All About Cache - Printable Version

All About Cache - Printable Version

+- Mario Kart Wii Gecko Codes, Cheats, & Hacks (https://mariokartwii.com)
+-- Forum: Guides/Tutorials/How-To's (https://mariokartwii.com/forumdisplay.php?fid=45)
+--- Forum: PowerPC Assembly (https://mariokartwii.com/forumdisplay.php?fid=50)
+--- Thread: All About Cache (/showthread.php?tid=1958)

All About Cache - Vega - 05-15-2022

All About Cache

This PowerPC tutorial will teach you the in's and out's of the Cache model of the Wii Broadway Chip, it's instruction set, and how some of these instructions may need to be used for Gecko ASM Codes. This is a lengthy read, but every PPC Coder/dev should have a decent understanding of Broadway's Cache model.

Chapter 1: Understanding some Basics about Memory

There's two types of memory, Virtual & Physical. When Broadway executes in Virtual Memory, this is called Virtual Mode. When Broadway executes in Physical Memory, this is called Real Mode.

Virtual Memory is split into two categories:

Virtual Cached Memory
Virtual Uncached Memory

Virtual Cached Memory is your typical 'normal' memory that you are familiar with (i.e. 0x80000000 thru 0x817FFFFF & 0x90000000 thru 0x93FFFFFF).

Virtual Cached Memory is a representation of Physical Memory but it includes any cached content. Cached content may be 'old' or may be 'too new'. Therefore, what you see in Virtual Cached Memory may not be what is actually present in Physical Memory. Virtual Uncached Memory is a simple representation (copy) of Physical Memory.

Virtual Memory has to be split into Cached & Uncached so software always have the option to bypass cache.

Wii games won't run entirely in Real Mode due to lack of 'security'.

In Real Mode, all of memory has the same properties, and those properties cannot be adjusted from the Broadway default settings. With Virtual Mode, you can set different regions of memory to have a variety of different properties, and adjust said properties whenever you want.

Here's a list of Physical, Virtual Cached, and Virtual Uncached memory ranges for most Wii games.

0x00000000 thru 0x017FFFFF Physical Mem1
0x10000000 thru 0x13FFFFFF Physical Mem2
0x80000000 thru 0x817FFFFF Virtual Cached Mem1 (known as mem80 and mem81)
0x90000000 thru 0x93FFFFFF Virtual Cached Mem2 (known as mem9)
0xC0000000 thru 0xC17FFFFF Virtual Uncached Mem1
0xD0000000 thru 0xD3FFFFFF Virtual Uncached Mem2

The list doesn't include everything (like Hardware Memory), just the most common stuff that's relevant to Gecko ASM Codes.

Chapter 2: Structure of Cache Organization

There are two different cache systems in Broadway. L1 (Level 1) and L2 (Level 2). The L2 cache operates in a similar fashion but is larger. There's no need to deep dive into the intricasies of the L2 cache. The L1 cache will be the only cache unit covered about this this thread. The L1 Cache is split into two categories:

Data Cache
Instruction Cache

Instruction Cache is for anything that contains executable instructions, simple enough. Data Cache is for any data that are part of any load/store mechanism. Executable instructions can also be included in the Data Cache. For example, if you write (i.e. store) a new instruction to memory, it will be utilized by both the Instruction and Data cache.

Here's the layout of a Data Cache set/page (each row is known as a 'way')

Way0 | 32-byte Aligned Physical Address | StateBits | 8 Words
Way1 | 32-byte Aligned Physical Address | StateBits | 8 Words
Way2 | 32-byte Aligned Physical Address | StateBits | 8 Words
.. ..
Way6 | 32-byte Aligned Physical Address | StateBits | 8 Words
Way7 | 32-byte Aligned Physical Address | StateBits | 8 Words

The Instruction Cache implements the same layout, but it uses a single "Valid" Bit in place of the State Bits. Each Way will contain a 32-byte aligned physical address. Even though the address is physical, it is always translated to its Virtual Address for usage. "8 words" means the 8 words of data/instructions that are at the 32-byte aligned address. 8 words = 32 byte block. This block is known as a Cache Block. Since every address has to be 32-byte aligned, this means nothing smaller than a 32-byte aligned block of memory can have unique State/Valid Bits.

8 ways (Way0 thru 7) make up one 'Set'. There are a total of 128 Sets for both the Instruction and Data Cache. Both Caches are 32KB in size (32 bytes x 8 ways x 128 = 32,768 bytes = 32KB).

Since every cache block is 32-byte aligned, this means that you make a modification to the cache of let's say address 0x80001504, cache for the words of addresses 0x80001500 thru 0x8000151C will all be effected simultaneously.

Chapter 3: Cache Hits and Misses

It's crucial to understand that the Data Cache can only have new content added to it by store instructions. This includes any typical store instruction, but it also includes the dcbi, dcbz and dcbz_l instructions (these are treated as store instructions by Broadway). Content in the Data Cache is managed by a pseudo least-recently-used algorithm (aka PLRU).

The Instruction Cache gets content added to it by Broadway's Instruction Fetching mechanism only. It is impossible to control the Fetching mechanism directly. Therefore we cannot, at will, add in new content to the Instruction Cache. Just like the Data Cache, content in the Instruction Cache has its own PLRU.

The inner workings of the PLRU is not a concern for us Gecko Code creators. However, we do need to cover Cache Hits and Misses. Anyway, over time, the PLRU will fill instructions/data in the cache and later remove them so new data can use the Cache. The filling of the cache by the PLRU is usually referred to as 'pushing a block(s) onto the Cache'. We cannot change how the PLRU itself functions, but there are specific instructions we can do to forcefully edit Cache Blocks or push new blocks onto the Cache. This is covered in Chapter 5.

Whenever instructions/data is processed by Broadway, Broadway will check the L1 Cache (then the L2 Cache) to see if the specific memory address is in the Cache. If the address is present, this is known as a Cache Hit. If not, this is known as a Cache Miss. Cache misses severely degrade performance.

Chapter 4: State and Valid Bits

Each cache block (with it's 32-byte aligned physical address) in the Data Cache will have one of the following state bits with it.

State bits--

M = Modified
E = Exclusive
I = Invalid

Modified = Present in Virtual Cached Memory but not yet present on Physical Memory; will be written to physical memory sooner or later. When new blocks are placed into the Cache by the PLRU, they are tagged with M bit. Please note that PPC Manuals will sometimes refer to a Data Cache block as "dirty" if it's tagged with the Modified (M) bit.
Exclusive = What's in Virtual Cached Memory is what's in Physical Memory. Please note that PPC Manuals will sometimes refer to a Data Cache block as "clean" if it's tagged with the Exclusive (E) bit.
Invalid = Old data that is now invalid, you can freely erase/modify this block w/o effecting anything. When the PLRU updates Data Cache, only blocks that are tagged with the I bit qualify to be removed from the Cache.

Each physical address in the instruction cache has a valid bit associated with it

V = Valid
I = Invalid

Valid = next time this address is used by an instruction, the value here is what will be used
Invalid = old data that is now invalid, will not be used, can be tossed whenever. When the PLRU updates Instruction Cache, only blocks that are tagged with the I bit qualify to be removed from the Cache.

Chapter 5: List of Cache Instructions

Broadway comes with the following cache instructions~

dcbf rD, rA = Data Cache Block Flush
dcbi rD, rA = Data Cache Block Invalidate
dcbst rD, rA = Data Cache Block Store
dcbt rD, rA = Data Cache Block Touch
dcbtst rD, rA = Data Cache Touch for Store
dcbz rD, A = Data Cache Block Zero
dcbz_l rD, rA = Data Cache Block Zero then Lock
icbi rD, rA = Instruction Cache Block Invalidate

rD + rA = The address (aka Effective Address aka EA)

Note that in all instructions, if rD = r0, it will be treated as literal zero.

dcbf For cache hits, if the block has a M bit, the data in the block is now written to physical memory and an I bit replaces the M bit. If the block has an E bit, the bit is simply changed to I. For cache misses, no action is taken. Therefore you can use dcbf as a way to "erase" the Cache but make sure memory gets updated before Cache is erased. For "erasing" Cache without updating memory, refer to dcbi.
dcbst For cache hits, if the block has an E bit or I bit, no action is taken. If the block has a M bit, the data in the block is written to physical memory and the bit is changed to E. For cache misses, no action is taken. Pro-Tip: If you are familiar with BAT Registers and the region of memory is in a BAT that is marked at 'Write-Through (W bit high), you will never need the dcbst instruction for that region of memory, but performance of Broadway will be degraded.
dcbi For cache hits, the state bit is always changed to I, regardless of what is was before. If the state bit was M, data that was going to be written to physical memory is now discarded. For cache misses, no action is taken. Therefore, dcbi can be used to "erase" the Cache and prevent the Cache from updating physical memory.
dcbt This is used to give the Cache system a hint that an upcoming Load instruction needs to have its 32-bit aligned Address pushed onto the cache. Thus, this is only useful if you know the Load instruction will end up as a Cache Miss. For cache hits, no action is taken. Improper usage of this instruction (too many Cache hits) will degrade performance.
dcbtst This is used to give the Cache system a hint that an upcoming Store instruction needs to have its 32-bit aligned Address pushed onto the cache. Thus, this is only useful if you know the Store instruction will end up as a Cache Miss. For cache hits, no action is taken. Improper usage of this instruction (too many Cache hits) will degrade performance.
dcbz For cache hits, the contents of the Block (virtual memory) are zero'd, and state bit changed to M. For cache misses, a new block using the address referenced by the dcbz is pushed onto the Cache, then the Cache Block is zero'd (regardless of what is present in Physical Memory) & tagged with the M bit.
dcbz_l does the same as above, but will then lock the cache where it can't be modified. This instruction is only legal when the Locked Cache (via HID2) is enabled, otherwise an exception will occur.
icbi is the only instruction you have available to modify the instruction cache. For cache hits, the block is set to Invalid. For cache misses, no action is taken.

As mentioned in Chapter 3, dcbi, dcbz and dcbz_l are treated as store instructions. All other cache-related instructions are treated as load instructions. In conclusion, there are no cache-related instructions to force any updates (add new Blocks) to the Instruction Cache.

Chapter 6: Overwriting Executable Instructions

For Gecko ASM Codes, the only instance where we really need to worry about cache is if your code involves writing/re-writing new instructions that will be executed later on.

This is known as Self-Modifying Code.

When overwriting instructions, you need to ensure they get updated in physical memory before Broadway fetches them for execution. Or else there's a chance the instructions fetched will be the old instructions.

Here's a template for updating cache for writing in new executable instructions

Code:
#rX = points to memory address of newly written executable instruction

dcbst 0, rX

icbi 0, rX

isync

dcbst 0, rX = This will force the block (if M bit tagged) to be written to physical memory. State bit changed to E.
icbi 0, rX = The old instruction may still be present (and marked Valid) in the Instruction Cache. Therefore, we tag it as Invalid
isync = Broadway is an out-of-order execution CPU like any other modern CPU. Even with the icbi instruction, it's possible Broadway still fetched the older instruction. This isync instruction will force Broadway to purge its current fetched instructions and refetch. Thus forcing the new instruction to be fetched.

You do **NOT** need to 32-byte align the address (i.e. 0x8000151C -> 0x80001500) for rX when using the above example source. Broadway will handle that for you.

You also do **NOT** need to include the isync if the first newly written instruction is at least 5 will-be-executed instructions ahead of the icbi. This is because Broadway can only fetch up to 4 instructions at a time.

Fyi: If using the above snippet in a loop mechanism, you only need an isync at the end. Do not place it inside the loop. Also remember that Cache Blocks are 32-byte aligned. Therefore your address incrementation amounts for load and store instructions (in your loop) should be incrementing by 32.

Chapter 7: In-Depth Explanation

To explain the entirety of why we need..

dcbst
icbi
isync

...for the case of Self-Modifying Code, we need to cover some complex aspects of Broadway that you may not be familiar with.

First understand that all Wii games configure virtual regions of memory via what a mechanism called BAT registers.

We don't need to worry what the BAT registers are exactly and how to use them. Just understand that all of usable physical memory is mapped twice virtually, once for data and once for instructions. (For more info on BATs, read this thread HERE)

Thus we have two virtual copies of the same physical memory. It's important to understand that there is no 'built-in' mechanism by Broadway that ensures these two copies of memory always match each other. That is required by software (the program/game/codes/whatever).

The virtual memory that is used for Data is configured as "Write-Back" and "Cache-Enabled".

Write-Back = store operations update the cached memory, but do not instantly update physical memory
Write-Through = store operations update cached memory, plus updating physical memory. performance is degraded.

Since the virtual memory for Data is also cache-enabled, it is referred to as Virtual Cached Memory. Therefore this memory includes all contents of the Data Cache.

The virtual memory for Instructions is also configured as Cache-Enabled (Write-Back/Through is not applicable here).

Anyway since Virtual Cached Memory, for the use of Data, is in Write-Back Mode, this presents a problem for Instruction execution. It can create scenarios where the Instruction Cache is "seeing" a different virtual memory copy than what the Data Cache is "seeing".

It's important to understand how Broadway fetches instructions for execution. The fetching mechanism will hit a virtual address, translate it to its physical address equivalent, and then search various units for the address's instruction. Broadway searches the following places...

L1 Instruction Cache
L2 Instruction Cache
Physical Memory (may also be called System or Main memory in various manuals/websites)

Broadway checks the L1 Cache first. If the address isn't present there, it will then check the L2 Cache. If not present in the L2 Cache, physical memory is finally checked.

For Cache hits, Broadway will then check the address's valid bit in the cache. If the valid bit is set, Broadway will use the instruction that is currently present in Virtual Cached Memory (the memory that the Instruction Cache "see's"). If the invalid bit is instead set, Broadway will directly go to physical memory for the instruction, bypassing the L2 check if necessary.

Keep in mind that L1 and L2 cache are 'synced', whatever is in the L1 Cache is ALWAYS present in the L2 Cache. This is possible due to L2 being larger than L1.

Now that you understand how instruction fetching works, we need to cover the 'under the hood' stuff of store and load instructions via virtual cached memory.

So let's say you have any plane-jane basic store instruction (i.e stw), that stores to plain-jane virtual cached memory. Welp after that store instruction has executed, the physical address will be pushed onto the Data cache and the data itself is written at the virtual cache memory address.

Now let's say you then execute a load instruction (i.e lwz) as the very next instruction. Obviously, what you have just stored via stw is what will be loaded via lwz. That's because the previous store updated the Data Cache (with a new Block), therefore the load instruction will receive a Cache Hit and the contents to load are retrieved from the Data Cache (virtual cached memory).

Now let's say you store over an instruction, the only changes that instantly occur is in the Data Cache which would be the Virtual Cached Memory that the Data Cache "see's". Physical memory doesn't update instantly since the memory in question is under Write-Back mode. Thus, the next time the new instruction is fetched, the old instruction will most likely be used instead.

Why is this?

This is because the newly written instruction won't be in the Instruction Cache's L1 + L2 meaning it's not present in the virtual memory that the Instruction Cache "see's". It will also not be present in Physical Memory.

The utilization of the dcbst instruction will force the newly written instruction to be also written to Physical Memory. However this instruction alone isn't enough. It is possible the old instruction is currently in the Instruction L1/L2 Cache with being marked as Valid. Meaning the instruction fetching mechanism won't even bother checking Physical Memory since the L1/L2 cache is basically saying "Hey we have the instruction! And it's valid! No need to check physical memory!"

Therefore to alleviate this possible problem, we use the icbi instruction to mark the old instruction in the L1/L2 cache as invalid. If the old instruction isn't in the Instruction Cache, then the icbi has zero effect (like a nop). The isync is needed just in case the old instruction was fetched. It will force Broadway to re-fetch instructions again so now the new instruction is guaranteed to be fetched. As mentioned earlier, an isync is not required if the modified new instruction is at least 5+ would-be-executed instructions ahead of the icbi instruction.

In conclusion, these three instructions (dcbst, icbi, isync) will always ensure that your newly written instructions are always executed.

Still confused? Here's a picture:

[Image: cache.png]

rX = New Instruction to write
rY = Address in question

Yellow font shows the changes invoked by the respective instruction.

Instruction is the instruction that HAS executed. Regarding Fetcher Status, it gives you a basic summary of what is happening in regards to the Fetcher and the Instruction Queue. When Instructions are placed into the Queue, they are also placed into the I-Cache (if not present beforehand), and then marked as Valid.

'Not Present most likely' for rY D-Cache means that its very unlikely that rY (with its cache block data) is already present in the D-Cache. Even if it is, we would have no idea (given the information from the diagram) of what its state bits would be.

'0x38000000 possibly' for rY I-Cache means we have no idea (give the information) if rY (with its cache block data) is present (regardless of Valid vs Invalid) in the I-Cache.

In conclusion, these three instructions (dcbst, icbi, isync) will always ensure your newly written instructions are visible to the instruction fetching mechanism.

Chapter 8: Possible Questions and Answers

Hey Vega, I've seen on some PPC Manuals or are on some websites that a sync must be placed after dcbst. Do I need this sync instruction in my Self-Modifying Code?

TLDR Version: No

Want to know exactly why? Read below...

First, let's recap what isync (Instruction Synchronize). isync does the following...

Waits for all prior instructions to complete
Prevents future instructions from completing until isync itself completes
Future instructions that were already fetched, are purged then refetched

Sync is sort of similar to isync, which brings in a lot of confusion. sync does the following..

Waits for all prior instructions to complete
Waits for all Memory Accesses caused by prior instructions to complete
Prevents future instructions from completing until Sync itself completes

What do I mean by Memory Accesses? Are we talking about Loads and Stores?

No, we are not. Placing a sync after a store instruction does *NOT* ensure said store instruction reaches physical memory. This is the biggest misconception of sync. Only dcbf and dcbst ensures a store reaches physical memory (or actually writing to physical memory directly ofc, lol).

The term "Memory Accesses" refers to....

Some external (non-PowerPC) device (i.e. Starlet, DSP engine, DMA engine, EEPROM chip) is accessing Memory (the 32-byte block in question) in any way.
Another PowerPC core or processor is accessing Memory (the 32-byte block in question) in any way
TLB invalidations (from the core that is executing the self modifying code itself, *NOT* some other core)
Page Table R and C bits being accessed by other cores, processors, or external devices

If the self-modifying code is executing in an environment where any of the following is true....

Page Tables are being utilized (which thus uses TLBs)
There is another core within Broadway or another Broadway processor present (which is impossible since Broadway is single core uni-processor), and cache coherency between all cores and/or processors must be maintained
Some external device (i.e. Starlet) is accessing the memory utilized by the self-modifying code

Then a sync would be required. Because typical self-modifying code that you would write for a regular Gecko Code doesn't meet any of these qualifications, a sync can be omitted safely.

What about using dcbf instead of dcbst?

This depends and it's so hard to answer this due to endless amount of factors. To keep it simple, for a C0 code or C2 codes that execute at or slower than once per frame, use dcbf. For C2 codes that execute quicker than once per frame, use dcbst.

Chapter 9: Some neat tricks with Cache instructions

This chapter will contain some snippets of code to show some neat tricks you can do with the Cache. All tricks are sources meant to be compiled as C0 Gecko Codes. Fyi,these tricks will only work on a regular Wii Console.

---

Trick #1: Write word 1 to memory, load it back, and it will be a different value (0) than what was just stored

Summary:
Write null word to virtual address 0x80001500
Flush the block, so we know its written to physical 0x00001500, and therefore the block is now left invalid
Write 1 to virtual address 0x80001500
Load word from physical (0xC0001500 which is direct physical copy of 0x00001500)
If value is *NOT* 1 (aka 0), game will light up disc drive to show success

Code:
#Disable INTs; not done correctly! Do not copy this for your regular cheat codes!

mfmsr r3

rlwinm r12, r3, 0, 17, 15

mtmsr r12

#Set r12 to 0x80000000

lis r12, 0x8000

#Make sure value at 0x80001500 is null beforehand

li r11, 0

stwu r11, 0x1500 (r12)

#Make sure null is also written to the physical memory

dcbf 0, r12

#Set value of 1

li r11, 1

#Set r10 as pointer to start of uncached memory

lis r10, 0xC000

#Store 1 to Virtual Cache memory

#This store will now push r12 on to the Cache assigned with the M bit.

#Fyi: Anything that has the M bit has not been sent to physical memory yet

stw r11, 0 (r12)

#Load up from uncached memory (exact copy of physical)

lwz r11, 0x1500 (r10)

#Check if r11 = 1. If not, light up disc drive

cmpwi r11, 1

beq- restore_ints

#Disc Drive

lis r12, 0xCD00

lwz r0, 0x00C0 (r12)

ori r0, r0, 0x0020

stw r0, 0x00C0 (r12)

#Restore INTs. #Copy current MSR into r12

#Not done correctly! Do not copy this for your regular cheat codes!

restore_ints:

mfmsr r12

#Insert r3's EE bit into r12, overwriting r12's EE bit

rlwimi r12, r3, 0, 16, 16

#Update MSR

mtmsr r12

#End C0

#blr

---

Trick #2: Write zero to memory without using regular store instructions (this will actually write zero to an entire 32-byte aligned block). This isn't really a 'trick' per say since the dcbz instruction is suppose to behave in such a manner, but you get the idea.

Summary:
Write 1 to virtual address 0x80001500
Make block exclusive (via dcbst) to force update to physical memory
Do a temp load to prove value the word value at physical address is 1
Zero the cache block
Force block to be written to physical (via dcbst)
Load value from physical memory
It will equal 0 (disc drive lights up)

Code:
#Disable INTs; not done correctly! Do not copy this for your regular cheat codes!

mfmsr r3

rlwinm r12, r3, 0, 17, 15

mtmsr r12

#Write 1 to 0x80001500

lis r12, 0x8000

li r11, 1

stwu r11, 0x1500 (r12)

#Make sure updates are in physical memory, keep block exclusive

dcbst 0, r12

#At this moment, physical addr 0x00001500 = 1. We will verify this. If not true, skip lightning up disc drive

lis r10, 0xC000

lwz r11, 0x1500 (r10)

cmpwi r11, 1

bne- restore_ints

#Zero out cache block now, block now tagged with M

#This instruction zero's out the data for the block in cached memory

dcbz 0, r12

#However, let's make sure the changes also go the physical memory

dcbst 0, r12

#Now load value from physical, if null disc drive will light up

lwz r11, 0x1500 (r10)

cmpwi r11, 0

bne- restore_ints

#Disc Drive

lis r12, 0xCD00

lwz r0, 0x00C0 (r12)

ori r0, r0, 0x0020

stw r0, 0x00C0 (r12)

#Restore INTs. #Copy current MSR into r12

#Not done correctly! Do not copy this for your regular cheat codes!

restore_ints: 

mfmsr r12

#Insert r3's EE bit into r12, overwriting r12's EE bit

rlwimi r12, r3, 0, 16, 16

#Update MSR

mtmsr r12

#End C0

#blr

---

Trick #3: Write value of 1 to virtual, then immediately write 2 to physical afterwards. However with some cache trickery, when we load the word value from physical memory, it will be the stale value of 1.

Summary:
Flush block at 0x80001500
Write 1 to 0x80001500
Write 2 to 0xC0001500 immediately afterwards
dcbst on cache block to overwrite the 2 with the earlier value of 1
Load value from 0xC0001500
It will be 1 (not 2) and disc drive will light.

Code:
#Disable INTs; not done correctly! Do not copy this for your regular cheat codes!

mfmsr r3

rlwinm r12, r3, 0, 17, 15

mtmsr r12

#First flush the block to make sure it cannot be in the exclusive state after our write

lis r12, 0x8000

ori r12, r12, 0x1500

dcbf 0, r12

#Set r10 to 0xC000

lis r10, 0xC000

#Set values 1 and 2 in their registers

li r11, 1

li r9, 2

#Write 1 to 0x80001500 first!

#Then Write 2 to 0xC0001500 (physical)

stw r11, 0 (r12)

stw r9, 0x1500 (r10)

#Force cache block at 0x80001500 to update to physical memory now

#This will overwrite the newly written value of 2 present at 0x00001500/0xC0001500

dcbst 0, r12

#Now Load from physical & check. If 1 (old value), disc drive will light up

lwz r11, 0x1500 (r10)

cmpwi r11, 2

beq- restore_ints

#Disc Drive

lis r12, 0xCD00

lwz r0, 0x00C0 (r12)

ori r0, r0, 0x0020

stw r0, 0x00C0 (r12)

#Restore INTs. #Copy current MSR into r12

#Not done correctly! Do not copy this for your regular cheat codes!

restore_ints:

mfmsr r12

#Insert r3's EE bit into r12, overwriting r12's EE bit

rlwimi r12, r3, 0, 16, 16

#Update MSR

mtmsr r12

#End C0

#blr

And that's it for Cache, happy coding!