All About Cache - Printable Version +- Mario Kart Wii Gecko Codes, Cheats, & Hacks (https://mariokartwii.com) +-- Forum: Guides/Tutorials/How-To's (https://mariokartwii.com/forumdisplay.php?fid=45) +--- Forum: PowerPC Assembly (https://mariokartwii.com/forumdisplay.php?fid=50) +--- Thread: All About Cache (/showthread.php?tid=1958) |
All About Cache - Vega - 05-15-2022 All About Cache This PowerPC tutorial will teach you the in's and out's of the Cache model of Broadway, it's instruction set, and how some of these instructions may need to be used for Gecko ASM Codes. This is a lengthy read, but every PPC Coder/dev should have a decent understanding of Broadway's Cache model. Chapter 1: Understanding some Basics about Memory There's two types of memory, Virtual & Physical. When Broadway executes in Virtual Memory, this is called Virtual Mode. When Broadway executes in Physical Memory, this is called Real Mode. Virtual Memory is split into two categories:
Virtual Cached Memory is your typical 'normal' memory that you are familiar with (i.e. 0x80000000 thru 0x817FFFFF & 0x90000000 thru 0x93FFFFFF). Virtual Cached Memory is a representation of Physical Memory but it includes any cached content. Cached content may be 'old' or may be 'too new'. Therefore, what you see in Virtual Cached Memory may not be what is actually present in Physical Memory. Virtual Uncached Memory is a simple representation (copy) of Physical Memory. Virtual Memory has to be split into Cached & Uncached so software always have the option to bypass cache. Wii games won't run entirely in Real Mode due to lack of 'security'. In Real Mode, all of memory has the same properties, and those properties cannot be adjusted from the Broadway default settings. With Virtual Mode, you can set different regions of memory to have a variety of different properties, and adjust said properties whenever you want. Here's a list of Physical, Virtual Cached, and Virtual Uncached memory ranges for most Wii games.
The list doesn't include everything (like Hardware Memory), just the most common stuff that's relevant to Gecko ASM Codes. Chapter 2: Structure of Cache Organization There are two different cache systems in Broadway. L1 (Level 1) and L2 (Level 2). The L2 cache operates in a similar fashion but is larger. There's no need to deep dive into the intricasies of the L2 cache. The L1 cache will be the only cache unit covered about this this thread. The L1 Cache is split into two categories:
Instruction Cache is for anything that contains executable instructions, simple enough. Data Cache is for any data that are part of any load/store mechanism. Executable instructions can also be included in the Data Cache. For example, if you write (i.e. store) a new instruction to memory, it will be utilized by both the Instruction and Data cache. Here's the layout of a Data Cache set/page (each row is known as a 'way') Way0 | 32-byte Aligned Physical Address | StateBits | 8 Words Way1 | 32-byte Aligned Physical Address | StateBits | 8 Words Way2 | 32-byte Aligned Physical Address | StateBits | 8 Words .. .. Way6 | 32-byte Aligned Physical Address | StateBits | 8 Words Way7 | 32-byte Aligned Physical Address | StateBits | 8 Words The Instruction Cache implements the same layout, but it uses a single "Valid" Bit in place of the State Bits. Each Way will contain a 32-byte aligned physical address. Even though the address is physical, it is always translated to its Virtual Address for usage. "8 words" means the 8 words of data/instructions that are at the 32-byte aligned address. 8 words = 32 byte block. This block is known as a Cache Block. Since every address has to be 32-byte aligned, this means nothing smaller than a 32-byte aligned block of memory can have unique State/Valid Bits. 8 ways (Way0 thru 7) make up one 'Set'. There are a total of 128 Sets for both the Instruction and Data Cache. Both Caches are 32KB in size (32 bytes x 8 ways x 128 = 32,768 bytes = 32KB). Since every cache block is 32-byte aligned, this means that you make a modification to the cache of let's say address 0x80001504, cache for the words of addresses 0x80001500 thru 0x8000151C will all be effected simultaneously. Chapter 3: Cache Hits and Misses It's crucial to understand that the Data Cache can only have new content added to it by store instructions. This includes any typical store instruction, but it also includes the dcbi, dcbz and dcbz_l instructions (these are treated as store instructions by Broadway). Content in the Data Cache is managed by a pseudo least-recently-used algorithm (aka PLRU). The Instruction Cache gets content added to it by Broadway's Instruction Fetching mechanism only. It is impossible to control the Fetching mechanism directly. Therefore we cannot, at will, add in new content to the Instruction Cache. Just like the Data Cache, content in the Instruction Cache has its own PLRU. The inner workings of the PLRU is not a concern for us Gecko Code creators. However, we do need to cover Cache Hits and Misses. Anyway, over time, the PLRU will fill instructions/data in the cache and later remove them so new data can use the Cache. The filling of the cache by the PLRU is usually referred to as 'pushing a block(s) onto the Cache'. We cannot change how the PLRU itself functions, but there are specific instructions we can do to forcefully edit Cache Blocks or push new blocks onto the Cache. This is covered in Chapter 5. Whenever instructions/data is processed by Broadway, Broadway will check the L1 Cache (then the L2 Cache) to see if the specific memory address is in the Cache. If the address is present, this is known as a Cache Hit. If not, this is known as a Cache Miss. Cache misses severely degrade performance. Chapter 4: State and Valid Bits Each cache block (with it's 32-byte aligned physical address) in the Data Cache will have one of the following state bits with it. State bits--
Modified = Present in Virtual Cached Memory but not yet present on Physical Memory; will be written to physical memory sooner or later. When new blocks are placed into the Cache by the PLRU, they are tagged with M bit. Please note that PPC Manuals will sometimes refer to a Data Cache block as "dirty" if it's tagged with the Modified (M) bit. Exclusive = What's in Virtual Cached Memory is what's in Physical Memory. Please note that PPC Manuals will sometimes refer to a Data Cache block as "clean" if it's tagged with the Exclusive (E) bit. Invalid = Old data that is now invalid, you can freely erase/modify this block w/o effecting anything. When the PLRU updates Data Cache, only blocks that are tagged with the I bit qualify to be removed from the Cache. Each physical address in the instruction cache has a valid bit associated with it
Valid = next time this address is used by an instruction, the value here is what will be used Invalid = old data that is now invalid, will not be used, can be tossed whenever. When the PLRU updates Instruction Cache, only blocks that are tagged with the I bit qualify to be removed from the Cache. Chapter 5: List of Cache Instructions Broadway comes with the following cache instructions~ dcbf rD, rA = Data Cache Block Flush dcbi rD, rA = Data Cache Block Invalidate dcbst rD, rA = Data Cache Block Store dcbt rD, rA = Data Cache Block Touch dcbtst rD, rA = Data Cache Touch for Store dcbz rD, A = Data Cache Block Zero dcbz_l rD, rA = Data Cache Block Zero then Lock icbi rD, rA = Instruction Cache Block Invalidate rD + rA = The address (aka Effective Address aka EA) Note that in all instructions, if rD = r0, it will be treated as literal zero.
As mentioned in Chapter 3, dcbi, dcbz and dcbz_l are treated as store instructions. All other cache-related instructions are treated as load instructions. In conclusion, there are no cache-related instructions to force any updates (add new Blocks) to the Instruction Cache. Chapter 6: Overwriting Executable Instructions For Gecko ASM Codes, the only instance where we really need to worry about cache is if your code involves writing/re-writing new instructions that will be executed later on. This is known as Self-Modifying Code. When overwriting instructions, you need to ensure they get updated in physical memory before Broadway fetches them for execution. Or else there's a chance the instructions fetched will be the old instructions. Here's a template for updating cache for writing in new executable instructions Code: #rX = points to memory address of newly written executable instruction
You do **NOT** need to 32-byte align the address (i.e. 0x8000151C -> 0x80001500) for rX when using the above example source. Broadway will handle that for you. You also do **NOT** need to include the isync if the first newly written instruction is at least 5 will-be-executed instructions ahead of the icbi. This is because Broadway can only fetch up to 4 instructions at a time. Fyi: If using the above snippet in a loop mechanism, you only need an isync at the end. Do not place it inside the loop. Also remember that Cache Blocks are 32-byte aligned. Therefore your address incrementation amounts for load and store instructions (in your loop) should be incrementing by 32. Chapter 7: In-Depth Explanation To explain the entirety of why we need.. dcbst icbi isync ...for the case of Self-Modifying Code, we need to cover some complex aspects of Broadway that you may not be familiar with. First understand that all Wii games configure virtual regions of memory via what a mechanism called BAT registers. We don't need to worry what the BAT registers are exactly and how to use them. Just understand that all of usable physical memory is mapped twice virtually, once for data and once for instructions. (For more info on BATs, read this thread HERE) Thus we have two virtual copies of the same physical memory. It's important to understand that there is no 'built-in' mechanism by Broadway that ensures these two copies of memory always match each other. That is required by software (the program/game/codes/whatever). The virtual memory that is used for Data is configured as "Write-Back" and "Cache-Enabled".
Since the virtual memory for Data is also cache-enabled, it is referred to as Virtual Cached Memory. Therefore this memory includes all contents of the Data Cache. The virtual memory for Instructions is also configured as Cache-Enabled (Write-Back/Through is not applicable here). Anyway since Virtual Cached Memory, for the use of Data, is in Write-Back Mode, this presents a problem for Instruction execution. It can create scenarios where the Instruction Cache is "seeing" a different virtual memory copy than what the Data Cache is "seeing". It's important to understand how Broadway fetches instructions for execution. The fetching mechanism will hit a virtual address, translate it to its physical address equivalent, and then search various units for the address's instruction. Broadway searches the following places...
Broadway checks the L1 Cache first. If the address isn't present there, it will then check the L2 Cache. If not present in the L2 Cache, physical memory is finally checked. For Cache hits, Broadway will then check the address's valid bit in the cache. If the valid bit is set, Broadway will use the instruction that is currently present in Virtual Cached Memory (the memory that the Instruction Cache "see's"). If the invalid bit is instead set, Broadway will directly go to physical memory for the instruction, bypassing the L2 check if necessary. Keep in mind that L1 and L2 cache are 'synced', whatever is in the L1 Cache is ALWAYS present in the L2 Cache. This is possible due to L2 being larger than L1. Now that you understand how instruction fetching works, we need to cover the 'under the hood' stuff of store and load instructions via virtual cached memory. So let's say you have any plane-jane basic store instruction (i.e stw), that stores to plain-jane virtual cached memory. Welp after that store instruction has executed, the physical address will be pushed onto the Data cache and the data itself is written at the virtual cache memory address. Now let's say you then execute a load instruction (i.e lwz) as the very next instruction. Obviously, what you have just stored via stw is what will be loaded via lwz. That's because the previous store updated the Data Cache (with a new Block), therefore the load instruction will receive a Cache Hit and the contents to load are retrieved from the Data Cache (virtual cached memory). Now let's say you store over an instruction, the only changes that instantly occur is in the Data Cache which would be the Virtual Cached Memory that the Data Cache "see's". Physical memory doesn't update instantly since the memory in question is under Write-Back mode. Thus, the next time the new instruction is fetched, the old instruction will most likely be used instead. Why is this? This is because the newly written instruction won't be in the Instruction Cache's L1 + L2 meaning it's not present in the virtual memory that the Instruction Cache "see's". It will also not be present in Physical Memory. The utilization of the dcbst instruction will force the newly written instruction to be also written to Physical Memory. However this instruction alone isn't enough. It is possible the old instruction is currently in the Instruction L1/L2 Cache with being marked as Valid. Meaning the instruction fetching mechanism won't even bother checking Physical Memory since the L1/L2 cache is basically saying "Hey we have the instruction! And it's valid! No need to check physical memory!" Therefore to alleviate this possible problem, we use the icbi instruction to mark the old instruction in the L1/L2 cache as invalid. If the old instruction isn't in the Instruction Cache, then the icbi has zero effect (like a nop). The isync is needed just in case the old instruction was fetched. It will force Broadway to re-fetch instructions again so now the new instruction is guaranteed to be fetched. As mentioned earlier, an isync is not required if the modified new instruction is at least 5+ would-be-executed instructions ahead of the icbi instruction. In conclusion, these three instructions (dcbst, icbi, isync) will always ensure that your newly written instructions are always executed. Still confused? Here's a picture: rX = New Instruction to write rY = Address in question Yellow font shows the changes invoked by the respective instruction. Instruction is the instruction that HAS executed. Regarding Fetcher Status, it gives you a basic summary of what is happening in regards to the Fetcher and the Instruction Queue. When Instructions are placed into the Queue, they are also placed into the I-Cache (if not present beforehand), and then marked as Valid. 'Not Present most likely' for rY D-Cache means that its very unlikely that rY (with its cache block data) is already present in the D-Cache. Even if it is, we would have no idea (given the information from the diagram) of what its state bits would be. '0x38000000 possibly' for rY I-Cache means we have no idea (give the information) if rY (with its cache block data) is present (regardless of Valid vs Invalid) in the I-Cache. In conclusion, these three instructions (dcbst, icbi, isync) will always ensure your newly written instructions are visible to the instruction fetching mechanism. Chapter 8: Possible Questions and Answers Hey Vega, I've seen on some PPC Manuals or are on some websites that a sync must be placed after dcbst. Do I need this sync instruction in my Self-Modifying Code? TLDR Version: No Want to know exactly why? Read below... First, let's recap what isync (Instruction Synchronize). isync does the following...
Sync is sort of similar to isync, which brings in a lot of confusion. sync does the following..
What do I mean by Memory Accesses? Are we talking about Loads and Stores? No, we are not. Placing a sync after a store instruction does *NOT* ensure said store instruction reaches physical memory. This is the biggest misconception of sync. Only dcbf and dcbst ensures a store reaches physical memory (or actually writing to physical memory directly ofc, lol). The term "Memory Accesses" refers to....
If the self-modifying code is executing in an environment where any of the following is true....
Then a sync would be required. Because typical self-modifying code that you would write for a regular Gecko Code doesn't meet any of these qualifications, a sync can be omitted safely. What about using dcbf instead of dcbst? This depends and it's so hard to answer this due to endless amount of factors. To keep it simple, for a C0 code or C2 codes that execute at or slower than once per frame, use dcbf. For C2 codes that execute quicker than once per frame, use dcbst. Chapter 9: Some neat tricks with Cache instructions This chapter will contain some snippets of code to show some neat tricks you can do with the Cache. All tricks are sources meant to be compiled as C0 Gecko Codes. Fyi,these tricks will only work on a regular Wii Console. --- Trick #1: Write word 1 to memory, load it back, and it will be a different value (0) than what was just stored Summary: Write null word to virtual address 0x80001500 Flush the block, so we know its written to physical 0x00001500, and therefore the block is now left invalid Write 1 to virtual address 0x80001500 Load word from physical (0xC0001500 which is direct physical copy of 0x00001500) If value is *NOT* 1 (aka 0), game will light up disc drive to show success Code: #Disable INTs; not done correctly! Do not copy this for your regular cheat codes! --- Trick #2: Write zero to memory without using regular store instructions (this will actually write zero to an entire 32-byte aligned block). This isn't really a 'trick' per say since the dcbz instruction is suppose to behave in such a manner, but you get the idea. Summary: Write 1 to virtual address 0x80001500 Make block exclusive (via dcbst) to force update to physical memory Do a temp load to prove value the word value at physical address is 1 Zero the cache block Force block to be written to physical (via dcbst) Load value from physical memory It will equal 0 (disc drive lights up) Code: #Disable INTs; not done correctly! Do not copy this for your regular cheat codes! --- Trick #3: Write value of 1 to virtual, then immediately write 2 to physical afterwards. However with some cache trickery, when we load the word value from physical memory, it will be the stale value of 1. Summary: Flush block at 0x80001500 Write 1 to 0x80001500 Write 2 to 0xC0001500 immediately afterwards dcbst on cache block to overwrite the 2 with the earlier value of 1 Load value from 0xC0001500 It will be 1 (not 2) and disc drive will light. Code: #Disable INTs; not done correctly! Do not copy this for your regular cheat codes! And that's it for Cache, happy coding! |