Chapter 28: Cache Part 1/2
Prelude~
So far everything we have learned in Assembly can be replicated in higher languages (C/C++/Rust/etc) excluding the Chapter about Exceptions. Since Compilers are so fast nowadays, the majority of Computer Programs are written only in these higher languages. Think of it like this... the Compiler will always beat you no matter how good you are at Assembly.
However, we are now in the realm where higher languages no longer apply, and where Compilers won't help us. Everything here will be "lower-level" specific. This is where Assembly gets tough. If your goal for Assembly was to learn the Basics and write out simple Programs, or to better understand C/C++, then your journey ends here. However, I encourage you to stick around, the fun stuff begins.
To understand cache, we must first understand Virtual vs Physical Memory. We covered some of the basics in the previous Chapters but we will go into further detail. Physical memory is well... the real physical memory. Virtual memory is a customized copy of Physical Memory which includes the ability to split the copied memory into chunks and apply various properties to said chunks. The CPU can then execute a program solely on Virtual Memory. Any writes in Virtual Memory will also be applied to Physical memory, but changes are usually not instant. When a CPU encounters a Virtual Memory address, that address has to be translated into its Physical equivalent.
This is known as Address Translation. For example, say we have the following Physical memory address range...
0x00000000 thru 0x017FFFFF
A program can use a custom numerical (Virtual) range to represent the above Physical range. Such as....
0x80000000 thru 0x817FFFFF
The Virtual address is a simple bit flip (bit 0) difference of the physical equivalent. Most modern programs today use what is called Identical Translation. Meaning all virtual address's exactly match the real physical address's.
Most programs will setup two copies of Virtual Memory. The two copies are not exactly the same. They are as follows...
Any modern CPU will contain a specialized hardware unit/chip where it will keep frequently used contents at. The CPU can access the contents in these units/chips quicker than accessing contents in Physical Memory. These contents is known as the Cache.
Virtual Cached Memory is a copy of Physical Memory but it includes any Cached content. Sometimes Cached content may to be old or too new (contents within a block of Physical Memory not matching the same block of Virtual Cached Memory). Therefore Virtual Cached Memory may not accurately represent what is currently present in Physical Memory.
Virtual Uncached Memory is an exact 1:1 copy of Physical Memory. This memory is slower than Virtual Cached Memory but it is used by the Program when it needs to bypass the use of Cache.
Whenever a CPU is executing in Virtual Memory, this is known as Virtual Mode. Whenever a CPU is executing in Physical Memory, this is known as Real Mode. You briefly learned about Virtual vs Real Mode back in Chapter 17.
Now you may ask yourself this.... Why even setup a Virtual Uncached Memory? Why not just directly use Physical Memory when needing to bypass the use of Cache? This is because Physical Memory cannot be split into chunks and have different attributes/properties set across said chunks. Not only that, Physical Memory will have set default attributes that cannot be changed. Also, Physical Memory can only be accessed while in Real Mode. Virtual vs Real Mode is determined by some bits in the Machine State Register. Toggling these bits on/off (and adding other necessary code to enter/leave Real Mode) is extremely slow and will unnecessarily add bloat/size to a Program.
A CPU will have multiple levels of Cache Systems. All PowerPC systems will at least have a L1 (Level 1) Cache system. The L1 Cache system contains two units...
Instruction Cache is for anything that contains executable instructions, simple enough. Data Cache is for any data that is part of any load/store mechanism. Executable instructions can also be included in the Data Cache. For example, if you write (i.e. store) a new PowerPC instruction to memory, it will be utilized by both the Instruction and Data cache.
Some PowerPC CPUs will have a secondary Cache System, this is known as the L2 (Level 2) Cache System. Sometimes, the L2 Cache will have options to allow it to be split into two units (L2 Data & L2 Instruction), while other CPUs may only allow the L2 Cache to handle Data only. Not only that, some PPC CPUs may contain a Level 3 Cache system.
It's important to understand that a higher level Cache System will contain everything that the lower Cache System(s) contains. Higher level Cache systems are always larger than their lower level counterparts.
For example, in the Broadway chip (Nintendo Wii CPU), it has two levels of Cache systems. The L1 Cache System has a Data Cache Unit and a Instruction Cache Unit. The L2 Cache System is for both Instruction & Data BUT it can be configured to do Data only. Since it's larger than the L1 Data Cache unit, the L2 Cache will always contain everything that the L1 Data Cache unit contains. This is done to ensure Cache coherency.
A Cache Unit is split into equal size pages/chunks. These pages/chunks are known as Sets. Each set will then be split into a fixed amount of rows called Ways. Each Way will then contain an Address Tag and State/Valid bits. This Address Tag is an aligned Physical Address. Data Cache units will use State bits while Instruction Cache units will use Valid bits. The alignment of the Physical Address depends on the Cache Block size. The smallest memory size that Cache Units can handle is called a Cache Block.
We'll keep our focus on the Broadway, the Nintendo Wii PPC CPU. Its L1 Cache Unit uses 32-byte blocks. This means a Way will contain an address tag that consists of a 32-byte aligned Physical Address. Thus, whatever State/Valid bits are present for the Way, will effect the entire 32-byte block.
What does this mean? Well if you use a Cache Instruction to modify the State/Valid bits of address 0x0000150C, then it effects all contents in the address range of 0x00001500 thru 0x0000153F.
Let's refocus again on the Broadway chip. The Broadway's L1 Data Cache is a 32KB 8-way Set-associative unit that uses 32-byte Cache blocks. The term "set-associative" means the entire unit is split in equal sized sets and each set has the same amount of Ways. With this Cache unit being 8-way, that means each Set has 8 ways. We can figure out the amount of Sets using this handy formula
Set numbering starts at 0, therefore this Unit starts at Set0 and ends at Set127. Way numbering starts at 0 too. Therefore each set has a Way0, Way1, ... til Way 7.
Here's a layout of a Set of Broadway's L1 Data Cache..
Way0 | 32-byte Aligned Physical Address | StateBits | 8 Words
Way1 | 32-byte Aligned Physical Address | StateBits | 8 Words
Way2 | 32-byte Aligned Physical Address | StateBits | 8 Words
.. ..
Way6 | 32-byte Aligned Physical Address | StateBits | 8 Words
Way7 | 32-byte Aligned Physical Address | StateBits | 8 Words
Pretend the above is for Set0, now say you have 127 more of these Sets (Set1 thru Set127) and that would be the entire L1 Data Cache. Each Way contains a 32-byte physical address. Even though the address tags use a physical address, that doesn't mean the contents in the Cache are always in Physical memory. StateBits are bits that tell you the state of the Cache Block (we'll cover these Bits shortly). 8 words is the 8 words of data that start at the aligned address.
8 words is 32 bytes, hence why a Cache Block is 32 bytes in size.
Broadway also has a L1 Instruction Cache that's the same Set size and format, but it uses ValidBits in place of State Bits.
It's crucial to understand that the Data Cache can only have new content added to it by store instructions. This includes any typical store instruction, but it also includes the cache-specific instructions of dcbi, dcbz and dcbz_l (these are treated as store instructions by any PowerPC CPU). We'll cover all the cache instructions in detail in the next Chapter. Content in Broadway's L1 Data Cache is managed by a pseudo least-recently-used algorithm (aka PLRU).
The Instruction Cache gets content added to it by Broadway's Instruction Fetching mechanism only. It is impossible to control the Fetching mechanism directly. Therefore we cannot, at will, add in new content to the Instruction Cache. Just like the L1 Data Cache, content in the L1 Instruction Cache has its own PLRU.
The inner workings of the PLRU is not a concern for us. However, we do need to cover Cache Hits and Misses. Anyway, over time, the PLRU will fill instructions/data in the cache and later remove them so new data can use the Cache. The filling of the cache by the PLRU is usually referred to as 'pushing a block(s) onto the Cache'. We cannot change how the PLRU itself functions, but there are specific instructions we can do to forcefully edit Cache Blocks or push new blocks onto the Cache. This is covered in the next Chapter.
Whenever instructions & data are processed by the CPU, it will check the L1 Cache by taking the Virtual Address checking its 32-byte Aligned Physical Address Equivalent (address tag) in the Cache Unit. If the Tag is not present in the L1, then L2 (if present in the PPC System) will be checked. If not present in the L2, then physical memory (aka main/system memory) will have to be used.
Whenever the CPU has to result to looking at physical memory for instructions/data, this is known as a Cache MISS. If the address tag is present in L1/L2, this is known as a Cache HIT. Cache misses severely degrade performance.
Each Data cache block (address tag within a Way) will have a State Bit. The Data Cache uses what is called the "MEI Architecture".
Modified = Present in Virtual Cached Memory but not yet present on Physical Memory; will be written to physical memory sooner or later. When exactly the contents get written to Physical Memory is usually dependent on the PLRU, what's being added to the Cache, etc. However contents can be forcefully written to Physical Memory via specific Cache Instructions. When new blocks are pushed onto the Cache, they are tagged with M bit. The term "Dirty" is sometimes used instead of the term Modified.
Exclusive = What's in Virtual Cached Memory is what's in Physical Memory. The term "Clean" is sometimes used instead of the term Exclusive.
Invalid = Old data that is now invalid, you can freely erase/modify this block w/o effecting anything.
As mentioned earlier, we don't need to cover the PLRU in particular but we must understand that....
The Instruction Cache has a much simpler protocol. A cache block in the Instruction Cache has a basic Valid/Invalid bit.
Valid = Next Time an Address in this cache block is referenced/used, contents (instruction) in the cache will be utilized by the CPU **regardless** of what's in Physical Memory
Invalid = Old Contents. Will *not* be used. Physical memory will be checked. Can be tossed out of the Cache.
As mentioned earlier, we don't need to cover the PLRU, but the Instruction Cache works like this...
We've talked about the terms clean and dirty. Another Data Cache term that is used a lot is "flush". Flushing contents in the Cache simply means cleaning them then invalidating afterwards.