AArch64/ARM64 Tutorial

Chapter 28: Cache Part 1/2

To understand cache, we must first understand Virtual vs Physical Memory. Physical memory is well... the real physical memory of the RAM chip. Virtual memory is a customized copy of Physical Memory which includes the ability to split the copied memory into chunks and apply various properties to said chunks. The CPU can then execute a program solely on Virtual Memory. Any changes in Virtual Memory will be applied to Physical memory, but changes are not usually instant. When a CPU encounter's a Virtual Memory address, that address has to be "translated" into its Physical equivalent.

This is known as Address Translation. For example, let's say we have the following physical memory address range...

0x000000000000 thru 0x00017FFFFFFF

A program can use a custom numerical (virtual) range to represent the above physical range. such as....

0x004000000000 thru 0x004017FFFFFF

As you can see the virtual address is a simple bit flip difference of the physical equivalent. Some modern programs running on modern CPUs use what is called Identical Translation. Meaning all virtual address's exactly match the real physical address's.

Most programs will setup two copies of Virtual Memory. The copies are not exactly the same. They are as follows...

  1. Virtual Cached Memory
  2. Virtual Uncached Memory

Any modern CPU will contain a specialized hardware unit/chip where it will keep frequently used contents at. The CPU can access the contents in these units/chips quicker than accessing contents in Physical Memory. These contents is known as the Cache.

Virtual Cached Memory is a copy of Physical Memory but it includes any Cached content. Sometimes Cached content may to be old or too new (contents within a block of Physical Memory not matching the same block of Virtual Cached Memory). Therefore Virtual Cached Memory may not accurately represent what is currently present in Physical Memory.

Virtual Uncached Memory is an exact 1:1 copy of Physical Memory. This memory is slower than Virtual Cached Memory but it is used by the Program when it needs to bypass the use of Cache.

Whenever a CPU is executing in Virtual Memory, this is known as Virtual Mode. Whenever a CPU is executing in Physical Memory, this is known as Real Mode.

Now you may ask yourself this.... Why even setup a Virtual Uncached Memory? Why not just use Physical Memory when needing to bypass/skip the use of Cache? This is because Physical Memory cannot be split into chunks and have different attributes/properties set across said chunks. Not only that, Physical Memory will have set default attributes that cannot be changed.

As an fyi, when exceptions occur, the CPU automatically goes into Real Mode.


A CPU will have multiple Cache Systems. Any CPU that complies with the ARMv8 Architecture will at least have what's called a L1 (Level-1) Cache System. The L1 Cache System will contain two cache units

Instruction Cache is for anything that contains executable instructions, simple enough. Data Cache is for any data that is part of any load/store mechanism. Executable instructions can also be included in the Data Cache. For example, if you write (i.e. store) a new instruction to memory, it will be utilized by both the Instruction and Data cache.

Some CPUs will have a secondary Cache System, this is known as the L2 Cache System. In some CPUs, the L2 Cache will have options to allow it to be split into two units (L2 Data & L2 Instruction), while other CPUs may only allow the L2 Cache to handle Data.

There can also be CPUs that will have a 3rd Cache system (L3). It's important to understand that a higher level Cache System will contain everything that the lower Cache System(s) contains. Higher level Cache systems are always larger than their lower level counterparts.

For example, in the ARM Cortex-A57 CPU. It has two Cache systems. The L1 Cache System has a Data Cache Unit and a Instruction Cache Unit. The L2 Cache System is for Data only. Since it's larger than the L1 Data Cache unit, will always contain everything that the L1 Data Cache unit contains (we're talking about on a particular core ofc). This is done to ensure Cache coherency.

A Cache Unit is split into equal size pages/chunks. These pages/chunks are known as Sets. Each set will then be split into a fixed amount of rows called Ways. Each Way will then contain an Address Tag and State/Valid bits. This Address Tag is an aligned Physical Address. Data Cache units will use State bits while Instruction Cache units will use Valid bits. The alignment of the Physical Address depends on the Cache Block/Line size. The smallest memory size that Cache Units can handle is called a Cache Block or Cache Line.

Let's say the Cache Unit uses 64-byte blocks/lines. This means a Way will contain an address tag that consists of a 64-byte aligned Physical Address. Thus, whatever State/Valid bits are present for the Way, will effect the entire 64-byte block/line.

What does this mean? Well if you use a Cache Instruction to modify the State/Valid bits of address 0x0000150C, then it effects all memory contents in the address range of 0x00001500 thru 0x0000153F.

Let's take the Cortex A-57 CPU. The Cortex's L1 instruction Cache is a 48KB 3-way Set-associative unit that uses 64-byte Cache blocks/lines. The term "set-associative" means the entire unit is split in equal sized sets and each set has the same amount of ways. With this Cache unit being 3-way, that means each Set has 3 ways. We can figure out the amount of Sets using this handy formula

Set numbering starts at 0, therefore this Unit starts at Set0 and ends at Set255. Way numbering starts at 0 too. Therefore each set has a Way0, Way1, and Way2.


Cache Hits and Misses:

It's crucial to understand that the Data Cache can only have new content added to it by store instructions. This includes any typical store instruction. Please keep in mind that some Cache Instructions are treated as Store instructions (more on Cache instruction set in the next Chapter). Content in the Data Cache is managed by an algorithm. In some ARMv8 CPU's the algorithm type can be selected (least recently used aka LRU vs round robin aka RR). While other ARMv8 CPU's only use the Least Recently Used algorithm.

The Instruction Cache gets content added to it by the CPU's Instruction Fetching mechanism only. It is impossible to control the Fetching mechanism directly. Therefore we cannot, at will, add in new content to the Instruction Cache. Just like the Data Cache, content in the Instruction Cache has its own LRU.

The inner workings of the LRU is not a concern for us. However, we do need to cover Cache Hits and Misses. Over time the cache will get filled with content, which will be later removed and refilled again. When content is placed into a Cache Unit, this is known as 'pushing' a block/line onto the Cache.

Whenever instructions & data are processed by the CPU, it will check the L1 Cache by taking the Virtual (or Physical Address) and checking the Physical Address (address tag) in the Cache Unit. If the Address is not present in the L1, then L2 (if present) will be checked. If not present in the L2, then the L3 (if present) will be checked. Finally if the Address is not present in any Cache unit, physical memory (aka main/system memory) will have to be used.

Whenever the CPU has to result to looking at physical memory for instructions/data, this is known as a Cache MISS. If the address tag is present in L1/L2/L3, this is known as a Cache HIT. Cache misses severely degrade performance.


Data Cache State Bits:

Each Data cache block/line (address tag in the Way) will have a State Bit. For Data Caches, there are three main protocols of what kind of State Bits can be present. Some CPUs will use a mix of protocols (ie. Cortex A-57 uses MESI for L1 Cache, MOESI for L2 Cache). Here are the 3 protocols...

Let's cover the MEI protocol first, as it's the most basic.

Modified = Present in Virtual Cached Memory but not yet present on Physical Memory; will be written to physical memory sooner or later.  When the contents get written to Physical Memory is usually dependent on the LRU, what's being added to the Cache, etc. However contents can be forcifully written to Physical Memory via specific Cache Instructions (more on Cache Instructions in next Chapter). When new blocks/lines are pushed onto the Cache, they are tagged with M bit. The term "Dirty" is sometimes used instead of the term Modified.
Exclusive = What's in Virtual Cached Memory is what's in Physical Memory. The term "Clean" is sometimes used instead of the term Exclusive.
Invalid = Old data that is now invalid, you can freely erase/modify this block w/o effecting anything.

Unfortunately, there are no ARMv8 Compliant CPUs which use the MEI protocol. MEI is mainly for single core older generation processors. Because, all ARMv8 CPU's are multi-core, they must at least use MESI. Let's cover MOESI though since this will inherently cover MESI

Since a processor has multi cores, that means each core will have its own Cache systems. For example, the Nintendo Switch gaming console uses a 4-core Cortex-A57 Processor.

Each Cortex-A57 core has a 48KB L1 Instruction Cache, and a 32KB L1 Data Cache. All four cores share a 512MB L2 Data Cache. Therefore in aggregate, the Nintendo Switch CPU has four L1 Instruction Caches, four L1 Data Caches, and one L2 Data Cache.

This presents cache coherency problems. There will be times a program will need a particular cache block/line to be present in the Cache System of ALL cores.

Modified = Present in Virtual Cached Memory of this Core, but not yet present on Physical Memory. Will be written to physical memory sooner or later. No other copies of this cache line are present in any other Core's Caches. Modified is also known as "Dirty".

Owned = Similar to Modified except only this Core is allowed to have the Cache line in this state. If the Cache line is present in another Core, it must in the Shared state.

Exclusive = What's in Virtual Cached Memory of this Core is what's in Physical Memory. No other copies of this cache line are present in other Core's Caches. Exclusive is also known as "Clean"

Shared = What's in Virtual Cached Memory of this Core is what's in Physical Memory. Other copies of this cache line *may* exist in the other Cores Caches.

Invalid = Old data that is now invalid, you can freely erase/modify this block w/o effect anything.

As mentioned earlier, we don't need to cover the LRU in particular but we must understand that....

  1. When new blocks are added to the Data Cache (by Store Instructions), they are *always* in the Modified/Dirty State
  2. Overtime, due to the LRU, blocks will end up tagged as Exclusive/Clean (their 2nd to final state). This means the Cache Block's contents matches Physical Memory
  3. Since Exclusive blocks are a waste of Cache space, eventually they will be set to the Invalid state by the LRU
  4. Sometime later, Invalid blocks are removed from the Data Cache. This is known as "casting out" a Block.


Instruction Cache Valid Bits:

Lucky for us, the Instruction Cache has a much simpler protocol. A cache block/line in the Instruction Cache has a basic Valid/Invalid bit.

Valid = Next Time an Address in this cache line is referenced/used, contents in the cache will be utilized by an instruction **regardless** of what's in Physical Memory

Invalid = Old Contents. Will *not* be used. Physical memory will be checked. Can be tossed out of the Cache.

As mentioned learier, we don't need to cover the LRU, but the Instruction Cache works like this...

  1. New blocks are added to the Instruction Cache (by the Fetcher), they are *always* tagged Valid
  2. Overtime, Valid blocks will be changed to Invalid
  3. Sometime later, Invalid blocks are removed from the Instruction Cache

We've talked about the terms clean and dirty. Another Data Cache term that is used a lot is "flush". Flushing contents in the Cache simply means cleaning them then invalidating afterwards. In a basic MEI single-core protocol, that would equate to setting a Data Cache Block/Line to Exclusive (which forces contents in Cache to be pushed to physical memory)  then to Invalid.


Next Chapter

Tutorial Index