PowerPC Tutorial

Previous Chapter

Chapter 29: Cache Part 2/2

Section 1: Cache Instructions

NOTE: In this chapter, all programs/code will assume only a Level 1 Cache Exists, and its Cache Blocks are 32-bytes in size. It also assumes the PowerPC system is a single processor containing just a single Core. We'll discuss more about System designs in the next Chapter.

Here are all the PowerPC Cache Instructions..

EA = rD + rA. If rD = r0, then it's literal zero.

IMPORTANT: Any EA that gets calculated will automatically have its lower 5 bits cleared so the EA ends up being 32-byte aligned. Hardware does this "behind the scenes", you do not need to write any code (software) for this.

Example: If EA is 0x80001511, the "final" EA is 0x80001500 and the cache instruction will effect 0x80001500 thru 0x8000151F.

dcbi, dcbz and dcbz_l are treated as store instructions. This is because they are the only cache instructions that can directly modify the contents within the Data Cache. All other cache-related instructions are treated as load instructions. Those instructions can't modify the data/instruction cache contents, but they can change the State/Valid bits.

If you are wanting to make sure something in the Data Cache reaches Physical Memory, you would issue a dcbst instruction. If you are wanting to quickly zero a block of aligned Virtual Memory, you would use a dcbz instruction.

If you had something that's in the cache, and for whatever reason you don't want it to reach Physical memory, you would issue a dcbi instruction. If you want something to reach physical memory and then be tossed out of the Cache asap, you would issue a dcbf instruction.


Section 2: Self Modifying Code Introduction

Self Modifying Code is a great beginner way to understand Cache. What is Self Modifying Code? It is any instruction(s) in a Program that overwrites other already-existing instructions.

Here's a small example of a Program using self modifying code. It will write over an add instruction with a nop. This isn't a practical/feasible Program, it's just to show how Self Modifying Code can cause cache incoherency.

lis r12, ptr_to_add_instruction@h
ori r12, r12, ptr_to_add_instruction@l #complete pointer lower 16-bits right away for ease of tutorial readability
lis r0, 0x6000 #write assembled nop instruction in r0. Assembled nop is 0x60000000.
stw r0, 0 (r12) #store nop instruction overwriting the add instruction

lis r5, 0x809C #Set some address
lbz r3, 0x2100 (r5) #Load some byte

ptr_to_add_instruction:
add r3, r3, 1 #increment byte, when CPU reaches here, this will be a nop instruction
stb r3, 0x2100 (r5) #Store byte

In the above code, once the CPU reaches the add instruction, there will be a nop present instead. Right? Right?.....

Well, a beginner would think so, but the add instruction will still get executed, even if you see a nop in your Program as you are stepping thru it (i.e. GDB). How can this be possible? How can we fix this?


Section 3: Understanding what Store Instructions *actually* do

You need to understand how Store Instructions actually operate. Let's look at the following basic Store instruction (this stw is NOT from our program, just a random example stw instruction).

stw r3, 0 (r4) #Store byte to address/ea in r4

When the PPC executes any plain jane Store Instruction, you have only modified Virtual Cached Memory. The Store Instruction has added a new entry (32-byte Aligned Block) into the Data Cache (if it wasn't already in the cache). Remember back in Section 1, I've stated ONLY store instructions can add blocks to the Data Cache! The 32-byte physical aligned address of the EA (r12 in our Program) is the address tag that gets added to the Cache. The 32-bytes of content starting at that virtual address is the block itself that gets added to the Cache. The Block is tagged with the M (Modified) State bit. We know from the previous Chapter that Modified = What's in Data Cache may NOT be what's in Physical Memory.

At this point, the stored contents are at the Virtual Cached Memory Address, but nothing Physically has taken place yet. If we were to then have a lwz instruction immediately next....

lwz r3, 0 (r4)

The value of r3 that was just stored, will be reloaded. r4/EA's address is already in the Data Cache due to the Store instruction. Thus, when this lwz instruction executes, the Data Cache receives a Cache Hit, and the Virtual Cached Memory is used for loading the contents from. Physical Memory is left *alone*.

Now what happens if we store over an instruction that we will later execute? Let's review that part of our Self Modifying Code from earlier.

stw r0, 0 (r12) #store nop instruction overwriting the add instruction

Once the stw executes, r12/EA's 32-byte aligned physical equivalent address will get added into the Data Cache (if it's not already in there). At this point in time, the new instruction (nop) has only been written to Virtual Cached Memory.

Now this is really important to understand. There are two different Virtual Cached Memories. The one the Data Cache "see's", and the one the Instruction Cache "see's". The contents (newly written nop) has been written to the Virtual Data Cache memory. Remember Data Cache handles stores and loads, not the Instruction Cache.

Alright, after the stw, if you execute the newly written instruction, it will fail (old/stale add instruction will execute). Why does this happen? Well, the Instruction Cache doesn't see what the Data Cache is seeing. When executed, the address of the newly written nop instruction will be checked by the CPU to see if it is in the Instruction Cache. It will not be there, so an Instruction Cache miss will occur. The contents at r12's 32-byte aligned *PHYSICAL* address will be used instead. Since the add instruction will be within those contents, it will be used instead of the new nop.

Ok so we need another new instruction for our Program. After the stw instruction, we need to force the Data Cache to push the contents into Physical Memory. We can use the dcbst instruction.

lis r12, ptr_to_add_instruction@h
ori r12, r12, ptr_to_add_instruction@l #complete pointer lower 16-bits right away for ease of tutorial readability
lis r0, 0x6000 #write assembled nop instruction in r0. Assembled nop is 0x60000000.
stw r0, 0 (r12) #store nop instruction overwriting the add instruction
dcbst r0, r12 #push data cache (includes our nop) contents to physical memory 

lis r5, 0x809C #Set some address
lbz r3, 0x2100 (r5) #Load some byte

ptr_to_add_instruction:
add r3, r3, 1 #increment byte, when CPU reaches here, this will be a nop instruction
stb r3, 0x2100 (r5) #Store byte

Alright, so now this will overwrite the add with a nop, and the nop will get executed now! Right?.... WRONG!

The dcbst alone isn't enough. It is possible the old add instruction is currently in the Instruction Cache with being marked as Valid. Meaning the PPC's instruction fetcher won't even bother checking Physical Memory since the Instruction Cache is saying "Hey we have the add instruction! And it's marked valid! No need to check what's in physical memory!"

Therefore to alleviate this possible problem, we use the icbi instruction to mark the old add instruction in the Instruction cache as invalid. If the old instruction isn't in the Instruction Cache, then the icbi has zero effect. Supplement the icbi to our Program now...

lis r12, ptr_to_add_instruction@h
ori r12, r12, ptr_to_add_instruction@l #complete pointer lower 16-bits right away for ease of tutorial readability
lis r0, 0x6000 #write assembled nop instruction in r0. Assembled nop is 0x60000000.
stw r0, 0 (r12) #store nop instruction overwriting the add instruction
dcbst r0, r12 #push data cache (includes our nop) contents to physical memory 
icbi r0, r12 #invalidate instruction cache contents just in case the add instruction is in the cache and is marked as valid

lis r5, 0x809C #Set some address
lbz r3, 0x2100 (r5) #Load some byte

ptr_to_add_instruction:
add r3, r3, 1 #increment byte, when CPU reaches here, this will be a nop instruction
stb r3, 0x2100 (r5) #Store byte

Okay, now it will work! Right?.... NOPE!


Section 4: Quick Instruction Fetching Lesson

Before we can continue further, you need a quick lesson on how Instruction Fetching works for PowerPC. All instructions go thru a "Pipeline". The life cycle of an instruction is the following...

  1. Instruction Queue
  2. Execution Unit
  3. Completion Queue
  4. Retirement

Instructions are fetched from your Program in Program Order. Up to 4 instructions can be fetched at a time. They are placed into the Instruction Queue (IQ for short). There are 6 IQ spots, IQ0 thru IQ6. Instructions get placed into the lowest spots (towards IQ6) first. Instructions sitting in IQ0 & IQ1 are sent off to their respective Execution Unit. When the IQ0 and/or IQ1 spots free up, the instructions sitting further back in the IQ "line" are moved up.

Regarding our Program, the icbi still won't work even though its forcing the Instruction Cache to use the nop that was pushed to Physical Memory. Why? Because the add instruction has already been fetched and is sitting in the Instruction Queue. It will reach its Execution Unit *before* icbi gets fetched & sent off to its respective Execution Unit.

Meaning your program will already have executed and completed the add instruction by the time icbi gets executed.

Well, how do we fix this new problem now? With the isync instruction! It is a "barrier" instruction. PowerPC has 3 barrier instructions.

isync is our only concern for now. I'll explain the others in the next Chapter. Remember when I said that the add instruction is already sitting in the IQ? Well we can use isync to remove everything out of the IQ. What does this do? It means that when the Program's instructions get refetched, the newly written nop will be fetched! Thus, the nop (NOT the stale add) will get executed! Here's our updated final Program...

lis r12, ptr_to_add_instruction@h
ori r12, r12, ptr_to_add_instruction@l #complete pointer lower 16-bits right away for ease of tutorial readability
lis r0, 0x6000 #write assembled nop instruction in r0. Assembled nop is 0x60000000.
stw r0, 0 (r12) #store nop instruction overwriting the add instruction
dcbst r0, r12 #push data cache (includes our nop) contents to physical memory 
icbi r0, r12 #invalidate instruction cache contents just in case the add instruction is in the cache and is marked as valid
isync #Stale add will still remain in the IQ. Purge the IQ so the new nop gets fetched.

lis r5, 0x809C #Set some address
lbz r3, 0x2100 (r5) #Load some byte

ptr_to_add_instruction:
add r3, r3, 1 #increment byte, when CPU reaches here, this will be a nop instruction
stb r3, 0x2100 (r5) #Store byte

And now this Self Modifying Code will work.

Question: Hey Vega, why not use dcbf instead of dcbst for the Self Modifying Code?

Answer: It depends on the specific program and many other factors. Remember that dcbf will get rid of the Cache block once it has been cleaned. To keep it simple, if the Self Modifying Code was only to be modified once, or only on uncommon occurrences, then use dcbf. Otherwise use dcbst.


Pro Tip: Remember when I said the Instruction Fetcher can only fetch 4 instructions max at a time? Well this means if the instruction (the one that gets overwritten) is *MORE* than 4 sequential instructions below the responsible store instruction, then you can OMIT the isync.


Next Chapter

Tutorial Index