AArch64/ARM64 Tutorial

Chapter 30: Broadcasting, Barrier Instructions, Self Modifying Code

Because of the modern multi processor and multi core setup of systems, there will be situations where a core needs to "tell" all other cores in the entire system about a cache update/change. This is known as Broadcasting. Here's a list of all Cache instructions with their Broadcasting abilities

ic ialluis = Yes (Inner Shareable domain only)
ic iallu = Never
ic ivau = Maybe*
dc zva = Never
dc ivac = Yes
dc isw = Never
dc cvac = Maybe*
dc csw = Never
dc cvau = Maybe*
dc civac = Yes
dc cisw = Never

*Depends on the specific design of the System. If unsure, assume Never. For beginners, assume Never.

If an instruction does not broadcast, barrier instruction(s) will need to be used to ensure the cache instruction has completed. Also due to various CPU hardware features such as Out-of-Order Execution, TLBs, Cache Units, Multiple Cores, etc, an instruction may not complete when it first gets executed for quite some time. Thus, barrier instructions would be required. The ARM64 Instruction Set includes three different barrier instructions at our disposal.

dsb = Data Synchronization Barrier
dmb = Data Memory Barrier
isb = Instruction Synchronization Barrier

isb is the easiest to understand. We will cover it first. Before an instruction can be executed. It gets fetched (by the Fetcher) and placed into what is called an Instruction Queue. The Queue is part of a larger environment known as the Instruction Pipeline. The Pipeline involves fetching, decoding, executing, completing, and retirement.

A processor operates on what is called clock cycles. Imagine it like a Video. A video is just a series of consecutive pictures. Each picture is a frame. Most videos you watch are 30 frames per seconds. Processors work in the same fashion. Except instead of frames, we use the term clock cycles or just clocks.

For example, a core of the Cortex A-57 can fetch up to 4 instructions from your program in 1 clock. After the instruction queue stage, instructions are decoded then sent to their respective execution unit depending on what type of instruction it is. This alone (we won't cover the other items of entire instruction pipeline) is already enough to cause out of order execution.

There will be times, due to cache operations, where old/stale instructions may be sitting in the Instruction Queue and a program needs them to be removed before continuing. The isb instruction serves this purpose. The isb instruction will purge (remove) any instructions sitting in the Instruction Queue and then wait for currently executing instructions (instructions past the Queue stage) to complete. Instruction Fetching will be restarted afterwards.

How can instructions become old/stale? Well we will save that explanation for later once we cover self modifying code.

The dmb instruction can be used to ensure load-load, store-store, and load-store ordering. Sometimes a CPU may interact with an external bridge type chip such as an I/O chip. Load/store instructions may need to be ordered for this interaction to work correctly. Therefore the CPU may issue a dmb before/after/between a load-load, store-store, or load-store to ensure ordering.

The dsb instruction waits for all currently executing instructions to complete. It also ensures any Load/Store operation/instruction to fully complete before the next instruction of the program gets executed. However it does **NOT** purge the instruction queue like isb.

The isb instruction does ensure all executing instructions complete, but that completion only refers to the pipeline. Meaning, load/store operations (in reference to Memory, Cache, TLBs, and other units that read Memory) may not be completed.

dmb and dsb have options to select load-load vs load-store vs store-store, and what level of domain to effect

Here are the options that apply to both dmb & dsb:
dmb <op>
dsb <op>

<op> list:

oshld = outer shareable, load-load & load-store
oshst = outer shareable, store-store
osh = outer shareable, any
nshld = non shareable, load-load & load-store
nshst = non shareable, store-store
nsh = non shareable, any
ishld = inner shareable, load-load & load-store
ishst = inner shareable, store-store
ish = inner shareable, any
ld = full system, load-load & load-store
st = full system, store-store
sy = full system, any

NOTE: Leaving <op> blank will tell the Assembler to use "sy"

Full System Vs Outer Shareable:
Tbh, I don't know the difference of the two. Couldn't find much on a quick Google Search. If unsure, use Full System type <op>. The innovation of multiple cores and clusters, Cache for any Assembly Language has become so incredibly complex that most devs don't bother learning all the under-the-hood cache+barrier stuff. They simply use already built functions to handle cache+barrier tasks or just use templates/examples from ARM.com.

Self Modifying Code:
Well, we've spent all this time going over Memory, Cache, Domains, Broadcasting, Barriers, etc. Finally, we can now use this new knowledge for some real world applications. The most common time when you need to deal with Cache (amongst other things you've just learned) is for scenarios that involve Self Modifying Code.

What is Self Modifying Code? It's a fancy term that simply means that you have rewritten over an existing instruction in your program. That's it.

Now you cannot simply just rewrite existing instructions and then execute them. It's a high probability that even though you see the new instruction in your program, that new instruction actually won't be executed. The old instruction will still be executed. How does this happen? Well cache coherency was not maintained.

You need to understand how Store Instructions actually operate. Let's look at the following basic Store instruction..

str x0, [x1] //Store x0 to address of x1

When you execute a plain jane Store Instruction, you have only modified Virtual Cached Memory. That Store Instruction has forcefully added a new entry (Block/Line) into the Data Cache. x1's address (64-byte aligned if we are talking about the Cortex A-57) gets added to the Data Cache. It is marked with the M state bit (Modified aka Dirty).

At this point, the stored contents are at the Virtual Cached Memory Address, nothing Physically has taken place yet. If we were to then have a ldr instruction next....

ldr x0, [x1]

The value of x0 that was just stored, will be reloaded. x1's address is already in the Data Cache due to the Store instruction. Thus, when this ldr instruction executes, the Data Cache receives a Cache Hit, and the Virtual Cached Memory is used for loading the contents from. Physical Memory is left unchecked.

Now what happens if we store over an instruction?

Pretend x5 = address of instruction to overwrite
Pretend w4 = instruction

str w4, [x5]

Once the str executes, x5's address will get added into the Data Cache. At this point in time, contents of w4 has only been written to Virtual Cached Memory.

Now this is really important to understand. There are two different Virtual Cached Memories. The one the Data Cache "see's", and the one the Instruction Cache "see's". The contents of w4 has been written to the Virtual Data Cache memory. Remember Data Cache handles stores and loads, not the Instruction Cache.

Alright, after the str, if you execute the newly written instruction, it will fail (old instruction will execute). Why does this happen? Well the Instruction Cache doesn't see what the Data Cache is seeing. When executed, the address of the newly written instruction will be checked by the CPU to see if it is in the Core's L1 Instruction Cache. It will not be there, an Instruction Cache miss will occur. The contents at x5's *PHYSICAL* address will be used instead.

At the physical address, the old instruction is still present. Therefore, the old instruction gets executed.

-----

Ok so we need to remedy a few things. After the str instruction, we need to force the Data Cache to "push" the contents into Physical Memory. We can use the dc cvac instruction.

str w4, [x5] //Write new instruction, at this point only Virtual Data Cache memory has been updated
dc cvac, x5 //Force the Data Cache to update Physical Memory with new w4 contents at x5's address

Alright, so now once the new instruction executes, it will work! Or will it????

This would work if and only if when we do execute the new instruction, that an Instruction Cache miss occurs. What if beforehand, for whatever reason, x5's address was already present in the Core's L1 Instruction Cache with the old instruction contents.

How can we fix this possible issue? Well, we force the Instruction Cache to invalidate the contents. That way even if a Cache Hit occurs, the Instruction Cache will be forced to use what's in Physical Memory.

ic ivau, x5

The above will invalidate the instruction cache block using address in x5. Ok so now we have 3 instructions..

str w4, [x5] //Write new instruction, at this point only Virtual Data Cache memory has been updated
dc cvac, x5 //Force the Data Cache to update Physical Memory with new w4 contents at x5's address
ic ivau, x5 //Just in case there is an Instruction Cache Hit, force Instruction Cache to use what's in Physical Memory

Ok now we are done, right? Nope we're not. There's a chance that the old instruction has already been fetched and is sitting in the Instruction Queue. Meaning our dc and ic instructions are useless. We need an instruction that will purge the instruction queue. Ah hah! The isb instruction!

isb //Force CPU to purge instruction queue, and do a refetch. This will ensure new instruction will be fetched then executed.

Now how do we know if we need the isb or not? Well for something like the Cortex A-57, a max of 4 sequential instructions can be fetched at at time. This means if your execution of the newly written instruction is *AT LEAST* 5 sequential instructions below the ic ivau instruction, then you can omit the isb instruction.

-----

Here's all 4 instructions that we are using so far...

Now let's address of few things. First, you may have notice the dc instruction uses Point of Coherency, since it pushes/forces the cache content all the way to Physical Memory. Point of Coherency instructions are rarely needed and take up a lot of time to execute.

We can use the dc cvau instruction instead. This will use PoU instead. Here's the updated code...

str w4, [x5] //Write new instruction, at this point only Virtual Data Cache memory has been updated
dc cvau, x5 //Clean the block in the current Working Core
ic ivau, x5 //Just in case there is an Instruction Cache Hit, force Instruction Cache to use what's in Physical Memory
isb //Force CPU to purge instruction queue, and do a refetch. This will ensure new instruction will be fetched then executed.

The above will work for any case when the Core responsible for executing the Self Modifying Code is also the Same Core that is responsible for running the newly modified code. If you are in situation where a different Core might execute the modified code, we need to change some things up..

str w4, [x5]
dc cvau, x5
dsb ish
ic ivau, x5
dsb ish
isb

Since the dc cvau instruction doesn't broadcast, we need a dsb instruction underneath. Also there are other "hidden" load/store mechanisms (such as TLB operations) throughout the whole system that may become out of sync due to the str & dc cvau instructions. The dsb instruction will force all load/store mechanisms to finish.

The 2nd dsb instruction broadcasts the ic ivau instruction to the other cores. And finally we have an isb. You will need isb no matter what in this case. The 'ish' in the dsb instructions is for Inner Shareable Domain.

Alright so that whole template of code above will work for *any* case of Self Modifying Code. Obviously, alter the registers used in the template to satisfy your program requirements/needs.

Data Cache Clean vs Flush ?
Now why didn't we use the dc civau instruction (clean then invalidate, aka flush)? Well cleaning is a bit faster than flushing. Also, chances are the program will use the address(s) of the self modifying code multiple times again. Therefore we want the Cache to keep the address(s) as long as possible. If flushed, the address(s) are marked invalid and will be casted out of the Cache quicker.

If the self modifying code is only to be execute just once, then its actually faster to use Flush. This is because you won't be forcing an address to stay in the Cache when said address of only going to be used one time. Flushing always broadcast, but you will need to use dsb ish afterwards regardless due to the hidden mechanisms I explained earlier.

Next Chapter

Tutorial Index