AArch64/ARM64 Tutorial

Chapter 33: SIMD Part 3/3

Sometimes a slew of example instructions may not cut it as far as helping you understand ARM's SIMD unit. How about this? We will do a mock exercise of reading a player's XYZ coordinates on a video game map. What are XYZ coordinates? XYZ coordinates are a measurement that can give you the location of an object in a 3 dimensional space.

In a two dimensional space (like a piece of graph paper), going left & right is the X coordinate. Going up and down is the Y coordinate.

Two dimensions is simple enough. Now, we will cover a 3 dimensional space. X & Y are still the same. However, we have a new coordinate where you for forward & backward motion. This is the Z coordinate.

Imagine you are holding that piece of graph paper in front of your face. Left & Right is X. Up & Down is Y. Going forward (through the paper) and back is the Z coordinate.

Another way to remember on how to differentiate Y vs Z is that Y is always for elevation. If we were to think of it like a Compass, X is West & East, and Z (**NOT** Y) is North & South.

These XYZ Coordinates in the video game update once per "frame". We will save the XYZ's of the current frame. Then next frame, we compare old vs new to see the difference. This will introduce time as a 4th dimension which will allow us to calculate a player's "XYZ speed".

XYZ coordinates are usually in single floating point precision and usually reside consecutively in memory as 32-bit words. More times than not, the coordinates reside as "XYZW" coordinates where W is always a constant value of 1.0. We will pretend that W is a constant value, and therefore will not effect our equations.

Pretend we are inserting code at an existing instruction of a program. The instruction (or address where the instruction resides) gets executed once per frame. When it is executed, x0 (at that time) always points to the most updated (most recent frame) XYZ coordinates of the moving player. Pretend that we will use an unused spot of memory to keep the old (previous) XYZ coordinate. It's pretend address will be in x1.

To write this program/code (video game cheat is a better term), we first need to..

Load the updated/new XYZs (located at address in x0) into an FPR
Load the old/previous XYZs (located at address in x1) into an FPR
Store the updated/new XYZs into our unused memory spot (x1) to overwrite the previous XYZs

It doesn't matter what order we do this in as long as step 3 comes AFTER what's in step 2. Let's start writing out the source. We will use v0 for the new XYZs, and use v1 for the old XYZs. Pretend that all registers we use are safe to use and do not alter the video game's behavior in any unintentional way.

// Load new/updating XYZs
ldr q0, [x0]

// Load old/previous XYZs
ldr q1, [x1]

// Store new/updating XYZs, overwriting old
str q0, [x0]

At this point we have both XYZs (plus W) into v0 and v1. Remember that no data conversion occurs on quadword float load/stores. Thus these XYZs are still in proper single float form. Therefore we must operate on them using single precision lane based vector instructions.

The formula for calculating XYZ speed using two sets of XYZs is as such...
XYZ Speed = sqrt{[(x2 - x1)^2] + [(y2 - y1)^2] + [(z2 - z1)^2]}

x1 = X coordinate at time instance #1 (or frame #1)
y1 = Y coordinate at time instance #1 (or frame #1)
z1 = Z coordinate at time instance #1 (or frame #1)
x2 = X coordinate at time instance #2 (or frame #2)
y2 = Y coordinate at time instance #2 (or frame #2)
z2 = Z coordinate at time instance #2 (or frame #2)

Frame #2 = new/updated. Frame #1 = old/previous

The first thing we need to do in this formula is do the 3 subtraction operations. We can do that with just one vector float instruction (fsub).

fsub v0.4s, v0.4s, v1.4s

Keep in mind since "W" is still present, this will do the operation on W as 1 minus 1, which equals 0. Since there's no difference on "W", it will not effect our calculations on XYZ. Therefore, we can just ignore it.

Subtraction is complete, now we need to raise each of the results to their power of 2. Power of 2 simply means just multiplying a number by itself. Therefore we will use the fmul instruction.

fmul v0.4s, v0.4s, v0.4s

At this point we have our 3 "by-products" (ignoring W by-product). We will refer to these by-products as just X, Y, and Z. These 3 by-products now must be all totaled (added/summed) together. We cannot do this in one instruction unfortunately. No instruction exists that will add 4 floats in one FPR all together (remember W is 0 so it won't effect anything).

What we can do is use the faddp instruction that was discussed in the previous chapter.

faddp v0.4s, v0.4s, v0.4s

This will add X and Y of v0 together. Result is placed into upper 32-bits of v0. Z and W of v0 is added together and result is placed into middle-upper 32-bits of v0. X and Y are re-added (from original v0 value), and replaced in lower-middle 32-bits of v0. Z and W are re-added (from original value of v0) and placed into lower 32-bits of v0.

This is confusing to say the least, because the upper 64-bits overall mean nothing to us in the faddp result, but it had to be done because faddp requires two source registers.

At this point v0 is this...

Upper (far left; lane 3)) 32-bits = X+Y
Middle-Upper (middle left; lane 2) 32-bits = Z+W (or just Z since by-product W was 0)
Middle-Lower (middle right; lane 1)) 32-bits = X+Y (same value as upper 32-bits)
Lower (far right; lane 0)) 32-bits = Z+W (or just Z since by-product W was 0; this is same value as middle-upper 32bits)

Now we could just slap the exact same faddp instruction ahead to finish the final addition. We would end up having the lower 32-bits of v0 with X+Y+Z. However we will use a slightly modified version of faddp (scaler version).

faddp s0, v0.2s

This will simply add the middle-lower 32-bits (lane 1) and lower 32-bits (lane 0) within v0 together (ignoring all other bits/values), and the result is placed back into the lower 32-bits (lane 0) of v0. The scaler version of faddp executes at least 3 clock cycles faster than the regular vector version faddp (7 vs 10/11). We won't deep dive into the instruction pipeline in this tutorial. But just understand that faddp (working on an overall double word) is faster than an faddp that works on an overall quadword.

Finally we have just one more math operation to perform. That is the square root calculation. A simple regular plain jane non-vector fsqrt instruction will get the job done.

fsqrt s0, s0

And that's it! We did it! Here's all the instructions combined...

----
ldr q0, [x0]
ldr q1, [x1]
str q0, [x0]
fsub v0.4s, v0.4s, v1.4s
fmul v0.4s, v0.4s, v0.4s
faddp v0.4s, v0.4s, v0.4s
faddp s0, v0.2s
fsqrt s0, s0
----

s0 would contain the XYZ Speed Result. Link to code (slightly modified) on my GitHub - https://github.com/VegaASM/XYZ-Speed-ARMv8/blob/main/xyz.s

Next Chapter

Tutorial Index