Creating Loops
#1
Creating Loops



This thread will teach a beginner ASM coder how to write basic loops in Power PC ASM. The purpose of a loop is to take a string of data and copy it to a different location in memory. For this tutorial, we have a string of data starting at memory address 0x80002008.

The string of data is this..

Code:
  Address     Data
0x80002008 0x11223344
0x8000200C 0xAABBCCDD
0x80002010 0x12345678
0x80002014 0xABCDEF01
0x80002018 0x12AB34CD

We want to copy this data to memory address starting at 0x81450000. The string of data is a total of 5 words in length (or 10 halfwords, or 20 bytes)

What a beginner coder might do is write a source of multiple uses of lwz+stw which is redundant. It is more efficient to write out a loop and let the loop transfer the data.

There are 2 types of loops:
CTR Loop
Subic. Loop



CTR Loop

The CTR loop uses the Count Register. The Count Register (CTR) is used to keep track of how many times the loop will execute. The amount of times to make a loop depends on these two factors. How much total data is being transferred and how you want to transfer it.

Loops can transfer data via bytes, halfwords, or words. For our data shown above, we have 5 words, and we will transfer it via one word at a time. Thus we need our loop to execute a total of 5 times. If we to transfer a byte at a time, we would need the loop to execute 20 times, if transferring a halfword at a time, we would need the loop to execute 10 times.

First, set the CTR to have the value of 5.

Code:
li r12, 5
mtctr r12

The mtctr instruction stands for Move to CTR. The value of r12 is copied to the CTR. Now we need to set our first loop loading address...

Code:
lis r12, 0x8000
ori r12, r12, 0x2008

And we're good to continue on, right? No, we're not. Our first loop loading address needs to -0x4 away from 0x80002008, which is 0x80002004. Why is this needed? Well we will be using what is called 'updating' instructions for our loop (will explain more on this shorlty). This means since we are transferring one word at a time, we need one word of space (or -0x4) before the first loading address. If we were transferring halfwords, this would be -0x2, if bytes then -0x1.

Now, let's correctly set the first loading address of the loop~

Code:
lis r12, 0x8000 #0x80002008 - 0x4 = 0x80002004
ori r12, r12, 0x2004

Now apply that same logic to the first storing address of the loop...

Code:
lis r11, 0x8144 #0x81450000 - 0x4 = 0x8144FFFC
ori r11, r11, 0xFFFC

We got our initial loading & storing addresses set, let's make the loop...

Code:
the_loop:
lwzu r10, 0x4 (r12)
stwu r10, 0x4 (r11)
bdnz+ the_loop

A lot to unpack here. First, all loops need a label name. The lwzu and stwu instructions are those 'updating' instructions I mentioned about earlier.


Load Word Zero & Update
lwzu rD, VALUE (rA) #Values in this instruction are signed

VALUE + rA = The effective address. The word from the effective address is loaded into rD. Afterwards the value of rA itself is then added with VALUE, so the new value of rA is increased/decreased accordingly. 
Use lhzu for halfwords, use lbzu for bytes.


Store Word & Update
stwu rD, VALUE (rA) #Values in this instruction are signed

Same logic as lwzu but this is storing rD's value to memory instead of loading a value from memory into it. Use sthu for halfwords, use stbu for bytes.


Still confused? Okay, so when our loop is first executed and the lwzu instruction gets executed, r12 has the value of 0x80002004. The word is loaded from 0x80002004 + 0x4 and placed into r10. Immediately after that happens, r12 gets the value of 0x4 (VALUE) added to it and is now 0x80002008. When the loop repeats to execute a second time, once the lwzu instruction executes again, r12 is increased by 0x4 again to 0x8000200C, and so on and so on. With this 'updating' feature of lwzu and stwu, we don't need to write out instruction(s) such as 'addi r12, r12, 4' and 'addi r11, r11 4'. This saves on compiled code length and it saves on execution time.

Now let's talk about the bdnz+ instruction. This stands for Branch Decrement Not Zero. That means when this instruction executes, the CTR is decremented by the value of 1 (example: 5 --> 4), then the CTR is compared to the value of 0. If NOT zero, it branches to our loop label (the_loop), and the loop repeats. If it is zero, the loop is over and instructions after the loop are thus executed.

With everything being said, here's the entire source~

Code:
li r12, 5
mtctr r12

lis r12, 0x8000 #0x80002008 - 0x4 = 0x80002004
ori r12, r12, 0x2004

lis r11, 0x8144 #0x81450000 - 0x4 = 0x8144FFFC
ori r11, r11, 0xFFFC

the_loop:
lwzu r10, 0x4 (r12)
stwu r10, 0x4 (r11)
bdnz+ the_loop

For a better idea of what's going on visually speaking, here is a series of 4 pictures. The 1st picture is right before the loop is first executed. I manually placed in the values for r11 and r12 before. Then next 4 pictures will show the execution of the loop with one iteration so the CPU ends up back at the lwzu instruction.

[Image: loop1.png]

[Image: loop2.png]

[Image: loop3.png]

[Image: loop4.png]

That is what one iteration of the loop looks like. A word gets loaded into r10 and 'transferred' to the spot designated by stwu instruction using r11.

Here are two more pictures showing the final stages of the loop. First pic is right before the loop is completed. You will notice the CTR has a value of 1 and the bdnz+ instruction is about to be executed. Then 2nd pic is the bdnz+ instruction getting executed, you will see the CTR is now 0 and the loop has fully completed.

[Image: loop5.png]

[Image: loop6.png]



Subic. Loop

With the subic. loop, instead of using the CTR for the amount of times the loop needs to execute, we use a normal general purpose register instead. Here's what our source would look like using the subic. loop...

Code:
li r9, 5 #r9 will be used to mimic our 'CTR'

lis r12, 0x8000 #0x80002008 - 0x4 = 0x80002004
ori r12, r12, 0x2004

lis r11, 0x8144 #0x81450000 - 0x4 = 0x8144FFFC
ori r11, r11, 0xFFFC

the_loop:
lwzu r10, 0x4 (r12)
stwu r10, 0x4 (r11)
subic. r9, r9, 1
bne+ the_loop

The subic. instruction stands for Subtract Immediate Carrying (carrying deals with the carry flag, you don't need to worry what this flag is about). The small dot is a record feature. It's a free use of 'cmpwi rD, 0', which is cmpwi r9, 0 for this source. The instruction for our loop will subtract one from the value of r9 and store the result back into r9 every time the loop executes. Then it compares the value of r9, to 0. The bne+ instruction will branch to the_loop whenever r9 is NOT zero. Once r9 is zero, the loop is over and instructions after the loop will be executed.



CTR vs Subic.

While both loop types resulted in the same length of compiled code, the CTR loop is better because it has less amount of total executable instructinos and thus results in less execution time. The CTR loop also allows you to use one less GPR (general purpose register) than the Subic. loop.

The subic. is needed when let's say your code's default instruction resides at a address that is within a CTR loop. Obviously, the CTR wouldn't be safe for use, and you will have to use the subic. loop. Happy coding!
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)