Mario Kart Wii Gecko Codes, Cheats, & Hacks

Full Version: Creating Loops
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Creating Loops



This thread will teach a beginner ASM coder how to write basic loops in Power PC ASM. Loops are a piece of code with the task of copy-pasting a chunk/string of data from one place of memory to another. For this tutorial, we have a chunk/string of data starting at memory address 0x80002008.

The string of data is this..
  •     Address         Data
  • 0x80002008 0x11223344
  • 0x8000200C 0xAABBCCDD
  • 0x80002010 0x12345678
  • 0x80002014 0xABCDEF01
  • 0x80002018 0x12AB34CD

We want to copy this data to memory address starting at 0x81450000. The string of data is a total of 5 words in length (or 10 halfwords, or 20 bytes).

In other words, we start with this...
  •     Address         Data
  • 0x80002008 0x11223344
  • 0x8000200C 0xAABBCCDD
  • 0x80002010 0x12345678
  • 0x80002014 0xABCDEF01
  • 0x80002018 0x12AB34CD

And we want to end up with this...
  •     Address         Data
  • 0x81450000 0x11223344
  • 0x81450004 0xAABBCCDD
  • 0x81450008 0x12345678
  • 0x8145000C 0xABCDEF01
  • 0x81450010 0x12AB34CD

What a beginner coder might do is write a source of multiple uses of lwz+stw like this...

Code:
lis r11, 0x8000
lis r12, 0x8145
lwz r10, 0x2008 (r11)
stw r10, 0 (r12)
lwz r10, 0x200C (r11)
stw r10, 0x4 (r12)
lwz r10, 0x2010 (r11)
stw r10, 0x8 (r12)
lwz r10, 0x2014 (r11)
stw r10, 0xC (r12)
lwz r10, 0x2018 (r11)
stw r10, 0x10 (r12)

Instead of using a stream of lwz+stw's, we can use Loops.

There are 2 types of loops:
CTR Loop
Subic. Loop



CTR Loop

The CTR loop uses the Count Register. The Count Register (CTR) is used to keep track of how many times the loop will execute. The amount of times that a loop will need to be executed depends on these two factors.

  1. How much total Data is being copy-pasted (transferred)
  2. How you want to transfer the Data

Loops can transfer data via bytes, halfwords, or words. For our data shown above, we have 5 words, and we will transfer it via one word at a time. Therfore, we need our loop to execute a total of 5 times. If we to transfer a byte at a time, we would need the loop to execute 20 times, if transferring a halfword at a time, we would need the loop to execute 10 times.

First, let's set the CTR to have the value of 5.

Code:
li r12, 5
mtctr r12

The mtctr instruction stands for Move to CTR. The value of r12 is copied to the CTR. Now we need to set our first loop loading address...

Code:
lis r12, 0x8000
ori r12, r12, 0x2008

And we're good to continue on, right? No, we're not. Why is this incorrect?

Our 1st loop loading address needs to -0x4 away from 0x80002008, which is 0x80002004. Why is this required? Well we will be using what is called 'updating' instructions for our loop (will explain more on this shortly). This means since we are transferring one word at a time, we need one word of space (or -0x4) before the first loading address. If we were transferring halfwords, this would be -0x2, if bytes then this would be -0x1.

Now, let's correctly set the first loading address of the loop~

Code:
lis r12, 0x8000 #0x80002008 - 0x4 = 0x80002004
ori r12, r12, 0x2004

We must apply that same logic to the first storing address of the loop. 0x81450000 - 0x4 = 0x8144FFFC.

Code:
lis r11, 0x8144 #0x81450000 - 0x4 = 0x8144FFFC
ori r11, r11, 0xFFFC

We got our initial loading & storing addresses set, let's make the loop...

Code:
the_loop:
lwzu r10, 0x4 (r12)
stwu r10, 0x4 (r11)
bdnz+ the_loop

A lot to unpack here. First, all loops need a label name. The lwzu and stwu instructions are those 'updating' instructions I mentioned about earlier. Let's figure out what they do....


Load Word Zero & Update
lwzu rD, SIMM (rA)


SIMM + rA = The effective address. The word located at the effective address is loaded into rD. Afterwards, the Effective Address becomes the new rA. Therefore, if the rA is used in a future instruction, it has a new incremented/decremnted value. Use lhzu for halfwords, and lbzu for bytes.


Example:
#r4 = 0x80456CF4
lwzu r0, 0x24 (r4)
#After lwzu has executed r4 is NOW 0x80456D18. (0x80456CF4 + 0x24 = 0x80456D18)


Store Word & Update
stwu rD, SIMM (rA)


Same concept as lwzu but this is storing rD's value to memory instead of loading a value from memory into it. Use sthu for halfwords, use stbu for bytes.


These updating instructions can cut down the amount of instructions your source contains. Let's say we have this lwzu instruction...
lwzu r0, 0x24 (r4)

If we were to mimic this withOUT lwzu, we would have to use two instructions...
lwz r0, 0x24 (r4)
addi r4, r4, 0x24

Okay, you now know what lwzu and stwu does. Let's talk about the bdnz+ instruction. This stands for Branch Decrement Not Zero. The instruction does the following...
  • Decrement the value in the Count Register by 1
  • If Count Register does not equal 0, take the branch
  • If Count Register equals 0, skip the branch.

By placing a bdnz+ instruction at the end of our loop with its branch label going back to the top of the loop, this allows us to decrement our Loop Tracker (CTR) and at the same time, stop executing the loop once the Loop Tracker (CTR) hits Zero.

Here's the entire source~

Code:
li r12, 5
mtctr r12

lis r12, 0x8000 #0x80002008 - 0x4 = 0x80002004
ori r12, r12, 0x2004

lis r11, 0x8144 #0x81450000 - 0x4 = 0x8144FFFC
ori r11, r11, 0xFFFC

the_loop:
lwzu r10, 0x4 (r12)
stwu r10, 0x4 (r11)
bdnz+ the_loop

For a better idea of what's going on visually speaking, here is a series of 4 pictures. The 1st picture is right before the loop is first executed. I've manually placed in the values for r11 and r12 beforehand. Then next 4 pictures will show the execution of the loop with one iteration so the CPU ends up back at the lwzu instruction.

[Image: loop1.png]

[Image: loop2.png]

[Image: loop3.png]

[Image: loop4.png]

That is what one iteration of the loop looks like. A word gets loaded into r10 and 'transferred' to the spot designated by stwu instruction using r11.

Here are two more pictures showing the final stages of the loop. First pic is right before the loop is completed. You will notice the CTR has a value of 1 and the bdnz+ instruction is about to be executed. Then 2nd pic is the bdnz+ instruction getting executed, you will see the CTR is now 0 and the loop has fully completed.

[Image: loop5.png]

[Image: loop6.png]



Subic. Loop

With the subic. loop, instead of using the CTR for the amount of times the loop needs to execute, we use a normal general purpose register instead. Here's what our source would look like using the subic. loop...

Code:
li r9, 5 #r9 will be used to mimic our 'CTR'

lis r12, 0x8000 #0x80002008 - 0x4 = 0x80002004
ori r12, r12, 0x2004

lis r11, 0x8144 #0x81450000 - 0x4 = 0x8144FFFC
ori r11, r11, 0xFFFC

the_loop:
lwzu r10, 0x4 (r12)
stwu r10, 0x4 (r11)
subic. r9, r9, 1
bne+ the_loop

The subic. instruction stands for Subtract Immediate Carrying (carrying deals with the carry flag, you don't need to worry what this flag is about). The small dot you see appened to subic is called the Record feature. It's a free use of 'cmpwi rD, 0', which is cmpwi r9, 0 for this source. The "subic." instruction for our loop will subtract one from the value of r9 and store the result back into r9 every time the loop executes. Then it compares the value of r9 against Zero. The bne+ instruction will branch to the_loop whenever r9 is NOT zero. Once r9 is zero, the loop is over and instructions beneath the loop will be executed.



CTR vs Subic.

While both loop types resulted in the same length of assembled code, the CTR loop is better because it has less amount of total executable instructions and thus results in less execution time. The CTR loop also allows you to use one less GPR (general purpose register) than the Subic. loop.

The subic. is needed when let's say your code's default instruction resides at a address that is inside a CTR loop. Obviously, the CTR wouldn't be safe for use, and you will have to use the subic. loop. Happy coding!