ARM Tutorial for Use with Starlet
First thing's first. Huge Thank You to Palapeli!
Author's Note: This thread is located in this subforum because the PPC Assembly Tutorial subforum is for PPC only. I don't see a need to make an ARM Assembly Tutorial subforum.
This is an ARM tutorial for the Starlet Core of the Wii. It is designed for those who are already experts in coding/programming with Broadway PowerPC on the Wii. This will cover the basics of the ARMv5 Assembly language, and some Starlet-specific attributes. You will also be provided with some programs/tools to assist you in testing ARM code on Starlet/IOS. For more info, read the following manuals---
NOTE: ONCE AGAIN, this is for those who are already experts in PPC (Broadway) Assembly! This is *not* a "1-0-1" tutorial, this is a "2-0-1" tutorial.
Chapter 1: Intro, Understanding the Starlet and Broadway Boot Sequences
While Broadway is the CPU that runs the Wii's Games, Channels, HBC apps, etc, there is another CPU in the background that actually is the 'Master' CPU. It is an Arm9 (arm926ej-s) core nicknamed "Starlet". It's official name is "IOP" which stands for Input-Output Processor. Starlet uses the Armv5 (specifically ARMv5tej) ARM Assembly language.
When you power on the Wii, Starlet boots from an internal MASK ROM (boot0). The following boot sequences are performed~
boot0: Will decrypt and verify (via checksum) the first 48 blocks of the NAND (these blocks is boot1's Code). Generated checksum is verified against what is stored in the OTP (One Time Programmable Register). If checksums don't match, boot0 will halt. Contents of the first 48 blocks of NAND are guaranteed by the manufacture. Older boot1's had a strmcmp bug in it which allowed the installation of Bootmii (as boot2).
boot1: Contains ARM Code that will initialize the DDR3 memory, setup some Hollywood/Hardware Registers, and check the boot2 version number that is in the SEEPROM. If that number in SEEPROM is higher than what is in the TMD of boot2, then boot1 will halt. Otherwise, boot1 will load up boot2 and then proceed to execute the ARM code in boot2. There are two copies of boot2 in the NAND just in case one copy gets corrupted.
boot2: A stripped down IOS with a small amount of tasks. boot2 will check the Wii Menu's TMD (Title Metadata) to know which IOS to load for Starlet to run. Boot2 will then load up the System Menu's IOS into memory, and then IOS will be 'booted' and take 'control'.
IOS: Also known as boot3. Loads the Wii Menu into memory. Every menu/channel/etc has a "BS1" code that gets placed into memory at 0x00003400. This "BS1" code is a small snippet of PowerPC code, more on that later.
IOS's are single or multiple ELFs of ARM code that Starlet places into memory and runs the code as an "Operating System". Every IOS has a kernel which is placed into SRAM. SRAM is the memory dedicated for Starlet (under normal operations, Broadway cannot access this memory). Most other IOS's will have additional ELFs that are placed into MEM2 as 'modules/plugins'. Boot2 is an IOS with only a bare bones kernel.
There are different types of IOS for different purposes. Example: IOS80 runs Wii Menu 4.3. This means when the Wii Menu (Broadway/PPC) is running, IOS80 (Starlet) is also running in the background.
Once the Wii Menu is in memory, IOS will write some very elementary PPC code at the EXI Boot Base (a group of Hollywood Hardware Registers). Whenever something is written to the EXI Boot Base, it is copied over (automatically by hardware) to physical address 0xFFF00100 (Broadway's Reset Vector Address).
This is the following code (or something very similar, depending on the specific IOS) that is written to the EXI Boot Base~
'entry' is usually the address of 0x00003400. This is the address to the "BS1" boot code that was placed into memory earlier. 'msr' is the Machine State Register value and it is usually zero.
After writing to the EXI Boot Base, IOS will power on the Broadway (via poking some Hollywood Registers). The Broadway chip boots up and is now running.
When Broadway is powered on from a Hard Reset, it's MSR value is 0x00000040. The only bit high is the IP bit (exception prefix), which means exceptions use the addressing mode 0xFFFx_xxxx. This is why Broadway boots at 0xFFF00100 instead of at 0x00000100.
Anyway, Broadway executes the code starting at 0xFFF00100 and will jump to 0x00003400 with an MSR value of zero. Broadway is now at the "BS1" code. Anyway, the BS1 contains code that will do some basic tasks such as setup some Broadway specific registers, adjust the MSR yet again, configure the Cache, and configure the BAT registers.
Once that has been completed, execution of Broadway will jump to 0x81330000 (now running in Virtual Cached Mode) and the Wii Menu is now officially running.
It's important to note that once Broadway has been powered on, IOS will recede into a very small loop waiting for tasks/interrupts that are called on by Hardware or Broadway (such as an IPC request for modifying a file on the NAND). When Broadway is running, Starlet (IOS) is running in this loop 99.9% of the time, basically doing nothing.
Chapter 2: Memory, Modes, Instruction Sets, Endianess, Registers
Memory regarding MEM1 and MEM2 (aka Broadway's memory) is straight forward. When referencing this memory on your ARM instructions, you would just use the physical address (i.e. 0x00001500). For Hollywood Registers, just use the physical address equivalent but with bit 8 set high (i.e. 0x0D800064) However, using SRAM (Starlet's dedicated memory) is confusing....
Without getting into the nitty gritty of various SRAM mappings that can be utilized, SRAM is configured as such whenever a Wii app/game/etc is running...
The IOS Kernel executes code and function calls located at SRAM A (similar to how Wii games execute code at Mem80). Certain parts of SRAM A and all of SRAM B is used as Data/Storage (similar to mem9 on Broadway).
The above addresses are "Mirrors". They are not "real" addresses per say. Also, "Mirrors" are **not** Virtual Addresses. The Mirrors are done by hardware, not software. These Mirrors were implemented to comply with Starlet's hardware requirements for addressing in regards to resets and exceptions.
These are the physical/real/true ranges of SRAM (when Wii games/etc are running):
Therefore, to change a physical address to it's Mirrored equivalent, just add 0xF2B00000.
Regarding the IOS Kernel, it is the highest level of security of IOS and has all privileges. Think of it like the "inner/root core" of the entire IOS. IOS interacts with the AES, SHA, and HMAC engines from the Kernel. When IOS idles while the PPC is running, it is idling via the 'main/idle' loop (called thread 0) that resides in the kernel.
The rest of IOS, what is known as the Modules/Plugins, is located in MEM2 at 0x13400000 thru 0x13FFFFFF (12MB). Under normal conditions, Broadway cannot write to this region of memory.
When an exception occurs in Starlet/IOS, memory mapping starts at 0xFFFF0000. More on exceptions in the next Chapter.
Starlet can execute in 3 different modes---
ARM Mode is the standard 32-bit mode. Instructions are 32-bits in size. All instructions are available for usage.
Thumb Mode is when the processor will run 16-bit sized instructions. This is a reduced instruction set due to the 16-bit size of Thumb instructions. This is also known as Thumbv1 Mode in newer/modern ARM manuals. Starlet runs in this mode for the main loop of the Kernel.
Jazelle Mode is when the processor executes in a mode tailored for a Java Virtual Machine. Starlet *NEVER* runs in this mode. Therefore, it won't be covered.
Unlike other ARM cores, Starlet runs in Big Endian mode by default, but can be configured to run in Little Endian mode. This was done for compatibility with the Broadway chip.
In the Manuals linked in the first Chapter, the bit labeling is done in Little Endian 'fashion'. The most significant (far left) bit is named as bit 31, and least significant (far right) bit is named as bit 0.
Starlet has absolutely zero Floating Point functionality. Starlet does implement its own Cache and address translation.
Register List of Starlet (in ARM Mode)
Register List of Starlet (in Thumb Mode)
CPSR Breakdown (NOTE: Reverse the bit numbers if referencing an ARM manual!!!)
Chapter 3: Exceptions and Starlet Syscalls
Here's a list of all Exceptions with their Exception Vector Address plus a short Description....
At each Exception Vector Address is a single instruction that changes the Program Counter to automatically branch to the actual code/program that handles said exception. Before discussing Exceptions in further detail, we need to cover the various Processor Modes. Here is a list of all 7 of them...
CPSR Processor Mode bit value breakdown:
All modes except User mode are known as Privileged Modes. All Privileged Modes except System Mode are known as Exception Modes. User mode is the most restrictive. The only way for Software to exit out of User mode is through an Exception and the contents of that Exception contain instructions to switch to a Privileged Mode once the Exception has been finished. You can 'manually' enter into any Exception Mode with some CPSR related instructions, but it Starlet is usually only in this mode due to an actual Exception occurring.
When any exception occurs, the following takes place...
1. Exception Mode enabled and execution of Starlet is now in Physical Memory
2. Banked Registers enabled
3. SPSR Register enabled (what was in the CPSR, before the exception, is now copied into the SPSR)
4. CPSR value changed depending on type of Exception taken
1. The type of Exception Mode that gets set depends on the type of Exception that occurred. Here is a list...
2. Banked Registers are certain Registers that are only visible/usable during the Exception Modes. The type and amount of Banked Registers allowed for usage depends on the type of Exception that was taken. In every Exception, SP (r13), and LR (r14) are banked. This means you are now using a different version of SP & LR. The original SP & LR are preserved and can only be used again once you leave the Exception. This is a crafty method implemented in ARM to easily save important data without the need to 'manually' save it. The FIQ Exception Mode is the only oddball. It will also include banking registers r8 thru r12.
Each mode has its own banked versions of SP and LR. Thus, there are 6 versions of SP & LR. The regular version, and the versions present for the 5 Exception Modes. Please note that User mode and System Mode do ***NOT*** bank any registers!
The banked LR will be set depending on the type of Exception that occurred (except for Reset, it will be an unpredictable value). Refer to chapter A2.6 of the ARMv6 manual for details of every banked LR value for each Exception.
All banked SP values are set by IOS sometime during initial environment setup/config. In order to change a banked SP, let's say the Banked IRQ SP register, you must be in IRQ Exception Mode.
3. The SPSR (Saved Program Status Register) gets filled with the CPSR's value that was present before the Exception was taken. Each Exception Mode has its own banked version of the SPSR. In totality, there are 5 SPSR's.
4. The CPSR will have some values changed. In every exception, T (thumb) and J (jazelle) bits are cleared. Thus, every exception executes in ARM mode! For every exception (except FIQ), the FIQ bit is left unchanged. In FIQ mode, it is set high (FIQ disabled). For every exception, IRQ bit is set high (IRQ disabled).
A picture is worth a 1000 words. Here is a handy diagram of the Banked Registers~
When an exception has ended, SPSR is copied over to the CPSR. Therefore, whatever mode (User or System) you were in beforehand, is now restored. SP and LR (plus r8 thru r12 for FIQ) original versions/values are also restored.
For every IOS, the Undefined/Illegal instruction exception vector goes to the address of 0xFFFF1F24. This is the location of IOS's specialized Syscall Handler. There are 2 types of Syscalls for IOS.
Regular ARM syscalls:
This uses the swi instruction. Whenever Starlet executes the swi instruction, an exception will occur and Starlet will start execution at 0xFFFF0008.
IOS syscalls:
IOS handles various illegal instructions to represent special syscalls that will preform certain tasks. The 'base' syscall instruction has the compiled form of 0xE6000010. Each syscall has a number, starting at syscall 0. You insert the number into the instruction by shifting the syscall number to left by 5 bits then logically ORing it into 0xE6000010.
Example (syscall 0x54)
Therefore to 'call' syscall 0x54, you would use the instruction of 0xE6000A90. Keep in mind some modules/plugins cannot execute certain syscalls due to lack of privilege. Syscall 0x54 can only be called from the Kernel or the /dev/es module.
The Syscall Program Handler is located at 0xFFFF1F24. Starlet will execute this handler to make a new stack frame to backup registers, check if the instruction is a special syscall instruction, and then use a lookup table to determine which address/function to call based on the syscall number.
Link to WiiBrew article of syscalls and a decomp of the syscall handler - https://wiibrew.org/wiki/IOS/Syscalls
Chapter 3: Immediate Value Rules
While PPC has a signed or unsigned 16-bit immediate value range for instructions which makes it super simple to write any 32-bit value in a register (lis+ori), the way ARMv5 language implements immediate values is can be a pain the ass.
To load a value 'from scratch' into a register in ARM, you will typically use the mov instruction. Example~
When writing any immediate value in any ARM instruction, it *MUST* have a hashtag sign pre-pended. Notes/comments are designated via the '@' symbol.
You can use any value that can be expressed in an 8-bit field within a 32-bit 'space'. For example, the value 0xFF000000 is a legal immediate value. 0xFF fits into 8 bits. If the value consists of 2 or more binary 1's, the first and last binary 1 must fit within an 8 bit field.
Therefore, something like the value of 0x000001C1 is invalid. 0x1C1 cannot be expressed in 8 bits.
Examples~
0xFFFFFFFF (-1) is an invalid value. However, there is an instruction called mvn (Move then Logically NOT). It the same as mov but will Logically-NOT the value. Therefore, to load 0xFFFFFFFF into a register, you can do this...
Therefore any 8-bit number in the allowed range, its "logical NOT'd" value is also valid.
Each 32-bit ARM instruction has 8 bits within them for an initial value and 4 bits as a rotation mechanism. If you want to learn how instructions actually process immediate values, formula is listed below.
V = n ror (2*r)
V = Immediate Value
n = 8-bit Initial Value
r = 4-bit Rotation Value
ror = Rotate Right (this is ROTATING, *not* shifting)
Example~
First, take 0xB and multiply it by 2. You will get 0x16 which is 22 in decimal.
Now take the value of 0xFF and rotate it to the right by 22 bits.
You get the result of 0x0003FC00. This value is safe to use in the mov instruction
For the case you are lazy and would rather just type a number into a script/program to check it's validity, there's a person named Azeria who has made such a script.
Link - https://raw.githubusercontent.com/azeria...rotator.py
Enter in the number, in decimal form, script will tell you if you can use it or not.
If the number is invalid (and the mvn instruction won't work either), you will need to break it apart via multiple instructions. For example, since the value of 511 is an invalid immediate value. Therefore you would do this, to load it into let's say r0.
The immediate values rules apply to ***ALL*** instructions, not just mov.
Chapter 4: Multiple Input Variations for same Instruction
For Broadway, there are many different versions of instructions, such as differences for an instruction when it uses two source registers vs the type of instruction that use 1 source and 1 immediate value (i.e. add vs addi)
However for ARM, this is not the case. The add instruction can be written out in different forms. Instead of having add,addi,etc instructions, you just have add.
Example~
Chapter 5: Compares, Branches, and Conditional Instruction Execution
Compare and branch instructions work similarly to PPC. You can use label names just like how you did with PPC. There are a slew of simplified mnemonic conditional branches that can be used.
Example~
Another Example~
How Compare Instructions actual operate:
When a comparison (cmp) instruction is executed, the Source Register (or Immediate Value) is subtracted *FROM* from the Destination Register. For example..
Will do r4 minus r12. The result of this subtraction will flip the appropriate CPSR bits high (i.e. Negative/Less Than Bit).
ARM's cmp instruction doesn't allow specification of a signed vs unsigned comparison. However, there are a wide variety of branch instructions to compensate for this...
Here are all the Conditional Branch options.
Example~
Unlike PPC, there are no branch hints you can use. There is also no dedicated Condition Register (let alone multiple CRs) for Starlet.
Almost all ARM instructions can have conditional operations applied to them too.
Example:
Another example:
ARM also comes with the cmn instruction. It's basically the opposite the cmp. Instead of a subtraction being done for the operation, an addition is preformed.
Example:
This would do the operation of r10 + 0xC10 to set the appropriate CPSR bits.
Chapter 6: Logical Operations plus additional Logical Shift/Rotation Feature
ARM comes with some instructions for logical operations. They are all pretty plain jane. Therefore, there is no need to really deep dive into them.
To mimic...
nand rD, rA, rB
Use....
and rD, rA, rB
mvn rD, rD
To mimic...
nor rD, rA, rB
Use...
orr rD, rA, rB
mvn rD, rD
To mimic...
not rD, rA
Use...
mvn rD, rA
To mimic...
rlwinm rD, rA, 0, 0x00008000 #Big Endian bit 16
Use...
bic rD, rA, #0x00008000
To mimic...
rlwimi rD, rA, 0, 0x00008000 #Big Endian bit 16
Use...
bic rD, rD, #0x00008000 @Erase previous bit 16 value, whatever it was, in rD
and rA, rA, #0x00008000 @Erase all previous bit values except bit 16 in rA
orr rD, rD, rA @Now OR the two registers together to insert rA's bit 16 into rD
Most instructions also allow the usage of a additional shift/rotation to it's 2nd/3rd source register before the instruction continues further calculations. No, you cannot use this feature for on an immediate value.
Example:
This will preform a 2-bit lefthand shift of r5's value *BEFORE* adding it to r4.
This will preform a 27-bit righthand shift of r3's value *BEFORE* r0 is compared to it.
Here is a list of all operations you can use:
Chapter 7: Basic Loads & Stores
Just like with PPC, in order to modify contents in memory, it must be loaded into a register, the register modified, then stored back to memory. Instead of writing the source register in parenthesis, it must be enclosed within brackets.
Example:
ldr is ths default load instruction, it loads a word.
ldrh = Load Halfword
ldrsh = Load Halfword Signed (similar to lha instruction in PPC)
ldrb = Load Byte
ldrsh = Load Byte Signed
str is the default store instruction, it stores a word.
strh = Store Halfword
strsh = Store Halfword Signed (stores the lower 16 bits of the register and then sign extends it so a 32-bit value is actually stored. It's like a PPC extsh instruction followed by a stw)
strb = Store Byte
strsb = Store Byte Signed (same mechanism as strsh, but for bytes)
Example of load instruction using a source register and immediate value
As you can see in the above example, the immediate value goes at the tail end of the instruction. Here's an example of a store halfword instruction that uses two source registers~
This would be similar to be sthx PPC instruction.
Chapter 8: Program Counter Details and Literal Pools
Unlike PowerPC, in ARMv5, the Program Counter (PC) is a General Purpose Register. You can freely read/write to. Therefore, you can also use it as a loading/storing reference (PC + offset value).
In ARM Mode, the PC is the current instruction + 8.
In Thumb Mode, the PC is the current instruction + 4.
This is because the PC is always 2 instructions ahead of the current instruction that is going to execute.
Example (ARM mode assumed):
In the above example, when the ldr instruction executes, r0 will be loaded with "0xE320F000", which is the compiled form of a nop in ARMv5.
Another Example (ARM mode assumed):
In the above example, the compiled form of the bic instruction will be loaded into r0 for the ldr instruction. To load exactly where you are current at, in ARM mode, you would need to load at PC minus 0x8.
With the PC being a general purpose register, we can use it as an alternative method to PPC style BL Tricks.
Example (ARM mode assumed; pretend first instruction resides at memory address 0x00001500):
The load instruction is loading the word value located at PC+8. Which is 0x00001514. At address 0x00001514 is the asciz instruction. Manually calculating address values in this manner is a pain. Therefore we can use what are called literal pools.
Example:
HOWEVER, this is nothing more than a compiler trick! Whatever compiler (or instruction simulator) you are using, it will auto add the "long 0xXXXXXXXX" somewhere in your source (usually at the end). The literal pool trick will not work for something such as manually writing over instructions in IOS. You may need to use a BL-Trick Lookup Table instead.
Chapter 9: Pre & Post Indexing of Loads/Stores
The ARM language doesn't have 'update' versions of loads/store instructions like with PPC. However the use of pre/post indexing can mimic this.
To mimic a PPC stbu instruction, you can do this...
The byte of r4 is stored to r0+0x20, *THEN* r0 is added by 0x20.
You also can do what is called post indexing. Example~
This will load the word located at r6 into r1. It is ***NOT*** loaded at r6+4. Once the word has been loaded, *THEN* r6 is incremented by 4. As you can see this is different than pre indexing.
Chapter 10: Multi Loading/Storing
ARM can't exactly replicate how PPC does multi loads/stores but it does come with some unique instructions that PPC cannot do. First thing's first, ARMv5 is only capable of doing multi loads/stores for word values only. You will use ldm for basic multi loading.
Example:
This instruction will do the following...
Obviously this differs quite a bit from a PPC lmw instruction. You will also notice the source registers are enclosed in curly brackets instead of regular squared brackets. This is required for any multi load/store instruction.
Instead of writing out every register in the source register list, you can do this...
You can also force an update to r1 after the multi load, like this....
Take NOTE of the Exclamation Point placed immediately after r1. After the words are loaded into r3 thru r6, r1 is then *incremented* by 16 (0x10). Incrementation is 4 bytes per every source register present in the instruction.
For basic multi storing, you use the stm instruction.
There are a variety of extra options for multi loading/storing. Here are 8 more multi load/store instrucitons....
ldmia is actually a simplified/alternative mnemonic of ldm only if the destination register is *NOT* appended with a "!"
stmia is actually a simplified/alternative mnemonic of stm only if the destination register is *NOT* appended with a "!"
Examples:
The term "increase" means the loading/storing address is increased during the instruction. "Afterwards" means the incrementation of the loading/storing address (by 4) starts *AFTER* the first load/store.
"Decrease afterwards" is the same as above but address's are decreasing instead of increasing
The term "before" means to increase/decease the loading/storing address by 4 *BEFORE* the first load/store
Examples:
You can also have the destination register be updated in these increase/decrease before/after type multi load/stores.
Example:
Another Example:
There are also alternative mnemonics available. Here's a list of them.
Chapter 11: Loops
There is no Count (CTR) register for Starlet. You must use a general purpose register as a loop tracker. Since load/store instructions come with a Post-Indexing feature, you do not need to decrement the load and/or store start addresses beforehand.
Example of Basic Loop:
The subs instruction will execute a basic sub (subtract) instruction but also update the CPSR flag bits. This is essentially the same mechanism as using the Record (.) shortcut in PPC instructions. Once r2 = 0, the bne branch will not be taken.
You can append many instructions with 's' to force the instruction to update the CPSR flag bits.
Chapter 12: Stack, Prologues, Epilogues
Pushing and popping the stack is a bit different than doing it on PPC. Plus, the layout/structure of the Stack is also different. There are dedicated push and pop mnemonics.
push = stmfd sp! = stmdb sp!
pop = ldmfd sp! = ldmia sp!
Here's an example push instruction that backups just the LR to a new stack frame (storing of PC is explained later)
It's better to look at this instruction in it's stmdb alternative mnemonic form to understand what's going on underneath the hood.
For starters let's pretend sp's value before the instruction = 0x0000C200. When the instruction executes, it will first temp decrease sp's value by 4. PC is stored at 0x0000C1FC. SP's value is temp decreased by 4 again. LR is stored at at 0x0000C1F8. SP's value is then actually decreased by 8, due to 2 source registers in the stmdb instruction. After the instruction has executed SP is now 0x0000C1F8.
PC is stored because Stack Frames in ARM must always have a size divisible by 8. Now to pop this new frame, we do the following...
Let's look at this pop instruction in it's ldmia alternative mnemonic form...
In this instruction word at sp+0 is loaded into lr. Then PC is loaded from sp+4. Afterwards, SP values is increased by 8. We are now 'back' to where we 'left off'.
Or are we? In fact, this is WRONG!!! We are loading PC's old value. The Program Counter keeps track of what's the next instruction that will execute.
By executing that above pop instruction, we would actually "branch" back to w/e instruction is present after the initial push instruction. Needless to say, your program/code will crash/fault because of this.
How do we fix this??? You don't have to fix the push instruction, just fix the pop instruction....
Now that you understand basic pushing and popping, let's go over how prologues and epilogue are implemented in IOS. There are a plethora of methods (styles) to write prologues and epilogues in ARM. Regardless of the 'style' used, r4 thru r10 are the non-volatile registers aka the global variables. They are the equivalent of the r14 thru 31 for Broadway. What's different is that the lower registers are used first instead of the higher ones. For example, in PPC if you make a stack frame and need to use 2 non-volatile registers, you throw r30 and r31 on the stack. In ARMv5 its backwards, if you need two non-volatile registers, you throw r4 and r5 on the stack, **NOT** r9 and r10.
The majority of functions within IOS use the "Full Descending" type of stack pushing/popping. Which is perfect, because we can use the push and pop mnemonics.
In ARMv5 compliant prologues, r11 is used for what is called the Frame Pointer aka fp. It must point to the bottom of the current Stack Frame.
r12 is known as Inter-Procedural scratch register aka ip. It is used as a scratch register during prologues for the purpose of making a copy of sp immediately before a new frame (push/stmfd instruction) is created.
Example Prologue that's ARMv5 compliant (1 register being saved)
FP, IP, and LR (in that order) must always be pushed onto a new frame. At this point the Stack structure is as such...
SP+Offset | Item
SP | r4 (Top of new Frame)
SP+4 | old fp (will have address that points to bottom of old frame)
SP+8 | old sp aka old ip (will have address that points to top of old frame)
SP+0xC | function return lr
SP+0x10 | Top of Old Frame (where old SP is pointing to)
.. .. | Unknown Size
SP+?? | Bottom of Old Frame; old function return lr (where old FP is pointing to)
The current value in new fp (r11) at this state in time would be pointing at the address of SP + 0xC. Onto the epilogue...
And here's the responding epilogue~
If you don't need to ensure an instruction set change, then the epilogue can be changed to this...
pop {r4, fp, ip, pc}
For me personally, I think the whole idea of fp is redundant. As long as each frame contains it's "old sp", then back-chaining is guaranteed. PowerPC does this right. Here's how I would personally write prologues/epilogues of custom functions.
Example prologue saving 1 register~
Epilogue~
When modifying IOS, be sure to follow the ARMv5 compliant style.
If you are in the situation where you need extra space allocated in the new frame for something such as an output buffer for a child function (i.e. sprintf), here's an example prologue (2 registers + 0x30 buffer space)...
When ready to setup your output buffer, do this...
Finally, here's the respective epilogue..
Chapter 13: Exchanging between ARM and Thumb; Thumb Instruction Set
You have to use the bx, blx, or 'bx lr' instructions to switch between ARM mode and Thumb mode. Do not try to edit the Thumb mode bit in the CPSR. You will cause an exception.
Anyway, the Least Significant Bit (bit 0 in ARM manuals) in the target address of a bx/blx/'bx lr' determines which mode to run. If the LSB is high, Thumb mode will be activated. When it's low, ARM Mode will be activated. It's really that simple.
Since Thumb instructions are limited to 16-bits, there is a reduced instruction set. Read up on Chapter A6 of the ARM reference manual. Keep in mind that on chart A6.2.1, any instructions notated with "[2]" are ARMv6 only and do not work on Starlet.
Regarding immediate values, Thumb instructions only allow the basic unsigned range of 0x00 thru 0xFF (0 thru 256).
Chapter 14: Interrupts
To disable interrupts, do this...
mrs rX, cpsr
and rY, rX, #0xC0 @Keep rY somewhere safe
orr rX, rX, #0xC0
msr cpsr_c, rX
To restore them, do this...
mrs rX, cpsr
bic rX, rX, #0xC0
orr rX, rX, rY @rY can be scrapped now
msr cpsr_c, rX
NOTE: To forcefully enable interrupts (instead of restoration), simply remove the ORR instruction from the above restoration example.
Chapter 15: MEM1 Store/Load issues with Starlet
Read this...
https://twitter.com/marcan42/status/1362...47?lang=en
Marcan, the co-creator of HBC and Bootmii explains it perfectly. Thus, if you are doing any sort of memcpy,memset,memclear,etc you must ensure every piece of data stored/loaded to/from MEM1 to done as a word value. Not only that, the word value(s) must always be stored/loaded via an address that is divisible by 4.
Chapter 16: Cache, Address Translation, Self Modifying Code
Starlet comes with a single Cache Unit. It is split into a 16KB Instruction Cache and 16KB Data Cache. Also just like Broadway, Starlet uses Cache Blocks of 32-byte size. However, the Cache Blocks are referred to as "Modified Virtual Address", "Lines", or "Single-Entries".
The term "Modified Virtual Address" can be confusing. Basically there is a hardware register (FCSE PID) in Starlet that does another address translation on top of your typical Virtual Address to Physical translation.
Don't fret though, the configuration of the FCSE PID is as such to where the Modified Virtual Address and Virtual Address are always equivalent.
Regarding typical address translation, the kernel and all modules (except ES,FS,STM, and DI) use Identical translation (Virtual Address exactly maps to Physical).
The other modules use a translation scheme in which the Virtual Address is the Physical Address but with the Most Significant Bit set low. For example, physical address 0xFFFE0500 is represented virtually via 0x7FFE0500.
Going back to the Cache...Starlet doesn't use the MEI protocol, but its own protocol that is almost identical, just with a different naming system.
Clean (similar to Exclusive in Broadway, what's in Cached Memory is also in Physical/Real Memory)
Dirty (similar to Modified in Broadway, what's in Cached Memory hasn't been updated yet to Physical/Real Memory)
Invalid (just like Invalid in Broadway, Block will be casted out soon, can be tossed)
Starlet's cache 'algorithm' can be changed between two different settings. First setting is Psuedo Random, second setting is Round Robin. IOS uses the Psuedo Random setting.
Here's a list of Handy Cache Operations~
Invalidate both the entire ICache and DCache~
mcr p15, 0, rX, c7, c7, 0 @rX must be zero
Invalidate entire ICache~
mcr p15, 0, rX, c7, c5, 0 @rX must be zero
Invalidate ICache Line~
mcr p15, 0, rX, c7, c5, 1 @rX must be 32-byte aligned address!!!
Invalidate entire DCache~
mcr p15, 0, rX, c7, c6, 0 @rX must be zero
Invalidate DCache Line~
mcr p15, 0, rX, c7, c6, 1 @rX must be 32-byte aligned address!!!
Test and Clean Entire DCache~
loop:
mcr p15, 0, pc, c7, c10, 3
bne loop
Clean DCache Line~
mcr p15, 0, rX, c7, c10, 1 @rX must be 32-byte aligned address!!!
Test, Clean, and Invalidate entire DCache (aka Test then Flush)~
loop:
mcr p15, 0, pc, c7, c14, 3
bne loop
Clean and Invalidate DCache Line (aka Flush)~
mcr p15, 0, rX, c7, c14, 1 @rX must be 32-byte aligned address!!!
Prefetch ICache Line~
mcr p15, 0, rX, c7, c13, 1 @rX must be 32-byte aligned address!!!
The prefetch ICache operation is simply a cache hint for the Instruction fetcher. Regarding DCache hints, there is only the PLD (pre-load data). This provides the DCache with a hint for an upcoming Load instruction. It executes exactly like the PPC dcbt instruction. There is no version of PLD for Store Hints.
Unlike Broadway, there is no need for any special pre-flushing routine in regards to flushing the entire DCache. Just simple execute the instructions required and you're good to go.
---
Self Modifying Code is pretty simple. The following snippet shows how to overwrite a single instruction. Adjust source accordingly for multiple instruction rewrites.
Chapter 17: Details of the Main/Idle Loop, and Writing in Custom ARM Code
As mentioned earlier in Chapter 1, Starlet is running in a loop waiting for tasks to do. This loop is called thread 0 or also called the Idle Thread.
Locating this loop is simple. Since Starlet is in this loop 99.99% of the time when your game is running, we can easily find the "Base Thread Pointer". We use a simple equation to calculate this Pointer.
Thread Pointer = 0xFFFE0000 + (0xB0 * threadnumber)
The loop is known as Thread 0, so the equation is this...
0xFFFE0000 + (0xB0 * 0)
0xFFFE0000 + 0 = 0xFFFE0000
Now we have the Thread Pointer. Using that, we can find other important informaiton
Thread Pointer + 0 = CPSR
Thread Pointer + 0x3C = SP
Thread Pointer + 0x40 = PC (what we need)
We can use something such as my Memory Editor code (HERE) to view 0xFFFE0040 on MKWii to get the current PC addresses of Starlet using Broadway. Your game must be patched using this method (HERE) for you to view SRAM!!! Obviously, convert 0xFFFE0040 to its real SRAM address. Then finally convert it to the usable Broadway virtual address.
0xCD4E0040 #View this address on the Memory Editor Code
You will see the PC is constantly flickering between 3 different addresses. On cIOS249 (using IOS56 as base), these are the following addresses you will see...
The 3 addresses you see on your screen may differ if your MKWii game is running on a different IOS. Anyway, this is the loop. It is executing the following ARM instructions..
IMPORTANT NOTE: Starlet/IOS is in Thumb mode when executing this loop, hence the 0x2 increments to each address. It is also in System Mode (highest privileged mode possible)
Now onto the methods of writing in custom ARM Code for IOS to execute...
Method 1:
You could add in custom patches of code to an IOS and then run a game/channel/etc that uses said IOS. This can be cumbersome at times, as you would constantly need to repatch and then reinstall the IOS for testing new ARM Code.
Method 2:
Another method is to use Palapeli's /dev/sha exploit. It allows you to run a snippet of ARM code that without having to interact with IOS directly. Everything is initiated via Broadway. Link - https://github.com/TheLordScruffy/saoirs...ot.cpp#L67
Here is an Assembly Source of Palapeli's exploit that can be used as a C2 Gecko Code for MKWii.
It is currently hooked to the Shared Item Address and will return certain items based on the condition of certain IOS calls (in regards to the exploit working or not). Adjust the addresses to the IOS calls accordingly, they are currently configured for PAL MKWii.
What's also included is the Exploit contents itself that is packaged in an 06 String Write. This 06 Code must be present. The Exploit Contents are currently configured to run ARM 32-bit code that is present at 0x80000A00. Therefore, you will NEED to make another 06 String Write Gecko Code that contains ARM instructions at 0x800000A00. There is a provided example 06 String Write (at 0x80000A00) that does a demo of the exploit by shutting down the Console via Starlet/IOS.
The right-side word of the 2nd to the last line of the 06 String Write Exploit is the physical address (entry point) that you can alter if needed. It must be a physical address. In a nutshell the following Memory Addresses are used...
0x80000000 thru 0x8000001B #Location of Exploit ARM Contents, don't change this (other than the entry point if desired)
0x80000A00+ #Depends on the length of your custom ARM code; change this entry point if needed
0x80001500 thru 0x80001517 #Used as a temp space for IOS_Ioctlv usage, you should be familiar enough with PPC to change this if this happens to conflict with your other unrelated Codes
Important Note about the Exploit:
The exploit can only be ran once. This is because Thread 0 is switched to ARM Mode after the exploit has been ran. Therefore to run it beyond just once, you will need to rig something up to rewrite the Exploit's ARM contents to be in ARM mode next time the Exploit (code as a whole) is executed by Broadway again.
Chapter 18: Assembling ARMv5 with Devkit
You need to have DevkitPPC already installed and the environments set. Here is a Linux Debian based guide for that - https://mariokartwii.com/showthread.php?tid=1200
You can find various Windows and Ubuntu guides via Google. After you get that done, you need to install devkitARM. Here's the linux command for that...
sudo dkp-pacman -S devkitARM
Now that you have devkitARM installed, here is a guide to assemble ARMv5 assembly into a raw binary file.
Example of ARM assembly contents/instructions in a file called source.s:
1. Nativate to DevkitARM binutils
cd /where/you/installed/devkit/devkitpro/devkitARM/bin
2. Assemble the ARM instructions to object code. Your source file must have the ".s" extension.
./arm-none-eabi-as -march=armv5te -mcpu=arm926ej-s -mbig-endian /where/file/is/source.s -o /where/file/is/source.o
NOTE: To force Thumb mode only, add in "-mthumb"
3. Convert (strip) object code to raw binary file
./arm-none-eabi-objcopy -O binary /where/file/is/source.o /where/file/is/source.bin
Congrats, feel free to view the binary file on a Hex Editor to see the assembled instructions. If your source file has both ARM and Thumb instructions present, you will need to designate them via Assembler directives. Place ".arm" before your instructions in your source file to force the Assembler to assemble the instructions in ARM mode. Place ".thumb" before your Thumb instructions. As an alternative to providing "-mthumb" for the Step #2 command, you can instead slap a ".thumb" at the top of your source file.
Example showing source file using nop in both ARM and Thumb form:
The above will assemble the nop into its 32-bit form, then the second nop into its 16-bit form.
Another option at your discretion is to insert the architecture name and cpu name in your source file instead of having to type it out in Step #2.
Example:
For the above example, your Step #2 command would be...
./arm-none-eabi-as -mbig-endian /where/file/is/source.s -o /where/file/is/source.o
---
How to disassemble (output will appear in your terminal screen)~
./arm-none-eabi-objdump -b binary -m armv5te -D -EB /where/file/is/source.bin
NOTE: The above will disassemble instructions to their ARM 32-bit form. Therefore, Thumb instructions, if present, will not be disassembled correctly.
NOTE: To force thumb only disassembly of instructions, add "-M force-thumb". If present, ARM 32-bit instructions will not be disassembled correctly.
Chapter 19: Instruction Simulator, Links
In the ARM Reference manual, there are many instructions present that obviously won't work for Starlet. In the Instruction Set chapters/sections, if an Instruction has the note "Version 6 and above" it obviously can't be used on Starlet.
There aren't any division instructions available. You will need to use some trickery to mimic division instructions. Luckily, there are multiply based instructions.
There isn't a good Starlet Emulator for use unfortunately. However, there are some ARM instruction simulators out there. Google is your friend on this one. There are some web browser based ones if you don't want to install anything on your computer.
Here is a decent web browser ARM instruction simulator - https://cpulator.01xz.net/?sys=arm
It's designed for ARMv7-a. The downsides to this simulator is that is can only run in Little Endian using Physical Addressing only, and switching between ARM & Thumb is not supported.
Now that you have completed this tutorial, you should...
1. Read more in depth of the Manuals provided in Chapter 1
2. Try out some snippets of code in a Simulator
3. Read these WiiBrew articles for Starlet and IOS~
You can then start tinkering around in IOS and try out your modifications on a real Wii.
And lastly, this is a handy Co Processor reference. This was a 'snapshot' of various co processor registers using Palapeli's exploit on cIOS249[56]. Thus all info was 'snapshotted' using the IOS Kernel. Read up on the Starlet Manual to understand the details of this reference.
FCSE PID = Null (identical translation for VA to MVA)
TTBR = 0x13850000
Domain Access = Set to Client on all 16 Fields; therefore all Access Permissions for every mapped Memory Region is based on the AP bits of the related Page Table Entry
C1 Details:
Cache Register Details:
TCM Memories = Both DCTM and ICTM low (not present)
Data TCM Region = Disabled, size and address set to Null
Instruction TCM Region = Disabled, size and address set to Null
Chapter 20: Test Code; Conclusion
Using the simulator I linked in the previous chapter, here is a mock-up code you can step thru with. The code attempts to mimic a source that opens, reads, and closes a file via IOS calls.
Please note that a BL Trick was used instead of a Literal Pool so the Simulator wouldn't add any extra content to the source.
Happy Coding!
First thing's first. Huge Thank You to Palapeli!
Author's Note: This thread is located in this subforum because the PPC Assembly Tutorial subforum is for PPC only. I don't see a need to make an ARM Assembly Tutorial subforum.
This is an ARM tutorial for the Starlet Core of the Wii. It is designed for those who are already experts in coding/programming with Broadway PowerPC on the Wii. This will cover the basics of the ARMv5 Assembly language, and some Starlet-specific attributes. You will also be provided with some programs/tools to assist you in testing ARM code on Starlet/IOS. For more info, read the following manuals---
- ARM Reference Manual (has everything you need to know about ARMv5. Also has ARMv6 info)
- Starlet Manual
- Similar ARM9 Core Manual (for anything not found in the Starlet Manual, may be in here)
NOTE: ONCE AGAIN, this is for those who are already experts in PPC (Broadway) Assembly! This is *not* a "1-0-1" tutorial, this is a "2-0-1" tutorial.
Chapter 1: Intro, Understanding the Starlet and Broadway Boot Sequences
While Broadway is the CPU that runs the Wii's Games, Channels, HBC apps, etc, there is another CPU in the background that actually is the 'Master' CPU. It is an Arm9 (arm926ej-s) core nicknamed "Starlet". It's official name is "IOP" which stands for Input-Output Processor. Starlet uses the Armv5 (specifically ARMv5tej) ARM Assembly language.
When you power on the Wii, Starlet boots from an internal MASK ROM (boot0). The following boot sequences are performed~
boot0: Will decrypt and verify (via checksum) the first 48 blocks of the NAND (these blocks is boot1's Code). Generated checksum is verified against what is stored in the OTP (One Time Programmable Register). If checksums don't match, boot0 will halt. Contents of the first 48 blocks of NAND are guaranteed by the manufacture. Older boot1's had a strmcmp bug in it which allowed the installation of Bootmii (as boot2).
boot1: Contains ARM Code that will initialize the DDR3 memory, setup some Hollywood/Hardware Registers, and check the boot2 version number that is in the SEEPROM. If that number in SEEPROM is higher than what is in the TMD of boot2, then boot1 will halt. Otherwise, boot1 will load up boot2 and then proceed to execute the ARM code in boot2. There are two copies of boot2 in the NAND just in case one copy gets corrupted.
boot2: A stripped down IOS with a small amount of tasks. boot2 will check the Wii Menu's TMD (Title Metadata) to know which IOS to load for Starlet to run. Boot2 will then load up the System Menu's IOS into memory, and then IOS will be 'booted' and take 'control'.
IOS: Also known as boot3. Loads the Wii Menu into memory. Every menu/channel/etc has a "BS1" code that gets placed into memory at 0x00003400. This "BS1" code is a small snippet of PowerPC code, more on that later.
IOS's are single or multiple ELFs of ARM code that Starlet places into memory and runs the code as an "Operating System". Every IOS has a kernel which is placed into SRAM. SRAM is the memory dedicated for Starlet (under normal operations, Broadway cannot access this memory). Most other IOS's will have additional ELFs that are placed into MEM2 as 'modules/plugins'. Boot2 is an IOS with only a bare bones kernel.
There are different types of IOS for different purposes. Example: IOS80 runs Wii Menu 4.3. This means when the Wii Menu (Broadway/PPC) is running, IOS80 (Starlet) is also running in the background.
Once the Wii Menu is in memory, IOS will write some very elementary PPC code at the EXI Boot Base (a group of Hollywood Hardware Registers). Whenever something is written to the EXI Boot Base, it is copied over (automatically by hardware) to physical address 0xFFF00100 (Broadway's Reset Vector Address).
This is the following code (or something very similar, depending on the specific IOS) that is written to the EXI Boot Base~
Code:
lis r3, entry@h #entry is some physical mem1 address
ori r3, r3, entry@l
mtsrr0, r3
lis r4, msr@h #msr is the value to give the Machine State Register after the rfi has executed
ori r4, r4, msr@l
mtsrr1, r4
rfi
'entry' is usually the address of 0x00003400. This is the address to the "BS1" boot code that was placed into memory earlier. 'msr' is the Machine State Register value and it is usually zero.
After writing to the EXI Boot Base, IOS will power on the Broadway (via poking some Hollywood Registers). The Broadway chip boots up and is now running.
When Broadway is powered on from a Hard Reset, it's MSR value is 0x00000040. The only bit high is the IP bit (exception prefix), which means exceptions use the addressing mode 0xFFFx_xxxx. This is why Broadway boots at 0xFFF00100 instead of at 0x00000100.
Anyway, Broadway executes the code starting at 0xFFF00100 and will jump to 0x00003400 with an MSR value of zero. Broadway is now at the "BS1" code. Anyway, the BS1 contains code that will do some basic tasks such as setup some Broadway specific registers, adjust the MSR yet again, configure the Cache, and configure the BAT registers.
Once that has been completed, execution of Broadway will jump to 0x81330000 (now running in Virtual Cached Mode) and the Wii Menu is now officially running.
It's important to note that once Broadway has been powered on, IOS will recede into a very small loop waiting for tasks/interrupts that are called on by Hardware or Broadway (such as an IPC request for modifying a file on the NAND). When Broadway is running, Starlet (IOS) is running in this loop 99.9% of the time, basically doing nothing.
Chapter 2: Memory, Modes, Instruction Sets, Endianess, Registers
Memory regarding MEM1 and MEM2 (aka Broadway's memory) is straight forward. When referencing this memory on your ARM instructions, you would just use the physical address (i.e. 0x00001500). For Hollywood Registers, just use the physical address equivalent but with bit 8 set high (i.e. 0x0D800064) However, using SRAM (Starlet's dedicated memory) is confusing....
Without getting into the nitty gritty of various SRAM mappings that can be utilized, SRAM is configured as such whenever a Wii app/game/etc is running...
- 0xFFFE0000 thru 0xFFFE7FFF = SRAM B aka SRAM 1 (32KB size)
- 0xFFFE8000 thru 0xFFFEFFFF = Junk, literally junk values, has no meaning or effect
- 0xFFFF0000 thru 0xFFFFFFFF = SRAM A aka SRAM 0 (64KB size)
The IOS Kernel executes code and function calls located at SRAM A (similar to how Wii games execute code at Mem80). Certain parts of SRAM A and all of SRAM B is used as Data/Storage (similar to mem9 on Broadway).
The above addresses are "Mirrors". They are not "real" addresses per say. Also, "Mirrors" are **not** Virtual Addresses. The Mirrors are done by hardware, not software. These Mirrors were implemented to comply with Starlet's hardware requirements for addressing in regards to resets and exceptions.
These are the physical/real/true ranges of SRAM (when Wii games/etc are running):
- 0x0D4E0000 thru 0x0D4E7FFF = SRAM B
- 0x0D4E8000 thru 0x0D4EFFFF = Junk
- 0x0D4F0000 thru 0x0D4FFFFF = SRAM A
Therefore, to change a physical address to it's Mirrored equivalent, just add 0xF2B00000.
Regarding the IOS Kernel, it is the highest level of security of IOS and has all privileges. Think of it like the "inner/root core" of the entire IOS. IOS interacts with the AES, SHA, and HMAC engines from the Kernel. When IOS idles while the PPC is running, it is idling via the 'main/idle' loop (called thread 0) that resides in the kernel.
The rest of IOS, what is known as the Modules/Plugins, is located in MEM2 at 0x13400000 thru 0x13FFFFFF (12MB). Under normal conditions, Broadway cannot write to this region of memory.
When an exception occurs in Starlet/IOS, memory mapping starts at 0xFFFF0000. More on exceptions in the next Chapter.
Starlet can execute in 3 different modes---
- ARM Mode (all exceptions run in ARM Mode)
- Thumb Mode
- Jazelle Mode
ARM Mode is the standard 32-bit mode. Instructions are 32-bits in size. All instructions are available for usage.
Thumb Mode is when the processor will run 16-bit sized instructions. This is a reduced instruction set due to the 16-bit size of Thumb instructions. This is also known as Thumbv1 Mode in newer/modern ARM manuals. Starlet runs in this mode for the main loop of the Kernel.
Jazelle Mode is when the processor executes in a mode tailored for a Java Virtual Machine. Starlet *NEVER* runs in this mode. Therefore, it won't be covered.
Unlike other ARM cores, Starlet runs in Big Endian mode by default, but can be configured to run in Little Endian mode. This was done for compatibility with the Broadway chip.
In the Manuals linked in the first Chapter, the bit labeling is done in Little Endian 'fashion'. The most significant (far left) bit is named as bit 31, and least significant (far right) bit is named as bit 0.
Starlet has absolutely zero Floating Point functionality. Starlet does implement its own Cache and address translation.
Register List of Starlet (in ARM Mode)
- r0 thru r3 = Volatile Registers w/ r0 being used for function return values (think of these like r3 thru r10 of Broadway)
- r4 thru r10 = Non-Volatile Registers (think of these like r14 thru r31 of Broadway)
- r11 = Frame Pointer (points to bottom of current Frame; more on this in Chapter 12)
- r12 = Scrap register (similar to r0 of Broadway but without the stupid literal zero rule )
- r13 = Stack Pointer (like r1 for Broadway)
- r14 = Link Register (unlike Broadway, you can directly read/write to the LR)
- r15= Program Counter (unlike Broadway, you can read/write to the PC)
- CPSR (Current Program Status Register)
Register List of Starlet (in Thumb Mode)
- r0 thru r7 only (r0 - r3 volatile, r4 - r6 non volatile, r7 scrap)
- SP, LR, PC, CPSR
CPSR Breakdown (NOTE: Reverse the bit numbers if referencing an ARM manual!!!)
- bit 0 = Negative/Less-Than; flipped high if a comparison results in a negative number
- bit 1 = Zero; flipped high if a comparison results in a zero value
- bit 2 = Carry/Borrow/Extend: flipped high is if a value went above/below its maxed and then is represented by a negative/positive number.
- bit 3 = Overflow; result of any instruction that the value cannot be represented
- Bit 4 = Saturation; if an overflow and/or Saturation occurs for certain DSP-oriented instructions
- Bit 5 and 6 = Reserved
- Bit 7 = Jazelle Enable; read only. Starlet runs in a super water downed 8-bit variable mode. Do not write to this to enable/disable Jazelle
- bit 8 thru 23 = Reserved
- Bit 24 = IRQ Disable (aka disable interrupts)
- Bit 25 = FIQ Disable
- Bit 26 = Thumb Enable; read only. Starlet runs in a faster but limited mode where instructions (that can be used) are 16-bits in size. Do not write to this to enable/disable Thumb
- Bits 27 thru 31 = Processor Mode (more info on this in the next Chapter)
Chapter 3: Exceptions and Starlet Syscalls
Here's a list of all Exceptions with their Exception Vector Address plus a short Description....
- 0xFFFF0000 Reset Vector (for resets or cold boots)
- 0xFFFF0004 Undefined/Illegal Instruction (similar to a Program Exception in PPC)
- 0xFFFF0008 Software Interrupt (called whenever swi instruction is executed; similar to a System Call exception in PPC)
- 0xFFFF000C Prefetch/Instruction Abort (similar to a ISI exception in PPC)
- 0xFFFF0010 Data Abort (similar to a DSI exception in PPC)
- 0xFFFF0014 Reserved
- 0xFFFF0018 IRQ (interrupt request; similar to External Interrupt exception in PPC)
- 0xFFFF001C FIQ (fast interrupt)
At each Exception Vector Address is a single instruction that changes the Program Counter to automatically branch to the actual code/program that handles said exception. Before discussing Exceptions in further detail, we need to cover the various Processor Modes. Here is a list of all 7 of them...
CPSR Processor Mode bit value breakdown:
- 0x10 = User Mode
- 0x11 = Fast Interrupt Request Mode
- 0x12 = Interrupt Request Mode
- 0x13 = Supervisor Mode
- 0x17 = Abort Mode
- 0x1A = Undefined Mode
- 0x1F = System Mode
All modes except User mode are known as Privileged Modes. All Privileged Modes except System Mode are known as Exception Modes. User mode is the most restrictive. The only way for Software to exit out of User mode is through an Exception and the contents of that Exception contain instructions to switch to a Privileged Mode once the Exception has been finished. You can 'manually' enter into any Exception Mode with some CPSR related instructions, but it Starlet is usually only in this mode due to an actual Exception occurring.
When any exception occurs, the following takes place...
1. Exception Mode enabled and execution of Starlet is now in Physical Memory
2. Banked Registers enabled
3. SPSR Register enabled (what was in the CPSR, before the exception, is now copied into the SPSR)
4. CPSR value changed depending on type of Exception taken
1. The type of Exception Mode that gets set depends on the type of Exception that occurred. Here is a list...
- Reset = Supervisor Mode
- Undefined/Illegal = Undefined Mode
- Software Interrupt = Supervisor Mode
- Prefetch Abort = Abort Mode
- Data Abort = Abort Mode
- IRQ = Interrupt Request Mode
- FIQ = Fast Interrupt Request Mode
2. Banked Registers are certain Registers that are only visible/usable during the Exception Modes. The type and amount of Banked Registers allowed for usage depends on the type of Exception that was taken. In every Exception, SP (r13), and LR (r14) are banked. This means you are now using a different version of SP & LR. The original SP & LR are preserved and can only be used again once you leave the Exception. This is a crafty method implemented in ARM to easily save important data without the need to 'manually' save it. The FIQ Exception Mode is the only oddball. It will also include banking registers r8 thru r12.
Each mode has its own banked versions of SP and LR. Thus, there are 6 versions of SP & LR. The regular version, and the versions present for the 5 Exception Modes. Please note that User mode and System Mode do ***NOT*** bank any registers!
The banked LR will be set depending on the type of Exception that occurred (except for Reset, it will be an unpredictable value). Refer to chapter A2.6 of the ARMv6 manual for details of every banked LR value for each Exception.
All banked SP values are set by IOS sometime during initial environment setup/config. In order to change a banked SP, let's say the Banked IRQ SP register, you must be in IRQ Exception Mode.
3. The SPSR (Saved Program Status Register) gets filled with the CPSR's value that was present before the Exception was taken. Each Exception Mode has its own banked version of the SPSR. In totality, there are 5 SPSR's.
4. The CPSR will have some values changed. In every exception, T (thumb) and J (jazelle) bits are cleared. Thus, every exception executes in ARM mode! For every exception (except FIQ), the FIQ bit is left unchanged. In FIQ mode, it is set high (FIQ disabled). For every exception, IRQ bit is set high (IRQ disabled).
A picture is worth a 1000 words. Here is a handy diagram of the Banked Registers~
When an exception has ended, SPSR is copied over to the CPSR. Therefore, whatever mode (User or System) you were in beforehand, is now restored. SP and LR (plus r8 thru r12 for FIQ) original versions/values are also restored.
For every IOS, the Undefined/Illegal instruction exception vector goes to the address of 0xFFFF1F24. This is the location of IOS's specialized Syscall Handler. There are 2 types of Syscalls for IOS.
Regular ARM syscalls:
This uses the swi instruction. Whenever Starlet executes the swi instruction, an exception will occur and Starlet will start execution at 0xFFFF0008.
IOS syscalls:
IOS handles various illegal instructions to represent special syscalls that will preform certain tasks. The 'base' syscall instruction has the compiled form of 0xE6000010. Each syscall has a number, starting at syscall 0. You insert the number into the instruction by shifting the syscall number to left by 5 bits then logically ORing it into 0xE6000010.
Example (syscall 0x54)
- Shift 0x54 by 5 bits to the left. 0x54 => 0xA80
- Logically OR 0xA80 with 0xE6000010
- The result is 0xE6000A90.
Therefore to 'call' syscall 0x54, you would use the instruction of 0xE6000A90. Keep in mind some modules/plugins cannot execute certain syscalls due to lack of privilege. Syscall 0x54 can only be called from the Kernel or the /dev/es module.
The Syscall Program Handler is located at 0xFFFF1F24. Starlet will execute this handler to make a new stack frame to backup registers, check if the instruction is a special syscall instruction, and then use a lookup table to determine which address/function to call based on the syscall number.
Link to WiiBrew article of syscalls and a decomp of the syscall handler - https://wiibrew.org/wiki/IOS/Syscalls
Chapter 3: Immediate Value Rules
While PPC has a signed or unsigned 16-bit immediate value range for instructions which makes it super simple to write any 32-bit value in a register (lis+ori), the way ARMv5 language implements immediate values is can be a pain the ass.
To load a value 'from scratch' into a register in ARM, you will typically use the mov instruction. Example~
Code:
mov r0, #1 @To write comments, use the '@' symbol.
When writing any immediate value in any ARM instruction, it *MUST* have a hashtag sign pre-pended. Notes/comments are designated via the '@' symbol.
You can use any value that can be expressed in an 8-bit field within a 32-bit 'space'. For example, the value 0xFF000000 is a legal immediate value. 0xFF fits into 8 bits. If the value consists of 2 or more binary 1's, the first and last binary 1 must fit within an 8 bit field.
Therefore, something like the value of 0x000001C1 is invalid. 0x1C1 cannot be expressed in 8 bits.
Examples~
- 0 = valid
- 0x1 = valid
- 0xF7 = valid
- 0x17800000 = valid
- 0x0003FC00 = valid
- 0x101 = invalid
- 0x100C0045 = invalid
0xFFFFFFFF (-1) is an invalid value. However, there is an instruction called mvn (Move then Logically NOT). It the same as mov but will Logically-NOT the value. Therefore, to load 0xFFFFFFFF into a register, you can do this...
Code:
mvn r3, #0 @Loads zero then does a logical NOT of it, resulting in -1.
Therefore any 8-bit number in the allowed range, its "logical NOT'd" value is also valid.
Code:
mvn r3, #255 @Loads 0xFF then does a logical NOT of it, resulting in 0xFFFFFF00 (-256)
Each 32-bit ARM instruction has 8 bits within them for an initial value and 4 bits as a rotation mechanism. If you want to learn how instructions actually process immediate values, formula is listed below.
V = n ror (2*r)
V = Immediate Value
n = 8-bit Initial Value
r = 4-bit Rotation Value
ror = Rotate Right (this is ROTATING, *not* shifting)
Example~
- Use a 8 bit value (n) of 0xFF
- Use a 4-bit value (r) of 0xB
First, take 0xB and multiply it by 2. You will get 0x16 which is 22 in decimal.
Now take the value of 0xFF and rotate it to the right by 22 bits.
You get the result of 0x0003FC00. This value is safe to use in the mov instruction
Code:
mov r4, #0x0003FC00
For the case you are lazy and would rather just type a number into a script/program to check it's validity, there's a person named Azeria who has made such a script.
Link - https://raw.githubusercontent.com/azeria...rotator.py
Enter in the number, in decimal form, script will tell you if you can use it or not.
If the number is invalid (and the mvn instruction won't work either), you will need to break it apart via multiple instructions. For example, since the value of 511 is an invalid immediate value. Therefore you would do this, to load it into let's say r0.
Code:
mov r0, #255 @Hex is 0xFF
add r0, r0, #256 @Hex is 0x100. 255 + 256 = 511. r0 now = 511
The immediate values rules apply to ***ALL*** instructions, not just mov.
Chapter 4: Multiple Input Variations for same Instruction
For Broadway, there are many different versions of instructions, such as differences for an instruction when it uses two source registers vs the type of instruction that use 1 source and 1 immediate value (i.e. add vs addi)
However for ARM, this is not the case. The add instruction can be written out in different forms. Instead of having add,addi,etc instructions, you just have add.
Example~
Code:
add r0, r3, r4 @Add two source registers to place result in destination register
add r0, r3, #100 @Add source register and immediate value to place result in destination register
Chapter 5: Compares, Branches, and Conditional Instruction Execution
Compare and branch instructions work similarly to PPC. You can use label names just like how you did with PPC. There are a slew of simplified mnemonic conditional branches that can be used.
Example~
Code:
cmp r4, 100 @Check r4 vs 100
beq some_label
Another Example~
Code:
cmp r4, r12 @Check r4 vs r12
bgt some_label
How Compare Instructions actual operate:
When a comparison (cmp) instruction is executed, the Source Register (or Immediate Value) is subtracted *FROM* from the Destination Register. For example..
Code:
cmp r4, r12
Will do r4 minus r12. The result of this subtraction will flip the appropriate CPSR bits high (i.e. Negative/Less Than Bit).
ARM's cmp instruction doesn't allow specification of a signed vs unsigned comparison. However, there are a wide variety of branch instructions to compensate for this...
Here are all the Conditional Branch options.
- EQ = Equal
- NE = Not Equal
- GT = Greater Than (signed)
- LT = Less Than (signed)
- GE = Greater Than (signed) or Equal
- LE = Less Than (signed) or Equal
- HS = Unsigned Higher or Same
- LO = Unsigned Lower Than
- MI = Negative
- PL = Positive or Zero
- VS = Signed Overflow
- VC = Not Signed Overflow
- HI = Unsigned Higher
- LS = Unsigned Lower or Same
- CS = Carry Bit Set; same thing as HS
- CC = Carry Bit Clear; same thing as LO
- AL = Always (same as just a regular Branch - B)
Example~
Code:
bhs some_label @branch to some_label if result unsigned higher or same
Unlike PPC, there are no branch hints you can use. There is also no dedicated Condition Register (let alone multiple CRs) for Starlet.
Almost all ARM instructions can have conditional operations applied to them too.
Example:
Code:
cmp r10, r11 @Compare r10 vs r11
addne r3, r4, r5 @If r10 doesn't equal r11, execute the addition of r3 + r4 into r5.
Another example:
Code:
cmp r7, #0x400 #Compare r7 vs 0x400
movlt r0, #1 #Set r0 to 1 if r7 is less then 0x400
ARM also comes with the cmn instruction. It's basically the opposite the cmp. Instead of a subtraction being done for the operation, an addition is preformed.
Example:
Code:
cmn r10, #0xC10
This would do the operation of r10 + 0xC10 to set the appropriate CPSR bits.
Chapter 6: Logical Operations plus additional Logical Shift/Rotation Feature
ARM comes with some instructions for logical operations. They are all pretty plain jane. Therefore, there is no need to really deep dive into them.
- orr = Logical OR
- and = Logical AND
- eor = Logical XOR
- bic = Bit Clear
- lsl = Shift Left
- lsr = Shift Right
- asr = Algebraic Shift Right (real name is Arithmetic Shift Right; just calling it Algebraic for those familiar with PPC)
- ror = Rotate Right
To mimic...
nand rD, rA, rB
Use....
and rD, rA, rB
mvn rD, rD
To mimic...
nor rD, rA, rB
Use...
orr rD, rA, rB
mvn rD, rD
To mimic...
not rD, rA
Use...
mvn rD, rA
To mimic...
rlwinm rD, rA, 0, 0x00008000 #Big Endian bit 16
Use...
bic rD, rA, #0x00008000
To mimic...
rlwimi rD, rA, 0, 0x00008000 #Big Endian bit 16
Use...
bic rD, rD, #0x00008000 @Erase previous bit 16 value, whatever it was, in rD
and rA, rA, #0x00008000 @Erase all previous bit values except bit 16 in rA
orr rD, rD, rA @Now OR the two registers together to insert rA's bit 16 into rD
Most instructions also allow the usage of a additional shift/rotation to it's 2nd/3rd source register before the instruction continues further calculations. No, you cannot use this feature for on an immediate value.
Example:
Code:
add r3, r4, r5, lsl #2
This will preform a 2-bit lefthand shift of r5's value *BEFORE* adding it to r4.
Code:
cmp r0, r3, lsr #27
This will preform a 27-bit righthand shift of r3's value *BEFORE* r0 is compared to it.
Here is a list of all operations you can use:
- lsl = shift left by bit amount
- lsr = shift right by bit amount
- asr = algebraic shift right by bit amount (similar to a srawi ppc instruction)
- ror = rotate right by bit amount
Chapter 7: Basic Loads & Stores
Just like with PPC, in order to modify contents in memory, it must be loaded into a register, the register modified, then stored back to memory. Instead of writing the source register in parenthesis, it must be enclosed within brackets.
Example:
Code:
ldr r8, [r2] @Load the word at address in r2, place into r8
ldr is ths default load instruction, it loads a word.
ldrh = Load Halfword
ldrsh = Load Halfword Signed (similar to lha instruction in PPC)
ldrb = Load Byte
ldrsh = Load Byte Signed
str is the default store instruction, it stores a word.
strh = Store Halfword
strsh = Store Halfword Signed (stores the lower 16 bits of the register and then sign extends it so a 32-bit value is actually stored. It's like a PPC extsh instruction followed by a stw)
strb = Store Byte
strsb = Store Byte Signed (same mechanism as strsh, but for bytes)
Example of load instruction using a source register and immediate value
Code:
ldr r11, [r4, #4]
As you can see in the above example, the immediate value goes at the tail end of the instruction. Here's an example of a store halfword instruction that uses two source registers~
Code:
strh r0, [r0, r1] @halfword of r0 is stored at address designated by r0+r1.
This would be similar to be sthx PPC instruction.
Chapter 8: Program Counter Details and Literal Pools
Unlike PowerPC, in ARMv5, the Program Counter (PC) is a General Purpose Register. You can freely read/write to. Therefore, you can also use it as a loading/storing reference (PC + offset value).
In ARM Mode, the PC is the current instruction + 8.
In Thumb Mode, the PC is the current instruction + 4.
This is because the PC is always 2 instructions ahead of the current instruction that is going to execute.
Example (ARM mode assumed):
Code:
ldr r0, [pc]
add r1, r2, r3
nop
In the above example, when the ldr instruction executes, r0 will be loaded with "0xE320F000", which is the compiled form of a nop in ARMv5.
Another Example (ARM mode assumed):
Code:
ldr r0, [pc, #0xC]
add r1, r2, r3
nop @PC + 0
nop @PC + 4
nop @PC + 8
bic r5, r6, #0xF @PC + 0xC
In the above example, the compiled form of the bic instruction will be loaded into r0 for the ldr instruction. To load exactly where you are current at, in ARM mode, you would need to load at PC minus 0x8.
With the PC being a general purpose register, we can use it as an alternative method to PPC style BL Tricks.
Example (ARM mode assumed; pretend first instruction resides at memory address 0x00001500):
Code:
ldr r0, [pc, #0x8]
add r1, r2, r3
nop
b the_end
@Address of test string
.long 0x00001514
@Test string, located at 0x00001514
.asciz "This is a test."
the_end:
The load instruction is loading the word value located at PC+8. Which is 0x00001514. At address 0x00001514 is the asciz instruction. Manually calculating address values in this manner is a pain. Therefore we can use what are called literal pools.
Example:
Code:
ldr r0, =Test
add r1, r2, r3
nop
b the_end
Test:
.asciz "This is a test."
the_end:
HOWEVER, this is nothing more than a compiler trick! Whatever compiler (or instruction simulator) you are using, it will auto add the "long 0xXXXXXXXX" somewhere in your source (usually at the end). The literal pool trick will not work for something such as manually writing over instructions in IOS. You may need to use a BL-Trick Lookup Table instead.
Chapter 9: Pre & Post Indexing of Loads/Stores
The ARM language doesn't have 'update' versions of loads/store instructions like with PPC. However the use of pre/post indexing can mimic this.
To mimic a PPC stbu instruction, you can do this...
Code:
strb r4, [r0, #0x20]! @Take note of the "!" appended after the bracketed contents
The byte of r4 is stored to r0+0x20, *THEN* r0 is added by 0x20.
You also can do what is called post indexing. Example~
Code:
ldr r1, [r6], #4
This will load the word located at r6 into r1. It is ***NOT*** loaded at r6+4. Once the word has been loaded, *THEN* r6 is incremented by 4. As you can see this is different than pre indexing.
Chapter 10: Multi Loading/Storing
ARM can't exactly replicate how PPC does multi loads/stores but it does come with some unique instructions that PPC cannot do. First thing's first, ARMv5 is only capable of doing multi loads/stores for word values only. You will use ldm for basic multi loading.
Example:
Code:
ldm r1, {r3,r4,r5,r6}
This instruction will do the following...
- Load word at r1 + 0 into r3
- Load word at r1 + 4 into r4
- Load word at r1 + 8 into r5
- Load word at r1 + 12 into r6
Obviously this differs quite a bit from a PPC lmw instruction. You will also notice the source registers are enclosed in curly brackets instead of regular squared brackets. This is required for any multi load/store instruction.
Instead of writing out every register in the source register list, you can do this...
Code:
ldm r1, {r3 - r6} @shorter & quicker to write
You can also force an update to r1 after the multi load, like this....
Code:
ldm r1!, {r3 - r6}
Take NOTE of the Exclamation Point placed immediately after r1. After the words are loaded into r3 thru r6, r1 is then *incremented* by 16 (0x10). Incrementation is 4 bytes per every source register present in the instruction.
For basic multi storing, you use the stm instruction.
- stm r1, {r3 - r6} @Stores r3 at r1, r4 at r1+4, r5 at r1+8, r6 at r1+12
- stm r1!, {r3 - r6} @Same as above but r1 is incremented by 16 afterwards
There are a variety of extra options for multi loading/storing. Here are 8 more multi load/store instrucitons....
- ldmia #Load Multi with increase afterwards
- ldmib #Load Multi with increase before
- ldmda #Load Multi with decrease afterwards
- ldmdb #Load Multi with decrease before
- stmia #Store Multi with increase afterwards
- stmib #Store Multi with increase before
- stmda #Store Multi with decrease afterwards
- stmdb #Store Multi with decrease before
ldmia is actually a simplified/alternative mnemonic of ldm only if the destination register is *NOT* appended with a "!"
stmia is actually a simplified/alternative mnemonic of stm only if the destination register is *NOT* appended with a "!"
Examples:
- ldm r1, {r3 - r6} = ldmia r1, {r3 - r6}
- stm r1, {r3 - r6} = stmia r1, {r3 - r6}
The term "increase" means the loading/storing address is increased during the instruction. "Afterwards" means the incrementation of the loading/storing address (by 4) starts *AFTER* the first load/store.
"Decrease afterwards" is the same as above but address's are decreasing instead of increasing
The term "before" means to increase/decease the loading/storing address by 4 *BEFORE* the first load/store
Examples:
- ldmib r0, {r7 - r9} @r0 is increased by 4 *FIRST* before the first load. Thus, word at r0+4 is loaded into r7. Word at r0+8 loaded into r8. Word at r0+12, loaded into r9.
- stmda r0, {r7 - r9} @r9 is stored at r0. r8 is stored at r0-4. r7 is stored at r0-8.
- stmdb r0, {r7 - r9} @r0 is decreased by 4 *FIRST* before the first store. Thus, r9 stored at r0-4. r8 stored at r0-8. r7 stored at r0-12.
You can also have the destination register be updated in these increase/decrease before/after type multi load/stores.
Example:
Code:
ldmdb r10!, {r0, r1} @Word at r10-4 loaded into r1. Word at r10-8 loaded into r0. Afterwards, r10's value is decreased by 8 (2 source registers x 4 = 8)
Another Example:
Code:
stmia r7!, {r4 - r6} @r4 stored at r7. r5 stored at r7+4. r6 stored at r7+8. Afterwards, r7's value is increased by 12 (3 source registers x 4 = 12)
There are also alternative mnemonics available. Here's a list of them.
- stmfd (store multiple full descending) = stmdb
- stmed (store multiple empty descending) = stmda
- stmfa (store multiple full ascending) = stmib
- stmea (store multiple empty ascending) = stmia
- ldmfd (load multiple full descending) = ldmia
- ldmed (load multiple empty descending) = ldmib
- ldmfa (load multiple full ascending) = ldmda
- ldmea (load multiple empty descending) = ldmdb
Chapter 11: Loops
There is no Count (CTR) register for Starlet. You must use a general purpose register as a loop tracker. Since load/store instructions come with a Post-Indexing feature, you do not need to decrement the load and/or store start addresses beforehand.
Example of Basic Loop:
Code:
@Set loop amount
mov r2, #10
@Do the loop
loop:
ldr r1, [r0], #4
str r1, [r8], #4
subs r2, r2, #1
bne loop
The subs instruction will execute a basic sub (subtract) instruction but also update the CPSR flag bits. This is essentially the same mechanism as using the Record (.) shortcut in PPC instructions. Once r2 = 0, the bne branch will not be taken.
You can append many instructions with 's' to force the instruction to update the CPSR flag bits.
Chapter 12: Stack, Prologues, Epilogues
Pushing and popping the stack is a bit different than doing it on PPC. Plus, the layout/structure of the Stack is also different. There are dedicated push and pop mnemonics.
push = stmfd sp! = stmdb sp!
pop = ldmfd sp! = ldmia sp!
Here's an example push instruction that backups just the LR to a new stack frame (storing of PC is explained later)
Code:
push {lr, pc}
It's better to look at this instruction in it's stmdb alternative mnemonic form to understand what's going on underneath the hood.
Code:
stmdb sp!, {lr, pc}
For starters let's pretend sp's value before the instruction = 0x0000C200. When the instruction executes, it will first temp decrease sp's value by 4. PC is stored at 0x0000C1FC. SP's value is temp decreased by 4 again. LR is stored at at 0x0000C1F8. SP's value is then actually decreased by 8, due to 2 source registers in the stmdb instruction. After the instruction has executed SP is now 0x0000C1F8.
PC is stored because Stack Frames in ARM must always have a size divisible by 8. Now to pop this new frame, we do the following...
Code:
pop {lr, pc}
Let's look at this pop instruction in it's ldmia alternative mnemonic form...
Code:
ldmia sp!, {lr, pc}
In this instruction word at sp+0 is loaded into lr. Then PC is loaded from sp+4. Afterwards, SP values is increased by 8. We are now 'back' to where we 'left off'.
Or are we? In fact, this is WRONG!!! We are loading PC's old value. The Program Counter keeps track of what's the next instruction that will execute.
By executing that above pop instruction, we would actually "branch" back to w/e instruction is present after the initial push instruction. Needless to say, your program/code will crash/fault because of this.
How do we fix this??? You don't have to fix the push instruction, just fix the pop instruction....
Code:
pop {lr} @Grab back old LR, do not grab back the old PC!
add sp, sp, #4 @Need to add 4 to SP to compensate not loading the PC
Now that you understand basic pushing and popping, let's go over how prologues and epilogue are implemented in IOS. There are a plethora of methods (styles) to write prologues and epilogues in ARM. Regardless of the 'style' used, r4 thru r10 are the non-volatile registers aka the global variables. They are the equivalent of the r14 thru 31 for Broadway. What's different is that the lower registers are used first instead of the higher ones. For example, in PPC if you make a stack frame and need to use 2 non-volatile registers, you throw r30 and r31 on the stack. In ARMv5 its backwards, if you need two non-volatile registers, you throw r4 and r5 on the stack, **NOT** r9 and r10.
The majority of functions within IOS use the "Full Descending" type of stack pushing/popping. Which is perfect, because we can use the push and pop mnemonics.
In ARMv5 compliant prologues, r11 is used for what is called the Frame Pointer aka fp. It must point to the bottom of the current Stack Frame.
r12 is known as Inter-Procedural scratch register aka ip. It is used as a scratch register during prologues for the purpose of making a copy of sp immediately before a new frame (push/stmfd instruction) is created.
Example Prologue that's ARMv5 compliant (1 register being saved)
Code:
mov ip, sp @Make a copy of soon-to-be old SP
push {r4, fp, ip, lr} @Backup 1 register, old fp, old sp (ip), and lr
sub fp, ip, #4 @Make fp (r11) point to bottom of the newly created frame
FP, IP, and LR (in that order) must always be pushed onto a new frame. At this point the Stack structure is as such...
SP+Offset | Item
SP | r4 (Top of new Frame)
SP+4 | old fp (will have address that points to bottom of old frame)
SP+8 | old sp aka old ip (will have address that points to top of old frame)
SP+0xC | function return lr
SP+0x10 | Top of Old Frame (where old SP is pointing to)
.. .. | Unknown Size
SP+?? | Bottom of Old Frame; old function return lr (where old FP is pointing to)
The current value in new fp (r11) at this state in time would be pointing at the address of SP + 0xC. Onto the epilogue...
And here's the responding epilogue~
Code:
pop {r4, fp, ip, lr}
bx lr
If you don't need to ensure an instruction set change, then the epilogue can be changed to this...
pop {r4, fp, ip, pc}
For me personally, I think the whole idea of fp is redundant. As long as each frame contains it's "old sp", then back-chaining is guaranteed. PowerPC does this right. Here's how I would personally write prologues/epilogues of custom functions.
Example prologue saving 1 register~
Code:
push {r4, sp, lr, pc} @PC stored for stack size rules
Epilogue~
Code:
pop {r4, sp, lr}
add sp, sp, #4 @Compensate for not including PC in the pop
bx lr
When modifying IOS, be sure to follow the ARMv5 compliant style.
If you are in the situation where you need extra space allocated in the new frame for something such as an output buffer for a child function (i.e. sprintf), here's an example prologue (2 registers + 0x30 buffer space)...
Code:
mov ip, sp
push {r4, r5, fp, ip, lr}
sub sp, sp, #0x34 @Add 0x34 of space *not* 0x30, because we pushed an odd amount of registers onto the frame which would violate stack frame sizing rules
sub fp, ip, #4
When ready to setup your output buffer, do this...
Code:
mov rX, sp @rX = register being used for output buffer
Finally, here's the respective epilogue..
Code:
add sp, sp, #0x34
pop {r4, r5, fp, ip, lr}
bx lr
Chapter 13: Exchanging between ARM and Thumb; Thumb Instruction Set
You have to use the bx, blx, or 'bx lr' instructions to switch between ARM mode and Thumb mode. Do not try to edit the Thumb mode bit in the CPSR. You will cause an exception.
Anyway, the Least Significant Bit (bit 0 in ARM manuals) in the target address of a bx/blx/'bx lr' determines which mode to run. If the LSB is high, Thumb mode will be activated. When it's low, ARM Mode will be activated. It's really that simple.
Since Thumb instructions are limited to 16-bits, there is a reduced instruction set. Read up on Chapter A6 of the ARM reference manual. Keep in mind that on chart A6.2.1, any instructions notated with "[2]" are ARMv6 only and do not work on Starlet.
Regarding immediate values, Thumb instructions only allow the basic unsigned range of 0x00 thru 0xFF (0 thru 256).
Chapter 14: Interrupts
To disable interrupts, do this...
mrs rX, cpsr
and rY, rX, #0xC0 @Keep rY somewhere safe
orr rX, rX, #0xC0
msr cpsr_c, rX
To restore them, do this...
mrs rX, cpsr
bic rX, rX, #0xC0
orr rX, rX, rY @rY can be scrapped now
msr cpsr_c, rX
NOTE: To forcefully enable interrupts (instead of restoration), simply remove the ORR instruction from the above restoration example.
Chapter 15: MEM1 Store/Load issues with Starlet
Read this...
https://twitter.com/marcan42/status/1362...47?lang=en
Marcan, the co-creator of HBC and Bootmii explains it perfectly. Thus, if you are doing any sort of memcpy,memset,memclear,etc you must ensure every piece of data stored/loaded to/from MEM1 to done as a word value. Not only that, the word value(s) must always be stored/loaded via an address that is divisible by 4.
Chapter 16: Cache, Address Translation, Self Modifying Code
Starlet comes with a single Cache Unit. It is split into a 16KB Instruction Cache and 16KB Data Cache. Also just like Broadway, Starlet uses Cache Blocks of 32-byte size. However, the Cache Blocks are referred to as "Modified Virtual Address", "Lines", or "Single-Entries".
The term "Modified Virtual Address" can be confusing. Basically there is a hardware register (FCSE PID) in Starlet that does another address translation on top of your typical Virtual Address to Physical translation.
Don't fret though, the configuration of the FCSE PID is as such to where the Modified Virtual Address and Virtual Address are always equivalent.
Regarding typical address translation, the kernel and all modules (except ES,FS,STM, and DI) use Identical translation (Virtual Address exactly maps to Physical).
The other modules use a translation scheme in which the Virtual Address is the Physical Address but with the Most Significant Bit set low. For example, physical address 0xFFFE0500 is represented virtually via 0x7FFE0500.
Going back to the Cache...Starlet doesn't use the MEI protocol, but its own protocol that is almost identical, just with a different naming system.
Clean (similar to Exclusive in Broadway, what's in Cached Memory is also in Physical/Real Memory)
Dirty (similar to Modified in Broadway, what's in Cached Memory hasn't been updated yet to Physical/Real Memory)
Invalid (just like Invalid in Broadway, Block will be casted out soon, can be tossed)
Starlet's cache 'algorithm' can be changed between two different settings. First setting is Psuedo Random, second setting is Round Robin. IOS uses the Psuedo Random setting.
Here's a list of Handy Cache Operations~
Invalidate both the entire ICache and DCache~
mcr p15, 0, rX, c7, c7, 0 @rX must be zero
Invalidate entire ICache~
mcr p15, 0, rX, c7, c5, 0 @rX must be zero
Invalidate ICache Line~
mcr p15, 0, rX, c7, c5, 1 @rX must be 32-byte aligned address!!!
Invalidate entire DCache~
mcr p15, 0, rX, c7, c6, 0 @rX must be zero
Invalidate DCache Line~
mcr p15, 0, rX, c7, c6, 1 @rX must be 32-byte aligned address!!!
Test and Clean Entire DCache~
loop:
mcr p15, 0, pc, c7, c10, 3
bne loop
Clean DCache Line~
mcr p15, 0, rX, c7, c10, 1 @rX must be 32-byte aligned address!!!
Test, Clean, and Invalidate entire DCache (aka Test then Flush)~
loop:
mcr p15, 0, pc, c7, c14, 3
bne loop
Clean and Invalidate DCache Line (aka Flush)~
mcr p15, 0, rX, c7, c14, 1 @rX must be 32-byte aligned address!!!
Prefetch ICache Line~
mcr p15, 0, rX, c7, c13, 1 @rX must be 32-byte aligned address!!!
The prefetch ICache operation is simply a cache hint for the Instruction fetcher. Regarding DCache hints, there is only the PLD (pre-load data). This provides the DCache with a hint for an upcoming Load instruction. It executes exactly like the PPC dcbt instruction. There is no version of PLD for Store Hints.
Unlike Broadway, there is no need for any special pre-flushing routine in regards to flushing the entire DCache. Just simple execute the instructions required and you're good to go.
---
Self Modifying Code is pretty simple. The following snippet shows how to overwrite a single instruction. Adjust source accordingly for multiple instruction rewrites.
Code:
@Self Mod code
@rX = Address of Instruction
@Pretend rZ contains new instruction
@Write new instruction at rX
str rZ, [rX]
@Align address to 32-bytes; not required if address is already 32-byte aligned
bic rX, rX, #0x0000001F
@Clean Data Cache Block
mcr p15, 0, rX, c7, c10, 1
@Drain Write Buffer
mov rY, #0 @rY = a scrap register that is safe to use
mcr p15, 0, rY, c7, c10, 4
@Invalidate Instruction Cache Block
mcr p15, 0, rX, c7, c5, 1
@Drain Prefetch Buffer; **not** required if self modified instruction(s) is 5+ sequential instructions ahead
@Using any branch will force the prefetch buffer (in the instruction fetcher) to be drained
@Or else a IMB (Instruction Memory Barrier) instruction is required
b 0x4
Chapter 17: Details of the Main/Idle Loop, and Writing in Custom ARM Code
As mentioned earlier in Chapter 1, Starlet is running in a loop waiting for tasks to do. This loop is called thread 0 or also called the Idle Thread.
Locating this loop is simple. Since Starlet is in this loop 99.99% of the time when your game is running, we can easily find the "Base Thread Pointer". We use a simple equation to calculate this Pointer.
Thread Pointer = 0xFFFE0000 + (0xB0 * threadnumber)
The loop is known as Thread 0, so the equation is this...
0xFFFE0000 + (0xB0 * 0)
0xFFFE0000 + 0 = 0xFFFE0000
Now we have the Thread Pointer. Using that, we can find other important informaiton
Thread Pointer + 0 = CPSR
Thread Pointer + 0x3C = SP
Thread Pointer + 0x40 = PC (what we need)
We can use something such as my Memory Editor code (HERE) to view 0xFFFE0040 on MKWii to get the current PC addresses of Starlet using Broadway. Your game must be patched using this method (HERE) for you to view SRAM!!! Obviously, convert 0xFFFE0040 to its real SRAM address. Then finally convert it to the usable Broadway virtual address.
0xCD4E0040 #View this address on the Memory Editor Code
You will see the PC is constantly flickering between 3 different addresses. On cIOS249 (using IOS56 as base), these are the following addresses you will see...
- 0xFFFF0C6A
- 0xFFFF0C6C
- 0xFFFF0C6E
The 3 addresses you see on your screen may differ if your MKWii game is running on a different IOS. Anyway, this is the loop. It is executing the following ARM instructions..
Code:
loop:
ldr r3, [r4, #0] @On cIOS249[56] this is loading the word value located at 0xFFFF9ECC
cmp r3, #0
beq loop
IMPORTANT NOTE: Starlet/IOS is in Thumb mode when executing this loop, hence the 0x2 increments to each address. It is also in System Mode (highest privileged mode possible)
Now onto the methods of writing in custom ARM Code for IOS to execute...
Method 1:
You could add in custom patches of code to an IOS and then run a game/channel/etc that uses said IOS. This can be cumbersome at times, as you would constantly need to repatch and then reinstall the IOS for testing new ARM Code.
Method 2:
Another method is to use Palapeli's /dev/sha exploit. It allows you to run a snippet of ARM code that without having to interact with IOS directly. Everything is initiated via Broadway. Link - https://github.com/TheLordScruffy/saoirs...ot.cpp#L67
Here is an Assembly Source of Palapeli's exploit that can be used as a C2 Gecko Code for MKWii.
Code:
#Exploit created by Palapeli
#THIS WORKS!
#exploit contents
06000000 0000001C
4903468D 49034788
49036209 47080000
10100000 00000A00 #entry point physical here
FFFF0014 00000000
#Custom ARM code
06000a00 00000014
e3a00536 e59010e0
e3811002 e58010e0
e12fff1e 00000000
#Custom ARM test code; simply write the word value of 1 to 0x80000040
mov r0, #0x0D800000 @Set r0 upper bits to GPIO Starlet Address
ldr r1, [r0, #0xE0] @Load up GPIO
orr r1, r1, #0x0002 @Flip Shutdown Bit High
str r1, [r0, #0xE0] @Write new GPIO
bx lr @Return to exploit
#########
#PAL
#Hooked at Shared Item address - 807BA164
#Custom return codes
#0 (Green shell) = success
#1 (Red shell) = failed to open /dev/sha
#2 (Banana) = ios_ioctlv error
#Statements
.set ios_open, 0x801938f8
.set ios_ioctlv, 0x801945e0
.set ios_close, 0x80193ad8
.set entrypoint, 0xA00 #physical address for 0x80000A00, this is where a list of custom ARM instructinos will reside at that we want executed
#Push stack; no need to backup r0, r11, r12, CTR or LR
stwu sp, -0x0080 (sp)
stmw r4, 0x8 (sp) #r3 is the shared item, we are modifying this as a custom return code, dont push it on the stack
#Set r31 as 0x8019 for ios calls
lis r31, 0x8019
#Set r30 as 0x8000 for EVA work
lis r30, 0x8000
#r29 will be used for fd backup
#r28 used for custom return/error code backup
#Open sha via ios
bl open_sha
.string "/dev/sha"
.align 2
open_sha:
mflr r3
li r4, 0
ori r12, r31, ios_open@l
mtctr r12
bctrl
#backup fd
mr r29, r3
#check for errors
cmpwi r3, 0
li r3, 1
blt- the_end
#Setup register args for SHA_Init
mr r3, r29 #fd
li r4, 0 #Ioctl no
li r5, 1 #Amount of input buffers
li r6, 2 #Amount of output buffers
addi r7, r30, 0x1500 #Vector table/root
#Vector table/root is at 0x80001500
#layout of Vector
#0x0 = null
#0x4 = null
#0x8 = 0x7FFE0028
#0xC = null
#0x10 = null (physical address for 0x80000000); for cache safety
#0x14 = 0x4 (length; 32 bits)
#Fill in the Vector contents
li r0, 0
stw r0, 0 (r7)
stw r0, 0x4 (r7)
stw r0, 0xC (r7)
stw r0, 0x10 (r7)
lis r0, 0x7FFE #0xFFFE0028 + 0x80000000. Needed because the ioctlv function call does a subtraction of 0x80000000 from this address to convert it to physical. Obviously this the ioctlv is 'bad' code. Why not just set bit 0 low? Smh.
ori r0, r0, 0x0028
stw r0, 0x8 (r7)
li r0, 4
stw r0, 0x14 (r7)
#Call ios_ioctlv
ori r12, r31, ios_ioctlv@l
mtctr r12
bctrl
#check for errors
cmpwi r3, 0
li r28, 0
bge- close_ios
#Ioctlv failed, place return code of 2 in r28
li r28, 2
#IOS Close
close_ios:
mr r3, r29 #fd
ori r12, r31, ios_close@l
mtctr r12
bctrl
mr r3, r28 #I've never seen an error from closing IOS, so just place return code from IOS_Ioctlv as the final return code
#The end, if r3 = 0 success, -1 = failed oepning /dev/sha, -2 = failed ioctlv call
the_end:
lmw r4, 0x8 (sp) #Pop stack
addi sp, sp, 0x0080 #Recover r4 thru r31
stw r3, 0x0020 (r23) #Shared Item Code default instruction
It is currently hooked to the Shared Item Address and will return certain items based on the condition of certain IOS calls (in regards to the exploit working or not). Adjust the addresses to the IOS calls accordingly, they are currently configured for PAL MKWii.
What's also included is the Exploit contents itself that is packaged in an 06 String Write. This 06 Code must be present. The Exploit Contents are currently configured to run ARM 32-bit code that is present at 0x80000A00. Therefore, you will NEED to make another 06 String Write Gecko Code that contains ARM instructions at 0x800000A00. There is a provided example 06 String Write (at 0x80000A00) that does a demo of the exploit by shutting down the Console via Starlet/IOS.
The right-side word of the 2nd to the last line of the 06 String Write Exploit is the physical address (entry point) that you can alter if needed. It must be a physical address. In a nutshell the following Memory Addresses are used...
0x80000000 thru 0x8000001B #Location of Exploit ARM Contents, don't change this (other than the entry point if desired)
0x80000A00+ #Depends on the length of your custom ARM code; change this entry point if needed
0x80001500 thru 0x80001517 #Used as a temp space for IOS_Ioctlv usage, you should be familiar enough with PPC to change this if this happens to conflict with your other unrelated Codes
Important Note about the Exploit:
The exploit can only be ran once. This is because Thread 0 is switched to ARM Mode after the exploit has been ran. Therefore to run it beyond just once, you will need to rig something up to rewrite the Exploit's ARM contents to be in ARM mode next time the Exploit (code as a whole) is executed by Broadway again.
Chapter 18: Assembling ARMv5 with Devkit
You need to have DevkitPPC already installed and the environments set. Here is a Linux Debian based guide for that - https://mariokartwii.com/showthread.php?tid=1200
You can find various Windows and Ubuntu guides via Google. After you get that done, you need to install devkitARM. Here's the linux command for that...
sudo dkp-pacman -S devkitARM
Now that you have devkitARM installed, here is a guide to assemble ARMv5 assembly into a raw binary file.
Example of ARM assembly contents/instructions in a file called source.s:
1. Nativate to DevkitARM binutils
cd /where/you/installed/devkit/devkitpro/devkitARM/bin
2. Assemble the ARM instructions to object code. Your source file must have the ".s" extension.
./arm-none-eabi-as -march=armv5te -mcpu=arm926ej-s -mbig-endian /where/file/is/source.s -o /where/file/is/source.o
NOTE: To force Thumb mode only, add in "-mthumb"
3. Convert (strip) object code to raw binary file
./arm-none-eabi-objcopy -O binary /where/file/is/source.o /where/file/is/source.bin
Congrats, feel free to view the binary file on a Hex Editor to see the assembled instructions. If your source file has both ARM and Thumb instructions present, you will need to designate them via Assembler directives. Place ".arm" before your instructions in your source file to force the Assembler to assemble the instructions in ARM mode. Place ".thumb" before your Thumb instructions. As an alternative to providing "-mthumb" for the Step #2 command, you can instead slap a ".thumb" at the top of your source file.
Example showing source file using nop in both ARM and Thumb form:
Code:
.arm
nop
.thumb
nop
The above will assemble the nop into its 32-bit form, then the second nop into its 16-bit form.
Another option at your discretion is to insert the architecture name and cpu name in your source file instead of having to type it out in Step #2.
Example:
Code:
.arch armv5te
.cpu arm926ej-s
.arm
nop
.thumb
nop
For the above example, your Step #2 command would be...
./arm-none-eabi-as -mbig-endian /where/file/is/source.s -o /where/file/is/source.o
---
How to disassemble (output will appear in your terminal screen)~
./arm-none-eabi-objdump -b binary -m armv5te -D -EB /where/file/is/source.bin
NOTE: The above will disassemble instructions to their ARM 32-bit form. Therefore, Thumb instructions, if present, will not be disassembled correctly.
NOTE: To force thumb only disassembly of instructions, add "-M force-thumb". If present, ARM 32-bit instructions will not be disassembled correctly.
Chapter 19: Instruction Simulator, Links
In the ARM Reference manual, there are many instructions present that obviously won't work for Starlet. In the Instruction Set chapters/sections, if an Instruction has the note "Version 6 and above" it obviously can't be used on Starlet.
There aren't any division instructions available. You will need to use some trickery to mimic division instructions. Luckily, there are multiply based instructions.
- mla rD, rA, rB, rC @[(rA x rB) + rC] = rD
- mul rD, rA, rB @rA x rB = rC
There isn't a good Starlet Emulator for use unfortunately. However, there are some ARM instruction simulators out there. Google is your friend on this one. There are some web browser based ones if you don't want to install anything on your computer.
Here is a decent web browser ARM instruction simulator - https://cpulator.01xz.net/?sys=arm
It's designed for ARMv7-a. The downsides to this simulator is that is can only run in Little Endian using Physical Addressing only, and switching between ARM & Thumb is not supported.
Now that you have completed this tutorial, you should...
1. Read more in depth of the Manuals provided in Chapter 1
2. Try out some snippets of code in a Simulator
3. Read these WiiBrew articles for Starlet and IOS~
- https://wiibrew.org/wiki/Hardware/Starlet
- https://wiibrew.org/wiki/IOS
- https://wiibrew.org/wiki/ARM_binaries
- https://wiibrew.org/wiki/Using_Ghidra_with_the_Wii (Section called "Use with IOS")
You can then start tinkering around in IOS and try out your modifications on a real Wii.
And lastly, this is a handy Co Processor reference. This was a 'snapshot' of various co processor registers using Palapeli's exploit on cIOS249[56]. Thus all info was 'snapshotted' using the IOS Kernel. Read up on the Starlet Manual to understand the details of this reference.
FCSE PID = Null (identical translation for VA to MVA)
TTBR = 0x13850000
Domain Access = Set to Client on all 16 Fields; therefore all Access Permissions for every mapped Memory Region is based on the AP bits of the related Page Table Entry
C1 Details:
- L4 bit low (set T bit for Loads to PC)
- RR bit low (Round Robin algo disabled in Cache)
- V bit high (Exceptions map to 0xFFFF0000 thru 0xFFFF001C)
- I bit high (ICache enabled)
- R bit low (ROM protection low)
- S bit low (System protection low)
- B bit high (Big Endian)
- C bit high (DCache enabled)
- A bit high (Alignment Fault checks enabled for Data)
- M bit high (MMU on)
Cache Register Details:
- Ctype = Write Back; Register 7 for Cleaning, Format C for Cache Lockdown
- S bit high (Use Harvard cache)
- DCache size = 16KB
- DCache is 4-way
- DCache M bit low (Cache present; this MUST always be set low for Starlet regardless of Cache usage)
- DCache block size is 8 words (32 bytes)
- ICache size = 16KB
- ICache is 4-way
- ICache M bit low (Cache present; this MUST always be set low for Starlet regardless of Cache usage)
- ICache block size is 8 words (32 bytes)
TCM Memories = Both DCTM and ICTM low (not present)
Data TCM Region = Disabled, size and address set to Null
Instruction TCM Region = Disabled, size and address set to Null
Chapter 20: Test Code; Conclusion
Using the simulator I linked in the previous chapter, here is a mock-up code you can step thru with. The code attempts to mimic a source that opens, reads, and closes a file via IOS calls.
Code:
@Example snippet of code:
@1. open /dev/fs to allow us to open basic files on the NAND virtual filesystem
@2. open /shared2/sys/SYSCONF
@3. dump its contents to sram (0xFFFF1000)
@4. close the SYSCONF file
@Make fake SP address
mov sp, #0x00000A00
@Call source as a custom function
bl example_function
the_end:
nop @We will return here once source has been completed, if no errors were set
b the_end
@FUNCTION
@Make fake SP at 0x00000A00
example_function:
@Prologue, save 2 registers
push {r4, r5, sp, lr}
@Open /dev/fs
@Make basic lookup table, and backup its pointer
bl lookup_table
.asciz "/dev/fs"
.asciz "/shared2/sys/SYSCONF"
.align 2
lookup_table:
mov r5, lr
@Open /dev/fs
mov r0, lr
bl ios_open
cmp r0, #0
blt error_handler
@We don't need file descriptor from /dev/fs
@Open SYSCONF
add r0, r5, #8 @Point to /shared2/sys/SYSCONF in lookup table
mov r1, #1 @Read perms for IOS_Open
bl ios_open
cmp r0, #0
blt error_handler
mov r4, r0 @Backup fd
@Read SYSCONF
mov r1, #0xFF000000 @Set dump address to 0xFFFF1000
orr r1, r1, #0x00FF0000
orr r1, r1, #0x00001000
mov r2, #0x4000 @Set dump size to 0x4000 bytes
bl ios_read
cmp r0, #0x4000
bne error_handler
@Close SYSCONF
mov r0, r4 @fd
bl ios_close
cmp r0, #0
bne error_handler
@Epilogue
pop {r4, r5, sp, pc}
@Fake routine locations, so that you can run source on simlulator
error_handler:
nop @Will end here if any errors occur
b error_handler
ios_open:
mov r0, #1 @Edit this to negative number to replicate ios_open error
bx lr
ios_read:
mov r0, #0x4000 @Edit this to anything other than 0x4000 for ios_read error
bx lr
ios_close:
mov r0, #0 @0 for ios_close being successful
bx lr
Please note that a BL Trick was used instead of a Literal Pool so the Simulator wouldn't add any extra content to the source.
Happy Coding!