PowerPC Page Tables Tutorial - Printable Version +- Mario Kart Wii Gecko Codes, Cheats, & Hacks (https://mariokartwii.com) +-- Forum: Guides/Tutorials/How-To's (https://mariokartwii.com/forumdisplay.php?fid=45) +--- Forum: PowerPC Assembly (https://mariokartwii.com/forumdisplay.php?fid=50) +--- Thread: PowerPC Page Tables Tutorial (/showthread.php?tid=2005) |
PowerPC Page Tables Tutorial - Vega - 12-01-2022 PowerPC Page Tables Tutorial Works for most 32-bit PPC chips including Broadway Requirements:
I'm making this tutorial due to a recent uptick in conversation on Discord PPC-related servers about how to set these things up. There appears to be some confusion among some ppl. This tutorial should clear things up. Under normal circumstances the BAT registers are enough to map out all the memory you need. However if that is not the case, then you will need to add in a Page Table. Chapter 1: Fundamentals What is a Page Table? A page table is a region of memory that contains blocks/sections of data of what is called Page Table Entry Groups (PTEG). Each PTEG contain 8 Page Table Entries (PTE). Every PTEG address is 64-byte aligned. Each PTE within a PTEG is 64-bits in length (double-word) and will contain the necessary information that is required for a proper address translation (such as the upper bits of the physical address equivalent and WIMG bits) PTE bit breakdown~ Upper 32-bits
Lower 32-bits
V (Valid) bit is simple enough to understand. If the PTE is invalid, it won't be used by Broadway. Broadway will try to look for another valid PTE. VSID (Virtual Segment ID) is a randomly generated identifier used as an input to calculate what is called Hash Value 1 (more on this in Chapter 6) H (Hash Value 2) is when a 2nd hash (Hash Value 1 failed) had to be computed. more on this in Chapter 6. R (Referenced) and C (Changed) are bits that get updated by Broadway to keep history information of the PTE, you do not need to worry about how Broadway updates these WIMG and PP bits are what they would be in BAT Registers (write-through, cache-inhibit, memory-coherence, guarded) -------- Special purpose registers known as Segment Registers contain the VSIDs. Permission related bits are also present and will change the meaning of the PP bits in a PTE. There are 16 segment registers (sr0 thru sr15). SR bit breakdown~
If multiple SR's are to be used, then each SR must have a unique randomly generated VSID. You can have software generate these from calling some rand function, or have them predefined (generated by a third party) in a source. To break it down very generally, address translation occurs as such....
Here is a chart of minimum recommended attributes for all allowed Page Table sizes~ Page Tables cannot cover less than 8MB or more than 4Gbyte of memory. Each PTE covers translation for 4 KB of memory. For a chunk of memory that is 16Mbytes in size, it would require 4,096 PTE's. Since each PTE is 8 bytes in size. That would mean 32Kbytes of memory are required to be allocated for the Page Table as a whole. However due to collision possibilities, you would need at least 4 times this amount. In conclusion to cover 16Mbytes of total memory, you would need to allocate 128Kbytes of memory for the page table. SDR1 is a special purpose register that contains the very start address of the entire Page Table and the input values for the special hashes that are used to calculate the PTEG address. SDR1 bit breakdown~
HTABORG is the physical address of where the very start of the Page Table resides at. In the above chart, the x's are don't care bit values. Meaning they have no restrictions on what they can be when setting the physical start address. The larger the covered Memory Size, the more right-justified zero bits are required in the physical start address. Bits 7 thru 15 within HTABORG is known as the "Maskable Bits". Meaning however many zeroes were required is the amount of high/one bits are required to be set in HTABMASK. As an fyi, BATs are faster than Page Tables. They also take priority over Page Tables. If a virtual address translation falls under both BAT and Page Table translation, the BAT will be used. This means you can setup two different virtual address's to translate to the same physical address (i.e. 0x80001500 -> 0x00001500 w/ Bat and also have 0xA0001500 -> 0x00001500 w/ Page Table) Chapter 2: Allocating Memory You will need quite a bit of memory for your page table entries, especially if you are planning to cover something such as 1+GB of virtual memory. For Mario Kart Wii, you can use something such as Egg::Heap::Alloc function to purchase you some memory for this (read note below)... Example PAL (fill in mem_needed byte value)~ Code: .set egg_alloc, 0x80229814 In the above code, r3 returns pointer to Allocated Heap. Be sure to make this address physical before writing it to the HTABORG bits of SDR1. IMPORTANT NOTE: The above code may not work for very large memory chunks due to natural function limitations. Function does work when asking for a 0x10000 chunk of memory with 0x10000 (64KB) required alignment Chapter 3: SDR1 Configuration & TLB Invalidation IMPORTANT NOTE: Be sure interrupts are masked (off) the entire time you are working on anything Page Table related (SDR1, SR, PTE construction, etc). Before any page tables can be constructed, the TLB (Translation Lookaside Buffers) must be invalidated. TLBs are buffers in a on-chip unit that keep track of recently used PTEs. You cannot read/write to these directly. The only action you can do to them is invalidate a TLB by its index number, or issue a tlbsync instruction to wait for all/any TLB invalidations to complete. SDR1 configuration must be done in real mode (reference: PPC PEM Book Page 2-42 footnotes for Table 2-22). Once SDR1 has been configured, you can invalidate the TLBs. There are a total of 64 TLBs. Each TLB is referenced by an index number that is contained in bits 14 thru 19 of the Effective Address used in the Register of the tlbie (TLB invalidate entry) instruction. The first TLB starts at index 0 and ends at index 63. The following snippet of code configures SDR1 and then invalidates TLBs. It assumes you went into Real Mode via rfi with EE, IR, and DR of the MSR set low Code: #Setup SDR1 Chapter 4: Segment Registers Configuration After you invalidate the TLBs, you can setup the Segment Registers. The first 4 bits of a Effective Address chooses which Segment Register will be used. Therefore, by design, the following occurs.. Effective Address --> Segment Register Chosen
Normally, a coder/dev may write a series of cmpwi/branch instructions to take an input Effective Address and know which SR to configure. There's no need for that. Broadway comes with the mtsrin instruction Move to Segment Register Indirect~ mtsrin rS, rB Upper 4 bits of rB selects the SR rS is copied into the SR Here's an example code that setups every SR with all protection bits low (no restrictions on supervisor, user, or execute). It includes a lookup table where all 16 randomly generated VSID's reside at Code: #Use a VSID lookup table Chapter 5: Clearing the Page Table Before any page table is to be used, it should always be entirely zero'd. Here's a snippet of code that does that.. Code: #r3 = *Physical* Start Address of the Page Table Above code is for real mode use. Assumes r3 is physical. Chapter 6: Algorithm, How PTEGs are Generated In order for any Page Table to be constructed for use, it needs to be filled with PTEs at the correct spots within the Table. This is determined by an algorithm. This algorithm requires 2 inputs. The EA and what's in SDR1. First, here is a very broad overview of how an Effective Address is translated to its Physical Equivalent As you can see portions of the EA are broken up. Then the selected SR is utilized with the EA portions to make a temporary 52-bit Address. The VPN portion (upper 40 bits) of this 52-bit Address then goes through a series of operations and hashing. Here's a diagram to display that... The above chart can be broken down into the following steps~
The following chart demonstrates how you can hand-generate a PTEG using just the EA and SDR1. The chart uses the following inputs... SDR1 = 0x0F980007 Virtual Addr/EA = 0x00FFA01B As you can see the PTEG result is 0x0F9FF980. It's important to understand that the amount of "1" bits in HTABMASK in SDR1 determines how many bits of Hash Value 1 is to be placed into PTEG bit 15 going leftward. The chart indicates this via the bracketed bit contents of the upper 9 bits of Hash Value 1 which in turn points to the bracketed bits in the PTEG. The above chart showed how a Primary PTEG is generated. Sometimes (due to the result of the Hash Value 1) a PTEG can be generated which matches a previous PTEG from a different EA. If such a case occurs, Hash Value 1 must be logically NOT'd (bitwise negated or also known as a 1's complement). This new Hash Value is known as Hash Value 2. In the above chart, it would replace what's in Hash Value 1 (steps beforehand aren't required anymore). The following Chart shows what occurs once Hash Value 2 needs to be used... Therefore if Primary PTEG couldn't be used, the new (Secondary) PTEG would be 0x0F980640. Chapter 7: Constructing the Page Table In order to construct a Page Table, you must write all PTEs for all possible PTEGs for your range of Covered Memory. Summary of constructing part (or all) of the Page Table based on a single EA. Assumes you also have the SDR1 and the PA that you want to use for the EA translation. 1. Using EA, figure out which SR would be used 2. Grab SR data 3. Form Upper 32-bits of PTE by... a. Extract VSID & API from SR b. Form temp upper PTE by inserting both VSID and API c. Finalize it by flipping bit 0 high (V/Valid bit) 4. Form Lower 32-bits of PTE by... a. Supply the PA (alternatively, you can supply one for identical translation by extracting the RPN bits from the EA) b. Insert desired WIMG bits c. Insert a high R bit, a high C bit, and desired PP bits 5. Generate PPC-Special Hash Value aka Hash Value 1 6. Generate PTEG Address by.... a. Create a temp hash called tmp1 using Hash Value 1 & SDR1 b. Create another temp hash called tmp2 using tmp1 and SDR1 c. Create temp blank PTEG d. Insert blank PTEG, tmp2, & Hash Value 1 7. Using PTEG Address from Step 6, make sure there is a empty (invalid) PTE (out of 8) 8. If empty, write new PTE (will set it valid) that was formed from steps 3 and 4 9. If all 8 PTEs of PTEG are already valid, run a secondary special hash (Hash Value 2) to generate a different PTEG 10. Check 8 PTEs in Second PTEG, if none of those can be used, then halt 11. If one of the PTEs in the 2nd PTEG can be used, write new PTE but with H bit high to indicate Hash Value 2 was required The above must be done for every 4KB aligned virtual address that you plan to use. So for example, let's say you want to setup the following translation scheme... Effective/Virtual Address Range | Physical Address Range 0xA0000000 thru 0xA07FFFFF | 0x00000000 thru 0x007FFFFF The above would be for 8MB of covered memory. To construct all the PTEs, you would first need to construct the PTE for virtual address 0xA0000000, then 0xA0001000, then 0xA0002000, etc etc until the last address of 0xA07FF000. When constructing the PTEs be sure the correct physical address is used for each new 4KB aligned virtual address you are utilizing. Example snippet of code for a single PTE construction~ Assumes all SR's are configured, TLB's invalidated, SDR1 configured, and you are in real mode with ID+DR low. Code: #r3 = Virtual/Effective EA (assumed to be 4KB aligned) Already, well that was a doozy. As you can see in the above source code, the mfsrin instruction was used to know which SR data to grab based on the EA. This is much more efficient that using a list of compare+branch instructions. Move from Segment Register Indirect~ mfsrin rD, rB Upper 4 bits of rB selects the SR SR is copied into rD Chapter 8: Wrapping Things Up; Example Gecko Code When exiting real mode, be sure that IR and DR will be set high in the MSR after the rfi instruction has been executed. Also make sure EE is back to its original state. Here is a Gecko Code that uses a Page Table for 0xA0000000 thru 0xA07XXXXX (physical 0x00000000 thru 0x007XXXXX) translation. Once the Page Table has been fully constructed, a simple store instruction using the address of 0xA0001500 is completed. Obviously this works or else an exception (page fault) would occur. ----- 0xA0000000+ 8MB Page Table Example [Vega] PAL C200A42C 00000032 9421FFE0 BF810008 3D808022 618C9814 7D8803A6 3C600001 3C800001 80ADA360 80A50024 4E800021 38804000 38A3FFFC 38000000 7C8903A6 94050004 4200FFFC 48000005 7FE802A6 3BDF0024 57DE007E 7FDA03A6 7FC000A6 57C0045E 54000732 7C1B03A6 4C000064 6C638000 5464843E 38A4FFFF 7CA52078 7CA50034 20A50020 7C642B78 7C0004AC 7C9903A6 4C00012C 38000040 38600000 7C0903A6 7C001A64 38631000 4200FFF8 7C00046C 3CC000CA 60C6701C 7CCA01A4 3FA0A000 3B800800 7FA3EB78 546500FE 54C03870 506056BE 64008000 54A70026 39000000 51071E78 60E70182 5468A43E 54C9037E 7D084A78 39800000 5509B5FE 7D2A2038 548B85FE 7D4A5B78 39600000 508B000C 514B81DE 510B3432 7C006378 39400008 396BFFF8 7D4903A6 854B0008 75498000 41820024 4200FFF4 2C0C0040 41820010 7D0840F8 39800040 4BFFFFB0 60000000 4BFFFFFC 900B0000 90EB0004 379CFFFF 3BBD1000 4082FF60 7FDB03A6 3BFF0130 7FFA03A6 4C000064 38000007 3C60A000 90031500 BB810008 38210020 38600000 00000000 Code: #Address Ports Chapter 9: Credits, Resources
RE: PowerPC Page Tables Tutorial - Gaberboo - 12-17-2022 Quote:Permission related bits are also present and will override the PP bits in a PTENot quite, these bits change the meaning of the PP bits, the key in the SR being set results in two PP bit combinations (00 no access, 01 read only) being reduced access, and not being set causes both these combinations to be read/write. PP bit combo 11 is always read only and 10 is always read/write Quote:bit 0 = Must be 0 or else the SR will be used for an I/O deviceBroadway doesn't support direct store segments and so simply triggers a DSI for any access in a segment with an SR that has this bit set and doesn't translate with a BAT. Quote:Bits 7 thru 15 within HTABORG is known as the "Maskable Bits". Meaning however many zeroes were required is the amount of high/one bits are required to be set in HTABMASK.Broadway's response to not following this is to just OR HTABORG with the hash anyway, according to a comment in Dolphin's source code. (Note I do not encourage doing this.) Quote:Page Tables cannot cover less than 8MB or more than 4Gbyte of memory.The values are a recommendation based on physical memory, on the expectation at least some physical memory will be used in multiple pages (say if an operating system is providing a file to multiple programs) and/or the page table gets PTEs clumped in one spot (Which can be somewhat remedied with smart setting of VSID). Doing some math says the absolute limit of a 64 KiB page table is mapping 32 MiB worth of pages, but pushing the limits like this poses problems similar to when your computer disc space is full. Quote:Please NOTE that you could use a double-float store mechanism or the dcbz instruction to clear the page tableInteger stores are going to be faster than storing doubles if you only write the first word of every PTE to zero and I am willing it bet it's faster even if you write 0s to the entire table, since it's likely IBM optimized those more. I'll try to set up a timing comparison on console for this sometime. Quote:construct_pte:This is only ok if the page table isn't in use, a PTE should only be valid if it's ok for the processor to see it, with either address translation on this requires the order to be swapped, an eieio placed between the two instructions to prevent out-of-order shenanigans, and a following sync so the processor waits for the PTE to be written before proceeding. Sidenote, a PR for mkw-sp I coauthored featured usage of page tables https://github.com/stblr/mkw-sp/pull/495 (I wrote the original page table code, it looks a lot nicer than when I first wrote it) RE: PowerPC Page Tables Tutorial - Vega - 12-18-2022 Thank you for the corrections, I appreciate it. Regarding the integer vs float for zero-ing a table, if we were to ignore alignment, cache hits/misses, etc, integer stores and double-float stores have the same cycle latency (2:1). I remember a long time ago I was doing some quick tests on seeing how to zero a small block of memory quickly. IIIRC, dcbz came out on top even with it being an Execution Serialization instruction. Then again, I have an awful memory (lol) I've found a few optimized memcpy PowerPC 32-bit implementations on the web and all of them implemented double-float stores to some degree to speed up the memcpy. Regarding eieio, that shouldn't be needed if SGE is low in HID0. A sync shouldn't be required if this is being set in Real Mode (translation off) and if so, the sync would simply need to be done sometime after all the PTEs are written and before translation is enabled. I'll add them regardless for the case that SGE is high in HID0 and/or PTEs are being rewritten while translation is on. |