U64: A Custom ABI For Nintendo 64

# U64: A Custom ABI For Nintendo 64 ```diff - NOTE this is incomplete, some sections may be unreadable currently. - Taking suggestions for any of the marked TODOs, or concerns for any part ``` > :information_source: **NOTE** Implementation progress in GCC has started at https://github.com/LateGator/gcc The U64 (working title) Application Binary Interface (ABI) is mostly N32 or EABI32, with modifications to reduce memory footprint. While primarily designed by and for Nintendo 64 homebrew, it is expected to be the superior ABI choice for MIPS III+ 64-bit embedded devices where memory is precious. **Goals:** - 32-bit and 64-bit arithmetic instruction usage - Only pay for 64-bit when it's actually used - 32-bit pointers - Full MIPS III FPU support - Reduced stack frame sizes compared to N32 - Pass by register as much as is reasonable ## Basic Definitions > :information_source: **NOTE** There shouldn't be any surprises here really. To comment on the important bits: > - sizeof(pointer) == 4: We want ILP32. > - long double is just double: Nobody uses 128-bit long doubles, they have no support in hardware and only end up complicating printf implementations if they are defined to be 128-bit. > - IEEE-754-1985: This is the version of IEEE-754 the VR4300 aimed to be compliant with, so we stick with it rather than a newer edition where hardware may make support difficult. > - Bit-fields are msbit-first: This is for compatibility with existing source code written with this bit-field ordering in mind in other ABIs. > - char with no qualifiers is unsigned: This is for compatibility with older ABIs. > - size_t is 32-bit. **BYTE**: 8 bits **HALFWORD**: 16 bits **WORD**: 32 bits **DOUBLEWORD**: 64 bits **FP32**: IEEE-754-1985 single-precision floating-point **FP64**: IEEE-754-1985 double-precision floating-point The byte order is **BIG Endian**. ### Fundamental Types | Type | C | sizeof() | alignof() | MIPS | | -------- | ------------------ | -------- | --------- | ------------------- | | Integral | bool / \_Bool | 1 | 1 | unsigned byte | | Integral | char | 1 | 1 | unsigned byte | | Integral | unsigned char | 1 | 1 | unsigned byte | | Integral | signed char | 1 | 1 | signed byte | | Integral | short | 2 | 2 | signed halfword | | Integral | signed short | 2 | 2 | signed halfword | | Integral | unsigned short | 2 | 2 | unsigned halfword | | Integral | int | 4 | 4 | signed word | | Integral | signed int | 4 | 4 | signed word | | Integral | long | 4 | 4 | signed word | | Integral | signed long | 4 | 4 | signed word | | Integral | signed long long | 8 | 8 | signed doubleword | | Integral | enum | 4 | 4 | signed word | | Integral | unsigned int | 4 | 4 | unsigned word | | Integral | unsigned long | 4 | 4 | unsigned word | | Integral | unsigned long long | 8 | 8 | unsigned doubleword | | Pointer | * | 4 | 4 | unsigned word | | Pointer | (\*)() | 4 | 4 | unsigned word | | Float | float | 4 | 4 | fp32 | | Float | double | 8 | 8 | fp64 | | Float | long double | 8 | 8 | fp64 | **Aggregates (structures and arrays) and Unions** - Assume the alignment of their most strictly aligned components - The size of any object is always a multiple of the alignment - Structures and unions can require padding to meet size and alignment constraints - The contents of any padding is undefined **Bit-Fields** - In a signed bit-field, the most significant bit is the sign bit - Sign extension occurs when the field is used in an expression - Bit-Fields are allocated from left to right, or most-significant to least-significant ```c struct { uint32_t x : 12; // [31:20] most-significant uint32_t y : 10; // [19:10] uint32_t z : 10; // [ 9: 0] least-significant }; ``` **Standard C Types** | Standard C Type | Equivalent Basic Type | | --------------- | --------------------- | | wchar_t | signed long | | wint_t | unsigned long | | size_t | unsigned int | | ptrdiff_t | signed int | > :information_source: **NOTE** The rationale for a 32-bit size_t is that since pointers are 32-bit the maximum array size cannot be more than 32-bit since it wouldn't be addressable. ## Register Assignment ### Integer Register Convention | Bit Width | Number | Name | Quantity | Description | | --------- | -------- | -------- | -------- | ---------------------------------- | | 64 | $0 | zero | 1 | Hard-wired Zero | | 64 | $1 | AT | 1 | Assembler Temporary | | 64 | $2..$3 | av0..av1 | 2 | Argument / Return Value | | 64 | $4..$9 | a2..a7 | 6 | Argument | | 64 | $10..$17 | t0..t7 | 8 | Temporary | | 32 | $18..$25 | s0..s7 | 8 | Callee-Saved | | 32 | $26 | k0 | 1 | Kernel Reserved | | 32 | $27 | k1 | 1 | Read-Only Thread Local Storage (TLS) Pointer, otherwise Kernel Reserved | | 32 | $28 | gp | 1 | Small Data or GOT Pointer | | 32 | $29 | sp | 1 | Stack Pointer | | 32 | $30 | fp | 1 | Frame Pointer or Callee-Saved | | 32 | $31 | ra | 1 | ISA-Preferred Return Address | Temporary registers can hold doublewords and double-precision floats, but they may not be transferred to callee-saved registers or otherwise saved to the prologue/epilogue. This way, 64-bit arithmetic and moves can be efficiently performed in local regions of code. To support this the exception and interrupt handler must be equipped to support saving all 64 bits of those registers. The temporary register **t7** shall be used as a temporary register in prologue/epilogue where required. The kernel register **k0** shall have no accessibility outside of kernel-specific contexts. Any use of this register outside of a platform-specific kernel context are undefined, as the kernel may choose to modify it without restoring the prior value afterwards. The kernel register **k1** shall have limited accessibility outside of kernel-specific contexts. It shall be used to hold the Thread Local Storage (TLS) pointer for the currently executing kernel thread, applications may access this register in a read-only manner. Writes to this register outside of a platform-specific kernel context are undefined, as the kernel may restore the TLS pointer to this register without saving the prior value. If TLS is not implemented in a particular kernel, this register is treated the same as **k0**. > :information_source: **NOTE** The registers are cleanly partitioned between 32-bit and 64-bit to simplify iterating all registers. > :information_source: **NOTE** It's generally advantageous to avoid unnecessary stack frame growth from spilling function arguments to the stack, so the GPR allocation follows after N32 and EABI in expanding the number of argument registers. > We also borrow from the PowerPC and RISC-V ABIs in overlapping the argument and return value registers, this occassionally has advantage in skipping register moves in leaf functions or in functions that immediately call another. > :information_source: **NOTE** It is recommended for kernel implementors to treat both kernel registers **k0** and **k1** as 32-bit at all times for compatibility with the iQue Player's secure mode execution environment. Besides this, how to use **k0** (resp. **k1** if TLS is unimplemented) is left entirely up to kernel implementors. > :warning: **TODO** The assembler temporary, $AT, is worth keeping since it is meaningfully used in GAS macros. However, it's generally not required to stay reserved at all times, could there be a well-defined use for it outside of assembler macro expansions? > :warning: **TODO** Should we increase the number of callee-saved registers? AArch64 has 10, RISC-V has 12, PowerPC has 18(!), should we bump from 8 to 10 or 12? Keep in mind that for every additional callee-saved reg, we have less registers that can hold and manipulate 64-bit values. > :warning: **TODO** Should we allow a subset of callee-saved registers to contain 64-bit values? This would increase stack usage again but could alleviate allocating 64-bit auto vars which may win the space back overall. ### Floating-Point Register Convention | Bit Width | Number | Name | Quantity | Description | | --------- | -------- | ---------- | -------- | ----------------------- | | 64 | $0..$3 | fav0..fav3 | 4 | Argument / Return Value | | 64 | $4..$7 | fa4..fa7 | 4 | Argument | | 64 | $8..$19 | ft0..ft11 | 12 | Temporary | | 32 | $20..$31 | fs0..fs11 | 12 | Callee-Saved | The floating-point callee-saved registers, like the integer callee-saved registers, are prohibited from storing 64-bit values. Only 32-bit single-precision values may be stored. The floating-point control and status register (FPCSR) shall have thread storage duration in accordance with C11 section 7.6. > :warning: **TODO** Should we partition floating-point registers to reserve some for single-precision only? This stands to reduce the cost of saving/restoring the floating point register set and we usually expect double-precision arithmetic to be uncommon compared to single-precision. > :warning: **TODO** Should we follow N32 or O64 w.r.t. floating-point callee-saved registers? N32 has just 6 callee-saved registers ($20, $22, $24, $26, $28, $30) while O64 has 12 (N32 plus the odd-numbered counterparts). Less callee-saved registers means smaller prologue/epilogue, but could mean more stack spilling mid-function. Needs investigation. > :warning: **TODO** If necessary, we may need to support moving values between 32-bit GPRs and floating-point registers while `SR_FR=1`. This is usually not supported without MIPS32 instructions `mthc1` and `mfhc1` which the VR4300 of course lacks. However, this can be efficiently carried out by noting that - at least on the VR4300 - `mtc1` does not update the upper 32 bits of the target floating-point register, allowing the use of the following instruction patterns: > ```mips > dsll32 AT, src_hi, 0 > dmtc1 AT, dst > mtc1 src_lo, dst > ``` > ```mips > dmfc1 AT, src > dsra32 dst, AT, 0 > ``` > > A GCC patch implementing these patterns is here for reference > https://gist.github.com/Thar0/6216fb44dfc769f8a2ef2d07974916d2 ## Stack Frame > :information_source: **NOTE** Here's the part that should really separate this ABI from close relatives like N32. The primary shortcoming of the N32 ABI is the vastly increased stack frame sizes due to spilling 64-bit values, mostly in the function prologue and epilogue. We prohibit callee-saved registers from holding 64-bit quantities to combat this disadvantage on memory-constrained systems. The proposed stack frame borrows from the PowerPC EABI: ``` sp+8+P+A+G+F +----------------+ align = 16 Prev SP is here (higher address) | Saved FPRs | sp+8+P+A+G size=F (multiple of 4) +----------------+ align = 4 | Saved GPRs | sp+8+P+A size=G (multiple of 4) +----------------+ align = 4 | Auto Vars | sp+8+P size=A (multiple of 4) | | +----------------+ align = any | Parameter Area | sp+8 size=P (multiple of 4) | | +----------------+ align = 8 | RA | sp+4 size=4 +----------------+ align = 4 | SP | sp+0 size=4 +----------------+ align = 16 Curr SP is here (lower address) ``` > :warning: **TODO** The PowerPC stack frame, for reference: > ``` > | ... | > +---------------------------------------+ > | back chain to caller's caller | > old SP --> +---------------------------------------+ 0+4+4+P+A+V+L+G+F > | Save area for FP registers (F) | > +---------------------------------------+ 0+4+4+P+A+V+L+G > | Save area for GP registers (G) | > +---------------------------------------+ 0+4+4+P+A+V+L > | Local variable space (L) | > +---------------------------------------+ 0+4+4+P+A+V > | Varargs save area (V) | > +---------------------------------------+ 0+4+4+P+A > | Alloca space (A) | > +---------------------------------------+ 0+4+4+P > | Parameter save area (+padding) (P) | > +---------------------------------------+ 0+4+4 > | Caller's saved LR | > +---------------------------------------+ 0+4 > | Back chain to caller | > SP --> +---------------------------------------+ 0 > ``` The stack grows downwards, from high addresses to low addresses. The previous frame's SP is written into the current frame forming a backtrace link, as 32 bits. The return address is stored next to be accessible at a consistent offset during backtracing, also as 32 bits. The parameter area for called functions is next, starting at SP+8 which is guaranteed to be 8-byte aligned. This offset is consistent for called functions to know where it begins. - Types smaller than 32 bits are promoted to 32-bit. - 64-bit types remain 64-bit. - Varargs walks this region like a struct of 32-bit and 64-bit types, with padding of 64-bit types if they should appear. - Called functions will reach into this region of the caller's stack to retrieve arguments. Stack space allocated for variables with automatic storage duration follow the parameter area, aligned as appropriate to accommodate the greatest required alignment no less than 32 bits. This region is otherwise unstructured, callers cannot assume anything about the stack layout from here on. Saved GPRs are next, only saving the lower 32 bits of registers that are used by the called function. 64-bit types are prohibited from entering the saved registers, the saved registers should contain in their upper 32 bits a copy of bit 31 at all times. This region is 4-byte aligned. > :warning: **TODO** Can we mix 32-bit and 64-bit saved regs in a consistent manner? `TARGET_HARD_REGNO_MODE_OK`/`mips_hard_regno_mode_ok` in GCC should be enough. Saved FPRs are next, only saving the registers that are used by the called function, as 32-bit single floats. Doubles are prohibited from entering saved registers. This region is 4-byte aligned. > :warning: **TODO** This represents a big departure from other ABIs: even O32 saves floating-point registers as 64 bits, do we save them as 32-bit or 64-bit? Unless we can statically determine the type held in the register, we probably have no choice in the matter. We would likely need to ban doubles from entering saved regs at any point if we want to do this. The whole stack frame is by default aligned to take up a multiple of 16 bytes total, matching the data cache line size. The stack alignment can be overridden by the (to-be-added) option `-mpreferred-stack-boundary`. For any given non-leaf function, the prologue and epilogue must be unique, but the epilogue may not necessarily be the final bytes of the function. Further, SP must be moved before any new space is used. Installing the frame link may look like: ```mips move $t7, $sp subu $sp, $sp, [N] sw $t7, ($sp) ``` > :warning: **TODO** is this instruction sequence safe in the presence of interrupts? If the store goes through but the subtraction does not, the stored value could be deleted if a signal handler creates a new stack frame. On the other hand if we advance sp before issuing the store and then try to backtrace, things may go terribly wrong if the backtrace handler doesn't confirm that it isn't between the subtraction and store. PowerPC uses the Store Word With Update (`stwu`) instruction to do this sequence in a restartable manner. RISC-V (optionally) uses a frame pointer, albeit in a more sane way than the MIPS frame pointer. > :warning: **TODO** how to determine if the first function in backtracing is a leaf or not? Since leaf functions have no stack frame. Note for leaf functions, the RA value is still in the register and SP is unchanged from the caller. > :warning: **TODO** how to implement `alloca()` in a way that's compatible with the backtrace linked list? ## Argument Passing The first 8 integer arguments are passed in `a0..a7` The first 8 floating-point arguments are passed in `fa0..fa7` Integer types smaller than 32-bit are promoted to 32-bit, whether they are passed in registers or via the stack. Aggregate types passed by value that have fewer members than argument registers are passed through registers e.g. `struct vec3f { float x, y, z; };` is passed via (`fa0`, `fa1`, `fa2`) Varargs are passed exclusively via the stack, e.g. `(const char *fmt, ...)` passed fmt in `a0` and the next arg is at `sp+8`. 64-bit types are always aligned to 64 bits, e.g. passing u32, u64 through varargs reads: - `4 bytes at sp+8` - `8 bytes at sp+16` For varargs, the `float` type is promoted to `double`, as required by the C standard (§6.5.3.3-6). `va_list` stores where in the parent stack frame we are at so far. > :warning: **TODO** The varargs layout should be fully deterministic, in that the callee must be able to uniquely reconstruct the argument layout of the caller. GCC varargs is often pretty "dumb", it'll spill all arg registers onto the stack even if they aren't used. > :information_source: **NOTE** When `va_arg` usage cannot be determined statically (e.g. within a loop) the resulting codegen is such that the varargs passed in registers are spilled to the stack and the varargs pointer is set appropriately to consume them. ## Return Values av0 contains first return value, 32-bit or 64-bit av1 contains a second return value, e.g. from return a struct of 2 elements by value Aggregates that fit in return values are returned in registers There are 4 float return regs so that aggregates of up to 4 floats can be returned in registers Complex primitive types are returned in 2 regs for the real and imaginary parts respectively. ## Runtime Support Routines We shall port `-msave-restore | -mno-save-restore` options to MIPS that shall replace inline save/restore of registers in the prologue/epilogue with runtime support calls when code is optimized for size or when the option is explicitly enabled. ```mips FUNC(_restgpr) /* restore gp registers */ AENT(_restgpr_7) lw s7, -32(t7) AENT(_restgpr_6) lw s6, -28(t7) AENT(_restgpr_5) lw s5, -24(t7) AENT(_restgpr_4) lw s4, -20(t7) AENT(_restgpr_3) lw s3, -16(t7) AENT(_restgpr_2) lw s2, -12(t7) AENT(_restgpr_1) lw s1, -8(t7) AENT(_restgpr_0) jr ra lw s0, -4(t7) END(_restgpr) ``` ```mips FUNC(_restgprx) /* restore gp registers for tail call */ AENT(_restgprx_7) lw s7, -32(t7) AENT(_restgprx_6) lw s6, -28(t7) AENT(_restgprx_5) lw s5, -24(t7) AENT(_restgprx_4) lw s4, -20(t7) AENT(_restgprx_3) lw s3, -16(t7) AENT(_restgprx_2) lw s2, -12(t7) AENT(_restgprx_1) lw s1, -8(t7) AENT(_restgprx_0) lw ra, 4(t7) lw s0, -4(t7) jr ra move sp, t7 END(_restgprx) ``` ```mips FUNC(_savegpr) /* save gp registers */ AENT(_savegpr_7) sw s7, -32(t7) AENT(_savegpr_6) sw s6, -28(t7) AENT(_savegpr_5) sw s5, -24(t7) AENT(_savegpr_4) sw s4, -20(t7) AENT(_savegpr_3) sw s3, -16(t7) AENT(_savegpr_2) sw s2, -12(t7) AENT(_savegpr_1) sw s1, -8(t7) AENT(_savegpr_0) jr ra sw s0, -4(t7) END(_savegpr) ``` ```mips FUNC(_restfpr) /* restore fp registers */ AENT(_restfpr_11) ldc1 fs11, -64(t7) AENT(_restfpr_10) ldc1 fs10, -56(t7) AENT(_restfpr_9) ldc1 fs9, -48(t7) AENT(_restfpr_8) ldc1 fs8, -40(t7) AENT(_restfpr_7) ldc1 fs7, -64(t7) AENT(_restfpr_6) ldc1 fs6, -56(t7) AENT(_restfpr_5) ldc1 fs5, -48(t7) AENT(_restfpr_4) ldc1 fs4, -40(t7) AENT(_restfpr_3) ldc1 fs3, -32(t7) AENT(_restfpr_2) ldc1 fs2, -24(t7) AENT(_restfpr_1) ldc1 fs1, -16(t7) AENT(_restfpr_0) jr ra ldc1 fs0, -8(t7) END(_restfpr) ``` ```mips FUNC(_restfprx) /* restore fp registers for tail call */ AENT(_restfprx_11) ldc1 fs11, -64(t7) AENT(_restfprx_10) ldc1 fs10, -56(t7) AENT(_restfprx_9) ldc1 fs9, -48(t7) AENT(_restfprx_8) ldc1 fs8, -40(t7) AENT(_restfprx_7) ldc1 fs7, -64(t7) AENT(_restfprx_6) ldc1 fs6, -56(t7) AENT(_restfprx_5) ldc1 fs5, -48(t7) AENT(_restfprx_4) ldc1 fs4, -40(t7) AENT(_restfprx_3) ldc1 fs3, -32(t7) AENT(_restfprx_2) ldc1 fs2, -24(t7) AENT(_restfprx_1) ldc1 fs1, -16(t7) AENT(_restfprx_0) lw ra, 4(t7) ldc1 fs0, -8(t7) jr ra move sp, t7 END(_restfprx) ``` ```mips FUNC(_savefpr) /* save fp registers */ AENT(_savefpr_11) sdc1 fs11, -96(t7) AENT(_savefpr_10) sdc1 fs10, -88(t7) AENT(_savefpr_9) sdc1 fs9, -80(t7) AENT(_savefpr_8) sdc1 fs8, -72(t7) AENT(_savefpr_7) sdc1 fs7, -64(t7) AENT(_savefpr_6) sdc1 fs6, -56(t7) AENT(_savefpr_5) sdc1 fs5, -48(t7) AENT(_savefpr_4) sdc1 fs4, -40(t7) AENT(_savefpr_3) sdc1 fs3, -32(t7) AENT(_savefpr_2) sdc1 fs2, -24(t7) AENT(_savefpr_1) sdc1 fs1, -16(t7) AENT(_savefpr_0) jr ra sdc1 fs0, -8(t7) END(_savefpr) ``` > :information_source: **NOTE** To support these runtime functions, the register allocator must always allocate saved registers from lowest to highest. t7 is the designated prologue/epilogue hard register for temporary variable storage. ## Small Data For applications using small data, the **gp** register shall at all times point to the start of the small data section, offset by 0x8000 bytes. Small data is incompatible with programs that use a Global Offset Table, where **gp** is repurposed for GOT accesses. When small data is used, 4-byte and 8-byte read-only literals shall be placed in mergeable sections named `.lit4` and `.lit8` respectively. > :information_source: **NOTE**: Small data merging is supposed to already be the case for other ABIs but this behavior has been broken in GCC for a long time, ever since moving to constant pooling. > Reference patch fixing this: > https://gist.github.com/Thar0/3203db6e8203f45b308774a0a510ceb5 ## ELF Header `e_ident[EI_CLASS]` shall be `ELFCLASS32`. `e_ident[EI_DATA]` shall be `ELFDATA2MSB`. `e_machine` shall be `EM_MIPS`. `e_flags` shall NOT have `EF_MIPS_ABI2` set. `e_flags` `EF_MIPS_ABI` shall be 5, the next free value at the time of writing. There shall be a dummy section `.mdebug.abiU64` to inform gdb of the ABI choice. There shall be a `.gcc_compiled_long32` to inform gdb of `sizeof(long)`. Linker should report errors if mixing U64 with any other ABI. ### Relocations U64, like New ABIs and unlike Old ABIs or the EABI, shall use the `.rela` relocation format to store relocations. > :warning: **TODO** `.rela` is a nicer format for offline processing but makes the ELF file larger than `.rel` which may be worse for online processing such as dynamic linking. > :warning: **TODO** Are there opportunities outside of PIC for linker relaxation? Linker relaxation can take the longest instruction sequence for an operation and shrink it at link-time. You need an alignment relocation to fix up strict alignment of later code since it may become misaligned post-relaxation. ### Thread-Local Storage Should follow the ELF Thread Local Storage Model: - .tdata, .tbss sections - `PT_TLS` program header(s) for the TLS initialization images ## References ISO/IEC 9899:2024 Programming Languages - C https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf ISO/IEC JTC1 SC22 WG21 N4860 Programming Languages - C++ https://isocpp.org/files/papers/N4860.pdf SYSTEM V APPLICATION BINARY INTERFACE MIPS RISC Processor Supplement, 3rd Edition https://refspecs.linuxfoundation.org/elf/mipsabi.pdf MIPS O64 Application Binary Interface for GCC https://gcc.gnu.org/projects/mipso64-abi.html MIPSpro:tm: N32 ABI Handbook https://techpubs.jurassic.nl/library/manuals/2000/007-2816-004/sgi_html/index.html https://math-atlas.sourceforge.net/devel/assembly/007-2816-005.pdf mips eabi documentation... From: Eric Christopher <echristo at redhat dot com> https://sourceware.org/legacy-ml/binutils/2003-06/msg00436.html Support building for multiple MIPS ABIs: o32 (default), n32 and eabi32 #135 `(The PR description has a concise comparison between O32, N32 and EABI32)` https://github.com/HackerN64/HackerOoT/pull/135 MIPS Calling Convention: ECE314 Spring 2006 `(Primarily O32)` https://courses.cs.washington.edu/courses/cse410/09sp/examples/MIPSCallingConventionsSummary.pdf MIPS Backtracing https://github.com/DragonMinded/libdragon/blob/46a5958fbedde02b0b26dfd50bd61a9a9807e0d3/src/backtrace.c https://www.yumpu.com/en/document/read/49433643/intricacies-of-a-mips-backtrace-implementation-david-vomlehn https://smeso.it/2024/03/02/mips-stacktrace-an-unexpected-journey.html PowerPC ABI https://www.nxp.com/docs/en/application-note/PPCEABI.pdf https://www.nxp.com/docs/en/reference-manual/E500ABIUG.pdf http://refspecs.linux-foundation.org/elf/elfspec_ppc.pdf https://www.ibm.com/docs/en/aix/7.1.0?topic=epilogs-updating-stack-pointer https://class.ece.iastate.edu/arun/Cpre381_Sp06/lab/labw12a/eabi_app.pdf RISC-V ABI https://lists.riscv.org/g/tech-psabi/attachment/61/0/riscv-abi.pdf https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf rv64ilp32: Running ILP32 on RV64 ISA https://lwn.net/Articles/951187/ https://lwn.net/Articles/932290/ https://lpc.events/event/17/contributions/1475/attachments/1186/2442/rv64ilp32_%20Run%20ILP32%20on%20RV64%20ISA.pdf https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/381 https://gcc.gnu.org/pipermail/gcc-patches/2024-November/669046.html ELF Handling For Thread-Local Storage https://www.akkadia.org/drepper/tls.pdf