Assembly & GNU C

# Assembly & GNU C ###### tags: `ASM` `ABI` `ARM` `x86` `Neon` `SSE` `AVX` `CUDA` `SIMD` `MIMT` `GPU` `Parallel Computing` `Hardware` `Embedded System` `Kernel` `Operating System` `GNU` `Linux` `C` `C++` `Compiler` `GDB` Notes & Cheat Sheet for assembly language & low-level C, especially on amd64 (x86_64) platforms & GNU/Linux operating system environments. > *"As programmers, our usual instinct is to write data structures that are as general-purpose and reusable as possible. This is not a bad thing, but it can hinder us if we turn it into a goal right out of the gate."* ---- [The World's Simplest Lock-Free Hash Table - Jeff Preshing](https://preshing.com/20130605/the-worlds-simplest-lock-free-hash-table/) :::warning **Why Assembly Matters?** For example: ```Cpp= union Data { u_int64_t u64 = 0x12'34'56'78'90'ab'cd'ef; u_int32_t u32; } data; u_int32_t a = data.u64; u_int32_t b = data.u32; u_int32_t c = *((u_int32_t *)&data); u_int32_t d = *((u_int32_t *)&data + 1); ``` > If in little endian (of most of modern systems), memory layout looks like: > ```Cpp! > 0xef 0xcd 0xab 0x90 0x78 0x56 0x34 0x12 > ``` > ```Cpp! > a == 0x90'ab'cd'ef; > b == 0x90'ab'cd'ef; > c == 0x90'ab'cd'ef; > d == 0x12'34'56'78; > ``` > If in big endian (of low-level networking systems), memory layout looks like: > ```Cpp! > ```2 0x34 > 0x12 0x34 0x56 0x78 0x90 0x > ``` > ```Cpp! > a == 0x12'34'56'78; > b == 0x11'22'33'44; > c == 0x11'22'33'44; > d == 0x55'66'77'88; > ``` Also, should I consider little-/big-endian memory layout for high-level programming languages such as: ```Cpp! // for 0x1234 0x34'12; 0x34'12'00'00; 0x12'34; 0x00'00'12'34; // for 0x12345678 0x78'56'34'12; 0x12'34'56'78; ``` > No! These are values, no matter in what endian. ::: > [!Note] **Related articles** > - [Linux Kernel Debugging - shibarashinu](https://hackmd.io/@shibarashinu/ryyKT2wZR) > - [Compiler: The Program, the Language, & the Computer Work - shibarashinu](https://hackmd.io/@shibarashinu/SyEHz-JHC) ## x86 Assembly Cheat Sheet Resources: - [[List] x86 Assembly - Wikibooks](https://en.wikibooks.org/wiki/X86_Assembly) - [[List] x86 Wiki - Stackoverflow](https://stackoverflow.com/tags/x86/info) - [[Book] Intel 80386 Reference Programmer's Manual - MIT](https://pdos.csail.mit.edu/6.828/2004/readings/i386/toc.htm) - [[Book] 組合語言 - 陳文進](https://www.csie.ntu.edu.tw/~wcchen/asm98/) - [[Specs] Complete 8086 instruction set - Gabriele Cecchetti](http://www.gabrielececchetti.it/Teaching/CalcolatoriElettronici/Docs/i8086_instruction_set.pdf) - [[Specs] Technical Resources: Intel® Core™ Ultra and Intel® Core™ Processors - Intel](https://www.intel.com/content/www/us/en/products/docs/processors/core/core-technical-resources.html) - [[Specs] Intel X86 Instruction Format for Different Modes - Intel](https://uglyduck.vajn.icu/PDF/Intel/OPx86.pdf) ### Assembly Syntax :::info **Intel vs. AT&T/UNIX** ![image](https://hackmd.io/_uploads/rkOUsoPFge.png =500x) [GCC-Inline-Assembly-HOWTO](https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html) > GNU assembler (`as` & `gcc`) use *AT&T* syntax by default. ::: ### x86 Registers General-Purpose Registers: ```arm! rax ; ALU accumulator, syscall number ; e.g., add eax, 3 rbx, ; base offset ; e.g., mov eax, [ebx+4] rcx ; loop counter ; e.g., loop <label> ; check & decrement rcx rdx ; data manipulation ; e.g., div ecx ; eax...edx := (64bit) edx:eax / (32bit) ecx r8 ~ r11 ; scratch register (temporary) r12 ~ r15 ; preserved register (should restore after the use or clobber) rsi, rdi ; src/dst data copies (e.g., string addr, func args) ; e.g., mov esi, <src> ; mov edi, <dst> ; mov ecx, <len> ; rep movsb ; copy ecx bytes from [esi] to [edi] rbp, rsp ; stack range: rbp (high) ~ rsp (low) ; e.g., func_prologue: ; call <func> ; same as push rip & jmp <func> ; push rbp ; mov rbp, rsp ; sub rsp, 0x10 ; func_epilogue: ; mov rsp, rbp ; pop rbp ; ret ; same as pop rip ``` :::warning **Register Size / Range Comparison** | | `ax`,`bx`,`cx`,`dx` | `sp`,`bp` | `si`,`di` | |:--------------:|:-------------------:|:---------:|:---------:| | `\|87654321\|` | `rax` | `rsp` | `rsi` | | `\|----4321\|` | `eax` | `esp` | `esi` | | `\|------21\|` | `ax` | `sp` | `si` | | `\|------2-\|` | `ah` | x | x | | `\|-------1\|` | `al` | `spl` | `sil` | ::: > ![image](https://hackmd.io/_uploads/SkeK-YoVxe.png) > (Source: [Registers in x86 Assembly - University of Alaska Fairbanks CS](https://www.cs.uaf.edu/2017/fall/cs301/lecture/09_11_registers.html)) Special Registers: ```arm! rip ; same as $pc eflags ; flags for current CPU state ``` Segment Registers: ```arm! SS ; Stack segment CS ; Code segment (change on ljmp/ret/int insns) DS ; Data segment ES ; Extra data segment FS, GS ; General purpose segment (extra data segment) ``` n :::info **Flat Memory Model** Most applications on most modern operating systems (like FreeBSD, Linux, or Windows) use a memory model that points nearly all segment registers to the same place, & uses paging instead (all are page addresses), effectively disabling these fixed-number registers use (:arrow_right: see [Memory Segmentation](#Memory-Segmentation) section). But `FS` & `GS` is an exception, instead being used to point at thread-specific data. > ![image](https://hackmd.io/_uploads/BJzVCsPKgl.png =500x) > (Source: [Modes of Memory Addressing on x86 - c-jump](https://www.c-jump.com/CIS77/ASM/Memory/)) ::: Flags: ![image](https://hackmd.io/_uploads/SkfUK1-QR.png =500x) [What is the purpose of the Parity Flag on a CPU? - StackOverflow](https://stackoverflow.com/questions/25707130/what-is-the-purpose-of-the-parity-flag-on-a-cpu) ### x86 Instructions ```arm! jo ; jmp if previous instruction overflowed test eax eax ; perform a logical AND & set parity flag for jmp, ... cmp eax 0 ; same as the above (but using costly sub): ; ==: zero_flag <= 1, carry_flag <= 0 ; <: zero_flag <= 0, carry_flag <= 1 ; >: zero_flag <= 0, carry_flag <= 0 ``` ### GCC Inline Assembly ```arm! // simple inline asm function int o0, i0, i1; asm goto volatile( ".intel_syntax noprefix \n\t" ".label_asm%=: \n\t" // %= auto unique id "mov eax, %[a] \n\t" // or %1 "add eax, %2 \n\t" "mov ebx, eax \n\t" "mov %[o], ebx \n\t" // or %0 "jc %l[stop] \n\t" // or %l3 : // output operands [o] "=r"(o0) : // input operands [a] "r"(i0), "r"(i1) : // clobbered components "eax", "ebx" : // label references (with asm goto(...) & can't have output operands) stop ); stop: // Input/Output operands a ; rax, eax, ... b ; rbx, ebx, ... c ; rcx, ecx, ... d ; rdx, edx, ... S ; rsi, esi, ... D ; rdi, edi, ... q ; byte-wise registers (e.g., al, bl, cl, dl, ...) r ; general-purpose registers m ; memory-backed operands g ; general operands (can be register or memory) i ; immediate constants n ; integer numbers F ; legacy floating-point registers (FPU: x87) x ; vector registers (SIMD: SSE/AVX) 0 ; use same set of %0 operand (can be others) // Output operand modifiers = ; written + ; read/written & ; early-cobbered (this output write happens before all input read) // Clobbers (mark the components dirty) cc ; may change CPU flags memory ; may change memory (force writing back to RAM e.g., for mb()) redzone ; may change stack red zone (memory below rsp) (only available for some architecture) ``` [How to Use Inline Assembly Language in C Code - GNU GCC](https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html) :::warning **Red Zone (AMD64 ABI for Example)** The optimization for leaf function call ---- no need to adjust stack pointer. > Very useful for programs that frequently do small leaf function calls. > [!Warning] **But Overall, This Is More likely an ABI Convention** > OS guarantees no signal/interrupt handlers will flush that memory area (but typically OS places interrupt stack in kernel space!). And compiler makes sure the program has sufficient stack memory to use. ![image](https://hackmd.io/_uploads/BkyldnDKlg.png =500x) ![image](https://hackmd.io/_uploads/B1NHI3PYge.png =500x) [Stack frame layout on x86-64 - Eli Bendersky](https://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64.html) ::: More examples: ```C= asm("nop"); ``` ```Cpp= // with C++ specific R"(...)" raw string asm volatile(R"( .intel_syntax noprefix; push ebp mov ebp, esp sub esp, 3Ch ; 0x3C push ebx push esi mov eax, dword ptr [ebp + 8] )"); ``` ```C= float f = 3.14f; __asm__ ("fld %0" :: "f"(f)); // load into x87 FPU stack-based register ``` > **Note:** Practically, for fp operations, rather use intrinsics that will translate into SIMD. ```C= // syscall exit unsigned char exit_code = 123; __asm__ volatile( "movq $60, %%rax \n\t" // syscall number: exit "movq %0, %%rdi \n\t" // exit code argument "syscall \n\t" : : "r"((long) exit_code) // 64-bit input operand : "%rax", "%rdi" ); ``` ```C= // goto outside label asm goto ( "btl %1, %0\n\t" "jc %l2" : /* No outputs. */ : "r" (p1), "r" (p2) : "cc" : carry); return 0; carry: return 1; ``` ```C= // print CPU vendor string int res[4] = {0}; asm ("movq $0, %%rax;" "cpuid;" : "=b" (res[0]), "=d" (res[1]), "=c" (res[2]) :: "eax"); printf("Vendor string: %s\n", (char*)(&res)); ``` ```C= #include <stdio.h> #include <stdint.h> #include <stdbool.h> int atomic_compare_and_swap(int *ptr, int old, int new) { unsigned char is_swap; __asm__ __volatile__ ( "lock cmpxchgl %3, %1\n" // atomic if (*ptr == eax) *ptr := new // else eax := *ptr "setz %0" // is_swap := ZF (zero flag) : "=q" (is_swap), // =: used-out (writeonly) "+m" (*ptr), // +: used-in/used-out (mutated) "+a" (old) // +: used-in/used-out (possibly mutated) : "r" (new) // : used-in (readonly) : "memory", "cc" // clobbered: memory & condition flags ); if (!!is_swap) return *ptr; else return old; } #define __raw_cmpxchg(ptr, old, new, size, lock) \ ({ \ __typeof__(*(ptr)) __ret; \ __typeof__(*(ptr)) __old = (old); \ __typeof__(*(ptr)) __new = (new); \ \ volatile u32 *__ptr = (volatile u32 *)(ptr); \ asm volatile(lock "cmpxchgl %2,%1" \ : "=a" (__ret), "+m" (*__ptr) \ : "r" (__new), "0" (__old) \ : "memory"); \ \ __ret; \ }) ``` - [What happens in the assembly output when we add "cc" to clobber list - StackOverflow](https://stackoverflow.com/questions/59656857/what-happens-in-the-assembly-output-when-we-add-cc-to-clobber-list): `cc` on x86 is always assumed being clobbered. :::warning **Volatile Qualifier** Without the `volatile` qualifier, the optimizers might assume that the asm block will always return the same value & may optimize away the following example's 2nd call. > Use `__volatile__` for memory fences, I/O ports, or instructions with side effects (shouldn't be modified!). ```c= #include <stdio.h> #include <stdint.h> int main() { uint64_t msr_timestamp; asm volatile ( ".intel_syntax noprefix\n" "rdtsc\n" // timestamp stores at edx:eax "shl rdx, 32\n" // shift edx << 32 "or %0, rdx\n" // rax := rdx | rax (eax has been set) ".att_syntax \n" : "=a" (msr_timestamp) : : "rdx" ); printf("msr_timestamp: %lx\n", msr_timestamp); // the same code ... asm volatile ( ... ); printf("msr_timestamp: %lx\n", msr_timestamp); return 0; } ``` ::: #### Linux Kernel Inline Assembly - [/arch/x86/include/asm/uaccess.h](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/uaccess.h#L358) ```c= // copy_from_user() #define __get_user_asm(x, addr, err, itype) \ asm volatile("\n" \ "1: mov"itype" %[umem],%[output]\n" \ "2:\n" \ _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_EFAULT_REG | \ EX_FLAG_CLEAR_AX, \ %[errout]) \ : [errout] "=r" (err), \ [output] "=a" (x) \ : [umem] "m" (__m(addr)), \ "0" (err)) ``` - [/arch/x86/include/asm/asm.h](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/asm.h#L204) ```c= // define hardware exception handling tables that help in safely handling // errors or exceptions that may occur during execution. #define _ASM_EXTABLE_TYPE_REG(from, to, type, reg) \ " .pushsection \"__ex_table\",\"a\"\n" \ " .balign 4\n" \ " .long (" #from ") - .\n" \ " .long (" #to ") - .\n" \ DEFINE_EXTABLE_TYPE_REG \ "extable_type_reg reg=" __stringify(reg) ", type=" __stringify(type) " \n"\ UNDEFINE_EXTABLE_TYPE_REG \ " .popsection\n" ``` ```c= #define DEFINE_EXTABLE_TYPE_REG \ ".macro extable_type_reg type:req reg:req\n" \ ".set .Lfound, 0\n" \ ".set .Lregnr, 0\n" \ ".irp rs,rax,rcx,rdx,rbx,rsp,rbp,rsi,rdi,r8,r9,r10,r11,r12,r13,r14,r15\n" \ ".ifc \\reg, %%\\rs\n" \ ".set .Lfound, .Lfound+1\n" \ ".long \\type + (.Lregnr << 8)\n" \ ".endif\n" \ ".set .Lregnr, .Lregnr+1\n" \ ".endr\n" \ ".set .Lregnr, 0\n" \ ".irp rs,eax,ecx,edx,ebx,esp,ebp,esi,edi,r8d,r9d,r10d,r11d,r12d,r13d,r14d,r15d\n" \ ".ifc \\reg, %%\\rs\n" \ ".set .Lfound, .Lfound+1\n" \ ".long \\type + (.Lregnr << 8)\n" \ ".endif\n" \ ".set .Lregnr, .Lregnr+1\n" \ ".endr\n" \ ".if (.Lfound != 1)\n" \ ".error \"extable_type_reg: bad register argument\"\n" \ ".endif\n" \ ".endm\n" ``` ```c= #define UNDEFINE_EXTABLE_TYPE_REG \ ".purgem extable_type_reg\n" ``` More: - [Kernel level exception handling in Linux - Linux Kernel Docs](https://www.kernel.org/doc/Documentation/x86/exception-tables.txt) ## Assembly-Level Optimization *Machine-Dependent Optimization* ### .bss Section vs. .data Section *GCC puts zero-initialized variables in `.bss` section by default.* How to fine-tune of this behavior? ```cpp= char bss_buf0[1024]; // default, stored in .bss section. // at runtime, ELF loader maps it to the zeroed out segment. char bss_buf1[1024] = {0}; // explicitly set to 0s, // same stored in .bss section. // (not specify in standard, compiler-specific action.) char data_buf0[1024] = {1}; // has initialized data { 1, 0, 0, ... }, // stored in .data section. // at runtime, ELF loader loads it into the RW .data segment. __attribute__((section(".data"))) char data_buf1[1024] = {0}; // designated to store all 0s in .data section. ``` Or: ```sh! gcc -fno-zero-initialized-in-bss ... ``` :::warning **[C++ Object Initialization] .bss Section or .data Section? Zeroed or Garbage Values for Uninitialized Class Objects?** *C++ supports constructors & default values in structs; where & how does it initialized?* > - **C++ Trivial Classes** > No any of runtime constructor & virtual function logic of itself & its (non-static) members recursively (in other words, like C struct) ---- `struct A { A() = default; };` or constructor non-specified. > - **Initialized Objects** > Objects have explicit assignment logic (e.g., `A a = A();`). - **Global Objects (Static/Thread-Local Storage Duration)** - **[compile-time]** 1. generate runtime constructor (for "non-trivial class"). 2. write data to `.data` section (only for "trivial class with initialized members"). 3. adjust `bss_start` & `bss_end` (for "non-trivial class" or "trivial class without initialized members"). :::info **`.bss` Section** At compiler-time, the linker records the start & offset of this "virtual space" & resolves all variable addresses of `.bss`. At runtime, Linux kernel allocates brand new zeroed pages (demand paging) for `.bss` on `exec`'s ELF loader. ::: - **[runtime]** 1. OS allocates zeroed pages for `.bss` (if any). 2. CRT calls the constructors at startup (for "non-trivial class"). > [!Note] > "Initialized objects" generate identical code to "uninitialized objects" in global scope. :::info **How Global Non-trivial Class Constructors are Called at Runtime?** The entries of these static objects' constructors are listed in `.init_array` section (addr: `DT_INIT_ARRAY`, size: `DT_INIT_ARRAYSZ`) that points to name-mangled constructors in `.text` section & brought up after the starting point `_start()` before entering `main()`: `call_init()` (invoked by `__libc_start_main` from `libc.so`'s `csu/libc-start.c`) -> constructors (invoked by `__static_initialization_and_destruction_0()` in `.text` section): - `csu/libc-start.c` ```Cpp= // call_init() ElfW(Dyn) *init_array = l->l_info[DT_INIT_ARRAY]; if (init_array != NULL) { unsigned int jm = l->l_info[DT_INIT_ARRAYSZ]->d_un.d_val / sizeof (ElfW(Addr)); ElfW(Addr) *addrs = (void *) (init_array->d_un.d_ptr + l->l_addr); for (unsigned int j = 0; j < jm; ++j) ((dl_init_t) addrs[j]) (argc, argv, env); /* * call: * 1. register_tm_clones(): * init transaction memory. * 2. __static_initialization_and_destruction_0(): * init global constructors. * 3. custom constructors (with __attribute__((constructor))): * call custom functions (see below code). * ... */ } ``` ```Cpp= // This function addr will also be registered in .init_array section. __attribute__((constructor(1234))) void init1234() { printf("constructor priority: 1234\n"); } ``` More: - [.init, .ctors, and .init_array - MaskRay](https://maskray.me/blog/2021-11-07-init-ctors-init-array) - [Analyzing The Simplest C++ Program - Ray Zhang](https://oneraynyday.github.io/dev/2020/05/03/Analyzing-The-Simplest-C++-Program/) - [Initialization and Termination Routines - Oracle Docs](https://docs.oracle.com/cd/E19683-01/816-1386/6m7qcobks/index.html) - [What is transactional memory? - StackOverflow](https://stackoverflow.com/questions/11255640/what-is-transactional-memory): Runtime provides an atomic memory-based transaction mechanism that multithreaded program can access a critical section without using locking (accepted/rejected either by hardware/software supervisor, e.g., CPU's r/w cache watcher, bookkeeping logs). This is still a C++ experimental feature. ```Cpp= int transaction_func() { static int i = 0; synchronized { // begin synchronized block // critical section ++i; return i; } // end synchronized block } ``` - [Transactional memory (TM TS) - cppreference](https://en.cppreference.com/w/cpp/language/transactional_memory.html) ::: - **Stack (Automatic Storage Duration)** - **[compiler-time]** 1. generate runtime constructor (for "non-trivial class"). - **[runtime]** 1. reserve space on stack segment (by adjusting `rsp`). 2. set initial values (only for "trivial class with initialized members"). 3. call the constructors (for "non-trivial class"). > [!Warning] > "Objects with uninitialized members" are garbage values (only stack pointer `rsp` has been moved). > [!Note] > "Initialized objects" generate identical code to "uninitialized objects" on stack. :::info **Classes with Initialized Members May Use `memset` to Zero Out** *Either for "trivial object" set in function or "non-trivial object" set in constructor* Some of the `memset` intrinsics implementations in assembly: First, zero out memory with x86 memory-stream-based instruction (maybe vectorized at microcode level internally): ```arm! rep stos QWORD PTR es:[rdi],rax ; rep stosq: memset ; store rca value into [rdi] for rcx times. ; (ES segment not use in modern flat memory model) ; ; rep movsq: memcpy ; copy from [rsi] into [rdi]. ``` or with SIMD instruction (compile with `-mavx`, `-march=native`): ```arm! vxorps ymm0, ymm0, ymm0 vmovdqu YMMWORD PTR [rdi], ymm0 vmovdqu YMMWORD PTR [rdi+32], ymm0 ... ``` then other members with initial values are loaded into memory. ::: - **Heap (Dynamic Storage Duration)** - **[compiler-time]** 1. generate runtime constructor (for "non-trivial class" or ++"trivial class with initialized members" for setting initial values++). :::danger Trivial class with constructor?? ::: - **[runtime]** 1. reserve space on heap segment (by `malloc`). 2. `memset` to 0 (only for "initialized trivial object (i.e., `new Trivial()`) with initialized members"). 3. call the constructor (if any). > [!Warning] > "initialized trivial object (i.e., `new Trivial()`) with initialized members" may cause the waste of the 1st `memset` to 0 (then call the constructor for setting initial values). > [!Note] > As the result, "non-trivial objects" are always "`malloc` + constructor" no matter if it is initialized, or if with initialized members. :::info **`new` Itself Doesn't Zero Out the Memory** Because fundamentally, `new` utilizes libc's `malloc` function call for allocating heap space. > `malloc` (implemented by `ptmalloc`, `tcmalloc`, `jemalloc`, ... memory management system) may utilize multiple free lists as multi-level bins (e.g., fast, small, large, ...) for user-space memory recycles, each has different operating strategies, along with syscall `brk` + `mmap` to request extra virtual memory space. > - [Is malloc good enough? - daily_coding_professor](https://www.youtube.com/watch?v=S3o2hB4s4B4) > - [ptmalloc、tcmalloc与jemalloc对比分析 - cyningsun](https://www.cyningsun.com/07-07-2018/memory-allocator-contrasts.html) ::: ::: ### Tagged Pointers *Embedding extra data in the memory pointers.* [Storing data in pointers - Muxup](https://muxup.com/2023q4/storing-data-in-pointers) Can be used as an extra marker / data storage of: - **for GC:** e.g., ref_cnt, version numbers, mark & sweep, , ... - **for syncronization:** e.g., atomic token id, ref_cnt, ... > - [lock-free atomic] ABA problem with CAS: [簡化概念下的lock-free 編程 - 知乎](https://zhuanlan.zhihu.com/p/53012280) - **inline caching:** compact, cache-friendly data structures. - [PointerIntPair - LLVM](https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/ADT/PointerIntPair.h#L64) - [value representation in javascript implementations - wingolog](https://wingolog.org/archives/2011/05/18/value-representation-in-javascript-implementations): use the pointer as an integer, without actual GC operations. ### Alignments (Basics) - data struct alignment - stack alignment [esp & -0x10 (0xFF'FF'FF'F0) - StackOverflow](https://stackoverflow.com/questions/4175281/what-does-it-mean-to-align-the-stack) ### Alignments (Advanced) - RAM->cacheline spatial affinity / simd ops: [Designing data for the CPU - daily_coding_professor](https://www.youtube.com/watch?v=aRhOHZAT0AU) - array of struct vs. struct of array ```cpp= // array of struct struct Object { vec3 position, velocity; vec4 color; int id; char name[32]; } objects[10000]; ``` ```cpp= // struct of array template<int N> struct Object { vec3 position[N], velocity[N]; vec4 color[N]; int id[N]; char name[32][N]; }; Object<10000> objects; ``` - false sharing: multi-core CPU process different data on same cacheline [Your CPU Is Fighting Itself - daily_coding_professor](https://www.youtube.com/watch?v=96L1X6GwQ64) - linux kernel tcp optimization ![image](https://hackmd.io/_uploads/H1P25gpSZl.png =400x) - `std::hardware_destructive_interference_size` ![image](https://hackmd.io/_uploads/rkkPKgpr-l.png =400x) [std::hardware_destructive_interference_size, std::hardware_constructive_interference_size - cppreference](https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size.html) ```cpp= struct keep_apart { alignas(std::hardware_destructive_interference_size) std::atomic<int> cat; alignas(std::hardware_destructive_interference_size) std::atomic<int> dog; }; ``` ### Zero Idioms > ![image](https://hackmd.io/_uploads/B19zrWTxJg.png) > (Source: [Fast and Beautiful Assembly - Kay Lack](https://www.youtube.com/watch?v=ON9vuzLiGuc)) ### Speculative Branch Prediction ```cpp= sort(nums.begin(), nums.end()); // better for the following operations for (int i : nums) if (i > X) ... else ... ``` ### Static Branches / Jump Labels -- Branch Hot Replacement *Kernel runtime dynamic code patching* Use in embedding tracepoints inside the kernel with zero cost. - [/include/linux/memory.h](https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/memory.h#L190) ```c= /* * Kernel text modification mutex, used for code patching. Users of this lock * can sleep. */ extern struct mutex text_mutex; ``` - [/kernel/jump_label.c](https://elixir.bootlin.com/linux/v6.12.6/source/kernel/jump_label.c#L503) ```c= static void __jump_label_update(struct static_key *key, struct jump_entry *entry, struct jump_entry *stop, bool init) { for (; (entry < stop) && (jump_entry_key(entry) == key); entry++) { if (!jump_label_can_update(entry, init)) continue; if (!arch_jump_label_transform_queue(entry, jump_label_type(entry))) { /* * Queue is full: Apply the current queue and try again. */ arch_jump_label_transform_apply(); BUG_ON(!arch_jump_label_transform_queue(entry, jump_label_type(entry))); } } arch_jump_label_transform_apply(); } ``` - [/arch/x86/kernel/jump_label.c](https://elixir.bootlin.com/linux/v6.12.6/source/arch/x86/kernel/jump_label.c#L36) ```c= struct jump_label_patch { const void *code; int size; }; static struct jump_label_patch __jump_label_patch(struct jump_entry *entry, enum jump_label_type type) { const void *expect, *code, *nop; const void *addr, *dest; addr = (void *)jump_entry_code(entry); dest = (void *)jump_entry_target(entry); ... code = text_gen_insn(JMP32_INSN_OPCODE, addr, dest); nop = x86_nops[size]; if (type == JUMP_LABEL_JMP) expect = nop; else expect = code; if (memcmp(addr, expect, size)) { /* * The location is not an op that we were expecting. * Something went wrong. Crash the box, as something could be * corrupting the kernel. */ pr_crit("jump_label: Fatal kernel bug, unexpected op at %pS [%p] (%5ph != %5ph)) size:%d type:%d\n", addr, addr, addr, expect, size, type); BUG(); } if (type == JUMP_LABEL_NOP) code = nop; return (struct jump_label_patch){.code = code, .size = size}; } ``` ### Platform-Dependent Instruction Hot Replacement *Kernel runtime dynamic code patching* For example, how Linux expands a normal memory barrier macro in C to the optimized target instruction (expecially considering machine different config at runtime): ```c! mb(); ``` :::warning **Hardware Backward Compatabilities vs. Newer Features** *Dynamically enable newer features at kernel build / compile time / boot time / runtime* How does GNU/Linux know which hardware features are available & can apply at boot time, even at runtime? - [Historical approach] `alternative` macro ```c= #define mb() \ alternative("lock; addl $0,0(%%esp)", \ "mfence", \ X86_FEATURE_XMM2) ``` > The default implementation is essentially a `bus-locked no-op`; it will work anywhere. On newer systems, however, the more efficient `mfence` instruction is available, and it would be nice to use it. `alternative` can be used to dynamically patch hardware support instructions ==at boot time==, when the kernel invokes `apply_alternative`. > Even more, it surely can be invoked ==at runtime==! Hot swapping system optimization config on the fly (e.g., virtualization / VMs on *uniprocessor mode* vs. *SMP mode*). [SMP alternatives - LWN.net](https://lwn.net/Articles/164121/) Modern kernel code for x86 memory barrier: - [/tools/arch/x86/include/asm/barrier.h](https://elixir.bootlin.com/linux/v6.15.1/source/tools/arch/x86/include/asm/barrier.h#L27) ```c= # define barrier() __asm__ __volatile__("": : :"memory") #if defined(__i386__) #define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory") #define rmb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory") #define wmb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory") #elif defined(__x86_64__) #define mb() asm volatile("mfence" ::: "memory") #define rmb() asm volatile("lfence" ::: "memory") #define wmb() asm volatile("sfence" ::: "memory") #define smp_rmb() barrier() #define smp_wmb() barrier() #define smp_mb() asm volatile("lock; addl $0,-132(%%rsp)" ::: "memory", "cc") #endif ``` ::: :::info **[Non-Aggressive Approach] Virtual Interface Mapping -- vDSO** For example, the program's `vsyscall` interface, which is brought up by kernel set "auxiliary vector" at runtime (program start time). > **`vsyscall`**: `64-bit`/`32-bit` syscall compatible trampoline. ::: ## Modern Heterogeneous Computing Architectures :::info **The Power of CPU x GPU** *WebGPU application for example.* Data transfering between RAM & VRAM can be costly, while we can introduce pointers in GPU context with WebGPU API to manipulate VRAM resources buffer directly. ```! CPU-side program, data & runtime <=> GPU-side computing, resources & display ``` Also providing compilation (e.g., with header definitions) & runtime support (e.g., with a wrapper) for GPU programming that integrates current CPU program (e.g., JS app) makes workflow tight & seamless, & less error-prone. ![image](https://hackmd.io/_uploads/HJjOFfRUxl.png =500x) [Compling JavaScript to WGSL, and the quest for type-safe and composable WebGPU libraries - Iwo Plaza](https://www.youtube.com/watch?v=pBRLqJaG4kk) :::warning **[Real World Application] Heterogeneous Hash Tables vs. Lock-Free Hash Tables** - [CPU x GPU] Heterogeneous Hash Tables: - [[Paper] Heterogeneous Working-set Hash Tables - Z Choudhury, 2016](https://www.researchgate.net/publication/290946456_Heterogeneous_CPUGPU_Working-set_Hash_Tables): overall intro & comparasion - [Multi-core CPUs] Lock-Free Hash Tables: - [The World's Simplest Lock-Free Hash Table - Jeff Preshing](https://preshing.com/20130605/the-worlds-simplest-lock-free-hash-table/) ::: ### Multimedia Data Stream Parallel Processing *++Vector / Graphics / Large Dataset Processing++ Done by ++General Processors++ (e.g., superscalar/multi-core CPUs) or ++Specialized Processors++ (e.g., GPGPUs, GPUs, TPUs, NPUs, ...)* > ![](https://hackmd.io/_uploads/rJe3UinlJe.jpg =400x) > > (Source: [Mythbusters Demo GPU versus CPU - NVIDIA](https://www.youtube.com/watch?v=-P28LKWTzrI)) :::warning **Multimedia Data Stream Parallel Processing: The Truly Scalibility** With reference to game frame rates at different screen resolutions, a computer operates like a factory production line; as long as there are enough vector (array) processing units, the amount of data that can be handled will grow linearly. Therefore, unlike the CPU, a GPU can manage multiple data streams in scale as long as the bandwidth is sufficient, without needing to consider *control flow*, *system resources*, *runtime environment*, and *data lock* (unless there is data dependency scheduling required in ++limited++ hardware resources like SIMT) that causes mutual exclusion effects. The more computational units a GPU or SIMD has, the better. ::: ### Multicore CPU :::info **[HW Perspective vs. SW Perspective] Processes & Threads** *How to absract the CPU resources to the software regardless the physical CPU architecture difference?* For example, how OS & programs make use of *4 cores 12 threads* vs. *6 cores 12 threads* vs. *12 cores 12 threads*? It turns out that as long as the processor can provide the certain hardware features (abstractions) that is able to separate & share resources in different execution contexts at certain degrees & virtualization control, such as: *MMU (vm page tables, TLB, ...)*, *icache/dcache*, *registers (task_struct, mm_struct, ...)*, the CPU / the OS doesn't care the other: - **From the hardware perspective** CPU is able to handle those "execution contexts" concurrently. > No matter they are *processes* or *threads*, which only differ in the shared resource degree at OS implementation level. - **From the OS perspective** A 4 cores 12 threads CPU are viewed as a processor that has 4 ==physical cores== & 12 ==logical cores==. That is, theoretically, OS assumes CPU can run 12 "execution contexts" concurrently. [SMT and Hyperthreading : threads vs process - StackOverflow](https://stackoverflow.com/questions/46793813/smt-and-hyperthreading-threads-vs-process) ::: ### SIMD on CPU :::warning **Frankly, Do We Really Need SIMD?** If SIMD only improves performance by a factor of certain constant for few parts in the program, is the tradeoff really worth it? E.g., time complexity `log(n)` -> `4log(n)` in 8% scenerio, but the code complexity (spanning from low-level to high-level), machine cost, & compatibility increase exponentially. ::: - **ARM Neon** - [Permutation - Neon instructions - ARM Developer](https://developer.arm.com/documentation/102159/0400/Permutation---Neon-instructions) - [Registers, vectors, lanes and elements - ARM Developer](https://developer.arm.com/documentation/102474/0100/Fundamentals-of-Armv8-Neon-technology/Registers--vectors--lanes-and-elements) ```arm= ; add vector (array) register V0 with V1 & store results in V3 add V3.16B, V0.16B, V1.16B ``` ```arm= ; table index look up .data table: .word 10, 20, 30, 40 .text .global main main: ; Load address of the table ldr r0, =table ; Load index into a NEON register (e.g., index 2) mov r1, #2 ; Use TBX to fetch the value at index 2 tbx r0, r1, r2 ; r2 now contains the value 30 ``` For example: :::info **Using SIMD Vector Operations to Implement `toupper(const char[])` (in AArch64)** Concept: 1. Write the unmodified parts to the output first. 2. Intentionally translate the range of the operating values to fit into the range of the operation in step 3. 3. Do operations and write to the output (as the picture shown below). > In this case, we perform: > ```arm > tbx dst.16B, {table.64B}, src.16B > ``` > which can only handle 64 indicies (each for 1 Bytes) in table for processing 16 Bytes input each time. 4. Back to step 2. and move on to other range that need processing in step 3. until all range are processed. > We perform this SIMD operations in 16 Bytes each time, & each of this procedure is about in $O(1)$ time complexity, so the overall performance boost is roughly 16x faster. ![image](https://hackmd.io/_uploads/H1h_7Y3xkl.png) ```arm= .global _main .align 4 ; CONSTANTS .set READ_BUF_LEN, 1024 .set SYSCALL_READ, 3 .set FD_STDIN, 0 ; USEFUL MACRO SNIPPETS ; Load the address of a label into a register .macro load_addr reg, label adrp \reg, \label@PAGE add \reg, \reg, \label@PAGEOFF .endm ; Prints a string of a given length to STDOUT ; Clobbers X0, X1, X2, X16 .macro print str len mov X0, #1 adrp X1, \str@PAGE add X1, X1, \str@PAGEOFF mov X2, \len mov X16, #4 ; ARM's version of syscall svc #0x80 .endm ``` `.data` Section: ```arm= .data read_buf: .space READ_BUF_LEN char_map: ; 0x00 ~ 0xFF custom ASCII pattern lookup table ; (total size: 255 Bytes) .byte 0, 1, 2, ..., 15 .byte 16, 17, 18, ..., 31 .byte 32, 33, 34, ..., 47 .byte 48, 49, 50, ..., 63 .ascii "@" .ascii "ABCDEFGHIJKLMNOPQRSTUVWXYZ" .ascii "[\\]^_`" ; Map the lower case's ASCII to upper case's (others remain still) .ascii "ABCDEFGHIJKLMNOPQRSTUVWXYZ" .ascii "{|}~" .byte 127 .byte 128, 129, 130, ..., 143 .byte 144, 145, 146, ..., 159 .byte 160, 161, 162, ..., 175 .byte 176, 177, 178, ..., 191 .byte 192, 193, 194, ..., 207 .byte 208, 209, 210, ..., 223 .byte 224, 225, 226, ..., 239 .byte 240, 241, 242, ..., 255 ``` `.text` Section: ```arm= .text _main: .L_load_map: load_addr X19, char_map ; Load char_map into registers V16 ~ V31 ld1 {V16.16B, V17.16B, V18.16B, V19.16B}, [X19] add X19, X19, #64 ld1 {V20.16B, V21.16B, V22.16B, V23.16B}, [X19] add X19, X19, #64 ld1 {V24.16B, V25.16B, V26.16B, V27.16B}, [X19] add X19, X19, #64 ld1 {V28.16B, V29.16B, V30.16B, V31.16B}, [X19] ; V0: 0x40, 0x40, 0x40, ..., 0x40 movi V0.16B, #64 .L_load_buffer: ; X20: address of the read buffer ; X21: length of the read buffer ; X22: end address of the read buffer mov X21, READ_BUF_LEN load_addr X20, read_buf add X22, X20, X21 ; Read from stdin to buffer bl read_stdin_to_buf ; Is EOF? cmp X0, #0 b.eq .L_exit mov X21, X0 ; Process the 16 Bytes of the read buffer each time .L_prcoess_chunk: ; Load the read buffer into register V1 ld1 {V1.16B}, [X20] ; Process the first table's range (0 ~ 63) ; tbx instruction: ; src: V1.16B ; table: char_map[0x00] ~ char_map[0x3F] ; dst: V2.16B (if out of bound remains unchanged) tbx V2.16B, {V16.16B, V17.16B, V18.16B, V19.16B}, V1.16B ; Same procedure for the next table's range (64 ~ 127) ; tbx instruction: ; src: V1.16B (translate to fit in the 0 ~ 64 tbx range) ; table: char_map[0x40] ~ char_map[0x7F] ; dst: V2.16B (if out of bound remains unchanged) sub V1.16B, V1.16B, V0.16B tbx V2.16B, {20.16B, V21.16B, V22.16B, V23.16B}, V1.16B ; Same procedure for the next table's range (128 ~ 191) ; tbx instruction: ; src: V1.16B (translate to fit in the 0 ~ 64 tbx range) ; table: char_map[0x80] ~ char_map[0xBF] ; dst: V2.16B (if out of bound remains unchanged) sub V1.16B, V1.16B, V0.16B tbx V2.16B, {24.16B, V25.16B, V26.16B, V27.16B}, V1.16B ; Same procedure for the last table's range (192 ~ 255) ; tbx instruction: ; src: V1.16B (translate to fit in the 0 ~ 64 tbx range) ; table: char_map[0xC0] ~ char_map[0xFF] ; dst: V2.16B (if out of bound remains unchanged) sub V1.16B, V1.16B, V0.16B tbx V2.16B, {28.16B, V29.16B, V30.16B, V31.16B}, V1.16B ; Store the result back into the buffer st1 {V2.16B}, [X20] .L_advance_buffer: add X20, X20, #16 ; Is the end of the read buffer? cmp X20, X22 ; No, keep processing the next 16 Bytes in the read buffer b.lt .L_process_chunk .L_flush_buffer: print read_buf, X21 b .L_load_buffer .L_exit: mov X0, #0 mov X16, #1 ; ARM's version of syscall svc #0x80 ; USEFUL FUNCTIONS ; read_stdin_to_buf: ; args: ; (none) ; results: ; X0: the number of bytes read read_stdin_to_buf: mov X0, FD_STDIN load_addr X1, read_buf mov X2, READ_BUF_LEN mov X16, SYSCALL_READ ; ARM's version of syscall svc #0x80 ret ``` Refs: - [Fast and Beautiful Assembly - Kay Lack](https://www.youtube.com/watch?v=ON9vuzLiGuc) ::: - **SSE2Neon** [sse2neon - DLTcollab - GitHub](https://github.com/DLTcollab/sse2neon) - **X86's SIMDs (MMX, SSE, AVX)** ![image](https://hackmd.io/_uploads/BkAzo03CJg.png) - **MMX (Multimedia / Matrix Math Extensions) on MM / FP Registers** The 1st SIMD instruction set on x86, but it makes use of x87 FPU `64-bit` out of `80-bit` registers, & then it can be divided into 2x`32-bit`, 4x`16-bit`, 8x`8-bit` for SIMD instructions. :::info **FPU is a Legacy Design** *Modern programs prefer to use SIMD more.* x87 FPU used ++stack-based register model++ (`push`/`pop` operands). Easy to implement but bad for optimization. ::: > ![image](https://hackmd.io/_uploads/ByRkhC3Akl.png =400x) > ![image](https://hackmd.io/_uploads/HyzGhRnRJe.png =400x) > > (Source: [Intel MMX for Multimedia PCs - NTU CSIE](https://www.csie.ntu.edu.tw/~cyy/courses/assembly/docs/MMX.pdf)) - **SSE (Streaming SIMD Extensions) on XMM Registers** Support for double floating points (`64-bit`), but only support binary operation (e.g., `A = A + B`). - **SSE2**: `128-bit` ++int++ & ++floating-point++ operations. :::info **Intrinsics & Data Types of SSE Series** - **SSE & Basic SIMD:** [xmmintrin.h - GCC - GitHub](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/xmmintrin.h). - **SSE2~:** [emmintrin.h - GCC - GitHub](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/emmintrin.h). - **SSE4.1~:** [smmintrin.h - GCC - GitHub](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/smmintrin.h). ::: - **AVX (Advanced Vector Extensions) on XMM / YMM / ZMM Registers** More general, more registers, support prefetch, pipelined execution, masked data gather/scatter, quaternary operation (e.g., `A = B × C + D`). :::warning **The complete implementation of the *vector processing* on x86's SIMD!** ::: - **AVX:** `256-bit` ++floating-point++ operations. - **AVX2:** `256-bit` ++int++ & ++floating-point++ operations (& 3-operand instructions). - **AVX-512:** `512-bit` ++int++ & ++floating-point++ operations. > ![image](https://hackmd.io/_uploads/S1R2FC2Cye.png =400x) > ![image](https://hackmd.io/_uploads/BkWd4JaRyx.png =400x) > ![image](https://hackmd.io/_uploads/B1xoq03Ake.png =400x) > ![image](https://hackmd.io/_uploads/S1euz1pAkg.png =400x) > > (Source: [硬科技：淺談x86的SIMD指令擴張史(下)：AVX到AVX-512 - Cool3c](https://www.cool3c.com/article/152953)) :::info **Intrinsics & Data Types of AVX Series** - **Common AVX Headers:** [immintrin.h - GCC - GitHub](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/immintrin.h). - **AVX:** [avxintrin.h - GCC - GitHub](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/avxintrin.h). - **AVX2:** [avx2intrin.h - GCC - GitHub](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/avx2intrin.h). ::: :::warning **SIMD Not only for Parallel Processing, but also for General Floating-Point Operations!** Modern compilers use SIMD floating-point instructions with independent XMM register set. Real world applications: - **AES En/Decryption:** [/crypto/aes_generic.c](https://elixir.bootlin.com/linux/v6.11.1/source/crypto/aes_generic.c) - **Compiler flags:** `gcc -O3 -march=native -ftree-vectorize a.c` - a.c ```C= void add(float *a, float *b, float *out, int n) { for (int i = 0; i < n; ++i) out[i] = a[i] + b[i]; } ``` - **Pattern Matching** - **Memory Copy** :::info **Kernel Side of the FPU & SIMD** The kernel still discourages to use both FPU & SIMD in the kernel, due to the impactful cost of maintaining those registers among function executions, context switches, ... Also, in the past implementations, if FPU is in need in the user space, kernel will traps it to lazy context switch & auto save/restore the FPU states for the users (but in preemption, multicore systems, this is not a practical solution). For more details, see [Floating-point API - Linux Kernel](https://docs.kernel.org/core-api/floating-point.html): `kernel_fpu_available()`, `kernel_fpu_begin()`, `kernel_fpu_end()`, also `preempt_disable()` may be involved. ::: Refs: - [一文读懂SIMD指令集目前最全SSE/AVX介绍 - CSDN](https://blog.csdn.net/qq_32916805/article/details/117637192) :::info :arrow_right: More on compiler supports of intrinsics functions: [Compiler: The Program, the Language, & the Computer Work - shibarashinu](https://hackmd.io/@shibarashinu/SyEHz-JHC) ::: For example: :::info **256-Bit ++Floating-Point++ Operations (in C)** - Using 256-bit ++single-fp++ operations: `_mm256_set_ps`, `_mm256_add_ps`, `_mm256_storeu_ps` (refs: [avxintrin.h - GCC - GitHub](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/avxintrin.h)): ```C= #include <stdio.h> #include <immintrin.h> int main() { // _mm256_set_ps: create an 8-single-fp vector __m256 a_v8sf = _mm256_set_ps( 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f ); __m256 b_v8sf = _mm256_set_ps( 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f ); // _mm256_add_ps: (__m256) ((__v8sf) arg0 + (__v8sf) arg1) __m256 res_v8sf = _mm256_add_ps(a_v8sf, b_v8sf); float res_arr[8]; _mm256_storeu_ps(res, res_v8sf); for (int i = 0; i < 8; i++) printf("%f ", res_arr[i]); return 0; } ``` - Using 256-bit ++double-fp++ operations: `_mm256_set_pd`, `_mm256_add_pd`, `_mm256_storeu_pd` (refs: [avxintrin.h - GCC - GitHub](https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/avxintrin.h)): ```C= #include <stdio.h> #include <immintrin.h> int main() { // _mm256_set_pd: create an 4-doubl-fp vector __m256d a_v4df = _mm256_set_pd(1.0, 2.0, 3.0, 4.0); __m256d b_v4df = _mm256_set_pd(1.0, 2.0, 3.0, 4.0); // _mm256_add_pd: (__m256d) ((__v4df) arg0 + (__v4df) arg1) __m256d result = _mm256_add_pd(a_v4df, b_v4df); // Store the result in a regular array double res_arr[4]; _mm256_storeu_pd(res, result); for (int i = 0; i < 4; i++) printf("%f ", res_arr[i]); return 0; } ``` **GCC Compilation Options for AVX Supports** ```sh gcc -mavx ... # AVX gcc -mavx2 ... # AVX2 gcc -mavx512f ... # AVX-512 Foundation gcc -mavx512f \ -mavx512vl ... # AVX-512 Foundation & Vector Length Extensions ``` ::: ### SIMT on GPGPU *Like **CPU's multiple hyper-threading-core processor with vector computing units**, has the abilities of ILP, DLP, & TLP, but in a task-based multiple-data parallel processing manner.* > ![](https://hackmd.io/_uploads/Hyr2TuLIll.jpg =500x) > ![](https://hackmd.io/_uploads/r11_ROUIxe.jpg =500x) > (Source: [AI 为什么离不开 GPU？顺序代码 vs. 并行计算！ - EfficLab 中文](https://www.youtube.com/watch?v=1SYQkRpcGAQ)) :::info **SIMD vs. SIMT** ![image](https://hackmd.io/_uploads/ByBHEkzG1x.png =400x) Refs: - [GPU Flynn 概念 - Johnny Chang](https://johnnybboy.medium.com/gpu-flynn%E6%A6%82%E5%BF%B5-ed19ed1b3e61) - [淺談 GPU 到底是什麼（中）：兼具 SIMD 與 MIMD 優點的 SIMT - Cool3C](https://www.cool3c.com/article/133370) ::: :::info **Graphics Rendering Optimization Issues in GPU** Within the GPU, there are a number of functional units operating in parallel, which essentially act as separate special-purpose processors, but a number of spots where a bottleneck can occur, including vertex and index fetching, vertex shading (transform & lighting (T&L)), fragment shading, & raster operations (ROP). ![image](https://hackmd.io/_uploads/r1HV4yzzkx.png =400x) Refs: - [Graphics Pipeline Performance - NVIDIA Developer](https://developer.nvidia.com/gpugems/gpugems/part-v-performance-and-practicalities/chapter-28-graphics-pipeline-performance) ::: #### Nvidia GPU Architecture *Multiple Streaming Multiprocessors (SMs)* ![image](https://hackmd.io/_uploads/S1z8eb8E1e.png =400x) Refs & More: - [Future Scaling of Memory Hierarchy for Tensor Cores and Eliminating Redundant Shared Memory Traffic Using Inter-Warp Multicasting - Sunjung Lee, 2022](https://www.computer.org/csdl/journal/tc/2022/12/09893362/1GGLKtuanzW) #### Nvidia CUDA (Compute Unified Device Architecture) *As a coprocessor, assists CPU to do SIMT parellel processing.* :::warning **The Modern Computer Architecture: Multicore Processor with Multi-Streaming Coprocessors** ::: ![image](https://hackmd.io/_uploads/S1t9gkGMJl.png =400x) Refs & More: - [Understanding Parallel Computing: GPUs vs CPUs Explained Simply with role of CUDA - Digital Ocean](https://blog.paperspace.com/demystifying-parallel-computing-gpu-vs-cpu-explained-simply-with-cuda/) ### Miscellaneous Parallel-Processing API Standards | | OpenGL | OpenCV | OpenMP | |:------------:|:----------------------------:|:--------------------------------------:|:------------------------------------------------:| | Targets | GPUs | GPUs | CPUs | | Purposes | 2D/3D graphics rendering | computer vision & image/video analysis | CPU-bound parallel programming | | Applications | graphics, simulations, games | face detection, AR, machine learning | perforamance optimization in numerical computing | ## Memory Segmentation Calculate the Instruction/Data Destination. ![image](https://hackmd.io/_uploads/SJEZKWilyg.png =500x) ![image](https://hackmd.io/_uploads/SkvmYZseJl.png =500x) :::warning **Linux Virtual Memory on x86-64 Architecture** - **4-level paging (x86-64's long mode, 48-bit wide)** - user space (50%): `0x----'0000'0000'0000` ~ `0x----'7fff'ffff'f000` (the last 4k page is reserved as a *guard hole*) > E.g., `0x0000'7fff'1234'5678`. - kernel space (50%): `0x----'8000'0000'0000` ~ `0x----'ffff'ffff'ffff` > E.g., `0xffff'8000'1234'5678`. > page table: > ![image](https://hackmd.io/_uploads/HJfVWr65xx.png =550x) > ![image](https://hackmd.io/_uploads/ryMNGS6cex.png =550x) > > (Source: [x86 Page Tables - COMS W4118 Operating Systems I](https://cs4118.github.io/www/2023-1/lect/18-x86-paging.html)) > canonical virtual addresses (i.e., sign extension of the kernel bit): > ![image](https://hackmd.io/_uploads/H1mfxSLcel.png =550x) > non-cononical addresses can be used for user/kernel virtual memory aliases (e.g., `0x7fff'1234'5678'9abc` for user & `0x8000'1234'5678'9abc` for kernel.) > > (Source: [x86-64 - Wiki](https://en.wikipedia.org/wiki/X86-64)) - **5-level (extended, 57-bit wide)** - user space (50%): `0x-000'0000'0000'0000` ~ `0x-0ff'ffff'ffff'f000` (the last 4k page is reserved as a *guard hole*) > E.g., `0x00ff'ffff'1234'5678`. - kernel space (50%): `0x-100'0000'0000'0000` ~ `0x-1ff'ffff'ffff'ffff` (the leading bit of `1` (the 57th bit) should be signed extended here as cononical addresses i.e., `0xff...`) > E.g., `0xffff'ffff'1234'5678`. [Memory Management - Linux Kernel Docs](https://docs.kernel.org/arch/x86/x86_64/mm.html) ::: :::info **Physical Memory Address** = **Segment Selector** (Registers: `SS`, `CS`, `DS`, `ES`, `FS`, `GS`) + **Offset** *x86 legacy segmentation mechanism (since protected mode came out) used before paging (i.e., page table).* > This only happens in *protected (32-bit) / long (64-bit) mode* where MMU is enabled, whereas in *real mode* there is no MMU support, no virtual memory, & all physical memory is used in segments universally without paging, all memory is located by a fixed address directly. > - **When User Program Is Trapped into Kernel on x86_64** > > In 64-bit mode user space, Linux uses a flat memory model: ALL memory segments (code, data, stack) appear in one continuous address space. > > Segment registers like `DS` & `ES` are unused. x86 forces to use `CS` & `SS` (pointing to dummy segment selectors in GDT). `FS` & `GS` are used for Thread Local Storage (TLS) or for special kernel tricks. > > - **Privileged MSR (Machine-Specific System Registers)** > *For low-level system control on x86.* > - `IA32_LSTAR`: address of syscall entry in kernel. > > E.g., `mov rip, IA32_LSTAR`. > - `IA32_KERNEL_GS_BASE`: address of percpu data. > > E.g., `mov gs, IA32_KERNEL_GS_BASE`. > - `CR3`: address of root page table. > > E.g., switching kernel page table: `ld cr3, [kernel_cr3_addr]` from percpu data, if KPTI is enabled. > - **Privileged Special-Purpose Registers** > - `GDTR`: address of GDT. > > For RAM management. > - `IDTR`: address of interrupt descriptor table. > > For interrupt table control. > - **GDT Table in x86_64** > - user/kernel's segment descriptors in GDT (accessed by `CS`, `SS` segment selectors): > > Because flat memory model of paging in x86_64, TSS of segmentation is not for hardware context switching but for privilege transition only. > - privilege level: ring3 (RPL: `0x03`) vs. ring0 (RPL: `0x00`). > - user/kernel's `CS` & `SS` pointed to their dummy selector descriptors in GDT in different privilege level. > - kernel's TSS descriptors in GDT (pointed by `TR`, per CPU core): > > This TSS represents the kernel context for executing this task on this CPU core. > - privilege level: always ring0 (RPL: `0x00`). > - I/O privilege level: I/O port permission bitmap for current task. > - `SS0:RSP0`: anchor address of ==this task=='s kernel stack (set on every context switching from `task_struct::stack`). > - `IST[]` Table: anchor address of ==this CPU=='s interrupt stacks for different scenarios (e.g., double faults, non-maskable interrupts (NMI), machine check exceptions, ...). > > kernel `IST[]` table set by TSS in GDT (per CPU core, set by booting process). > > kernel syscall stack pointer is restored by TSS in GDT (per CPU core, set by per task switching). > > user mode state & syscall arguments are stored onto kernel stack (e.g., `rip`, `rsp`, `rax`, ...). > > [怎麼理解 linux 內核棧？- 知乎](https://www.zhihu.com/question/57013926/answer/100820002456) > > [Task state segment - Wiki](https://en.wikipedia.org/wiki/Task_state_segment) - **Segment Register (as Segment Selector)**: Index of Segment Discriptor (in `LDT[]`/`GDT[]` table), "Requested" Privilege Level (in Protected Mode). ![image](https://hackmd.io/_uploads/H13HY-ig1e.png =300x) - **Segment Descriptor**: (8 Bytes in `LDT[]`/`GDT[]` table) Segment's Status, Range, Type, Permission, Privilege Level, ... ![image](https://hackmd.io/_uploads/H13wY-ieJg.png =300x) - **Task Register (as TSS Descriptor)**: Current Task State Segment's Status (for hardware context switching). - **TSS Descriptor**: (8 Bytes in `GDT[]` table) Processor Register State, I/O Port Permissions, Inner-Level Stack Pointers, Previous TSS Link, ... ![image](https://hackmd.io/_uploads/HyPKabjxkg.png =300x) Refs: - [物理地址和段暫存器 - 小碼農米爾](https://ithelp.ithome.com.tw/articles/10203546) - [分段架構 - 記憶體管理 - Wen-Chin Chen](https://www.csie.ntu.edu.tw/~wcchen/asm98/asm/proj/b85506061/chap2/segment.html) - [邊界和型態檢查 - 保護機制 - Wen-Chin Chen](https://www.csie.ntu.edu.tw/~wcchen/asm98/asm/proj/b85506061/chap3/limit_type_check.html) - [工作狀態表 - 多工處理 - Wen-Chin Chen](https://www.csie.ntu.edu.tw/~wcchen/asm98/asm/proj/b85506061/chap5/tss.html) - [Task state segment (TSS) - Wiki](https://en.wikipedia.org/wiki/Task_state_segment) ::: ## Assembly Practices ### Simple Word Count Program (wcx64) ![image](https://hackmd.io/_uploads/H1FwW8HhC.png) #### Analysis of x86_64 Linux's Program `.rodata` Section (of Constants): ```arm= .set READ_SYSCALL, 0 .set WRITE_SYSCALL, 1 .set OPEN_SYSCALL, 2 .set CLOSE_SYSCALL, 3 .set EXIT_SYSCALL, 60 .set STDIN_FD, 0 .set STDOUT_FD, 1 .set O_RDONLY, 0x0 .set OPEN_NO_MODE, 0x0 .set READBUFLEN, 16384 .set ITOABUFLEN, 12 .set NEWLINE, '\n' .set CR, '\r' .set TAB, '\t' .set SPACE, ' ' ``` `.data` Section (of Global Variables): ```arm= .data newline_str: .asciz "\n" # (with '\0' terminator) fourspace_str: .asciz " " total_str: .asciz "total" read_buffer: # sizeof(READBUFLEN + 1) space (+1 for '\0') .space READBUFLEN + 1, 0x0 # itoa_buffer: store int to ASCII result in this const char buffer[12]. # For example, "1234" Bytes (maximum: "9999999999" Bytes (9.99 GB)) itoa_buffer: .space ITOABUFLEN, 0x0 # itoa_buffer_end: symbol of the itoa buffer's end .set itoa_buffer_end, itoa_buffer + ITOABUFLEN - 1 ``` `.text` Section (of Main Procedure): ```arm= .globl _start .text # Get the filename from args[] & process each of them. _start: # for example, if program starts by this command: ./wcx64 file1 file2 # [$rsp] stores argc (i.e., 3) # [$rsp + 8]: argv[0] (i.e., ./wcx64) # [$rsp + 16]: argv[1] (i.e., file1) # [$rsp + 24]: argv[2] (i.e., file2) # $rbx: argc mov rbx, [rsp] # Is argc only 1? cmp rbx, 1 # Yes, jump to label L_no_argv. jle .L_no_argv # No, init setup. # global variables: # $r13: global_char_count # $r14: global_line_count # $r15: global_word_count xor r13, r13 xor r14, r14 xor r15, r15 # $rbp: argv counter (range: 0 ~ argc - 1) mov rbp, 1 .L_argv_loop: # Get the address of the filename (null-terminated strings) to process. mov rdi, [rsp + 8 + 8*rbp] mov r12, rdi # syscall open: # args: # $rax: OPEN_SYSCALL # $rdx: OPEN_NO_MODE # $rsi: O_RDONLY # $rdi: filename # result: # $rax: file descriptor mov rsi, O_RDONLY mov rdx, OPEN_NO_MODE mov rax, OPEN_SYSCALL syscall # If file descriptor == 0 (error on open()), ignore this file. cmp rax, 0 jl .L_next_argv # Push fd onto the stack. push rax # call count_in_file: # args: # $rdi: file descriptor # result: # $rax: local_char_count # $rdx: local_line_count # $r9: local_word_count mov rdi, rax call count_in_file # Add those counters to the global totals & call print_counters: # global variables: # $r13: global_char_count # $r14: global_line_count # $r15: global_word_count # # call print_counters: # args: # $rdi: $rax (local_char_count) # $rsi: $rdx (local_line_count) # $rdx: $r9 (local_word_count) # $rcx: $r12 (filename) # print_counters's arg mov rdi, rax # sum totals add r13, rax # print_counters's arg mov rsi, rdx # sum totals add r14, rdx # print_counters's arg mov rdx, r9 # sum totals add r15, r9 # print_counters's arg mov rcx, r12 call print_counters # syscall close: # args: # $rax: CLOSE_SYSCALL # $rdi: file descriptor (poped from stack) pop rdi mov rax, CLOSE_SYSCALL syscall # Increment argv counter (rbp). .L_next_argv: # increment argv counter & check if argv counter > argc inc rbp cmp rbp, rbx # No, do the next round. jl .L_argv_loop # Yes, finished. # call print_counters: # args: # $rdi: $r13 (global_char_count) # $rsi: $r14 (global_line_count) # $rdx: $r15 (global_word_count) # $rcx: total_str (address of .data's "total") mov rdi, r13 mov rsi, r14 mov rdx, r15 lea rcx, total_str call print_counters jmp .L_wcx64_exit # Read from stdin, which fd is 0. .L_no_argv: mov rdi, STDIN_FD call count_in_file # call print_counters: # args: # $rdi: $rax (local_char_count) # $rsi: $rdx (local_line_count) # $rdx: $r9 (local_word_count) # $rcx: 0 mov rdi, rax mov rsi, rdx mov rdx, r9 mov rcx, 0 call print_counters .L_wcx64_exit: # syscall exit: # args: # $rax: EXIT_SYSCALL # $rdi: 0 mov rdi, 0 mov rax, EXIT_SYSCALL syscall ret ``` Function count_in_file: ```arm= # count_in_file: # result: # $rax: char_count of this file # $rdx: line_count of this file # $r9: word_count of this file count_in_file: # local varibles: # $rdi: file descriptor # $r9: char counter # $r13: read buffer address # $r14: line counter # $r15: word counter # $rcx: index for looping the read buffer # $dl: next byte read from the buffer # $r12: DFA state indicator # DFA states: .set IN_WORD, 1 .set IN_WHITESPACE, 2 # preserved registers (should be restored before ret) push r12 push r13 push r14 push r15 # init setup xor r9, r9 xor r15, r15 xor r14, r14 lea r13, read_buffer mov r12, IN_WHITESPACE # looping to read file's content (by read_buffer[READBUFLEN]) .L_read_buf: # syscall read: # args: # $rax: READ_SYSCALL # $rsi: read_buffer # $rdx: READBUFLEN # $rdi: file descriptor # result: # $rax: size of this read syscall mov rsi, r13 mov rdx, READBUFLEN mov rax, READ_SYSCALL syscall add r9, rax # Are all files read? cmp rax, 0 # Yes, jump to cleanup label & return. je .L_done_with_file # No, init setup. xor rcx, rcx # Traverse each byte of this buffer for this read. .L_next_byte_in_buf: # dl: [read_buffer + offset] mov dl, [r13 + rcx] # Switch cases for this byte. cmp dl, NEWLINE je .L_seen_newline cmp dl, CR je .L_seen_whitespace_not_newline cmp dl, SPACE je .L_seen_whitespace_not_newline cmp dl, TAB je .L_seen_whitespace_not_newline # Else, non of those special chars. # Change state or not. cmp r12, IN_WORD je .L_done_with_this_byte inc r15 mov r12, IN_WORD jmp .L_done_with_this_byte .L_done_with_this_byte: inc rcx # Check if this is the end of the buffer. cmp rcx, rax jl .L_next_byte_in_buf # Switch cases for special char. .L_seen_newline: inc r14 .L_seen_whitespace_not_newline: # IN_WORD or IN_WHITESPACE? cmp r12, IN_WORD je .L_end_current_word jmp .L_done_with_this_byte .L_end_current_word: # Skip without touching counters. mov r12, IN_WHITESPACE .L_done_with_this_byte: inc rcx cmp rcx, rax jg .L_next_byte_in_buf # Done going over this buffer. # We need to read another buffer if rax == READBUFLEN. cmp READBUFLEN, rax je .L_read_buf .L_done_with_file: # Done with this file. The char count is already in r9. # Put the word and line counts in their return locations. mov r15, rdx mov r14, rax # Restore callee-saved registers. pop r15 pop r14 pop r13 pop r12 ret ``` & other functions: `print_counters`, `memset`, `itoa`, ... Refs: - [Fast and Beautiful Assembly - Kay Lack](https://www.youtube.com/watch?v=ON9vuzLiGuc) - [wcx64 - eliben - GitHub](https://github.com/eliben/wcx64) ### Write & Compile In-Kernel Syscall Wrapper Library No glibc wrappers / helpers. [Making Smallest Possible Linux Distro (x64) - Nir Lichtman](https://www.youtube.com/watch?v=u2Juz5sQyYQ) :::warning **[x86 Calling Convention Issue] User vs. Kernel Function Parameters** For System V ABI (x86_64 systems), the registers in user function parameter interface are taken in the following order: ```arm! rdi, rsi, rdx, rcx, r8, r9, (SIMD regs ...) ``` Whereas the kernel interface uses: ```arm! rdi, rsi, rdx, r10, r8, r9 ``` > The 4th parameter uses different reg, because of Intel fast syscall ISA using `rcx` to save return (back to user) address. [List of x86 calling conventions - Wiki](https://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions) ::: - a.S ```arm= .intel_syntax noprefix /* symbols for linking */ .global my_write .global my_read .global my_fork .global my_execve .global my_waitid .global my_exit my_write: mov rax, 1 /* listed in /arch/.../syscalls_64.h */ syscall ret my_read: mov rax, 0 syscall ret my_fork: mov rax, 57 syscall ret my_execve: mov rax, 59 syscall ret my_waitid: mov rax, 247 mov r10, rcx /* fix the above x86 calling convention issue */ syscall ret my_exit: mov rax, 60 syscall ret ``` - a.h ```c= #pragma once // rewrite glibc's <func> declaration with my_<func> #define write my_write #define read my_read #define fork my_fork #define execve my_execve #define _exit my_exit #include <unistd.h> // glibc's waitid interface != kernel's waitid interface #include <sys/wait.h> int my_waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options, void*); ``` - main.c ```c= #include "a.h" void my_main() { char cmd_buf[255]; for (;;) { my_write(1, "$ ", 2); int count = my_read(0, cmd_buf, 255); cmd_buf[count - 1] = 0; // e.g., /bin/ls\n -> /bin/ls\0 pid_t fork_result = my_fork(); if (fork_result == 0) { my_execve(cmd_buf, 0, 0); break; } else { // wait siginfo_t info; my_waitid(P_ALL, 0, &info, WEXITED, 0); } } my_exit(0); } ``` Compile via GNU assembler: ```sh! as a.S -o a.o # use a.o as a normal object file ``` Compile normal main.c: ``` gcc -c main.c -o main.o ``` Link: ```sh! ld a.o main.o --entry my_main \ -z noexecstack # will enable NX bit on stack at runtime # -z relro # partial RELRO # -z relro -z now # full RELRO ``` :::info **[Extra] GNU Linker Script** *Customizable binary linking rules for fine-tuning output memory layout.* > Compile-Time Linker `ld` x Runtime Loader `ld-linux.so`. [Castor_and_Pollux/firmware/scripts/samd21g18a.ld - GitHub](https://github.com/wntrblm/Castor_and_Pollux/blob/main/firmware/scripts/samd21g18a.ld) - `main.c` ```c= // normal variables <-> symbol_table["var"] = <addr> // linker's symbols <-> symbol_table["ld_sym"] = <value> // (i.e., &ld_sym is its value, symbol's addr as its value) extern char foo_start[], foo_end[]; // non-modifiable lvalues of type char[x] // or extern char text_start, text_end; // then use them like &text_start __attribute__((section(".foo"))) int foo[22]; int data[4096] = {1}; int bss[4096]; int main() { foo[10] = 42; data[10] = 42; bss[10] = 42; unsigned char exit_code = foo_end - foo_start; __asm__ volatile( "movq $60, %%rax\n\t" // syscall number: exit "movq %0, %%rdi\n\t" // exit code argument "syscall\n\t" : : "r"((long) exit_code) // 64-bit operand : "%rax", "%rdi" ); } ``` - `linkcmds.ld` > [!Tip] **How Linker Treats the Linker Script** > If some restraints are not specified, the linker will apply the default rules & automatically generate the suitable config for the targets. - `example.memory` ```ld= /* * how to use: * INCLUDE example.memory * * SECTIONS { * .output_section : * { * ... * } > RAM > AT BOOTLOADER * } */ MEMORY { BOOTLOADER (rx) : ORIGIN = 0x00000, LENGTH = 8K DATA (rw) : ORIGIN = 0x02000, LENGTH = 64K RAM (rxw) : ORIGIN = 0x12000, LENGTH = 16M - 72K } bootloader_start = .BOOTLOADER; bootloader_size = SIZEOF(.BOOTLOADER); ram_start = .RAM; ``` :::warning **Load Memory Address (LMA) vs. Virtual Memory Address (VMA)** The program loader will load the section to its LMA (e.g., ROM), & the actual runnable region for this section is at VMA (e.g., RAM). If LMA != VMA, then the section can be copied into RAM then run by this: ```C= #include <string.h> // symbols set via the location counter extern char data_start[], data_size[], data_load_start[]; extern char bootloader_start[], bootloader_size[], ram_start; copy_to_ram() { if (data_start != data_load_start) memcpy(data_start, data_load_start, (size_t) data_size); // or if (bootloader_start != ram_start) memcpy(ram_start, bootloader_start, (size_t) bootloader_size); } ``` - [Assign alias names to memory regions - GNU linker ld](https://sourceware.org/binutils/docs/ld/REGION_005fALIAS.html) - [Source Code Reference - GNU linker ld](https://sourceware.org/binutils/docs/ld/Source-Code-Reference.html) ::: ```ld= /* linker-defined symbols */ exported_symbol = 1234; HIDDEN(non_exported_symbol = 1234); ENTRY(main) SECTIONS { /* * put all input's *(.text) section to the output's .text section * on the named memory REGION_TEXT: * 1. set the section's virtual memory address (VMA) counter: * * - auto set: * .output_section : * // VMA starts at VMA counter with auto alignment. * // (if not define MEMORY {...}) * * - with specific location: * .output_section 0x1234 : * // VMA starts at 0x1234. * * - with aligned specific location: * .output_section ALIGN(0x1000) : * // VMA starts at 0x1000, 0x2000, 0x3000, ... * * 2. control the section's load memory address (LMA): * * - set LMA same as VMA (by default): * .output_section : AT(.) * * - set LMA starting at specific position: * .output_section : AT(0x30000) */ . = 0xBEEF0001; .text : { text_start = .; KEEP(*(.isr_vector)) /* Interrupt vectors */ *(.text) /* Code */ x = 0x1234; . = ALIGN(x); /* the aligned VMA location counter */ . = ((. + x - 1) / x) * x; /* equivalant */ text_end = .; } text_size = SIZEOF(.text); /* same as (text_end - text_start) */ text_start = ADDR(.text); /* .text's virtual memory address */ text_load_start = LOADADDR(.text); /* .text's load memory address */ /* .foo then .data */ .foo : AT(LOADADDR(.text) + SIZEOF(.text)) { foo_start = .; KEEP(*(.foo)) foo_end = .; } /* .data then the remaining sections (including non-specified bss, ...) */ .data 0xDEAD0000 : AT(0xCCCC0000) { *(.data) } /DISCARD/ : { *(.comment) *.o(.eh_frame) } } ``` Compile & link: ```sh! gcc <source-code> -c -o <object-file> ``` ```sh! ld <object-file> \ -T <linker-script> \ --entry main \ -o <program> ``` - For libc program: ```sh! gcc <source-code> \ -T <linker-script> \ -o <program> ``` - Print gcc's default linker script: ```sh! gcc -Wl,-verbose <source-code> ``` ![image](https://hackmd.io/_uploads/S1U5sOfFgl.png =550x) Print binary's symbols: ```sh! readelf -sW <program> ``` Print binary's section headers info: - ```sh! objdump -hw <program> ``` ![image](https://hackmd.io/_uploads/Hkzf5aIFxx.png =550x) - ```sh! readelf -SW <program> ``` ![image](https://hackmd.io/_uploads/ryTgA6Itxe.png =550x) Execute: ```sh! ./<program>; echo exit_code: $? ``` :::warning **Linker-Defined Symbols Will Be Stored in `.symtab` & may Be Stripped** ```c= extern void *text_start; extern void *text_end; int main() { return (int) text_end - (int) text_start; } ``` ::: :::info **[Extra] GNU Version Script** *"API" Version control for exposed symbols of sharedlibraries.* > Compile-Time Linker `ld` x Runtime Loader `ld-linux.so`. [[Book] VERSION Command - LD](https://sourceware.org/binutils/docs/ld/VERSION.html) [[Book] LD Version Scripts - GNU Gnulib](https://www.gnu.org/software/gnulib/manual/html_node/LD-Version-Scripts.html) > [!Tip] > Version script is a more advanced API control than `__attribute__((__visibility__("default")))`, ... - `-Wl,--version-script=version.script` ```ld= API { global: api_func1; api_func2; local: *; }; VERS_1.1 { global: foo1; local: old*; original*; new*; }; VERS_1.2 { foo2; } VERS_1.1; VERS_2.0 { bar1; bar2; extern "C++" { ns::*; "f(int, double)"; }; } VERS_1.2; ``` - `source.c` ```c= // exporting functions with symbols of specific versions __asm__(".symver original_foo,foo@"); __asm__(".symver old_foo,foo@VERS_1.1"); __asm__(".symver old_foo1,foo@VERS_1.2"); __asm__(".symver new_foo,foo@@VERS_2.0"); ``` :::warning **[Extra] C++ Version of API Version Control** ```C++= namespace lib { inline namespace v1_1 { void foo(); } namespace v2 { void foo(); } } // namespace lib ``` ```C++= namespace lib { int main() { lib::v1_1::foo(); lib::v2::foo(); lib::foo(); // lib::v1_1::foo } ``` ::: ## Misc. ### Apollo Program’s Guidance System The flight control software that took Apollo 11 to the moon. - **Apollo Guidance Computer** > On-board flight programs of CM (Command Module `Comanche055`) & LM (Lunar (Excursion) Module `Luminary100`). - source code (assembly): [Apollo-11 - chrislgarry - GitHub](https://github.com/chrislgarry/Apollo-11) - **Apollo Guidance System Simulator** > Simulation software, assembler, mechanical design files, Inertial Measurement Unit (IMU), user interface (DSKY), ... - detailed docs & commetary: [Virtual AGC](https://www.ibiblio.org/apollo) - source code: [Virtual Apollo Guidance Computer - virtualagc - GitHub](https://github.com/virtualagc/virtualagc) - hardware design: [AGC Hardware - virtualagc - GitHub](https://github.com/virtualagc/agc_hardware) :::info **Simulated Saturn 5 Launch Checklist** 1. Press Enable IMU button. 1. Wait 85-90 seconds until NO ATT light turns off. 1. Enter V37E01E to initiate major mode 01 (Prelaunch or Service Initialization). PROG should read 01, if not try again in few seconds. 1. Wait until IMU is calibrated (pitch 90°). 1. Major mode 02 (Prelaunch or Service Gyrocompassing) will automatically start in few seconds (PROG shows 02). 1. Press Launch button. 1. MET (Mission Elapsed Time) clock starts running. 1. Major mode 11 (Earth Orbit Insertion Monitor) will start after the detection of the launch event. 1. DSKY updates every two seconds with the latest state-vector data: - R1 = velocity (XXXXX ft/s). - R2 = the altitude rate (XXXXX ft/s). - R3 = the altitude above the pad (XXXX.X nmi). 1. The powered flight lasts 11 minutes and 44 seconds. Both roll and pitch programs are executed. 1. You may monitor the orbit parameters at any time after the launch by entering V82E: - R1 = the apocenter altitude (XXXX.X nmi). - R2 = the pericenter altitude (XXXX.X nmi). - R3 = the time to free fall (XX XX min:sec). 1. After a successful launch, both the apocenter and pericenter altitudes should be >90 nmi. 1. Press PRO to return to major mode 11 display. 1. Enter V06N32E to display time from perigee: - R1 = 00XXX. hours - R2 = 000XX. minutes - R3 = 0XX.XX seconds > DSKY interface: > ![image](https://hackmd.io/_uploads/rJCg4TiAge.png =550x) [Moonjs: An Online Apollo Guidance Computer (AGC) Simulator](https://svtsim.com/moonjs/agc.html) :::