Computer Architecture - HackMD

<style> .reveal { font-size: 24px; } .reveal section img { background:none; border:none; box-shadow:none; } pre.graphviz { box-shadow: none; } </style> # Computer Architecture *(AMH chapters 2, 4)* Performance Engineering, Lecture 1 Sergey Slotin May 7, 2022 --- As software engineers, we love building and using *abstractions* - Engineers working on higher-level modules only need to know its *interface*. - Engineers working on the module itself get the freedom to optimize and refactor its implementation as long as it complies with its *contracts* Examples: data structures, file systems, network protocols, programming languages ---- ### Instruction Set Architectures  An abstraction of a CPU is called an *instruction set architecture* (ISA), and it defines how a computer should interpret the machine language: - instructions and their binary encodings - counts, sizes, and purposes of registers - the memory model and the input/output model Similar to software interfaces, ISAs can be extended too (adding new instructions)  ---- There used to be many competing ISAs ![](https://imgs.xkcd.com/comics/standards.png =400x) https://xkcd.com/927/ ---- Maintaining ISAs is costlier than messengers, so now we basically have just two: - **Arm** chips are used in almost all mobile devices (and most other computer-like devices such as TVs, smart fridges, microwaves, [car autopilots](https://en.wikipedia.org/wiki/Tesla_Autopilot)) - **x86** chips are used in almost all servers and desktops (notable exceptions: M1 MacBooks, AWS Graviton, world's #1 supercomputer [Fugaku](https://en.wikipedia.org/wiki/Fugaku_(supercomputer))) ---- ### RISC vs CISC The main difference between Arm and x86 is that of architectural complexity: - Arm CPUs are *reduced* instruction set computers (**RISC**): they keep the instruction set small and highly optimized (but less common operations have to be implemented with several instructions) - x86 CPUs are *complex* instruction set computers (**CISC**): they improve performance by adding many separate instructions (most of which are rarely used) RISC designs result in **simpler** and **smaller** chips, projecting to lower manufacturing costs and power usage — which is why they are preferred for mobile devices --- ## Assembly *Machine language* is a stream of binary-encoded instructions specifying - the instruction number (called *opcode*) - what its *operands* are (if there are any) - where to store the *result* (if one is produced) *Assembly language* is a human-friendly rendition of machine language that uses mnemonic codes to refer to machine code instructions and registers ---- To generate assembly from C/C++: - `g++ ... -S filename.s` - https://godbolt.org/ ---- C language: ```c void add(int *a, int *b, int *c) { *c = *a + *b } ``` Arm assembly: ```arm ; *a = x0, *b = x1, *c = x2 ldr w0, [x0] ; load 4 bytes from wherever x0 points into w0 ldr w1, [x1] ; load 4 bytes from wherever x1 points into w1 add w0, w0, w1 ; add w0 with w1 and save the result to w0 str w0, [x2] ; write contents of w0 to wherever x2 points/ ``` x86 assembly: ```asm ; *a = rsi, *b = rdi, *c = rdx mov eax, DWORD PTR [rsi] ; load 4 bytes from wherever rsi points into eax add eax, DWORD PTR [rdi] ; add whatever is stored at rdi to eax mov DWORD PTR [rdx], eax ; write contents of eax to wherever rdx points ``` (We focus on x86 from now on) ---- ```asm ; *a = rsi, *b = rdi, *c = rdx mov eax, DWORD PTR [rsi] ; load 4 bytes from wherever rsi points into eax add eax, DWORD PTR [rdi] ; add whatever is stored at rdi to eax mov DWORD PTR [rdx], eax ; write contents of eax to wherever rdx points ``` - A program is a sequence of instructions written as its name followed by a variable amount of operands - The result is written in the *first* operand, which may also be used in computation - The `[reg]` syntax is used for "dereferencing" a pointer stored in a register (it needs to be prefixed with size information: `BYTE`, `WORD`, `DWORD`, `QWORD`, etc.) - The `;` sign is used for line comments, similar to `#` and `//` in other languages ---- Instruction mnemonics are very terse for historical reasons: back when people used to write assembly by hand, one less character to type was one step away from insanity. - `mov` is for "store/load a word" - `inc` is for "increment by 1 - `mul` is for "multiply" - `idiv` is for "integer division." *Most* instructions write their result into the first operand, which can also be involved in the computation like in the `add eax, [rdi]` case. Operands can be either **registers**, **constant values**, or **memory locations** ---- ### Registers - There are 16 general-purpose registers named `rax`, `rbx`, `rcx`, `rdx`, `rdi`, `rsi`, `rbp`, `rsp` and `r8`-`r15` - 32-, 16-bit, and 8-bit registers with similar names (`rax` → `eax` → `ax` → `al`) that are *aliased* with larger ones: e. g. the 32 lowest bytes of `rax` are `eax` - a separate set of floating-point/SIMD registers - a few auxiliary registers used for control flow (e. g. the instruction pointer `rip`) ---- ### Constants/immediates - Constants are just integer or floating-point values: `42`, `0x2a`, `3.14`, `6.02e23` - They are more commonly called *immediate values* because they are embedded right into the machine code - Because it complicates instruction encodings, some instructions don't support immediates (you have to move them to a register) - There are also strings constants like `hello` or `world\n`, but we won't cover them ---- ### Moving data - Some instructions support many different operand types, in which case they perform slightly different operations and are considered different instructions - The `mov` instruction has ~20 different flavors, all moving data either between the memory and registers or just between two registers - Despite the name, it doesn't *move* a value into a register but *copies* it - When used between registers, it performs *register renaming* and takes zero cycles - Some common operations, such as fused `add`, can read one of their operands directly from memory and don't need a separate `mov` ---- ### Addressing Modes ``` SIZE PTR [base + index * scale + displacement] ``` - `base` and `index` are registers (think `a[i] = *(a + i)`) - `displacement` is an integer constant - `scale` can be either 2, 4, or 8 - loads bytes from `base + index * scale + displacement` <span>`lea` ("load effective address") calculates the address and saves it in a register, which is useful for arithmetic tricks: e. g. you can multiply by `3`, `5` `9` in one cycle by calculating it as `{2,4,8} * x + x`</span>  ---- ### Alternative Syntax There are actually multiple *assemblers*, most using one of the two syntaxes: - The *AT&T syntax*, used by default by all Linux tools - The *Intel syntax*, used by default by… Intel ---- ### AT&T syntax ```asm movl (%rsi), %eax addl (%rdi), %eax movl %eax, (%rdx) ``` - The *last* operand is used to specify the destination - Registers and constants need to be prefixed by `%` and `$` respectively (`addl $1, %rdx` increments `rdx`) - Memory addressing uses `displacement(%base, %index, scale)` - Both `;` and `#` are used for line comments, `/* */` is used for block comments - The instruction names need to be "suffixed" (`addq`, `movl`, `cmpq`) to specify what size operands are being manipulated (We will stick to Intel syntax) --- What does this code do? ```nasm loop: add edx, DWORD PTR [rax] add rax, 4 cmp rax, rcx jne loop ``` It calculates the sum of a 32-bit integer array, just as a simple `for` loop would:  ```cpp int s = 0 for (int i = 0; i < n; i++) s += a[i]; ```  - The "body" of the loop is `add edx, DWORD PTR [rax]`: load data from the iterator `rax` and add it to the accumulator  `edx` - Next, we move the iterator 4 bytes forward with  `add rax, 4` - Then, a slightly more complicated thing happens  ---- ### Jumps - Assembly is a very minimal langauge (because it needs to be) - It doesn't have `if`-s, `for`-s, functions, or other control flow that other programming languages tend to have - What it *does* have is `goto`, or "jump", which moves the instruction pointer to a location specified by its operand - The location may be either an absolute address in memory, relative to the current address, or even computed during runtime - To avoid headache managing these addresses, you can mark any instruction with a string followed by `:`, and then use this string as a label ```nasm hip: jmp hop hop: jmp hip ``` ---- - **Unconditional** jump (`jmp`) can only be used to implement `while (true)` kind of loops or stitch parts of a program together - A family of **conditional** jumps is used to implement actual control flow - They use a special `FLAGS` register, which first needs to be populated by instructions that perform some kind of checks ---- ```nasm loop: add edx, DWORD PTR [rax] add rax, 4 cmp rax, rcx jne loop ``` - `cmp rax, rcx` compares the iterator `rax` with the end-of-array pointer `rcx`, updating the `FLAGS` register - It is then accessed by `jne loop`, which looks up a certain bit there that tells whether the two values are equal and either jumps back to the beginning or continues to the next instruction, breaking the loop ---- ### An Alternative Approach Many other instructions populate `FLAGS`: for example, `add` sets flags based on whether the result is zero, negative, whether and overflowed or underflowed occured, and so on ```nasm mov rax, -100 ; replace 100 with the array size loop: add edx, DWORD PTR [rax + 100 + rcx] add rax, 4 jnz loop ; checks if the result is zero ``` One instruction shorter in the repeated part (not huge, but non-negligible either) --- ## Loop Unrolling There is a lot of overhead to process a single element from managing the loop variables; we can *unroll* the loop by grouping its iterations together: ```c++ for (int i = 0; i < n; i += 4) { s += a[i]; s += a[i + 1]; s += a[i + 2]; s += a[i + 3]; } ``` In assembly, it would look something like this: ```nasm loop: add edx, [rax] add edx, [rax+4] add edx, [rax+8] add edx, [rax+12] add rax, 16 cmp rax, rsi jne loop ``` We need 3 loop control instructions for 4 useful ones (an improvement from $\frac{1}{4}$ to $\frac{4}{7}$), and this can be continued to reduce the overhead to almost zero ---- `-funroll-loops` or `#pragma GCC unroll n`: ```c++ #pragma GCC unroll 4 for (int i = 0; i < n; i++) { // ... } ``` Loop unrolling makes binary larger and may or may not make it run faster --- ## Functions To "call a function" in assembly, you need to jump to its beginning and then jump back, but then two important problems arise: 1. What if the caller stores data in the same registers as the callee? 2. Where is "back"? Both of these concerns can be solved with a dedicated memory location where we can stash the information we need to return from the function before calling it  ---- ### The Stack - The *base pointer* marks the start of the stack and is stored in `rbp` - The *stack pointer* marks the last element of the stack and is stored in `rsp` ---- ### Calling a function - push all your local variables onto the stack - push the current instruction pointer - jump to the beginning of the function ---- ### Returning from a function - look at the pointer stored on top of the stack - jump there (callee return point) - carefully read all the variables stored on the stack back into their registers ---- There are 4 special instructions for doing this: - `push` writes data at the stack pointer and decrements it - `pop` reads data from the stack pointer and increments it - `call` puts the address of the next instruction on the stack and jumps to a label - `ret` reads the return address from the top of the stack and jumps to it You would call them "syntactic sugar" if they weren't actual hardware instructions ---- ```nasm ; "push rax" sub rsp, 8 mov QWORD PTR[rsp], rax ; "pop rax" mov rax, QWORD PTR[rsp] add rsp, 8 ; "call func" push rip ; <- instruction pointer (although accessing it like that is probably illegal) jmp func ; "ret" pop rcx ; <- choose any unused register jmp rcx ``` ---- ### Example ```c int square(int x) { return x * x; } int distance(int x, int y) { return square(x) + square(y); } ``` `square`, being a simple one-argument function, can be implemented like this:  ```nasm square: ; x = edi, ret = eax imul edi, edi mov eax, edi ret ```  ---- ```nasm distance: ; x = rdi/edi, y = rsi/esi, ret = rax/eax push rdi push rsi call square ; eax = square(x) pop rsi pop rdi mov ebx, eax ; save x^2 mov rdi, rsi ; move new x=y push rdi push rsi call square ; eax = square(x=y) pop rsi pop rdi add eax, ebx ; x^2 + y^2 ret ``` ---- - Moving data to and from the stack creates overhead - You have to do this because you don't know whether the calee modifies registers you hold your data in - `square` doesn't modify anything except `eax` and `ebx`, so it is safe to call it as it is: ```nasm distance: call square mov ebx, eax mov edi, esi call square add eax, ebx ret ``` ---- - We are still implicitly accessing stack memory: we need to push and pop the instruction pointer on each function call - In simple cases like these, we can *inline* function calls by stitching callee's code into the caller and resolving conflicts over registers: ```nasm distance: imul edi, edi ; edi = x^2 imul esi, esi ; esi = y^2 add edi, esi mov eax, edi ; there is no "add eax, edi, esi", so we need a separate mov ret ``` It may or may not be beneficial; to force it:  ```c++ #define FORCE_INLINE inline __attribute__((always_inline)) ```  ---- ```nasm distance: imul edi, edi ; edi = x^2 imul esi, esi ; esi = y^2 lea eax, [rdi+rsi] ; eax = x^2 + y^2 ret ``` --- ### Tail Call Elimination ```cpp int factorial(int n) { if (n == 0) return 1; return factorial(n - 1) * n; } ``` Equivalent assembly: ```nasm ; n = edi, ret = eax factorial: test edi, edi ; test if a value is zero jne nonzero ; (the machine code of "cmp rax, 0" would be one byte longer) mov eax, 1 ; return 1 ret nonzero: push edi ; save n to use later in multiplication sub edi, 1 call factorial ; call f(n - 1) pop edi imul eax, edi ret ``` ---- - Inlining is straightforward to do when the callee doesn't make any other function calls, or at least if these calls are not recursive - If the function is recursive, it is still often possible to make it "call-less" by restructuring it, effectively turning it into a loop - This is possible when the function is *tail recursive* and returns right after making a recursive call ---- ```cpp int factorial(int n, int p = 1) { if (n == 0) return p; return factorial(n - 1, p * n); } ``` This implementation is tail-recursive, and it can be easily folded into a loop: ```nasm ; assuming n > 0 factorial: mov eax, 1 loop: imul eax, edi sub edi, 1 jne loop ret ``` Without tail call elimination, functional programs require *way* more time and memory --- ## Indirect branching You can also jump by a non-constant value stored inside a register: ```nasm jmp rax ``` This is called a *computed jump*, and it has a few interesting applications related to dynamic languages and implementing more complex control flow ---- ### Multiway Branch ```cpp switch (grade) { case 'A': return 4.0; break; case 'B': return 3.0; break; case 'C': return 2.0; break; case 'D': return 1.0; break; case 'E': case 'F': return 0.0; break; default: return NAN; } ``` A pattern useful in parsers and other state machines ---- ## Computed Goto - Instead of making $n$ conditional branches, we can create a *branch table* containing pointers/offsets to possible jump locations and index it with the `state` variable taking values in the $[0, n)$ range. - Compilers use this technique when the values are densely packed together, but it can also be implemented explicitly with a *computed goto*: ```cpp void weather_in_russia(int season) { static const void* table[] = {&&winter, &&spring, &&summer, &&fall}; goto *table[season]; winter: printf("Freezing\n"); return; spring: printf("Dirty\n"); return; summer: printf("Dry\n"); return; fall: printf("Windy\n"); return; } ``` --- ## Dynamic Dispatch ```cpp struct Animal { virtual void speak() { printf("<abstract animal sound>\n");} }; struct Dog { void speak() override { printf("Bark\n"); } }; struct Cat { void speak() override { printf("Meow\n"); } }; ``` We need to somehow invoke the right implementation: ```cpp Dog sparkles; Cat mittens; Animal *catdog = (rand() & 1) ? &sparkles : &mittens; catdog->speak(); ``` ---- C++ does it with a *virtual method table*: - For all concrete implementations, the compiler **pads** all their methods to uniform lenth and writes them sequentially somewhere in the instruction memory - It adds a *run-time type information* field to the structure, which is the offset in the memory region that points to the right implementation of the methods of the class - During a virtual method call, that offset field is fetched from the instance of a structure, and a normal function call is made with it Dynamic languages do something similar ---- Dynamic dispatch adds some overhead: - Class size increases by a couple of bytes or so - The binary size itself increases a little bit - The compiler most likely won't be able to inline the function call itself - You may need to spend another 15 cycles or so for pipeline flushing reasons