<style>
.reveal {
font-size: 24px;
}
.reveal section img {
background:none;
border:none;
box-shadow:none;
}
pre.graphviz {
box-shadow: none;
}
</style>
# Computer Architecture
*(AMH chapters 2, 4)*
Performance Engineering, Lecture 1
Sergey Slotin
May 7, 2022
---
As software engineers, we love building and using *abstractions*
- Engineers working on higher-level modules only need to know its *interface*.
- Engineers working on the module itself get the freedom to optimize and refactor its implementation as long as it complies with its *contracts*
Examples: data structures, file systems, network protocols, programming languages
----
### Instruction Set Architectures
<!-- hardware engineers love abstraction too -->
An abstraction of a CPU is called an *instruction set architecture* (ISA), and it defines how a computer should interpret the machine language:
- instructions and their binary encodings
- counts, sizes, and purposes of registers
- the memory model and the input/output model
Similar to software interfaces, ISAs can be extended too (adding new instructions)
<!--: they are often updated, mostly in a backward-compatible way, adding new and more specialized instructions that can improve performance-->
----
There used to be many competing ISAs
![](https://imgs.xkcd.com/comics/standards.png =400x)
https://xkcd.com/927/
----
Maintaining ISAs is costlier than messengers, so now we basically have just two:
- **Arm** chips are used in almost all mobile devices (and most other computer-like devices such as TVs, smart fridges, microwaves, [car autopilots](https://en.wikipedia.org/wiki/Tesla_Autopilot))
- **x86** chips are used in almost all servers and desktops (notable exceptions: M1 MacBooks, AWS Graviton, world's #1 supercomputer [Fugaku](https://en.wikipedia.org/wiki/Fugaku_(supercomputer)))
----
### RISC vs CISC
The main difference between Arm and x86 is that of architectural complexity:
- Arm CPUs are *reduced* instruction set computers (**RISC**): they keep the instruction set small and highly optimized (but less common operations have to be implemented with several instructions)
- x86 CPUs are *complex* instruction set computers (**CISC**): they improve performance by adding many separate instructions (most of which are rarely used)
RISC designs result in **simpler** and **smaller** chips, projecting to lower manufacturing costs and power usage — which is why they are preferred for mobile devices
---
## Assembly
*Machine language* is a stream of binary-encoded instructions specifying
- the instruction number (called *opcode*)
- what its *operands* are (if there are any)
- where to store the *result* (if one is produced)
*Assembly language* is a human-friendly rendition of machine language that uses mnemonic codes to refer to machine code instructions and registers
----
To generate assembly from C/C++:
- `g++ ... -S filename.s`
- https://godbolt.org/
----
C language:
```c
void add(int *a, int *b, int *c) {
*c = *a + *b
}
```
Arm assembly:
```arm
; *a = x0, *b = x1, *c = x2
ldr w0, [x0] ; load 4 bytes from wherever x0 points into w0
ldr w1, [x1] ; load 4 bytes from wherever x1 points into w1
add w0, w0, w1 ; add w0 with w1 and save the result to w0
str w0, [x2] ; write contents of w0 to wherever x2 points/
```
x86 assembly:
```asm
; *a = rsi, *b = rdi, *c = rdx
mov eax, DWORD PTR [rsi] ; load 4 bytes from wherever rsi points into eax
add eax, DWORD PTR [rdi] ; add whatever is stored at rdi to eax
mov DWORD PTR [rdx], eax ; write contents of eax to wherever rdx points
```
(We focus on x86 from now on)
----
```asm
; *a = rsi, *b = rdi, *c = rdx
mov eax, DWORD PTR [rsi] ; load 4 bytes from wherever rsi points into eax
add eax, DWORD PTR [rdi] ; add whatever is stored at rdi to eax
mov DWORD PTR [rdx], eax ; write contents of eax to wherever rdx points
```
- A program is a sequence of instructions written as its name followed by a variable amount of operands
- The result is written in the *first* operand, which may also be used in computation
- The `[reg]` syntax is used for "dereferencing" a pointer stored in a register
(it needs to be prefixed with size information: `BYTE`, `WORD`, `DWORD`, `QWORD`, etc.)
- The `;` sign is used for line comments, similar to `#` and `//` in other languages
----
Instruction mnemonics are very terse for historical reasons: back when people used to write assembly by hand, one less character to type was one step away from insanity.
- `mov` is for "store/load a word"
- `inc` is for "increment by 1
- `mul` is for "multiply"
- `idiv` is for "integer division."
*Most* instructions write their result into the first operand, which can also be involved in the computation like in the `add eax, [rdi]` case. Operands can be either **registers**, **constant values**, or **memory locations**
----
### Registers
- There are 16 general-purpose registers named `rax`, `rbx`, `rcx`, `rdx`, `rdi`, `rsi`, `rbp`, `rsp` and `r8`-`r15`
- 32-, 16-bit, and 8-bit registers with similar names (`rax` → `eax` → `ax` → `al`) that are *aliased* with larger ones: e. g. the 32 lowest bytes of `rax` are `eax`
- a separate set of floating-point/SIMD registers
- a few auxiliary registers used for control flow (e. g. the instruction pointer `rip`)
----
### Constants/immediates
- Constants are just integer or floating-point values: `42`, `0x2a`, `3.14`, `6.02e23`
- They are more commonly called *immediate values* because they are embedded right into the machine code
- Because it complicates instruction encodings, some instructions don't support immediates (you have to move them to a register)
- There are also strings constants like `hello` or `world\n`, but we won't cover them
----
### Moving data
- Some instructions support many different operand types, in which case they perform slightly different operations and are considered different instructions
- The `mov` instruction has ~20 different flavors, all moving data either between the memory and registers or just between two registers
- Despite the name, it doesn't *move* a value into a register but *copies* it
- When used between registers, it performs *register renaming* and takes zero cycles
- Some common operations, such as fused `add`, can read one of their operands directly from memory and don't need a separate `mov`
----
### Addressing Modes
```
SIZE PTR [base + index * scale + displacement]
```
- `base` and `index` are registers (think `a[i] = *(a + i)`)
- `displacement` is an integer constant
- `scale` can be either 2, 4, or 8
- loads bytes from `base + index * scale + displacement`
<span>`lea` ("load effective address") calculates the address and saves it in a register, which is useful for arithmetic tricks: e. g. you can multiply by `3`, `5` `9` in one cycle by calculating it as `{2,4,8} * x + x`</span> <!-- .element: class="fragment" data-fragment-index="1" -->
----
### Alternative Syntax
There are actually multiple *assemblers*, most using one of the two syntaxes:
- The *AT&T syntax*, used by default by all Linux tools
- The *Intel syntax*, used by default by… Intel
----
### AT&T syntax
```asm
movl (%rsi), %eax
addl (%rdi), %eax
movl %eax, (%rdx)
```
- The *last* operand is used to specify the destination
- Registers and constants need to be prefixed by `%` and `$` respectively (`addl $1, %rdx` increments `rdx`)
- Memory addressing uses `displacement(%base, %index, scale)`
- Both `;` and `#` are used for line comments, `/* */` is used for block comments
- The instruction names need to be "suffixed" (`addq`, `movl`, `cmpq`) to specify what size operands are being manipulated
(We will stick to Intel syntax)
---
What does this code do?
```nasm
loop:
add edx, DWORD PTR [rax]
add rax, 4
cmp rax, rcx
jne loop
```
It calculates the sum of a 32-bit integer array, just as a simple `for` loop would: <!-- .element: class="fragment" data-fragment-index="1" -->
```cpp
int s = 0
for (int i = 0; i < n; i++)
s += a[i];
```
<!-- .element: class="fragment" data-fragment-index="1" -->
- The "body" of the loop is `add edx, DWORD PTR [rax]`: load data from the iterator `rax` and add it to the accumulator <!-- .element: class="fragment" data-fragment-index="1" --> `edx`
- Next, we move the iterator 4 bytes forward with <!-- .element: class="fragment" data-fragment-index="1" --> `add rax, 4`
- Then, a slightly more complicated thing happens <!-- .element: class="fragment" data-fragment-index="1" -->
----
### Jumps
- Assembly is a very minimal langauge (because it needs to be)
- It doesn't have `if`-s, `for`-s, functions, or other control flow that other programming languages tend to have
- What it *does* have is `goto`, or "jump", which moves the instruction pointer to a location specified by its operand
- The location may be either an absolute address in memory, relative to the current address, or even computed during runtime
- To avoid headache managing these addresses, you can mark any instruction with a string followed by `:`, and then use this string as a label
```nasm
hip:
jmp hop
hop:
jmp hip
```
----
- **Unconditional** jump (`jmp`) can only be used to implement `while (true)` kind of loops or stitch parts of a program together
- A family of **conditional** jumps is used to implement actual control flow
- They use a special `FLAGS` register, which first needs to be populated by instructions that perform some kind of checks
----
```nasm
loop:
add edx, DWORD PTR [rax]
add rax, 4
cmp rax, rcx
jne loop
```
- `cmp rax, rcx` compares the iterator `rax` with the end-of-array pointer `rcx`, updating the `FLAGS` register
- It is then accessed by `jne loop`, which looks up a certain bit there that tells whether the two values are equal and either jumps back to the beginning or continues to the next instruction, breaking the loop
----
### An Alternative Approach
Many other instructions populate `FLAGS`: for example, `add` sets flags based on whether the result is zero, negative, whether and overflowed or underflowed occured, and so on
```nasm
mov rax, -100 ; replace 100 with the array size
loop:
add edx, DWORD PTR [rax + 100 + rcx]
add rax, 4
jnz loop ; checks if the result is zero
```
One instruction shorter in the repeated part (not huge, but non-negligible either)
---
## Loop Unrolling
There is a lot of overhead to process a single element from managing the loop variables; we can *unroll* the loop by grouping its iterations together:
```c++
for (int i = 0; i < n; i += 4) {
s += a[i];
s += a[i + 1];
s += a[i + 2];
s += a[i + 3];
}
```
In assembly, it would look something like this:
```nasm
loop:
add edx, [rax]
add edx, [rax+4]
add edx, [rax+8]
add edx, [rax+12]
add rax, 16
cmp rax, rsi
jne loop
```
We need 3 loop control instructions for 4 useful ones (an improvement from $\frac{1}{4}$ to $\frac{4}{7}$), and this can be continued to reduce the overhead to almost zero
----
`-funroll-loops` or `#pragma GCC unroll n`:
```c++
#pragma GCC unroll 4
for (int i = 0; i < n; i++) {
// ...
}
```
Loop unrolling makes binary larger and may or may not make it run faster
---
## Functions
To "call a function" in assembly, you need to jump to its beginning and then jump back, but then two important problems arise:
1. What if the caller stores data in the same registers as the callee?
2. Where is "back"?
Both of these concerns can be solved with a dedicated memory location where we can stash the information we need to return from the function before calling it <!-- .element: class="fragment" data-fragment-index="1" -->
----
### The Stack
- The *base pointer* marks the start of the stack and is stored in `rbp`
- The *stack pointer* marks the last element of the stack and is stored in `rsp`
----
### Calling a function
- push all your local variables onto the stack
- push the current instruction pointer
- jump to the beginning of the function
----
### Returning from a function
- look at the pointer stored on top of the stack
- jump there (callee return point)
- carefully read all the variables stored on the stack back into their registers
----
There are 4 special instructions for doing this:
- `push` writes data at the stack pointer and decrements it
- `pop` reads data from the stack pointer and increments it
- `call` puts the address of the next instruction on the stack and jumps to a label
- `ret` reads the return address from the top of the stack and jumps to it
You would call them "syntactic sugar" if they weren't actual hardware instructions
----
```nasm
; "push rax"
sub rsp, 8
mov QWORD PTR[rsp], rax
; "pop rax"
mov rax, QWORD PTR[rsp]
add rsp, 8
; "call func"
push rip ; <- instruction pointer (although accessing it like that is probably illegal)
jmp func
; "ret"
pop rcx ; <- choose any unused register
jmp rcx
```
----
### Example
```c
int square(int x) {
return x * x;
}
int distance(int x, int y) {
return square(x) + square(y);
}
```
`square`, being a simple one-argument function, can be implemented like this: <!-- .element: class="fragment" data-fragment-index="1" -->
```nasm
square: ; x = edi, ret = eax
imul edi, edi
mov eax, edi
ret
```
<!-- .element: class="fragment" data-fragment-index="1" -->
----
```nasm
distance: ; x = rdi/edi, y = rsi/esi, ret = rax/eax
push rdi
push rsi
call square ; eax = square(x)
pop rsi
pop rdi
mov ebx, eax ; save x^2
mov rdi, rsi ; move new x=y
push rdi
push rsi
call square ; eax = square(x=y)
pop rsi
pop rdi
add eax, ebx ; x^2 + y^2
ret
```
----
- Moving data to and from the stack creates overhead
- You have to do this because you don't know whether the calee modifies registers you hold your data in
- `square` doesn't modify anything except `eax` and `ebx`, so it is safe to call it as it is:
```nasm
distance:
call square
mov ebx, eax
mov edi, esi
call square
add eax, ebx
ret
```
----
- We are still implicitly accessing stack memory: we need to push and pop the instruction pointer on each function call
- In simple cases like these, we can *inline* function calls by stitching callee's code into the caller and resolving conflicts over registers:
```nasm
distance:
imul edi, edi ; edi = x^2
imul esi, esi ; esi = y^2
add edi, esi
mov eax, edi ; there is no "add eax, edi, esi", so we need a separate mov
ret
```
It may or may not be beneficial; to force it:
<!-- .element: class="fragment" data-fragment-index="1" -->
```c++
#define FORCE_INLINE inline __attribute__((always_inline))
```
<!-- .element: class="fragment" data-fragment-index="1" -->
----
```nasm
distance:
imul edi, edi ; edi = x^2
imul esi, esi ; esi = y^2
lea eax, [rdi+rsi] ; eax = x^2 + y^2
ret
```
---
### Tail Call Elimination
```cpp
int factorial(int n) {
if (n == 0)
return 1;
return factorial(n - 1) * n;
}
```
Equivalent assembly:
```nasm
; n = edi, ret = eax
factorial:
test edi, edi ; test if a value is zero
jne nonzero ; (the machine code of "cmp rax, 0" would be one byte longer)
mov eax, 1 ; return 1
ret
nonzero:
push edi ; save n to use later in multiplication
sub edi, 1
call factorial ; call f(n - 1)
pop edi
imul eax, edi
ret
```
----
- Inlining is straightforward to do when the callee doesn't make any other function calls, or at least if these calls are not recursive
- If the function is recursive, it is still often possible to make it "call-less" by restructuring it, effectively turning it into a loop
- This is possible when the function is *tail recursive* and returns right after making a recursive call
----
```cpp
int factorial(int n, int p = 1) {
if (n == 0)
return p;
return factorial(n - 1, p * n);
}
```
This implementation is tail-recursive, and it can be easily folded into a loop:
```nasm
; assuming n > 0
factorial:
mov eax, 1
loop:
imul eax, edi
sub edi, 1
jne loop
ret
```
Without tail call elimination, functional programs require *way* more time and memory
---
## Indirect branching
You can also jump by a non-constant value stored inside a register:
```nasm
jmp rax
```
This is called a *computed jump*, and it has a few interesting applications related to dynamic languages and implementing more complex control flow
----
### Multiway Branch
```cpp
switch (grade) {
case 'A':
return 4.0;
break;
case 'B':
return 3.0;
break;
case 'C':
return 2.0;
break;
case 'D':
return 1.0;
break;
case 'E':
case 'F':
return 0.0;
break;
default:
return NAN;
}
```
A pattern useful in parsers and other state machines
----
## Computed Goto
- Instead of making $n$ conditional branches, we can create a *branch table* containing pointers/offsets to possible jump locations and index it with the `state` variable taking values in the $[0, n)$ range.
- Compilers use this technique when the values are densely packed together, but it can also be implemented explicitly with a *computed goto*:
```cpp
void weather_in_russia(int season) {
static const void* table[] = {&&winter, &&spring, &&summer, &&fall};
goto *table[season];
winter:
printf("Freezing\n");
return;
spring:
printf("Dirty\n");
return;
summer:
printf("Dry\n");
return;
fall:
printf("Windy\n");
return;
}
```
---
## Dynamic Dispatch
```cpp
struct Animal {
virtual void speak() { printf("<abstract animal sound>\n");}
};
struct Dog {
void speak() override { printf("Bark\n"); }
};
struct Cat {
void speak() override { printf("Meow\n"); }
};
```
We need to somehow invoke the right implementation:
```cpp
Dog sparkles;
Cat mittens;
Animal *catdog = (rand() & 1) ? &sparkles : &mittens;
catdog->speak();
```
----
C++ does it with a *virtual method table*:
- For all concrete implementations, the compiler **pads** all their methods to uniform lenth and writes them sequentially somewhere in the instruction memory
- It adds a *run-time type information* field to the structure, which is the offset in the memory region that points to the right implementation of the methods of the class
- During a virtual method call, that offset field is fetched from the instance of a structure, and a normal function call is made with it
Dynamic languages do something similar
----
Dynamic dispatch adds some overhead:
- Class size increases by a couple of bytes or so
- The binary size itself increases a little bit
- The compiler most likely won't be able to inline the function call itself
- You may need to spend another 15 cycles or so for pipeline flushing reasons
{"metaMigratedAt":"2023-06-17T00:24:22.814Z","metaMigratedFrom":"YAML","title":"Computer Architecture","breaks":"true","slideOptions":"{\"theme\":\"white\"}","contributors":"[{\"id\":\"045b9308-fa89-4b5e-a7f0-d8e7d0b849fd\",\"add\":37965,\"del\":18889}]"}