# CS 61C Notes # Lecture 1: 8/26/20 Intro class, no notes # Lecture 2: 8/28/20 * Refer to table for easy hex to binary conversion * Numerals have infinite preceeding zeroes * Numbers are representations of numerals * Numbers are theoretical * For a set number of bits, if addition (sub, mul, or div) cannot be represented, there is overflow * Not underflow, this would be negative overflow ## Number representations * Positive number representations are always the same for all representations * Sign and magnitude * Leftmost sign is int (0 is +, 1 is -) * Problem 1: Making numbers smaller than 0 actually makes the number bigger in binary representation * Problem 2: Also there are 2 zeroes * Problem 3: Complicated circuitry * One's complement * Flip all the bits for a negative number * Problem: there are still 2 0s (0000 and 1111) * Solution: shift everything over left by 1 * Two's complement * Flip all bits and add $1$ * Or multiply leftmost bit by $-1$ * All 1s represents -1 * $2^{N-1}$ non-negatives, $2^{N-1}$ negatives * There are 1 more negative than positive * Bias encoding * Add a bias to bring the center of the range down * Bias for N bits chosen as $-(2^{N-1}-1)$ # Lecture 3: 8/31/20 * ENIAC: First Electronic General-Purpose Computer * Long time to program (patch cords and switches) * EDSAC: First General Stored-Program Computer * Programs held as numbers/bits in memory * Compile changed source files into .o files (machine code object files) which are then linked into a .out file (machine code executable file) * Executables must be rebuilt on each new system (porting) * Parallel compiling: `make -j` to rebuild changed pieces * Amdahl's Law: Linking is sequential, forming a bottleneck * C Pre-Processor (CPP): Begins with `#` * Ex. `#define` or `#include` * Macros: small functions that text replaces everything below * However, this means functions would be called multiple times ## C vs Java C main function ```cpp= #include <stdio.h> int main(void) { return 0; } ``` | | C | Java| | -------- | -------- | -------- | | Type of Language | Function Oriented | Object Oriented | | Programming Unit | Function | Class = Abstract Data Type | | Compilation | `gcc hello.c` | `javac Hello.java` | | Execution | `a.out` loads and executes | `java Hello` interprets | | Storage | Manual(`malloc`, `free`) | `new` allocates & initializes, automatic garbage collection | | Comments | `/* ... */` | `/* ... */` or `//` | | Constants | `#define, const` | `final` | | Preprocessor | Yes | No | | Variable declaration | Beginning of block (C99 same as Java) | Before you use it | | Variable naming conventions | under_scores | camelCase | | Access a library | `#include <stdio.h>` | `import java.io.File;` | * ANSI C was updated to C99 or C9x * Added variable-length arrays * Int types and booleans * C11 and C18 are newer fixes * Multi-threading, unicode, removed `gets()` * Passing args into main: `int main(int argc, char *argv[])` * `argc` containst he number of strings on command line * Executable counts as one, plus one for each argument * `argv` is a pointer to an array containing the arguments as strings ## C Basics * Booleans * False: `0` (int), `NULL`, booleans (C99) * True: Everything else * Types: `int`, `unsigned int`, `float`, `float`, `double`, `char`, `long`, `long long` * `long` is a longer int (32b), `long long` is even longer (64b) * `int` is 16b * `float` is floating point decimal, `double` is equal or higher precision floating point * Preferrable to use `intN_t` and `uintN_t` * Constants: `const int days_in week` * Enums: `enum color {RED, GREEN, BLUE}` * Typedef: `typedef uint8_t BYTE;` Custom types * Struct: Custom structured groups of variables ```cpp= typedef struct{ int length_in_seconds; int year_recorded; } SONG; ``` * switches evaluate cases until it sees a break * `goto` is nonlocal control flow, reserved for advanced use cases * includes are a shared namespace so no `math.atan` * Never while loop based on a floating point value # Lecture 4: 9/2/20 ## Pointers * Variables have undefined values if not initialized in declaration * Heisenbugs: Undefined behavior that run nondeterministically * Address refers to a memory location (not value) * Pointer contains the address of a memory location * Syntax: * `int *p;` p is a pointer (an address of an int) * `p = &y;` assign the address of y to p * Dereferencing: * `z = *p;` assign value at adress in p to z * `*p = z;` assign the value of z to what p is pointing to * Must initialize pointers first * `int *ptr; *ptr=5;` does not work since `*ptr` never had an initial value * Benefit of cleaner code, faster to pass pointer to large struct/array * Drawback is bugs (dynamic memory management, dangling references, memory leaks) ## Pointer Usage * Pointers can point to any data type * `void*` is a generic pointer (not recommended) * Pointers can point to functions * `int (*fn) (void*, void*) = &foo` * fn is a function that accepts 2 `void*` pointers and returns an int. Initially points to function `foo` * `(*fn)(x, y)` is equivalent to `foo(x,y)` * Struct arrow notation: `int h = paddr->x;` equivalent to `int h = (*paddr).x;` * Arrow notation is more commonly used * `p1 = p2` copies all values, not pointers * NULL pointer: The pointer of all 0s * Writing or reading to null pointer causes crash * Since 0 is false, simple test for null pointer `if(q)` * Type declaration tells how many bytes to fetch for each pointer * 32bit int is stored in 4 consecutive 8-bit bytes * Word alignment: Can only access 4 byte boundaries * `sizeof(type)` returns num of bytes in object * By C99 definition, `sizeof(char) == 1` ## Arrays * A contiguous block of memory * Declaration: `int ar[] = {1, 2};` * Arrays are almost identical to pointers * `char *string` and `char string[]` are nearly identical * Cannot increment an array * `ar[0]` is same as `*ar`, `ar[2]` is same as `*(ar+2)` * Declared arrays are only allocated while scope is valid * Declare `int ARRAY_SIZE` instead of hardcoding * Array does not know its own length and bounds not checked * Exception is strings, the null terminator signifies end * Common errors * Segmentation fault - read/write to memory you don't have access to * Bus error - alignment is wrong (bad word alignment) * `pointer + n` adds n * sizeof(type) * Handle: pointer to a pointer `**h` * Needed to advance a pointer for an array * Typically no more than 2 stars # Lecture 5: 9/4/20 ## Dynamic Memory Allocation * `sizeof` gives size in bytes * Also knows size of arrays (in bytes) * Allocate room to point to: `malloc()` * Ex. `ptr = (int *) malloc(n*sizeof(int))` * This will cast the `void*` returned by malloc to `int*` * `malloc` location contains garbage initially * Dynamically allocated space must be freed * When main returns it auto-frees it up, but bad practice * Memory leak: If memory is not freed and the program exits * Malloc returns `NULL` if request cannot be granted * Common errors * `free()`ing the same place of memory twice * `free()`ing something that you didn't `malloc()` * Resizing * `realloc(p, size)` which returns a new pointer (`NULL` if fails) * `realloc(ip, 0)` is same as `free(ip)` * Array names are not variables * `*a`: first value of array, `a`: pointer of array, `&a`: pointer of array (just like `a`) * 3 Rules of pointers * 1) Need to set up the pointee (separate from pointer) * 2) Dereference a pointer to access its pointee * 3) Assignment (=) between pointers makes them point to the same pointee ## Linked List ```cpp= struct Node { char *value; struct Node *next; }; typedef struct Node *List; // List is pointer to struct Node List ListNew(void) { return NULL; } //Create new empty list ``` * When you `malloc` a string, need `strlen(string) + 1` for null terminator `\0` * `strcpy(destination, original)` ## Memory locations * Struct declaration does NOT allocate memory * Variable declaration does allocate memory * Methods of allocating memory * Declaration of local variable * Dynamic allocation at runtime like `malloc` * Global variable (above `main()`) which has global scope * 3 Pools of memory * 1) Static storage: global variable storage (permanent) * 2) Stack: Local variable storage, params, return address * 3) Heap: Dynamic malloc storage: data lives until dallocated by programmer * Program Address space * Stack: local vars, grows downward * Big stack: Deep recursion/fractals * Heap: Space from malloc, grows upward * Just big mallocs without function calls * Static data: Declared outside of main. Does not grow/shirnk * Code: Loaded when program starts. Does not grow/shrink ![](https://i.imgur.com/cBqquil.png) ### Heap * Stack frame (LIFO) * Return instruction address, params, space for other local variables * Always contiguous blocks of memory, stack pointer tells where top stack is * When procedure ends, stack frame is tossed off stack freeing memory for future stack frames * Don't erase memory even if not needed, just move stack pointer up ### Heap * Large pool of memory, not allocated in contiguous order * Specify the num of bytes to allocate in `malloc` * Tricky to manage unlike others * Want `malloc()`, `free()` to be fast, minimal memory overhead * Want to avoid (external) fragmentation (free memory separated in small chunks) * K&R Implementation * Each block of memory is preceded by a header with size of block and pointer to next block * Free blocks are kept in a circular linked list * Pointer field unused in allocated blocks * `malloc()`: Search free list for a block large enough * `free()`: Check if blocks adjacent to freed block are free * Coalesce into 1 bigger free block, otherwise add new freed block to free list * `free` is very quick, `malloc` can search the full list worst case * `malloc` block choosing * best-fit: choose smallest block that is big enough * first-fit: choose first block that is big enough * next-fit: same as first-fit but resume seraching where we finished last time * distributes the small slivers of memory ## When Memory goes Bad * Pointer problems * Dangling reference: using ptr before malloc * Writing off the end of arrays * Cannot return pointer to variable in local frame * Use after free * Stale pointers after `realloc` moves data * Freeing the wrong stuff (incremented pointer) * Double free * Memory leaks: tardy free, lose the ptr * Incrementing a pointer * Valgrind catches memory leaks, misues of free, writing over end of arrays * Helpful for debugging # Lecture 6: 9/9/20 * Very large/small numbers and fractional numbers need to be stored * Agree where to put the binary point to sep integer and fractional parts * Store where the binary point is in a diff variable * Binary point can 'float' as the binary point is no longer fixed * Normalized form: No leading 0s (like scientific notation) ## FLoating Point Representation * First 1 is not recorded (since always 1) * Sign: 1 bit (use sign-magnitude form) * Exponent: 8 bits * Significand: 23 bits (LSB represents $10^{-23}$) * Can represent range of $(1.2 * 10^{-38}, 3.4 * 10^{38}$) * Overflow: Exponent is larger than 8 bits ($>3.4 * 10^{38}, < -3.4 * 10^{38}$) * Underflow: Negative exponent is larger than 8 bits * Ends up storing small fraction as 0 * More precision: double (52 bit significand) * Significand is always between 0, 1 (for normalized numbers) * Representing 0: Exponent value of 0 * 0 has no leading 1 * Results in 2 zeroes because of the sign bit * Biased exponent representation (IEEE Standard 754) * 2's complement would make negative numbers look bigger * Bias: 127 * $(-1)^{\text{Sign}} * (1 + \text{Significand}) * 2^{\text{Exponent} - 127}$ ![](https://i.imgur.com/hyKuS0x.png) ## Special Numbers * How do we represent $\pm$ infinity * Use exponent 255 with 0 significand * Representing NaN * Use exponent 255 with nonzero signifcand * NaNs contaminate results as they percolate up * $2^{23} -1$ possible NaNs used to specify each error (sqrt neg number, 0/0, etc.) and include which line it occured on * Problem of normalization and implicit 1 * Gap between FP numbers around 0 * Smallest pos num: $2^{-126}$ * Second smallest: $(1 + 2^{-23}) * 2^{-126} = 2^{-126} + 2^{-149}$ * Denorm: Solves implicit 1 problem * Exponent 0 with nonzero significand * No implicit leading 1, implicit exponent -126 * Smallest: $2^{-149}$ * Second smallest: $2^{-148}$ * Even steps until normalized region and the stride doubles with higher exponents * Have $2^{23} \approx$ 8 million slots between each power of 2 * $2^{24} + 1$ is the smallest int that cannot be represented since the stride would be 2 * Example: $\frac{1}{3}$ * Sign: 0 * Exponent: $-2 + 127 = 125 = 0b01111101$ * Significand: $0b010101 \ldots$ ## Fallacies * Associativity is lost * Small numbers can be lost when added to big numbers * Precision: count of the number bits used to represent a value * Accuracy: Difference between actual value and comp representation * High precision permits high accuracy but no guarantee * Rounding: 2 extra bits of precision to round * Converting double to single precision (or FP to int) * 1) Round to $+\infty$ * 2) Round to $-\infty$ * 3) Truncate (round to 0) * 4) Unbiased: Round to even number * Decimal examples: 2.5 rounds to 2, 3.5 rounds to 4 * FP Addition * De-normalize to match exponents * Add significands * Keep same exponent * Normalize again * Casting * `int -> float -> int` is not always same * `float -> int ->float` is not always same ## Alternate FP Representations * float: standard float (8exp, 23 sig) * binary64: double precision (11exp, 52 sig) * Quad-precision: binary128 (15exp, 112 sig) * Oct-precision: binary256 (19 exp, 236 sig) * Half-precision: binary16 or fp16 (5exp, 10 sig) * bfloat16 (brain FP): (8exp, 7sig) * unum: Tells how many bits are for exp/sig (customizable) * u-bit indicates if the number is exact # Lecture 7: 9/11/20 * CPU executes a lot of instructions * Instruction Set Architecture (ISA): set of instructions a CPU implements * RISC: Keep instruction set small and simple to build fast hardware ## RISC-V Basics * Registers instead of variables * Registers are inside the processor (not memory) * Predetermined number of registers in hardware (32) * Each RISC-V register is 32 bits wide (independent of variant) * Numbered from `x0` to `x31` * `x0` is special, always holds value 0 * Registers can be referred to by number or name * Registers have no type * Comments start with `#` * Instruction: one statement/command * Each line has at most 1 instruction * One operation per instruction ## RISC-V add/sub * `add x1, x2, x3` * one: operation by name * two: operand getting result (destination) * three: first operand * four: second operand * Break longer sequences into multiple instructions * Intermediates: numerical constants * `addi x3, x4, 10` * No `subi`, just `addi` a negative value * Use `x0` for number zero to avoid using intermediates # Lecture 8: 9/14/20 * Some registers have designated use (usually < 30 registers available) * Optimizer compiler minimizes register usage * Register footprint - number of registers needed for a compute kernel * Memory is like a large 1D array * Processor gives address(offset to base pointer) to point in memory * Load from (read) and store to memory (write) * Memory addresses are in bytes * Word addresses are 4 bytes apart * Little-endian: Least significant byte is smallest address * Bits are always stored with most significant in upper position * DRAM: Dynamic Random Access Memory (fast but not as fast as registers, medium capacity) * registers are 50-500x faster than memory ## Loading from and storing Memory * Load Word: `lw destination, offset(base_ptr)` * Offset is in bytes (multiple of 4) * Data flow moves from right to left * C Code: `int A[100]; g = h + A[3];` * Equiv: `lw x10, 12(x15); add x10, x12, x10` * Store Word: `sw current, offset(base_ptr)` * Data flow moves from left to right * `int A[100]; A[10] = h + A[3];` * `lw x10, 12(x15)` * `add x10, x12, x10` * `sw x10, 40(x15)` * Load byte/store byte: `lb` `sb` for byte data transfers * `lb x10, 3(x11)` copies content of `x11+3` to low byte pos of `x10` * Sign-extend the sign bit to all higher bits * Unsigned byte load: `lbu` * No sign-extension needed * No `sbu` (nonsensical) * `addi` equiv to `sw, lw, add`, but `addi` is much faster * Immediates must be <= 32 bits * Need to load longer immediates from memory ## Decision Making * if statement: `beq reg1, reg2, L1` * Go to statement labeled L1 if (value reg1) == (value reg2) * Go to next statement otherwise * Counterpart: `bne` branch not equal * conditional branch: * Less than: `blt` (unsigned: `bltu`) * Branch greater: `bge` (unsigned: `bgeu`) * Unconditional branch (always) * Jump: `j label` * `==` usually translates to `bne` so next instruction executes if condition true * Need a jump like `j Exit` at the end of `bne` block * No `bgt`, `ble` * Need to increment memory pointers by num of bytes # Lecture 9: 9/16/20 * Logical instructions: isolate or packing bytes/nibles into words * 2 variants: register and immediate * No logical not, just use xor with 11111111 * Logical shifting: * Shift logical left: `sll` (shift 2 bits left and insert 0s on right, bits fall off left end) * Shift logical left immediate: `slli` * Shift logical right: `slr` fills left with 0s * Arithmetic shifting: logical shift for signed int * Shift right arithmetic (`sra`, `srai): moves n bits to the right, inserts sign bit into empty bits * Not equivalent to divide by $2^n$ (fails for odd neg numbers) * C standards is division should round to 0 ## Registers * Programs are stored in memory * Program lives in an area, data lives in an area (In practice, program above data) * 1 RISC-V instruction is 32 bits * Instruction fetched from memory using Program Counter (PC) * Next instruction is 4 bytes away (incr PC +4) * Symbolic register names * `x10 - x17` are `a0-a7` for params, `a0, a1` for return values * `x1` is `ra` (return address to point of origin) * `x8-x9`: `s0-s1` and `x18-x27`: `s2-s11`: saved registers * `x0` is `zero` * Pseudo-instructions: shorthand for common assembly * `mv rd, rs = addi rd, rs, 0` (move value to another register) * `li rd, 13 = addi rd, x0, 13` (load immediate value into register) * `nop = addi x0, x0, 0` (no operation, processor just waits) ## Function calls * 6 fundamental steps * 1) Put arguments in place where func can access them * 2) Transfer control to Function * 3) Acquire local storage resources needed for func * 4) Perform desired task of func * 5) Put return value where code can access it, restore used registers, release local storage * 6) Return control to origin * Use `jr` not `j` since `jr` allows us to use a variable to return (`jr ra`) * Jump and link: `jal rd, Label` will jump and return save address (equiv to addi then j) * Essentially link then jump * Pseudo-instruction is `ret = jr ra` * `jalr rd, rs, imm`: Jump and link register * `j, jr, ret` are pseudo-instructions for `jal` (no ret address is just a jump) # Lecture 10: 9/18/20 * Use stack to save old values before calling func * Stack pointer: `sp` (x2) * Stack grows downwards (stack address decreases when we push) * Push decrements sp, pop increments sp * Stack frame includes: * return instruction address * Params (args) * Space for local vars * Stack pointer points to bottom of stack frame ```assembly addi sp, sp -8 # adjust stack for 2 items sw s1, 4(sp) # Save s1 for use after sw s0, 0(sp) ... lw s0, 0(sp) # restore s0 for caller lw s1, 4(sp) addi sp, sp, 8 # adjust stack to delete 2 items jr ra (ret) ``` ## Nested function calls * `ra` would be overwritten in future calls * Register conventions: a set of rules for which registers are unchanged after `jal` * Preserved: `sp, gp, tp, s0-s11` (s0 is fp) * Not preserved: `a0-a7, ra, t0-6` (arg and temp registers) * Caller must preserve args/rv, temp, ra * Callee must preserve saved registers and sp ## Memory Allocation * Types of storage classes: * Automatic: local to func, discarded when func exits * Use stack for local vars * Procedure frame/activation record: segment of stack with saved registers and local vars * Static: exist across procedures * Stack: high end `0xbfff_fff0` * Text segment in low end: `0x0001_0000` * Static: gp points to static (`0x1000_000`) * Heap: above static, grows up to high addresses * The callee saves them on the stack at the beginning of the function call and restores the stack right before returning # Lecture 11: 9/21/20 * Everything has memory addreses (instructions, data words) * Branches and jumps to different sections * One register keeps address of the instruction being executed (Program counter: PC) aka Instruction Pointer (IP) * Programs are distributed as binaries but are different versions based on device * Want backwards compatibility * Data is in words(32 bit chunks) * Each register is a word (lw, sw to access 1 word of memory) * Divide instruction into fields * R- register-register arithmetic ops * I- register-immediate arithmeti cops, loads * S- stores * B- branches (minor variant of S) * U- 20-bit upper immediate instructions * J - jumps (minor variant of U) ## R-Format ![](https://i.imgur.com/tzDHERd.png) * opcode: what format it is * `0b0110011` for all R format * funct7+funct3: tell what operation to perform * `rs1, rs2`: source register: specifes register w/ first operand * `rd`: destination register: register that receives result of computation * Each register field holds a 5-bit uint corresp to reg number (`x0-x31`) ## I-Format ![](https://i.imgur.com/wttA7k3.png) * `imm[11:0]` covers range $[-2048, 2047]$ * Sign-extend immediate * `opcode`: `0b0010011` * shift amounts is limited to 5 bits since larget shifts are meaningless * Loads: * 12bit signed immediate for the memory address (offset) * `rd` contains value loaded from memory * `rs1` is load from ## S-format ![](https://i.imgur.com/KzbMVst.png) * Store needs 2 registers (`rs1, rs2`) but no `rd` # Lecture 12: 9/23/20 ## B-format (SB) ![](https://i.imgur.com/5Icv0bD.png) * Need to read 2 registers but don't write to register * Branches used for loops (func calls and unconditional jumps handled by jump instructions) * Can only jump relative to PC * Largest branch dist limited by size of code * PC-Relative Addressing: Use immediate as 2s complement offset to PC * 12 bits gives $\pm 2^{11} * 32$ * Can only branch to beginning of word * Branch calculation * No branch: `PC = PC + 4` * Branch: `PC = PC + immediate*4` * Support for 16-bit words means branch size reduced by half * Half of targets will be errors * $\pm 2^{10} * 32$ * Branch calculation * No branch: `PC = PC + 4` * Branch: `PC = PC + immediate*2` * The 12bit immediate is actually 13bits but always has LSB = 0 * beq with offset 0 is infinite loop ## U-Format ![](https://i.imgur.com/y16TyRH.png) * Need to branch to destination $> 2^{10}$ * Replace beq with bne and then j * 20 bit immediate for upper 20 bits of 32 bit word * lui: Load upper immediate * lui then addi to set all 32 bits * addi sign extends though which can cause problems * Add 1 to LUI value to take that into account * li: pseudo-op to handle this (lui and addi) * auipc: Add upper immediate to PC ## J-Format (UJ) ![](https://i.imgur.com/zvYwHhQ.png) * `jal` saves PC+4 to rd (return address) * The immediate is the offset, remove the LSB 0 * PC relative jump to $\pm 2^{19}$ locations $=\pm 2^{18}$ 32bit instructions * `jalr rd, rs, immediate` uses I-format * Writes `rd = PC+4`, `PC = rs + immediate` * Way to jump to absolute address * Cannot assume LSB is 0 so range is reduced * Use auipc, jalr to jump to pc-relative 32-bit offset # Lecture 13: 9/25/20 * Interpreter: Directly executes a program in source language * Higher level, better error msgs * Slower (10x), code smaller (2x) * Independent instructions (can run on any machine) * Translator: converts prog from 1 source lang to another * Translates to executable, don't know the source code * Compilation, Assembly, Linking, Loading * Compiler * Input: High level lang code (`foo.c`) * Output: Assembly code: (`foo.s`) * Can contain pseudo-instructions ## Assembler * Input: Assembly code (`foo.s`) * Output: Object code (`foo.o`) * Reads and uses directives * Ex. `.text` (code), `.data` (static), `.globl sym`, `.string str` `.word w1...wn` * Replace psuedo-instructions, create machine code, create object file * Need to sign extend immediates * Branch immediates count halfwords * PC-relative branches/jumps: count half words between curr and where you want to go * Foward reference problem: branching below * Solution: Take 2 passes * Pass 0: Replace pseudo-instructions * Pass 1: Remember position of label * Pass 2: Use label positions to generate code * Pc-relative jumps (jal, beq, bne) * `j ofset` expands to `jal, zero, offset` * Count half-words between target and jump for position-independent code (PIC) * PIC is good - relative to location * Static data refs: * `lui` `addi` require full 32b address * `auipc` `addi` uses PIC * Symbol table * List of items that can be used by other files * Labels: function calling * Data: `.data` variables can be accessed across files * Relocation table * List of items whose address the file needs * Absolute labels (`jal, jalr`) * Static data (`la`) * Object file format (ex. ELF) * Object file header: size, pos of other pieces * text segment: machine code * Data segment: static data (in binary) * Relocation info: Has lines of code that need to be fixed later * Symbol table: List of files labels and static data that can be referenced * Debugging info ## Linker * Input: Object code files, info tables (`foo.o`) * Output: Executable code (`a.out`) * Combines several object files into a single executable * Throws error for duplicate or missing symbols * Types of addresses * PC-Relative (beq, bne, jal, auipc/addi): Never need to relocate * Absolute or external func address (quipc/jalr): Always relocate * Static data reference (lui/addi): Always relocate * Relocation editing: * J-format: jal address * I,S format: static pointers * No conditional branches * Text starts at `0x1000` * Statically-linked approach: include entire library even if not all of it is used (self-contained) * Dynamically-linked (DLL): include library at runtime * Smaller storage, requires less memory * Requires time to link at runtime * Link at machine code level ## Loader * Input: Executable code (`a.out`) * Output: program run * Part of the OS * Read header to know size of text and data * Create address space for text, data, stack * Copy instructions and data from executable file to new address space * Copy CLI args to stack * Initialize machine registers (mostly cleared) * Jump to start-up routine that copies program args from stack to register, set PC # Lecture 14: 9/28/20 * Synchronous digital system * Synchronous: operations coordinated by a central clock * Digital: Values represented by discrete values * Transistor: semiconductor device to amplify or switch signals * CMOS: Complementary metal oxide semiconductor * Drain, Gate, Source * Current flows from source to drain depending on the gate * Can crete AND, OR, NOR out of NAND * Asserted switch = closed ![](https://i.imgur.com/Cnk1daj.png) * Signals are transmitted instantly and continuously * Only contains 1 value at a time * [Block] propagation delay: Delay between clock start and when you can see the output * Combinational Logic (CL) circuits * Output is a function of inputs * State elements * circuits that store information (registers, cache) # Lecture 15: 9/30/20 * State can store values for time (registers, caches) * Control flow of info between logic blocks * Register = n "flip-flops" aka D-type Flip-Flop * D: data, q: output ![](https://i.imgur.com/MUTg8iv.png) * At rising edge of the clock, sample input d. * After a time delay, output q * Setup time: Time before rising edge that input must be stable * Hold time: Time after rising edge that input must be stable * Clk-to-q delay: Time until q gets updated and is stable ![](https://i.imgur.com/ruN1tg7.png) * Max delay = CLK to Q delay + CL delay + Setup time * Extra registers added to speed up clock rate * Clock period limited by propagation delay of adder/shifter * Finite State Machine: Represent with a state transition diagram * Use CL + register * $S_{i-1}$ is the stable output of a register which holds result of $i-1$ iteration # Lecture 15: 10/2/20 * XOR: 1 if odd number of 1s * Boolean algebra: + is OR, $\cdot$ is AND, $\bar{x}$ is NOT * Circuit verification: If they are equivalent in boolean algebra * Uniting theorem: $xy + x = x, (x+y)x = x$ * DeMorgan's Law: $\bar{(x \cdot y)} = \bar{x} + \bar{y}$, $\bar{(x + y)} = \bar{x} \cdot \bar{y}$ * Distribute the "bar" to all 3 terms * Sum-of-products: ORs of ANDs * Sum together all possible outcomes of 1 (not for 0 for each input bit) # Lecture 16: 10/5/20 Multiplexor ![](https://i.imgur.com/5GV4C6T.png) Arithmetic and Logic Unit (ALU) * 00: A + B * 01: A - B * 10: A & B * 11: A | B ![](https://i.imgur.com/kt4n1zC.png) ![](https://i.imgur.com/5gd2nMB.png) ## Adders * LSB: $s_0 = a_0 XOR b_0$, $c_1 = a_0 \& b_0$ * ith bit: $s_i = XOR(a_i, b_i, c_i)$, $c_{i+1} = MAJ(b_i, b_i, c_i)=a_ib_i + b_ic_i + a_ic_i$ * N 1-bit adders cascaded is a 1 N-bit adder * For unsigned numbers, existence of carry indicates overflow * For signed numbers, carry is not indicative * For highest adder: overflow = $c_n XOR c_{n-1}$ * No c_out, no c_in: no overflow * c_in and c_out: no overflow * c_in but no c_out: overflow (A, B > 0) * c_out but not c_in: overflow * Subtractor: A-B = A + (-B) * Does an XOR to flip the bits * if SUB is high, then we just add that 1 # Lecture 17: 10/7/20 * Hardware is inherently parallel * CPU - active part of computer that does work (data manipulation and decision making) * Datapath: Part of processor that has hardware to perform operations * Computer executes 1 instruction per tick * State outputs drive inputs to combinational logic to make next state * All state elements updated at rising clock edge * Problem: single monolithic block is too bulky * Solution: Break up process of executing instruction into steps * Stage 1: Instruction Fetch (IF) * Stage 2: Instruction Decode (ID) * Stage 3: Execute (EX) ALU * Stage 4: Memory Access (MEM) * Stage 5: Write Back to Register (WB) * All 5 stages occur within 1 clock cycle (rising edge fetch, write on the next rising edge) ![](https://i.imgur.com/WjbpPlw.png) ![](https://i.imgur.com/s9VxNJP.png) # Lecture 18: 10/9/20 See lecture diagrams ![](https://i.imgur.com/g39nwYV.png) # Lecture 19: 10/12/20 * Control and status register (CSR): diff from register file * Monitor status and performance * Up to 4096 CSRs * Necessary for peripherals * csrrw swaps value of a CSR and a integer register * `csrw csr, rs1` is `csrrw x0, csr, rs1` * rd=x0 meand just write rs1 to CSR * `csrwi csr, uimm` is `csrrwi x0, csr, uimm` * rd=x0 just writes uimm to CSR ![](https://i.imgur.com/1eNBk2A.png) * LSB is stable first * Critical path lw = $t_{clk-q} + t_{IMEM} + t_{Imm} + t_{ALU} + t_{DMEM} + t_{mux} + t_{setup}$ * Certain instructions take longer * Most blocks will be idle for most of the time * ROM: Read Only Memory ![](https://i.imgur.com/vUyZdhh.png) # Lecture 20: 10/14/20 * Measure performance in maximum frequency * Performance measures: * Program execution time * Throughput (num of server requests handled per hour) * Energy per task * Iron law of processor performance: * $\frac{\text{Time}}{\text{Program}} =\frac{\text{Instructions}}{\text{Program}} * \frac{\text{Cycles}}{\text{Instruction}} * \frac{\text{Time}}{\text{Cycle}}$ * CPI: Cycles per instruction * Instructions per program * Determined by task, algorithm, language, compiler, ISA * Average CPI: * Determined by ISA, processor implementation, complex instructions (like strcpy, CPI >> 1), superscalar processors (CPI < 1) * Time per cycle: * Procesor microarchitecture (critical path), technology, power budget * 30% of energy is leaked in CMOS * $\frac{\text{Energy}}{\text{Program}} =\frac{\text{Instructions}}{\text{Program}} * \frac{\text{Energy}}{\text{Instruction}}$ * $\frac{\text{Energy}}{\text{Program}} \propto \frac{\text{Instructions}}{\text{Program}} * CV^2$ * Capacitance depends on technology processor features * Supply voltage (run slower with less supply voltage) * Energy Iron Law: Performance = Power * Energy Efficiency * Performance: tasks/second * Energy efficiency: tasks/joule ## Pipelining * Does not help latency of single task * Helps throughput of entire worload * Multiple tasks operate simultaneously using diff resources * (theoretical) Speedup = num of pipe stages * Time to fill and drain pipeline in reality * Pipeline rate is limited by slowest pipeline stage # Lecture 21: 10/16/20 * Put registers in between the 5 stages to pipeline ![](https://i.imgur.com/qft6eUl.png) * Time to process 1 instruction can be longer so worse latency, but better throughput * Clock rate can be dramatically higher * Pipelining hazards: situations that prevent startingthe next instruction in the next clock cycle * Structural hazard: Required resource is busy (needed in multiple stages) * 2+ instructions compete for single physical resource * Sol1: Instructions take turns, causing some to stall * Sol2: Add more hardware (can always solve) * 2 independent read ports and 1 write port * 2 separate memories (cache) * Data hazard: Data dependency between instructions, need to wait for prev inst to complete its read/write * Write in the first half of cycle, read in the second half of cycle (registers are fast) * Sol1: Stall using bubble (`nop`) when it depends on previous inst, but reduces performance * Compiler will try to rearrange code that is not dependent * Sol2: Forwarding/bypassing - grab operand from pipeline stage rather than register file (output of ALU to input of ALU) ![](https://i.imgur.com/98F1UjL.png) * Control hazard: Flow of execution depends on previous instruction (for branches) # Lecture 23: 10/18/20 ## Load Data Hazard * Load first, then do operations on it * Load is only available after DMEM, but needed 1 cycle earlier * Sol 1: Load delay slot: Need 1 stall, convert next command into a nop * Just set regWEn, MemRW, and do not update PC * Cycle after that execute the original command * Forward from DMEM output to ALU * Sol2: Put unrelated instruction into load delay slot ## Control Hazards * Branches are condition so don't know next inst * Earliest to know branch is at end of execution (ALU) * 2 subsequent instructions are executed regardless of the branch outcome * Sol1: Use 2 stall cycles after a branch * Sol2: If branch/jump taken, convert the 2 next to `nop` * Use branch prediction to guess * Generally loop, so branch is rarely taken * Ex. Keep a bit of if the branch was taken last time * Moving branch comparator to ID stage would incur data hazard and forwarding could fail ## Superscalar Processors * Clock rate limited by tech and power dissipation * Deeper pipelines (10, 15 stages) * Less work per stage, so shorter clock cycle * More potential for hazards (CPI > 1) * Multi-issue superscalar * Replicate pipeline stages: multiple pipelines (CPI < 1) * Multiple execution units, start multiple instructions per clock cycle * Out of order execution to reduce hazards * Arrive at commit unit which reorders * Get CPI using iron law (time program, count instructions, know time/cycle)