# CS 61C Notes # Lecture 1: 8/26/20 Intro class, no notes # Lecture 2: 8/28/20 * Refer to table for easy hex to binary conversion * Numerals have infinite preceeding zeroes * Numbers are representations of numerals * Numbers are theoretical * For a set number of bits, if addition (sub, mul, or div) cannot be represented, there is overflow * Not underflow, this would be negative overflow ## Number representations * Positive number representations are always the same for all representations * Sign and magnitude * Leftmost sign is int (0 is +, 1 is -) * Problem 1: Making numbers smaller than 0 actually makes the number bigger in binary representation * Problem 2: Also there are 2 zeroes * Problem 3: Complicated circuitry * One's complement * Flip all the bits for a negative number * Problem: there are still 2 0s (0000 and 1111) * Solution: shift everything over left by 1 * Two's complement * Flip all bits and add $1$ * Or multiply leftmost bit by $-1$ * All 1s represents -1 * $2^{N-1}$ non-negatives, $2^{N-1}$ negatives * There are 1 more negative than positive * Bias encoding * Add a bias to bring the center of the range down * Bias for N bits chosen as $-(2^{N-1}-1)$ # Lecture 3: 8/31/20 * ENIAC: First Electronic General-Purpose Computer * Long time to program (patch cords and switches) * EDSAC: First General Stored-Program Computer * Programs held as numbers/bits in memory * Compile changed source files into .o files (machine code object files) which are then linked into a .out file (machine code executable file) * Executables must be rebuilt on each new system (porting) * Parallel compiling: make -j to rebuild changed pieces * Amdahl's Law: Linking is sequential, forming a bottleneck * C Pre-Processor (CPP): Begins with # * Ex. #define or #include * Macros: small functions that text replaces everything below * However, this means functions would be called multiple times ## C vs Java C main function cpp= #include <stdio.h> int main(void) { return 0; }  | | C | Java| | -------- | -------- | -------- | | Type of Language | Function Oriented | Object Oriented | | Programming Unit | Function | Class = Abstract Data Type | | Compilation | gcc hello.c | javac Hello.java | | Execution | a.out loads and executes | java Hello interprets | | Storage | Manual(malloc, free) | new allocates & initializes, automatic garbage collection | | Comments | /* ... */ | /* ... */ or // | | Constants | #define, const | final | | Preprocessor | Yes | No | | Variable declaration | Beginning of block (C99 same as Java) | Before you use it | | Variable naming conventions | under_scores | camelCase | | Access a library | #include <stdio.h> | import java.io.File; | * ANSI C was updated to C99 or C9x * Added variable-length arrays * Int types and booleans * C11 and C18 are newer fixes * Multi-threading, unicode, removed gets() * Passing args into main: int main(int argc, char *argv[]) * argc containst he number of strings on command line * Executable counts as one, plus one for each argument * argv is a pointer to an array containing the arguments as strings ## C Basics * Booleans * False: 0 (int), NULL, booleans (C99) * True: Everything else * Types: int, unsigned int, float, float, double, char, long, long long * long is a longer int (32b), long long is even longer (64b) * int is 16b * float is floating point decimal, double is equal or higher precision floating point * Preferrable to use intN_t and uintN_t * Constants: const int days_in week * Enums: enum color {RED, GREEN, BLUE} * Typedef: typedef uint8_t BYTE; Custom types * Struct: Custom structured groups of variables cpp= typedef struct{ int length_in_seconds; int year_recorded; } SONG;  * switches evaluate cases until it sees a break * goto is nonlocal control flow, reserved for advanced use cases * includes are a shared namespace so no math.atan * Never while loop based on a floating point value # Lecture 4: 9/2/20 ## Pointers * Variables have undefined values if not initialized in declaration * Heisenbugs: Undefined behavior that run nondeterministically * Address refers to a memory location (not value) * Pointer contains the address of a memory location * Syntax: * int *p; p is a pointer (an address of an int) * p = &y; assign the address of y to p * Dereferencing: * z = *p; assign value at adress in p to z * *p = z; assign the value of z to what p is pointing to * Must initialize pointers first * int *ptr; *ptr=5; does not work since *ptr never had an initial value * Benefit of cleaner code, faster to pass pointer to large struct/array * Drawback is bugs (dynamic memory management, dangling references, memory leaks) ## Pointer Usage * Pointers can point to any data type * void* is a generic pointer (not recommended) * Pointers can point to functions * int (*fn) (void*, void*) = &foo * fn is a function that accepts 2 void* pointers and returns an int. Initially points to function foo * (*fn)(x, y) is equivalent to foo(x,y) * Struct arrow notation: int h = paddr->x; equivalent to int h = (*paddr).x; * Arrow notation is more commonly used * p1 = p2 copies all values, not pointers * NULL pointer: The pointer of all 0s * Writing or reading to null pointer causes crash * Since 0 is false, simple test for null pointer if(q) * Type declaration tells how many bytes to fetch for each pointer * 32bit int is stored in 4 consecutive 8-bit bytes * Word alignment: Can only access 4 byte boundaries * sizeof(type) returns num of bytes in object * By C99 definition, sizeof(char) == 1 ## Arrays * A contiguous block of memory * Declaration: int ar[] = {1, 2}; * Arrays are almost identical to pointers * char *string and char string[] are nearly identical * Cannot increment an array * ar[0] is same as *ar, ar[2] is same as *(ar+2) * Declared arrays are only allocated while scope is valid * Declare int ARRAY_SIZE instead of hardcoding * Array does not know its own length and bounds not checked * Exception is strings, the null terminator signifies end * Common errors * Segmentation fault - read/write to memory you don't have access to * Bus error - alignment is wrong (bad word alignment) * pointer + n adds n * sizeof(type) * Handle: pointer to a pointer **h * Needed to advance a pointer for an array * Typically no more than 2 stars # Lecture 5: 9/4/20 ## Dynamic Memory Allocation * sizeof gives size in bytes * Also knows size of arrays (in bytes) * Allocate room to point to: malloc() * Ex. ptr = (int *) malloc(n*sizeof(int)) * This will cast the void* returned by malloc to int* * malloc location contains garbage initially * Dynamically allocated space must be freed * When main returns it auto-frees it up, but bad practice * Memory leak: If memory is not freed and the program exits * Malloc returns NULL if request cannot be granted * Common errors * free()ing the same place of memory twice * free()ing something that you didn't malloc() * Resizing * realloc(p, size) which returns a new pointer (NULL if fails) * realloc(ip, 0) is same as free(ip) * Array names are not variables * *a: first value of array, a: pointer of array, &a: pointer of array (just like a) * 3 Rules of pointers * 1) Need to set up the pointee (separate from pointer) * 2) Dereference a pointer to access its pointee * 3) Assignment (=) between pointers makes them point to the same pointee ## Linked List cpp= struct Node { char *value; struct Node *next; }; typedef struct Node *List; // List is pointer to struct Node List ListNew(void) { return NULL; } //Create new empty list  * When you malloc a string, need strlen(string) + 1 for null terminator \0 * strcpy(destination, original) ## Memory locations * Struct declaration does NOT allocate memory * Variable declaration does allocate memory * Methods of allocating memory * Declaration of local variable * Dynamic allocation at runtime like malloc * Global variable (above main()) which has global scope * 3 Pools of memory * 1) Static storage: global variable storage (permanent) * 2) Stack: Local variable storage, params, return address * 3) Heap: Dynamic malloc storage: data lives until dallocated by programmer * Program Address space * Stack: local vars, grows downward * Big stack: Deep recursion/fractals * Heap: Space from malloc, grows upward * Just big mallocs without function calls * Static data: Declared outside of main. Does not grow/shirnk * Code: Loaded when program starts. Does not grow/shrink ![](https://i.imgur.com/cBqquil.png) ### Heap * Stack frame (LIFO) * Return instruction address, params, space for other local variables * Always contiguous blocks of memory, stack pointer tells where top stack is * When procedure ends, stack frame is tossed off stack freeing memory for future stack frames * Don't erase memory even if not needed, just move stack pointer up ### Heap * Large pool of memory, not allocated in contiguous order * Specify the num of bytes to allocate in malloc * Tricky to manage unlike others * Want malloc(), free() to be fast, minimal memory overhead * Want to avoid (external) fragmentation (free memory separated in small chunks) * K&R Implementation * Each block of memory is preceded by a header with size of block and pointer to next block * Free blocks are kept in a circular linked list * Pointer field unused in allocated blocks * malloc(): Search free list for a block large enough * free(): Check if blocks adjacent to freed block are free * Coalesce into 1 bigger free block, otherwise add new freed block to free list * free is very quick, malloc can search the full list worst case * malloc block choosing * best-fit: choose smallest block that is big enough * first-fit: choose first block that is big enough * next-fit: same as first-fit but resume seraching where we finished last time * distributes the small slivers of memory ## When Memory goes Bad * Pointer problems * Dangling reference: using ptr before malloc * Writing off the end of arrays * Cannot return pointer to variable in local frame * Use after free * Stale pointers after realloc moves data * Freeing the wrong stuff (incremented pointer) * Double free * Memory leaks: tardy free, lose the ptr * Incrementing a pointer * Valgrind catches memory leaks, misues of free, writing over end of arrays * Helpful for debugging # Lecture 6: 9/9/20 * Very large/small numbers and fractional numbers need to be stored * Agree where to put the binary point to sep integer and fractional parts * Store where the binary point is in a diff variable * Binary point can 'float' as the binary point is no longer fixed * Normalized form: No leading 0s (like scientific notation) ## FLoating Point Representation * First 1 is not recorded (since always 1) * Sign: 1 bit (use sign-magnitude form) * Exponent: 8 bits * Significand: 23 bits (LSB represents $10^{-23}$) * Can represent range of $(1.2 * 10^{-38}, 3.4 * 10^{38}$) * Overflow: Exponent is larger than 8 bits ($>3.4 * 10^{38}, < -3.4 * 10^{38}$) * Underflow: Negative exponent is larger than 8 bits * Ends up storing small fraction as 0 * More precision: double (52 bit significand) * Significand is always between 0, 1 (for normalized numbers) * Representing 0: Exponent value of 0 * 0 has no leading 1 * Results in 2 zeroes because of the sign bit * Biased exponent representation (IEEE Standard 754) * 2's complement would make negative numbers look bigger * Bias: 127 * $(-1)^{\text{Sign}} * (1 + \text{Significand}) * 2^{\text{Exponent} - 127}$ ![](https://i.imgur.com/hyKuS0x.png) ## Special Numbers * How do we represent $\pm$ infinity * Use exponent 255 with 0 significand * Representing NaN * Use exponent 255 with nonzero signifcand * NaNs contaminate results as they percolate up * $2^{23} -1$ possible NaNs used to specify each error (sqrt neg number, 0/0, etc.) and include which line it occured on * Problem of normalization and implicit 1 * Gap between FP numbers around 0 * Smallest pos num: $2^{-126}$ * Second smallest: $(1 + 2^{-23}) * 2^{-126} = 2^{-126} + 2^{-149}$ * Denorm: Solves implicit 1 problem * Exponent 0 with nonzero significand * No implicit leading 1, implicit exponent -126 * Smallest: $2^{-149}$ * Second smallest: $2^{-148}$ * Even steps until normalized region and the stride doubles with higher exponents * Have $2^{23} \approx$ 8 million slots between each power of 2 * $2^{24} + 1$ is the smallest int that cannot be represented since the stride would be 2 * Example: $\frac{1}{3}$ * Sign: 0 * Exponent: $-2 + 127 = 125 = 0b01111101$ * Significand: $0b010101 \ldots$ ## Fallacies * Associativity is lost * Small numbers can be lost when added to big numbers * Precision: count of the number bits used to represent a value * Accuracy: Difference between actual value and comp representation * High precision permits high accuracy but no guarantee * Rounding: 2 extra bits of precision to round * Converting double to single precision (or FP to int) * 1) Round to $+\infty$ * 2) Round to $-\infty$ * 3) Truncate (round to 0) * 4) Unbiased: Round to even number * Decimal examples: 2.5 rounds to 2, 3.5 rounds to 4 * FP Addition * De-normalize to match exponents * Add significands * Keep same exponent * Normalize again * Casting * int -> float -> int is not always same * float -> int ->float is not always same ## Alternate FP Representations * float: standard float (8exp, 23 sig) * binary64: double precision (11exp, 52 sig) * Quad-precision: binary128 (15exp, 112 sig) * Oct-precision: binary256 (19 exp, 236 sig) * Half-precision: binary16 or fp16 (5exp, 10 sig) * bfloat16 (brain FP): (8exp, 7sig) * unum: Tells how many bits are for exp/sig (customizable) * u-bit indicates if the number is exact # Lecture 7: 9/11/20 * CPU executes a lot of instructions * Instruction Set Architecture (ISA): set of instructions a CPU implements * RISC: Keep instruction set small and simple to build fast hardware ## RISC-V Basics * Registers instead of variables * Registers are inside the processor (not memory) * Predetermined number of registers in hardware (32) * Each RISC-V register is 32 bits wide (independent of variant) * Numbered from x0 to x31 * x0 is special, always holds value 0 * Registers can be referred to by number or name * Registers have no type * Comments start with # * Instruction: one statement/command * Each line has at most 1 instruction * One operation per instruction ## RISC-V add/sub * add x1, x2, x3 * one: operation by name * two: operand getting result (destination) * three: first operand * four: second operand * Break longer sequences into multiple instructions * Intermediates: numerical constants * addi x3, x4, 10 * No subi, just addi a negative value * Use x0 for number zero to avoid using intermediates # Lecture 8: 9/14/20 * Some registers have designated use (usually < 30 registers available) * Optimizer compiler minimizes register usage * Register footprint - number of registers needed for a compute kernel * Memory is like a large 1D array * Processor gives address(offset to base pointer) to point in memory * Load from (read) and store to memory (write) * Memory addresses are in bytes * Word addresses are 4 bytes apart * Little-endian: Least significant byte is smallest address * Bits are always stored with most significant in upper position * DRAM: Dynamic Random Access Memory (fast but not as fast as registers, medium capacity) * registers are 50-500x faster than memory ## Loading from and storing Memory * Load Word: lw destination, offset(base_ptr) * Offset is in bytes (multiple of 4) * Data flow moves from right to left * C Code: int A[100]; g = h + A[3]; * Equiv: lw x10, 12(x15); add x10, x12, x10 * Store Word: sw current, offset(base_ptr) * Data flow moves from left to right * int A[100]; A[10] = h + A[3]; * lw x10, 12(x15) * add x10, x12, x10 * sw x10, 40(x15) * Load byte/store byte: lb sb for byte data transfers * lb x10, 3(x11) copies content of x11+3 to low byte pos of x10 * Sign-extend the sign bit to all higher bits * Unsigned byte load: lbu * No sign-extension needed * No sbu (nonsensical) * addi equiv to sw, lw, add, but addi is much faster * Immediates must be <= 32 bits * Need to load longer immediates from memory ## Decision Making * if statement: beq reg1, reg2, L1 * Go to statement labeled L1 if (value reg1) == (value reg2) * Go to next statement otherwise * Counterpart: bne branch not equal * conditional branch: * Less than: blt (unsigned: bltu) * Branch greater: bge (unsigned: bgeu) * Unconditional branch (always) * Jump: j label * == usually translates to bne so next instruction executes if condition true * Need a jump like j Exit at the end of bne block * No bgt, ble * Need to increment memory pointers by num of bytes # Lecture 9: 9/16/20 * Logical instructions: isolate or packing bytes/nibles into words * 2 variants: register and immediate * No logical not, just use xor with 11111111 * Logical shifting: * Shift logical left: sll (shift 2 bits left and insert 0s on right, bits fall off left end) * Shift logical left immediate: slli * Shift logical right: slr fills left with 0s * Arithmetic shifting: logical shift for signed int * Shift right arithmetic (sra, srai): moves n bits to the right, inserts sign bit into empty bits * Not equivalent to divide by $2^n$ (fails for odd neg numbers) * C standards is division should round to 0 ## Registers * Programs are stored in memory * Program lives in an area, data lives in an area (In practice, program above data) * 1 RISC-V instruction is 32 bits * Instruction fetched from memory using Program Counter (PC) * Next instruction is 4 bytes away (incr PC +4) * Symbolic register names * x10 - x17 are a0-a7 for params, a0, a1 for return values * x1 is ra (return address to point of origin) * x8-x9: s0-s1 and x18-x27: s2-s11: saved registers * x0 is zero * Pseudo-instructions: shorthand for common assembly * mv rd, rs = addi rd, rs, 0 (move value to another register) * li rd, 13 = addi rd, x0, 13 (load immediate value into register) * nop = addi x0, x0, 0 (no operation, processor just waits) ## Function calls * 6 fundamental steps * 1) Put arguments in place where func can access them * 2) Transfer control to Function * 3) Acquire local storage resources needed for func * 4) Perform desired task of func * 5) Put return value where code can access it, restore used registers, release local storage * 6) Return control to origin * Use jr not j since jr allows us to use a variable to return (jr ra) * Jump and link: jal rd, Label will jump and return save address (equiv to addi then j) * Essentially link then jump * Pseudo-instruction is ret = jr ra * jalr rd, rs, imm: Jump and link register * j, jr, ret are pseudo-instructions for jal (no ret address is just a jump) # Lecture 10: 9/18/20 * Use stack to save old values before calling func * Stack pointer: sp (x2) * Stack grows downwards (stack address decreases when we push) * Push decrements sp, pop increments sp * Stack frame includes: * return instruction address * Params (args) * Space for local vars * Stack pointer points to bottom of stack frame assembly addi sp, sp -8 # adjust stack for 2 items sw s1, 4(sp) # Save s1 for use after sw s0, 0(sp) ... lw s0, 0(sp) # restore s0 for caller lw s1, 4(sp) addi sp, sp, 8 # adjust stack to delete 2 items jr ra (ret)  ## Nested function calls * ra would be overwritten in future calls * Register conventions: a set of rules for which registers are unchanged after jal * Preserved: sp, gp, tp, s0-s11 (s0 is fp) * Not preserved: a0-a7, ra, t0-6 (arg and temp registers) * Caller must preserve args/rv, temp, ra * Callee must preserve saved registers and sp ## Memory Allocation * Types of storage classes: * Automatic: local to func, discarded when func exits * Use stack for local vars * Procedure frame/activation record: segment of stack with saved registers and local vars * Static: exist across procedures * Stack: high end 0xbfff_fff0 * Text segment in low end: 0x0001_0000 * Static: gp points to static (0x1000_000) * Heap: above static, grows up to high addresses * The callee saves them on the stack at the beginning of the function call and restores the stack right before returning # Lecture 11: 9/21/20 * Everything has memory addreses (instructions, data words) * Branches and jumps to different sections * One register keeps address of the instruction being executed (Program counter: PC) aka Instruction Pointer (IP) * Programs are distributed as binaries but are different versions based on device * Want backwards compatibility * Data is in words(32 bit chunks) * Each register is a word (lw, sw to access 1 word of memory) * Divide instruction into fields * R- register-register arithmetic ops * I- register-immediate arithmeti cops, loads * S- stores * B- branches (minor variant of S) * U- 20-bit upper immediate instructions * J - jumps (minor variant of U) ## R-Format ![](https://i.imgur.com/tzDHERd.png) * opcode: what format it is * 0b0110011 for all R format * funct7+funct3: tell what operation to perform * rs1, rs2: source register: specifes register w/ first operand * rd: destination register: register that receives result of computation * Each register field holds a 5-bit uint corresp to reg number (x0-x31) ## I-Format ![](https://i.imgur.com/wttA7k3.png) * imm[11:0] covers range $[-2048, 2047]$ * Sign-extend immediate * opcode: 0b0010011 * shift amounts is limited to 5 bits since larget shifts are meaningless * Loads: * 12bit signed immediate for the memory address (offset) * rd contains value loaded from memory * rs1 is load from ## S-format ![](https://i.imgur.com/KzbMVst.png) * Store needs 2 registers (rs1, rs2) but no rd # Lecture 12: 9/23/20 ## B-format (SB) ![](https://i.imgur.com/5Icv0bD.png) * Need to read 2 registers but don't write to register * Branches used for loops (func calls and unconditional jumps handled by jump instructions) * Can only jump relative to PC * Largest branch dist limited by size of code * PC-Relative Addressing: Use immediate as 2s complement offset to PC * 12 bits gives $\pm 2^{11} * 32$ * Can only branch to beginning of word * Branch calculation * No branch: PC = PC + 4 * Branch: PC = PC + immediate*4 * Support for 16-bit words means branch size reduced by half * Half of targets will be errors * $\pm 2^{10} * 32$ * Branch calculation * No branch: PC = PC + 4 * Branch: PC = PC + immediate*2 * The 12bit immediate is actually 13bits but always has LSB = 0 * beq with offset 0 is infinite loop ## U-Format ![](https://i.imgur.com/y16TyRH.png) * Need to branch to destination $> 2^{10}$ * Replace beq with bne and then j * 20 bit immediate for upper 20 bits of 32 bit word * lui: Load upper immediate * lui then addi to set all 32 bits * addi sign extends though which can cause problems * Add 1 to LUI value to take that into account * li: pseudo-op to handle this (lui and addi) * auipc: Add upper immediate to PC ## J-Format (UJ) ![](https://i.imgur.com/zvYwHhQ.png) * jal saves PC+4 to rd (return address) * The immediate is the offset, remove the LSB 0 * PC relative jump to $\pm 2^{19}$ locations $=\pm 2^{18}$ 32bit instructions * jalr rd, rs, immediate uses I-format * Writes rd = PC+4, PC = rs + immediate * Way to jump to absolute address * Cannot assume LSB is 0 so range is reduced * Use auipc, jalr to jump to pc-relative 32-bit offset # Lecture 13: 9/25/20 * Interpreter: Directly executes a program in source language * Higher level, better error msgs * Slower (10x), code smaller (2x) * Independent instructions (can run on any machine) * Translator: converts prog from 1 source lang to another * Translates to executable, don't know the source code * Compilation, Assembly, Linking, Loading * Compiler * Input: High level lang code (foo.c) * Output: Assembly code: (foo.s) * Can contain pseudo-instructions ## Assembler * Input: Assembly code (foo.s) * Output: Object code (foo.o) * Reads and uses directives * Ex. .text (code), .data (static), .globl sym, .string str .word w1...wn * Replace psuedo-instructions, create machine code, create object file * Need to sign extend immediates * Branch immediates count halfwords * PC-relative branches/jumps: count half words between curr and where you want to go * Foward reference problem: branching below * Solution: Take 2 passes * Pass 0: Replace pseudo-instructions * Pass 1: Remember position of label * Pass 2: Use label positions to generate code * Pc-relative jumps (jal, beq, bne) * j ofset expands to jal, zero, offset * Count half-words between target and jump for position-independent code (PIC) * PIC is good - relative to location * Static data refs: * lui addi require full 32b address * auipc addi uses PIC * Symbol table * List of items that can be used by other files * Labels: function calling * Data: .data variables can be accessed across files * Relocation table * List of items whose address the file needs * Absolute labels (jal, jalr) * Static data (la) * Object file format (ex. ELF) * Object file header: size, pos of other pieces * text segment: machine code * Data segment: static data (in binary) * Relocation info: Has lines of code that need to be fixed later * Symbol table: List of files labels and static data that can be referenced * Debugging info ## Linker * Input: Object code files, info tables (foo.o) * Output: Executable code (a.out) * Combines several object files into a single executable * Throws error for duplicate or missing symbols * Types of addresses * PC-Relative (beq, bne, jal, auipc/addi): Never need to relocate * Absolute or external func address (quipc/jalr): Always relocate * Static data reference (lui/addi): Always relocate * Relocation editing: * J-format: jal address * I,S format: static pointers * No conditional branches * Text starts at 0x1000 * Statically-linked approach: include entire library even if not all of it is used (self-contained) * Dynamically-linked (DLL): include library at runtime * Smaller storage, requires less memory * Requires time to link at runtime * Link at machine code level ## Loader * Input: Executable code (a.out) * Output: program run * Part of the OS * Read header to know size of text and data * Create address space for text, data, stack * Copy instructions and data from executable file to new address space * Copy CLI args to stack * Initialize machine registers (mostly cleared) * Jump to start-up routine that copies program args from stack to register, set PC # Lecture 14: 9/28/20 * Synchronous digital system * Synchronous: operations coordinated by a central clock * Digital: Values represented by discrete values * Transistor: semiconductor device to amplify or switch signals * CMOS: Complementary metal oxide semiconductor * Drain, Gate, Source * Current flows from source to drain depending on the gate * Can crete AND, OR, NOR out of NAND * Asserted switch = closed ![](https://i.imgur.com/Cnk1daj.png) * Signals are transmitted instantly and continuously * Only contains 1 value at a time * [Block] propagation delay: Delay between clock start and when you can see the output * Combinational Logic (CL) circuits * Output is a function of inputs * State elements * circuits that store information (registers, cache) # Lecture 15: 9/30/20 * State can store values for time (registers, caches) * Control flow of info between logic blocks * Register = n "flip-flops" aka D-type Flip-Flop * D: data, q: output ![](https://i.imgur.com/MUTg8iv.png) * At rising edge of the clock, sample input d. * After a time delay, output q * Setup time: Time before rising edge that input must be stable * Hold time: Time after rising edge that input must be stable * Clk-to-q delay: Time until q gets updated and is stable ![](https://i.imgur.com/ruN1tg7.png) * Max delay = CLK to Q delay + CL delay + Setup time * Extra registers added to speed up clock rate * Clock period limited by propagation delay of adder/shifter * Finite State Machine: Represent with a state transition diagram * Use CL + register * $S_{i-1}$ is the stable output of a register which holds result of $i-1$ iteration # Lecture 15: 10/2/20 * XOR: 1 if odd number of 1s * Boolean algebra: + is OR, $\cdot$ is AND, $\bar{x}$ is NOT * Circuit verification: If they are equivalent in boolean algebra * Uniting theorem: $xy + x = x, (x+y)x = x$ * DeMorgan's Law: $\bar{(x \cdot y)} = \bar{x} + \bar{y}$, $\bar{(x + y)} = \bar{x} \cdot \bar{y}$ * Distribute the "bar" to all 3 terms * Sum-of-products: ORs of ANDs * Sum together all possible outcomes of 1 (not for 0 for each input bit) # Lecture 16: 10/5/20 Multiplexor ![](https://i.imgur.com/5GV4C6T.png) Arithmetic and Logic Unit (ALU) * 00: A + B * 01: A - B * 10: A & B * 11: A | B ![](https://i.imgur.com/kt4n1zC.png) ![](https://i.imgur.com/5gd2nMB.png) ## Adders * LSB: $s_0 = a_0 XOR b_0$, $c_1 = a_0 \& b_0$ * ith bit: $s_i = XOR(a_i, b_i, c_i)$, $c_{i+1} = MAJ(b_i, b_i, c_i)=a_ib_i + b_ic_i + a_ic_i$ * N 1-bit adders cascaded is a 1 N-bit adder * For unsigned numbers, existence of carry indicates overflow * For signed numbers, carry is not indicative * For highest adder: overflow = $c_n XOR c_{n-1}$ * No c_out, no c_in: no overflow * c_in and c_out: no overflow * c_in but no c_out: overflow (A, B > 0) * c_out but not c_in: overflow * Subtractor: A-B = A + (-B) * Does an XOR to flip the bits * if SUB is high, then we just add that 1 # Lecture 17: 10/7/20 * Hardware is inherently parallel * CPU - active part of computer that does work (data manipulation and decision making) * Datapath: Part of processor that has hardware to perform operations * Computer executes 1 instruction per tick * State outputs drive inputs to combinational logic to make next state * All state elements updated at rising clock edge * Problem: single monolithic block is too bulky * Solution: Break up process of executing instruction into steps * Stage 1: Instruction Fetch (IF) * Stage 2: Instruction Decode (ID) * Stage 3: Execute (EX) ALU * Stage 4: Memory Access (MEM) * Stage 5: Write Back to Register (WB) * All 5 stages occur within 1 clock cycle (rising edge fetch, write on the next rising edge) ![](https://i.imgur.com/WjbpPlw.png) ![](https://i.imgur.com/s9VxNJP.png) # Lecture 18: 10/9/20 See lecture diagrams ![](https://i.imgur.com/g39nwYV.png) # Lecture 19: 10/12/20 * Control and status register (CSR): diff from register file * Monitor status and performance * Up to 4096 CSRs * Necessary for peripherals * csrrw swaps value of a CSR and a integer register * csrw csr, rs1 is csrrw x0, csr, rs1 * rd=x0 meand just write rs1 to CSR * csrwi csr, uimm is csrrwi x0, csr, uimm * rd=x0 just writes uimm to CSR ![](https://i.imgur.com/1eNBk2A.png) * LSB is stable first * Critical path lw = $t_{clk-q} + t_{IMEM} + t_{Imm} + t_{ALU} + t_{DMEM} + t_{mux} + t_{setup}$ * Certain instructions take longer * Most blocks will be idle for most of the time * ROM: Read Only Memory ![](https://i.imgur.com/vUyZdhh.png) # Lecture 20: 10/14/20 * Measure performance in maximum frequency * Performance measures: * Program execution time * Throughput (num of server requests handled per hour) * Energy per task * Iron law of processor performance: * $\frac{\text{Time}}{\text{Program}} =\frac{\text{Instructions}}{\text{Program}} * \frac{\text{Cycles}}{\text{Instruction}} * \frac{\text{Time}}{\text{Cycle}}$ * CPI: Cycles per instruction * Instructions per program * Determined by task, algorithm, language, compiler, ISA * Average CPI: * Determined by ISA, processor implementation, complex instructions (like strcpy, CPI >> 1), superscalar processors (CPI < 1) * Time per cycle: * Procesor microarchitecture (critical path), technology, power budget * 30% of energy is leaked in CMOS * $\frac{\text{Energy}}{\text{Program}} =\frac{\text{Instructions}}{\text{Program}} * \frac{\text{Energy}}{\text{Instruction}}$ * $\frac{\text{Energy}}{\text{Program}} \propto \frac{\text{Instructions}}{\text{Program}} * CV^2$ * Capacitance depends on technology processor features * Supply voltage (run slower with less supply voltage) * Energy Iron Law: Performance = Power * Energy Efficiency * Performance: tasks/second * Energy efficiency: tasks/joule ## Pipelining * Does not help latency of single task * Helps throughput of entire worload * Multiple tasks operate simultaneously using diff resources * (theoretical) Speedup = num of pipe stages * Time to fill and drain pipeline in reality * Pipeline rate is limited by slowest pipeline stage # Lecture 21: 10/16/20 * Put registers in between the 5 stages to pipeline ![](https://i.imgur.com/qft6eUl.png) * Time to process 1 instruction can be longer so worse latency, but better throughput * Clock rate can be dramatically higher * Pipelining hazards: situations that prevent startingthe next instruction in the next clock cycle * Structural hazard: Required resource is busy (needed in multiple stages) * 2+ instructions compete for single physical resource * Sol1: Instructions take turns, causing some to stall * Sol2: Add more hardware (can always solve) * 2 independent read ports and 1 write port * 2 separate memories (cache) * Data hazard: Data dependency between instructions, need to wait for prev inst to complete its read/write * Write in the first half of cycle, read in the second half of cycle (registers are fast) * Sol1: Stall using bubble (nop) when it depends on previous inst, but reduces performance * Compiler will try to rearrange code that is not dependent * Sol2: Forwarding/bypassing - grab operand from pipeline stage rather than register file (output of ALU to input of ALU) ![](https://i.imgur.com/98F1UjL.png) * Control hazard: Flow of execution depends on previous instruction (for branches) # Lecture 23: 10/18/20 ## Load Data Hazard * Load first, then do operations on it * Load is only available after DMEM, but needed 1 cycle earlier * Sol 1: Load delay slot: Need 1 stall, convert next command into a nop * Just set regWEn, MemRW, and do not update PC * Cycle after that execute the original command * Forward from DMEM output to ALU * Sol2: Put unrelated instruction into load delay slot ## Control Hazards * Branches are condition so don't know next inst * Earliest to know branch is at end of execution (ALU) * 2 subsequent instructions are executed regardless of the branch outcome * Sol1: Use 2 stall cycles after a branch * Sol2: If branch/jump taken, convert the 2 next to nop` * Use branch prediction to guess * Generally loop, so branch is rarely taken * Ex. Keep a bit of if the branch was taken last time * Moving branch comparator to ID stage would incur data hazard and forwarding could fail ## Superscalar Processors * Clock rate limited by tech and power dissipation * Deeper pipelines (10, 15 stages) * Less work per stage, so shorter clock cycle * More potential for hazards (CPI > 1) * Multi-issue superscalar * Replicate pipeline stages: multiple pipelines (CPI < 1) * Multiple execution units, start multiple instructions per clock cycle * Out of order execution to reduce hazards * Arrive at commit unit which reorders * Get CPI using iron law (time program, count instructions, know time/cycle)