# CS 61C Final Notes # Lecture 24: 10/20/20 * Memory and caches use powers of 2 for binary prefixes * add `bi` so `kibi` is `Ki` for `1024` * `ceil log_2(x)` * DRAM speed is much lower than CPU Performance (performance gap) * Use a memory cache (copy of a subset of main memory) for faster access * Separate caches for instructions and data * Farther away from processor means: * Longer access time, bigger size capacity * Inclusive: Smaller is always a copy subset of larger * Illusion of large fast memory * Temporal locality: Use it recently, want to use it again soon * Keep most recently accessed data closer * Spatial locality: Use a piece of memory, want to use neighboring pieces soon * Move blocks of contiguous words closer to processor * Hierarchy management: * Registers to memory: Compiler * Cache to main memory: cache controller hardware * Main memory to disks: OS # Lecture 25: 10/22/20 * Direct-mapped cache: each mem addr associated with 1 block within cache * Look in a single location in cache for data * Draw memory same width as cache * Lowest byte in top right * Lower bits tell the cache mapped row (which bits depend on the block size) * Tags tell you which address it came from * Discard lower bytes (which indicate which cache index and block) * Aka cache number * Divide memory address into 3 fields: * tag: check if have correct block * index: select block * offset: byte offset within block * Lookup cache in order of index, offset, tag * Cache area $= I*O = 2^{I+O}$ ## Terminology * Cache hit: Cache block is valid and contains proper address * Cache miss: Nothing in cache for appropriate block * Cache miss, block replacement: wrong data in cache at appropriate block, discard it and fetch the right one * Temperature: * Cold: cache empty * Warming: cache filling with values * Warm: cache doing its job, fair % of hits * Hot: Cache doing very well, high % of hits * Hit rate: fraction of access that hit in cache * Miss rate: 1 - hit rate * Miss penalty: time to replace a block from memory to cache * Hit time: Time to access cache memory * $ is cache * Have a valid bit to represent invalid values # Lecture 26: 10/26/20 * Order IVTO: Index, Valid, Tag, Offset * Write-through: Update cache and memory at same time * Write-back: Update word in cache block, allowing memory word to be stale * Add dirty bit to block (update memory when block is replaced) * Block side Benefits: * Spatial locality * Works well for sequential array access * Block side Drawbacks * Larger miss penalty (longer load time for new blocks) * Too few blocks for cache size (miss rate goes up) ![](https://i.imgur.com/jgTgnXd.png) ## Cache misses * Compulsory misses (infinite size, fully associative) * When a program first started (can't be avoied) * Every block of memory will have at least 1 compulsory miss * Conflict misses (finite size, finite associativity) * 2 distinct memory addresses map to same cache location * Sol1: Make cache size bigger * Sol2: Multiple distinct blocks can fit in the same cache index * Capacity misses (finite size, fully associative) * miss because cache is limited in size * Primarty type of miss for Fully Associative caches ## Fully Associative Caches * Tag and offset same as before, index nonexistent * Compare tags in parallel * Benefits: no conflict misses since data can go anywhere * Drawbacks: Hardware difficult to make # Lecture 27: 10/28/20 ## Set associative Cache * Tag and offset are same as before, index points us to correct row/set * Size is # sets x N blocks / set x block size * Each set contains multiple blocks * Once found correct set, compare with all blocks in set * Compare tag with all tags in set * Avoids a lot of conflict misses and only need N comparators * For a cache with M blocks: * Direct-mapped: 1 way set assoc * Fully assoc: M-way set assoc * One hot encoding for a mux to choose which had a hit ## Block replacement * Direct mapped: Index specifies the position to replace * N-way Set Assoc: Index specifies set, block can be anywhere in set * Fully associative: Any position * Replacement policy * LRU: Least recently used is removed * Pro: temporal locality * Con: harder to keep track of LRU (with set assoc) * FIFO: Ignore accesses, just track initial order * Random: Good for low temporal locality of workload ## AMAT * AMAT: Average Memory Access Time * = Hit Time + Miss Penalty * Miss Rate * Multi level caches: recursive model of caches * AMAT: L1 hit time + L1 miss rate(L2 hit time + L2 miss rate * L2 Miss penalty) * Larger cache reduces miss rate (hit time of first level cache < yccle time) * More places in cache to put blocks of memory: associativity * L1: miss rate 1-5%, hit time 1 cycle, size: tens of KB * L2: miss rate: 10-20%, hit time: few clock cycles, size: hundreds of KB * L2 miss rate is a fraction of L1 misses that also miss in L2 # Lecture 28: 10/30/20 * OS is first thing that runs when computer starts * Find and control devices (device drivers) * Start services (file system, network stack, TTY/keyboard) * Loads, runs, manages programs * Isolation between running processes and interaction with outside world * Memory translation: Each process has a mapping from virtual to physical address * Actual memory accessed at a physical address * Protection and privilege * At least 2 running modes (User, supervisor) * Lesser privilege cannot change memory mapping (but supervisor can) * Traps and interrupts * Ways to handle unusual behavior in program * On boot: Execute instructions from Flash ROM * 1. BIOS (Basic Input Output System): find a storage device, load first sector * 2. Bootloader: Load OS kernel from disk into location in memory, jump to it * 3. OS Boot: initialize services, drivers, etc. * 4. Init: Launch application that waits for input in loop (Terminal, Desktop) ## Launching applications * Applications called processes * Thread: shaerd memory, process: separate memory * Apps started using shell calling OS routine (syscall) * Linux uses fork to create new process, execve (execute file command) to load application * Loads executable file from disk, puts instructions and data into memory * Set argc and argv, jump to main * Shell waits for main to return (join) * OS enforces constraints for applications memory * Supervisor mode: CPU's mode to protect OS from application * Process has access to subset of instructions when not in supervisor mode * Very rarely runs in supervisor mode (BSOD if error here) ## Syscalls * Call an OS routine * Need to perform a syscall (set up func arguments in registers, raise fotware interrupt) * When to transitition into supervisor mode: * Interrupts: Something external to the running program * Async to curent program (ex. key press, disk IO) * Exception: Caused by an event during execution of an instruction by the running program * Sync, on instruction that causes exception (ex. mem error, buss error, illegal instruction) * Trap: action of servicing interrupt or exception by hardware jump to interrupt/trap handler code * Ecall: Trigger exception to higher privilege * Ebreak: Trigger exception within current privilege * None of instructions after trap has been executed, all prior have been * Handler can return from an interrupt by restoring user registers * More complex to handle trap from execption than interrupt * Tricky for superscalar pipeline * Exceptions in 5-stage pipeline * PC address exception * Illegal opcode * Data address exceptions * Exceptions handled like pipeline hazards * complete xecution of instructions before exception * Flush curernt pipeline instructions (nops) * Store exception cause ins tatus register * Transfer execution to trap handler * Optionally return to original program, re-execute instruction * Multiprogramming: Switch between processes very quickly (context switch) to run multiple applications at same time * When jumping into process, set a timer (interrupt) * When expires, store PC and registers (process state) * Pick a different process to run and load state * Set timer, change to user mode, jump to new PC * Use virtual memory to provide illusion of full memory address space for applications # Lecture 29: 11/2/20 * Virtual mem is next level in memory heirarchy (for protection) * Memory manager maps virtual to physical addresses * Physical address space hidden from user apps * Multiple processes can simultaneously occupy memory with protection * Illusion of private memory * Can swap memory to disk to give illusion of larger memory * DRAM is 10x denser than SRAM * Both are volatile (cannot remember when powered off) * Non-volatile storage disks * SSD: faster, 5x more expensive, made with transistors (nothing mechanical), can't erase single bits, only entire blocks * HDD: slower, cheaper ## Paged Memory * Physical memory broken into pages * Typical page size: 4KiB (need 12 bits to address 4KiB) * 20 bit page number, 12 bit offset * Each process has a dedciated page table * Physical memory is not consecutive * Physical addresses can have more or fewer bits than virtual addresses * Assigning diff pages in DRAM isolates process memory * Sharing is possible, assign same physical page to several processes * Use a bit to indicate write protection (exception if try to write if 1) * lw/sw needs to get page from memory, then get the physical address * Pages are too big to store in cache * Frequent parts of page tables can be cached ## Page Faults * Caches deals with individual blocks (around 64B) * VM deals with individual pages (4KiB) * Ex. Each page has 32 blocks, each block has 32 words * If page table entry is valid and in DRAM, read/write data * If not, Allocate new page in DRAM, evict page from DRAM * Store evicted page to disk, read page from disk into memory * If not valid: * Allocate new page in DRAM * If out of memory, evict a page * Page fault if not on disk or not valid * Treated as exception # Lecture 30: 11/4/20 ## Hierarchical Page Tables * Page tables can take up large amounts of space * Can increase page size, but wastes more memory * Hierarchical page tables with decreasing page size * Most programs use a fraction of memory * Virtual address with 10-bit L1 index, 10-bit L2 index, 12 bit offset * register points to level 1 PT which points to Level 2 PT which points to physical memory * Most L2 pointers will be empty ![](https://i.imgur.com/Eyaq05k.png) ![](https://i.imgur.com/RYgMZ7B.png) ## Translation Lookaside Buffers * Reference requires 3 mem accesses for 2-level PT * Solution: Cache some translations in TLB * 32-128 entries, usually fully associative * Each entry maps large page so less spatial locality * Larger TLBs are 4-8 way set-associatie * Random or FIFO replacement policy * TLB Reach: size of larger virtual address space that can be mapped in a TLB * TLB hit: Single cycle translation * TLB miss: PT walk to refill ![](https://i.imgur.com/H2vSZjP.png) ![](https://i.imgur.com/9VXs2pL.png) * Handing page fault needs a precise trap so handler can resume after retrieving page * Protection violation should abort process ![](https://i.imgur.com/QrBLimO.png) * Demand paging: ability to run programs larger than primary memory * Context switch: Change of internal state of processor * save register values and PC, change value in Supervisor Page Table Base register (SPTBR) * TLB entries all set to invalid on context switch * VM is level below main memory, TLB comes before cache * Same CPI, AMAT equations, but main memory treated as mid-level cache # Lecture 31: 11/6/20 * Need IO interface for keyboards, network, mouse, display * Input: Read a sequence of bytes * Output: Write a sequence of bytes * Interface: Memory mapped I/O * Portion of address space for IO * IO device registers there * Use normal load/store instructions * Processor-IO speed can be mismatched ## Polling * Device registers: * Control register: says OK if ready for IO * Data register: contains data * Processor reads from control register in loop * Processor loads from input/writes to output, IO resets control register bit to 0 * Polling a mouse is fine but a hard disk would be very costly for cpu ## Interrupts * Polling wasts processor resources, use interrupt when IO is ready or needs attention * Lots of IOs can be expensive with thrashing cache, VM, saving/restoring state * Low data rate (mouse, keyboard): use interrupts * High data rate (network, disk): start with interrupts, switch to DMA * Programmed IO: * CPU execs lw/sw instructions, gets data from device to main memory, uses data to compute ## DMA * Direct memory Acccess Engine * Contains registers written by CPU * Mem addr to place data, # of bytes, IO device, transfer direction, unit of transfer, amount to transfer/burst * Incoming data: * Receive interrupt from device * CPU takes interrupt, initiates transfer (instructs DMA engine to place data at addresses) * DMA handles transfer as CPU does other things * Upon completion, DMA interrupts CPU * Outgoing data: * CPU confirms external device is ready * CPU begins transfer * DMA handles transfer as CPU does other things * DMA interrupts CPU to signal completion ## Networking * sw send: * App copies data to OS buffer * OS calculates checksum, starts timer * OS sends data to network interface hardware, says start * sw receive * OS copies data from network interface to OS buffer * OS calculates checksum, if OK send ACK, if not, delete * If OK, OS copies data to user address space, signals app to continue # Lecture 32: 11/9/20 * dgemm: double precision floating-point matrix mul * C is around 240x faster than python for matmul * Hardware can be serial or parallel, software can be sequential or concurrent * Flynn's taxonomy - parallel hardware * SIMD: Single instruction, multiple data * Specialized hardware to handle lock-step calculations for arrays * SPMD: Single Program Multiple data * Most common parallel processing programming style ![](https://i.imgur.com/0ws0gRi.png) * Data Level Parallelism (DLP): operation on multiple data streams * Ex. multiply coef vector by data vector * One instruction fetched and decoded for entire operation * MMX: MultiMedia extension * SSE: Streaming SIMD Extension * 8 128-bit data registers (XMM0 to XMM7) ![](https://i.imgur.com/TnyJk3u.png) * Intrinsics: C functions and procedures for putting in assembly language (including SSE instr) * 1:1 correspondence between SSE instructions and intrinsics # Lecture 33: 11/13/20 * Multiprocessor execution model: * Each core executes own instructions * Separate resources (datapath, high level caches) * Cache coherency is important * Shared resources (DRAM, 3rd level cache) * Drawback: Slow memory shared by cores (bottleneck) * 2 ways to use a multiprocessor * Job-level parallelism: processors work on unrelated problems, no communication * Partition work of single task between several cores ## Threads * Thread: Thread of execution, single stream of instructions * Program/process can split/fork itself into multiple threads to execute simultaneously * Single CPU can execute many threads by time sharing * Each thread has dedicated PC, sep registers, acceses shared memory * Each physical core executes one hardware threads * OS multiplexes multiple software threads onto available hardware threads * All threads except ones mapped to hardware threads are waiting * OS Threads give illusion of many simultaneously active threads * Multiplex software threads onto hardware threads * Switch out blocked threads (cache miss) * Timer (switch active thread every time interval) * Remove software thread from hardware therad by: * Interrupting its execution * Saving its registers and PC to memory * Start executing diff software thread by * Loading previously saved regs * Jump to its saved PC * daemon = always running ## Multithreading * Need to switch thread state very quickly * Hardware Assisted Software Multithreading: 1 core, 2 threads * 2 copies of PC and registers, looks like 2 diff hardware threads * Hyper-threading: both threads can be active simultaneously * Physical CPU: 1 thread at a time per CPU (switches threads from IO events) * Logical CPU (multithreading): 1% more hardware, 10% better performance (sep registers, share datapath, ALU, cache) * Hyper-Threading/Simultaneous Multithreading (multicore): duplicate processes, 50% more hardware, 2x better performance * Exploirt superscalar architecture to launch instructions from diff threads at same time * Modern machines do both # Lecture 34: 11/16/20 * Use intrinsics since no native support * Parallel loops: * `#include <omp.h>` * `#pragma omp parallel for` before for loop to parallelize * Don't know what thread will finish first * `#pragma` ignored by compilers unaware of openMP * Cannot premature break * OpenMP Programming model (Fork-join model) * Requires shared memory environment * Begin as a single process (main thread) * Master thread forks into threads, join together at the end * Use array to keep track of all thread results * Race condition: not deterministic, multiple reads and writes at same time # Lecture 35: 11/18/20 ## Synchronization * Limit access to shared resource to 1 actor at a time to prevent race condition * Use locks to control acces to shared resources * Implement lock as a variable (1 = locked) does not work * Critical section: area that needs to be serialized * Solution: Atomic read/write in single instruction (hardware sync) * Must used shared memory * Atomic Memory Operations (AMOs): Atomic swap of register and memory * Declare `omp_lock_t lock` * `omp_init_lock(&lock)`, `omp_set_lock(&lock)`, `omp_unset_lock(&lock)`, `omp_destroy_lock(&lock)` * `#pragma omp critical` within parallel ensures only 1 thread accesses at a time * Deadlock: System state where no progress is possible * Elapsed wall clock time: `double omp_get_wtime(void);` * Time measured per thread ## Shared memory * SMP: (Shared memory) symmetric multiprocessor * >=2 CPUs, single coherent memory * Memory is a performance bottleneck for 1 processor * Caches reduce bandwidth demands for main memory * Each cache has local private cache * Cache misses have to access shared common memory ## Cache Coherency * Need to keep cache values coherent * Idea: Notify other processors when a processor has a cache miss/write * If a processor writes, invalidate other copies * Each cache tracks state of each block in cache * 1) Shared: up to date data * 2) Modified: Up to date data, changed/dirty, no other cache has a copy, Ok to write, memory out of date (need to write back) * 3) Exclusive: Up to date data, no other cache has a copy, ok to write, memory up to date (don't need to write back if block replaced) * 4) Owner: up to date data, other caches can have copy but must be shared * Owner has exclusive right to make changes to it * Broadcasts those changes to all other caches (dirty sharing of data) * Can become modified after invalidating shared copies OR changed to shared after writing modifications back to memory (you go from owner to shared) * Owned cache lines must respond to a snoop request with data ![](https://i.imgur.com/1AF4xLz.png) * False sharing: the block ping pongs between 2 caches even though variables accessed are disjoint * Coherence/communication miss: Disjoint variables but have to invalidate block as one updates # Lecture 36: 11/20/20 * Amdahl's Law: speedup w/ E = exec time w/o E / exec time w/ E * Enhancement E does not affect portion s of a task, accelerates 1-2 by a factor of P (P>1) * exec time w/ E = exec time w/o E * [s + (1-s)/P] * Speedup w/ E = 1 / [s + (1-s)/P] = $\frac{1}{s + \frac{1-s}{P}} \leq \frac{1}{s}$, as $P \to \infty$ * Amount of speedup that can be achieved thru parallelism is limited by serial portion (s) * Request level parallelism: Many requests/sec reading a database (usually not read-write/sync) * Data level parallelism (DLP): data in memory or data on many disks ## MapReduce * Scalable to large data volumes * Cost-efficiency: commodity machines, commodity network * Automatic fault-tolerance * Expensive switches at higher levels ![](https://i.imgur.com/BJesX0y.png) * Map: processes input key/value pair * Slices data into shards/splits, distributed to workers * Produces set of intermediate pairs * Group by key, send to reduce * Reduce: Combines intermediate values for a particular key, produces an output value * Need maps to finish before reduce, but can start reading values ## Spark * Spark is 100x faster than Hadoop in memory * Advanced DAG execution engine with cyclic data flow, in memory computing * Lazy evaluation model * Only after `.collect()` it is evaluated * `flatmap` will append all things together from map # Lecture 37: 11/23/20 ## Warehouse Scale Computers * 100k servers in the same datacenter (Need to cool building for heat generated) * Homogeneous hardware/software * High availability * Want ample parallelism (Data level parallelism), prepare for high num of component failures, cost of equipment << cost of ownership * Rack contains 40-80 servers * Array/custer contains 16-32 server racks * Performance: * Response time/latency: time between start and completion of task * Throughput/bandwidth: total amt of work in a given time * Lower latency to DRAM in another server than local disk * Higher bandwidth to local disk than to DRAM in another server ## Power Usage Effectiveness * Workload variation * Place data within array in right spot for good performance * Cope with failure gracefully * Scale up and down in response to varying demand * Uses half power when idle * Most WSC servers in 10-50% utilized * Ideally linear energy proportionality * WSC Energy Efficiency: amt of computational work performed dividd by the total energy used in the process * Power usage effectiveness (PUE): Total building power / IT equipment power (1.0 is perfect) * Some power goes to uninterruptable power supply (UPS, battery), chiller cooling warm water from AC * Need careful air flow handling (don't mix server hot air with cold air), easier to control in small freight containers * Keep cold aisle temperature to 81F (instead of 65) * Use free cooling (cool warm water outside by evaporation in cooling towers) # Lecture 38: 11/30/20 * Dependability via redundancy (datacenters, disks, memory bits) * RAID: Redundant Arrays of Independent Disks * ECC: Error correcting code (redundant DRAM memory) * Spatial redundancy: replicated data/hardware to handle hard and transient failures * Temporal redundancy: redundancy in time to handle transient failures * Dependability measures: * Reliability: Mean Time to Failure (MTTF) * Service interruption: MEan Time To Repair (MTTR) * Mean time between failures (MTBF) * MTBF = MTTF + MTTR * Availability = MTTF / (MTTF + MTTR) * Increasing MTTF for more reliable hardware/software, fault tolerance * Reduce MTTR: improve tools/processess for diagnosis and repair * Number of 9s availability per year: * 1 nine: 90%, 2 nine: 99% * Reliability Measures * Average number of failures per year: Annualized Failure Rate (AFR) * AFR = 24 * 365 / MTTF * Failures in Time (FIT) rate: Num of failures expected in 1 billion device-hours of operation * MTBF = 1 billion / FIT * Automotive safety integrity level (ASIL) defines FT rates for diff classes of components in vehicles * Design principles * No single point of failure * Dependability limited by part you do not improve ## Error Detection * Memory systems accidentally flip bits * Soft errors: When cells are struck by alpha particles or environmental upset * Hard error: When chips permanently fail * Error Detection/Correction Codes can protect soft errors * Hamming distance: difference in num of bits * Parity bit: Forces stored words to have even parity * XOR them all, if output = 0 then even, 1 then odd-->error * Cannot detect an even number of errors ## Error Correction * Detection: bit pattern fails codeword check * Correction: map to nearest valid code word * Snap back to closest valid word * For hamming distance 2: invalid codewords for neighbors * Place parity bits at binary positions of powers of 2 * p1 covers positions with LSB=1 * p2 covers positions next to LSB=1 * p4 covers positions distance 2 from LSB=1 * start at that bit and go that many outward. that number is stride * Error will be the sum of the 2 incorrect pairty bits * Will need double-error correction, triple-error detection (DECTED) * Cyclic Redundancy Check: CRC interleaving advanced codes ![](https://i.imgur.com/8qsqBG0.png) ## RAID * RAID 1: Disk Mirroring/Shadowing: Each disk fully duplicated on mirror * Very high availability * Writes limited by single-disk speed, reads can be optimized * 100% capacity overhead but very expensive * RAID 3: Parity Disk * Stripe physical records across disks, parity disk has parity of stripe * If a disk fails, subtract P from sum of other disks to find info * RAID 4: High IO Rate Parity * Parity is all on other disk * For concurrent writes, need to update parity * Option 1): Read other data disks, create new sum and write to parity disk * Option 2) Compare old data to new data, add difference to P * Both are bottlenecked by parity writes * RAID 5: Hgih IO Rate Interleaved Parity * Interleaved parity as a diagonal pattern