# CS 61C Final Notes
# Lecture 24: 10/20/20
* Memory and caches use powers of 2 for binary prefixes
* add `bi` so `kibi` is `Ki` for `1024`
* `ceil log_2(x)`
* DRAM speed is much lower than CPU Performance (performance gap)
* Use a memory cache (copy of a subset of main memory) for faster access
* Separate caches for instructions and data
* Farther away from processor means:
* Longer access time, bigger size capacity
* Inclusive: Smaller is always a copy subset of larger
* Illusion of large fast memory
* Temporal locality: Use it recently, want to use it again soon
* Keep most recently accessed data closer
* Spatial locality: Use a piece of memory, want to use neighboring pieces soon
* Move blocks of contiguous words closer to processor
* Hierarchy management:
* Registers to memory: Compiler
* Cache to main memory: cache controller hardware
* Main memory to disks: OS
# Lecture 25: 10/22/20
* Direct-mapped cache: each mem addr associated with 1 block within cache
* Look in a single location in cache for data
* Draw memory same width as cache
* Lowest byte in top right
* Lower bits tell the cache mapped row (which bits depend on the block size)
* Tags tell you which address it came from
* Discard lower bytes (which indicate which cache index and block)
* Aka cache number
* Divide memory address into 3 fields:
* tag: check if have correct block
* index: select block
* offset: byte offset within block
* Lookup cache in order of index, offset, tag
* Cache area $= I*O = 2^{I+O}$
## Terminology
* Cache hit: Cache block is valid and contains proper address
* Cache miss: Nothing in cache for appropriate block
* Cache miss, block replacement: wrong data in cache at appropriate block, discard it and fetch the right one
* Temperature:
* Cold: cache empty
* Warming: cache filling with values
* Warm: cache doing its job, fair % of hits
* Hot: Cache doing very well, high % of hits
* Hit rate: fraction of access that hit in cache
* Miss rate: 1 - hit rate
* Miss penalty: time to replace a block from memory to cache
* Hit time: Time to access cache memory
* $ is cache
* Have a valid bit to represent invalid values
# Lecture 26: 10/26/20
* Order IVTO: Index, Valid, Tag, Offset
* Write-through: Update cache and memory at same time
* Write-back: Update word in cache block, allowing memory word to be stale
* Add dirty bit to block (update memory when block is replaced)
* Block side Benefits:
* Spatial locality
* Works well for sequential array access
* Block side Drawbacks
* Larger miss penalty (longer load time for new blocks)
* Too few blocks for cache size (miss rate goes up)

## Cache misses
* Compulsory misses (infinite size, fully associative)
* When a program first started (can't be avoied)
* Every block of memory will have at least 1 compulsory miss
* Conflict misses (finite size, finite associativity)
* 2 distinct memory addresses map to same cache location
* Sol1: Make cache size bigger
* Sol2: Multiple distinct blocks can fit in the same cache index
* Capacity misses (finite size, fully associative)
* miss because cache is limited in size
* Primarty type of miss for Fully Associative caches
## Fully Associative Caches
* Tag and offset same as before, index nonexistent
* Compare tags in parallel
* Benefits: no conflict misses since data can go anywhere
* Drawbacks: Hardware difficult to make
# Lecture 27: 10/28/20
## Set associative Cache
* Tag and offset are same as before, index points us to correct row/set
* Size is # sets x N blocks / set x block size
* Each set contains multiple blocks
* Once found correct set, compare with all blocks in set
* Compare tag with all tags in set
* Avoids a lot of conflict misses and only need N comparators
* For a cache with M blocks:
* Direct-mapped: 1 way set assoc
* Fully assoc: M-way set assoc
* One hot encoding for a mux to choose which had a hit
## Block replacement
* Direct mapped: Index specifies the position to replace
* N-way Set Assoc: Index specifies set, block can be anywhere in set
* Fully associative: Any position
* Replacement policy
* LRU: Least recently used is removed
* Pro: temporal locality
* Con: harder to keep track of LRU (with set assoc)
* FIFO: Ignore accesses, just track initial order
* Random: Good for low temporal locality of workload
## AMAT
* AMAT: Average Memory Access Time
* = Hit Time + Miss Penalty * Miss Rate
* Multi level caches: recursive model of caches
* AMAT: L1 hit time + L1 miss rate(L2 hit time + L2 miss rate * L2 Miss penalty)
* Larger cache reduces miss rate (hit time of first level cache < yccle time)
* More places in cache to put blocks of memory: associativity
* L1: miss rate 1-5%, hit time 1 cycle, size: tens of KB
* L2: miss rate: 10-20%, hit time: few clock cycles, size: hundreds of KB
* L2 miss rate is a fraction of L1 misses that also miss in L2
# Lecture 28: 10/30/20
* OS is first thing that runs when computer starts
* Find and control devices (device drivers)
* Start services (file system, network stack, TTY/keyboard)
* Loads, runs, manages programs
* Isolation between running processes and interaction with outside world
* Memory translation: Each process has a mapping from virtual to physical address
* Actual memory accessed at a physical address
* Protection and privilege
* At least 2 running modes (User, supervisor)
* Lesser privilege cannot change memory mapping (but supervisor can)
* Traps and interrupts
* Ways to handle unusual behavior in program
* On boot: Execute instructions from Flash ROM
* 1. BIOS (Basic Input Output System): find a storage device, load first sector
* 2. Bootloader: Load OS kernel from disk into location in memory, jump to it
* 3. OS Boot: initialize services, drivers, etc.
* 4. Init: Launch application that waits for input in loop (Terminal, Desktop)
## Launching applications
* Applications called processes
* Thread: shaerd memory, process: separate memory
* Apps started using shell calling OS routine (syscall)
* Linux uses fork to create new process, execve (execute file command) to load application
* Loads executable file from disk, puts instructions and data into memory
* Set argc and argv, jump to main
* Shell waits for main to return (join)
* OS enforces constraints for applications memory
* Supervisor mode: CPU's mode to protect OS from application
* Process has access to subset of instructions when not in supervisor mode
* Very rarely runs in supervisor mode (BSOD if error here)
## Syscalls
* Call an OS routine
* Need to perform a syscall (set up func arguments in registers, raise fotware interrupt)
* When to transitition into supervisor mode:
* Interrupts: Something external to the running program
* Async to curent program (ex. key press, disk IO)
* Exception: Caused by an event during execution of an instruction by the running program
* Sync, on instruction that causes exception (ex. mem error, buss error, illegal instruction)
* Trap: action of servicing interrupt or exception by hardware jump to interrupt/trap handler code
* Ecall: Trigger exception to higher privilege
* Ebreak: Trigger exception within current privilege
* None of instructions after trap has been executed, all prior have been
* Handler can return from an interrupt by restoring user registers
* More complex to handle trap from execption than interrupt
* Tricky for superscalar pipeline
* Exceptions in 5-stage pipeline
* PC address exception
* Illegal opcode
* Data address exceptions
* Exceptions handled like pipeline hazards
* complete xecution of instructions before exception
* Flush curernt pipeline instructions (nops)
* Store exception cause ins tatus register
* Transfer execution to trap handler
* Optionally return to original program, re-execute instruction
* Multiprogramming: Switch between processes very quickly (context switch) to run multiple applications at same time
* When jumping into process, set a timer (interrupt)
* When expires, store PC and registers (process state)
* Pick a different process to run and load state
* Set timer, change to user mode, jump to new PC
* Use virtual memory to provide illusion of full memory address space for applications
# Lecture 29: 11/2/20
* Virtual mem is next level in memory heirarchy (for protection)
* Memory manager maps virtual to physical addresses
* Physical address space hidden from user apps
* Multiple processes can simultaneously occupy memory with protection
* Illusion of private memory
* Can swap memory to disk to give illusion of larger memory
* DRAM is 10x denser than SRAM
* Both are volatile (cannot remember when powered off)
* Non-volatile storage disks
* SSD: faster, 5x more expensive, made with transistors (nothing mechanical), can't erase single bits, only entire blocks
* HDD: slower, cheaper
## Paged Memory
* Physical memory broken into pages
* Typical page size: 4KiB (need 12 bits to address 4KiB)
* 20 bit page number, 12 bit offset
* Each process has a dedciated page table
* Physical memory is not consecutive
* Physical addresses can have more or fewer bits than virtual addresses
* Assigning diff pages in DRAM isolates process memory
* Sharing is possible, assign same physical page to several processes
* Use a bit to indicate write protection (exception if try to write if 1)
* lw/sw needs to get page from memory, then get the physical address
* Pages are too big to store in cache
* Frequent parts of page tables can be cached
## Page Faults
* Caches deals with individual blocks (around 64B)
* VM deals with individual pages (4KiB)
* Ex. Each page has 32 blocks, each block has 32 words
* If page table entry is valid and in DRAM, read/write data
* If not, Allocate new page in DRAM, evict page from DRAM
* Store evicted page to disk, read page from disk into memory
* If not valid:
* Allocate new page in DRAM
* If out of memory, evict a page
* Page fault if not on disk or not valid
* Treated as exception
# Lecture 30: 11/4/20
## Hierarchical Page Tables
* Page tables can take up large amounts of space
* Can increase page size, but wastes more memory
* Hierarchical page tables with decreasing page size
* Most programs use a fraction of memory
* Virtual address with 10-bit L1 index, 10-bit L2 index, 12 bit offset
* register points to level 1 PT which points to Level 2 PT which points to physical memory
* Most L2 pointers will be empty


## Translation Lookaside Buffers
* Reference requires 3 mem accesses for 2-level PT
* Solution: Cache some translations in TLB
* 32-128 entries, usually fully associative
* Each entry maps large page so less spatial locality
* Larger TLBs are 4-8 way set-associatie
* Random or FIFO replacement policy
* TLB Reach: size of larger virtual address space that can be mapped in a TLB
* TLB hit: Single cycle translation
* TLB miss: PT walk to refill


* Handing page fault needs a precise trap so handler can resume after retrieving page
* Protection violation should abort process

* Demand paging: ability to run programs larger than primary memory
* Context switch: Change of internal state of processor
* save register values and PC, change value in Supervisor Page Table Base register (SPTBR)
* TLB entries all set to invalid on context switch
* VM is level below main memory, TLB comes before cache
* Same CPI, AMAT equations, but main memory treated as mid-level cache
# Lecture 31: 11/6/20
* Need IO interface for keyboards, network, mouse, display
* Input: Read a sequence of bytes
* Output: Write a sequence of bytes
* Interface: Memory mapped I/O
* Portion of address space for IO
* IO device registers there
* Use normal load/store instructions
* Processor-IO speed can be mismatched
## Polling
* Device registers:
* Control register: says OK if ready for IO
* Data register: contains data
* Processor reads from control register in loop
* Processor loads from input/writes to output, IO resets control register bit to 0
* Polling a mouse is fine but a hard disk would be very costly for cpu
## Interrupts
* Polling wasts processor resources, use interrupt when IO is ready or needs attention
* Lots of IOs can be expensive with thrashing cache, VM, saving/restoring state
* Low data rate (mouse, keyboard): use interrupts
* High data rate (network, disk): start with interrupts, switch to DMA
* Programmed IO:
* CPU execs lw/sw instructions, gets data from device to main memory, uses data to compute
## DMA
* Direct memory Acccess Engine
* Contains registers written by CPU
* Mem addr to place data, # of bytes, IO device, transfer direction, unit of transfer, amount to transfer/burst
* Incoming data:
* Receive interrupt from device
* CPU takes interrupt, initiates transfer (instructs DMA engine to place data at addresses)
* DMA handles transfer as CPU does other things
* Upon completion, DMA interrupts CPU
* Outgoing data:
* CPU confirms external device is ready
* CPU begins transfer
* DMA handles transfer as CPU does other things
* DMA interrupts CPU to signal completion
## Networking
* sw send:
* App copies data to OS buffer
* OS calculates checksum, starts timer
* OS sends data to network interface hardware, says start
* sw receive
* OS copies data from network interface to OS buffer
* OS calculates checksum, if OK send ACK, if not, delete
* If OK, OS copies data to user address space, signals app to continue
# Lecture 32: 11/9/20
* dgemm: double precision floating-point matrix mul
* C is around 240x faster than python for matmul
* Hardware can be serial or parallel, software can be sequential or concurrent
* Flynn's taxonomy - parallel hardware
* SIMD: Single instruction, multiple data
* Specialized hardware to handle lock-step calculations for arrays
* SPMD: Single Program Multiple data
* Most common parallel processing programming style

* Data Level Parallelism (DLP): operation on multiple data streams
* Ex. multiply coef vector by data vector
* One instruction fetched and decoded for entire operation
* MMX: MultiMedia extension
* SSE: Streaming SIMD Extension
* 8 128-bit data registers (XMM0 to XMM7)

* Intrinsics: C functions and procedures for putting in assembly language (including SSE instr)
* 1:1 correspondence between SSE instructions and intrinsics
# Lecture 33: 11/13/20
* Multiprocessor execution model:
* Each core executes own instructions
* Separate resources (datapath, high level caches)
* Cache coherency is important
* Shared resources (DRAM, 3rd level cache)
* Drawback: Slow memory shared by cores (bottleneck)
* 2 ways to use a multiprocessor
* Job-level parallelism: processors work on unrelated problems, no communication
* Partition work of single task between several cores
## Threads
* Thread: Thread of execution, single stream of instructions
* Program/process can split/fork itself into multiple threads to execute simultaneously
* Single CPU can execute many threads by time sharing
* Each thread has dedicated PC, sep registers, acceses shared memory
* Each physical core executes one hardware threads
* OS multiplexes multiple software threads onto available hardware threads
* All threads except ones mapped to hardware threads are waiting
* OS Threads give illusion of many simultaneously active threads
* Multiplex software threads onto hardware threads
* Switch out blocked threads (cache miss)
* Timer (switch active thread every time interval)
* Remove software thread from hardware therad by:
* Interrupting its execution
* Saving its registers and PC to memory
* Start executing diff software thread by
* Loading previously saved regs
* Jump to its saved PC
* daemon = always running
## Multithreading
* Need to switch thread state very quickly
* Hardware Assisted Software Multithreading: 1 core, 2 threads
* 2 copies of PC and registers, looks like 2 diff hardware threads
* Hyper-threading: both threads can be active simultaneously
* Physical CPU: 1 thread at a time per CPU (switches threads from IO events)
* Logical CPU (multithreading): 1% more hardware, 10% better performance (sep registers, share datapath, ALU, cache)
* Hyper-Threading/Simultaneous Multithreading (multicore): duplicate processes, 50% more hardware, 2x better performance
* Exploirt superscalar architecture to launch instructions from diff threads at same time
* Modern machines do both
# Lecture 34: 11/16/20
* Use intrinsics since no native support
* Parallel loops:
* `#include <omp.h>`
* `#pragma omp parallel for` before for loop to parallelize
* Don't know what thread will finish first
* `#pragma` ignored by compilers unaware of openMP
* Cannot premature break
* OpenMP Programming model (Fork-join model)
* Requires shared memory environment
* Begin as a single process (main thread)
* Master thread forks into threads, join together at the end
* Use array to keep track of all thread results
* Race condition: not deterministic, multiple reads and writes at same time
# Lecture 35: 11/18/20
## Synchronization
* Limit access to shared resource to 1 actor at a time to prevent race condition
* Use locks to control acces to shared resources
* Implement lock as a variable (1 = locked) does not work
* Critical section: area that needs to be serialized
* Solution: Atomic read/write in single instruction (hardware sync)
* Must used shared memory
* Atomic Memory Operations (AMOs): Atomic swap of register and memory
* Declare `omp_lock_t lock`
* `omp_init_lock(&lock)`, `omp_set_lock(&lock)`, `omp_unset_lock(&lock)`, `omp_destroy_lock(&lock)`
* `#pragma omp critical` within parallel ensures only 1 thread accesses at a time
* Deadlock: System state where no progress is possible
* Elapsed wall clock time: `double omp_get_wtime(void);`
* Time measured per thread
## Shared memory
* SMP: (Shared memory) symmetric multiprocessor
* >=2 CPUs, single coherent memory
* Memory is a performance bottleneck for 1 processor
* Caches reduce bandwidth demands for main memory
* Each cache has local private cache
* Cache misses have to access shared common memory
## Cache Coherency
* Need to keep cache values coherent
* Idea: Notify other processors when a processor has a cache miss/write
* If a processor writes, invalidate other copies
* Each cache tracks state of each block in cache
* 1) Shared: up to date data
* 2) Modified: Up to date data, changed/dirty, no other cache has a copy, Ok to write, memory out of date (need to write back)
* 3) Exclusive: Up to date data, no other cache has a copy, ok to write, memory up to date (don't need to write back if block replaced)
* 4) Owner: up to date data, other caches can have copy but must be shared
* Owner has exclusive right to make changes to it
* Broadcasts those changes to all other caches (dirty sharing of data)
* Can become modified after invalidating shared copies OR changed to shared after writing modifications back to memory (you go from owner to shared)
* Owned cache lines must respond to a snoop request with data

* False sharing: the block ping pongs between 2 caches even though variables accessed are disjoint
* Coherence/communication miss: Disjoint variables but have to invalidate block as one updates
# Lecture 36: 11/20/20
* Amdahl's Law: speedup w/ E = exec time w/o E / exec time w/ E
* Enhancement E does not affect portion s of a task, accelerates 1-2 by a factor of P (P>1)
* exec time w/ E = exec time w/o E * [s + (1-s)/P]
* Speedup w/ E = 1 / [s + (1-s)/P] = $\frac{1}{s + \frac{1-s}{P}} \leq \frac{1}{s}$, as $P \to \infty$
* Amount of speedup that can be achieved thru parallelism is limited by serial portion (s)
* Request level parallelism: Many requests/sec reading a database (usually not read-write/sync)
* Data level parallelism (DLP): data in memory or data on many disks
## MapReduce
* Scalable to large data volumes
* Cost-efficiency: commodity machines, commodity network
* Automatic fault-tolerance
* Expensive switches at higher levels

* Map: processes input key/value pair
* Slices data into shards/splits, distributed to workers
* Produces set of intermediate pairs
* Group by key, send to reduce
* Reduce: Combines intermediate values for a particular key, produces an output value
* Need maps to finish before reduce, but can start reading values
## Spark
* Spark is 100x faster than Hadoop in memory
* Advanced DAG execution engine with cyclic data flow, in memory computing
* Lazy evaluation model
* Only after `.collect()` it is evaluated
* `flatmap` will append all things together from map
# Lecture 37: 11/23/20
## Warehouse Scale Computers
* 100k servers in the same datacenter (Need to cool building for heat generated)
* Homogeneous hardware/software
* High availability
* Want ample parallelism (Data level parallelism), prepare for high num of component failures, cost of equipment << cost of ownership
* Rack contains 40-80 servers
* Array/custer contains 16-32 server racks
* Performance:
* Response time/latency: time between start and completion of task
* Throughput/bandwidth: total amt of work in a given time
* Lower latency to DRAM in another server than local disk
* Higher bandwidth to local disk than to DRAM in another server
## Power Usage Effectiveness
* Workload variation
* Place data within array in right spot for good performance
* Cope with failure gracefully
* Scale up and down in response to varying demand
* Uses half power when idle
* Most WSC servers in 10-50% utilized
* Ideally linear energy proportionality
* WSC Energy Efficiency: amt of computational work performed dividd by the total energy used in the process
* Power usage effectiveness (PUE): Total building power / IT equipment power (1.0 is perfect)
* Some power goes to uninterruptable power supply (UPS, battery), chiller cooling warm water from AC
* Need careful air flow handling (don't mix server hot air with cold air), easier to control in small freight containers
* Keep cold aisle temperature to 81F (instead of 65)
* Use free cooling (cool warm water outside by evaporation in cooling towers)
# Lecture 38: 11/30/20
* Dependability via redundancy (datacenters, disks, memory bits)
* RAID: Redundant Arrays of Independent Disks
* ECC: Error correcting code (redundant DRAM memory)
* Spatial redundancy: replicated data/hardware to handle hard and transient failures
* Temporal redundancy: redundancy in time to handle transient failures
* Dependability measures:
* Reliability: Mean Time to Failure (MTTF)
* Service interruption: MEan Time To Repair (MTTR)
* Mean time between failures (MTBF)
* MTBF = MTTF + MTTR
* Availability = MTTF / (MTTF + MTTR)
* Increasing MTTF for more reliable hardware/software, fault tolerance
* Reduce MTTR: improve tools/processess for diagnosis and repair
* Number of 9s availability per year:
* 1 nine: 90%, 2 nine: 99%
* Reliability Measures
* Average number of failures per year: Annualized Failure Rate (AFR)
* AFR = 24 * 365 / MTTF
* Failures in Time (FIT) rate: Num of failures expected in 1 billion device-hours of operation
* MTBF = 1 billion / FIT
* Automotive safety integrity level (ASIL) defines FT rates for diff classes of components in vehicles
* Design principles
* No single point of failure
* Dependability limited by part you do not improve
## Error Detection
* Memory systems accidentally flip bits
* Soft errors: When cells are struck by alpha particles or environmental upset
* Hard error: When chips permanently fail
* Error Detection/Correction Codes can protect soft errors
* Hamming distance: difference in num of bits
* Parity bit: Forces stored words to have even parity
* XOR them all, if output = 0 then even, 1 then odd-->error
* Cannot detect an even number of errors
## Error Correction
* Detection: bit pattern fails codeword check
* Correction: map to nearest valid code word
* Snap back to closest valid word
* For hamming distance 2: invalid codewords for neighbors
* Place parity bits at binary positions of powers of 2
* p1 covers positions with LSB=1
* p2 covers positions next to LSB=1
* p4 covers positions distance 2 from LSB=1
* start at that bit and go that many outward. that number is stride
* Error will be the sum of the 2 incorrect pairty bits
* Will need double-error correction, triple-error detection (DECTED)
* Cyclic Redundancy Check: CRC interleaving advanced codes

## RAID
* RAID 1: Disk Mirroring/Shadowing: Each disk fully duplicated on mirror
* Very high availability
* Writes limited by single-disk speed, reads can be optimized
* 100% capacity overhead but very expensive
* RAID 3: Parity Disk
* Stripe physical records across disks, parity disk has parity of stripe
* If a disk fails, subtract P from sum of other disks to find info
* RAID 4: High IO Rate Parity
* Parity is all on other disk
* For concurrent writes, need to update parity
* Option 1): Read other data disks, create new sum and write to parity disk
* Option 2) Compare old data to new data, add difference to P
* Both are bottlenecked by parity writes
* RAID 5: Hgih IO Rate Interleaved Parity
* Interleaved parity as a diagonal pattern