CS 61C Final Notes

Lecture 24: 10/20/20

Memory and caches use powers of 2 for binary prefixes
- add bi so kibi is Ki for 1024
- ceil log_2(x)
DRAM speed is much lower than CPU Performance (performance gap)
Use a memory cache (copy of a subset of main memory) for faster access
- Separate caches for instructions and data
Farther away from processor means:
- Longer access time, bigger size capacity
- Inclusive: Smaller is always a copy subset of larger
- Illusion of large fast memory
Temporal locality: Use it recently, want to use it again soon
- Keep most recently accessed data closer
Spatial locality: Use a piece of memory, want to use neighboring pieces soon
- Move blocks of contiguous words closer to processor
Hierarchy management:
- Registers to memory: Compiler
- Cache to main memory: cache controller hardware
- Main memory to disks: OS

Lecture 25: 10/22/20

Direct-mapped cache: each mem addr associated with 1 block within cache
Look in a single location in cache for data
Draw memory same width as cache
- Lowest byte in top right
Lower bits tell the cache mapped row (which bits depend on the block size)
Tags tell you which address it came from
- Discard lower bytes (which indicate which cache index and block)
- Aka cache number
Divide memory address into 3 fields:
- tag: check if have correct block
- index: select block
- offset: byte offset within block
Lookup cache in order of index, offset, tag
Cache area
$= I * O = 2^{I + O}$

Terminology

Cache hit: Cache block is valid and contains proper address
Cache miss: Nothing in cache for appropriate block
Cache miss, block replacement: wrong data in cache at appropriate block, discard it and fetch the right one
Temperature:
- Cold: cache empty
- Warming: cache filling with values
- Warm: cache doing its job, fair % of hits
- Hot: Cache doing very well, high % of hits
Hit rate: fraction of access that hit in cache
- Miss rate: 1 - hit rate
- Miss penalty: time to replace a block from memory to cache
- Hit time: Time to access cache memory
$ is cache
Have a valid bit to represent invalid values

Lecture 26: 10/26/20

Order IVTO: Index, Valid, Tag, Offset
Write-through: Update cache and memory at same time
Write-back: Update word in cache block, allowing memory word to be stale
- Add dirty bit to block (update memory when block is replaced)
Block side Benefits:
- Spatial locality
- Works well for sequential array access
Block side Drawbacks
- Larger miss penalty (longer load time for new blocks)
- Too few blocks for cache size (miss rate goes up)
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

Cache misses

Compulsory misses (infinite size, fully associative)
- When a program first started (can't be avoied)
- Every block of memory will have at least 1 compulsory miss
Conflict misses (finite size, finite associativity)
- 2 distinct memory addresses map to same cache location
- Sol1: Make cache size bigger
- Sol2: Multiple distinct blocks can fit in the same cache index
Capacity misses (finite size, fully associative)
- miss because cache is limited in size
- Primarty type of miss for Fully Associative caches

Fully Associative Caches

Tag and offset same as before, index nonexistent
Compare tags in parallel
Benefits: no conflict misses since data can go anywhere
Drawbacks: Hardware difficult to make

Lecture 27: 10/28/20

Set associative Cache

Tag and offset are same as before, index points us to correct row/set
Size is # sets x N blocks / set x block size
Each set contains multiple blocks
Once found correct set, compare with all blocks in set
- Compare tag with all tags in set
Avoids a lot of conflict misses and only need N comparators
For a cache with M blocks:
- Direct-mapped: 1 way set assoc
- Fully assoc: M-way set assoc
One hot encoding for a mux to choose which had a hit

Block replacement

Direct mapped: Index specifies the position to replace
N-way Set Assoc: Index specifies set, block can be anywhere in set
Fully associative: Any position
Replacement policy
- LRU: Least recently used is removed
  - Pro: temporal locality
  - Con: harder to keep track of LRU (with set assoc)
- FIFO: Ignore accesses, just track initial order
- Random: Good for low temporal locality of workload

AMAT

AMAT: Average Memory Access Time
- = Hit Time + Miss Penalty * Miss Rate
Multi level caches: recursive model of caches
- AMAT: L1 hit time + L1 miss rate(L2 hit time + L2 miss rate * L2 Miss penalty)
Larger cache reduces miss rate (hit time of first level cache < yccle time)
More places in cache to put blocks of memory: associativity
L1: miss rate 1-5%, hit time 1 cycle, size: tens of KB
L2: miss rate: 10-20%, hit time: few clock cycles, size: hundreds of KB
L2 miss rate is a fraction of L1 misses that also miss in L2

Lecture 28: 10/30/20

OS is first thing that runs when computer starts
Find and control devices (device drivers)
Start services (file system, network stack, TTY/keyboard)
Loads, runs, manages programs
Isolation between running processes and interaction with outside world
Memory translation: Each process has a mapping from virtual to physical address
- Actual memory accessed at a physical address
Protection and privilege
- At least 2 running modes (User, supervisor)
- Lesser privilege cannot change memory mapping (but supervisor can)
Traps and interrupts
- Ways to handle unusual behavior in program
On boot: Execute instructions from Flash ROM
- 1. BIOS (Basic Input Output System): find a storage device, load first sector
- 1. Bootloader: Load OS kernel from disk into location in memory, jump to it
- 1. OS Boot: initialize services, drivers, etc.
- 1. Init: Launch application that waits for input in loop (Terminal, Desktop)

Launching applications

Applications called processes
- Thread: shaerd memory, process: separate memory
Apps started using shell calling OS routine (syscall)
- Linux uses fork to create new process, execve (execute file command) to load application
Loads executable file from disk, puts instructions and data into memory
Set argc and argv, jump to main
Shell waits for main to return (join)
OS enforces constraints for applications memory
Supervisor mode: CPU's mode to protect OS from application
- Process has access to subset of instructions when not in supervisor mode
- Very rarely runs in supervisor mode (BSOD if error here)

Syscalls

Call an OS routine
Need to perform a syscall (set up func arguments in registers, raise fotware interrupt)
When to transitition into supervisor mode:
- Interrupts: Something external to the running program
  - Async to curent program (ex. key press, disk IO)
- Exception: Caused by an event during execution of an instruction by the running program
  - Sync, on instruction that causes exception (ex. mem error, buss error, illegal instruction)
- Trap: action of servicing interrupt or exception by hardware jump to interrupt/trap handler code
Ecall: Trigger exception to higher privilege
Ebreak: Trigger exception within current privilege
None of instructions after trap has been executed, all prior have been
- Handler can return from an interrupt by restoring user registers
- More complex to handle trap from execption than interrupt
- Tricky for superscalar pipeline
Exceptions in 5-stage pipeline
- PC address exception
- Illegal opcode
- Data address exceptions
Exceptions handled like pipeline hazards
- complete xecution of instructions before exception
- Flush curernt pipeline instructions (nops)
- Store exception cause ins tatus register
- Transfer execution to trap handler
- Optionally return to original program, re-execute instruction
Multiprogramming: Switch between processes very quickly (context switch) to run multiple applications at same time
When jumping into process, set a timer (interrupt)
- When expires, store PC and registers (process state)
- Pick a different process to run and load state
- Set timer, change to user mode, jump to new PC
Use virtual memory to provide illusion of full memory address space for applications

Lecture 29: 11/2/20

Virtual mem is next level in memory heirarchy (for protection)
Memory manager maps virtual to physical addresses
- Physical address space hidden from user apps
- Multiple processes can simultaneously occupy memory with protection
- Illusion of private memory
- Can swap memory to disk to give illusion of larger memory
DRAM is 10x denser than SRAM
- Both are volatile (cannot remember when powered off)
Non-volatile storage disks
- SSD: faster, 5x more expensive, made with transistors (nothing mechanical), can't erase single bits, only entire blocks
- HDD: slower, cheaper

Paged Memory

Physical memory broken into pages
Typical page size: 4KiB (need 12 bits to address 4KiB)
- 20 bit page number, 12 bit offset
Each process has a dedciated page table
Physical memory is not consecutive
Physical addresses can have more or fewer bits than virtual addresses
Assigning diff pages in DRAM isolates process memory
- Sharing is possible, assign same physical page to several processes
- Use a bit to indicate write protection (exception if try to write if 1)
lw/sw needs to get page from memory, then get the physical address
- Pages are too big to store in cache
- Frequent parts of page tables can be cached

Page Faults

Caches deals with individual blocks (around 64B)
VM deals with individual pages (4KiB)
Ex. Each page has 32 blocks, each block has 32 words
If page table entry is valid and in DRAM, read/write data
- If not, Allocate new page in DRAM, evict page from DRAM
- Store evicted page to disk, read page from disk into memory
If not valid:
- Allocate new page in DRAM
- If out of memory, evict a page
Page fault if not on disk or not valid
- Treated as exception

Lecture 30: 11/4/20

Hierarchical Page Tables

Page tables can take up large amounts of space
Can increase page size, but wastes more memory
Hierarchical page tables with decreasing page size
- Most programs use a fraction of memory
Virtual address with 10-bit L1 index, 10-bit L2 index, 12 bit offset
- register points to level 1 PT which points to Level 2 PT which points to physical memory
- Most L2 pointers will be empty

Translation Lookaside Buffers

Reference requires 3 mem accesses for 2-level PT
Solution: Cache some translations in TLB
- 32-128 entries, usually fully associative
- Each entry maps large page so less spatial locality
- Larger TLBs are 4-8 way set-associatie
- Random or FIFO replacement policy
- TLB Reach: size of larger virtual address space that can be mapped in a TLB
TLB hit: Single cycle translation
TLB miss: PT walk to refill
Handing page fault needs a precise trap so handler can resume after retrieving page
Protection violation should abort process
Demand paging: ability to run programs larger than primary memory
Context switch: Change of internal state of processor
- save register values and PC, change value in Supervisor Page Table Base register (SPTBR)
- TLB entries all set to invalid on context switch
VM is level below main memory, TLB comes before cache
- Same CPI, AMAT equations, but main memory treated as mid-level cache

Lecture 31: 11/6/20

Need IO interface for keyboards, network, mouse, display
Input: Read a sequence of bytes
Output: Write a sequence of bytes
Interface: Memory mapped I/O
- Portion of address space for IO
- IO device registers there
- Use normal load/store instructions
Processor-IO speed can be mismatched

Polling

Device registers:
- Control register: says OK if ready for IO
- Data register: contains data
Processor reads from control register in loop
Processor loads from input/writes to output, IO resets control register bit to 0
Polling a mouse is fine but a hard disk would be very costly for cpu

Interrupts

Polling wasts processor resources, use interrupt when IO is ready or needs attention
Lots of IOs can be expensive with thrashing cache, VM, saving/restoring state
Low data rate (mouse, keyboard): use interrupts
High data rate (network, disk): start with interrupts, switch to DMA
Programmed IO:
- CPU execs lw/sw instructions, gets data from device to main memory, uses data to compute

DMA

Direct memory Acccess Engine
- Contains registers written by CPU
- Mem addr to place data, # of bytes, IO device, transfer direction, unit of transfer, amount to transfer/burst
Incoming data:
- Receive interrupt from device
- CPU takes interrupt, initiates transfer (instructs DMA engine to place data at addresses)
- DMA handles transfer as CPU does other things
- Upon completion, DMA interrupts CPU
Outgoing data:
- CPU confirms external device is ready
- CPU begins transfer
- DMA handles transfer as CPU does other things
- DMA interrupts CPU to signal completion

Networking

sw send:
- App copies data to OS buffer
- OS calculates checksum, starts timer
- OS sends data to network interface hardware, says start
sw receive
- OS copies data from network interface to OS buffer
- OS calculates checksum, if OK send ACK, if not, delete
- If OK, OS copies data to user address space, signals app to continue

Lecture 32: 11/9/20

dgemm: double precision floating-point matrix mul
C is around 240x faster than python for matmul
Hardware can be serial or parallel, software can be sequential or concurrent
Flynn's taxonomy - parallel hardware
SIMD: Single instruction, multiple data
- Specialized hardware to handle lock-step calculations for arrays
SPMD: Single Program Multiple data
- Most common parallel processing programming style
Data Level Parallelism (DLP): operation on multiple data streams
- Ex. multiply coef vector by data vector
- One instruction fetched and decoded for entire operation
MMX: MultiMedia extension
SSE: Streaming SIMD Extension
- 8 128-bit data registers (XMM0 to XMM7)
Intrinsics: C functions and procedures for putting in assembly language (including SSE instr)
- 1:1 correspondence between SSE instructions and intrinsics

Lecture 33: 11/13/20

Multiprocessor execution model:
- Each core executes own instructions
- Separate resources (datapath, high level caches)
  - Cache coherency is important
- Shared resources (DRAM, 3rd level cache)
- Drawback: Slow memory shared by cores (bottleneck)
2 ways to use a multiprocessor
- Job-level parallelism: processors work on unrelated problems, no communication
- Partition work of single task between several cores

Threads

Thread: Thread of execution, single stream of instructions
- Program/process can split/fork itself into multiple threads to execute simultaneously
- Single CPU can execute many threads by time sharing
Each thread has dedicated PC, sep registers, acceses shared memory
Each physical core executes one hardware threads
OS multiplexes multiple software threads onto available hardware threads
- All threads except ones mapped to hardware threads are waiting
OS Threads give illusion of many simultaneously active threads
- Multiplex software threads onto hardware threads
  - Switch out blocked threads (cache miss)
  - Timer (switch active thread every time interval)
- Remove software thread from hardware therad by:
  - Interrupting its execution
  - Saving its registers and PC to memory
- Start executing diff software thread by
  - Loading previously saved regs
  - Jump to its saved PC
daemon = always running

Multithreading

Need to switch thread state very quickly
Hardware Assisted Software Multithreading: 1 core, 2 threads
- 2 copies of PC and registers, looks like 2 diff hardware threads
- Hyper-threading: both threads can be active simultaneously
Physical CPU: 1 thread at a time per CPU (switches threads from IO events)
Logical CPU (multithreading): 1% more hardware, 10% better performance (sep registers, share datapath, ALU, cache)
Hyper-Threading/Simultaneous Multithreading (multicore): duplicate processes, 50% more hardware, 2x better performance
- Exploirt superscalar architecture to launch instructions from diff threads at same time
Modern machines do both

Lecture 34: 11/16/20

Use intrinsics since no native support
Parallel loops:
- #include <omp.h>
- #pragma omp parallel for before for loop to parallelize
- Don't know what thread will finish first
- #pragma ignored by compilers unaware of openMP
- Cannot premature break
OpenMP Programming model (Fork-join model)
Requires shared memory environment
- Begin as a single process (main thread)
- Master thread forks into threads, join together at the end
Use array to keep track of all thread results
- Race condition: not deterministic, multiple reads and writes at same time

Lecture 35: 11/18/20

Synchronization

Limit access to shared resource to 1 actor at a time to prevent race condition
Use locks to control acces to shared resources
Implement lock as a variable (1 = locked) does not work
Critical section: area that needs to be serialized
Solution: Atomic read/write in single instruction (hardware sync)
- Must used shared memory
Atomic Memory Operations (AMOs): Atomic swap of register and memory
Declare omp_lock_t lock
omp_init_lock(&lock), omp_set_lock(&lock), omp_unset_lock(&lock), omp_destroy_lock(&lock)
#pragma omp critical within parallel ensures only 1 thread accesses at a time
Deadlock: System state where no progress is possible
Elapsed wall clock time: double omp_get_wtime(void);
- Time measured per thread

Shared memory

SMP: (Shared memory) symmetric multiprocessor
- =2 CPUs, single coherent memory
Memory is a performance bottleneck for 1 processor
Caches reduce bandwidth demands for main memory
Each cache has local private cache
- Cache misses have to access shared common memory

Cache Coherency

Need to keep cache values coherent
Idea: Notify other processors when a processor has a cache miss/write
- If a processor writes, invalidate other copies
- Each cache tracks state of each block in cache
- 1. Shared: up to date data
- 1. Modified: Up to date data, changed/dirty, no other cache has a copy, Ok to write, memory out of date (need to write back)
- 1. Exclusive: Up to date data, no other cache has a copy, ok to write, memory up to date (don't need to write back if block replaced)
- 1. Owner: up to date data, other caches can have copy but must be shared
  - Owner has exclusive right to make changes to it
  - Broadcasts those changes to all other caches (dirty sharing of data)
  - Can become modified after invalidating shared copies OR changed to shared after writing modifications back to memory (you go from owner to shared)
  - Owned cache lines must respond to a snoop request with data
False sharing: the block ping pongs between 2 caches even though variables accessed are disjoint
Coherence/communication miss: Disjoint variables but have to invalidate block as one updates

Lecture 36: 11/20/20

Amdahl's Law: speedup w/ E = exec time w/o E / exec time w/ E
Enhancement E does not affect portion s of a task, accelerates 1-2 by a factor of P (P>1)
- exec time w/ E = exec time w/o E * [s + (1-s)/P]
- Speedup w/ E = 1 / [s + (1-s)/P] =
  $\frac{1}{s + \frac{1 - s}{P}} \leq \frac{1}{s}$ , as
  $P \to \infty$
- Amount of speedup that can be achieved thru parallelism is limited by serial portion (s)
Request level parallelism: Many requests/sec reading a database (usually not read-write/sync)
Data level parallelism (DLP): data in memory or data on many disks

MapReduce

Scalable to large data volumes
Cost-efficiency: commodity machines, commodity network
Automatic fault-tolerance
Expensive switches at higher levels
Map: processes input key/value pair
- Slices data into shards/splits, distributed to workers
- Produces set of intermediate pairs
Group by key, send to reduce
Reduce: Combines intermediate values for a particular key, produces an output value
Need maps to finish before reduce, but can start reading values

Spark

Spark is 100x faster than Hadoop in memory
Advanced DAG execution engine with cyclic data flow, in memory computing
Lazy evaluation model
Only after .collect() it is evaluated
flatmap will append all things together from map

Lecture 37: 11/23/20

Warehouse Scale Computers

100k servers in the same datacenter (Need to cool building for heat generated)
Homogeneous hardware/software
High availability
Want ample parallelism (Data level parallelism), prepare for high num of component failures, cost of equipment << cost of ownership
Rack contains 40-80 servers
Array/custer contains 16-32 server racks
Performance:
- Response time/latency: time between start and completion of task
- Throughput/bandwidth: total amt of work in a given time
- Lower latency to DRAM in another server than local disk
- Higher bandwidth to local disk than to DRAM in another server

Power Usage Effectiveness

Workload variation
Place data within array in right spot for good performance
Cope with failure gracefully
Scale up and down in response to varying demand
Uses half power when idle
- Most WSC servers in 10-50% utilized
- Ideally linear energy proportionality
WSC Energy Efficiency: amt of computational work performed dividd by the total energy used in the process
Power usage effectiveness (PUE): Total building power / IT equipment power (1.0 is perfect)
- Some power goes to uninterruptable power supply (UPS, battery), chiller cooling warm water from AC
- Need careful air flow handling (don't mix server hot air with cold air), easier to control in small freight containers
- Keep cold aisle temperature to 81F (instead of 65)
- Use free cooling (cool warm water outside by evaporation in cooling towers)

Lecture 38: 11/30/20

Dependability via redundancy (datacenters, disks, memory bits)
RAID: Redundant Arrays of Independent Disks
ECC: Error correcting code (redundant DRAM memory)
Spatial redundancy: replicated data/hardware to handle hard and transient failures
Temporal redundancy: redundancy in time to handle transient failures
Dependability measures:
- Reliability: Mean Time to Failure (MTTF)
- Service interruption: MEan Time To Repair (MTTR)
- Mean time between failures (MTBF)
  - MTBF = MTTF + MTTR
- Availability = MTTF / (MTTF + MTTR)
  - Increasing MTTF for more reliable hardware/software, fault tolerance
  - Reduce MTTR: improve tools/processess for diagnosis and repair
- Number of 9s availability per year:
  - 1 nine: 90%, 2 nine: 99%
Reliability Measures
- Average number of failures per year: Annualized Failure Rate (AFR)
  - AFR = 24 * 365 / MTTF
- Failures in Time (FIT) rate: Num of failures expected in 1 billion device-hours of operation
  - MTBF = 1 billion / FIT
- Automotive safety integrity level (ASIL) defines FT rates for diff classes of components in vehicles
Design principles
- No single point of failure
- Dependability limited by part you do not improve

Error Detection

Memory systems accidentally flip bits
Soft errors: When cells are struck by alpha particles or environmental upset
Hard error: When chips permanently fail
Error Detection/Correction Codes can protect soft errors
- Hamming distance: difference in num of bits
- Parity bit: Forces stored words to have even parity
  - XOR them all, if output = 0 then even, 1 then odd–>error
- Cannot detect an even number of errors

Error Correction

Detection: bit pattern fails codeword check
Correction: map to nearest valid code word
- Snap back to closest valid word
- For hamming distance 2: invalid codewords for neighbors
Place parity bits at binary positions of powers of 2
- p1 covers positions with LSB=1
- p2 covers positions next to LSB=1
- p4 covers positions distance 2 from LSB=1
- start at that bit and go that many outward. that number is stride
- Error will be the sum of the 2 incorrect pairty bits
Will need double-error correction, triple-error detection (DECTED)
Cyclic Redundancy Check: CRC interleaving advanced codes

RAID

RAID 1: Disk Mirroring/Shadowing: Each disk fully duplicated on mirror
- Very high availability
- Writes limited by single-disk speed, reads can be optimized
- 100% capacity overhead but very expensive
RAID 3: Parity Disk
- Stripe physical records across disks, parity disk has parity of stripe
- If a disk fails, subtract P from sum of other disks to find info
RAID 4: High IO Rate Parity
- Parity is all on other disk
- For concurrent writes, need to update parity
- Option 1): Read other data disks, create new sum and write to parity disk
- Option 2) Compare old data to new data, add difference to P
- Both are bottlenecked by parity writes
RAID 5: Hgih IO Rate Interleaved Parity
- Interleaved parity as a diagonal pattern

CS 61C Final Notes

Lecture 24: 10/20/20

Lecture 25: 10/22/20

Terminology

Lecture 26: 10/26/20

Cache misses

Fully Associative Caches

Lecture 27: 10/28/20

Set associative Cache

Block replacement

AMAT

Lecture 28: 10/30/20

Launching applications

Syscalls

Lecture 29: 11/2/20

Paged Memory

Page Faults

Lecture 30: 11/4/20

Hierarchical Page Tables

Translation Lookaside Buffers

Lecture 31: 11/6/20

Polling

Interrupts

DMA

Networking

Lecture 32: 11/9/20

Lecture 33: 11/13/20

Threads

Multithreading

Lecture 34: 11/16/20

Lecture 35: 11/18/20

Synchronization

Shared memory

Cache Coherency

Lecture 36: 11/20/20

MapReduce

Spark

Lecture 37: 11/23/20

Warehouse Scale Computers

Power Usage Effectiveness

Lecture 38: 11/30/20

Error Detection

Error Correction

RAID

Read more

CS61B MT2 Cheat Sheet

Puzzle Game 2022 Sols

Write It Powerpoint It Recap

CS188 Notes Part 1