CS 61C Final Notes

Lecture 24: 10/20/20

  • Memory and caches use powers of 2 for binary prefixes
    • add bi so kibi is Ki for 1024
    • ceil log_2(x)
  • DRAM speed is much lower than CPU Performance (performance gap)
  • Use a memory cache (copy of a subset of main memory) for faster access
    • Separate caches for instructions and data
  • Farther away from processor means:
    • Longer access time, bigger size capacity
    • Inclusive: Smaller is always a copy subset of larger
    • Illusion of large fast memory
  • Temporal locality: Use it recently, want to use it again soon
    • Keep most recently accessed data closer
  • Spatial locality: Use a piece of memory, want to use neighboring pieces soon
    • Move blocks of contiguous words closer to processor
  • Hierarchy management:
    • Registers to memory: Compiler
    • Cache to main memory: cache controller hardware
    • Main memory to disks: OS

Lecture 25: 10/22/20

  • Direct-mapped cache: each mem addr associated with 1 block within cache
  • Look in a single location in cache for data
  • Draw memory same width as cache
    • Lowest byte in top right
  • Lower bits tell the cache mapped row (which bits depend on the block size)
  • Tags tell you which address it came from
    • Discard lower bytes (which indicate which cache index and block)
    • Aka cache number
  • Divide memory address into 3 fields:
    • tag: check if have correct block
    • index: select block
    • offset: byte offset within block
  • Lookup cache in order of index, offset, tag
  • Cache area
    =IO=2I+O

Terminology

  • Cache hit: Cache block is valid and contains proper address
  • Cache miss: Nothing in cache for appropriate block
  • Cache miss, block replacement: wrong data in cache at appropriate block, discard it and fetch the right one
  • Temperature:
    • Cold: cache empty
    • Warming: cache filling with values
    • Warm: cache doing its job, fair % of hits
    • Hot: Cache doing very well, high % of hits
  • Hit rate: fraction of access that hit in cache
    • Miss rate: 1 - hit rate
    • Miss penalty: time to replace a block from memory to cache
    • Hit time: Time to access cache memory
  • $ is cache
  • Have a valid bit to represent invalid values

Lecture 26: 10/26/20

  • Order IVTO: Index, Valid, Tag, Offset
  • Write-through: Update cache and memory at same time
  • Write-back: Update word in cache block, allowing memory word to be stale
    • Add dirty bit to block (update memory when block is replaced)
  • Block side Benefits:
    • Spatial locality
    • Works well for sequential array access
  • Block side Drawbacks
    • Larger miss penalty (longer load time for new blocks)
    • Too few blocks for cache size (miss rate goes up)
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →

Cache misses

  • Compulsory misses (infinite size, fully associative)
    • When a program first started (can't be avoied)
    • Every block of memory will have at least 1 compulsory miss
  • Conflict misses (finite size, finite associativity)
    • 2 distinct memory addresses map to same cache location
    • Sol1: Make cache size bigger
    • Sol2: Multiple distinct blocks can fit in the same cache index
  • Capacity misses (finite size, fully associative)
    • miss because cache is limited in size
    • Primarty type of miss for Fully Associative caches

Fully Associative Caches

  • Tag and offset same as before, index nonexistent
  • Compare tags in parallel
  • Benefits: no conflict misses since data can go anywhere
  • Drawbacks: Hardware difficult to make

Lecture 27: 10/28/20

Set associative Cache

  • Tag and offset are same as before, index points us to correct row/set
  • Size is # sets x N blocks / set x block size
  • Each set contains multiple blocks
  • Once found correct set, compare with all blocks in set
    • Compare tag with all tags in set
  • Avoids a lot of conflict misses and only need N comparators
  • For a cache with M blocks:
    • Direct-mapped: 1 way set assoc
    • Fully assoc: M-way set assoc
  • One hot encoding for a mux to choose which had a hit

Block replacement

  • Direct mapped: Index specifies the position to replace
  • N-way Set Assoc: Index specifies set, block can be anywhere in set
  • Fully associative: Any position
  • Replacement policy
    • LRU: Least recently used is removed
      • Pro: temporal locality
      • Con: harder to keep track of LRU (with set assoc)
    • FIFO: Ignore accesses, just track initial order
    • Random: Good for low temporal locality of workload

AMAT

  • AMAT: Average Memory Access Time
    • = Hit Time + Miss Penalty * Miss Rate
  • Multi level caches: recursive model of caches
    • AMAT: L1 hit time + L1 miss rate(L2 hit time + L2 miss rate * L2 Miss penalty)
  • Larger cache reduces miss rate (hit time of first level cache < yccle time)
  • More places in cache to put blocks of memory: associativity
  • L1: miss rate 1-5%, hit time 1 cycle, size: tens of KB
  • L2: miss rate: 10-20%, hit time: few clock cycles, size: hundreds of KB
  • L2 miss rate is a fraction of L1 misses that also miss in L2

Lecture 28: 10/30/20

  • OS is first thing that runs when computer starts
  • Find and control devices (device drivers)
  • Start services (file system, network stack, TTY/keyboard)
  • Loads, runs, manages programs
  • Isolation between running processes and interaction with outside world
  • Memory translation: Each process has a mapping from virtual to physical address
    • Actual memory accessed at a physical address
  • Protection and privilege
    • At least 2 running modes (User, supervisor)
    • Lesser privilege cannot change memory mapping (but supervisor can)
  • Traps and interrupts
    • Ways to handle unusual behavior in program
  • On boot: Execute instructions from Flash ROM
      1. BIOS (Basic Input Output System): find a storage device, load first sector
      1. Bootloader: Load OS kernel from disk into location in memory, jump to it
      1. OS Boot: initialize services, drivers, etc.
      1. Init: Launch application that waits for input in loop (Terminal, Desktop)

Launching applications

  • Applications called processes
    • Thread: shaerd memory, process: separate memory
  • Apps started using shell calling OS routine (syscall)
    • Linux uses fork to create new process, execve (execute file command) to load application
  • Loads executable file from disk, puts instructions and data into memory
  • Set argc and argv, jump to main
  • Shell waits for main to return (join)
  • OS enforces constraints for applications memory
  • Supervisor mode: CPU's mode to protect OS from application
    • Process has access to subset of instructions when not in supervisor mode
    • Very rarely runs in supervisor mode (BSOD if error here)

Syscalls

  • Call an OS routine
  • Need to perform a syscall (set up func arguments in registers, raise fotware interrupt)
  • When to transitition into supervisor mode:
    • Interrupts: Something external to the running program
      • Async to curent program (ex. key press, disk IO)
    • Exception: Caused by an event during execution of an instruction by the running program
      • Sync, on instruction that causes exception (ex. mem error, buss error, illegal instruction)
    • Trap: action of servicing interrupt or exception by hardware jump to interrupt/trap handler code
  • Ecall: Trigger exception to higher privilege
  • Ebreak: Trigger exception within current privilege
  • None of instructions after trap has been executed, all prior have been
    • Handler can return from an interrupt by restoring user registers
    • More complex to handle trap from execption than interrupt
    • Tricky for superscalar pipeline
  • Exceptions in 5-stage pipeline
    • PC address exception
    • Illegal opcode
    • Data address exceptions
  • Exceptions handled like pipeline hazards
    • complete xecution of instructions before exception
    • Flush curernt pipeline instructions (nops)
    • Store exception cause ins tatus register
    • Transfer execution to trap handler
    • Optionally return to original program, re-execute instruction
  • Multiprogramming: Switch between processes very quickly (context switch) to run multiple applications at same time
  • When jumping into process, set a timer (interrupt)
    • When expires, store PC and registers (process state)
    • Pick a different process to run and load state
    • Set timer, change to user mode, jump to new PC
  • Use virtual memory to provide illusion of full memory address space for applications

Lecture 29: 11/2/20

  • Virtual mem is next level in memory heirarchy (for protection)
  • Memory manager maps virtual to physical addresses
    • Physical address space hidden from user apps
    • Multiple processes can simultaneously occupy memory with protection
    • Illusion of private memory
    • Can swap memory to disk to give illusion of larger memory
  • DRAM is 10x denser than SRAM
    • Both are volatile (cannot remember when powered off)
  • Non-volatile storage disks
    • SSD: faster, 5x more expensive, made with transistors (nothing mechanical), can't erase single bits, only entire blocks
    • HDD: slower, cheaper

Paged Memory

  • Physical memory broken into pages
  • Typical page size: 4KiB (need 12 bits to address 4KiB)
    • 20 bit page number, 12 bit offset
  • Each process has a dedciated page table
  • Physical memory is not consecutive
  • Physical addresses can have more or fewer bits than virtual addresses
  • Assigning diff pages in DRAM isolates process memory
    • Sharing is possible, assign same physical page to several processes
    • Use a bit to indicate write protection (exception if try to write if 1)
  • lw/sw needs to get page from memory, then get the physical address
    • Pages are too big to store in cache
    • Frequent parts of page tables can be cached

Page Faults

  • Caches deals with individual blocks (around 64B)
  • VM deals with individual pages (4KiB)
  • Ex. Each page has 32 blocks, each block has 32 words
  • If page table entry is valid and in DRAM, read/write data
    • If not, Allocate new page in DRAM, evict page from DRAM
    • Store evicted page to disk, read page from disk into memory
  • If not valid:
    • Allocate new page in DRAM
    • If out of memory, evict a page
  • Page fault if not on disk or not valid
    • Treated as exception

Lecture 30: 11/4/20

Hierarchical Page Tables

  • Page tables can take up large amounts of space
  • Can increase page size, but wastes more memory
  • Hierarchical page tables with decreasing page size
    • Most programs use a fraction of memory
  • Virtual address with 10-bit L1 index, 10-bit L2 index, 12 bit offset
    • register points to level 1 PT which points to Level 2 PT which points to physical memory
    • Most L2 pointers will be empty

Translation Lookaside Buffers

  • Reference requires 3 mem accesses for 2-level PT
  • Solution: Cache some translations in TLB
    • 32-128 entries, usually fully associative
    • Each entry maps large page so less spatial locality
    • Larger TLBs are 4-8 way set-associatie
    • Random or FIFO replacement policy
    • TLB Reach: size of larger virtual address space that can be mapped in a TLB
  • TLB hit: Single cycle translation
  • TLB miss: PT walk to refill

  • Handing page fault needs a precise trap so handler can resume after retrieving page
  • Protection violation should abort process
  • Demand paging: ability to run programs larger than primary memory
  • Context switch: Change of internal state of processor
    • save register values and PC, change value in Supervisor Page Table Base register (SPTBR)
    • TLB entries all set to invalid on context switch
  • VM is level below main memory, TLB comes before cache
    • Same CPI, AMAT equations, but main memory treated as mid-level cache

Lecture 31: 11/6/20

  • Need IO interface for keyboards, network, mouse, display
  • Input: Read a sequence of bytes
  • Output: Write a sequence of bytes
  • Interface: Memory mapped I/O
    • Portion of address space for IO
    • IO device registers there
    • Use normal load/store instructions
  • Processor-IO speed can be mismatched

Polling

  • Device registers:
    • Control register: says OK if ready for IO
    • Data register: contains data
  • Processor reads from control register in loop
  • Processor loads from input/writes to output, IO resets control register bit to 0
  • Polling a mouse is fine but a hard disk would be very costly for cpu

Interrupts

  • Polling wasts processor resources, use interrupt when IO is ready or needs attention
  • Lots of IOs can be expensive with thrashing cache, VM, saving/restoring state
  • Low data rate (mouse, keyboard): use interrupts
  • High data rate (network, disk): start with interrupts, switch to DMA
  • Programmed IO:
    • CPU execs lw/sw instructions, gets data from device to main memory, uses data to compute

DMA

  • Direct memory Acccess Engine
    • Contains registers written by CPU
    • Mem addr to place data, # of bytes, IO device, transfer direction, unit of transfer, amount to transfer/burst
  • Incoming data:
    • Receive interrupt from device
    • CPU takes interrupt, initiates transfer (instructs DMA engine to place data at addresses)
    • DMA handles transfer as CPU does other things
    • Upon completion, DMA interrupts CPU
  • Outgoing data:
    • CPU confirms external device is ready
    • CPU begins transfer
    • DMA handles transfer as CPU does other things
    • DMA interrupts CPU to signal completion

Networking

  • sw send:
    • App copies data to OS buffer
    • OS calculates checksum, starts timer
    • OS sends data to network interface hardware, says start
  • sw receive
    • OS copies data from network interface to OS buffer
    • OS calculates checksum, if OK send ACK, if not, delete
    • If OK, OS copies data to user address space, signals app to continue

Lecture 32: 11/9/20

  • dgemm: double precision floating-point matrix mul
  • C is around 240x faster than python for matmul
  • Hardware can be serial or parallel, software can be sequential or concurrent
  • Flynn's taxonomy - parallel hardware
  • SIMD: Single instruction, multiple data
    • Specialized hardware to handle lock-step calculations for arrays
  • SPMD: Single Program Multiple data
    • Most common parallel processing programming style
  • Data Level Parallelism (DLP): operation on multiple data streams
    • Ex. multiply coef vector by data vector
    • One instruction fetched and decoded for entire operation
  • MMX: MultiMedia extension
  • SSE: Streaming SIMD Extension
    • 8 128-bit data registers (XMM0 to XMM7)
  • Intrinsics: C functions and procedures for putting in assembly language (including SSE instr)
    • 1:1 correspondence between SSE instructions and intrinsics

Lecture 33: 11/13/20

  • Multiprocessor execution model:
    • Each core executes own instructions
    • Separate resources (datapath, high level caches)
      • Cache coherency is important
    • Shared resources (DRAM, 3rd level cache)
    • Drawback: Slow memory shared by cores (bottleneck)
  • 2 ways to use a multiprocessor
    • Job-level parallelism: processors work on unrelated problems, no communication
    • Partition work of single task between several cores

Threads

  • Thread: Thread of execution, single stream of instructions
    • Program/process can split/fork itself into multiple threads to execute simultaneously
    • Single CPU can execute many threads by time sharing
  • Each thread has dedicated PC, sep registers, acceses shared memory
  • Each physical core executes one hardware threads
  • OS multiplexes multiple software threads onto available hardware threads
    • All threads except ones mapped to hardware threads are waiting
  • OS Threads give illusion of many simultaneously active threads
    • Multiplex software threads onto hardware threads
      • Switch out blocked threads (cache miss)
      • Timer (switch active thread every time interval)
    • Remove software thread from hardware therad by:
      • Interrupting its execution
      • Saving its registers and PC to memory
    • Start executing diff software thread by
      • Loading previously saved regs
      • Jump to its saved PC
  • daemon = always running

Multithreading

  • Need to switch thread state very quickly
  • Hardware Assisted Software Multithreading: 1 core, 2 threads
    • 2 copies of PC and registers, looks like 2 diff hardware threads
    • Hyper-threading: both threads can be active simultaneously
  • Physical CPU: 1 thread at a time per CPU (switches threads from IO events)
  • Logical CPU (multithreading): 1% more hardware, 10% better performance (sep registers, share datapath, ALU, cache)
  • Hyper-Threading/Simultaneous Multithreading (multicore): duplicate processes, 50% more hardware, 2x better performance
    • Exploirt superscalar architecture to launch instructions from diff threads at same time
  • Modern machines do both

Lecture 34: 11/16/20

  • Use intrinsics since no native support
  • Parallel loops:
    • #include <omp.h>
    • #pragma omp parallel for before for loop to parallelize
    • Don't know what thread will finish first
    • #pragma ignored by compilers unaware of openMP
    • Cannot premature break
  • OpenMP Programming model (Fork-join model)
  • Requires shared memory environment
    • Begin as a single process (main thread)
    • Master thread forks into threads, join together at the end
  • Use array to keep track of all thread results
    • Race condition: not deterministic, multiple reads and writes at same time

Lecture 35: 11/18/20

Synchronization

  • Limit access to shared resource to 1 actor at a time to prevent race condition
  • Use locks to control acces to shared resources
  • Implement lock as a variable (1 = locked) does not work
  • Critical section: area that needs to be serialized
  • Solution: Atomic read/write in single instruction (hardware sync)
    • Must used shared memory
  • Atomic Memory Operations (AMOs): Atomic swap of register and memory
  • Declare omp_lock_t lock
  • omp_init_lock(&lock), omp_set_lock(&lock), omp_unset_lock(&lock), omp_destroy_lock(&lock)
  • #pragma omp critical within parallel ensures only 1 thread accesses at a time
  • Deadlock: System state where no progress is possible
  • Elapsed wall clock time: double omp_get_wtime(void);
    • Time measured per thread

Shared memory

  • SMP: (Shared memory) symmetric multiprocessor
    • =2 CPUs, single coherent memory

  • Memory is a performance bottleneck for 1 processor
  • Caches reduce bandwidth demands for main memory
  • Each cache has local private cache
    • Cache misses have to access shared common memory

Cache Coherency

  • Need to keep cache values coherent
  • Idea: Notify other processors when a processor has a cache miss/write
    • If a processor writes, invalidate other copies
    • Each cache tracks state of each block in cache
      1. Shared: up to date data
      1. Modified: Up to date data, changed/dirty, no other cache has a copy, Ok to write, memory out of date (need to write back)
      1. Exclusive: Up to date data, no other cache has a copy, ok to write, memory up to date (don't need to write back if block replaced)
      1. Owner: up to date data, other caches can have copy but must be shared
      • Owner has exclusive right to make changes to it
      • Broadcasts those changes to all other caches (dirty sharing of data)
      • Can become modified after invalidating shared copies OR changed to shared after writing modifications back to memory (you go from owner to shared)
      • Owned cache lines must respond to a snoop request with data
  • False sharing: the block ping pongs between 2 caches even though variables accessed are disjoint
  • Coherence/communication miss: Disjoint variables but have to invalidate block as one updates

Lecture 36: 11/20/20

  • Amdahl's Law: speedup w/ E = exec time w/o E / exec time w/ E
  • Enhancement E does not affect portion s of a task, accelerates 1-2 by a factor of P (P>1)
    • exec time w/ E = exec time w/o E * [s + (1-s)/P]
    • Speedup w/ E = 1 / [s + (1-s)/P] =
      1s+1sP1s
      , as
      P
    • Amount of speedup that can be achieved thru parallelism is limited by serial portion (s)
  • Request level parallelism: Many requests/sec reading a database (usually not read-write/sync)
  • Data level parallelism (DLP): data in memory or data on many disks

MapReduce

  • Scalable to large data volumes
  • Cost-efficiency: commodity machines, commodity network
  • Automatic fault-tolerance
  • Expensive switches at higher levels
  • Map: processes input key/value pair
    • Slices data into shards/splits, distributed to workers
    • Produces set of intermediate pairs
  • Group by key, send to reduce
  • Reduce: Combines intermediate values for a particular key, produces an output value
  • Need maps to finish before reduce, but can start reading values

Spark

  • Spark is 100x faster than Hadoop in memory
  • Advanced DAG execution engine with cyclic data flow, in memory computing
  • Lazy evaluation model
  • Only after .collect() it is evaluated
  • flatmap will append all things together from map

Lecture 37: 11/23/20

Warehouse Scale Computers

  • 100k servers in the same datacenter (Need to cool building for heat generated)
  • Homogeneous hardware/software
  • High availability
  • Want ample parallelism (Data level parallelism), prepare for high num of component failures, cost of equipment << cost of ownership
  • Rack contains 40-80 servers
  • Array/custer contains 16-32 server racks
  • Performance:
    • Response time/latency: time between start and completion of task
    • Throughput/bandwidth: total amt of work in a given time
    • Lower latency to DRAM in another server than local disk
    • Higher bandwidth to local disk than to DRAM in another server

Power Usage Effectiveness

  • Workload variation
  • Place data within array in right spot for good performance
  • Cope with failure gracefully
  • Scale up and down in response to varying demand
  • Uses half power when idle
    • Most WSC servers in 10-50% utilized
    • Ideally linear energy proportionality
  • WSC Energy Efficiency: amt of computational work performed dividd by the total energy used in the process
  • Power usage effectiveness (PUE): Total building power / IT equipment power (1.0 is perfect)
    • Some power goes to uninterruptable power supply (UPS, battery), chiller cooling warm water from AC
    • Need careful air flow handling (don't mix server hot air with cold air), easier to control in small freight containers
    • Keep cold aisle temperature to 81F (instead of 65)
    • Use free cooling (cool warm water outside by evaporation in cooling towers)

Lecture 38: 11/30/20

  • Dependability via redundancy (datacenters, disks, memory bits)
  • RAID: Redundant Arrays of Independent Disks
  • ECC: Error correcting code (redundant DRAM memory)
  • Spatial redundancy: replicated data/hardware to handle hard and transient failures
  • Temporal redundancy: redundancy in time to handle transient failures
  • Dependability measures:
    • Reliability: Mean Time to Failure (MTTF)
    • Service interruption: MEan Time To Repair (MTTR)
    • Mean time between failures (MTBF)
      • MTBF = MTTF + MTTR
    • Availability = MTTF / (MTTF + MTTR)
      • Increasing MTTF for more reliable hardware/software, fault tolerance
      • Reduce MTTR: improve tools/processess for diagnosis and repair
    • Number of 9s availability per year:
      • 1 nine: 90%, 2 nine: 99%
  • Reliability Measures
    • Average number of failures per year: Annualized Failure Rate (AFR)
      • AFR = 24 * 365 / MTTF
    • Failures in Time (FIT) rate: Num of failures expected in 1 billion device-hours of operation
      • MTBF = 1 billion / FIT
    • Automotive safety integrity level (ASIL) defines FT rates for diff classes of components in vehicles
  • Design principles
    • No single point of failure
    • Dependability limited by part you do not improve

Error Detection

  • Memory systems accidentally flip bits
  • Soft errors: When cells are struck by alpha particles or environmental upset
  • Hard error: When chips permanently fail
  • Error Detection/Correction Codes can protect soft errors
    • Hamming distance: difference in num of bits
    • Parity bit: Forces stored words to have even parity
      • XOR them all, if output = 0 then even, 1 then odd>error
    • Cannot detect an even number of errors

Error Correction

  • Detection: bit pattern fails codeword check
  • Correction: map to nearest valid code word
    • Snap back to closest valid word
    • For hamming distance 2: invalid codewords for neighbors
  • Place parity bits at binary positions of powers of 2
    • p1 covers positions with LSB=1
    • p2 covers positions next to LSB=1
    • p4 covers positions distance 2 from LSB=1
    • start at that bit and go that many outward. that number is stride
    • Error will be the sum of the 2 incorrect pairty bits
  • Will need double-error correction, triple-error detection (DECTED)
  • Cyclic Redundancy Check: CRC interleaving advanced codes

RAID

  • RAID 1: Disk Mirroring/Shadowing: Each disk fully duplicated on mirror
    • Very high availability
    • Writes limited by single-disk speed, reads can be optimized
    • 100% capacity overhead but very expensive
  • RAID 3: Parity Disk
    • Stripe physical records across disks, parity disk has parity of stripe
    • If a disk fails, subtract P from sum of other disks to find info
  • RAID 4: High IO Rate Parity
    • Parity is all on other disk
    • For concurrent writes, need to update parity
    • Option 1): Read other data disks, create new sum and write to parity disk
    • Option 2) Compare old data to new data, add difference to P
    • Both are bottlenecked by parity writes
  • RAID 5: Hgih IO Rate Interleaved Parity
    • Interleaved parity as a diagonal pattern