--- tags: computer-arch --- # Quiz7 of Computer Architecture (2025 Fall) > solutions # Quiz 7: Computer Architecture ## Problem A: Branch Prediction + Vector extension Consider a pipelined RISC-V RV32I processor with 5 stages: IF, ID, EX, MEM, WB. The processor uses dynamic branch prediction. - [ ] Part 1: 2-Bit Saturating Counter A 2-bit saturating counter for branch prediction has four states: - 00: Strongly Not Taken (SNT) - 01: Weakly Not Taken (WNT) - 10: Weakly Taken (WT) - 11: Strongly Taken (ST) ``` Predict: NOT TAKEN (MSB=0) Predict: TAKEN (MSB=1) ┌────────────────────────┬────────────────────────┐ │ │ │ │ ┌─────┐ ┌─────┐ │ ┌─────┐ ┌─────┐ │ │ │ 00 │ T │ 01 │ │ │ 10 │ T │ 11 │ │ │ │(SNT)│────▶│(WNT)│───┼──▶│(WT) │────▶│(ST) │ │ │ │ │◀────│ │◀──┼───│ │◀────│ │ │ │ └─────┘ N └─────┘ │ └─────┘ N └─────┘ │ │ │ │ │ │ │ └─N (saturate) │ T (saturate)─┘ │ └────────────────────────┴────────────────────────┘ T = Taken → Increment (move right) N = Not Taken → Decrement (move left) ``` Given the following branch outcome sequence (T=Taken, N=Not Taken) starting from state `01` (WNT): ``` T, T, N, T, T, T, N, N, N, T ``` Prediction rule: States with MSB=0 (00, 01) predict Not Taken; states with MSB=1 (10, 11) predict Taken. Fill in the prediction and actual outcome for each branch (showing first 5 of 10): | Branch # | Current State | Prediction | Actual | Next State | |----------|---------------|------------|--------|------------| | 1 | 01 (WNT) | __ A01 __ (Hint: N/T) | T | __ A02 __ (Hint: 2-bit state) | | 2 | __ A02 __ | __ A03 __ (Hint: N/T) | T | __ A04 __ (Hint: 2-bit state) | | 3 | __ A04 __ | T | N | __ A05 __ (Hint: 2-bit state) | | 4 | __ A05 __ | __ A06 __ (Hint: N/T) | T | __ A07 __ (Hint: 2-bit state) | | 5 | __ A07 __ | T | T | 11 (ST) | The total number of mispredictions in the complete 10-branch sequence is __ A08 __ (Hint: count mismatches). - [ ] Part 2: Branch Target Buffer (BTB) A BTB stores three key fields per entry: * __ A09 __ (Hint: identifier): To identify the branch instruction * __ A10 __ (Hint: jump destination): Where to fetch next if predicted taken * __ A11 __ (Hint: history record): To remember recent branch behavior ``` ┌───────────────────────────────────────────────┐ │ 32-bit Virtual Address │ │ ┌─────────────┬──────────┬──────┐ │ │ │ TAG │ INDEX │ (00) │ ← 4B align │ │ │ (? bits) │ (8 bits) │ 2b │ │ │ └─────────────┴────┬─────┴──────┘ │ └────────────────────┼──────────────────────────┘ ▼ ┌───────────────────────────────────────────────┐ │ BTB (256 Entries, Direct-Mapped) │ ├─────┬───────┬─────────────┬────────────┬──────┤ │ Idx │ Valid │ TAG │ Target │ Pred │ ├─────┼───────┼─────────────┼────────────┼──────┤ │ 0 │ 1 │ 0x3F2A1.. │ 0x00401234 │ 10 │ │ 1 │ 0 │ ---- │ ---- │ -- │ │ 2 │ 1 │ 0x048BC.. │ 0x00405678 │ 11 │ │ ... │ ... │ ... │ ... │ ... │ │ 255 │ 1 │ 0xABCDE.. │ 0x00409ABC │ 01 │ └─────┴───────┴─────────────┴────────────┴──────┘ Address Breakdown (32-bit, 4-byte aligned): ┌──────────────────────────────────────────┐ │ [31 ... ?+2] [?+1 ... 2] [1:0] │ │ TAG INDEX alignment(00) │ │ (log₂ 256) │ └──────────────────────────────────────────┘ ``` For a direct-mapped BTB with 256 entries and 32-bit addresses (assuming 4-byte aligned RISC-V instructions), the number of tag bits required (excluding valid bit) is __ A12 __ (Hint: remaining bits for tag). - [ ] Part 3: Tournament Predictor A tournament predictor combines: - A global predictor using a Global History Register (GHR) - A local predictor using per-branch history - A chooser that selects between them ``` Branch PC │ ┌───────────────┼───────────────┐ │ │ │ ▼ ▼ ▼ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ GLOBAL │ │ LOCAL │ │ CHOOSER │ │ PREDICTOR │ │ PREDICTOR │ │ │ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │ │ │ ┌─────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐ │ GHR │ │ LHT │ │ 2-bit/entry│ │ (? bits) │ │ (64 entry, │ │ │ │ ┌─┬─┬─┬─┐ │ │ 6-bit ea) │ │ 00,01 → ? │ │ │T│N│T│T│ │ │ │ │ 10,11 → ? │ │ └─┴─┴─┴─┘ │ │ Index │ │ │ │ │ │ │ │ │ └─────┬──────┘ │ ▼ │ │ ▼ │ │ │ ┌────────┐ │ │ ┌────────┐ │ │ │ │ PHT │ │ │ │ Local │ │ │ │ │(16 ent)│ │ │ │ PHT │ │ │ │ │2-bit ea│ │ │ │(64 ent)│ │ │ │ └───┬────┘ │ │ └───┬────┘ │ │ └─────┼──────┘ └─────┼──────┘ │ │ │ │ ▼ ▼ ▼ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ Global │ │ Local │ │ Select │ │Prediction │ │Prediction │──▶│ (MUX) │ └───────────┘ └───────────┘ └─────┬─────┘ │ ▼ ┌───────────┐ │ Final │ │Prediction │ └───────────┘ ``` Given: - Global predictor uses a 16-entry Pattern History Table (PHT), indexed by GHR - The GHR must have __ A13 __ (Hint: bits to index PHT) bits to index all PHT entries - Local History Table (LHT) with 64 entries, each 6-bit history - Local PHT with 64 entries The chooser uses a 2-bit counter per entry (MSB indicates preference): - 00 or 01 (MSB=0): Use __ A14 __ (Hint: global/local) predictor - 10 or 11 (MSB=1): Use __ A15 __ (Hint: global/local) predictor Update rule: When the selected predictor is wrong and the other would be correct, increment counter if local was correct (toward 11), decrement if global was correct (toward 00), saturating at boundaries. For the following scenario: - Global predictor predicts: Taken - Local predictor predicts: Not Taken - Chooser state: 01 - Actual outcome: Not Taken The tournament predictor's prediction is __ A16 __ (Hint: based on chooser state). After this branch, the chooser counter becomes __ A17 __ (Hint: 2-bit value). - [ ] Part 4: Branch Prediction Accuracy A program has 20% branch instructions. The branch predictor achieves 90% accuracy. Each misprediction costs 3 cycles (pipeline flush). The base CPI (without mispredictions) is 1.0. ``` Effective CPI = Base CPI + (Branch% × Mispred% × Penalty) Instruction Mix: ┌────────────┬─────────────────────────────────────┐ │████████████│ │ │ Branches │ Non-Branches │ │ 20% │ 80% │ └────────────┴─────────────────────────────────────┘ Branch Predictions (90% accuracy): ┌──────────────────┬──┐ │██████████████████│░░│ 90% Correct │ 10% Miss └──────────────────┴──┘ │ │ ▼ ▼ No penalty +3 cycles Speedup = CPI_old / CPI_new ``` The effective CPI considering branch mispredictions is __ A18 __ (Hint: 1.0 + penalty). If we improve the predictor to 95% accuracy: - The new effective CPI becomes 1.0 + (0.20 × 0.05 × 3) = __ A19 __ (Hint: decimal) - The speedup ratio (CPI_90% / CPI_95%) is __ A20 __ (Hint: 2 decimal places) - [ ] Part 5: RVV Programming for LLM Inference Recent advances in Large Language Models (LLMs), such as BitNet b1.58, use ternary quantization where weights are restricted to {-1, 0, +1}. This eliminates expensive multiplications—the core operation becomes conditional addition/subtraction based on weight values. **Background: Ternary Multiply-Accumulate** In standard inference: `y = Σ(w_i × x_i)` requires multiplications. In BitNet inference with ternary weights: - If `w_i = 0`: skip (no contribution) - If `w_i = +1`: `y += x_i` - If `w_i = -1`: `y -= x_i` ``` Standard GEMV: BitNet Ternary: ┌───────────────────┐ ┌───────────────────┐ │ y += w[i] * x[i] │ │ if w[i] == +1: │ │ (multiply) │ → │ y += x[i] │ │ │ │ if w[i] == -1: │ │ │ │ y -= x[i] │ └───────────────────┘ └───────────────────┘ HW: Multiplier HW: Mux + Adder ``` **RVV Implementation Strategy:** Using RVV, we can vectorize ternary MAC without branches by using vector merge/mask operations: ``` Weight Encoding (2 bits per weight): ┌─────┬───────┬────────────────────┐ │ Enc │ Value │ Operation │ ├─────┼───────┼────────────────────┤ │ 00 │ 0 │ No contribution │ │ 01 │ +1 │ Add activation │ │ 10 │ -1 │ Subtract activ. │ │ 11 │ 0 │ Reserved (as 0) │ └─────┴───────┴────────────────────┘ ``` **Part 5a: vsetvli Configuration** The `vsetvli` instruction configures the vector unit. Its format: ``` vsetvli rd, rs1, vtypei ``` - `rd`: destination register, receives actual vector length (vl) - `rs1`: requested number of elements (AVL - Application Vector Length) - `vtypei`: immediate encoding SEW (element width), LMUL, tail/mask policies For processing 32-bit integers with VLEN=128: - SEW (Selected Element Width) = 32 bits - VLMAX = VLEN / SEW = 128 / 32 = __ A21 __ elements (Hint: integer) The `vsetvli` encoding `e32,m1,ta,ma` means: - `e32`: element width = __ A22 __ bits (Hint: e32 means) - `m1`: LMUL = 1 (use 1 vector register group) - `ta`: tail agnostic - `ma`: mask agnostic **Part 5b: Vector Ternary MAC Implementation** Complete the RVV assembly for ternary multiply-accumulate. Given: - `a0`: pointer to activations array (int32_t x[N]) - `a1`: pointer to weight codes array (int32_t w[N], pre-expanded from 2-bit encoding) - `a2`: number of elements N - Return sum in `a0` Note: Weight codes are pre-expanded to 32-bit for simplified mask comparison (0, 1, or 2). ``` ┌─────────────────────────────────────────────────────┐ │ RVV Ternary MAC: y = Σ (w[i] ∈ {-1,0,+1}) × x[i] │ └─────────────────────────────────────────────────────┘ # Register usage: # v0: mask register (reserved for masking operations) # v1: scalar for reduction identity/result # v4: activations (x values) # v8: accumulator # v12: weight codes (32-bit) # v16: negated activations (-x) rvv_ternary_mac: # Initialize accumulator to zero vsetvli t0, a2, e32,m1,ta,ma # Set vl, get actual length in t0 vmv.v.i v8, 0 # v8 = 0 (accumulator) .loop: vsetvli t0, a2, e32,m1,ta,ma # Configure for remaining elements # Load activations vle32.v v4, (a0) # v4 = x[0..vl-1] # Load weight codes (already expanded to 32-bit in memory) vle32.v v12, (a1) # v12 = weight codes (0, 1, or 2) # Create mask for w == +1 (code 01) vmseq.vi v0, v12, __A23__ # Hint: code for +1 # Conditional add: accumulator += x where w == +1 vadd.vv v8, v8, v4, __A24__ # Hint: mask syntax # Create mask for w == -1 (code 10, value 2) vmseq.vi v0, v12, __A25__ # Hint: code for -1 # Compute -x for subtraction vneg.v v16, v4 # v16 = -v4 # Conditional add: accumulator += (-x) where w == -1 __A26__ v8, v8, v16, v0.t # Hint: vector-vector add # Advance pointers slli t1, t0, 2 # t1 = vl * 4 (bytes per int32) add a0, a0, t1 # Advance activation pointer add a1, a1, t1 # Advance weight pointer sub a2, a2, t0 # Decrement remaining count bnez a2, .loop # Horizontal reduction: sum all elements in v8 vmv.s.x v1, zero # v1[0] = 0 (scalar identity for sum) __A27__ v1, v8, v1 # Hint: reduction sum instruction vmv.x.s a0, v1 # a0 = v1[0] (return value) ret ``` **Part 5c: RVV Instruction Semantics** Match each RVV instruction to its operation: | Instruction | Operation | |-------------|-----------| | `vmseq.vi vd, vs2, imm` | Set mask bit if vs2[i] __ A28 __ immediate (Hint: comparison operator) | | `vadd.vv vd, vs1, vs2, v0.t` | Add vectors, but only where __ A29 __ (Hint: mask condition) | | `vredsum.vs vd, vs2, vs1` | Scalar vd[0] = vs1[0] + __ A30 __ (Hint: vector sum) | | `vneg.v vd, vs2` | vd[i] = __ A31 __ (Hint: negation) | **Part 5d: Performance Analysis** Given: VLEN=128, SEW=32, N=64 elements, processing 4 elements per iteration. **Scalar Implementation (with branches):** ```c int scalar_ternary_mac(int *x, int *w, int n) { int sum = 0; for (int i = 0; i < n; i++) { if (w[i] == 1) sum += x[i]; // Branch 1 else if (w[i] == -1) sum -= x[i]; // Branch 2 } return sum; } ``` - Iterations: 64 - Branches per iteration: 2 (weight checks) - Total branch instructions: __ A32 __ (Hint: iterations × branches) **RVV Implementation (branchless inner loop):** - Iterations: 64 / 4 = 16 - Branches per iteration: 1 (loop control only) - Total branch instructions: __ A33 __ (Hint: iterations × branches) Branch reduction ratio: __ A34 __ (Hint: scalar/vector ratio) **Cycle Analysis (simplified):** | Operation | Scalar (per elem) | RVV (per 4 elem) | |-----------|-------------------|------------------| | Load activation | 1 cycle | 4 cycles | | Load weight | 1 cycle | 4 cycles | | Compare/branch | 2-4 cycles | 2 cycles (mask) | | Add/Sub | 1 cycle | 4 cycles | | Loop overhead | 3 cycles | 3 cycles | | **Total** | ~8 cycles/elem | ~17 cycles/4 elem | Approximate speedup for N=64 (assume 8 cycles for final reduction): - Scalar: 64 × 8 = 512 cycles (+ branch mispredictions) - RVV: 16 × 17 + 8 (reduction) = __ A35 __ cycles (Hint: total cycles) - Speedup ≈ __ A36 __ × (Hint: ratio, 1 decimal) **Key Insight:** RVV eliminates data-dependent branches in the inner loop by using __ A37 __ operations (Hint: vectorization technique), converting control flow into data flow. ### ✅ Solution to Problem A: A01: ==N== A02: ==10== A03: ==T== A04: ==11== A05: ==10== A06: ==T== A07: ==11== A08: ==5== A09: ==Tag== A10: ==Target address== A11: ==Prediction bits== A12: ==22== A13: ==4== A14: ==global== A15: ==local== A16: ==Taken== A17: ==10== A18: ==1.06== A19: ==1.03== A20: ==1.03== A21: ==4== A22: ==32== A23: ==1== A24: ==`v0.t`== A25: ==2== A26: ==`vadd.vv`== A27: ==`vredsum.vs`== A28: ==`==`== A29: ==v0 mask bit is 1== A30: ==sum(vs2[0..vl-1])== A31: ==-vs2[i]== A32: ==128== A33: ==16== A34: ==8== A35: ==280== A36: ==1.8== A37: ==predicated/masked== Detailed explanations: #### Part 1: 2-Bit Saturating Counter (A01-A08) The 2-bit saturating counter operates as a finite state machine: * States 00 (SNT) and 01 (WNT) predict Not Taken * States 10 (WT) and 11 (ST) predict Taken * Actual Taken → increment state (saturate at 11) * Actual Not Taken → decrement state (saturate at 00) Per spec.md guideline D-3: "2 consecutive same-direction outcomes to flip prediction." Complete trace for sequence T, T, N, T, T, T, N, N, N, T starting from 01 (WNT): | Branch | State | Prediction | Actual | Next State | Mispred | |--------|-------|------------|--------|------------|---------| | 1 | 01 | N | T | 10 | ✗ | | 2 | 10 | T | T | 11 | | | 3 | 11 | T | N | 10 | ✗ | | 4 | 10 | T | T | 11 | | | 5 | 11 | T | T | 11 | | | 6 | 11 | T | T | 11 | | | 7 | 11 | T | N | 10 | ✗ | | 8 | 10 | T | N | 01 | ✗ | | 9 | 01 | N | N | 00 | | | 10 | 00 | N | T | 01 | ✗ | Total mispredictions: 5 (branches 1, 3, 7, 8, 10) #### Part 2: Branch Target Buffer (A09-A12) BTB structure per entry: * Tag: identifies the branch instruction (partial PC) * Target address: predicted jump destination * Prediction bits: 2-bit saturating counter Tag bits calculation for 256-entry direct-mapped BTB with 32-bit addresses: * RISC-V instructions are 4-byte aligned → effective bits: 30 * Index bits: log₂(256) = 8 * Tag bits: 30 - 8 = 22 #### Part 3: Tournament Predictor (A13-A17) GHR bits derivation (A13): * Global PHT has 16 entries * Index bits needed = log₂(16) = 4 bits * Therefore GHR = 4 bits Chooser counter semantics: * 00, 01 → use global predictor (MSB=0 favors global) * 10, 11 → use local predictor (MSB=1 favors local) Given scenario analysis: * Chooser = 01 → select global predictor * Global predicts Taken → final prediction is Taken * Actual = Not Taken → global wrong, local correct * Update: increment chooser toward local: 01 → 10 #### Part 4: Branch Prediction Accuracy (A18-A20) CPI formula: ``` Effective CPI = Base CPI + (Branch% × Misprediction% × Penalty) ``` With 90% accuracy: ``` CPI_90 = 1.0 + (0.20 × 0.10 × 3) = 1.06 → A18 ``` With 95% accuracy: ``` CPI_95 = 1.0 + (0.20 × 0.05 × 3) = 1.03 → A19 ``` Speedup ratio: ``` Speedup = CPI_90 / CPI_95 = 1.06 / 1.03 ≈ 1.029 → A20 (rounded to 1.03) ``` Note: Speedup ≈ 1.03 means 95% accuracy is only ~3% faster than 90%, demonstrating diminishing returns in branch prediction improvement. #### Part 5: RVV Programming for LLM Inference (A21-A37) **Part 5a: vsetvli Configuration (A21-A22)** The vector length calculation: * VLEN = 128 bits (vector register width) * SEW = 32 bits (selected element width for int32) * VLMAX = VLEN / SEW = 128 / 32 = 4 elements The `e32` in `vsetvli` explicitly sets SEW to 32 bits. This determines how many elements can be processed per vector instruction with the given VLEN. **Part 5b: Vector Ternary MAC Implementation (A23-A27)** The RVV implementation uses predicated operations to eliminate data-dependent branches: ``` A23: 1 - Weight code for +1 (binary 01) A24: v0.t - Mask suffix indicating "execute only where v0 bit is set" A25: 2 - Weight code for -1 (binary 10) A26: vadd.vv - Vector-vector add (same instruction, different mask) A27: vredsum.vs - Vector reduction sum to scalar ``` Key insight: Instead of branching on weight values, we: 1. Create a mask register (v0) where bits indicate which elements match 2. Use masked vector operations that only affect elements where mask=1 3. This converts control flow (branches) into data flow (masks) **Part 5c: RVV Instruction Semantics (A28-A31)** | Instruction | Semantics | |-------------|-----------| | `vmseq.vi vd, vs2, imm` | For each element i: vd[i] = (vs2[i] == imm) ? 1 : 0 | | `vadd.vv vd, vs1, vs2, v0.t` | For each element i: if (v0[i]) vd[i] = vs1[i] + vs2[i] | | `vredsum.vs vd, vs2, vs1` | vd[0] = vs1[0] + Σ(vs2[i]) for i in 0..vl-1 | | `vneg.v vd, vs2` | For each element i: vd[i] = -vs2[i] (two's complement) | **Part 5d: Performance Analysis (A32-A37)** Branch count comparison: ``` Scalar: 64 iterations × 2 conditional branches = 128 branches - Each iteration has: if (w==1) and else if (w==-1) - Data-dependent branches cause pipeline stalls on misprediction RVV: 16 iterations × 1 loop branch = 16 branches - Only the loop control branch remains - Weight comparisons become mask operations (no branches) ``` Branch reduction: 128 / 16 = 8× fewer branches Cycle analysis (simplified model, matching table above): ``` Scalar (per element): Load x: 1 cycle Load w: 1 cycle Compare w==1: 1 cycle Branch: 1-3 cycles (misprediction penalty) Add/Sub: 1 cycle Loop control: 3 cycles Total: ~8 cycles/element average RVV (per 4 elements): Load x (vle32): 4 cycles Load w (vle32): 4 cycles Mask ops (vmseq×2): 2 cycles Add ops (vadd×2): 4 cycles (includes vneg) Loop control: 3 cycles Total: ~17 cycles/4 elements ≈ 4.25 cycles/element ``` Note: This simplified model amortizes vsetvli overhead and assumes pipelined execution. Final reduction (N=64): * Scalar: 64 × 8 = 512 cycles + misprediction penalties * RVV: 16 × 17 + 8 (reduction) ≈ 280 cycles * Speedup ≈ 512 / 280 ≈ 1.8× **Architectural Insight:** The fundamental advantage of RVV for LLM inference workloads: * **Predicated execution**: Data-dependent control flow becomes mask operations * **Eliminated branch mispredictions**: Weight-dependent branches are notorious for poor prediction (random-looking patterns) * **SIMD amortization**: Overhead (loop control, pointer arithmetic) is shared across multiple elements * **Memory bandwidth utilization**: Coalesced vector loads are more efficient than scalar loads This pattern is critical for BitNet and similar quantized LLM inference where weight values ({-1, 0, +1}) determine operations. The RVV approach converts unpredictable branches into deterministic masked operations. --- ## Problem B: UART Device Driver with Synchronization Consider a multi-threaded RISC-V RV32 system where multiple threads need to access a shared UART peripheral. This problem explores the complete design of a thread-safe UART driver, from basic MMIO to singleton pattern implementation. - [ ] Part 1: MMIO Fundamentals Memory-mapped I/O maps device registers to __ B01 __ (Hint: address type) addresses. ``` ┌─────────────────────────────────────────────┐ │ 32-bit Address Space │ └─────────────────────────────────────────────┘ 0x00000000 ┌─────────────────────┐ │ RAM (Program+Data) │ 0x3FFFFFFF └─────────────────────┘ 0x40000000 ┌─────────────────────┐ ◀── UART_BASE │ UART Registers │ │ ┌─────────────────┐ │ │ │+0x00: STATUS(RO)│ │ b0=TX_RDY, b1=RX_VAL │ │+0x04: BAUD (RO)│ │ │ │+0x08: INTR (WO)│ │ │ │+0x0C: RECV (RO)│ │ │ │+0x10: SEND (WO)│ │ │ └─────────────────┘ │ 0x400FFFFF └─────────────────────┘ │ Other Peripherals │ 0xFFFFFFFF └─────────────────────┘ ``` MyCPU UART register map (base address: `0x40000000`): ```c #define UART_BASE 0x40000000u #define UART_STATUS ((volatile uint32_t *)(UART_BASE + 0x00)) #define UART_BAUDRATE ((volatile uint32_t *)(UART_BASE + 0x04)) #define UART_INTERRUPT ((volatile uint32_t *)(UART_BASE + 0x08)) #define UART_RECV ((volatile uint32_t *)(UART_BASE + 0x0C)) #define UART_SEND ((volatile uint32_t *)(UART_BASE + 0x10)) ``` The registers should be declared as `volatile` because: * The compiler must not __ B02 __ (Hint: compiler behavior) accesses to these addresses * The hardware may change values __ B03 __ (Hint: timing characteristic) ``` Why volatile matters: ┌─────────────────────────┬─────────────────────────┐ │ Problem 1: Compiler Opt │ Problem 2: Hardware │ ├─────────────────────────┼─────────────────────────┤ │ Thread A: │ STATUS register: │ │ x = *STATUS; │ ┌──────────────┐ │ │ y = *STATUS; │ │ TX_READY: 0 │ │ │ ↳ optimized to y = x │ │ RX_VALID: 0 │ │ │ (Wrong! value changed)│ └──────┬───────┘ │ │ │ ↓ │ │ Thread B: │ Hardware IRQ arrives │ │ (expects fresh read) │ RX_VALID → 1 │ └─────────────────────────┴─────────────────────────┘ ``` - [ ] Part 2: The Race Condition Problem Consider two threads trying to send data simultaneously: ``` Thread A Thread B │ │ │ Check TX_READY ───────────┼──▶ TRUE │ │ Check TX_READY ──▶ TRUE │ Write 'H' ────────────────┼──▶ UART_SEND │ │ Write 'W' ──▶ UART_SEND ▼ ▼ Expected: "HelloWorld" Actual: "HWeolrlold" (interleaved!) ``` A naive (unsafe) UART transmit function: ```c #define UART_TX_READY (1 << 0) // UNSAFE: No synchronization! void uart_putc_unsafe(unsigned char byte) { while (!(*UART_STATUS & UART_TX_READY)) { // Busy wait } *UART_SEND = byte; } ``` This function is not thread-safe because the check-then-act sequence between checking `UART_STATUS` and writing to `UART_SEND` is not __ B04 __ (Hint: indivisible). - [ ] Part 3: Spinlock Protection with RISC-V Atomics To protect UART access, we implement a spinlock using RV32A atomic instructions: ```c typedef struct { volatile int locked; } spinlock_t; void spin_lock(spinlock_t *lock) { while (1) { int old; asm volatile( "__B05__ %0, %2, (%1)" // Hint: RV32A atomic swap with acquire : "=r"(old) : "r"(&lock->locked), "r"(1) : "memory" ); if (old == __B06__) { // Hint: unlocked state value break; // Lock acquired (was unlocked) } } } void spin_unlock(spinlock_t *lock) { asm volatile( "fence ...\n\t" // Release barrier: ensure prior writes complete "__B07__ %1, 0(%0)" // Hint: RISC-V store instruction : : "r"(&lock->locked), "r"(0) : "memory" ); } ``` The atomic swap instruction with acquire ordering atomically: 1. Reads the old value from memory 2. Writes the new value to memory 3. Returns the __ B08 __ (Hint: which value?) value 4. The `.aq` (acquire) suffix ensures subsequent memory operations cannot be reordered before the lock acquisition ``` Atomic swap with acquire semantics: ┌────────────────────────────────────────────┐ │ [instruction] rd, rs2, (rs1) │ │ │ │ 1. temp = M[rs1] // Read old value │ │ 2. M[rs1] = rs2 // Write new value │ │ 3. rd = temp // Return old value │ │ │ │ * All three steps are ATOMIC │ │ * .aq: no later ops reorder before this │ └────────────────────────────────────────────┘ Why acquire/release semantics matter: ┌────────────────────────────────────────────┐ │ Without ordering: With acquire/release: │ │ ┌──────────────┐ ┌──────────────────┐ │ │ │ spin_lock() │ │ spin_lock()+.aq │ │ │ │ x = shared; │ │ ─────────────── │ │ │ │ shared = y; │ │ x = shared; │ │ │ │ spin_unlock()│ │ shared = y; │ │ │ └──────────────┘ │ ─────────────── │ │ │ Ops may reorder! │ fence ... │ │ │ │ spin_unlock() │ │ │ └──────────────────┘ │ │ Ops stay in order! │ └────────────────────────────────────────────┘ ``` - [ ] Part 4: Thread-Safe UART with Singleton Pattern In embedded systems, hardware peripherals like UART should only be initialized once and accessed through a single, consistent interface. The Singleton pattern ensures: 1. Only one instance of the UART driver exists 2. Thread-safe initialization (initialized exactly once) 3. Global access point for all threads ```c typedef struct { spinlock_t tx_lock; // Protects transmit operations spinlock_t rx_lock; // Protects receive operations volatile int initialized; } uart_driver_t; // Global singleton instance static volatile uart_driver_t *uart_instance = NULL; // volatile for visibility static spinlock_t init_lock = {0}; uart_driver_t *uart_get_instance(void) { // Double-checked locking pattern with proper memory ordering uart_driver_t *inst = (uart_driver_t *) uart_instance; // Acquire fence: ensure we see fully initialized driver if non-NULL asm volatile("fence ..." ::: "memory"); // acquire ordering if (inst == NULL) { spin_lock(&init_lock); inst = (uart_driver_t *)uart_instance; if (inst == __B09__) { // Hint: pointer value for second check static uart_driver_t driver; driver.tx_lock.locked = 0; driver.rx_lock.locked = 0; driver.initialized = 1; // Release barrier before publishing pointer asm volatile("fence __B10__" ::: "memory"); // Hint: release semantics uart_instance = &driver; inst = &driver; } spin_unlock(&init_lock); } return inst; } ``` Why double-checked locking needs memory barriers? ``` Without barriers: With barriers: ┌──────────────────────┐ ┌─────────────────────────┐ │ Thread A Thread B │ │ Thread A Thread B │ │ ──────── ──────── │ │ ──────── ──────── │ │ init fields │ │ init fields │ │ instance = &drv │ │ fence ... ← Release │ │ │ │ │ instance = &drv │ │ └────▶ read ptr │ │ │ │ │ read flds │ │ └────▶ read ptr │ │ (STALE!) │ │ fence ... ←│ │ │ │ read fields │ │ Sees uninit data! │ │ (CORRECT!) │ └──────────────────────┘ └─────────────────────────┘ ``` The fence instruction with release semantics ensures: * All __ B11 __ (Hint: r/w) and __ B12 __ (Hint: r/w) operations before the fence complete * Before any __ B13 __ (Hint: r/w) operations after the fence begin This prevents other threads from seeing `uart_instance` pointing to a partially initialized driver. Note: In C11+, use `atomic_load_explicit(&uart_instance, memory_order_acquire)` and `atomic_store_explicit(&uart_instance, &driver, memory_order_release)` for portable code. - [ ] Part 5: Interrupt-Driven UART with Ring Buffer For efficient UART communication, we use interrupt-driven I/O with a ring buffer: ``` ┌───────────────────────────────────────────────────────┐ │ Interrupt-Driven UART Architecture │ └───────────────────────────────────────────────────────┘ Application Threads Interrupt Handler │ │ │ uart_putc() │ ▼ │ ┌─────────────┐ │ │ TX Ring │◀────── Producer ───────────┤ │ Buffer │ │ │┌──┬──┬──┬──┐│ Consumer │ ││H │e │l │l ││────────────────────────────▶ TX IRQ │└──┴──┴──┴──┘│ │ │ ▲ ▲ │ │ │head tail │ │ └─────────────┘ │ │ ┌─────────────┐ │ │ RX Ring │◀────── Producer ───────────┤ RX IRQ │ Buffer │ │ │┌──┬──┬──┬──┐│ Consumer │ ││W │o │r │l ││────────────────────────────┤ │└──┴──┴──┴──┘│ │ │ ▲ ▲ │ uart_getc() │ │head tail │◀───────────────────────────│ Application └─────────────┘ ``` ```c #define RING_SIZE 64 typedef struct { volatile uint8_t buffer[RING_SIZE]; volatile uint32_t head; // Write position (producer) volatile uint32_t tail; // Read position (consumer) } ring_buffer_t; typedef struct { spinlock_t tx_lock; spinlock_t rx_lock; ring_buffer_t tx_ring; ring_buffer_t rx_ring; volatile int tx_busy; } uart_driver_t; ``` The ring buffer implements a producer-consumer pattern: * For TX: Application is __ B14 __ (Hint: producer/consumer), interrupt handler is __ B15 __ (Hint: producer/consumer) * For RX: Interrupt handler is __ B16 __ (Hint: producer/consumer), application is __ B17 __ (Hint: producer/consumer) Complete the ring buffer operations: ```c int ring_put(ring_buffer_t *rb, uint8_t byte) { uint32_t next_head = (rb->head + 1) % RING_SIZE; if (next_head == rb->__B18__) { // Hint: head/tail return -1; // Buffer full } rb->buffer[rb->head] = byte; rb->head = next_head; return 0; } int ring_get(ring_buffer_t *rb, uint8_t *byte) { if (rb->head == rb->__B19__) { // Hint: head/tail return -1; // Buffer empty } *byte = rb->buffer[rb->tail]; rb->tail = (rb->tail + 1) % RING_SIZE; return 0; } ``` - [ ] Part 6: Interrupt Handler and CSR Operations For interrupt-driven UART, we configure RISC-V CSRs: ``` mstatus (Machine Status) mie (Machine Interrupt Enable) ┌────┬────┬────┬────┬────┐ ┌────┬────┬────┬────┬────┬────┐ │... │ │ MIE│ │ │ │... │MEIE│ │MTIE│ │MSIE│ │ │ │bit3│ │ │ │ │b11 │ │ b7 │ │ b3 │ └────┴────┴──┬─┴────┴────┘ └────┴──┬─┴────┴──┬─┴────┴──┬─┘ │ │ │ │ Global Enable External Timer Software ``` Enable external interrupts for UART: ```c void uart_enable_interrupts(void) { // Enable RX interrupt in UART peripheral *UART_INTERRUPT = (1 << 1); // RX interrupt enable // Enable external interrupts in mie (bit 11) csrs(mie, __B20__); // Hint: bit 11 mask // Enable global interrupts in mstatus (bit 3) csrs(mstatus, __B21__); // Hint: bit 3 mask } ``` The interrupt handler: ```c void __attribute__((interrupt)) uart_irq_handler(void) { uart_driver_t *uart = uart_get_instance(); uint32_t status = *UART_STATUS; // RX interrupt: data available if (status & (1 << 1)) { uint8_t byte = *UART_RECV; // No lock needed: single producer (IRQ handler) ring_put(&uart->rx_ring, byte); } // TX interrupt: ready to send if (status & (1 << 0)) { uint8_t byte; if (ring_get(&uart->tx_ring, &byte) == 0) { *UART_SEND = byte; } else { uart->tx_busy = 0; // TX complete } } } ``` When an interrupt occurs: 1. The PC is saved to __ B22 __ (Hint: CSR name) 2. The cause is stored in __ B23 __ (Hint: CSR name) 3. Execution jumps to the address in __ B24 __ (Hint: CSR name) The instruction to return from machine-mode trap is __ B25 __ (Hint: RISC-V instruction). - [ ] Part 7: Complete Thread-Safe UART API Putting it all together, the complete thread-safe UART API: ```c // Thread-safe transmit: multiple threads can call safely void uart_putc(unsigned char byte) { uart_driver_t *uart = uart_get_instance(); __B26__(&uart->tx_lock); // Hint: acquire lock // Add to TX ring buffer while (ring_put(&uart->tx_ring, byte) < 0) { // Buffer full, wait for space __B27__(&uart->tx_lock); // Hint: release lock // Yield or brief spin __B26__(&uart->tx_lock); // Hint: re-acquire lock } // Start transmission if not already in progress if (!uart->tx_busy) { uart->tx_busy = 1; uint8_t first; ring_get(&uart->tx_ring, &first); *UART_SEND = first; } __B27__(&uart->tx_lock); // Hint: release lock } // Thread-safe receive: multiple threads can call safely int uart_getc(void) { uart_driver_t *uart = uart_get_instance(); uint8_t byte; __B26__(&uart->rx_lock); // Hint: acquire lock while (ring_get(&uart->rx_ring, &byte) < 0) { // Buffer empty, wait for data __B27__(&uart->rx_lock); // Hint: release lock // Could use interrupt-based waiting here __B26__(&uart->rx_lock); // Hint: re-acquire lock } __B27__(&uart->rx_lock); // Hint: release lock return byte; } ``` - [ ] Part 8: Memory Ordering for MMIO RISC-V requires explicit memory ordering for correct MMIO operations. Under RVWMO (RISC-V Weak Memory Ordering), memory accesses can be reordered unless explicit fences are used. ```c // Ensure status read completes before data write uint32_t status = *UART_STATUS; asm volatile("fence __B28__" ::: "memory"); // Hint: read-to-write ordering *UART_SEND = byte; // Ensure data write completes before status check *UART_SEND = byte; asm volatile("fence __B29__" ::: "memory"); // Hint: write-to-read ordering uint32_t status = *UART_STATUS; ``` ``` RISC-V fence instruction format: ┌─────────────────────────────────────────────────┐ │ fence [predecessor], [successor] │ │ │ │ Predecessor: ops BEFORE fence that must finish │ │ Successor: ops AFTER fence that must wait │ │ │ │ Ordering bits: r = read, w = write │ │ │ │ Examples: │ │ fence r, r → reads before later reads │ │ fence w, w → writes before later writes │ │ fence rw, rw → full memory barrier │ └─────────────────────────────────────────────────┘ Note: For strict MMIO ordering, you may need fence.i or device-specific ordering. The basic fence orders regular memory ops, sufficient for MMIO in this context. ``` The `fence` instruction ordering arguments: * First argument: operations __ B30 __ (Hint: before/after) the fence that must complete * Second argument: operations __ B31 __ (Hint: before/after) the fence that must wait - [ ] Part 9: Interrupt Latency and Synchronization Hazards ``` IRQ Signal Return to App │ │ ▼ ▼ ──┼────┬────────────┬──────────────┬────────────────┼──▶ time │◀──▶│◀──────────▶│◀────────────▶│◀──────────────▶│ │ T1 │ T2 │ T3 │ T4 │ │ │ │ │ │ │ IRQ│ Context │ Handler + │ Context │ │Det.│ Save │ Ring Buf Op │ Restore │ │ │(13 reg × 4)│ │ (13 reg × 4) │ ``` For a processor with: * 2-cycle interrupt recognition * 13 registers to save (4 cycles each) * Handler + ring buffer operation: 60 cycles * Context restore: same as save Total interrupt latency = __ B32 __ cycles (Hint: sum T1+T2+T3+T4). Why spinlocks should NOT be held during interrupt handling: ``` ┌────────────────────────────────────────────────────┐ │ Deadlock Scenario │ ├────────────────────────────────────────────────────┤ │ 1. Thread A: spin_lock(&tx_lock) ← Lock acquired │ │ 2. Interrupt occurs ← A preempted │ │ 3. IRQ: spin_lock(&tx_lock) ← Tries to lock │ │ 4. Spins forever ← A can't resume! │ └────────────────────────────────────────────────────┘ ``` If Thread A holds `tx_lock` and gets interrupted, and the IRQ handler tries to acquire the same lock, the result is __ B33 __ (Hint: concurrency issue). **Prevention strategies:** ``` ┌────────────────────────────────────────────────────┐ │ Solution 1: Disable IRQ while holding lock │ ├────────────────────────────────────────────────────┤ │ uint32_t saved = disable_interrupts(); │ │ spin_lock(&tx_lock); │ │ // Critical section (IRQ disabled) │ │ spin_unlock(&tx_lock); │ │ restore_interrupts(saved); │ └────────────────────────────────────────────────────┘ ┌────────────────────────────────────────────────────┐ │ Solution 2: IRQ handler never acquires locks │ ├────────────────────────────────────────────────────┤ │ Use lock-free structures (e.g., SPSC ring buffer) │ │ between threads and IRQ handlers. Only app threads │ │ acquire locks, not handlers. │ └────────────────────────────────────────────────────┘ ``` In our UART driver design, we use Solution 2: the IRQ handler operates on ring buffers without acquiring locks (single-producer/single-consumer pattern), while application threads use spinlocks only among themselves. ### ✅ Solution to Problem B: B01: ==memory== B02: ==optimize away== B03: ==asynchronously== B04: ==atomic== B05: ==`amoswap.w.aq`== B06: ==0== B07: ==`sw`== B08: ==old== B09: ==NULL== B10: ==rw, w== B11: ==read== B12: ==write== B13: ==write== B14: ==producer== B15: ==consumer== B16: ==producer== B17: ==consumer== B18: ==tail== B19: ==tail== B20: ==(1 << 11)== B21: ==(1 << 3)== B22: ==`mepc`== B23: ==`mcause`== B24: ==`mtvec`== B25: ==`mret`== B26: ==`spin_lock`== B27: ==`spin_unlock`== B28: ==r, rw== B29: ==w, r== B30: ==before== B31: ==after== B32: ==166== B33: ==deadlock== Detailed explanations: #### Part 1-2: MMIO and Race Conditions Memory-mapped I/O uses regular memory addresses to access hardware registers. The `volatile` keyword is essential because: * Compiler optimization: Without volatile, the compiler may cache register values or reorder accesses * Hardware behavior: The UART hardware can change STATUS at any time (asynchronously) The race condition in naive UART code: ``` Thread A: Check STATUS (ready) → [context switch] → Write 'H' Thread B: Check STATUS (ready) → Write 'W' Thread A: [resumed] → Write 'H' (overwrites 'W' or interleaves) ``` The check-then-act is not atomic, creating a TOCTOU (Time-Of-Check-Time-Of-Use) vulnerability. #### Part 3: Spinlock with RISC-V Atomics The `amoswap.w.aq` (Atomic Memory Operation Swap Word with Acquire) instruction provides hardware-level atomicity with memory ordering: ``` amoswap.w.aq rd, rs2, (rs1) 1. Read M[rs1] into rd 2. Write rs2 to M[rs1] 3. Both steps are atomic (cannot be interrupted) 4. .aq suffix: subsequent memory ops cannot reorder before this instruction ``` For acquiring a lock: * Swap 1 into lock location, get old value * If old value was 0, lock was free → acquired * If old value was 1, lock was held → spin and retry * The acquire ordering ensures all critical section accesses happen after lock acquisition For releasing a lock: * A release fence (`fence rw, w`) ensures critical section writes complete * Then a simple `sw` (store word) sets the lock to 0 * This is more efficient than `amoswap.w.rl` when no return value is needed #### Part 4: Singleton Pattern The singleton pattern ensures: 1. Single instance: Only one UART driver object exists 2. Lazy initialization: Created on first access 3. Thread safety: Double-checked locking prevents races Memory barriers are critical for correct double-checked locking: **Release fence (`fence rw, w`) on the publisher side:** * Ensures all driver initialization writes complete before publishing the pointer * Without it, another thread might see `uart_instance` set before the driver struct is fully initialized **Acquire fence (`fence r, rw`) on the reader side:** * Ensures the pointer read happens before any subsequent reads of the driver fields * Without it, a reader might use stale cached values of driver fields even after seeing the non-NULL pointer This pattern demonstrates the fundamental acquire-release synchronization model used throughout concurrent programming. #### Part 5: Ring Buffer Producer-Consumer The ring buffer implements classic producer-consumer: ``` TX Path: Application (producer) → tx_ring → IRQ handler (consumer) → UART hardware RX Path: UART hardware → IRQ handler (producer) → rx_ring → Application (consumer) ``` Key insight: Single-producer-single-consumer (SPSC) ring buffers don't need locks if: * Only one thread writes head * Only one thread writes tail * Volatile ensures visibility For TX: multiple threads call `uart_putc()` (multiple producers), so we need the `tx_lock`. #### Part 6-7: Interrupt-Driven I/O RISC-V interrupt flow: ``` 1. Hardware triggers external interrupt 2. If mstatus.MIE=1 and mie.MEIE=1: a. Save PC → mepc b. Store cause code → mcause c. Clear mstatus.MIE (disable further interrupts) d. Jump to mtvec 3. Handler executes 4. mret: Restore PC from mepc, restore MIE ``` The handler must: * Check which interrupt occurred * Process data quickly (add to ring buffer) * Not hold locks that application code might hold #### Part 8: Memory Ordering RISC-V has a relaxed memory model. For MMIO: ``` fence r, rw - Acquire: reads complete before following memory ops fence rw, w - Release: all ops complete before following writes fence rw, rw - Full barrier: total ordering ``` Critical for UART: * Must read STATUS before writing SEND (acquire) * Must complete SEND before checking for next operation (release) #### Part 9: Interrupt Latency and Deadlock Latency calculation: ``` Recognition: 2 cycles Context save: 52 cycles (13 × 4) Handler + ring: 60 cycles Context restore: 52 cycles (13 × 4) Total: 166 cycles ``` Deadlock scenario: ``` 1. Thread A: spin_lock(&tx_lock) 2. Interrupt occurs (Thread A preempted) 3. IRQ handler: tries spin_lock(&tx_lock) 4. Spins forever - Thread A can't run to release lock ``` Solution: Disable interrupts while holding locks, or use interrupt-safe lock primitives. --- ## Problem C: Unified Memory Hierarchy (Sv32, Caches, and Coherence) You are architecting a dual-core RISC-V system (Core P0 and Core P1) for a banking application. The system enforces strict memory isolation using Sv32 Virtual Memory and maintains data consistency using a Snooping MSI Cache Coherence Protocol. ``` ┌────────────────────────────────────────────────────────┐ │ Dual-Core RISC-V Banking System │ └────────────────────────────────────────────────────────┘ Core P0 Core P1 ┌───────────────┐ ┌───────────────┐ │ Pipeline │ │ Pipeline │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │ TLB │ │ │ │ TLB │ │ │ │ (Sv32) │ │ │ │ (Sv32) │ │ │ └────┬────┘ │ │ └────┬────┘ │ │ ┌────▼────┐ │ │ ┌────▼────┐ │ │ │L1 Cache │ │ │ │L1 Cache │ │ │ │ 16KB/4W │ │ │ │ 16KB/4W │ │ │ │ 64B blk │ │ │ │ 64B blk │ │ │ └────┬────┘ │ │ └────┬────┘ │ └───────┼───────┘ └───────┼───────┘ └───────────────┬────────────────┘ │ Shared Bus (MSI) ┌────▼────┐ │ L2 Cache│ └────┬────┘ ┌────▼────┐ │ Main │ │ Memory │ └─────────┘ ``` **System Specifications:** - Virtual Memory: Sv32 (32-bit VA, 4 KiB pages, 2-level page table) - L1 Cache (Private): 16 KiB, 4-way set associative, 64-byte block, Write-Back/Write-Allocate - Coherence: MSI (Modified, Shared, Invalid) protocol - Physical Address: 32-bit The system executes a critical transaction update. We trace a store instruction from Core P0: ```assembly sw x10, 0x050(x11) # Store new balance to memory # x10 = 100 (new balance value) # x11 = 0x80402000 (virtual base address) # Effective VA = 0x80402050 ``` - [ ] Part 1: L1 Cache Organization Before tracing the memory access, we must understand the L1 cache geometry. ``` L1 Cache Parameters: ┌──────────────────────────────────────────────┐ │ Total Size: 16 KiB = 16384 bytes │ │ Block Size: 64 bytes │ │ Associativity: 4-way │ └──────────────────────────────────────────────┘ 32-bit Physical Address Breakdown: ┌───────────────────┬──────────┬──────────┐ │ TAG │ INDEX │ OFFSET │ │ (? bits) │ (? bits) │ (? bits) │ └───────────────────┴──────────┴──────────┘ │ │ └─▶ Byte within block │ └─▶ Which set to check └─▶ Which block within set ``` Calculate: - Number of blocks = Cache size / Block size = __ C01 __ (Hint: total blocks) - Number of sets = Number of blocks / Associativity = __ C02 __ (Hint: sets count) - Offset bits = log₂(Block size) = __ C03 __ (Hint: bits for block offset) - Index bits = log₂(Number of sets) = __ C04 __ (Hint: bits for set index) - Tag bits = 32 - Index bits - Offset bits = __ C05 __ (Hint: remaining bits) - [ ] Part 2: Virtual Memory Address Translation (Sv32) The core must translate VA `0x80402050` to a Physical Address. ``` Sv32 Virtual Address Format (32-bit): ┌────────────┬────────────┬──────────────┐ │ VPN[1] │ VPN[0] │ Offset │ │ 10 bits │ 10 bits │ 12 bits │ │ [31:22] │ [21:12] │ [11:0] │ └────────────┴────────────┴──────────────┘ satp CSR (configured as 0x80000002): ┌───────┬──────────┬────────────────────────┐ │ Mode │ ASID │ PPN │ │ bit31 │ [30:22] │ [21:0] │ │ 1 │ 0 │ [PPN] │ └───────┴──────────┴────────────────────────┘ │ │ └─▶ Sv32 enabled └─▶ Root PT @ PPN×4KiB ``` For VA = `0x80402050` (binary: `1000 0000 0100 0000 0010 0000 0101 0000`): Extract address fields: - VPN[1] = bits [31:22] = __ C06 __ (Hint: upper 10 bits in hex) - VPN[0] = bits [21:12] = __ C07 __ (Hint: middle 10 bits in hex) - Page Offset = bits [11:0] = __ C08 __ (Hint: lower 12 bits in hex) ``` Two-Level Page Table Walk: ┌─────────────────────────────────────────────────┐ │ Step 1: Level-1 PT lookup │ │ PT1_base = satp.PPN × 4096 = 0x00002000 │ │ PTE1_addr = PT1_base + VPN[1] × 4 │ └─────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ Step 2: Level-0 PT lookup │ │ PT0_base = PTE1.PPN × 4096 │ │ PTE0_addr = PT0_base + VPN[0] × 4 │ └─────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ Step 3: Physical Address │ │ PA = PTE0.PPN × 4096 + Offset │ └─────────────────────────────────────────────────┘ ``` Given the Leaf PTE contains PPN `0x12345` with permissions R=1, W=1, X=0, V=1: - Physical Address = PPN concatenated with Offset = __ C09 __ (Hint: hex result) The instruction is `sw` (store). With W=1 permission and assuming hardware manages A/D bits, does a page fault occur? __ C10 __ (Hint: Yes/No) **Sv32 PTE Format:** ``` ┌─────────────────────────┬───┬─┬─┬─┬─┬─┬─┬─┬─┐ │ PPN [31:10] │RSW│D│A│G│U│X│W│R│V│ │ 22 bits │ 2 │1│1│1│1│1│1│1│1│ └─────────────────────────┴───┴─┴─┴─┴─┴─┴─┴─┴─┘ ``` - V (bit 0): __ C11 __ (Hint: bit function) - A (bit 6): __ C12 __ (Hint: bit function) - D (bit 7): __ C13 __ (Hint: bit function) A leaf PTE (actual page mapping) requires __ C14 __ bit(s) set (Hint: W alone invalid). A non-leaf PTE (pointer to next level) has R=W=X=__ C15 __ (Hint: single digit). - [ ] Part 3: TLB and Cache Interaction ``` Memory Access Pipeline: ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ VA │──▶│ TLB │──▶│ Cache │──▶│ Data │ │0x80402050│ │ Lookup │ │ Lookup │ │ │ └──────────┘ └────┬─────┘ └────┬─────┘ └──────────┘ │ │ ┌─────▼─────┐ ┌─────▼─────┐ │ TLB Miss? │ │Cache Miss?│ │ Page Walk │ │ Memory │ └───────────┘ └───────────┘ ``` **TLB Structure (32 entries, fully associative):** For VA `0x80402050`: - TLB lookup uses VPN (upper 20 bits) = __ C16 __ (Hint: upper 20 bits in hex) TLB hit requires: 1. Valid bit is __ C17 __ (Hint: 0 or 1) 2. ASID matches current process OR entry is __ C18 __ (Hint: page attribute) 3. VPN matches the __ C19 __ (Hint: which portion of VA) of the virtual address **VIPT Optimization:** This system uses VIPT (Virtually Indexed, Physically Tagged) cache design: - Cache Way Size = Sets × Block Size = 64 × 64 = 4096 bytes = 4 KiB - Page Size = 4 KiB Since Cache Way Size equals Page Size, Index bits [11:6] fall entirely within the Page Offset [11:0]. This allows TLB lookup and Cache indexing to happen in __ C20 __ (Hint: Parallel/Series), because the index bits don't change during VA→PA translation. - [ ] Part 4: Cache Address Mapping With the translated PA, map to cache structure: ``` PA: [Physical Address from C09] Binary: [32-bit binary representation] └─────────TAG─────────┘ └INDEX─┘ └OFFSET┘ 20 bits 6 bits 6 bits ``` Extract fields from the PA (answer to C09): - Offset = bits [5:0] = __ C21 __ (Hint: lower 6 bits in hex) - Index = bits [11:6] = __ C22 __ (Hint: middle 6 bits in hex) - Tag = bits [31:12] = __ C23 __ (Hint: upper 20 bits in hex) - [ ] Part 5: MSI Cache Coherence Protocol The MSI protocol has three states: ``` MSI State Diagram: ┌─────────────────────────────────────────────────────┐ │ │ │ ┌───────┐ BusRd ┌───────┐ PrWr/ ┌───────┐│ │ │Invalid│─────────────▶│Shared │─BusRdX─▶│ Mod. ││ │ │ (I) │◀─────────────│ (S) │◀────────│ (M) ││ │ └───────┘ BusRdX └───────┘ BusRd └───────┘│ │ ▲ (Inv) │ (Flush) │ │ │ │ │ │ │ │ └──────────────────────┴─────────────────┘ │ │ BusRdX (Invalidate) │ └─────────────────────────────────────────────────────┘ States: - M (Modified): Cache has __C24__ (Hint: copy type) copy, memory is __C25__ (Hint: freshness) - S (Shared): Clean copy, __C26__ (Hint: quantity) caches may also have it - I (Invalid): Cache does not have valid data ``` **State Transition Trace:** Initial state: P0 and P1 both have cache line for address X in state Invalid (I). | Step | Action | P0 State | P1 State | Bus Transaction | |------|--------|----------|----------|-----------------| | 1 | P0: Read X | __ C27 __ (Hint: M/S/I) | I | BusRd, Memory responds | | 2 | P1: Read X | S | __ C28 __ (Hint: M/S/I) | BusRd, Memory responds | | 3 | P0: Write X | __ C29 __ (Hint: M/S/I) | __ C30 __ (Hint: M/S/I) | BusRdX (Invalidation) | | 4 | P1: Read X | __ C31 __ (Hint: M/S/I) | __ C32 __ (Hint: M/S/I) | BusRd, P0 responds + Writeback | - [ ] Part 6: MESI Protocol Enhancement MESI adds the Exclusive (E) state to MSI: ``` MESI State Transitions: PrRd (exclusive) │ ▼ ┌───────┐ ┌─────────┐ ┌───────┐ │Invalid│────────────▶│Exclusive│─────────────▶│ Mod. │ │ (I) │ BusRd │ (E) │ PrWr │ (M) │ └───────┘ (alone) └─────────┘ (silent!) └───────┘ ▲ │ │ │ BusRd (shared) │ ▼ │ ┌─────────┐ └─────────────────│ Shared │ BusRdX │ (S) │ └─────────┘ ``` - E (Exclusive): Cache has __ C33 __ (Hint: copy type) copy, memory is __ C34 __ (Hint: freshness) The key advantage of E state: A write to an Exclusive line transitions directly to Modified __ C35 __ (Hint: with/without) generating bus traffic, because no other cache has the data. - [ ] Part 7: Cache Coherence Traffic Analysis Consider this access pattern (X and Y are in different cache lines): ``` P0: Read X ─┐ P1: Read X │ Contention on X P0: Write X │ P1: Write X ─┘ P0: Read Y ─┐ P1: Write Y ─┘ Access to Y ``` **MSI Protocol Traffic:** | Step | Operation | Bus Transaction | Notes | |------|-----------|-----------------|-------| | 1 | P0: Read X | BusRd | Miss, fetch from memory | | 2 | P1: Read X | BusRd | Miss, fetch from memory | | 3 | P0: Write X | BusRdX | Invalidate P1's copy | | 4 | P1: Write X | BusRdX | P0 flushes, P1 takes ownership | | 5 | P0: Read Y | BusRd | Miss, fetch from memory | | 6 | P1: Write Y | BusRdX | P0's copy invalidated | Count for MSI protocol: - Number of bus reads (BusRd): __ C36 __ (Hint: count read transactions) - Number of invalidations (BusRdX): __ C37 __ (Hint: count write transactions) - Number of writebacks: __ C38 __ (Hint: count M→S transitions) - [ ] Part 8: False Sharing ``` False Sharing Scenario: ┌───────────────────────────────────────────────────────┐ │ Cache Line (64 bytes) │ │ ┌─────────┬─────────┬────────────────────────────────┐│ │ │ balance │interest │ unused padding ││ │ │(P0 uses)│(P1 uses)│ ││ │ │ 4 bytes │ 4 bytes │ 56 bytes ││ │ └─────────┴─────────┴────────────────────────────────┘│ └───────────────────────────────────────────────────────┘ │ │ │ └─▶ P1 reads interest_rate └─▶ P0 writes balance (invalidates ENTIRE line!) ``` False sharing occurs when: - Two variables are in the __ C39 __ (Hint: same cache structure) - Different processors write to __ C40 __ (Hint: variable relationship) variables - The cache line __ C41 __ (Hint: action) back and forth between caches Given: ```c struct account { int balance; // Used exclusively by P0 int interest_rate; // Read by P1 }; ``` With `sizeof(int) = 4` and cache line size = 64 bytes: - Both variables fit in the same cache line because __ C42 __ (Hint: size comparison) When P0 writes `balance`, P1's cached copy of `interest_rate` is invalidated even though P0 didn't modify it. **Fix using padding:** ```c struct account_padded { int balance; char padding[__C43__]; // Hint: padding bytes to fill cache line int interest_rate; }; ``` The padding ensures each variable occupies a separate cache line, eliminating false sharing. - [ ] Part 9: Performance Analysis **AMAT Calculation:** Given: - L1 Hit time: 1 cycle - L1 Miss rate: 5% - Miss penalty: 100 cycles Basic AMAT = Hit_time + Miss_rate × Miss_penalty = __ C44 __ cycles (Hint: apply AMAT formula) **With L2 Cache:** Adding a 256KB L2 cache: - L2 Hit time: 10 cycles - L2 Miss rate: 20% (of L1 misses that reach L2) - Memory access: 100 cycles ``` Memory Hierarchy Latency: ┌──────────┬───────────┬───────────┐ │ L1 Cache │ L2 Cache │ Memory │ │ 1 cycle │ 10 cycles │100 cycles │ └──────────┴───────────┴───────────┘ 95% hit │ │ └─ 80% hit ──┘ │ └─ 20% miss to memory ``` New AMAT = L1_hit + L1_miss × (L2_hit + L2_miss × Memory) = 1 + 0.05 × (10 + 0.20 × 100) = __ C45 __ cycles (Hint: substitute and calculate) **With Coherence Overhead:** Due to MSI coherence traffic, average miss penalty increases by 20 cycles when contention occurs (50% of misses). Effective miss penalty = 100 + (0.5 × 20) = 110 cycles AMAT with coherence = 1 + 0.05 × (10 + 0.20 × 110) = __ C46 __ cycles (Hint: substitute and calculate) - [ ] Part 10: Cache Optimization for Matrix Operations ```c // Matrix multiplication with poor locality for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { for (int k = 0; k < N; k++) C[i][j] += A[i][k] * B[k][j]; // B[k][j] strides by row } } ``` ``` Memory Access Pattern for B[k][j] (Row-Major): ┌───────────────────────────────────────────────────────┐ │ k=0: B[0][j] ──▶ Addr: &B + j*4 │ │ k=1: B[1][j] ──▶ Addr: &B + N*4 + j*4 (stride = N*4) │ │ k=2: B[2][j] ──▶ Addr: &B + 2*N*4 + j*4 │ │ ... │ │ Each increment of k jumps N elements (one full row)! │ └───────────────────────────────────────────────────────┘ ``` The access to array __ C47 __ (Hint: A/B/C) has poor spatial locality because it accesses __ C48 __ (Hint: access pattern). With N=64 and 64-byte cache lines (16 integers/line): - Array A access pattern: Sequential within rows → good locality - Array B access pattern: Column-wise (stride = N×4 bytes) → poor locality **Optimized Loop Order (i, k, j):** ```c for (int i = 0; i < N; i++) { for (int k = 0; k < N; k++) { for (int j = 0; j < N; j++) C[i][j] += A[i][k] * B[k][j]; // B[k][j] now sequential in j } } ``` Reordering to (i, k, j) specifically fixes the poor spatial locality of array __ C49 __ (Hint: A/B/C) because its innermost loop index now matches the array's row-major storage order. ### ✅ Solution to Problem C: C01: ==256== C02: ==64== C03: ==6== C04: ==6== C05: ==20== C06: ==0x201== C07: ==0x002== C08: ==0x050== C09: ==0x12345050== C10: ==No== C11: ==Valid== C12: ==Accessed== C13: ==Dirty== C14: ==R or X== C15: ==0== C16: ==0x80402== C17: ==1== C18: ==global== C19: ==upper 20 bits== C20: ==Parallel== C21: ==0x10== C22: ==0x01== C23: ==0x12345== C24: ==only valid== C25: ==stale== C26: ==other== C27: ==S== C28: ==S== C29: ==M== C30: ==I== C31: ==S== C32: ==S== C33: ==only valid== C34: ==up-to-date== C35: ==without== C36: ==3== C37: ==3== C38: ==1== C39: ==same cache line== C40: ==different== C41: ==bounces== C42: ==both fit within 64 bytes== C43: ==60== C44: ==6== C45: ==2.5== C46: ==2.6== C47: ==B== C48: ==different rows== C49: ==B== #### Detailed Explanations **Part 1: Cache Organization** The L1 cache geometry determines how addresses are mapped: - 256 blocks total, organized into 64 sets with 4 blocks per set (4-way associativity) - Each address splits into: Tag (identifies block), Index (selects set), Offset (byte within block) - With 64 sets and 64-byte blocks: Index needs 6 bits, Offset needs 6 bits, leaving 20 bits for Tag **Part 2: Virtual Address Translation** Sv32 uses a two-level radix page table: - VPN[1] indexes the first-level table (1024 entries × 4 bytes = 4 KiB page) - VPN[0] indexes the second-level table - The 12-bit page offset passes through unchanged to the physical address For VA 0x80402050: - Binary: 1000 0000 01 | 00 0000 0010 | 0000 0101 0000 - VPN[1] = 0x201, VPN[0] = 0x002, Offset = 0x050 - PA = (PPN << 12) | Offset = 0x12345050 **Part 3: TLB-Cache Interaction** VIPT (Virtually Indexed, Physically Tagged) allows parallel TLB and cache access when: - Cache index bits fall within the page offset - Here: Index bits [11:6] are within page offset [11:0] - This avoids the "synonym problem" where different VAs map to same PA **Part 4-6: Cache Coherence** MSI protocol ensures consistency: - Modified: Exclusive write permission, must writeback before sharing - Shared: Read-only, multiple copies allowed - Invalid: No valid data MESI optimization: Exclusive state allows silent upgrade to Modified, reducing bus traffic when a cache line is accessed by only one processor. **Part 7: Coherence Traffic** Counting bus transactions: - BusRd: Initial read misses (P0 Read X, P1 Read X, P0 Read Y) = 3 - BusRdX: Write operations requiring invalidation/ownership = 3 - Writeback: When Modified line is requested by another (P0's X when P1 writes) = 1 **Part 8: False Sharing** False sharing is a performance anti-pattern: - Two logically independent variables share a cache line - Writes to one variable invalidate the other - Solution: Pad structures to cache line boundaries (60 bytes padding for a 4-byte int) **Part 9: AMAT Analysis** Memory hierarchy performance: - Basic AMAT: 1 + 0.05 × 100 = 6 cycles - With L2: Miss to L2 takes 10 cycles, 20% miss further to memory (100 cycles) - AMAT = 1 + 0.05 × (10 + 0.20 × 100) = 1 + 0.05 × 30 = 2.5 cycles - Coherence adds 20 cycles to 50% of misses: 1 + 0.05 × 32 = 2.6 cycles **Part 10: Loop Optimization** Matrix multiplication locality analysis for original (i, j, k) order: - A[i][k]: Row-major, sequential in k → good spatial locality - B[k][j]: Column access (stride = N×sizeof(int)) → poor spatial locality - C[i][j]: Fixed within inner loop → reused in register Reordering to (i, k, j) specifically fixes B's poor locality: - B[k][j] now iterates over j in the innermost loop - With k fixed, B[k][0], B[k][1], B[k][2]... are consecutive in memory - This matches row-major storage, dramatically improving cache hit rate