Quiz5 of Computer Architecture (2025 Fall)

--- tags: computer-arch --- # Quiz5 of Computer Architecture (2025 Fall) > Solutions ## Problem `X` (25 points) This course emphasizes interaction between the instructor and students. List the times during class (not including after class) when you discussed with the instructor, the topics you discussed, the insights you gained, and what additional preparation and experiments you conducted afterward. Write at least 50 words. $\to$ 字數符合、時段合理就給分，注意：要紀錄學員寫下的時間來核實 --- ## Problem A: MMIO Device Control and Bitwise Packing In embedded systems and computer architecture, processors communicate with peripherals (like a VGA display controller) using Memory-Mapped I/O (MMIO). Instead of special instructions, the processor uses standard Load and Store instructions to write to specific memory addresses that are mapped to hardware registers. Consider a simple VGA controller mapped to base address `0x40000000`. To draw a pixel, software must write a single 32-bit word to this address. This word packs the pixel's coordinates (Row, Column) and its Color data. System Diagram: ```text System Bus ┌───────────┐ ┌──────────────────────────────┐ │ RISC-V │ ──────► │ VGA Controller (Slave) │ │ │ sw/sh │ Base Addr: 0x40000000 │ └───────────┘ └──────────────┬───────────────┘ │ Decodes 32-bit Data ▼ 31 22 21 12 11 8 7 4 3 0 ┌──────────────┬──────────────┬─────────┬─────────┬─────────┐ │ Row (Y) │ Col (X) │ Red │ Green │ Blue │ │ (10 bits) │ (10 bits) │ (4 bits)│ (4 bits)│ (4 bits)│ └──────────────┴──────────────┴─────────┴─────────┴─────────┘ ``` Field Specifications: * Row (Y): 10 bits, placed in bits [31:22]. Range [0, 1023]. * Col (X): 10 bits, placed in bits [21:12]. Range [0, 1023]. * Red (R): 4 bits, placed in bits [11:8]. Range [0, 15]. * Green (G): 4 bits, placed in bits [7:4]. Range [0, 15]. * Blue (B): 4 bits, placed in bits [3:0]. Range [0, 15]. - [ ] Part 1: C Pointer Types for MMIO When writing C code to interact with this hardware register, we must use a specific type qualifier to tell the compiler not to optimize away the memory access, as the "memory" could change or have side effects (like updating a screen) even if the program doesn't read it back. The qualifier must ensure that the pointed-to object (the `uint32_t` value at the memory address) is volatile, not the pointer itself. This prevents the compiler from caching the value across multiple accesses. In C, there are several ways to declare a pointer to a volatile object: - `volatile uint32_t * ptr` means: pointer to a volatile `uint32_t` - `uint32_t volatile * ptr` means: same as above (different syntax) - `uint32_t * volatile ptr` means: volatile pointer to a non-volatile `uint32_t` - `volatile uint32_t * volatile ptr` means: volatile pointer to a volatile `uint32_t` The blank should be filled with a single keyword qualifier that makes the pointed-to object volatile. Fill in the missing qualifier in the type cast below: ```c #define VGA_CTRL_REG (*((__A01__ uint32_t *) 0x40000000)) ``` This defines a macro that dereferences a pointer to a volatile `uint32_t` at the fixed memory address. A01: _________________ - [ ] Part 2: Bitwise Packing in C Implement the function `vga_draw_pixel` that takes coordinates and color components, packs them into the 32-bit format described above, and writes to the hardware. Input Validation Assumption: For this part, assume all inputs are pre-validated to be within their valid ranges (`r`, `g`, `b` ∈ [0, 15] and `x`, `y` ∈ [0, 1023]). No overflow handling is needed—simply pack the values directly into their designated bit fields. Recall the bit layout: ``` Bits [31:22] = Row (Y, 10 bits, so LSB of Y is at bit 22) Bits [21:12] = Col (X, 10 bits, so LSB of X is at bit 12) Bits [11:8] = Red (4 bits) Bits [7:4] = Green (4 bits) Bits [3:0] = Blue (4 bits) ``` ```c void vga_draw_pixel(uint32_t x, uint32_t y, uint32_t r, uint32_t g, uint32_t b) { uint32_t pixel_command = 0; // Pack Row (Y) into bits [31:22] // Because Y is a 10-bit field, its LSB position is bit 22 pixel_command |= (y << __A02__); // Pack Col (X) into bits [21:12] // Because X is a 10-bit field, its LSB position is bit 12 pixel_command |= (x << __A03__); // Pack Red, Green, Blue into bits [11:0] // Field placements: Red at [11:8], Green at [7:4], Blue at [3:0] pixel_command |= (r << 8) | (__A04__ << 4) | __A05__; // Perform the MMIO write VGA_CTRL_REG = pixel_command; } ``` A02 (Shift amount for Y): _________________ (Hint: Y's 10-bit field has LSB at position ___) A03 (Shift amount for X): _________________ (Hint: X's 10-bit field has LSB at position ___) A04 (Green variable name): _________________ (Hint: Enter the variable name for the green component—which input parameter holds the green value?) A05 (Blue variable name): _________________ (Hint: Enter the variable name for the blue component—which input parameter holds the blue value?) - [ ] Part 3: RISC-V Assembly Implementation You are implementing the same logic in RISC-V assembly. * `a0` holds `x` * `a1` holds `y` * `a2` holds the packed 12-bit color (0xRGB) where R is bits `[11:8]`, G is `[7:4]`, B is `[3:0]`. * The base address `0x40000000` is currently in register `s0`. You need to pack `x` and `y` into their high positions and combine them with the color in `a2`. The RISC-V `sw` instruction format is: `sw rs2, offset(rs1)`, where `rs2` is the source register (data to write) and `rs1` is the base address register. Note on S-Format Immediate Encoding: In RISC-V, the S-type instruction (including `sw`) encodes a 12-bit signed immediate offset. This immediate is split across two fields in the instruction: ``` S-Format Instruction Layout: 31 25 24 20 19 15 14 12 11 7 6 0 ┌──────────┬──────────┬──────────┬─────────┬──────────┬─────────┐ │ imm[11:5]│ rs2 │ rs1 │ funct3│ imm[4:0] │ opcode │ │ (7 bits) │ (5 bits) │ (5 bits) │ (3 bits)│(5 bits) │ (7 bits)│ └──────────┴──────────┴──────────┴─────────┴──────────┴─────────┘ Immediate assembly: imm[11:0] = { imm[11:5] (bits 31-25), imm[4:0] (bits 11-7) } ``` Example: For `sw t2, 0(s0)`, the offset is 0, so both `imm[11:5]` and `imm[4:0]` are zero. For `sw t2, 4(s0)`, the offset is 4: `imm[11:5]=0`, `imm[4:0]=4`. ```assembly # a0 = x (needs to move to [21:12]) # a1 = y (needs to move to [31:22]) # a2 = color (already at [11:0]) # s0 = 0x40000000 (VGA base address) slli t0, a1, __A06__ # Shift Y to position slli t1, a0, __A07__ # Shift X to position or t2, t0, t1 # Combine X and Y or t2, t2, a2 # Combine with Color sw __A08__, 0(__A09__) # Write packed pixel to MMIO (format: sw rs2, offset(rs1)) ``` A06 (Immediate shift for Y): _________________ (Note: In RISC-V, the immediate shift amount in `slli` must be in range `[0, 31]`) A07 (Immediate shift for X): _________________ (Note: In RISC-V, the immediate shift amount in `slli` must be in range `[0, 31]`) A08 (Source register for store, rs2 containing packed pixel): _________________ A09 (Base address register for store, rs1 containing `0x40000000`): _________________ - [ ] Part 4: Endianness Challenge Scenario: Your RISC-V system is little-endian, but the VGA controller was designed by a team using big-endian conventions. The hardware expects bytes in big-endian order. Key Principle: When a little-endian CPU writes to big-endian hardware, a mismatch occurs because the byte order in memory is reversed. Solution: Software must pre-swap bytes before writing to compensate for the endianness mismatch. Currently, your code packs the pixel as: ```c uint32_t pixel_command = (y << 22) | (x << 12) | (r << 8) | (g << 4) | b; VGA_CTRL_REG = pixel_command; ``` Example: With `x=10, y=5, r=12, g=12, b=12`, you produce `0x14000ACC`. Three-layer analysis (detailed explanation): Layer 1: CPU Register Value (What the CPU computed) ``` Register value (register interpretation): 0x14000ACC (unchanged) This is what lives in a CPU register. ``` Layer 2: Little-Endian Memory Layout (How CPU writes to memory) Your little-endian CPU stores this 32-bit value in memory byte-by-byte: ``` Memory[0x40000000]: 0xCC (bits [7:0]) Memory[0x40000001]: 0x0A (bits [15:8]) Memory[0x40000002]: 0x00 (bits [23:16]) Memory[0x40000003]: 0x14 (bits [31:24]) ``` Layer 3: Big-Endian Hardware Reads Memory The big-endian hardware reads these same 4 bytes sequentially from the starting address, interpreting them in big-endian order (MSB first): Big-Endian Hardware Interpretation: The hardware reads bytes sequentially from memory and assembles them by treating the first byte as the MSB. Consider what happens when a little-endian CPU writes to memory, then a big-endian hardware device reads from the same memory addresses. The Challenge: A byte-order mismatch occurs because the CPU and hardware interpret byte order differently. To make the hardware read the correct value, the CPU must handle the byte-order difference before writing to the hardware register. What function call would you add to the packing code to convert from little-endian to big-endian before writing? ```c uint32_t pixel_command = (y << 22) | (x << 12) | (r << 8) | (g << 4) | b; pixel_command = __A10__(pixel_command); // <-- Insert endian conversion VGA_CTRL_REG = pixel_command; ``` A10: _________________ (function name) After the conversion, if you trace through memory again with the same example (`x=10, y=5, r=12, g=12, b=12`), fill in all three parts: Part 1 - Swapped value (after `__builtin_bswap32()`) ``` Swapped value = 0x__ __ __ __ ``` Part 2 - Memory layout (how little-endian CPU writes the swapped value) ``` Memory[0x40000000] = 0x__ Memory[0x40000001] = 0x__ Memory[0x40000002] = 0x__ Memory[0x40000003] = 0x__ ``` Part 3 - Hardware interpretation (big-endian reading of memory bytes sequentially) ``` Hardware interprets = 0x__ __ __ __ (should match original 0x14000ACC ✓) ``` A11: _________________ Question A12 (Alternative): If you cannot use a built-in function, write the assembly equivalent. Given that `a0` contains the pixel command, manually byte-swap by extracting each byte and shifting to its opposite position. Example: For input `0x14000ACC`: ``` Byte 3 (MSB) = 0x14 → shift to byte 0 (LSB) Byte 2 = 0x00 → shift to byte 1 Byte 1 = 0x0A → shift to byte 2 Byte 0 (LSB) = 0xCC → shift to byte 3 (MSB) Result: 0xCC0A0014 ``` ```assembly # Input: a0 = pixel_command (e.g., 0x14000ACC) # Output: a0 = byte-swapped result (e.g., 0xCC0A0014) # Use t0, t1, t2, t3 as temporary registers # Extract each byte and shift to opposite position slli t0, a0, 24 # t0 = (a0 << 24): byte 0 moves to byte 3 position # 0xCC -> 0xCC000000 srli t1, a0, 8 # t1 = (a0 >> 8): byte 1 into LSB position # For 0x14000ACC: (0x14000ACC >> 8) = 0x0014000A, so byte 1 = 0x0A at LSB andi t1, t1, 0xFF # Mask to isolate byte 1 in LSB position (bits [7:0]) # This zeros out all bits above byte 1, leaving 0x0A in t1 slli t1, t1, 16 # Shift byte 1 to target position (bits [23:16]) # Result in t1: 0x0A0000 srli t2, a0, 16 # t2 = (a0 >> 16): extract upper 16 bits # For 0x14000ACC: (0x14000ACC >> 16) = 0x00001400, byte 2 = 0x00 andi t2, t2, 0xFF # Mask to isolate byte 2 in LSB position slli t2, t2, 8 # Shift byte 2 to target position (bits [15:8]) # Result in t2: 0x0000 (byte 2 contributes 0x00 << 8) srli t3, a0, 24 # t3 = (a0 >> 24): byte 3 into LSB position # 0x00000014 or t4, t0, t1 # Combine bytes 0 and 1: t4 = t0 | t1 or t4, t4, t2 # Add byte 2: t4 = t4 | t2 or a0, t4, t3 # Add byte 3: a0 = t4 | t3 (final result) ``` In the final `or` instruction (line 236), fill in the blanks: A12: _________________ (intermediate register: result after `or t4, t4, t2`) A13: _________________ (final byte register: the last value to combine with OR) - [ ] Part 5: Register Overflow Validation Problem: The register fields are defined as: * Y: 10 bits (valid range [0, 1023]) * X: 10 bits (valid range [0, 1023]) * R, G, B: 4 bits each (valid range [0, 15]) If someone calls `vga_draw_pixel(2048, 500, 12, 12, 12)`, the value `x=2048` overflows the 10-bit field. The extra bits [31:22] would be corrupted. Question A14: In RISC-V assembly, write a single instruction that checks if `x` (in register `a0`) is valid. This instruction should set `t0 = 1` if valid, and `t0 = 0` if `x > 1023`. Important: Coordinates are non-negative by definition (range [0, 1023]), so we must use an unsigned comparison to reject any value ≥ 1024. Which instruction below is correct? a. sltiu t0, a0, 1024 b. slti t0, a0, 1024 c. sltu t0, a0, t1 d. bne a0, t0, error A14: _________________ (letter: a, b, c, or d) After using the instruction from A14, what branch instruction would you use to handle the error case? In RISC-V, the instruction `beq rs1, rs2, target` executes: if rs1 == rs2, then jump to target. Similarly, `bne rs1, rs2, target` executes: if rs1 != rs2, then jump to target. ```assembly sltiu t0, a0, 1024 # t0 = (a0 < 1024) ? 1 : 0 __A15__ t0, zero, error # Branch when appropriate ``` A15: _________________ (instruction name) Write a complete assembly snippet that validates all three coordinate/color inputs before packing: Pattern to Recognize: Each validation follows this pattern: 1. Use `sltiu` to set a temporary register to 1 (valid) or 0 (invalid) 2. Use a branch-if-equal-to-zero instruction to jump to error when the register equals zero (invalid) 3. The branch instruction condition must be: "if register == zero, jump to error" Critical Requirement: For A16, A17, and A18, you must use the instruction that branches when the register equals zero. This is the ONLY correct pattern. Any other branch instruction (like `bne`) produces the opposite behavior and is incorrect. ```assembly # Validate x (a0 < 1024), y (a1 < 1024), color bits (a2 < 16) # If any are invalid, jump to 'error' label # Check x coordinate sltiu t0, a0, 1024 __A16__ t0, zero, error # Jump if t0 == 0 (i.e., x is invalid) # Check y coordinate sltiu t1, a1, 1024 __A17__ t1, zero, error # Jump if t1 == 0 (i.e., y is invalid) # Check color bits sltiu t2, a2, 16 __A18__ t2, zero, error # Jump if t2 == 0 (i.e., color is invalid) # All valid; continue to packing... ``` A16: _________________ (instruction name) A17: _________________ (instruction name) A18: _________________ (instruction name) Hint: All three answers should be the same instruction, since they all follow the same pattern: "branch when the result register == zero". Consider the difference between branching when a condition is true vs. false, and whether your choice correctly handles the invalid case. - [ ] Part 6: Fixed-Point Arithmetic Background: The graphics system lacks an FPU. To perform coordinate transformations (rotation, scaling), you must use Q16.16 fixed-point arithmetic—32 bits with 16 integer bits and 16 fractional bits. Q16.16 Format: * Decimal 1.5 → `0x00018000` * Decimal 2.0 → `0x00020000` * Multiplication of two Q16.16 numbers requires 64-bit intermediate to preserve fractional bits Key Challenge: To rotate a screen coordinate by angle θ, you compute: ```c int64_t cos_fixed = fx_cos(θ); // Q16.16 int64_t x_fixed = x << 16; // Convert x to Q16.16 int64_t product = (int64_t)cos_fixed * x_fixed; // 64-bit multiplication int32_t x_rotated = (int32_t)(product >> 16); // Back to Q16.16 int32_t x_screen = (x_rotated + 0x8000) >> 16; // Rounding conversion to integer ``` When multiplying two Q16.16 fixed-point numbers (each 32-bit signed integers), the product temporarily requires more bits. Why is the cast `(int64_t)` necessary in the multiplication `(int64_t)cos_fixed * x_fixed`? __ A19 __ a. The compiler requires explicit type hints for mathematical operations b. Without it, the 32×32 multiplication would overflow, losing the upper bits needed for the fractional part c. It forces the result into a signed type, preventing arithmetic errors d. It reserves memory on the stack for the larger intermediate value A19 = ? You have computed a 64-bit signed product from multiplying two Q16.16 numbers. Now you need to right-shift by 16 to extract the final Q16.16 result. Since the values are signed (they can represent negative coordinates or results), you must ensure the sign bit is propagated correctly into the upper bits during the shift. Which shift instruction is correct? __ A20 __ a. `srai` (arithmetic right shift) — shifts bits and replicates the sign bit (MSB) into the vacating upper bits; preserves the sign of the number b. `srli` (logical right shift) — shifts bits and fills vacating upper bits with zeros; this would corrupt negative numbers c. `slli` (logical left shift) — moves bits in the wrong direction d. Both `srai` and `srli` work equally well for signed numbers A20 = ? Critical Distinction: - For signed Q16.16 fixed-point values stored in a two's-complement integer, shifting right by 16 is logically a division by 2^16. You must preserve the sign bit for negative values. That means using an arithmetic right shift (`srai`) on the 64-bit product (or its low 32-bit result). - Using `srli` (logical right shift) would zero-fill the high bits and corrupt negative results, even though the raw bit pattern of the fractional part might look "preserved". In Q16.16 format, to round a fixed-point value to the nearest integer (rather than truncating), you add a bias before the right-shift. Consider: ```c int32_t fx_to_int_rounded(int32_t fixed_val) { // fixed_val is in Q16.16 format // Goal: round to nearest integer, not truncate return (fixed_val + __A21__) >> 16; } ``` What hexadecimal constant replaces `__A21__` to implement rounding? __ A21 __ a. `0x00008000` (32768 decimal) b. `0x00010000` (65536 decimal) c. `0x0000FFFF` (65535 decimal) d. `0x00007FFF` (32767 decimal) A21 = ? ### ✅ Solution to Problem A: A01: ==volatile== A02: ==22== A03: ==12== A04: ==g== A05: ==b== A06: ==22== A07: ==12== A08: ==t2== A09: ==s0== A10: ==__builtin_bswap32== A11: Part 1 - Swapped value: Swapped value = ==0xCC0A0014== Part 2 - Memory layout: Memory[0x40000000] = ==0x14== Memory[0x40000001] = ==0x00== Memory[0x40000002] = ==0x0A== Memory[0x40000003] = ==0xCC== Part 3 - Hardware interpretation: Hardware interprets = ==0x14000ACC== A12: ==t4== A13: ==t3== A14: ==A== A15: ==beq== A16: ==beq== A17: ==beq== A18: ==beq== A19: ==B== (Reason: `(int64_t)` cast prevents 32×32 overflow) A20: ==A== (Reason: `srai` preserves the sign bit for signed fixed-point values) A21: ==A== (Reason: `0x8000` is the rounding bias for Q16.16) Detailed explanations: #### Part 1: Volatile Keyword (A01) The `volatile` keyword is essential for MMIO because: 1. It tells the compiler the memory location has side effects (hardware interaction) 2. Prevents optimization that would elide "redundant" writes 3. Ensures every assignment to `VGA_CTRL_REG` generates an actual memory store instruction 4. Preserves memory access ordering (no reordering of volatile accesses) Without `volatile`, the compiler might recognize you're writing the same address repeatedly and skip intermediate writes, causing visual glitches or missed updates. #### Part 2: Bitwise Packing (A02-A05) Memory Layout: ``` Bit positions: [31:22] Y | [21:12] X | [11:8] R | [7:4] G | [3:0] B Input values: y (10b) | x (10b) | r (4b) | g (4b) | b (4b) ``` Shift Amounts: - Y: Input in bits [9:0] → Target bits [31:22] → Shift left by 22 - X: Input in bits [9:0] → Target bits [21:12] → Shift left by 12 - R: Input in bits [3:0] → Target bits [11:8] → Shift left by 8 - G: Input in bits [3:0] → Target bits [7:4] → Shift left by 4 - B: Input in bits [3:0] → Target bits [3:0] → No shift (multiply by 1) Verification Example: `x=10, y=5, r=12, g=12, b=12` ``` (5 << 22) = 0x14000000 (10 << 12) = 0x0000A000 (12 << 8) = 0x00000C00 (12 << 4) = 0x000000C0 (12) = 0x0000000C ────────────── 0x14000ACC ✓ ``` #### Part 3: Assembly Implementation (A06-A09) The assembly code mirrors the C implementation: ``` slli t0, a1, 22 # Pack Y: shift register a1 left by 22 bits slli t1, a0, 12 # Pack X: shift register a0 left by 12 bits or t2, t0, t1 # Combine Y and X using bitwise OR or t2, t2, a2 # Combine result with color (already in [11:0]) sw t2, 0(s0) # Store Word: write packed value to memory address s0+0 ``` Key differences from C: - C uses implicit OR operations (`|=`) - Assembly uses explicit `or` instructions - Both achieve same result: all bit fields combined without overlap #### Part 4: Endianness Challenge (A10-A13) The Problem: - Your CPU is little-endian (RISC-V default) - VGA hardware is big-endian (designed with opposite byte order) - When you write `0x14000ACC` on LE CPU, bytes appear in memory as: `[0xCC, 0x0A, 0x00, 0x14]` - Hardware reads them in opposite order and interprets as `0xCC0A0014` ✗ WRONG! Solution: Byte-Swap ```c uint32_t pixel_command = (y << 22) | (x << 12) | (r << 8) | (g << 4) | b; pixel_command = __builtin_bswap32(pixel_command); // Swap bytes for BE hardware VGA_CTRL_REG = pixel_command; ``` Trace Example: `x=10, y=5, r=12, g=12, b=12` Before swap: ``` Value: 0x14000ACC Bytes: [0x14] [0x00] [0x0A] [0xCC] ``` After `__builtin_bswap32()`: ``` Value: 0xCC0A0014 Bytes: [0xCC] [0x0A] [0x00] [0x14] ``` When LE CPU writes `0xCC0A0014` to memory: ``` Memory[0x00] = 0x14 (LSB of 0xCC0A0014) Memory[0x01] = 0x00 Memory[0x02] = 0x0A Memory[0x03] = 0xCC (MSB of 0xCC0A0014) ``` BE hardware reads memory in order `[0x00], [0x01], [0x02], [0x03]` and interprets as: ``` Value: 0x14000ACC ✓ CORRECT! ``` Manual Byte-Swap Implementation (A12): If `__builtin_bswap32` is unavailable, manually extract and shift each byte. The assembly approach shown above uses sequential OR instructions to combine bytes: ```c // C equivalent of the assembly approach: uint32_t bswap32_manual(uint32_t x) { return ((x & 0xFF000000) >> 24) | // Byte 3 (MSB) → byte 0 (LSB) ((x & 0x00FF0000) >> 8) | // Byte 2 → byte 1 ((x & 0x0000FF00) << 8) | // Byte 1 → byte 2 ((x & 0x000000FF) << 24); // Byte 0 (LSB) → byte 3 (MSB) } ``` The RISC-V assembly implementation (shown in the problem above) follows the same logic: extract each byte and shift it to its opposite position, then combine all bytes using OR operations. The key steps are: 1. Shift byte 0 left by 24 bits (move to byte 3 position [31:24]) 2. Extract byte 1 (via right shift by 8), mask it, then shift left by 16 bits (move to byte 2 position [23:16]) 3. Extract byte 2 (via right shift by 16), mask it, then shift left by 8 bits (move to byte 1 position [15:8]) 4. Extract byte 3 (via right shift by 24, result already in byte 0 position [7:0]) 5. Combine all four shifted bytes using OR operations #### Part 5: Register Overflow Validation (A14-A18) A14: Which Instruction is Correct? ``` A) sltiu t0, a0, 1024 (unsigned immediate compare) B) slti t0, a0, 1024 (signed compare; treats negative values differently) C) sltu t0, a0, x10 (uses register operand, not immediate) D) bne a0, t0, error (compares registers, doesn't validate bounds) ``` `sltiu` (Set Less Than Immediate Unsigned) sets `t0 = 1` if `a0 < 1024`, else `t0 = 0`. - For valid input `a0=500`: `500 < 1024` → `t0 = 1` ✓ - For invalid input `a0=2048`: `2048 < 1024` → `t0 = 0` ✓ A15: Branch Instruction for Error Handling After `sltiu t0, a0, 1024`, we have: - `t0 = 1` if valid (a0 < 1024) - `t0 = 0` if invalid (a0 >= 1024) To branch on error (when a0 >= 1024), we check if `t0 == 0`: ```assembly __ A15 __ t0, zero, error # Branch if Equal to zero (invalid) ``` A16: Full Validation Snippet The same pattern repeats for each input: ```assembly # Check x (a0 < 1024) sltiu t0, a0, 1024 beq t0, zero, error # A16: beq # Check y (a1 < 1024) sltiu t1, a1, 1024 beq t1, zero, error # A17: beq # Check L (a2 < 16) sltiu t2, a2, 16 beq t2, zero, error # A18: beq # All valid; continue to packing code... ``` Answers: A16 = beq, A17 = beq, A18 = beq --- ## Problem B: Single-Cycle Datapath and Memory Access Control (VGA Sync Logic) Building on the VGA Controller from Problem A, we now implement a mechanism to prevent "screen tearing". The software must wait for the Vertical Sync (V-Sync) signal before writing the pixel data. The V-Sync status is located at bit 0 of the `VGA_STATUS_REG` (Memory Address `0x40000004`). C Code Snippet: ```c // x11 holds address of VGA_STATUS_REG (0x40000004) // x12 holds address of VGA_CTRL_REG (0x40000000) // x10 holds the packed pixel command // Polling loop: Wait until V-Sync bit (bit 0) is 1 while ((vga_status_ptr[0] & 1) == 0) { // Busy wait } vga_ctrl_ptr[0] = pixel_cmd; ``` RISC-V assembly implementation: ```text Loop: I1: lw x5, 0(x11) # Read Status Register I2: andi x5, x5, 1 # Mask bit 0 I3: beq x5, x0, Loop # If (x5 == 0) goto Loop I4: sw x10, 0(x12) # Write Pixel Command to Control Register ``` Datapath Reference: Use the simplified Single-Cycle Datapath definitions: * PCSel: 0 = PC+4, 1 = ALU Result (Branch Target) * RegWEn: 1 = Write to RD, 0 = No Write * ImmSel: I, S, B, J, U * BrUn: 1 = Unsigned, 0 = Signed * BSel (ALU Src B): 0 = rs2, 1 = Immediate * MemRW: 1 = Write Memory, 0 = Read/Idle * WBSel Encoding: - 0 = ALU Result (default path for ALU operations) - 1 = Memory Data (for load instructions) - 2 = PC+4 (for link/return instructions) - [ ] Part 1: Encoding I4 `sw x10, 0(x12)` (S-Type) The Store instruction format (S-Type) splits the immediate value. For `sw rs2, offset(rs1)`: * `rs2` (Source Data) = x10 (register number 10 in decimal = 01010₂ in binary) * `rs1` (Base Address) = x12 (register number 12 in decimal = 01100₂ in binary) * `offset` = 0 ```text 31 25 24 20 19 15 14 12 11 7 6 0 ┌───────────┬────────┬────────┬────────┬────────┬─────────┐ │ imm[11:5] │ rs2 │ rs1 │ funct3 │imm[4:0]│ opcode │ └───────────┴────────┴────────┴────────┴────────┴─────────┘ (7b) (5b) (5b) (3b) (5b) (7b) ``` Bit Width Note: The immediate field is split across two locations: - imm[11:5]: 7 bits (bits 31:25 in instruction) - imm[4:0]: 5 bits (bits 11:7 in instruction) Together they form the complete 12-bit immediate value. Fill in the following fields for the instruction `sw x10, 0(x12)`: Note on the Offset: Since the offset is exactly 0, both `imm[11:5]` and `imm[4:0]` are all zeros. This means all 12 bits of the immediate field must be 0, so both split fields `imm[11:5]` and `imm[4:0]` are encoded as zeros in the instruction: * `imm[11:5] = 0000000` (7 bits of zeros) * `imm[4:0] = 00000` (5 bits of zeros) Fill in the following: * opcode (binary, 7 bits): __ B01 __ * funct3 (binary, 3 bits): __ B02 __ * rs1 (Base Address Register) = x12 = register 12 (decimal), field value: __ B04 __ (binary) * rs2 (Source Data Register) = x10 = register 10 (decimal), field value: __ B03 __ (binary) * imm[4:0] (Lower 5 bits of offset) (binary, 5 bits): __ B05 __ (These are the lower 5 bits of the split immediate; since offset=0, this is 00000) - [ ] Part 2: Control Signals (Load vs. Store) Compare the control signals for I1 (`lw`) and I4 (`sw`). Both access memory, but their data flow is opposite. Key Insight: Both load and store instructions need to calculate a memory address: `Address = rs1 + immediate offset`. To compute this address, the ALU must receive the immediate as input B. Therefore: - For load (`lw`): `BSel = 1` to add the offset to `rs1` - For store (`sw`): `BSel = 1` to add the offset to `rs1` The difference is in what happens after the address is calculated: loads read from memory (and write to register), while stores write to memory (and don't write to register). | Signal | I1: `lw x5, 0(x11)` | I4: `sw x10, 0(x12)` | | :--- | :---: | :---: | | RegWEn | __ B06 __ | __ B07 __ | | MemRW | 0 | __ B08 __ | | WBSel | __ B09 __ | X (Don't Care) | | BSel | __ B10 __ | 1 | * Fill in B06, B07, B08 (0 or 1). * Fill in B09 (0, 1, or 2). WBSel encoding: 0 = ALU result, 1 = Memory read data, 2 = PC+4 * Fill in B10 (0 or 1). Hint: Store instructions always use the immediate offset for address calculation, just like load instructions. Important Note on WBSel for Stores: The WBSel field in a store instruction is marked as "don't care" (X) because store instructions do not write back to the register file. Since there is no destination register (stores modify memory, not registers), the value of WBSel is irrelevant—it will never be used because RegWEn = 0 prevents any register write operation. - [ ] Part 3: Control Signals for Branching (I3) Analyze instruction I3: `beq x5, x0, Loop`. Assume the branch is taken (i.e., `x5` is indeed 0). * ALU Operation: To compare `x5` and `x0` to check for equality, the ALU performs a subtraction. The ALU computes (rs1 - rs2) to generate the zero flag. This subtraction approach is used because: - If `x5 == x0`, then `x5 - x0 = 0` → zero flag is set → branch is taken - If `x5 != x0`, then `x5 - x0 ≠ 0` → zero flag is not set → branch is not taken Therefore, the ALU control unit must select subtraction to perform the comparison. ALUSel = __ B11 __ (add / sub / and). Critical Note on Branch Comparisons: Branch comparisons never use the AND operation for equality testing. The AND operation cannot reliably detect equality: - `AND(0x0000, 0x0001) = 0` ✗ (misleading: result is zero, but values are not equal) - `AND(0xFFFF, 0xFFFF) = 0xFFFF` ✗ (result is non-zero, but values ARE equal) Subtraction is the only correct choice for testing equality in branch instructions. It produces a zero flag that directly and correctly indicates equality. * PCSel: Since the branch condition is true (Taken), the next PC comes from the branch target calculation. PCSel = __ B12 __ (0 / 1). * ImmSel: Which immediate format does `beq` use to encode the offset? ImmSel = __ B13 __ (I / S / B / J). ### ✅ Solution to Problem B: Part 1: S-Type Encoding for `sw x10, 0(x12)` B01: ==0100011== (Standard Store Opcode) B02: ==010== (Width: Word = 32-bit) B03: ==01010== (Register x10 in binary; register number 10₁₀ = 01010₂; holds the data to store) B04: ==01100== (Register x12 in binary; register number 12₁₀ = 01100₂; holds the base address) B05: ==00000== (Lower 5 bits of offset 0) Part 2: Load vs. Store Control Signals B06: ==1== (Load writes to register) B07: ==0== (Store does not write to register file) B08: ==1== (Enable Memory Write) B09: ==1== (Select Memory Data for Writeback) For `lw`, WBSel = 1 selects memory data for writeback to register x5. For `sw`, WBSel is don't care (X) because RegWEn = 0 and no register write occurs. B10: ==1== (ALU Source B is Immediate offset) Part 3: Branching Control Signals for `beq x5, x0, Loop` B11: ==sub== (Equality check is done via subtraction) B12: ==1== (Select Branch Target) B13: ==B== (Branch-type Immediate) Detailed Explanations: #### Part 1: S-Type Encoding The `sw` (Store Word) instruction encodes the offset in split immediate fields: * `imm[11:5]` goes into bits [31:25] * `imm[4:0]` goes into bits [11:7] Critical Point for offset = 0: Since the offset is exactly 0, we have `imm[11:0] = 0x000` (all 12 bits are zero). Therefore: - `imm[11:5]` = 0000000 (7 bits of zeros) - `imm[4:0]` = 00000 (5 bits of zeros) This means both fields in the instruction encoding must be all zeros. Students often mistakenly treat the immediate bits as register numbers, so be careful to fill in 00000 for the `imm[4:0]` field, not a register number. The opcode `0100011` identifies this as a Store-class instruction. The `funct3 = 010` specifies 32-bit (word) access. #### Part 2: Memory Access Control Key Principle: Both load and store instructions compute the effective address as `Address = rs1 + immediate offset`. This computation requires the ALU to add `rs1` (from the register file, which appears on ALU input A) and the immediate offset (which must appear on ALU input B). Therefore, both load and store use `BSel = 1` to route the immediate to ALU input B. Load (`lw`) Control Signals: * RegWEn = 1: The loaded data must be written to the destination register (x5) * MemRW = 0: Read mode; no memory write * WBSel = 1: Select loaded data from memory for writeback * BSel = 1: Use the immediate offset for address calculation (`Address = x11 + 0`) Store (`sw`) Control Signals: * RegWEn = 0: Store does not modify registers; no writeback needed * MemRW = 1: Enable memory write; data (x10) flows to memory * WBSel = X: Don't care; no register is written (store has no destination register) * BSel = 1: Use the immediate offset for address calculation (`Address = x12 + 0`) Why `BSel = 1` for Store? The store instruction still needs to calculate the memory address by adding `rs1 + immediate`. Without the immediate on ALU input B, the address calculation would be impossible. The only difference from load is that after the address is calculated, the store operation writes to memory (MemRW=1) instead of reading from it (MemRW=0). #### Part 3: Branch Control Logic For `beq x5, x0, Loop`: 1. ALU Subtraction (ALUSel = `sub`): The control unit generates ALUSel = `sub` to compute `x5 - x0` (i.e., rs1 - rs2). Why subtraction for equality testing? - If `x5 == x0`, then `x5 - x0 = 0` → The zero flag is set - If `x5 != x0`, then `x5 - x0 ≠ 0` → The zero flag is clear The zero flag directly indicates whether the two values are equal. This is the standard way all processors test equality in branch instructions. Why not `and` operation? The `and` operation would not correctly compute equality. For example: - `and(0x0000, 0x0001) = 0` (both zero and non-zero values give 0, which is misleading) - `and(0xFFFF, 0xFFFF) = 0xFFFF` (non-zero result, but values are equal) Subtraction is the only correct choice for testing equality. 2. Branch Target Calculation: The PC adder computes `PC + (B-type immediate)` Since the branch is taken, the next PC is the calculated target. 3. PCSel = 1: Routes the branch target to the PC multiplexer (In a more detailed design, PCSel might be conditional on the zero flag; here we assume the branch is taken) 4. ImmSel = B: B-type immediates encode the offset in split fields within the instruction --- ## Problem C: Fused Multiply-Add and Pipeline Hazard Analysis Background: Neural Networks and Fused Operations In modern workloads like Deep Learning (similar to the BitNet context in Problem O) and Digital Signal Processing, the Dot Product ($R = \sum A_i \times B_i$) is the most frequent operation. To accelerate this, standard RISC-V extensions (F and D) and custom accelerators use a Fused Multiply-Add (FMA) instruction. Your team has implemented a single-precision floating-point FMA instruction called `fmac.s`. Operation: `Rd = (Rs1 * Rs2) + Rs3` Instruction Format (R4-Type): RISC-V uses the R4-type format for instructions requiring three source operands. ```text 31 27 26 25 24 20 19 15 14 12 11 7 6 0 ┌─────────┬────────┬──────────┬──────────┬────────┬──────────┬─────────┐ │ rs3 │ funct2 │ rs2 │ rs1 │ funct3 │ rd │ opcode │ └─────────┴────────┴──────────┴──────────┴────────┴──────────┴─────────┘ ``` Important Note on Encoding Order: The R4-type instruction format shows bits in their physical positions in the encoding. However, the arithmetic operation is `rd = (rs1 * rs2) + rs3`, which does not follow the left-to-right order of operands in the instruction encoding. Specifically, `rs3` occupies bits [31:27] (the leftmost bits), but it is the third operand in the arithmetic sense (the addend, not the multiplicands). Hardware Specifications: * opcode: `1000011` (Standard F-ext FMA opcode) * funct3: `000` * funct2: `00` * Latency: The `fmac.s` operation takes 4 cycles in the Execute (EX) stage (EX1, EX2, EX3, EX4). If an instruction enters EX at cycle T: - Cycle T: EX1 (first phase) - Cycle T+1: EX2 (second phase) - Cycle T+2: EX3 (third phase) - Cycle T+3: EX4 (fourth and final phase) - Forwarding occurs at the boundary between cycles T+3 and T+4: The result is available for forwarding at the start of cycle T+4 (immediately after EX4 completes). * Throughput: The unit is fully pipelined (can accept 1 new instruction per cycle). Note on Standard RISC-V FP Instructions: * `flw ft0, 0(x10)`: Load 32-bit float from memory `Mem[x10]` into float register `ft0`. * `fadd.s ft0, ft1, ft2`: Single-precision add (`ft0 = ft1 + ft2`). * `fmul.s ft0, ft1, ft2`: Single-precision multiply (`ft0 = ft1 * ft2`). * Floating-point registers are named `f0` to `f31` (or ABI names `ft0`-`ft11`, `fa0`-`fa7`, `fs0`-`fs11`). - [ ] Part 1: Instruction Encoding Encode the instruction `fmac.s f10, f1, f2, f3`. This performs: `f10 = (f1 * f2) + f3`. * opcode (7 bits): __ C01 __ * rd (f10) (5 bits): __ C02 __ * rs1 (f1) (5 bits): __ C03 __ * rs2 (f2) (5 bits): __ C04 __ * rs3 (f3) (5 bits): __ C05 __ - [ ] Part 2: Matrix Kernel Implementation You need to write a RISC-V assembly loop to compute the dot product of two arrays, `A` and `B`, of length `N`. * `x10`: Base address of array A * `x11`: Base address of array B * `x12`: Length N (integer) * `f0`: Accumulator (initialized to 0.0) Write the assembly code for the loop body using `flw` and the custom `fmac.s` instruction. Your goal is to minimize the number of arithmetic instructions. ```assembly loop: # 1. Load A[i] into f1 flw f1, 0(x10) # 2. Load B[i] into f2 flw f2, 0(x11) # 3. Fused Multiply-Accumulate: `f0` = `f0` + (f1 * f2) # IMPORTANT: The fmac.s instruction syntax is: fmac.s rd, rs1, rs2, rs3 # This performs the computation: rd = (rs1 * rs2) + rs3 # So to compute `f0` = (f1 * f2) + `f0`, you must write: fmac.s `f0`, f1, f2, `f0` # (not `f0`, `f0`, f1, f2 or any other ordering) __C06__ # 4. Increment pointers (float is 4 bytes) addi x10, x10, 4 addi x11, x11, 4 # 5. Decrement count and branch addi x12, x12, -1 bnez x12, loop ``` If we did not have `fmac.s`, we would need separate `fmul.s` and `fadd.s` instructions. How many *fewer* instructions are executed in the loop body for N=100 iterations by using `fmac.s`? __ C07 __ instructions. Key Insight: One `fmac.s` instruction replaces both one `fmul.s` (multiply) and one `fadd.s` (add). Therefore, using `fmac.s` saves exactly 1 instruction per iteration. For N=100 iterations, the total savings is 100 instructions. - [ ] Part 3: Pipeline Latency and Optimization Consider the execution of the loop from Part 2 on a 5-stage pipeline. The `fmac.s` instruction uses a floating-point pipeline in the EX stage. * Latency: 4 cycles (Result available for forwarding after 4 EX cycles). * Issue: 1 instruction per cycle. Consider two consecutive iterations of the loop (unrolled view): ```assembly Iter 1: fmac.s `f0`, f1, f2, `f0` # Writes to `f0` ... (pointer updates) ... Iter 2: fmac.s `f0`, f3, f4, `f0` # Reads `f0` ``` * Hazard Identification: There is a RAW dependency on register __ C08 __ between the `fmac.s` of Iteration 1 and the `fmac.s` of Iteration 2. (Hint: Which register does Iter 1 write, and which does Iter 2 read as input?) * Stall Calculation: - Iter 1 `fmac.s` timing: Assume `fmac.s` enters EX1 at cycle `T` immediately after the ID stage completes. - Iter 1 completes all 4 EX cycles (T, T+1, T+2, T+3) and makes the result available for forwarding at the start of cycle T+4. - Iter 2 `fmac.s` would ideally enter EX at cycle T+1 (back-to-back execution, one cycle after Iter 1 enters EX). - However, Iter 2 needs the result from Iter 1 as an input at the start of its EX stage. - Since the result is not available until cycle T+4, Iter 2 must stall from T+1 through T+3. - Assuming ideal instruction fetch (pointer updates handled by integer unit in parallel), how many stall cycles occur? Hazard Timeline Visualization: ``` Cycle T T+1 T+2 T+3 T+4 T+5 ────────────────────────────────────────────────────────────── Iter 1 EX1 EX2 EX3 EX4 (result available) (`f0` write completes at end of T+3) ↓ Iter 2 IF STALL STALL STALL EX1 (wants `f0` in EX1) (`f0` available at T+4) ``` Stall Analysis: - Iter 1 result available for forwarding at: Start of cycle T+4 - Iter 2 can begin EX1 at: Cycle T+4 (when `f0` is ready) - Iter 2 would have begun EX1 at: Cycle T+1 (without hazard) - Stall period: Cycles T+1, T+2, T+3 = 3 stall cycles $\to$ Answer: __ C09 __ cycles (Iter 2 enters EX at T+4 when result is available). Optimization: To hide this latency without changing hardware, compilers often use Loop Unrolling and Multiple Accumulators. By using multiple independent accumulators `f0`, `f1`, `f2`, `f3` (each summing a distinct part of the array), we can break the Read-After-Write (RAW) dependency on a single accumulator. Does using these multiple independent accumulators eliminate the RAW hazard and thus remove the 3-cycle stalls calculated in C09? $\to$ Answer (Yes/No): __ C10 __ - [ ] Part 4: Pipeline Timeline Visualization Complete the following timeline diagram by filling in the stage labels for Iter 1 and the ready cycle for Iter 2: ``` For Iter 1: fmac.s `f0`, f1, f2, `f0` (starts EX at cycle T) For Iter 2: fmac.s `f0`, f3, f4, `f0` (dependent on Iter 1) Cycle T T+1 T+2 T+3 T+4 T+5 T+6 ────────────────────────────────────────────────────── Iter1 __C11__ __C12__ __C13__ __C14__ MEM WB Iter2 IF STALL STALL STALL EX1 EX2 EX3 Stage abbreviations: IF=Fetch, ID=Decode, EX1/EX2/EX3/EX4=Execute stages, MEM, WB ``` Note: ID is effectively held during the three STALL cycles; the instruction cannot advance into EX1 until the result in `f0` becomes available at cycle T+4. Fill in the blanks for Iter 1's EX stages (which are 4 cycles long): C11: The first EX stage (cycle T). Fill: __ C11 __ C12: The second EX stage (cycle T+1). Fill: __ C12 __ C13: The third EX stage (cycle T+2). Fill: __ C13 __ C14: The fourth EX stage (cycle T+3). Fill: __ C14 __ Now, identify the earliest cycle when Iter 2 can begin executing EX1: C15: At which cycle does Iter 1's result become available for forwarding, allowing Iter 2 to enter EX1? Fill: __ C15 __ Hint: Forwarding occurs at the boundary between cycles T+3 and T+4. The result is available for forwarding at the start of cycle T+4 (immediately after EX4 completes). Therefore, any dependent instruction that needs this result as an EX input may begin its EX1 stage in cycle T+4. ### ✅ Solution to Problem C: C01: ==1000011== C02: ==01010== C03: ==00001== C04: ==00010== C05: ==00011== C06: ==fmac.s `f0`, f1, f2, `f0`== C07: ==100== C08: ==`f0`== C09: ==3== C10: ==Yes== Part 4: Pipeline Timeline Visualization C11: ==EX1== C12: ==EX2== C13: ==EX3== C14: ==EX4== C15: ==T+4== Detailed explanations (optional reading): Part A: Encoding 1. opcode: 1000011 (or 0x43 in hex) 2. rd (f10): 01010 (binary representation of 10) 3. rs1 (f1): 00001 (binary representation of 1) 4. rs2 (f2): 00010 (binary representation of 2) 5. rs3 (f3): 00011 (binary representation of 3) The R4-type instruction format is: - Bits [31:27]: rs3 (source register 3) - Bits [26:25]: funct2 = 00 - Bits [24:20]: rs2 (source register 2) - Bits [19:15]: rs1 (source register 1) - Bits [14:12]: funct3 = 000 - Bits [11:7]: rd (destination register) - Bits [6:0]: opcode = 1000011 Part B: Matrix Kernel C06: ==`fmac.s `f0`, f1, f2, f0` (Ordering: rd, rs1, rs2, rs3 -> `f0` = f1*f2 + `f0`)== C07: ==100 (1 instruction saved per iteration * 100 iterations)== For the loop: - Without FMA: Need `fmul.s temp, f1, f2` followed by `fadd.s `f0`, `f0`, temp` (2 instructions per multiply-add) - With FMA: Just `fmac.s `f0`, f1, f2, f0` (1 instruction per multiply-add) - Savings: 1 instruction per iteration × 100 iterations = 100 fewer instructions Part C: Pipeline Hazards and Stalls C08: ==`f0`== C09: ==3== C10: ==Yes== Detailed Analysis: 1. RAW dependency (C08): Register `f0` - Iteration 1 writes `f0`, Iteration 2 reads `f0` as source 2. Stall Calculation (C09 = 3 cycles, Critical Concept): Pipeline Timeline: ``` Cycle T T+1 T+2 T+3 T+4 T+5 T+6 ────────────────────────────────────────── Iter1 EX1 EX2 EX3 EX4 MEM WB Iter2 IF ID -- -- EX1 EX2 EX3 (starts EX at T+4) ``` Detailed Explanation: - Iter 1 `fmac.s` enters EX stage at cycle T - Iter 1 progresses through EX1 (T), EX2 (T+1), EX3 (T+2), EX4 (T+3) - Result is available for forwarding at the start of cycle T+4 (after EX4 completes) - Iter 2 `fmac.s` would ideally enter EX at cycle T+1 (back-to-back execution) - However, Iter 2 needs `f0` from Iter 1, which is not available until T+4 - Stall Period: Cycles T+1, T+2, T+3 (3 stall cycles) - Iter 2 finally enters EX at T+4 (when `f0` is available for forwarding) ✓ - Conclusion: C09 = 3 cycles 3. Loop Unrolling with Multiple Accumulators (C10 = Yes): Background (ILP - Instruction-Level Parallelism): The key insight is accumulator independence: by using separate accumulators (`f0`, f1, f2, f3), we break the Read-After-Write (RAW) dependencies between consecutive loop iterations. Each `fmac` instruction writes to a different floating-point register, so there are no data dependencies between them. This allows the out-of-order or deeply-pipelined hardware to execute multiple independent `fmac` instructions in parallel, even though they are issued sequentially in the code. This technique is called loop unrolling with multiple accumulators, and it exploits Instruction-Level Parallelism (ILP) to achieve higher throughput. By using independent accumulators `f0`, f1, f2, f3, each iteration's `fmac` becomes independent: ``` Cycle T T+1 T+2 T+3 T+4 ──────────────────────────────── fmac `f0`... EX1 EX2 EX3 EX4 (result in `f0`) fmac f1... IF ID EX1 EX2 (no RAW stall! f1 ≠ `f0`) fmac f2... IF ID EX1 (no RAW stall! f2 ≠ `f0`) fmac f3... IF ID (no RAW stall! f3 ≠ `f0`) ``` The pipeline stays fully utilized with no stalls. Each `fmac` executes independently because: - `f0`, f1, f2, f3 are physically distinct registers (no write-after-read conflicts) - No instruction depends on the result of a previously-issued `fmac` (no RAW hazards) - Forwarding paths are available but not needed (no stalls occur) --- ## Problem D: Cache Locality, Miss Penalties, and Branch Prediction in GEMM Context: You are optimizing the core kernel of a Deep Learning accelerator running on a RISC-V processor. The operation is General Matrix Multiply (GEMM) followed by a ReLU activation: $$C = \text{ReLU}(A \times B)$$ System Configuration: * Processor: 5-stage RISC-V (IF, ID, EX, MEM, WB). * L1 Data Cache: 4 KiB, 2-Way Set Associative, 64-byte block size (line size), LRU replacement. - Total Size: $4096$ bytes. - Block Size: $64$ bytes ($16$ integers). * Memory: 32-bit address space. * Latency: Hit = 1 cycle, Miss Penalty = 50 cycles. Data Layout: Matrices $A$, $B$, and $C$ are all $64 \times 64$ matrices of 32-bit integers (`int32_t`). * $N = 64$. * Matrices are stored in Row-Major order. * $A$ starts at `0x10000`, $B$ at `0x20000`, $C$ at `0x30000`. * Size of one matrix = $64 \times 64 \times 4$ bytes = 16 KiB. Code Variant 1: Naive GEMM (Standard definition) ```c // i = row, j = col, k = dot-product index for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { int sum = 0; for (int k = 0; k < N; k++) { // Access Pattern 1 sum += `A[i][k]` * `B[k][j]`; } // ReLU Activation if (sum < 0) sum = 0; `C[i][j]` = sum; } } ``` Code Variant 2: Locality-Optimized GEMM (Loop Permutation) ```c // Initialize C to 0 for (int i = 0; i < N; i++) { for (int k = 0; k < N; k++) { // Loop k swapped with j int a_val = `A[i][k]`; // Loaded once per inner loop for (int j = 0; j < N; j++) { // Access Pattern 2 int partial = a_val * `B[k][j]`; `C[i][j]` += partial; } } } // ReLU pass (done separately for simplicity in this variant) for (int i=0; i<N; i++) for(int j=0; j<N; j++) if(`C[i][j]` < 0) `C[i][j]` = 0; ``` - [ ] Part 1: Cache Geometry * Calculate the number of sets in the L1 Data Cache. $\text{Sets} = \frac{\text{Total Size}}{\text{Associativity} \times \text{Block Size}}$ > Answer: __ D01 __ sets. * How many bits are used for the Index? > Answer: __ D02 __ bits. * How many integers (`int32_t`) fit into a single cache block? > Answer: __ D03 __ integers. - [ ] Part 2: Spatial Locality Analysis (Code Variant 1) Focus on the innermost loop of Variant 1 (iterating `k`): `sum += `A[i][k]` * `B[k][j]`;` Assume `i=0`, `j=0`, and `k` runs from 0 to 63. Assume cold start: the L1 cache is initially empty before any memory access. * Matrix A Access (`A[0][k]`): As `k` increments, we access `A[0][0], A[0][1], A[0][2]...`. Are these accesses contiguous in memory? (Yes/No) __ D04 __. Miss Rate Analysis for A (Cold Start): In row-major layout, `A[0][k]` addresses are sequential: - `A[0][0]` at address 0x10000 - A[0][1] at address 0x10004 - ... - A[0][15] at address 0x1003C Each cache block holds 16 integers (64 bytes), so: - Access `A[0][0]`: MISS (loads cache block [0x10000:0x1003F]) → This block contains `A[0][0]` through A[0][15] - Access A[0][1] through A[0][15]: HIT (already in cache) - Access A[0][16]: MISS (loads cache block [0x10040:0x1007F]) - Access A[0][17] through A[0][31]: HIT - Access A[0][32]: MISS - Access A[0][48]: MISS Miss calculation: - Total accesses: 64 (k = 0 to 63) - Total misses: 4 (at k = 0, 16, 32, 48) - Miss rate = 4 / 64 = 6.25% Key Assumptions for this analysis: - Cache initially empty (cold start) - LRU replacement policy (when cache is full) - 2-way associativity does not affect first-access streaming (sequential accesses to different sets) What is the miss rate for $A$ in this inner loop? __ D05 __ %. * Matrix B Access (`B[k][0]`): As `k` increments, we access `B[0][0], B[1][0], B[2][0]...`. Memory Stride Analysis: In row-major layout: - B[0][0] is at address 0x20000 - B[1][0] is at address 0x20000 + 1×64×4 = 0x20100 (stride = 256 bytes) - B[2][0] is at address 0x20000 + 2×64×4 = 0x20200 (stride = 256 bytes) Calculate the stride between `B[k][0]` and `B[k+1][0]`: Stride = Row Width $\times$ Element Size = $64 \times 4$ = __ D06 __ bytes. Cache Block Spatial Reuse Analysis: Cache block size = 64 bytes = 16 integers. Access stride = 256 bytes = 4 consecutive cache blocks. When B[0][0] is loaded (at address 0x20000): - Cache block contains: B[0][0], B[0][1], ..., B[0][15] When B[1][0] is accessed (at address 0x20100): - This is at a DIFFERENT cache block (4 blocks away) - The "other 15 integers" from the first miss (B[0][1:15]) are NOT reused More specifically: Do both B[0][0] and B[1][0] reside in the same cache block? Does this stride allow us to use the other 15 integers loaded in a cache block from the first miss? (Yes/No) Answer: __ D07 __ Explanation: No, because B[1][0] is 256 bytes away from B[0][0]. Since 256 > 64 (cache block size), they occupy different cache blocks. Miss Rate Calculation: In the k-loop (k = 0 to 63): - Every iteration accesses B[k][0] - Every B[k][0] is in a different cache block (stride = 256B > 64B) - Total accesses: 64 - Total misses: 64 (every single access misses) What is the approximate miss rate for $B$ in this inner loop? Answer: __ D08 __ %. - [ ] Part 3: The Power of Loop Permutation (Code Variant 2) Focus on the innermost loop of Variant 2 (iterating `j`): `C[i][j] += a_val * `B[k][j]`;` As `j` increments, access is `B[k][0], B[k][1]...`. This yields __ D09 __ (Spatial / Temporal) locality. Calculation Breakdown: - Cache block size: 64 bytes = 16 integers (since each integer is 4 bytes) - Inner loop iteration count: 64 iterations (j goes from 0 to 63) - Sequential accesses: B[k][0], B[k][1], ..., B[k][15] fit in the first cache block (16 elements per block) - Accesses B[k][16:31] fit in the second cache block - Accesses B[k][32:47] fit in the third cache block - Accesses B[k][48:63] fit in the fourth cache block - Total misses for each value of k: 4 misses (one per cache block) - Total accesses per k: 64 accesses - Miss rate = 4 / 64 = 6.25% The miss rate for $B$ drops to approximately __ D10 __ %. Performance Comparison: Let $N_{ops}$ be the total number of multiply-accumulate operations. If Variant 1 has a combined miss rate (A+B+C) of 50% and Variant 2 has 6.25%. AMAT_1 = $1 + 0.50 \times 50$ = 26 cycles. AMAT_2 = $1 + 0.0625 \times 50$ = 4.125 cycles. Matrix Roles in the GEMM Computation: In General Matrix Multiply $C = A \times B$: - Matrix A: Input activation data - Accessed once per k-iteration in both variants - In Variant 1: row-major access (good spatial locality) - In Variant 2: loaded once per k and reused in j-loop - Miss rate: same in both variants (~6.25%) - Matrix B: Weight matrix - In Variant 1: accessed as `B[k][j]` with k varying in inner loop → Column-major access pattern: `B[0][j]`, B[1][j], B[2][j]... → Each `B[k][j]` is 256 bytes apart (different cache block) → Every access misses: 100% miss rate - In Variant 2: accessed as `B[k][j]` with j varying in inner loop → Row-major access pattern: B[k][0], B[k][1], ..., B[k][63] → Sequential access within rows enables spatial locality → Only 4 misses per k-iteration instead of 64 → Miss rate: 6.25% instead of 100% - Matrix C: Output accumulator - Both variants: written sequentially, minimal cache penalty Critical Insight: Variant 2 transforms the B access pattern from: ``` Column-major: `B[0][j]`, B[1][j], B[2][j], ... (64 misses per j) ``` to: ``` Row-major: B[k][0], B[k][1], ..., B[k][63] (4 misses per k) ``` This change reduces B's cache penalty by 16×, which is the dominant improvement in overall performance. Why is Variant 2 theoretically faster for Deep Learning workloads? Answer: It minimizes the penalty of loading the __ D11 __ matrix by accessing memory sequentially instead of column-major striding. - [ ] Part 4: ReLU and Branch Prediction Consider the ReLU activation in Variant 1: ```c // Compiled RISC-V Assembly for: if (sum < 0) sum = 0; blt x5, x0, is_negative # Branch if sum < 0 j store_result is_negative: li x5, 0 # sum = 0 store_result: sw x5, 0(x10) # Store to `C[i][j]` ``` In a Neural Network, the distribution of positive/negative sums depends on the weights. Assume the ReLU condition (`sum < 0`) is clustered, with a burst of taken branches followed by a burst of not-taken branches. The branch predictor uses a 2-bit saturating counter with initial state = Strongly-Not-Taken (state 0). Assume the instruction pattern is: Taken, Taken, Taken, Taken, Not-taken, Not-taken, Not-taken, Not-taken (8 branches total). - Taken = branch condition true; jump to `is_negative` label - Not-taken = branch condition false; fall through to next instruction How many mispredictions occur in this sequence? __ D12 __ 2-Bit Saturating Counter: State Machine and Update Rules State Reference (determines prediction for next branch): - State 0 (Strongly-Not-Taken): prediction = NOT TAKEN - State 1 (Weakly-Not-Taken): prediction = NOT TAKEN - State 2 (Weakly-Taken): prediction = TAKEN - State 3 (Strongly-Taken): prediction = TAKEN Counter Update Rules (apply after each branch executes): - If branch is Taken (actual outcome = 1): - Increment state: 0→1, 1→2, 2→3, 3→3 (saturate at max) - Moves toward "Strongly-Taken" prediction - If branch is Not-taken (actual outcome = 0): - Decrement state: 0→0 (saturate at min), 1→0, 2→1, 3→2 - Moves toward "Strongly-Not-Taken" prediction Key Insight: A 2-bit counter requires two consecutive outcomes in the same direction to change the prediction. This is why steps 1-2 (two consecutive Taken) change the prediction from Not-taken to Taken, and steps 5-6 (two consecutive Not-taken) change it back to Not-taken. Execution Trace (State Transitions and Predictions): | Step | Prediction | Actual Outcome | Match? | New State | Notes | |:----:|:----------:|:--------------:|:------:|:---------:|:------| | 1 | NOT-TAKEN (0) | Taken | ❌ MISS | 1 | Condition was true (sum < 0) | | 2 | NOT-TAKEN (1) | Taken | ❌ MISS | 2 | State changes to predict Taken | | 3 | TAKEN (2) | Taken | ✓ HIT | 3 | Prediction correct | | 4 | TAKEN (3) | Taken | ✓ HIT | 3 | State saturated at 3 | | 5 | TAKEN (3) | Not-taken | ❌ MISS | 2 | Condition false (sum ≥ 0) | | 6 | TAKEN (2) | Not-taken | ❌ MISS | 1 | State changes to predict Not-taken | | 7 | NOT-TAKEN (1) | Not-taken | ✓ HIT | 0 | Prediction correct | | 8 | NOT-TAKEN (0) | Not-taken | ✓ HIT | 0 | State saturated at 0 | Total Mispredictions: 4 (steps 1, 2, 5, 6) Transitions: Taken event → increment state (max 3), Not-Taken event → decrement state (min 0) Cache-Branch Interaction: The Fundamental Bottleneck In Variant 1, the ReLU activation depends on the sign of `sum`: ```c sum += `A[i][k]` * `B[k][j]`; // Execution depends on `B[k][j]` load if (sum < 0) // Branch depends on sum value sum = 0; ``` Dependency Chain Analysis: The instruction sequence has a data dependency chain: 1. Load `B[k][j]` from memory - Address calculation: k and j determine which cache block - If `B[k][j]` misses in the cache → 50-cycle memory penalty 2. ADD instruction (computing `sum += result_of_load`) - Depends on the result of the load instruction - Cannot execute until the load completes 3. BLT instruction (branch on `sum < 0`) - Depends on the result of the ADD instruction - Cannot evaluate the condition until sum is computed - Cannot make the prediction decision until the pipeline stalls resolve The Pipeline Stall Sequence: ``` Cycle 0: Load instruction issued (`B[k][j]`) Cycle 1-50: Pipeline STALLED - waiting for L1 miss to resolve Cycle 51: Load completes, value written to register Cycle 52: ADD instruction completes Cycle 53: BLT instruction can now evaluate condition ``` Does the branch resolution occur BEFORE the cache miss is resolved? __ D13 __ (Yes / No) Classify the fundamental hazard type: This performance bottleneck is a __ D14 __ (Data / Control / Structural) hazard. ### ✅ Solution to Problem D: D01: ==32== D02: ==5== D03: ==16== D04: ==Yes== D05: ==6.25== D06: ==256== D07: ==No== D08: ==100== D09: ==Spatial== D10: ==6.25== D11: ==Matrix B== D12: ==4== D13: ==No== D14: ==Data== Detailed Explanation of Part D (Cache-Branch Interaction Synthesis): D12: Branch Predictor Behavior Under Pattern Transitions The 2-bit saturating counter starts at state 0 (Strongly-Not-Taken). When the actual branch pattern is: Taken × 4, then Not-taken × 4: - The predictor transitions through states: 0→1→2→3→2→1→0 - Mispredictions occur at transition points (steps 1, 2, 5, 6), yielding 4 total mispredictions - Key insight: The predictor learns the pattern gradually (two consecutive outcomes needed to flip prediction) - Steps 1-2: Two consecutive Taken branches → prediction flips from Not-taken to Taken - Steps 5-6: Two consecutive Not-taken branches → prediction flips from Taken to Not-taken D13-D14: Load-to-Use Data Hazard Cascading into Control Hazard The fundamental performance bottleneck in Variant 1 arises from memory latency, not branch prediction: Dependency Chain (Data Hazard): ``` Load `B[k][j]` → ADD (compute sum) → BLT (test sum < 0) (Miss: 50 cycles) ``` Pipeline Stall Sequence: 1. Load instruction issues `B[k][j]` 2. If `B[k][j]` misses: Pipeline stalls for 50 cycles (fetch from memory) 3. ADD instruction cannot complete until load finishes 4. BLT instruction cannot evaluate its condition until ADD completes 5. Result: Branch resolution is delayed by the cache miss penalty Why This Is Fundamentally a Data Hazard (D14 = Data): - The instruction dependency chain is Load → ADD → Branch (data dependencies) - Each instruction depends on the previous one's result value - The root cause is the load-to-use dependency, not control flow However, It Manifests as a Control Hazard Because: - The cache miss stalls the data pipeline - This delays the condition evaluation for the branch - The branch predictor made its prediction, but the actual resolution is delayed - Memory latency is the root blocker, not processor cycles Critical Deep Learning Implication: In `AMAT = 1 + 0.50 × 50 = 26 cycles` (Variant 1) vs `AMAT = 1 + 0.0625 × 50 = 4.125 cycles` (Variant 2): - Memory latency (cache misses) accounts for 95% of execution time in Variant 1 - Branch predictor mispredictions are a secondary concern - The performance improvement from Variant 2 (6.25× speedup) comes from optimizing cache access patterns, NOT from improving branch prediction - Key Lesson: In memory-intensive workloads (like GEMM in Deep Learning), optimize cache locality first; branch prediction is negligible --- ## Problem E: Sv32 Page Table Construction and PMP Isolation This problem explores both virtual memory translation (supervisor mode) and physical memory protection (machine mode). - [ ] Part 1: Bare-metal MMU Configuration (Sv32) You are analyzing the memory management unit (MMU) setup code from rv32emu. The system implements Sv32 paging (2-level page tables). You are provided with snippets from `setup.S` (assembly startup) and `vm_setup.c` (C page table configuration). Source Code Reference: * `vm_setup.c` (Simplified `map_page` function): This function maps a Virtual Address (`va`) to a Physical Address (`pa`) with specific permission flags. ```c #define PTE_V 0x001 // Valid #define PTE_R 0x002 // Read #define PTE_W 0x004 // Write #define PTE_X 0x008 // Execute #define PTE_U 0x010 // User #define PGSIZE 4096 typedef uint32_t pte_t; // root_pt: pointer to the Level 1 (Root) Page Table void map_page(pte_t *root_pt, uint32_t va, uint32_t pa, uint32_t flags) { // 1. Calculate VPN[1] (Level 1 index) uint32_t vpn1 = (va >> __E01__); // 2. Check if the L1 PTE exists. If not, assume a Level 0 table is allocated at address 'next_pt_pa' // (In this simplified code, we assume the L0 table is already allocated at physical address 0x80001000) if ((root_pt[vpn1] & PTE_V) == 0) { // Create a pointer to the next level (L0 table) // A pointer PTE has V=1, but R=W=X=0 uint32_t next_pt_pa = 0x80001000; root_pt[vpn1] = ((next_pt_pa >> 12) << 10) | __E02__; } // 3. Get pointer to Level 0 table // Convert PPN from PTE back to physical pointer pte_t *l0_pt = (pte_t *) ((root_pt[vpn1] >> 10) << 12); // 4. Calculate VPN[0] (Level 0 index) uint32_t vpn0 = (va >> __E03__) & 0x3FF; // 5. Write the Leaf PTE // A Leaf PTE contains the physical Page Frame Number (PPN) and permission flags l0_pt[vpn0] = ((pa >> 12) << 10) | (flags | PTE_V); } ``` * `setup.S` (Enabling the MMU): ```assembly # t0 holds the physical address of the Root Page Table li t0, 0x80000000 # Construct the satp value # Sv32 Mode = 1 (Bit 31) # PPN = t0 >> 12 srli t0, t0, 12 # t0 = PPN li t1, 0x80000000 # Set Mode bit (bit 31) or t0, t0, t1 # t0 = satp value # Write to satp csrw satp, t0 # Flush TLB to ensure consistency sfence.vma # (Assembly instruction) ``` - [ ] Part 2: Virtual Address Decoding (C Programming) In `vm_setup.c`, we need to extract the Virtual Page Numbers (VPN) to index into the page tables. In Sv32, the virtual address is split as: `[31:22] VPN[1]` (10 bits), `[21:12] VPN[0]` (10 bits), `[11:0] Offset` (12 bits). Virtual Address Layout (Sv32): ``` 31 22 21 12 11 0 ├────── VPN[1] ────────┼────── VPN[0] ────────┤ Offset (12 bits) │ │ (bits [31:22]) │ (bits [21:12]) │ │ 10 bits (Level 1) │ 10 bits (Level 0) │ ``` VPN[1] Extraction (E01): VPN[1] corresponds to VA bits [31:22]. To extract these bits and move them to bits [9:0] for indexing, shift right by __ E01 __ bits. Example: If VA = 0x00002000, VPN[1] = (0x00002000 >> ?) & 0x3FF VPN[0] Extraction (E03): VPN[0] corresponds to VA bits [21:12]. To extract these bits and move them to bits [9:0] for indexing, shift right by __ E03 __ bits, then mask with 0x3FF to keep only 10 bits. Example: If VA = 0x00002000, VPN[0] = (0x00002000 >> ?) & 0x3FF Note: Masking with 0x3FF keeps only VPN[0]'s 10 bits, preventing VPN[1] from interfering with the index. - [ ] Part 3: Page Table Entry (PTE) Construction Pointer PTE vs Leaf PTE: A PTE is interpreted as a pointer entry if and only if V=1 and R=W=X=0. A leaf PTE must set at least one of R/W/X to indicate permission. This distinction is critical: if you mark a PTE with V=1 and all R/W/X=0, the MMU treats it as a pointer to the next page table level, not as a final memory location. Non-Leaf PTE (Pointer/Intermediate Node) - E02: When the code finds the L1 entry is invalid, it links it to a Level 0 table. Fill in blank E02 with ONLY the constant `PTE_V` and no other flags. - `PTE_V` = 0x001 sets the Valid bit - Do NOT include any of: `PTE_R`, `PTE_W`, `PTE_X`, or `PTE_U` - Do NOT include any other flags - Answer format: Use the macro name exactly as defined Leaf PTE - E05: Suppose we map VA 0x00002000 to PA 0x80005000 with flags R|W|X. - pa >> 12 = 0x80005 - PTE PPN field = 0x80005 << 10 = 0x20001400 - Flags = `PTE_R` | `PTE_W` | `PTE_X` | `PTE_V` = 0x00E | 0x001 = 0x00F - Final PTE Hex Value = 0x2000140F If the flags argument passed to map_page was just 0 (indicating only Valid bit is set by the code), would this page be accessible for reading by the kernel? Answer (Yes/No): __ E05 __ The key: a PTE with V=1 and R=W=X=0 is a pointer entry, not a readable page. - [ ] Part 4: The `satp` Register and Identity Mapping In `setup.S`, the code configures the `satp` register. Assume the Root Page Table is at physical address 0x80000000. `satp` Register Layout (Sv32): ``` Bit: 31 30 . . . 20 19 . . . . . . . . . . . 0 [Mode| PPN (22 bits) ] 1 bit [1][0] Sv32=1 Root Page Table PPN ``` The PPN field of `satp` is 0x80000000 >> 12 = 0x80000. The Mode bit for Sv32 is bit 31 (value = 1). Mode=1 selects Sv32; paging is enabled automatically when `satp` is written. Combining: `satp = 0x80000000 /* Mode */ | (0x80000000 >> 12) /* PPN */ = __ E06 __` Identity Mapping Requirement (E07): When the instruction `csrw satp, t0` executes, the MMU turns on immediately. The very next instruction fetch will be translated via the page table. If the virtual address of the next instruction does not map to its current physical address, the system will crash with an Instruction Page Fault. Therefore, before enabling paging, the OS must ensure that the memory range containing the setup code is __ E07 __ mapped. Identity mapping must be a valid leaf PTE (R or X set), not just a pointer entry. - [ ] Part 5: TLB Maintenance Changing the page table in memory (writing to RAM) does not automatically update the hardware Translation Lookaside Buffer (TLB), which caches old translations. After writing to `satp` (switching address spaces) or modifying PTEs, software must flush the TLB. TLB Flush Instruction (E04): The specific RISC-V supervisor-level assembly instruction to flush the TLB is __ E04 __. When flushing the entire TLB, use this instruction with rs1=zero and rs2=zero. - [ ] Part 6: Physical Memory Protection (PMP) Configuration You are analyzing the boot sequence of a RISC-V operating system. The system starts in Machine Mode (M-mode). The firmware's job is to configure Physical Memory Protection (PMP) to protect its own memory range, grant the OS kernel (Supervisor Mode, S-mode) access to the rest of the memory, and then switch execution to S-mode. System Memory Map: * `0x8000_0000` - `0x8000_FFFF`: M-mode Firmware / OpenSBI (64 KiB) * `0x8001_0000` - `0x8FFF_FFFF`: DRAM (Kernel code, data, and user space) Hardware Behavior: * PMP entries are checked in order (`pmp0`, `pmp1`, ...). The lowest-numbered match determines permissions. * M-mode access: If a PMP entry matches and the Lock (L) bit is set, permissions are enforced. If L=0, M-mode ignores the PMP entry (pass-through). * S/U-mode access: PMP permissions are always enforced. * No Match: If no PMP entry matches an address, M-mode access is allowed, but S/U-mode access is denied (Access Fault). M-mode PMP Configuration The firmware configures two PMP regions: * `pmp0`: Covers `0x8000_0000` to `0x8000_FFFF` (Firmware). Configuration: `L=0` (Not Locked), `A=TOR`, `R=0`, `W=0`, `X=0`. * `pmp1`: Covers `0x8000_0000` to `0x8FFF_FFFF` (All DRAM). Configuration: `L=0`, `A=TOR`, `R=1`, `W=1`, `X=1`. The goal is to allow M-mode to read/write everything, but prevent S-mode from touching the firmware. When M-mode code tries to write to address `0x8000_0050` (inside Firmware): * Does it match `pmp0`? (Yes/No): __ E08 __ * Since `L=0`, is the permission check enforced for M-mode? (Yes/No): __ E09 __ * Result: (Allowed/Trap): __ E10 __ When S-mode code tries to write to address `0x8000_0050` (inside Firmware): * Does it match `pmp0`? (Yes/No): __ E11 __ * Since the mode is S, is the permission check enforced? (Yes/No): __ E12 __ * `pmp0` permissions are `RWX=000`. Result: (Allowed/Trap): __ E13 __ When S-mode code tries to write to `0x8002_0000` (inside DRAM): * Does it match `pmp0`? (Yes/No): __ E14 __ (No, it is above the limit) * Does it match `pmp1`? (Yes/No): __ E15 __ * `pmp1` permissions are `RWX=111`. Result: (Allowed/Trap): __ E16 __ Switching to Supervisor Mode After configuring PMP, the firmware prepares to jump to the OS kernel. This requires modifying CSRs and executing `mret`. The simplified assembly sequence is: ```assembly # 1. Set the jump target (OS Kernel Entry Point) li t0, 0x80010000 csrw mepc, t0 # 2. Configure mstatus to return to S-mode # mstatus layout: [ ... | MPP(12:11) | ... | MPIE(7) | ... ] # MPP encoding: 11=Machine, 01=Supervisor, 00=User li t1, 0x1800 # 0x1800 sets MPP=11 (Machine) csrc mstatus, t1 # Clear MPP bits li t2, __E17__ # Load constant for Supervisor Mode csrs mstatus, t2 # Set MPP bits to Supervisor # 3. Disable Machine Interrupts (optional safety) # 4. Perform the switch __E18__ ``` * To set the "Previous Privilege Mode" (MPP) to Supervisor (01), we need to set bit 11 to 1 and bit 12 to 0. The hex value for `t2` (shifting `1` into bit position 11) is `0x800`. Fill in __ E17 __ with the hexadecimal immediate to set MPP to Supervisor. * Which instruction atomically restores the PC from `mepc` and updates the privilege mode based on `mstatus.MPP`? > Instruction mnemonic: __ E18 __ * Immediately after instruction E18 executes: * The `pc` will be __ E19 __ (hex value). * The current privilege mode will be __ E20 __ (M/S/U). * The "Default Deny" Safety Net Suppose the firmware developer made a mistake and forgot to configure `pmp1` (the entry granting access to DRAM). Only `pmp0` (protecting firmware) is active. * The OS kernel starts running in S-mode at `0x8001_0000`. * It fetches the first instruction. * Does this address match `pmp0`? (Yes/No) __ E21 __ * Since it does not match any configured PMP entry, what is the default behavior for this Supervisor Mode access? > Answer: __ E22 __ (Choose: "Allow access" or "Access Fault") This mechanism ensures that if the secure bootloader doesn't explicitly grant memory regions, the OS cannot execute, preventing a "fail-open" security vulnerability. ### ✅ Solution to Problem E: E01: ==22== E02: ==`PTE_V`== E03: ==12== E04: ==`sfence.vma`== E05: ==No== E06: ==0x80080000== E07: ==identity== E08: ==Yes== E09: ==No== E10: ==Allowed== E11: ==Yes== E12: ==Yes== E13: ==Trap== E14: ==No== E15: ==Yes== E16: ==Allowed== E17: ==0x800== E18: ==`mret`== E19: ==0x80010000== E20: ==S== E21: ==No== E22: ==Access Fault== Brief Explanations: E01 (22): In Sv32, VPN[1] is bits [31:22]. Shift right by 22 to isolate these 10 bits. E03 (12): VPN[0] is bits [21:12]. Shift right by 12 and mask with 0x3FF to extract the 10 bits. E02 (`PTE_V`): A PTE is a pointer if V=1 and R=W=X=0. Only set `PTE_V` for pointer PTEs. E05 (No): If flags=0, the PTE has V=1 but R=W=X=0, making it a pointer to the next level, not a readable leaf page. E06 (0x80080000): Root PPN = 0x80000000 >> 12 = 0x80000. Mode (Sv32) = 1 << 31 = 0x80000000. `satp` = 0x80000000 | 0x00080000 = 0x80080000. E07 (identity): Identity mapping (VA == PA) is critical at boot because the PC points to physical memory. If the MMU turns on without identity mapping, the next instruction fetch causes a page fault. E04 (`sfence.vma`): Supervisor-level instruction to flush the TLB. Use with rs1=zero, rs2=zero to flush all entries. E08-E10 (M-mode on `pmp0`): M-mode with L=0 ignores the PMP check (passes through). This allows the firmware to access its own memory. E11-E13 (S-mode on `pmp0`): S-mode checks are always enforced. Since `pmp0` has RWX=0, S-mode access traps, isolating the firmware. E14-E16 (S-mode on `pmp1`): Address 0x8002_0000 is inside `pmp1` with RWX=1, so S-mode access is allowed. E17-E20 (Switching to S-mode): MPP is at bits [12:11] of `mstatus`. Set MPP=01 (Supervisor) with 0x800. Execute `mret` to switch mode and jump to `mepc` (0x80010000). E21-E22 (Default Deny Safety Net): The RISC-V PMP spec states that if no PMP entries match an access, M-mode is allowed, but S-mode and U-mode are denied. This effectively "whitelists" memory for the OS. If the grant (`pmp1`) is missing, the OS crashes immediately upon fetch. --- ## Problem F: Digital Design with Chisel This problem explores hardware description using Chisel HDL, covering both pipeline control logic and memory controller FSM design. - [ ] Part 1: Counters and Pipeline Integration In a RISC-V processor (like the `ca2025-mycpu` 5-stage pipeline), counters are essential components used for: * CSRs: Such as `mcycle` (cycle counter) and `minstret` (instruction retired counter). * Hazard Control: Counting cycles to stall the pipeline during multi-cycle operations (e.g., dividing or cache misses). Performance Counter: Consider the following Chisel module for a 64-bit performance counter, designed to be part of the Writeback (WB) stage. It cascades two 32-bit registers. ```scala import chisel3._ import chisel3.util._ class PerformanceCounter extends Module { val io = IO(new Bundle { val event = Input(Bool()) // Signal to increment (e.g., instruction retired) val read_lo = Input(Bool()) // CPU reading lower 32 bits val read_hi = Input(Bool()) // CPU reading upper 32 bits val rdata = Output(UInt(32.W)) }) val count_lo = RegInit(0.U(32.W)) val count_hi = RegInit(0.U(32.W)) // --- Update Logic --- when (io.event) { val next_lo = count_lo + 1.U // UInt(32.W) arithmetic wraps around on overflow count_lo := next_lo // Carry logic: Increment upper bits only when lower bits wrap around to 0 // When count_lo = 0xFFFFFFFF, next_lo = 0x00000000 (overflow/wrap) when (next_lo === __F01__) { count_hi := count_hi + 1.U } } // --- Read Logic --- // Read-data returns the value after the update in the same cycle io.rdata := Mux(io.read_lo, count_lo, Mux(io.read_hi, __F02__, 0.U)) ``` Stall Controller: The Stall Controller in the Instruction Decode (ID) stage handles structural or data hazards by "freezing" the PC and injecting NOPs (bubbles) into the pipeline. The control flow is: - `Stall_Cnt` is loaded with the number of stall cycles (e.g., 3) - Each cycle while `Stall_Cnt` > 0: `PC_Enable` = 0, `Bubble` = 1, `Stall_Cnt` := `Stall_Cnt` - 1 - When `Stall_Cnt` reaches 0: `PC_Enable` = 1, `Bubble` = 0 - During a stall: IF and ID stages freeze (hold instruction), ID/EX register receives a bubble (NOP), EX/MEM and MEM/WB continue normally Timing Diagram (Signal Behavior): The following ASCII diagram illustrates the desired behavior when a hazard triggers a 3-cycle stall at Cycle T1: - Trigger: Input signal to start stalling - `Stall_Cnt`: Internal register value - `PC_Enable`: Controls the PC register (0 = Freeze, 1 = Update) - `Bubble`: Controls the ID/EX pipeline register reset (1 = Clear to NOP, 0 = Pass instruction) ```text Cycle │ T0 │ T1 │ T2 │ T3 │ T4 │ ────────────┼───────┼───────┼───────┼───────┼───────┤ Trigger │ 1 │ 0 │ 0 │ 0 │ 0 │ Stall_Cnt │ 0 │ 3 │ 2 │ 1 │ 0 │ Is_Stalling │ 0 │ 1 │ 1 │ 1 │ 0 │ ────────────┼───────┼───────┼───────┼───────┼───────┤ PC_Enable │ 1 │ 0 │ 0 │ 0 │ 1 │ Bubble_Out │ 0 │ 1 │ 1 │ 1 │ 0 │ ────────────┼───────┼───────┼───────┼───────┼───────┤ Instr @ ID │ I2 │ [I2] │ [I2] │ [I2] │ I3 │ Instr @ EX │ I1 │ NOP │ NOP │ NOP │ I2 │ ``` Chisel Implementation: ```scala class StallController extends Module { val io = IO(new Bundle { val trigger_stall = Input(Bool()) // Start a 3-cycle stall val bubble = Output(Bool()) // Output to pipeline registers val pc_enable = Output(Bool()) // Control PC update }) // A 2-bit counter to track remaining stall cycles val stall_cnt = RegInit(0.U(2.W)) when (stall_cnt > 0.U) { stall_cnt := stall_cnt - 1.U } .elsewhen (io.trigger_stall) { stall_cnt := 3.U // Load stall duration } // Generate control signals based on the diagram above val is_stalling = (stall_cnt > 0.U) // Logic to match the ASCII diagram behavior: io.bubble := __F03__ io.pc_enable := __F04__ } ``` In `PerformanceCounter`, `count_hi` increments only when `count_lo` overflows. In a 32-bit unsigned adder, `(MaxValue + 1)` wraps to 0. Therefore, the condition ``next_lo === __ F01 __`` detects this wrap-around. What is the value of __ F01 __? Fill in blank __ F02 __ to ensure the CPU reads the correct upper 32 bits. - [ ] Part 2: Stall Control Logic Refer to the ASCII Timing Diagram and the Chisel code. To achieve the behavior shown in the `Bubble_Out` row (High when stalling), what signal should assign to `io.bubble` at __ F03 __? (Enter `is_stalling` or `!is_stalling`). To achieve the behavior shown in the `PC_Enable` row (Low when stalling), what signal should assign to `io.pc_enable` at __ F04 __? (Enter `is_stalling` or `!is_stalling`). In `PerformanceCounter`, the logic implies `count_hi` depends on `count_lo`. Does this create a single critical path spanning all 64 bits of addition in one clock cycle? Answer (Yes/No): __ F05 __ (Hint: ``next_lo`` is computed, but it is used as a control signal for the *enable* of the separate `count_hi` register). Refer to the ASCII diagram. The instruction I2 is held in the ID stage for 3 extra cycles (T1, T2, T3). In Cycle T4, the stall ends. In which cycle does the instruction I3 (the instruction following I2) first enter the ID stage? Answer: Cycle T__ F06 __. ### ✅ Solution to Problem F: F01: ==0.U== F02: ==`count_hi`== F03: ==`is_stalling`== F04: ==`!is_stalling`== F05: ==No== F06: ==4== Brief Explanations: F01 (0.U): When UInt(32.W) at 0xFFFFFFFF adds 1, it wraps to 0. Detect overflow by checking `next_lo` === 0.U. Chisel UInt arithmetic automatically wraps on overflow. F02 (`count_hi`): The multiplexer selects `count_hi` when `read_hi` is asserted. This provides the upper 32 bits of the 64-bit counter. F03 (`is_stalling`): `Bubble_Out` is High (1) when `Stall_Cnt` > 0 (cycles T1-T3). This matches `is_stalling` condition. F04 (`!is_stalling`): `PC_Enable` is Low (0) when `Stall_Cnt` > 0. This is the inverse of `is_stalling` (enable only when NOT stalling). F05 (No): The addition logic is broken by register boundaries. `count_lo` adds 1, and the result is compared to 0 as a control signal (not a data path). `count_hi` increments in the next clock edge, not the same cycle. Carry does not ripple through 64 bits in one cycle. F06 (4): I2 enters ID at T0. Stall holds I2 in ID through T1-T3 (PC disabled). At T4, `Stall_Cnt` reaches 0, `PC_Enable` becomes 1, PC updates, and I3 enters ID stage. F07: ==sSetup== F08: ==sAccess== F09: ==sHold== F10: ==sSetup or sIdle== F11: ==1== F12: ==Yes== F13: ==sAccess== F14: ==4== F15: ==25== Brief Explanations for F07-F15: F07 (sSetup): From Idle state, when a valid request arrives, transition to Setup to begin address stabilization. F08 (sAccess): From Setup state, always transition to Access to assert read/write enable signals. F09 (sHold): From Access state, always transition to Hold to complete the transaction and de-assert enables. F10 (sSetup or sIdle): From Hold state, transition to Setup if another valid request is waiting (back-to-back transactions), otherwise return to Idle. F11 (1): The Chip Select (cs) signal must be held High during both Setup and Access phases (2 cycles total), but only 1 cycle is strictly needed for address setup before Access. F12 (Yes): The controller meets the SRAM timing constraints. Address is stable for 1 cycle in Setup (10ns > 2ns required), data is stable in Access (10ns > 2ns required), and output valid delay (8ns) fits within the 10ns Access cycle. F13 (sAccess): For a read operation, the CPU data becomes valid during the Access state when OE is asserted, but it is typically latched during Hold when the transaction completes. F14 (4): Store instruction goes through 4 states: sIdle (detect request) → sSetup (1 cycle) → sAccess (1 cycle) → sHold (1 cycle, ready asserted). Total = 4 cycles from request to ready. F15 (25): With 4 cycles per transaction and 100 MHz clock, maximum throughput = 100 MHz / 4 cycles = 25 million transactions per second (MT/s). - [ ] Part 3: Memory Controller FSM Design You are designing a synchronous SRAM controller for a RISC-V processor. The raw SRAM chip requires a specific timing sequence: * Address Setup: Address must be stable 1 cycle before enabling output/write. * Access: `OE` (Output Enable) or `WE` (Write Enable) must be asserted for at least 1 cycle. * Hold: Control signals must be de-asserted for 1 cycle before the next transaction. You are implementing this controller in Chisel 3. The processor communicates via a simplified Decoupled interface (`valid`, `ready`, `addr`, `wdata`, `rdata`). Hardware Specifications: * System Clock: 100 MHz (10 ns period). * SRAM Constraints: Address Setup Time = 2ns, Data Setup Time = 2ns, Output Valid Delay = 8ns. Chisel 3 FSM Implementation: Complete the Chisel code to implement the FSM. The FSM states are `sIdle`, `sSetup`, `sAccess`, and `sHold`. State transitions follow the SRAM timing requirements: - sIdle: Wait for io.cpu.valid. If true, go to sSetup (address setup phase). - sSetup: Assert chip select and stabilize address. Transition to sAccess (operation phase). - sAccess: Assert read/write enable based on io.cpu.bits.rw. Transition to sHold (completion phase). - sHold: De-assert all enables. Transition back to sSetup (if io.cpu.valid for next transaction) or sIdle (otherwise). Data is valid for the CPU during this state for read operations. The state transitions are: - F07: Transition from Idle when valid request arrives - F08: Transition from Setup when address is stable - F09: Transition from Access when operation is complete - F10: Transition from Hold back to either Setup or Idle (enables back-to-back transactions) ```scala import chisel3._ import chisel3.util._ class MemController extends Module { val io = IO(new Bundle { val cpu = Flipped(Decoupled(new Bundle { // CPU Interface val rw = Bool() // 0=Read, 1=Write val addr = UInt(32.W) val data = UInt(32.W) })) val sram = new Bundle { // SRAM Physical Interface val cs = Output(Bool()) // Chip Select (Active High) val we = Output(Bool()) // Write Enable (Active High) val oe = Output(Bool()) // Output Enable (Active High) } }) // State Definitions val sIdle :: sSetup :: sAccess :: sHold :: Nil = Enum(4) val state = RegInit(sIdle) // Default Outputs io.cpu.ready := false.B io.sram.cs := false.B io.sram.we := false.B io.sram.oe := false.B switch(state) { is(sIdle) { when (io.cpu.valid) { state := __F07__ } } is(sSetup) { io.sram.cs := true.B state := __F08__ } is(sAccess) { io.sram.cs := true.B // If writing, assert WE; if reading, assert OE when (io.cpu.bits.rw) { io.sram.we := true.B } .otherwise { io.sram.oe := true.B } state := __F09__ } is(sHold) { // Transaction complete, handshake with CPU io.cpu.ready := true.B // Support back-to-back transactions when (io.cpu.valid) { state := sSetup } .otherwise { state := sIdle } } } } ``` * State transitions from Idle: __ F07 __ * State transitions from Setup: __ F08 __ * State transitions from Access: __ F09 __ * State transitions from Hold: __ F10 __ - [ ] Part 4: Timing Analysis Read Operation Timing: In sAccess, the controller asserts OE. The SRAM takes 8ns (Output Valid Delay) to put data on the bus. The data must be stable at the processor's register input by the rising edge of the next clock (transition from sAccess to sHold). Clock period = 10ns, processor setup time = 1ns. Slack = (Clock Period) - (SRAM Output Delay) - (Processor Setup Time) Slack = 10ns - 8ns - 1ns = __ F11 __ ns. Is this timing met? (Yes/No): __ F12 __ Timing Margin and Wait States: If the SRAM Output Delay increased to 12ns, we would need additional wait cycles. In which state would you add a counter to pause execution to accommodate this longer delay? > Answer: __ F13 __ - [ ] Part 5: RISC-V Assembly Verification You write a bare-metal driver to verify this memory controller. The processor pipeline stalls the MEM stage until io.cpu.ready is true. Code sequence: ```assembly # x10 holds Base Address of the memory controller li x5, 0xDEADBEEF sw x5, 0(x10) # Write transaction lw x6, 0(x10) # Read transaction (verify) ``` Transaction Latency (F14): When sw executes, the Memory Controller steps through sIdle → sSetup → sAccess → sHold. The ready signal is asserted in sHold. How many total clock cycles does the store instruction occupy the memory stage (from request to ready)? > Answer: __ F14 __ cycles. Maximum Bandwidth (F15): If the processor runs at 100 MHz, what is the maximum bandwidth (in Millions of Transactions Per Second, MT/s) this controller can sustain for continuous back-to-back operations? (Hint: Total cycles per transaction = F14. Frequency / Cycles) > Answer: __ F15 __ MT/s. --- ## Problem G: RISC-V Vector Extension (RVV) This problem explores the RISC-V Vector (RVV) extension, covering both correctness (debugging vector code) and performance (cycle analysis). - [ ] Part 1: Debugging Vector Assembly Consider a RISC-V system enhanced with the RISC-V Vector (RVV) extension and a custom vector accelerator. The vector unit has: * 64 vector registers (v0-v63), each 512 bits wide * Configurable vector length (vl) up to 64 elements * Support for element widths of 8, 16, 32, or 64 bits * 2 vector load/store units, 2 vector ALU units, 1 vector multiplier unit Note: For this problem, assume the microarchitecture can use multiple register groups (LMUL > 1) to support up to 64 elements even though each physical vector register is 512 bits wide. You do not need to reason about LMUL details; just assume VLMAX = 64 elements for SEW=32. The following vector code implements a matrix-vector multiplication kernel (y = A*x): ``` # Matrix A: 32x32 elements, stored row-major (a0 points to start) # Vector x: 32 elements (a1 points to start) # Vector y: 32 elements (a2 points to start), compute y = A*x li a3, 32 # matrix dimension vsetvli x0, x0, e32 # set element width to 32 bits row_loop: slli t0, t2, 5 # t2 is row index, t0 = row_idx * 32 add t3, a0, t0 # t3 = base address of current row of A vle32.v v1, (t3) # load one row of matrix A to v1 vle32.v v2, (a1) # load vector x to v2 vfmul.vv v3, v1, v2 # multiply row * x element-wise vfredsum.vs v4, v3, v4.zero # reduce sum to get y[row] vse32.v v4, (a2) # store result y[row] addi t2, t2, 1 # increment row counter blt t2, a3, row_loop # continue if more rows ``` Identify the bugs in the above vector code implementation: > Note on RISC-V Register Naming: RISC-V distinguishes between general-purpose integer registers (x0–x31), floating-point registers (`f0`–f31), and vector registers (v0–v63). Integer loop counters use `x` registers (e.g., `t2` is `x7`), while vector accumulators use `v` registers. When answering the following, be careful to identify the correct register type. The loop index register __ G01 __ is not initialized before the loop. Without initialization, the starting value is undefined, and `slli t0, t2, ...` uses an undefined value. Add `li t2, 0` before the loop. The reduction instruction `vfredsum.vs v4, v3, v4.zero` uses a pseudo-operand `v4.zero` to denote a zero seed. In RVV semantics, `vfredsum.vs vd, vs2, vs1` takes the seed from the **scalar element** `vs1[0]` (the first element of vector register `vs1`). The notation `v4.zero` in the problem is a pseudo-operand suggesting the seed is zero; in actual implementation, this requires initializing the scalar element of the seed register beforehand. Answer: __ G02 __ (specify the seed register or initialization requirement). The instruction `vsetvli x0, x0, e32` sets the vector length (vl) to VLMAX (hardware maximum), ignoring the actual matrix dimension in register `a3`. This causes processing all 64 elements at once instead of respecting loop bounds. The corrected instruction is: `vsetvli __ G03 __, a3, e32`, which sets `vl` to min(VLMAX, value in a3). The instruction `vle32.v v2, (a1)` loads vector x. Does `a1` change inside the loop? __ G04 __ (Yes/No). In a y = A*x operation, reloading x for every row is functionally correct (x is reused), but it is suboptimal for memory bandwidth. A better implementation would hoist the x load outside the row loop. Calculate the theoretical peak performance for a 2.5 GHz vector unit that can process up to 8 32-bit elements simultaneously (matching the 8-lane design in Part 2). Be explicit about your convention: are you counting each element-wise arithmetic operation as one operation (20 GOPS), or counting a fused multiply-add as two operations (40 GOPS)? State your assumption clearly. Peak performance in GOPS = __ G05 __ To maximize utilization of multiple vector functional units, you would use (choose one): > (a) Sequential execution only > (b) Vector chaining to overlap operations > (c) Reduce parallelism > (d) Ignore functional unit types > Answer: __ G06 __ - [ ] Part 2: Performance Analysis and Timing You are given a 32-bit RISC-V system that supports the RVV vector extension. We focus on 32-bit integer elements (SEW = 32). The hardware has the following properties: * Vector length (VL) is fixed at 64 elements for this program. * The vector unit has 8 lanes, so it can process 8 elements per cycle in a fully pipelined fashion. * All vector instructions are issued in program order. * There is a single shared vector pipeline; only one vector instruction may be in its execution phase at a time. Instruction latencies and behavior: * `vle32.v` (vector load of 32-bit elements) * Startup latency: 4 cycles * Then one group of up to 8 elements per cycle * `vse32.v` (vector store of 32-bit elements) * Startup latency: 4 cycles * Then one group of up to 8 elements per cycle * `vadd.vv` (vector add) * Startup latency: 1 cycle * Then one group of up to 8 elements per cycle * `vredsum.vs` (vector reduction sum) * Startup latency: 2 cycles * Then one group of up to 8 input elements per cycle until a single scalar result is produced Assume: * No structural hazards other than the "only one vector instruction executing at once" rule. * Memory system is fast enough that loads and stores never stall once started. We want to compute the dot product of two length-64 integer vectors: ```c // a and b point to arrays of 64 int32_t // result is a scalar int32_t int32_t dot_product(const int32_t *a, const int32_t *b) { int32_t result = 0; for (int i = 0; i < 64; i++) result += a[i] * b[i]; return result; } ``` The compiler emits a single RVV loop body (ignoring loop overhead) equivalent to: ```text vle32.v v1, (x1) # load 64 elements from a[] vle32.v v2, (x2) # load 64 elements from b[] vmul.vv v3, v1, v2 # element-wise multiply vredsum.vs v4, v3, v0 # reduction sum into v4[0] vse32.v v4, (x3) # store the scalar result v4[0] ``` For this problem, you may treat `vmul.vv` as having the same timing as `vadd.vv`: * Startup: 1 cycle * Then one group of 8 elements per cycle Because VL = 64 and there are 8 lanes, each vector instruction processes its data in exactly 64 / 8 = 8 groups. We will compute the total execution time (in cycles) for this sequence. - [ ] Part 3: Cycles per instruction (without overlap) For each vector instruction, compute the total number of cycles it occupies the vector pipeline, given: ```text total_cycles = startup_latency + number_of_element_groups ``` Recall: number_of_element_groups = 64 / 8 = 8. Fill in: * `vle32.v v1, (x1)` total cycles: __ G07 __ * `vle32.v v2, (x2)` total cycles: __ G08 __ * `vmul.vv v3, v1, v2` total cycles: __ G09 __ * `vredsum.vs v4, v3, v0` total cycles: __ G10 __ * `vse32.v v4, (x3)` total cycles: __ G11 __ - [ ] Part 4: Total cycles for the full dot-product kernel Because no chaining is allowed, the five vector instructions execute strictly one after another. Let the total cycle count be the sum of the individual times from Part 3. Fill in: Total cycles for: ```text vle32.v vle32.v vmul.vv vredsum.vs vse32.v ``` executed back-to-back with no overlap: __ G12 __ cycles. ### ✅ Solution to Problem G: Part 1: Debugging Vector Assembly G01: ==t2== G02: ==v4 (initialized with v4[0] = 0 before reduction)== G03: ==t0== G04: ==No== G05: ==20== G06: ==(b) Vector chaining== Detailed Explanations for Part 1: G01-G02 (Missing Initialization): The code has two uninitialized registers: * G01 (`t2`): The loop index register used as the row counter. It is never initialized to 0 before entering the `row_loop` label. This causes the loop to read from an undefined row position initially. **Fix**: Add `li t2, 0` before `row_loop`. * G02 (`v4` with initialization requirement): The reduction instruction `vfredsum.vs v4, v3, v4.zero` follows RVV semantics: `vfredsum.vs vd, vs2, vs1` takes the seed from **the scalar element `vs1[0]`** (the first element of the seed vector register). In this case, the seed register is `v4`, meaning the reduction sum is seeded from `v4[0]`. The pseudo-operand notation `v4.zero` indicates that `v4[0]` should be zero as the initial accumulator value. **Critical implementation detail**: Before executing the reduction, the seed register `v4` must be explicitly initialized so that `v4[0] = 0`, typically using `vmv.s.x v4, x0` to move a scalar zero into the vector register. The answer is `v4` (understanding that it must be initialized beforehand with `v4[0] = 0`). G03 (Incorrect Vector Length): The instruction `vsetvli x0, x0, e32` sets vector length to VLMAX (hardware maximum vector length). According to the RISC-V V-extension specification, when `rs1` is `x0`, the AVL (Application Vector Length) is set to VLMAX, which causes the vector unit to process the maximum number of elements in one go, completely ignoring the actual remaining loop count in `a3`. This breaks strip-mining. The corrected instruction should be `vsetvli t0, a3, e32`, which sets `vl` to the minimum of VLMAX and the value in `a3`, enabling proper strip-mining. This ensures that each iteration processes only the remaining elements up to the loop bound. G04 (Incorrect Addressing): The instruction `vle32.v v2, (a1)` loads the vector x, but `a1` does not change inside the loop (answer: ==No==). This causes the same vector to be loaded every iteration instead of loading different portions of the vector. G05 (Peak Performance): Vector unit runs at 2.5 GHz = 2.5 × 10^9 cycles/second. Each operation can process 8 32-bit elements per cycle (matching the 8-lane design in Part 2). Peak operations per second = 2.5 × 10^9 × 8 = 20 × 10^9 operations/second. Peak performance = ==20 GOPS==. G06 (Optimization Strategy): The answer is (b) Vector chaining. Vector chaining allows overlapping operations by having one operation begin using results from another operation before the first one completely finishes, maximizing utilization of multiple functional units. Part 2: Performance Analysis and Timing G07: ==12== G08: ==12== G09: ==9== G10: ==10== G11: ==12== G12: ==55== Detailed Explanations for Part 2: * G07 (vle32.v v1): Startup = 4, plus 8 groups × 1 cycle = 4 + 8 = 12 cycles * G08 (vle32.v v2): Startup = 4, plus 8 groups × 1 cycle = 4 + 8 = 12 cycles * G09 (vmul.vv v3): Startup = 1, plus 8 groups × 1 cycle = 1 + 8 = 9 cycles * G10 (vredsum.vs v4): Startup = 2, plus 8 groups × 1 cycle = 2 + 8 = 10 cycles * G11 (vse32.v v4): Startup = 4, plus 8 groups × 1 cycle = 4 + 8 = 12 cycles * G12 (Total): 12 + 12 + 9 + 10 + 12 = 55 cycles (sum of all instruction times with no overlap/chaining) --- ## Problem H: Advanced Memory Systems and Prefetching Consider a multi-level cache hierarchy with the following characteristics: - L1 Data Cache: 32KB, 64-byte blocks, 8-way set associative, 3-cycle access time - L2 Cache: 512KB, 64-byte blocks, 16-way set associative, 15-cycle access time - L3 Cache: 8MB, 64-byte blocks, direct-mapped, 50-cycle access time - Main Memory: 100-cycle access time - Miss rates: L1 (5%), L2 given L1 miss (20%), L3 given L2 miss (5%) - Prefetcher: Stride-based, can prefetch up to 2 cache lines ahead Given a program that accesses a 4MB array with a 2D access pattern: `array[i][j]` where the array is 2048x512 integers (4 bytes each), and it accesses the array in row-major order. Important Definitions: 1. Define the miss penalty at a given cache level as the extra time spent beyond that level's hit time. For example: - The L1 miss penalty includes the L2 access time and any further misses, but does not include the L1 hit time itself. - Correspondingly, the prefetcher reduces the latency of accesses that go beyond L2 (i.e., the L2 miss service time and main memory access), and does not change the L2 hit time itself. 2. Miss Rate Scope: For this problem, use the given miss rates (L1: 5%, L2 given L1 miss: 20%, L3 given L2 miss: 5%). You do not need to derive miss rates from the 2D access pattern; these values represent the observable behavior for this workload on this cache hierarchy. - [ ] Part 1: AMAT Calculation with Prefetching Part 1a: Without Prefetcher Calculate the cost of accessing beyond L1 (if L1 misses): - Cost = L2_hit_time + P(L2 miss | L1 miss) × (L3_hit_time + P(L3 miss | L2 miss) × memory_time) - Answer: __ H01 __ cycles Then: AMAT = 3 + 0.05 × (__ H01 __) = __ H02 __ cycles Part 1b: With Prefetcher The prefetcher reduces the latency of accesses beyond L2 (11 cycles) to 80% prefetched at half penalty: - Effective cost beyond L2: __ H03 __ cycles (0.8 × 5.5 + 0.2 × 11) - New L1 miss penalty: 15 + (__ H03 __) = __ H04 __ cycles - AMAT with prefetcher = 3 + 0.05 × (__ H04 __) = __ H05 __ cycles - [ ] Part 2: Memory vs Compute Bound Analysis Part 2a: Bandwidth Analysis Calculate CPU peak data demand: - CPU demand = 8 bytes/cycle × 2.5 GHz = __ H06 __ GB/s Compare with memory supply: - Memory provides 64 GB/s; CPU needs __ H06 __ GB/s → Ratio: __ H07 __ (as %) Part 2b: Conclusion Based on low AMAT (__ H05 __ cycles ≈ L1 hit time of 3 cycles) and ample bandwidth (64 GB/s vs. __ H06 __ GB/s demand), the system is more likely __ H08 __ (memory-bound / compute-bound) because: __ H09 __ ### ✅ Solution to Problem H: Part 1a: AMAT Without Prefetcher H01: Cost = 15 + 0.20 × (50 + 0.05 × 100) = 15 + 0.20 × 55 = 15 + 11 = 26 cycles H02: AMAT = 3 + 0.05 × 26 = 3 + 1.3 = 4.3 cycles Part 1b: With Prefetcher The prefetcher reduces the beyond-L2 cost (11 cycles) by applying 80% coverage at 50% reduction: - Effective cost beyond L2: 0.8 × (11/2) + 0.2 × 11 = 4.4 + 2.2 = 6.6 cycles → H03 - New L1 miss penalty: 15 + 6.6 = 21.6 cycles → H04 - H05: AMAT = 3 + 0.05 × 21.6 = 4.08 cycles Part 2a: Bandwidth Analysis H06: CPU demand = 8 bytes/cycle × 2.5 GHz = 20 GB/s H07: Ratio = 20 / 64 = 31.25% (Memory is 3.2× headroom) Part 2b: Conclusion H08: compute-bound H09: "The system has ample bandwidth (64 GB/s available vs. 20 GB/s demand) and low latency (AMAT 4.08 cycles ≈ L1 hit time), so performance is limited by CPU execution capability, not memory access." Key Insight (Latency vs. Bandwidth): The AMAT calculation in Part 1 characterizes the average latency per access — it tells us how long each memory access takes on average. The bandwidth comparison in Part 2 checks whether the memory system can sustain the average data rate demanded by the processor. Both perspectives are essential for determining memory-bound vs. compute-bound behavior: - Low AMAT alone does not guarantee compute-bound: A system might have low average latency but still be memory-bound if the CPU demands more bandwidth than the memory system can supply. - High bandwidth availability alone is not sufficient: A system with ample bandwidth might still be memory-bound if latency is very high (e.g., long AMAT). - Both must align: For compute-bound behavior, we need both reasonable latency (demonstrated by AMAT) and sufficient bandwidth (demonstrated by the supply–demand comparison). This problem exhibits both, confirming compute-bound behavior. --- ## Problem I: Pipeline Timing, Structural Hazard Reasoning, and Multi-Cycle Functional Units You are given a five-stage RV32I pipeline similar to the classic MIPS/RISC-V educational pipeline: ``` IF → ID → EX → MEM → WB ``` The base pipeline supports: * Full data forwarding (EX/MEM → EX, MEM/WB → EX), including store data * Load–use stall: exactly 1 bubble when an instruction uses a load result immediately * Branch resolution in EX (predict-not-taken) Now, the processor is extended with a multi-cycle multiply functional unit: * Integer multiply (`mul`) takes 3 cycles in a dedicated multiplier pipeline (separate from the main EX stage). * While a multiply occupies the multiplier pipeline, no other multiply may enter that pipeline. * However, other ALU instructions continue to use the regular EX stage in parallel. * IF and ID continue to run normally (unless a stall is introduced). * Forwarding of multiply results works exactly like normal ALU results once the multiply is complete. You are given the following instruction sequence: ``` I1: lw x3, 0(x10) I2: mul x4, x3, x6 I3: add x7, x4, x5 I4: sw x7, 8(x10) I5: addi x8, x7, 1 I6: beq x7, x0, L1 I7: addi x9, x9, 1 ``` Paging in the multi-cycle EX stage for `mul` looks like: ``` Cycle N EX stage holds I2 (mul), cycle 1 of 3 Cycle N+1 EX stage holds I2 (mul), cycle 2 of 3 Cycle N+2 EX stage holds I2 (mul), cycle 3 of 3 Cycle N+3 I2 leaves EX → MEM ``` Other instructions cannot enter EX during N, N+1, N+2. - [ ] Part 1: RAW dependencies (fill all blanks) Identify all producer → consumer RAW dependencies and fill the table: | Producer | Consumer | Register | | -------- | -------- | -------------------------- | | I1 | I2 | __ I01 __ | | I2 | I3 | __ I02 __ | | I3 | I4 | __ I03 __ | | I3 | I5 | __ I04 __ | | I3 | I6 | __ I05 __ | | I6 | I7 | __ I06 __ (Register, or "none" if no RAW) | Note: Think carefully about the last one. - [ ] Part 2: Load–use stall and dependency behavior (prior to considering multi-cycle EX) Ignoring the multi-cycle multiply for now, and considering only load–use and forwarding, fill: * I1 → I2 causes __ I07 __ stall cycles (0 or 1) * I2 → I3 normally causes __ I08 __ stall cycles (0 or 1) * I3 → I4 normally causes __ I09 __ stall cycles * I3 → I5 normally causes __ I10 __ stall cycles * I3 → I6 normally causes __ I11 __ stall cycles For each blank, write 0 if forwarding resolves it, or 1 if a stall is needed. - [ ] Part 3: Structural hazard from multi-cycle EX Now include the fact that I2 (`mul`) occupies EX for 3 cycles. Because no other instruction may enter EX during this time: * Which instruction is the first to be blocked from entering EX? Answer: __ I12 __ * How many cycles does this instruction wait purely due to the EX structural hazard? (Not counting any data hazard stalls already required.) Answer: __ I13 __ cycles * Which later instructions are also delayed by this EX blockage? Fill the sequence: Instructions delayed: __ I14 __, __ I15 __, __ I16 __, __ I17 __ * Does this multi-cycle EX unit eliminate the need for any load–use stall you identified in Part B? Answer "yes" or "no": __ I18 __ - [ ] Part 4: ASCII timing illustration (fill in missing cycle numbers) Consider the timeline below showing the multi-cycle multiply structural hazard: ``` Cycle: C C+1 C+2 C+3 C+4 C+5 ┌─────┬──────┬──────┬──────┬──────┬──────┐ I1: │ │ EX │ MEM │ WB │ │ │ ├─────┼──────┼──────┼──────┼──────┼──────┤ I2: │ EX1*│ EX2* │ EX3* │ MEM │ WB │ │ ← mul occupies multiplier pipeline (C, C+1, C+2) ├─────┼──────┼──────┼──────┼──────┼──────┤ I3: │ L │ S │ S │ EX │ MEM │ WB │ ← blocked from EX while I2 occupies it ├─────┼──────┼──────┼──────┼──────┼──────┤ I4: │ IF │ IF │ ID │ ID │ EX │ MEM │ ← ripple delay from I3's stall └─────┴──────┴──────┴──────┴──────┴──────┘ * = Multiplier pipeline, separate from main EX L = Load–use stall (I1→I2 dependency on x3) S = Structural stall (multiply blocks EX) Total stalls for I3: 3 cycles (1 load–use + 2 structural) Pure structural stalls (I13 answer): 2 cycles (C+1, C+2) ``` Fill in the blanks: I19: The first cycle in which I2 occupies the EX stage. (Answer: __ I19 __) I20-I21: The two cycles in which I3 is blocked from entering EX due to I2's occupancy. (Answer: __ I20 __ and __ I21 __) I22: The cycle in which I3 finally enters EX after I2 completes. (Answer: __ I22 __) - [ ] Part 5: Branch timing interaction with multi-cycle EX I6 is: ``` I6: beq x7, x0, L1 ``` Given: * Branch resolves in EX * I3 produces x7 * I3 is delayed by the multiply * Thus I6's comparison is also delayed Fill in: * The earliest cycle when I3's result x7 can reach EX for comparison by I6 is: cycle __ I23 __ * This means the branch outcome is not known until: cycle __ I24 __ * Under predict-not-taken, the processor will speculatively fetch I7. I7 will be flushed if the branch is taken. The number of wasted cycles caused by this delayed branch resolution is: __ I25 __ Hint: - Consider when the branch result becomes available and when instructions are speculatively fetched. - If the branch is taken, how many of the speculatively fetched instructions would need to be flushed? ### ✅ Solution to Problem I: I01: ==x3== I02: ==x4== I03: ==x7== I04: ==x7== I05: ==x7== I06: ==none== I07: ==1== I08: ==0== I09: ==0== I10: ==0== I11: ==0== I12: ==I3== I13: ==2== I14: ==I4== I15: ==I5== I16: ==I6== I17: ==I7== I18: ==No== I19: ==C== I20: ==C+1== I21: ==C+2== I22: ==C+3== I23: ==C+4== I24: ==C+4== I25: ==2== Detailed Explanations: Part 1 - RAW Dependencies: - I01-I06: Identify register dependencies between instructions (I1→I2 on x3, I2→I3 on x4, I3→{I4,I5,I6} on x7, I6 has no RAW to I7) Part 2 - Load-use Stalls: - I07-I11: Load-use stalls occur when a load result is used immediately (I1→I2 causes 1 stall; normal ALU forwarding resolves others) Part 3 - Structural Hazards (Detailed Analysis): The multiply instruction (I2) occupies the EX stage for 3 cycles. During this time, no other instruction can enter EX. - I12 (First Blocked Instruction): I3 (add) is the first instruction to be blocked from entering EX - I1 completes its EX stage before I2 starts - When I3 reaches the EX stage boundary, I2 is already occupying EX for cycle 1 of 3 - Therefore, I3 must wait in ID while I2 occupies EX cycles: C, C+1, C+2 - I13 (Stall Cycles): 2 cycles of pure structural stall - I3 wants to enter EX at cycle C (but I2 is in EX cycle 1) - I3 blocked at cycle C and C+1 (2 stall cycles) - I3 finally enters EX at cycle C+3 (after I2 leaves EX at the end of C+2) - This assumes no additional data hazards (though in reality, I3's input x4 is available from I2's EX output) - I14-I17 (Ripple Effect): I4, I5, I6, I7 are all delayed by this same 2-cycle structural blockage - I18 (Load-Use Stall Elimination): No — The structural hazard does not eliminate the load-use stall between I1→I2 - Even with the multiply occupying EX for 3 cycles, I2 still requires its input (x3) from I1 - Without the multiply, I2 would normally incur a 1-cycle load-use stall - With the multiply: I1 completes its load at cycle C+1 (WB), making x3 available; I2 enters EX at C+3 - The structural hazard (2 cycles) is much more significant than the load-use stall (1 cycle), so the overall delay is dominated by the structural hazard Part 4 - Timing Illustration (Corrected): - I19: C (I2 occupies EX at cycle C) - I20-I21: C+1, C+2 (I3 stalled in ID while I2 occupies EX during cycles C, C+1, C+2) - I22: C+3 (I3 finally enters EX after I2 completes its 3-cycle multiply) Part 5 - Branch Timing: - I23-I25: Branch resolution delayed due to multiply blockage; I7 fetched speculatively before branch resolves, wasting 2 cycles if branch is taken --- ## Problem J: BitNet SIMD Extension and Hardware Acceleration Background: Multiplication-Free Inference Recent advances in Large Language Models (LLMs), such as BitNet b1.58, utilize extreme quantization where weights are restricted to ternary values {-1, 0, +1}. This paradigm shift allows the expensive Matrix-Vector Multiplication (GEMM) to be replaced by purely additive operations. In a standard dot product `y = sum(w_i * x_i)`: 1. If `w_i = 0`, the term is ignored. 2. If `w_i = 1`, the operation becomes `y = y + x_i`. 3. If `w_i = -1`, the operation becomes `y = y - x_i`. This eliminates the need for hardware multipliers, reducing power and area. To accelerate this on RISC-V, we design a custom BitNet SIMD Extension. Unlike the general-purpose RISC-V Vector (RVV) extension which requires large vector register files, this extension is a lightweight modification to the integer datapath using custom opcodes. Instruction Set Specification The extension uses the RISC-V custom-0 opcode space (`0x0B`). 1. `bn.store rs1, rs2` - Function: Loads 32 bits of compressed weights (2 bits per weight, 16 weights total) from `rs1` into a dedicated internal Weight Buffer. `rs2` is ignored and must be set to x0. - Format: R-type, `opcode=0x0B`, `funct3=0x2`, `rd=x0`. 2. `bn.sum8 rd, rs1` - Function: Performs a SIMD dot product. It takes 8-bit activation values packed in `rs1` (standard 32-bit register containing four 8-bit integers) and computes the dot product against weights currently stored in the Weight Buffer. - Format: R-type, `opcode=0x0B`, `funct3=0x1`, `rs2=x0` (unused). - Operation: `rd = rd + dot_product(rs1_bytes, weight_buffer)`. Note: It accumulates into `rd`. Instruction Encoding Format (R-type) ```text 31 25 24 20 19 15 14 12 11 7 6 0 | funct7 | rs2 | rs1 | funct3 | rd | opcode | ``` - [ ] Part 1: Concept and Microarchitecture * In a standard RISC-V core without this extension, computing `w * x` where `w` is ternary typically requires loading the weight, branching to check its value, and then executing an `add` or `sub`. This introduces control flow hazards. In the `bn.sum8` hardware datapath, the logic that replaces the multiplier is a network of __ J01 __ (choose: ALUs / Multiplexers / Floating Point Units) that select between `+x`, `-x`, or `0` based on the weight bits. * The `bn.store` instruction writes to a specialized "Weight Buffer" rather than a general-purpose register. This design is "stateful." If an operating system context switch occurs between `bn.store` and `bn.sum8`, the OS must explicitly save and restore this buffer. This increases the __ J02 __ (choose: arithmetic latency / context switch overhead / instruction size). * Standard RVV (Vector Extension) supports variable vector lengths and strided loads. In contrast, this BitNet extension operates on fixed 32-bit integer registers (treating them as packed SIMD data). This approach is generally __ J03 __ (choose: more complex / simpler) to implement in hardware than full RVV. - [ ] Part 2: Instruction Encoding You are writing an assembler for this extension. Calculate the binary or hexadecimal fields for the instruction `bn.sum8 x10, x11` (accumulate result into `x10` using activations in `x11`). Assume: `opcode = 0001011` (binary for 0x0B). * `opcode` field (bits 6:0) in binary: __ J04 __ * `rd` field (bits 11:7) for `x10`: __ J05 __ (binary, 5 bits) * `funct3` field (bits 14:12): __ J06 __ (binary, 3 bits) * `rs1` field (bits 19:15) for `x11`: __ J07 __ (binary, 5 bits) * `rs2` field (bits 24:20): Since `bn.sum8` does not use a second register source for data (weights come from the buffer), this field should be set to the register index of x0, which is __ J08 __ (binary, 5 bits). - [ ] Part 3: Performance Analysis Consider a scenario computing the dot product of a vector of length 16. Scenario 1: Standard RV32I Loop unrolled. For each element: Load weight (1 cycle) + Load activation (1 cycle) + Check weight branches (2 cycles) + Add (1 cycle) ≈ 5 cycles per element. Total for 16 elements ≈ 80 cycles. Scenario 2: BitNet SIMD * `bn.store` sets up weights (handles 16 weights, packed 2-bits each in one 32-bit word). Latency: 1 cycle. * `bn.sum8` processes 4 elements per instruction (since `rs1` is 32-bit holding four 8-bit values). * For this problem, assume `bn.store` and `bn.sum8` each take exactly 1 cycle and do not incur additional pipeline stalls. We ignore instruction fetch and decode overhead. To process 16 elements, we need __ J09 __ `bn.sum8` instructions. Total cycles = 1 (`bn.store`) + (number of sums) = __ J10 __ cycles. (Assume 1 cycle latency per instruction). The theoretical speedup of BitNet SIMD over RV32I for this kernel is approximately __ J11 __x. - [ ] Part 4: Chisel HDL Implementation (v3.6) You are designing the execution unit for the `bn.sum8` instruction using Chisel 3.6. The module `BitNetPE` computes the dot product of four 8-bit signed integers (activations) and four 2-bit weights. Weight Encoding (2-bit): * `00`: 0 * `01`: +1 * `10`: -1 * `11`: 0 (Reserved) Fill in the missing Chisel code. ```scala import chisel3._ import chisel3.util._ class BitNetPE extends Module { val io = IO(new Bundle { val activations = Input(UInt(32.W)) val weights = Input(UInt(8.W)) val result = Output(SInt(32.W)) }) val partials = Wire(Vec(4, SInt(32.W))) for (i <- 0 until 4) { val act_slice = io.activations((i * 8) + 7, i * 8) val act_val = act_slice.asSInt val w_slice = io.weights((i * 2) + 1, i * 2) // Multiplication-free Logic (MuxLookup) partials(i) := MuxLookup(w_slice, 0.S)(Seq( "b00".U -> 0.S, "b01".U -> __J12__, // Case: Weight is +1 "b10".U -> __J13__, // Case: Weight is -1 "b11".U -> 0.S )) } // Reduce: Sum all partial products io.result := __J14__ } ``` * Fill blank __ J12 __: (Chisel expression for positive activation) * Fill blank __ J13 __: (Chisel expression for negative activation) * Fill blank __ J14 __: (Chisel expression to sum `partials`) - [ ] Part 5: Pipeline Hazards and Forwarding Consider a standard 5-stage RISC-V pipeline (IF, ID, EX, MEM, WB) modified to support this extension. * The Weight Buffer is updated/read in the EX stage. * `bn.sum8` requires standard Register File access (ID) and ALU execution (EX). * Assume the Weight Buffer write for `bn.store` completes at the end of EX in cycle T, and subsequent BitNet instructions see the updated buffer when they reach EX in cycle T+1 (no extra stall is needed). Analyze the sequence: ```asm I1: bn.store x1, x2 # Update Weight Buffer I2: bn.sum8 x10, x3 # Accumulate into x10 I3: bn.sum8 x10, x4 # Accumulate into x10 ``` * I1 updates the buffer in EX (Cycle T). I2 reads the buffer in EX (Cycle T+1). Does this require a pipeline stall? __ J15 __ (yes / no). * There is a RAW dependency on register __ J16 __ between I2 and I3. * Can standard ALU-to-ALU forwarding resolve the hazard between I2 and I3? __ J17 __ (yes / no). * If the `bn.sum8` execution unit is complex and takes 2 cycles in the EX stage (meaning it occupies the shared EX stage for 2 consecutive clock cycles and blocks other instructions from entering EX), how many bubbles (stall cycles) must be inserted before I3 can execute? __ J18 __ cycle(s). ### ✅ Solution to Problem J: J01: ==Multiplexers== J02: ==context switch overhead== J03: ==simpler== J04: ==0001011== J05: ==01010== J06: ==001== J07: ==01011== J08: ==00000== J09: ==4== J10: ==5== J11: ==16== J12: ==act_val== J13: ==-act_val== J14: ==partials.reduce(_ + _)== J15: ==no== J16: ==x10== J17: ==yes== J18: ==1== Detailed Explanations: Part 1 - Concept and Microarchitecture: - J01: Multiplexers are used to select between +x, -x, or 0 based on the weight value, replacing the multiplier - J02: The stateful Weight Buffer requires saving/restoring during context switches, increasing overhead - J03: BitNet extension is simpler to implement in hardware than full RVV due to fixed 32-bit registers and simpler logic Part 2 - Instruction Encoding: - J04-J08: Binary encoding for `bn.sum8 x10, x11` with opcode 0x0B, rd=10, funct3=1, rs1=11, rs2=0 Part 3 - Performance Analysis: - J09: 16 elements ÷ 4 elements per instruction = 4 `bn.sum8` instructions needed - J10: Total cycles = 1 (bn.store) + 4 (bn.sum8) = 5 cycles - J11: Speedup = 80 cycles (RV32I) ÷ 5 cycles (BitNet) = 16x Part 4 - Chisel HDL Implementation: - J12-J14: Multiplication-free logic using multiplexers; sum partial products using reduce Part 5 - Pipeline Hazards and Forwarding: - J15-J18: Buffer update in EX(T), read in EX(T+1) requires no stall; RAW on x10; ALU forwarding resolves; 1 bubble for 2-cycle execution unit --- ## Problem K: Pipeline Execution + Cache Behavior Under Memory Stalls A 5-stage RV32I pipeline: ``` IF → ID → EX → MEM → WB ``` supports: * Full forwarding (including EX/MEM → EX and MEM/WB → EX for all operands) * A 1-cycle stall for load–use hazards * Branch resolution in EX (predict-not-taken), where branches read their source registers and perform the comparison in EX * No speculative memory operations * When a cache miss occurs in MEM, the pipeline freezes until the miss is serviced (IF/ID/EX do not advance; WB completes normally) The system uses this L1 data cache: * Size: 4 KiB * Line size: 32 bytes * 2-way associative * Replacement: LRU * Hit time: 1 cycle * Miss penalty: 10 cycles (fixed), meaning the MEM stage stalls for 10 extra cycles beyond the 1-cycle hit time You execute the following loop: ```c for (i = 0; i < 64; i++) { s = s + A[i]; } ``` The compiler emits this RV32I sequence (A base address is in x10): ``` Loop: lw x5, 0(x10) # I1 add x6, x6, x5 # I2 addi x10, x10, 4 # I3 addi x11, x11, -1 # I4 (loop counter) bne x11, x0, Loop # I5 ``` Assume: * A is 256 bytes, aligned to a 32-byte boundary. * A spans exactly 8 cache lines (32 bytes per line, 4 bytes per element). * Cache is initially cold. * Branch predictor always predicts *not taken*. * x11 is initialized to 64. - [ ] Part 1: Pipeline hazards without cache misses For this part, only consider RAW dependencies within a single loop iteration (do not count loop-carried dependencies across iterations). * How many producer-consumer dependency pairs requiring forwarding or stalls are present between distinct instructions within the loop body? __ K01 __ * Which instruction pair causes a stall? __ K02 __ * How many stall cycles per loop iteration (ignoring cache misses)? __ K03 __ * What type of hazard causes this stall? __ K04 __ (Choose: "load-use", "branch", "structural") * Can forwarding resolve the hazard between I4 and I5? (yes/no) __ K05 __ - [ ] Part 2: Cache behavior across the loop Given that A has 64 elements and lines are 8 elements wide: * How many L1 data cache misses occur during the entire loop? __ K06 __ * How many hits? __ K07 __ * What is the miss rate (as a decimal)? __ K08 __ * After the first access to each line, what is the access pattern for that line? __ K09 __ (Choose: "hit", "miss") * How many elements are in each cache line? __ K10 __ - [ ] Part 3: Pipeline + cache interaction For each cache miss on the `lw` instruction: * How many pipeline stall cycles occur due to the miss? __ K11 __ * When a miss happens, what happens to the IF stage? __ K12 __ (Choose: "advances normally", "freezes", "repeats fetch") * Does the WB stage continue to complete normally during a miss? (yes/no) __ K13 __ * During a miss, do IF/ID/EX stages advance? (yes/no) __ K14 __ * What type of pipeline behavior occurs during a miss? __ K15 __ (Choose: "selective stall", "pipeline-wide freeze", "bypass") - [ ] Part 4: Branch behavior during cache stalls Consider how cache stalls interact with branch prediction: * How many total branches are executed in the loop? __ K16 __ * How many of these branches are actually taken? __ K17 __ * What is the branch misprediction penalty in cycles? __ K18 __ * When does the branch outcome resolve? __ K19 __ (Choose: "IF", "ID", "EX", "MEM", "WB") * Does the cache stall affect branch resolution timing? (yes/no) __ K20 __ - [ ] Part 5: Total cycles to execute the loop Compute the total cycles using the components: * Base pipeline cost per iteration (including hazard stalls): __ K21 __ cycles * Total base cycles for all 64 iterations: __ K22 __ cycles * Branch penalty cycles (63 taken branches × penalty): __ K23 __ cycles * Cache miss stall cycles (8 misses × penalty): __ K24 __ cycles * Final total execution time: __ K25 __ cycles Hint: - For this problem, assume all costs (stalls, branch penalties, cache misses) accumulate independently with no overlap. - Build your answers incrementally: first compute K21 from earlier components, then use that to build K22, and so on. ### ✅ Solution to Problem K: K01: ==2== K02: ==I1 → I2== K03: ==1== K04: ==load-use== K05: ==yes== K06: ==8== K07: ==56== K08: ==0.125== K09: ==hit== K10: ==8== K11: ==10== K12: ==freezes== K13: ==yes== K14: ==no== K15: ==pipeline-wide freeze== K16: ==64== K17: ==63== K18: ==2== K19: ==EX== K20: ==yes== K21: ==6== K22: ==384== K23: ==126== K24: ==80== K25: ==590== Detailed explanations (optional reading): Part 1 — Pipeline hazards without cache misses Two RAW hazards exist: I1 → I2 (on x5) and I4 → I5 (on x11). The I1 (lw) → I2 (add) instruction pair causes a stall due to load-use hazard. This requires 1 stall cycle per iteration. The I4 → I5 hazard can be resolved through forwarding (EX/MEM → EX forwarding). Part 2 — Cache behavior There are 8 cache misses (first access to each of the 8 cache lines). Subsequent accesses to each line are hits (56 total hits: 64 - 8 = 56). The miss rate is 8/64 = 0.125. Each cache line holds 32 bytes / 4 bytes per element = 8 elements. Part 3 — Pipeline + cache interaction Each cache miss causes 10 stall cycles (the miss penalty). During a miss, the IF stage freezes as part of a pipeline-wide freeze. The WB stage continues to complete normally during a miss (it operates after MEM). The IF/ID/EX stages do not advance when a miss occurs—all stages except WB are frozen. Part 4 — Branch behavior during cache stalls There are 64 branch instructions (one per iteration). Of these, 63 branches are taken (all except the last one). The branch outcome resolves in the EX stage. A cache miss in MEM freezes IF/ID/EX. If a branch is in flight, its EX stage (where the comparison and resolution happen) is delayed until after the miss is serviced. Thus, in terms of clock cycles, the branch resolution timing is affected by cache stalls, even though it still logically resolves in the EX stage. Part 5 — Total cycle calculation Base pipeline cost: 5 cycles for pipeline depth + 1 stall cycle = 6 cycles per iteration. For all 64 iterations: 64 × 6 = 384 cycles. Branch penalties: 63 taken branches × 2 cycles misprediction penalty = 126 cycles. Cache stalls: 8 misses × 10 cycle penalty = 80 cycles. Total execution time: 384 + 126 + 80 = 590 cycles.