---
tags: computer-arch
---
# Quiz4 of Computer Architecture (2025 Fall)
> Solutions
## Problem `A`
You are given the following truth table describing the **next-cycle** values of two 2-bit registers `r` and `s`:
| c1 | c2 | $r_{next}$ | $s_{next}$ |
| -- | -- | ------ | ------ |
| 0 | 0 | 3 | 3 |
| 0 | 1 | 2 | 3 |
| 1 | 0 | 1 | 1 |
| 1 | 1 | 2 | 1 |
You must complete the Chisel code below so that it exactly produces the behavior in the truth table.
Fill in all blanks. **Do not modify the given structure.**
Assume `c1`, `c2` are `Bool`, and `r`, `s` are 2-bit registers initialized to 3.
```scala
val r = RegInit(3.U(2.W))
val s = RegInit(3.U(2.W))
// default assignments
r := 3.U
s := 3.U
when (c1) {
r := 1.U
___ A01 ___
}
when (c2) {
___ A02 ___
}
```
* A01: ==`s := 1.U`==
* A02: ==`r := 2.U`==
> These are the **only** values that satisfy all four rows of the truth table when combined with the required Chisel structure and the "last connect wins" semantics.
Assume Chisel's **last-connect-wins** semantics and independent `when` blocks (not `elsewhen`).
Your task is to derive the **pure combinational** expressions that compute the next values of `r` and `s`, expressed entirely in terms of nested muxes.
Write the mux expression for $r_{next}$. __ A03 __
Your answer must use only the signals `c1`, `c2`, and constants `3.U`, `1.U`, and `2.U`.
A correct answer must explicitly show two levels of muxes.
* A03 = ==`Mux(c2, 2.U, Mux(c1, 1.U, 3.U))`== OR ==`Mux(c1, Mux(c2, 2.U, 1.U), 3.U)`==
> Must show:
> * `c2` overrides `c1` for `r`
> * default is 3.U
Write the mux expression for $s_{next}$.
Your answer must reflect exactly the truth table behavior.
* A04 = ==`Mux(c1, 1.U, 3.U)`==
> Explanation: `s` is updated only under `c1`; `c2` does not affect `s`.
The equivalent hardware block diagram is shown below:
1. The two-level mux structure for `r_next`
2. The one-level mux structure for `s_next`
3. Proper labeling of select lines (`c1`, `c2`)
4. All constant inputs (3, 1, 2)
5. Connections from mux outputs to the registers `r` and `s`
```
+--------------------+
| Mux (level 1) |
| sel = c1 |
1.U -------->| ______ |
3.U -------->| | | |
+-------->| t1 |----+
/|______| |
/ |
/ v
/ +-------------------+
/ | Mux (level 2) |
/ | sel = c2 |
/ 2.U -->| |
/ | ______ |
/ +-------->| | |
/ | r_next|---> Reg r
/ t1 -------------->|______|
/
/
+---------------+
| Mux for s |
| sel = c1 |
1.U ---->| |-----> Reg s
3.U ---->| s_next |
+---------------+
```
* `r_next` uses **two** cascaded muxes: inner (c1), outer (c2)
* `s_next` uses **one** mux controlled only by `c1`
Given the nested-mux expressions for Chisel code: ($r_{next}$ and $s_{next}$)
* A05: ==`Mux(c2, 2.U, Mux(c1, 1.U, 3.U))`==
* A06: ==`Mux(c1, 1.U, 3.U)`==
---
## Problem `B`
This problem focuses on digital logic timing analysis and critical path optimization in sequential circuits.
Consider the following digital circuit with four pipeline registers (R1, R2, R3, R4) and three combinational logic blocks (A, B, C):
```
CLK ────┬────────┬───────┬──────┐
▼ ▼ ▼ ▼
Input ──────┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
│ │R1 │ │R2 │ │R3 │ │R4 │
│ └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘
│ └───┐ │ ┌───┘ │
│ │ │ │ │
│ ┌─▼───▼───▼─┐ │
│ │ Logic A │ │
│ │ (12ns) │ │
│ └─────┬─────┘ │
│ ├───────────────┘
│ ┌─────▼─────┐
│ │ Logic B │
│ │ (15ns) │
│ └─────┬─────┘
└────────────────┼───┐
│ │
┌─────▼───▼─┐
│ Logic C │
│ (8ns) │
└─────┬─────┘
│
Output
```
Component specifications:
- Register clk-to-q delay: 6ns
- Register setup time: 2ns
- Register hold time: 1ns
- Logic A propagation delay: 12ns
- Logic A contamination delay: 3ns
- Logic B propagation delay: 15ns
- Logic B contamination delay: 5ns
- Logic C propagation delay: 8ns
- Logic C contamination delay: 2ns
1. What is the minimum clock period for this circuit? __ B01 __ ns
* B01: ==43==
> Critical path analysis:
> R1 clk-to-q (6ns) + Logic A (12ns) + Logic B (15ns) + Logic C (8ns) + R4 setup (2ns)
> = 6 + 12 + 15 + 8 + 2 = 43ns
2. What is the maximum frequency at which this circuit can operate? __ B02 __ MHz (round to 2 decimal places)
* B02: ==23.26==
> Frequency = 1 / Clock period = 1 / 43ns = 23.26 MHz
3. Which logic block is in the critical path? __ B03 __
* B03: ==`Logic B`== OR ==`B`==
> Logic B has the longest delay (15ns) and is part of the longest path from R1 to R4.
4. Check if the circuit satisfies the hold time constraint for register R3. Does it satisfy the hold time requirement? __ B04 __
* B04: ==`Yes`== OR ==`True`==
> Hold time check for R3:
> Shortest path to R3: R1 clk-to-q (6ns) + Logic A contamination (3ns) = 9ns
> R3 hold time requirement: 1ns
> Since 9ns > 1ns, the hold time constraint is satisfied.
5. To improve performance, a designer wants to split Logic B into two pipeline stages: B1 (7ns) and B2 (8ns) with a new register R5 between them. What would be the new minimum clock period? __ B05 __ ns
* B05: ==27==
> With Logic B split into B1 (7ns) + R5 + B2 (8ns):
> Stage 1 critical path: R1 → Logic A → Logic B1 → R5 = 6 + 12 + 7 + 2 = 27ns
> Stage 2 critical path: R5 → Logic B2 → Logic C → R4 = 6 + 8 + 8 + 2 = 24ns
> New minimum clock period = max(27, 24) = 27ns
---
## Problem `C`
This problem explores circular buffer implementation and bit manipulation for efficient modulo operations.
A circular buffer (also called ring buffer) is a data structure that uses a single, fixed-size buffer as if it were connected end-to-end. This is commonly used in embedded systems for FIFO queues, audio/video streaming, and inter-process communication.
Consider implementing a circular buffer with the following properties:
- Buffer size: 64 bytes (power of 2)
- Head pointer: points to the next write position
- Tail pointer: points to the next read position
- Count: number of elements currently in buffer
1. For a circular buffer of size N (power of 2), why is using bitwise AND more efficient than modulo for wrapping indices? __ C01 __
* C01 = (Accept explanations of bitwise optimization) 意思相近就給分
> When N is a power of 2, the modulo operation `index % N` can be replaced with `index & (N-1)`.
> For N=64: `index % 64` becomes `index & 63` (since 63 = 0b00111111)
> Bitwise AND is a single-cycle operation, while modulo requires division (multi-cycle on most processors).
> For RV32I, this saves approximately 20-30 cycles per index wrap operation.
2. Implement the buffer_write function that adds a byte to the circular buffer. Complete the RISC-V assembly code:
```c
# Circular buffer structure (in memory):
# Offset 0: head (4 bytes)
# Offset 4: tail (4 bytes)
# Offset 8: count (4 bytes)
# Offset 12: size (4 bytes) - always 64
# Offset 16: buffer[64] (64 bytes)
#
# Input: a0 = pointer to buffer structure
# a1 = byte to write (in least significant byte)
# Output: a0 = 0 if success, -1 if buffer full
buffer_write:
lw t0, 8(a0) # Load count __ C02 __
lw t1, 12(a0) # Load size (64)
beq t0, t1, buffer_full # If count == size, buffer is full __ C03 __
lw t2, 0(a0) # Load head index
addi t3, a0, 16 # t3 = address of buffer[0] __ C04 __
add t3, t3, t2 # t3 = address of buffer[head]
sb a1, 0(t3) # Store byte at buffer[head] __ C05 __
addi t2, t2, 1 # head++ __ C06 __
andi t2, t2, 63 # head = head & (size-1) for wrapping __ C07 __
sw t2, 0(a0) # Store new head
addi t0, t0, 1 # count++
sw t0, 8(a0) # Store new count __ C08 __
add a0, x0, x0 # Return 0 (success) __ C09 __
jalr x0, ra, 0
buffer_full:
addi a0, x0, -1 # Return -1 (buffer full) __ C10 __
jalr x0, ra, 0
```
* C02: `lw t0, 8(a0)`
* C03: `beq t0, t1, buffer_full`
* C04: `addi t3, a0, 16`
* C05: `sb a1, 0(t3)`
* C06: `addi t2, t2, 1`
* C07: `andi t2, t2, 63`
* C08: `sw t0, 8(a0)`
* C09: `add a0, x0, x0`
* C10: `addi a0, x0, -1`
3. For the buffer_read function, explain why we need to check if count == 0 before reading: __ C11 __
* C11 = (Accept explanations of empty buffer detection) 意思相近就給分
> Checking count == 0 detects an empty buffer condition. If we read when count == 0, we would read stale or uninitialized data from the buffer array. Additionally, decrementing count when it's already 0 would cause integer underflow, wrapping to a large positive number (0xFFFFFFFF), which would corrupt the buffer state and make it appear full when it's actually empty.
4. What is the maximum number of bytes that can be stored in this buffer at any given time? __ C12 __
* C12: ==64==
> Since our implementation maintains a separate `count` variable, we can distinguish between full and empty states even when head == tail. Therefore, we can use all 64 buffer slots:
> - Empty: count == 0 (head and tail can be equal)
> - Full: count == 64 (head and tail can be equal)
> The count variable eliminates ambiguity, allowing maximum capacity = 64 bytes.
5. If we remove the count variable and only use head and tail pointers, what is the maximum capacity? __ C13 __
* C13: ==63==
> Without a count variable, we must distinguish full from empty using only head and tail:
> - Empty: head == tail
> - Full: (head + 1) % size == tail
> This means we must leave one slot always empty as a sentinel, reducing capacity to size-1 = 63 bytes.
6. In the explanation for C01, we mentioned that modulo saves approximately 20-30 cycles on RV32I. To understand why, complete the following implementation of `software_modulo` (software modulo routine) using **only RV32I base instructions** (no pseudoinstructions):
```assembly
# software_modulo: Software modulo implementation for RV32I
# Algorithm: Simplified shift-and-subtract division
# Input: a0 = dividend, a1 = divisor
# Output: a0 = remainder (a0 % a1)
# Constraints: Assumes positive inputs and divisor != 0
software_modulo:
add t0, x0, x0 # quotient = 0
add t1, x0, x0 # remainder = 0 __ C14 __
addi t2, x0, 32 # bit_count = 32 (process 32 bits)
div_loop:
slli t1, t1, 1 # remainder <<= 1 __ C15 __
srli t3, a0, 31 # Extract MSB of dividend __ C16 __
or t1, t1, t3 # remainder |= extracted bit
slli a0, a0, 1 # dividend <<= 1 (shift left)
blt t1, a1, skip_sub # If remainder < divisor, skip subtraction __ C17 __
sub t1, t1, a1 # remainder -= divisor __ C18 __
ori t0, t0, 1 # quotient |= 1 (set LSB)
skip_sub:
slli t0, t0, 1 # quotient <<= 1 (prepare for next bit) __ C19 __
addi t2, t2, -1 # bit_count--
bne t2, x0, div_loop # Continue if bit_count != 0 __ C20 __
add a0, t1, x0 # Return remainder in a0 __ C21 __
jalr x0, ra, 0 # Return to caller
```
* C14: `add t1, x0, x0`
* C15: `slli t1, t1, 1`
* C16: `srli t3, a0, 31`
* C17: `blt t1, a1, skip_sub`
* C18: `sub t1, t1, a1`
* C19: `slli t0, t0, 1`
* C20: `bne t2, x0, div_loop`
* C21: `add a0, t1, x0`
7. Based on the `software_modulo` implementation above and the circular buffer code, analyze the cycle savings when using bitwise AND instead of runtime modulo on a **single-cycle RV32I processor**.
Constraints:
- Single-cycle implementation (CPI = 1, each instruction takes exactly 1 cycle)
- RV32I base ISA (no M extension, no hardware division)
- Runtime divisor (compiler cannot optimize modulo to bitwise AND at compile time)
- Buffer size is a runtime variable, not a compile-time constant
Count the exact number of instructions executed by `software_modulo` for one complete modulo operation, then calculate the cycle savings. __ C22 __
a) Total cycles for `software_modulo` (count: prologue + all loop iterations + epilogue)
b) Total cycles for bitwise AND operation (`andi t2, t2, 63`)
c) Cycle savings = (a) - (b)
* C22: ==`292`==
> a) **Cycles for `software_modulo`:**
- Prologue: 3 instructions (lines 1-3) = 3 cycles
- Loop iterations: 32 iterations (for 32-bit dividend)
- Per iteration: ~9 instructions on average (lines 5-14). The loop body contains 10 static instructions, but executes 8-10 dynamically depending on whether the conditional subtraction path is taken. For simplicity, assume 9 cycles per iteration.
- Total loop: 32 × 9 = 288 cycles
- Epilogue: 2 instructions (lines 16-17) = 2 cycles
- **Total: 3 + 288 + 2 = 293 cycles**
> Note: This is a simplified implementation for educational purposes. Optimized versions with early termination can reduce this to ~20-40 cycles by detecting when the quotient is determined early.
> b) **Cycles for bitwise AND: `andi t2, t2, 63` = 1 cycle
> c) **Cycle savings:**
- With runtime modulo: 293 cycles (or ~20-40 for optimized version)
- With bitwise AND: 1 cycle
- **Savings: 292 cycles (simplified) or ~19-39 cycles (optimized)**
---
## Problem `D`
This problem focuses on processor timing analysis and pipeline optimization.
You are designing a single-cycle RISC-V processor. The following components have the specified delays:
| Component | Delay (ns) |
|-----------|-----------|
| Instruction Memory | 250ns |
| Data Memory | 250ns |
| Register File Read | 100ns |
| Register File Write Setup | 50ns |
| ALU | 150ns |
| Mux (2-to-1) | 25ns |
| Control Logic | 50ns |
| Sign Extend | 30ns |
| Adder (PC+4) | 75ns |
Below is a simplified single-cycle datapath (ASCII representation):
```
┌──────────────┐
PC ───────>│ Instruction │
│ Memory │───> Instruction
│ (250ns) │
└──────────────┘
│
├─────> Control (50ns)
│
v
┌──────────────┐
Instruction──>│ Register File│
│ (100ns) │─┬──> Read Data 1
│ │ │
└──────────────┘ └──> Read Data 2
^
│
Write Data
│
┌─────┴────┐
│ Mux │<─── Memory/ALU result
│ (25ns) │
└──────────┘
│
v
┌──────────────────┐
Read Data 1 ───>│ │
│ ALU │───> ALU Result
Read Data 2/ ───>│ (150ns) │
Immediate │ │
└──────────────────┘
│
v
┌──────────────┐
ALU Result>│ Data Memory │
│ (250ns) │───> Read Data
└──────────────┘
```
1. What is the minimum clock period for this single-cycle processor? __ D01 __ ns
* D01: ==900==
> Critical path is the lw instruction (uses all stages):
> IF: Instruction Memory (250ns)
> ID: Control (50ns) + Register File Read (100ns) = 150ns
> EX: Mux (25ns) + ALU (150ns) = 175ns
> MEM: Data Memory (250ns)
> WB: Mux (25ns) + Register File Write Setup (50ns) = 75ns
> Total: 250 + 150 + 175 + 250 + 75 = 900ns
2. What instruction type determines the critical path? __ D02 __
* D02: ==`lw`== OR `load word` OR `load instruction`
> The load word (lw) instruction has the longest delay path because it must: fetch instruction, decode, read register, calculate address, access data memory, and write back to register. This is the only instruction that uses all five stages with memory access in both IF and MEM stages.
3. To improve performance, the processor is converted to a 5-stage pipeline with the stages IF, ID, EX, MEM, and WB. Assuming ideal pipeline registers (zero delay), what is the new clock period? __ D03 __ ns
* D03: ==250==
> Clock period = max(IF, ID, EX, MEM, WB)
> IF: 250ns (instruction memory)
> ID: 50ns + 100ns = 150ns (control + register read)
> EX: 25ns + 150ns = 175ns (mux + ALU)
> MEM: 250ns (data memory)
> WB: 25ns + 50ns = 75ns (mux + register write setup)
> Critical stage: max(250, 150, 175, 250, 75) = 250ns
4. For the pipelined processor, calculate the speedup for executing a program with 1000 instructions (assume no hazards). __ D04 __ (express as a decimal, e.g., 3.5)
* D04: ==3.59== (或近似值)
> Single-cycle: 1000 instructions × 900ns = 900,000ns
> Pipelined: (4 + 1000) × 250ns = 251,000ns (4 cycles to fill pipeline + 1000 cycles)
> Speedup = 900,000 / 251,000 = 3.586 ≈ 3.59
5. Now consider adding pipeline registers with realistic delays. Each pipeline register has a setup time of 20ns and clk-to-q delay of 10ns. What is the new clock period? __ D05 __ ns
* D05: ==280==
> New clock period = clk-to-q + max(stage delays) + setup
> = 10 + max(250, 150, 175, 250, 75) + 20
> = 10 + 250 + 20 = 280ns
6. The processor implements the following optimization: the data memory and instruction memory can be accessed in parallel during different stages. However, the register file has only one write port. Identify a hazard scenario where this causes a pipeline stall: __ D06 __
* D06 = (Accept answers describing structural hazard on register file write port) 意思相近就給分
> Structural hazard occurs when two instructions need to write to the register file in the same cycle. This happens when a multi-cycle instruction (like divide or floating-point operation) completes in the same cycle as a regular instruction reaching WB. For example:
> Cycle 8: Normal instruction (add x1, x2, x3) reaches WB stage
> Cycle 8: Long-latency instruction (div x4, x5, x6) also completes and needs to write back
> With only one write port, one instruction must be stalled, even though they write to different registers.
7. To further optimize the processor, the designer proposes splitting the EX stage into two pipeline stages: EX1 (ALU operation) and EX2 (result forwarding). Assuming the ALU delay can be evenly split, what would be the new clock period? __ D07 __ ns
* D07: ==280==
> Splitting the EX stage divides the 175ns delay into two stages of 87.5ns each. However, the critical path remains the memory stages (IF and MEM), both at 250ns. With pipeline registers:
> Clock period = clk-to-q + max(250, 150, 87.5, 87.5, 250, 75) + setup
> Clock period = 10 + 250 + 20 = 280ns (no improvement)
---
## Problem `E`
This problem analyzes data hazards and branch prediction in pipelined processors.
Consider a 5-stage pipelined RISC-V processor (IF, ID, EX, MEM, WB) with the following properties:
- Full bypassing/forwarding from EX and MEM stages to EX stage inputs
- Branch resolution in EX stage
- 2-bit saturating counter branch predictor, initialized to "predict not taken" (state 01)
- Register file: write in first half of cycle, read in second half (double-pumped)
Analyze the following code sequence:
```c
addi x1, x0, 100 # I1
lw x2, 0(x1) # I2
addi x3, x2, 50 # I3
sw x3, 4(x1) # I4
beq x2, x3, target # I5
addi x4, x0, 0 # I6
target: addi x5, x0, 1 # I7
```
1. Create a pipeline diagram showing the execution of instructions I1 through I5. Assume the code starts execution at cycle 1. In which cycle does instruction I5 complete its WB stage? __ E01 __
* E01: ==10==
> Pipeline execution with load-use hazard on I2 → I3:
> Cycle 1: I1-IF
> Cycle 2: I1-ID, I2-IF
> Cycle 3: I1-EX, I2-ID, I3-IF
> Cycle 4: I1-MEM, I2-EX, I3-ID (stall: load-use hazard, x2 not ready)
> Cycle 5: I1-WB, I2-MEM, I3-ID (still stalled)
> Cycle 6: I2-WB, I3-EX (bypass x2), I4-ID
> Cycle 7: I3-MEM, I4-EX (bypass x3), I5-ID
> Cycle 8: I3-WB, I4-MEM, I5-EX (branch resolves)
> Cycle 9: I4-WB, I5-MEM
> Cycle 10: I5-WB (completes)
> I5 completes its WB stage in cycle 10
2. Identify all data hazards in the code sequence above. For each hazard, specify: (a) the instructions involved, (b) the register causing the hazard, and (c) whether it can be resolved by bypassing or requires a stall. __ E02 __
* E02 = (Accept comprehensive hazard analysis) 意思相近就給分
> Hazard 1: I2 → I3 on register x2
> Type: RAW (Read After Write)
> Resolution: Requires 1 cycle stall (load-use hazard), then bypass from MEM stage
>
> Hazard 2: I3 → I4 on register x3
> Type: RAW
> Resolution: Bypass from EX stage (no stall)
>
> Hazard 3: I2 → I5 on register x2
> Type: RAW
> Resolution: Bypass from WB stage (no stall, sufficient distance)
>
> Hazard 4: I3 → I5 on register x3
> Type: RAW
> Resolution: Bypass from MEM stage (no stall)
3. If the processor did NOT have bypassing, how many stall cycles would be required to execute this code correctly? __ E03 __
* E03: ==6==
> Without bypassing, each RAW hazard requires stalling until the producer reaches WB (2 stalls per hazard):
> I1 → I2 on x1: 2 stalls (I2 waits for I1's WB)
> I2 → I3 on x2: 2 stalls (I3 waits for I2's WB)
> I3 → I4 on x3: 2 stalls (I4 waits for I3's WB)
> I3 → I5 on x3: 0 stalls (I5 reads x3 after I3's WB due to accumulated stalls)
> Total: 2 + 2 + 2 + 0 = 6 stalls
4. Consider the branch instruction (I5). Assuming the branch is NOT taken, how many instructions are incorrectly fetched and must be annulled? __ E04 __
* E04: ==0==
> The processor uses a 2-bit predictor initialized to "predict not taken". When the branch is NOT taken, the prediction is correct, so I6 executes as intended. No instructions are annulled.
5. Now assume the branch IS taken. How many cycles are wasted due to the misprediction? __ E05 __
* E05: ==2==
> Branch resolves as taken in cycle 8 (I5-EX stage)
> At the cycle where I5 is in EX, the next two sequential instructions (I6 in ID and I7 in IF) have been wrongly fetched and must be annulled
> These 2 instructions are flushed (turned into bubbles)
> 2 cycles are wasted
---
## Problem `F`
Consider a simple 32-entry register file with 2 read ports and 1 write port, described by Chisel HDL.
```scala
class RegisterFile extends Module {
val io = IO(new Bundle {
val rs1Addr = Input(UInt(5.W)) // Read address 1
val rs2Addr = Input(UInt(5.W)) // Read address 2
val rdAddr = Input(UInt(5.W)) // Write address
val rdData = Input(UInt(32.W)) // Write data
val we = Input(Bool()) // Write enable
val rs1Data = Output(UInt(32.W)) // Read data 1
val rs2Data = Output(UInt(32.W)) // Read data 2
})
val regfile = Reg(Vec(32, UInt(32.W))) // 32 registers, each 32 bits
// Read ports (combinational)
io.rs1Data := regfile(io.rs1Addr)
io.rs2Data := regfile(io.rs2Addr)
// Write port (sequential, on clock edge)
when(io.we) {
regfile(io.rdAddr) := io.rdData
}
}
```
1. In the read operation `io.rs1Data := regfile(io.rs1Addr)`, what does `:=` mean? __ F01 __
* F01: (Accept explanations of wire connection) 意思相近就給分
> The `:=` operator in Chisel means "connects" or "drives". It creates a combinational path from the right-hand side to the left-hand side. In this case, it continuously drives io.rs1Data with the value from the register at address io.rs1Addr. This is combinational logic, not a software assignment - the connection persists and updates whenever the right-hand side changes.
2. Why is `Reg(Vec(...))` used instead of `Wire(Vec(...))`? __ F02 __
* F02: (Accept explanations of state retention) 意思相近就給分
> `Reg` creates registers that hold state across clock cycles, which is necessary for a register file that must remember values between writes. `Wire` would create combinational logic without memory - values would be lost after each clock cycle. The register file needs persistent storage, so `Reg` is required.
3. What happens if `io.rs1Addr` is 0 and we read from it? Is there special handling needed for register x0 in RISC-V? __ F03 __
* F03 = (Accept explanations of x0 hardwiring) 意思相近就給分
> In RISC-V, register x0 is hardwired to always read as 0, regardless of what is written to it. The code above does NOT implement this constraint - it would return whatever value is stored in regfile(0). To properly implement RISC-V semantics, we need:
> ```scala
> io.rs1Data := Mux(io.rs1Addr === 0.U, 0.U, regfile(io.rs1Addr))
> ```
> This ensures reads from x0 always return 0, even if something was written to it.
4. Complete the code to implement x0 hardwiring for both read ports: __ F04 __
```scala
io.rs1Data := Mux(io.rs1Addr === 0.U, 0.U, regfile(io.rs1Addr))
io.rs2Data := __ F04 __
```
* F04: ==`Mux(io.rs2Addr === 0.U, 0.U, regfile(io.rs2Addr))`==
5. Should writes to register x0 be prevented? Implement this: __ F05 __
```scala
when(io.we && __ F05 __) {
regfile(io.rdAddr) := io.rdData
}
```
* F05: ==`io.rdAddr =/= 0.U`== OR ==`io.rdAddr !== 0.U`==
> Writes to x0 should be prevented to save power and avoid unnecessary register updates. The condition checks that the write address is not 0 before allowing the write.
---
## Problem `G`
Consider implementing a population count (popcount) function that counts the number of set bits (1s) in a 32-bit integer. This is useful for various algorithms including cryptography, compression, and error detection.
1. Implement a naive popcount function in RISC-V assembly using a loop:
```c
# Input: a0 = 32-bit value
# Output: a0 = count of set bits
popcount_loop:
add t0, x0, x0 # t0 = count = 0 __ G01 __
add t1, a0, x0 # t1 = value (copy of a0)
popcount_loop_iter:
beq t1, x0, popcount_done # if value == 0, done __ G02 __
andi t2, t1, 1 # t2 = value & 1 (check LSB) __ G03 __
add t0, t0, t2 # count += LSB __ G04 __
srli t1, t1, 1 # value >>= 1 __ G05 __
j popcount_loop_iter # repeat
popcount_done:
add a0, t0, x0 # return count __ G06 __
jalr x0, ra, 0
```
* G01: `add t0, x0, x0`
* G02: `beq t1, x0, popcount_done`
* G03: `andi t2, t1, 1`
* G04: `add t0, t0, t2`
* G05: `srli t1, t1, 1`
* G06: `add a0, t0, x0`
2. A more efficient approach uses the Brian Kernighan algorithm, which iterates only once per set bit instead of once per total bit. The key insight is that `x & (x-1)` clears the lowest set bit. Implement this:
```c
# Input: a0 = 32-bit value
# Output: a0 = count of set bits
popcount_bk:
add t0, x0, x0 # count = 0
add t1, a0, x0 # value (copy of a0)
bk_loop:
beq t1, x0, bk_done # if value == 0, done
addi t0, t0, 1 # count++ __ G07 __
addi t2, t1, -1 # t2 = value - 1 __ G08 __
and t1, t1, t2 # value = value & (value - 1) __ G09 __
j bk_loop
bk_done:
add a0, t0, x0 # return count
jalr x0, ra, 0
```
* G07: `addi t0, t0, 1`
* G08: `addi t2, t1, -1`
* G09: `and t1, t1, t2`
3. Explain why `x & (x-1)` clears the lowest set bit: __ G10 __
* G10 = (Accept explanations of bit manipulation) 意思相近就給分
> When we subtract 1 from x, all bits after the lowest set bit (including it) are flipped.
> Example: x = 0b1010 (decimal 10)
> x-1 = 0b1001 (decimal 9)
> x & (x-1) = 0b1000 (decimal 8)
> The subtraction borrows from the lowest 1 bit, turning it to 0 and flipping all lower bits to 1.
> The AND operation keeps the upper bits unchanged but zeros the lowest 1 bit and all lower bits.
4. For a value with k set bits, what is the time complexity of the Brian Kernighan algorithm compared to the naive loop? __ G11 __
* G11 = (Accept explanations of O(k) vs O(n)) 意思相近就給分
> Brian Kernighan: O(k) where k = number of set bits (best case O(1), worst case O(32))
> Naive loop: O(n) where n = bit width (always O(32) for 32-bit integers)
> For sparse bit patterns (few 1s), Brian Kernighan is significantly faster.
> Example: for value = 0x00000001 (one set bit):
> - Brian Kernighan: 1 iteration
> - Naive loop: 32 iterations
---
## Problem `H`
This problem explores runtime code generation where machine code is created dynamically and executed directly. Consider the following C program that generates RISC-V instructions at runtime:
```c
#include <stdint.h>
#include <stdio.h>
// Function pointer type: int func(int, int)
typedef int (*func_t)(int, int);
int main(int argc, char *argv[])
{
// Array of RISC-V instructions (machine code)
uint32_t instructions[] = {
[0] = H01, /* add a0, a0, a1 */
[1] = H02, /* slli a0, a0, 1 */
[2] = 0x00008067, /* H03 */
};
// Cast array to function pointer and execute
func_t jit = (func_t) instructions;
int result = jit(10, 5);
printf("Result: %d\n", result);
return 0;
}
```
This program creates a simple function that:
1. Adds two input parameters (a0 + a1)
2. Doubles the result (shift left by 1)
3. Returns the final value
Answer the following questions:
1. Encode the `add a0, a0, a1` instruction in hexadecimal: __ H01 __
H01 = `0x00b50533` (不計分)
> ADD is R-type: opcode=0110011 (0x33), funct3=000, funct7=0000000
> rd=a0 (x10=01010), rs1=a0 (x10=01010), rs2=a1 (x11=01011)
> Encoding: [31:25]=funct7=0000000, [24:20]=rs2=01011, [19:15]=rs1=01010,
> [14:12]=funct3=000, [11:7]=rd=01010, [6:0]=opcode=0110011
> Binary: 0000000_01011_01010_000_01010_0110011
> Hex: 0x00b50533
2. Encode the `slli a0, a0, 1` instruction in hexadecimal: __ H02 __
H02 = `0x00151513` (不計分)
> SLLI is I-type: opcode=0010011 (0x13), funct3=001, imm[11:5]=0000000 (shift amount in imm[4:0])
> rd=a0 (x10=01010), rs1=a0 (x10=01010), shamt=1 (00001)
> Encoding: [31:25]=0000000, [24:20]=shamt=00001, [19:15]=rs1=01010,
> [14:12]=funct3=001, [11:7]=rd=01010, [6:0]=opcode=0010011
> Binary: 0000000_00001_01010_001_01010_0010011
> Hex: 0x00151513
3. The instruction `0x00008067` is used at the end of the function. What is this instruction and what is its purpose? __ H03 __
* H03: ==`jalr zero, ra, 0`== OR ==`ret`==
> This instruction is `jalr x0, x1, 0`, which is the standard return instruction in RISC-V.
> **Instruction breakdown**:
> - JALR (Jump And Link Register): I-type instruction with opcode 1100111
> - rd=zero (x0): Discard the return address (no need to save PC+4)
> - rs1=ra (x1): Jump to the address stored in the return address register
> - imm=0: No offset added to the target address
>
> **Purpose**: Returns control to the calling function by:
> 1. Setting PC to the value in ra (return address saved by the caller)
> 2. Not saving a return address (since rd=x0)
> 3. This is the standard way to exit a function in RISC-V
>
> The pseudoinstruction `ret` is an alias for this exact encoding.
4. What is the expected output when this program runs with arguments jit(10, 5)? __ H04 __
* H04: ==`30`==
> Execution trace:
> - Initial: a0=10, a1=5
> - After `add a0, a0, a1`: a0 = 10 + 5 = 15
> - After `slli a0, a0, 1`: a0 = 15 << 1 = 15 * 2 = 30
> - After `ret`: return value = 30
> Output: "Result: 30"
5. Explain why we can cast a `uint32_t` array to a function pointer and execute it as code: __ H05 __
* H05: (Accept explanations covering memory, executable permissions, calling conventions) 意思相近就給分
> **Von Neumann Architecture**: Modern computers use the von Neumann architecture where instructions and data share the same memory space. A `uint32_t` array is just a sequence of bytes in memory.
>
> **Function Pointer Mechanics**: A function pointer in C points to the first instruction of executable code. When we cast `(func_t) instructions`, we're telling the compiler to treat the array's memory address as the entry point of a function.
>
> **Execution Flow**: When `jit(10, 5)` is called:
> 1. Arguments 10 and 5 are placed in registers a0 and a1 (RISC-V calling convention)
> 2. PC (program counter) jumps to the array's address
> 3. CPU fetches and executes each 32-bit word as a RISC-V instruction
> 4. The `ret` instruction returns control to the caller with result in a0
>
> **Security Note**: This works because the memory page containing the array has executable permissions. Modern systems with W^X (Write XOR Execute) protection may require `mprotect()` or similar calls to mark memory as executable. On rv32emu or systems without strict memory protection, this works directly.
>
> **JIT Compilation Applications**: This technique is used in JavaScript engines (V8, SpiderMonkey), JVM, Python (PyPy), and dynamic language runtimes to generate optimized machine code at runtime based on actual program behavior.
6. To create a function that computes `(a * 3) + b`, we can use the following instruction sequence. Fill in the hexadecimal encoding for the first instruction: __ H06 __
```assembly
slli t0, a0, 1 # t0 = a0 * 2 (encoding: H06)
add t0, t0, a0 # t0 = (a0 * 2) + a0 = a0 * 3
add a0, t0, a1 # a0 = (a0 * 3) + a1
jalr zero, ra, 0 # return
```
* H06: ==`0x00151293`==
> SLLI is I-type: opcode=0010011, funct3=001
> rd=t0 (x5=00101), rs1=a0 (x10=01010), shamt=1 (00001)
> Encoding: [31:25]=0000000, [24:20]=00001, [19:15]=01010,
> [14:12]=001, [11:7]=00101, [6:0]=0010011
> Binary: 0000000_00001_01010_001_00101_0010011
> Hex: 0x00151293
>
> Full instruction sequence encodings:
> - `slli t0, a0, 1`: 0x00151293
> - `add t0, t0, a0`: 0x00a282b3
> - `add a0, t0, a1`: 0x00b28533
> - `jalr zero, ra, 0`: 0x00008067
---