--- tags: computer-arch --- # Quiz3 of Computer Architecture (2025 Fall) > Solutions ## Problem `A` [JIT (Just-In-Time) compilation](https://en.wikipedia.org/wiki/Just-in-time_compilation) is a technique where code is compiled at runtime rather than ahead of time. This approach is used by: - Virtual Machines: Java VM, JavaScript engines (V8, SpiderMonkey) - Language Runtimes: Python (PyPy), Ruby (YJIT), Lua (LuaJIT) - Binary Translation: QEMU, rv32emu How JIT Works: The Three-Step Process ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Source Code │ │ Machine Code │ │ Execute │ │ (interpreted) │ ───> │ (generated at │ ───> │ (native CPU) │ │ │ │ runtime) │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ Step 1 Step 2 Step 3 Analyze code Generate machine Jump to generated code in memory code and run ``` Key Insight: A JIT compiler treats **code as data**—it generates machine instructions (just numbers!) and stores them in memory, then jumps to that memory location to execute them. Consider this C code that demonstrates JIT: ```c uint32_t instructions[] = { 0x00001537, // lui a0, 1 (load upper immediate: a0 = 1 << 12 = 4096) 0x00008067, // jalr zero, ra, 0 (return) }; // Treat the array as a function! func_t *jit = (func_t *) instructions; int result = jit(); // Calls the generated code, returns 4096 ``` This works because: 1. Machine instructions are just 32-bit numbers (e.g., `0x00001537`) 2. We can store these numbers in memory (as data) 3. The CPU can execute from that memory location (as code) Why Copy from Data to Code Section? Modern systems enforce **NX (No-eXecute) protection**: memory is either writable OR executable, but not both. This prevents security attacks. * Problem: If we generate code in writable memory (`.data` section), we cannot execute it directly. * Solution: JIT compilers use a two-step process: ``` 1. Generate code in writable memory (data section) ┌─────────────────────────────┐ │ .data (writable, not exec) │ │ [0x12345678, 0xABCDEF00] │ ← Generate machine code here └─────────────────────────────┘ 2. Copy to executable memory (code section) ┌───────────────────────────────┐ │ .text (executable, not write) │ │ [0x12345678, 0xABCDEF00] │ ← Copy instructions here └───────────────────────────────┘ 3. Jump to executable memory and run ┌───────────────────────────────┐ │ CPU executes: jalr ra, t0, 0 │ ← Jump and execute! └───────────────────────────────┘ ``` In real JIT systems (like V8 or rv32emu), this is done using system calls like `mmap()` or `mprotect()` to mark memory as executable. In our simplified Ripes simulation, we copy from `.data` to `.text` to achieve the same effect. The following program simulates a minimal JIT compiler. It demonstrates the core concept: generating machine code at runtime and executing it. Program Structure ```c .data jit_instructions: .word 0x???????? # Your answer goes here .word 0x00008067 # jalr zero, ra, 0 (ret) .text .globl _start _start: la t0, jit_instructions la t1, jit_code_buffer lw t2, 0(t0) sw t2, 0(t1) lw t2, 4(t0) sw t2, 4(t1) la t0, jit_code_buffer jalr ra, t0, 0 li a7, 1 ecall li a0, 10 li a7, 11 ecall li a7, 10 ecall jit_code_buffer: .word 0x00000013 # nop (placeholder, will be overwritten) .word 0x00000013 # nop (placeholder, will be overwritten) ``` Execution Flow Diagram ``` Memory Layout: ┌──────────────────────────────────────┐ │ .data section (read/write) │ │ ┌────────────────────────────────┐ │ │ │ jit_instructions: │ │ │ │ [0x????????] ← Step 1 │ │ │ │ [0x00008067] │ │ │ └────────────────────────────────┘ │ └──────────────────────────────────────┘ │ │ Step 2: lw/sw (copy) ↓ ┌──────────────────────────────────────┐ │ .text section (executable) │ │ ┌────────────────────────────────┐ │ │ │ jit_code_buffer: │ │ │ │ [0x????????] ← Copied │ │ │ │ [0x00008067] │ │ │ └────────────────────────────────┘ │ │ ↑ │ │ │ Step 3: jalr (jump) │ │ Execute! │ └──────────────────────────────────────┘ │ ↓ Result in a0 = 42 ``` What Makes This "JIT"? 1. Code Generation (simulated): The `.word 0x????????` represents machine code that would normally be generated by analyzing source code 2. Runtime Copying: Instructions are copied from data to executable memory at runtime 3. Dynamic Execution: The program jumps to newly-copied code using `jalr` 4. Return Value: The executed code returns a value (42) in register `a0` This mimics what happens inside JavaScript engines, Python JITs, and other dynamic language runtimes—except they generate the machine code based on actual program analysis, rather than using pre-computed values. The program should output **42** when executed in the Ripes simulator. Your task is to determine the correct instruction encoding for the first word in `jit_instructions` (marked with `0x????????`). Constraints: - You must use a **single RV32I instruction** that sets register `a0` to the value 42 - The instruction will be executed as part of the JIT code buffer - When the JIT code is called, all general-purpose registers except `zero` have unpredictable values - Only register `zero` is guaranteed to contain the value 0 Why these constraints? - **Single instruction**: We are simulating the simplest possible JIT-compiled function - **Only `zero` is predictable**: Real JIT code cannot assume register values; it must build up state from scratch - **Must use `zero`**: This forces you to think about generating constants from nothing—a fundamental JIT challenge :::info For all answers starting with **0x**, you should omit the hexadecimal prefix when responding. ::: ### Q1: ADDI Instruction To make the program output 42, what is the hexadecimal encoding if you use the `addi` instruction? > Answer: `0x___ A01 ___` > ==`0x02a00513`== > Assembly: `addi a0, zero, 42` > Encoding Format (I-type): ``` 31 20 19 15 14 12 11 7 6 0 [ imm ][ rs1 ][funct3][ rd ][opcode] [000000101010][00000][ 000 ][01010][0010011] ``` > Field Values: - `imm[11:0]` = 42 (decimal) = `000000101010` (binary) = `0x02a` (hex) - `rs1` = 0 (register `zero` = x0) - `funct3` = `000` (identifies ADDI operation) - `rd` = 10 (register `a0` = x10) - `opcode` = `0010011` (I-type arithmetic/logical operations) > > Step-by-step encoding: 1. Start with immediate: `000000101010` (bits 31-20) 2. Add source register: `00000` (bits 19-15) 3. Add function code: `000` (bits 14-12) 4. Add destination: `01010` (bits 11-7) 5. Add opcode: `0010011` (bits 6-0) > > Binary: `00000010101000000000010100010011` > Hexadecimal: `0x02a00513` Show your work: Provide the instruction format breakdown (opcode, rd, funct3, rs1, immediate). > Answer: `__ A02 __` > Why ADDI is the canonical choice: (意思相近就給分) > - Most intuitive: "add immediate value to zero" > - Commonly used for loading small constants in real code > - All compilers emit this for constant initialization ### Q2: OR Instruction Can you use the `or` instruction or its variants (e.g., `ori`) to achieve the same result? > Answer: `0x___ A03 ___` (or write `N/A` if impossible) > ==`0x02a06513`== Explanation: If you wrote `N/A`, explain why the OR instruction cannot produce the value 42. If you provided an encoding, explain why it works. > Answer: `__ A04 __` > The bitwise OR operation has the **identity property** when one operand is zero: ``` 0 | x = x (for any value x) ``` > > Proof by truth table (single bit): ``` zero | x → result 0 | 0 → 0 0 | 1 → 1 ``` > Since OR outputs 1 if *either* input is 1, and zero provides all 0s, the result is exactly `x`. > Applied to our case: ``` a0 = zero | 42 a0 = 0 | 42 a0 = 42 ✓ ``` > Encoding (I-type): ``` 31 20 19 15 14 12 11 7 6 0 [000000101010][00000][ 110 ][01010][0010011] ``` > > Field Values: - `imm[11:0]` = 42 - `rs1` = 0 (zero) - `funct3` = `110` (identifies ORI) ← **Only difference from ADDI** - `rd` = 10 (a0) - `opcode` = `0010011` > > Binary: `00000010101000000110010100010011` > Hexadecimal: `0x02a06513` > > Comparison with ADDI: ``` ADDI: 0x02a00513 = ...0000000... (funct3 = 000) ORI: 0x02a06513 = ...0000110... (funct3 = 110) ^^^ Only 2 bits different! ``` > > Note on R-type `or`: > The R-type instruction `or a0, zero, rs2` cannot be used because: - It requires `rs2` to contain 42 - We cannot guarantee any register except `zero` has a predictable value - This would violate the constraint that only `zero` is reliable **Hint:** Consider the bitwise identity: `0 | x = ?` ### Q3: XOR Instruction Can you use the `xor` instruction or its variants (e.g., `xori`) to achieve the same result? > Answer: `0x___ A05 ___` (or write `N/A` if impossible) > ==`0x02a04513`== **Explanation:** If you wrote `N/A`, explain why the XOR instruction cannot produce the value 42. If you provided an encoding, explain why it works. > Answer: `__ A06 __` > The bitwise XOR operation has the **identity property** when one operand is zero: ``` 0 ⊕ x = x (for any value x) ``` > > Proof by truth table (single bit): ``` zero ⊕ x → result 0 ⊕ 0 → 0 0 ⊕ 1 → 1 ``` > XOR outputs 1 if inputs are *different*. With zero as one input, the result equals the other input. > Applied to our case: ``` a0 = zero ⊕ 42 a0 = 0 ⊕ 42 a0 = 42 ✓ ``` > Encoding (I-type): ``` 31 20 19 15 14 12 11 7 6 0 [000000101010][00000][ 100 ][01010][0010011] ``` > > Field Values: - `imm[11:0]` = 42 - `rs1` = 0 (zero) - `funct3` = `100` (identifies XORI) ← **Only difference from ADDI** - `rd` = 10 (a0) - `opcode` = `0010011` > Binary: `00000010101000000100010100010011` > Hexadecimal: `0x02a04513` **Hint:** Consider the bitwise identity: `0 ⊕ x = ?` ### Q4: AND Instruction Can you use the `and` instruction or its variants (e.g., `andi`) to achieve the same result? > Answer: `0x___ A15 ___` (or write `N/A` if impossible) > ==`N/A`== **Explanation:** If you wrote `N/A`, explain why the AND instruction cannot produce the value 42. If you provided an encoding, explain why it works. > Answer: `__ A16 __` > the AND instruction cannot produce 42. > The bitwise AND operation has the **annihilator property** when one operand is zero: ``` 0 & x = 0 (for any value x) ``` > > Proof by truth table (single bit): ``` zero & x → result 0 & 0 → 0 0 & 1 → 0 ``` > AND outputs 1 only if *both* inputs are 1. Since zero provides all 0s, the result is always 0. > Analysis of `andi a0, zero, 42`: ``` a0 = zero & 42 a0 = 0 & 42 a0 = 0 ≠ 42 ✗ ``` > > Binary breakdown: ``` 00000000000000000000000000000000 (zero = 0) & 00000000000000000000000000101010 (42 = 0b101010) -------------------------------- 00000000000000000000000000000000 (result = 0) ``` > > Every bit position: `0 & ? = 0`, so the result is all zeros. > What about R-type `and`? > The R-type instruction `and a0, rs1, rs2` would require: ``` rs1 & rs2 = 42 ``` > > For this to work: - Both `rs1` and `rs2` must have 1s in positions where 42 has 1s - At least one must have 0s in positions where 42 has 0s > > Example: `rs1 = 42, rs2 = 42` would work, but: - We cannot guarantee any register except `zero` has a predictable value - We cannot rely on uninitialized register contents - Not portable or reliable **Hint:** Consider the bitwise identity: `0 & x = ?` ### Q5: Alternative Instructions Besides the four instruction types mentioned above (ADDI, OR, XOR, AND), can you use **any other RV32I instruction** to achieve the same result with a single instruction? > Answer: `0x___ A07 ___` (or write `N/A` if impossible) > ==`N/A`== **Explanation:** If you found alternative instruction(s), list them with their encodings and explain why they work. If you wrote `N/A`, provide a systematic argument explaining why no other single RV32I instruction can reliably produce the value 42 in register `a0`. > Answer: `__ A08 __` > No other single RV32I instruction can reliably produce the value 42 in register `a0`. > Category 1: I-type Shift Instructions > Instructions: `slli`, `srli`, `srai` > Analysis: ``` slli a0, zero, shamt → a0 = 0 << shamt = 0 srli a0, zero, shamt → a0 = 0 >> shamt = 0 srai a0, zero, shamt → a0 = 0 >> shamt = 0 (arithmetic) ``` > > For any shift amount (0-31): - Left shift: zeros shifted in from right → result is 0 - Right shift: zeros shifted in from left → result is 0 > Cannot produce 42. > > U-type Instructions > LUI (Load Upper Immediate): ``` lui a0, imm a0 = imm << 12 ``` > > To get `a0 = 42`: ``` imm << 12 = 42 imm = 42 >> 12 = 0 ``` > > But `0 << 12 = 0 ≠ 42`. > > The smallest value LUI can load is `1 << 12 = 4096`, which is far larger than 42. > Cannot produce 42. > > AUIPC (Add Upper Immediate to PC): ``` auipc a0, imm a0 = PC + (imm << 12) ``` > > To get `a0 = 42`: ``` PC + (imm << 12) = 42 ``` > > Problems: - PC value depends on where the JIT buffer is located in memory - PC changes based on code layout - Not portable across different assemblies or runs - Would only work by coincidence, not by design > ... (other types) > After systematically examining **all RV32I instruction categories**, only three instructions can reliably produce the value 42 in register `a0` using a single instruction: 1. `addi a0, zero, 42` → `0x02a00513` 2. `xori a0, zero, 42` → `0x02a04513` 3. `ori a0, zero, 42` → `0x02a06513` > > Summary of why others fail: | Category | Reason for Failure | |----------|-------------------| | Shift (I-type) | Shifting 0 always gives 0 | | LUI | Operates on upper 20 bits; 42 too small | | AUIPC | PC-dependent; not portable | | R-type | Operating on two zeros gives 0 | | JAL | PC-dependent; has side effects | | JALR | Doesn't write 42; has side effects | | Loads | Memory-dependent; unpredictable | | Stores | Don't modify registers | | Branches | Don't modify registers | **Hint:** Consider all RV32I instruction categories: - I-type: arithmetic/logical/shift/load - R-type: register-register operations - U-type: `lui`, `auipc` - J-type: `jal` - B-type: branches (do not modify `a0`) - S-type: stores (do not modify `a0`) For each category, ask: 1. Can it write to `a0`? 2. Can it produce the value 42 using only `zero` (value 0)? 3. Does it have predictable behavior (no dependency on PC or memory)? ### Q6: Encoding Difference Analysis Analyze the valid solutions (if you found multiple). What is the **only difference** in their binary encodings? Which specific bit field differs, and what does this field represent in the RISC-V instruction format? > Answer: `__ A09 __` > ==The only difference among the three valid solutions is the `funct3` field (bits [14:12]).== > > Binary Comparison: ``` 0x02a00513 = 00000010101000000000010100010011 (addi, funct3=000) 0x02a04513 = 00000010101000000100010100010011 (xori, funct3=100) 0x02a06513 = 00000010101000000110010100010011 (ori, funct3=110) ^^^^^^^^^^^^^^^^^^^^ ^^ All identical │ └─ Only difference: bits [14:12] ``` > Detailed bit-by-bit comparison: ``` Bit position: 31 0 ADDI: 00000010101000000000010100010011 XORI: 00000010101000000100010100010011 ORI: 00000010101000000110010100010011 ┗━━━━━━━━━━━━━━━━━━━┛ ┗━┛ ┗━┛ ┗━┛ Same (imm + rs1) diff same same funct3 rd opcode ``` > > Field Breakdown: ``` Bits [31:20]: imm[11:0] = 000000101010 (42) ✓ Same Bits [19:15]: rs1 = 00000 (zero) ✓ Same Bits [14:12]: funct3 = 000/100/110 ← DIFFERENT Bits [11:7]: rd = 01010 (a0) ✓ Same Bits [6:0]: opcode = 0010011 ✓ Same ``` > The `funct3` field is a **3-bit function code** that differentiates operations within the same instruction type and opcode. > > For I-type arithmetic/logical instructions (opcode = `0010011`): | funct3 | Instruction | Operation | |--------|-------------|-----------| | `000` | ADDI | rd = rs1 + imm | | `010` | SLTI | rd = (rs1 < imm) ? 1 : 0 (signed) | | `011` | SLTIU | rd = (rs1 < imm) ? 1 : 0 (unsigned) | | `100` | XORI | rd = rs1 ⊕ imm | | `110` | ORI | rd = rs1 \| imm | | `111` | ANDI | rd = rs1 & imm | | `001` | SLLI | rd = rs1 << imm[4:0] | | `101` | SRLI/SRAI | rd = rs1 >> imm[4:0] (logical/arithmetic) | > Key insight: All 8 I-type operations share: - Same opcode (`0010011`) - Same instruction format - Same encoding scheme > > The CPU's decode logic examines `funct3` to determine which ALU operation to perform. > > Why These Three Work (Mathematical Perspective) > Identity elements in Boolean algebra and arithmetic: > ``` ADDI: 0 + 42 = 42 ✓ (0 is the additive identity) XORI: 0 ⊕ 42 = 42 ✓ (0 is the XOR identity) ORI: 0 | 42 = 42 ✓ (0 is the OR identity) ANDI: 0 & 42 = 0 ✗ (0 is the AND annihilator, not identity) ``` > > Why AND fails: > AND has **1** as its identity (`x & 1 = x` for single bits), but we only have access to **0** (zero register), which is the annihilator. Why this matters for JIT compilers: Understanding that multiple instructions can produce the same result is crucial for JIT optimization. A JIT might choose `ori` over `addi` if it simplifies instruction selection logic, or use `xori` to avoid pipeline conflicts. Real JIT compilers constantly make these micro-decisions. ### Q7: Instruction Type Classification After verifying all valid answers, what **instruction encoding type** do they all belong to in the RISC-V instruction format? > Answer: `__ A10 __` > ==All valid solutions belong to the I-type (Immediate-type) instruction format== **Explanation:** Describe the key characteristics of this instruction type: - What is the general format? - What kind of operations does this type typically support? - Why is this type suitable for loading constants (like the value 42)? > Answer: `__ A11 __` > What is I-type format? > I-type instructions are one of the six basic instruction formats in RISC-V. They are designed for operations involving an immediate value (a constant encoded directly in the instruction). > > General format: ``` 31 20 19 15 14 12 11 7 6 0 [ imm ][ rs1 ][funct3][ rd ][opcode] [ 12 bits ][5 bits][3 bits][5 bits][7 bits] ``` > > Field descriptions: - `imm[11:0]`: 12-bit signed immediate value (constant) - `rs1`: 5-bit source register - `funct3`: 3-bit function code (selects specific operation) - `rd`: 5-bit destination register - `opcode`: 7-bit operation code (identifies instruction type) > > Operations supported by I-type: > I-type instructions typically support: 1. **Arithmetic operations with immediates:** - `addi` (add immediate) - `slti` (set less than immediate, signed) - `sltiu` (set less than immediate, unsigned) 2. **Logical operations with immediates:** - `xori` (XOR immediate) - `ori` (OR immediate) - `andi` (AND immediate) 3. **Shift operations:** - `slli` (shift left logical immediate) - `srli` (shift right logical immediate) - `srai` (shift right arithmetic immediate) 4. **Load operations:** - `lb`, `lh`, `lw` (load byte/halfword/word) - `lbu`, `lhu` (load byte/halfword unsigned) 5. **Jump operations:** - `jalr` (jump and link register) > > Why I-type is suitable for loading constants: 1. **Immediate field:** The 12-bit immediate field can encode constants from -2048 to +2047, covering small integer values like 42. 2. **Single instruction:** Loading a small constant requires only one instruction (no need to load from memory or use multiple instructions). 3. **Efficient encoding:** The constant is embedded directly in the instruction, making it compact and fast. 4. **Versatile operations:** Multiple operations (add, or, xor) can achieve the same result, giving the compiler/JIT flexibility in instruction selection. > > Why the value 42 fits perfectly: ``` 42 (decimal) = 0b000000101010 (binary) ``` > > Since 42 requires only 6 bits and fits comfortably within the 12-bit immediate field, any I-type arithmetic/logical instruction can encode it directly: ``` addi a0, zero, 42: imm = 000000101010 xori a0, zero, 42: imm = 000000101010 ori a0, zero, 42: imm = 000000101010 ``` > > Contrast with other instruction types:** - **R-type:** Requires two source registers; cannot encode immediate values - **S-type:** Used for stores; does not write to destination register - **B-type:** Used for branches; does not write to destination register - **U-type:** Has 20-bit immediate but loads to upper bits (value × 4096); cannot represent small values like 42 - **J-type:** Used for jumps; saves PC+4, not arbitrary constants > > Key insight for JIT compilers: > I-type instructions are the workhorse for constant materialization in JIT compilers. For small constants (fitting in 12 bits), a single I-type instruction suffices. For larger constants, JIT compilers must use multi-instruction sequences like: ```assembly lui a0, %hi(large_constant) # Load upper 20 bits (U-type) addi a0, a0, %lo(large_constant) # Add lower 12 bits (I-type) ``` > > This is why understanding I-type encoding is fundamental for anyone working on code generation, JIT compilation, or compiler backends. **Hint:** RISC-V defines six basic instruction formats: R-type, I-type, S-type, B-type, U-type, and J-type. Each format serves different purposes and has different field layouts. ### Q8: Optimization If the JIT needed to generate code for `return x + 42` (where `x` is in register `a1`), what single instruction would you use? Show the efficient one. > Answer: `__ A12 __` > ==`addi a0, a1, 42`== **Explanation:** > Answer: `__ A13 __` > Encoding: ``` imm[11:0] = 42 = 000000101010 rs1 = a1 = 11 (x11) funct3 = 000 (ADDI) rd = a0 = 10 (x10) opcode = 0010011 Binary: 00000010101001011000010100010011 Hex: 0x02a58513 ``` > > Why this works: - `a1` contains the input value `x` - Add immediate 42 to it - Store result in `a0` (standard return value register) > > Alternative (less efficient): > Could use two instructions: ```assembly addi a0, zero, 42 # Load constant add a0, a0, a1 # Add input ``` > But single `addi a0, a1, 42` is better: fewer instructions, lower latency. ### Q9: Security Why do modern systems prevent executing code from writable memory? What attacks does this prevent? > Answer: `__ A14 __` > (提到 buffer overflow 或 ROP 就給分) > NX (No-eXecute) / W⊕X (Write XOR Execute) protection prevents a class of attacks called code injection. > Attack scenario without NX: 1. **Buffer overflow:** Attacker overflows a buffer on the stack: ```c char buffer[64]; gets(buffer); // No bounds checking! ``` 2. **Inject shellcode:** Attacker writes malicious machine code into buffer: ``` buffer = [0x93, 0x08, 0x05, 0x02, // addi a7, zero, 0x20 (execve syscall) 0x67, 0x80, 0x00, 0x00, // jalr zero, ra, 0 ...] ``` 3. **Redirect execution:** Overflow overwrites return address to point to buffer 4. **Execute payload:** Function returns, jumps to buffer, executes shellcode 5. **System compromised:** Attacker has shell access > > Other attacks prevented: - Return-oriented programming (ROP) - harder without executable stack - Heap spraying - cannot execute injected code on heap --- ## Problem `B` Consider a RISC-V instruction set simulator capable of loading and executing the JIT program presented in Problem `A`, using all possible instructions required to produce the expected result. The simulator should be able to fetch, decode, and execute a subset of RISC-V instructions. A partial implementation is shown below, and you are required to complete the code. Full source code: [q3-vm.c](https://github.com/sysprog21/ca2025-quizzes/blob/main/q3-vm.c) B01: ==`XORI`== B02: ==`cpu->regs[inst.rd] = cpu->regs[inst.rs1] ^ imm`== or equivalent B03: ==`ORI`== B04: ==`cpu->regs[inst.rd] = cpu->regs[inst.rs1] | imm`== or equivalent B05: ==`ADDI`== B06: ==`cpu->regs[inst.rd] = cpu->regs[inst.rs1] + imm`== or equivalent B07: ==`(cpu->regs[inst.rs1] + imm) & ~1`== or equivalent B08: ==4== B09: ==jump table== or branch table B10: 程式碼接近就給分,涵蓋其中一種指令即可 Full source code: [q3-vm-jump-table.c](https://github.com/sysprog21/ca2025-quizzes/blob/main/q3-vm-jump-table.c) ```c /* Fetch and dispatch macro */ #define FETCH_AND_DISPATCH() do { \ if (cpu->pc >= code_size) goto end; \ raw = LOAD32(cpu->pc); \ cpu->pc += 4; \ opcode = OPCODE(raw); \ void *target = dispatch_table[opcode]; \ if (!target) goto illegal_instruction; \ goto *target; \ } while (0) /* * Instruction implementations using computed goto * Each label handles one instruction type, then fetches the next */ op_integer_comp_ri: { int32_t imm = I_IMM(raw); uint8_t rd = RD(raw); uint8_t rs1 = RS1(raw); uint8_t funct3 = FUNCT3(raw); switch (funct3) { case ADDI: regs[rd] = regs[rs1] + imm; break; case XORI: regs[rd] = regs[rs1] ^ imm; break; case ORI: regs[rd] = regs[rs1] | imm; break; default: fatal("Unknown FUNCT3 for INTEGER_COMP_RI: 0x%x\n", funct3); } regs[0] = 0; /* x0 always zero */ FETCH_AND_DISPATCH(); } op_load: { int32_t imm = I_IMM(raw); uint8_t rd = RD(raw); uint8_t rs1 = RS1(raw); uint8_t funct3 = FUNCT3(raw); if (funct3 != LW) fatal("Unknown FUNCT3 for LOAD: 0x%x\n", funct3); uint32_t addr = regs[rs1] + imm; regs[rd] = LOAD32(addr); regs[0] = 0; FETCH_AND_DISPATCH(); } op_store: { /* S-type immediate: imm[11:5] = inst[31:25], imm[4:0] = inst[11:7] */ uint32_t imm_11_5 = (raw >> 25) & 0x7F; uint32_t imm_4_0 = (raw >> 7) & 0x1F; int32_t imm = (imm_11_5 << 5) | imm_4_0; if (imm & 0x800) imm |= 0xFFFFF000; uint8_t rs1 = RS1(raw); uint8_t rs2 = RS2(raw); uint8_t funct3 = FUNCT3(raw); if (funct3 != SW) fatal("Unknown FUNCT3 for STORE: 0x%x\n", funct3); uint32_t addr = regs[rs1] + imm; STORE32(addr, regs[rs2]); FETCH_AND_DISPATCH(); } op_jalr: { int32_t imm = I_IMM(raw); uint8_t rd = RD(raw); uint8_t rs1 = RS1(raw); uint32_t target = (regs[rs1] + imm) & ~1; regs[rd] = cpu->pc; cpu->pc = target; regs[0] = 0; FETCH_AND_DISPATCH(); } op_auipc: { int32_t imm = U_IMM(raw); uint8_t rd = RD(raw); regs[rd] = cpu->pc - 4 + imm; /* PC already incremented */ regs[0] = 0; FETCH_AND_DISPATCH(); } op_ecall: { if (raw == 0x100073) { /* EBREAK - ignore */ FETCH_AND_DISPATCH(); } ecall_handler(cpu); FETCH_AND_DISPATCH(); } illegal_instruction: fatal("Illegal Instruction 0x%x at PC 0x%lx\n", opcode, cpu->pc - 4); end: cpu_free(cpu); ``` --- ## Problem C `rsqrt` provides a fast reciprocal square root implementation optimized for RV32I. - **No hardware multiplier**: Uses shift-add for 32×32 → 64-bit multiplication - **LUT + Newton iterations**: Balances precision and performance - **Precision**: ~3-8% error Newton's method is an algorithm to compute the root of a function. ![image](https://hackmd.io/_uploads/SyThnZ4Agl.png =70%x) $f'(x_n)$ is the derivative of $f$ at point $x_n$, which equals the slope of the tangent line at $(x_n, f(x_n))$. The tangent line passes through $(x_n, f(x_n))$ with slope $f'(x_n)$: $$ y - f(x_n) = f'(x_n)(x - x_n) $$ Setting $y = 0$ to find where the tangent crosses the x-axis gives us the next approximation: $$ 0 - f(x_n) = f'(x_n)(x_{n+1} - x_n) $$ Solving for $x_{n+1}$: $$ x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)} $$ To compute the reciprocal square root, we want to find $y$ such that $y = \frac{1}{\sqrt{x}}$. Equivalently, we want to solve: $y^2 = \frac{1}{x}$, or $\frac{1}{y^2} = x$ Define the function: $f(y) = \frac{1}{y^2} - x$ The root of $f(y) = 0$ gives us $\frac{1}{y^2} = x$, so $y = \frac{1}{\sqrt{x}}$. Computing the derivative: $f'(y) = -\frac{2}{y^3}$ Applying Newton's formula $y_{n+1} = y_n - \frac{f(y_n)}{f'(y_n)}$: $$ y_{n+1} = y_n - \frac{\frac{1}{y_n^2} - x}{-\frac{2}{y_n^3}} = y_n - \frac{y_n^3(\frac{1}{y_n^2} - x)}{-2} = y_n + \frac{y_n - xy_n^3}{2} $$ Simplifying: $$ y_{n+1} = y_n + \frac{y_n(1 - xy_n^2)}{2} = \frac{2y_n + y_n - xy_n^3}{2} = \frac{y_n(3 - xy_n^2)}{2} $$ Final formula: $$ y_{n+1} = y_n \left(\frac{3 - xy_n^2}{2}\right) = y_n \left(\frac{3}{2} - \frac{xy_n^2}{2}\right) $$ For integer inputs $x \geq 1$, the reciprocal square root $\frac{1}{\sqrt{x}} \leq 1$ must be represented in either floating-point or fixed-point format. Since we assume no floating-point unit (FPU), we use fixed-point representation. In Q0.32 format, values less than 1.0 are represented with 32-bit fractional precision: ```c /* * Newton iteration: new_y = y * (3/2 - x * y^2 / 2) * Here, y is a Q0.32 fixed-point number (< 1.0) */ static void newton_step(uint32_t *rec_inv_sqrt, uint32_t x) { uint32_t invsqrt, invsqrt2; uint64_t val; invsqrt = *rec_inv_sqrt; /* Dereference pointer */ invsqrt2 = ((uint64_t)invsqrt * invsqrt) >> 32; val = (3LL << 32) - ((uint64_t)x * invsqrt2); val >>= 2; /* Avoid overflow in following multiply */ val = (val * invsqrt) >> 31; /* Right shift by 31 = (32 - 2 + 1) */ *rec_inv_sqrt = (uint32_t)val; } ``` Newton's method converges quickly with good initial guesses but slowly with poor ones. For example, computing $\frac{1}{\sqrt{5}}$ with a starting guess of 0.5 requires 4 iterations to reach 32-bit precision. A lookup table provides better initial guesses: ```c #define REC_INV_SQRT_CACHE (16) static const uint32_t inv_sqrt_cache[REC_INV_SQRT_CACHE] = { ~0U, ~0U, 3037000500, 2479700525, 2147483647, 1920767767, 1753413056, 1623345051, 1518500250, 1431655765, 1358187914, 1294981364, 1239850263, 1191209601, 1147878294, 1108955788 }; ``` The actual implementation in `rsqrt` uses 2^16 scaling instead of Q0.32 for better performance on RV32I. **Why $2^{16}$ instead of Q0.32?** 1. **Range**: $2^{16}$ scaling supports $\frac{2^{16}}{\sqrt{x}}$ for all uint32_t values of $x$, giving results up to 65536 (when $x=1$). Q0.32 can only represent values < 1.0. 2. **Performance**: Shifts by 16 bits are more efficient than 32-bit shifts on RV32I, as they map better to register operations. 3. **Precision**: 16-bit fractional precision ($2^{16}$ = 65536) provides sufficient accuracy (3-8% error) for most embedded applications while maintaining integer-only arithmetic. The Q0.32 code example above illustrates the Newton iteration principle but is not the actual implementation. Example: Fast Distance Calculation ```c /* Calculate distance using rsqrt * * Faster than computing sqrt directly: * sqrt(x) = x * (1/sqrt(x)) */ uint32_t fast_distance_3d(int32_t dx, int32_t dy, int32_t dz) { uint64_t dist_sq = (uint64_t)dx * dx + (uint64_t)dy * dy + (uint64_t)dz * dz; if (dist_sq > 0xFFFFFFFF) dist_sq >>= 16; uint32_t inv_dist = fast_rsqrt((uint32_t)dist_sq); /* sqrt(x) = x / sqrt(x) = x * (1/sqrt(x)) */ uint64_t dist = ((uint64_t)dist_sq * inv_dist) >> 16; return (uint32_t)dist; } ``` C01: ==`1U << i`== C02: ==`(uint64_t)a << i`== C03: ==`31 - clz(x)`== C04: ==`rsqrt_table[exp]`== C05: ==`(exp < 31) ? rsqrt_table[exp + 1] : 0`== C06: ==`y - y_next`== C07: ==`(uint32_t) ((((uint64_t)x - (1UL << exp)) << 16) >> exp)`== C08: ==`(uint32_t) ((delta * frac) >> 16)`== C09: ==`(uint32_t)mul32(y, y)`== C10: ==`(uint32_t)(mul32(x, y2) >> 16)`== C11: ==`(uint32_t)(mul32(y, (3u << 16) - xy2) >> 17)`== To validate fast_rsqrt(x), consider the case where x = UINT32_MAX. Is this input valid in terms of expected behavior? C12: ==Valid== What should the output be? C13: ==1== > For x = UINT32_MAX with our 2^16 scaling implementation: > > Input: x = 4294967295 (2^32 - 1) > Output: y = 1 (0x00000001) > > Formula: y = 65536 / sqrt(x) ≈ 65536 / 65536 = 1 > > To get actual floating-point value: 1/sqrt(x) = y / 65536 = 1 / 65536 ≈ 0.0000152588 What types of applications are suitable for fast_rsqrt()? Provide at least three examples. C14: (接近就給分) - Real-time graphics (vector normalization, lighting) - Game physics (distance approximations, collision detection) - Embedded systems without FPU - Performance-critical code where 3-8% error is acceptable ### 1. Lookup Table Initial Estimate The implementation uses a 32-entry table that provides initial approximations based on the input's magnitude: ```c static const uint16_t rsqrt_table[32] = { 65536, 46341, 32768, 23170, 16384, /* 2^0 to 2^4 */ 11585, 8192, 5793, 4096, 2896, /* 2^5 to 2^9 */ 2048, 1448, 1024, 724, 512, /* 2^10 to 2^14 */ 362, 256, 181, 128, 90, /* 2^15 to 2^19 */ 64, 45, 32, 23, 16, /* 2^20 to 2^24 */ 11, 8, 6, 4, 3, /* 2^25 to 2^29 */ 2, 1 /* 2^30, 2^31 */ }; ``` **How it works:** 1. Find the position of the most significant bit (MSB) in the input $x$ - For $x = 1$ (binary: 1), MSB position = 0 - For $x = 16$ (binary: 10000), MSB position = 4 - For $x = 1024$ (binary: 10000000000), MSB position = 10 2. Use this position as an index: `rsqrt_table[MSB_position]` 3. Each entry approximates $\frac{65536}{\sqrt{2^{exp}}}$ where $exp$ is the MSB position **Example:** For $x = 16$: - MSB position = 4 - Table entry: `rsqrt_table[4] = 16384` - This approximates $\frac{65536}{\sqrt{16}} = \frac{65536}{4} = 16384$ (exact!) ### 2. Linear Interpolation For inputs that are not exact powers of 2, the lookup table alone is insufficient. The implementation interpolates between adjacent table entries to improve accuracy. **Algorithm:** ```c /* x is between 2^exp and 2^(exp+1) */ y_base = rsqrt_table[exp]; /* Value at 2^exp */ y_next = rsqrt_table[exp + 1]; /* Value at 2^(exp+1) */ /* fraction: how far x is between 2^exp and 2^(exp+1) */ fraction = (x - 2^exp) / (2^exp) = (x - 2^exp) >> exp; /* Linear interpolation */ y = y_base - (y_base - y_next) * fraction; ``` **Example:** For $x = 20$: - MSB position = 4, so $x$ is between $2^4 = 16$ and $2^5 = 32$ - `y_base = rsqrt_table[4] = 16384` (approximates $65536/\sqrt{16}$) - `y_next = rsqrt_table[5] = 11585` (approximates $65536/\sqrt{32}$) - `fraction = (20 - 16) / 16 = 0.25` - `y = 16384 - (16384 - 11585) * 0.25 = 16384 - 1200 = 15184` - Actual value: $65536/\sqrt{20} \approx 14653$, error ≈ 3.6% This interpolation significantly improves accuracy before Newton iterations. ### 3. Newton-Raphson Iteration For 2^16 scaling (actual implementation in `rsqrt.c`), the Newton iteration formula becomes: $$ y_{n+1} = y_n \cdot \frac{3 - x \cdot y_n^2 / 2^{16}}{2} $$ where $y$ represents $\frac{2^{16}}{\sqrt{x}}$ instead of $\frac{1}{\sqrt{x}}$. Fixed-point implementation: ```c y2 = (y * y) /* Compute y^2 in 2^32 scaling */ xy2 = (x * y2) >> 16 /* Compute x * y^2 / 2^16 */ y = (y * ((3<<16) - xy2)) >> 17 /* Complete Newton step */ ``` **Why the shifts?** - `>> 16`: Compensates for the 2^16 scaling factor in multiplications - `>> 17`: Equivalent to dividing by 2 after the multiply (17 = 16 + 1) Two iterations improve precision from approximately 20% error to 3-8% error. --- ## Problem `D` The following circuit will be analyzed: ![circuit](https://hackmd.io/_uploads/ryoGoLa0ee.png) In a small sequential subsystem extracted from a pipelined digital design, two registers, Reg0 and Reg1, feed their outputs into a short combinational logic chain that ultimately drives a third register, RegY. The goal is to determine the appropriate clock period and verify that the system maintains correct timing behavior under given constraints. The signal flow is as follows: * The outputs of Reg0 and Reg1 are first combined by an XOR gate. * The XOR output passes through a NOT gate, followed by an AND gate, producing a signal labeled `y_output`. * The signal `y_output` is captured by RegY on the *rising edge* of the clock. * All registers are triggered by the same clock edge and share identical timing characteristics. You may assume that: * The clock-to-Q delay of all flip-flops (`Reg0`, `Reg1`, and `RegY`) is 4 ns. * The setup time required at `RegY` is 3 ns. * Propagation delays for the combinational elements are: * XOR = 10 ns * NOT = 5 ns * AND = 9 ns There is no external input timing offset (such as "`x_input` switches 25 ns after the clock). Instead, the data launched from `Reg0` and `Reg1` propagates through the combinational logic toward `RegY`, following a standard synchronous data transfer. Determine the minimum clock period (in nanoseconds) that ensures correct data capture at `RegY` without violating setup-time requirements. __ D01 __ ns > ==31== > The critical path is from the launching register's clock-to-Q delay through the entire combinational logic chain to the destination setup requirement. > > Calculation: > `Tmin = clk-to-Q + (XOR + NOT + AND) + setup` > = `4 + (10 + 5 + 9) + 3 = 31 ns` Determine the maximum hold time (in nanoseconds) that can be tolerated before data stability at `RegY` is compromised. __ D02 __ ns > ==28== > Since there is only one functional path from the launching register to `RegY`, the shortest and longest combinational delays are identical. > > Calculation: > `Thold_max = clk-to-Q + shortest CL delay` > = `4 + 24 = 28 ns` Assume the system operates with a 60 ns clock period, the **first rising edge** occurs at 20 ns, and all registers start with logical zeros (`Reg0=Reg1=0`). That is, one register changing state (e.g., from 1 to 0) at the first clock edge. At what time will `y_output` (the *ombinational node feeding RegY) first attain a logic high (`1`)? __ D03 __ ns > ==48== > Launch registers update their outputs after the clock edge plus clock-to-Q delay: > `20 ns + 4 ns = 24 ns`. > > The combinational propagation delay to `y_output` is `(10 + 5 + 9) = 24 ns`. > > Hence, `y_output` becomes 1 at: > `24 + 24 = 48 ns` Determine the first time the registered output (RegY.Q) becomes logic high (1) __ D04 __ ns Assumptions: * The clock period is 60 ns. * The first rising edge of the clock occurs at 20 ns. * The external input x_input is initialized to logic 0 at time t = 0 ns. * Both Reg0 and Reg1 sample the value of x_input on each rising clock edge. * The signal y_output is generated by the combinational logic (XOR → NOT → AND) and then captured by a destination register RegY. * All flip-flops share the same clock-to-Q delay of 4 ns. > ==`84`== > 1. Initial condition (t = 0 ns): Since x_input = 0, both Reg0 and Reg1 hold 0 before the first clock event. Therefore: Reg0 = 0, Reg1 = 0 XOR(Reg0, Reg1) = 0 NOT(0) = 1 AND(1, 1) = 1 This value does not immediately appear at RegY’s Q output because the register captures data only at the rising clock edge. > 2. At the first rising edge (t = 20 ns): Both Reg0 and Reg1 again sample x_input = 0, so their outputs remain unchanged. Consequently, the XOR output remains 0, the NOT output remains 1, and the AND output remains 1. > 3. After the clock-to-Q delay: The outputs of Reg0 and Reg1 become valid 4 ns after the rising edge, that is, at t = 24 ns. Any new data launched from the registers starts to propagate through the combinational logic from this point onward. > 4. Combinational logic propagation: The total propagation delay through the logic chain is 10 ns (XOR) + 5 ns (NOT) + 9 ns (AND) = 24 ns. Therefore, the internal node y_output (the D input of RegY) reaches logic 1 at 24 ns (register output valid) + 24 ns (logic delay) = 48 ns. > 5. Result – Combinational node timing: The combinational signal y_output feeding RegY first becomes 1 at t = 48 ns. Answer (combinational y_output): 48 ns. > 6. Registered output timing: RegY captures this high value on the next rising clock edge at 80 ns, and its Q output becomes valid after its own clock-to-Q delay (4 ns). Thus, the registered output of RegY becomes 1 at 80 ns + 4 ns = 84 ns. Impact of increased setup time on the minimum clock period __ D05 __ ns > ==`33`== > If the setup time at RegY increases from 3 ns to 5 ns, the new minimum clock period must be recalculated. > > Computation: Tmin = clk-to-Q (launch) + combinational delay + setup time (destination) Tmin = 4 + 24 + 5 = 33 ns Determining the latest possible clock edge that still captures valid data at RegY. Suppose the minimum clock period requirement (31 ns) has been met. If the system designer allows a maximum of 2 ns clock skew between the launching and capturing registers, where the capture clock arrives later than the launch clock, determine the latest time at which RegY's capturing clock edge can occur while still meeting setup timing. __ D06 __ ns. Given values: * clk-to-Q = 4 ns * Combinational delay = 24 ns * Setup time = 3 ns * Positive skew = 2 ns > ==33== > The data launched at t = 0 will arrive at the D input of RegY at 4 ns (clk-to-Q) + 24 ns (logic) = 28 ns. RegY must capture valid data before setup time expires, so the latest possible capture edge occurs at 28 ns + 3 ns (setup) = 31 ns nominally. If the capture clock edge is delayed by 2 ns (positive skew), the effective capture time becomes 31 + 2 = 33 ns. Therefore, the latest allowable capturing edge is 33 ns after the launch edge. Determining the slack for setup timing at a 40 ns clock period. Assume the design is operated with a 40 ns clock period. Determine the setup slack at RegY. __ D07 __ ns Given: * clk-to-Q = 4 ns * Logic delay = 24 ns * Setup time = 3 ns > =9== > Required arrival time = 40 ns (next capture edge) − 3 ns (setup) = 37 ns. Actual arrival time = 4 ns + 24 ns = 28 ns. Setup slack = 37 − 28 = 9 ns. Q8. Determining the slack for hold timing: __ D08 __ ns. Under the same conditions, determine the hold slack at RegY if hold requirement = 0 ns. > ==28== Earliest data arrival = 4 ns + 24 ns = 28 ns. Required hold time = 0 ns. Hold slack = 28 − 0 = 28 ns.