CA2025 Quiz6 - HackMD

# CA2025 Quiz6 > 張暐俊 ## Q65 ### The Question Why is JAL easier to pipeline than JALR? ### My Answer According to page 39 of the Week 2 [RISC-V Instructions](https://docs.google.com/presentation/d/1qNVK8ULddo6luq0Rrjj_bmOShhcx9hlESN36AK91n6I/edit?slide=id.g13aaf244ca4_0_219#slide=id.g13aaf244ca4_0_219) material, two types of Unconditional Jumps are introduced, namely `JAL` and `JALR`. It clearly states the difference in usage between the two: one jumps directly to the corresponding address based on the label, while the other jumps to the address stored in a register. This difference implies a correlation in the complexity of instruction execution within the pipeline. Starting from page 25 of the Week 8 [RISC-V Datapath II](https://docs.google.com/presentation/d/1nJknDrB402GuriZuqeYOVpoJv9bvYoZeFRSPKzqjdqw/edit?slide=id.p26#slide=id.p26) slides, we can see the difference in address calculation between the two: - `JAL`: $PC = PC + Imm$ (unconditional PC-relative jump) - `JALR`: $PC = R[rs1] + Imm$ (absolute addressing) From these two lines, we can analyze that the jump address of `JAL` is hardcoded because it directly processes $PC$. This also means that `JAL` does not need to read any register to know the jump address. On the other hand, since `JALR` needs to read the value of $R[rs1]$, it means it must access the register. The impact of this difference is that if the instruction preceding `JALR` needs to modify its jump address, `JALR` will encounter a Data Hazard, easily leading to a Pipeline Stall. In contrast, because the jump address of the `JAL` instruction is fixed, such issues do not arise. The following are two images also explained in the Week 8 [RISC-V Datapath II](https://docs.google.com/presentation/d/1nJknDrB402GuriZuqeYOVpoJv9bvYoZeFRSPKzqjdqw/edit?slide=id.p26#slide=id.p26) slides: ![JAL Single-Cycle Datapath](https://hackmd.io/_uploads/rkPODyuV-e.png) In the Datapath of `JAL`, the path completely bypasses the Register File. ![JALR Single-Cycle Datapath](https://hackmd.io/_uploads/B1TDP1_4Zl.png) In the `JALR` diagram, we can see a solid green line connecting $R[rs1]$ to the ALU, proving its Data Dependency on the Register value. This extra path is the root cause of why `JALR` is more difficult to handle in the Pipeline and prone to causing Stalls. ### Verification To investigate the CPU Pipeline's handling mechanisms for different jump instructions, I designed a dedicated assembly test program `hazard_test.S`, comprising two comparison groups: First is Case A: JAL (Jump And Link) ```asm _start: # ---------------------------------------------------------------- # Case A: JAL (Jump And Link) - Should be pipeline-friendly # ---------------------------------------------------------------- nop nop jal x1, label_jal_target # Direct jump to label nop # Delay slot / padding nop label_jal_target: nop nop ``` This section is designed to test the standard direct jump instruction, `JAL`. The purpose of this code is to verify that since the target address of `JAL` can be calculated at the Decode stage and does not rely on register contents, it does not produce a Pipeline Stall. Next is the control group Case B: JALR (Jump And Link Register) with Load-Use Hazard ```asm # ---------------------------------------------------------------- # Case B: JALR (Jump And Link Register) with Data Hazard # ---------------------------------------------------------------- # 1. Setup a register with target address la x2, label_jalr_target # 2. Store it to memory (to simulate loading from memory later) la x3, var_target_addr sw x2, 0(x3) nop nop nop # 3. Trigger the Hazard: Load then immediately JALR lw x4, 0(x3) # Load target address into x4 jalr x5, x4, 0 # JALR using x4 immediately. # EXPECTATION: Stall cycles here waiting for x4 from MEM stage. nop nop jal x0, end # Infinite loop to end label_jalr_target: nop nop jal x0, end ``` This code tests the indirect jump instruction to intentionally create a Data Dependency. The approach is to first load the target address into a register, then immediately execute the `JALR` which depends on that register. Since `JALR` must wait for `lw` to retrieve data from the Memory stage, the Pipeline must insert a Stall to wait for the data to be ready, preventing a Data Hazard. Next, execute the test. We use the fully implemented CPU in the `soc` directory for simulation. First, compile the assembly test program: ```bash make hazard_test.asmbin ``` Then simulate: ```bash make sim BINARY=src/main/resources/hazard_test.asmbin ``` However, during simulation, we encountered an issue where the simulator failed to generate a waveform file. Upon inspection, we discovered that `sim.cpp` in the `soc` directory lacked VCD output logic. Therefore, I first included `<verilated_vcd_c.h>` in `sim.cpp` to implement the Dump function, successfully generating `trace.vcd`. However, forgetting to set the cycle count caused the simulated waveform to run for 500 million cycles, inflating the VCD file to 5.4GB and making it impossible to open. So, `sim.cpp` was modified again to limit the simulation period to 20,000 Cycles. After the fix, we successfully generated a lightweight `trace.vcd` and analyzed it using GTKWave. To analyze the Load-Use Hazard, we monitored the following three key signals in GTKWave: * `io_instruction_address_if` (PC): Observe whether the program execution flow stalls. * `io_pc_stall`: Observe whether the pipeline asserts a stall signal. * `io_output_instruction`: Observe whether the pipeline inserts a bubble (NOP). The resulting waveform is shown below: ![sim waveform](https://hackmd.io/_uploads/By4gWGdE-x.png) From this waveform, we can capture extremely clear Hazard behavior. First, `io_instruction_address_if` remains at address `00001044` for several cycles, which is exactly the address of the `JALR` instruction. Second, corresponding to the PC stall period, the `io_pc_stall` signal goes high (`1`), confirming the Pipeline enters a paused state. Finally, and most critically, `io_output_instruction` becomes `00000013` (NOP). This is because `JALR` must wait for the preceding `lw` instruction to return data from the Memory stage, forcing the hardware to insert a NOP to delay execution. This experiment verifies that `JALR`, due to Data Dependency, is indeed more prone to triggering Stalls in Pipeline design compared to `JAL`. This confirms it is relatively more complex, indirectly validating the conclusion of the first question. ### PIC/PIE Additionally, the Week 6 [Compiling, Assembling, Linking, and Loading](https://docs.google.com/presentation/d/1uAURy-tL-K9oMtUP1gY9034gsOrxbhhDL9MnoTm1NKM/edit?usp=sharing) slides introduce the concept of Position-Independent Code (PIC). Page 17 indicates that PC-relative jump instructions such as `beq`, `bne`, and `jal` are classified as PIC because they calculate the target address using an offset relative to the current PC, without requiring knowledge of the program's absolute load address. In the linker section, page 22 further clarifies that PC-relative addressing, such as `jal` and `auipc/addi`, never requires relocation, as the relative offsets remain unchanged when the program is moved to different memory addresses. In contrast, instructions using absolute addresses like `lui/addi` for accessing static data must be relocated during loading. Page 23 then details which instructions require relocation. The first mentioned is J-Format's `JAL`, which only needs relocation editing when jumping to external functions; internal jumps require no processing at all. More importantly, B-Type instructions, due to their PC-relative addressing nature, require no editing even when the code is relocated. This demonstrates that `JAL`'s PC-relative design not only simplifies pipeline implementation but also provides inherent hardware support for Position-Independent Code, reducing the linker's burden. While `JALR` can also achieve PC-relative jumps through the `auipc` + `jalr` combination, this two-instruction sequence is more complex than a single `JAL` instruction, further validating the conclusion of Q65. ## Q64 ### The Question Why is fixed-point preferred on small RISC-V cores? ### My Answer For this question, I use [Fixed-point Arithmetic](https://hackmd.io/@maromaSamsa/HkjefPbFs) from Week 4 to support my argument. First, "preferred on small RISC-V cores" implies that fixed-point can save certain hardware costs and bring corresponding hardware efficiency. Relevant discussions can be found in the [Fixed point](https://hackmd.io/@maromaSamsa/HkjefPbFs#Fixed-point) section of the development documentation: > In essence, fixed-point numbers are an integer data structure, which means that they can just perform fast calculations by using general ALU, without using floating-point operators. This passage proves that fixed-point can perform fast calculations directly using the processor's general ALU without requiring additional floating-point hardware support. Additionally, the documentation mentions experimental results on `rv32emu`. Next, I will attempt to reproduce them to prove that fixed-point significantly reduces Cycles and time compared to floating-point, making it more suitable for small RISC-V cores. ### Verification To verify the argument that Fixed-point arithmetic provides hardware efficiency, I referred to the experiment [Compare with floating point in rv32emu](https://hackmd.io/@maromaSamsa/HkjefPbFs#Compare-with-floating-point-in-rv32emu) from the [Fixed-point Arithmetic](https://hackmd.io/@maromaSamsa/HkjefPbFs) documentation. I conducted a reproduction experiment to compare the performance differences between fixed-point and floating-point arithmetic on `rv32emu`. Based on the documentation, I created a Micro-benchmark program implementing the core Fixed-Point functions for addition, multiplication, and division: ```c static inline q_fmt q_add(q_fmt a, q_fmt b) { q_buf tmp = (q_buf)a + (q_buf)b; if (tmp > (q_buf)QFMT_MAX) return (q_fmt)QFMT_MAX; if (tmp * -1 >= (q_buf)QFMT_MIN) return (q_fmt)QFMT_MIN; return (q_fmt)tmp; } static inline q_fmt q_mul(q_fmt a, q_fmt b) { q_buf tmp = (q_buf)a * (q_buf)b; tmp += 1 << (Q - 1); // Rounding tmp >>= Q; // Rescale if (tmp > (q_buf)QFMT_MAX) return (q_fmt)QFMT_MAX; if (tmp * -1 >= (q_buf)QFMT_MIN) return (q_fmt)QFMT_MIN; return (q_fmt)tmp; } static inline q_fmt q_div(q_fmt a, q_fmt b) { q_buf tmp = (q_buf)a << Q; if ((tmp >= 0 && b >= 0) || (tmp < 0 && b < 0)) tmp += (b >> 1); else tmp -= (b >> 1); return (q_fmt)(tmp / b); } ``` To precisely measure the Cycle Count consumed by the code execution, the test program uses inline assembly to directly read RISC-V CSRs such as `cycle` and `time`, as suggested by the documentation. The following example demonstrates reading `rdcycle`: ```c static inline uint32_t read_cycles() { uint32_t cycle; asm volatile ("rdcycle %0" : "=r" (cycle)); return cycle; } ``` In this experiment, I performed 1000 loop iterations for both standard `float` and `fixed-point` (Q20.11) operations. Taking addition as an example, the test logic is as follows: ```c // Float Test start_cycle = read_cycles(); for(int i = 0; i < 1000; ++i) { res_f = a_f - b_f; } end_cycle = read_cycles(); // Fixed-Point Test start_cycle = read_cycles(); for(int i = 0; i < 1000; ++i) { // q_add handles fixed-point addition res_q = q_add(a_q, -b_q); } end_cycle = read_cycles(); ``` The compilation was done using `riscv-none-elf-gcc` with the `-O0` flag to observe the raw computational overhead: ```shell riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 -o compare.elf test.c ``` Finally, the program was executed on `rv32emu` to record the data. The experimental results (total Cycle Count for 1000 operations) are shown in the table below: | Operation | Float (Cycles) | Fixed (Cycles) | Improvement | | :-------- | -------------: | -------------: | :---------------: | | **Add** | 75,047 | 55,047 | **~26.6% Faster** | | **Mul** | 428,019 | 358,019 | **~16.3% Faster** | | **Div** | 866,019 | 914,019 | *~5.5% Slower* | From the reproduction results, clearly, Fixed-Point provides a performance improvement over floating-point in the most basic and frequently used addition and multiplication operations, with improvements of approximately 26% and 16% respectively. This demonstrates that on Small RISC-V Cores lacking a hardware Floating-Point Unit (FPU), bypassing the complex software floating-point emulation layer and directly utilizing the general-purpose Integer ALU for Fixed-Point operations significantly reduces computational costs and enhances execution efficiency. Although the division operation is slightly slower in this test, this is because the dividend must be left-shifted by Q bits to preserve precision, potentially overflowing the 32-bit range. Consequently, a 64-bit type is required to temporarily store the intermediate result, incurring additional computational overhead. However, considering that addition and multiplication are generally much more frequent than division in embedded applications, Fixed-point still offers better overall performance. ## Q63 ### The Question Why is ra caller-saved in RISC-V? ### My Answer First, we must understand what a caller and a callee are. Page 20 of the Week 3 [RISC-V Procedures](https://docs.google.com/presentation/d/18F1vHY5Mv5E6WU6BvktXzkVbZrTZ3jQUbkWOjEPGE4I/edit?slide=id.p21#slide=id.p21) slides explains the reason for their naming: - Calle$R$: the calling function - Calle$E$: the function being called And it explains the Register Conventions: > A set of generally accepted rules as to which registers will be unchanged after a procedure call (`jal`) and which may be changed. From these statements, we can deduce two things. First, as the names suggest, the caller is the one calling, and the callee is the one being called. Second, and most importantly, when a Caller calls a Callee, the hardware executes `JAL`. The function of `JAL` is to jump to the target function while forcing the "address of the next instruction" (PC+4) to be written into the `ra` register. Because the action of calling a function itself overwrites the original value of `ra`, the Caller cannot rely on `ra` remaining unchanged before and after the call. That is to say, since the old value of `ra` will be destroyed, if the Caller needs to use the original `ra` after the function call ends, the Caller must back up the value of `ra` itself before calling others. Therefore, based on the discussion on page 21 of the [RISC-V Procedures](https://docs.google.com/presentation/d/18F1vHY5Mv5E6WU6BvktXzkVbZrTZ3jQUbkWOjEPGE4I/edit?slide=id.p21#slide=id.p21) slides: > 2. Not preserved across function call > - Caller cannot rely on values being unchanged > - Argument/return registers a0-a7, ra, “temporary registers” t0-t6 We can conclude that the reason `ra` is Caller-saved (Not preserved) is that the function call mechanism itself overwrites it. Therefore, to save the cost of load and store operations, the RISC-V Register Convention chooses to categorize it under "Caller cannot rely on values being unchanged". ### Verification To verify the behavior of the Caller-saved `ra` register and the consequences of failing to save it, I designed a nested function call test case `ra_test.S`. ```asm _start: # 1. Main calls Function A jal x1, func_A # x1 (ra) becomes address of next instruction (0x100C) nop j end func_A: nop nop # 2. Function A calls Function B (Nested Call) # CRITICAL: We intentionally DO NOT save x1 (ra) to stack here. jal x1, func_B # x1 is OVERWRITTEN by this call (becomes 0x1024) nop # 3. Return from Function A # Since x1 was clobbered by the call to func_B, # this "ret" will jump to 0x1024 (infinite loop inside func_A), never returning to _start. jalr x0, x1, 0 func_B: nop nop # 4. Return from Function B jalr x0, x1, 0 ``` ![image](https://hackmd.io/_uploads/rkyCWnPBZg.png) The experiment was conducted using the simulator with `trace.vcd` generation enabled. We observed the program counter (PC) and the `ra` register (`x1`) in GTKWave. **Waveform Analysis:** 1. The program executes `jal func_A`, and `ra` is correctly set to the return address in `_start` (`0x100C`). 2. Upon entering `func_A` and calling `func_B`, the `jal func_B` instruction at `0x1020` executes. This **overwrites** `ra` with `0x1024` (the instruction following the call in `func_A`). The original return address to `_start` (`0x100C`) is permanently lost. 3. When `func_B` returns, it jumps to `0x1024`. Then, when `func_A` attempts to return using `ret` (`jalr x0, x1, 0`) at `0x1028`, it reads the current value of `ra`, which is still `0x1024`. 4. The CPU enters an infinite loop between `0x1024` (NOP) and `0x1028` (RET), verifying that the return address was indeed destroyed by the nested call. This confirms that `ra` is not preserved across function calls (Caller-saved), and passing the responsibility of saving `ra` to the Caller (or the Function Prologue) is architectural necessity in RISC-V.