CA2025 Quiz6 - HackMD

# CA2025 Quiz6 > 吳承晉 > I draft all the content myself first, then use Gemini to polish the English for better fluency. ## Q43: What’s the hazard if two instructions write the same register? ### Answer > I use [gemini](https://gemini.google.com/?hl=zh-TW) to rewrite my expression in english and correcr my knowledge if I have wrong. All idea and final analysis are my own. When two instructions write to the same register, it causes a Write-After-Write (WAW) hazard, which is a type of data hazard (also known as an output dependence). In a simple pipeline where all instructions allow the data to be written back in the same pipeline stage (in-order writeback), WAW hazards do not occur. However, issues arise in complex pipelines with variable execution latencies. For example, a `LOAD` instruction or a complex arithmetic operation (like `DIV` or `MUL`) might take many cycles to complete. If a subsequent instruction (like a simple `ADD`) writes to the same register while the slower instruction is still processing, the faster instruction might write its result first. Later, when the slow instruction finally finishes, it will overwrite the "newer" value with its "older" result, leaving the register with incorrect data. > Here I found the lectue metrial, [L16: Processor Pipelining](https://www.youtube.com/watch?v=TMpjvAvQCWA) talking about the pipeline hazard, but it only mention briefly at [24:20](https://youtu.be/TMpjvAvQCWA?t=1470). > I can't find the lecture metirial which talk about WAW, I only find that RAW. So, I search from the [webside](https://www.icsa.inf.ed.ac.uk/cgi-bin/hase/dlx-scb.pl?depend-t.html,depend-f.html,menu.html) which mention it and support my answer. > *A Write-After-Write hazard occurs when an instruction tries to write its result to the same register as a previously issued, but as yet uncompleted instruction. In the WAW example shown in the figure, both instructions write their results to R6. Although this latter example is unlikely to arise in normal programming practice, it must nevertheless give the correct result. Without proper interlocks the add operation would complete first and the result in R6 would then be overwritten by that of the multiplication.* The other reason that causes WAW hazards is **Out-of-Order** execution. If this technique does not properly handle WAW hazards, it can result in instructions reading incorrect data because the registers may still hold old values. >[!Note] Out-of-Order processors v.s. In-Order processors > **Out-of-Order** > *The key concept of out-of-order processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable.* > > **In-Order** > *Often, an in-order processor has a bit vector recording which registers will be written to by a pipeline. If any input operands have the corresponding bit set in this vector, the instruction stalls.* >> [reference](https://en.wikipedia.org/wiki/Out-of-order_execution) ### How to validate this with experiments? #### WAW in mycpu In a system without hazard protection, if we execute the instructions below, the final value of `t0` should be the result of the `add`. However, due to a WAW hazard, the `lw` (long latency) might write back after the `add` (short latency), causing `t0` to hold the result of the ld instead. ```asm lw t0, 0(t2) # Long latency operation (writes to t0) add t0, t1, t2 # Short latency operation (writes to t0) ``` When executing this sequence on mycpu, we expect to see automatic hazard handling, such as the pipeline stalling to wait for data. Since mycpu is an in-order processor, it does not suffer from the WAW hazards typical of Out-of-Order (OoO) execution, as instructions are committed in sequence. We will now build a small test case to visualize this behavior and verify how mycpu handles the potential hazard or does nothing for this: 1. Insert a WAW test case at the beginning of `4-soc/csrc/init.S` and create a dummy main function. We modify `init.S` directly instead of creating a standalone test file for two reasons: 1. It is easier to locate and debug this tiny test case compared to searching through a large standard test suite. 2. This simple insertion does not affect the original logic of `init.S`. `4-soc/csrc/init.S` ```diff=10 .section .text.init .globl _start _start: + li t0, 0x3 + sw t0, 4(x0) + + # WAW + lw t1, 4(x0) + li t1, 0x12 + + sw t1, 4(x0) # Initialize global pointer for linker relaxation .option push .option norelax la gp, __global_pointer$ .option pop ``` Create dummy code `wawhazard.S` in `4-soc/csrc/`: ```asm .globl main main: jr ra ``` 2. Add compilation logic to `4-soc/csrc/Makefile` and compile. ```diff=41 +wawhazard.asmbin: wawhazard.S init.o link.lds + $(AS) -R $(ASFLAGS) -o wawhazard.o wawhazard.S + $(CROSS_COMPILE)ld -o wawhazard.elf -T link.lds $(LDFLAGS) wawhazard.o init.o + $(OBJCOPY) -O binary -j .text -j .data wawhazard.elf $@ init.o: init.S $(AS) -R $(ASFLAGS) -o $@ $< ``` Run the update: ```bash # in 4-soc/csrc $ make update ``` 3. Enable waveform tracing in `4-soc/verilog/verilator/sim.cpp`. Add the following code to capture the simulation output: ```diff=360 try { mem.load(binary); std::cout << "Loaded: " << binary << "\n"; } catch (const std::exception &e) { std::cerr << e.what() << "\n"; return 1; } + Verilated::traceEverOn(true); + auto tfp = std::make_unique<VerilatedVcdC>(); + top->trace(tfp.get(), 99); + tfp->open("trace.vcd"); ``` ```diff=437 top->io_instruction = inst; top->clock = !top->clock; + tfp->dump(cycle); ``` ```diff=642 + tfp->close(); return 0; ``` 4. Add flags to `4-soc/makefile` ```diff=21 - cd verilog/verilator && verilator --exe --cc sim.cpp Top.v \ + cd verilog/verilator && verilator --exe --trace --cc sim.cpp Top.v \ -CFLAGS "$$(sdl2-config --cflags)" \ -LDFLAGS "$$(sdl2-config --libs)" && \ make -C obj_dir -f VTop.mk ``` ```diff=37 - cd verilog/verilator/obj_dir && ./VTop -i ../../../$(BINARY) + cd verilog/verilator/obj_dir && ./VTop -i ../../../$(BINARY) --headless ``` 4. Use Sufer to view `trace.vcd`: ```bash # in 4-soc surfer ./verilog/verilator/obj_dir.vcd ``` To confirm that we are observing the correct location, we first inspect the io_rom_instruction signal in the inst_fetch module. This register records the raw instruction fetched from the ROM. We can verify this by comparing the instruction encoding. For example, the instruction [`li t0, 0x3` orresponds to the Hexadecimal `0x00300293`](https://luplab.gitlab.io/rvcodecjs/#q=addi+t0,+x0,+3&abi=false&isa=AUTO) which tools introduced by our lecture metirial [online tool for RISC-V Instruction Encoder/Decoder in week5](https://luplab.gitlab.io/rvcodecjs/). ![image](https://hackmd.io/_uploads/BkVSZiuNbl.png) The first five ROM instructions match the following WAW code: ```asm addi t0, x0, 3 #0x00300293 sw t0, 4(x0) #0x00502223 lw t1, 4(x0) #0x00402303 addi t1, x0, 0x12 #0x01200313 sw t1, 4(x0) #0x00602223 ``` To observe how the WAW hazard is handled, we monitor the following signals: - `io_tsall_flag_ctrl` in the `inst_fetch` module. - `registers_4` (mapping to `t0`) and `registers_5` (mapping to `t1`) in `regs` module. - `io_memory_read_data` register in the `wb` module. - `io_memory_read_enble` and `io_memory_write_enble` wire in the `mem` module. >[!Note] Why not `registers_5` and `registers_6`? > Since MyCPU only implements 30 physical registers (keeping `x0` constant at zero), it maps the architectural registers `x1`-`x31` to indices `0`-`30` in physical storage. This is shown in the source code below: > ```scala > // Allocate only 31 physical registers (x1-x31), x0 is constant zero > // This saves 3% of register file resources (992 vs 1024 flip-flops) > val registers = Reg(Vec(Parameters.PhysicalRegisters - 1, UInt(Parameters.DataWidth))) > > when(!reset.asBool) { > when(io.write_enable && io.write_address =/= 0.U) { > // Map x1-x31 to indices 0-30 in physical storage > registers(io.write_address - 1.U) := io.write_data > } > } > ``` --- ![image](https://hackmd.io/_uploads/HkfB8jOEbx.png) We can observe that the first stall occurs at `8ps`, caused by the store instruction. A second stall occurs at `22ps`, caused by the load instruction, which is confirmed by the `io_memory_read_enable` signal. --- ![image](https://hackmd.io/_uploads/rkzCPs_NZx.png) The WAW hazard occurs at `36ps`, where two instructions (`lw` and `addi`) both intend to write into `registers_5` (which maps to `t1`). We can see that the pipeline stalls to wait for the load instruction to write data into the register first, and then writes the `addi` result sequentially. Thus, the WAW hazard is resolved because the pipeline waits for each instruction to be fully retired. >[!Note] Retired Instruction >Instruction Retire (also known as Commit) is the final stage of an instruction's lifecycle where its execution is finalized. >It signifies three key things: >- Permanent Update: The instruction writes its results to the architectural state (registers or memory), making them visible to the software. >- Point of No Return: The instruction is no longer speculative; it cannot be flushed or undone. >- Exception Free: The instruction has completed without triggering any traps or exceptions. >> reference: gemini and [wekipidea](https://en.wikipedia.org/wiki/Out-of-order_execution) ### WAW in Ripes Educational simulators like Ripes often abstract away variable latencies, treating most instructions as taking only 1 cycle. Therefore, we typically cannot observe WAW hazards in Ripes. We ran the WAW tiny test case in Ripes and found that it does not stall. The execution took a total of 9 cycles for all instructions to retire. ![image](https://hackmd.io/_uploads/rJIvbnuEWx.png) ## Q44: Why is hazard detection logic essential in decode? This is beacuse it is the earliest point whre the processor knows what the instruciotn intends to do and what register it need to read or write. > this is mention by the lectue metrial [video note](https://computationstructures.org/lectures/pbeta/pbeta.html#5) at Week 10 (Nov 11) > *The bypass MUXes are controlled by logic that’s matching the number of the source register to the number of the destination registers in the ALU, MEM, and WB stages, with the usual complications of dealing with R31.* In mycpu project, this detection can implement fully pypassing and deal with RAW hazard. For MyCPU's `fivestage_stall` implementation, we observe that without forwarding, the pipeline must stall to resolve RAW hazards: > `3-pipeline/src/main/scala/riscv/core/fivestage_stall/Control.scala` ```scala .elsewhen( // =========================== Data Hazard (RAW) =========================== // Conservative stalling: ANY register dependency causes a stall // No forwarding capability, so must wait for register write to complete // Check EX stage dependency (1-cycle old instruction): (io.reg_write_enable_ex && // EX stage will write a register (io.rd_ex === io.rs1_id || io.rd_ex === io.rs2_id) && // Destination matches ID source io.rd_ex =/= 0.U) // Not writing to x0 (always zero) || // Check MEM stage dependency (2-cycle old instruction): (io.reg_write_enable_mem && // MEM stage will write a register (io.rd_mem === io.rs1_id || io.rd_mem === io.rs2_id) && // Destination matches ID source io.rd_mem =/= 0.U) // Not writing to x0 ) { // Stall action: Insert bubble (NOP) and freeze earlier stages io.id_flush := true.B // Insert NOP into ID/EX register (bubble) io.pc_stall := true.B // Freeze PC (don't fetch new instruction) io.if_stall := true.B // Freeze IF/ID register (hold current instruction) // Result: ID stage instruction waits until dependency resolved } ``` The logic compares the source register numbers in the Decode stage with the destination register numbers in the Execute and Memory stages to determine if a RAW hazard exists. Based on this detection, full bypassing can be implemented by forwarding the values, allowing us to remove most stall logic: > `3-pipeline/src/main/scala/riscv/core/fivestage_forward/Control.scala` ```scala .elsewhen( // =========================== Load-Use Hazard =========================== // Special case: Load instruction followed immediately by dependent instruction // Forwarding CANNOT resolve this because load data isn't available until MEM stage // // Detection conditions (ALL must be true): io.memory_read_enable_ex && // 1. EX stage has load instruction io.rd_ex =/= 0.U && // 2. Load destination is not x0 (io.rd_ex === io.rs1_id || io.rd_ex === io.rs2_id) // 3. ID stage uses load destination // // Example triggering this hazard: // LW x1, 0(x2) [EX stage: memory_read_enable_ex=1, rd_ex=x1] // ADD x3, x1, x4 [ID stage: rs1_id=x1] → STALL required // // Timeline without stall (WRONG): // Cycle N: LW in EX (initiates memory read) // Cycle N+1: LW in MEM (data arrives), ADD in EX (needs x1 - NOT READY!) // // Timeline with stall (CORRECT): // Cycle N: LW in EX // Cycle N+1: LW in MEM, bubble in EX (ADD stalled in ID) // Cycle N+2: LW in WB, ADD in EX (x1 forwarded from MEM/WB) ) { // Insert one bubble to delay dependent instruction by 1 cycle io.id_flush := true.B // Insert NOP/bubble into ID/EX register io.pc_stall := true.B // Freeze PC (hold next instruction fetch) io.if_stall := true.B // Freeze IF/ID (hold current instruction) // After stall: forwarding unit will forward load result from MEM/WB stage } ``` We can see that stalling is now only necessary for **Load-Use** hazards, as other RAW hazards are resolved via forwarding. ### How to validate this with experiments? Since MyCPU provides both a stalling pipeline (`ImplementationType.FiveStageStall`) and a pipeline with bypassing logic (`ImplementationType.FiveStageForward`) to solve RAW hazards, we can validate the impact of hazard detection logic by executing the same code on both and comparing the execution time. 1. To experiment with different implementations of MyCPU in `3-pipeline`, we need to modify `3-pipeline/src/main/scala/board/verilator/Top.scala`: ```diff=16 - val cpu = Module(new CPU(implementation = ImplementationType.ThreeStage)) + val cpu = Module(new CPU(implementation = ImplementationType.FiveStageStall)) ``` or ```diff=16 - val cpu = Module(new CPU(implementation = ImplementationType.ThreeStage)) + val cpu = Module(new CPU(implementation = ImplementationType.FiveStageForward)) ``` 2. Run the [HW3 homework testcase](https://hackmd.io/pabGYUumRvSXP0VDkefwug?view#Modify-the-handwritten-RISC-V-assembly-code-in-Homework2-to-ensure-it-functions-correctly-on-the-pipelined-RISC-V-CPU) on both implement via simulation: ``` make sim SIM_ARGS="-instruction ./src/main/resources/uf8.asmbin" ``` 3. Use Surfer to view the waveform and identify when the code enters the endless loop (e.g., pc=`0x109c`, `0x10a0`, `0x10a4`, `0x109c`, `0x10a0`...): ```bash # in 3-pipeline $ surfer ./trace.vcd ``` We observe that `FiveStageStall` enters the endless loop at `1074 ps`, whereas `FiveStageForward` enters the endless loop at `798 ps`. This significant performance difference validates that hazard detection logic (enabling forwarding) is essential for efficient execution. - Stall ![image](https://hackmd.io/_uploads/S1kOy1K4bl.png) - Forward ![image](https://hackmd.io/_uploads/rJdex1YV-l.png)