# Rewrite Lab3 as 3-stage pipeline RISC-V processor with branch predictor contributed by <[`shhung`](https://github.com/shhung/ca2023-lab3/tree/pipeline)><[`SUE3K`](https://github.com/SUE3K)> The detail can be accessed at branch [pipeline](https://github.com/shhung/ca2023-lab3/tree/pipeline) ## Introduction In this project, we have refactored the single-cycle CPU from [Assignment 3](https://hackmd.io/@shhung/rkSkyLDmT) into a 3-stage pipeline CPU. The original single-cycle CPU segmented the data path into five stages: Instruction Fetch, Instruction Decode, Execution, Memory Access, and Write-Back. Notably, the Instruction Decode stage included fetching necessary data from registers. Drawing inspiration from the srv32's 3-stage pipeline architecture, we reorganized our CPU into the following 3 stages: - Stage 1 comprises Instruction Fetch and Instruction Decode - Stage 2 involves Execution - Stage 3 encompasses Memory Access and Write-Back Register reading is now incorporated into Stage 2. In order for the CPU to operate in a pipelined manner, temporary registers must be introduced between the stages. The stage register is used to store the data required for subsequent stages and the calculation results of the previous stage. Efficiently managing these stage registers and ensuring correct propagation is critical to the functionality of the pipeline. ## Implementation ### 3-stage pipeline CPU architecture diagram ![pipeline_forward](https://hackmd.io/_uploads/HkyOHZmOp.png) **IF** stand for Instruction Fetch **ID** stand for Instruction Decode **REG** stand for Instruction Register file **EXE** stand for Execution **MEM** stand for Instruction Memory access ### Description Before diving into coding, it's crucial to plan our architecture and address the issues the pipeline needs to solve. The final architecture diagram is shown above. The main modules have been proven in single-cycle CPUs, so we can leverage them for reuse. The key is the stage registers between stages and the circuitry for forward highlighted in red as shown. ### issues #### data harzard As we have learned, we know that data harzard will be an issue we need to solve. There is onle RAW harzard possible on single issue processor. To address this, we can adopt forwarding instead of inserting stall which may reduce preformance. The data to be forwarded come from either memory or registers. We can see that the two red lines represent forwarding points to the ALU on the above diagram.. #### control harzard To mitigate control hazards, we utilize static branch prediction instead of delayed branches. We have configured our processor to never take a branch. As analyzed in the lecture material for **srv32**, our design also incurs a two-branch penalty for taken branches." ### Feature - Forwarding - Static branch predict ### Pipeline register The pipeline registers are utilized to store and propagate the required data and control signals from the previous stage to the following stage. The most crucial signal is `stall`, which allows for flushing the instruction when a branch is taken. #### IF/ID to EXE ```scala // Pipelining, FD-EXE fd_ex.instruction := inst_fetch.io.instruction fd_ex.instruction_address := inst_fetch.io.instruction_address fd_ex.immediate := id.io.ex_immediate fd_ex.ex_aluop1_source := id.io.ex_aluop1_source fd_ex.ex_aluop2_source := id.io.ex_aluop2_source fd_ex.reg_read_address1 := id.io.regs_reg1_read_address fd_ex.reg_read_address2 := id.io.regs_reg2_read_address fd_ex.stall := ex_wb.if_jump_flag //first stall fd_ex.wbcontrol.memory_read_enable := id.io.memory_read_enable fd_ex.wbcontrol.memory_write_enable := id.io.memory_write_enable fd_ex.wbcontrol.wb_reg_write_source := id.io.wb_reg_write_source fd_ex.wbcontrol.reg_write_enable := id.io.reg_write_enable fd_ex.wbcontrol.reg_write_address := id.io.reg_write_address ``` #### EXE to MEM/WB ```scala // Pipelining, EXE-WB ex_wb.instruction := fd_ex.instruction ex_wb.instruction_address := fd_ex.instruction_address ex_wb.mem_alu_result := ex.io.mem_alu_result ex_wb.reg2_data := reg2data ex_wb.if_jump_address := ex.io.if_jump_address ex_wb.stall := fd_ex.stall || ex_wb.if_jump_flag //second stall ``` ## Deal with data hazard During the **EXE** stage, we may need to read data from `rs*`. At the same time, if the `rd` of the last instruction is the same as `rs1` or `rs2`, data hazard will occur. Therefore, we must forward data from the **WB** to the **EXE** to ensure that the CPU operates as expected. Referencing **srv32**, we have implemented the forwarding logic as shown below. ```scala when(ex_wb.wbcontrol.reg_write_enable && ex_wb.wbcontrol.reg_write_address === fd_ex.reg_read_address1) { when(ex_wb.wbcontrol.memory_read_enable) { ex.io.reg1_data := mem.io.wb_memory_read_data }.otherwise { ex.io.reg1_data := ex_wb.mem_alu_result } }.otherwise { ex.io.reg1_data := regs.io.read_data1 } when(ex_wb.wbcontrol.reg_write_enable && ex_wb.wbcontrol.reg_write_address === fd_ex.reg_read_address2) { when(ex_wb.wbcontrol.memory_read_enable) { ex.io.reg2_data := mem.io.wb_memory_read_data }.otherwise { ex.io.reg2_data := ex_wb.mem_alu_result } }.otherwise { ex.io.reg2_data := regs.io.read_data2 } ``` `reg_write_enable` represents an instruction intending to modify a register. In the case where `reg_write_address` equals `reg_read_address*`, we should assign the data that was forwarded to `reg*data`. To distinguish which data should be assigned, we utilize `memory_read_enable`, as only the **Load** instruction reads data from memory and writes it back to registers. The rest write back the data calculated by the **ALU** from the **EXE**. ## stall implementation In pipeline processors, stall is a technique used to pause the pipeline, usually to solve data dependency or control dependency problems. In the case of stalling, operations at some stages are suspended, but the entire pipeline remains active. There are two ways to implement stall: 1. Replace the instruction with NOP(addi x0, x0, 0) 2. Disable the control signal`reg_write_enable` and `mem_write_enable` In order to achieve functionality with minimal changes, we adopt second method. ### Disable regWE&memWE in pipeline EXE-WB ```scala //disable regWE&memWE when(fd_ex.stall || ex_wb.if_jump_flag) { ex_wb.if_jump_flag := false.B ex_wb.wbcontrol.memory_read_enable := fd_ex.wbcontrol.memory_read_enable ex_wb.wbcontrol.wb_reg_write_source := fd_ex.wbcontrol.wb_reg_write_source ex_wb.wbcontrol.reg_write_address := fd_ex.wbcontrol.reg_write_address ex_wb.wbcontrol.reg_write_enable := false.B ex_wb.wbcontrol.memory_write_enable := false.B }.otherwise { ex_wb.wbcontrol := fd_ex.wbcontrol ex_wb.if_jump_flag := ex.io.if_jump_flag } ``` ### Test CPU We wrote an easy program to test our 3-stage pipeline CPU. ```scala class PipelineTest extends AnyFlatSpec with ChiselScalatestTester { behavior.of("Pipeline") it should "print out the stage register" in { test(new CPU).withAnnotations(TestAnnotations.annos) { c => c.io.instruction_valid.poke(true.B) // c.io.instruction.poke(0x3e001463L.U) // bne x0, x0, 1000 c.io.instruction.poke(0x002081b3L.U) // add x3, x1, x2 c.clock.step() c.io.instruction.poke(0x3e000463L.U) // beq x0, x0, 1000 c.clock.step() c.io.instruction.poke(0x0146a583L.U) // lw x11, 20(x13) c.clock.step() // c.io.instruction.poke(0x00100513L.U) // addi x10, x0, 1 // c.clock.step() // c.io.instruction.poke(0x00500593L.U) // addi x11, x0, 5 // c.clock.step() // c.io.instruction.poke(0x40a58633L.U) // sub x12, x11, x10 // c.clock.step() // c.io.instruction.poke(0x00a58633L.U) // add x12, x11, x10 c.clock.step(3) } } } ``` After finishing the easy test, we need to comprehensively test it. ```shell sbt test ``` ```shell [info] QuicksortTest: [info]Single Cycle CPU [info]- should perform a quicksort on 10 numbers *** FAILED *** [info] io_mem_debug_read_data=0 (0x0) did not equal expected=1 (0x1) (lines in CPUTest.scala: 93, 90, 85) (CUTest. scala: 93) [info]InstructionDecoder Test: [info]InstructionDecoder of Single Cycle CPU [info] - should produce correct control signal [info]Run completed in 41 seconds, 416 milliseconds. [info]Total number of tests run: 11 [info]Suites: completed 9, aborted o [info]Tests: succeeded 8, failed 3, canceled 0, ignored o, pending o [info]*** 3 TESTS FAILED*** [error]Failed tests: [error] riscv.singlecycle.ByteAccessTest [error] riscv.singlecycle.FibonacciTest [error] riscv. singlecycle.QuicksortTest [error] (Test / test) sbt. TestsFailedException: Tests unsuccessful [error] Total time: 43s, completed Jan 6, 2024 10:43:28 PM ``` First we found three failed tests. After checking the waveform, we found the bug. 1. We didn't assign the correct data to stage register 2. If there are two continuous branch instructions, and jump_flag is both 1, the second branch should not be executed, but the original version will pass jump_flag and jump_addr back to the IF/ID, and jump directly to the target of the second branch. ![1704608796541](https://hackmd.io/_uploads/SJZlmvtdp.jpg) --- Solve first problem: assign the correct data to `reg2data` ```scala val reg2data = Wire(UInt(Parameters.DataWidth)) when(ex_wb.wbcontrol.reg_write_enable && (ex_wb.wbcontrol.reg_write_address === fd_ex.reg_read_address1)) { when(ex_wb.wbcontrol.memory_read_enable) { ex.io.reg1_data := mem.io.wb_memory_read_data }.otherwise { ex.io.reg1_data := ex_wb.mem_alu_result } }.otherwise { ex.io.reg1_data := regs.io.read_data1 } when(ex_wb.wbcontrol.reg_write_enable && (ex_wb.wbcontrol.reg_write_address === fd_ex.reg_read_address2)) { when(ex_wb.wbcontrol.memory_read_enable) { reg2data := mem.io.wb_memory_read_data }.otherwise { reg2data := ex_wb.mem_alu_result } }.otherwise { reg2data := regs.io.read_data2 } ex.io.reg2_data := reg2data ``` Solve second problem: disable the `ex_wb.if_jump_flag` and the other two control signal ```scala // ex_wb.if_jump_flag := ex.io.if_jump_flag ex_wb.if_jump_address := ex.io.if_jump_address ex_wb.stall := fd_ex.stall || ex_wb.if_jump_flag //second stall //disable regWE&memWE when(fd_ex.stall || ex_wb.if_jump_flag) { ex_wb.if_jump_flag := false.B ex_wb.wbcontrol.memory_read_enable := fd_ex.wbcontrol.memory_read_enable ex_wb.wbcontrol.wb_reg_write_source := fd_ex.wbcontrol.wb_reg_write_source ex_wb.wbcontrol.reg_write_address := fd_ex.wbcontrol.reg_write_address ex_wb.wbcontrol.reg_write_enable := false.B ex_wb.wbcontrol.memory_write_enable := false.B }.otherwise { ex_wb.wbcontrol := fd_ex.wbcontrol ex_wb.if_jump_flag := ex.io.if_jump_flag } ``` Finally, we passed all test. ```shell [info] welcome to sbt 1.9.7 (Temurin Java 1.8.0_392) [info] loading settings for project ca2023-lab3-build from plugins.sbt ... [info] loading project definition from /home/ianli/ca2023-lab3/project [info] loading settings for project root from build.sbt ... [info] set current project to mycpu (in build file:/home/ianli/ca2023-lab3/) [info] InstructionFetchTest: [info] InstructionFetch of Single Cycle CPU [info] - should fetch instruction [info] ExecuteTest: [info] Execution of Single Cycle CPU [info] - should execute correctly [info] PipelineTest: [info] Pipeline [info] - should print out the stage register [info] BranchTest: [info] 3-stage Pipeline CPU [info] - should branch correctly [info] InstructionDecoderTest: [info] InstructionDecoder of Single Cycle CPU [info] - should produce correct control signal [info] ByteAccessTest: [info] 3-stage Pipeline CPU [info] - should store and load a single byte [info] QuicksortTest: [info] 3-stage Pipeline CPU [info] - should perform a quicksort on 10 numbers [info] RegisterFileTest: [info] Register File of Single Cycle CPU [info] - should read the written content [info] - should x0 always be zero [info] - should read the writing content [info] FibonacciTest: [info] 3-stage Pipeline CPU [info] - should recursively calculate Fibonacci(10) [info] ForwardTest: [info] 3-stage Pipeline CPU [info] - should bypass the operand starting address: UInt<32>(5600) UInt<32>(1059250332), UInt<32>(1057652809), UInt<32>(1063491255), UInt<32>(1061586249), UInt<32>(1059681246), UInt<32>(1057776241), UInt<32>(1054777865), UInt<32>(1062939607), UInt<32>(1060727120), UInt<32>(1058514636), UInt<32>(1055639692), UInt<32>(1051214720), UInt<32>(1062387961), UInt<32>(1059867992), UInt<32>(1057348026), UInt<32>(1052691510), UInt<32>(1046727151), UInt<32>(2), UInt<32>(5), UInt<32>(1048576000), UInt<32>(1056964608), UInt<32>(1061158912), UInt<32>(1064594550), UInt<32>(1059434382), UInt<32>(1062387961), [info] ImgScaleTest: [info] 3-stage Pipeline CPU [info] - should Image Scaling... [info] Run completed in 30 seconds, 315 milliseconds. [info] Total number of tests run: 13 [info] Suites: completed 11, aborted 0 [info] Tests: succeeded 13, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 32 s, completed Jan 8, 2024 8:03:40 PM ``` We increment a counter to count the number of incorrect predictions ```shell ianli@new:~/ca2023-lab3 $ sbt "testOnly riscv.singlecycle.ByteAccessTest" [info] welcome to sbt 1.9.7 (Temurin Java 1.8.0_392) [info] loading settings for project ca2023-lab3-build from plugins.sbt ... [info] loading project definition from /home/ianli/ca2023-lab3/project [info] loading settings for project root from build.sbt ... [info] set current project to mycpu (in build file:/home/ianli/ca2023-lab3/) [info] compiling 1 Scala source to /home/ianli/ca2023-lab3/target/scala-2.13/classes ... fd_ex = inst: 0x00000000 instAddr: 0 imm: 0 op1_src: 0 op2_src: 0 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0 ex_wb = inst: 0x00000000 instAddr: 0 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0 mispredictCounter: 0 fd_ex = inst: 0x00000013 instAddr: 4096 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x00000000 instAddr: 0 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0 mispredictCounter: 0 fd_ex = inst: 0x00000013 instAddr: 4096 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x00000013 instAddr: 4096 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4096 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0 mispredictCounter: 0 fd_ex = inst: 0x00000013 instAddr: 4096 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x00000013 instAddr: 4096 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4096 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0 mispredictCounter: 0 fd_ex = inst: 0x00400513 instAddr: 4096 imm: 4 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 4 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 10 ex_wb = inst: 0x00000013 instAddr: 4096 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4096 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0 mispredictCounter: 0 fd_ex = inst: 0xdeadc2b7 instAddr: 4100 imm: 3735928832 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 10 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 5 ex_wb = inst: 0x00400513 instAddr: 4096 alu_out: 4 reg2_data: 0 jump_flag: 0 jumpAddr: 4100 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 10 mispredictCounter: 0 fd_ex = inst: 0xeef28293 instAddr: 4104 imm: 4294967023 op1_src: 0 op2_src: 1 reg1_RA: 5 reg2_RA: 15 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 5 ex_wb = inst: 0xdeadc2b7 instAddr: 4100 alu_out: 3735928832 reg2_data: 4 jump_flag: 0 jumpAddr: 3735932932 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 5 mispredictCounter: 0 fd_ex = inst: 0x00550023 instAddr: 4108 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 10 reg2_RA: 5 stall: 0 mem_RE: 0 mem_WE: 1 wb_reg_src: 0 reg_WE: 0 reg_WA: 0 ex_wb = inst: 0xeef28293 instAddr: 4104 alu_out: 3735928559 reg2_data: 0 jump_flag: 0 jumpAddr: 3831 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 5 mispredictCounter: 0 fd_ex = inst: 0x00052303 instAddr: 4112 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 10 reg2_RA: 0 stall: 0 mem_RE: 1 mem_WE: 0 wb_reg_src: 1 reg_WE: 1 reg_WA: 6 ex_wb = inst: 0x00550023 instAddr: 4108 alu_out: 4 reg2_data: 3735928559 jump_flag: 0 jumpAddr: 4108 stall: 0 mem_RE: 0 mem_WE: 1 wb_reg_src: 0 reg_WE: 0 reg_WA: 0 mispredictCounter: 0 fd_ex = inst: 0x01500913 instAddr: 4116 imm: 21 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 21 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 18 ex_wb = inst: 0x00052303 instAddr: 4112 alu_out: 4 reg2_data: 0 jump_flag: 0 jumpAddr: 4112 stall: 0 mem_RE: 1 mem_WE: 0 wb_reg_src: 1 reg_WE: 1 reg_WA: 6 mispredictCounter: 0 fd_ex = inst: 0x012500a3 instAddr: 4120 imm: 1 op1_src: 0 op2_src: 1 reg1_RA: 10 reg2_RA: 18 stall: 0 mem_RE: 0 mem_WE: 1 wb_reg_src: 0 reg_WE: 0 reg_WA: 1 ex_wb = inst: 0x01500913 instAddr: 4116 alu_out: 21 reg2_data: 0 jump_flag: 0 jumpAddr: 4137 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 18 mispredictCounter: 0 fd_ex = inst: 0x00052083 instAddr: 4124 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 10 reg2_RA: 0 stall: 0 mem_RE: 1 mem_WE: 0 wb_reg_src: 1 reg_WE: 1 reg_WA: 1 ex_wb = inst: 0x012500a3 instAddr: 4120 alu_out: 5 reg2_data: 21 jump_flag: 0 jumpAddr: 4121 stall: 0 mem_RE: 0 mem_WE: 1 wb_reg_src: 0 reg_WE: 0 reg_WA: 1 mispredictCounter: 0 fd_ex = inst: 0x0000006f instAddr: 4128 imm: 0 op1_src: 1 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x00052083 instAddr: 4124 alu_out: 4 reg2_data: 0 jump_flag: 0 jumpAddr: 4124 stall: 0 mem_RE: 1 mem_WE: 0 wb_reg_src: 1 reg_WE: 1 reg_WA: 1 mispredictCounter: 0 fd_ex = inst: 0x00000013 instAddr: 4132 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x0000006f instAddr: 4128 alu_out: 4128 reg2_data: 0 jump_flag: 1 jumpAddr: 4128 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0 mispredictCounter: 0 fd_ex = inst: 0x00000013 instAddr: 4136 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x00000013 instAddr: 4132 alu_out: 4128 reg2_data: 4128 jump_flag: 0 jumpAddr: 4132 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0 mispredictCounter: 0 fd_ex = inst: 0x0000006f instAddr: 4128 imm: 0 op1_src: 1 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x00000013 instAddr: 4136 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4136 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0 mispredictCounter: 1 fd_ex = inst: 0x00000013 instAddr: 4132 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x0000006f instAddr: 4128 alu_out: 4128 reg2_data: 0 jump_flag: 1 jumpAddr: 4128 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0 mispredictCounter: 1 fd_ex = inst: 0x00000013 instAddr: 4136 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x00000013 instAddr: 4132 alu_out: 4128 reg2_data: 4128 jump_flag: 0 jumpAddr: 4132 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0 mispredictCounter: 1 fd_ex = inst: 0x0000006f instAddr: 4128 imm: 0 op1_src: 1 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x00000013 instAddr: 4136 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4136 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0 mispredictCounter: 2 fd_ex = inst: 0x00000013 instAddr: 4132 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x0000006f instAddr: 4128 alu_out: 4128 reg2_data: 0 jump_flag: 1 jumpAddr: 4128 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0 mispredictCounter: 2 fd_ex = inst: 0x00000013 instAddr: 4136 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x00000013 instAddr: 4132 alu_out: 4128 reg2_data: 4128 jump_flag: 0 jumpAddr: 4132 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0 mispredictCounter: 2 fd_ex = inst: 0x0000006f instAddr: 4128 imm: 0 op1_src: 1 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x00000013 instAddr: 4136 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4136 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0 mispredictCounter: 3 fd_ex = inst: 0x00000013 instAddr: 4132 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x0000006f instAddr: 4128 alu_out: 4128 reg2_data: 0 jump_flag: 1 jumpAddr: 4128 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0 mispredictCounter: 3 fd_ex = inst: 0x00000013 instAddr: 4136 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x00000013 instAddr: 4132 alu_out: 4128 reg2_data: 4128 jump_flag: 0 jumpAddr: 4132 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0 mispredictCounter: 3 fd_ex = inst: 0x0000006f instAddr: 4128 imm: 0 op1_src: 1 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x00000013 instAddr: 4136 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4136 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0 mispredictCounter: 4 [info] ByteAccessTest: [info] 3-stage Pipeline CPU [info] - should store and load a single byte [info] Run completed in 5 seconds, 351 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 11 s, completed Jan 9, 2024 4:22:54 PM ``` --- Let us analyze how the branch predictor works. When the IF/D to EXE stage register stall signal is 1, set the `reg_WE` control signal. ```shell fd_ex = inst: 0x00000013 instAddr: 4136 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x00000013 instAddr: 4132 alu_out: 4128 reg2_data: 4128 jump_flag: 0 jumpAddr: 4132 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0 mispredictCounter: 0 ``` --- After that, we disable the `reg_WE` control signal and increase one to `mispredictCounter`. Means that we mispredicted the branch once. ```shell fd_ex = inst: 0x0000006f instAddr: 4128 imm: 0 op1_src: 1 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0 ex_wb = inst: 0x00000013 instAddr: 4136 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4136 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0 mispredictCounter: 1 ``` ## Analyze srv32 ### Forwarding The architecture of srv32 can avoid the previously mentioned `Data harzard` and `Load-use hazard` through forwarding, because the reading of the temporary register and the reading of the memory are executed in the same stage, as shown in the following table: | Instruction |cycle 1| c2 | c3 | c4 | c5 | | ----------------- | ----- | --- | --- | --- | --- | | `lw x2 0(x5)` | IF/ID | EX | WB | | | | `and x3, x2, x4` | |IF/ID| EX | WB | | | `add x4, x5, x6` | | |IF/ID| EX | WB | ```verilog // register reading @ execution stage and register forwarding // When the execution result accesses the same register, // the execution result is directly forwarded from the previous // instruction (at write back stage) assign reg_rdata1[31: 0] = (ex_src1_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src1_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : regs[ex_src1_sel]; assign reg_rdata2[31: 0] = (ex_src2_sel == 5'h0) ? 32'h0 : (!wb_flush && wb_alu2reg && (wb_dst_sel == ex_src2_sel)) ? // register forwarding (wb_mem2reg ? wb_rdata : wb_result) : regs[ex_src2_sel]; ``` ## Conclusion In this project, we reviewed aspects related to pipeline processors and branch prediction. Although we recognized the necessity for additional components to support dynamic branch prediction, time constraints limited our ability to implement them fully. The achieved results are as follows: - Pipeline - Static branch predict(Always not taken) Components left for future implementation include BHT (Branch History Table) and BTB (Branch Target Buffer). Through these structures, we can implement dynamic branch prediction to observe differences between various branching mechanisms. :::warning Discuss the effectiveness of the design. ::: ## References [Pipeline Processor](https://hackmd.io/@joanne8826/HkT32O85I#4-3%E2%80%934-Data-Hazard-and-forwarding) [Single Cycle CPU](https://hackmd.io/@sysprog/r1mlr3I7p#Single-cycle-RISC-V-CPU) [srv32](https://github.com/kuopinghsu/srv32) Analyze for srv32:[RISCV RV32IM Soft CPU](https://hackmd.io/@sysprog/S1Udn1Xtt) [riscv-mini](https://github.com/ucb-bar/riscv-mini) [dino-cpu](https://github.com/jlpteaching/dinocpu/tree/lab4-wq19)