contributed by <shhung
><SUE3K
>
The detail can be accessed at branch pipeline
In this project, we have refactored the single-cycle CPU from Assignment 3 into a 3-stage pipeline CPU. The original single-cycle CPU segmented the data path into five stages: Instruction Fetch, Instruction Decode, Execution, Memory Access, and Write-Back. Notably, the Instruction Decode stage included fetching necessary data from registers. Drawing inspiration from the srv32's 3-stage pipeline architecture, we reorganized our CPU into the following 3 stages:
Register reading is now incorporated into Stage 2.
In order for the CPU to operate in a pipelined manner, temporary registers must be introduced between the stages. The stage register is used to store the data required for subsequent stages and the calculation results of the previous stage. Efficiently managing these stage registers and ensuring correct propagation is critical to the functionality of the pipeline.
Before diving into coding, it's crucial to plan our architecture and address the issues the pipeline needs to solve. The final architecture diagram is shown above. The main modules have been proven in single-cycle CPUs, so we can leverage them for reuse. The key is the stage registers between stages and the circuitry for forward highlighted in red as shown.
As we have learned, we know that data harzard will be an issue we need to solve. There is onle RAW harzard possible on single issue processor. To address this, we can adopt forwarding instead of inserting stall which may reduce preformance. The data to be forwarded come from either memory or registers. We can see that the two red lines represent forwarding points to the ALU on the above diagram..
To mitigate control hazards, we utilize static branch prediction instead of delayed branches. We have configured our processor to never take a branch. As analyzed in the lecture material for srv32, our design also incurs a two-branch penalty for taken branches."
The pipeline registers are utilized to store and propagate the required data and control signals from the previous stage to the following stage. The most crucial signal is stall
, which allows for flushing the instruction when a branch is taken.
// Pipelining, FD-EXE
fd_ex.instruction := inst_fetch.io.instruction
fd_ex.instruction_address := inst_fetch.io.instruction_address
fd_ex.immediate := id.io.ex_immediate
fd_ex.ex_aluop1_source := id.io.ex_aluop1_source
fd_ex.ex_aluop2_source := id.io.ex_aluop2_source
fd_ex.reg_read_address1 := id.io.regs_reg1_read_address
fd_ex.reg_read_address2 := id.io.regs_reg2_read_address
fd_ex.stall := ex_wb.if_jump_flag //first stall
fd_ex.wbcontrol.memory_read_enable := id.io.memory_read_enable
fd_ex.wbcontrol.memory_write_enable := id.io.memory_write_enable
fd_ex.wbcontrol.wb_reg_write_source := id.io.wb_reg_write_source
fd_ex.wbcontrol.reg_write_enable := id.io.reg_write_enable
fd_ex.wbcontrol.reg_write_address := id.io.reg_write_address
// Pipelining, EXE-WB
ex_wb.instruction := fd_ex.instruction
ex_wb.instruction_address := fd_ex.instruction_address
ex_wb.mem_alu_result := ex.io.mem_alu_result
ex_wb.reg2_data := reg2data
ex_wb.if_jump_address := ex.io.if_jump_address
ex_wb.stall := fd_ex.stall || ex_wb.if_jump_flag //second stall
During the EXE stage, we may need to read data from rs*
. At the same time, if the rd
of the last instruction is the same as rs1
or rs2
, data hazard will occur. Therefore, we must forward data from the WB to the EXE to ensure that the CPU operates as expected.
Referencing srv32, we have implemented the forwarding logic as shown below.
when(ex_wb.wbcontrol.reg_write_enable && ex_wb.wbcontrol.reg_write_address === fd_ex.reg_read_address1) {
when(ex_wb.wbcontrol.memory_read_enable) {
ex.io.reg1_data := mem.io.wb_memory_read_data
}.otherwise {
ex.io.reg1_data := ex_wb.mem_alu_result
}
}.otherwise {
ex.io.reg1_data := regs.io.read_data1
}
when(ex_wb.wbcontrol.reg_write_enable && ex_wb.wbcontrol.reg_write_address === fd_ex.reg_read_address2) {
when(ex_wb.wbcontrol.memory_read_enable) {
ex.io.reg2_data := mem.io.wb_memory_read_data
}.otherwise {
ex.io.reg2_data := ex_wb.mem_alu_result
}
}.otherwise {
ex.io.reg2_data := regs.io.read_data2
}
reg_write_enable
represents an instruction intending to modify a register. In the case where reg_write_address
equals reg_read_address*
, we should assign the data that was forwarded to reg*data
. To distinguish which data should be assigned, we utilize memory_read_enable
, as only the Load instruction reads data from memory and writes it back to registers. The rest write back the data calculated by the ALU from the EXE.
In pipeline processors, stall is a technique used to pause the pipeline, usually to solve data dependency or control dependency problems.
In the case of stalling, operations at some stages are suspended, but the entire pipeline remains active. There are two ways to implement stall:
reg_write_enable
and mem_write_enable
In order to achieve functionality with minimal changes, we adopt second method.
//disable regWE&memWE
when(fd_ex.stall || ex_wb.if_jump_flag) {
ex_wb.if_jump_flag := false.B
ex_wb.wbcontrol.memory_read_enable := fd_ex.wbcontrol.memory_read_enable
ex_wb.wbcontrol.wb_reg_write_source := fd_ex.wbcontrol.wb_reg_write_source
ex_wb.wbcontrol.reg_write_address := fd_ex.wbcontrol.reg_write_address
ex_wb.wbcontrol.reg_write_enable := false.B
ex_wb.wbcontrol.memory_write_enable := false.B
}.otherwise {
ex_wb.wbcontrol := fd_ex.wbcontrol
ex_wb.if_jump_flag := ex.io.if_jump_flag
}
We wrote an easy program to test our 3-stage pipeline CPU.
class PipelineTest extends AnyFlatSpec with ChiselScalatestTester {
behavior.of("Pipeline")
it should "print out the stage register" in {
test(new CPU).withAnnotations(TestAnnotations.annos) { c =>
c.io.instruction_valid.poke(true.B)
// c.io.instruction.poke(0x3e001463L.U) // bne x0, x0, 1000
c.io.instruction.poke(0x002081b3L.U) // add x3, x1, x2
c.clock.step()
c.io.instruction.poke(0x3e000463L.U) // beq x0, x0, 1000
c.clock.step()
c.io.instruction.poke(0x0146a583L.U) // lw x11, 20(x13)
c.clock.step()
// c.io.instruction.poke(0x00100513L.U) // addi x10, x0, 1
// c.clock.step()
// c.io.instruction.poke(0x00500593L.U) // addi x11, x0, 5
// c.clock.step()
// c.io.instruction.poke(0x40a58633L.U) // sub x12, x11, x10
// c.clock.step()
// c.io.instruction.poke(0x00a58633L.U) // add x12, x11, x10
c.clock.step(3)
}
}
}
After finishing the easy test, we need to comprehensively test it.
sbt test
[info] QuicksortTest:
[info]Single Cycle CPU
[info]- should perform a quicksort on 10 numbers *** FAILED ***
[info] io_mem_debug_read_data=0 (0x0) did not equal expected=1 (0x1) (lines in CPUTest.scala: 93, 90, 85) (CUTest. scala: 93)
[info]InstructionDecoder Test:
[info]InstructionDecoder of Single Cycle CPU
[info] - should produce correct control signal
[info]Run completed in 41 seconds, 416 milliseconds.
[info]Total number of tests run: 11
[info]Suites: completed 9, aborted o
[info]Tests: succeeded 8, failed 3, canceled 0, ignored o, pending o
[info]*** 3 TESTS FAILED***
[error]Failed tests:
[error] riscv.singlecycle.ByteAccessTest
[error] riscv.singlecycle.FibonacciTest
[error] riscv. singlecycle.QuicksortTest
[error] (Test / test) sbt. TestsFailedException: Tests unsuccessful
[error] Total time: 43s, completed Jan 6, 2024 10:43:28 PM
First we found three failed tests.
After checking the waveform, we found the bug.
Solve first problem: assign the correct data to reg2data
val reg2data = Wire(UInt(Parameters.DataWidth))
when(ex_wb.wbcontrol.reg_write_enable && (ex_wb.wbcontrol.reg_write_address === fd_ex.reg_read_address1)) {
when(ex_wb.wbcontrol.memory_read_enable) {
ex.io.reg1_data := mem.io.wb_memory_read_data
}.otherwise {
ex.io.reg1_data := ex_wb.mem_alu_result
}
}.otherwise {
ex.io.reg1_data := regs.io.read_data1
}
when(ex_wb.wbcontrol.reg_write_enable && (ex_wb.wbcontrol.reg_write_address === fd_ex.reg_read_address2)) {
when(ex_wb.wbcontrol.memory_read_enable) {
reg2data := mem.io.wb_memory_read_data
}.otherwise {
reg2data := ex_wb.mem_alu_result
}
}.otherwise {
reg2data := regs.io.read_data2
}
ex.io.reg2_data := reg2data
Solve second problem: disable the ex_wb.if_jump_flag
and the other two control signal
// ex_wb.if_jump_flag := ex.io.if_jump_flag
ex_wb.if_jump_address := ex.io.if_jump_address
ex_wb.stall := fd_ex.stall || ex_wb.if_jump_flag //second stall
//disable regWE&memWE
when(fd_ex.stall || ex_wb.if_jump_flag) {
ex_wb.if_jump_flag := false.B
ex_wb.wbcontrol.memory_read_enable := fd_ex.wbcontrol.memory_read_enable
ex_wb.wbcontrol.wb_reg_write_source := fd_ex.wbcontrol.wb_reg_write_source
ex_wb.wbcontrol.reg_write_address := fd_ex.wbcontrol.reg_write_address
ex_wb.wbcontrol.reg_write_enable := false.B
ex_wb.wbcontrol.memory_write_enable := false.B
}.otherwise {
ex_wb.wbcontrol := fd_ex.wbcontrol
ex_wb.if_jump_flag := ex.io.if_jump_flag
}
Finally, we passed all test.
[info] welcome to sbt 1.9.7 (Temurin Java 1.8.0_392)
[info] loading settings for project ca2023-lab3-build from plugins.sbt ...
[info] loading project definition from /home/ianli/ca2023-lab3/project
[info] loading settings for project root from build.sbt ...
[info] set current project to mycpu (in build file:/home/ianli/ca2023-lab3/)
[info] InstructionFetchTest:
[info] InstructionFetch of Single Cycle CPU
[info] - should fetch instruction
[info] ExecuteTest:
[info] Execution of Single Cycle CPU
[info] - should execute correctly
[info] PipelineTest:
[info] Pipeline
[info] - should print out the stage register
[info] BranchTest:
[info] 3-stage Pipeline CPU
[info] - should branch correctly
[info] InstructionDecoderTest:
[info] InstructionDecoder of Single Cycle CPU
[info] - should produce correct control signal
[info] ByteAccessTest:
[info] 3-stage Pipeline CPU
[info] - should store and load a single byte
[info] QuicksortTest:
[info] 3-stage Pipeline CPU
[info] - should perform a quicksort on 10 numbers
[info] RegisterFileTest:
[info] Register File of Single Cycle CPU
[info] - should read the written content
[info] - should x0 always be zero
[info] - should read the writing content
[info] FibonacciTest:
[info] 3-stage Pipeline CPU
[info] - should recursively calculate Fibonacci(10)
[info] ForwardTest:
[info] 3-stage Pipeline CPU
[info] - should bypass the operand
starting address: UInt<32>(5600)
UInt<32>(1059250332), UInt<32>(1057652809), UInt<32>(1063491255), UInt<32>(1061586249), UInt<32>(1059681246),
UInt<32>(1057776241), UInt<32>(1054777865), UInt<32>(1062939607), UInt<32>(1060727120), UInt<32>(1058514636),
UInt<32>(1055639692), UInt<32>(1051214720), UInt<32>(1062387961), UInt<32>(1059867992), UInt<32>(1057348026),
UInt<32>(1052691510), UInt<32>(1046727151), UInt<32>(2), UInt<32>(5), UInt<32>(1048576000),
UInt<32>(1056964608), UInt<32>(1061158912), UInt<32>(1064594550), UInt<32>(1059434382), UInt<32>(1062387961),
[info] ImgScaleTest:
[info] 3-stage Pipeline CPU
[info] - should Image Scaling...
[info] Run completed in 30 seconds, 315 milliseconds.
[info] Total number of tests run: 13
[info] Suites: completed 11, aborted 0
[info] Tests: succeeded 13, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 32 s, completed Jan 8, 2024 8:03:40 PM
We increment a counter to count the number of incorrect predictions
ianli@new:~/ca2023-lab3 $ sbt "testOnly riscv.singlecycle.ByteAccessTest"
[info] welcome to sbt 1.9.7 (Temurin Java 1.8.0_392)
[info] loading settings for project ca2023-lab3-build from plugins.sbt ...
[info] loading project definition from /home/ianli/ca2023-lab3/project
[info] loading settings for project root from build.sbt ...
[info] set current project to mycpu (in build file:/home/ianli/ca2023-lab3/)
[info] compiling 1 Scala source to /home/ianli/ca2023-lab3/target/scala-2.13/classes ...
fd_ex = inst: 0x00000000 instAddr: 0 imm: 0 op1_src: 0 op2_src: 0 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0
ex_wb = inst: 0x00000000 instAddr: 0 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0
mispredictCounter: 0
fd_ex = inst: 0x00000013 instAddr: 4096 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x00000000 instAddr: 0 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0
mispredictCounter: 0
fd_ex = inst: 0x00000013 instAddr: 4096 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x00000013 instAddr: 4096 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4096 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0
mispredictCounter: 0
fd_ex = inst: 0x00000013 instAddr: 4096 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x00000013 instAddr: 4096 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4096 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0
mispredictCounter: 0
fd_ex = inst: 0x00400513 instAddr: 4096 imm: 4 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 4 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 10
ex_wb = inst: 0x00000013 instAddr: 4096 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4096 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0
mispredictCounter: 0
fd_ex = inst: 0xdeadc2b7 instAddr: 4100 imm: 3735928832 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 10 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 5
ex_wb = inst: 0x00400513 instAddr: 4096 alu_out: 4 reg2_data: 0 jump_flag: 0 jumpAddr: 4100 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 10
mispredictCounter: 0
fd_ex = inst: 0xeef28293 instAddr: 4104 imm: 4294967023 op1_src: 0 op2_src: 1 reg1_RA: 5 reg2_RA: 15 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 5
ex_wb = inst: 0xdeadc2b7 instAddr: 4100 alu_out: 3735928832 reg2_data: 4 jump_flag: 0 jumpAddr: 3735932932 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 5
mispredictCounter: 0
fd_ex = inst: 0x00550023 instAddr: 4108 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 10 reg2_RA: 5 stall: 0 mem_RE: 0 mem_WE: 1 wb_reg_src: 0 reg_WE: 0 reg_WA: 0
ex_wb = inst: 0xeef28293 instAddr: 4104 alu_out: 3735928559 reg2_data: 0 jump_flag: 0 jumpAddr: 3831 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 5
mispredictCounter: 0
fd_ex = inst: 0x00052303 instAddr: 4112 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 10 reg2_RA: 0 stall: 0 mem_RE: 1 mem_WE: 0 wb_reg_src: 1 reg_WE: 1 reg_WA: 6
ex_wb = inst: 0x00550023 instAddr: 4108 alu_out: 4 reg2_data: 3735928559 jump_flag: 0 jumpAddr: 4108 stall: 0 mem_RE: 0 mem_WE: 1 wb_reg_src: 0 reg_WE: 0 reg_WA: 0
mispredictCounter: 0
fd_ex = inst: 0x01500913 instAddr: 4116 imm: 21 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 21 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 18
ex_wb = inst: 0x00052303 instAddr: 4112 alu_out: 4 reg2_data: 0 jump_flag: 0 jumpAddr: 4112 stall: 0 mem_RE: 1 mem_WE: 0 wb_reg_src: 1 reg_WE: 1 reg_WA: 6
mispredictCounter: 0
fd_ex = inst: 0x012500a3 instAddr: 4120 imm: 1 op1_src: 0 op2_src: 1 reg1_RA: 10 reg2_RA: 18 stall: 0 mem_RE: 0 mem_WE: 1 wb_reg_src: 0 reg_WE: 0 reg_WA: 1
ex_wb = inst: 0x01500913 instAddr: 4116 alu_out: 21 reg2_data: 0 jump_flag: 0 jumpAddr: 4137 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 18
mispredictCounter: 0
fd_ex = inst: 0x00052083 instAddr: 4124 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 10 reg2_RA: 0 stall: 0 mem_RE: 1 mem_WE: 0 wb_reg_src: 1 reg_WE: 1 reg_WA: 1
ex_wb = inst: 0x012500a3 instAddr: 4120 alu_out: 5 reg2_data: 21 jump_flag: 0 jumpAddr: 4121 stall: 0 mem_RE: 0 mem_WE: 1 wb_reg_src: 0 reg_WE: 0 reg_WA: 1
mispredictCounter: 0
fd_ex = inst: 0x0000006f instAddr: 4128 imm: 0 op1_src: 1 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x00052083 instAddr: 4124 alu_out: 4 reg2_data: 0 jump_flag: 0 jumpAddr: 4124 stall: 0 mem_RE: 1 mem_WE: 0 wb_reg_src: 1 reg_WE: 1 reg_WA: 1
mispredictCounter: 0
fd_ex = inst: 0x00000013 instAddr: 4132 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x0000006f instAddr: 4128 alu_out: 4128 reg2_data: 0 jump_flag: 1 jumpAddr: 4128 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0
mispredictCounter: 0
fd_ex = inst: 0x00000013 instAddr: 4136 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x00000013 instAddr: 4132 alu_out: 4128 reg2_data: 4128 jump_flag: 0 jumpAddr: 4132 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0
mispredictCounter: 0
fd_ex = inst: 0x0000006f instAddr: 4128 imm: 0 op1_src: 1 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x00000013 instAddr: 4136 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4136 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0
mispredictCounter: 1
fd_ex = inst: 0x00000013 instAddr: 4132 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x0000006f instAddr: 4128 alu_out: 4128 reg2_data: 0 jump_flag: 1 jumpAddr: 4128 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0
mispredictCounter: 1
fd_ex = inst: 0x00000013 instAddr: 4136 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x00000013 instAddr: 4132 alu_out: 4128 reg2_data: 4128 jump_flag: 0 jumpAddr: 4132 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0
mispredictCounter: 1
fd_ex = inst: 0x0000006f instAddr: 4128 imm: 0 op1_src: 1 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x00000013 instAddr: 4136 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4136 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0
mispredictCounter: 2
fd_ex = inst: 0x00000013 instAddr: 4132 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x0000006f instAddr: 4128 alu_out: 4128 reg2_data: 0 jump_flag: 1 jumpAddr: 4128 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0
mispredictCounter: 2
fd_ex = inst: 0x00000013 instAddr: 4136 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x00000013 instAddr: 4132 alu_out: 4128 reg2_data: 4128 jump_flag: 0 jumpAddr: 4132 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0
mispredictCounter: 2
fd_ex = inst: 0x0000006f instAddr: 4128 imm: 0 op1_src: 1 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x00000013 instAddr: 4136 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4136 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0
mispredictCounter: 3
fd_ex = inst: 0x00000013 instAddr: 4132 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x0000006f instAddr: 4128 alu_out: 4128 reg2_data: 0 jump_flag: 1 jumpAddr: 4128 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0
mispredictCounter: 3
fd_ex = inst: 0x00000013 instAddr: 4136 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x00000013 instAddr: 4132 alu_out: 4128 reg2_data: 4128 jump_flag: 0 jumpAddr: 4132 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0
mispredictCounter: 3
fd_ex = inst: 0x0000006f instAddr: 4128 imm: 0 op1_src: 1 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x00000013 instAddr: 4136 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4136 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0
mispredictCounter: 4
[info] ByteAccessTest:
[info] 3-stage Pipeline CPU
[info] - should store and load a single byte
[info] Run completed in 5 seconds, 351 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 11 s, completed Jan 9, 2024 4:22:54 PM
Let us analyze how the branch predictor works.
When the IF/D to EXE stage register stall signal is 1, set the reg_WE
control signal.
fd_ex = inst: 0x00000013 instAddr: 4136 imm: 0 op1_src: 0 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x00000013 instAddr: 4132 alu_out: 4128 reg2_data: 4128 jump_flag: 0 jumpAddr: 4132 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0
mispredictCounter: 0
After that, we disable the reg_WE
control signal and increase one to mispredictCounter
.
Means that we mispredicted the branch once.
fd_ex = inst: 0x0000006f instAddr: 4128 imm: 0 op1_src: 1 op2_src: 1 reg1_RA: 0 reg2_RA: 0 stall: 0 mem_RE: 0 mem_WE: 0 wb_reg_src: 3 reg_WE: 1 reg_WA: 0
ex_wb = inst: 0x00000013 instAddr: 4136 alu_out: 0 reg2_data: 0 jump_flag: 0 jumpAddr: 4136 stall: 1 mem_RE: 0 mem_WE: 0 wb_reg_src: 0 reg_WE: 0 reg_WA: 0
mispredictCounter: 1
The architecture of srv32 can avoid the previously mentioned Data harzard
and Load-use hazard
through forwarding, because the reading of the temporary register and the reading of the memory are executed in the same stage, as shown in the following table:
Instruction | cycle 1 | c2 | c3 | c4 | c5 |
---|---|---|---|---|---|
lw x2 0(x5) |
IF/ID | EX | WB | ||
and x3, x2, x4 |
IF/ID | EX | WB | ||
add x4, x5, x6 |
IF/ID | EX | WB |
// register reading @ execution stage and register forwarding
// When the execution result accesses the same register,
// the execution result is directly forwarded from the previous
// instruction (at write back stage)
assign reg_rdata1[31: 0] = (ex_src1_sel == 5'h0) ? 32'h0 :
(!wb_flush && wb_alu2reg &&
(wb_dst_sel == ex_src1_sel)) ? // register forwarding
(wb_mem2reg ? wb_rdata : wb_result) :
regs[ex_src1_sel];
assign reg_rdata2[31: 0] = (ex_src2_sel == 5'h0) ? 32'h0 :
(!wb_flush && wb_alu2reg &&
(wb_dst_sel == ex_src2_sel)) ? // register forwarding
(wb_mem2reg ? wb_rdata : wb_result) :
regs[ex_src2_sel];
In this project, we reviewed aspects related to pipeline processors and branch prediction. Although we recognized the necessity for additional components to support dynamic branch prediction, time constraints limited our ability to implement them fully. The achieved results are as follows:
Components left for future implementation include BHT (Branch History Table) and BTB (Branch Target Buffer). Through these structures, we can implement dynamic branch prediction to observe differences between various branching mechanisms.
Discuss the effectiveness of the design.
Pipeline Processor
Single Cycle CPU
srv32
Analyze for srv32:RISCV RV32IM Soft CPU
riscv-mini
dino-cpu