Try   HackMD

Rewrite Lab3 as 3-stage pipeline RISC-V processor with branch predictor

contributed by <shhung><SUE3K>
The detail can be accessed at branch pipeline

Introduction

In this project, we have refactored the single-cycle CPU from Assignment 3 into a 3-stage pipeline CPU. The original single-cycle CPU segmented the data path into five stages: Instruction Fetch, Instruction Decode, Execution, Memory Access, and Write-Back. Notably, the Instruction Decode stage included fetching necessary data from registers. Drawing inspiration from the srv32's 3-stage pipeline architecture, we reorganized our CPU into the following 3 stages:

  • Stage 1 comprises Instruction Fetch and Instruction Decode
  • Stage 2 involves Execution
  • Stage 3 encompasses Memory Access and Write-Back

Register reading is now incorporated into Stage 2.

In order for the CPU to operate in a pipelined manner, temporary registers must be introduced between the stages. The stage register is used to store the data required for subsequent stages and the calculation results of the previous stage. Efficiently managing these stage registers and ensuring correct propagation is critical to the functionality of the pipeline.

Implementation

3-stage pipeline CPU architecture diagram

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

IF stand for Instruction Fetch
ID stand for Instruction Decode
REG stand for Instruction Register file
EXE stand for Execution
MEM stand for Instruction Memory access

Description

Before diving into coding, it's crucial to plan our architecture and address the issues the pipeline needs to solve. The final architecture diagram is shown above. The main modules have been proven in single-cycle CPUs, so we can leverage them for reuse. The key is the stage registers between stages and the circuitry for forward highlighted in red as shown.

issues

data harzard

As we have learned, we know that data harzard will be an issue we need to solve. There is onle RAW harzard possible on single issue processor. To address this, we can adopt forwarding instead of inserting stall which may reduce preformance. The data to be forwarded come from either memory or registers. We can see that the two red lines represent forwarding points to the ALU on the above diagram..

control harzard

To mitigate control hazards, we utilize static branch prediction instead of delayed branches. We have configured our processor to never take a branch. As analyzed in the lecture material for srv32, our design also incurs a two-branch penalty for taken branches."

Feature

  • Forwarding
  • Static branch predict

Pipeline register

The pipeline registers are utilized to store and propagate the required data and control signals from the previous stage to the following stage. The most crucial signal is stall, which allows for flushing the instruction when a branch is taken.

IF/ID to EXE

  // Pipelining, FD-EXE
  fd_ex.instruction         := inst_fetch.io.instruction
  fd_ex.instruction_address := inst_fetch.io.instruction_address
  fd_ex.immediate           := id.io.ex_immediate
  fd_ex.ex_aluop1_source    := id.io.ex_aluop1_source
  fd_ex.ex_aluop2_source    := id.io.ex_aluop2_source
  fd_ex.reg_read_address1   := id.io.regs_reg1_read_address
  fd_ex.reg_read_address2   := id.io.regs_reg2_read_address
  fd_ex.stall               := ex_wb.if_jump_flag //first stall
  fd_ex.wbcontrol.memory_read_enable  := id.io.memory_read_enable
  fd_ex.wbcontrol.memory_write_enable := id.io.memory_write_enable
  fd_ex.wbcontrol.wb_reg_write_source := id.io.wb_reg_write_source
  fd_ex.wbcontrol.reg_write_enable    := id.io.reg_write_enable
  fd_ex.wbcontrol.reg_write_address   := id.io.reg_write_address

EXE to MEM/WB

  // Pipelining, EXE-WB
  ex_wb.instruction         := fd_ex.instruction
  ex_wb.instruction_address := fd_ex.instruction_address
  ex_wb.mem_alu_result      := ex.io.mem_alu_result
  ex_wb.reg2_data           := reg2data
  ex_wb.if_jump_address     := ex.io.if_jump_address
  ex_wb.stall               := fd_ex.stall || ex_wb.if_jump_flag //second stall

Deal with data hazard

During the EXE stage, we may need to read data from rs*. At the same time, if the rd of the last instruction is the same as rs1 or rs2, data hazard will occur. Therefore, we must forward data from the WB to the EXE to ensure that the CPU operates as expected.

Referencing srv32, we have implemented the forwarding logic as shown below.

  when(ex_wb.wbcontrol.reg_write_enable && ex_wb.wbcontrol.reg_write_address === fd_ex.reg_read_address1) {
    when(ex_wb.wbcontrol.memory_read_enable) {
      ex.io.reg1_data := mem.io.wb_memory_read_data
    }.otherwise {
      ex.io.reg1_data := ex_wb.mem_alu_result
    }
  }.otherwise {
    ex.io.reg1_data := regs.io.read_data1
  }
  when(ex_wb.wbcontrol.reg_write_enable && ex_wb.wbcontrol.reg_write_address === fd_ex.reg_read_address2) {
    when(ex_wb.wbcontrol.memory_read_enable) {
      ex.io.reg2_data := mem.io.wb_memory_read_data
    }.otherwise {
      ex.io.reg2_data := ex_wb.mem_alu_result
    }
  }.otherwise {
    ex.io.reg2_data := regs.io.read_data2
  }

reg_write_enable represents an instruction intending to modify a register. In the case where reg_write_address equals reg_read_address*, we should assign the data that was forwarded to reg*data. To distinguish which data should be assigned, we utilize memory_read_enable, as only the Load instruction reads data from memory and writes it back to registers. The rest write back the data calculated by the ALU from the EXE.

stall implementation

In pipeline processors, stall is a technique used to pause the pipeline, usually to solve data dependency or control dependency problems.
In the case of stalling, operations at some stages are suspended, but the entire pipeline remains active. There are two ways to implement stall:

  1. Replace the instruction with NOP(addi x0, x0, 0)
  2. Disable the control signalreg_write_enable and mem_write_enable

In order to achieve functionality with minimal changes, we adopt second method.

Disable regWE&memWE in pipeline EXE-WB

  //disable regWE&memWE
  when(fd_ex.stall || ex_wb.if_jump_flag) {
    ex_wb.if_jump_flag                  := false.B
    ex_wb.wbcontrol.memory_read_enable  := fd_ex.wbcontrol.memory_read_enable
    ex_wb.wbcontrol.wb_reg_write_source := fd_ex.wbcontrol.wb_reg_write_source
    ex_wb.wbcontrol.reg_write_address   := fd_ex.wbcontrol.reg_write_address
    ex_wb.wbcontrol.reg_write_enable    := false.B
    ex_wb.wbcontrol.memory_write_enable := false.B
  }.otherwise {
    ex_wb.wbcontrol := fd_ex.wbcontrol
    ex_wb.if_jump_flag        := ex.io.if_jump_flag
  }

Test CPU

We wrote an easy program to test our 3-stage pipeline CPU.

class PipelineTest extends AnyFlatSpec with ChiselScalatestTester {
  behavior.of("Pipeline")
  it should "print out the stage register" in {
    test(new CPU).withAnnotations(TestAnnotations.annos) { c =>
      c.io.instruction_valid.poke(true.B)


      // c.io.instruction.poke(0x3e001463L.U) // bne x0, x0, 1000
      
      c.io.instruction.poke(0x002081b3L.U) // add x3, x1, x2
      c.clock.step()
      c.io.instruction.poke(0x3e000463L.U) // beq x0, x0, 1000
      c.clock.step()
      c.io.instruction.poke(0x0146a583L.U) // lw x11, 20(x13)
      c.clock.step()

      // c.io.instruction.poke(0x00100513L.U) // addi x10, x0, 1
      // c.clock.step()
      // c.io.instruction.poke(0x00500593L.U) // addi x11, x0, 5
      // c.clock.step()
      // c.io.instruction.poke(0x40a58633L.U) // sub x12, x11, x10
      // c.clock.step()
      // c.io.instruction.poke(0x00a58633L.U) // add x12, x11, x10
      c.clock.step(3)
    }
  }
}

After finishing the easy test, we need to comprehensively test it.

sbt test
[info] QuicksortTest:
[info]Single Cycle CPU
[info]- should perform a quicksort on 10 numbers *** FAILED ***
[info] io_mem_debug_read_data=0 (0x0) did not equal expected=1 (0x1) (lines in CPUTest.scala: 93, 90, 85) (CUTest. scala: 93)
[info]InstructionDecoder Test:
[info]InstructionDecoder of Single Cycle CPU
[info] - should produce correct control signal
[info]Run completed in 41 seconds, 416 milliseconds.
[info]Total number of tests run: 11
[info]Suites: completed 9, aborted o
[info]Tests: succeeded 8, failed 3, canceled 0, ignored o, pending o
[info]*** 3 TESTS FAILED***
[error]Failed tests: 
[error]    riscv.singlecycle.ByteAccessTest
[error]    riscv.singlecycle.FibonacciTest
[error]    riscv. singlecycle.QuicksortTest
[error] (Test / test) sbt. TestsFailedException: Tests unsuccessful
[error] Total time: 43s, completed Jan 6, 2024 10:43:28 PM

First we found three failed tests.
After checking the waveform, we found the bug.

  1. We didn't assign the correct data to stage register
  2. If there are two continuous branch instructions, and jump_flag is both 1, the second branch should not be executed, but the original version will pass jump_flag and jump_addr back to the IF/ID, and jump directly to the target of the second branch.

1704608796541


Solve first problem: assign the correct data to reg2data

val reg2data = Wire(UInt(Parameters.DataWidth))
  when(ex_wb.wbcontrol.reg_write_enable && (ex_wb.wbcontrol.reg_write_address === fd_ex.reg_read_address1)) {
    when(ex_wb.wbcontrol.memory_read_enable) {
      ex.io.reg1_data := mem.io.wb_memory_read_data
    }.otherwise {
      ex.io.reg1_data := ex_wb.mem_alu_result
    }
  }.otherwise {
    ex.io.reg1_data := regs.io.read_data1
  }
  when(ex_wb.wbcontrol.reg_write_enable && (ex_wb.wbcontrol.reg_write_address === fd_ex.reg_read_address2)) {
    when(ex_wb.wbcontrol.memory_read_enable) {
      reg2data := mem.io.wb_memory_read_data
    }.otherwise {
      reg2data := ex_wb.mem_alu_result
    }
  }.otherwise {
    reg2data := regs.io.read_data2
  }
  ex.io.reg2_data           := reg2data

Solve second problem: disable the ex_wb.if_jump_flag and the other two control signal

// ex_wb.if_jump_flag        := ex.io.if_jump_flag
  ex_wb.if_jump_address     := ex.io.if_jump_address
  ex_wb.stall               := fd_ex.stall || ex_wb.if_jump_flag //second stall

  //disable regWE&memWE
  when(fd_ex.stall || ex_wb.if_jump_flag) {
    ex_wb.if_jump_flag                  := false.B
    ex_wb.wbcontrol.memory_read_enable  := fd_ex.wbcontrol.memory_read_enable
    ex_wb.wbcontrol.wb_reg_write_source := fd_ex.wbcontrol.wb_reg_write_source
    ex_wb.wbcontrol.reg_write_address   := fd_ex.wbcontrol.reg_write_address
    ex_wb.wbcontrol.reg_write_enable    := false.B
    ex_wb.wbcontrol.memory_write_enable := false.B
  }.otherwise {
    ex_wb.wbcontrol := fd_ex.wbcontrol
    ex_wb.if_jump_flag        := ex.io.if_jump_flag
  }

Finally, we passed all test.

[info] welcome to sbt 1.9.7 (Temurin Java 1.8.0_392)
[info] loading settings for project ca2023-lab3-build from plugins.sbt ...
[info] loading project definition from /home/ianli/ca2023-lab3/project
[info] loading settings for project root from build.sbt ...
[info] set current project to mycpu (in build file:/home/ianli/ca2023-lab3/)
[info] InstructionFetchTest:
[info] InstructionFetch of Single Cycle CPU
[info] - should fetch instruction
[info] ExecuteTest:
[info] Execution of Single Cycle CPU
[info] - should execute correctly
[info] PipelineTest:
[info] Pipeline
[info] - should print out the stage register
[info] BranchTest:
[info] 3-stage Pipeline CPU
[info] - should branch correctly
[info] InstructionDecoderTest:
[info] InstructionDecoder of Single Cycle CPU
[info] - should produce correct control signal
[info] ByteAccessTest:
[info] 3-stage Pipeline CPU
[info] - should store and load a single byte
[info] QuicksortTest:
[info] 3-stage Pipeline CPU
[info] - should perform a quicksort on 10 numbers
[info] RegisterFileTest:
[info] Register File of Single Cycle CPU
[info] - should read the written content
[info] - should x0 always be zero
[info] - should read the writing content
[info] FibonacciTest:
[info] 3-stage Pipeline CPU
[info] - should recursively calculate Fibonacci(10)
[info] ForwardTest:
[info] 3-stage Pipeline CPU
[info] - should bypass the operand
starting address: UInt<32>(5600)
UInt<32>(1059250332), UInt<32>(1057652809), UInt<32>(1063491255), UInt<32>(1061586249), UInt<32>(1059681246), 
UInt<32>(1057776241), UInt<32>(1054777865), UInt<32>(1062939607), UInt<32>(1060727120), UInt<32>(1058514636), 
UInt<32>(1055639692), UInt<32>(1051214720), UInt<32>(1062387961), UInt<32>(1059867992), UInt<32>(1057348026), 
UInt<32>(1052691510), UInt<32>(1046727151), UInt<32>(2), UInt<32>(5), UInt<32>(1048576000), 
UInt<32>(1056964608), UInt<32>(1061158912), UInt<32>(1064594550), UInt<32>(1059434382), UInt<32>(1062387961), 
[info] ImgScaleTest:
[info] 3-stage Pipeline CPU
[info] - should Image Scaling...
[info] Run completed in 30 seconds, 315 milliseconds.
[info] Total number of tests run: 13
[info] Suites: completed 11, aborted 0
[info] Tests: succeeded 13, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 32 s, completed Jan 8, 2024 8:03:40 PM

We increment a counter to count the number of incorrect predictions

ianli@new:~/ca2023-lab3 $ sbt "testOnly riscv.singlecycle.ByteAccessTest"
[info] welcome to sbt 1.9.7 (Temurin Java 1.8.0_392)
[info] loading settings for project ca2023-lab3-build from plugins.sbt ...
[info] loading project definition from /home/ianli/ca2023-lab3/project
[info] loading settings for project root from build.sbt ...
[info] set current project to mycpu (in build file:/home/ianli/ca2023-lab3/)
[info] compiling 1 Scala source to /home/ianli/ca2023-lab3/target/scala-2.13/classes ...
fd_ex =   inst: 0x00000000  instAddr:          0  imm:          0  op1_src: 0  op2_src: 0  reg1_RA:  0  reg2_RA:  0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 0  reg_WA:  0
ex_wb =   inst: 0x00000000  instAddr:          0  alu_out:          0  reg2_data:          0  jump_flag: 0  jumpAddr:          0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 0  reg_WA:  0
mispredictCounter:          0

fd_ex =   inst: 0x00000013  instAddr:       4096  imm:          0  op1_src: 0  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x00000000  instAddr:          0  alu_out:          0  reg2_data:          0  jump_flag: 0  jumpAddr:          0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 0  reg_WA:  0
mispredictCounter:          0

fd_ex =   inst: 0x00000013  instAddr:       4096  imm:          0  op1_src: 0  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x00000013  instAddr:       4096  alu_out:          0  reg2_data:          0  jump_flag: 0  jumpAddr:       4096  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  0
mispredictCounter:          0

fd_ex =   inst: 0x00000013  instAddr:       4096  imm:          0  op1_src: 0  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x00000013  instAddr:       4096  alu_out:          0  reg2_data:          0  jump_flag: 0  jumpAddr:       4096  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  0
mispredictCounter:          0

fd_ex =   inst: 0x00400513  instAddr:       4096  imm:          4  op1_src: 0  op2_src: 1  reg1_RA:  0  reg2_RA:  4  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA: 10
ex_wb =   inst: 0x00000013  instAddr:       4096  alu_out:          0  reg2_data:          0  jump_flag: 0  jumpAddr:       4096  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  0
mispredictCounter:          0

fd_ex =   inst: 0xdeadc2b7  instAddr:       4100  imm: 3735928832  op1_src: 0  op2_src: 1  reg1_RA:  0  reg2_RA: 10  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  5
ex_wb =   inst: 0x00400513  instAddr:       4096  alu_out:          4  reg2_data:          0  jump_flag: 0  jumpAddr:       4100  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA: 10
mispredictCounter:          0

fd_ex =   inst: 0xeef28293  instAddr:       4104  imm: 4294967023  op1_src: 0  op2_src: 1  reg1_RA:  5  reg2_RA: 15  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  5
ex_wb =   inst: 0xdeadc2b7  instAddr:       4100  alu_out: 3735928832  reg2_data:          4  jump_flag: 0  jumpAddr: 3735932932  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  5
mispredictCounter:          0

fd_ex =   inst: 0x00550023  instAddr:       4108  imm:          0  op1_src: 0  op2_src: 1  reg1_RA: 10  reg2_RA:  5  stall: 0    mem_RE: 0  mem_WE: 1  wb_reg_src: 0  reg_WE: 0  reg_WA:  0
ex_wb =   inst: 0xeef28293  instAddr:       4104  alu_out: 3735928559  reg2_data:          0  jump_flag: 0  jumpAddr:       3831  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  5
mispredictCounter:          0

fd_ex =   inst: 0x00052303  instAddr:       4112  imm:          0  op1_src: 0  op2_src: 1  reg1_RA: 10  reg2_RA:  0  stall: 0    mem_RE: 1  mem_WE: 0  wb_reg_src: 1  reg_WE: 1  reg_WA:  6
ex_wb =   inst: 0x00550023  instAddr:       4108  alu_out:          4  reg2_data: 3735928559  jump_flag: 0  jumpAddr:       4108  stall: 0    mem_RE: 0  mem_WE: 1  wb_reg_src: 0  reg_WE: 0  reg_WA:  0
mispredictCounter:          0

fd_ex =   inst: 0x01500913  instAddr:       4116  imm:         21  op1_src: 0  op2_src: 1  reg1_RA:  0  reg2_RA: 21  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA: 18
ex_wb =   inst: 0x00052303  instAddr:       4112  alu_out:          4  reg2_data:          0  jump_flag: 0  jumpAddr:       4112  stall: 0    mem_RE: 1  mem_WE: 0  wb_reg_src: 1  reg_WE: 1  reg_WA:  6
mispredictCounter:          0

fd_ex =   inst: 0x012500a3  instAddr:       4120  imm:          1  op1_src: 0  op2_src: 1  reg1_RA: 10  reg2_RA: 18  stall: 0    mem_RE: 0  mem_WE: 1  wb_reg_src: 0  reg_WE: 0  reg_WA:  1
ex_wb =   inst: 0x01500913  instAddr:       4116  alu_out:         21  reg2_data:          0  jump_flag: 0  jumpAddr:       4137  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA: 18
mispredictCounter:          0

fd_ex =   inst: 0x00052083  instAddr:       4124  imm:          0  op1_src: 0  op2_src: 1  reg1_RA: 10  reg2_RA:  0  stall: 0    mem_RE: 1  mem_WE: 0  wb_reg_src: 1  reg_WE: 1  reg_WA:  1
ex_wb =   inst: 0x012500a3  instAddr:       4120  alu_out:          5  reg2_data:         21  jump_flag: 0  jumpAddr:       4121  stall: 0    mem_RE: 0  mem_WE: 1  wb_reg_src: 0  reg_WE: 0  reg_WA:  1
mispredictCounter:          0

fd_ex =   inst: 0x0000006f  instAddr:       4128  imm:          0  op1_src: 1  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 3  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x00052083  instAddr:       4124  alu_out:          4  reg2_data:          0  jump_flag: 0  jumpAddr:       4124  stall: 0    mem_RE: 1  mem_WE: 0  wb_reg_src: 1  reg_WE: 1  reg_WA:  1
mispredictCounter:          0

fd_ex =   inst: 0x00000013  instAddr:       4132  imm:          0  op1_src: 0  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x0000006f  instAddr:       4128  alu_out:       4128  reg2_data:          0  jump_flag: 1  jumpAddr:       4128  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 3  reg_WE: 1  reg_WA:  0
mispredictCounter:          0

fd_ex =   inst: 0x00000013  instAddr:       4136  imm:          0  op1_src: 0  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 1    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x00000013  instAddr:       4132  alu_out:       4128  reg2_data:       4128  jump_flag: 0  jumpAddr:       4132  stall: 1    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 0  reg_WA:  0
mispredictCounter:          0

fd_ex =   inst: 0x0000006f  instAddr:       4128  imm:          0  op1_src: 1  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 3  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x00000013  instAddr:       4136  alu_out:          0  reg2_data:          0  jump_flag: 0  jumpAddr:       4136  stall: 1    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 0  reg_WA:  0
mispredictCounter:          1

fd_ex =   inst: 0x00000013  instAddr:       4132  imm:          0  op1_src: 0  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x0000006f  instAddr:       4128  alu_out:       4128  reg2_data:          0  jump_flag: 1  jumpAddr:       4128  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 3  reg_WE: 1  reg_WA:  0
mispredictCounter:          1

fd_ex =   inst: 0x00000013  instAddr:       4136  imm:          0  op1_src: 0  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 1    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x00000013  instAddr:       4132  alu_out:       4128  reg2_data:       4128  jump_flag: 0  jumpAddr:       4132  stall: 1    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 0  reg_WA:  0
mispredictCounter:          1

fd_ex =   inst: 0x0000006f  instAddr:       4128  imm:          0  op1_src: 1  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 3  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x00000013  instAddr:       4136  alu_out:          0  reg2_data:          0  jump_flag: 0  jumpAddr:       4136  stall: 1    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 0  reg_WA:  0
mispredictCounter:          2

fd_ex =   inst: 0x00000013  instAddr:       4132  imm:          0  op1_src: 0  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x0000006f  instAddr:       4128  alu_out:       4128  reg2_data:          0  jump_flag: 1  jumpAddr:       4128  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 3  reg_WE: 1  reg_WA:  0
mispredictCounter:          2

fd_ex =   inst: 0x00000013  instAddr:       4136  imm:          0  op1_src: 0  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 1    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x00000013  instAddr:       4132  alu_out:       4128  reg2_data:       4128  jump_flag: 0  jumpAddr:       4132  stall: 1    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 0  reg_WA:  0
mispredictCounter:          2

fd_ex =   inst: 0x0000006f  instAddr:       4128  imm:          0  op1_src: 1  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 3  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x00000013  instAddr:       4136  alu_out:          0  reg2_data:          0  jump_flag: 0  jumpAddr:       4136  stall: 1    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 0  reg_WA:  0
mispredictCounter:          3

fd_ex =   inst: 0x00000013  instAddr:       4132  imm:          0  op1_src: 0  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x0000006f  instAddr:       4128  alu_out:       4128  reg2_data:          0  jump_flag: 1  jumpAddr:       4128  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 3  reg_WE: 1  reg_WA:  0
mispredictCounter:          3

fd_ex =   inst: 0x00000013  instAddr:       4136  imm:          0  op1_src: 0  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 1    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x00000013  instAddr:       4132  alu_out:       4128  reg2_data:       4128  jump_flag: 0  jumpAddr:       4132  stall: 1    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 0  reg_WA:  0
mispredictCounter:          3

fd_ex =   inst: 0x0000006f  instAddr:       4128  imm:          0  op1_src: 1  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 3  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x00000013  instAddr:       4136  alu_out:          0  reg2_data:          0  jump_flag: 0  jumpAddr:       4136  stall: 1    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 0  reg_WA:  0
mispredictCounter:          4

[info] ByteAccessTest:
[info] 3-stage Pipeline CPU
[info] - should store and load a single byte
[info] Run completed in 5 seconds, 351 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 11 s, completed Jan 9, 2024 4:22:54 PM


Let us analyze how the branch predictor works.
When the IF/D to EXE stage register stall signal is 1, set the reg_WE control signal.

fd_ex =   inst: 0x00000013  instAddr:       4136  imm:          0  op1_src: 0  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 1    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x00000013  instAddr:       4132  alu_out:       4128  reg2_data:       4128  jump_flag: 0  jumpAddr:       4132  stall: 1    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 0  reg_WA:  0
mispredictCounter:          0


After that, we disable the reg_WE control signal and increase one to mispredictCounter.
Means that we mispredicted the branch once.

fd_ex =   inst: 0x0000006f  instAddr:       4128  imm:          0  op1_src: 1  op2_src: 1  reg1_RA:  0  reg2_RA:  0  stall: 0    mem_RE: 0  mem_WE: 0  wb_reg_src: 3  reg_WE: 1  reg_WA:  0
ex_wb =   inst: 0x00000013  instAddr:       4136  alu_out:          0  reg2_data:          0  jump_flag: 0  jumpAddr:       4136  stall: 1    mem_RE: 0  mem_WE: 0  wb_reg_src: 0  reg_WE: 0  reg_WA:  0
mispredictCounter:          1

Analyze srv32

Forwarding

The architecture of srv32 can avoid the previously mentioned Data harzard and Load-use hazard through forwarding, because the reading of the temporary register and the reading of the memory are executed in the same stage, as shown in the following table:

Instruction cycle 1 c2 c3 c4 c5
lw x2 0(x5) IF/ID EX WB
and x3, x2, x4 IF/ID EX WB
add x4, x5, x6 IF/ID EX WB
// register reading @ execution stage and register forwarding
// When the execution result accesses the same register,
// the execution result is directly forwarded from the previous
// instruction (at write back stage)
assign reg_rdata1[31: 0]    = (ex_src1_sel == 5'h0) ? 32'h0 :
                              (!wb_flush && wb_alu2reg &&
                               (wb_dst_sel == ex_src1_sel)) ? // register forwarding
                                (wb_mem2reg ? wb_rdata : wb_result) :
                                regs[ex_src1_sel];
assign reg_rdata2[31: 0]    = (ex_src2_sel == 5'h0) ? 32'h0 :
                              (!wb_flush && wb_alu2reg &&
                               (wb_dst_sel == ex_src2_sel)) ? // register forwarding
                                (wb_mem2reg ? wb_rdata : wb_result) :
                                regs[ex_src2_sel];

Conclusion

In this project, we reviewed aspects related to pipeline processors and branch prediction. Although we recognized the necessity for additional components to support dynamic branch prediction, time constraints limited our ability to implement them fully. The achieved results are as follows:

  • Pipeline
  • Static branch predict(Always not taken)

Components left for future implementation include BHT (Branch History Table) and BTB (Branch Target Buffer). Through these structures, we can implement dynamic branch prediction to observe differences between various branching mechanisms.

Discuss the effectiveness of the design.

References

Pipeline Processor
Single Cycle CPU
srv32
Analyze for srv32:RISCV RV32IM Soft CPU
riscv-mini
dino-cpu