Implement dynamic branch prediction for pipelined RISC-V processor

許翰翔, 簡德彥, 吳秉宥

GitHub

Implement 5-stage pipeline CPU

Architecture

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

We split single-cycle CPU to five stage, each stage is IF, ID, EX, MEM, WB.

IF(instruction fetch)
Fetch instruction from instruction memory(IM) and update program counter selected by next_pc_sel signal.
ID(instruction decode)
Decode instruction to opcode, func7, func3, etc., and retrieve register value from RegFile.
EX(execute)
Executing every operation depends on the instruction. For example, it may involve calculating the branch/jump target address or computing an arithmetic result.
MEM(memory)
Read/write data from data memory(DM)
WB(write back)
Write the result data back to RegFile.

Modules

Module name	Explanation
ALU	Used to slove most of arithmetic operation.When the instruction is a branch type, alu_out[0] will be true when a branch is needed; otherwise, it will be false.
Controller	Output control signal to each pipeline stage to ensure that instruction correctly flows into the pipeline.
Decoder	Decode instruction into `opcode`, `func7`, `func3`, etc.
Imm_Ext	Generating the immediate number depends on the type of instruction.
JB_Unit	Calculate the target address for jump/branch type instructions.
LD_Filter	Filter data depends on `func3`. Used to load type instruction. For example `lw`, `lh`, `lb`, etc.
RegFile	Register file, it contains 32 RISC-V registers. This module can output `rs1` and `rs2` data simultaneously.
Reg_D	Pipeline IF/ID stage register.
Reg_E	Pipeline ID/EX stage register.
Reg_M	Pipeline EX/MEM stage register.
Reg_PC	Program counter register.
Reg_W	Pipeline MEM/WB stage register.
SRAM	In this project, to reduce complexity, we design the SRAM to read data without any delay.

Hazards resolve

Structure hazard
Due to the fact that SRAM can't read and write different addresses simultaneously, we split the memory into IM and DM. This allows the CPU to read instructions from the IM and load/store data from the DM at the same time, as both contain the same data.
Control hazard
Whether an instruction requires a branch is determined in the EX stage. However, at the same time, some instructions that may not need to be executed have already flowed into the IF and ID stages. Therefore, we need the next_pc_sel signal from the Controller unit to flush these instructions.
Data hazard
When instructions encounter a write-and-use or load-and-use issue, the correct data may not be available for use. For write-and-use situations, we forward some signals from the MEM stage to the ID or EX stage. For load-and-use situations, we insert a pipeline bubble (nop e.g. add x0 x1 x0) to stall the entire pipeline.

Static Branch Prediction

Always not jump

Our original design always assumes no jump. Instructions are fetched in order until the EX stage, where the decision to flush or not flush is made.

Always jump

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Code

package Pipeline
import chisel3._
import chisel3.util._

class Static_Jump_Unit extends Module {
    val io = IO(new Bundle {
        val to_static_jump = Input(Bits(32.W))
        val current_pc = Input(Bits(32.W))
        val inst = Input(Bits(32.W))
        val next_pc_sel = Input(Bool())
        
        val out_static_jump = Output(Bits(32.W))
    })
    val target_address_offset = Wire(UInt(32.W))
    val target_address = Wire(UInt(32.W))
    target_address_offset   := Cat(Fill(20, io.inst(31)), io.inst(7), io.inst(30, 25), io.inst(11, 8), 0.U(1.W))
    target_address          := ((io.current_pc + target_address_offset) & "hFFFF_FFFF".U(32.W))

    when(io.next_pc_sel){
        io.out_static_jump  := io.to_static_jump
    }.elsewhen(io.inst(6, 2) === OpcodeFunc7Funct3Definition.B){
        io.out_static_jump  := target_address
    }.otherwise{
        io.out_static_jump  := io.to_static_jump
    }
}

object Static_Jump_Unit extends App {
  (new chisel3.stage.ChiselStage).emitVerilog(new Static_Jump_Unit, Array("--target-dir", "generated/Static_Jump_Unit"))
}

Insert a branch predictor before Reg_PC and determine if the current instruction is a branch type. If it is, the predictor will calculate the target address and send it to Reg_PC. However, if the EX stage needs to flush and obtain the correct address(next_pc_sel is high), it will propagate the correct address.

Jump if the target address before/after current PC

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Code

package Pipeline
import chisel3._
import chisel3.util._

class Static_Jump_Unit extends Module {
    val io = IO(new Bundle {
        val to_static_jump = Input(Bits(32.W))
        val current_pc = Input(Bits(32.W))
        val inst = Input(Bits(32.W))
        val next_pc_sel = Input(Bool())        

        val branch_or_not   = Output(Bool())
        val out_static_jump = Output(Bits(32.W))
    })
    val target_address_offset = Wire(UInt(32.W))
    val target_address = Wire(UInt(32.W))
    val branch_target_address = Wire(UInt(32.W))

    target_address_offset   := Cat(Fill(20, io.inst(31)), io.inst(7), io.inst(30, 25), io.inst(11, 8), 0.U(1.W))
    branch_target_address   := ((io.current_pc + target_address_offset) & "hFFFF_FFFF".U(32.W))
    io.branch_or_not        := (branch_target_address > io.current_pc)
    target_address          := Mux(io.branch_or_not, branch_target_address, io.to_static_jump)

    when(io.next_pc_sel){
        io.out_static_jump  := io.to_static_jump
    }.elsewhen(io.inst(6, 2) === OpcodeFunc7Funct3Definition.B){
        io.out_static_jump  := target_address
    }.otherwise{
        io.out_static_jump  := io.to_static_jump
    }
}

object Static_Jump_Unit extends App {
  (new chisel3.stage.ChiselStage).emitVerilog(new Static_Jump_Unit, Array("--target-dir", "generated/Static_Jump_Unit"))
}

If you change > to < at line 21, it will indeed alter the behavior from "jump if the target address is after the current PC" to "jump if the target address is before the current PC.

    // from     
    // io.branch_or_not        := (branch_target_address > io.current_pc)
    // change to 
    io.branch_or_not           := (branch_target_address < io.current_pc)

Because whether a branch is needed depends on the relative location between the target address and the PC, the branch predictor needs to output a branch_or_not signal to let the controller know whether this instruction in IF stage branched or not.
The branch_or_not signal needs to be propagated through the IF, ID, and EX stages so that the controller can correctly determine whether this instruction branched or not.

Dynamic Branch Prediction

Overall Design

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

We consolidate all PC-related functions into a single BPU (Branch Prediction Unit). This unit manages the next PC, handles prediction misses, and detects Branch Target Buffer (BTB) misses.

BPU Design - 2-bit Branch Predictor

package Pipeline

import chisel3._
import chisel3.util._

class BPU extends Module {
  val io = IO(new Bundle {
    val IF_pc        = Input(UInt(32.W))
    val IF_inst      = Input(UInt(32.W))
    val EXE_pc       = Input(UInt(32.W))
    val EXE_op       = Input(UInt(5.W))
    val alu_out      = Input(Bool())
    val stall        = Input(Bool())
    val jump         = Input(Bool())
    val jb_pc        = Input(UInt(32.W))
    val predict      = Output(Bool())
    val predict_miss = Output(Bool())
    val BTB_miss     = Output(Bool())
    val next_pc      = Output(UInt(32.W))
  })

  // Branch Table Size - bit
  var SIZE          = 6

  // Registers
  val BHT           = RegInit(VecInit(Seq.fill(1<<SIZE)(3.U(2.W))))
  val BTB           = RegInit(VecInit(Seq.fill(1<<SIZE)(0.U(32.W))))
  val BTB_valid     = RegInit(VecInit(Seq.fill(1<<SIZE)(false.B)))

  val delay         = RegInit(false.B)
  val last_predict  = RegInit(false.B)
  val delay_pc      = RegInit(0.U(32.W))
  val last_pc       = RegInit(0.U(32.W))

  val IF_is_B_type  = io.IF_inst(6, 0) === "b1100011".U
  val EXE_is_B_type = io.EXE_op === "b11000".U

  val IF_B_index    = io.IF_pc(SIZE+1, 2)
  val EXE_B_index   = io.EXE_pc(SIZE+1, 2)

  // Predict control
  io.BTB_miss := EXE_is_B_type && last_predict && (BTB(EXE_B_index) =/= io.jb_pc)
  io.predict := BTB_valid(IF_B_index) && IF_is_B_type && BHT(IF_B_index) >= 2.U
  io.predict_miss := EXE_is_B_type && (last_predict ^ io.alu_out)
  
  // Next PC control
  io.next_pc := Mux(io.jump, io.jb_pc, // Jump instruction
                Mux(io.predict_miss, Mux(last_predict, last_pc, io.jb_pc), // Predict miss
                Mux(io.BTB_miss, io.jb_pc, // BTB miss
                Mux(io.predict, BTB(IF_B_index), io.IF_pc + 4.U)))) // Predict jump or not
    
  when(io.stall) {
    delay        := delay
    last_predict := last_predict
    delay_pc     := delay_pc
    last_pc      := last_pc
  }.otherwise{
    delay        := io.predict
    last_predict := delay
    delay_pc     := io.IF_pc + 4.U
    last_pc      := delay_pc
  }

  // State Control
  when(EXE_is_B_type) {
    when(last_predict) {
      when(io.alu_out){
        BHT(EXE_B_index)       := Mux(BHT(EXE_B_index) === 3.U, 3.U, BHT(EXE_B_index) + 1.U)
        BTB(EXE_B_index)       := io.jb_pc
        BTB_valid(EXE_B_index) := true.B
      }.otherwise{
        BHT(EXE_B_index)       := Mux(BHT(EXE_B_index) === 0.U, 0.U, BHT(EXE_B_index) - 1.U)
      }
    }.otherwise{
      when(io.alu_out){
        BHT(EXE_B_index)       := Mux(BHT(EXE_B_index) === 3.U, 3.U, BHT(EXE_B_index) + 1.U)
        BTB(EXE_B_index)       := io.jb_pc
        BTB_valid(EXE_B_index) := true.B
      }.otherwise{
        BHT(EXE_B_index)       := Mux(BHT(EXE_B_index) === 0.U, 0.U, BHT(EXE_B_index) - 1.U)
      }
    }
  }
}

object BPU extends App {
  (new chisel3.stage.ChiselStage).emitVerilog(new BPU, Array("--target-dir", "generated/BPU"))
}

A rough diagram of a Branch Prediction Unit(BPU) architecture.

After inputting the PC, the BHT provides information on whether the branch is predicted to be taken, and the BTB provides the target address for the branch. Since I have integrated all PC-related units into one, it is necessary to further compare whether the prediction was incorrect, whether the address fetched by the BTB is correct, and whether it is a J-type instruction.

Branch History Table

The BHT size is configurable.

  // configrable parameter
  // Branch Table Size - bit
  var BHT_SIZE      = 6

The BHT (Branch History Table) stores the prediction state for each branch instruction. We use a state machine to manage and control the branch prediction states.

1-bit predictior: use 1-bit counter to record the last outcome of the branch.
2-bits predictor: It will change if the branch mispredict twice.

In our implementation, we use a 2-bits predictor.

Branch Target Buffer

Since the jump address is calculated during the EXE stage, we require a cache-like structure to store it. Each branch instruction's jump address is saved in this structure. A tagless design is used, as it can handle incorrect jump addresses. When an error is detected, the pipeline flushes previous instructions to restore proper execution flow.
The size of Branch Target Buffer is the same as Branch History Table

Local History Branch Predictor

  // Modified compare to original BPU
  // configrable paramete
  // Branch Table Size - bit
  var LHT_SIZE      = 3
  var BHT_SIZE      = 5
  ...
  val IF_B_index    = LHT(io.IF_pc(LHT_SIZE+1, 2))
  val EXE_B_index   = LHT(io.EXE_pc(LHT_SIZE+1, 2))
  ...
  when(EXE_is_B_type) {
    when(last_predict) {
      ...
    }.otherwise{
      ...
    }
    LHT(io.EXE_pc(LHT_SIZE+1, 2)):= (LHT(io.EXE_pc(LHT_SIZE+1, 2)) << 1) | io.alu_out
  }

A rough diagram of Local History BPU architecture.

The Local History Branch Predictor uses a local history table to track the behavior of each branch instruction individually. Each branch has its own local history register that stores the outcomes (taken or not-taken) of recent executions for that specific branch. Our implementation allows configuration of different Local Histroy Table and Branch History Table sizes for flexible optimization.

Global History Branch Predictor

  // Modified compare to original BPU
  // configrable paramete
  // Branch Table Size - bit
  var BHT_SIZE      = 5
  ...
  val IF_B_index    = GH
  val EXE_B_index   = delay_GH
  ...
  when(io.stall) {
    temp         := temp
    delay_GH     := delay_GH
  }.otherwise{
    temp         := GH
    delay_GH     := temp
  }
  ...
  when(EXE_is_B_type) {
    when(last_predict) {
      ...
    }.otherwise{
      ...
    }
    GH := Cat(GH(GH_SIZE-2, 0), io.alu_out)
  }

A rough diagram of Global History BPU architecture.

The Global History Branch Predictor uses a global history register (GH) to track the outcomes (taken or not-taken) of recent branches, which is then used to index the Branch History Table (BHT) for prediction.

Global History with Selectd Address

  // Modified compare to original BPU
  // configrable parameter
  // Branch Table Size - bit
  var GH_SIZE       = 2
  var BHT_SIZE      = 4
  ...
  val delay_GH      = RegInit(0.U((GH_SIZE).W))
  val temp          = RegInit(0.U((GH_SIZE).W))
  val GH            = RegInit(0.U((GH_SIZE).W))
  ...
  val IF_B_index    = Cat(io.IF_pc((BHT_SIZE-GH_SIZE+1), 2), GH)
  val EXE_B_index   = Cat(io.EXE_pc((BHT_SIZE-GH_SIZE+1), 2), delay_GH)
  ...
  when(io.stall) {
    temp         := temp
    delay_GH     := delay_GH
  }.otherwise{
    temp         := GH
    delay_GH     := temp
  }
  ...
  when(EXE_is_B_type) {
    when(last_predict) {
      ...
    }.otherwise{
      ...
    }
    GH := Cat(GH(GH_SIZE-2, 0), io.alu_out)
  }

A rough diagram of Gselect BPU architecture.

The Global History with Selected Address (GSelect) approach combines the global history, shared across all branches, with the program counter to make predictions. The program counter and global history are used together to generate an index for the Branch History Table. Our implementation allows configuration of different global history lengths and Branch History Table sizes for flexible optimization.

  // Modified compare to original BPU
  // configrable parameter
  // Branch Table Size - bit
  var BHT_SIZE      = 5
  ...
  val delay_GH      = RegInit(0.U((GH_SIZE).W))
  val temp          = RegInit(0.U((GH_SIZE).W))
  val GH            = RegInit(0.U((GH_SIZE).W))
  ...
  val IF_B_index    = io.IF_pc(BHT_SIZE+1, 2) ^ GH
  val EXE_B_index   = io.EXE_pc(BHT_SIZE+1, 2) ^ delay_GH
  ...
  when(io.stall) {
    temp         := temp
    delay_GH     := delay_GH
  }.otherwise{
    temp         := GH
    delay_GH     := temp
  }
  ...
  when(EXE_is_B_type) {
    when(last_predict) {
      ...
    }.otherwise{
      ...
    }
    GH := Cat(GH(GH_SIZE-2, 0), io.alu_out)
  }

A rough diagram of GShare BPU architecture.

The Global History with Index Sharing (GShare) predictor enhances the GSelect approach by using a bitwise XOR operation between the global history and the program counter to form the index for the Branch history table. Our implementation allows configuration of different Branch History Table sizes for flexible optimization.

Verification

Testing method

To ensure the program executes correctly, each test program includes a setup.s file. This program is placed at the beginning of all .s files after being compiled and linked using the RISC-V toolchain. Its purpose is to manage the main program's execution and termination signals.

























## setup.s ##

.text

_start:

init_stack:
  # set stack pointer
  la sp, _stack

SystemInit:
  # jump to main
  jal main

SystemExit:
  # End simulation
  # Write -1 at _sim_end(0xfffc)
  la t0, _sim_end
  li t1, -1
  sw t1, 0(t0)

dead_loop:
  # infinite loop
  j dead_loop

Initially, setup.s jumps to the main function to execute the main program. Upon successful execution and return of main, setup.s writes a value of -1 to the data memory address 0xfffc and enters a dead loop.

The external ChiselTest Program monitors the value at address 0xfffc after each cycle. Once it detects the value -1, the simulation halts. However, if an issue arises in the test program or the Chisel hardware program that prevents this condition from being met, the ChiselTest Program calculates the total number of executed cycles. If the number exceeds a predefined maximum threshold, the simulation is forcibly terminated to prevent infinite loops.

## MainTest.scala ##

package Pipeline
import chisel3._
import org.scalatest.FreeSpec
import chiseltest._
import org.scalatest.flatspec.AnyFlatSpec
import scala.io.Source


class TopTest extends AnyFlatSpec with ChiselScalatestTester{
   "5-Stage test" should "pass" in{
    val prog_num = 2
    val prog_filename = f"./src/main/scala/Pipeline/prog${prog_num}/prog${prog_num}.hex"
    val golden_filename = f"./src/main/scala/Pipeline/prog${prog_num}/golden.hex"
    val goldenData = Source.fromFile(golden_filename).getLines().map(_.trim).toArray
    val startAddress = 0x9000
    val step = 0x4
    val endAddress = startAddress + step * (goldenData.length - 1)
    var counter = 0;
    var branch_times = 0;
    var miss_time = 0;
    var BTB_miss_time = 0;
    var endProgram = true;
    test(new Top(prog_filename))
        .withAnnotations(Seq(WriteVcdAnnotation)){
            x =>
            x.clock.setTimeout(0)
            while(endProgram){
                x.clock.step()
                x.io.mem_read_test.poke(true.B)
                x.io.mem_addr_test.poke("hfffc".U(32.W))
                if(x.io.mem_data_test.peek().litValue === BigInt("ffffffff", 16)){
                    endProgram = false
                }
                if(counter > 40000){
                    endProgram = false
                }
                if(x.io.EXE_Branch.peek().litValue === BigInt("1", 2)){
                    branch_times = branch_times + 1
                    if(x.io.Predict_miss.peek().litValue === BigInt("1", 2)){
                        miss_time = miss_time + 1
                    }
                    if(x.io.BTB_miss.peek().litValue === BigInt("1", 2)){
                        BTB_miss_time = BTB_miss_time + 1
                    }
                }
                counter = counter + 1
            }
            println(f"program end")
            println(f"Testing prog${prog_num}...")
            for ((i, expectedData) <- (startAddress to endAddress by step).zip(goldenData)){
                x.io.mem_read_test.poke(true.B)
                x.io.mem_addr_test.poke(i.U)
                val actualData = x.io.mem_data_test.peek().litValue
                val expectedValue = BigInt(expectedData, 16)
                if(actualData === expectedValue){
                    println(f"addr: [0x${i.toHexString}], actual data: 0x${actualData}%08x, pass")
                }
                else{
                    println(f"addr: [0x${i.toHexString}], actual data: 0x${actualData}%08x, expected data: 0x${expectedValue}%08x, fail")
                }
            }
            println(f"total cycle: ${counter}")
            println(f"total branch inst: ${branch_times}")
            println(f"total miss time: ${miss_time}")
            println(f"hit rate: ${(branch_times - miss_time).toDouble / branch_times * 100}%%")
            println(f"BTB miss: ${BTB_miss_time}")
        }
    }    
}

To verify the correctness of the program's behavior, we initialize the data memory with test inputs, which the main program uses during execution. After processing, the results are stored at a fixed address (0x9000) in the data memory. These results are then compared, one by one, against the expected answers stored in golden.hex. Finally, the comparison results are displayed in the terminal for further analysis.

## terminal output example ##

[info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.21)
[info] loading project definition from /mnt/c/Linux/ChiselTest/5-stage-RV32I-CPU/project
[info] loading settings for project root-5-stage-rv32i-cpu from build.sbt...
[info] set current project to root-5-stage-rv32i-cpu (in build file:/mnt/c/Linux/ChiselTest/5-stage-RV32I-CPU/)
program end
Testing prog3...
addr: [0x9000], actual data: 0x4131c6f7, pass
addr: [0x9004], actual data: 0x3f000000, pass
addr: [0x9008], actual data: 0x42a6f82c, pass
addr: [0x900c], actual data: 0x400d53b7, pass
addr: [0x9010], actual data: 0x40df7527, pass
addr: [0x9014], actual data: 0x4104e7ee, pass
addr: [0x9018], actual data: 0x3f6e26cc, pass
addr: [0x901c], actual data: 0x3d50db18, pass
total cycle: 25074
[info] TopTest:
[info] 5-Stage test
[info] - should pass
[info] Run completed in 29 seconds, 876 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 31 s, completed Jan 23, 2025, 5:20:01 AM

Test Case

To thoroughly evaluate the CPU's functionality and performance, we designed a total of four test cases, each targeting specific aspects of instruction execution and system behavior.

Prog0: Verifying CPU Instructions

In Prog0, our goal was to verify the correct execution of individual CPU instructions. We divided the program into multiple blocks, with each block dedicated to testing a specific instruction, such as add, branch, or jump. This systematic approach ensured comprehensive validation of all fundamental instructions.

## prog0/main.s ##

add:
  li t0, 0xffffffff   # -1
  li t1, 0xffffffff   # -1
  add t0, t0, t1      # t0 = -2
  add t0, t0, t1      # t0 = -3
  add t0, t0, t1      # t0 = -4
  add t0, t0, t1      # t0 = -5
  add t0, t0, t1      # t0 = -6
  li t1, 0xfffffffe   # -2
  add t0, t1, t0      # t0 = -8
  add t0, t1, t0      # t0 = -10
  add t0, t1, t0      # t0 = -12
  add t0, t1, t0      # t0 = -14
  add t0, t1, t0      # t0 = -16
  sw t0, 0(s0)        # Write Answer (DM[_answer + 0])
  addi s0, s0, 4
  
  ...
  
addi:

  ...
  
jal:

  ...
  
bne:

  ...

Prog1 and Prog2: Testing Branch Prediction with Merge Sort

For Prog1 and Prog2, we implemented two variations of Merge Sort using different styles and test data. The primary purpose of these programs was to assess the effectiveness of branch prediction mechanisms. Since Merge Sort relies heavily on divide-and-conquer strategies, it generates numerous branch instructions, making it an ideal scenario for evaluating branch prediction performance.

## prog1/main.s ##

.data
num_test: .word 3 
TEST1_SIZE: .word 34
TEST2_SIZE: .word 19
TEST3_SIZE: .word 29
test1: .word 3,41,18,8,40,6,45,1,18,10,24,46,37,23,43,12,3,37,0,15,11,49,47,27,23,30,16,10,45,39,1,23,40,38
test2: .word -3,-23,-22,-6,-21,-19,-1,0,-2,-47,-17,-46,-6,-30,-50,-13,-47,-9,-50
test3: .word -46,0,-29,-2,23,-46,46,9,-18,-23,35,-37,3,-24,-18,22,0,15,-43,-16,-17,-42,-49,-29,19,-44,0,-18,23

.text
.globl main

main:
  ...
  jal ra, MERGESORT
  ...

Prog3: Optimizing the Sqrt Function

In Prog3, we re-implemented the Sqrt function from Problem C of Quiz 3, replacing unsupported instructions like bseti and mul that are unavailable in the RV32I instruction set. For the mul instruction, we opted for a shift-and-add method, which ensures a consistent 32 iterations for every execution. This design choice allowed us to gain deeper insights into the impact of branch prediction by creating a predictable and controlled loop structure.

## prog3/mul.s ##

.text
.global mul
mul:
    ####### a0: multiplier/ result  #######
    ####### a1: multiplicand        #######
    ### Callee Saved ###
    addi sp, sp, -12
    sw s0, 0(sp)
    sw s1, 4(sp)
    sw s2, 8(sp)
    ####################
    ## move arguement ##
    mv s0, a0
    mv a0, x0
    li s1, 32
    ####################
loop:
    andi s2, a1, 1      # check LSB is 1
    beqz s2, skip_add   # if LSB is 0, skip add
    add a0, a0, s0      # add multiplier to result
skip_add:
    slli s0, s0, 1      # shift left multiplier by 1
    srli a1, a1, 1      # shift right multiplicand by 1
    addi s1, s1, -1     # iteration time -1
    bnez s1, loop       # if not iterate 32 times, loop again

    ## Callee Retrieve ##
    lw s0, 0(sp)
    lw s1, 4(sp)
    lw s2, 8(sp)
    addi sp, sp, 12
    #####################
    # return
    ret

Evaluation

Total Cycle

	prog0	prog1	prog2	prog3
BaseLine(always not jump)	696	19335	29019	25073
Static (always jump)	812	19419	27521	18057
Static (jump if Target < current pc)	696	18475	29019	20693
Static (jump if Target > current pc)	812	20279	27521	22437
2-bit branch predictor	706	18689	27445	18075
Local history branch predictor	702	19271	28947	23101
Global history branch predictor	696	19255	28489	24645
G-Select branch predictor	704	18797	27421	18075
G-Share branch predictor	708	19265	28357	19409

branch penalty

In our implementation, when a jump occurs, the pipeline flushes two cycles to correct the process and ensure accurate execution. The branch penalty is 2 cycles.

Assume 5 pipeline stages, 500 instructions 20% of instruction is branch.

100% accuracy -> 500 cycles to finish
90% accuracy -> 500 + 100 * 10% * 2 = 520 cycles
80% accuracy -> 500 + 100 * 20% * 2 = 540 cycles

Predict Hit rate

	prog0	prog1	prog2	prog3
BaseLine(always not jump)	89.19%	51.04%	31.36%	12.85%
Static (always jump)	10.81%	48.96%	68.64%	87.14%
Static (jump if Target < current pc)	89.19%	72.43%	31.36%	59.23%
Static (jump if Target > current pc)	10.81%	27.56%	68.64%	40.77%
2-bit branch predictor(5bit BHT)	82.43%	67.76%	70.89%	86.95%
Local history branch predictor(5bit BHT)	86.49%	65.77%	70.43%	86.79%
Global history branch predictor(5bit BHT)	89.18%	63.73%	72.47%	86.23%
G-Select branch predictor(5bit BHT)	82.78%	68.51%	75.66%	86.95%
G-Share branch predictor(5bit BHT)	82.43%	67.26%	67.84%	86.21%

Observations

Static Predictors:

Effective for programs with consistent and predictable branch behaviors (prog0 and prog3).
Perform poorly on programs with mixed or complex branch patterns (prog1, prog2).
Preformance depends on branch behavior.

Dynamic Predictors:

Consistently outperform static methods due to adaptive mechanisms.
Global predictors tend to excel in programs with correlated branches (prog2).

Branch Target Hit rate

	prog0	prog1	prog2	prog3
2-bit branch predictor	93.24%	96.01%	98.26%	99.77%
Local history branch predictor	94.60%	74.08%	49.08%	42.69%
Global history branch predictor	94.60%	80.00%	60.78%	21.86%
G-Select branch predictor	94.60%	89.95%	93.68%	99.85%
G-Share branch predictor	100.0%	73.63%	67.60%	82.21%

Observations

For every dynamic prediction strategy, the prediction hit rate is generally high. However, a low Branch Target Buffer (BTB) hit rate can occur, which prevents a significant reduction in the total cycle count. To address this issue, we can increase the size of the Branch History Table or add more ways (associativity) to the Branch History Table. This helps mitigate branch target misses, which typically happen when two or more branches access the same entry.