許翰翔, 簡德彥, 吳秉宥
IM
) and update program counter selected by next_pc_sel
signal.opcode
, func7
, func3
, etc., and retrieve register value from RegFile
.DM
)RegFile
.Module name | Explanation |
---|---|
ALU | Used to slove most of arithmetic operation.When the instruction is a branch type, alu_out[0] will be true when a branch is needed; otherwise, it will be false. |
Controller | Output control signal to each pipeline stage to ensure that instruction correctly flows into the pipeline. |
Decoder | Decode instruction into opcode , func7 , func3 , etc. |
Imm_Ext | Generating the immediate number depends on the type of instruction. |
JB_Unit | Calculate the target address for jump/branch type instructions. |
LD_Filter | Filter data depends on func3 . Used to load type instruction. For example lw , lh , lb , etc. |
RegFile | Register file, it contains 32 RISC-V registers. This module can output rs1 and rs2 data simultaneously. |
Reg_D | Pipeline IF/ID stage register. |
Reg_E | Pipeline ID/EX stage register. |
Reg_M | Pipeline EX/MEM stage register. |
Reg_PC | Program counter register. |
Reg_W | Pipeline MEM/WB stage register. |
SRAM | In this project, to reduce complexity, we design the SRAM to read data without any delay. |
next_pc_sel
signal from the Controller unit to flush these instructions.nop
e.g. add x0 x1 x0
) to stall the entire pipeline.Our original design always assumes no jump. Instructions are fetched in order until the EX stage, where the decision to flush or not flush is made.
package Pipeline
import chisel3._
import chisel3.util._
class Static_Jump_Unit extends Module {
val io = IO(new Bundle {
val to_static_jump = Input(Bits(32.W))
val current_pc = Input(Bits(32.W))
val inst = Input(Bits(32.W))
val next_pc_sel = Input(Bool())
val out_static_jump = Output(Bits(32.W))
})
val target_address_offset = Wire(UInt(32.W))
val target_address = Wire(UInt(32.W))
target_address_offset := Cat(Fill(20, io.inst(31)), io.inst(7), io.inst(30, 25), io.inst(11, 8), 0.U(1.W))
target_address := ((io.current_pc + target_address_offset) & "hFFFF_FFFF".U(32.W))
when(io.next_pc_sel){
io.out_static_jump := io.to_static_jump
}.elsewhen(io.inst(6, 2) === OpcodeFunc7Funct3Definition.B){
io.out_static_jump := target_address
}.otherwise{
io.out_static_jump := io.to_static_jump
}
}
object Static_Jump_Unit extends App {
(new chisel3.stage.ChiselStage).emitVerilog(new Static_Jump_Unit, Array("--target-dir", "generated/Static_Jump_Unit"))
}
Insert a branch predictor before Reg_PC and determine if the current instruction is a branch type. If it is, the predictor will calculate the target address and send it to Reg_PC. However, if the EX stage needs to flush and obtain the correct address(next_pc_sel
is high), it will propagate the correct address.
package Pipeline
import chisel3._
import chisel3.util._
class Static_Jump_Unit extends Module {
val io = IO(new Bundle {
val to_static_jump = Input(Bits(32.W))
val current_pc = Input(Bits(32.W))
val inst = Input(Bits(32.W))
val next_pc_sel = Input(Bool())
val branch_or_not = Output(Bool())
val out_static_jump = Output(Bits(32.W))
})
val target_address_offset = Wire(UInt(32.W))
val target_address = Wire(UInt(32.W))
val branch_target_address = Wire(UInt(32.W))
target_address_offset := Cat(Fill(20, io.inst(31)), io.inst(7), io.inst(30, 25), io.inst(11, 8), 0.U(1.W))
branch_target_address := ((io.current_pc + target_address_offset) & "hFFFF_FFFF".U(32.W))
io.branch_or_not := (branch_target_address > io.current_pc)
target_address := Mux(io.branch_or_not, branch_target_address, io.to_static_jump)
when(io.next_pc_sel){
io.out_static_jump := io.to_static_jump
}.elsewhen(io.inst(6, 2) === OpcodeFunc7Funct3Definition.B){
io.out_static_jump := target_address
}.otherwise{
io.out_static_jump := io.to_static_jump
}
}
object Static_Jump_Unit extends App {
(new chisel3.stage.ChiselStage).emitVerilog(new Static_Jump_Unit, Array("--target-dir", "generated/Static_Jump_Unit"))
}
If you change >
to <
at line 21, it will indeed alter the behavior from "jump if the target address is after the current PC" to "jump if the target address is before the current PC.
// from
// io.branch_or_not := (branch_target_address > io.current_pc)
// change to
io.branch_or_not := (branch_target_address < io.current_pc)
Because whether a branch is needed depends on the relative location between the target address and the PC, the branch predictor needs to output a branch_or_not
signal to let the controller know whether this instruction in IF stage branched or not.
The branch_or_not
signal needs to be propagated through the IF, ID, and EX stages so that the controller can correctly determine whether this instruction branched or not.
package Pipeline
import chisel3._
import chisel3.util._
class BPU extends Module {
val io = IO(new Bundle {
val IF_pc = Input(UInt(32.W))
val IF_inst = Input(UInt(32.W))
val EXE_pc = Input(UInt(32.W))
val EXE_op = Input(UInt(5.W))
val alu_out = Input(Bool())
val stall = Input(Bool())
val jump = Input(Bool())
val jb_pc = Input(UInt(32.W))
val predict = Output(Bool())
val predict_miss = Output(Bool())
val BTB_miss = Output(Bool())
val next_pc = Output(UInt(32.W))
})
// Branch Table Size - bit
var SIZE = 6
// Registers
val BHT = RegInit(VecInit(Seq.fill(1<<SIZE)(3.U(2.W))))
val BTB = RegInit(VecInit(Seq.fill(1<<SIZE)(0.U(32.W))))
val BTB_valid = RegInit(VecInit(Seq.fill(1<<SIZE)(false.B)))
val delay = RegInit(false.B)
val last_predict = RegInit(false.B)
val delay_pc = RegInit(0.U(32.W))
val last_pc = RegInit(0.U(32.W))
val IF_is_B_type = io.IF_inst(6, 0) === "b1100011".U
val EXE_is_B_type = io.EXE_op === "b11000".U
val IF_B_index = io.IF_pc(SIZE+1, 2)
val EXE_B_index = io.EXE_pc(SIZE+1, 2)
// Predict control
io.BTB_miss := EXE_is_B_type && last_predict && (BTB(EXE_B_index) =/= io.jb_pc)
io.predict := BTB_valid(IF_B_index) && IF_is_B_type && BHT(IF_B_index) >= 2.U
io.predict_miss := EXE_is_B_type && (last_predict ^ io.alu_out)
// Next PC control
io.next_pc := Mux(io.jump, io.jb_pc, // Jump instruction
Mux(io.predict_miss, Mux(last_predict, last_pc, io.jb_pc), // Predict miss
Mux(io.BTB_miss, io.jb_pc, // BTB miss
Mux(io.predict, BTB(IF_B_index), io.IF_pc + 4.U)))) // Predict jump or not
when(io.stall) {
delay := delay
last_predict := last_predict
delay_pc := delay_pc
last_pc := last_pc
}.otherwise{
delay := io.predict
last_predict := delay
delay_pc := io.IF_pc + 4.U
last_pc := delay_pc
}
// State Control
when(EXE_is_B_type) {
when(last_predict) {
when(io.alu_out){
BHT(EXE_B_index) := Mux(BHT(EXE_B_index) === 3.U, 3.U, BHT(EXE_B_index) + 1.U)
BTB(EXE_B_index) := io.jb_pc
BTB_valid(EXE_B_index) := true.B
}.otherwise{
BHT(EXE_B_index) := Mux(BHT(EXE_B_index) === 0.U, 0.U, BHT(EXE_B_index) - 1.U)
}
}.otherwise{
when(io.alu_out){
BHT(EXE_B_index) := Mux(BHT(EXE_B_index) === 3.U, 3.U, BHT(EXE_B_index) + 1.U)
BTB(EXE_B_index) := io.jb_pc
BTB_valid(EXE_B_index) := true.B
}.otherwise{
BHT(EXE_B_index) := Mux(BHT(EXE_B_index) === 0.U, 0.U, BHT(EXE_B_index) - 1.U)
}
}
}
}
object BPU extends App {
(new chisel3.stage.ChiselStage).emitVerilog(new BPU, Array("--target-dir", "generated/BPU"))
}
A rough diagram of a Branch Prediction Unit(BPU) architecture.
After inputting the PC, the BHT provides information on whether the branch is predicted to be taken, and the BTB provides the target address for the branch. Since I have integrated all PC-related units into one, it is necessary to further compare whether the prediction was incorrect, whether the address fetched by the BTB is correct, and whether it is a J-type instruction.
The BHT size is configurable.
// configrable parameter
// Branch Table Size - bit
var BHT_SIZE = 6
The BHT (Branch History Table) stores the prediction state for each branch instruction. We use a state machine to manage and control the branch prediction states.
In our implementation, we use a 2-bits predictor.
Since the jump address is calculated during the EXE stage, we require a cache-like structure to store it. Each branch instruction's jump address is saved in this structure. A tagless design is used, as it can handle incorrect jump addresses. When an error is detected, the pipeline flushes previous instructions to restore proper execution flow.
The size of Branch Target Buffer is the same as Branch History Table
// Modified compare to original BPU
// configrable paramete
// Branch Table Size - bit
var LHT_SIZE = 3
var BHT_SIZE = 5
...
val IF_B_index = LHT(io.IF_pc(LHT_SIZE+1, 2))
val EXE_B_index = LHT(io.EXE_pc(LHT_SIZE+1, 2))
...
when(EXE_is_B_type) {
when(last_predict) {
...
}.otherwise{
...
}
LHT(io.EXE_pc(LHT_SIZE+1, 2)):= (LHT(io.EXE_pc(LHT_SIZE+1, 2)) << 1) | io.alu_out
}
A rough diagram of Local History BPU architecture.
The Local History Branch Predictor uses a local history table to track the behavior of each branch instruction individually. Each branch has its own local history register that stores the outcomes (taken or not-taken) of recent executions for that specific branch. Our implementation allows configuration of different Local Histroy Table and Branch History Table sizes for flexible optimization.
// Modified compare to original BPU
// configrable paramete
// Branch Table Size - bit
var BHT_SIZE = 5
...
val IF_B_index = GH
val EXE_B_index = delay_GH
...
when(io.stall) {
temp := temp
delay_GH := delay_GH
}.otherwise{
temp := GH
delay_GH := temp
}
...
when(EXE_is_B_type) {
when(last_predict) {
...
}.otherwise{
...
}
GH := Cat(GH(GH_SIZE-2, 0), io.alu_out)
}
A rough diagram of Global History BPU architecture.
The Global History Branch Predictor uses a global history register (GH) to track the outcomes (taken or not-taken) of recent branches, which is then used to index the Branch History Table (BHT) for prediction.
// Modified compare to original BPU
// configrable parameter
// Branch Table Size - bit
var GH_SIZE = 2
var BHT_SIZE = 4
...
val delay_GH = RegInit(0.U((GH_SIZE).W))
val temp = RegInit(0.U((GH_SIZE).W))
val GH = RegInit(0.U((GH_SIZE).W))
...
val IF_B_index = Cat(io.IF_pc((BHT_SIZE-GH_SIZE+1), 2), GH)
val EXE_B_index = Cat(io.EXE_pc((BHT_SIZE-GH_SIZE+1), 2), delay_GH)
...
when(io.stall) {
temp := temp
delay_GH := delay_GH
}.otherwise{
temp := GH
delay_GH := temp
}
...
when(EXE_is_B_type) {
when(last_predict) {
...
}.otherwise{
...
}
GH := Cat(GH(GH_SIZE-2, 0), io.alu_out)
}
A rough diagram of Gselect BPU architecture.
The Global History with Selected Address (GSelect) approach combines the global history, shared across all branches, with the program counter to make predictions. The program counter and global history are used together to generate an index for the Branch History Table. Our implementation allows configuration of different global history lengths and Branch History Table sizes for flexible optimization.
// Modified compare to original BPU
// configrable parameter
// Branch Table Size - bit
var BHT_SIZE = 5
...
val delay_GH = RegInit(0.U((GH_SIZE).W))
val temp = RegInit(0.U((GH_SIZE).W))
val GH = RegInit(0.U((GH_SIZE).W))
...
val IF_B_index = io.IF_pc(BHT_SIZE+1, 2) ^ GH
val EXE_B_index = io.EXE_pc(BHT_SIZE+1, 2) ^ delay_GH
...
when(io.stall) {
temp := temp
delay_GH := delay_GH
}.otherwise{
temp := GH
delay_GH := temp
}
...
when(EXE_is_B_type) {
when(last_predict) {
...
}.otherwise{
...
}
GH := Cat(GH(GH_SIZE-2, 0), io.alu_out)
}
A rough diagram of GShare BPU architecture.
The Global History with Index Sharing (GShare) predictor enhances the GSelect approach by using a bitwise XOR operation between the global history and the program counter to form the index for the Branch history table. Our implementation allows configuration of different Branch History Table sizes for flexible optimization.
To ensure the program executes correctly, each test program includes a setup.s
file. This program is placed at the beginning of all .s
files after being compiled and linked using the RISC-V toolchain. Its purpose is to manage the main program's execution and termination signals.
## setup.s ##
.text
_start:
init_stack:
# set stack pointer
la sp, _stack
SystemInit:
# jump to main
jal main
SystemExit:
# End simulation
# Write -1 at _sim_end(0xfffc)
la t0, _sim_end
li t1, -1
sw t1, 0(t0)
dead_loop:
# infinite loop
j dead_loop
Initially, setup.s
jumps to the main function to execute the main program. Upon successful execution and return of main, setup.s
writes a value of -1
to the data memory address 0xfffc
and enters a dead loop.
The external ChiselTest Program monitors the value at address 0xfffc
after each cycle. Once it detects the value -1
, the simulation halts. However, if an issue arises in the test program or the Chisel hardware program that prevents this condition from being met, the ChiselTest Program calculates the total number of executed cycles. If the number exceeds a predefined maximum threshold, the simulation is forcibly terminated to prevent infinite loops.
## MainTest.scala ##
package Pipeline
import chisel3._
import org.scalatest.FreeSpec
import chiseltest._
import org.scalatest.flatspec.AnyFlatSpec
import scala.io.Source
class TopTest extends AnyFlatSpec with ChiselScalatestTester{
"5-Stage test" should "pass" in{
val prog_num = 2
val prog_filename = f"./src/main/scala/Pipeline/prog${prog_num}/prog${prog_num}.hex"
val golden_filename = f"./src/main/scala/Pipeline/prog${prog_num}/golden.hex"
val goldenData = Source.fromFile(golden_filename).getLines().map(_.trim).toArray
val startAddress = 0x9000
val step = 0x4
val endAddress = startAddress + step * (goldenData.length - 1)
var counter = 0;
var branch_times = 0;
var miss_time = 0;
var BTB_miss_time = 0;
var endProgram = true;
test(new Top(prog_filename))
.withAnnotations(Seq(WriteVcdAnnotation)){
x =>
x.clock.setTimeout(0)
while(endProgram){
x.clock.step()
x.io.mem_read_test.poke(true.B)
x.io.mem_addr_test.poke("hfffc".U(32.W))
if(x.io.mem_data_test.peek().litValue === BigInt("ffffffff", 16)){
endProgram = false
}
if(counter > 40000){
endProgram = false
}
if(x.io.EXE_Branch.peek().litValue === BigInt("1", 2)){
branch_times = branch_times + 1
if(x.io.Predict_miss.peek().litValue === BigInt("1", 2)){
miss_time = miss_time + 1
}
if(x.io.BTB_miss.peek().litValue === BigInt("1", 2)){
BTB_miss_time = BTB_miss_time + 1
}
}
counter = counter + 1
}
println(f"program end")
println(f"Testing prog${prog_num}...")
for ((i, expectedData) <- (startAddress to endAddress by step).zip(goldenData)){
x.io.mem_read_test.poke(true.B)
x.io.mem_addr_test.poke(i.U)
val actualData = x.io.mem_data_test.peek().litValue
val expectedValue = BigInt(expectedData, 16)
if(actualData === expectedValue){
println(f"addr: [0x${i.toHexString}], actual data: 0x${actualData}%08x, pass")
}
else{
println(f"addr: [0x${i.toHexString}], actual data: 0x${actualData}%08x, expected data: 0x${expectedValue}%08x, fail")
}
}
println(f"total cycle: ${counter}")
println(f"total branch inst: ${branch_times}")
println(f"total miss time: ${miss_time}")
println(f"hit rate: ${(branch_times - miss_time).toDouble / branch_times * 100}%%")
println(f"BTB miss: ${BTB_miss_time}")
}
}
}
To verify the correctness of the program's behavior, we initialize the data memory with test inputs, which the main program uses during execution. After processing, the results are stored at a fixed address (0x9000
) in the data memory. These results are then compared, one by one, against the expected answers stored in golden.hex
. Finally, the comparison results are displayed in the terminal for further analysis.
## terminal output example ##
[info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.21)
[info] loading project definition from /mnt/c/Linux/ChiselTest/5-stage-RV32I-CPU/project
[info] loading settings for project root-5-stage-rv32i-cpu from build.sbt...
[info] set current project to root-5-stage-rv32i-cpu (in build file:/mnt/c/Linux/ChiselTest/5-stage-RV32I-CPU/)
program end
Testing prog3...
addr: [0x9000], actual data: 0x4131c6f7, pass
addr: [0x9004], actual data: 0x3f000000, pass
addr: [0x9008], actual data: 0x42a6f82c, pass
addr: [0x900c], actual data: 0x400d53b7, pass
addr: [0x9010], actual data: 0x40df7527, pass
addr: [0x9014], actual data: 0x4104e7ee, pass
addr: [0x9018], actual data: 0x3f6e26cc, pass
addr: [0x901c], actual data: 0x3d50db18, pass
total cycle: 25074
[info] TopTest:
[info] 5-Stage test
[info] - should pass
[info] Run completed in 29 seconds, 876 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 31 s, completed Jan 23, 2025, 5:20:01 AM
To thoroughly evaluate the CPU's functionality and performance, we designed a total of four test cases, each targeting specific aspects of instruction execution and system behavior.
In Prog0, our goal was to verify the correct execution of individual CPU instructions. We divided the program into multiple blocks, with each block dedicated to testing a specific instruction, such as add
, branch
, or jump
. This systematic approach ensured comprehensive validation of all fundamental instructions.
## prog0/main.s ##
add:
li t0, 0xffffffff # -1
li t1, 0xffffffff # -1
add t0, t0, t1 # t0 = -2
add t0, t0, t1 # t0 = -3
add t0, t0, t1 # t0 = -4
add t0, t0, t1 # t0 = -5
add t0, t0, t1 # t0 = -6
li t1, 0xfffffffe # -2
add t0, t1, t0 # t0 = -8
add t0, t1, t0 # t0 = -10
add t0, t1, t0 # t0 = -12
add t0, t1, t0 # t0 = -14
add t0, t1, t0 # t0 = -16
sw t0, 0(s0) # Write Answer (DM[_answer + 0])
addi s0, s0, 4
...
addi:
...
jal:
...
bne:
...
For Prog1 and Prog2, we implemented two variations of Merge Sort using different styles and test data. The primary purpose of these programs was to assess the effectiveness of branch prediction mechanisms. Since Merge Sort relies heavily on divide-and-conquer strategies, it generates numerous branch instructions, making it an ideal scenario for evaluating branch prediction performance.
## prog1/main.s ##
.data
num_test: .word 3
TEST1_SIZE: .word 34
TEST2_SIZE: .word 19
TEST3_SIZE: .word 29
test1: .word 3,41,18,8,40,6,45,1,18,10,24,46,37,23,43,12,3,37,0,15,11,49,47,27,23,30,16,10,45,39,1,23,40,38
test2: .word -3,-23,-22,-6,-21,-19,-1,0,-2,-47,-17,-46,-6,-30,-50,-13,-47,-9,-50
test3: .word -46,0,-29,-2,23,-46,46,9,-18,-23,35,-37,3,-24,-18,22,0,15,-43,-16,-17,-42,-49,-29,19,-44,0,-18,23
.text
.globl main
main:
...
jal ra, MERGESORT
...
In Prog3, we re-implemented the Sqrt function from Problem C of Quiz 3, replacing unsupported instructions like bseti
and mul
that are unavailable in the RV32I instruction set. For the mul
instruction, we opted for a shift-and-add method, which ensures a consistent 32 iterations for every execution. This design choice allowed us to gain deeper insights into the impact of branch prediction by creating a predictable and controlled loop structure.
## prog3/mul.s ##
.text
.global mul
mul:
####### a0: multiplier/ result #######
####### a1: multiplicand #######
### Callee Saved ###
addi sp, sp, -12
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
####################
## move arguement ##
mv s0, a0
mv a0, x0
li s1, 32
####################
loop:
andi s2, a1, 1 # check LSB is 1
beqz s2, skip_add # if LSB is 0, skip add
add a0, a0, s0 # add multiplier to result
skip_add:
slli s0, s0, 1 # shift left multiplier by 1
srli a1, a1, 1 # shift right multiplicand by 1
addi s1, s1, -1 # iteration time -1
bnez s1, loop # if not iterate 32 times, loop again
## Callee Retrieve ##
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
addi sp, sp, 12
#####################
# return
ret
prog0 | prog1 | prog2 | prog3 | |
---|---|---|---|---|
BaseLine(always not jump) | 696 | 19335 | 29019 | 25073 |
Static (always jump) | 812 | 19419 | 27521 | 18057 |
Static (jump if Target < current pc) | 696 | 18475 | 29019 | 20693 |
Static (jump if Target > current pc) | 812 | 20279 | 27521 | 22437 |
2-bit branch predictor | 706 | 18689 | 27445 | 18075 |
Local history branch predictor | 702 | 19271 | 28947 | 23101 |
Global history branch predictor | 696 | 19255 | 28489 | 24645 |
G-Select branch predictor | 704 | 18797 | 27421 | 18075 |
G-Share branch predictor | 708 | 19265 | 28357 | 19409 |
In our implementation, when a jump occurs, the pipeline flushes two cycles to correct the process and ensure accurate execution. The branch penalty is 2 cycles.
Assume 5 pipeline stages, 500 instructions 20% of instruction is branch.
prog0 | prog1 | prog2 | prog3 | |
---|---|---|---|---|
BaseLine(always not jump) | 89.19% | 51.04% | 31.36% | 12.85% |
Static (always jump) | 10.81% | 48.96% | 68.64% | 87.14% |
Static (jump if Target < current pc) | 89.19% | 72.43% | 31.36% | 59.23% |
Static (jump if Target > current pc) | 10.81% | 27.56% | 68.64% | 40.77% |
2-bit branch predictor(5bit BHT) | 82.43% | 67.76% | 70.89% | 86.95% |
Local history branch predictor(5bit BHT) | 86.49% | 65.77% | 70.43% | 86.79% |
Global history branch predictor(5bit BHT) | 89.18% | 63.73% | 72.47% | 86.23% |
G-Select branch predictor(5bit BHT) | 82.78% | 68.51% | 75.66% | 86.95% |
G-Share branch predictor(5bit BHT) | 82.43% | 67.26% | 67.84% | 86.21% |
prog0 | prog1 | prog2 | prog3 | |
---|---|---|---|---|
2-bit branch predictor | 93.24% | 96.01% | 98.26% | 99.77% |
Local history branch predictor | 94.60% | 74.08% | 49.08% | 42.69% |
Global history branch predictor | 94.60% | 80.00% | 60.78% | 21.86% |
G-Select branch predictor | 94.60% | 89.95% | 93.68% | 99.85% |
G-Share branch predictor | 100.0% | 73.63% | 67.60% | 82.21% |
For every dynamic prediction strategy, the prediction hit rate is generally high. However, a low Branch Target Buffer (BTB) hit rate can occur, which prevents a significant reduction in the total cycle count. To address this issue, we can increase the size of the Branch History Table or add more ways (associativity) to the Branch History Table. This helps mitigate branch target misses, which typically happen when two or more branches access the same entry.