2021 computer architecture
Cloning: To clone this project and its dependencies
git clone –recursive https://github.com/ultraembedded/biriscv.git
Running HelloWorld (For testing)
Install Icarus Verilog (Debian / Ubuntu / Linux Mint)
sudo apt-get install iverilog
Run a simple test image (test.elf)
cd tb/tb_core_icarus make
makefile:9: *** riscv32-unknown-elf-objcopy missing from PATH. Stop.
To solve this problem, We modify the 7 line of the Makefile (in ./tb/tb_core_icarus).
OBJCOPY ?= riscv32-unknown-elf-objcopy → OBJCOPY ?= riscv-none-embed-objcopy
make again!
In biRISC-V, We can find that there are seven stages in biRISC-V.
→ PC
, Fetch
, Pre-decode
, Issue
, ALU
, Mem
and Writeback
, respectively.
biRISC-V Feature
How to analyze In the beginning, We use the test.elf in the folder /src/tb/tb_core_icaris to analyze biRISC-V core. From the 29 line instruction of Makefile, we can find out what verilog file we will use.
SRC_V ?= \((foreach src,\)(SRC_V_DIR),$(wildcard $(src)/*.v)
According these files, we can find out how biRISC-V CPU work, and next step we will analyze the datapath, control, and pipeline of biRISC-V.
To analyze the biRISC-V, we use the test.elf (in ./tb/tb_core_icarus) as a example. The disasembly of this file is below.
riscv-none-embed-objdump -d test.elf > test.s
To understand how the pc work, We trace the code in biriscv_npc.v
, and there are some interest discoveries.
Control The following lists are the control parameters and the inputs/output of the PC stage.
In parameters, we can find some design, like Branch Predictor(BTB and BHT) which is used in biriscv.
Simultaneously, user can enable GSHARE(Global branch predictor).
In I/O, there are some system signals(clk_i and rst_i …) and branch signals(branch_request_i and branch_is_taken_i …).
Also, if occurred the branch condition, we could know what the instruction is, like ret (using branch_is_ret_i) or j (branch_is_jmp_i).
From I/O, we can plot a simple graph.
Datapath
In datapath, from the following source code, it is divide into two parts, BRANCH_PREDICTION
and NO_BRANCH_PREDICTION
, respectively.
The following source code shows if SUPPORT_BRANCH_PREDICTION = 1
, how to compute the output next_pc_f_o
and next_taken_f_o
.
The following source code shows the condition of SUPPORT_BRANCH_PREDICTION = 0
.
Control
In file biriscv_fetch.v
, there is only one parameter to setup — SUPPORT_MMU
The following source code from biriscv_fetch.v
is the I/O of the fetch stage. We can know that the birisc-v get instruction from icache, and there are some control bits of cache, like valid, accept and error…
Datapath In datapath, we find a special design called Skid Buffer.
The following source code is the implementation of Skid Buffer, and we can know how biRISC-V get instruction.
fetch_fifo
, which is a queue to distribute the instruction to decoder0
and decoder1
we need to see the definitions in biriscv_defs.v
, and there are the instruction definitions which is shown as following below
The module structure in decode stage
So we need to know these three module which is in biriscv_decode.v
and biriscv_decoder.v
biriscv_decoder
This module here is to decode the instrution by opcode, analyze which type instruction is, and which operation the processor need to takes
fetch_fifo
decoder 0
and decoder 1
then concatenate the last 3 bit to the previous instruction
Due to above code, there is the reason why PC of decoder1 and PC of decoder 2 have address difference of 4
biriscv_decode
which is combined with biriscv_decoder
and fetch_fifo
There are three units in this Stage, Controller, Divider and Register File.
Control The following source code is the I/O of divider in biRISC-V. It contains the signals which has decoded by decoder, like rd, rs1, rs2 and opcode … .
Datapath
The following source code is the definition of div
and rem
instruction and mask.
The code below shows how biRISC-V identifies the div
instruction in divider.
The following code is the divider output, writeback_value_o
means the result of divider, and the result will be writeback to E1 stage.
Here we will analyze the controller of biRISC-V
, Controller is the most important part in Pipeline. It needs to detect hazard, issue and schedule, track instruction status and branch prediction.
The parameters here are supporting the dual issue, multiplier and divider, load bypass and multiply bypass
There are the cycles for each operation:
ALU
: 1MUL
: 2~3Load Store
: 2~3DIV
: 2~34CSR
: 2~3Issue and scheduling:
MUL
, DIV
, CSR
after load
, can only issue ALU
operationBranch
, LSU
, ALU
, MUL
, DIV
and CSR
Branch
, LSU
, ALU
and MUL
Branch Prediction: This information is used to learn future prediction, and to correct BTB
, BHT
, GShare
, RAS
indexes on mispredictions. Link to PC
Stage.
Hazard Detection
biRISC-V supports fully-bypassing, and the code is shown as following:
There are some operations in this stage, ALU
, LSU
, CSU
and MUL
ALU
: Arithemtic OperationLSU
: Load/Store OperationCSU
: Deal with Interrupt and ExceptionMUL
: Pipelining MultiplierControl
In biriscv_defs.v
we can see some definition for ALU operation:
From I/O, We can plot a simple graph.
There are some differenct operations, it can decide the normal arithmetic operation like add
and sub
, logical operation like and
and or
, Comparison like less than
and less than signed
and the most important operation, bitwise operation, shift left
, shift right
, and shift right with signed bit
. In the source code, there is some variable to decide the bit number to do the shift operation
Datapath
The following source code show the result singal of ALU. result_r
is a register which stored the result of instruction, and alu_p_o
is the result signal of ALU.
Load Store Unit used to load and store…
CSR and MUL implement from the ALU stage(E1) to writeback stage(WB)
CSR(Control and Status Register) is used to deal with Exception, Interrupt and Storage Protection. There are two mode in CSR, machine mode and supervisor mode. In the standard user-level base ISA, only a handful of read-only counter CSRs are accessible.
When the Interrupt enhanced in machine mode, CSR will save interrupt state, disabled it, enter supervisor mode and raise the priviledge to machine level.
When the Interrupt enhanced in supervisor mode, CSR will save interrupt state, disabled it, enter supervisor mode and raise the priviledge to super level.
Here we will introduce the multiplier implemented in this project, multiplier is in biriscv_multiplier.v
*
and it implemented by verilog EDA, and EDA will make a pipelined multiplier simultaneously.
biriscv
, there are supervisor mode and user mode, so before access the TLB, we need to know it is user page or supervisor page. Supervisor cannot access user pageThere is no fixed Write Back stage, because when the controller issue the instruction to different hardware corresponding their operation, there are different time to finish it. For example, ALU
only take one stage e1
, then the operation is done. However, MUL
take three stage from e1
to W
. When ALU
is finished, it will write back to memory, and so does MUL
, the shemantic figure is shown as below:
The figure shown above is a x86-64 architecture not riscv, but the concept is similar. The figure above is describe the operation of int and float take different time, and when the operation is finished, it will concatenate write back Stage. Thus Write Back Stage is not fixed, maybe it appears in stage 6 or stage 7.
In GTKwave, we mainly use the following siganl to analyze.
pipe0
pipe1
I-Format
We observe the following code from test.elf
, especially lw a5,-448(s0)
The following picture is the result, and lw a5,-448(s0)
is executed in pipe0.
R-Format
We observe the following code from test.elf
, especially add a4, a4, a3
The following picture is the result, and add a4, a4, a3
is executed in pipe0
.
The red boxes represent the flow of add a4, a4, a3
, and we can observe the instruction from PC stage
to WB stage
The yellow boxes represent the stall in order to wait the next instruction into issue stage
.
We find out there is a data hazard (blue box), at slli a4, a3, 0x1
and add a4, a4, a3
. From the following source code in biriscv_issue.v
, we can know how the biRISC-V deal with the data hazard.
Now, we know that the pipe0_rd_e1_w
is the a4
in slli a4, a3, 0x1
and issue_a_ra_idx_w
is the a4
in add a4, a4, a3
. So, the result of E1 stage will bypass to Issue stage.
SB-Format
We observe the following code from test.elf, especially beq a4,a5,80000688
The following picture is the result, and beq a4,a5,80000688
is executed in pipe0.
pipe0_branch_e1_w
: Whether the instruction in E1 stage
is branch instruction.branch_exec0_is_taken_i
: the branch instruction is taken.E1 stage
, biRISC-V will detect and determine that the branch is taken or not.addi a5, a5, 564
and beq a4, a5, 80000688
. From last example, we can know the result of addi
will bypass to beq
.Multiplication Operations
We observe the following code from test.elf, especially mul a4,a4,a5
mul rd, rs1, rs2 → MUL performs an XLEN-bit(rs1) × XLEN-bit(rs2) multiplication and places the lower XLEN bits in the destination register(rd). In RV32IM, XLEN-bit means 32-bit.
The following picture is the result, and mul a4,a4,a5
is executed in pipe0.
addi a5,a5,-1997
and mul a4,a4,a5
, and we check the issue_a_rb_idx_w
and pipe0_rd_e1_w
.
The result will bypass from E1 stage to Issue stage.MULT_STAGES
control the value writeback at E2 stage or WB stage.
div a4,a4,a5
The following picture is the result, and mul a4,a4,a5 is executed in pipe0.
Example 1: csrw (csrw csr, rs1
, is encoded as csrrw x0, csr, rs1
)
The CSRRW (Atomic Read/Write CSR) instruction atomically swaps values in the CSRs and integer registers. CSRRW reads the old value of the CSR, zero-extends the value to XLEN bits, then writes it to integer register rd. The initial value in rs1 is written to the CSR. If rd=x0, then the instruction shall not read the CSR and shall not cause any of the side effects that might occur on a CSR read.
We observe the following code from test.elf, especially csrw a4,a4,a5
The following picture is the result, and csrw mtvec,t0
is executed in pipe0.
There are 3 cycles for executing csr instruction.
From the source code, we can know that the pipeline will take 2 ~ 3 cycles and stall.
In the picture, we can see that the value of t0 register copy to the mtvec register.
In verilog code, controller can issue all operations to pipeline 0(issue a
). However, it can only issue lsu
, branch
, alu
and mul
to pipeline 1(issue b
). In this case, controller issue all but mul
, the source code of dual issue is shown as following:
Parameters in issue.v
fetch0_instr_exec_i=1
: when Pipeline 0 fetch ANDI
, ADDI
, SLTI
, SLTIU
, ORI
, SLLI
, ADD
, AUIPC
…fetch0_instr_lsu_i=1
: when Pipeline 0 fetch LB
, LH
, LW
, LBH
, LHU
, LWU
, SB
, SH
or SW
fetch0_instr_branch_i=1
: when Pipeline 0 fetch JAL
, JALR
, BEQ
, BNE
, BLT
, BGE
, BLTU
or BGEU
fetch0_instr_mul_i=1
: when Pipeline 0 fetch MUL
, MULH
, MULHSU
or MULHU
fetch0_instr_div_i=1
: when Pipeline 0 fetch DIV
, DIVU
, REM
or REMU
fetch0_instr_csr_i=1
: when Pipeline 0 fetch ECALL
, EBREAK
, ERET
, CSRRW
, CSRRS
, CSRRC
, CSRRWI
…pipe1_ok_w
: instruction is lsu
, branch
, alu
or mul
take_interrupt_i
: boolean value for interrupt enhancedenable_dual_issue_w
: this value is determined by SUPPORT_DUAL_ISSUE
The following source code shows about what condition can occur the dual issue.
dual_issue_ok_w
is determined by enable_dual_issue_w
, pipe1_ok_w
, take_interrupt_i
, and others.
From this code, we can summary 3 conditions that can occur the dual issue.
exec
, lsu
or mul
in decoder0
and exec
or branch
in decoder1
exec
or mul
in decoder0
and lsu
in decoder1
exec
or lsu
in decoder0
and mul
in decoder1
So now, let's discuss the parameters of dual_issue_w
, the parameters in the code above.
When opcode_b_issue_r
, opcode_b_accept_r
and ~take_interrupt_i
are all equal to 1, the result of dual_issue_w
is 1, and the dual issue will start with next cycle.
Here are an example for dual issue:
issue_a_exec_w
and issue_b_exec_w
are equal to 1. Therefore, dual_issue_ok_w
is equal to 1, too.opcode_b_issue_r
and opcode_b_accept
are not equal to 1, so dual_issue_w
is equal to 0.80000294 lui a0, 0x80000
and 8000029c jal ra, 80000624<main>
issue_a_lsu_w
and issue_b_exec_w
are equal to 1. Therefore, dual_issue_ok_w
is equal to 1, too.issue_a_exec_w
and issue_b_branch_w
are equal to 1. Therefore, dual_issue_ok_w
is equal to 1, too.opcode_b_issue_r
and opcode_b_accept
are equal to 1 in both case, so dual_issue_w
is equal to 1.Branch predictor
BHT: Branch History Table**
Branch History Table is a table to record the branch information, and which is used to determine whether this branch instruction be taken. Here is one bit to record the branch prediction. However, one bit branch prediction may cause two misprediction. Thus, there are at least two bits for branch prediction.
Reference: https://users.ece.cmu.edu/~jhoe/course/ece447/S21handouts/L10.pdf
Now, return to biriscv
, I think there is only one bit to do branch prediction.
BTB: Branch Target Buffer**
Reference: https://users.ece.cmu.edu/~jhoe/course/ece447/S21handouts/L10.pdf
BTB Architecture is shown above. After branch instruction is implemented, the address of instruction and the jump address.
How to use BTB?
We will compare the Program Counter and the address of branch instrution in the BTB, if we found that is in the BTB, we will use the address which is predicted by BTB. If none, just go to PC+4.
RAS: Return Address Stack** A stack record the return address.
GSHARE**
GShare is a dynamic brnach predictor, which include BTB and BHT. When a instruction come, GShare will query BTB, if hits, then query BHT to predict whether a branch will occur.
Static Branch Prediction: Predict without history information Dynamic Branch Prediction: Predict with history information
Take N-bit from branch instruction and M-bit from Branch History Shift Register(BHSR) to search the table shown as above, then predict next PC
x86-64 Decode Stage Here we will introduce the decode stage in biriscv, and there are two decoder This figure is not the arichtecture of biriscv, this is x86-64 architecture. After I see the verilog code, I find that there are quite similar, and it make me understand the decode stage in multithreading architecture