林趺菩
NucleusRV is a 32-bit 5-stage pipelined RISC-V core implemented in Chisel.
Cloning nucleusrv
repository.
Since I encountered difficulties when building the riscv-gnu-toolchain
, I referenced the web resources and decided to follow its guide, so I didn't have to build the riscv-gnu-toolchain
from scratch.
riscv-gnu-toolchain
related filesI decide to use docker to efficiently build the required environment.
First run the docker container.
The terminal will look like this:
Referencing the steps that the nucleusrv
repository gives:
The terminal output:
The corresponding program.dump
, program.elf
, program.hex
files will be generated under nucleusrv/tools/out/
The content of program.dump
:
Referencing the steps that the nucleusrv
repository gives:
Moving to nucleusrv
directory:
Opening SBT server:
The terminal output:
Running SBT test:
DwriteVcd=1
: This flag enables VCD (Value Change Dump) file generation, which is useful for waveform viewing and debugging purpose.DprogramFile=/app/nucleusrv/tools/out/program.hex
: This specifies the path to the program file (in hexadecimal format) that will be used for testing.The terminal output:
If we want to exit the sbt server, just use CTRL+D
.
Referencing the steps that the nucleusrv
repository gives:
Cloning riscv-arch-test
repository under nucleusrv
.
The default run_compliance.sh
uses riscv64
, so I modified to riscv32
.
Running compliance tests:
The terminal output shows some errors:
The error messages are related to the RISC-V Control and Status Register (CSR)
instructions. These errors occur because the compiler is not recognizing the CSR
instructions, which are part of the Zicsr
extension in RISC-V.
To resolve this issue, I need to explicitly enable the Zicsr
extension when compiling the code.
I modify the Makefile at nucleusrv/riscv-arch-test/riscv-test-suite/rv32i/Makefile
. At line 48, modify the -march
flag from rv32i
to rv32i_zicsr
Running compliance tests again:
The terminal output still shows errors:
Seems like all 48 tests are failed, it doesn't make sense.
I want to debug by comparing the actual output and the golden data.
Take I-ADD-01
for example, I compare nucleusrv/riscv-arch-test/work/rv32i/I-ADD-01.signature.output
and nucleusrv/riscv-arch-test/riscv-test-suite/rv32i/references/I-ADD-01.reference_output
.
This problem has not been solved yet …
The InstructionFetch
module is designed to fetch instructions from memory based on a given address.
Code can be found in nucleusrv/src/main/scala/components/InstructionFetch.scala
Indicating that all four bytes of a 32-bit word are active.
Indicating that this is a read operation.
Since we're performing a read operation (fetching an instruction), we don't need to specify any data to write.
Sets the address for the memory request. The input address io.address
is right-shifted by 2 bits, which is equivalent to dividing by 4. This operation converts the byte address to the word address.
Ensures that no instruction fetch requests are made when the system is being reset or when the pipeline is stalled.
Ensures that the instruction output is only updated with valid data from memory, and remains in an undefined state when no valid instruction has been fetched.
The InstructionDecode
stage is responsible for decoding instructions.
Code can be found in nucleusrv/src/main/scala/components/InstructionDecode.scala
The InstructionDecode
module instantiates several sub-modules to perform specific tasks. The HazardUnit
module is used to detect and handle hazards in the pipeline. The Control
module generates control signals based on the instruction opcode. The Registers
module represents the register file, which stores and retrieves register values. The ImmediateGen
module generates immediate values from the instruction. The BranchUnit
module evaluates branch conditions, and calculates the target address for branches and jumps.
It allows normal operation when the HDU (Hazard Detection Unit)
indicates it's safe and the instruction is not a NOP (No Operation)
. It disables memory and register writes when there's a hazard or when processing a NOP
instruction.
This forwarding logic serves to resolve structural hazards. It handles the case where a register is being written to and read from in the same cycle. Instead of waiting for the write to complete and then reading (which would introduce a delay), it forwards the data being written directly to the read output. It maintains the behavior of the zero register (always reading as 0)
even in forwarding situations.
The branch forwarding logic resolves data hazards specifically for branch instructions.
If registerRs1
/ registerRs2
matches the destination register of the instruction in the EX/MEM
stage, input1
/ input2
is set to the result from that stage.
Else if registerRs1
/ registerRs2
matches the destination register of the instruction in the MEM/WB
stage, input1
/ input2
is set to the result from that stage.
Otherwise, input1
/ input2
is set to the value read from the register file.
The forwarding logic resolves data hazards that can occur when a jump instruction depends on the result of a recent instruction that hasn't yet been written back to the register file.
If registerRs1
matches the destination register of the instruction in the EX
stage, j_offset
is set to the result from that stage.
Else if registerRs1
matches the destination register of the instruction in the EX/MEM
stage, j_offset
is set to the result from that stage.
Else if registerRs1
matches the destination register of the instruction in the MEM/WB
stage, j_offset
is set to the result from that stage.
There's a redundant check for the EX
stage again (likely a mistake in the code ?).
If none of the above conditions are met, j_offset
is set to the value read from the register file io.readData1
.
The code handles offset calculation for jump and branch instructions. It calculates the next program counter (PC)
value based on the type of control flow instruction (jump/branch) and determines whether the PC
should be updated.
io.ctl_jump === 1.U
checks if the control signal ctl_jump
indicates a jump instruction where the offset is calculated relative to the current program counter pcAddress
. The next PC
value io.pcPlusOffset
is computed as: pcPlusOffset = pcAddress + immediate
, typically for jump instructions like jal (jump and link)
.
io.ctl_jump === 2.U
checks if the control signal ctl_jump
indicates a jump instruction where the offset is calculated relative to a register value j_offset
. The next PC
value is computed as: pcPlusOffset = j_offset + immediate
, typically for jalr (jump and link register)
.
Otherwise if no jump is indicated, it assumes a branch instruction or regular sequential execution. The next PC value is computed as: pcPlusOffset = pcAddress + immediate
, typically for branch instructions like beq
, bne
, etc., where the offset is relative to the current PC
.
If bu.io.taken || io.ctl_jump =/= 0.U
is true, which means that either a branch is taken or a jump instruction is present, io.pcSrc
is set to true.B
, indicates that the program counter should be updated to the new target address.
Else if bu.io.taken || io.ctl_jump =/= 0.U
is false, which means that neither a branch is taken nor a jump instruction exists, io.pcSrc
is set to false.B
, indicates that the program counter will not change and will continue sequentially.
The code handles instruction flushing, extracts specific fields from the instruction, and determines if a stall is necessary.
Checks if the opcode (bits 6-0)
is either "0110011" (R-type)
or "0010011" (I-type)
with func3 == 5
. If true, it sets func7
to bits 31-25
of the instruction, otherwise it sets func7
to 0.
Determines if a stall is necessary:
it checks if func7
is 1 and func3
is either 4, 5, 6, or 7.
This likely identifies specific instructions (RV32M
instructions) that require additional processing time, necessitating a pipeline stall.
Code can be found in nucleusrv/src/main/scala/components/Execute.scala
The Execute
module handle arithmetic, logical operations, data forwarding, etc.
Selects the appropriate input for the ALU
.
For inputMux1
and inputMux2
:
If fu.forwardA === 0.U
/ fu.forwardB === 0.U
, selects io.readData1
/ io.readData2
, the original register value.
Else if fu.forwardA === 1.U
/ fu.forwardB === 1.U
, selects io.mem_result
, the result from the memory stage.
Else if fu.forwardA === 2.U
/ fu.forwardB === 2.U
, selects io.wb_result
, the result from the writeback stage.
For aluIn1
:
If io.ctl_aluSrc1 === 1.U
, selects io.pcAddress
, the current program counter value.
Else if io.ctl_aluSrc1 === 2.U
, selects 0.U
, a constant zero.
Else selects inputMux1
, the result of the forwarding logic.
For aluIn2
:
If io.ctl_aluSrc
is true, selects inputMux2
, another forwarding logic.
Else selects io.immediate
, immediate value encoded in the instruction.
Code can be found in nucleusrv/src/main/scala/components/MemoryFetch.scala
The MemoryFetch
module handles data memory DCCM (Data Closely Coupled Memory)
read and write operations.
I think the code is actually handling a Store Byte (SB)
operation, not a Store Half Word (SH)
as the comment suggests. When io.writeEnable
is true and io.f3 === "b000".U
, the operation stores a single byte (8 bits) at a specified memory address.
offsetSW
is the least significant 2 bits of the ALU
result (memory address), to determine where within a 32-bit word the byte should be stored.
activeByteLane
is a 4-bit value indicating which byte within the 32-bit word should be written.
if offsetSW === 0.U
, the byte is stored in the least significant byte (bits 7-0) of the 32-bit word, and io.dccmReq.bits.activeByteLane := "b0001".U
Else if offsetSW === 1.U
, the byte is stored in the second least significant byte (bits 15-8) of the 32-bit word, and io.dccmReq.bits.activeByteLane := "b0010".U
Else if offsetSW === 2.U
, the byte is stored in the second most significant byte (bits 23-16) of the 32-bit word, and io.dccmReq.bits.activeByteLane := "b0100".U
Else if offsetSW === 3.U
, the byte is stored in the most significant byte (bits 31-24) of the 32-bit word, and io.dccmReq.bits.activeByteLane := "b1000".U
When io.writeEnable
is true and io.f3 === "b001".U
, it handles the Store Half Word (SH)
operation.
The comment states that offsetSW
will either be 0 or 2 since address will be 0x0000
or 0x0002
If offsetSW === 0.U
, the half word is stored in the lower 16 bits (15-0) of the 32-bit word. io.dccmReq.bits.activeByteLane := "b0011".U
, indicating that the two least significant bytes should be written.
If offsetSW === 2.U
, the half word is stored in the upper 16 bits (31-16) of the 32-bit word. io.dccmReq.bits.activeByteLane := "b1100".U
, and the wdata
is rearranged accordingly.
Store Word (SW)
operation. io.dccmReq.bits.activeByteLane := "b1111".U
indicates that all four bytes of the 32-bit word should be active for writing.
Prepares the memory request by setting up the data to be written (if it's a write operation), calculating the memory address, setting the write enable flag, and validating the request when there's an actual memory operation to perform.
The stall logic ensures that the processor waits for memory operations to complete before proceeding.
Selects the data from the DCCM
response if it's valid, otherwise sets it to DontCare
.
When funct3 === "b010"
, it performs a full 32-bit word load, Load Word (LW)
.
When funct3 === "b000"
, it performs loading a single byte and sign-extending it to 32 bits, Load Byte (LB)
It uses the offset
to determine which byte of the 32-bit word to load.
Similar to Load Byte (LB)
, but zero-extends the byte instead of sign-extending. Load Byte Unsigned (LBU)
.
Loads a 16-bit halfword and zero-extends it to 32 bits, Load Halfword Unsigned (LHU)
.
Loads a 16-bit halfword and sign-extends it to 32 bits, Load Halfword (LH)
.
mul (Multiplication)
Format: mul rd,rs1,rs2
Description: performs a 32-bit × 32-bit multiplication and places the lower 32 bits in the destination register (Both rs1
and rs2
treated as signed numbers).
Implementation: x[rd] = x[rs1] * x[rs2]
mulh (Multiplication Higher)
Format: mulh rd,rs1,rs2
Description: performs a 32-bit × 32-bit multiplication and places the upper 32 bits in the destination register of the 64-bit product (Both rs1
and rs2
treated as signed numbers).
Implementation: x[rd] = (x[rs1] s*s x[rs2]) >>s 32
mulhsu (Multiplication Higher Signed Unsigned)
Format: mulhsu rd,rs1,rs2
Description: performs a 32-bit × 32-bit multiplication and places the upper 32 bits in the destination register of the 64-bit product (rs1
treated as signed number, rs2
treated as unsigned number).
Implementation: x[rd] = (x[rs1] s*u x[rs2]) >>s 32
mulhu (Multiplication Higher Unsigned)
Format: mulhu rd,rs1,rs2
Description: performs a 32-bit × 32-bit multiplication and places the upper 32 bits in the destination register of the 64-bit product (Both rs1
and rs2
treated as unsigned numbers).
Implementation: x[rd] = (x[rs1] u*u x[rs2]) >>u 32
div (Division)
Format: div rd,rs1,rs2
Description: perform signed integer division of 32 bits by 32 bits (rounding towards zero).
Implementation: x[rd] = x[rs1] /s x[rs2]
divu (Division Unsigned)
Format: divu rd, rs1, rs2
Description: perform unsigned integer division of 32 bits by 32 bits (rounding towards zero).
Implementation: x[rd] = x[rs1] /u x[rs2]
rem (Remain)
Format: rem rd, rs1, rs2
Description: provide the remainder of the corresponding division operation div (the sign of rd
equals the sign of rs1
).
Implementation: x[rd] = x[rs1] %s x[rs2]
remu (Remain Unsigned)
Format: rem rd, rs1, rs2
Description: provide the remainder of the corresponding division operation divu.
Implementation: x[rd] = x[rs1] %u x[rs2]