# Evaluate NucleusRV
> 林趺菩
[GitHub](https://github.com/merledu/nucleusrv)
NucleusRV is a 32-bit 5-stage pipelined RISC-V core implemented in Chisel.
## Prerequisites
Cloning `nucleusrv` repository.
```shell
$ git clone https://github.com/merledu/nucleusrv.git
```
---
Since I encountered difficulties when building the `riscv-gnu-toolchain`, I referenced the web resources and decided to follow its guide, so I didn't have to build the `riscv-gnu-toolchain` from scratch.
* download `riscv-gnu-toolchain` related files
```shell
$ wget https://github.com/riscv-collab/riscv-gnu-toolchain/releases/download/2024.12.16/riscv32-elf-ubuntu-22.04-gcc-nightly-2024.12.16-nightly.tar.xz
$ tar -xvf riscv32-elf-ubuntu-22.04-gcc-nightly-2024.12.16-nightly.tar.xz
$ rm -f riscv32-elf-ubuntu-22.04-gcc-nightly-2024.12.16-nightly.tar.xz
```
---
I decide to use docker to efficiently build the required environment.
* The Dockerfile content.
```
# Start with Ubuntu 22.04 as the base image
FROM ubuntu:22.04
# set time zone to avoid some request when apt install
ENV TZ="Asia/Taipei"
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
# Update package lists and install required packages
RUN apt-get update && apt-get install -y \
build-essential \
verilator \
gtkwave \
curl \
zip \
unzip \
sudo \
bsdmainutils \
&& rm -rf /var/lib/apt/lists/*
# Set up SDKMAN
RUN curl -s "https://get.sdkman.io" | bash
# Set up environment for SDKMAN
ENV SDKMAN_DIR="/root/.sdkman"
ENV PATH="${PATH}:${SDKMAN_DIR}/bin:${SDKMAN_DIR}/candidates/java/current/bin:${SDKMAN_DIR}/candidates/sbt/current/bin"
# Install Java and SBT using SDKMAN
RUN bash -c "source ${SDKMAN_DIR}/bin/sdkman-init.sh && \
sdk install java 11.0.21-tem && \
sdk install sbt"
# Set the working directory
WORKDIR /nucleusrv
############################ for riscv-gnu-toolchain ############################
# Copy the riscv directory to /opt/riscv
COPY ./riscv /opt/riscv
# Add /opt/riscv/bin to PATH
RUN echo "export PATH=/opt/riscv/bin:$PATH" >> /root/.bashrc
############################ for riscv-gnu-toolchain ############################
# Set the default command
CMD ["/bin/bash", "-c", "source /root/.bashrc && /bin/bash"]
```
* The script to build the docker image.
```shell
$ bash build.sh
```
* The script to run the docker container.
```shell
$ bash run.sh
```
## NucleusRV Demo
First run the docker container.
```shell
$ bash run.sh
```
The terminal will look like this:
```
root@0e76ca700801:/app# ls
Dockerfile build.sh nucleusrv riscv run.sh
```
### Building C Programs (hello_world)
Referencing the steps that the `nucleusrv` repository gives:

```shell
$ cd nucleusrv/tools/
$ make PROGRAM=hello_world
```
The terminal output:
```
rm -rf out
riscv32-unknown-elf-gcc -c -march=rv32i -mabi=ilp32 -ffreestanding -fomit-frame-pointer -c -o tests/hello_world/hello.o tests/hello_world/hello.c
riscv32-unknown-elf-gcc -c -march=rv32i -mabi=ilp32 -ffreestanding -fomit-frame-pointer -c -o tests/hello_world/main.o tests/hello_world/main.c
riscv32-unknown-elf-gcc -c -march=rv32i -mabi=ilp32 -ffreestanding -fomit-frame-pointer -c -o tests/hello_world/world.o tests/hello_world/world.c
riscv32-unknown-elf-gcc -march=rv32im -mabi=ilp32 -static -nostdlib -nostartfiles -T link.ld tests/hello_world/hello.o tests/hello_world/main.o tests/hello_world/world.o -o out/program.elf -lgcc
riscv32-unknown-elf-objdump --disassemble-all --section=.text out/program.elf > out/program.dump
python3 makehex.py out/program.elf 2048 > out/program.hex
```
The corresponding `program.dump`, `program.elf`, `program.hex` files will be generated under `nucleusrv/tools/out/`
The content of `program.dump`:
```
out/program.elf: file format elf32-littleriscv
Disassembly of section .text:
00000000 <hello>:
0: ff010113 addi sp,sp,-16
4: 00400793 li a5,4
8: 00f12623 sw a5,12(sp)
c: 00500793 li a5,5
10: 00f12423 sw a5,8(sp)
14: 00c12703 lw a4,12(sp)
18: 00812783 lw a5,8(sp)
1c: 00f707b3 add a5,a4,a5
20: 00f12223 sw a5,4(sp)
24: 00412783 lw a5,4(sp)
28: 00078513 mv a0,a5
2c: 01010113 addi sp,sp,16
30: 00008067 ret
00000034 <main>:
34: fe010113 addi sp,sp,-32
38: 00112e23 sw ra,28(sp)
3c: fc5ff0ef jal 0 <hello>
40: 00a12623 sw a0,12(sp)
44: 02c000ef jal 70 <world>
48: 00a12423 sw a0,8(sp)
4c: 00c12703 lw a4,12(sp)
50: 00812783 lw a5,8(sp)
54: 00f707b3 add a5,a4,a5
58: 00f12223 sw a5,4(sp)
5c: 00000793 li a5,0
60: 00078513 mv a0,a5
64: 01c12083 lw ra,28(sp)
68: 02010113 addi sp,sp,32
6c: 00008067 ret
00000070 <world>:
70: 00500793 li a5,5
74: 00078513 mv a0,a5
78: 00008067 ret
```
### Building with SBT
Referencing the steps that the `nucleusrv` repository gives:

Moving to `nucleusrv` directory:
```shell
$ cd ..
```
Opening SBT server:
```shell
$ sbt
```
The terminal output:
```
...
[info] loading settings for project nucleusrv from build.sbt ...
[info] set current project to nucleusrv (in build file:/app/nucleusrv/)
[info] sbt server started at local:///root/.sbt/1.0/server/0aa2831cde32e66c128a/sock
[info] started sbt server
```
Running SBT test:
```shell
$ testOnly nucleusrv.components.TopTest -- -DwriteVcd=1 -DprogramFile=/app/nucleusrv/tools/out/program.hex
```
* `DwriteVcd=1`: This flag enables VCD (Value Change Dump) file generation, which is useful for waveform viewing and debugging purpose.
* `DprogramFile=/app/nucleusrv/tools/out/program.hex`: This specifies the path to the program file (in hexadecimal format) that will be used for testing.
The terminal output:
```
...
Enabling waves..
Exit Code: 0
[info] - Top Test
[info] Run completed in 4 seconds, 903 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 14 s, completed Jan 9, 2025, 6:29:50 PM
sbt:nucleusrv>
```
If we want to exit the sbt server, just use `CTRL+D`.
### Running Compliance Tests
Referencing the steps that the `nucleusrv` repository gives:

Cloning `riscv-arch-test` repository under `nucleusrv`.
```shell
$ git clone git@github.com:riscv-non-isa/riscv-arch-test.git -b 1.0
```
The default `run_compliance.sh` uses `riscv64`, so I modified to `riscv32`.

Running compliance tests:
```shell
$ bash run_compliance.sh rv32i
```
The terminal output shows some errors:
```
/app/nucleusrv/test_run_dir/Top_Test/VTop exists.
make \
RISCV_TARGET=nucleusrv \
RISCV_DEVICE=rv32i \
RISCV_PREFIX=riscv32-unknown-elf- \
clean -C /app/nucleusrv/riscv-arch-test/riscv-test-suite/rv32i
make[1]: Entering directory '/app/nucleusrv/riscv-arch-test/riscv-test-suite/rv32i'
rm -rf /app/nucleusrv/riscv-arch-test/work
make[1]: Leaving directory '/app/nucleusrv/riscv-arch-test/riscv-test-suite/rv32i'
make \
RISCV_TARGET=nucleusrv \
RISCV_DEVICE=rv32i \
RISCV_PREFIX=riscv32-unknown-elf- \
run -C /app/nucleusrv/riscv-arch-test/riscv-test-suite/rv32i
make[1]: Entering directory '/app/nucleusrv/riscv-arch-test/riscv-test-suite/rv32i'
Compile /app/nucleusrv/riscv-arch-test/work/rv32i/I-MISALIGN_JMP-01.elf
src/I-MISALIGN_JMP-01.S: Assembler messages:
src/I-MISALIGN_JMP-01.S:48: Error: unrecognized opcode `csrrw x31,mtvec,x1', extension `zicsr' required
src/I-MISALIGN_JMP-01.S:51: Error: unrecognized opcode `csrrci x0,misa,4', extension `zicsr' required
src/I-MISALIGN_JMP-01.S:273: Error: unrecognized opcode `csrw mtvec,x31', extension `zicsr' required
src/I-MISALIGN_JMP-01.S:282: Error: unrecognized opcode `csrr x30,mtval', extension `zicsr' required
src/I-MISALIGN_JMP-01.S:284: Error: unrecognized opcode `csrw mepc,x30', extension `zicsr' required
src/I-MISALIGN_JMP-01.S:287: Error: unrecognized opcode `csrr x30,mtval', extension `zicsr' required
src/I-MISALIGN_JMP-01.S:292: Error: unrecognized opcode `csrr x30,mcause', extension `zicsr' required
riscv32-unknown-elf-objcopy: '/app/nucleusrv/riscv-arch-test/work/rv32i/I-MISALIGN_JMP-01.elf': No such file
riscv32-unknown-elf-objcopy: '/app/nucleusrv/riscv-arch-test/work/rv32i/I-MISALIGN_JMP-01.elf': No such file
riscv32-unknown-elf-objdump: '/app/nucleusrv/riscv-arch-test/work/rv32i/I-MISALIGN_JMP-01.elf': No such file
hexdump: /app/nucleusrv/riscv-arch-test/work/rv32i/I-MISALIGN_JMP-01.elf.text.bin: No such file or directory
hexdump: all input file arguments failed
hexdump: /app/nucleusrv/riscv-arch-test/work/rv32i/I-MISALIGN_JMP-01.elf.data.bin: No such file or directory
hexdump: all input file arguments failed
make[1]: *** [Makefile:50: /app/nucleusrv/riscv-arch-test/work/rv32i/I-MISALIGN_JMP-01.elf] Error 1
make[1]: Leaving directory '/app/nucleusrv/riscv-arch-test/riscv-test-suite/rv32i'
make: *** [Makefile:79: simulate] Error 2
```
The error messages are related to the `RISC-V Control and Status Register (CSR)` instructions. These errors occur because the compiler is not recognizing the `CSR` instructions, which are part of the `Zicsr` extension in RISC-V.
To resolve this issue, I need to explicitly enable the `Zicsr` extension when compiling the code.
I modify the Makefile at `nucleusrv/riscv-arch-test/riscv-test-suite/rv32i/Makefile`. At line 48, modify the `-march` flag from `rv32i` to `rv32i_zicsr`
Running compliance tests again:
```shell
$ bash run_compliance.sh rv32i
```
The terminal output still shows errors:
```
...
Check I-SW-011d0
< ffffffff
8d6
< 7fffffff
14,15d11
< ffffffff
< fffff801
24,25d19
< 7fffffff
< 00000001
31d24
< fffff801
... FAIL
Check I-XOR-011d0
< 00000000
5d3
< 80000000
9d6
< 7ffffffe
13,15d9
< ffffea33
< 00000000
< fffff800
25d18
< 7ffffffe
33d25
< ffffffff
... FAIL
Check I-XORI-014d3
< ffffffff
7d5
< f89abb21
20d17
< ffffffff
23d19
< f89abb21
25d20
< fffff801
30,31d24
< 00000000
< fffff800
... FAIL
--------------------------------
FAIL: 48/48 RISCV_TARGET=nucleusrv RISCV_DEVICE=rv32i RISCV_ISA=rv32i
make: *** [Makefile:86: verify] Error 1
```
Seems like all 48 tests are failed, it doesn't make sense.
I want to debug by comparing the actual output and the golden data.
Take `I-ADD-01` for example, I compare `nucleusrv/riscv-arch-test/work/rv32i/I-ADD-01.signature.output` and `nucleusrv/riscv-arch-test/riscv-test-suite/rv32i/references/I-ADD-01.reference_output`.

This problem has not been solved yet ...
## NucleusRV explanation
### Instruction Fetch

The `InstructionFetch` module is designed to fetch instructions from memory based on a given address.
Code can be found in `nucleusrv/src/main/scala/components/InstructionFetch.scala`
```scala=
package nucleusrv.components
import chisel3._
import chisel3.util._
class InstructionFetch extends Module {
val io = IO(new Bundle {
val address: UInt = Input(UInt(32.W))
val instruction: UInt = Output(UInt(32.W))
val stall: Bool = Input(Bool())
val coreInstrReq = Decoupled(new MemRequestIO)
val coreInstrResp = Flipped(Decoupled(new MemResponseIO))
})
val rst = Wire(Bool())
rst := reset.asBool()
io.coreInstrResp.ready := true.B
...
```
---
```scala
io.coreInstrReq.bits.activeByteLane := "b1111".U
```
Indicating that all four bytes of a 32-bit word are active.
---
```scala
io.coreInstrReq.bits.isWrite := false.B
```
Indicating that this is a read operation.
---
```scala
io.coreInstrReq.bits.dataRequest := DontCare
```
Since we're performing a read operation (fetching an instruction), we don't need to specify any data to write.
---
```scala
io.coreInstrReq.bits.addrRequest := io.address >> 2
```
Sets the address for the memory request. The input address `io.address` is right-shifted by 2 bits, which is equivalent to dividing by 4. This operation converts the byte address to the word address.
---
```scala
io.coreInstrReq.valid := Mux(rst || io.stall, false.B, true.B)
```
Ensures that no instruction fetch requests are made when the system is being reset or when the pipeline is stalled.
---
```scala
io.instruction := Mux(io.coreInstrResp.valid, io.coreInstrResp.bits.dataResponse, DontCare)
```
Ensures that the instruction output is only updated with valid data from memory, and remains in an undefined state when no valid instruction has been fetched.
### Instruction Decode
The `InstructionDecode` stage is responsible for decoding instructions.
Code can be found in `nucleusrv/src/main/scala/components/InstructionDecode.scala`
```scala=
package nucleusrv.components
import chisel3._
class InstructionDecode(TRACE:Boolean) extends Module {
val io = IO(new Bundle {
val id_instruction = Input(UInt(32.W))
val writeData = Input(UInt(32.W))
val writeReg = Input(UInt(5.W))
val pcAddress = Input(UInt(32.W))
val ctl_writeEnable = Input(Bool())
val id_ex_mem_read = Input(Bool())
// val ex_mem_mem_write = Input(Bool())
val ex_mem_mem_read = Input(Bool())
val dmem_resp_valid = Input(Bool())
val id_ex_rd = Input(UInt(5.W))
val ex_mem_rd = Input(UInt(5.W))
val id_ex_branch = Input(Bool())
//for forwarding
val ex_mem_ins = Input(UInt(32.W))
val mem_wb_ins = Input(UInt(32.W))
val ex_ins = Input(UInt(32.W))
val ex_result = Input(UInt(32.W))
val ex_mem_result = Input(UInt(32.W))
val mem_wb_result = Input(UInt(32.W))
//Outputs
val immediate = Output(UInt(32.W))
val writeRegAddress = Output(UInt(5.W))
val readData1 = Output(UInt(32.W))
val readData2 = Output(UInt(32.W))
val func7 = Output(UInt(7.W))
val func3 = Output(UInt(3.W))
val ctl_aluSrc = Output(Bool())
val ctl_memToReg = Output(UInt(2.W))
val ctl_regWrite = Output(Bool())
val ctl_memRead = Output(Bool())
val ctl_memWrite = Output(Bool())
val ctl_branch = Output(Bool())
val ctl_aluOp = Output(UInt(2.W))
val ctl_jump = Output(UInt(2.W))
val ctl_aluSrc1 = Output(UInt(2.W))
val hdu_pcWrite = Output(Bool())
val hdu_if_reg_write = Output(Bool())
val pcSrc = Output(Bool())
val pcPlusOffset = Output(UInt(32.W))
val ifid_flush = Output(Bool())
val stall = Output(Bool())
// RVFI pins
val rs_addr = if (TRACE) Some(Output(Vec(2, UInt(5.W)))) else None
})
//Hazard Detection Unit
val hdu = Module(new HazardUnit)
...
//Control Unit
val control = Module(new Control)
...
//Register File
val registers = Module(new Registers)
...
val immediate = Module(new ImmediateGen)
immediate.io.instruction := io.id_instruction
io.immediate := immediate.io.out
...
//Branch Unit
val bu = Module(new BranchUnit)
...
```
The `InstructionDecode` module instantiates several sub-modules to perform specific tasks. The `HazardUnit` module is used to detect and handle hazards in the pipeline. The `Control` module generates control signals based on the instruction opcode. The `Registers` module represents the register file, which stores and retrieves register values. The `ImmediateGen` module generates immediate values from the instruction. The `BranchUnit` module evaluates branch conditions, and calculates the target address for branches and jumps.
---
```scala
when(hdu.io.ctl_mux && io.id_instruction =/= "h13".U) {
io.ctl_memWrite := control.io.memWrite
io.ctl_regWrite := control.io.regWrite
}.otherwise {
io.ctl_memWrite := false.B
io.ctl_regWrite := false.B
}
```
It allows normal operation when the `HDU (Hazard Detection Unit)` indicates it's safe and the instruction is not a `NOP (No Operation)`. It disables memory and register writes when there's a hazard or when processing a `NOP` instruction.
---
```scala
//Forwarding to fix structural hazard
when(io.ctl_writeEnable && (io.writeReg === registerRs1)){
when(registerRs1 === 0.U){
io.readData1 := 0.U
}.otherwise{
io.readData1 := io.writeData
}
}.otherwise{
io.readData1 := registers.io.readData(0)
}
when(io.ctl_writeEnable && (io.writeReg === registerRs2)){
when(registerRs2 === 0.U){
io.readData2 := 0.U
}.otherwise{
io.readData2 := io.writeData
}
}.otherwise{
io.readData2 := registers.io.readData(1)
}
```
This forwarding logic serves to resolve structural hazards. It handles the case where a register is being written to and read from in the same cycle. Instead of waiting for the write to complete and then reading (which would introduce a delay), it forwards the data being written directly to the read output. It maintains the behavior of the `zero register (always reading as 0)` even in forwarding situations.
---
```scala
// Branch Forwarding
val input1 = Wire(UInt(32.W))
val input2 = Wire(UInt(32.W))
when(registerRs1 === io.ex_mem_ins(11, 7)) {
input1 := io.ex_mem_result
}.elsewhen(registerRs1 === io.mem_wb_ins(11, 7)) {
input1 := io.mem_wb_result
}
.otherwise {
input1 := io.readData1
}
when(registerRs2 === io.ex_mem_ins(11, 7)) {
input2 := io.ex_mem_result
}.elsewhen(registerRs2 === io.mem_wb_ins(11, 7)) {
input2 := io.mem_wb_result
}
.otherwise {
input2 := io.readData2
}
```
The branch forwarding logic resolves data hazards specifically for branch instructions.
If `registerRs1` / `registerRs2` matches the destination register of the instruction in the `EX/MEM` stage, `input1` / `input2` is set to the result from that stage.
Else if `registerRs1` / `registerRs2` matches the destination register of the instruction in the `MEM/WB` stage, `input1` / `input2` is set to the result from that stage.
Otherwise, `input1` / `input2` is set to the value read from the register file.
---
```scala
//Forwarding for Jump
val j_offset = Wire(UInt(32.W))
when(registerRs1 === io.ex_ins(11, 7)){
j_offset := io.ex_result
}.elsewhen(registerRs1 === io.ex_mem_ins(11, 7)) {
j_offset := io.ex_mem_result
}.elsewhen(registerRs1 === io.mem_wb_ins(11, 7)) {
j_offset := io.mem_wb_result
}.elsewhen(registerRs1 === io.ex_ins(11, 7)){
j_offset := io.ex_result
}.otherwise {
j_offset := io.readData1
}
```
The forwarding logic resolves data hazards that can occur when a jump instruction depends on the result of a recent instruction that hasn't yet been written back to the register file.
If `registerRs1` matches the destination register of the instruction in the `EX` stage, `j_offset` is set to the result from that stage.
Else if `registerRs1` matches the destination register of the instruction in the `EX/MEM` stage, `j_offset` is set to the result from that stage.
Else if `registerRs1` matches the destination register of the instruction in the `MEM/WB` stage, `j_offset` is set to the result from that stage.
There's a redundant check for the `EX` stage again (likely a mistake in the code ?).
If none of the above conditions are met, `j_offset` is set to the value read from the register file `io.readData1`.
---
```scala
//Offset Calculation (Jump/Branch)
when(io.ctl_jump === 1.U) {
io.pcPlusOffset := io.pcAddress + io.immediate
}.elsewhen(io.ctl_jump === 2.U) {
io.pcPlusOffset := j_offset + io.immediate
}
.otherwise {
io.pcPlusOffset := io.pcAddress + immediate.io.out
}
when(bu.io.taken || io.ctl_jump =/= 0.U) {
io.pcSrc := true.B
}.otherwise {
io.pcSrc := false.B
}
```
The code handles offset calculation for jump and branch instructions. It calculates the next `program counter (PC)` value based on the type of control flow instruction (jump/branch) and determines whether the `PC` should be updated.
`io.ctl_jump === 1.U` checks if the control signal `ctl_jump` indicates a jump instruction where the offset is calculated relative to the current program counter `pcAddress`. The next `PC` value `io.pcPlusOffset` is computed as: `pcPlusOffset = pcAddress + immediate`, typically for jump instructions like `jal (jump and link)`.
`io.ctl_jump === 2.U` checks if the control signal `ctl_jump` indicates a jump instruction where the offset is calculated relative to a register value `j_offset`. The next `PC` value is computed as: `pcPlusOffset = j_offset + immediate`, typically for `jalr (jump and link register)`.
Otherwise if no jump is indicated, it assumes a branch instruction or regular sequential execution. The next PC value is computed as: `pcPlusOffset = pcAddress + immediate`, typically for branch instructions like `beq`, `bne`, etc., where the offset is relative to the current `PC`.
If `bu.io.taken || io.ctl_jump =/= 0.U` is true, which means that either a branch is taken or a jump instruction is present, `io.pcSrc` is set to `true.B`, indicates that the program counter should be updated to the new target address.
Else if `bu.io.taken || io.ctl_jump =/= 0.U` is false, which means that neither a branch is taken nor a jump instruction exists, `io.pcSrc` is set to `false.B`, indicates that the program counter will not change and will continue sequentially.
---
```scala
//Instruction Flush
io.ifid_flush := hdu.io.ifid_flush
io.writeRegAddress := io.id_instruction(11, 7)
io.func3 := io.id_instruction(14, 12)
when((io.id_instruction(6,0) === "b0110011".U) | ((io.id_instruction(6,0) === "b0010011".U) & (io.func3 === 5.U))){
io.func7 := io.id_instruction(31,25)
}.otherwise{
io.func7 := 0.U
}
io.stall := io.func7 === 1.U && (io.func3 === 4.U || io.func3 === 5.U || io.func3 === 6.U || io.func3 === 7.U)
```
The code handles instruction flushing, extracts specific fields from the instruction, and determines if a stall is necessary.
Checks if the `opcode (bits 6-0)` is either `"0110011" (R-type)` or `"0010011" (I-type)` with `func3 == 5`. If true, it sets `func7` to `bits 31-25` of the instruction, otherwise it sets `func7` to 0.
Determines if a stall is necessary:
it checks if `func7` is 1 and `func3` is either 4, 5, 6, or 7.
This likely identifies specific instructions (`RV32M` instructions) that require additional processing time, necessitating a pipeline stall.
### Execute
Code can be found in `nucleusrv/src/main/scala/components/Execute.scala`
```scala=
package nucleusrv.components
import chisel3._
import chisel3.util.MuxCase
class Execute(M:Boolean = false) extends Module {
val io = IO(new Bundle {
val immediate = Input(UInt(32.W))
val readData1 = Input(UInt(32.W))
val readData2 = Input(UInt(32.W))
val pcAddress = Input(UInt(32.W))
val func7 = Input(UInt(7.W))
val func3 = Input(UInt(3.W))
val mem_result = Input(UInt(32.W))
val wb_result = Input(UInt(32.W))
val ex_mem_regWrite = Input(Bool())
val mem_wb_regWrite = Input(Bool())
val id_ex_ins = Input(UInt(32.W))
val ex_mem_ins = Input(UInt(32.W))
val mem_wb_ins = Input(UInt(32.W))
val ctl_aluSrc = Input(Bool())
val ctl_aluOp = Input(UInt(2.W))
val ctl_aluSrc1 = Input(UInt(2.W))
val writeData = Output(UInt(32.W))
val ALUresult = Output(UInt(32.W))
val stall = Output(Bool())
})
val alu = Module(new ALU)
val aluCtl = Module(new AluControl)
val fu = Module(new ForwardingUnit).io
// Forwarding Unt
fu.ex_regWrite := io.ex_mem_regWrite
fu.mem_regWrite := io.mem_wb_regWrite
fu.ex_reg_rd := io.ex_mem_ins(11, 7)
fu.mem_reg_rd := io.mem_wb_ins(11, 7)
fu.reg_rs1 := io.id_ex_ins(19, 15)
fu.reg_rs2 := io.id_ex_ins(24, 20)
val inputMux1 = MuxCase(
0.U,
Array(
(fu.forwardA === 0.U) -> (io.readData1),
(fu.forwardA === 1.U) -> (io.mem_result),
(fu.forwardA === 2.U) -> (io.wb_result)
)
)
val inputMux2 = MuxCase(
0.U,
Array(
(fu.forwardB === 0.U) -> (io.readData2),
(fu.forwardB === 1.U) -> (io.mem_result),
(fu.forwardB === 2.U) -> (io.wb_result)
)
)
val aluIn1 = MuxCase(
inputMux1,
Array(
(io.ctl_aluSrc1 === 1.U) -> io.pcAddress,
(io.ctl_aluSrc1 === 2.U) -> 0.U
)
)
val aluIn2 = Mux(io.ctl_aluSrc, inputMux2, io.immediate)
aluCtl.io.f3 := io.func3
aluCtl.io.f7 := io.func7(5)
aluCtl.io.aluOp := io.ctl_aluOp
aluCtl.io.aluSrc := io.ctl_aluSrc
alu.io.input1 := aluIn1
alu.io.input2 := aluIn2
alu.io.aluCtl := aluCtl.io.out
io.stall := false.B
if(M){
val mdu = Module (new MDU)
mdu.io.src_a := aluIn1
mdu.io.src_b := aluIn2
mdu.io.op := io.func3
// mdu.io.valid := true.B
// io.stall := false.B
val src_a_reg = RegInit(0.U(32.W))
val src_b_reg = RegInit(0.U(32.W))
val op_reg = RegInit(0.U(3.W))
val div_en = RegInit(false.B)
val f7_reg = RegInit(0.U(6.W))
val counter = RegInit(0.U(6.W))
when(io.func7 === 1.U && (io.func3 === 0.U || io.func3 === 1.U || io.func3 === 2.U || io.func3 === 3.U)){
mdu.io.valid := true.B
}otherwise{
mdu.io.valid := false.B
}
dontTouch(io.stall)
when(io.func7 === 1.U && ~div_en && (io.func3 === 4.U || io.func3 === 5.U || io.func3 === 6.U || io.func3 === 7.U)){
mdu.io.valid := RegNext(true.B)
div_en := true.B
src_a_reg := aluIn1
src_b_reg := aluIn2
op_reg := io.func3
f7_reg := io.func7
io.stall := true.B
dontTouch(f7_reg)
}
when(div_en){
// io.stall := true.B
when (counter < 32.U){
io.stall := true.B
mdu.io.src_a := src_a_reg
mdu.io.src_b := src_b_reg
mdu.io.op := op_reg
// mdu.io.valid := true.B
counter := counter + 1.U
}.otherwise{
mdu.io.valid := false.B
div_en := false.B
mdu.io.src_a := src_a_reg
mdu.io.src_b := src_b_reg
mdu.io.op := op_reg
counter := 0.U
}
}//.otherwise{io.stall := false.B}
when(div_en && f7_reg === 1.U && mdu.io.ready){
io.ALUresult := Mux(mdu.io.output.valid, mdu.io.output.bits, 0.U)
}
.elsewhen (io.func7 === 1.U && mdu.io.ready){
io.ALUresult := Mux(mdu.io.output.valid, mdu.io.output.bits, 0.U)
}
.otherwise{io.ALUresult := alu.io.result}
}
else {
io.ALUresult := alu.io.result
}
// io.ALUresult := alu.io.result
io.writeData := inputMux2
}
```
The `Execute` module handle arithmetic, logical operations, data forwarding, etc.
```scala
val inputMux1 = MuxCase(
0.U,
Array(
(fu.forwardA === 0.U) -> (io.readData1),
(fu.forwardA === 1.U) -> (io.mem_result),
(fu.forwardA === 2.U) -> (io.wb_result)
)
)
val inputMux2 = MuxCase(
0.U,
Array(
(fu.forwardB === 0.U) -> (io.readData2),
(fu.forwardB === 1.U) -> (io.mem_result),
(fu.forwardB === 2.U) -> (io.wb_result)
)
)
val aluIn1 = MuxCase(
inputMux1,
Array(
(io.ctl_aluSrc1 === 1.U) -> io.pcAddress,
(io.ctl_aluSrc1 === 2.U) -> 0.U
)
)
val aluIn2 = Mux(io.ctl_aluSrc, inputMux2, io.immediate)
```
Selects the appropriate input for the `ALU`.
For `inputMux1` and `inputMux2`:
If `fu.forwardA === 0.U` / `fu.forwardB === 0.U`, selects `io.readData1` / `io.readData2`, the original register value.
Else if `fu.forwardA === 1.U` / `fu.forwardB === 1.U`, selects `io.mem_result`, the result from the memory stage.
Else if `fu.forwardA === 2.U` / `fu.forwardB === 2.U`, selects `io.wb_result`, the result from the writeback stage.
For `aluIn1`:
If `io.ctl_aluSrc1 === 1.U`, selects `io.pcAddress`, the current program counter value.
Else if `io.ctl_aluSrc1 === 2.U`, selects `0.U`, a constant zero.
Else selects `inputMux1`, the result of the forwarding logic.
For `aluIn2`:
If `io.ctl_aluSrc` is true, selects `inputMux2`, another forwarding logic.
Else selects `io.immediate`, immediate value encoded in the instruction.
---
### Memory Access
Code can be found in `nucleusrv/src/main/scala/components/MemoryFetch.scala`
```scala=
package nucleusrv.components
import chisel3._
import chisel3.util._
class MemoryFetch extends Module {
val io = IO(new Bundle {
val aluResultIn: UInt = Input(UInt(32.W))
val writeData: UInt = Input(UInt(32.W))
val writeEnable: Bool = Input(Bool())
val readEnable: Bool = Input(Bool())
val readData: UInt = Output(UInt(32.W))
val stall: Bool = Output(Bool())
val f3 = Input(UInt(3.W))
val dccmReq = Decoupled(new MemRequestIO)
val dccmRsp = Flipped(Decoupled(new MemResponseIO))
})
io.dccmRsp.ready := true.B
val wdata = Wire(Vec(4, UInt(8.W)))
val rdata = Wire(UInt(32.W))
val offset = RegInit(0.U(2.W))
val funct3 = RegInit(0.U(3.W))
val offsetSW = io.aluResultIn(1,0)
when(!io.dccmRsp.valid){
funct3 := io.f3
offset := io.aluResultIn(1,0)
}.otherwise{
funct3 := funct3
offset := offset
}
wdata(0) := io.writeData(7,0)
wdata(1) := io.writeData(15,8)
wdata(2) := io.writeData(23,16)
wdata(3) := io.writeData(31,24)
/* Store Half Word */
when(io.writeEnable && io.f3 === "b000".U){
when(offsetSW === 0.U){
io.dccmReq.bits.activeByteLane := "b0001".U
}.elsewhen(offsetSW === 1.U){
wdata(0) := io.writeData(15,8)
wdata(1) := io.writeData(7,0)
wdata(2) := io.writeData(23,16)
wdata(3) := io.writeData(31,24)
io.dccmReq.bits.activeByteLane := "b0010".U
}.elsewhen(offsetSW === 2.U){
wdata(0) := io.writeData(15,8)
wdata(1) := io.writeData(23,16)
wdata(2) := io.writeData(7,0)
wdata(3) := io.writeData(31,24)
io.dccmReq.bits.activeByteLane := "b0100".U
}.otherwise{
wdata(0) := io.writeData(15,8)
wdata(1) := io.writeData(23,16)
wdata(2) := io.writeData(31,24)
wdata(3) := io.writeData(7,0)
io.dccmReq.bits.activeByteLane := "b1000".U
}
}
/* Store Half Word */
.elsewhen(io.writeEnable && io.f3 === "b001".U){
// offset will either be 0 or 2 since address will be 0x0000 or 0x0002
when(offsetSW === 0.U){
// data to be stored at lower 16 bits (15,0)
io.dccmReq.bits.activeByteLane := "b0011".U
}.elsewhen(offsetSW === 1.U){
// data to be stored at lower 16 bits (15,0)
io.dccmReq.bits.activeByteLane := "b0110".U
wdata(0) := io.writeData(23,16)
wdata(1) := io.writeData(7,0)
wdata(2) := io.writeData(15,8)
wdata(3) := io.writeData(31,24)
}.otherwise{
// data to be stored at upper 16 bits (31,16)
io.dccmReq.bits.activeByteLane := "b1100".U
wdata(2) := io.writeData(7,0)
wdata(3) := io.writeData(15,8)
wdata(0) := io.writeData(23,16)
wdata(1) := io.writeData(31,24)
}
}
/* Store Word */
.otherwise{
io.dccmReq.bits.activeByteLane := "b1111".U
}
io.dccmReq.bits.dataRequest := wdata.asUInt()
io.dccmReq.bits.addrRequest := (io.aluResultIn & "h00001fff".U) >> 2
io.dccmReq.bits.isWrite := io.writeEnable
io.dccmReq.valid := Mux(io.writeEnable | io.readEnable, true.B, false.B)
io.stall := (io.writeEnable || io.readEnable) && !io.dccmRsp.valid
rdata := Mux(io.dccmRsp.valid, io.dccmRsp.bits.dataResponse, DontCare)
when(io.readEnable) {
when(funct3 === "b010".U) {
// load word
io.readData := rdata
}
.elsewhen(funct3 === "b000".U) {
// load byte
when(offset === "b00".U) {
// addressing memory with 0,4,8...
io.readData := Cat(Fill(24,rdata(7)),rdata(7,0))
} .elsewhen(offset === "b01".U) {
// addressing memory with 1,5,9...
io.readData := Cat(Fill(24, rdata(15)),rdata(15,8))
} .elsewhen(offset === "b10".U) {
// addressing memory with 2,6,10...
io.readData := Cat(Fill(24, rdata(23)),rdata(23,16))
} .elsewhen(offset === "b11".U) {
// addressing memory with 3,7,11...
io.readData := Cat(Fill(24, rdata(31)),rdata(31,24))
} .otherwise {
// this condition would never occur but using to avoid Chisel generating VOID errors
io.readData := DontCare
}
}
.elsewhen(funct3 === "b100".U) {
//load byte unsigned
when(offset === "b00".U) {
// addressing memory with 0,4,8...
io.readData := Cat(Fill(24, 0.U), rdata(7, 0))
}.elsewhen(offset === "b01".U) {
// addressing memory with 1,5,9...
io.readData := Cat(Fill(24, 0.U), rdata(15, 8))
}.elsewhen(offset === "b10".U) {
// addressing memory with 2,6,10...
io.readData := Cat(Fill(24, 0.U), rdata(23, 16))
}.elsewhen(offset === "b11".U) {
// addressing memory with 3,7,11...
io.readData := Cat(Fill(24, 0.U), rdata(31, 24))
} .otherwise {
// this condition would never occur but using to avoid Chisel generating VOID errors
io.readData := DontCare
}
}
.elsewhen(funct3 === "b101".U) {
// load halfword unsigned
when(offset === "b00".U) {
// addressing memory with 0,4,8...
io.readData := Cat(Fill(16, 0.U),rdata(15,0))
} .elsewhen(offset === "b01".U) {
// addressing memory with 2,6,10...
io.readData := Cat(Fill(16, 0.U),rdata(23,8))
} .elsewhen(offset === "b10".U) {
// addressing memory with 2,6,10...
io.readData := Cat(Fill(16, 0.U),rdata(31,16))
} .otherwise {
// this condition would never occur but using to avoid Chisel generating VOID errors
io.readData := DontCare
}
}
.elsewhen(funct3 === "b001".U) {
// load halfword
when(offset === "b00".U) {
// addressing memory with 0,4,8...
io.readData := Cat(Fill(16, rdata(15)),rdata(15,0))
} .elsewhen(offset === "b01".U) {
// addressing memory with 1,3,7...
io.readData := Cat(Fill(16, rdata(23)),rdata(23,8))
} .elsewhen(offset === "b10".U) {
// addressing memory with 2,6,10...
io.readData := Cat(Fill(16, rdata(31)),rdata(31,16))
} .otherwise {
// this condition would never occur but using to avoid Chisel generating VOID errors
io.readData := DontCare
}
}
.otherwise {
// unknown func3 bits
io.readData := DontCare
}
} .otherwise {
io.readData := DontCare
}
when(io.writeEnable && io.aluResultIn(31, 28) === "h8".asUInt()){
printf("%x\n", io.writeData)
}
}
```
The `MemoryFetch` module handles data memory `DCCM (Data Closely Coupled Memory)` read and write operations.
---
```scala
/* Store Half Word */
when(io.writeEnable && io.f3 === "b000".U){
when(offsetSW === 0.U){
io.dccmReq.bits.activeByteLane := "b0001".U
}.elsewhen(offsetSW === 1.U){
wdata(0) := io.writeData(15,8)
wdata(1) := io.writeData(7,0)
wdata(2) := io.writeData(23,16)
wdata(3) := io.writeData(31,24)
io.dccmReq.bits.activeByteLane := "b0010".U
}.elsewhen(offsetSW === 2.U){
wdata(0) := io.writeData(15,8)
wdata(1) := io.writeData(23,16)
wdata(2) := io.writeData(7,0)
wdata(3) := io.writeData(31,24)
io.dccmReq.bits.activeByteLane := "b0100".U
}.otherwise{
wdata(0) := io.writeData(15,8)
wdata(1) := io.writeData(23,16)
wdata(2) := io.writeData(31,24)
wdata(3) := io.writeData(7,0)
io.dccmReq.bits.activeByteLane := "b1000".U
}
}
```
I think the code is actually handling a `Store Byte (SB)` operation, not a `Store Half Word (SH)` as the comment suggests. When `io.writeEnable` is true and `io.f3 === "b000".U`, the operation stores a single byte (8 bits) at a specified memory address.
`offsetSW` is the least significant 2 bits of the `ALU` result (memory address), to determine where within a 32-bit word the byte should be stored.
`activeByteLane` is a 4-bit value indicating which byte within the 32-bit word should be written.
if `offsetSW === 0.U`, the byte is stored in the least significant byte (bits 7-0) of the 32-bit word, and `io.dccmReq.bits.activeByteLane := "b0001".U`
Else if `offsetSW === 1.U`, the byte is stored in the second least significant byte (bits 15-8) of the 32-bit word, and `io.dccmReq.bits.activeByteLane := "b0010".U`
Else if `offsetSW === 2.U`, the byte is stored in the second most significant byte (bits 23-16) of the 32-bit word, and `io.dccmReq.bits.activeByteLane := "b0100".U`
Else if `offsetSW === 3.U`, the byte is stored in the most significant byte (bits 31-24) of the 32-bit word, and `io.dccmReq.bits.activeByteLane := "b1000".U`
---
```scala
/* Store Half Word */
.elsewhen(io.writeEnable && io.f3 === "b001".U){
// offset will either be 0 or 2 since address will be 0x0000 or 0x0002
when(offsetSW === 0.U){
// data to be stored at lower 16 bits (15,0)
io.dccmReq.bits.activeByteLane := "b0011".U
}.elsewhen(offsetSW === 1.U){
// data to be stored at lower 16 bits (15,0)
io.dccmReq.bits.activeByteLane := "b0110".U
wdata(0) := io.writeData(23,16)
wdata(1) := io.writeData(7,0)
wdata(2) := io.writeData(15,8)
wdata(3) := io.writeData(31,24)
}.otherwise{
// data to be stored at upper 16 bits (31,16)
io.dccmReq.bits.activeByteLane := "b1100".U
wdata(2) := io.writeData(7,0)
wdata(3) := io.writeData(15,8)
wdata(0) := io.writeData(23,16)
wdata(1) := io.writeData(31,24)
}
}
```
When `io.writeEnable` is true and `io.f3 === "b001".U`, it handles the `Store Half Word (SH)` operation.
The comment states that `offsetSW` will either be 0 or 2 since address will be `0x0000` or `0x0002`
If `offsetSW === 0.U`, the half word is stored in the lower 16 bits (15-0) of the 32-bit word. `io.dccmReq.bits.activeByteLane := "b0011".U`, indicating that the two least significant bytes should be written.
If `offsetSW === 2.U`, the half word is stored in the upper 16 bits (31-16) of the 32-bit word. `io.dccmReq.bits.activeByteLane := "b1100".U`, and the `wdata` is rearranged accordingly.
---
```scala
/* Store Word */
.otherwise{
io.dccmReq.bits.activeByteLane := "b1111".U
}
```
`Store Word (SW)` operation. `io.dccmReq.bits.activeByteLane := "b1111".U` indicates that all four bytes of the 32-bit word should be active for writing.
---
```scala
io.dccmReq.bits.dataRequest := wdata.asUInt()
io.dccmReq.bits.addrRequest := (io.aluResultIn & "h00001fff".U) >> 2
io.dccmReq.bits.isWrite := io.writeEnable
io.dccmReq.valid := Mux(io.writeEnable | io.readEnable, true.B, false.B)
```
Prepares the memory request by setting up the data to be written (if it's a write operation), calculating the memory address, setting the write enable flag, and validating the request when there's an actual memory operation to perform.
---
```scala
io.stall := (io.writeEnable || io.readEnable) && !io.dccmRsp.valid
```
The stall logic ensures that the processor waits for memory operations to complete before proceeding.
---
```scala
rdata := Mux(io.dccmRsp.valid, io.dccmRsp.bits.dataResponse, DontCare)
```
Selects the data from the `DCCM` response if it's valid, otherwise sets it to `DontCare`.
---
```scala
when(io.readEnable) {
when(funct3 === "b010".U) {
// load word
io.readData := rdata
}
```
When `funct3 === "b010"`, it performs a full 32-bit word load, `Load Word (LW)`.
---
```scala
.elsewhen(funct3 === "b000".U) {
// load byte (sign-extended)
// ...
}
```
When `funct3 === "b000"`, it performs loading a single byte and sign-extending it to 32 bits, `Load Byte (LB)` It uses the `offset` to determine which byte of the 32-bit word to load.
---
```scala
.elsewhen(funct3 === "b100".U) {
//load byte unsigned
// ...
}
```
Similar to `Load Byte (LB)`, but zero-extends the byte instead of sign-extending. `Load Byte Unsigned (LBU)`.
---
```scala
.elsewhen(funct3 === "b101".U) {
// load halfword unsigned
// ...
}
```
Loads a 16-bit halfword and zero-extends it to 32 bits, `Load Halfword Unsigned (LHU)`.
---
```scala
.elsewhen(funct3 === "b001".U) {
// load halfword
// ...
}
```
Loads a 16-bit halfword and sign-extends it to 32 bits, `Load Halfword (LH)`.
## RV32IM Instruction Introduction
### Multiplication Operations
**mul (Multiplication)**

**Format:** `mul rd,rs1,rs2`
**Description:** performs a 32-bit × 32-bit multiplication and places the lower 32 bits in the destination register (Both `rs1` and `rs2` treated as signed numbers).
**Implementation:** `x[rd] = x[rs1] * x[rs2]`
---
**mulh (Multiplication Higher)**

**Format:** `mulh rd,rs1,rs2`
**Description:** performs a 32-bit × 32-bit multiplication and places the upper 32 bits in the destination register of the 64-bit product (Both `rs1` and `rs2` treated as signed numbers).
**Implementation:** `x[rd] = (x[rs1] s*s x[rs2]) >>s 32`
---
**mulhsu (Multiplication Higher Signed Unsigned)**

**Format:** `mulhsu rd,rs1,rs2`
**Description:** performs a 32-bit × 32-bit multiplication and places the upper 32 bits in the destination register of the 64-bit product (`rs1` treated as signed number, `rs2` treated as unsigned number).
**Implementation:** `x[rd] = (x[rs1] s*u x[rs2]) >>s 32`
---
**mulhu (Multiplication Higher Unsigned)**

**Format:** `mulhu rd,rs1,rs2`
**Description:** performs a 32-bit × 32-bit multiplication and places the upper 32 bits in the destination register of the 64-bit product (Both `rs1` and `rs2` treated as unsigned numbers).
**Implementation:** `x[rd] = (x[rs1] u*u x[rs2]) >>u 32`
---
### Division Operations
**div (Division)**

**Format:** `div rd,rs1,rs2`
**Description:** perform signed integer division of 32 bits by 32 bits (rounding towards zero).
**Implementation:** `x[rd] = x[rs1] /s x[rs2]`
---
**divu (Division Unsigned)**

**Format:** `divu rd, rs1, rs2`
**Description:** perform unsigned integer division of 32 bits by 32 bits (rounding towards zero).
**Implementation:** `x[rd] = x[rs1] /u x[rs2]`
---
**rem (Remain)**

**Format:** `rem rd, rs1, rs2`
**Description:** provide the remainder of the corresponding division operation div (the sign of `rd` equals the sign of `rs1`).
**Implementation:** `x[rd] = x[rs1] %s x[rs2]`
---
**remu (Remain Unsigned)**

**Format:** `rem rd, rs1, rs2`
**Description:** provide the remainder of the corresponding division operation divu.
**Implementation:** `x[rd] = x[rs1] %u x[rs2]`
## References
* https://github.com/merledu/nucleusrv
* https://hackmd.io/@sysprog/r1mlr3I7p#Single-cycle-RISC-V-CPU
* https://blog.csdn.net/raw_inputhello/article/details/135848711
* https://verilator.org/guide/latest/install.html
* https://github.com/riscv-collab/riscv-gnu-toolchain
* https://github.com/riscv-non-isa/riscv-arch-test/tree/1.0
* https://msyksphinz-self.github.io/riscv-isadoc/html/rvm.html
* https://docs.openhwgroup.org/projects/cva6-user-manual/01_cva6_user/RISCV_Instructions_RV32M.html