Evaluate NucleusRV

林趺菩

NucleusRV is a 32-bit 5-stage pipelined RISC-V core implemented in Chisel.

Prerequisites

Cloning nucleusrv repository.

$ git clone https://github.com/merledu/nucleusrv.git

Since I encountered difficulties when building the riscv-gnu-toolchain, I referenced the web resources and decided to follow its guide, so I didn't have to build the riscv-gnu-toolchain from scratch.

download riscv-gnu-toolchain related files

$ wget https://github.com/riscv-collab/riscv-gnu-toolchain/releases/download/2024.12.16/riscv32-elf-ubuntu-22.04-gcc-nightly-2024.12.16-nightly.tar.xz

$ tar -xvf riscv32-elf-ubuntu-22.04-gcc-nightly-2024.12.16-nightly.tar.xz

$ rm -f riscv32-elf-ubuntu-22.04-gcc-nightly-2024.12.16-nightly.tar.xz

I decide to use docker to efficiently build the required environment.

The Dockerfile content.

# Start with Ubuntu 22.04 as the base image
FROM ubuntu:22.04

# set time zone to avoid some request when apt install
ENV TZ="Asia/Taipei"
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

# Update package lists and install required packages
RUN apt-get update && apt-get install -y \
    build-essential \
    verilator \
    gtkwave \
    curl \
    zip \
    unzip \
    sudo \
    bsdmainutils \
    && rm -rf /var/lib/apt/lists/*

# Set up SDKMAN
RUN curl -s "https://get.sdkman.io" | bash

# Set up environment for SDKMAN
ENV SDKMAN_DIR="/root/.sdkman"
ENV PATH="${PATH}:${SDKMAN_DIR}/bin:${SDKMAN_DIR}/candidates/java/current/bin:${SDKMAN_DIR}/candidates/sbt/current/bin"

# Install Java and SBT using SDKMAN
RUN bash -c "source ${SDKMAN_DIR}/bin/sdkman-init.sh && \
    sdk install java 11.0.21-tem && \
    sdk install sbt"

# Set the working directory
WORKDIR /nucleusrv


############################ for riscv-gnu-toolchain ############################

# Copy the riscv directory to /opt/riscv
COPY ./riscv /opt/riscv

# Add /opt/riscv/bin to PATH
RUN echo "export PATH=/opt/riscv/bin:$PATH" >> /root/.bashrc

############################ for riscv-gnu-toolchain ############################

# Set the default command
CMD ["/bin/bash", "-c", "source /root/.bashrc && /bin/bash"]

The script to build the docker image.

$ bash build.sh

The script to run the docker container.

$ bash run.sh

NucleusRV Demo

First run the docker container.

$ bash run.sh

The terminal will look like this:

root@0e76ca700801:/app# ls
Dockerfile  build.sh  nucleusrv  riscv  run.sh

Building C Programs (hello_world)

Referencing the steps that the nucleusrv repository gives:

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

$ cd nucleusrv/tools/

$ make PROGRAM=hello_world

The terminal output:

rm -rf out
riscv32-unknown-elf-gcc -c -march=rv32i -mabi=ilp32 -ffreestanding -fomit-frame-pointer   -c -o tests/hello_world/hello.o tests/hello_world/hello.c
riscv32-unknown-elf-gcc -c -march=rv32i -mabi=ilp32 -ffreestanding -fomit-frame-pointer   -c -o tests/hello_world/main.o tests/hello_world/main.c
riscv32-unknown-elf-gcc -c -march=rv32i -mabi=ilp32 -ffreestanding -fomit-frame-pointer   -c -o tests/hello_world/world.o tests/hello_world/world.c
riscv32-unknown-elf-gcc -march=rv32im -mabi=ilp32 -static -nostdlib -nostartfiles -T link.ld tests/hello_world/hello.o tests/hello_world/main.o tests/hello_world/world.o -o out/program.elf -lgcc
riscv32-unknown-elf-objdump --disassemble-all --section=.text out/program.elf > out/program.dump
python3 makehex.py out/program.elf 2048 > out/program.hex

The corresponding program.dump, program.elf, program.hex files will be generated under nucleusrv/tools/out/

The content of program.dump:


out/program.elf:     file format elf32-littleriscv


Disassembly of section .text:

00000000 <hello>:
   0:	ff010113          	addi	sp,sp,-16
   4:	00400793          	li	a5,4
   8:	00f12623          	sw	a5,12(sp)
   c:	00500793          	li	a5,5
  10:	00f12423          	sw	a5,8(sp)
  14:	00c12703          	lw	a4,12(sp)
  18:	00812783          	lw	a5,8(sp)
  1c:	00f707b3          	add	a5,a4,a5
  20:	00f12223          	sw	a5,4(sp)
  24:	00412783          	lw	a5,4(sp)
  28:	00078513          	mv	a0,a5
  2c:	01010113          	addi	sp,sp,16
  30:	00008067          	ret

00000034 <main>:
  34:	fe010113          	addi	sp,sp,-32
  38:	00112e23          	sw	ra,28(sp)
  3c:	fc5ff0ef          	jal	0 <hello>
  40:	00a12623          	sw	a0,12(sp)
  44:	02c000ef          	jal	70 <world>
  48:	00a12423          	sw	a0,8(sp)
  4c:	00c12703          	lw	a4,12(sp)
  50:	00812783          	lw	a5,8(sp)
  54:	00f707b3          	add	a5,a4,a5
  58:	00f12223          	sw	a5,4(sp)
  5c:	00000793          	li	a5,0
  60:	00078513          	mv	a0,a5
  64:	01c12083          	lw	ra,28(sp)
  68:	02010113          	addi	sp,sp,32
  6c:	00008067          	ret

00000070 <world>:
  70:	00500793          	li	a5,5
  74:	00078513          	mv	a0,a5
  78:	00008067          	ret

Building with SBT

Referencing the steps that the nucleusrv repository gives:

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Moving to nucleusrv directory:

$ cd ..

Opening SBT server:

$ sbt

The terminal output:

...

[info] loading settings for project nucleusrv from build.sbt ...
[info] set current project to nucleusrv (in build file:/app/nucleusrv/)
[info] sbt server started at local:///root/.sbt/1.0/server/0aa2831cde32e66c128a/sock
[info] started sbt server

Running SBT test:

$ testOnly nucleusrv.components.TopTest -- -DwriteVcd=1 -DprogramFile=/app/nucleusrv/tools/out/program.hex

DwriteVcd=1: This flag enables VCD (Value Change Dump) file generation, which is useful for waveform viewing and debugging purpose.
DprogramFile=/app/nucleusrv/tools/out/program.hex: This specifies the path to the program file (in hexadecimal format) that will be used for testing.

The terminal output:

...

Enabling waves..
Exit Code: 0
[info] - Top Test
[info] Run completed in 4 seconds, 903 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 14 s, completed Jan 9, 2025, 6:29:50 PM
sbt:nucleusrv>

If we want to exit the sbt server, just use CTRL+D.

Running Compliance Tests

Referencing the steps that the nucleusrv repository gives:

Cloning riscv-arch-test repository under nucleusrv.

$ git clone git@github.com:riscv-non-isa/riscv-arch-test.git -b 1.0

The default run_compliance.sh uses riscv64, so I modified to riscv32.

Running compliance tests:

$ bash run_compliance.sh rv32i

The terminal output shows some errors:

/app/nucleusrv/test_run_dir/Top_Test/VTop exists.
make  \
        RISCV_TARGET=nucleusrv \
        RISCV_DEVICE=rv32i \
        RISCV_PREFIX=riscv32-unknown-elf- \
        clean -C /app/nucleusrv/riscv-arch-test/riscv-test-suite/rv32i
make[1]: Entering directory '/app/nucleusrv/riscv-arch-test/riscv-test-suite/rv32i'
rm -rf /app/nucleusrv/riscv-arch-test/work
make[1]: Leaving directory '/app/nucleusrv/riscv-arch-test/riscv-test-suite/rv32i'
make  \
        RISCV_TARGET=nucleusrv \
        RISCV_DEVICE=rv32i \
        RISCV_PREFIX=riscv32-unknown-elf- \
        run -C /app/nucleusrv/riscv-arch-test/riscv-test-suite/rv32i
make[1]: Entering directory '/app/nucleusrv/riscv-arch-test/riscv-test-suite/rv32i'
Compile /app/nucleusrv/riscv-arch-test/work/rv32i/I-MISALIGN_JMP-01.elf
src/I-MISALIGN_JMP-01.S: Assembler messages:
src/I-MISALIGN_JMP-01.S:48: Error: unrecognized opcode `csrrw x31,mtvec,x1', extension `zicsr' required
src/I-MISALIGN_JMP-01.S:51: Error: unrecognized opcode `csrrci x0,misa,4', extension `zicsr' required
src/I-MISALIGN_JMP-01.S:273: Error: unrecognized opcode `csrw mtvec,x31', extension `zicsr' required
src/I-MISALIGN_JMP-01.S:282: Error: unrecognized opcode `csrr x30,mtval', extension `zicsr' required
src/I-MISALIGN_JMP-01.S:284: Error: unrecognized opcode `csrw mepc,x30', extension `zicsr' required
src/I-MISALIGN_JMP-01.S:287: Error: unrecognized opcode `csrr x30,mtval', extension `zicsr' required
src/I-MISALIGN_JMP-01.S:292: Error: unrecognized opcode `csrr x30,mcause', extension `zicsr' required
riscv32-unknown-elf-objcopy: '/app/nucleusrv/riscv-arch-test/work/rv32i/I-MISALIGN_JMP-01.elf': No such file
riscv32-unknown-elf-objcopy: '/app/nucleusrv/riscv-arch-test/work/rv32i/I-MISALIGN_JMP-01.elf': No such file
riscv32-unknown-elf-objdump: '/app/nucleusrv/riscv-arch-test/work/rv32i/I-MISALIGN_JMP-01.elf': No such file
hexdump: /app/nucleusrv/riscv-arch-test/work/rv32i/I-MISALIGN_JMP-01.elf.text.bin: No such file or directory
hexdump: all input file arguments failed
hexdump: /app/nucleusrv/riscv-arch-test/work/rv32i/I-MISALIGN_JMP-01.elf.data.bin: No such file or directory
hexdump: all input file arguments failed
make[1]: *** [Makefile:50: /app/nucleusrv/riscv-arch-test/work/rv32i/I-MISALIGN_JMP-01.elf] Error 1
make[1]: Leaving directory '/app/nucleusrv/riscv-arch-test/riscv-test-suite/rv32i'
make: *** [Makefile:79: simulate] Error 2

The error messages are related to the RISC-V Control and Status Register (CSR) instructions. These errors occur because the compiler is not recognizing the CSR instructions, which are part of the Zicsr extension in RISC-V.

To resolve this issue, I need to explicitly enable the Zicsr extension when compiling the code.

I modify the Makefile at nucleusrv/riscv-arch-test/riscv-test-suite/rv32i/Makefile. At line 48, modify the -march flag from rv32i to rv32i_zicsr

Running compliance tests again:

$ bash run_compliance.sh rv32i

The terminal output still shows errors:

...

Check                  I-SW-011d0
< ffffffff
8d6
< 7fffffff
14,15d11
< ffffffff
< fffff801
24,25d19
< 7fffffff
< 00000001
31d24
< fffff801
 ... FAIL
Check                 I-XOR-011d0
< 00000000
5d3
< 80000000
9d6
< 7ffffffe
13,15d9
< ffffea33
< 00000000
< fffff800
25d18
< 7ffffffe
33d25
< ffffffff
 ... FAIL
Check                I-XORI-014d3
< ffffffff
7d5
< f89abb21
20d17
< ffffffff
23d19
< f89abb21
25d20
< fffff801
30,31d24
< 00000000
< fffff800
 ... FAIL
--------------------------------
FAIL: 48/48 RISCV_TARGET=nucleusrv RISCV_DEVICE=rv32i RISCV_ISA=rv32i

make: *** [Makefile:86: verify] Error 1

Seems like all 48 tests are failed, it doesn't make sense.

I want to debug by comparing the actual output and the golden data.

Take I-ADD-01 for example, I compare nucleusrv/riscv-arch-test/work/rv32i/I-ADD-01.signature.output and nucleusrv/riscv-arch-test/riscv-test-suite/rv32i/references/I-ADD-01.reference_output.

This problem has not been solved yet …

NucleusRV explanation

Instruction Fetch

The InstructionFetch module is designed to fetch instructions from memory based on a given address.

Code can be found in nucleusrv/src/main/scala/components/InstructionFetch.scala





















package nucleusrv.components
import chisel3._
import chisel3.util._ 


class InstructionFetch extends Module {
  val io = IO(new Bundle {
    val address: UInt = Input(UInt(32.W))
    val instruction: UInt = Output(UInt(32.W))
    val stall: Bool = Input(Bool())
    val coreInstrReq = Decoupled(new MemRequestIO)
    val coreInstrResp = Flipped(Decoupled(new MemResponseIO))
  })

  val rst = Wire(Bool())
  rst := reset.asBool()
  io.coreInstrResp.ready := true.B

...

io.coreInstrReq.bits.activeByteLane := "b1111".U

Indicating that all four bytes of a 32-bit word are active.

io.coreInstrReq.bits.isWrite := false.B

Indicating that this is a read operation.

io.coreInstrReq.bits.dataRequest := DontCare

Since we're performing a read operation (fetching an instruction), we don't need to specify any data to write.

io.coreInstrReq.bits.addrRequest := io.address >> 2

Sets the address for the memory request. The input address io.address is right-shifted by 2 bits, which is equivalent to dividing by 4. This operation converts the byte address to the word address.

io.coreInstrReq.valid := Mux(rst || io.stall, false.B, true.B)

Ensures that no instruction fetch requests are made when the system is being reset or when the pipeline is stalled.

io.instruction := Mux(io.coreInstrResp.valid, io.coreInstrResp.bits.dataResponse, DontCare)

Ensures that the instruction output is only updated with valid data from memory, and remains in an undefined state when no valid instruction has been fetched.

Instruction Decode

The InstructionDecode stage is responsible for decoding instructions.

Code can be found in nucleusrv/src/main/scala/components/InstructionDecode.scala

















































































package nucleusrv.components
import chisel3._

class InstructionDecode(TRACE:Boolean) extends Module {
  val io = IO(new Bundle {
    val id_instruction = Input(UInt(32.W))
    val writeData = Input(UInt(32.W))
    val writeReg = Input(UInt(5.W))
    val pcAddress = Input(UInt(32.W))
    val ctl_writeEnable = Input(Bool())
    val id_ex_mem_read = Input(Bool())
//    val ex_mem_mem_write = Input(Bool())
    val ex_mem_mem_read = Input(Bool())
    val dmem_resp_valid = Input(Bool())
    val id_ex_rd = Input(UInt(5.W))
    val ex_mem_rd = Input(UInt(5.W))
    val id_ex_branch = Input(Bool())
    //for forwarding
    val ex_mem_ins = Input(UInt(32.W))
    val mem_wb_ins = Input(UInt(32.W))
    val ex_ins = Input(UInt(32.W))
    val ex_result = Input(UInt(32.W))
    val ex_mem_result = Input(UInt(32.W))
    val mem_wb_result = Input(UInt(32.W))
    
    //Outputs
    val immediate = Output(UInt(32.W))
    val writeRegAddress = Output(UInt(5.W))
    val readData1 = Output(UInt(32.W))
    val readData2 = Output(UInt(32.W))
    val func7 = Output(UInt(7.W))
    val func3 = Output(UInt(3.W))
    val ctl_aluSrc = Output(Bool())
    val ctl_memToReg = Output(UInt(2.W))
    val ctl_regWrite = Output(Bool())
    val ctl_memRead = Output(Bool())
    val ctl_memWrite = Output(Bool())
    val ctl_branch = Output(Bool())
    val ctl_aluOp = Output(UInt(2.W))
    val ctl_jump = Output(UInt(2.W))
    val ctl_aluSrc1 = Output(UInt(2.W))
    val hdu_pcWrite = Output(Bool())
    val hdu_if_reg_write = Output(Bool())
    val pcSrc = Output(Bool())
    val pcPlusOffset = Output(UInt(32.W))
    val ifid_flush = Output(Bool())

    val stall = Output(Bool())

    // RVFI pins
    val rs_addr = if (TRACE) Some(Output(Vec(2, UInt(5.W)))) else None
  })

  //Hazard Detection Unit
  val hdu = Module(new HazardUnit)

...

  //Control Unit
  val control = Module(new Control)
  
...

  //Register File
  val registers = Module(new Registers)

...
  

  val immediate = Module(new ImmediateGen)
  immediate.io.instruction := io.id_instruction
  io.immediate := immediate.io.out

...

  //Branch Unit
  val bu = Module(new BranchUnit)

...

The InstructionDecode module instantiates several sub-modules to perform specific tasks. The HazardUnit module is used to detect and handle hazards in the pipeline. The Control module generates control signals based on the instruction opcode. The Registers module represents the register file, which stores and retrieves register values. The ImmediateGen module generates immediate values from the instruction. The BranchUnit module evaluates branch conditions, and calculates the target address for branches and jumps.

when(hdu.io.ctl_mux && io.id_instruction =/= "h13".U) {
io.ctl_memWrite := control.io.memWrite
io.ctl_regWrite := control.io.regWrite

}.otherwise {
io.ctl_memWrite := false.B
io.ctl_regWrite := false.B
}

It allows normal operation when the HDU (Hazard Detection Unit) indicates it's safe and the instruction is not a NOP (No Operation). It disables memory and register writes when there's a hazard or when processing a NOP instruction.

 //Forwarding to fix structural hazard
  when(io.ctl_writeEnable && (io.writeReg === registerRs1)){
    when(registerRs1 === 0.U){
      io.readData1 := 0.U
    }.otherwise{
      io.readData1 := io.writeData
    }
  }.otherwise{
    io.readData1 := registers.io.readData(0)
  }
  when(io.ctl_writeEnable && (io.writeReg === registerRs2)){
    when(registerRs2 === 0.U){
      io.readData2 := 0.U
    }.otherwise{
      io.readData2 := io.writeData
    }
  }.otherwise{
    io.readData2 := registers.io.readData(1)
  }

This forwarding logic serves to resolve structural hazards. It handles the case where a register is being written to and read from in the same cycle. Instead of waiting for the write to complete and then reading (which would introduce a delay), it forwards the data being written directly to the read output. It maintains the behavior of the zero register (always reading as 0) even in forwarding situations.

// Branch Forwarding
  val input1 = Wire(UInt(32.W))
  val input2 = Wire(UInt(32.W))

  when(registerRs1 === io.ex_mem_ins(11, 7)) {
    input1 := io.ex_mem_result
  }.elsewhen(registerRs1 === io.mem_wb_ins(11, 7)) {
      input1 := io.mem_wb_result
    }
    .otherwise {
      input1 := io.readData1
    }
  when(registerRs2 === io.ex_mem_ins(11, 7)) {
    input2 := io.ex_mem_result
  }.elsewhen(registerRs2 === io.mem_wb_ins(11, 7)) {
      input2 := io.mem_wb_result
    }
    .otherwise {
      input2 := io.readData2
    }

The branch forwarding logic resolves data hazards specifically for branch instructions.

If registerRs1 / registerRs2 matches the destination register of the instruction in the EX/MEM stage, input1 / input2 is set to the result from that stage.

Else if registerRs1 / registerRs2 matches the destination register of the instruction in the MEM/WB stage, input1 / input2 is set to the result from that stage.

Otherwise, input1 / input2 is set to the value read from the register file.

  //Forwarding for Jump
  val j_offset = Wire(UInt(32.W))
    when(registerRs1 === io.ex_ins(11, 7)){
      j_offset := io.ex_result
    }.elsewhen(registerRs1 === io.ex_mem_ins(11, 7)) {
    j_offset := io.ex_mem_result
  }.elsewhen(registerRs1 === io.mem_wb_ins(11, 7)) {
    j_offset := io.mem_wb_result
  }.elsewhen(registerRs1 === io.ex_ins(11, 7)){
    j_offset := io.ex_result
  }.otherwise {
      j_offset := io.readData1
    }

The forwarding logic resolves data hazards that can occur when a jump instruction depends on the result of a recent instruction that hasn't yet been written back to the register file.

If registerRs1 matches the destination register of the instruction in the EX stage, j_offset is set to the result from that stage.

Else if registerRs1 matches the destination register of the instruction in the EX/MEM stage, j_offset is set to the result from that stage.

Else if registerRs1 matches the destination register of the instruction in the MEM/WB stage, j_offset is set to the result from that stage.

There's a redundant check for the EX stage again (likely a mistake in the code ?).

If none of the above conditions are met, j_offset is set to the value read from the register file io.readData1.

//Offset Calculation (Jump/Branch)
  when(io.ctl_jump === 1.U) {
    io.pcPlusOffset := io.pcAddress + io.immediate
  }.elsewhen(io.ctl_jump === 2.U) {
      io.pcPlusOffset := j_offset + io.immediate
    }
    .otherwise {
      io.pcPlusOffset := io.pcAddress + immediate.io.out
    }

  when(bu.io.taken || io.ctl_jump =/= 0.U) {
    io.pcSrc := true.B
  }.otherwise {
    io.pcSrc := false.B
  }

The code handles offset calculation for jump and branch instructions. It calculates the next program counter (PC) value based on the type of control flow instruction (jump/branch) and determines whether the PC should be updated.

io.ctl_jump === 1.U checks if the control signal ctl_jump indicates a jump instruction where the offset is calculated relative to the current program counter pcAddress. The next PC value io.pcPlusOffset is computed as: pcPlusOffset = pcAddress + immediate, typically for jump instructions like jal (jump and link).

io.ctl_jump === 2.U checks if the control signal ctl_jump indicates a jump instruction where the offset is calculated relative to a register value j_offset. The next PC value is computed as: pcPlusOffset = j_offset + immediate, typically for jalr (jump and link register).

Otherwise if no jump is indicated, it assumes a branch instruction or regular sequential execution. The next PC value is computed as: pcPlusOffset = pcAddress + immediate, typically for branch instructions like beq, bne, etc., where the offset is relative to the current PC.

If bu.io.taken || io.ctl_jump =/= 0.U is true, which means that either a branch is taken or a jump instruction is present, io.pcSrc is set to true.B, indicates that the program counter should be updated to the new target address.

Else if bu.io.taken || io.ctl_jump =/= 0.U is false, which means that neither a branch is taken nor a jump instruction exists, io.pcSrc is set to false.B, indicates that the program counter will not change and will continue sequentially.

//Instruction Flush
  io.ifid_flush := hdu.io.ifid_flush

  io.writeRegAddress := io.id_instruction(11, 7)
  io.func3 := io.id_instruction(14, 12)
  when((io.id_instruction(6,0) === "b0110011".U) | ((io.id_instruction(6,0) === "b0010011".U) & (io.func3 === 5.U))){
    io.func7 := io.id_instruction(31,25)
  }.otherwise{
    io.func7 := 0.U
  }

  io.stall := io.func7 === 1.U && (io.func3 === 4.U || io.func3 === 5.U || io.func3 === 6.U || io.func3 === 7.U)

The code handles instruction flushing, extracts specific fields from the instruction, and determines if a stall is necessary.

Checks if the opcode (bits 6-0) is either "0110011" (R-type) or "0010011" (I-type) with func3 == 5. If true, it sets func7 to bits 31-25 of the instruction, otherwise it sets func7 to 0.

Determines if a stall is necessary:
it checks if func7 is 1 and func3 is either 4, 5, 6, or 7.

This likely identifies specific instructions (RV32M instructions) that require additional processing time, necessitating a pipeline stall.

Execute

Code can be found in nucleusrv/src/main/scala/components/Execute.scala





















































































































































package nucleusrv.components
import chisel3._
import chisel3.util.MuxCase

class Execute(M:Boolean = false) extends Module {
  val io = IO(new Bundle {
    val immediate = Input(UInt(32.W))
    val readData1 = Input(UInt(32.W))
    val readData2 = Input(UInt(32.W))
    val pcAddress = Input(UInt(32.W))
    val func7 = Input(UInt(7.W))
    val func3 = Input(UInt(3.W))
    val mem_result = Input(UInt(32.W))
    val wb_result = Input(UInt(32.W))

    val ex_mem_regWrite = Input(Bool())
    val mem_wb_regWrite = Input(Bool())
    val id_ex_ins = Input(UInt(32.W))
    val ex_mem_ins = Input(UInt(32.W))
    val mem_wb_ins = Input(UInt(32.W))

    val ctl_aluSrc = Input(Bool())
    val ctl_aluOp = Input(UInt(2.W))
    val ctl_aluSrc1 = Input(UInt(2.W))

    val writeData = Output(UInt(32.W))
    val ALUresult = Output(UInt(32.W))

    val stall = Output(Bool())
  })

  val alu = Module(new ALU)
  val aluCtl = Module(new AluControl)
  val fu = Module(new ForwardingUnit).io

  // Forwarding Unt

  fu.ex_regWrite := io.ex_mem_regWrite
  fu.mem_regWrite := io.mem_wb_regWrite
  fu.ex_reg_rd := io.ex_mem_ins(11, 7)
  fu.mem_reg_rd := io.mem_wb_ins(11, 7)
  fu.reg_rs1 := io.id_ex_ins(19, 15)
  fu.reg_rs2 := io.id_ex_ins(24, 20)

  val inputMux1 = MuxCase(
    0.U,
    Array(
      (fu.forwardA === 0.U) -> (io.readData1),
      (fu.forwardA === 1.U) -> (io.mem_result),
      (fu.forwardA === 2.U) -> (io.wb_result)
    )
  )
  val inputMux2 = MuxCase(
    0.U,
    Array(
      (fu.forwardB === 0.U) -> (io.readData2),
      (fu.forwardB === 1.U) -> (io.mem_result),
      (fu.forwardB === 2.U) -> (io.wb_result)
    )
  )

  val aluIn1 = MuxCase(
    inputMux1,
    Array(
      (io.ctl_aluSrc1 === 1.U) -> io.pcAddress,
      (io.ctl_aluSrc1 === 2.U) -> 0.U
    )
  )
  val aluIn2 = Mux(io.ctl_aluSrc, inputMux2, io.immediate)

  aluCtl.io.f3 := io.func3
  aluCtl.io.f7 := io.func7(5)
  aluCtl.io.aluOp := io.ctl_aluOp
  aluCtl.io.aluSrc := io.ctl_aluSrc

  alu.io.input1 := aluIn1
  alu.io.input2 := aluIn2
  alu.io.aluCtl := aluCtl.io.out

  io.stall := false.B
  if(M){
    val mdu = Module (new MDU)
    mdu.io.src_a := aluIn1
    mdu.io.src_b := aluIn2
    mdu.io.op    := io.func3
    // mdu.io.valid := true.B
    // io.stall := false.B
    
    val src_a_reg = RegInit(0.U(32.W))
    val src_b_reg = RegInit(0.U(32.W))
    val op_reg    = RegInit(0.U(3.W))
    val div_en    = RegInit(false.B)
    val f7_reg    = RegInit(0.U(6.W))
    val counter   = RegInit(0.U(6.W))

    when(io.func7 === 1.U && (io.func3 === 0.U || io.func3 === 1.U || io.func3 === 2.U || io.func3 === 3.U)){
      mdu.io.valid := true.B
    }otherwise{
      mdu.io.valid := false.B
    }
    dontTouch(io.stall)
    when(io.func7 === 1.U && ~div_en && (io.func3 === 4.U || io.func3 === 5.U || io.func3 === 6.U || io.func3 === 7.U)){
      mdu.io.valid := RegNext(true.B)
      div_en := true.B
      src_a_reg := aluIn1
      src_b_reg := aluIn2
      op_reg := io.func3
      f7_reg := io.func7
      io.stall := true.B
      dontTouch(f7_reg)
    }

    when(div_en){
      // io.stall := true.B
      when (counter < 32.U){
        io.stall := true.B
        mdu.io.src_a := src_a_reg
        mdu.io.src_b := src_b_reg
        mdu.io.op    := op_reg
        // mdu.io.valid := true.B
        counter := counter + 1.U
      }.otherwise{
        mdu.io.valid := false.B
        div_en       := false.B
        mdu.io.src_a := src_a_reg
        mdu.io.src_b := src_b_reg
        mdu.io.op    := op_reg
        counter := 0.U
      }
    }//.otherwise{io.stall := false.B}

    when(div_en && f7_reg === 1.U && mdu.io.ready){
      io.ALUresult := Mux(mdu.io.output.valid, mdu.io.output.bits, 0.U)
    }
    .elsewhen (io.func7 === 1.U && mdu.io.ready){
      io.ALUresult := Mux(mdu.io.output.valid, mdu.io.output.bits, 0.U)
    }
    .otherwise{io.ALUresult := alu.io.result}
  } 
  else {
    io.ALUresult := alu.io.result
  }

  // io.ALUresult := alu.io.result

  io.writeData := inputMux2
}

The Execute module handle arithmetic, logical operations, data forwarding, etc.

  val inputMux1 = MuxCase(
    0.U,
    Array(
      (fu.forwardA === 0.U) -> (io.readData1),
      (fu.forwardA === 1.U) -> (io.mem_result),
      (fu.forwardA === 2.U) -> (io.wb_result)
    )
  )
  val inputMux2 = MuxCase(
    0.U,
    Array(
      (fu.forwardB === 0.U) -> (io.readData2),
      (fu.forwardB === 1.U) -> (io.mem_result),
      (fu.forwardB === 2.U) -> (io.wb_result)
    )
  )

  val aluIn1 = MuxCase(
    inputMux1,
    Array(
      (io.ctl_aluSrc1 === 1.U) -> io.pcAddress,
      (io.ctl_aluSrc1 === 2.U) -> 0.U
    )
  )
  val aluIn2 = Mux(io.ctl_aluSrc, inputMux2, io.immediate)

Selects the appropriate input for the ALU.

For inputMux1 and inputMux2:

If fu.forwardA === 0.U / fu.forwardB === 0.U, selects io.readData1 / io.readData2, the original register value.

Else if fu.forwardA === 1.U / fu.forwardB === 1.U, selects io.mem_result, the result from the memory stage.

Else if fu.forwardA === 2.U / fu.forwardB === 2.U, selects io.wb_result, the result from the writeback stage.

For aluIn1:

If io.ctl_aluSrc1 === 1.U, selects io.pcAddress, the current program counter value.

Else if io.ctl_aluSrc1 === 2.U, selects 0.U, a constant zero.

Else selects inputMux1, the result of the forwarding logic.

For aluIn2:

If io.ctl_aluSrc is true, selects inputMux2, another forwarding logic.

Else selects io.immediate, immediate value encoded in the instruction.

Memory Access

Code can be found in nucleusrv/src/main/scala/components/MemoryFetch.scala

































































































































































































package nucleusrv.components
import chisel3._
import chisel3.util._ 



class MemoryFetch extends Module {
  val io = IO(new Bundle {
    val aluResultIn: UInt = Input(UInt(32.W))
    val writeData: UInt = Input(UInt(32.W))
    val writeEnable: Bool = Input(Bool())
    val readEnable: Bool = Input(Bool())
    val readData: UInt = Output(UInt(32.W))
    val stall: Bool = Output(Bool())
    val f3 = Input(UInt(3.W))

    val dccmReq = Decoupled(new MemRequestIO)
    val dccmRsp = Flipped(Decoupled(new MemResponseIO))
  })

  io.dccmRsp.ready := true.B

  val wdata = Wire(Vec(4, UInt(8.W)))
  val rdata = Wire(UInt(32.W))
  val offset = RegInit(0.U(2.W))
  val funct3 = RegInit(0.U(3.W))
  val offsetSW = io.aluResultIn(1,0)

  when(!io.dccmRsp.valid){
    funct3 := io.f3
    offset := io.aluResultIn(1,0)
  }.otherwise{
    funct3 := funct3
    offset := offset
  }

  wdata(0) := io.writeData(7,0)
  wdata(1) := io.writeData(15,8)
  wdata(2) := io.writeData(23,16)
  wdata(3) := io.writeData(31,24)

  /* Store Half Word */
  when(io.writeEnable && io.f3 === "b000".U){
    when(offsetSW === 0.U){
      io.dccmReq.bits.activeByteLane := "b0001".U
    }.elsewhen(offsetSW === 1.U){
      wdata(0) := io.writeData(15,8)
      wdata(1) := io.writeData(7,0)
      wdata(2) := io.writeData(23,16)
      wdata(3) := io.writeData(31,24)
      io.dccmReq.bits.activeByteLane := "b0010".U
    }.elsewhen(offsetSW === 2.U){
      wdata(0) := io.writeData(15,8)
      wdata(1) := io.writeData(23,16)
      wdata(2) := io.writeData(7,0)
      wdata(3) := io.writeData(31,24)
      io.dccmReq.bits.activeByteLane := "b0100".U
    }.otherwise{
      wdata(0) := io.writeData(15,8)
      wdata(1) := io.writeData(23,16)
      wdata(2) := io.writeData(31,24)
      wdata(3) := io.writeData(7,0)
      io.dccmReq.bits.activeByteLane := "b1000".U
    }
  }
    /* Store Half Word */
    .elsewhen(io.writeEnable && io.f3 === "b001".U){
    // offset will either be 0 or 2 since address will be 0x0000 or 0x0002
    when(offsetSW === 0.U){
      // data to be stored at lower 16 bits (15,0)
      io.dccmReq.bits.activeByteLane := "b0011".U
    }.elsewhen(offsetSW === 1.U){
      // data to be stored at lower 16 bits (15,0)
      io.dccmReq.bits.activeByteLane := "b0110".U
      wdata(0) := io.writeData(23,16)
      wdata(1) := io.writeData(7,0)
      wdata(2) := io.writeData(15,8)
      wdata(3) := io.writeData(31,24)
    }.otherwise{
      // data to be stored at upper 16 bits (31,16)
      io.dccmReq.bits.activeByteLane := "b1100".U
      wdata(2) := io.writeData(7,0)
      wdata(3) := io.writeData(15,8)
      wdata(0) := io.writeData(23,16)
      wdata(1) := io.writeData(31,24)
    }
  }
    /* Store Word */
    .otherwise{
    io.dccmReq.bits.activeByteLane := "b1111".U
  }

  io.dccmReq.bits.dataRequest := wdata.asUInt()
  io.dccmReq.bits.addrRequest := (io.aluResultIn & "h00001fff".U) >> 2
  io.dccmReq.bits.isWrite := io.writeEnable
  io.dccmReq.valid := Mux(io.writeEnable | io.readEnable, true.B, false.B)

  io.stall := (io.writeEnable || io.readEnable) && !io.dccmRsp.valid

  rdata := Mux(io.dccmRsp.valid, io.dccmRsp.bits.dataResponse, DontCare)


  when(io.readEnable) {
    when(funct3 === "b010".U) {
      // load word
      io.readData := rdata
    }
      .elsewhen(funct3 === "b000".U) {
        // load byte
        when(offset === "b00".U) {
          // addressing memory with 0,4,8...
          io.readData := Cat(Fill(24,rdata(7)),rdata(7,0))
        } .elsewhen(offset === "b01".U) {
          // addressing memory with 1,5,9...
          io.readData := Cat(Fill(24, rdata(15)),rdata(15,8))
        } .elsewhen(offset === "b10".U) {
          // addressing memory with 2,6,10...
          io.readData := Cat(Fill(24, rdata(23)),rdata(23,16))
        } .elsewhen(offset === "b11".U) {
          // addressing memory with 3,7,11...
          io.readData := Cat(Fill(24, rdata(31)),rdata(31,24))
        } .otherwise {
          // this condition would never occur but using to avoid Chisel generating VOID errors
          io.readData := DontCare
        }
      }
      .elsewhen(funct3 === "b100".U) {
        //load byte unsigned
        when(offset === "b00".U) {
          // addressing memory with 0,4,8...
          io.readData := Cat(Fill(24, 0.U), rdata(7, 0))
        }.elsewhen(offset === "b01".U) {
          // addressing memory with 1,5,9...
          io.readData := Cat(Fill(24, 0.U), rdata(15, 8))
        }.elsewhen(offset === "b10".U) {
          // addressing memory with 2,6,10...
          io.readData := Cat(Fill(24, 0.U), rdata(23, 16))
        }.elsewhen(offset === "b11".U) {
          // addressing memory with 3,7,11...
          io.readData := Cat(Fill(24, 0.U), rdata(31, 24))
        } .otherwise {
          // this condition would never occur but using to avoid Chisel generating VOID errors
          io.readData := DontCare
        }
      }
      .elsewhen(funct3 === "b101".U) {
        // load halfword unsigned
        when(offset === "b00".U) {
          // addressing memory with 0,4,8...
          io.readData := Cat(Fill(16, 0.U),rdata(15,0))
        } .elsewhen(offset === "b01".U) {
          // addressing memory with 2,6,10...
          io.readData := Cat(Fill(16, 0.U),rdata(23,8))
        } .elsewhen(offset === "b10".U) {
          // addressing memory with 2,6,10...
          io.readData := Cat(Fill(16, 0.U),rdata(31,16))
        } .otherwise {
          // this condition would never occur but using to avoid Chisel generating VOID errors
          io.readData := DontCare
        }
      }
      .elsewhen(funct3 === "b001".U) {
        // load halfword
        when(offset === "b00".U) {
          // addressing memory with 0,4,8...
          io.readData := Cat(Fill(16, rdata(15)),rdata(15,0))
        } .elsewhen(offset === "b01".U) {
          // addressing memory with 1,3,7...
          io.readData := Cat(Fill(16, rdata(23)),rdata(23,8))
        } .elsewhen(offset === "b10".U) {
          // addressing memory with 2,6,10...
          io.readData := Cat(Fill(16, rdata(31)),rdata(31,16))
        } .otherwise {
          // this condition would never occur but using to avoid Chisel generating VOID errors
          io.readData := DontCare
        }
      }
      .otherwise {
      // unknown func3 bits
      io.readData := DontCare
    }
  } .otherwise {
    io.readData := DontCare
  }


  when(io.writeEnable && io.aluResultIn(31, 28) === "h8".asUInt()){
    printf("%x\n", io.writeData)
  }

}

The MemoryFetch module handles data memory DCCM (Data Closely Coupled Memory) read and write operations.

  /* Store Half Word */
  when(io.writeEnable && io.f3 === "b000".U){
    when(offsetSW === 0.U){
      io.dccmReq.bits.activeByteLane := "b0001".U
    }.elsewhen(offsetSW === 1.U){
      wdata(0) := io.writeData(15,8)
      wdata(1) := io.writeData(7,0)
      wdata(2) := io.writeData(23,16)
      wdata(3) := io.writeData(31,24)
      io.dccmReq.bits.activeByteLane := "b0010".U
    }.elsewhen(offsetSW === 2.U){
      wdata(0) := io.writeData(15,8)
      wdata(1) := io.writeData(23,16)
      wdata(2) := io.writeData(7,0)
      wdata(3) := io.writeData(31,24)
      io.dccmReq.bits.activeByteLane := "b0100".U
    }.otherwise{
      wdata(0) := io.writeData(15,8)
      wdata(1) := io.writeData(23,16)
      wdata(2) := io.writeData(31,24)
      wdata(3) := io.writeData(7,0)
      io.dccmReq.bits.activeByteLane := "b1000".U
    }
  }

I think the code is actually handling a Store Byte (SB) operation, not a Store Half Word (SH) as the comment suggests. When io.writeEnable is true and io.f3 === "b000".U, the operation stores a single byte (8 bits) at a specified memory address.

offsetSW is the least significant 2 bits of the ALU result (memory address), to determine where within a 32-bit word the byte should be stored.

activeByteLane is a 4-bit value indicating which byte within the 32-bit word should be written.

if offsetSW === 0.U, the byte is stored in the least significant byte (bits 7-0) of the 32-bit word, and io.dccmReq.bits.activeByteLane := "b0001".U

Else if offsetSW === 1.U, the byte is stored in the second least significant byte (bits 15-8) of the 32-bit word, and io.dccmReq.bits.activeByteLane := "b0010".U

Else if offsetSW === 2.U, the byte is stored in the second most significant byte (bits 23-16) of the 32-bit word, and io.dccmReq.bits.activeByteLane := "b0100".U

Else if offsetSW === 3.U, the byte is stored in the most significant byte (bits 31-24) of the 32-bit word, and io.dccmReq.bits.activeByteLane := "b1000".U

    /* Store Half Word */
    .elsewhen(io.writeEnable && io.f3 === "b001".U){
    // offset will either be 0 or 2 since address will be 0x0000 or 0x0002
    when(offsetSW === 0.U){
      // data to be stored at lower 16 bits (15,0)
      io.dccmReq.bits.activeByteLane := "b0011".U
    }.elsewhen(offsetSW === 1.U){
      // data to be stored at lower 16 bits (15,0)
      io.dccmReq.bits.activeByteLane := "b0110".U
      wdata(0) := io.writeData(23,16)
      wdata(1) := io.writeData(7,0)
      wdata(2) := io.writeData(15,8)
      wdata(3) := io.writeData(31,24)
    }.otherwise{
      // data to be stored at upper 16 bits (31,16)
      io.dccmReq.bits.activeByteLane := "b1100".U
      wdata(2) := io.writeData(7,0)
      wdata(3) := io.writeData(15,8)
      wdata(0) := io.writeData(23,16)
      wdata(1) := io.writeData(31,24)
    }
  }

When io.writeEnable is true and io.f3 === "b001".U, it handles the Store Half Word (SH) operation.

The comment states that offsetSW will either be 0 or 2 since address will be 0x0000 or 0x0002

If offsetSW === 0.U, the half word is stored in the lower 16 bits (15-0) of the 32-bit word. io.dccmReq.bits.activeByteLane := "b0011".U, indicating that the two least significant bytes should be written.

If offsetSW === 2.U, the half word is stored in the upper 16 bits (31-16) of the 32-bit word. io.dccmReq.bits.activeByteLane := "b1100".U, and the wdata is rearranged accordingly.

/* Store Word */
.otherwise{
io.dccmReq.bits.activeByteLane := "b1111".U
}

Store Word (SW) operation. io.dccmReq.bits.activeByteLane := "b1111".U indicates that all four bytes of the 32-bit word should be active for writing.

  io.dccmReq.bits.dataRequest := wdata.asUInt()
  io.dccmReq.bits.addrRequest := (io.aluResultIn & "h00001fff".U) >> 2
  io.dccmReq.bits.isWrite := io.writeEnable
  io.dccmReq.valid := Mux(io.writeEnable | io.readEnable, true.B, false.B)

Prepares the memory request by setting up the data to be written (if it's a write operation), calculating the memory address, setting the write enable flag, and validating the request when there's an actual memory operation to perform.

io.stall := (io.writeEnable || io.readEnable) && !io.dccmRsp.valid

The stall logic ensures that the processor waits for memory operations to complete before proceeding.

rdata := Mux(io.dccmRsp.valid, io.dccmRsp.bits.dataResponse, DontCare)

Selects the data from the DCCM response if it's valid, otherwise sets it to DontCare.

when(io.readEnable) {
    when(funct3 === "b010".U) {
      // load word
      io.readData := rdata
    }

When funct3 === "b010", it performs a full 32-bit word load, Load Word (LW).

.elsewhen(funct3 === "b000".U) {
  // load byte (sign-extended)
  // ...
}

When funct3 === "b000", it performs loading a single byte and sign-extending it to 32 bits, Load Byte (LB) It uses the offset to determine which byte of the 32-bit word to load.

.elsewhen(funct3 === "b100".U) {
  //load byte unsigned
  // ...
}

Similar to Load Byte (LB), but zero-extends the byte instead of sign-extending. Load Byte Unsigned (LBU).

.elsewhen(funct3 === "b101".U) {
  // load halfword unsigned
  // ...
}

Loads a 16-bit halfword and zero-extends it to 32 bits, Load Halfword Unsigned (LHU).

.elsewhen(funct3 === "b001".U) {
  // load halfword
  // ...
}

Loads a 16-bit halfword and sign-extends it to 32 bits, Load Halfword (LH).

RV32IM Instruction Introduction

Multiplication Operations

mul (Multiplication)

Format: mul rd,rs1,rs2

Description: performs a 32-bit × 32-bit multiplication and places the lower 32 bits in the destination register (Both rs1 and rs2 treated as signed numbers).

Implementation: x[rd] = x[rs1] * x[rs2]

mulh (Multiplication Higher)

Format: mulh rd,rs1,rs2

Description: performs a 32-bit × 32-bit multiplication and places the upper 32 bits in the destination register of the 64-bit product (Both rs1 and rs2 treated as signed numbers).

Implementation: x[rd] = (x[rs1] s*s x[rs2]) >>s 32

mulhsu (Multiplication Higher Signed Unsigned)

Format: mulhsu rd,rs1,rs2

Description: performs a 32-bit × 32-bit multiplication and places the upper 32 bits in the destination register of the 64-bit product (rs1 treated as signed number, rs2 treated as unsigned number).

Implementation: x[rd] = (x[rs1] s*u x[rs2]) >>s 32

mulhu (Multiplication Higher Unsigned)

Format: mulhu rd,rs1,rs2

Description: performs a 32-bit × 32-bit multiplication and places the upper 32 bits in the destination register of the 64-bit product (Both rs1 and rs2 treated as unsigned numbers).

Implementation: x[rd] = (x[rs1] u*u x[rs2]) >>u 32

Division Operations

div (Division)

Format: div rd,rs1,rs2

Description: perform signed integer division of 32 bits by 32 bits (rounding towards zero).

Implementation: x[rd] = x[rs1] /s x[rs2]

divu (Division Unsigned)

Format: divu rd, rs1, rs2

Description: perform unsigned integer division of 32 bits by 32 bits (rounding towards zero).

Implementation: x[rd] = x[rs1] /u x[rs2]

rem (Remain)

Format: rem rd, rs1, rs2

Description: provide the remainder of the corresponding division operation div (the sign of rd equals the sign of rs1).

Implementation: x[rd] = x[rs1] %s x[rs2]

remu (Remain Unsigned)

Format: rem rd, rs1, rs2

Description: provide the remainder of the corresponding division operation divu.

Implementation: x[rd] = x[rs1] %u x[rs2]

Evaluate NucleusRV

Prerequisites

NucleusRV Demo

Building C Programs (hello_world)

Building with SBT

Running Compliance Tests

NucleusRV explanation

Instruction Fetch

Instruction Decode

Execute

Memory Access

RV32IM Instruction Introduction

Multiplication Operations

Division Operations

References