# Cache system for NucleusRV
> 程品叡, 吳睿秉
## Introduction
[NucleusRV](https://github.com/merledu/nucleusrv) is a Chisel-based RISC-V 5-stage pipelined CPU, which implements the 32-bit version of the ISA. [Verilator](https://github.com/verilator/verilator) is used to generate a C++ simulator and an executable, which are then verified using the [RISC-V architectural test](https://github.com/riscv-non-isa/riscv-arch-test). The CPU currently supports a limited set of instructions. In this project, we are focusing on the memory-related components, i.e., those related to SRAM. We are building Nucleusrv in `Ubuntu 22.04.5` and working on the progress of the cache system completed so far. After this, we attempted to implement a simple direct-mapped cache similar to the previous one and studied the compulsory miss situation in the instruction fetch.
## Development Environment
* OS: **Ubuntu 22.04.5 LTS**
### Get NucleusRV and RISC-V Architecture Test SIG
#### 1. nucleusrv
**main** branch: https://github.com/merledu/nucleusrv.git
```bash
git clone https://github.com/merledu/nucleusrv.git
```
or **new cache** branch: https://github.com/merledu/nucleusrv/tree/new_cache
```bash
git clone https://github.com/merledu/nucleusrv.git -b new_cache
```
#### 2. riscv-arch-test
riscv-arch-test should be placed inside the nucleusrv directory.
riscv-arch-test: https://github.com/riscv-non-isa/riscv-arch-test.git
```bash
cd nucleusrv
git clone https://github.com/riscv-non-isa/riscv-arch-test.git -b 1.0
```
### Dependencies
#### 1. Install essential packages
```bash
sudo apt install git
sudo apt install curl
sudo apt install make
```
#### 2. Install Java and SBT
Reference: https://www.chisel-lang.org/docs/installation#java-development-kit-jdk
```bash
sudo su
apt install -y wget gpg apt-transport-https
wget -qO - https://packages.adoptium.net/artifactory/api/gpg/key/public | gpg --dearmor | tee /etc/apt/trusted.gpg.d/adoptium.gpg > /dev/null
echo "deb https://packages.adoptium.net/artifactory/deb $(awk -F= '/^VERSION_CODENAME/{print$2}' /etc/os-release) main" | tee /etc/apt/sources.list.d/adoptium.list
apt update
apt install temurin-17-jdk
exit
```
Reference: https://www.chisel-lang.org/docs/installation#sbt
```bash
curl -s -L https://github.com/sbt/sbt/releases/download/v1.9.7/sbt-1.9.7.tgz | tar xvz
sudo mv sbt/bin/sbt /usr/local/bin/
```
#### 3. Verilator
Reference: https://verilator.org/guide/latest/install.html#package-manager-quick-install
```
sudo apt-get install verilator
```
<!-- #### Git Quick Install
```bash
# Prerequisites (ignore if gives error):
sudo apt-get install git help2man perl python3 make autoconf g++ flex bison ccache
sudo apt-get install libgoogle-perftools-dev numactl perl-doc
sudo apt-get install libfl2
sudo apt-get install libfl-dev
sudo apt-get install zlibc zlib1g zlib1g-dev
# Only first time
git clone https://github.com/verilator/verilator
# Every time you need to build:
unset VERILATOR_ROOT # For bash
cd verilator
git pull # Make sure git repository is up-to-date
git tag # See what versions exist
autoconf # Create ./configure script
./configure # Configure and create Makefile
sudo make -j `nproc` # Build Verilator itself (if error, try just 'make')
sudo make test
sudo make install
``` -->
After successfully installing, check the version.
```bash
verilator --version
```
The terminal should show something like :
> ```bash
> Verilator 4.038 2020-07-11 rev v4.036-114-g0cd4a57ad
> ```
NOTE: Only in `Ubuntu 22.04.5` does running `sudo apt-get install verilator` correctly install `Verilator 4.038`.
:::spoiler
On newer versions of Verilator, the sbt test might fail because it requires additional arguments to specify how to handle timing. For example, in `Ubuntu 24.04.1`, the same command installs `Verilator 5.020`, which is newer but fails to successfully run the sbt testonly command.
#### Faults During Building
> **Running Compliance Tests** (`README.md` of [nucleusrv](https://github.com/merledu/nucleusrv))
> * Clone `riscv-arch-test` repo in nucleusrv root `git clone
> git@github.com:riscv-non-isa/riscv-arch-test.git -b 1.0`
> * Build the simulation executable as defined in "Building with SBT" section.
> * Run `./run-compliance.sh` in root directory.
When I executed `./run-compliance.sh`, the error message is as follows.
```bash=
[info] TopTest:
Elaborating design...
Done elaborating.
cd /home/vboxuser/Desktop/nucleusrv/test_run_dir/Top_Test && verilator --cc Top.v --assert -Wno-fatal -Wno-WIDTH -Wno-STMTDLY -O1 --top-module Top +define+TOP_TYPE=VTop +define+PRINTF_COND=!Top.reset +define+STOP_COND=!Top.reset -CFLAGS "-Wno-undefined-bool-conversion -O1 -DTOP_TYPE=VTop -DVL_USER_FINISH -include VTop.h" -Mdir /home/vboxuser/Desktop/nucleusrv/test_run_dir/Top_Test -f /home/vboxuser/Desktop/nucleusrv/test_run_dir/Top_Test/firrtl_black_box_resource_files.f --exe /home/vboxuser/Desktop/nucleusrv/test_run_dir/Top_Test/Top-harness.cpp --trace
%Error-NEEDTIMINGOPT: /home/vboxuser/Desktop/nucleusrv/test_run_dir/Top_Test/sram.v:131:17: Use --timing or --no-timing to specify how timing controls should be handled
: ... note: In instance 'Top.imem.sram.memory'
131 | dout0 <= #(DELAY) mem[addr0_reg];
| ^
... For error description see https://verilator.org/warn/NEEDTIMINGOPT?v=5.033
%Error-NEEDTIMINGOPT: /home/vboxuser/Desktop/nucleusrv/test_run_dir/Top_Test/sram.v:139:17: Use --timing or --no-timing to specify how timing controls should be handled
: ... note: In instance 'Top.imem.sram.memory'
139 | dout1 <= #(DELAY) mem[addr1_reg];
| ^
%Error: Exiting due to 2 error(s)
[info] - Top Test *** FAILED ***
```
According to the [Errors and Warnings](https://verilator.org/guide/latest/warnings.html#cmdoption-arg-NEEDTIMINGOPT) page, the NEEDTIMINGOPT error indicates that the command does not specify how Verilator should handle timing-related constructs, such as delays.
Since running `./run-compliance.sh` triggers the command on line 4 to invoke Verilator, and we are unable to locate where to modify the arguments passed to Verilator, an alternative solution is to manually type the command instead of directly running `./run-compliance.sh`. This allows us to add the `--timing` or `--no-timing` argument at the end of the command.
The command looks like:
```
cd /home/vboxuser/Desktop/nucleusrv/test_run_dir/Top_Test
verilator --cc Top.v --assert -Wno-fatal -Wno-WIDTH -Wno-STMTDLY -O1 \
--top-module Top +define+TOP_TYPE=VTop +define+PRINTF_COND=\!Top.reset +define+STOP_COND=\!Top.reset \
-CFLAGS "-Wno-undefined-bool-conversion \
-O1 -DTOP_TYPE=VTop -DVL_USER_FINISH -include VTop.h" \
-Mdir /home/vboxuser/Desktop/nucleusrv/test_run_dir/Top_Test \
-f /home/vboxuser/Desktop/nucleusrv/test_run_dir/Top_Test/firrtl_black_box_resource_files.f \
--exe /home/vboxuser/Desktop/nucleusrv/test_run_dir/Top_Test/Top-harness.cpp \
--trace --no-timing
```
:::
<s>

</s>
:::danger
Avoid including screenshots that display only plaintext. Instead, always use Markdown syntax.
:::
:::success
I got it.
:::
#### 4. RISC-V GNU Compiler Toolchain
riscv-gnu-toolchain: https://github.com/riscv-collab/riscv-gnu-toolchain
```bash
git clone https://github.com/riscv-collab/riscv-gnu-toolchain.git
sudo apt-get install autoconf automake autotools-dev curl python3 python3-pip python3-tomli libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev ninja-build git cmake libglib2.0-dev libslirp-dev
cd riscv-gnu-toolchain/
./configure --prefix=/opt/riscv
echo "PATH=$PATH:/opt/riscv/bin" >> ~/.bashrc
source ~/.bashrc
sudo make -j `nproc`
```
After successfully installing and building, run the following command to check the installation.
```bash
riscv64-unknown-elf-gcc -v
```
The terminal should show something like :
> ```
> Using built-in specs.
> COLLECT_GCC=riscv64-unknown-elf-gcc
> COLLECT_LTO_WRAPPER=/opt/riscv/libexec/gcc/riscv64-unknown-elf/14.2.0/lto-wrapper
> Target: riscv64-unknown-elf
> Configured with: /home/user/Desktop/riscv-gnu-toolchain/gcc/configure --target=riscv64-unknown-elf --prefix=/opt/riscv --disable-shared --disable-threads --enable-languages=c,c++ --with-pkgversion= --with-system-zlib --enable-tls --with-newlib --with-sysroot=/opt/riscv/riscv64-unknown-elf --with-native-system-header-dir=/include --disable-libmudflap --disable-libssp --disable-libquadmath --disable-libgomp --disable-nls --disable-tm-clone-registry --src=.././gcc --disable-multilib --with-abi=lp64d --with-arch=rv64gc --with-tune=rocket --with-isa-spec=20191213 'CFLAGS_FOR_TARGET=-Os -mcmodel=medlow' 'CXXFLAGS_FOR_TARGET=-Os -mcmodel=medlow'
> Thread model: single
> Supported LTO compression algorithms: zlib
> gcc version 14.2.0 ()
> ```
After building the riscv-gnu-toolchain, the following commands are now available. Try `riscv64-unknown-elf-` in the terminal and press the tab to check them.
> ```
> user@user:~/Desktop/nucleusrv/tools$ riscv64-unknown-elf-
> riscv64-unknown-elf-addr2line riscv64-unknown-elf-gdb
> riscv64-unknown-elf-ar riscv64-unknown-elf-gdb-add-index
> riscv64-unknown-elf-as riscv64-unknown-elf-gprof
> riscv64-unknown-elf-c++ riscv64-unknown-elf-ld
> riscv64-unknown-elf-c++filt riscv64-unknown-elf-ld.bfd
> riscv64-unknown-elf-cpp riscv64-unknown-elf-lto-dump
> riscv64-unknown-elf-elfedit riscv64-unknown-elf-nm
> riscv64-unknown-elf-g++ riscv64-unknown-elf-objcopy
> riscv64-unknown-elf-gcc riscv64-unknown-elf-objdump
> riscv64-unknown-elf-gcc-14.2.0 riscv64-unknown-elf-ranlib
> riscv64-unknown-elf-gcc-ar riscv64-unknown-elf-readelf
> riscv64-unknown-elf-gcc-nm riscv64-unknown-elf-run
> riscv64-unknown-elf-gcc-ranlib riscv64-unknown-elf-size
> riscv64-unknown-elf-gcov riscv64-unknown-elf-strings
> riscv64-unknown-elf-gcov-dump riscv64-unknown-elf-strip
> riscv64-unknown-elf-gcov-tool
> ```
## Building with SBT
### Top Test
Run the following command in SBT shell:
```
testOnly nucleusrv.components.TopTest -- -DwriteVcd=1
```
The terminal should show something like :
> ```
> [info] TopTest:
> Elaborating design...
> Done elaborating.
> ...
> sim start on DESKTOP at Mon Jan 20 17:34:11 2025
> inChannelName: 00001368.in
> outChannelName: 00001368.out
> cmdChannelName: 00001368.cmd
> STARTING test_run_dir/Top_Test/VTop
> Enabling waves..
> Exit Code: 0
> [info] - Top Test
> [info] Run completed in 3 seconds, 278 milliseconds.
> [info] Total number of tests run: 1
> [info] Suites: completed 1, aborted 0
> [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
> [info] All tests passed.
> [success] Total time: 10 s, completed Jan 20, 2025, 5:34:12 PM
> ```
After successfully doing this, a `VTop` executable in `nucleusrv/test_run_dir/Top_Test` will be generated.
### SRAM Test
Additionally, to perform the SRAM test, the following command can be used to execute the cache SRAM test in the `nucleusrv\src\test\scala\components\SRamTests.scala`:
```
testOnly nucleusrv.components.CacheSRAMTests
```
## Running Compliance Tests
In the root directory, run `./run-compliance.sh`. If the `VTop` executable already exists, it will start the RISC-V architecture tests. However, before actually starting, some modifications need to be made:
### Modify Makefile
In the old ISA specification, CSR related instructions were part of the basic instruction set. However, in the new ISA specification, CSR instructions were separated into the Zicsr extension. Therefore, to recognize the related instructions, explicit configuration settings must be made in the Makefile.
Running the RISC-V architecture test suite for `rv32i` requires adding
`-march=rv32imac_zicsr` at the end of `RISCV_GCC_OPTS` in the
`nucleusrv/riscv-target/nucleusrv/device/rv32i/Makefile.include`. This is not needed when running rv32im or rv32imc.
```diff=
#sbt "testOnly nucleusrv.components.TopTest -- -DmemFile=tools/out/program.hex -DwriteVcd=1 -DsignatureFile=test.sig"
TARGET_SIM ?= VTop
ifeq ($(shell command -v $(TARGET_SIM) 2> /dev/null),)
$(error Target simulator executable '$(TARGET_SIM)` not found)
endif
RUN_TARGET=\
cd $(NUCLEUSRV) && sbt "testOnly nucleusrv.components.TopTest -- -DprogramFile=$(<).program.hex -DwriteVcd=1 -DdataFile=$(<).data.hex" \
> $(*).stdout; \
`grep '^[a-f0-9]\+$$' $(*).stdout > $(*).signature.output`;
RISCV_PREFIX ?= riscv64-unknown-elf-
RISCV_GCC ?= $(RISCV_PREFIX)gcc
RISCV_OBJCOPY ?= $(RISCV_PREFIX)objcopy
RISCV_OBJDUMP ?= $(RISCV_PREFIX)objdump
RISCV_ELF2HEX ?= $(RISCV_PREFIX)elf2hex
-RISCV_GCC_OPTS ?= -static -mcmodel=medany -fvisibility=hidden -nostdlib -nostartfiles
+RISCV_GCC_OPTS ?= -static -mcmodel=medany -fvisibility=hidden -nostdlib -nostartfiles -march=rv32imac_zicsr
SBT ?= sbt
COMPILE_TARGET=\
...
```
### Modify `./run-compliance.sh` to Run Test
In `./run-compliance.sh`, the `$ISA` and `$TEST` specify the testcase in
`nucleusrv/riscv-arch-test/riscv-test-suite/[ISA]/src/[TEST]` that will be executed.
Note that if the ISA is specifically `rv32i`, modifying the Makefile is necessary, as described in the **Modify Makefile** section previously.
If want to perform all the tests, set `TEST` as `$ALL`.
For example, the following modifications to ISA and TEST will only perform the ISA rv32imc's C-ADDI16SP test.
```bash
ISA=rv32imc
TEST=C-ADDI16SP
```
After setting the `ISA` and `TEST` variables, the two make commands in this script will run the tests.
> ```
> Compare to reference files ...
>
> Check C-ADDI16SP ... OK
> Check C-ADDI4SPN ... IGNORE
> Check C-ADDI ... IGNORE
> Check C-ADD ... IGNORE
> Check C-ANDI ... IGNORE
> Check C-AND ... IGNORE
> Check C-BEQZ ... IGNORE
> Check C-BNEZ ... IGNORE
> Check C-JAL ... IGNORE
> Check C-JALR ... IGNORE
> Check C-J ... IGNORE
> Check C-JR ... IGNORE
> Check C-LI ... IGNORE
> Check C-LUI ... IGNORE
> Check C-LW ... IGNORE
> Check C-LWSP ... IGNORE
> Check C-MV ... IGNORE
> Check C-OR ... IGNORE
> Check C-SLLI ... IGNORE
> Check C-SRAI ... IGNORE
> Check C-SRLI ... IGNORE
> Check C-SUB ... IGNORE
> Check C-SW ... IGNORE
> Check C-SWSP ... IGNORE
> Check C-XOR ... IGNORE
> --------------------------------
> OK: 25/25 RISCV_TARGET=nucleusrv RISCV_DEVICE=rv32i RISCV_ISA=rv32imc
> ```
The test will compare two files, and if they are identical, the test will pass.
* Reference answer: `riscv-arch-test/riscv-test-suite/[ISA]/reference`
* Test output: `riscv-arch-test/work/[ISA]/[TEST].output`
## Building C Programs
### Modify the Makefile
Before making the C program, since the toolchain is installed with
`./configure --prefix=/opt/riscv`, which by default installs `riscv64`. Modify line 1 of the project's Makefile in `nucleusrv/tools/makefile` to `RISCV=riscv64-unknown-elf-` to ensure it does not call `riscv32-unknown-elf-`.
```diff=
-RISCV=riscv32-unknown-elf-
+RISCV=riscv64-unknown-elf-
CC=$(RISCV)gcc
OBJDUMP=$(RISCV)objdump
OBJCOPY=$(RISCV)objcopy
CFLAGS=-c -march=rv32i -mabi=ilp32 -ffreestanding -fomit-frame-pointer
OFLAGS=--disassemble-all --section=.text
LFLAGS = -march=rv32im -mabi=ilp32 -static -nostdlib -nostartfiles -T link.ld
PROGRAM ?= fibonacci
...
```
### Make Program
Navigate to the `nucleusrv/tools` and run the following command to build your program, which is located in `nucleusrv/tools/tests/FOLDER_NAME`:
`make PROGRAM=<FOLDER_NAME>`
For example, if your program folder in `nucleusrv/tools/test` is named `hello_world`, use:
```bash
make PROGRAM=hello_world
```
The terminal should show something like :
> ```
> rm -rf out
> riscv64-unknown-elf-gcc -c -march=rv32i -mabi=ilp32 -ffreestanding -fomit-frame-pointer -c -o tests/hello_world/hello.o tests/hello_world/hello.c
> riscv64-unknown-elf-gcc -c -march=rv32i -mabi=ilp32 -ffreestanding -fomit-frame-pointer -c -o tests/hello_world/main.o tests/hello_world/main.c
> riscv64-unknown-elf-gcc -c -march=rv32i -mabi=ilp32 -ffreestanding -fomit-frame-pointer -c -o tests/hello_world/world.o tests/hello_world/world.c
> riscv64-unknown-elf-gcc -march=rv32im -mabi=ilp32 -static -nostdlib -nostartfiles -T link.ld tests/hello_world/hello.o tests/hello_world/main.o tests/hello_world/world.o -o out/program.elf -lgcc
> riscv64-unknown-elf-objdump --disassemble-all --section=.text out/program.elf > out/program.dump
> python3 makehex.py out/program.elf 2048 > out/program.hex
> ```
The corresponding RISC-V instruction will be generated in `nucleusrv/tools/out/program.dump`.
---
## Understanding the Cache in new_cache Branch
* new_cache branch: https://github.com/merledu/nucleusrv/tree/new_cache
Note that in this project, the class names and file names do not always correspond and may have inconsistent capitalization. The names mentioned below, if without file extensions, primarily refer to the class names.
### Difference Between Main and new_cache Branch
The main difference between the main branch and the new_cache branch is that the SRAM is divided into Instruction SRAM and Data SRAM. In the new_cache branch, there are four types of memory: `SRamTop`, `new_SRamTop`, `Instruction_SRamTop`, and `Data_SRamTop`. They differ only in the class names or variable names, and their content is essentially identical.
To be more specific, `SRamTop` is not used at all. `new_SRamTop` is only used in the test program `SRamTests.scala` under `CacheSRAMTests` class. After modifying the test program from `new_SRamTop` to test `Instruction_SRamTop` or `Data_SRamTop`, as shown below, `new_SRamTop` is no longer used.
```diff
...
class CacheSRAMTests extends FreeSpec with ChiselScalatestTester {
"New SRAM Test" in {
- test(new new_SRamTop(None)).withAnnotations(Seq(VerilatorBackendAnnotation)) { c =>
+ test(new Instruction_SRamTop(None)).withAnnotations(Seq(VerilatorBackendAnnotation)) { c =>
// Write data to a specific address
c.io.req.valid.poke(true.B)
...
```
This `CacheSRAMTests` test program is also a newly implemented feature in the new_cache branch, which has not been enabled in the main branch.
### Program for Memory Components
Cache or memory serves as the closest place for the CPU to access instructions and data during execution. In the design architecture of this project, an instruction cache and a data cache are instantiated in the `Top` module. The `Core` utilizes them during the IF stage and MEM stage. In the implementation of these instruction and data caches (i.e., `Instruction_SRamTop` and `Data_SRamTop`), the `*.v` files in `src/main/resources` are called by Chisel using the BlackBox mechanism.
Moreover we have explain the code…more detail in [code_explain](https://hackmd.io/@ZJL22446TDWfer1HG8KYJw/SyhsvNcDJe)
:::spoiler
Here, I will briefly introduce the code in `src/main/scala/components` from the **`new_cache`** branch.
```
|-src/main/scala/components
|-ALU.scala
|-AluControl.scala
|-BranchUnit.scala
|-CompressedDecoder.scala
|-Configs.scala
|-Constants.scala
|-Control.scala
|-Core.scala
|-Data_SRamTOP.scala
|-Execute.scala
|-ForwardingUnit.scala
|-HazardUnit.scala
|-ImmediateGen.scala
|-InstructionDecode.scala
|-InstructionFetch.scala
|-Instruction_SRamTOP.scala
|-JumpUnit.scala
|-MDU.scala
|-Main.scala
|-MemIO.scala
|-MemoryFetch.scala
|-PC.scala
|-Realigner.scala
|-Registers.scala
|-Top.scala
|-SRamTop.scala
|-new_SRamTOP.scala
```
#### ALU.scala
Basic ALU (Arithmetic Logic Unit) operation, like `and`,`or`,`add`,`sub`,`slt` and so on.
#### AluControl.scala
This code defines a module named **`AluControl`** using **Chisel**, which implements the control logic for an ALU (Arithmetic Logic Unit). It generates specific ALU control signals based on the input control signals (`aluOp`, `f7`, `f3`, `aluSrc`). The control signals determine the operation mode of the ALU.
The table below maps `aluOp`, `f3`, `f7`, and `aluSrc` to the output `io.out`:
| `aluOp` | `f3` | `f7` | `aluSrc` | Operation Type | `io.out` |
|---------|-------|-------|----------|---------------------|----------|
| `0.U` | Any | Any | Any | Add (`ADD`) | `2.U` |
| `2.U` | `0.U` | `0.U` | `false` | Add (`ADD`) | `2.U` |
| `2.U` | `0.U` | `1.U` | `true` | Subtract (`SUB`) | `3.U` |
| `2.U` | `1.U` | Any | Any | Shift Left (`SLL`) | `6.U` |
| `2.U` | `2.U` | Any | Any | Set Less Than (`SLT`)| `4.U` |
| `2.U` | `3.U` | Any | Any | Set Less Than Unsigned (`SLTU`) | `5.U` |
| `2.U` | `5.U` | `0.U` | Any | Logical Right Shift (`SRL`) | `7.U` |
| `2.U` | `5.U` | `1.U` | Any | Arithmetic Right Shift (`SRA`) | `8.U` |
| `2.U` | `7.U` | Any | Any | Logical AND (`AND`) | `0.U` |
| `2.U` | `6.U` | Any | Any | Logical OR (`OR`) | `1.U` |
| `2.U` | `4.U` | Any | Any | Logical XOR (`XOR`) | `9.U` |
#### BranchUnit.scala
This code implements a RISC-V processor's **branch unit** that determines whether a branch should be taken based on branch conditions (`funct3`), operands, and control signals.
```
switch(io.funct3) {
is(0.U) { check := (io.rd1 === io.rd2) } // beq
is(1.U) { check := (io.rd1 =/= io.rd2) } // bne
is(4.U) { check := (io.rd1.asSInt < io.rd2.asSInt) } // blt
is(5.U) { check := (io.rd1.asSInt >= io.rd2.asSInt) } // bge
is(6.U) { check := (io.rd1 < io.rd2) } // bltu
is(7.U) { check := (io.rd1 >= io.rd2) } // bgeu
}
```
#### CompressedDecoder.scala
The function is to decode RISC-V **16-bit compressed instructions** into their corresponding **32-bit standard instructions**.
Here’s the table with **C0-C3 instructions** explained:
| Instruction Type | Opcode (15-14/13-12) | Instruction Name | Description | Decoding/Explanation |
|------------------|-----------------------|-------------------|--------------------------------------------------|-------------------------------------------------|
| **C0** | `00` + `b00` | `c.addi4spn` | Stack pointer offset calculation | Offset calculation and addition to `x2` (sp) |
| | `00` + `b01` | `c.lw` | Load data into register | `lw rd', imm(rs1')` |
| | `00` + `b11` | `c.sw` | Store data to memory | `sw rs2', imm(rs1')` |
| | `00` + `b10` | Illegal instruction| Unrecognized instruction, directly return | - |
| **C1** | `01` + `b000` | `c.addi`/`nop` | Add immediate or no operation | `addi rd, rd, imm` or `nop` |
| | `01` + `b001` | `c.jal` | Unconditional jump, save return address to `x1` | `jal x1, imm` |
| | `01` + `b101` | `c.j` | Unconditional jump | `jal x0, imm` |
| | `01` + `b010` | `c.li` | Load immediate into register | `addi rd, x0, imm` |
| | `01` + `b011` | `c.lui`/`addi16sp`| Load upper immediate or specific stack operation| `lui rd, imm` or `addi x2, x2, imm` |
| | `01` + `b100` | Logical/Arithmetic| Shift, logical operation or subtraction | `srli`, `srai`, `andi`, `sub`, etc. |
| | `01` + `b110` | `c.beqz` | Branch if `rs1'` is zero | `beq rs1', x0, imm` |
| | `01` + `b111` | `c.bnez` | Branch if `rs1'` is not zero | `bne rs1', x0, imm` |
| **C2** | `10` + `b00` | `c.slli` | Logical left shift | `slli rd, rd, shamt` |
| | `10` + `b01` | `c.lwsp` | Load data from stack with offset | `lw rd, imm(x2)` |
| | `10` + `b10` | `c.mv`/`c.add`/`c.jr` etc| Data move, addition, or jump | Depends on whether the register is zero |
| | `10` + `b11` | `c.swsp` | Store data to stack | `sw rs2, imm(x2)` |
| **C3** | `11` | Illegal instruction| Unrecognized instruction, directly return | - |
#### Configs.scala
This code defines a **case class** named `Configs`, which is used to store and initialize the basic configuration parameters for a RISC-V core, such as the bit width (`XLEN`), whether the M and C instruction sets are enabled (`M` and `C`), and whether TRACE messages are enabled (`TRACE`).
#### Constants.scala
This code defines an object called `ALUOps` that contains various operation codes (opcodes) for an Arithmetic Logic Unit (ALU).
| ALUop | Opcode |
| ----- | ------ |
| ADD | 2 |
| SUB | 6 |
| AND | 0 |
| OR | 1 |
| XOR | 9 |
| SLL | 3 |
| SRL | 4 |
| SRA | 5 |
| SLT | 9 |
| SLTU | 10 |
| COPY | 11 |
#### Control.scala
This code implements a **control unit based on instruction decoding**, which generates the corresponding control signals by matching instruction bit patterns. It includes signals for ALU source selection (`aluSrc`), memory operation control (`memToReg`, `memRead`, `memWrite`), register write control (`regWrite`), branch decision (`branch`), jump instructions (`jump`), and ALU operation selection (`aluOp`, `aluSrc1`). These control signals guide the operation of the RISC-V processor based on the input instructions.
#### Core.scala
This code simulates the core logic of a RISC-V processor, processing instructions through a pipeline **(IF, ID, EX, MEM, WB)**. It defines multiple registers to store data at each stage, along with control signals and computation results. Each stage is responsible for handling different parts of the instruction flow, including instruction fetching, decoding, execution, memory access, and write-back. Through this pipelined design, the code efficiently models the operation of a RISC-V processor, improving performance by alternating instruction processing. Additionally, it handles memory access requests and outputs `RVFI` (RISC-V Formal Interface) data for simulation and validation purposes.
#### Data_SRamTOP.scala
This code defines a **SRAM memory controller** that handles read and write requests from external devices (such as a processor). It processes incoming `Decoupled` requests to determine whether the operation is a read or write, and it controls the SRAM memory by adjusting the chip-select signal, read/write control, address, and data inputs. A status register (`validReg`) manages the validity of the responses, and the read data or invalid data is passed back to the response. This allows the SRAM memory to efficiently perform data operations based on the incoming requests.
#### Execute.scala
This code implements an execution unit that handles the execution phase in a RISC-V processor pipeline. It is responsible for performing arithmetic and logical operations (such as addition and subtraction) based on control signals (`func3`, `func7`). The code uses a forwarding unit to resolve data hazards and selectively supports multiplication and division operations. If multiplication/division functionality (`M` is set to true) is enabled, it performs these operations based on control signals and manages pipeline stalls. Ultimately, the output includes the computation result (`ALUresult`) and the data to be written (`writeData`).
#### ForwardingUnit.scala
**Forwarding Unit** handles the *data hazard* in a processor pipeline. The Forwarding Unit determines the appropriate data source for execution by checking if the source registers (`reg_rs1`, `reg_rs2`) match with the destination registers of the execution unit (`ex_reg_rd`) or the memory unit (`mem_reg_rd`), and then chooses the correct forwarding path.
- `io.forwardA` and `io.forwardB` are output signals that indicate which data source should be forwarded, either from the execution unit (EX) or the memory unit (MEM).
- `io.reg_rs1` and `io.reg_rs2` are the source registers used in branch operations.
- `io.ex_reg_rd` represents the destination register of the execution unit, and `io.mem_reg_rd` represents the destination register of the memory unit.
- `io.ex_regWrite` and `io.mem_regWrite` indicate whether the execution unit and memory unit are writing back to their destination registers.
This code’s role is to provide the appropriate data forwarding path to the execution unit (ALU) to resolve **data hazards** caused by uncompleted read operations, such as accessing memory.
#### HazardUnit.scala
This code is responsible for detecting **data hazards** and controlling the flow of the pipeline in a processor. The Hazard Unit checks if the current instruction (mainly memory access and branch instructions) causes pipeline hazards and adjusts the control signals accordingly.
##### Code Logic:
1. **load-use hazard**: This part of the code checks for "data hazard" situations, where the ID stage has a memory read operation, and the source register (`id_rs1` or `id_rs2`) matches the destination register (`id_ex_rd`).
- If this condition is true, it disables further pipeline control signals (`io.ctl_mux`, `io.pc_write`, `io.if_reg_write`), preventing incorrect instruction execution.
2. **branch hazard**: This part of the code detects if a branch hazard exists. If the `taken` signal is true or `jump` is not zero, the `IF/ID` stage needs to be cleared to avoid executing the wrong instruction.
- It sets `io.ifid_flush` to `true.B`, which signals the need to clear the pipeline.
#### ImmediateGen.scala
This code extracts and extends appropriate immediate values from a given 32-bit instruction based on its instruction type (**I-type, U-type, S-type, SB-type, UJ-type**). The code determines the immediate type using the OPCODE part of the instruction and then combines the relevant bits using the `Cat` function. The extended immediate value is output to `io.out`, which will be used in subsequent arithmetic operations, address calculations, or branch decisions during instruction execution, playing a crucial role in the execution phase.
#### InstructionDecode.scala
This code is handling control logic, pipeline hazards, immediate value processing, register data read/write, and branch/jump calculations—critical for decoding instructions into the next operational phase.
It primarily performs the following functions:
* **Hazard Detection**:
* Detects data hazards and adjusts the pipeline accordingly based on control signals and memory responses.
* **Control Unit**:
* Decodes the instruction and generates control signals for operations, such as ALU source selection, memory access control, etc.
* **Register File**:
* Reads from and writes to the register file, and performs data forwarding.
* **Immediate Generation**:
* Extracts the immediate value from the instruction for use in calculations.
* **Branch Unit**:
* Handles branch decisions and checks if a branch should be taken.
* **Offset Calculation**:
* Calculates the new program counter (PC) address after a jump or branch.
* **Instruction Flush**:
* Performs instruction flush if there is a structural hazard.
* **RVFI**:
* If tracing is enabled, outputs register information involved in the instruction operation.
#### InstructionFetch.scala
This code implements an instruction fetch module that retrieves a 32-bit instruction from memory at a specified address. It interacts with the memory using a handshake protocol, sending read requests and receiving instruction responses. The module supports reset and stall mechanisms to ensure request operations are paused when the pipeline is stalled or in a reset state.
#### Instruction_SRamTOP.scala
This code defines an SRAM controller module named `Instruction_SRamTop` designed specifically for instruction access, implemented using Chisel. It utilizes the BlackBox module `instructioncache_sramTop` to simulate SRAM behavior. The functionality and logic are similar to `Data_SRamTop`: it handles SRAM signals like address, write mask, and others based on external read or write requests, enabling instruction read or write operations. The module optionally supports loading an initialization file into memory and sends the processed results back to the response port, making it suitable for managing processor instruction access.
#### JumpUnit.scala
Here’s a table explanation of the code:
| Condition (`func7`) | Output (`jump`) | Description |
|---------------------|-----------------|-------------------------------------------|
| `1101111` (JAL) | `2` | Indicates a **JAL instruction**, unconditional jump. |
| `1100111` (JALR) | `3` | Indicates a **JALR instruction**, register-based unconditional jump. |
| Other values | `0` | Indicates a non-jump instruction, no jump occurs. |
#### MDU.scala
This code implements a multi-functional integer multiplication and division unit (MDU), capable of performing **multiplication**, **division**, and **modulus** operations.
- **Inputs**: `src_a`, `src_b` (32-bit numbers), `op` (operation type), `valid`.
- **Outputs**: `ready` (whether ready), `output.bits` (operation result).
##### **Key Logic**
- **Multiplication**: Depending on `op`, different multiplication operations are performed.
- **Division**: Simulates division using state registers, outputs quotient or remainder based on `op`.
- **Output**: Based on `op`, it selects whether to output multiplication results, division quotient, or remainder.
#### Main.scala
Nothing here
#### MemIO.scala
These `MemRequestIO` and `MemResponseIO` classes define the request and response interfaces for memory operations, facilitating communication between hardware modules such as CPU cores and memory or peripherals.
#### MemoryFetch.scala
This code implements a **Memory Fetch module** with the following functionality:
1. **Write Operations**:
- Supports word, halfword, and byte-level data storage.
- Activates specific byte lanes based on `funct3` and configures valid data bits.
2. **Read Operations**:
- Supports signed and zero-extended data loading.
- Selects the correct data byte or halfword based on `funct3` and address offset.
3. **Address and Data Formatting**:
- Aligns memory addresses as required and packages them into read/write requests.
- Processes returned data and formats it for proper output.
4. **Request Management**:
- Controls the validity of memory requests and sets a stall signal to ensure operational synchronization based on response readiness.
5. **Debugging Support**:
- Outputs data during specific write operations for debugging purposes.
#### PC.scala
This code implements a **Program Counter** (PC) module, which manages the current program address in a processor and calculates the next possible address (incremented by 4 or 2) based on a condition (whether halted), supporting program flow control and instruction address generation.
#### Realigner.scala
This code implements a `Realigner` module that handles misaligned instruction addresses. In the processor pipeline, if the instruction address is misaligned (i.e., the least significant bit is not word-aligned), the module uses a state machine to process the instruction in two steps: first, it stores the upper 16 bits of the current instruction, halts the PC for one cycle, and outputs a NOP instruction to the core; then, in the next cycle, it concatenates the saved upper 16 bits with the new lower 16 bits to form an aligned instruction, which is then sent to the core. If the address is already aligned, the module directly passes the instruction. This design ensures instruction alignment and prevents issues caused by misaligned addresses.
#### Registers.scala
This code implements a simple 32-bit register file module for handling register read and write operations in a processor. It supports two read ports, allowing data to be read from two registers simultaneously, and can write data to a specified register based on the input write address.
#### Top.scala
The `Top` module serves as the integration point for the various components in the processor system, such as the core, instruction memory, data memory, and trace functionality. It coordinates the interaction between the core and memory modules, ensuring that instructions and data are correctly exchanged between the core and memory. Additionally, when tracing is enabled in the configuration, the `Top` module sends the core's execution details to the trace module for monitoring and debugging purposes.
#### SRamTop.scala
This code defines a module for handling read and write operations to SRAM. The `SRamTop` module acts as an interface for memory requests and responses, working with the `sram_top` module to perform actual read and write operations. When a read request is received, `SRamTop` enables the SRAM, reads the data, and returns it. For a write request, it writes the data into SRAM. The module uses control signals such as `csb_i`, `we_i`, and `wmask_i` to manage memory operations. The `sram_top` module is implemented as a black-box, interacting with an external Verilog memory model and optionally loading a program file to initialize the memory content.
#### new_SRamTOP.scala
The `new_SRamTop` module handles memory requests by managing data read and write operations. It interfaces with a custom SRAM model (`new_sramTop`) and performs read and write actions based on the incoming requests, which include address and data. The module uses a handshake mechanism with `Decoupled` signals and controls the SRAM using signals like `csb_i`, `we_i`, and `addr_i`. The module also incorporates logic to handle both read and write operations, ensuring data integrity by properly managing the valid response and request readiness signals.
:::
#### Memory Request and Response I/O
The memory request and response are achieved using the Chisel Decoupled technique. In Chisel, the decoupled technique is used to separate components and allow them to operate independently, especially when designing systems with multiple modules.
In a decoupled system, two signals need to be set: the `valid` signal and the `ready` signal.
* The `sender` is responsible for setting the `valid` signal to indicate that the data is going to be sent.
* The `receiver` is responsible for setting the `ready` signal to indicate that it is ready to accept the data.
As shown in the figure below, there are two directions of communication. In either direction, the sender is responsible for setting the `valid` signal, while the receiver is responsible for setting the `ready` signal.

Here, the entity requesting memory could be the CPU or a test program.
Take `InstructionFetch` as an example. To request memory, the MemResponseIO instance (coreInstrResp) is set to `ready := true` (line 12). Similarly, the MemRequestIO instance (coreInstrReq) has its valid signal set to true (line 19), unless a reset or stall condition occurs.
```scala=
class InstructionFetch extends Module {
val io = IO(new Bundle {
val address: UInt = Input(UInt(32.W))
val instruction: UInt = Output(UInt(32.W))
val stall: Bool = Input(Bool())
val coreInstrReq = Decoupled(new MemRequestIO)
val coreInstrResp = Flipped(Decoupled(new MemResponseIO))
})
val rst = Wire(Bool())
rst := reset.asBool()
io.coreInstrResp.ready := true.B
io.coreInstrReq.bits.activeByteLane := "b1111".U
io.coreInstrReq.bits.isWrite := false.B
io.coreInstrReq.bits.dataRequest := DontCare
io.coreInstrReq.bits.addrRequest := io.address >> 2
io.coreInstrReq.valid := Mux(rst || io.stall, false.B, true.B)
io.instruction := Mux(io.coreInstrResp.valid, io.coreInstrResp.bits.dataResponse, DontCare)
}
```
The `MemRequestIO` and `MemResponseIO` described here are defined in `MemIO.scala`.
The `addrRequest` specifies the memory location to be accessed, and the `isWrite` signal determines whether the operation is a write or a read.
```scala
class MemRequestIO extends Bundle {
val addrRequest: UInt = Input(UInt(32.W))
val dataRequest: UInt = Input(UInt(32.W))
val activeByteLane: UInt = Input(UInt(4.W))
val isWrite: Bool = Input(Bool())
}
class MemResponseIO extends Bundle {
val dataResponse: UInt = Input(UInt(32.W))
}
```
In the write operation, `dataRequest` contains the data to be written to the cache. For a read case, `isWrite` is set to `false`, and `dataRequest` is considered **DontCare** (as `InstructionFetch` line 15~16). The `MemResponseIO` carries the data being read in the `dataResponse` signal.
## Observations in Cache Access
Below is the content of `CacheSRAMTests` after modifying the test from `new_SRamTop` to `Instruction_SRamTop`. Note that the programs for `Instruction_SRamTop` and `Data_SRamTop` are basically the same, differing only in name.
In this test program, data is written and then read from the corresponding address to verify that the cache works correctly and can access previously stored data.
```scala
package nucleusrv.components
import chisel3._
import chisel3.util._
import org.scalatest._
import chiseltest._
import chiseltest.experimental.TestOptionBuilder._
import chiseltest.internal.VerilatorBackendAnnotation
class CacheSRAMTests extends FreeSpec with ChiselScalatestTester {
"New SRAM Test" in {
test(new Instruction_SRamTop(None)).withAnnotations(Seq(VerilatorBackendAnnotation)) { c =>
// Write data to a specific address
c.io.req.valid.poke(true.B)
c.io.req.bits.isWrite.poke(true.B) // Write operation
c.io.req.bits.addrRequest.poke(100.U) // Address to write
c.io.req.bits.dataRequest.poke(42.U) // Data to write
c.io.req.bits.activeByteLane.poke("b1111".U) // Enable all bytes
c.clock.step(10) // Allow time for write to complete
// Read data back from the same address
c.io.req.bits.isWrite.poke(false.B) // Read operation
c.io.req.bits.addrRequest.poke(100.U) // Address to read
c.clock.step(10) // Allow time for read
c.io.rsp.bits.dataResponse.expect(42.U) // Verify the data
// Test another address
c.io.req.bits.addrRequest.poke(5000000.U) // Address to write (large address within range)
c.io.req.bits.dataRequest.poke(123.U) // Data to write
c.io.req.bits.isWrite.poke(true.B) // Write operation
c.clock.step(10) // Allow time for write
c.io.req.bits.isWrite.poke(false.B) // Read operation
c.io.req.bits.addrRequest.poke(5000000.U) // Address to read
c.clock.step(10) // Allow time for read
c.io.rsp.bits.dataResponse.expect(123.U) // Verify the data
}
}
}
```
Here is a code snippet from `Instruction_SRamTop`, where two printf lines are added to observe whether the process is a read or write operation.
```diff
when(io.req.valid && !io.req.bits.isWrite) {
// READ
+ printf("read case\n")
...
} .elsewhen(io.req.valid && io.req.bits.isWrite) {
// WRITE
+ printf("write case\n")
...
} .otherwise {
...
}
```
During the `CacheSRAMTests`, both read and write requests can be observed.
> ```
> ...
> STARTING test_run_dir/New_SRAM_Test/VInstruction_SRamTop
> write case
> write case
> read case
> read case
> write case
> read case
> read case
> Exit Code: 0
> [info] - New SRAM Test
> ...
> [info] All tests passed.
> [success] Total time: 2 s, completed Jan 21, 2025, 11:48:02 AM
> ```
However, in the `TopTest`, it always shows only read requests.
> ```
> ...
> read case
> read case
> read case
> read case
> read case
> Enabling waves..
> Exit Code: 0
> [info] - Top Test
> ...
> [info] All tests passed.
> [success] Total time: 3 s, completed Jan 21, 2025, 11:50:01 AM
> ```
And since this project lack a memory hierarchy, every request results in a miss. Moreover, the CPU's fetching of the program file and data file seems not to be fully implemented, and it is unable to execute instructions based on the file I want to provide.
Therefore, a more feasible approach for us seems to be focusing first on improving and testing only the instruction fetch part.
## Cache Implementation for Instruction Fetch
### Modifying Memory Request and Response I/O
First, I want to know whether the cache is a hit or miss, so I added an output hit in `MemResponseIO`. The other parts remain unchanged, keeping the I/O interface as close to the original as possible. Upon knowing the hit or miss, further actions can be proceeded with, such as handling a cache miss.
```scala
class MemRequestIO extends Bundle{
val addrRequest: UInt = Input(UInt(32.W))
val isWrite:Bool = Input(Bool())
val dataRequest: UInt = Input(UInt(32.W))
val activeByteLane: UInt = Input(UInt(4.W))
}
class MemResponseIO extends Bundle{
val dataResponse:UInt = Output(UInt(32.W))
val hit: Bool = Output(Bool())
}
```
### Implementing a Simple Cache
#### Cache Program
I am attempting to write another cache without using Verilog black boxes. Below is our attempt at implementing a simple direct-mapped cache. The cache contain valid tag and data.
```scala=
package nucleusrv.components
import chisel3._
import chisel3.util._
import chisel3.experimental._
import chisel3.util.experimental._
class Instruction_SRamTop(val programFile:Option[String] ) extends Module {
val io = IO(new Bundle {
val req = Flipped(Decoupled(new MemRequestIO))
val rsp = Decoupled(new MemResponseIO)
})
val validReg = RegInit(false.B)
io.rsp.valid := validReg
io.req.ready := true.B
val cacheSize = 8 // Cache 行數
val blockSize = 32 // 32 bit per line
val indexBits = log2Ceil(cacheSize)
val validBits = RegInit(VecInit(Seq.fill(cacheSize)(false.B))) // V
val tags = Reg(Vec(cacheSize, UInt((32 - indexBits).W))) // TAG
val data = Reg(Vec(cacheSize, UInt(blockSize.W))) // DATA
// extract tag、index from addr
val tag = io.req.bits.addrRequest(31, indexBits)
val index = io.req.bits.addrRequest(indexBits - 1, 0)
// Data and TAG in the cache
val isValid = validBits(index)
val cacheTag = tags(index)
val cacheData = data(index)
io.rsp.bits.hit := isValid && (cacheTag === tag)
dontTouch(io.req.valid)
val target_data = Reg(UInt(32.W))
target_data := 0.U
when(io.req.valid && !io.req.bits.isWrite){
when(io.rsp.bits.hit) {
printf(">>>Cache hit! %x\n",cacheData)
target_data := cacheData
}.otherwise{
printf(">>>Cache miss!\n")
target_data := 0.U
}
validReg := true.B
} .elsewhen(io.req.valid && io.req.bits.isWrite) {
printf(">>>Write, %x\n",io.req.bits.dataRequest)
validBits(index) := true.B
tags(index) := tag
data(index) := io.req.bits.dataRequest
validReg := true.B
} .otherwise {
validReg := false.B
}
io.rsp.bits.dataResponse := target_data
}
```
#### Test Program
Since this implementation maintains an almost identical interface to the original `Instruction_SRamTop`, the `CacheSRAMTests` can continue to be used to test this new implementation. Additionally, since a `hit` signal has been added to `MemResponseIO`, the test can now expect a hit output for verification, as shown in lines 26, 38, and 44.
```scala=
package nucleusrv.components
import chisel3._
import chisel3.util._
import org.scalatest._
import chiseltest._
import chiseltest.experimental.TestOptionBuilder._
import chiseltest.internal.VerilatorBackendAnnotation
class CacheSRAMTests extends FreeSpec with ChiselScalatestTester {
"New SRAM Test" in {
test(new Instruction_SRamTop(None)).withAnnotations(Seq(VerilatorBackendAnnotation)) { c =>
// Write data to a specific address
c.io.req.valid.poke(true.B)
c.io.req.bits.isWrite.poke(true.B) // Write operation
c.io.req.bits.addrRequest.poke(100.U) // Address to write
c.io.req.bits.dataRequest.poke(42.U) // Data to write
c.io.req.bits.activeByteLane.poke("b1111".U) // Enable all bytes
c.clock.step(10) // Allow time for write to complete
// Read data back from the same address
c.io.req.bits.isWrite.poke(false.B) // Read operation
c.io.req.bits.addrRequest.poke(100.U) // Address to read
c.clock.step(10) // Allow time for read
c.io.rsp.bits.dataResponse.expect(42.U) // Verify the data
c.io.rsp.bits.hit.expect(true.B)
// Test another address
c.io.req.bits.addrRequest.poke(5000000.U) // Address to write (large address within range)
c.io.req.bits.dataRequest.poke(123.U) // Data to write
c.io.req.bits.isWrite.poke(true.B) // Write operation
c.clock.step(10) // Allow time for write
c.io.req.bits.isWrite.poke(false.B) // Read operation
c.io.req.bits.addrRequest.poke(5000000.U) // Address to read
c.clock.step(10) // Allow time for read
c.io.rsp.bits.dataResponse.expect(123.U) // Verify the data
c.io.rsp.bits.hit.expect(true.B)
c.io.req.bits.isWrite.poke(false.B) // Read operation
c.io.req.bits.addrRequest.poke(777.U) // Address to read
c.clock.step(10) // Allow time for read
c.io.rsp.bits.dataResponse.expect(0.U) // Verify the data
c.io.rsp.bits.hit.expect(false.B)
}
}
}
```
The output looks like this. During execution, it will show whether it's a hit or miss. The number of output lines varies by `clock.step`.
> ```
> >>>Write, 0000002a
> >>>Write, 0000002a
> >>>Write, 0000002a
> >>>Cache hit! 0000002a
> >>>Cache hit! 0000002a
> >>>Cache hit! 0000002a
> >>>Write, 0000007b
> >>>Write, 0000007b
> >>>Write, 0000007b
> >>>Cache hit! 0000007b
> >>>Cache hit! 0000007b
> >>>Cache hit! 0000007b
> >>>Cache miss!
> >>>Cache miss!
> >>>Cache miss!
> Exit Code: 0
> [info] - New SRAM Test
> ...
> [info] All tests passed.
> [success] Total time: 2 s, completed Jan 21, 2025, 5:36:54 PM
> ```
### Instruction Fetch
As previously described, the CPU only performs reads during each cycle and does not attempt to write missing data into the cache when a cache miss occurs.
After creating the new cache, I started focusing on whether the CPU is truly utilizing the cache. Specifically, I am examining whether the CPU correctly handles cache misses and fetches data from the next level, like the behavior in the IF (Instruction Fetch) stage.
Based on the original Instruction Fetch process, I made some modifications so that when a miss occurs, the data will be placed into the cache, so that there won't be endless compulsory misses during a cold start.
#### Instruction Fetch Program
```scala=
package nucleusrv.components
import chisel3._
import chisel3.util._
import scala.io.Source
class InstructionFetch extends Module {
val io = IO(new Bundle {
val stall: Bool = Input(Bool())
val address: UInt = Input(UInt(32.W))
val coreInstrReq = Decoupled(new MemRequestIO)
val coreInstrResp = Flipped(Decoupled(new MemResponseIO))
val instruction: UInt = Output(UInt(32.W))
val hit: Bool = Output(Bool())
})
io.instruction := 0.U
io.hit := false.B
io.coreInstrResp.ready := true.B
val rst = Wire(Bool())
rst := reset.asBool()
io.coreInstrReq.valid := true.B // Mux(rst || io.stall, false.B, true.B)
io.coreInstrReq.bits.isWrite := false.B
io.coreInstrReq.bits.dataRequest := DontCare
io.coreInstrReq.bits.addrRequest := io.address >> 2
io.coreInstrReq.bits.activeByteLane := "b1111".U
// Sending a request to SRAM, trying to read data.
val real_iSRAM = Module(new Instruction_SRamTop(None))
real_iSRAM.io.req.valid := true.B
real_iSRAM.io.rsp.ready := true.B
val writeEnable = RegInit(false.B)
real_iSRAM.io.req.bits.isWrite := writeEnable
real_iSRAM.io.req.bits.addrRequest := io.coreInstrReq.bits.addrRequest
real_iSRAM.io.req.bits.dataRequest := DontCare
real_iSRAM.io.req.bits.activeByteLane := "b1111".U
// hit or miss
io.hit := RegNext(real_iSRAM.io.rsp.bits.hit, false.B)
when(real_iSRAM.io.rsp.valid && !io.hit){
// when cache miss, read .txt as next level cache
val searchTableData = RegInit(0.U(32.W))
val filename = "address_data.txt"
val addressDataMap = Source.fromFile(filename).getLines()
.map { line =>
val Array(address, data) = line.split("=").map(_.trim)
BigInt(address, 16) -> BigInt(data, 16).U(32.W)
}.toMap
addressDataMap.foreach { case (address, data) =>
when(address.U === io.address){
searchTableData := data
}
}
// get the instruction
io.instruction := searchTableData
// load into the cache
real_iSRAM.io.req.bits.isWrite := true.B
real_iSRAM.io.req.bits.addrRequest := io.coreInstrReq.bits.addrRequest
real_iSRAM.io.req.bits.dataRequest := searchTableData
}.elsewhen(real_iSRAM.io.rsp.valid && io.hit){
real_iSRAM.io.req.bits.isWrite := false.B
// get the instruction
io.instruction := real_iSRAM.io.rsp.bits.dataResponse
}
}
```
Similar to the original Instruction Fetch program, but when a cache miss occurs, it will access the next level of the cache to load the data from the requested address.
If a cache miss occurs, the condition on `line 41` is met, and the data is retrieved from the next level and placed into the cache.
Since there is no memory hierarchy in this system, I use a text file as a substitute for the lower-level memory. When a cache miss occurs, the program will read from this text file to get the corresponding address's data. (`Line 43~54`)
Copy the `addr_x=data` column (excluding the heading) into a `address_data.txt` file and place it in the root directory of the project.
```
index addr_bin addr_dec addr_x=data
0 000000|00 0 0=fe010113
1 000001|00 4 4=00012a23
2 000010|00 8 8=08000793
3 000011|00 12 c=00f12823
4 000100|00 16 10=000027b7
5 000101|00 20 14=00f12623
6 000110|00 24 18=00012e23
7 000111|00 28 1c=0400006f
0 001000|00 32 20=00012c23
1 001001|00 36 24=0200006f
2 001010|00 40 28=01c12703
3 001011|00 44 2c=01812783
4 001100|00 48 30=00f707b3
5 001101|00 52 34=00f12a23
6 001110|00 56 38=01812783
7 001111|00 60 3c=00178793
0 010000|00 64 40=00f12c23
1 010001|00 68 44=01812703
2 010010|00 72 48=00c12783
3 010011|00 76 4c=fcf74ee3
4 010100|00 80 50=01c12783
5 010101|00 84 54=00178793
6 010110|00 88 58=00f12e23
7 010111|00 92 5c=01c12703
0 011000|00 96 60=01012783
1 011001|00 100 64=faf74ee3
2 011010|00 104 68=00000793
3 011011|00 108 6c=00078513
4 011100|00 112 70=02010113
5 011101|00 116 74=00008067
```
#### Test Program
We also created a test for the InstructionFetch, which successfully passes the test. Initially, it reads data from addresses 0 and 28, resulting in compulsory misses. (`line 11~14`)
Since a text file is used as the next-level storage, when a miss occurs, data can be retrieved from it and loaded into the cache. On the second read request, a cache hit occurs. (`line 16~28`)
When new data (address 92) is placed into the same line of the cache, the previous data (address 28) will be evicted from the cache. So, when address 28 is read again, it will result in a conflict miss. (`line 30 to 45`)
```scala=
package nucleusrv.components
import chisel3._
import org.scalatest.FreeSpec
import chiseltest._
class InstructionFetchTest extends FreeSpec with ChiselScalatestTester {
"IF Test" in {
test(new InstructionFetch) { IF =>
IF.io.stall.poke(false.B)
IF.io.address.poke(0.U)
IF.io.hit.expect(false.B)
IF.io.address.poke(28.U)
IF.io.hit.expect(false.B)
// read address 0
IF.clock.step(10)
IF.io.address.poke(0.U)
IF.io.hit.expect(true.B)
IF.clock.step()
IF.io.instruction.expect(BigInt("fe010113", 16).U)
// read address 28
IF.clock.step(10)
IF.io.address.poke(28.U)
IF.io.hit.expect(true.B)
IF.clock.step()
IF.io.instruction.expect(BigInt("0400006f", 16).U)
// Request 92 and evict 28
IF.io.address.poke(92.U)
IF.clock.step()
IF.io.hit.expect(false.B)
// read address 92
IF.clock.step(10)
IF.io.address.poke(92.U)
IF.io.hit.expect(true.B)
IF.clock.step()
IF.io.instruction.expect(BigInt("01c12703", 16).U)
// Conflict miss for 28
IF.io.address.poke(28.U)
IF.clock.step()
IF.io.hit.expect(false.B)
}
}
}
```
### Future Work
In this project, we focused on building a Chisel-based CPU, Nucleusrv, with an emphasis on understanding its memory-related components. While the instruction cache has been implemented, the data cache remains to be developed. This will likely require a more complex cache architecture or development.
Additionally, the cache hierarchy has not yet been fully implemented, and there is room for further refinement in this area.
Another area for future work is improving the CPU's ability to read program and data files, which currently has some limitations. There is also potential for enhancing compatibility with the CPU to ensure seamless integration with external systems.
## Reference
* https://github.com/merledu/nucleusrv
* https://github.com/riscv-non-isa/riscv-arch-test
* https://www.chisel-lang.org/
* https://www.veripool.org/verilator/
* https://github.com/riscv-collab/riscv-gnu-toolchain
* https://mybinder.org/v2/gh/freechipsproject/chisel-bootcamp/master