程品叡, 吳睿秉
NucleusRV is a Chisel-based RISC-V 5-stage pipelined CPU, which implements the 32-bit version of the ISA. Verilator is used to generate a C++ simulator and an executable, which are then verified using the RISC-V architectural test. The CPU currently supports a limited set of instructions. In this project, we are focusing on the memory-related components, i.e., those related to SRAM. We are building Nucleusrv in Ubuntu 22.04.5
and working on the progress of the cache system completed so far. After this, we attempted to implement a simple direct-mapped cache similar to the previous one and studied the compulsory miss situation in the instruction fetch.
main branch: https://github.com/merledu/nucleusrv.git
or new cache branch: https://github.com/merledu/nucleusrv/tree/new_cache
riscv-arch-test should be placed inside the nucleusrv directory.
riscv-arch-test: https://github.com/riscv-non-isa/riscv-arch-test.git
Reference: https://www.chisel-lang.org/docs/installation#java-development-kit-jdk
Reference: https://www.chisel-lang.org/docs/installation#sbt
Reference: https://verilator.org/guide/latest/install.html#package-manager-quick-install
After successfully installing, check the version.
The terminal should show something like :
NOTE: Only in Ubuntu 22.04.5
does running sudo apt-get install verilator
correctly install Verilator 4.038
.
On newer versions of Verilator, the sbt test might fail because it requires additional arguments to specify how to handle timing. For example, in Ubuntu 24.04.1
, the same command installs Verilator 5.020
, which is newer but fails to successfully run the sbt testonly command.
Running Compliance Tests (
README.md
of nucleusrv)
- Clone
riscv-arch-test
repo in nucleusrv rootgit clone git@github.com:riscv-non-isa/riscv-arch-test.git -b 1.0
- Build the simulation executable as defined in "Building with SBT" section.
- Run
./run-compliance.sh
in root directory.
When I executed ./run-compliance.sh
, the error message is as follows.
According to the Errors and Warnings page, the NEEDTIMINGOPT error indicates that the command does not specify how Verilator should handle timing-related constructs, such as delays.
Since running ./run-compliance.sh
triggers the command on line 4 to invoke Verilator, and we are unable to locate where to modify the arguments passed to Verilator, an alternative solution is to manually type the command instead of directly running ./run-compliance.sh
. This allows us to add the --timing
or --no-timing
argument at the end of the command.
The command looks like:
Avoid including screenshots that display only plaintext. Instead, always use Markdown syntax.
I got it.
riscv-gnu-toolchain: https://github.com/riscv-collab/riscv-gnu-toolchain
After successfully installing and building, run the following command to check the installation.
The terminal should show something like :
After building the riscv-gnu-toolchain, the following commands are now available. Try riscv64-unknown-elf-
in the terminal and press the tab to check them.
Run the following command in SBT shell:
The terminal should show something like :
After successfully doing this, a VTop
executable in nucleusrv/test_run_dir/Top_Test
will be generated.
Additionally, to perform the SRAM test, the following command can be used to execute the cache SRAM test in the nucleusrv\src\test\scala\components\SRamTests.scala
:
In the root directory, run ./run-compliance.sh
. If the VTop
executable already exists, it will start the RISC-V architecture tests. However, before actually starting, some modifications need to be made:
In the old ISA specification, CSR related instructions were part of the basic instruction set. However, in the new ISA specification, CSR instructions were separated into the Zicsr extension. Therefore, to recognize the related instructions, explicit configuration settings must be made in the Makefile.
Running the RISC-V architecture test suite for rv32i
requires adding
-march=rv32imac_zicsr
at the end of RISCV_GCC_OPTS
in the
nucleusrv/riscv-target/nucleusrv/device/rv32i/Makefile.include
. This is not needed when running rv32im or rv32imc.
./run-compliance.sh
to Run TestIn ./run-compliance.sh
, the $ISA
and $TEST
specify the testcase in
nucleusrv/riscv-arch-test/riscv-test-suite/[ISA]/src/[TEST]
that will be executed.
Note that if the ISA is specifically rv32i
, modifying the Makefile is necessary, as described in the Modify Makefile section previously.
If want to perform all the tests, set TEST
as $ALL
.
For example, the following modifications to ISA and TEST will only perform the ISA rv32imc's C-ADDI16SP test.
After setting the ISA
and TEST
variables, the two make commands in this script will run the tests.
The test will compare two files, and if they are identical, the test will pass.
riscv-arch-test/riscv-test-suite/[ISA]/reference
riscv-arch-test/work/[ISA]/[TEST].output
Before making the C program, since the toolchain is installed with
./configure --prefix=/opt/riscv
, which by default installs riscv64
. Modify line 1 of the project's Makefile in nucleusrv/tools/makefile
to RISCV=riscv64-unknown-elf-
to ensure it does not call riscv32-unknown-elf-
.
Navigate to the nucleusrv/tools
and run the following command to build your program, which is located in nucleusrv/tools/tests/FOLDER_NAME
:
make PROGRAM=<FOLDER_NAME>
For example, if your program folder in nucleusrv/tools/test
is named hello_world
, use:
The terminal should show something like :
The corresponding RISC-V instruction will be generated in nucleusrv/tools/out/program.dump
.
Note that in this project, the class names and file names do not always correspond and may have inconsistent capitalization. The names mentioned below, if without file extensions, primarily refer to the class names.
The main difference between the main branch and the new_cache branch is that the SRAM is divided into Instruction SRAM and Data SRAM. In the new_cache branch, there are four types of memory: SRamTop
, new_SRamTop
, Instruction_SRamTop
, and Data_SRamTop
. They differ only in the class names or variable names, and their content is essentially identical.
To be more specific, SRamTop
is not used at all. new_SRamTop
is only used in the test program SRamTests.scala
under CacheSRAMTests
class. After modifying the test program from new_SRamTop
to test Instruction_SRamTop
or Data_SRamTop
, as shown below, new_SRamTop
is no longer used.
This CacheSRAMTests
test program is also a newly implemented feature in the new_cache branch, which has not been enabled in the main branch.
Cache or memory serves as the closest place for the CPU to access instructions and data during execution. In the design architecture of this project, an instruction cache and a data cache are instantiated in the Top
module. The Core
utilizes them during the IF stage and MEM stage. In the implementation of these instruction and data caches (i.e., Instruction_SRamTop
and Data_SRamTop
), the *.v
files in src/main/resources
are called by Chisel using the BlackBox mechanism.
Moreover we have explain the code…more detail in code_explain
Here, I will briefly introduce the code in src/main/scala/components
from the new_cache
branch.
Basic ALU (Arithmetic Logic Unit) operation, like and
,or
,add
,sub
,slt
and so on.
This code defines a module named AluControl
using Chisel, which implements the control logic for an ALU (Arithmetic Logic Unit). It generates specific ALU control signals based on the input control signals (aluOp
, f7
, f3
, aluSrc
). The control signals determine the operation mode of the ALU.
The table below maps aluOp
, f3
, f7
, and aluSrc
to the output io.out
:
aluOp |
f3 |
f7 |
aluSrc |
Operation Type | io.out |
---|---|---|---|---|---|
0.U |
Any | Any | Any | Add (ADD ) |
2.U |
2.U |
0.U |
0.U |
false |
Add (ADD ) |
2.U |
2.U |
0.U |
1.U |
true |
Subtract (SUB ) |
3.U |
2.U |
1.U |
Any | Any | Shift Left (SLL ) |
6.U |
2.U |
2.U |
Any | Any | Set Less Than (SLT ) |
4.U |
2.U |
3.U |
Any | Any | Set Less Than Unsigned (SLTU ) |
5.U |
2.U |
5.U |
0.U |
Any | Logical Right Shift (SRL ) |
7.U |
2.U |
5.U |
1.U |
Any | Arithmetic Right Shift (SRA ) |
8.U |
2.U |
7.U |
Any | Any | Logical AND (AND ) |
0.U |
2.U |
6.U |
Any | Any | Logical OR (OR ) |
1.U |
2.U |
4.U |
Any | Any | Logical XOR (XOR ) |
9.U |
This code implements a RISC-V processor's branch unit that determines whether a branch should be taken based on branch conditions (funct3
), operands, and control signals.
The function is to decode RISC-V 16-bit compressed instructions into their corresponding 32-bit standard instructions.
Here’s the table with C0-C3 instructions explained:
Instruction Type | Opcode (15-14/13-12) | Instruction Name | Description | Decoding/Explanation |
---|---|---|---|---|
C0 | 00 + b00 |
c.addi4spn |
Stack pointer offset calculation | Offset calculation and addition to x2 (sp) |
00 + b01 |
c.lw |
Load data into register | lw rd', imm(rs1') |
|
00 + b11 |
c.sw |
Store data to memory | sw rs2', imm(rs1') |
|
00 + b10 |
Illegal instruction | Unrecognized instruction, directly return | - | |
C1 | 01 + b000 |
c.addi /nop |
Add immediate or no operation | addi rd, rd, imm or nop |
01 + b001 |
c.jal |
Unconditional jump, save return address to x1 |
jal x1, imm |
|
01 + b101 |
c.j |
Unconditional jump | jal x0, imm |
|
01 + b010 |
c.li |
Load immediate into register | addi rd, x0, imm |
|
01 + b011 |
c.lui /addi16sp |
Load upper immediate or specific stack operation | lui rd, imm or addi x2, x2, imm |
|
01 + b100 |
Logical/Arithmetic | Shift, logical operation or subtraction | srli , srai , andi , sub , etc. |
|
01 + b110 |
c.beqz |
Branch if rs1' is zero |
beq rs1', x0, imm |
|
01 + b111 |
c.bnez |
Branch if rs1' is not zero |
bne rs1', x0, imm |
|
C2 | 10 + b00 |
c.slli |
Logical left shift | slli rd, rd, shamt |
10 + b01 |
c.lwsp |
Load data from stack with offset | lw rd, imm(x2) |
|
10 + b10 |
c.mv /c.add /c.jr etc |
Data move, addition, or jump | Depends on whether the register is zero | |
10 + b11 |
c.swsp |
Store data to stack | sw rs2, imm(x2) |
|
C3 | 11 |
Illegal instruction | Unrecognized instruction, directly return | - |
This code defines a case class named Configs
, which is used to store and initialize the basic configuration parameters for a RISC-V core, such as the bit width (XLEN
), whether the M and C instruction sets are enabled (M
and C
), and whether TRACE messages are enabled (TRACE
).
This code defines an object called ALUOps
that contains various operation codes (opcodes) for an Arithmetic Logic Unit (ALU).
ALUop | Opcode |
---|---|
ADD | 2 |
SUB | 6 |
AND | 0 |
OR | 1 |
XOR | 9 |
SLL | 3 |
SRL | 4 |
SRA | 5 |
SLT | 9 |
SLTU | 10 |
COPY | 11 |
This code implements a control unit based on instruction decoding, which generates the corresponding control signals by matching instruction bit patterns. It includes signals for ALU source selection (aluSrc
), memory operation control (memToReg
, memRead
, memWrite
), register write control (regWrite
), branch decision (branch
), jump instructions (jump
), and ALU operation selection (aluOp
, aluSrc1
). These control signals guide the operation of the RISC-V processor based on the input instructions.
This code simulates the core logic of a RISC-V processor, processing instructions through a pipeline (IF, ID, EX, MEM, WB). It defines multiple registers to store data at each stage, along with control signals and computation results. Each stage is responsible for handling different parts of the instruction flow, including instruction fetching, decoding, execution, memory access, and write-back. Through this pipelined design, the code efficiently models the operation of a RISC-V processor, improving performance by alternating instruction processing. Additionally, it handles memory access requests and outputs RVFI
(RISC-V Formal Interface) data for simulation and validation purposes.
This code defines a SRAM memory controller that handles read and write requests from external devices (such as a processor). It processes incoming Decoupled
requests to determine whether the operation is a read or write, and it controls the SRAM memory by adjusting the chip-select signal, read/write control, address, and data inputs. A status register (validReg
) manages the validity of the responses, and the read data or invalid data is passed back to the response. This allows the SRAM memory to efficiently perform data operations based on the incoming requests.
This code implements an execution unit that handles the execution phase in a RISC-V processor pipeline. It is responsible for performing arithmetic and logical operations (such as addition and subtraction) based on control signals (func3
, func7
). The code uses a forwarding unit to resolve data hazards and selectively supports multiplication and division operations. If multiplication/division functionality (M
is set to true) is enabled, it performs these operations based on control signals and manages pipeline stalls. Ultimately, the output includes the computation result (ALUresult
) and the data to be written (writeData
).
Forwarding Unit handles the data hazard in a processor pipeline. The Forwarding Unit determines the appropriate data source for execution by checking if the source registers (reg_rs1
, reg_rs2
) match with the destination registers of the execution unit (ex_reg_rd
) or the memory unit (mem_reg_rd
), and then chooses the correct forwarding path.
io.forwardA
and io.forwardB
are output signals that indicate which data source should be forwarded, either from the execution unit (EX) or the memory unit (MEM).io.reg_rs1
and io.reg_rs2
are the source registers used in branch operations.io.ex_reg_rd
represents the destination register of the execution unit, and io.mem_reg_rd
represents the destination register of the memory unit.io.ex_regWrite
and io.mem_regWrite
indicate whether the execution unit and memory unit are writing back to their destination registers.This code’s role is to provide the appropriate data forwarding path to the execution unit (ALU) to resolve data hazards caused by uncompleted read operations, such as accessing memory.
This code is responsible for detecting data hazards and controlling the flow of the pipeline in a processor. The Hazard Unit checks if the current instruction (mainly memory access and branch instructions) causes pipeline hazards and adjusts the control signals accordingly.
load-use hazard: This part of the code checks for "data hazard" situations, where the ID stage has a memory read operation, and the source register (id_rs1
or id_rs2
) matches the destination register (id_ex_rd
).
io.ctl_mux
, io.pc_write
, io.if_reg_write
), preventing incorrect instruction execution.branch hazard: This part of the code detects if a branch hazard exists. If the taken
signal is true or jump
is not zero, the IF/ID
stage needs to be cleared to avoid executing the wrong instruction.
io.ifid_flush
to true.B
, which signals the need to clear the pipeline.This code extracts and extends appropriate immediate values from a given 32-bit instruction based on its instruction type (I-type, U-type, S-type, SB-type, UJ-type). The code determines the immediate type using the OPCODE part of the instruction and then combines the relevant bits using the Cat
function. The extended immediate value is output to io.out
, which will be used in subsequent arithmetic operations, address calculations, or branch decisions during instruction execution, playing a crucial role in the execution phase.
This code is handling control logic, pipeline hazards, immediate value processing, register data read/write, and branch/jump calculations—critical for decoding instructions into the next operational phase.
It primarily performs the following functions:
This code implements an instruction fetch module that retrieves a 32-bit instruction from memory at a specified address. It interacts with the memory using a handshake protocol, sending read requests and receiving instruction responses. The module supports reset and stall mechanisms to ensure request operations are paused when the pipeline is stalled or in a reset state.
This code defines an SRAM controller module named Instruction_SRamTop
designed specifically for instruction access, implemented using Chisel. It utilizes the BlackBox module instructioncache_sramTop
to simulate SRAM behavior. The functionality and logic are similar to Data_SRamTop
: it handles SRAM signals like address, write mask, and others based on external read or write requests, enabling instruction read or write operations. The module optionally supports loading an initialization file into memory and sends the processed results back to the response port, making it suitable for managing processor instruction access.
Here’s a table explanation of the code:
Condition (func7 ) |
Output (jump ) |
Description |
---|---|---|
1101111 (JAL) |
2 |
Indicates a JAL instruction, unconditional jump. |
1100111 (JALR) |
3 |
Indicates a JALR instruction, register-based unconditional jump. |
Other values | 0 |
Indicates a non-jump instruction, no jump occurs. |
This code implements a multi-functional integer multiplication and division unit (MDU), capable of performing multiplication, division, and modulus operations.
src_a
, src_b
(32-bit numbers), op
(operation type), valid
.ready
(whether ready), output.bits
(operation result).op
, different multiplication operations are performed.op
.op
, it selects whether to output multiplication results, division quotient, or remainder.Nothing here
These MemRequestIO
and MemResponseIO
classes define the request and response interfaces for memory operations, facilitating communication between hardware modules such as CPU cores and memory or peripherals.
This code implements a Memory Fetch module with the following functionality:
Write Operations:
funct3
and configures valid data bits.Read Operations:
funct3
and address offset.Address and Data Formatting:
Request Management:
Debugging Support:
This code implements a Program Counter (PC) module, which manages the current program address in a processor and calculates the next possible address (incremented by 4 or 2) based on a condition (whether halted), supporting program flow control and instruction address generation.
This code implements a Realigner
module that handles misaligned instruction addresses. In the processor pipeline, if the instruction address is misaligned (i.e., the least significant bit is not word-aligned), the module uses a state machine to process the instruction in two steps: first, it stores the upper 16 bits of the current instruction, halts the PC for one cycle, and outputs a NOP instruction to the core; then, in the next cycle, it concatenates the saved upper 16 bits with the new lower 16 bits to form an aligned instruction, which is then sent to the core. If the address is already aligned, the module directly passes the instruction. This design ensures instruction alignment and prevents issues caused by misaligned addresses.
This code implements a simple 32-bit register file module for handling register read and write operations in a processor. It supports two read ports, allowing data to be read from two registers simultaneously, and can write data to a specified register based on the input write address.
The Top
module serves as the integration point for the various components in the processor system, such as the core, instruction memory, data memory, and trace functionality. It coordinates the interaction between the core and memory modules, ensuring that instructions and data are correctly exchanged between the core and memory. Additionally, when tracing is enabled in the configuration, the Top
module sends the core's execution details to the trace module for monitoring and debugging purposes.
This code defines a module for handling read and write operations to SRAM. The SRamTop
module acts as an interface for memory requests and responses, working with the sram_top
module to perform actual read and write operations. When a read request is received, SRamTop
enables the SRAM, reads the data, and returns it. For a write request, it writes the data into SRAM. The module uses control signals such as csb_i
, we_i
, and wmask_i
to manage memory operations. The sram_top
module is implemented as a black-box, interacting with an external Verilog memory model and optionally loading a program file to initialize the memory content.
The new_SRamTop
module handles memory requests by managing data read and write operations. It interfaces with a custom SRAM model (new_sramTop
) and performs read and write actions based on the incoming requests, which include address and data. The module uses a handshake mechanism with Decoupled
signals and controls the SRAM using signals like csb_i
, we_i
, and addr_i
. The module also incorporates logic to handle both read and write operations, ensuring data integrity by properly managing the valid response and request readiness signals.
The memory request and response are achieved using the Chisel Decoupled technique. In Chisel, the decoupled technique is used to separate components and allow them to operate independently, especially when designing systems with multiple modules.
In a decoupled system, two signals need to be set: the valid
signal and the ready
signal.
sender
is responsible for setting the valid
signal to indicate that the data is going to be sent.receiver
is responsible for setting the ready
signal to indicate that it is ready to accept the data.As shown in the figure below, there are two directions of communication. In either direction, the sender is responsible for setting the valid
signal, while the receiver is responsible for setting the ready
signal.
Here, the entity requesting memory could be the CPU or a test program.
Take InstructionFetch
as an example. To request memory, the MemResponseIO instance (coreInstrResp) is set to ready := true
(line 12). Similarly, the MemRequestIO instance (coreInstrReq) has its valid signal set to true (line 19), unless a reset or stall condition occurs.
The MemRequestIO
and MemResponseIO
described here are defined in MemIO.scala
.
The addrRequest
specifies the memory location to be accessed, and the isWrite
signal determines whether the operation is a write or a read.
In the write operation, dataRequest
contains the data to be written to the cache. For a read case, isWrite
is set to false
, and dataRequest
is considered DontCare (as InstructionFetch
line 15~16). The MemResponseIO
carries the data being read in the dataResponse
signal.
Below is the content of CacheSRAMTests
after modifying the test from new_SRamTop
to Instruction_SRamTop
. Note that the programs for Instruction_SRamTop
and Data_SRamTop
are basically the same, differing only in name.
In this test program, data is written and then read from the corresponding address to verify that the cache works correctly and can access previously stored data.
Here is a code snippet from Instruction_SRamTop
, where two printf lines are added to observe whether the process is a read or write operation.
During the CacheSRAMTests
, both read and write requests can be observed.
However, in the TopTest
, it always shows only read requests.
And since this project lack a memory hierarchy, every request results in a miss. Moreover, the CPU's fetching of the program file and data file seems not to be fully implemented, and it is unable to execute instructions based on the file I want to provide.
Therefore, a more feasible approach for us seems to be focusing first on improving and testing only the instruction fetch part.
First, I want to know whether the cache is a hit or miss, so I added an output hit in MemResponseIO
. The other parts remain unchanged, keeping the I/O interface as close to the original as possible. Upon knowing the hit or miss, further actions can be proceeded with, such as handling a cache miss.
I am attempting to write another cache without using Verilog black boxes. Below is our attempt at implementing a simple direct-mapped cache. The cache contain valid tag and data.
Since this implementation maintains an almost identical interface to the original Instruction_SRamTop
, the CacheSRAMTests
can continue to be used to test this new implementation. Additionally, since a hit
signal has been added to MemResponseIO
, the test can now expect a hit output for verification, as shown in lines 26, 38, and 44.
The output looks like this. During execution, it will show whether it's a hit or miss. The number of output lines varies by clock.step
.
As previously described, the CPU only performs reads during each cycle and does not attempt to write missing data into the cache when a cache miss occurs.
After creating the new cache, I started focusing on whether the CPU is truly utilizing the cache. Specifically, I am examining whether the CPU correctly handles cache misses and fetches data from the next level, like the behavior in the IF (Instruction Fetch) stage.
Based on the original Instruction Fetch process, I made some modifications so that when a miss occurs, the data will be placed into the cache, so that there won't be endless compulsory misses during a cold start.
Similar to the original Instruction Fetch program, but when a cache miss occurs, it will access the next level of the cache to load the data from the requested address.
If a cache miss occurs, the condition on line 41
is met, and the data is retrieved from the next level and placed into the cache.
Since there is no memory hierarchy in this system, I use a text file as a substitute for the lower-level memory. When a cache miss occurs, the program will read from this text file to get the corresponding address's data. (Line 43~54
)
Copy the addr_x=data
column (excluding the heading) into a address_data.txt
file and place it in the root directory of the project.
We also created a test for the InstructionFetch, which successfully passes the test. Initially, it reads data from addresses 0 and 28, resulting in compulsory misses. (line 11~14
)
Since a text file is used as the next-level storage, when a miss occurs, data can be retrieved from it and loaded into the cache. On the second read request, a cache hit occurs. (line 16~28
)
When new data (address 92) is placed into the same line of the cache, the previous data (address 28) will be evicted from the cache. So, when address 28 is read again, it will result in a conflict miss. (line 30 to 45
)
In this project, we focused on building a Chisel-based CPU, Nucleusrv, with an emphasis on understanding its memory-related components. While the instruction cache has been implemented, the data cache remains to be developed. This will likely require a more complex cache architecture or development.
Additionally, the cache hierarchy has not yet been fully implemented, and there is room for further refinement in this area.
Another area for future work is improving the CPU's ability to read program and data files, which currently has some limitations. There is also potential for enhancing compatibility with the CPU to ensure seamless integration with external systems.