# Implement cache system for srv32
> 陳榮昶, 林晉德
[GitHub](https://github.com/cjc525/srv32-master)
# Introduction
This project involves enhancing the SRV32 lightweight RISC-V processor core, developed in Verilog, by integrating a cache system. The aim is to improve memory access performance and implement classic cache replacement mechanisms.
This document provides a guide for installing and testing SRV32 on Ubuntu systems and outlines the steps to achieve the project goals.
# Tools Installation Steps
## System Requirements
* Operating System: Ubuntu (tested on Ubuntu 24.04.1 LTS)
* Basic development tools
* RISC-V GNU toolchain
* Verilator
## Install RISCV toolchains
### 1. Install Basic Packages
First, install the required development packages:
```bash
sudo apt-get install autoconf automake autotools-dev curl libmpc-dev \
libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo \
gperf libtool patchutils bc zlib1g-dev git libexpat1-dev python3 \
python-is-python3 lcov
```
### 2. Install Verilator
```bash
sudo apt install verilator
```
### 3. RISC-V Toolchain Installation
There are two methods to install the RISC-V toolchain:
#### Method 1: Using Pre-compiled xPack GNU RISC-V Tools (Recommended)
1. Download xPack RISC-V toolchain
- Visit https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases
- Choose the appropriate version for your system
```bash
# For Linux x64 systems:
wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases/download/v14.2.0-3/xpack-riscv-none-elf-gcc-14.2.0-3-linux-x64.tar.gz
```
2. Extract and configure
```bash
tar xf xpack-riscv-none-elf-gcc-14.2.0-3-linux-x64.tar.gz
sudo mv xpack-riscv-none-elf-gcc-14.2.0-3 /opt/riscv
```
3. Set environment variables
```bash
echo 'export PATH="/opt/riscv/bin:$PATH"' >> ~/.bashrc
echo 'export CROSS_COMPILE=riscv-none-elf-' >> ~/.bashrc
echo 'export EXTRA_CFLAGS="-misa-spec=2.2 -march=rv32im"' >> ~/.bashrc
source ~/.bashrc
```
#### Method 2: Compiling from Source (Time-consuming)
```bash
git clone --recursive https://github.com/riscv/riscv-gnu-toolchain
cd riscv-gnu-toolchain
mkdir build
cd build
../configure --prefix=/opt/riscv \
--with-isa-spec=20191213 \
--with-multilib-generator="\
rv32im_zicsr-ilp32--;\
rv32imac_zicsr-ilp32--;\
rv32im_zicsr_zba_zbb_zbc_zbs-ilp32--;\
rv32imac_zicsr_zba_zbb_zbc_zbs-ilp32--;\
rv32em_zicsr-ilp32e--;\
rv32emac_zicsr-ilp32e--"
make -j$(nproc)
```
### Common Issues and Solutions
### 1. Toolchain Version Mismatch
**Problem**:
Error "cannot execute binary file: Exec format error" when running `make tests-all`.
**Solution**:
Ensure you download the correct toolchain version for your system:
- Use `uname -m` to check system architecture
- Choose the appropriate version:
- x86_64: select linux-x64 version
- aarch64: select linux-arm64 version
### 2. Compilation Errors
**Problem**:
Errors related to "la x0,5b" during compilation.
**Solution**:
1. Clear previous environment variables:
```bash
unset EXTRA_CFLAGS
unset CROSS_COMPILE
```
2. Set correct environment variables:
```bash
export CROSS_COMPILE=riscv-none-elf-
export EXTRA_CFLAGS="-misa-spec=2.2"
```
3. Clean and recompile:
```bash
make distclean
make tests-all
```
## Installation Verification
```bash
# Ubuntu package needed to run the RTL simulation
sudo apt install verilator
# Modify the Makefile, and set verilator ?= 1
vim sim/Makefile
or
make verilator=1
```
### Basic Function Testing
```bash
# Test Hello World program
make hello
# Run Dhrystone benchmark
make dhrystone
```
### Comprehensive Testing
```bash
# Run all tests
make tests-all
```
# Cache Architecture & Specification
### Srv32 Architecture

Above is the srv32 Architecture, it has two memory system, instruction memory and the data memory.
Thus we adding two Level 1 Caches to the CPU, namely the Instruction Cache and the Data Cache. Both utilize a 2-Way Set Associative Mapping method to map. The cache replacement policies we choose is LRU (Least Recently Used Cache). The System Architecture is as shown in the figure below.
### System Architecture

In this project, the Cache I/O is based on the srv32 cpu, with some modifications. Due to the 2-Way associative mapping, the Cache design includes two banks, each containing 32 blocks. Each block has its own Tag and Valid bit. Therefore, the designed Cache has a 64-bit valid signal internally, representing the validity of the 32 blocks in each bank. The following diagrams illustrates the hardware architecture of the Cache for this project.
### Cache Architecture

### Cache Specification
- Cache size : 1 kB
- 2-Way set associative mapping
- Cache replacement policies
- LRU (Least Recently Used Cache)
- Block size (cache line size)
- 16Byte == 128bit (4 words)
- Entry : 32
- Tag Array Structure

- Data Array Structure

# Cache controller & state diagram
### Instruction Cache

The figure above shows the state diagram of the Instruction Cache controller designed for this project. Upon receiving the rst signal, the controller transitions to the IDLE state. When a CPU access request to the Instruction Memory is received, the controller transitions to the READ_HIT state to determine whether the accessed address matches the tag stored in the Cache. If the tag matches and the Valid bit of the corresponding block is 1, it is considered a hit; otherwise, it is a miss. This state lasts for only one cycle.
If it is determined to be a hit, the temporarily stored value in the Cache is returned to the CPU. Each Cache block is 128 bits, so the word to be returned to the CPU is determined based on address[3:2]. If it is determined to be a miss, the controller transitions to the READ_AXI state.
In the event of a Read Miss, the controller employs a Read Allocate policy, requiring communication with the SRAM to fetch the requested data along with three additional pieces of data (since the block size is 128 bits). When all four pieces of data have been sequentially written into the Cache block, the read operation is completed, and the controller transitions to the READ_DONE state.
### Data Cache

The figure above illustrates the state machine of the aata Cache controller designed for this project. The design process is generally similar to that of the Instruction Cache, with the main difference being that the aata Cache must handle both read and write operations, whereas the Instruction Cache only needs to handle read operations. Therefore, the state machine for the Data Cache includes two additional states, WRITE_HIT and WRITE_CACHE, to manage write operations to the Data Memory.
Initially, the system remains in the IDLE state, waiting for the CPU to issue a core_req. Upon receiving a core_req, the system uses the core_write signal to determine whether the operation is a write. If it is a write operation, the system transitions to the WRITE_HIT state to determine whether it is a Write Hit. In the case of a Write Miss, the system notifies the CPU to update the value in the data Memory and simultaneously updates the value in the Cache. The system then returns to the IDLE state to await the next access request from the CPU, ensuring that the data is successfully written before allowing the CPU to proceed.
If it is a Write Hit, the system transitions directly to the WRITE_CACHE state, following the Write-Through Policy. In this state, the data to be written to the data Memory is also written to the corresponding block in the Cache. The system then transitions to the DONE state before returning to the IDLE state.
On the other hand, if the core_write signal is low while in the IDLE state, it indicates that the CPU intends to read from the data Memory. The process then follows a flow similar to that of the Instruction Cache. The system transitions to the READ_HIT state to determine whether it is a Read Hit. If it is a Read Hit, the system returns to the IDLE state while simultaneously outputting the cached data back to the CPU. Conversely, if it is a Read Miss, the system transitions to the READ_AXI state to fetch data from the data Memory via the CPU Wrapper. After receiving all four pieces of data, the system transitions to the DONE state and then returns to the IDLE state to await the next request.
# Cache Controller Implementation
## L1C_inst.v
This module implements the control logic for L1 Instruction Cache. Since instruction cache only handles read operations, it's relatively simple:
```verilog=
reg [1:0] cstate, nstate;
localparam IDLE = 2'd0;
localparam READ_HIT = 2'd1;
localparam READ_AXI = 2'd2;
localparam READ_DONE = 2'd3;
```
## Main control logic:
### 1. IDLE State
```verilog=
IDLE: begin
if(core_req) nstate = READ_HIT;
else nstate = IDLE;
end
```
- Wait for CPU request
### 2. READ_HIT State
```verilog=
READ_HIT: begin
if(hit) nstate = IDLE; // Cache hit
else nstate = READ_AXI; // Cache miss
end
```
- Perform cache lookup
- Check for hit (`hit0 || hit1`)
- Return data directly if hit, otherwise go to external memory read state
### 3. READ_AXI State
```verilog=
READ_AXI: begin
if(sram_counter == 2'd3 && axi_ready)
nstate = READ_DONE;
else
nstate = READ_AXI;
end
```
- Read data from external memory
- Each read transfers 32 bits, needs 4 transfers to fill a cache line
- Use `sram_counter` to track progress
## L1C_data.v
The data cache control logic is more complex since it needs to handle both read and write operations:
```verilog=
reg [2:0] cstate, nstate;
localparam IDLE = 3'd0;
localparam READ_HIT = 3'd1;
localparam READ_AXI = 3'd2; // Read from external memory
localparam WRITE_HIT = 3'd3;
localparam WRITE_CACHE = 3'd4; // Write to cache
localparam DONE = 3'd5;
```
## Main control logic:
### 1. Read Operation
```verilog=
READ_HIT: begin
if(hit) begin
case(core_addr_reg[3:2])
2'b00: core_out = DA_out[31:0];
2'b01: core_out = DA_out[63:32];
2'b10: core_out = DA_out[95:64];
2'b11: core_out = DA_out[127:96];
endcase
end
end
```
- Select corresponding 32-bit data based on address when hit
### 2. Write Operation
```verilog=
WRITE_CACHE: begin
case(core_addr_reg[3:2])
2'b00: begin // write into first block
DA_write = {{12{1'b1}}, 4'b0000};
DA_in = {96'd0, core_in};
end
2'b01: begin // write into second block
DA_write = {{8{1'b1}}, 4'b0000, {4{1'b1}}};
DA_in = {64'd0, core_in, 32'd0};
end
2'b10: begin // write into third block
DA_write = {{4{1'b1}}, 4'b0000, {8{1'b1}}};
DA_in = {32'd0, core_in, 64'd0};
end
2'b11: begin // write into fourth block
DA_write = {4'b0000, {12{1'b1}}};
DA_in = {core_in, 96'd0};
end
endcase
```
- Write Through: Write to both cache and external memory
- Write No-allocate: Don't load block into cache on miss
## LRU Replacement Strategy Implementation
In both L1C_data.v and L1C_inst.v, LRU is implemented similarly:
```verilog
// LRU buffer declaration (one LRU bit needed for each index)
reg lru_buffer[31:0]; // 0->set0; 1->set1
```
LRU bit update logic has several cases:
### 1. Reset State:
```verilog=
always @(posedge clk or negedge resetb) begin
if (!resetb) begin
for(i=0; i<32; i=i+1) begin
lru_buffer[i] <= 1'b0;
end
end
end
```
- Initialize all LRU bits to 0 on system reset
### 2. Update on Read Hit:
```verilog=
READ_HIT: begin
if(hit) begin
// Set to 0 for set0 hit, 1 for set1 hit
lru_buffer[core_addr_reg[8:4]] <= (hit0)? 1'b0 : 1'b1;
end
end
```
- `hit0`: Indicates hit on set 0
- `hit1`: Indicates hit on set 1
- Records which set was last accessed
### 3. Update on Read Miss:
```verilog=
READ_AXI: begin
if(TA_write == 2'b10) begin
lru_buffer[core_addr_reg[8:4]] <= 1'b0;
end
else if(TA_write == 2'b01) begin
lru_buffer[core_addr_reg[8:4]] <= 1'b1;
end
end
```
- Update LRU bit based on which set is written to
### 4. Replacement Decision:
```verilog=
// Decide which set to replace based on LRU
if(!valid_reg[{1'b0, core_addr_reg[8:4]}]) begin
DA_wen = 2'b10; // Write to set0 if invalid
end
else if(!valid_reg[{1'b1, core_addr_reg[8:4]}]) begin
DA_wen = 2'b01; // Write to set1 if invalid
end
else if(lru_buffer[core_addr_reg[8:4]] == 1'b0) begin
DA_wen = 2'b01; // Replace set1 if LRU points to set0
end
else begin
DA_wen = 2'b10; // Replace set0 if LRU points to set1
end
```
Replacement priority order:
1. Use invalid cache line if available
2. If both sets are valid, choose the least recently used set based on LRU bit
# Results
## 1. Cache Operation Verification through Waveform Analysis
### 1.1 Instruction Cache Operations
#### A. Read Miss Scenario

The waveform demonstrates a L1 Instruction Cache Read Miss case. During the READ_HIT state, the low hit signal indicates that the requested data is not present in the Cache Block. This triggers the need to fetch data from instruction memory. Following the Cache's Read Allocate Policy, this fetch operation includes retrieving the requested data along with three additional data pieces to fill the corresponding Cache Block.

The READ_AXI state communicates with the CPU Wrapper to retrieve data from SRAM. The process completes only after receiving four data pieces, each accompanied by an axi_ready signal. The core_wait signal then informs the CPU that it can proceed with the next request. Notably, an lru_buffer is implemented to track which bank was most recently accessed in each set, facilitating efficient replacement decisions.
#### B. Read Hit Scenario

The waveform shows multiple Instruction Cache Read Hit cases. In the READ_HIT state, the high hit signal confirms a cache hit. The requested data is then output to the CPU through the core_out port. The system returns to IDLE state, signaling the CPU's readiness for the next request.
### 1.2 Data Cache Operations
#### A. Write Miss Scenario

In the Data Cache Write Miss case, identified by the low hit signal, the cache proceeds from WRITE_HIT state directly to inform the CPU to implement the Write Through policy, updating the DM accordingly.
#### B. Write Hit Scenario

The Data Cache Write Hit waveform shows a high hit signal. Upon detection in the WRITE_HIT state, the cache transitions to WRITE_CACHE state, where it writes the CPU's data to the corresponding Cache Block.
# Future Work
## Cache Performance Analysis
- Add hit rate analysis
* Track cache hits and misses
* Monitor hit rates for different workloads
* Analyze miss rate patterns
## Replacement Policy Enhancement
- Implement FIFO and Random replacement algorithms
- Add prefetching mechanisms
## Cache Architecture Optimization
- Implement 3-way or 4-way set associative cache
- Increase cache size to 2KB or 4KB
- Implement L2 Cache
# Reference
* [Lab7: RISC-V Caches](https://csg.csail.mit.edu/6.175/labs/lab7-riscv-caches.html)
* [Building a RISC-V SoC from Scratch: Hardware Design and Linux Implementation](https://hackmd.io/@w4K9apQGS8-NFtsnFXutfg/B1Re5uGa5)
* [SRV32 - Simple 3-stage pipeline RISC-V processor](https://github.com/kuopinghsu/srv32)