陳榮昶, 林晉德
This project involves enhancing the SRV32 lightweight RISC-V processor core, developed in Verilog, by integrating a cache system. The aim is to improve memory access performance and implement classic cache replacement mechanisms.
This document provides a guide for installing and testing SRV32 on Ubuntu systems and outlines the steps to achieve the project goals.
First, install the required development packages:
sudo apt-get install autoconf automake autotools-dev curl libmpc-dev \
libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo \
gperf libtool patchutils bc zlib1g-dev git libexpat1-dev python3 \
python-is-python3 lcov
sudo apt install verilator
There are two methods to install the RISC-V toolchain:
Download xPack RISC-V toolchain
# For Linux x64 systems:
wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases/download/v14.2.0-3/xpack-riscv-none-elf-gcc-14.2.0-3-linux-x64.tar.gz
Extract and configure
tar xf xpack-riscv-none-elf-gcc-14.2.0-3-linux-x64.tar.gz
sudo mv xpack-riscv-none-elf-gcc-14.2.0-3 /opt/riscv
Set environment variables
echo 'export PATH="/opt/riscv/bin:$PATH"' >> ~/.bashrc
echo 'export CROSS_COMPILE=riscv-none-elf-' >> ~/.bashrc
echo 'export EXTRA_CFLAGS="-misa-spec=2.2 -march=rv32im"' >> ~/.bashrc
source ~/.bashrc
git clone --recursive https://github.com/riscv/riscv-gnu-toolchain
cd riscv-gnu-toolchain
mkdir build
cd build
../configure --prefix=/opt/riscv \
--with-isa-spec=20191213 \
--with-multilib-generator="\
rv32im_zicsr-ilp32--;\
rv32imac_zicsr-ilp32--;\
rv32im_zicsr_zba_zbb_zbc_zbs-ilp32--;\
rv32imac_zicsr_zba_zbb_zbc_zbs-ilp32--;\
rv32em_zicsr-ilp32e--;\
rv32emac_zicsr-ilp32e--"
make -j$(nproc)
Problem:
Error "cannot execute binary file: Exec format error" when running make tests-all
.
Solution:
Ensure you download the correct toolchain version for your system:
uname -m
to check system architectureProblem:
Errors related to "la x0,5b" during compilation.
Solution:
Clear previous environment variables:
unset EXTRA_CFLAGS
unset CROSS_COMPILE
Set correct environment variables:
export CROSS_COMPILE=riscv-none-elf-
export EXTRA_CFLAGS="-misa-spec=2.2"
Clean and recompile:
make distclean
make tests-all
# Ubuntu package needed to run the RTL simulation
sudo apt install verilator
# Modify the Makefile, and set verilator ?= 1
vim sim/Makefile
or
make verilator=1
# Test Hello World program
make hello
# Run Dhrystone benchmark
make dhrystone
# Run all tests
make tests-all
Above is the srv32 Architecture, it has two memory system, instruction memory and the data memory.
Thus we adding two Level 1 Caches to the CPU, namely the Instruction Cache and the Data Cache. Both utilize a 2-Way Set Associative Mapping method to map. The cache replacement policies we choose is LRU (Least Recently Used Cache). The System Architecture is as shown in the figure below.
In this project, the Cache I/O is based on the srv32 cpu, with some modifications. Due to the 2-Way associative mapping, the Cache design includes two banks, each containing 32 blocks. Each block has its own Tag and Valid bit. Therefore, the designed Cache has a 64-bit valid signal internally, representing the validity of the 32 blocks in each bank. The following diagrams illustrates the hardware architecture of the Cache for this project.
The figure above shows the state diagram of the Instruction Cache controller designed for this project. Upon receiving the rst signal, the controller transitions to the IDLE state. When a CPU access request to the Instruction Memory is received, the controller transitions to the READ_HIT state to determine whether the accessed address matches the tag stored in the Cache. If the tag matches and the Valid bit of the corresponding block is 1, it is considered a hit; otherwise, it is a miss. This state lasts for only one cycle.
If it is determined to be a hit, the temporarily stored value in the Cache is returned to the CPU. Each Cache block is 128 bits, so the word to be returned to the CPU is determined based on address[3:2]. If it is determined to be a miss, the controller transitions to the READ_AXI state.
In the event of a Read Miss, the controller employs a Read Allocate policy, requiring communication with the SRAM to fetch the requested data along with three additional pieces of data (since the block size is 128 bits). When all four pieces of data have been sequentially written into the Cache block, the read operation is completed, and the controller transitions to the READ_DONE state.
The figure above illustrates the state machine of the aata Cache controller designed for this project. The design process is generally similar to that of the Instruction Cache, with the main difference being that the aata Cache must handle both read and write operations, whereas the Instruction Cache only needs to handle read operations. Therefore, the state machine for the Data Cache includes two additional states, WRITE_HIT and WRITE_CACHE, to manage write operations to the Data Memory.
Initially, the system remains in the IDLE state, waiting for the CPU to issue a core_req. Upon receiving a core_req, the system uses the core_write signal to determine whether the operation is a write. If it is a write operation, the system transitions to the WRITE_HIT state to determine whether it is a Write Hit. In the case of a Write Miss, the system notifies the CPU to update the value in the data Memory and simultaneously updates the value in the Cache. The system then returns to the IDLE state to await the next access request from the CPU, ensuring that the data is successfully written before allowing the CPU to proceed.
If it is a Write Hit, the system transitions directly to the WRITE_CACHE state, following the Write-Through Policy. In this state, the data to be written to the data Memory is also written to the corresponding block in the Cache. The system then transitions to the DONE state before returning to the IDLE state.
On the other hand, if the core_write signal is low while in the IDLE state, it indicates that the CPU intends to read from the data Memory. The process then follows a flow similar to that of the Instruction Cache. The system transitions to the READ_HIT state to determine whether it is a Read Hit. If it is a Read Hit, the system returns to the IDLE state while simultaneously outputting the cached data back to the CPU. Conversely, if it is a Read Miss, the system transitions to the READ_AXI state to fetch data from the data Memory via the CPU Wrapper. After receiving all four pieces of data, the system transitions to the DONE state and then returns to the IDLE state to await the next request.
This module implements the control logic for L1 Instruction Cache. Since instruction cache only handles read operations, it's relatively simple:
reg [1:0] cstate, nstate;
localparam IDLE = 2'd0;
localparam READ_HIT = 2'd1;
localparam READ_AXI = 2'd2;
localparam READ_DONE = 2'd3;
IDLE: begin
if(core_req) nstate = READ_HIT;
else nstate = IDLE;
end
READ_HIT: begin
if(hit) nstate = IDLE; // Cache hit
else nstate = READ_AXI; // Cache miss
end
hit0 || hit1
)
READ_AXI: begin
if(sram_counter == 2'd3 && axi_ready)
nstate = READ_DONE;
else
nstate = READ_AXI;
end
sram_counter
to track progressThe data cache control logic is more complex since it needs to handle both read and write operations:
reg [2:0] cstate, nstate;
localparam IDLE = 3'd0;
localparam READ_HIT = 3'd1;
localparam READ_AXI = 3'd2; // Read from external memory
localparam WRITE_HIT = 3'd3;
localparam WRITE_CACHE = 3'd4; // Write to cache
localparam DONE = 3'd5;
READ_HIT: begin
if(hit) begin
case(core_addr_reg[3:2])
2'b00: core_out = DA_out[31:0];
2'b01: core_out = DA_out[63:32];
2'b10: core_out = DA_out[95:64];
2'b11: core_out = DA_out[127:96];
endcase
end
end
WRITE_CACHE: begin
case(core_addr_reg[3:2])
2'b00: begin // write into first block
DA_write = {{12{1'b1}}, 4'b0000};
DA_in = {96'd0, core_in};
end
2'b01: begin // write into second block
DA_write = {{8{1'b1}}, 4'b0000, {4{1'b1}}};
DA_in = {64'd0, core_in, 32'd0};
end
2'b10: begin // write into third block
DA_write = {{4{1'b1}}, 4'b0000, {8{1'b1}}};
DA_in = {32'd0, core_in, 64'd0};
end
2'b11: begin // write into fourth block
DA_write = {4'b0000, {12{1'b1}}};
DA_in = {core_in, 96'd0};
end
endcase
In both L1C_data.v and L1C_inst.v, LRU is implemented similarly:
// LRU buffer declaration (one LRU bit needed for each index)
reg lru_buffer[31:0]; // 0->set0; 1->set1
LRU bit update logic has several cases:
always @(posedge clk or negedge resetb) begin
if (!resetb) begin
for(i=0; i<32; i=i+1) begin
lru_buffer[i] <= 1'b0;
end
end
end
READ_HIT: begin
if(hit) begin
// Set to 0 for set0 hit, 1 for set1 hit
lru_buffer[core_addr_reg[8:4]] <= (hit0)? 1'b0 : 1'b1;
end
end
hit0
: Indicates hit on set 0hit1
: Indicates hit on set 1
READ_AXI: begin
if(TA_write == 2'b10) begin
lru_buffer[core_addr_reg[8:4]] <= 1'b0;
end
else if(TA_write == 2'b01) begin
lru_buffer[core_addr_reg[8:4]] <= 1'b1;
end
end
// Decide which set to replace based on LRU
if(!valid_reg[{1'b0, core_addr_reg[8:4]}]) begin
DA_wen = 2'b10; // Write to set0 if invalid
end
else if(!valid_reg[{1'b1, core_addr_reg[8:4]}]) begin
DA_wen = 2'b01; // Write to set1 if invalid
end
else if(lru_buffer[core_addr_reg[8:4]] == 1'b0) begin
DA_wen = 2'b01; // Replace set1 if LRU points to set0
end
else begin
DA_wen = 2'b10; // Replace set0 if LRU points to set1
end
Replacement priority order:
The waveform demonstrates a L1 Instruction Cache Read Miss case. During the READ_HIT state, the low hit signal indicates that the requested data is not present in the Cache Block. This triggers the need to fetch data from instruction memory. Following the Cache's Read Allocate Policy, this fetch operation includes retrieving the requested data along with three additional data pieces to fill the corresponding Cache Block.
The READ_AXI state communicates with the CPU Wrapper to retrieve data from SRAM. The process completes only after receiving four data pieces, each accompanied by an axi_ready signal. The core_wait signal then informs the CPU that it can proceed with the next request. Notably, an lru_buffer is implemented to track which bank was most recently accessed in each set, facilitating efficient replacement decisions.
The waveform shows multiple Instruction Cache Read Hit cases. In the READ_HIT state, the high hit signal confirms a cache hit. The requested data is then output to the CPU through the core_out port. The system returns to IDLE state, signaling the CPU's readiness for the next request.
In the Data Cache Write Miss case, identified by the low hit signal, the cache proceeds from WRITE_HIT state directly to inform the CPU to implement the Write Through policy, updating the DM accordingly.
The Data Cache Write Hit waveform shows a high hit signal. Upon detection in the WRITE_HIT state, the cache transitions to WRITE_CACHE state, where it writes the CPU's data to the corresponding Cache Block.