Implement cache system for srv32

陳榮昶, 林晉德

Introduction

This project involves enhancing the SRV32 lightweight RISC-V processor core, developed in Verilog, by integrating a cache system. The aim is to improve memory access performance and implement classic cache replacement mechanisms.
This document provides a guide for installing and testing SRV32 on Ubuntu systems and outlines the steps to achieve the project goals.

Tools Installation Steps

System Requirements

Operating System: Ubuntu (tested on Ubuntu 24.04.1 LTS)
Basic development tools
RISC-V GNU toolchain
Verilator

Install RISCV toolchains

1. Install Basic Packages

First, install the required development packages:

sudo apt-get install autoconf automake autotools-dev curl libmpc-dev \
    libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo \
    gperf libtool patchutils bc zlib1g-dev git libexpat1-dev python3 \
    python-is-python3 lcov

2. Install Verilator

sudo apt install verilator

3. RISC-V Toolchain Installation

There are two methods to install the RISC-V toolchain:

Method 1: Using Pre-compiled xPack GNU RISC-V Tools (Recommended)

Download xPack RISC-V toolchain

Visit https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases
Choose the appropriate version for your system

# For Linux x64 systems:
wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases/download/v14.2.0-3/xpack-riscv-none-elf-gcc-14.2.0-3-linux-x64.tar.gz

Extract and configure

tar xf xpack-riscv-none-elf-gcc-14.2.0-3-linux-x64.tar.gz
sudo mv xpack-riscv-none-elf-gcc-14.2.0-3 /opt/riscv

Set environment variables

echo 'export PATH="/opt/riscv/bin:$PATH"' >> ~/.bashrc
echo 'export CROSS_COMPILE=riscv-none-elf-' >> ~/.bashrc
echo 'export EXTRA_CFLAGS="-misa-spec=2.2 -march=rv32im"' >> ~/.bashrc
source ~/.bashrc

Method 2: Compiling from Source (Time-consuming)

git clone --recursive https://github.com/riscv/riscv-gnu-toolchain
cd riscv-gnu-toolchain
mkdir build
cd build
../configure --prefix=/opt/riscv \
   --with-isa-spec=20191213 \
   --with-multilib-generator="\
     rv32im_zicsr-ilp32--;\
     rv32imac_zicsr-ilp32--;\
     rv32im_zicsr_zba_zbb_zbc_zbs-ilp32--;\
     rv32imac_zicsr_zba_zbb_zbc_zbs-ilp32--;\
     rv32em_zicsr-ilp32e--;\
     rv32emac_zicsr-ilp32e--"
make -j$(nproc)

Common Issues and Solutions

1. Toolchain Version Mismatch

Problem:
Error "cannot execute binary file: Exec format error" when running make tests-all.

Solution:
Ensure you download the correct toolchain version for your system:

Use uname -m to check system architecture
Choose the appropriate version:
- x86_64: select linux-x64 version
- aarch64: select linux-arm64 version

2. Compilation Errors

Problem:
Errors related to "la x0,5b" during compilation.

Solution:

Clear previous environment variables:

unset EXTRA_CFLAGS
unset CROSS_COMPILE

Set correct environment variables:

export CROSS_COMPILE=riscv-none-elf-
export EXTRA_CFLAGS="-misa-spec=2.2"

Clean and recompile:

make distclean
make tests-all

Installation Verification

# Ubuntu package needed to run the RTL simulation
sudo apt install verilator

# Modify the Makefile, and set verilator ?= 1
vim sim/Makefile
or
make verilator=1

Basic Function Testing

# Test Hello World program
make hello

# Run Dhrystone benchmark
make dhrystone

Comprehensive Testing

# Run all tests
make tests-all

Cache Architecture & Specification

Srv32 Architecture

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Above is the srv32 Architecture, it has two memory system, instruction memory and the data memory.
Thus we adding two Level 1 Caches to the CPU, namely the Instruction Cache and the Data Cache. Both utilize a 2-Way Set Associative Mapping method to map. The cache replacement policies we choose is LRU (Least Recently Used Cache). The System Architecture is as shown in the figure below.

System Architecture

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

In this project, the Cache I/O is based on the srv32 cpu, with some modifications. Due to the 2-Way associative mapping, the Cache design includes two banks, each containing 32 blocks. Each block has its own Tag and Valid bit. Therefore, the designed Cache has a 64-bit valid signal internally, representing the validity of the 32 blocks in each bank. The following diagrams illustrates the hardware architecture of the Cache for this project.

Cache Architecture

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Cache Specification

Cache size : 1 kB
2-Way set associative mapping
Cache replacement policies
- LRU (Least Recently Used Cache)
Block size (cache line size)
- 16Byte == 128bit (4 words)
Entry : 32
Tag Array Structure
Data Array Structure

Cache controller & state diagram

Instruction Cache

The figure above shows the state diagram of the Instruction Cache controller designed for this project. Upon receiving the rst signal, the controller transitions to the IDLE state. When a CPU access request to the Instruction Memory is received, the controller transitions to the READ_HIT state to determine whether the accessed address matches the tag stored in the Cache. If the tag matches and the Valid bit of the corresponding block is 1, it is considered a hit; otherwise, it is a miss. This state lasts for only one cycle.

If it is determined to be a hit, the temporarily stored value in the Cache is returned to the CPU. Each Cache block is 128 bits, so the word to be returned to the CPU is determined based on address[3:2]. If it is determined to be a miss, the controller transitions to the READ_AXI state.

In the event of a Read Miss, the controller employs a Read Allocate policy, requiring communication with the SRAM to fetch the requested data along with three additional pieces of data (since the block size is 128 bits). When all four pieces of data have been sequentially written into the Cache block, the read operation is completed, and the controller transitions to the READ_DONE state.

Data Cache

The figure above illustrates the state machine of the aata Cache controller designed for this project. The design process is generally similar to that of the Instruction Cache, with the main difference being that the aata Cache must handle both read and write operations, whereas the Instruction Cache only needs to handle read operations. Therefore, the state machine for the Data Cache includes two additional states, WRITE_HIT and WRITE_CACHE, to manage write operations to the Data Memory.

Initially, the system remains in the IDLE state, waiting for the CPU to issue a core_req. Upon receiving a core_req, the system uses the core_write signal to determine whether the operation is a write. If it is a write operation, the system transitions to the WRITE_HIT state to determine whether it is a Write Hit. In the case of a Write Miss, the system notifies the CPU to update the value in the data Memory and simultaneously updates the value in the Cache. The system then returns to the IDLE state to await the next access request from the CPU, ensuring that the data is successfully written before allowing the CPU to proceed.

If it is a Write Hit, the system transitions directly to the WRITE_CACHE state, following the Write-Through Policy. In this state, the data to be written to the data Memory is also written to the corresponding block in the Cache. The system then transitions to the DONE state before returning to the IDLE state.

On the other hand, if the core_write signal is low while in the IDLE state, it indicates that the CPU intends to read from the data Memory. The process then follows a flow similar to that of the Instruction Cache. The system transitions to the READ_HIT state to determine whether it is a Read Hit. If it is a Read Hit, the system returns to the IDLE state while simultaneously outputting the cached data back to the CPU. Conversely, if it is a Read Miss, the system transitions to the READ_AXI state to fetch data from the data Memory via the CPU Wrapper. After receiving all four pieces of data, the system transitions to the DONE state and then returns to the IDLE state to await the next request.

Cache Controller Implementation

L1C_inst.v

This module implements the control logic for L1 Instruction Cache. Since instruction cache only handles read operations, it's relatively simple:






reg [1:0] cstate, nstate;

localparam IDLE      = 2'd0;
localparam READ_HIT  = 2'd1;
localparam READ_AXI  = 2'd2;
localparam READ_DONE = 2'd3;

Main control logic:

1. IDLE State




IDLE: begin
    if(core_req) nstate = READ_HIT;
    else nstate = IDLE;
end

Wait for CPU request

2. READ_HIT State




READ_HIT: begin
    if(hit) nstate = IDLE;  // Cache hit
    else nstate = READ_AXI; // Cache miss  
end

Perform cache lookup
Check for hit (hit0 || hit1)
Return data directly if hit, otherwise go to external memory read state

3. READ_AXI State






READ_AXI: begin
    if(sram_counter == 2'd3 && axi_ready)
        nstate = READ_DONE;
    else
        nstate = READ_AXI;  
end

Read data from external memory
Each read transfers 32 bits, needs 4 transfers to fill a cache line
Use sram_counter to track progress

L1C_data.v

The data cache control logic is more complex since it needs to handle both read and write operations:








reg [2:0] cstate, nstate;

localparam IDLE        = 3'd0;
localparam READ_HIT    = 3'd1;
localparam READ_AXI    = 3'd2; // Read from external memory
localparam WRITE_HIT   = 3'd3;
localparam WRITE_CACHE = 3'd4; // Write to cache
localparam DONE        = 3'd5;

Main control logic:

1. Read Operation










READ_HIT: begin
    if(hit) begin
        case(core_addr_reg[3:2])
            2'b00: core_out = DA_out[31:0];
            2'b01: core_out = DA_out[63:32];
            2'b10: core_out = DA_out[95:64]; 
            2'b11: core_out = DA_out[127:96];
        endcase
    end
end

Select corresponding 32-bit data based on address when hit

2. Write Operation



















WRITE_CACHE: begin
    case(core_addr_reg[3:2])
        2'b00: begin // write into first block
            DA_write = {{12{1'b1}}, 4'b0000};
            DA_in = {96'd0, core_in};
        end
        2'b01: begin // write into second block
            DA_write = {{8{1'b1}}, 4'b0000, {4{1'b1}}};
            DA_in = {64'd0, core_in, 32'd0};
        end
        2'b10: begin // write into third block
            DA_write = {{4{1'b1}}, 4'b0000, {8{1'b1}}};
            DA_in = {32'd0, core_in, 64'd0};
        end
        2'b11: begin // write into fourth block
            DA_write = {4'b0000, {12{1'b1}}};
            DA_in = {core_in, 96'd0};
        end
    endcase

Write Through: Write to both cache and external memory
Write No-allocate: Don't load block into cache on miss

LRU Replacement Strategy Implementation

In both L1C_data.v and L1C_inst.v, LRU is implemented similarly:

// LRU buffer declaration (one LRU bit needed for each index)
reg lru_buffer[31:0];  // 0->set0; 1->set1

LRU bit update logic has several cases:

1. Reset State:







always @(posedge clk or negedge resetb) begin
    if (!resetb) begin
        for(i=0; i<32; i=i+1) begin
            lru_buffer[i] <= 1'b0;
        end
    end
end

Initialize all LRU bits to 0 on system reset

2. Update on Read Hit:






READ_HIT: begin
    if(hit) begin
        // Set to 0 for set0 hit, 1 for set1 hit
        lru_buffer[core_addr_reg[8:4]] <= (hit0)? 1'b0 : 1'b1;
    end
end

hit0: Indicates hit on set 0
hit1: Indicates hit on set 1
Records which set was last accessed

3. Update on Read Miss:








READ_AXI: begin
    if(TA_write == 2'b10) begin
        lru_buffer[core_addr_reg[8:4]] <= 1'b0;
    end
    else if(TA_write == 2'b01) begin
        lru_buffer[core_addr_reg[8:4]] <= 1'b1;
    end
end

Update LRU bit based on which set is written to

4. Replacement Decision:













// Decide which set to replace based on LRU
if(!valid_reg[{1'b0, core_addr_reg[8:4]}]) begin
    DA_wen = 2'b10;  // Write to set0 if invalid
end
else if(!valid_reg[{1'b1, core_addr_reg[8:4]}]) begin
    DA_wen = 2'b01;  // Write to set1 if invalid 
end
else if(lru_buffer[core_addr_reg[8:4]] == 1'b0) begin
    DA_wen = 2'b01;  // Replace set1 if LRU points to set0
end
else begin
    DA_wen = 2'b10;  // Replace set0 if LRU points to set1
end

Replacement priority order:

Use invalid cache line if available
If both sets are valid, choose the least recently used set based on LRU bit

Results

1. Cache Operation Verification through Waveform Analysis

1.1 Instruction Cache Operations

A. Read Miss Scenario

The waveform demonstrates a L1 Instruction Cache Read Miss case. During the READ_HIT state, the low hit signal indicates that the requested data is not present in the Cache Block. This triggers the need to fetch data from instruction memory. Following the Cache's Read Allocate Policy, this fetch operation includes retrieving the requested data along with three additional data pieces to fill the corresponding Cache Block.

The READ_AXI state communicates with the CPU Wrapper to retrieve data from SRAM. The process completes only after receiving four data pieces, each accompanied by an axi_ready signal. The core_wait signal then informs the CPU that it can proceed with the next request. Notably, an lru_buffer is implemented to track which bank was most recently accessed in each set, facilitating efficient replacement decisions.

B. Read Hit Scenario

The waveform shows multiple Instruction Cache Read Hit cases. In the READ_HIT state, the high hit signal confirms a cache hit. The requested data is then output to the CPU through the core_out port. The system returns to IDLE state, signaling the CPU's readiness for the next request.

1.2 Data Cache Operations

A. Write Miss Scenario

In the Data Cache Write Miss case, identified by the low hit signal, the cache proceeds from WRITE_HIT state directly to inform the CPU to implement the Write Through policy, updating the DM accordingly.

B. Write Hit Scenario

The Data Cache Write Hit waveform shows a high hit signal. Upon detection in the WRITE_HIT state, the cache transitions to WRITE_CACHE state, where it writes the CPU's data to the corresponding Cache Block.

Future Work

Cache Performance Analysis

Add hit rate analysis
- Track cache hits and misses
- Monitor hit rates for different workloads
- Analyze miss rate patterns

Implement cache system for srv32

Introduction

Tools Installation Steps

System Requirements

Install RISCV toolchains

1. Install Basic Packages

2. Install Verilator

3. RISC-V Toolchain Installation

Method 1: Using Pre-compiled xPack GNU RISC-V Tools (Recommended)

Method 2: Compiling from Source (Time-consuming)

Common Issues and Solutions

1. Toolchain Version Mismatch

2. Compilation Errors

Installation Verification

Basic Function Testing

Comprehensive Testing

Cache Architecture & Specification

Srv32 Architecture

System Architecture

Cache Architecture

Cache Specification

Cache controller & state diagram

Instruction Cache

Data Cache

Cache Controller Implementation

L1C_inst.v

Main control logic:

1. IDLE State

2. READ_HIT State

3. READ_AXI State

L1C_data.v

Main control logic:

1. Read Operation

2. Write Operation

LRU Replacement Strategy Implementation

1. Reset State:

2. Update on Read Hit:

3. Update on Read Miss:

4. Replacement Decision:

Results

1. Cache Operation Verification through Waveform Analysis

1.1 Instruction Cache Operations

A. Read Miss Scenario

B. Read Hit Scenario

1.2 Data Cache Operations

A. Write Miss Scenario

B. Write Hit Scenario

Future Work

Cache Performance Analysis

Replacement Policy Enhancement

Cache Architecture Optimization

Reference