Try   HackMD

Implement cache system for srv32

陳榮昶, 林晉德

GitHub

Introduction

This project involves enhancing the SRV32 lightweight RISC-V processor core, developed in Verilog, by integrating a cache system. The aim is to improve memory access performance and implement classic cache replacement mechanisms.
This document provides a guide for installing and testing SRV32 on Ubuntu systems and outlines the steps to achieve the project goals.

Tools Installation Steps

System Requirements

  • Operating System: Ubuntu (tested on Ubuntu 24.04.1 LTS)
  • Basic development tools
  • RISC-V GNU toolchain
  • Verilator

Install RISCV toolchains

1. Install Basic Packages

First, install the required development packages:

sudo apt-get install autoconf automake autotools-dev curl libmpc-dev \
    libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo \
    gperf libtool patchutils bc zlib1g-dev git libexpat1-dev python3 \
    python-is-python3 lcov

2. Install Verilator

sudo apt install verilator

3. RISC-V Toolchain Installation

There are two methods to install the RISC-V toolchain:

  1. Download xPack RISC-V toolchain

    ​​​# For Linux x64 systems:
    ​​​wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases/download/v14.2.0-3/xpack-riscv-none-elf-gcc-14.2.0-3-linux-x64.tar.gz
    
  2. Extract and configure

    ​​​tar xf xpack-riscv-none-elf-gcc-14.2.0-3-linux-x64.tar.gz
    ​​​sudo mv xpack-riscv-none-elf-gcc-14.2.0-3 /opt/riscv
    
  3. Set environment variables

    ​​​echo 'export PATH="/opt/riscv/bin:$PATH"' >> ~/.bashrc
    ​​​echo 'export CROSS_COMPILE=riscv-none-elf-' >> ~/.bashrc
    ​​​echo 'export EXTRA_CFLAGS="-misa-spec=2.2 -march=rv32im"' >> ~/.bashrc
    ​​​source ~/.bashrc
    

Method 2: Compiling from Source (Time-consuming)

git clone --recursive https://github.com/riscv/riscv-gnu-toolchain
cd riscv-gnu-toolchain
mkdir build
cd build
../configure --prefix=/opt/riscv \
   --with-isa-spec=20191213 \
   --with-multilib-generator="\
     rv32im_zicsr-ilp32--;\
     rv32imac_zicsr-ilp32--;\
     rv32im_zicsr_zba_zbb_zbc_zbs-ilp32--;\
     rv32imac_zicsr_zba_zbb_zbc_zbs-ilp32--;\
     rv32em_zicsr-ilp32e--;\
     rv32emac_zicsr-ilp32e--"
make -j$(nproc)

Common Issues and Solutions

1. Toolchain Version Mismatch

Problem:
Error "cannot execute binary file: Exec format error" when running make tests-all.

Solution:
Ensure you download the correct toolchain version for your system:

  • Use uname -m to check system architecture
  • Choose the appropriate version:
    • x86_64: select linux-x64 version
    • aarch64: select linux-arm64 version

2. Compilation Errors

Problem:
Errors related to "la x0,5b" during compilation.

Solution:

  1. Clear previous environment variables:

    ​​​unset EXTRA_CFLAGS
    ​​​unset CROSS_COMPILE
    
  2. Set correct environment variables:

    ​​​export CROSS_COMPILE=riscv-none-elf-
    ​​​export EXTRA_CFLAGS="-misa-spec=2.2"
    
  3. Clean and recompile:

    ​​​make distclean
    ​​​make tests-all
    

Installation Verification

# Ubuntu package needed to run the RTL simulation
sudo apt install verilator

# Modify the Makefile, and set verilator ?= 1
vim sim/Makefile
or
make verilator=1

Basic Function Testing

# Test Hello World program
make hello

# Run Dhrystone benchmark
make dhrystone

Comprehensive Testing

# Run all tests
make tests-all

Cache Architecture & Specification

Srv32 Architecture

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Above is the srv32 Architecture, it has two memory system, instruction memory and the data memory.
Thus we adding two Level 1 Caches to the CPU, namely the Instruction Cache and the Data Cache. Both utilize a 2-Way Set Associative Mapping method to map. The cache replacement policies we choose is LRU (Least Recently Used Cache). The System Architecture is as shown in the figure below.

System Architecture

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

In this project, the Cache I/O is based on the srv32 cpu, with some modifications. Due to the 2-Way associative mapping, the Cache design includes two banks, each containing 32 blocks. Each block has its own Tag and Valid bit. Therefore, the designed Cache has a 64-bit valid signal internally, representing the validity of the 32 blocks in each bank. The following diagrams illustrates the hardware architecture of the Cache for this project.

Cache Architecture

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Cache Specification

  • Cache size : 1 kB
  • 2-Way set associative mapping
  • Cache replacement policies
    • LRU (Least Recently Used Cache)
  • Block size (cache line size)
    • 16Byte == 128bit (4 words)
  • Entry : 32
  • Tag Array Structure
    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →
  • Data Array Structure
    image

Cache controller & state diagram

Instruction Cache

image
The figure above shows the state diagram of the Instruction Cache controller designed for this project. Upon receiving the rst signal, the controller transitions to the IDLE state. When a CPU access request to the Instruction Memory is received, the controller transitions to the READ_HIT state to determine whether the accessed address matches the tag stored in the Cache. If the tag matches and the Valid bit of the corresponding block is 1, it is considered a hit; otherwise, it is a miss. This state lasts for only one cycle.

If it is determined to be a hit, the temporarily stored value in the Cache is returned to the CPU. Each Cache block is 128 bits, so the word to be returned to the CPU is determined based on address[3:2]. If it is determined to be a miss, the controller transitions to the READ_AXI state.

In the event of a Read Miss, the controller employs a Read Allocate policy, requiring communication with the SRAM to fetch the requested data along with three additional pieces of data (since the block size is 128 bits). When all four pieces of data have been sequentially written into the Cache block, the read operation is completed, and the controller transitions to the READ_DONE state.

Data Cache

圖片1
The figure above illustrates the state machine of the aata Cache controller designed for this project. The design process is generally similar to that of the Instruction Cache, with the main difference being that the aata Cache must handle both read and write operations, whereas the Instruction Cache only needs to handle read operations. Therefore, the state machine for the Data Cache includes two additional states, WRITE_HIT and WRITE_CACHE, to manage write operations to the Data Memory.

Initially, the system remains in the IDLE state, waiting for the CPU to issue a core_req. Upon receiving a core_req, the system uses the core_write signal to determine whether the operation is a write. If it is a write operation, the system transitions to the WRITE_HIT state to determine whether it is a Write Hit. In the case of a Write Miss, the system notifies the CPU to update the value in the data Memory and simultaneously updates the value in the Cache. The system then returns to the IDLE state to await the next access request from the CPU, ensuring that the data is successfully written before allowing the CPU to proceed.

If it is a Write Hit, the system transitions directly to the WRITE_CACHE state, following the Write-Through Policy. In this state, the data to be written to the data Memory is also written to the corresponding block in the Cache. The system then transitions to the DONE state before returning to the IDLE state.

On the other hand, if the core_write signal is low while in the IDLE state, it indicates that the CPU intends to read from the data Memory. The process then follows a flow similar to that of the Instruction Cache. The system transitions to the READ_HIT state to determine whether it is a Read Hit. If it is a Read Hit, the system returns to the IDLE state while simultaneously outputting the cached data back to the CPU. Conversely, if it is a Read Miss, the system transitions to the READ_AXI state to fetch data from the data Memory via the CPU Wrapper. After receiving all four pieces of data, the system transitions to the DONE state and then returns to the IDLE state to await the next request.

Cache Controller Implementation

L1C_inst.v

This module implements the control logic for L1 Instruction Cache. Since instruction cache only handles read operations, it's relatively simple:

reg [1:0] cstate, nstate; localparam IDLE = 2'd0; localparam READ_HIT = 2'd1; localparam READ_AXI = 2'd2; localparam READ_DONE = 2'd3;

Main control logic:

1. IDLE State

IDLE: begin if(core_req) nstate = READ_HIT; else nstate = IDLE; end
  • Wait for CPU request

2. READ_HIT State

READ_HIT: begin if(hit) nstate = IDLE; // Cache hit else nstate = READ_AXI; // Cache miss end
  • Perform cache lookup
  • Check for hit (hit0 || hit1)
  • Return data directly if hit, otherwise go to external memory read state

3. READ_AXI State

READ_AXI: begin if(sram_counter == 2'd3 && axi_ready) nstate = READ_DONE; else nstate = READ_AXI; end
  • Read data from external memory
  • Each read transfers 32 bits, needs 4 transfers to fill a cache line
  • Use sram_counter to track progress

L1C_data.v

The data cache control logic is more complex since it needs to handle both read and write operations:

reg [2:0] cstate, nstate; localparam IDLE = 3'd0; localparam READ_HIT = 3'd1; localparam READ_AXI = 3'd2; // Read from external memory localparam WRITE_HIT = 3'd3; localparam WRITE_CACHE = 3'd4; // Write to cache localparam DONE = 3'd5;

Main control logic:

1. Read Operation

READ_HIT: begin if(hit) begin case(core_addr_reg[3:2]) 2'b00: core_out = DA_out[31:0]; 2'b01: core_out = DA_out[63:32]; 2'b10: core_out = DA_out[95:64]; 2'b11: core_out = DA_out[127:96]; endcase end end
  • Select corresponding 32-bit data based on address when hit

2. Write Operation

WRITE_CACHE: begin case(core_addr_reg[3:2]) 2'b00: begin // write into first block DA_write = {{12{1'b1}}, 4'b0000}; DA_in = {96'd0, core_in}; end 2'b01: begin // write into second block DA_write = {{8{1'b1}}, 4'b0000, {4{1'b1}}}; DA_in = {64'd0, core_in, 32'd0}; end 2'b10: begin // write into third block DA_write = {{4{1'b1}}, 4'b0000, {8{1'b1}}}; DA_in = {32'd0, core_in, 64'd0}; end 2'b11: begin // write into fourth block DA_write = {4'b0000, {12{1'b1}}}; DA_in = {core_in, 96'd0}; end endcase
  • Write Through: Write to both cache and external memory
  • Write No-allocate: Don't load block into cache on miss

LRU Replacement Strategy Implementation

In both L1C_data.v and L1C_inst.v, LRU is implemented similarly:

// LRU buffer declaration (one LRU bit needed for each index)
reg lru_buffer[31:0];  // 0->set0; 1->set1 

LRU bit update logic has several cases:

1. Reset State:

always @(posedge clk or negedge resetb) begin if (!resetb) begin for(i=0; i<32; i=i+1) begin lru_buffer[i] <= 1'b0; end end end
  • Initialize all LRU bits to 0 on system reset

2. Update on Read Hit:

READ_HIT: begin if(hit) begin // Set to 0 for set0 hit, 1 for set1 hit lru_buffer[core_addr_reg[8:4]] <= (hit0)? 1'b0 : 1'b1; end end
  • hit0: Indicates hit on set 0
  • hit1: Indicates hit on set 1
  • Records which set was last accessed

3. Update on Read Miss:

READ_AXI: begin if(TA_write == 2'b10) begin lru_buffer[core_addr_reg[8:4]] <= 1'b0; end else if(TA_write == 2'b01) begin lru_buffer[core_addr_reg[8:4]] <= 1'b1; end end
  • Update LRU bit based on which set is written to

4. Replacement Decision:

// Decide which set to replace based on LRU if(!valid_reg[{1'b0, core_addr_reg[8:4]}]) begin DA_wen = 2'b10; // Write to set0 if invalid end else if(!valid_reg[{1'b1, core_addr_reg[8:4]}]) begin DA_wen = 2'b01; // Write to set1 if invalid end else if(lru_buffer[core_addr_reg[8:4]] == 1'b0) begin DA_wen = 2'b01; // Replace set1 if LRU points to set0 end else begin DA_wen = 2'b10; // Replace set0 if LRU points to set1 end

Replacement priority order:

  1. Use invalid cache line if available
  2. If both sets are valid, choose the least recently used set based on LRU bit

Results

1. Cache Operation Verification through Waveform Analysis

1.1 Instruction Cache Operations

A. Read Miss Scenario

1

The waveform demonstrates a L1 Instruction Cache Read Miss case. During the READ_HIT state, the low hit signal indicates that the requested data is not present in the Cache Block. This triggers the need to fetch data from instruction memory. Following the Cache's Read Allocate Policy, this fetch operation includes retrieving the requested data along with three additional data pieces to fill the corresponding Cache Block.
2

The READ_AXI state communicates with the CPU Wrapper to retrieve data from SRAM. The process completes only after receiving four data pieces, each accompanied by an axi_ready signal. The core_wait signal then informs the CPU that it can proceed with the next request. Notably, an lru_buffer is implemented to track which bank was most recently accessed in each set, facilitating efficient replacement decisions.

B. Read Hit Scenario

3

The waveform shows multiple Instruction Cache Read Hit cases. In the READ_HIT state, the high hit signal confirms a cache hit. The requested data is then output to the CPU through the core_out port. The system returns to IDLE state, signaling the CPU's readiness for the next request.

1.2 Data Cache Operations

A. Write Miss Scenario

4

In the Data Cache Write Miss case, identified by the low hit signal, the cache proceeds from WRITE_HIT state directly to inform the CPU to implement the Write Through policy, updating the DM accordingly.

B. Write Hit Scenario

5

The Data Cache Write Hit waveform shows a high hit signal. Upon detection in the WRITE_HIT state, the cache transitions to WRITE_CACHE state, where it writes the CPU's data to the corresponding Cache Block.

Future Work

Cache Performance Analysis

  • Add hit rate analysis
    • Track cache hits and misses
    • Monitor hit rates for different workloads
    • Analyze miss rate patterns

Replacement Policy Enhancement

  • Implement FIFO and Random replacement algorithms
  • Add prefetching mechanisms

Cache Architecture Optimization

  • Implement 3-way or 4-way set associative cache
  • Increase cache size to 2KB or 4KB
  • Implement L2 Cache

Reference