Try   HackMD

5-stage pipelined RISC-V processor capable of running FreeRTOS

林子齊

Introduction

Use last year's Lab 3 RISC-V 32-bit single-cycle CPU as a template to construct a pipelined CPU with CLINT (Core Local Interruptor). The goal is to achieve minimal ISA support required to run FreeRTOS. Implement the design on an FPGA and attempt to run FreeRTOS on a physical development board.

Development Environment

I strive to use only open-source tools for this project. Although setting up the development environment can be complex, I manage my toolchains using Nix Home Manager in my configuration file. Refer to my ubuntu-config and use soc-flake to install the necessary packages.

Data Memory

In modern CPUs, communication between the data memory and the CPU occurs via the bus. However, as this is too complex for this project, I have implemented only the simplest data memory, which is directly connected to the memory stage without bus.

Sync & Async Read

After modifying the data memory implementation, the single-cycle CPU failed the CPU test. Upon inspecting the waveform, I noticed that if the current cycle is an L-type instruction, the data from memory does not appear until the next cycle. This delay prevents the destination register from being updated correctly within the same cycle.

In my view, if we want to use a synchronous read memory in a single-cycle CPU, we need to include a buffer to temporarily hold the data and an appropriate register write-enable mechanism, so that the data can be written in the following cycle. Since this adds complexity to a single-cycle design, I switched to an asynchronous memory implementation. With asynchronous reads, the data is available within the same cycle, allowing the CPU to pass the test.

Instruction Read-Only Memory

As previously mentioned, this project is a simplified implementation. The instruction read-only memory is directly connected to the instruction fetch stage. However, in a real system, the instructions would reside in main memory, requiring an operating system to manage the memory state. Since our CPU implementation does not include an operating system, we must load the program’s binary code into the instruction memory for each test.

Testbench Preparation

To test the functionality of the CPU, we need to compile a C program into binary code and manually load the binary program into memory. On my Mac, I use the RISC-V toolchain installed via Homebrew to compile and link the program into an executable file. Then, I extract the .text and .data segments to create a binary program.

On my Mac, only the RISC-V 64 toolchain is available, so it's necessary to add an option to specify the target architecture. For example, if the single-cycle CPU only implements the RV32I instruction set, we need to explicitly specify it.

To compile the C source code into object code, use the -c option. This option ensures that only object code is generated without linking. This is important because we want to link the C program with a custom-defined initialization code that performs some setup before executing the main function in the C program.

$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -nostdlib -o init.o -c init.S
$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -nostdlib -o fibonacci.o -c fibonacci.c

To link the C program's object code and the initialization code, use a custom-defined memory layout specified in link.ld. Additionally, ensure that the architecture is set to the ELF format with 32-bit little-endian encoding. This ensures compatibility with the target CPU architecture and allows the binary code to be correctly generated and loaded.

$ riscv64-unknown-elf-ld -m elf32lriscv -T link.ld -o fibonacci.elf init.o fibonacci.o

I set the initial program counter to point to the instruction memory address 0x00000000. Consequently, I place the initialization code segment at the memory address 0x00000000 to ensure proper execution from the beginning of the memory.

/* link.ld */
OUTPUT_ARCH(riscv)
ENTRY(_start)

SECTIONS
{
    . = 0x00000000;
    .text :
    {
        *(.text._start)
        *(.text)
        *(.rodata)
    }

    .data :
    {
        *(.data)
        *(.bss)
    }
}

The C program used to test the CPU functionality is quite simple and does not involve any context switching. Therefore, the initialization code only needs to set the stack pointer to the appropriate address and can then directly jump to the main function in the C program.

/* init.S */
.section .text
.globl _start

_start:
    li sp, 4096
    call main
hang:
    j hang

Since the ELF format contains unnecessary metadata for my CPU, only the .text and .data segments are extracted from the ELF file. The target file is then specified as a binary format.

$ riscv64-unknown-elf-objcopy -O binary -j .text -j .data fibonacci.elf fibonacci.bin

In Chisel, memory initialization supports the use of hexadecimal files. Hexadecimal format instructions are easy to trace and debug. The hexdump tool can be used to create a hexadecimal format program. When using hexdump, specify the option to treat 4 bytes as a block, which corresponds to an instruction word.

$ hexdump -v -e '/4 "%08x\n"' fibonacci.bin > fibonacci.hex

Simulation Bug

It is acceptable to run each component separately, but when testing the entire CPU, errors occur. I discovered this is caused by a bug in the latest version of Chisel. The issue lies in the Chisel backend compiler, which does not initialize the instruction memory with the program provided. FIRRTL eliminates the initialization process because it considers it unnecessary. You can find the related issue on GitHub. To resolve this, I reverted to an older version of Chisel with FIRRTL annotations, allowing the CPU to function correctly.

Annotation

In Chisel, an annotation is a tag that provides additional information to the compiler. During the process of transpiling Chisel HDL to System Verilog, several optimizations occur, such as renaming and eliminating unnecessary circuits. However, during CPU behavior simulation with a program, the initialization process might be discarded. To prevent this, the DontTouchAnnotation must be applied to instruct the FIRRTL compiler not to optimize the instruction memory section. Without this annotation, the instruction memory will not function correctly.

val mem = Mem(System.InstructionMemorySizeInBytes, UInt(System.InstructionWidth)).suggestName("mem")
annotate(new ChiselAnnotation {
    def toFirrtl = DontTouchAnnotation(mem.toTarget)
})
loadMemoryFromFileInline(
    mem,
    f"./src/main/resources/${filename}",
    firrtl.annotations.MemoryLoadFileType.Hex
)

Control & Forward

After understanding how a single-cycle CPU works, the next step is learning pipelined CPU architecture. A pipelined CPU reduces the critical path, enhancing the clock rate of the CPU. However, it introduces challenges such as handling register dependencies and jump problems. Since multiple instructions are in different stages simultaneously, the actual state of the CPU cannot be determined at any given moment. Additional buffers are required to cache the state, and data forwarding is used to resolve dependency issues. When hazards cannot be eliminated, the pipeline must be stalled or flushed to ensure correct execution.

Data Hazard

A data hazard occurs when there is a register dependency between instructions, meaning the destination register of a previous instruction is the same as the source register of the current instruction. In a 5-stage pipelined CPU, if there is no forwarding unit to pass the data early, the result will be incorrect. This happens because the current instruction cannot receive the correct source register value before entering the execute stage (EX). This situation is referred to as a Read-After-Write (RAW) hazard.

Below is the code segment for the forward unit. After the instruction is decoded, the instruction's information is cached in the pipeline buffer. This information includes the destination and source register identifiers. The forward unit checks whether the destination register identifier in the memory stage or execute stage matches the source register identifier of the current instruction at the decode stage.

when(io.MEM_REGS_wEn && io.EX_rs1 === io.MEM_rd && io.MEM_rd =/= 0.U) {
    io.EX_reg1_forward_type := ForwardingType.ForwardFromMEM
}.elsewhen(io.WB_REGS_wEn && io.EX_rs1 === io.WB_rd && io.WB_rd =/= 0.U) {
    io.EX_reg1_forward_type := ForwardingType.ForwardFromWB
}.otherwise {
    io.EX_reg1_forward_type := ForwardingType.NoForward
}

when(io.MEM_REGS_wEn && io.EX_rs2 === io.MEM_rd && io.MEM_rd =/= 0.U) {
    io.EX_reg2_forward_type := ForwardingType.ForwardFromMEM
}.elsewhen(io.WB_REGS_wEn && io.EX_rs2 === io.WB_rd && io.WB_rd =/= 0.U) {
    io.EX_reg2_forward_type := ForwardingType.ForwardFromWB
}.otherwise {
    io.EX_reg2_forward_type := ForwardingType.NoForward
}

when(io.MEM_REGS_wEn && io.ID_rs1 === io.MEM_rd && io.MEM_rd =/= 0.U) {
    io.ID_reg1_forward_type := ForwardingType.ForwardFromMEM
}.elsewhen(io.WB_REGS_wEn && io.ID_rs1 === io.WB_rd && io.WB_rd =/= 0.U) {
    io.ID_reg1_forward_type := ForwardingType.ForwardFromWB
}.otherwise {
    io.ID_reg1_forward_type := ForwardingType.NoForward
}

when(io.MEM_REGS_wEn && io.ID_rs2 === io.MEM_rd && io.MEM_rd =/= 0.U) {
    io.ID_reg2_forward_type := ForwardingType.ForwardFromMEM
}.elsewhen(io.WB_REGS_wEn && io.ID_rs2 === io.WB_rd && io.WB_rd =/= 0.U) {
    io.ID_reg2_forward_type := ForwardingType.ForwardFromWB
}.otherwise {
    io.ID_reg2_forward_type := ForwardingType.NoForward
}

While forwarding cannot always eliminate data hazards, consider the example below in a 5-stage pipelined CPU with IF, ID, EX, MEM, and WB stages. It is necessary to stall for 1 cycle because the t0 source register in the addi instruction cannot receive the correct value via forwarding when the lw instruction is in the MEM stage and the addi instruction is in the EX stage. This is because the value from memory becomes available only in the WB stage, making it possible to forward the data back.

lw t0, 0(a0)
addi t0, t0, 1

To address this situation, a control unit is used to detect inevitable data hazards. The conditions are defined in the control unit's logic, as described below:

  • When EX_MEM_rEn is high, it indicates that the instruction currently in the EX stage requires data from memory. This condition occurs only for load-type (L-type) instructions, as the data is not yet available in the MEM stage.
  • If the instruction in the ID stage uses the destination register of the L-type instruction as a source register, the ID stage must stall for one cycle. This stall ensures that the data read from memory at the MEM stage is available for the subsequent instruction.

The control logic ensures proper pipeline operation by introducing stalls where necessary to handle data hazards.

((io.ID_jump_inst || io.EX_MEM_rEn)
 && io.EX_rd =/= 0.U
 && (io.EX_rd === io.ID_rs1 || io.EX_rd === io.ID_rs2))

The IF and ID must stall for one cycle, to prevent the instruction at ID pass to EX to cause the error result.

io.PC_stall := true.B /* Stall PC    */
io.IF_stall := true.B /* Flush IF2ID */
io.ID_flush := true.B /* Flush ID2EX */

Inevitable Data Hazards: Possible Solutions

  1. Compiler Optimization: Rearrange the instruction sequence to minimize dependencies and avoid hazards.
  2. Out-of-Order Execution: Implement hardware-level techniques to reduce dependencies by executing instructions out of their original order when possible.
  3. Modify ALUOpSource: Expand ALUOpSource to 2 bits in width, allowing it to select data forwarded from either the MEM or WB stage.
    • Note: It is uncertain whether this method is feasible without further experimentation. Testing this approach could provide valuable insights.

Control Hazard

Decrease Branch Cost

Early Branch Resolution

Global Branch Predictor

CSR (Control/Status Register)

CSR Definition

Atomic Operation

CLINT (Core-Local Interrupt Controller)

Interrupt Handler