林子齊
Use last year's Lab 3 RISC-V 32-bit single-cycle CPU as a template to construct a pipelined CPU with CLINT (Core Local Interruptor). The goal is to achieve minimal ISA support required to run FreeRTOS. Implement the design on an FPGA and attempt to run FreeRTOS on a physical development board.
I strive to use only open-source tools for this project. Although setting up the development environment can be complex, I manage my toolchains using Nix Home Manager in my configuration file. Refer to my ubuntu-config and use soc-flake
to install the necessary packages.
In modern CPUs, communication between the data memory and the CPU occurs via the bus. However, as this is too complex for this project, I have implemented only the simplest data memory, which is directly connected to the memory stage without bus.
After modifying the data memory implementation, the single-cycle CPU failed the CPU test. Upon inspecting the waveform, I noticed that if the current cycle is an L-type instruction, the data from memory does not appear until the next cycle. This delay prevents the destination register from being updated correctly within the same cycle.
In my view, if we want to use a synchronous read memory in a single-cycle CPU, we need to include a buffer to temporarily hold the data and an appropriate register write-enable mechanism, so that the data can be written in the following cycle. Since this adds complexity to a single-cycle design, I switched to an asynchronous memory implementation. With asynchronous reads, the data is available within the same cycle, allowing the CPU to pass the test.
As previously mentioned, this project is a simplified implementation. The instruction read-only memory is directly connected to the instruction fetch stage. However, in a real system, the instructions would reside in main memory, requiring an operating system to manage the memory state. Since our CPU implementation does not include an operating system, we must load the program’s binary code into the instruction memory for each test.
To test the functionality of the CPU, we need to compile a C program into binary code and manually load the binary program into memory. On my Mac, I use the RISC-V toolchain installed via Homebrew to compile and link the program into an executable file. Then, I extract the .text
and .data
segments to create a binary program.
On my Mac, only the RISC-V 64 toolchain is available, so it's necessary to add an option to specify the target architecture. For example, if the single-cycle CPU only implements the RV32I instruction set, we need to explicitly specify it.
To compile the C source code into object code, use the -c
option. This option ensures that only object code is generated without linking. This is important because we want to link the C program with a custom-defined initialization code that performs some setup before executing the main
function in the C program.
$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -nostdlib -o init.o -c init.S
$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -nostdlib -o fibonacci.o -c fibonacci.c
To link the C program's object code and the initialization code, use a custom-defined memory layout specified in link.ld
. Additionally, ensure that the architecture is set to the ELF format with 32-bit little-endian encoding. This ensures compatibility with the target CPU architecture and allows the binary code to be correctly generated and loaded.
$ riscv64-unknown-elf-ld -m elf32lriscv -T link.ld -o fibonacci.elf init.o fibonacci.o
I set the initial program counter to point to the instruction memory address 0x00000000
. Consequently, I place the initialization code segment at the memory address 0x00000000
to ensure proper execution from the beginning of the memory.
/* link.ld */
OUTPUT_ARCH(riscv)
ENTRY(_start)
SECTIONS
{
. = 0x00000000;
.text :
{
*(.text._start)
*(.text)
*(.rodata)
}
.data :
{
*(.data)
*(.bss)
}
}
The C program used to test the CPU functionality is quite simple and does not involve any context switching. Therefore, the initialization code only needs to set the stack pointer to the appropriate address and can then directly jump to the main
function in the C program.
/* init.S */
.section .text
.globl _start
_start:
li sp, 4096
call main
hang:
j hang
Since the ELF format contains unnecessary metadata for my CPU, only the .text
and .data
segments are extracted from the ELF file. The target file is then specified as a binary format.
$ riscv64-unknown-elf-objcopy -O binary -j .text -j .data fibonacci.elf fibonacci.bin
In Chisel, memory initialization supports the use of hexadecimal files. Hexadecimal format instructions are easy to trace and debug. The hexdump
tool can be used to create a hexadecimal format program. When using hexdump
, specify the option to treat 4 bytes as a block, which corresponds to an instruction word.
$ hexdump -v -e '/4 "%08x\n"' fibonacci.bin > fibonacci.hex
It is acceptable to run each component separately, but when testing the entire CPU, errors occur. I discovered this is caused by a bug in the latest version of Chisel. The issue lies in the Chisel backend compiler, which does not initialize the instruction memory with the program provided. FIRRTL eliminates the initialization process because it considers it unnecessary. You can find the related issue on GitHub. To resolve this, I reverted to an older version of Chisel with FIRRTL annotations, allowing the CPU to function correctly.
In Chisel, an annotation is a tag that provides additional information to the compiler. During the process of transpiling Chisel HDL to System Verilog, several optimizations occur, such as renaming and eliminating unnecessary circuits. However, during CPU behavior simulation with a program, the initialization process might be discarded. To prevent this, the DontTouchAnnotation
must be applied to instruct the FIRRTL compiler not to optimize the instruction memory section. Without this annotation, the instruction memory will not function correctly.
val mem = Mem(System.InstructionMemorySizeInBytes, UInt(System.InstructionWidth)).suggestName("mem")
annotate(new ChiselAnnotation {
def toFirrtl = DontTouchAnnotation(mem.toTarget)
})
loadMemoryFromFileInline(
mem,
f"./src/main/resources/${filename}",
firrtl.annotations.MemoryLoadFileType.Hex
)
After understanding how a single-cycle CPU works, the next step is learning pipelined CPU architecture. A pipelined CPU reduces the critical path, enhancing the clock rate of the CPU. However, it introduces challenges such as handling register dependencies and jump problems. Since multiple instructions are in different stages simultaneously, the actual state of the CPU cannot be determined at any given moment. Additional buffers are required to cache the state, and data forwarding is used to resolve dependency issues. When hazards cannot be eliminated, the pipeline must be stalled or flushed to ensure correct execution.
A data hazard occurs when there is a register dependency between instructions, meaning the destination register of a previous instruction is the same as the source register of the current instruction. In a 5-stage pipelined CPU, if there is no forwarding unit to pass the data early, the result will be incorrect. This happens because the current instruction cannot receive the correct source register value before entering the execute stage (EX). This situation is referred to as a Read-After-Write (RAW) hazard.
Below is the code segment for the forward unit. After the instruction is decoded, the instruction's information is cached in the pipeline buffer. This information includes the destination and source register identifiers. The forward unit checks whether the destination register identifier in the memory stage or execute stage matches the source register identifier of the current instruction at the decode stage.
when(io.MEM_REGS_wEn && io.EX_rs1 === io.MEM_rd && io.MEM_rd =/= 0.U) {
io.EX_reg1_forward_type := ForwardingType.ForwardFromMEM
}.elsewhen(io.WB_REGS_wEn && io.EX_rs1 === io.WB_rd && io.WB_rd =/= 0.U) {
io.EX_reg1_forward_type := ForwardingType.ForwardFromWB
}.otherwise {
io.EX_reg1_forward_type := ForwardingType.NoForward
}
when(io.MEM_REGS_wEn && io.EX_rs2 === io.MEM_rd && io.MEM_rd =/= 0.U) {
io.EX_reg2_forward_type := ForwardingType.ForwardFromMEM
}.elsewhen(io.WB_REGS_wEn && io.EX_rs2 === io.WB_rd && io.WB_rd =/= 0.U) {
io.EX_reg2_forward_type := ForwardingType.ForwardFromWB
}.otherwise {
io.EX_reg2_forward_type := ForwardingType.NoForward
}
when(io.MEM_REGS_wEn && io.ID_rs1 === io.MEM_rd && io.MEM_rd =/= 0.U) {
io.ID_reg1_forward_type := ForwardingType.ForwardFromMEM
}.elsewhen(io.WB_REGS_wEn && io.ID_rs1 === io.WB_rd && io.WB_rd =/= 0.U) {
io.ID_reg1_forward_type := ForwardingType.ForwardFromWB
}.otherwise {
io.ID_reg1_forward_type := ForwardingType.NoForward
}
when(io.MEM_REGS_wEn && io.ID_rs2 === io.MEM_rd && io.MEM_rd =/= 0.U) {
io.ID_reg2_forward_type := ForwardingType.ForwardFromMEM
}.elsewhen(io.WB_REGS_wEn && io.ID_rs2 === io.WB_rd && io.WB_rd =/= 0.U) {
io.ID_reg2_forward_type := ForwardingType.ForwardFromWB
}.otherwise {
io.ID_reg2_forward_type := ForwardingType.NoForward
}
While forwarding cannot always eliminate data hazards, consider the example below in a 5-stage pipelined CPU with IF
, ID
, EX
, MEM
, and WB
stages. It is necessary to stall for 1 cycle because the t0
source register in the addi
instruction cannot receive the correct value via forwarding when the lw
instruction is in the MEM
stage and the addi
instruction is in the EX
stage. This is because the value from memory becomes available only in the WB
stage, making it possible to forward the data back.
lw t0, 0(a0)
addi t0, t0, 1
To address this situation, a control unit is used to detect inevitable data hazards. The conditions are defined in the control unit's logic, as described below:
EX_MEM_rEn
is high, it indicates that the instruction currently in the EX
stage requires data from memory. This condition occurs only for load-type (L-type) instructions, as the data is not yet available in the MEM
stage.ID
stage uses the destination register of the L-type instruction as a source register, the ID
stage must stall for one cycle. This stall ensures that the data read from memory at the MEM
stage is available for the subsequent instruction.The control logic ensures proper pipeline operation by introducing stalls where necessary to handle data hazards.
((io.ID_jump_inst || io.EX_MEM_rEn)
&& io.EX_rd =/= 0.U
&& (io.EX_rd === io.ID_rs1 || io.EX_rd === io.ID_rs2))
The IF
and ID
must stall for one cycle, to prevent the instruction at ID
pass to EX
to cause the error result.
io.PC_stall := true.B /* Stall PC */
io.IF_stall := true.B /* Flush IF2ID */
io.ID_flush := true.B /* Flush ID2EX */
Inevitable Data Hazards: Possible Solutions
ALUOpSource
: Expand ALUOpSource
to 2 bits in width, allowing it to select data forwarded from either the MEM
or WB
stage.