# 5-stage pipelined RISC-V processor capable of running FreeRTOS
> 林子齊
## Project Links
- [Single-Cycle CPU](https://github.com/ericlin1231/rv32-single-cycle)
- [Pipelined CPU](https://github.com/ericlin1231/rv32-pipeline)
## Introduction
Use last year's Lab 3 RISC-V 32-bit single-cycle CPU as a template to construct a pipelined CPU with CLINT (Core Local Interruptor). The goal is to achieve minimal ISA support required to run FreeRTOS. Implement the design on an FPGA and attempt to run FreeRTOS on a physical development board.
## Development Environment
I strive to use only open-source tools for this project. Although setting up the development environment can be complex, I manage my toolchains using Nix Home Manager in my configuration file. Refer to my [ubuntu-config](https://github.com/ericlin1231/ubuntu-config) and use `soc-flake` to install the necessary packages.
## Data Memory
In modern CPUs, communication between the data memory and the CPU occurs via the bus. However, as this is too complex for this project, I have implemented only the simplest data memory, which is directly connected to the memory stage without bus.
### Sync & Async Read
After modifying the data memory implementation, the single-cycle CPU failed the CPU test. Upon inspecting the waveform, I noticed that if the current cycle is an L-type instruction, the data from memory does not appear until the next cycle. This delay prevents the destination register from being updated correctly within the same cycle.
In my view, if we want to use a synchronous read memory in a single-cycle CPU, we need to include a buffer to temporarily hold the data and an appropriate register write-enable mechanism, so that the data can be written in the following cycle. Since this adds complexity to a single-cycle design, I switched to an asynchronous memory implementation. With asynchronous reads, the data is available within the same cycle, allowing the CPU to pass the test.
## Instruction Read-Only Memory
As previously mentioned, this project is a simplified implementation. The instruction read-only memory is directly connected to the instruction fetch stage. However, in a real system, the instructions would reside in main memory, requiring an operating system to manage the memory state. Since our CPU implementation does not include an operating system, we must load the program’s binary code into the instruction memory for each test.
### Testbench Preparation
To test the functionality of the CPU, we need to compile a C program into binary code and manually load the binary program into memory. On my Mac, I use the RISC-V toolchain installed via Homebrew to compile and link the program into an executable file. Then, I extract the `.text` and `.data` segments to create a binary program.
On my Mac, only the RISC-V 64 toolchain is available, so it's necessary to add an option to specify the target architecture. For example, if the single-cycle CPU only implements the RV32I instruction set, we need to explicitly specify it.
To compile the C source code into object code, use the `-c` option. This option ensures that only object code is generated without linking. This is important because we want to link the C program with a custom-defined initialization code that performs some setup before executing the `main` function in the C program.
```shell
$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -nostdlib -o init.o -c init.S
$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -nostdlib -o fibonacci.o -c fibonacci.c
```
To link the C program's object code and the initialization code, use a custom-defined memory layout specified in `link.ld`. Additionally, ensure that the architecture is set to the ELF format with 32-bit little-endian encoding. This ensures compatibility with the target CPU architecture and allows the binary code to be correctly generated and loaded.
```shell
$ riscv64-unknown-elf-ld -m elf32lriscv -T link.ld -o fibonacci.elf init.o fibonacci.o
```
I set the initial program counter to point to the instruction memory address `0x00000000`. Consequently, I place the initialization code segment at the memory address `0x00000000` to ensure proper execution from the beginning of the memory.
```c
/* link.ld */
OUTPUT_ARCH(riscv)
ENTRY(_start)
SECTIONS
{
. = 0x00000000;
.text :
{
*(.text._start)
*(.text)
*(.rodata)
}
.data :
{
*(.data)
*(.bss)
}
}
```
The C program used to test the CPU functionality is quite simple and does not involve any context switching. Therefore, the initialization code only needs to set the stack pointer to the appropriate address and can then directly jump to the `main` function in the C program.
```asm
/* init.S */
.section .text
.globl _start
_start:
li sp, 4096
call main
hang:
j hang
```
Since the ELF format contains unnecessary metadata for my CPU, only the `.text` and `.data` segments are extracted from the ELF file. The target file is then specified as a binary format.
```shell
$ riscv64-unknown-elf-objcopy -O binary -j .text -j .data fibonacci.elf fibonacci.bin
```
In Chisel, memory initialization supports the use of hexadecimal files. Hexadecimal format instructions are easy to trace and debug. The `hexdump` tool can be used to create a hexadecimal format program. When using `hexdump`, specify the option to treat **4 bytes** as a block, which corresponds to an instruction word.
```shell
$ hexdump -v -e '/4 "%08x\n"' fibonacci.bin > fibonacci.hex
```
### Simulation Bug
It is acceptable to run each component separately, but when testing the entire CPU, errors occur. I discovered this is caused by a bug in the latest version of Chisel. The issue lies in the Chisel backend compiler, which does not initialize the instruction memory with the program provided. FIRRTL eliminates the initialization process because it considers it unnecessary. You can find the related [issue](https://github.com/chipsalliance/chisel/issues/4496) on GitHub. To resolve this, I reverted to an older version of Chisel with FIRRTL annotations, allowing the CPU to function correctly.
### Annotation
In Chisel, an annotation is a tag that provides additional information to the compiler. During the process of transpiling Chisel HDL to System Verilog, several optimizations occur, such as renaming and eliminating unnecessary circuits. However, during CPU behavior simulation with a program, the initialization process might be discarded. To prevent this, the `DontTouchAnnotation` must be applied to instruct the FIRRTL compiler not to optimize the instruction memory section. Without this annotation, the instruction memory will not function correctly.
```scala
val mem = Mem(System.InstructionMemorySizeInBytes, UInt(System.InstructionWidth)).suggestName("mem")
annotate(new ChiselAnnotation {
def toFirrtl = DontTouchAnnotation(mem.toTarget)
})
loadMemoryFromFileInline(
mem,
f"./src/main/resources/${filename}",
firrtl.annotations.MemoryLoadFileType.Hex
)
```
## Control & Forward
After understanding how a single-cycle CPU works, the next step is learning pipelined CPU architecture. A pipelined CPU reduces the critical path, enhancing the clock rate of the CPU. However, it introduces challenges such as handling register dependencies and jump problems. Since multiple instructions are in different stages simultaneously, the actual state of the CPU cannot be determined at any given moment. Additional buffers are required to cache the state, and data forwarding is used to resolve dependency issues. When hazards cannot be eliminated, the pipeline must be stalled or flushed to ensure correct execution.
### Data Hazard
A data hazard occurs when there is a register dependency between instructions, meaning the **destination register** of a previous instruction is the same as the **source register** of the current instruction. In a 5-stage pipelined CPU, if there is no forwarding unit to pass the data early, the result will be incorrect. This happens because the current instruction cannot receive the correct source register value before entering the execute stage (EX). This situation is referred to as a Read-After-Write (RAW) hazard.
Below is the code segment for the forward unit. After the instruction is decoded, the instruction's information is cached in the pipeline buffer. This information includes the destination and source register identifiers. The forward unit checks whether the destination register identifier in the memory stage or execute stage matches the source register identifier of the current instruction at the decode stage.
```scala
when(io.MEM_REGS_wEn && io.EX_rs1 === io.MEM_rd && io.MEM_rd =/= 0.U) {
io.EX_reg1_forward_type := ForwardingType.ForwardFromMEM
}.elsewhen(io.WB_REGS_wEn && io.EX_rs1 === io.WB_rd && io.WB_rd =/= 0.U) {
io.EX_reg1_forward_type := ForwardingType.ForwardFromWB
}.otherwise {
io.EX_reg1_forward_type := ForwardingType.NoForward
}
when(io.MEM_REGS_wEn && io.EX_rs2 === io.MEM_rd && io.MEM_rd =/= 0.U) {
io.EX_reg2_forward_type := ForwardingType.ForwardFromMEM
}.elsewhen(io.WB_REGS_wEn && io.EX_rs2 === io.WB_rd && io.WB_rd =/= 0.U) {
io.EX_reg2_forward_type := ForwardingType.ForwardFromWB
}.otherwise {
io.EX_reg2_forward_type := ForwardingType.NoForward
}
when(io.MEM_REGS_wEn && io.ID_rs1 === io.MEM_rd && io.MEM_rd =/= 0.U) {
io.ID_reg1_forward_type := ForwardingType.ForwardFromMEM
}.elsewhen(io.WB_REGS_wEn && io.ID_rs1 === io.WB_rd && io.WB_rd =/= 0.U) {
io.ID_reg1_forward_type := ForwardingType.ForwardFromWB
}.otherwise {
io.ID_reg1_forward_type := ForwardingType.NoForward
}
when(io.MEM_REGS_wEn && io.ID_rs2 === io.MEM_rd && io.MEM_rd =/= 0.U) {
io.ID_reg2_forward_type := ForwardingType.ForwardFromMEM
}.elsewhen(io.WB_REGS_wEn && io.ID_rs2 === io.WB_rd && io.WB_rd =/= 0.U) {
io.ID_reg2_forward_type := ForwardingType.ForwardFromWB
}.otherwise {
io.ID_reg2_forward_type := ForwardingType.NoForward
}
```
While forwarding cannot always eliminate data hazards, consider the example below in a 5-stage pipelined CPU with `IF`, `ID`, `EX`, `MEM`, and `WB` stages. It is necessary to stall for 1 cycle because the `t0` source register in the `addi` instruction cannot receive the correct value via forwarding when the `lw` instruction is in the `MEM` stage and the `addi` instruction is in the `EX` stage. This is because the value from memory becomes available only in the `WB` stage, making it possible to forward the data back.
```asm
lw t0, 0(a0)
addi t0, t0, 1
```
To address this situation, a control unit is used to detect inevitable data hazards. The conditions are defined in the control unit's logic, as described below:
- When `EX_MEM_rEn` is high, it indicates that the instruction currently in the `EX` stage requires data from memory. This condition occurs only for load-type (L-type) instructions, as the data is not yet available in the `MEM` stage.
- If the instruction in the `ID` stage uses the destination register of the L-type instruction as a source register, the `ID` stage must stall for one cycle. This stall ensures that the data read from memory at the `MEM` stage is available for the subsequent instruction.
The control logic ensures proper pipeline operation by introducing stalls where necessary to handle data hazards.
```scala
((io.ID_jump_inst || io.EX_MEM_rEn)
&& io.EX_rd =/= 0.U
&& (io.EX_rd === io.ID_rs1 || io.EX_rd === io.ID_rs2))
```
The `IF` and `ID` must stall for one cycle, to prevent the instruction at `ID` pass to `EX` to cause the error result.
```scala
io.PC_stall := true.B /* Stall PC */
io.IF_stall := true.B /* Flush IF2ID */
io.ID_flush := true.B /* Flush ID2EX */
```
:::success
Inevitable Data Hazards: Possible Solutions
1. **Compiler Optimization**: Rearrange the instruction sequence to minimize dependencies and avoid hazards.
2. **Out-of-Order Execution**: Implement hardware-level techniques to reduce dependencies by executing instructions out of their original order when possible.
3. **Modify `ALUOpSource`**: Expand `ALUOpSource` to 2 bits in width, allowing it to select data forwarded from either the `MEM` or `WB` stage.
- Note: It is uncertain whether this method is feasible without further experimentation. Testing this approach could provide valuable insights.
:::
### Control Hazard
## Decrease Branch Cost
### Early Branch Resolution
### Global Branch Predictor
## CSR (Control/Status Register)
### CSR Definition
### Atomic Operation
## CLINT (Core-Local Interrupt Controller)
### Interrupt Handler