# Lab3: Construct a RISC-V CPU with Chisel
> [Computer Architecture](https://wiki.csie.ncku.edu.tw/arch/schedule) 2025 Fall
## Development Objectives of this Project
Our goal is to create a RISC-V CPU that prioritizes simplicity while assuming a foundational understanding of digital circuits and the C programming language among its readers. The CPU should strike a balance between simplicity and sophistication, and we intend to maximize its functionality. This project encompasses the following key aspects, which will be prominently featured in the technical report:
1. Implementation in [Chisel](https://www.chisel-lang.org/).
2. RV32I instruction set support.
3. Construct RISC-V processor from single-cycle to pipelined design.
4. Execution of programs compiled from the C programming language.
5. Successful completion of RISC-V Architectural Tests, also known as [riscv-arch-test](https://github.com/riscv-non-isa/riscv-arch-test).
## Prerequisites
### Operating Systems
Ensure that you have a functioning GNU/Linux or macOS system with the necessary permissions to install software packages.
:::warning
:warning: Notice
1. Consider using Linux or macOS instead of Microsoft Windows, as resolving any potential issues on Windows would be your responsibility.
2. Ubuntu Linux 24.04 or later is recommended.
> Please note that certain packages, such as [Verilator](https://www.veripool.org/verilator/), which are available in older Ubuntu distributions, may be too outdated to follow the instructions provided below.
3. If you are using an Apple Silicon-based MacBook, you may encounter some problems. Please record them and discuss them with the instructor.
:::
### Install the dependent packages
* For macOS
```shell
$ brew install verilator surfer
```
* For Ubuntu Linux
```shell
$ sudo apt install build-essential verilator gtkwave
```
> Install Surfer manualy via [GitLab](https://gitlab.com/surfer-project/surfer/-/packages/).
### Install sbt
[sbt](https://www.scala-sbt.org/), short for [Scala](https://www.scala-lang.org/) Build Tool, relies on [Scala](https://www.scala-lang.org/), a JVM (Java Virtual Machine) language that combines object-oriented and functional programming paradigms into a highly concise, logical, and exceptionally robust language. Therefore, before using sbt, you must ensure that a functional JVM is accessible in your environment.
- [ ] macOS
> Ensure that [Homebrew](https://brew.sh/) is installed.
```shell
# Uninstall everything
$ brew uninstall sbt
$ brew uninstall jenv
# Install sdkman
$ curl -s "https://get.sdkman.io" | bash
$ source "$HOME/.sdkman/bin/sdkman-init.sh"
# Install Eclipse Temurin JDK 11
$ sdk install java 11.0.29-tem
$ sdk install sbt
```
> :warning: The version number `11` is crucial; otherwise, you may encounter unexpected issues with sbt.
- [ ] Linux
* Follow [the instructions](https://www.scala-sbt.org/release/docs/Installing-sbt-on-Linux.html). You MUST install Eclipse Temurin JDK 11. You can follow the instructions mentioned above, starting with the `curl` step.
> Another approach is to utilize Apptainer. See [champ's node](https://champyen.blogspot.com/2023/11/chisel.html).
## Fundamental Concepts behind Chisel
[Chisel](https://www.chisel-lang.org/) is a domain specific language (DSL) implemented using [Scala](https://www.scala-lang.org/)'s macro features. Therefore, **all programming related to circuit logic must be implemented using the macro definitions provided in the [Chisel](https://www.chisel-lang.org/) library**, rather than directly using Scala language keywords. Specifically, when you want to use a constant in a circuit, such as 12, and compare it with a circuit value, you cannot write `if (value == 12)` but instead must write `when(value === 12.U)`. Memorizing this fundamental concept is essential to correctly write valid [Chisel](https://www.chisel-lang.org/) code.
> [Chisel Cheatsheet](https://github.com/freechipsproject/chisel-cheatsheet/releases/latest/download/chisel_cheatsheet.pdf)
### Relationship Between Chisel and Verilog
[Chisel](https://www.chisel-lang.org/) is not strictly an equivalent replacement for Verilog but rather a generator language. Hardware circuits written in Chisel need to be compiled into Verilog files and then synthesized into actual circuits through EDA (Electronic Design Automation) software. Furthermore, for the sake of generality, some features found in Verilog, such as negative-edge triggering and multi-clock simulation, are not fully supported or not supported at all in Chisel.

### Illustration of Simple Combinational Logic
Let's explore a simple combinational logic example.

```scala
// simple logic expression
(a & ~b) | (~a & b)
```
Unlike traditional execution models, the logic circuits associated with [Chisel](https://www.chisel-lang.org/) are continuously active, somewhat akin to continuous assignment in Verilog. However, [Chisel](https://www.chisel-lang.org/) takes a departure from Verilog's built-in logic gates and opts for expressions to define logic.
In the above example, we introduce "variables" `a` and `b`, which essentially serve as "named wires" due to their roles as inputs to the circuit. It is important to note that other wires within the circuit do not necessitate explicit names. While we have assumed one-bit-wide inputs and generated wires in this instance, these expressions can seamlessly adapt to wider wires, with [Chisel](https://www.chisel-lang.org/)'s logic operators inherently operating in a "bitwise" manner. Furthermore, [Chisel](https://www.chisel-lang.org/) boasts a powerful wire width inference mechanism, enhancing its versatility and simplifying its use in designing complex logic circuits.
In the preceding example, the named wires `a` and `b` offer the advantage of reusability across multiple locations within the circuit. Likewise, it is possible to assign a name to the output of the circuit:
```scala
// simple logic expression
val out = (a & ~b) | (~a & b)
```
The keyword `val` is derived from Scala and serves as a means to declare a program variable that can only be assigned once, essentially creating a constant. This approach allows for the generation of a single output value at a specific location within the circuit and subsequently distributes it to other locations where the same output is required.
```scala
// fan-out
val z = (a & out) | (out & b)
```
Assigning names to wires and employing fanout mechanisms provide a means to reuse a single output value in multiple locations within the generated circuit.
Function abstraction enables us to reuse a circuit description effectively. Here is an example of a simple logic function:
```scala
def XOR(a: Bits, b: Bits) = (a & ~b) | (~a & b)
```
In this case, the function inputs and output are specified with the type Bits. We will delve deeper into types shortly. When we use the XOR function elsewhere, we essentially create a duplicate of the corresponding logic. You can think of the function as a "constructor" for this logic.
```scala
// Constructing multiple copies
val z = (x & XOR(x, y)) | (XOR(x, y) & y)
```
### Basic Types
Chisel datatypes are employed to define the type of values that reside in state elements or flow through wires. While hardware circuits fundamentally manipulate vectors of binary digits, employing more abstract representations for values enables clearer specifications and aids in the generation of more efficient circuits by the tools.
The basic types:
| Types | meaning |
|:-----:|:--------|
| `Bits` | Raw collection of bits |
| `SInt` | Signed integer number |
| `UInt` | Unsigned integer number |
| `Bool` | Boolean |
All signed numbers are represented using 2's complement in Chisel. Chisel also provides support for several higher-order types, including `Bundles` and `Vecs`.
- [ ] Unsigned and Signed Integers
The unsigned integer type is `UInt`, and the signed integer type is `SInt`. When declaring integers, you can specify the bit width using `.W`, for example, `UInt(8.W)`. If you want to convert an integer from Scala to Chisel's hardware integer, you can use the `.U` operator, for example, `12.U`. The `.U` operator also supports specifying the bit width, for example, `12.U(8.W)`.
- [ ] Booleans
The boolean type is `Bool`, and you can use `.B` to convert Scala boolean values to hardware boolean values, for example, `true.B`.
### Type Inference
While it's useful to keep track of wire types, Scala's type inference allows you to omit type declarations when they are not necessary. For example, in our previous example:
```scala
// simple logic expression
val out = (a & ~b) | (~a & b)
```
the type of `out` was deduced based on the types of `a` and `b` as well as the operators used. If you want to ensure type clarity or if there is insufficient context for the inference engine, you can always explicitly specify the type like this:
```scala
// simple logic expression
val out: Bits = (a & ~b) | (~a & b)
```
Additionally, as we will explore later, explicit type declaration becomes necessary in certain situations.
### Bundles and Vecs
Chisel Bundles are used to represent groups of wires that have named fields. They function in a manner similar to C's `struct`. In Chisel, Bundles are defined as a class, which is analogous to how classes are defined in languages like C++ and Java.
```scala
class FIFOInput extends Bundle {
val rdy = Bool(OUTPUT) // Indicates if FIFO has space
val data = Bits(INPUT, 32) // The value to be enqueued
val enq = Bool(INPUT) // Assert to enqueue data
}
```
Chisel provides class methods for Bundles, which means that user-created bundles should "extend" the Bundle class to benefit from automatic connection creation (more on this later). In a Bundle, each field is assigned a name and defined using a constructor of the appropriate type, along with parameters that specify its width and direction. This allows you to create instances of FIFOInput, for example.
```scala
val jonsIO = new FIFOInput
```
You can create nested Bundle definitions and construct hierarchies with them. These Bundles are typically used to define the interface of modules. The Bundle "flip" operator is employed to create an "opposite" Bundle concerning its direction.
```scala
class MyFloat extends Bundle {
val sign = Bool()
val exponent = Bits(width = 8)
val significant = Bits(width = 23)
}
val x = new MyFloat()
Val xs = x.sign
class BigBundle extends Bundle {
val myVec = Vec(5) { SInt(width = 23) } // Vector of 5 23-bit signed integers.
val flag = Bool()
val f = new MyFloat() // Previously defined bundle.
}
```
Bundle and Vec are class types used for organizing and grouping other types. The Vec class, in particular, represents an array of objects of the same type that can be indexed:
```scala
val myVec = Vec(5) { SInt(width = 23) } // Vec of 5 23-bit signed integers.
val third = myVec(3) // Name one of the 23-bit signed integers
```
Note: Vec is not a memory array; it functions as a collection of wires (or registers).
Both Vec and Bundle inherit from the class Data. Any object that ultimately inherits from Data can be represented as a bit vector in a hardware design.
### Literals
Literals are values that you specify directly in your source code. Chisel provides type-specific constructors for defining literals.
```scala
Bits("ha") // hexadecimal 4-bit literal of type Bits
Bits("o12") // octal 4-bit literal of type Bits
Bits("b1010") // binary 4-bit literal of type Bits
SInt(5) // signed decimal 4-bit literal of type Fix
SInt(-8) // negative decimal 4-bit literal of type Fix
UInt(5) // unsigned decimal 3-bit literal of type UFix
Bool(true) // literals for type Bool, from Scala boolean literals
Bool(false)
```
By default, Chisel will determine the width of your literal to be the minimum necessary. Alternatively, you can provide a width value as a second argument if needed.
```scala
Bits("ha", 8) // hexadecimal 8-bit literal of type Bits, 0-extended
SInt(-5, 32) // 32-bit decimal literal of type Fix, sign-extended
SInt(-5, width = 32) // handy if lots of parameters
```
An error will be reported if the specified width value is insufficient.
### Ports
A port refers to any Data object whose members have directions assigned to them. Port constructors enable the addition of directions during construction:
```scala
class FIFOInput extends Bundle {
val rdy = Bool(OUTPUT)
val data = Bits(width = 32, OUTPUT)
val enq = Bool(INPUT)
}
```
The direction of an object can also be set during instantiation:
```scala
class ScaleIO extends Bundle {
val in = new MyFloat().asInput
val scale = new MyFloat().asInput
val out = new MyFloat().asOutput
}
```
The methods `asInput` and `asOutput` ensure that all components of the data object are set to the specified direction. There are also other methods available for tasks like reversing direction, among others.
### Modules
Modules are employed to establish hierarchy within the generated circuit, resembling modules in Verilog. Each module defines a port interface and interconnects subcircuits. These module definitions are essentially class definitions that extend the Chisel Module class.
```scala
class Mux2 extends Module {
val io = new Bundle {
val select = Bits(width=1, dir=INPUT);
val in0 = Bits(width=1, dir=INPUT);
val in1 = Bits(width=1, dir=INPUT);
val out = Bits(width=1, dir=OUTPUT);
};
io.out := (io.select & io.in1) | (~io.select & io.in0);
}
```


The Module slot `io` is utilized to store the interface definition, which is of type Bundle. In the above example, `io` is associated with an unnamed Bundle using the `:=` assignment operator. This Chisel operator connects the input on the left-hand side to the output of the circuit on the right-hand side .
Each module implicitly includes clock and reset signals in addition to the declared I/O ports.
Combinational Logic:
```scala
val wire = Wire(UInt(8.W))
val wireinit = WireInit(0.U(8.W))
```
The most basic type of state element that Chisel supports is a positive-edge-triggered register, which can be functionally instantiated as follows:
```scala
val reg = Reg(UInt(8.W))
val reginit = RegInit(0.U(8.W))
```
This circuit generates an output that replicates the input signal but with a one-clock-cycle delay. It is important to note that we do not need to explicitly specify the type of the register (`Reg`) because Chisel automatically infers it from the input when instantiated in this manner. In Chisel, the clock and reset signals are global and are implicitly included where necessary.
### Hello World in Chisel
```scala
class Hello extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
})
val CNT_MAX = (50000000 / 2 - 1).U;
val cntReg = RegInit(0.U(32.W))
val blkReg = RegInit(0.U(1.W))
cntReg := cntReg + 1.U
when(cntReg === CNT_MAX) {
cntReg := 0.U
blkReg := ~blkReg
}
io.led := blkReg
}
```
---
## [Chisel](https://www.chisel-lang.org/) Tutorial
Getting the Repository:
```shell
$ git clone https://github.com/ucb-bar/chisel-tutorial
$ cd chisel-tutorial
$ git checkout release
```
Before testing your system, ensure that you have sbt (the Scala build tool) installed.
```shell
$ sbt run
```
When running for the first time, an internet connection is needed to automatically download necessary components. Reference output:
```
[info] Set current project to chisel-tutorial (in build file:/tmp/chisel-tutorial/)
[info] Updating
https://repo1.maven.org/maven2/edu/berkeley/cs/chisel-iotesters_2.12/maven-metadata.xml
100.0% [##########] 2.1 KiB (8.7 KiB / s)
https://repo1.maven.org/maven2/edu/berkeley/cs/chisel-iotesters_2.12/1.4.1/chisel-iotesters_2.12-1.4.1.pom
100.0% [##########] 2.9 KiB (16.1 KiB / s)
...
https://repo1.maven.org/maven2/edu/berkeley/cs/firrtl-interpreter_2.12/1.3.1/firrtl-interpreter_2.12-1.3.1.jar
100.0% [##########] 444.2 KiB (1.5 MiB / s)
[info] Fetched artifacts of
[info] Compiling 56 Scala sources to /tmp/chisel-tutorial/target/scala-2.12/classes ...
https://repo1.maven.org/maven2/org/scala-sbt/util-interface/1.3.0/util-interface-1.3.0.pom
100.0% [##########] 2.7 KiB (16.4 KiB / s)
[info] Non-compiled module 'compiler-bridge_2.12' for Scala 2.12.10. Compiling...
```
This will generate and test a basic block called `Hello` that consistently produces the number `42` (represented as `0x2a`). You should observe `[success]` in the final line of the output (from `sbt`) and `PASSED` on the preceding line, indicating that the block has successfully passed the test case.
Reference output:
```
test Hello Success: 1 tests passed in 6 cycles taking 0.004980 seconds
[info] [0.002] RAN 1 CYCLES PASSED
[success] Total time: 2 s, completed
```
> source code: `src/main/scala/hello/Hello.scala`
Then, you can run the examples:
```shell
$ ./run-examples.sh FullAdder
$ ./run-examples.sh Adder4
```
> source code: `src/main/scala/examples/FullAdder.scala`, `src/main/scala/examples/Adder4.scala`
```shell
$ ./run-examples.sh SimpleALU
```
> source code: `src/main/scala/examples/SimpleALU.scala `
Alternatively, you can go through all examples:
```shell
$ ./run-examples.sh all
```
Reference: [Digital Design with Chisel](https://github.com/schoeberl/chisel-book)
## [Chisel Bootcamp](https://github.com/freechipsproject/chisel-bootcamp)
Take your hardware design to the next level, transitioning from instances to generators! This bootcamp will introduce you to [Chisel](https://www.chisel-lang.org/), a hardware construction DSL developed at Berkeley and written in [Scala](https://www.scala-lang.org/). It will also provide you with a grasp of [Scala](https://www.scala-lang.org/) as you progress, and it will focus on teaching [Chisel](https://www.chisel-lang.org/) through the concept of hardware generators.
Read the following materials: (required)
* [From Chisel to Chips in Fully Open-Source](https://youtu.be/FenSOWKBbAw)
* [Chisel Introduction](https://youtu.be/OhMuPQcyynY)
* [Chisel Best Practices Intensive](https://youtu.be/e1HRwrNhZhw)
[Learn Chisel online!](https://mybinder.org/v2/gh/sysprog21/chisel-bootcamp/HEAD)
+ Please run the cell blocks by either pressing SHIFT+ENTER on your keyboard
+ Caution: You might encounter unforeseen connectivity issues during the exercises. Therefore, it is advisable to install the required packages on your personal computer.

### Method 1: Using Docker or nerdctl
:::warning
:warning: The prebuilt Docker image `ucbbar/chisel-bootcamp` mentioned in [Local Installation - Mac/Linux](https://github.com/freechipsproject/chisel-bootcamp/blob/master/Install.md) is only valid to x86-64 machines. Therefore, we have rebuilt the Docker image `sysprog21
/chisel-bootcamp` to address known issues and support both x86-64 and Arm64.
:::
Make sure you have Docker [installed](https://docs.docker.com/get-docker/) on your system.
* For Ubuntu Linux users, read [Install Docker Engine on Ubuntu](https://docs.docker.com/engine/install/ubuntu/) carefully.
> Alternatively, use [nerdctl](https://github.com/containerd/nerdctl).
* For macOS users, if you prefer not to install Docker Desktop, which can be slow and resource-intensive, you can install the [lima](https://github.com/lima-vm/lima) package. Lima provides access to a full Linux system and comes with built-in integration for [nerdctl](https://github.com/containerd/nerdctl), offering Docker-compatible commands. Simply setup via the following commands:
```shell
$ brew install lima
$ limactl start
```
Run the following command: (for Linux and macOS + Docker Desktop)
```shell
$ docker run -it --rm -p 8888:8888 sysprog21/chisel-bootcamp
```
For macOS users who have already installed [lima](https://github.com/lima-vm/lima) and run `limactl start`, you can run the command.
```
$ lima nerdctl run -it --rm -p 127.0.0.1:8888:8888 sysprog21/chisel-bootcamp
```
This will download a Dokcer image for the bootcamp and run it. The output will end in the following message:
```
To access the notebook, open this file in a browser:
file:///home/bootcamp/.local/share/jupyter/runtime/nbserver-6-open.html
Or copy and paste one of these URLs:
http://79b8df8411f2:8888/?token=LONG_RANDOM_TOKEN
or http://127.0.0.1:8888/?token=LONG_RANDOM_TOKEN
```
Next, you can copy the URL provided above, which starts with `http://127.0.0.1:8888/?`, and paste it into your web browser to access the [Jupyter Notebook](https://jupyter.org/)-based interactive computing platform.
### Method 2: Install packages manually
See [Local Installation - Mac/Linux](https://github.com/freechipsproject/chisel-bootcamp/blob/master/Install.md)
:::warning
:warning: You might suffer from unexpected problems with this method. Record and write down the issues accordingly.
:::
### Learning Chisel by doing!
:information_source: You should go through the following and complete the exercises:
- 1_intro_to_scala
- 2.1_first_module
- 2.2_comb_logic
- 2.3_control_flow
- 2.4_sequential_logic
- 2.5_exercise
- 2.6_chiseltest
- 3.1_parameters
- 3.2_collections
- 3.2_interlude
- 3.3_higher-order_functions
- 3.4_functional_programming
- 3.5_object_oriented_programming
- 3.6_types
Take the following example of a 3-point moving average implemented in the style of a FIR filter.

```scala
// 3-point moving average implemented in the style of a FIR filter
class MovingAverage3(bitWidth: Int) extends Module {
val io = IO(new Bundle {
val in = Input(UInt(bitWidth.W))
val out = Output(UInt(bitWidth.W))
})
val z1 = RegNext(io.in) // Create a register whose input is connected to the argument io.in
val z2 = RegNext(z1) // Create a register whose input is connected to the argument z1
io.out := (io.in * 1.U) + (z1 * 1.U) + (z2 * 1.U) // `1.U` is an unsigned literal with value 1
}
```
After defining `class MovingAverage3`, let's instantiate it and take a look at its structure, rendered as SVG circuit diagram:

- Green circular nodes for inputs (clock, reset)
- Red double-circle for output (io)
- Yellow rectangular nodes for registers (z1, z2)
- Proper dataflow arrows showing connections
- Color-coded clusters (Inputs, Outputs, Registers)
---
## RISC-V CPU
A single-cycle CPU completes the execution of one instruction within a single clock cycle. Because the clock period must accommodate the slowest instruction, all instructions take the same amount of time to execute. This leads to poor performance and inefficient hardware utilization. The single-cycle design, however, is conceptually simple: only one instruction is active at any given time, so there are no data or control conflicts to handle.
This experiment begins with building such a single-cycle processor to help you understand the fundamental components of a CPU and the flow of instruction execution. You will incrementally design the data path and the control unit, observe how instructions move through fetch, decode, and execute phases, and complete the required coding exercises to construct your own simple RISC-V processor, named `MyCPU`.
After the single-cycle version is complete, MyCPU will be extended into a pipelined implementation, first with a 3-stage pipeline that combines instruction fetch, decode, and execute/memory operations, and later with a 5-stage pipeline consisting of instruction fetch (IF), instruction decode (ID), execute (EX), memory access (MEM), and write-back (WB). Through this progression, you will learn how instruction-level parallelism improves throughput, how hazards occur when multiple instructions overlap, and how mechanisms such as forwarding, stalling, and branch prediction can be used to resolve these challenges.
By the end of this exercise, you will have implemented a functional RISC-V CPU and gained a deeper understanding of how pipelining enhances processor performance.
### Data Path
The path through which data moves between functional units is known as the data path. The elements situated along this route that perform operations on or hold data are called data path components, which include the ALU (Arithmetic-Logic Unit), general-purpose registers, memory, and so on. The data path illustrates the various pathways through which data transitions from one component to another.

> The processor employs synchronous logic design with a clock.
### Control Signals

As the name implies, control signals govern the data path. Whenever a choice must be made, the control unit is responsible for making the correct decision and dispatching control signals to the relevant data path components. For instance, should the ALU perform addition or subtraction? Are we reading from or writing to memory?
So, how does the controller determine what action to take? This primarily relies on the instruction currently in execution. In the RISC-V instruction format, the controller discerns the appropriate decisions by examining the opcode, funct3, and funct7 fields of the instruction, thus emitting the correct control signals. In the CPU schematic, the Decoder, ALUControl, and JumpJudge components can all be regarded as constituents of the control unit. They receive instructions and generate control signals. These control signals serve as guides within the data path to ensure the precise execution of instructions.
### Combinational Units and State Units
In digital circuits, there are two main types of circuits: combinational logic and sequential logic. In CPU design, units composed of these two types of circuits are called combinational units and state units, respectively. In this experiment, only the registers belong to the sequential logic (memory is not within the scope of the CPU core), while the rest are combinational units.
- [ ] Combinational Logic
The output depends only on the current input and does not require a clock as a triggering condition. Inputs are reflected immediately in the outputs (ignoring delay).

Combinational logic processes data during clock cycles:
1. Between clock edges.
2. Taking input from state elements and providing output to state elements.
3. The clock period is determined by the longest delay in this process.
- [ ] Sequential (state) Logic
These units store state and use the clock as a triggering condition. Inputs are reflected in the outputs when the clock's rising edge arrives.
Register with write control:
* Only updates on the clock edge when the write control input is 1.
* Used when the stored value is required later

### Implementation
Get the repository:
```shell
$ git clone https://github.com/sysprog21/ca2025-mycpu
$ cd ca2025-mycpu
```
> :warning: Please be aware that the Scala code in this repository is not entirely complete, as the instructor has omitted certain sections for students to work on independently. Only `0-minimal` is complete.
`0-minimal` is a minimal single-cycle RISC-V CPU designed specifically to execute `jit.asmbin` (JIT self-modifying code demonstration), as described in [Quiz 3](https://hackmd.io/@sysprog/arch2025-quiz3-sol). It implements only the 5 instructions required by this program, making it an excellent educational example of how to build a focused, minimal processor.
The CPU supports exactly these RISC-V instructions:
1. `AUIPC` (Add Upper Immediate to PC) - for PC-relative addressing
2. `ADDI` (Add Immediate) - for arithmetic and register initialization
3. `LW` (Load Word) - word-aligned memory reads only
4. `SW` (Store Word) - word-aligned memory writes only
5. `JALR` (Jump and Link Register) - for function calls and returns
Note: `ECALL` is not required for this minimal CPU. Test verification reads registers directly via debug interface.
The `jit.asmbin` binary demonstrates RISC-V self-modifying code and JIT (Just-In-Time) compilation concepts:
1. Instructions as Data: Stores compiled instructions as hex data (like a JIT compiler output)
2. Runtime Code Generation: Copies these instructions to an executable code buffer
3. Dynamic Execution: Jumps to the copied code and executes it
4. Result Verification: Returns with register a0 = 42
To simulate and run tests for this project, execute the following commands under the `ca2025-mycpu/0-minimal` directory.
```shell
$ sbt test
```
> Alternately, run `make`.
If you have successfully filled in the implementation with your own efforts, you should encounter the following messages:
```
[info] JITTest:
[info] Minimal CPU - JIT Test
[info] - should correctly execute jit.asmbin and set a0 to 42
[info] All tests passed.
```
Simulation:
```bash
# Run Verilator simulation
make sim
```
The included Python script analyzes Verilator simulation traces to verify correct CPU behavior:
- JIT Code Execution: Confirms PC reaches and executes from JIT code buffer (0x102c)
- Execution Duration: Tracks cycle count spent executing dynamically generated code
- Memory Layout: Validates expected address layout from jit.S assembly
Example output:
```
Parsed 74 signals
======================================================================
VCD Trace Analysis Report - 0-minimal RISC-V CPU
======================================================================
Overall Status: [PASS]
Key Findings:
[OK] JIT Code Execution: 499978 cycles at buffer address (0x102c)
[NO] Register a0 = 42: False
[NO] Memory Writes: 0 total writes
Detailed Statistics:
PC Samples: 500000
Max PC Address: 0x00001030
Register Writes: 0
Writes to a0 (x10): 0
Expected Memory Layout:
Entry Point: 0x00001000
JIT Code Buffer: 0x0000102c
JIT Instructions: 0x00001034
Interpretation:
[OK] CPU successfully executed JIT self-modifying code
[OK] PC spent 499978 cycles executing from buffer
[OK] JIT code execution flow verified
Note: Internal signals (register writes, memory writes)
are not exported to VCD in this minimal CPU design.
ChiselTest validates a0=42 via debug interface.
```
Alternatively, use [Surfer](https://surfer-project.org/) to view waveform.
```
# View waveform
surfer trace.vcd
```

:::warning
:information_source: NOTICE
* You must make sure that you fully understand and verify `0-minimal` carefully before proceeding to the following sections.
* Comments marked with `CA25: Exercise` in the source indicate TODOs that must be completed as part of the exercises.
:::
## Single-cycle CPU
> Directory: `1-single-cycle`
CPU architecture diagram
- [ ] Full

- [ ] Simplifed

### Overview of Single-Cycle CPU Implementation
The RISC-V CPU we are designing can execute a core subset of RISC-V instructions (RV32I):
1. Arithmetic and Logic Instructions: `add`, `sub`, `slt`, etc.
2. Memory Access Instructions: `lb`, `lw`, `sb`, etc.
3. Branch Instructions: `beq`, `jar`, etc.
We will divide the instruction execution into five different stages:
1. Instruction Fetch: Fetching the instruction data from memory.
2. Decode: Understanding the meaning of the instruction and reading register data.
3. Execute: Calculating the result using the ALU.
4. Memory Access (load/store instructions): Reading from and writing to memory.
5. Write-back (for all instructions except store): Writing the result back to registers.

Now, let's build data path components step by step according to the above stages, and then instantiate and connect these data path components in the CPU's top-level module. (The code related to this is located in the `src/main/scala/riscv` directory.)
### Instruction Fetch
> Code can be found in `src/main/scala/riscv/core/InstructionFetch.scala`.

What the instruction fetch stage does:
* Fetch the instruction from memory based on the current address in the PC register.
* Modify the value of the PC register to point to the next instruction.
```scala
val pc = RegInit(ProgramCounter.EntryAddress)
when(io.instruction_valid) {
io.instruction := io.instruction_read_data
// lab3(InstructionDecode)
// ...
```
First, the value of the PC (program counter) register is initialized to the entry address of the program. When an instruction is valid, the current instruction pointed to by the PC is fetched. If a jump is required, the PC is directed to the jump address; otherwise, it is incremented to `PC + 4`.
The Instruction Fetch stage (`src/main/scala/riscv/core/InstructionFetch.scala`) is responsible for:
1. Managing the program counter (PC) register
2. Fetching instructions from memory
3. Handling control flow changes (branches and jumps)
4. Implementing pipeline stalls when memory is not ready
The missing implementation is the program counter (pc) update logic, which must select between two values:
| Condition | PC Update Rule | Description |
|----------------------|--------------------------|------------------------------------------------|
| Sequential execution | pc := pc + 4.U | Increment by 4 bytes (RV32I instruction width) |
| Control flow change | pc := io.jump_address_id | Jump to target address |
Selection condition: The input signal io.jump_flag_id determines whether to branch:
* `io.jump_flag_id` = true → Branch taken, use io.jump_address_id
* `io.jump_flag_id` = false → Sequential execution, use pc + 4
Complete IF Stage Behavior
- [ ] 1. Valid Instruction Path (io.instruction_valid = true)
When memory provides a valid instruction:
```scala
when(io.instruction_valid) {
io.instruction := io.instruction_read_data // Output fetched instruction
pc := Mux(io.jump_flag_id, io.jump_address_id, pc + 4.U) // Update PC
}
```
* Instruction forwarding: io.instruction_read_data (from memory) → io.instruction (output)
* PC update: Based on io.jump_flag_id condition (Exercise 14 fill-in-the-blank)
- [ ] 2. Invalid Instruction Path (io.instruction_valid = false)
When memory is not ready (stall condition):
```scala
.otherwise {
pc := pc // Hold PC (stall)
io.instruction := 0x00000013.U // Insert NOP (ADDI x0, x0, 0)
}
```
* Pipeline stall: PC freezes, waiting for valid instruction
* NOP insertion: Prevents illegal instruction execution, allows pipeline to continue safely
- [ ] 3. Instruction Address Output
```scala
io.instruction_address := pc // Always output current PC value
```
The output signal io.instruction_address is unconditionally assigned the current PC value, independent of `io.instruction_valid`.
Waveform Analysis
- [ ] Waveform 1: Reset and Initial Stall (0-3 ps)

Timeline breakdown:
| Time | Event | PC Value | Explanation |
|------|-------|-------------|-------------|
| 0-2 ps | Reset active | pc = 0x1000 | PC initializes to entry address (ProgramCounter.EntryAddress) |
| 0-2 ps | io.instruction_valid = 0 | pc = 0x1000 | Stall: pc := pc (hold current value) |
| 0-2 ps | Setup/Hold time | — | Input signals (io.jump_flag_id, io.jump_address_id, etc.) remain stable before and after clock edge to prevent metastability |
| 2 ps | Clock falling edge | — | Input signals may change at falling edge (non-triggering edge) |
| 3 ps | io.instruction_valid = 1io.jump_flag_id = 1 | pc = 0x1000 | pc := io.jump_address_id = 0x1000 (branch to same address, appears as stall) |
Key observation: Although io.jump_flag_id is set, branching from 0x1000 to 0x1000 visually appears identical to a stall.
- [ ] Waveform 2: Sequential Execution (No Branch)

Condition: `io.jump_flag_id` = 0 (no control flow change)
```scala
PC update:
pc := pc + 4.U
```
Result: Program counter increments by 4 bytes, fetching the next sequential instruction.
Example: If current pc = 0x1000, next pc = 0x1004.
- [ ] Waveform 3: Control Flow Change (Branch Taken)

Condition: `io.jump_flag_id` = 1 (branch/jump taken)
```scala
PC update:
pc := io.jump_address_id = 0x1000
```
Result: Program counter jumps to the target address specified by the Execute stage.
Instruction types triggering this:
- JAL (Jump and Link)
- JALR (Jump and Link Register)
- Taken branches (BEQ, BNE, BLT, BGE, BLTU, BGEU)
The waveform analysis confirms the IF stage operates correctly according to the design specification:
* Reset behavior: PC initializes to entry address (0x1000)
* Stall mechanism: PC holds current value when `io.instruction_valid` = 0
* Sequential execution: PC increments by 4 when `io.jump_flag_id` = 0
* Control flow: PC jumps to target when `io.jump_flag_id` = 1
* Output consistency: `io.instruction_address` always reflects current PC value
Note: The observation shows `io.instruction` = 0 because the instruction memory is empty in this test scenario. In actual execution with loaded programs, `io.instruction` would contain valid RV32I instructions fetched from memory.
### Instruction Decode
> The code is located in `src/main/scala/riscv/core/InstructionDecode.scala`.
What the decode stage does:
* Read the opcode to determine instruction type and field lengths
* Read in data from all necessary registers
- for `add`, read two registers
- for `addi`, read one register
- for `jal`, no reads are necessary
* Output control signals.

```scala
val rs1 = io.instruction(19, 15)
val rs2 = io.instruction(24, 20)
io.regs_reg1_read_address := Mux(opcode === Instructions.lui, 0.U(Parameters.PhysicalRegisterAddrWidth), rs1)
io.regs_reg2_read_address := rs2
```
The above code extracts the register operand numbers from the instruction. In all cases, except when the instruction is `lui`, register 1 is set to register 0, and register 2 is set to the values from the rs1~(19:15)~ and rs2~(24:20)~ fields of the instruction, respectively.
:::info
:information_source: Allocating a separate register for the constant 0 simplifies the RISC-V ISA, for example, allowing assignment instructions to be replaced with addition instructions using an operand of 0.
:::
```scala
val immediate = MuxLookup(
opcode,
Cat(Fill(20, io.instruction(31)), io.instruction(31, 20)),
IndexedSeq(
InstructionTypes.I -> Cat(Fill(21, io.instruction(31)), io.instruction(30, 20)),
InstructionTypes.L -> Cat(Fill(21, io.instruction(31)), io.instruction(30, 20)),
```
The above code extracts immediate values. Since the position of immediate values varies for different instruction types, it is necessary to differentiate instruction types using the opcode and then extract the corresponding immediate value.
```scala
object ALUOp1Source {
val Register = 0.U(1.W)
val InstructionAddress = 1.U(1.W)
}
// ...
io.ex_aluop1_source := Mux(
opcode === Instructions.auipc || opcode === InstructionTypes.B || opcode === Instructions.jal,
ALUOp1Source.InstructionAddress,
ALUOp1Source.Register
)
```
Taking `ex_aluop1_source` control signal as an example, this control signal determines the input for the first operand of the ALU. It assigns a value to `ex_aluop1_source` based on the opcode. When the instruction type is either `auipc`, `jal`, or B, `ex_aluop1_source` is set to 0, controlling the ALU's first operand input to be the instruction address. In other cases, `ex_aluop1_source` is set to 1, controlling the ALU's first operand input to be a register.
As you can see, the design of the decoding unit is also simple combinational logic. Knowing the mapping relationship between control signals and instructions is sufficient to complete it. Next, please complete the code for assigning values to the four control signals: `ex_aluop2_source`, `io.memory_read_enable`, `io.memory_write_enable`, and `io.wb_reg_write_source`.
| control signals | meanings |
|:----------------------|:---------------------------------|
| `ex_aluop2_source` | ALU Input Source Selection |
| `memory_read_enable` | Memory Read Enable |
| `memory_write_enable` | Memory Write Enable |
| `wb_reg_write_source` | Write-Back Data Source Selection |
Memory Read Enable:
- If the decoded instruction is L-type (load instructions: lw, lh, lb, lhu, lbu), whose opcode is 0x03 (binary 0000011), the output flag memory_read_enable will be true/1.
- Otherwise, it will be false/0.
Memory Write Enable:
- If the decoded instruction is S-type (store instructions: sw, sh, sb), whose opcode is 0x23 (binary 0100011), the output flag memory_write_enable will be true/1.
- Otherwise, it will be false/0.

Waveform Observations:
1. When the instruction is sw a0, 4(zero) (encoded as 0x00A02223), its lower 7 bits form the opcode = 0x23 (binary 010_0011), indicating this is an S-type store instruction, so memory_write_enable is true.
2. This particular test does not execute any L-type instructions (opcode 0x03), so memory_read_enable remains false throughout the test.
### Execution
> The code is located in `src/main/scala/riscv/core/Execute.scala`.
What the execution stage does:
* Perform ALU computation.
* Determine if there is a branch.
```scala
val alu = Module(new ALU)
val alu_ctrl = Module(new ALUControl)
// ...
io.mem_alu_result := alu.io.result
```
The above code instantiates ALU and ALUControl within the Execute module. The specific ALU computation logic is implemented in the ALU module, and here, you only need to assign values to the input ports of ALU. The code for ALU can be found in `src/main/scala/riscv/core/ALU.scala`.
```scala
io.if_jump_flag :=
(opcode === Instructions.jal) ||
(opcode === Instructions.jalr) ||
(opcode === InstructionTypes.B) &&
MuxLookup(
funct3,
false.B,
IndexedSeq(
InstructionsTypeB.beq -> (io.reg1_data === io.reg2_data),
// ...
)
```
The above code determines whether a jump should be taken. The logic for determining a jump is as follows: if it is an unconditional jump instruction, it jumps directly (e.g., `jal` and `jalr` instructions in the code above). If it is a branch instruction, it checks the corresponding jump condition to decide whether to jump (e.g., the `beq` instruction in the code above, which jumps when `io.reg1_data` is equal to `io.reg2_data`). When a jump is taken, the control signal `if_jump_flag` is set to 1.
The ALU operand selection logic performs the following three operations:
1. Connect ALU Function Control:
- The output alu_funct from the ALU control unit (alu_ctrl) is connected to the input func of the ALU (alu): `alu.io.func := alu_ctrl.io.alu_funct`
2. Select ALU Operand 1:
- If aluop1_source equals ALUOp1Source.InstructionAddress (value 1), use the program counter (instruction_address)
- Otherwise, use the content of source register 1 (reg1_data)
```scala
val aluOp1 = Mux(io.aluop1_source === ALUOp1Source.InstructionAddress,
io.instruction_address, io.reg1_data)
alu.io.op1 := aluOp1
```
3. Select ALU Operand 2:
- If aluop2_source equals ALUOp2Source.Immediate (value 1), use the decoded immediate value (immediate)
- Otherwise, use the content of source register 2 (reg2_data)
```scala
val aluOp2 = Mux(io.aluop2_source === ALUOp2Source.Immediate,
io.immediate, io.reg2_data)
alu.io.op2 := aluOp2
```
Waveform Examples:
The following waveform shows an R-type instruction where:
- aluop1_source = 0 → op1 = reg1_data
- aluop2_source = 0 → op2 = reg2_data

The following waveform shows a U-type or J-type instruction where:
- aluop1_source = 1 → op1 = instruction_address (PC)
- aluop2_source = 1 → op2 = immediate

### Memory Access
> The code is located in `src/main/scala/riscv/core/MemoryAccess.scala`.
Only load/store instructions have a memory access stage. The other instructions remain idle during this stage or skip it all together. In this stage, reading loads data from memory into registers, and writing stores data from registers into memory.
In the decode stage, if it is an L-type (I-type subtype) instruction, `memory_read_enable` is set to 1, and if it is an S-type instruction, `memory_write_enable` is set to 1. These two control signals determine whether reading or writing should occur in this stage.
```scala
val mem_address_index = io.alu_result(log2Up(Parameters.WordSize) - 1, 0).asUInt
```
First, the address for reading or writing memory is obtained.
When `io.memory_read_enable` is true:
The code reads data from the memory bus and processes it differently based on the instruction type (e.g., `lb` sign-extends, `lbu` zero-extends, `lh` reads two bytes, etc.). The processed data is then assigned to io.`wb_memory_read_data` for writing back.
```scala
.elsewhen(io.memory_write_enable) {
io.memory_bundle.write_data := io.reg2_data
io.memory_bundle.write_enable := true.B
io.memory_bundle.write_strobe := VecInit(Seq.fill(Parameters.WordSize)(false.B))
when(io.funct3 === InstructionsTypeS.sb) {
io.memory_bundle.write_strobe(mem_address_index) := true.B
io.memory_bundle.write_data := io.reg2_data(Parameters.ByteBits, 0) << (mem_address_index << log2Up(Parameters.ByteBits).U)
}.elsewhen(io.funct3 === InstructionsTypeS.sh) {
// ...
```
When `io.memory_write_enable` is true:
The code writes data to memory, and the data is processed differently based on the instruction type (e.g., `sw` takes 32 bits, `sh` takes 16 bits, and `sb` takes 8 bits).
The RegisterFileTest class (`src/test/scala/riscv/singlecycle/RegisterFileTest.scala`) validates register file operations:
- Write and read-back: Writes 0xdeadbeef to register x1, then reads it back
- x0 hardwired to zero: Verifies x0 always reads 0, even after write attempts
- Write-through behavior: Tests reading a register in the same cycle it's being written

For the byte access test (and the following integration tests: Quicksort, Fibonacci), the external assembly code is loaded into memory by the TestTopModule class in `src/test/scala/riscv/singlecycle/CPUTest.scala`.
Binary Loading Process:
1. InstructionROM (`src/main/scala/peripheral/InstructionROM.scala`):
- Takes exeFilename parameter (e.g., "fibonacci.asmbin")
- Reads binary file from src/main/resources/ in little-endian format (4 bytes per instruction)
- Appends 3 NOP instructions (0x00000013) for safety
- Creates a Chisel Mem initialized with these instructions
- Generates .txt file in verilog/ for Verilator simulation
2. ROMLoader (`src/main/scala/peripheral/ROMLoader.scala`):
- Copies ROM contents word-by-word to main memory (RAM)
- Writes to entry address (defined in Parameters.EntryAddress, typically 0x1000)
- Signals load_finished when complete (line 38)
3. TestTopModule orchestration:
- Before load_finished: ROMLoader writes to Memory, CPU receives zero data
- After load_finished: CPU executes from Memory, ROMLoader is idle
### Write-Back
> The code is located in `src/main/scala/riscv/core/WriteBack.scala`.
In the write-back stage, the computed data or data read from memory is written into registers.
The write-back module is essentially a multiplexer, and the code is very simple. However, it raises an interesting question: the write-enable signal is generated in the decode stage, but at that point, the correct write-back data has not been calculated (or read from memory). So, will incorrect write-back data be written into the register file, and why?
### Combining into a CPU
> The code is located in `src/main/scala/riscv/core/CPU.scala`.
We have implemented all the components required to build the CPU. Now, we need to instantiate and connect these components together according to the single-cycle CPU architecture diagram.
```scala
class CPU extends Module {
val io = IO(new CPUBundle)
// CPUBundle is the channel for data exchange between the CPU and peripherals like memory.
val regs = Module(new RegisterFile)
val inst_fetch = Module(new InstructionFetch)
val id = Module(new InstructionDecode)
val ex = Module(new Execute)
val mem = Module(new MemoryAccess)
val wb = Module(new WriteBack)
..
// Here, we instantiate modules for different execution stages.
inst_fetch.io.jump_address_id := ex.io.if_jump_address
inst_fetch.io.jump_flag_id := ex.io.if_jump_flag
// Taking the two lines above as an example, you can see the corresponding connections in the CPU schematic.
}
```
In the code above, we instantiate modules for various execution stages and then establish connections between them to create the CPU.
Please observe the input port code of the Execute module and the CPU architecture diagram, and fill in the connections between the inputs of the Execute module and the outputs of other modules.
### Functional Test
The process by which `sbt test` activates the test cases for validating the CPU implementation relies on [Chiseltest](https://index.scala-lang.org/ucb-bar/chiseltest), an extensive testing and formal verification library designed for Chisel-based RTL (Register-Transfer Level) designs. Chiseltest places a strong emphasis on creating tests that are lightweight, promoting minimal boilerplate code, easy to read and write for enhanced understandability, and conducive to test code reuse through composability.
Provided a C program that computes the [Fibonacci sequence](https://en.wikipedia.org/wiki/Fibonacci_sequence), its source code is presented below.
```c
static int fib(int a) {
if (a == 1 || a == 2) return 1;
return fib(a - 1) + fib(a - 2);
}
int main() {
*((volatile int *) (4)) = fib(10);
return 0;
}
```
> File: csrc/fibonacci.c
Within the `main` function, the line `*((volatile int *) (4)) = fib(10)` stores the Fibonacci(10) result in the memory address `4`, which can be later verified using a Chiseltest-based test case.
```scala
class FibonacciTest extends AnyFlatSpec with ChiselScalatestTester {
behavior.of("Single Cycle CPU")
it should "calculate recursively fibonacci(10)" in {
test(new TestTopModule("fibonacci.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
for (i <- 1 to 50) {
c.clock.step(1000)
c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
}
c.io.mem_debug_read_address.poke(4.U)
c.clock.step()
c.io.mem_debug_read_data.expect(55.U)
}
}
}
```
> File: src/test/scala/riscv/singlecycle/CPUTest.scala
Given that Fibonacci(10) is known to be `55`, this test case straightforwardly verifies the content of the designated memory region after our CPU executes the instructions.
### Waveform
Before burning the board for verification, you can perform another round of testing using waveform simulation.
Generating Waveform Files During Testing:
While running tests, if you set the environment variable `WRITE_VCD` to 1, waveform files will be generated.
```shell
$ WRITE_VCD=1 sbt test
```
Afterward, you can find `.vcd` files in various subdirectories under the `test_run_dir` directory. You can open them using [Surfer](https://surfer-project.org/).

On the left side of the interface, there is a tree-like structure organized by modules. To add a signal to the window, right-click and select `Insert` from the menu.
### Verilator
[Verilator](https://www.veripool.org/verilator/) is a tool that compiles Verilog and SystemVerilog sources into highly optimized C++ or SystemC code, which can be used for verification and modeling in C++ or SystemC testbenches.
For more details, please refer to the official [Verilator website](https://www.veripool.org/verilator/) and [its manual](https://verilator.org/guide/latest/).
Why do we choose [Verilator](https://www.veripool.org/verilator/)? It is a high-performance, open-source Verilog/SystemVerilog simulator. While it is powerful and fast, it is not a direct replacement for event-based simulators such as Modelsim and Vivado Xsim. Verilator operates on a cycle-based simulation model, which means it does not simulate exact circuit timing within a single clock cycle and may not capture intra-period glitches. While this approach has its advantages, it also has limitations compared to other simulators.
If you want to quickly test your own written programs, you can use Verilator for simulation. The main simulation function is already written and located in `verilog/verilator/sim_main.cpp`. After the first run and every time you modify the Chisel code, you need to execute the following command in the project's root directory to generate Verilog files:
```shell
$ make verilator
```
> :warning: If you don't modify the Scala code as mentioned earlier, the Verilog generation will fail.
After compilation, an executable file named `verilog/verilator/obj_dir/VTop` will be generated. This executable file can take parameters to run different code files. Here are the parameters and their usage:
| Parameter | Usage |
|:----------|:------|
| `-memory` | Specify the size of the simulation memory in words (4 bytes each).<br> Example: `-memory 4096` |
| `-instruction` | Specify the RISC-V program used to initialize the simulation memory.<br>Example: `-instruction src/main/resources/hello
.asmbin` |
| `-signature` | Specify the memory range and destination file to output after simulation.<br>Example: `-signature 0x100 0x200 mem.txt` |
| `-halt` | Specify the halt identifier address; writing `0xBABECAFE` to this memory address stops the simulation.<br>Example: `-halt 0x8000` |
| `-vcd` | Specify the filename for saving the simulation waveform during the process; not specifying this parameter will not generate a waveform file.<br>Example: `-vcd dump.vcd` |
| `-time` | Specify the maximum simulation time; note that time is **twice** the number of cycles.<br>Example: `-time 1000` |
For example, to load the `fibonacci.asmbin` file, simulate for 1000 cycles, and save the simulation waveform to the `dump.vcd` file, you can run:
```shell
./run-verilator.sh -instruction src/main/resources/fibonacci.asmbin -time 2000 -vcd dump.vcd
```
> :walking: Keep in mind that a time value of **2000** corresponds to simulating 1000 cycles.
Then, run `surfer dump.vcd` to check its waveform.

You can observe that the signal io_instruction begins with `000000000` and `00001137`. In the meantime, let's verify the hexadecimal representation of `hello.asmbin`:
```shell
hexdump src/main/resources/fibonacci.asmbin | head -1
```
Its output:
```
0000000 1137 0000 1097 0000 80e7 8b00 006f 0000
```
It aligns with the expected waveform.
:walking: You may wonder: why are not the modules/IO ports/registers defined in Chisel visible in the generated Verilog code (or waveform)?
> This occurs because Chisel initially generates [FIRRTL](https://github.com/chipsalliance/firrtl) code and subsequently applies optimization steps, including logical simplification, constant propagation, and dead code elimination, to the FIRRTL code. The final Verilog code is then generated based on the optimized [FIRRTL](https://github.com/chipsalliance/firrtl) representation. If the modules/IO ports/registers you created in Chisel do not appear in the generated Verilog code, it is recommended to check for proper module connections and potential logic issues that could lead to constant values in certain registers, among other possible reasons.
Run simulations:
```bash
# Basic simulation (no program loaded)
make sim
# Run with test program
make sim SIM_ARGS="-instruction ../../../src/main/resources/fibonacci.asmbin"
# Custom simulation time and waveform output
make sim SIM_TIME=100000 SIM_VCD=custom.vcd
```
Simulation Parameters
- `SIM_TIME`: Maximum simulation cycles (default: 1,000,000)
- `SIM_VCD`: Waveform output file (default: trace.vcd)
- `SIM_ARGS`: Additional arguments (program binary, halt address)
View simulation waveforms with Surfer:
```bash
surfer trace.vcd
```
Key signals to observe:
- `io_instruction_address`: Current PC value
- `io_instruction`: Fetched instruction
- `io_memory_bundle_*`: Memory interface signals
- `inst_fetch_*, id_*, ex_*, mem_*, wb_*`: Pipeline stage internals
### RISCOF Compliance Testing
[RISC-V architectural compliance testing](https://github.com/riscv-non-isa/riscv-arch-test) validates correct implementation of the RV32I instruction set against the official RISC-V specification, based on [RISCOF](https://riscof.readthedocs.io/).
Test Coverage:
- RV32I base instruction set (41 tests)
- Arithmetic operations: ADD, SUB, ADDI
- Logical operations: AND, OR, XOR, ANDI, ORI, XORI
- Shift operations: SLL, SRL, SRA, SLLI, SRLI, SRAI
- Comparison: SLT, SLTU, SLTI, SLTIU
- Load operations: LB, LH, LW, LBU, LHU
- Store operations: SB, SH, SW
- Branch instructions: BEQ, BNE, BLT, BGE, BLTU, BGEU
- Jump instructions: JAL, JALR
- Upper immediate: LUI, AUIPC
Running Compliance Tests:
```bash
make compliance
# Expected duration: 10-15 minutes
# Results saved to: results/report.html
```
### Prepare Programs to Run on MyCPU
If no specific argument is provided to the GNU toolchain, the default linker script is utilized for the linking process. However, for more granular control over the linking process, it is possible to create a custom linker script. You can designate a linker script using the `-T` option, as demonstrated below:
```shell
$ riscv-none-elf-gcc -T link.lds hello.o -o hello
```
The content of the linker script would look something like this:
```c
OUTPUT_ARCH( "riscv" )
ENTRY(_start)
SECTIONS
{
. = 0x00001000;
.text : { *(.text.init) *(.text.startup) *(.text) }
.data ALIGN(0x1000) : { *(.data*) *(.rodata*) *(.sdata*) }
. = 0x00100000;
.bss : { *(.bss) }
_end = .;
}
```
> File: `csrc/link.lds`
The first line of the linker script specifies the output instruction format, which is RISC-V in this case. The second line specifies the program's entry address, which is the `_start` function. Starting from the third line, it defines the positions of various segments in the program. For example, the `.text` segment starts at address `0x00001000`, the `.data` segment starts after `.text` and is aligned to address `0x1000`, and the `.bss` segment starts at address 0x00100000. The last line specifies the program's end address, denoted as `_end`.
To regenerate the RISC-V programs utilized for unit tests, change to the `csrc` directory and run the `make update` command. Ensure that the `$PATH` environment variable is correctly configured to include the GNU toolchain for RISC-V.
```shell!
$ cd csrc
$ make update
```
> See [Lab2: RISC-V Instruction Set Simulator and System Emulator](https://hackmd.io/@sysprog/Sko2Ja5pel).
[ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) (Executable and Linkable Format) is an executable file format commonly used in Linux systems. While a detailed understanding of the ELF format is not necessary, having a general idea of its structure can be helpful for this experiment.
An ELF file stores metadata about a program, including the entry address, segment information, symbol information, and more. Among these, segment information is crucial as it contains the program's code and data. Typically, the code segment is named `.text`, and the data segment is named `.data`. These segments are loaded into memory by the operating system for program execution and data access.
In our experiment, since we do not have an operating system, we use Chisel code to specify the CPU's entry address and load the code and data segments into memory for direct program execution. These code and data segments are flashed into the FPGA logic as initialization data during the synthesis phase. The `InstructionROM` module is responsible for generating this initialization data, and the `ROMLoader` module handles the loading of data into memory. After this data copying process is complete, the CPU can start executing the program.
The process of generating an executable file for the CPU involves the following steps:
1. Only the code and data segments from the ELF file are required, while other segments can be disregarded. In the linker script, these two segments are allocated to contiguous addresses to streamline the implementation.
2. The code segment commences from address `0x1000`, with the space before it reserved for the program's stack.
3. The `objcopy` tool is employed to duplicate the code and data segments into a distinct file, resulting in a file containing solely binary code and data.
---
## RISC-V CPU with MMIO Peripherals and Trap Handling
> Directory: `2-mmio-trap`
An extended single-cycle RISC-V processor implementation in Chisel that adds memory-mapped I/O peripherals (Timer, UART) and comprehensive trap handling through Control and Status Registers (CSR) and Core-Local Interrupt Controller (CLINT). This builds upon the basic single-cycle design by introducing privileged architecture features and peripheral interfacing necessary for embedded systems and operating system support.
### Instruction Coverage
The design decodes the complete RV32I base ISA together with the machine-mode CSR (Zicsr) subset:
- Arithmetic / Logical: all OP and OP-IMM forms (`add/sub`, shifts, comparisons, bitwise ops)
- Memory: byte/halfword/word loads and stores with sign or zero extension
- Control Flow: all conditional branches, `jal`, `jalr`, `lui`, `auipc`
- System: `ecall`, `ebreak`, `mret`, and fence instructions (treated as architectural no-ops in this configuration)
- CSR Access: `csrrw`, `csrrs`, `csrrc` and immediate variants with proper read-modify-write semantics (no write-back when the source operand is zero)
### Key Enhancements Over Base Implementation
- MMIO Peripherals: Memory-mapped Timer, UART, and VGA devices with device address decoding
- Timer Peripheral: Configurable 32-bit counter with interrupt generation on threshold
- UART Peripheral: Full-duplex serial communication with TX/RX buffering and interrupts
- VGA Peripheral: 640×480@72Hz display with dual-clock framebuffer, palette, and SDL2 visualization
- CSR Support: 15+ machine-mode CSR registers per RISC-V Privileged Spec v1.10
- Interrupt Handling: Hardware interrupt processing from peripherals via CLINT
- Exception Support: Software traps through `ecall` and `ebreak`
- Privileged Instructions: `mret` for trap return, CSR manipulation instructions (CSRRW/CSRRS/CSRRC)
### Design Philosophy
The interrupt mechanism operates at instruction boundaries, ensuring that:
- Current instruction completes before interrupt handling begins
- Atomicity of individual instructions is preserved
- CSR state remains consistent
- Nested interrupts are explicitly prevented through privilege controls
### Control and Status Registers (CSR)
CSRs form an independent 4096-byte address space separate from general-purpose registers. According to the RISC-V ISA specification, "CSR instructions are atomic read-modify-write operations," requiring special handling in the processor pipeline.
#### Machine Information Registers
- `mvendorid` (0xF11): Vendor ID (read-only, returns 0)
- `marchid` (0xF12): Architecture ID (read-only, returns 0)
- `mimpid` (0xF13): Implementation ID (read-only, returns 0)
- `mhartid` (0xF14): Hardware thread ID (read-only, returns 0)
#### Machine Trap Setup
- `mstatus` (0x300): Machine status register
- Bit 3 (MIE): Machine interrupt enable
- Bit 7 (MPIE): Previous interrupt enable state
- `misa` (0x301): ISA and extensions (read-only)
- `mie` (0x304): Interrupt enable register
- Bit 11 (MEIE): External interrupt enable
- `mtvec` (0x305): Trap vector base address
#### Machine Trap Handling
- `mscratch` (0x340): Scratch register for machine trap handlers
- `mepc` (0x341): Machine exception program counter
- `mcause` (0x342): Machine trap cause
- Bit 31: Interrupt flag (1 = interrupt, 0 = exception)
- Bits 30:0: Exception code
- `mtval` (0x343): Machine trap value (bad address or instruction)
- `mip` (0x344): Interrupt pending register
- Bit 11 (MEIP): External interrupt pending
#### Counters and Timers
- `cycle` (0xC00): Cycle counter (lower 32 bits)
- `cycleh` (0xC80): Cycle counter (upper 32 bits)
### CSR Instruction Set
The implementation supports all RV32I Zicsr extension instructions:
#### Atomic Read-Modify-Write Operations
- `CSRRW rd, csr, rs1`: Atomic Read/Write
- Reads CSR into `rd`
- Writes `rs1` value to CSR
- `CSRRS rd, csr, rs1`: Atomic Read and Set Bits
- Reads CSR into `rd`
- Sets bits in CSR where `rs1` bits are 1
- `CSRRC rd, csr, rs1`: Atomic Read and Clear Bits
- Reads CSR into `rd`
- Clears bits in CSR where `rs1` bits are 1
#### Immediate Variants
- `CSRRWI rd, csr, uimm`: Read/Write with 5-bit unsigned immediate
- `CSRRSI rd, csr, uimm`: Read and Set with immediate
- `CSRRCI rd, csr, uimm`: Read and Clear with immediate
### CSR Implementation Details
> File: `src/main/scala/riscv/core/CSR.scala`
The CSR module implements:
* Separate 4096-entry register file for CSR address space
* Read-only enforcement for information registers
* Atomic read-modify-write semantics in single cycle
* CLINT interface for interrupt-driven CSR updates
* Debug read port for verification
Key Operations:
- Decode CSR address from instruction (bits 31:20)
- Determine operation type from funct3 field
- Perform atomic RMW for CSRRS/CSRRC operations
- Handle read-only register protection
- Interface with CLINT for exception/interrupt updates
### Core-Local Interrupt Controller (CLINT)
The CLINT manages interrupt and exception processing by coordinating CSR updates and control flow redirection.
The processor handles interrupts at instruction boundaries:
1. Detection: Check `mstatus.mie` and pending interrupt signals
2. Entry: Save state and jump to handler
3. Handling: Execute trap handler code
4. Exit: Restore state via `mret` instruction
- [ ] Interrupt Entry Sequence
> File: `src/main/scala/riscv/core/CLINT.scala`
When responding to an interrupt or exception:
1. Save Return Address: Write PC + 4 to `mepc`
2. Record Cause: Write exception code to `mcause`
- Hardware interrupt: `mcause[31] = 1`, cause code in bits 30:0
- Software exception: `mcause[31] = 0`, exception code in bits 30:0
3. Disable Interrupts:
- Save current `mstatus.mie` to `mstatus.mpie`
- Clear `mstatus.mie` to prevent nested interrupts
4. Jump to Handler: Redirect PC to address in `mtvec`
- [ ] Interrupt Exit (`mret`)
The `mret` instruction atomically:
1. Restores PC from `mepc`
2. Restores `mstatus.mie` from `mstatus.mpie`
3. Resumes normal execution
- [ ] Exception Codes
According to RISC-V privilege specification:
Exceptions (`mcause[31] = 0`):
- `0`: Instruction address misaligned
- `2`: Illegal instruction
- `3`: Breakpoint (`ebreak`)
- `8`: Environment call from U-mode (`ecall`)
- `11`: Environment call from M-mode (`ecall`)
Interrupts (`mcause[31] = 1`):
- `11`: Machine external interrupt
### Software Exceptions
- [ ] Environment Call (`ecall`)
Triggers a synchronous exception for system call interface:
* Saves current PC to `mepc`
* Sets `mcause` to 11 (M-mode ecall)
* Jumps to trap handler in `mtvec`
- [ ] Breakpoint (`ebreak`)
Triggers a synchronous exception for debugging:
* Saves current PC to `mepc`
* Sets `mcause` to 3 (breakpoint)
* Jumps to trap handler in `mtvec`
Both instructions behave identically to hardware interrupts regarding CSR manipulation, differing only in the `mcause` value to indicate the specific exception type.
### Memory-Mapped Peripherals
The processor uses high-order address bits to select between devices:
* `deviceSelect = 0`: Main memory
* `deviceSelect = 1`: Timer peripheral
* `deviceSelect = 2`: UART peripheral
* `deviceSelect = 3`: VGA peripheral
- [ ] Timer Peripheral
> File: `src/main/scala/peripheral/Timer.scala`
A memory-mapped timer peripheral provides periodic interrupt generation capabilities.
- [ ] Memory-Mapped Registers
Located at base address `0x80000000`:
- Timer Limit Register (`0x80000004`): Sets interrupt interval
- Write: Configure timer period (in cycles)
- Read: Current limit value
- Timer Enable Register (`0x80000008`): Controls timer operation
- Write: 1 = enable, 0 = disable
- Read: Current enable state
- [ ] Timer Operation
1. Internal counter increments each cycle when enabled
2. When counter reaches limit value:
- Assert interrupt signal to CLINT
- Reset counter to 0
3. CLINT processes interrupt according to `mstatus.mie` and `mie.meie`
### VGA Peripheral
> File: `src/main/scala/peripheral/VGA.scala`
A memory-mapped VGA display peripheral for visual output with 640×480@72Hz timing and indexed color support.
Display Specifications:
- Resolution: 640×480 pixels @ 72Hz refresh rate
- Framebuffer: 64×64 pixels (4-bit indexed color)
- Color Depth: 16-color palette with 6-bit RRGGBB format
- Upscaling: 6× hardware upscaler (64×64 → 384×384 centered display)
- Animation: 12-frame double-buffered animation support
Memory Organization:
* Display Memory: 12 frames × 4096 pixels × 4 bits = 24KB
* Pixel Packing: 8 pixels per 32-bit word (4 bits per pixel)
* Frame Capacity: 49,152 bytes uncompressed (4,755 bytes with delta compression)
- [ ] Memory-Mapped Registers
Base address: `0x30000000`
Control Registers:
* VGA_ID (0x30000000): Device identification (read-only, returns 0x56474131 "VGA1")
* VGA_CTRL (0x30000004): Control register
- Bit 0: Display enable (1 = on, 0 = off)
- Bit 1: Auto-advance enable (1 = automatic frame cycling)
* VGA_STATUS (0x30000008): Status register (read-only)
- Bit 0: V-sync active
- Bit 1: H-sync active
* VGA_UPLOAD_ADDR (0x30000010): Framebuffer write address pointer
- Format: [frame_index:4][pixel_offset:12] (bits packed as 32-bit word address)
* VGA_STREAM_DATA (0x30000014): Streaming data write port
- Write: 8 pixels (32 bits) to current upload address, auto-increment address
Palette Registers:
* VGA_PALETTE(n) (`0x30000020 + n*4`): Color palette entries (n = 0..15)
- Format: 6-bit RRGGBB (bits 5:4 = RR, bits 3:2 = GG, bits 1:0 = BB)
- Each component: 0-3 scale (4 levels)
Pixel Packing Format
Each 32-bit word contains 8 pixels with 4-bit color indices:
```
Bits: 31-28 | 27-24 | 23-20 | 19-16 | 15-12 | 11-8 | 7-4 | 3-0
Pixel: 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
```
### VGA Display Demo
The processor includes a VGA peripheral for visual output with SDL2 support. The demo displays an animated nyancat on a 640×480@72Hz virtual display using advanced delta frame compression.
To ensure proper operation, the target system should have the [SDL2 library](https://www.libsdl.org/) installed.
macOS: `brew install sdl2`
Ubuntu Linux / Debian: `sudo apt install libsdl2-dev`
Quick Start:
```bash
make demo
```
This command will:
1. Build Verilator simulator with SDL2 graphics support
2. Run the nyancat animation program (12 frames of animated nyancat)
3. Open an SDL2 window showing real-time VGA output
4. Simulate 500 million cycles (~5 minutes, includes full animation)
5. Display completion progress (1%, 50%, 100%)
VGA Peripheral Features:
- Display: 640×480 @ 72Hz timing
- Framebuffer: Dual-clock RAM with 12 frames of 64×64 pixels
- Rendering: 6× upscaling (64×64 → 384×384 centered display)
- MMIO Base: 0x30000000
- Color: 16-color palette with 6-bit RRGGBB format
- Compression: Delta frame encoding (91% size reduction, 49KB → 4.7KB)
Animation Details:
The nyancat demo uses compressed animation data generated from the upstream [klange/nyancat](https://github.com/klange/nyancat) project:
- Source: Original nyancat terminal animation
- Frames: 12 frames × 4096 pixels (64×64 each)
- Compression: Delta frame encoding achieving 91% reduction (29% better than RLE)
- Generation: Automated Python script downloads and compresses animation data
- Colors: 14-color palette mapped from upstream character encoding
- Binary size: 8.7KB (vs 10.8KB with RLE, 19% smaller)

Delta Frame Compression Format:
The animation uses an advanced delta encoding scheme exploiting 94.4% frame-to-frame similarity:
| Opcode | Meaning | Example |
|--------|---------|---------|
| `0x0X` | SetColor (X = color 0-13) | `0x05` sets current color to 5 |
| `0x1Y` | Skip unchanged (Y+1 pixels, 1-16) | `0x13` skips 4 pixels |
| `0x2Y` | Repeat changed (Y+1 pixels, 1-16) | `0x23` writes 4 pixels |
| `0x3Y` | Skip unchanged ((Y+1)×16 pixels, 16-256) | `0x32` skips 48 pixels |
| `0x4Y` | Repeat changed ((Y+1)×16 pixels, 16-256) | `0x42` writes 48 pixels |
| `0x5Y` | Skip unchanged ((Y+1)×64 pixels, 64-1024) | `0x52` skips 192 pixels |
| `0xFF` | EndOfFrame marker | Signals frame completion |
Compression Performance:
- Frame 0 (baseline): 576 opcodes (86% reduction) using RLE
- Frames 1-11 (delta): avg 390 opcodes (91% reduction) exploiting temporal coherence
- Best frames (3, 9): 235-236 opcodes (95% reduction) with minimal pixel changes
- Total: 4,755 bytes compressed data (vs 6,715 RLE, 29% improvement)
This achieves 91% compression with pixel-perfect quality, enabling 12 frames to fit in 8.7KB binary with delta decompression logic.
---
## Pipelined RISC-V CPU
> Directory: `3-pipeline`
Every design builds upon the single-cycle baseline and shares the same front-end (instruction memory, register file) and peripheral models as the previous labs. The goal is to show how progressively richer pipeline techniques eliminate performance bottlenecks while preserving architectural correctness.
| Implementation | Stages | Highlight |
| --- | --- | --- |
| `ImplementationType.ThreeStage` | IF → ID → EX/MEM/WB (folded) | Minimal pipeline that introduces control-flow redirection and CLINT interaction with a single execute stage. |
| `ImplementationType.FiveStageStall` | IF → ID → EX → MEM → WB | Classic five-stage design that resolves data hazards with interlocks (bubbles) and performs branch resolution in EX. |
| `ImplementationType.FiveStageForward` | IF → ID → EX → MEM → WB | Adds bypass paths from MEM/WB back to EX to reduce stalls caused by RAW hazards. |
| `ImplementationType.FiveStageFinal` | IF → ID → EX → MEM → WB | Combines forwarding, refined flush logic, and the optimized CLINT/CSR handshake that matches the interrupt-capable single-cycle core. |
Select the implementation by passing the desired constant to `new CPU(implementation = …)` in `board/verilator/Top.scala` or within the unit tests.
### Hazards in This Lab
Pipelining introduces overlapping instruction execution, which naturally creates hazards. The lab highlights three families and shows how each implementation responds:
- [ ] Structural Hazards
The pipelines assume a Harvard-style memory system: instruction fetch and data memory have independent ports, so structural hazards are intentionally avoided. The register file supports two reads and one write per cycle. If you experiment with alternative memories (for example, a unified single-port SRAM), you must introduce arbitration or buffers to avoid fetch/data conflicts.
- [ ] Data Hazards
1. Read-After-Write (RAW):
- *Three-stage* and *five-stage stall* cores insert bubbles when an instruction consumes a value still in flight. This behavior is encoded in `Control.scala` and the hazard unit.
- *Forwarding* and *final* cores extend `Forwarding.scala` to feed results from MEM or WB back into EX. The tests under `PipelineProgramTest` inspect the register file after running `hazard.asmbin` to ensure RAW hazards are resolved without corrupting architectural state.
2. Write-After-Write (WAW) / Write-After-Read (WAR):
Single-issue in-order execution eliminates these hazards because writeback occurs in program order. The unified tests still monitor the register file (`hazard.asmbin`) to confirm that the chosen hazard strategy does not disturb older writes.
- [ ] Control Hazards
Branches and jumps must redirect the instruction fetch stage. The designs use the following techniques:
* `Control.scala` asserts flush signals whenever a taken branch or exception is detected.
* The CLINT module coordinates interrupt entry (`mtvec`, `mepc`, `mcause`) and ensures that the pipeline drains correctly before executing `mret`.
* `PipelineProgramTest` runs `irqtrap.asmbin`, toggles `interrupt_flag`, and checks that the machine returns to the instruction stream with a valid CSR snapshot. The test accepts either timer or external interrupt causes (`0x80000007` or `0x8000000b`) because both codes are architecturally legal depending on which device raised the interrupt.
- [ ] Interaction with the Hazard Unit
Each pipeline variant ties the hazard unit into the register file and the forwarding network. Pay special attention to:
* `Control.scala` (stall and flush decisions).
* `Forwarding.scala` (mux selects for EX operands).
* `ID2EX.scala` and `EX2MEM.scala` (registering control signals so that the hazard unit can observe the pipeline state).
The tests exercise these paths automatically, but it is useful to inspect waveform dumps (`make sim SIM_ARGS="..."`) to see how hazards propagate through the pipeline.
### Test Suite
The implementation includes comprehensive verification through multiple testing methodologies:
- [ ] ChiselTest Unit Tests (25 tests)
> Located in `src/test/scala/riscv/`:
The test suite validates all four pipeline implementations (ThreeStage, FiveStageStall, FiveStageForward, FiveStageFinal) across multiple dimensions:
1. PipelineProgramTest: Validates correct execution of test programs
- fibonacci.asmbin: Recursive Fibonacci calculation
- quicksort.asmbin: Array sorting algorithm
- hazard.asmbin: Data hazard handling (RAW, WAW)
- irqtrap.asmbin: Interrupt entry/exit sequences
2. PipelineUartTest: MMIO peripheral verification
- UART register access and configuration
- TX/RX buffer operations
- Memory-mapped I/O correctness
3. PipelineRegisterTest: Pipeline register functionality
- IF2ID, ID2EX, EX2MEM, MEM2WB register correctness
- Control signal propagation
All unit tests pass successfully:
```bash
make test
# Total number of tests run: 25
# Tests: succeeded 25, failed 0
```
- [ ] ISCOF Compliance Testing (119 tests)
RISC-V architectural compliance testing validates correct implementation of RV32I + Zicsr extensions with pipelined execution.
Test Coverage:
* RV32I base instruction set (41 tests)
* Zicsr extension - CSR instructions (40 tests)
- CSRRW, CSRRS, CSRRC and immediate variants
- Machine-mode CSR registers (mstatus, mie, mtvec, mepc, mcause, etc.)
- Atomic read-modify-write semantics in pipelined context
* Physical Memory Protection (PMP) registers (38 tests)
Running Compliance Tests:
```bash
make compliance
# Expected duration: 10-15 minutes
# Results saved to: results/report.html
```
---
## Reference
* [(System)Verilog to Chisel Translation for Faster Hardware Design](https://hal.science/hal-02949112/file/sv2chisel.pdf)
* [Digital Design with Chisel](https://www.imm.dtu.dk/~masca/chisel-book.html)
* [Davis In-Order (DINO) CPU models](https://github.com/jlpteaching/dinocpu)
* [Online RISC-V instruction interpreter](https://www.cs.cornell.edu/courses/cs3410/2019sp/riscv/interpreter/)