# part3: Construct a single-cycle RISC-V CPU with Chisel > [Computer Architecture](https://wiki.csie.ncku.edu.tw/arch/schedule) 2024 Fall ## Development Objectives of this Project Our goal is to create a RISC-V CPU that prioritizes simplicity while assuming a foundational understanding of digital circuits and the C programming language among its readers. The CPU should strike a balance between simplicity and sophistication, and we intend to maximize its functionality. This project encompasses the following key aspects, which will be prominently featured in the technical report: 1. Implementation in [Chisel](https://www.chisel-lang.org/). 2. RV32I instruction set support. 3. Execution of programs compiled from the C programming language. 4. Successful completion of RISC-V Architectural Tests, also known as [riscv-arch-test](https://github.com/riscv-non-isa/riscv-arch-test). ## Prerequisites ### Operating Systems Ensure that you have a functioning GNU/Linux or macOS system with the necessary permissions to install software packages. :::warning :warning: Notice 1. Consider using Linux or macOS instead of Microsoft Windows, as resolving any potential issues on Windows would be your responsibility. 2. Ubuntu Linux 22.04 is recommended. > Please note that certain packages, such as [Verilator](https://www.veripool.org/verilator/), which are available in older Ubuntu distributions, may be too outdated to follow the instructions provided below. 3. If you are using an Apple Silicon-based MacBook, you may encounter some problems. Please record them and discuss them with the instructor. ::: ### Install the dependent packages * For macOS ```shell $ brew install verilator gtkwave ``` * For Ubuntu Linux ```shell $ sudo apt install build-essential verilator gtkwave ``` ### Install sbt [sbt](https://www.scala-sbt.org/), short for [Scala](https://www.scala-lang.org/) Build Tool, relies on [Scala](https://www.scala-lang.org/), a JVM (Java Virtual Machine) language that combines object-oriented and functional programming paradigms into a highly concise, logical, and exceptionally robust language. Therefore, before using sbt, you must ensure that a functional JVM is accessible in your environment. - [ ] macOS > Ensure that [Homebrew](https://brew.sh/) is installed. ```shell # Uninstall everything $ brew uninstall sbt $ brew uninstall jenv # Install sdkman $ curl -s "https://get.sdkman.io" | bash $ source "$HOME/.sdkman/bin/sdkman-init.sh" # Install Eclipse Temurin JDK 11 $ sdk install java 11.0.21-tem $ sdk install sbt ``` > :warning: The version number `11` is crucial; otherwise, you may encounter unexpected issues with sbt. - [ ] Linux * Follow [the instructions](https://www.scala-sbt.org/release/docs/Installing-sbt-on-Linux.html). You MUST install Eclipse Temurin JDK 11. You can follow the instructions mentioned above, starting with the `curl` step. > Another approach is to utilize Apptainer. See [champ's node](https://champyen.blogspot.com/2024/11/chisel.html). ## Fundamental Concepts behind Chisel [Chisel](https://www.chisel-lang.org/) is a domain specific language (DSL) implemented using [Scala](https://www.scala-lang.org/)'s macro features. Therefore, **all programming related to circuit logic must be implemented using the macro definitions provided in the [Chisel](https://www.chisel-lang.org/) library**, rather than directly using Scala language keywords. Specifically, when you want to use a constant in a circuit, such as 12, and compare it with a circuit value, you cannot write `if (value == 12)` but instead must write `when(value === 12.U)`. Memorizing this fundamental concept is essential to correctly write valid [Chisel](https://www.chisel-lang.org/) code. > [Chisel Cheatsheet](https://github.com/freechipsproject/chisel-cheatsheet/releases/latest/download/chisel_cheatsheet.pdf) ### Relationship Between Chisel and Verilog [Chisel](https://www.chisel-lang.org/) is not strictly an equivalent replacement for Verilog but rather a generator language. Hardware circuits written in Chisel need to be compiled into Verilog files and then synthesized into actual circuits through EDA (Electronic Design Automation) software. Furthermore, for the sake of generality, some features found in Verilog, such as negative-edge triggering and multi-clock simulation, are not fully supported or not supported at all in Chisel. ![Chisel flow](https://hackmd.io/_uploads/Sk6sD0qXp.png =50%x) ### Illustration of Simple Combinational Logic Let's explore a simple combinational logic example. ![Simple Combinational Logic](https://hackmd.io/_uploads/rybOYN07T.jpg) ```scala // simple logic expression (a & ~b) | (~a & b) ``` Unlike traditional execution models, the logic circuits associated with [Chisel](https://www.chisel-lang.org/) are continuously active, somewhat akin to continuous assignment in Verilog. However, [Chisel](https://www.chisel-lang.org/) takes a departure from Verilog's built-in logic gates and opts for expressions to define logic. In the above example, we introduce "variables" `a` and `b`, which essentially serve as "named wires" due to their roles as inputs to the circuit. It is important to note that other wires within the circuit do not necessitate explicit names. While we have assumed one-bit-wide inputs and generated wires in this instance, these expressions can seamlessly adapt to wider wires, with [Chisel](https://www.chisel-lang.org/)'s logic operators inherently operating in a "bitwise" manner. Furthermore, [Chisel](https://www.chisel-lang.org/) boasts a powerful wire width inference mechanism, enhancing its versatility and simplifying its use in designing complex logic circuits. In the preceding example, the named wires a and b offer the advantage of reusability across multiple locations within the circuit. Likewise, it is possible to assign a name to the output of the circuit: ```scala // simple logic expression val out =(a & ~b) | (~a & b) ``` The keyword `val` is derived from Scala and serves as a means to declare a program variable that can only be assigned once, essentially creating a constant. This approach allows for the generation of a single output value at a specific location within the circuit and subsequently distributes it to other locations where the same output is required. ```scala // fan-out val z = (a & out) | (out & b) ``` Assigning names to wires and employing fanout mechanisms provide a means to reuse a single output value in multiple locations within the generated circuit. Function abstraction enables us to reuse a circuit description effectively. Here is an example of a simple logic function: ```scala def XOR(a: Bits, b: Bits) = (a & ~b) | (~a & b) ``` In this case, the function inputs and output are specified with the type Bits. We will delve deeper into types shortly. When we use the XOR function elsewhere, we essentially create a duplicate of the corresponding logic. You can think of the function as a "constructor" for this logic. ```scala // Constructing multiple copies val z = (x & XOR(x, y)) | (XOR(x, y) & y) ``` ### Basic Types Chisel datatypes are employed to define the type of values that reside in state elements or flow through wires. While hardware circuits fundamentally manipulate vectors of binary digits, employing more abstract representations for values enables clearer specifications and aids in the generation of more efficient circuits by the tools. The basic types: | Types | meaning | |:-----:|:--------| | `Bits` | Raw collection of bits | | `SInt` | Signed integer number | | `UInt` | Unsigned integer number | | `Bool` | Boolean | All signed numbers are represented using 2's complement in Chisel. Chisel also provides support for several higher-order types, including `Bundles` and `Vecs`. - [ ] Unsigned and Signed Integers The unsigned integer type is `UInt`, and the signed integer type is `SInt`. When declaring integers, you can specify the bit width using `.W`, for example, `UInt(8.W)`. If you want to convert an integer from Scala to Chisel's hardware integer, you can use the `.U` operator, for example, `12.U`. The `.U` operator also supports specifying the bit width, for example, `12.U(8.W)`. - [ ] Booleans The boolean type is `Bool`, and you can use `.B` to convert Scala boolean values to hardware boolean values, for example, `true.B`. ### Type Inference While it's useful to keep track of wire types, Scala's type inference allows you to omit type declarations when they are not necessary. For example, in our previous example: ```scala // simple logic expression val out = (a & ~b) | (~a & b) ``` the type of `out` was deduced based on the types of `a` and `b` as well as the operators used. If you want to ensure type clarity or if there is insufficient context for the inference engine, you can always explicitly specify the type like this: ```scala // simple logic expression val out: Bits =(a & ~b) | (~a & b) ``` Additionally, as we will explore later, explicit type declaration becomes necessary in certain situations. ### Bundles and Vecs Chisel Bundles are used to represent groups of wires that have named fields. They function in a manner similar to C's `struct`. In Chisel, Bundles are defined as a class, which is analogous to how classes are defined in languages like C++ and Java. ```scala class FIFOInput extends Bundle { val rdy = Bool(OUTPUT) // Indicates if FIFO has space val data = Bits(INPUT, 32) // The value to be enqueued val enq = Bool(INPUT) // Assert to enqueue data } ``` Chisel provides class methods for Bundles, which means that user-created bundles should "extend" the Bundle class to benefit from automatic connection creation (more on this later). In a Bundle, each field is assigned a name and defined using a constructor of the appropriate type, along with parameters that specify its width and direction. This allows you to create instances of FIFOInput, for example. ```scala val jonsIO = new FIFOInput; ``` You can create nested Bundle definitions and construct hierarchies with them. These Bundles are typically used to define the interface of modules. The Bundle "flip" operator is employed to create an "opposite" Bundle concerning its direction. ```scala class MyFloat extends Bundle { val sign = Bool() val exponent = Bits(width = 8) val significant = Bits(width = 23) } val x = new MyFloat() Val xs = x.sign class BigBundle extends Bundle { val myVec = Vec(5) { SInt(width = 23) } // Vector of 5 23-bit signed integers. val flag = Bool() val f = new MyFloat() // Previously defined bundle. } ``` Bundle and Vec are class types used for organizing and grouping other types. The Vec class, in particular, represents an array of objects of the same type that can be indexed: ```scala val myVec = Vec(5) { SInt(width = 23) } // Vec of 5 23-bit signed integers. val third = myVec(3) // Name one of the 23-bit signed integers ``` Note: Vec is not a memory array; it functions as a collection of wires (or registers). Both Vec and Bundle inherit from the class Data. Any object that ultimately inherits from Data can be represented as a bit vector in a hardware design. ### Literals Literals are values that you specify directly in your source code. Chisel provides type-specific constructors for defining literals. ```scala Bits("ha") // hexadecimal 4-bit literal of type Bits Bits("o12") // octal 4-bit literal of type Bits Bits("b1010") // binary 4-bit literal of type Bits SInt(5) // signed decimal 4-bit literal of type Fix SInt(-8) // negative decimal 4-bit literal of type Fix UInt(5) // unsigned decimal 3-bit literal of type UFix Bool(true) // literals for type Bool, from Scala boolean literals Bool(false) ``` By default, Chisel will determine the width of your literal to be the minimum necessary. Alternatively, you can provide a width value as a second argument if needed. ```scala Bits("ha", 8) // hexadecimal 8-bit literal of type Bits, 0-extended SInt(-5, 32) // 32-bit decimal literal of type Fix, sign-extended SInt(-5, width = 32) // handy if lots of parameters ``` An error will be reported if the specified width value is insufficient. ### Ports A port refers to any Data object whose members have directions assigned to them. Port constructors enable the addition of directions during construction: ```scala class FIFOInput extends Bundle { val rdy = Bool(OUTPUT) val data = Bits(width = 32, OUTPUT) val enq = Bool(INPUT) } ``` The direction of an object can also be set during instantiation: ```scala class ScaleIO extends Bundle { val in = new MyFloat().asInput val scale = new MyFloat().asInput val out = new MyFloat().asOutput } ``` The methods `asInput` and `asOutput` ensure that all components of the data object are set to the specified direction. There are also other methods available for tasks like reversing direction, among others. ### Modules Modules are employed to establish hierarchy within the generated circuit, resembling modules in Verilog. Each module defines a port interface and interconnects subcircuits. These module definitions are essentially class definitions that extend the Chisel Module class. ```scala class Mux2 extends Module { val io = new Bundle { val select = Bits(width=1, dir=INPUT); val in0 = Bits(width=1, dir=INPUT); val in1 = Bits(width=1, dir=INPUT); val out = Bits(width=1, dir=OUTPUT); }; io.out := (io.select & io.in1) | (~io.select & io.in0); } ``` ![mux-1](https://hackmd.io/_uploads/Hk4jYE076.jpg) ![mux-2](https://hackmd.io/_uploads/Bkq3FN0mp.jpg) The Module slot `io` is utilized to store the interface definition, which is of type Bundle. In the above example, `io` is associated with an unnamed Bundle using the `:=` assignment operator. This Chisel operator connects the input on the left-hand side to the output of the circuit on the right-hand side . Each module implicitly includes clock and reset signals in addition to the declared I/O ports. Combinational Logic: ```scala val wire = Wire(UInt(8.W)) val wireinit = WireInit(0.U(8.W)) ``` The most basic type of state element that Chisel supports is a positive-edge-triggered register, which can be functionally instantiated as follows: ```scala val reg = Reg(UInt(8.W)) val reginit = RegInit(0.U(8.W)) ``` This circuit generates an output that replicates the input signal but with a one-clock-cycle delay. It is important to note that we do not need to explicitly specify the type of the register (`Reg`) because Chisel automatically infers it from the input when instantiated in this manner. In Chisel, the clock and reset signals are global and are implicitly included where necessary. ### Hello World in Chisel ```scala class Hello extends Module { val io = IO(new Bundle { val led = Output(UInt(1.W)) }) val CNT_MAX = (50000000 / 2 - 1).U; val cntReg = RegInit(0.U(32.W)) val blkReg = RegInit(0.U(1.W)) cntReg := cntReg + 1.U when(cntReg === CNT_MAX) { cntReg := 0.U blkReg := ~blkReg } io.led := blkReg } ``` --- ## [Chisel](https://www.chisel-lang.org/) Tutorial Getting the Repository: ```shell $ git clone https://github.com/ucb-bar/chisel-tutorial $ cd chisel-tutorial $ git checkout release ``` Before testing your system, ensure that you have sbt (the Scala build tool) installed. ```shell $ sbt run ``` When running for the first time, an internet connection is needed to automatically download necessary components. Reference output: ``` [info] Set current project to chisel-tutorial (in build file:/tmp/chisel-tutorial/) [info] Updating https://repo1.maven.org/maven2/edu/berkeley/cs/chisel-iotesters_2.12/maven-metadata.xml 100.0% [##########] 2.1 KiB (8.7 KiB / s) https://repo1.maven.org/maven2/edu/berkeley/cs/chisel-iotesters_2.12/1.4.1/chisel-iotesters_2.12-1.4.1.pom 100.0% [##########] 2.9 KiB (16.1 KiB / s) ... https://repo1.maven.org/maven2/edu/berkeley/cs/firrtl-interpreter_2.12/1.3.1/firrtl-interpreter_2.12-1.3.1.jar 100.0% [##########] 444.2 KiB (1.5 MiB / s) [info] Fetched artifacts of [info] Compiling 56 Scala sources to /tmp/chisel-tutorial/target/scala-2.12/classes ... https://repo1.maven.org/maven2/org/scala-sbt/util-interface/1.3.0/util-interface-1.3.0.pom 100.0% [##########] 2.7 KiB (16.4 KiB / s) [info] Non-compiled module 'compiler-bridge_2.12' for Scala 2.12.10. Compiling... ``` This will generate and test a basic block called `Hello` that consistently produces the number `42` (represented as `0x2a`). You should observe `[success]` in the final line of the output (from `sbt`) and `PASSED` on the preceding line, indicating that the block has successfully passed the test case. Reference output: ``` test Hello Success: 1 tests passed in 6 cycles taking 0.004980 seconds [info] [0.002] RAN 1 CYCLES PASSED [success] Total time: 2 s, completed ``` > source code: `src/main/scala/hello/Hello.scala` Then, you can run the examples: ```shell $ ./run-examples.sh FullAdder $ ./run-examples.sh Adder4 ``` > source code: `src/main/scala/examples/FullAdder.scala`, `src/main/scala/examples/Adder4.scala` ```shell $ ./run-examples.sh SimpleALU ``` > source code: `src/main/scala/examples/SimpleALU.scala ` Alternatively, you can go through all examples: ```shell $ ./run-examples.sh all ``` ## [Chisel Bootcamp](https://github.com/freechipsproject/chisel-bootcamp) Take your hardware design to the next level, transitioning from instances to generators! This bootcamp will introduce you to [Chisel](https://www.chisel-lang.org/), a hardware construction DSL developed at Berkeley and written in [Scala](https://www.scala-lang.org/). It will also provide you with a grasp of [Scala](https://www.scala-lang.org/) as you progress, and it will focus on teaching [Chisel](https://www.chisel-lang.org/) through the concept of hardware generators. Read the following materials: (required) * [From Chisel to Chips in Fully Open-Source](https://youtu.be/FenSOWKBbAw) * [Chisel Introduction](https://youtu.be/OhMuPQcyynY) * [Chisel Best Practices Intensive](https://youtu.be/e1HRwrNhZhw) [Learn Chisel online!](https://mybinder.org/v2/gh/freechipsproject/chisel-bootcamp/master) + Please run the cell blocks by either pressing SHIFT+ENTER on your keyboard + Caution: You might encounter unforeseen connectivity issues during the exercises. Therefore, it is advisable to install the required packages on your personal computer. ![Chisel Bootcamp](https://hackmd.io/_uploads/BypsOcOXT.png) ### Method 1: Using Docker or nerdctl :::warning :warning: The prebuilt Docker image `ucbbar/chisel-bootcamp` mentioned in [Local Installation - Mac/Linux](https://github.com/freechipsproject/chisel-bootcamp/blob/master/Install.md) is only valid to x86-64 machines. Therefore, we have rebuilt the Docker image `sysprog21/chisel-bootcamp` to address known issues and support both x86-64 and Arm64. ::: Make sure you have Docker [installed](https://docs.docker.com/get-docker/) on your system. * For Ubuntu Linux users, read [Install Docker Engine on Ubuntu](https://docs.docker.com/engine/install/ubuntu/) carefully. > Alternatively, use [nerdctl](https://github.com/containerd/nerdctl). * For macOS users, if you prefer not to install Docker Desktop, which can be slow and resource-intensive, you can install the [lima](https://github.com/lima-vm/lima) package. Lima provides access to a full Linux system and comes with built-in integration for [nerdctl](https://github.com/containerd/nerdctl), offering Docker-compatible commands. Simply setup via the following commands: ```shell $ brew install lima $ limactl start ``` Run the following command: (for Linux and macOS + Docker Desktop) ```shell $ docker run -it --rm -p 8888:8888 sysprog21/chisel-bootcamp ``` For macOS users who have already installed [lima](https://github.com/lima-vm/lima) and run `limactl start`, you can run the command. ``` $ lima nerdctl run -it --rm -p 127.0.0.1:8888:8888 sysprog21/chisel-bootcamp ``` This will download a Dokcer image for the bootcamp and run it. The output will end in the following message: ``` To access the notebook, open this file in a browser: file:///home/bootcamp/.local/share/jupyter/runtime/nbserver-6-open.html Or copy and paste one of these URLs: http://79b8df8411f2:8888/?token=LONG_RANDOM_TOKEN or http://127.0.0.1:8888/?token=LONG_RANDOM_TOKEN ``` Next, you can copy the URL provided above, which starts with `http://127.0.0.1:8888/?`, and paste it into your web browser to access the [Jupyter Notebook](https://jupyter.org/)-based interactive computing platform. ### Method 2: Install packages manually See [Local Installation - Mac/Linux](https://github.com/freechipsproject/chisel-bootcamp/blob/master/Install.md) :::warning :warning: You might suffer from unexpected problems with this method. Record and write down the issues accordingly. ::: ### Learning Chisel by doing! :information_source: You should go through the following and complete the exercises: - 1_intro_to_scala - 2.1_first_module - 2.2_comb_logic - 2.3_control_flow - 2.4_sequential_logic - 2.5_exercise - 2.6_chiseltest - 3.1_parameters - 3.2_collections - 3.2_interlude - 3.3_higher-order_functions - 3.4_functional_programming - 3.5_object_oriented_programming - 3.6_types --- ## Single-cycle RISC-V CPU Single-cycle CPU completes the execution of one instruction within a single clock cycle. Because the clock cycle is fixed, all instructions must take the same amount of time to execute as the slowest instruction. This results in poor performance for a single-cycle CPU. A single-cycle CPU runs only one instruction at a time, and there are no conflicts between instructions, making it the simplest to implement. The purpose of this experiment is to help everyone understand the basic structure of the CPU and how it executes instructions. We will first introduce some fundamental concepts, then proceed to build the data path and control unit step by step for instruction execution (there will be coding tasks to complete along the way, please remember to do so), ultimately constructing a simple single-cycle RISC-V processor, named `MyCPU`. ### Implementation Get the repository: ```shell $ git clone https://github.com/sysprog21/ca2024-part3 $ cd ca2024-part3 ``` > :warning: Please be aware that the Scala code in this repository is not entirely complete, as the instructor has omitted certain sections for students to work on independently. To simulate and run tests for this project, execute the following commands under the `ca2024-part3` directory. ```shell $ sbt test ``` > Alternately, run `make`. If you have successfully filled in the implementation with your own efforts, you should encounter the following messages: ``` [info] FibonacciTest: [info] Single Cycle CPU [info] InstructionFetchTest: [info] ByteAccessTest: [info] InstructionDecoderTest: [info] ExecuteTest: [info] InstructionDecoder of Single Cycle CPU [info] Single Cycle CPU [info] InstructionFetch of Single Cycle CPU [info] - should recursively calculate Fibonacci(10) [info] - should fetch instruction [info] - should store and load a single byte [info] - should produce correct control signal [info] Execution of Single Cycle CPU [info] - should execute correctly [info] QuicksortTest: [info] Single Cycle CPU [info] - should perform a quicksort on 10 numbers [info] RegisterFileTest: [info] Register File of Single Cycle CPU [info] - should read the written content [info] - should x0 always be zero [info] - should read the writing content [info] Run completed in 31 seconds, 280 milliseconds. ``` Naturally, if you have not made any efforts to improve this implementation, all test cases will fail as shown below: ``` [info] *** 6 TESTS FAILED *** [error] Failed tests: [error] riscv.singlecycle.InstructionDecoderTest [error] riscv.singlecycle.ByteAccessTest [error] riscv.singlecycle.InstructionFetchTest [error] riscv.singlecycle.ExecuteTest [error] riscv.singlecycle.FibonacciTest [error] riscv.singlecycle.QuicksortTest [error] (Test / test) sbt.TestsFailedException: Tests unsuccessful ``` If you want to run a single test, such as running only `InstructionDecoderTest`, execute the following command: ```shell $ sbt "testOnly riscv.singlecycle.InstructionDecoderTest" ``` ### Single-cycle CPU architecture diagram - [ ] Full ![Single-cycle CPU architecture](https://hackmd.io/_uploads/SJzK891ra.png) - [ ] Simplifed ![Simplifed view for single-cycle CPU](https://hackmd.io/_uploads/S1R9lQYmp.png =80%x) ### Data Path The path through which data moves between functional units is known as the data path. The elements situated along this route that perform operations on or hold data are called data path components, which include the ALU (Arithmetic-Logic Unit), general-purpose registers, memory, and so on. The data path illustrates the various pathways through which data transitions from one component to another. ![synchronous logic design](https://hackmd.io/_uploads/r1bqyEYQT.png =50%x) > The processor employs synchronous logic design with a clock. ### Control Signals ![Control Signals](https://hackmd.io/_uploads/ByWXo7Y7p.png =85%x) As the name implies, control signals govern the data path. Whenever a choice must be made, the control unit is responsible for making the correct decision and dispatching control signals to the relevant data path components. For instance, should the ALU perform addition or subtraction? Are we reading from or writing to memory? So, how does the controller determine what action to take? This primarily relies on the instruction currently in execution. In the RISC-V instruction format, the controller discerns the appropriate decisions by examining the opcode, funct3, and funct7 fields of the instruction, thus emitting the correct control signals. In the CPU schematic, the Decoder, ALUControl, and JumpJudge components can all be regarded as constituents of the control unit. They receive instructions and generate control signals. These control signals serve as guides within the data path to ensure the precise execution of instructions. ### Combinational Units and State Units In digital circuits, there are two main types of circuits: combinational logic and sequential logic. In CPU design, units composed of these two types of circuits are called combinational units and state units, respectively. In this experiment, only the registers belong to the sequential logic (memory is not within the scope of the CPU core), while the rest are combinational units. - [ ] Combinational Logic The output depends only on the current input and does not require a clock as a triggering condition. Inputs are reflected immediately in the outputs (ignoring delay). ![Combinational logic](https://hackmd.io/_uploads/rJTLbEYQT.png =70%x) Combinational logic processes data during clock cycles: 1. Between clock edges. 2. Taking input from state elements and providing output to state elements. 3. The clock period is determined by the longest delay in this process. - [ ] Sequential (state) Logic These units store state and use the clock as a triggering condition. Inputs are reflected in the outputs when the clock's rising edge arrives. Register with write control: * Only updates on the clock edge when the write control input is 1. * Used when the stored value is required later ![Sequential Logic](https://hackmd.io/_uploads/ryvh-4FQa.png) ### Overview of Implementation The RISC-V CPU we are designing can execute a core subset of RISC-V instructions (RV32I): 1. Arithmetic and Logic Instructions: `add`, `sub`, `slt`, etc. 2. Memory Access Instructions: `lb`, `lw`, `sb`, etc. 3. Branch Instructions: `beq`, `jar`, etc. We will divide the instruction execution into five different stages: 1. Instruction Fetch: Fetching the instruction data from memory. 2. Decode: Understanding the meaning of the instruction and reading register data. 3. Execute: Calculating the result using the ALU. 4. Memory Access (load/store instructions): Reading from and writing to memory. 5. Write-back (for all instructions except store): Writing the result back to registers. ![Stages of Execution on Datapath](https://hackmd.io/_uploads/Hkkf-XtQT.png =85%x) Now, let's build data path components step by step according to the above stages, and then instantiate and connect these data path components in the CPU's top-level module. (The code related to this is located in the `src/main/scala/riscv` directory.) ### Instruction Fetch > Code can be found in `src/main/scala/riscv/core/InstructionFetch.scala`. ![Instruction Fetch](https://hackmd.io/_uploads/H1qFBQYQp.png =85%x) What the instruction fetch stage does: * Fetch the instruction from memory based on the current address in the PC register. * Modify the value of the PC register to point to the next instruction. ```scala val pc = RegInit(ProgramCounter.EntryAddress) when(io.instruction_valid) { io.instruction := io.instruction_read_data // part3(InstructionDecode) // ... ``` First, the value of the PC (program counter) register is initialized to the entry address of the program. When an instruction is valid, the current instruction pointed to by the PC is fetched. If a jump is required, the PC is directed to the jump address; otherwise, it is incremented to `PC + 4`. :::info :information_source: Please fill in the code at the `// part3(InstructionFetch)` comment section in `src/main/scala/riscv/core/InstructionFetch.scala` so that it can pass the `InstructionFetchTest` unit test. ::: ### Instruction Decode > The code is located in `src/main/scala/riscv/core/InstructionDecode.scala`. What the decode stage does: * Read the opcode to determine instruction type and field lengths * Read in data from all necessary registers - for `add`, read two registers - for `addi`, read one register - for `jal`, no reads are necessary * Output control signals. ![Instruction Decode](https://hackmd.io/_uploads/B12wQmFXp.png) ```scala val rs1 = io.instruction(19, 15) val rs2 = io.instruction(24, 20) io.regs_reg1_read_address := Mux(opcode === Instructions.lui, 0.U(Parameters.PhysicalRegisterAddrWidth), rs1) io.regs_reg2_read_address := rs2 ``` The above code extracts the register operand numbers from the instruction. In all cases, except when the instruction is `lui`, register 1 is set to register 0, and register 2 is set to the values from the rs1~(19:15)~ and rs2~(24:20)~ fields of the instruction, respectively. :::info :information_source: Allocating a separate register for the constant 0 simplifies the RISC-V ISA, for example, allowing assignment instructions to be replaced with addition instructions using an operand of 0. ::: ```scala val immediate = MuxLookup( opcode, Cat(Fill(20, io.instruction(31)), io.instruction(31, 20)), IndexedSeq( InstructionTypes.I -> Cat(Fill(21, io.instruction(31)), io.instruction(30, 20)), InstructionTypes.L -> Cat(Fill(21, io.instruction(31)), io.instruction(30, 20)), ``` The above code extracts immediate values. Since the position of immediate values varies for different instruction types, it is necessary to differentiate instruction types using the opcode and then extract the corresponding immediate value. ```scala object ALUOp1Source { val Register = 0.U(1.W) val InstructionAddress = 1.U(1.W) } // ... io.ex_aluop1_source := Mux( opcode === Instructions.auipc || opcode === InstructionTypes.B || opcode === Instructions.jal, ALUOp1Source.InstructionAddress, ALUOp1Source.Register ) ``` Taking `ex_aluop1_source` control signal as an example, this control signal determines the input for the first operand of the ALU. It assigns a value to `ex_aluop1_source` based on the opcode. When the instruction type is either `auipc`, `jal`, or B, `ex_aluop1_source` is set to 0, controlling the ALU's first operand input to be the instruction address. In other cases, `ex_aluop1_source` is set to 1, controlling the ALU's first operand input to be a register. As you can see, the design of the decoding unit is also simple combinational logic. Knowing the mapping relationship between control signals and instructions is sufficient to complete it. Next, please complete the code for assigning values to the four control signals: `ex_aluop2_source`, `io.memory_read_enable`, `io.memory_write_enable`, and `io.wb_reg_write_source`. | control signals | meanings | |:----------------------|:---------------------------------| | `ex_aluop2_source` | ALU Input Source Selection | | `memory_read_enable` | Memory Read Enable | | `memory_write_enable` | Memory Write Enable | | `wb_reg_write_source` | Write-Back Data Source Selection | :::info :information_source: Please insert the code at the `// part3(InstructionDecode)` comment section in `src/main/scala/riscv/core/InstructionDecode.scala` to make it pass the `InstructionDecoderTest` unit test. ::: ### Execution > The code is located in `src/main/scala/riscv/core/Execute.scala`. What the execution stage does: * Perform ALU computation. * Determine if there is a branch. ```scala val alu = Module(new ALU) val alu_ctrl = Module(new ALUControl) // ... io.mem_alu_result := alu.io.result ``` The above code instantiates ALU and ALUControl within the Execute module. The specific ALU computation logic is implemented in the ALU module, and here, you only need to assign values to the input ports of ALU. The code for ALU can be found in `src/main/scala/riscv/core/ALU.scala`. :::info :information_source: Please insert the code at the `// part3(Execute)` comment section in `src/main/scala/riscv/core/Execute.scala` to make it pass the `ExecuteTest` unit test. ::: ```scala io.if_jump_flag := (opcode === Instructions.jal) || (opcode === Instructions.jalr) || (opcode === InstructionTypes.B) && MuxLookup( funct3, false.B, IndexedSeq( InstructionsTypeB.beq -> (io.reg1_data === io.reg2_data), // ... ) ``` The above code determines whether a jump should be taken. The logic for determining a jump is as follows: if it is an unconditional jump instruction, it jumps directly (e.g., `jal` and `jalr` instructions in the code above). If it is a branch instruction, it checks the corresponding jump condition to decide whether to jump (e.g., the `beq` instruction in the code above, which jumps when `io.reg1_data` is equal to `io.reg2_data`). When a jump is taken, the control signal `if_jump_flag` is set to 1. ### Memory Access > The code is located in `src/main/scala/riscv/core/MemoryAccess.scala`. Only load/store instructions have a memory access stage. The other instructions remain idle during this stage or skip it all together. In this stage, reading loads data from memory into registers, and writing stores data from registers into memory. In the decode stage, if it is an L-type (I-type subtype) instruction, `memory_read_enable` is set to 1, and if it is an S-type instruction, `memory_write_enable` is set to 1. These two control signals determine whether reading or writing should occur in this stage. ```scala val mem_address_index = io.alu_result(log2Up(Parameters.WordSize) - 1, 0).asUInt ``` First, the address for reading or writing memory is obtained. When `io.memory_read_enable` is true: The code reads data from the memory bus and processes it differently based on the instruction type (e.g., `lb` sign-extends, `lbu` zero-extends, `lh` reads two bytes, etc.). The processed data is then assigned to io.`wb_memory_read_data` for writing back. ```scala .elsewhen(io.memory_write_enable) { io.memory_bundle.write_data := io.reg2_data io.memory_bundle.write_enable := true.B io.memory_bundle.write_strobe := VecInit(Seq.fill(Parameters.WordSize)(false.B)) when(io.funct3 === InstructionsTypeS.sb) { io.memory_bundle.write_strobe(mem_address_index) := true.B io.memory_bundle.write_data := io.reg2_data(Parameters.ByteBits, 0) << (mem_address_index << log2Up(Parameters.ByteBits).U) }.elsewhen(io.funct3 === InstructionsTypeS.sh) { // ... ``` When `io.memory_write_enable` is true: The code writes data to memory, and the data is processed differently based on the instruction type (e.g., `sw` takes 32 bits, `sh` takes 16 bits, and `sb` takes 8 bits). ### Write-Back > The code is located in `src/main/scala/riscv/core/WriteBack.scala`. In the write-back stage, the computed data or data read from memory is written into registers. The write-back module is essentially a multiplexer, and the code is very simple. However, it raises an interesting question: the write-enable signal is generated in the decode stage, but at that point, the correct write-back data has not been calculated (or read from memory). So, will incorrect write-back data be written into the register file, and why? ### Combining into a CPU > The code is located in `src/main/scala/riscv/core/CPU.scala`. We have implemented all the components required to build the CPU. Now, we need to instantiate and connect these components together according to the single-cycle CPU architecture diagram. ```scala class CPU extends Module { val io = IO(new CPUBundle) // CPUBundle is the channel for data exchange between the CPU and peripherals like memory. val regs = Module(new RegisterFile) val inst_fetch = Module(new InstructionFetch) val id = Module(new InstructionDecode) val ex = Module(new Execute) val mem = Module(new MemoryAccess) val wb = Module(new WriteBack) .. // Here, we instantiate modules for different execution stages. inst_fetch.io.jump_address_id := ex.io.if_jump_address inst_fetch.io.jump_flag_id := ex.io.if_jump_flag // Taking the two lines above as an example, you can see the corresponding connections in the CPU schematic. } ``` In the code above, we instantiate modules for various execution stages and then establish connections between them to create the CPU. ![JumpAddr/Flag](https://hackmd.io/_uploads/Sk27GZvm6.png) Please observe the input port code of the Execute module and the CPU architecture diagram, and fill in the connections between the inputs of the Execute module and the outputs of other modules. :::info :information_source: Please insert the code at the `// part3(CPU)` comment section in `src/main/scala/riscv/core/CPU.scala` to make it pass the corresponding unit tests. ::: ### Functional Test The process by which `sbt test` activates the test cases for validating the CPU implementation relies on [Chiseltest](https://index.scala-lang.org/ucb-bar/chiseltest), an extensive testing and formal verification library designed for Chisel-based RTL (Register-Transfer Level) designs. Chiseltest places a strong emphasis on creating tests that are lightweight, promoting minimal boilerplate code, easy to read and write for enhanced understandability, and conducive to test code reuse through composability. Provided a C program that computes the [Fibonacci sequence](https://en.wikipedia.org/wiki/Fibonacci_sequence), its source code is presented below. ```c static int fib(int a) { if (a == 1 || a == 2) return 1; return fib(a - 1) + fib(a - 2); } int main() { *((volatile int *) (4)) = fib(10); return 0; } ``` > File: csrc/fibonacci.c Within the `main` function, the line `*((volatile int *) (4)) = fib(10)` stores the Fibonacci(10) result in the memory address `4`, which can be later verified using a Chiseltest-based test case. ```scala class FibonacciTest extends AnyFlatSpec with ChiselScalatestTester { behavior.of("Single Cycle CPU") it should "calculate recursively fibonacci(10)" in { test(new TestTopModule("fibonacci.asmbin")).withAnnotations(TestAnnotations.annos) { c => for (i <- 1 to 50) { c.clock.step(1000) c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout } c.io.mem_debug_read_address.poke(4.U) c.clock.step() c.io.mem_debug_read_data.expect(55.U) } } } ``` > File: src/test/scala/riscv/singlecycle/CPUTest.scala Given that Fibonacci(10) is known to be `55`, this test case straightforwardly verifies the content of the designated memory region after our CPU executes the instructions. ### Waveform Before burning the board for verification, you can perform another round of testing using waveform simulation. Generating Waveform Files During Testing: While running tests, if you set the environment variable `WRITE_VCD` to 1, waveform files will be generated. ```shell $ WRITE_VCD=1 sbt test ``` Afterward, you can find `.vcd` files in various subdirectories under the `test_run_dir` directory. You can open them using [Surfer](https://surfer-project.org/). ![Waveform](https://hackmd.io/_uploads/Hy9DrWD7T.png) On the left side of the interface, there is a tree-like structure organized by modules. To add a signal to the window, right-click and select `Insert` from the menu. ### Verilator [Verilator](https://www.veripool.org/verilator/) is a tool that compiles Verilog and SystemVerilog sources into highly optimized C++ or SystemC code, which can be used for verification and modeling in C++ or SystemC testbenches. For more details, please refer to the official [Verilator website](https://www.veripool.org/verilator/) and [its manual](https://verilator.org/guide/latest/). Why do we choose [Verilator](https://www.veripool.org/verilator/)? It is a high-performance, open-source Verilog/SystemVerilog simulator. While it is powerful and fast, it is not a direct replacement for event-based simulators such as Modelsim and Vivado Xsim. Verilator operates on a cycle-based simulation model, which means it does not simulate exact circuit timing within a single clock cycle and may not capture intra-period glitches. While this approach has its advantages, it also has limitations compared to other simulators. If you want to quickly test your own written programs, you can use Verilator for simulation. The main simulation function is already written and located in `verilog/verilator/sim_main.cpp`. After the first run and every time you modify the Chisel code, you need to execute the following command in the project's root directory to generate Verilog files: ```shell $ make verilator ``` > :warning: If you don't modify the Scala code as mentioned earlier, the Verilog generation will fail. After compilation, an executable file named `verilog/verilator/obj_dir/VTop` will be generated. This executable file can take parameters to run different code files. Here are the parameters and their usage: | Parameter | Usage | |:----------|:------| | `-memory` | Specify the size of the simulation memory in words (4 bytes each).<br> Example: `-memory 4096` | | `-instruction` | Specify the RISC-V program used to initialize the simulation memory.<br>Example: `-instruction src/main/resources/hello .asmbin` | | `-signature` | Specify the memory range and destination file to output after simulation.<br>Example: `-signature 0x100 0x200 mem.txt` | | `-halt` | Specify the halt identifier address; writing `0xBABECAFE` to this memory address stops the simulation.<br>Example: `-halt 0x8000` | | `-vcd` | Specify the filename for saving the simulation waveform during the process; not specifying this parameter will not generate a waveform file.<br>Example: `-vcd dump.vcd` | | `-time` | Specify the maximum simulation time; note that time is **twice** the number of cycles.<br>Example: `-time 1000` | For example, to load the `hello.asmbin` file, simulate for 1000 cycles, and save the simulation waveform to the `dump.vcd` file, you can run: ```shell $ ./run-verilator.sh -instruction src/main/resources/hello.asmbin -time 2000 -vcd dump.vcd ``` > :walking: Keep in mind that a time value of **2000** corresponds to simulating 1000 cycles. Then, run `gtkwave dump.vcd` to check its waveform. ![Waveform of hello](https://hackmd.io/_uploads/S1tl3yF7T.png) You can observe that the signal io_instruction begins with `000000000` and `00001137`. In the meantime, let's verify the hexadecimal representation of `hello.asmbin`: ```shell $ hexdump src/main/resources/hello.asmbin | head -1 ``` Its output: ``` 0000000 1137 0000 1097 0000 80e7 8b00 006f 0000 ``` It aligns with the expected waveform. :walking: You may wonder: why are not the modules/IO ports/registers defined in Chisel visible in the generated Verilog code (or waveform)? > This occurs because Chisel initially generates [FIRRTL](https://github.com/chipsalliance/firrtl) code and subsequently applies optimization steps, including logical simplification, constant propagation, and dead code elimination, to the FIRRTL code. The final Verilog code is then generated based on the optimized [FIRRTL](https://github.com/chipsalliance/firrtl) representation. If the modules/IO ports/registers you created in Chisel do not appear in the generated Verilog code, it is recommended to check for proper module connections and potential logic issues that could lead to constant values in certain registers, among other possible reasons. ## Prepare Programs to Run on MyCPU If no specific argument is provided to the GNU toolchain, the default linker script is utilized for the linking process. However, for more granular control over the linking process, it is possible to create a custom linker script. You can designate a linker script using the `-T` option, as demonstrated below: ```shell $ riscv-none-elf-gcc -T link.lds hello.o -o hello ``` The content of the linker script would look something like this: ``` OUTPUT_ARCH( "riscv" ) ENTRY(_start) SECTIONS { . = 0x00001000; .text : { *(.text.init) *(.text.startup) *(.text) } .data ALIGN(0x1000) : { *(.data*) *(.rodata*) *(.sdata*) } . = 0x00100000; .bss : { *(.bss) } _end = .; } ``` > File: `csrc/link.lds` The first line of the linker script specifies the output instruction format, which is RISC-V in this case. The second line specifies the program's entry address, which is the `_start` function. Starting from the third line, it defines the positions of various segments in the program. For example, the `.text` segment starts at address `0x00001000`, the `.data` segment starts after `.text` and is aligned to address `0x1000`, and the `.bss` segment starts at address 0x00100000. The last line specifies the program's end address, denoted as `_end`. To regenerate the RISC-V programs utilized for unit tests, change to the `csrc` directory and run the `make update` command. Ensure that the `$PATH` environment variable is correctly configured to include the GNU toolchain for RISC-V. ```shell! $ cd csrc $ make update ``` > See [part2: RISC-V RV32I[MACF] emulator with ELF support](https://hackmd.io/@sysprog/SJAR5XMmi). [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) (Executable and Linkable Format) is an executable file format commonly used in Linux systems. While a detailed understanding of the ELF format is not necessary, having a general idea of its structure can be helpful for this experiment. An ELF file stores metadata about a program, including the entry address, segment information, symbol information, and more. Among these, segment information is crucial as it contains the program's code and data. Typically, the code segment is named `.text`, and the data segment is named `.data`. These segments are loaded into memory by the operating system for program execution and data access. In our experiment, since we do not have an operating system, we use Chisel code to specify the CPU's entry address and load the code and data segments into memory for direct program execution. These code and data segments are flashed into the FPGA logic as initialization data during the synthesis phase. The `InstructionROM` module is responsible for generating this initialization data, and the `ROMLoader` module handles the loading of data into memory. After this data copying process is complete, the CPU can start executing the program. The process of generating an executable file for the CPU involves the following steps: 1. Only the code and data segments from the ELF file are required, while other segments can be disregarded. In the linker script, these two segments are allocated to contiguous addresses to streamline the implementation. 2. The code segment commences from address `0x1000`, with the space before it reserved for the program's stack. 3. The `objcopy` tool is employed to duplicate the code and data segments into a distinct file, resulting in a file containing solely binary code and data. --- ## Interrupts After completing the single-cycle CPU experiment, you have developed a processor that can execute instructions as expected. However, this simple CPU can only run continuously according to predefined program instructions and cannot be interrupted midway. In our uncertain world, a practical CPU must always be prepared to handle external events, process interrupts in a timely manner, and return to the original program to continue execution. In this experiment, you will learn: - **CSR Registers** and their operation instructions - The principles and design of the **Interrupt Controller** - How to implement a simple **timer-based interrupt generator** Whether using an IDE or executing commands, the root directory is the `part2` directory. ![MyCPU](https://hackmd.io/_uploads/HJN0-RiXJx.png) ### CSR Register Operations From the preliminary knowledge on **interrupts and exceptions**, we have learned the basic concepts of CSR registers. In this experiment, we will extend the single-cycle CPU to support instructions that operate on CSR registers. Details about the CSR-related instruction set, including instruction semantics and encoding, can be found in **Chapter 9 of the Unprivileged ISA Specification**. For the content and meaning of interrupt-related CSR registers, please refer to the **Privilege ISA Specification**. Based on the experience of implementing a single-cycle CPU, we can break down the support for CSR instructions into the following: 1. **CSR Register Set**: CSR registers are a set of independently addressable registers, similar to the `RegisterFile` register group, with an address space of 4096 bytes. From the instruction manual, we can see that operations on CSR registers are atomic read-modify-write (RMW). The specific semantics of CSR instructions can be found in the manual. The CSR register set needs to address the internal registers, fetch their content, and modify them based on the control signals and CSR register addresses provided by the ID decoding module. 2. **ID Decoding Unit**: The ID decoding unit must recognize CSR instructions and generate corresponding control signals and data to other modules based on the instruction semantics and encoding standards described in the manual. 3. **EX Execution Unit**: CSR instructions are atomic RMW operations. The execution result of a single instruction must write the original content of the target CSR register into a target general-purpose register and modify the content of the target CSR register according to the instruction semantics. During this process, the ALU unit in the EX stage is idle. The ALU can either be reused to obtain the value to write into the CSR register or not reused. 4. **WB Write-Back Unit**: After supporting CSR-related instructions, the data source written back to the target general-purpose register includes the original value read from the target CSR register. ### Interrupt Controller (CLINT) In simple terms, the interrupt controller detects external interrupts. When an interrupt occurs and is enabled, it interrupts the CPU’s current execution flow, sets the appropriate CSR register information, and jumps to the interrupt handler to execute the interrupt service routine. The key is determining what information to save into the corresponding CSR registers. The answer is the CPU state after executing the current instruction. For example, if the current instruction is a jump instruction, the `mepc` register should save the jump target address of the current jump instruction. Some special cases need to be considered. We know that external interrupt enablement is determined by the `mstatus` register. If the current instruction modifies `mstatus` to disable interrupts, and an external interrupt arrives during its execution, should the next cycle respond to the interrupt? For consistency, we consider that such a situation should not respond to the interrupt, and the CPU should complete the current instruction execution before jumping to the interrupt handler. ### Responding to (Hardware) Interrupts Responding to interrupts consists of two parts: 1. **Save the CPU’s next state information** in one cycle by writing the relevant content into the appropriate registers. Specifically: - `mepc`: Saves the address from which the CPU will resume execution after completing the interrupt or exception handling. - `mcause`: Records the cause of the interrupt or exception. Details can be found in the Privilege ISA Specification. - `mstatus`: During interrupt handling, set the `MPIE` bit in the `mstatus` register to 0 to disable interrupts. 2. Retrieve the interrupt handler’s address from the `mtvec` register and jump to that address for further interrupt handling. ### (Hardware) Interrupt Return After handling the interrupt (`mret`), the CPU execution flow needs to be restored. This process is similar to responding to interrupts, but the registers written are limited to `mstatus`, and the target address for jumping is fetched from `mepc`. For simplicity, the value written into `mstatus` during `mret` sets the `MIE` bit equal to the `MPIE` bit. Thus, if `MPIE` is 1, `mret` will re-enable interrupts; if `MPIE` is 0, `mret` will not change the value of `mstatus`. This design does not support interrupt nesting. When implementing privilege level switching in the future, changes to `mstatus` will be more complex, but all standards are clearly defined in the manual. To explore the implementation of the interrupt mechanism further, refer to Section 3.1.6.1 of the Privilege ISA Specification. Exception context saving and restoration are similar to interrupt handling, with slight differences in the content saved and restored. Students interested in exploring these differences can investigate further on their own. ### CLINT Implementation There are many ways to implement CLINT. For simplicity, we use a pure combinational logic design for the interrupt controller (the main repository of MyCPU uses a state machine to implement CLINT). Since the CLINT is based on a single-cycle CPU and uses combinational logic, it immediately responds to external interrupts when they occur. CLINT requires the ability to modify the contents of multiple registers in one cycle. Normal CSR instructions, however, can only perform RMW operations on one register at a time. Therefore, there is an independent, higher-priority pathway between CLINT and the CSR module to quickly update CSR register values. ### Simple Timer-Based Interrupt Generator We will implement a **MMIO-based timer interrupt generator**. **MMIO** (Memory-Mapped I/O) refers to peripherals that interact with the CPU using registers mapped into the memory address space. The CPU modifies the values of these registers using load/store instructions, enabling interaction with peripherals. Without implementing a bus, MMIO can be achieved with multiplexers. This is because instruction fetch operations and load/store memory access operations are separated, allowing memory to have a dedicated pathway for instruction fetches. Therefore, our model involves a CPU interacting with multiple peripherals without conflicts between instruction fetch and memory access operations. The logic address emitted by the CPU determines the target device based on the high bits of the address, while the low bits are used for internal device addressing. Additionally, the timer interrupt generator includes two control registers: - **Enable Register**: Controls whether the timer interrupt generator is enabled. If `false`, no interrupts are generated. This register is mapped to logical address `0x80000008`. - **Limit Register**: Controls the interrupt interval for the timer. This register is mapped to logical address `0x80000004`. The interrupt generator contains an incrementing counter that triggers an interrupt signal when its value reaches the `limit` (provided `enable` is true). Note: The duration of the interrupt signal is less critical, but it must be at least one CPU clock cycle to ensure the CPU correctly captures the signal. ### Experimental Tasks - [ ] Ensure the EX execution unit can correctly obtain data to write into CSR registers when handling CSR instructions. ![image](https://hackmd.io/_uploads/r1izf0sQyl.png) CSR Instruction Semantics - **[csr]**: Represents the value stored in the CSR register indexed by `csr` in the CSR register set. - **(rd)**: Represents the value stored in the general-purpose register indexed by `rd`. CSRRW (Atomic Read/Write CSR): Used to exchange values between a CSR register and a general-purpose register. - **func3 = 001** - If `rd` is not register 0: 1. Zero-extend `[csr]` to XLEN bits and write it to `rd`, i.e., `(rd) = (zero-extended)[csr]` 2. Update the CSR register value: `[csr] = (rs1)` - If `rd` is register 0: - Only execute `[csr] = (rs1)`. CSRRS (Atomic Read and Set Bits in CSR): Used to read the value of a CSR register. - **func3 = 010** - Zero-extend `[csr]` to XLEN bits and write it to `rd`, i.e., `(rd) = (zero-extended)[csr]`. - If `rs1` is not register 0: - Perform a bitwise OR operation: `[csr] = [csr] | (rs1)`. CSRRC (Atomic Read and Clear Bits in CSR) - **func3 = 011** - Zero-extend `[csr]` to XLEN bits and write it to `rd`, i.e., `(rd) = (zero-extended)[csr]`. - If `rs1` is not register 0: - Perform a bitwise AND operation with the complement of `rs1`: `[csr] = [csr] & ~(rs1)`. CSRRWI (Atomic Read/Write CSR Immediate) - **func3 = 101** - If `rd` is not register 0: 1. Zero-extend `[csr]` to XLEN bits and write it to `rd`, i.e., `(rd) = (zero-extended)[csr]`. 2. Update the CSR register value: `[csr] = (zero-extended)(rs1)`. - If `rd` is register 0: - Only execute `[csr] = (zero-extended)(rs1)`. CSRRSI (Atomic Read and Set Bits in CSR Immediate): Used to read the value of a CSR register. - **func3 = 110** - Zero-extend `[csr]` to XLEN bits and write it to `rd`, i.e., `(rd) = (zero-extended)[csr]`. - If `rs1` is not 0: - Perform a bitwise OR operation: `[csr] = [csr] | (zero-extended)(rs1)`. CSRRCI (Atomic Read and Clear Bits in CSR Immediate) - **func3 = 111** - Zero-extend `[csr]` to XLEN bits and write it to `rd`, i.e., `(rd) = (zero-extended)[csr]`. - If `rs1` is not 0: - Perform a bitwise AND operation with the complement of `rs1`: `[csr] = [csr] & ~(zero-extended)(rs1)`. ![image](https://hackmd.io/_uploads/H1qjSRsmJl.png) - [ ] Ensure the CSR register set can correctly support read/write operations from both CLINT and CSR instructions. MSTATUS register ![image](https://hackmd.io/_uploads/S1AprCsmyg.png) Test Cases and Waveform Diagrams 1. Checking Interrupt Assertions and Handling: - Set `interrupt_flag` to `1`: - Verify that `interrupt_assert` is `true`. - Check if `interrupt_handler_address` equals the value of `MTVEC` (previously set to `0x1144`). - Set `jump_flag` to `false`: - Verify that `MEPC` saves the value of `PC + 4`, i.e., `0x1904`. - Check `MCAUSE`: - Since `interrupt_flag = 1`, `MCAUSE` should equal `0x80000007L`. - Check `MSTATUS`: - Verify that it changes to `0x1880L` (sets `MIE` from `1` to `0`, disabling interrupts). Waveform Image: ![image](https://hackmd.io/_uploads/rkNI80imJx.png) 2. When encountering the `mret` instruction: - Expectations: - `interrupt_assert` should be `true`. - `interrupt_handler_address` should equal the value of `MEPC`, i.e., `0x1904`. - Check `MSTATUS`: - Verify that it restores to `0x1888L` (sets `MIE` from `0` to `1`, enabling interrupts). 3. Testing Interrupts during a Jump: - Set `interrupt_flag` to `2`: - Verify that `interrupt_assert` is `true`. - Check if `interrupt_handler_address` equals the value of `MTVEC` (previously set to `0x1144`). - Set `jump_flag` to `true`: - Verify that `MEPC` saves the value of `jump_address`, i.e., `0x1990`. - Check `MCAUSE`: - Since `interrupt_flag = 2`, `MCAUSE` should equal `0x8000000BL`. - Check `MSTATUS`: - Verify that it changes to `0x1880L` (sets `MIE` from `1` to `0`, disabling interrupts). - When encountering the `mret` instruction: - Expectations: - `interrupt_assert` should be `true`. - `interrupt_handler_address` should equal the value of `MEPC`, i.e., `0x1990`. - Check `MSTATUS`: - Verify that it restores to `0x1888L` (sets `MIE` from `0` to `1`, enabling interrupts). 4. Testing Interrupts When Interrupts Are Disabled: - Set `MSTATUS` to `0x1880` (interrupts disabled): - Despite `interrupt_flag = 1`, verify that `interrupt_assert` is `false` since interrupts are masked. ![image](https://hackmd.io/_uploads/Hy7KICjXyx.png) - [ ] Ensure the timer interrupt generator correctly generates interrupt signals and implements MMIO for the timer registers. Test Cases 1. Write `0x990315` to memory address `0x4.U` (the memory space corresponding to the `limit` register). - Read the value of the `limit` register and check if it is `0x990315`. 2. Write `0` to memory address `0x8.U` (the memory space corresponding to the `enabled` register). - Read the value of the `enabled` register and check if it is `false`. ![image](https://hackmd.io/_uploads/rJKkwCsQ1l.png) At 3ps, `io_debug_limit` changes to `0x990315`, indicating that `0x4.U` (the memory space corresponding to the `limit` register) was successfully written with `0x990315`. At 7ps, `io_debug_enabled` changes to `0`, indicating that `0x8.U` (the memory space corresponding to the `enabled` register) was successfully written with `0`. - [ ] Ensure CLINT correctly responds to interrupts and resumes the original execution flow after the interrupt ends. Test Cases 1. Run `fibonacci.asmbin` to recursively calculate the 10th Fibonacci number. Verify that the result is `55`. 2. Run `quicksort.asmbin` to execute a quicksort operation. If successful, memory starting at address `4` should contain an array of length 10 with values ranging from `0` to `9`. 3. Run `sb.asmbin` to verify the correctness of register read and write functionality. 4. Run `simpletest.asmbin` to test the interrupt functionality. The following is an analysis of the fourth test: Jump to Trap Handler and Return Based on the code in `simpletest.c`, this program first writes the value `0xDEADBEEF` to memory address `0x4`. Then, it uses the `enable_interrupt()` function to allow interrupts and defines the interrupt handler function `trap_handler`, which writes the value `0x2022` to memory address `0x4`. 1. Allow the program to run for 1000 cycles and then read the value at memory address `0x4` to verify it is `0xDEADBEEF`. As shown in the diagram below, at 2003ps, `mem_debug_read_data = 0xDEADBEEF` is observed. ![image](https://hackmd.io/_uploads/BkR9PComyl.png) 2. Set `interrupt_flag` to `0x1` to trigger an interrupt. The program should jump to execute the `trap_handler` function. After another 1000 cycles, check the values of the `MSTATUS` and `MCAUSE` registers. ![image](https://hackmd.io/_uploads/HkpsDRoQke.png) As shown in the diagram below, `MCAUSE = 0x80000007`, which matches expectations. ![image](https://hackmd.io/_uploads/rJbJ_RjQ1l.png) 3. Finally, read the value at memory address `0x4` to verify whether the interrupt handler executed correctly. The result is `0x2022`, which matches expectations, as shown in the diagram below. ![image](https://hackmd.io/_uploads/H1WxuAjQJg.png) ### Code Files - **EX Execution Unit**: `src/main/scala/riscv/core/Execute.scala` - **CSR Register Set**: `src/main/scala/riscv/core/CSR.scala` - **CLINT**: `src/main/scala/riscv/core/CLINT.scala` - **Timer**: `src/main/scala/riscv/peripheral/Timer.scala` Fill in the corresponding code at the `// part2(CLINTCSR)` comments in the files above to pass the tests `CPUTest`, `ExecuteTest`, `CLINTCSRTest`, and `TimerTest`. --- ## Pipelined CPU ![截圖 2024-12-03 上午11.09.32](https://hackmd.io/_uploads/B1nKUgnmyg.png) After completing the single-cycle CPU experiment, you should now have a basic understanding of CPU principles and structure. However, in the single-cycle CPU design, the critical path is too long, limiting the achievable clock frequency. Additionally, only one instruction can be executed per clock cycle, resulting in low instruction throughput. To address these issues, we will attempt to overlap the execution of multiple instructions using pipelining. Handling hazards is the most challenging and critical aspect of pipelined CPU design. In this experiment, we will first design a simple three-stage pipelined CPU (IF, ID, and EX stages) that only deals with control hazards caused by branch and jump instructions, making it relatively simple to handle. Then, we will further divide the EX stage of the three-stage pipeline into EX, MEM, and WB stages, forming a classic five-stage pipeline. This introduces data hazards that require techniques like stalling and forwarding. Finally, we will move branch and jump handling to the ID stage to further reduce branch delay. In this experiment, you will learn to: - Use pipelining to shorten the critical path - Correctly handle pipeline stalls and flushes - Use forwarding logic to reduce pipeline stalls Whether using an IDE or running commands, the root directory is the `part3` directory. ### Pipeline Registers Pipeline registers act as buffers in the pipeline, dividing combinational logic and shortening the critical path. Their functionality is simple: at each clock cycle, depending on the reset (pipeline flush) or stall (pipeline pause) signals, the register content is either cleared, held, or updated with new values. The output of the register is its stored value. For reusability, we can define a parameterized `PipelineRegister` module to implement pipeline registers with different data widths. The module interface has been defined in `src/main/scala/riscv/core/PipelineRegister.scala`. The `stall` and `flush` signals represent pipeline register stall and flush signals, while `in` and `out` are the values written to and read from the register. Modify the code at the `// part3(PipelineRegister)` comment to pass the `PipelineRegisterTest`. #### Hint You can ignore the CPU for this task and use the basic knowledge from lab0 to implement the required functionality. The code should be no more than seven lines, so treat this as a simple warm-up exercise. ### Three-Stage Pipeline Below is the structure diagram of the three-stage pipelined CPU, where blue lines represent data paths, and red lines represent control signals. *three_stage_pipelined_CPU_structure* We divide the combinational logic of the single-cycle CPU into three stages using two sets of pipeline registers, `IF2ID` and `ID2EX`: 1. Instruction Fetch (IF): Fetch instructions from memory using the address in the `PC`. 2. Instruction Decode (ID): Decode the instruction into control signals and read operands from the register file. 3. Execute (EX): Perform ALU operations, access memory, and write back results. The code for these three stages is similar to that of the single-cycle CPU and has been provided. Focus on handling hazards in the following sections. ### Resolving Control Hazards: Flushing In the three-stage pipeline, since all data processing occurs in the EX stage, data hazards do not exist. The only issue is control hazards caused by program jumps. Program jumps can occur in three cases: - A jump instruction is executed in the EX stage. - A branch instruction is executed in the EX stage, and the branch condition is met. - An interrupt occurs, and the `InterruptAssert` signal from CLINT is received in the EX stage, effectively adding a jump instruction on top of the current EX stage instruction. In all cases, the EX stage sends the `jump_flag` and `jump_address` signals to the IF stage. However, before the new instruction can be fetched using `jump_address`, two instructions already in the IF and ID stages must be discarded. These instructions can be turned into "no-operations" by flushing the corresponding pipeline registers. A control unit will detect control hazards and flush the pipeline. The module is defined in `src/main/scala/riscv/core/threestage/Control.scala`. Based on the analysis, determine the module's input and output, and complete the code at the `// part3(Flush)` comment. Also, connect the module in `src/main/scala/riscv/core/threestage/CPU.scala` at the `// part3(Flush)` comment to pass the `ThreeStageCPUTest`. ### Five-Stage Pipeline In the three-stage pipeline, the EX stage still involves complex logic that can lead to delays. To further shorten the critical path, we can extend the pipeline to five stages by dividing the EX stage into ALU, memory access, and write-back stages: *five_stage_pipelined_CPU_structure* Dividing the three-stage pipeline into a five-stage pipeline introduces more complex data hazards. We will use stalling to handle these hazards, creating a fully functional five-stage pipelined CPU. Forwarding and branch prediction can further improve efficiency and are optional extensions. ### Resolving Data Hazards: Stalling Data hazards occur when instructions in the ID stage depend on the results of instructions in the EX or MEM stages. In such cases, the IF and ID stages are stalled until the required data is available in the WB stage. Consider the following instruction sequence: ```c 0000: add x1, x0, x0 0004: sub x2, x0, x1 0008: and x3, x1, x2 000C: jalr x4, x1, 0 0010: or x5, x3, x4 0014: xor x6, x4, x5 ``` Without stalling, the pipeline state is as follows: | Clock Cycle | 0 | 1 | 2 | 3 | 4 | 5 | 6 | |-------------|------|------|-------|-------|-------|-------|-------| | IF | add | sub | and | jalr | or | xor | | | ID | | add | sub | and | jalr | or | xor | | EX | | | add | sub | and | jalr | or | | MEM | | | | add | sub | and | jalr | | WB | | | | | add | sub | and | In cycle 2, `sub x2, x0, x1` in the ID stage depends on `x1`, which is not written back yet. Similarly, `and x3, x1, x2` in cycle 3 depends on the results of previous instructions. How many cycles should these instructions stall? Complete the `Control.scala` module at the `// part3(Stall)` comment in `src/main/scala/riscv/core/fivestage_stall` to handle stalling and pass the `FiveStageCPUStallTest`. ### Using Forwarding to Reduce Stalls While stalling resolves hazards, it introduces pipeline bubbles, reducing efficiency. Forwarding can bypass waiting by using intermediate pipeline registers (e.g., EX2MEM, MEM2WB). Modify the `Control.scala`, `Forwarding.scala`, and `Execute.scala` modules in `src/main/scala/riscv/core/fivestage_forward` at the `// part3(Forward)` comment to implement forwarding and pass the `FiveStageCPUForward` test. ![forwarding](https://hackmd.io/_uploads/rJxOTBx2XJe.png) Although the second instruction, `sub x2, x0, x1`, depends on the execution result of the first instruction, in the third clock cycle, the `sub` instruction in the EX stage can directly fetch the value of register `x1` from the `EX2MEM` register as its source operand without stalling. Similarly, the third instruction depends on the first two instructions, and in the fourth clock cycle, the `and` instruction in the EX stage can fetch its source operands from `EX2MEM` and `MEM2WB`. It is worth noting that in the fifth clock cycle, both `EX2MEM` and `MEM2WB` contain the value of `x2`. In this case, the "latest" value, located in `EX2MEM`, should be used. In the sixth clock cycle, the situation differs because the fourth instruction is a `load` instruction, which requires the MEM stage to complete before its result is available. Therefore, the `or` instruction in the EX stage cannot fetch its source operand from `EX2MEM` and must stall for one clock cycle to retrieve it from `MEM2WB`. A control unit is used to handle pipeline stalls and flushes, with its module interface defined in `src/main/scala/riscv/core/fivestage_forward/Control.scala`. A forwarding unit is used to detect data hazards and generate forwarding control signals, with its module interface defined in `src/main/scala/riscv/core/fivestage_forward/Forwarding.scala`. Additionally, the execute unit (`src/main/scala/riscv/core/fivestage_forward/Execute.scala`) needs to use the appropriate forwarded data based on the forwarding unit's control signals. Based on this analysis, modify the code at the `// part3(Forward)` comment in the three modules to pass the `FiveStageCPUForward` test. ### Reducing Branch Delays Finally, move branch and jump handling to the ID stage to reduce the branch penalty from two cycles to one. Modify the `InstructionDecode.scala`, `Control.scala`, and `Forwarding.scala` modules in `src/main/scala/riscv/core/fivestage_final` at the `// part3(Final)` comment to implement this optimization and pass the `FiveStageCPUFinal` test. ## Reference * [(System)Verilog to Chisel Translation for Faster Hardware Design](https://hal.science/hal-02949112/file/sv2chisel.pdf) * [Digital Design with Chisel](https://www.imm.dtu.dk/~masca/chisel-book.html) * [Davis In-Order (DINO) CPU models](https://github.com/jlpteaching/dinocpu) * [Online RISC-V instruction interpreter](https://www.cs.cornell.edu/courses/cs3410/2019sp/riscv/interpreter/)