# Assignment3: single-cycle RISC-V CPU
contribute by < [JinYu1225](https://github.com/JinYu1225) >
## Introduction
The objective of this assignment is to develop a single-cycle RISC-V CPU, named MyCPU, using Chisel, a hardware description language based on Scala. The process involves forking and modifying code from [ca2023-lab3](https://github.com/sysprog21/ca2023-lab3) to complete the CPU construction. The final step is to execute the project developed in [Assignment2](https://hackmd.io/@edenlin/GNU_Toolchain_HammingDistanceByCLZ) on the MyCPU.
The complete project can be accessed [here](https://github.com/JinYu1225/ca2023-lab3)
For further details on Chisel, Scala, and CPU construction as part of this assignment, refer to [Lab3: Construct a single-cycle RISC-V CPU with Chisel](https://hackmd.io/@sysprog/r1mlr3I7p#Lab3-Construct-a-single-cycle-RISC-V-CPU-with-Chisel).
## Hello World in Chisel
```scala
class Hello extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
})
val CNT_MAX = (50000000 / 2 - 1).U;
val cntReg = RegInit(0.U(32.W))
val blkReg = RegInit(0.U(1.W))
cntReg := cntReg + 1.U
when(cntReg === CNT_MAX) {
cntReg := 0.U
blkReg := ~blkReg
}
io.led := blkReg
}
```
`Hello` is a module that includes a slot named `io`, used to define an unnamed bundle with an output wire named `led`. Additionally, `Hello` features a counter called `cntReg` and a flag called `blkReg`, both of which connect to the output `led`. The flag `blkReg` is triggered when the counter `cntReg` reaches the value of `CNT_MAX`.
## Modifying MyCPU
The project comprises six tests designed for various purposes, and it is crucial to successfully pass all of them to ensure the proper functioning of MyCPU, particularly for basic functionalities. The key .scala files that require modification are `InstructionFetch.scala`,`InstructionDecode.scala`,`Execute.scala`, and `CPU.scala`.
Run all 6 tests:
```shell
$ sbt test
```
Result:
```shell
[info] *** 5 TESTS FAILED ***
[error] Failed tests:
[error] riscv.singlecycle.InstructionDecoderTest
[error] riscv.singlecycle.ByteAccessTest
[error] riscv.singlecycle.ExecuteTest
[error] riscv.singlecycle.FibonacciTest
[error] riscv.singlecycle.QuicksortTest
[error] (Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 11 s, completed Nov 30, 2023, 11:23:51 PM
```
It is evident that five tests have failed. One test has passed due to the modifications made in `InstructionFetch.scala`, which will be elaborated on in the following discussion.
### Instruction Fetch
The function of instruction fetch stage is implement in `InstructionFetch.scala`.

MyCPU fetches instructions from the address specified by the Program Counter (PC) and determines the next PC. We can easily implement this by categorizing the situation into two parts: **whether the code jumps or not**.
:::danger
:warning: **Refrain from copying and pasting your solution directly into the HackMD note**. Instead, provide a concise summary of the various test cases, outlining the aspects of the CPU they evaluate, the techniques employed for loading test program instructions, and the outcomes of these test cases.
:::
Another important point is that we should always check if instruciton input is valid.
```scala
when(io.instruction_valid) {
io.instruction := io.instruction_read_data
...
}.otherwise {
pc := pc
io.instruction := 0x00000013.U
}
io.instruction_address := pc
```
`0x00000013` is define as `nop` in `InstructionDecode.scala`
### Instruction Decode
Instruction Decode is implemented in `InstructionDecode.scala`. The purpose of the decode stage is to assist the CPU in recognizing each type of operation and determining the corresponding action to be taken. It involves the following steps.
1. Identify the instruction type and the length of each field based on the opcode of the instruction.
2. Retrieve all the arguments (from registers, memory, or immediate values) required for the operation.
3. Generate the control signals for the next stage.
Let's use the ADD instruction as an example, whose opcode is represented as `0b0110011`. A more comprehensive understanding can be gained by referring to the following table.

First, the instruction undergoes processing in the IF stage as the input. It is then divided into six parts: `opcode`, `funct3`, `funct7`, `rd`, `rs1`, `rs2` as the figure shows above.
```scala
val opcode = io.instruction(6, 0)
val funct3 = io.instruction(14, 12)
val funct7 = io.instruction(31, 25)
val rd = io.instruction(11, 7)
val rs1 = io.instruction(19, 15)
val rs2 = io.instruction(24, 20)
```
* Second, determine the instruction type and the exact operation by the opcode and funct3.
```scala
object InstructionTypes {
val L = "b0000011".U
val I = "b0010011".U
val S = "b0100011".U
val RM = "b0110011".U
val B = "b1100011".U
}
...
```
* Third, determine which register or immediate value should be accessed by the instruction.
```scala
...
io.wb_reg_write_source := MuxCase(
RegWriteSource.ALUResult,
ArraySeq(
(opcode === InstructionTypes.RM || opcode === InstructionTypes.I ||
opcode === Instructions.lui || opcode === Instructions.auipc) -> RegWriteSource.ALUResult, // same as default
(opcode === InstructionTypes.L) -> RegWriteSource.Memory,
(opcode === Instructions.jal || opcode === Instructions.jalr) -> RegWriteSource.NextInstructionAddress
)
)
...
```
To ensure the correct operation of MyCPU, it is necessary to modify the control signal of MemRW in the Decode stage. This modification should be based on the types of operations being performed.
:::danger
:warning: **Refrain from copying and pasting your solution directly into the HackMD note**. Instead, provide a concise summary of the various test cases, outlining the aspects of the CPU they evaluate, the techniques employed for loading test program instructions, and the outcomes of these test cases.
Another important point is that we should always check if instruciton input is valid.
:::
* Finally, pass all the necessary arguments and control sigals to next stage (EXE).
### Execute
The Execute stage is where MyCPU carries out the arithmetic processes of each instruction using the ALU. In this stage, the control signals and arguments are obtained from the ID stage, and a specific operation is performed in the ALU. Then the result will be passed to the next stage.
`ALU.scala` and `ALUControl.scala` are utilized in the Execute stage. ALUControl produces an output specifying the type of operation that the ALU needs to perform. The inputs of the ALU are determined by the results of ALUControl and the previous stage, as illustrated by the inputs provided through ALUop1 and ALUop2.
### Combining into CPU
The CPU assumes a crucial role in coordinating the connection between each stage. Each stage is declared as a module variable within the CPU. The code establishes the necessary connections of inputs and outputs among these stages in the CPU to ensure the proper functioning of the single-cycle CPU.
The connections of inputs and outputs can be viewed by following figure.

## Test Result
```shell
$ sbt test
...
[info] Run completed in 14 seconds, 891 milliseconds.
[info] Total number of tests run: 9
[info] Suites: completed 7, aborted 0
[info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 16 s, completed Dec 1, 2023, 11:57:58 PM
```
Generate waveform files during tests:
```shell
$ $ WRITE_VCD=1 sbt test
```
Waveform file `.vcd` will be generated under `test_run_dir` directory.
### Instruction Fetch Test

The `reset` signal is typically set to `HIGH` to ensure hardware resets to its initial state at the beginning of a functional test. Additionally, `io_instruction_valid` is set to `LOW` to output the No Operation (NOP) instruction `0x00000013` to `io_instruction`. The necessary input for the test will be set during this period.
Subsequently, during the falling edge of the clock, `reset` and `io_instruction_valid` are seperately set to `LOW` and `High` in preparation for the input during the subsequent rising edge.

`pc` is set to `io_jump_address_id` = `0x1000` as the signal `io_jump_flag_id` is `HIGH` in the rising edge shows above.
`pc` is set to the value of `pc+4` when `io_jump_flag_id` remain `LOW`.
### Instruction Decode Test

The first `io_struction` is `0x00A02223` = `0b0000 0000 1010 0000 0010 0010 0010 0011`. The corresponding `opcode` should be `0b010 0011` which means a S-type instruction.

- `io_memory_write_enable` is set to `HIGH`. And `reg_write_enable`, `memory_read_enable` is set to `LOW`.
- `io_ex_aluop1_source` = `0` = Register, and `io_ex_aluop2_source` = `1` = Immediate.
- funct3 = `0b010` indicates it is `sw`, and imm = `0b0 0100`, rs1 = `0b0 0000`, rs2 = `0b0 1010`
- `io_regs_reg1_read_address` = `0`, `io_regs_reg2_read_address` = `0xA`, `io_ex_immediate` = `0x4`
We can make sure the decoder stage work properly by checking the instruction input and the corresponding outputs.
There is an interesting point in this test that the output signals trigger during the falling edge of the clock while the IF stage trigger at the rising edge.
This could cause by the structure of single-cycle CPU. There isn't any reg to store the termianl signals execpt for the IF stage. Therefore, rising edge trigger will only appears when we test IF stage. Otherwise, the output will just change right after we give an input to the module.
### EXE Test
In this test, the circuit will be test by the RISCV code represents `x3 = x2 + x1` for 100 times. Then have a few tests for the function of `pc + 2 if x1 === x2`
`x3 = x2 + x1`:

- `io_if_jump_flag` remains `0` for the `add` function.
- In the first clock cycle right after the reset signal, the circuit obtain the input of `0x0a45c5af` and `1486d599`, then calculate the output `1ECC9B48`.
`pc + 2 if x1 === x2`:

- In the first set of branch test signal, `io_if_jump_flag` = `0` because `io_reg1_data` does not equal to `io_reg2_data`
- At the next clock cycle, `io_if_jump` = `1`, and `io_if_jump_address` equals to `io_instruction_address` + `2`
### Register File Test
==**read the writing content:**==

- After the `write_enable` = `1`, signal `io_read_data1` will change to `0xDEADBEEF` based on the value of the signals `io_read_address1`, `io_write_address`, and `io_write_data` during the next rising edge.
==**read the written content:**==

- In this case, we can only observe the results from 0 to 1.5 clock cycle, but the output `io_read_data1` doesn't change during this period.
- We can get the wave results of the later time by adding more `c.clock.step()` to the test banch.

- The result of `io_read_data1` will change to the corresponding result after 2 clock cycles.
## Modify HW2 to fit MyCPU
1. Add a new class `HammingDistanceTest` into `CPUTest.scala`
```scala
class HammingDistanceTest extends AnyFlatSpec with ChiselScalatestTester {
behavior.of("Single Cycle CPU")
it should "cal hamming distance of 0x100000 and 0xFFFFF" in {
test(new TestTopModule("HammingDistance.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
for (i <- 1 to 50) {
c.clock.step(1000)
c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
}
c.io.regs_debug_read_address.poke(10.U) //a0
c.clock.step()
c.io.regs_debug_read_data.expect(21.U)
}
}
}
```
2. Modify Makefile in `/csrc` to generate `HammingDistance.asmbin`
3. Run the test
```shell
$ sbt "testOnly riscv.singlecycle.HammingDistanceTest"
```
```shell
[info] welcome to sbt 1.9.7 (Oracle Corporation Java 17.0.9)
[info] loading settings for project ca2023-lab3-build from plugins.sbt ...
[info] loading project definition from /home/edenlin/Documents/Computer_Architecture/ca2023-lab3/project
[info] loading settings for project root from build.sbt ...
[info] set current project to mycpu (in build file:/home/edenlin/Documents/Computer_Architecture/ca2023-lab3/)
[info] HammingDistanceTest:
[info] Single Cycle CPU
[info] - should cal hamming distance of 0x100000 and 0xFFFFF *** FAILED ***
[info] io_regs_debug_read_data=1 (0x1) did not equal expected=21 (0x15) (lines in CPUTest.scala: 77, 69) (CPUTest.scala:77)
[info] Run completed in 6 seconds, 582 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
[info] *** 1 TEST FAILED ***
[error] Failed tests:
[error] riscv.singlecycle.HammingDistanceTest
[error] (Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 8 s, completed Jan 19, 2024, 11:22:37 PM
```
The test initially failed due to an oversight in the code. Upon reviewing the code, I discovered that in the `HW2` version, there was a termination triggered by ecall function placed in the middle of the code. When transitioning to `myCPU`, this termination point was not present, causing the program to continue execution. Consequently, the final result deviated from expectations. The issue was rectified by adjusting the termination point to the end of the code, then normal results were achieved.
```shell
[info] welcome to sbt 1.9.7 (Oracle Corporation Java 17.0.9)
[info] loading settings for project ca2023-lab3-build from plugins.sbt ...
[info] loading project definition from /home/edenlin/Documents/Computer_Architecture/ca2023-lab3/project
[info] loading settings for project root from build.sbt ...
[info] set current project to mycpu (in build file:/home/edenlin/Documents/Computer_Architecture/ca2023-lab3/)
[info] HammingDistanceTest:
[info] Single Cycle CPU
[info] - should cal hamming distance of 0x100000 and 0xFFFFF
[info] Run completed in 5 seconds, 774 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 7 s, completed Jan 19, 2024, 11:38:19 PM
```
### Verilator
Use Verilator to check the waveform and quickly test the programs. The following code should be executed everytime the source Chisel file has been modified to generate corresponding Verilog file.
```shell
$ make verilator
```
Then we can get a executable file `VTop`. This executable file can run the code files with following parameters.
| Parameter | Usage |
|:----------|:------|
| `-memory` | Specify the size of the simulation memory in words (4 bytes each).<br> Example: `-memory 4096` |
| `-instruction` | Specify the RISC-V program used to initialize the simulation memory.<br>Example: `-instruction src/main/resources/hello
.asmbin` |
| `-signature` | Specify the memory range and destination file to output after simulation.<br>Example: `-signature 0x100 0x200 mem.txt` |
| `-halt` | Specify the halt identifier address; writing `0xBABECAFE` to this memory address stops the simulation.<br>Example: `-halt 0x8000` |
| `-vcd` | Specify the filename for saving the simulation waveform during the process; not specifying this parameter will not generate a waveform file.<br>Example: `-vcd dump.vcd` |
| `-time` | Specify the maximum simulation time; note that time is **twice** the number of cycles.<br>Example: `-time 1000` |
Load the `HammingDistance.asmbin`, simulate for 2000 cycles, and save the simulation waveform to the `dumpH.vcd`.
```shell
$ ./run-verilator.sh -instruction src/main/resources/HammingDistance.asmbin
-time 4000 -vcd dumpH.vcd
$ gtkwave dumpH.vcd
```
### Waveforms analysis
- RISC-V Instruction Formats by Reference Data Card

#### I type
Take the following instruction as I-type example.
`0xFF410113` = `0b 1111 1111 0100 0001 0000 0001 0001 0011`
By the Reference Data Card, we can know the code represents following instruction.
`addi sp sp -12`
| imm[11:0] | rs1 | funct3 | rd | Opcode |
|:--------------:|:------:|:------:|:------:|:--------:|
| 1111 1111 0100 | 0 0010 | 000 | 0 0010 | 001 0011 |
**IF:**

- The instruction at the address of `io_instruction_address` `0x1000` will be assigned to the `io_instruction_read_data`, and was soon assigned to the `io_instruction` when the signal `io_instruction_valid` turns to HIGH.
- The `io_jump_flag_id` is LOW, so the next `pc` will be `pc+4`.
**ID:**

- For the I-type code, the `io_ex_aluop1_source`, `io_ex_aluop2_source` were assigned as `0` and `1`, which represent `Register` and `Immediate`.
- The `io_ex_immediate` = `0x1...10100`, which means `-12` in decminal.
- Read/Write enable signal for the memory were all `0` for I-type. Concurrently, the write enable signal for the register was assigned to the `1` since the result should be passed back to the register `rd`.
**EXE:**

- The jump flag was `0` since the `opcode` wasn't one of the JAL, JALR, or B type instruction.
- The `op1` and `op2` of `alu` were assigned as `0` from `sp` and `-12` from `imm` according to the `aluop1_source` and `aluop2_source`
- The execution of adding `sp` and `-12` was been perform by the ALU with the corresponding `opcode` and `funct3` which represent `addi`.
**MEM:**

- The memory read/write enable signal, which was controled by the information decoded in the ID stage from L & S type instruction, were remain LOW. Therefore, the except for the L & S type instruction, the others won't do anything in MEM stage.
**WB:**

- The source of `reg_write_data` in WB stage will be decided by the instruction type decoded in ID stage. The **RM**, **I**, **lui**, **auipc** types instructions will take `alu_result` as the source data.
- While the **L** type instructions take `memory_read_data` as data source, and the **JAL**, **JALR** types take `instruction_address + 4`.
#### B type
Take the following instruction as B type example.
`0x04040A63` = `0b0000 0100 0000 0100 0000 1010 0110 0011`
`beq s0 x0 EXIT_HAMDIS`
| imm[12,10:5] | rs2 | rs1 | funct3 | imm[4:1,11] | opcode |
|:------------:|:-----:|:-----:|:------:|:-----------:|:--------:|
| 000 0010 | 00000 | 01000 | 000 | 10100 | 110 0011 |
**ID:**

- We can easily see the `jump_flag` is assigned to HIGH, and the `pc` will be `jump_address_id` in the next cycle.
**ID:**

- ID stage regconized the code was a B-type and exactly a `beq` instruction from the `opcode` and `funct3`.
- The stage soon decoded the instruction and put each byte into the corresponding postion in `imm` for B-type and assigned it to `ex_immediate`.
- `aluop1_source` and `aluop2_source` were assigned as `register` and `imm` because whether the branch valid or not in this CPU would be determined in the jump judge unit. Normally, jump address would calculated by the ALU, but we complete it in EXE stage without ALU module in this CPU.
**EXE:**

- `if_jump_address` was calculated from `instruction_address` + `immediate`, and `if_jump_flag` was determined by the `opcode` and `funct3`.
- Although the CPU calculate the `if_jump_address` outside the `ALU` module in the module code, we can still see that the `alu_io_result` calculate the same value as the `if_jump_address`. Therefore, we can use ALU to calculate B type instruction instead of doing it with extra execution.
There was nothing to do in MEM and WB stage in B-type code.
#### JAL
Take the following instruciton as JAL example.
`0x010000EF` = `0b 0000 0001 0000 0000 0000 0000 1110 1111`
`jal ra HAMDIS`
| imm[20,10:1,11,19:12] | rd | opcode |
|:------------------------:|:------:|:--------:|
| 0000 0001 0000 0000 0000 | 0 0001 | 110 1111 |
**IF:**

- The `jump_flag` was set to HIGH, and the correspond `jump_address` was `0x104C`. we can see the value of `pc` was `0x104C` in the next cycle.
**ID:**

- The `reg_write_enable` was set to HIGH to store `pc + 4` into `rd`, which was assigned to `reg_write_address`.
- The `aluop1` and `aluop2` were assigned as `1` that represented `Instruction address` and `Imm`.
**EXE:**

- The control signal `if_jump_flag` was set to HIGH as the instruction was JAL.
- The `if_jump_address` was assigned as `immediate` + `instruction_address` = `0x103C + 0x10` = `0x104C`.
**WB:**

- `regs_write_data` was assigned as `instruction_address + 4` = `0x1040`.
#### S type
We take the following S-type instruction as example in this section.
`0x00512023` = `0b 0000 0000 0101 0001 0010 0000 0010 0011`
which also means `sw t0 0(sp)`
| imm[11:5] | rs2 | rs1 | funct3 | imm[4:0] | opcode |
|:---------:|:-----:|:-----:|:------:|:--------:|:-------:|
| 0000 000 | 00101 | 00010 | 010 | 0 0000 | 010 0011 |
Because of the result in IF stage will only different when the instruction is **JAL**, **JALR**, and **B** type, we skip IF stage analysis for S type instructions.
**ID**

- The address of `t0` was assigned to `reg2_read_address` and would be used in the MEM stage since the control signal `memory_wirte_enable` was HIHG.
**MEM:**

- The control signal `memory_write_enable` was HIGH.
- As the `funct3` reveal that the instruction was `sw`, the `strobe` from 0 to 3 were all set to 1 corresponding to the size of a word.
- The `memory_bundle_address` = `0xFFFFFFF4` was the same as the address of `sp` that we had moved at the previous instruction `addi sp sp -12`.
The EXE and WB stages wouldn't do any effort to S-type instruction, so we didn't discuss about it in this section.