# Assignment3: Single-cycle RISC-V CPU
contributed by < [`jeremy90307`](https://github.com/jeremy90307) >
## Environment setup
OS:ubuntu 22.04
sbt versopn:1.9.4
JDK version:1.8.0
Follow the instructions in [Lab3: Construct a single-cycle RISC-V CPU with Chisel](https://hackmd.io/@sysprog/r1mlr3I7p) to set up the environment.
### GTKWave Installation
**Install**
1. Visit the [GTKWave](https://gtkwave.sourceforge.net/) website to download `gtkwave-3.3.117.tar.gz`.
2. According to the `README` file instructions, if the installation fails, you need to install some packages.
```
sudo apt-get install libjudy-dev
sudo apt-get install libbz2-dev
sudo apt-get install liblzma-dev
sudo apt-get install libgconf2-dev
sudo apt-get install libgtk2.0-dev
sudo apt-get install tcl-dev
sudo apt-get install tk-dev
sudo apt-get install gperf
sudo apt-get install gtk2-engines-pixbuf
```
3. ./configure
4. make
5. make install
## Hello World in Chisel
```scala
class Hello extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
})
val CNT_MAX = (50000000 / 2 - 1).U;
val cntReg = RegInit(0.U(32.W))
val blkReg = RegInit(0.U(1.W))
cntReg := cntReg + 1.U
when(cntReg === CNT_MAX) {
cntReg := 0.U
blkReg := ~blkReg
}
io.led := blkReg
}
```
- This module has only one output signal.
- The `led` is an output terminal with an unsigned type and a bit width of 1.
- `cntReg` is a counter with an initial value set to 0 and a bit width of 32 bits
- `CNT_MAX` is the maximum value of the counter.
- `blkReg` represents the current state, with an initial value of 0 and a bit width of 1.
- `when(...)` : When cntReg is equal to CNT_MAX, reset cntReg, and change the state of blkReg.
- Finally, link `blkReg` to the output signal.
## Lab 3 : Single Cycle RISC-V CPU
Install the dependent packages
```
sudo apt install build-essential verilator gtkwave
```
Run all test
```
sbt test
```
If the execution is successful, you will see the following message.
```
[info] welcome to sbt 1.9.7 (Temurin Java 1.8.0_392)
[info] loading settings for project ca2023-lab3-build from plugins.sbt ...
[info] loading project definition from /home/jeremytsai/ca2023-lab3/project
[info] loading settings for project root from build.sbt ...
[info] set current project to mycpu (in build file:/home/jeremytsai/ca2023-lab3/)
[info] InstructionDecoderTest:
[info] InstructionDecoder of Single Cycle CPU
[info] - should produce correct control signal
[info] InstructionFetchTest:
[info] InstructionFetch of Single Cycle CPU
[info] - should fetch instruction
[info] FibonacciTest:
[info] Single Cycle CPU
[info] - should recursively calculate Fibonacci(10)
[info] ByteAccessTest:
[info] Single Cycle CPU
[info] - should store and load a single byte
[info] QuicksortTest:
[info] Single Cycle CPU
[info] - should perform a quicksort on 10 numbers
[info] HW2Test:
[info] Single Cycle CPU
[info] - should calculate the scale
[info] ExecuteTest:
[info] Execution of Single Cycle CPU
[info] - should execute correctly
[info] RegisterFileTest:
[info] Register File of Single Cycle CPU
[info] - should read the written content
[info] - should x0 always be zero
[info] - should read the writing content
[info] Run completed in 10 seconds, 35 milliseconds.
[info] Total number of tests run: 10
[info] Suites: completed 8, aborted 0
[info] Tests: succeeded 10, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 11 s, completed 2023/11/30 下午 06:04:27
```
Run single Scala file for unit test
```
sbt "testOnly riscv.singlecycle.XXXTest"
```
Output `.vcd` file and analyze using GTKWave.
```
WRITE_VCD=1 sbt test
```
### Resolve MyCPU
**Pending resolution**
1. InstructionFetch.scala
2. InstructionDecoder.scala
3. Execute.scala
4. CPU.scala
My [repository](https://github.com/jeremy90307/ca2023-lab3.git), where you can see the code for my completed MyCPU, is available here.

### Instruction Fetch

#### Test
- If `io.instruction_valid` is true, then jump to that location.(0x1000)
- If `io.instruction_valid` is false, then pc + 4
#### GTKWave


From the diagram, it can be observed that when `io.instruction_valid = 1`, the PC position returns to 0x1000 on the next rising edge of the clock.
### Instruction Decoder
#### Test
After inputting 'sw,' 'lui,' and 'add,' the correct control signals are obtained.
- If the opcode is of the Load type, then `io.memory_read_enable` is set to true.
- If the opcode is of the store type, then `io.memory_write_enable` is set to true.
##### GTKWave

input=0x00a02223L.U --> `sw x10, 4(x0)`
When the instruction is of the store type, io_memory_write_enable is set to 1.
### Execution

#### Explanation of Scala syntax
`muxLookup` description
```scala
io.if_jump_flag :=
(opcode === Instructions.jal) ||
(opcode === Instructions.jalr) ||
(opcode === InstructionTypes.B) &&
MuxLookup(
funct3,
false.B,
IndexedSeq(
InstructionsTypeB.beq -> (io.reg1_data === io.reg2_data),
// ...
)
```
`funct3` : The value to be matched.
`false.B` : The default value when no corresponding match is found.
`InstructionsTypeB.beq -> (io.reg1_data === io.reg2_data),` :
If funct3 is equal to InstructionsTypeB.beq, then it evaluates whether io.reg1_data = io.reg2_data is true, and the result serves as the output of MuxLookup. If funct3 is not equal to InstructionsTypeB.beq, the default value for MuxLookup is set to false.B.
#### Test
- Test `add`, and obtain the expected output.
- Test `beq`, and determine the output address by comparing if they are equal.
#### GTKWave
1. `add`

- ALU input value `alu_io_op1` = `io_reg1_data` = 016A05E2
- ALU input value `alu_io_op2` = `io_reg2_data` = 0FBD8F12
- ALU output value `alu_io_result` is the sum of `alu_io_op1` and `alu_io_op2`.
- According to the ALUControl.scala file, when `alu_io_function` is set to 1, it corresponds to ALUFunctions.add.
- Therefore, the output value of the `add` operation matches the expectations.
2. `beq`

- ALU input value `alu_io_op1` = `io_instruction_address` = 0x00000002
- ALU input value `alu_io_op2` = `io_immediate` = 0x00000002
- ALU output value `alu_io_result` is the sum of `alu_io_op1` and `alu_io_op2`.
- `alu_io_func=1` -> ALUFunctions.add
- When the data of `reg1` and `reg2` are the same, `io_if_jump_flag` is set to 1.
### RegisterFileTest
#### Test
- Testing writing data to a register and ensuring it can be successfully read.
- Writing the value of `x0` always results in 0.
#### GTKWave

- `io_write_enable` is set to 1, and data is written to registers_2, and it is successfully read.

- It can be observed that the value of x0(registers_0=0x00000000) remains unchanged, with no modifications made to x0.
### CPU
Complete `CPU.scala`, this part is missing the necessary data and signals for connecting to module EXE.
#### Test
- Calculating the Fibonacci sequence and obtaining the expected answer.
- Calculating the Quicksort and obtaining the expected answer.
- Test whether the CPU can correctly store and retrieve a single byte of data.
# HW2 runs on MyCPU
## Adapt [HW2](https://github.com/jeremy90307/Computer_Architecture/tree/main/HW2) for MyCPU
:::spoiler **c code**
```c=
#include <stdio.h>
#include <stdlib.h>
#include<math.h>
#include <inttypes.h>
# define array_size 7
# define range 127 /*2^(n-1)-1, n: quant bit*/
float fp32_to_bf16(float x);
int* quant_bf16_to_int8(float x[]);
float bf16_findmax(float x[]);
typedef uint64_t ticks;
static inline ticks getticks(void)
{
uint64_t result;
uint32_t l, h, h2;
asm volatile(
"rdcycleh %0\n"
"rdcycle %1\n"
"rdcycleh %2\n"
"sub %0, %0, %2\n"
"seqz %0, %0\n"
"sub %0, zero, %0\n"
"and %1, %1, %0\n"
: "=r"(h), "=r"(l), "=r"(h2));
result = (((uint64_t) h) << 32) | ((uint64_t) l);
return result;
}
int main()
{
ticks t0 = getticks();
float array[array_size] = {1.200000, 1.203125, 2.310000, 2.312500, 3.460000, 3.4531255, 5.630000};
float array2[array_size] = { 0.1, 0.2, 1.2, 3, 2.1, -4.2, 3.5};
float array3[array_size] = { 3.14159265, 0.12345678 , 1.23456789 , 0.00000123, 0.00000001, 0.99999999 , 0.00000007 };
float array_bf16[array_size] = {};
int *after_quant;
/*data 1*/
for (int i = 0; i < 7; i++) {
array_bf16[i] = fp32_to_bf16(array[i]);
}
printf("data 1\nbfloat16 number is \n");
for (int i = 0; i < array_size; i++) {
printf("%.12f\n", array_bf16[i]);
}
after_quant = quant_bf16_to_int8(array_bf16);
printf("after quantization \n");
for (int i = 0; i < array_size; i++) {
printf("%d\n", after_quant[i]);
}
/*data 2*/
for (int i = 0; i < 7; i++) {
array_bf16[i] = fp32_to_bf16(array2[i]);
}
printf("data 2\nbfloat16 number is \n");
for (int i = 0; i < array_size; i++) {
printf("%.12f\n", array_bf16[i]);
}
after_quant = quant_bf16_to_int8(array_bf16);
printf("after quantization \n");
for (int i = 0; i < array_size; i++) {
printf("%d\n", after_quant[i]);
}
/*data 3*/
for (int i = 0; i < 7; i++) {
array_bf16[i] = fp32_to_bf16(array3[i]);
}
printf("data 3\nbfloat16 number is \n");
for (int i = 0; i < array_size; i++) {
printf("%.12f\n", array_bf16[i]);
}
after_quant = quant_bf16_to_int8(array_bf16);
printf("after quantization \n");
for (int i = 0; i < array_size; i++) {
printf("%d\n", after_quant[i]);
}
ticks t1 = getticks();
printf("elapsed cycle: %" PRIu64 "\n", t1 - t0);
system("pause");
return 0;
}
float fp32_to_bf16(float x)
{
float y = x;
int *p = (int *)&y;
unsigned int exp = *p & 0x7F800000;
unsigned int man = *p & 0x007FFFFF;
if (exp == 0 && man == 0) /* zero */
return x;
if (exp == 0x7F800000 /* Fill this! */) /* infinity or NaN */
return x;
/* Normalized number */
/* round to nearest */
float r = x;
int *pr = (int *)&r;
*pr &= 0xFF800000; /* r has the same exp as x */
r /= 0x100 /* Fill this! */;
y = x + r;
*p &= 0xFFFF0000;
return y;
}
int* quant_bf16_to_int8(float x[array_size])
{
static int after_quant[array_size] = {};
float max = fabs(x[0]);
for (int i = 1; i < array_size; i++) {
if (fabs(x[i]) > max) {
max = fabs(x[i]);
}
}
printf("maximum number is %.12f\n", max);
float scale = range / max;
for (int i = 0; i < array_size; i++) {
after_quant[i] = (x[i] * scale);
}
return after_quant;
}
```
:::
### Process
1. Place the assembly code for HW2 (`hw2.S`) into the `ca2023-lab/csrc` directory.
2. Modify hw2.S to remove `ecall` and add `_start:`
3. Modify the `Makefile` ,and add `hw2.asmbin` under BINS.
4. Enter `$ make update` in the directory to generate `hw2.asmbin`.
5. In `CPUTest.scala`, add a Test class for `hw2.asmbin`.
```scala
class HW2Test extends AnyFlatSpec with ChiselScalatestTester {
behavior.of("Single Cycle CPU")
it should "calculate the scale" in {
test(new TestTopModule("hw2.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
for (i <- 1 to 50) {
c.clock.step(1000)
c.io.mem_debug_read_address.poke((i * 4).U)
}
c.io.regs_debug_read_address.poke(16.U) //a6
c.clock.step()
c.io.regs_debug_read_data.expect(0x41d00000.U)
}
}
}
```
Test the scale value for the first set of data in hw2.
```
$ sbt "testOnly riscv.singlecycle.HW2Test"
```
Output
```
[info] welcome to sbt 1.9.7 (Temurin Java 1.8.0_392)
[info] loading settings for project ca2023-lab3-build from plugins.sbt ...
[info] loading project definition from /home/jeremytsai/ca2023-lab3/project
[info] loading settings for project root from build.sbt ...
[info] set current project to mycpu (in build file:/home/jeremytsai/ca2023-lab3/)
[info] compiling 1 Scala source to /home/jeremytsai/ca2023-lab3/target/scala-2.13/test-classes ...
[info] HW2Test:
[info] Single Cycle CPU
[info] - should calculate the scale
[info] Run completed in 4 seconds, 832 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 9 s, completed 2023/11/30 下午 05:12:43
```
## Verilator
generate Verilog files
```
$ make verilator
```
| Parameter | Usage |
|:----------|:------|
| `-memory` | Specify the size of the simulation memory in words (4 bytes each).<br> Example: `-memory 4096` |
| `-instruction` | Specify the RISC-V program used to initialize the simulation memory.<br>Example: `-instruction src/main/resources/hello.asmbin` |
| `-signature` | Specify the memory range and destination file to output after simulation.<br>Example: `-signature 0x100 0x200 mem.txt` |
| `-halt` | Specify the halt identifier address; writing `0xBABECAFE` to this memory address stops the simulation.<br>Example: `-halt 0x8000` |
| `-vcd` | Specify the filename for saving the simulation waveform during the process; not specifying this parameter will not generate a waveform file.<br>Example: `-vcd dump.vcd` |
| `-time` | Specify the maximum simulation time; note that time is **twice** the number of cycles.<br>Example: `-time 1000` |
Load the `hw2.asmbin` file, simulate for 2000 cycles, and save the simulation waveform to the `dump01.vcd` file.
```
./run-verilator.sh -instruction src/main/resources/hw2.asmbin -time 4000 -vcd dump01.vcd
```
Output
```
-time 4000
-memory 1048576
-instruction src/main/resources/hw2.asmbin
[-------------------->] 100%
```
## Run GTKwave `dump01.vcd` to check its waveform.
### I-type : `addi x2, x2, -4`
Hexadecimal = 0xffc10113
Binary = 1111111 11100 00010 000 00010 0010011
#### IF

- `io_instruction=0xFFC10113 -> addi x2, x2, -4`
- `io_instruction_read_data=io_instruction`
- Since `io_jump_flag_id=0`, the next pc is pc+4.
#### ID

- `io_ex_aluop1_source=0` reads the value of `io_reg1_data`.
- `io_ex_aluop2_source=1`,reads the value of `io_ex_immediate`.(`io_ex_immediate=0xFFFFFFFC = -4`)
- Since `io_memory_read_enable=io_memory_write_enable=0`, there is no modification to the memory
<font color="#f00">L-type : `io_memory_read_enable = 1`
S-type : `io_memory_write_enable = 1`</font>
- `io_regs_reg1_read_address=02`(x2=sp)
#### EXE

- When `alu_ctrl_io_alu_func=1`, it indicates the `addi` function.
- Because `io_aluop1_source=0`,`alu_io_op1` is equal to `io_reg1_data`, which is 0.
- Because `io_aluop2_source=1`,`alu_io_op2` is equal to `io_immediate`, which is 0xFFFFFFFC
#### MEM

- From the figure below, it can be observed that `io_alu_result` is equal to `io_memory_bundle_address`, both being 0xFFFFFFFC.

- At this stage, no read/write operations are performed on the memory.
##### WB

- Write `io_regs_write_data=0xFFFFFFFC` to register `x2`.
### J-type : `jal x0, 68`
Hexdicimal:0x0440006f
Binary:00000100010000000000 00000 1101111
##### IF

- `io_instruction[31:0]=0440006F -> jal x0, 68`
- `io_instruction[31:0]=io_instruction_read_data[31:0]`
- When `io_jump_flag_id=1` is set to 1, the program counter (pc) consequently receives the instruction `io_jump_address_id[31:0]=00001098`, resulting in the pc becoming `pc=1098`.
#### ID

- `io_ex_aluop1_source=1`=`io.instruction_address`
- `io_ex_aluop2_source=1` reads the value of `io_ex_immediate[31:0]=00000044`
- Due to `io_memory_read_enable=0` and `io_memory_write_enable=0`, no read or write operations are performed on the memory.
- `io_regs_reg1_read_address[4:0]=00`(x0) is equivalent to `rd[4:0]=00`(x0).
#### EXE

- Because `io_aluop1_source=1`,`alu_io_op1` is equal to `io_instruction_address=00001054`, where pc=1054.
- Because `io_aluop2_source=1`, `alu_io_op2 is equal` to `io_immediate`, which is 0x00000044
- Due to `io_if_jump_flag=1`, the program counter (pc) jumps to `io_if_jump_address=00001098`.(The next pc is set to 1098.)
- Where `io_if_jump_address` is defined as the sum of `io_immediate` and `io_instruction_address`.
#### MEM

- At this stage, no read/write operations are performed on the memory.
- `io_alu_result` is equal to `io_memory_bundle_address`.
#### WB

- At this stage, no changes are made to the registers.
### S-type : `sw x15, 0(x12)`
Hexadecimal=0x00f62023
Binary=00000000 11110 11000 100 00000 100011
#### IF

- `io_instruction=0xFFC10113 -> sw x15, 0(x12)`
- `io_instruction_read_data=io_instruction`
- Since `io_jump_flag_id=0`, the next pc is pc+4.
#### ID

- The value of `io_ex_aluop1_source=0` is the base memory address, which is the value of x12.(`io_regs_reg1_read_address=0x0C`)
- The value of `io_ex_aluop2_source=1` is the offset of the memory address, which is 0.(`io_ex_immediate=0x00000000`)
- Because `io_memory_write_enable=1`, data is written into memory.
#### EXE

- The ALU output, `alu_io_result=00001334`, is equal to the sum of ALU operands, where `alu_io_op1=00001334` and `alu_io_op2=00000000`.
#### MEM

- The memory address input, `io_memory_bundle_address=0x00001334`, is equivalent to the ALU output, `io_alu_result=00001334`.
- The data input of the memory `io_memory_bundle_write_data` is 0x00000000, which is equal to the value of `io_reg2_data`.
#### WB

- S-type does not assign new values to registers.
# Reference
- [Lab3: Construct a single-cycle RISC-V CPU with Chisel](https://hackmd.io/@sysprog/r1mlr3I7p)
- [HW2](https://hackmd.io/3oeAp56nT3uVBbEyTxyQnQ)
- [RISC-V Instruction Encoder/Decoder](https://luplab.gitlab.io/rvcodecjs/)