# Assignment3: single-cycle RISC-V CPU
contributed by < [`eeeXun`](https://github.com/eeeXun) >
## Installation
On Arch Linux
```
sudo pacman -S sbt jdk17-openjdk verilator gtkwave
```
The current version of the packages
```
pacman -Q | grep "sbt\|jdk\|verilator\|gtkwave"
```
```
gtkwave 3.3.117-1
jdk17-openjdk 17.0.9.u8-2
jre17-openjdk 17.0.9.u8-2
jre17-openjdk-headless 17.0.9.u8-2
sbt 1:1.8.3-1
verilator 5.018-1
```
## Pass all tests
At the beginning, I have no idea where to start. So I look into the tests that I failed.
In the `InstructionFetchTest`, I see that when `jump_flag_id` is turned on, the `instruction_address` should be `entry`. And the value of `entry` is pass into `instruction_address_id`.
If the `jump_flag_id` is turned off, the `instruction_address` should be `cur`. And the value of `cur` is the `prev` + 4.
So at this point, I could make the `InstructionFetchTest` pass.
But when it goes to `InstructionDecoderTest`, I see the test only examine `ex_aluop1_source`, `ex_aluop2_source`, `regs_reg1_read_address` and `regs_reg2_read_address` these four values. But the value of these four signals has been already assigned in `InstructionDecode`. And I think these four values are all assigned correctly.
The error message shows
```
[info] - should produce correct control signal *** FAILED ***
[info] : io.memory_write_enable <= VOID
[info] firrtl.passes.CheckInitialization$RefNotInitializedException: @[src/main/scala/riscv/core/InstructionDecode.scala 127:14] : [module InstructionDecode] Reference io is not fully initialized.
```
`memory_write_enable` is not initialized. But I still have no idea why it would fail. In the [bootcamp](https://github.com/freechipsproject/chisel-bootcamp), I only see the failure occur when the value mismatch the value in `expect` function.
So I look at [Lab3](https://hackmd.io/@sysprog/r1mlr3I7p) again. I found that I missed this [diagram](https://hackmd.io/@sysprog/r1mlr3I7p#Single-cycle-CPU-architecture-diagram). This diagram shows all the signals required for this homework. After following this diagram, all the failed tests are quickly resolved.
## Assignment 2
In assembly code of [assignment 2](https://github.com/eeeXun/computer_architecture/blob/master/hw2/hw2.s), I remove all the instructions related to `ecall`. And in [exit](https://github.com/eeeXun/computer_architecture/blob/master/hw2/hw2.s#L270), I change it to infinite jump
```assembly!
exit:
j exit
```
There is only one test case. And the result is stored in register `s3`. So in my test case, I only check whether the value of register `s3` is correct.
```scala!
class HW2Test extends AnyFlatSpec with ChiselScalatestTester {
behavior.of("Single Cycle CPU")
it should "multiply two bfloat16" in {
test(new TestTopModule("hw2.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
for (i <- 1 to 50) {
c.clock.step(1000)
c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
}
c.io.regs_debug_read_address.poke(19.U) // s3
c.io.regs_debug_read_data.expect(0x440a0000.U)
}
}
}
```
## Waveform During Testing
### Startup
I load the `vcd` file of assignment 2. And I compare the `clock` with `iostruction_address` and `instruction` at first. I observe that it take so long to start the instruction.

Then I compare it with memory, `rom_loader`.

The instruction start time is almost identical to the time when `rom_loader` stops changing. Then I look at the objdump result from the assignment 2 and compare it with what I observed in GTKWave.
:::spoiler objdump
```assembly!
hw2.o: file format elf32-littleriscv
Disassembly of section .text:
00000000 <_start>:
0: 00100893 li a7,1
4: 00000617 auipc a2,0x0
8: 00060613 mv a2,a2
c: 00062803 lw a6,0(a2) # 4 <_start+0x4>
10: 028000ef jal 38 <f32_b16_p1>
14: 000807b3 add a5,a6,zero
18: 00462803 lw a6,4(a2)
1c: 01c000ef jal 38 <f32_b16_p1>
20: 00080733 add a4,a6,zero
24: 0a8000ef jal cc <encoder>
28: 00098cb3 add s9,s3,zero
2c: 0b8000ef jal e4 <decoder>
30: 0d0000ef jal 100 <Multi_bfloat>
34: 25c0006f j 290 <exit>
00000038 <f32_b16_p1>:
38: 01012023 sw a6,0(sp)
3c: 000802b3 add t0,a6,zero
40: 7f800fb7 lui t6,0x7f800
44: 01f2f333 and t1,t0,t6
48: 00800fb7 lui t6,0x800
4c: ffff8f93 add t6,t6,-1 # 7fffff <str+0x7ffd53>
50: 01f2f3b3 and t2,t0,t6
54: 7f800fb7 lui t6,0x7f800
58: 07f30463 beq t1,t6,c0 <inf_or_zero>
5c: 00736e33 or t3,t1,t2
60: 060e0063 beqz t3,c0 <inf_or_zero>
64: 00800fb7 lui t6,0x800
68: 01f3e3b3 or t2,t2,t6
6c: 00008fb7 lui t6,0x8
70: 01f383b3 add t2,t2,t6
74: 0183df13 srl t5,t2,0x18
78: 020f0063 beqz t5,98 <no_overflow>
7c: 00800fb7 lui t6,0x800
80: 01f30333 add t1,t1,t6
84: 0113d393 srl t2,t2,0x11
88: 07f00f93 li t6,127
8c: 01f3f3b3 and t2,t2,t6
90: 01039393 sll t2,t2,0x10
94: 0140006f j a8 <f32_b16_p2>
00000098 <no_overflow>:
98: 0103d393 srl t2,t2,0x10
9c: 07f00f93 li t6,127
a0: 01f3f3b3 and t2,t2,t6
a4: 01039393 sll t2,t2,0x10
000000a8 <f32_b16_p2>:
a8: 01f2d293 srl t0,t0,0x1f
ac: 01f29293 sll t0,t0,0x1f
b0: 0062e2b3 or t0,t0,t1
b4: 0072e2b3 or t0,t0,t2
b8: 00028833 add a6,t0,zero
bc: 00008067 ret
000000c0 <inf_or_zero>:
c0: 01085813 srl a6,a6,0x10
c4: 01081813 sll a6,a6,0x10
c8: 00008067 ret
000000cc <encoder>:
cc: 000782b3 add t0,a5,zero
d0: 00070333 add t1,a4,zero
d4: 01035313 srl t1,t1,0x10
d8: 0062e2b3 or t0,t0,t1
dc: 000289b3 add s3,t0,zero
e0: 00008067 ret
000000e4 <decoder>:
e4: 000c82b3 add t0,s9,zero
e8: ffff0937 lui s2,0xffff0
ec: 0122f333 and t1,t0,s2
f0: 01029393 sll t2,t0,0x10
f4: 00030b33 add s6,t1,zero
f8: 00038ab3 add s5,t2,zero
fc: 00008067 ret
00000100 <Multi_bfloat>:
100: 000a82b3 add t0,s5,zero
104: 000b0333 add t1,s6,zero
108: 7f800fb7 lui t6,0x7f800
10c: 01f2fe33 and t3,t0,t6
110: 01f373b3 and t2,t1,t6
114: 007e0e33 add t3,t3,t2
118: 3f800fb7 lui t6,0x3f800
11c: 41fe0e33 sub t3,t3,t6
120: 0062c3b3 xor t2,t0,t1
124: 01f3d393 srl t2,t2,0x1f
128: 01f39393 sll t2,t2,0x1f
12c: 007e6e33 or t3,t3,t2
130: 00929293 sll t0,t0,0x9
134: 0092d293 srl t0,t0,0x9
138: 005e62b3 or t0,t3,t0
13c: 007f0fb7 lui t6,0x7f0
140: 01f2f3b3 and t2,t0,t6
144: 01f37e33 and t3,t1,t6
148: 00839393 sll t2,t2,0x8
14c: 80000fb7 lui t6,0x80000
150: 01f3e3b3 or t2,t2,t6
154: 0013d393 srl t2,t2,0x1
158: 008e1e13 sll t3,t3,0x8
15c: 01fe6e33 or t3,t3,t6
160: 001e5e13 srl t3,t3,0x1
164: 00000333 add t1,zero,zero
168: 80000fb7 lui t6,0x80000
16c: 001fdf93 srl t6,t6,0x1
170: 01f3feb3 and t4,t2,t6
174: 01d03433 snez s0,t4
178: 40800433 neg s0,s0
17c: 01c474b3 and s1,s0,t3
180: 00930333 add t1,t1,s1
184: 001e5e13 srl t3,t3,0x1
188: 001fdf93 srl t6,t6,0x1
18c: 01f3feb3 and t4,t2,t6
190: 01d03433 snez s0,t4
194: 40800433 neg s0,s0
198: 01c474b3 and s1,s0,t3
19c: 00930333 add t1,t1,s1
1a0: 001e5e13 srl t3,t3,0x1
1a4: 001fdf93 srl t6,t6,0x1
1a8: 01f3feb3 and t4,t2,t6
1ac: 01d03433 snez s0,t4
1b0: 40800433 neg s0,s0
1b4: 01c474b3 and s1,s0,t3
1b8: 00930333 add t1,t1,s1
1bc: 001e5e13 srl t3,t3,0x1
1c0: 001fdf93 srl t6,t6,0x1
1c4: 01f3feb3 and t4,t2,t6
1c8: 01d03433 snez s0,t4
1cc: 40800433 neg s0,s0
1d0: 01c474b3 and s1,s0,t3
1d4: 00930333 add t1,t1,s1
1d8: 001e5e13 srl t3,t3,0x1
1dc: 001fdf93 srl t6,t6,0x1
1e0: 01f3feb3 and t4,t2,t6
1e4: 01d03433 snez s0,t4
1e8: 40800433 neg s0,s0
1ec: 01c474b3 and s1,s0,t3
1f0: 00930333 add t1,t1,s1
1f4: 001e5e13 srl t3,t3,0x1
1f8: 001fdf93 srl t6,t6,0x1
1fc: 01f3feb3 and t4,t2,t6
200: 01d03433 snez s0,t4
204: 40800433 neg s0,s0
208: 01c474b3 and s1,s0,t3
20c: 00930333 add t1,t1,s1
210: 001e5e13 srl t3,t3,0x1
214: 001fdf93 srl t6,t6,0x1
218: 01f3feb3 and t4,t2,t6
21c: 01d03433 snez s0,t4
220: 40800433 neg s0,s0
224: 01c474b3 and s1,s0,t3
228: 00930333 add t1,t1,s1
22c: 001e5e13 srl t3,t3,0x1
230: 001fdf93 srl t6,t6,0x1
234: 01f3feb3 and t4,t2,t6
238: 01d03433 snez s0,t4
23c: 40800433 neg s0,s0
240: 01c474b3 and s1,s0,t3
244: 00930333 add t1,t1,s1
248: 001e5e13 srl t3,t3,0x1
24c: 80000fb7 lui t6,0x80000
250: 01f37eb3 and t4,t1,t6
254: 000e8a63 beqz t4,268 <not_overflow>
258: 00131313 sll t1,t1,0x1
25c: 00800fb7 lui t6,0x800
260: 01f282b3 add t0,t0,t6
264: 0080006f j 26c <Mult_end>
00000268 <not_overflow>:
268: 00231313 sll t1,t1,0x2
0000026c <Mult_end>:
26c: 01835313 srl t1,t1,0x18
270: 00130313 add t1,t1,1
274: 00135313 srl t1,t1,0x1
278: 01031313 sll t1,t1,0x10
27c: 0172d293 srl t0,t0,0x17
280: 01729293 sll t0,t0,0x17
284: 0062e2b3 or t0,t0,t1
288: 000289b3 add s3,t0,zero
28c: 00008067 ret
00000290 <exit>:
290: 0000006f j 290 <exit>
00000294 <test0>:
294: 4141f9a7 .word 0x4141f9a7
298: 423645a2 .word 0x423645a2
0000029c <test1>:
29c: 3fa66666 .word 0x3fa66666
2a0: 42c63333 .word 0x42c63333
000002a4 <test2>:
2a4: 43e43a5e .word 0x43e43a5e
2a8: 42b1999a .word 0x42b1999a
000002ac <str>:
2ac: 0000000a .word 0x0000000a
```
:::

So these period should be the time of loading ELF file into memory!
### InstructionFetch
When `inst_fetch.io.instruction_read_data` is loaded, it takes 3 cpu clock cycles to dump the `inst_fetch.io.instruction_address`, which is the `PC`. And when `inst_fetch.io.instruction_address` changes, it takes 1 cpu clock cycle to load the `inst_fetch.io.instruction_read_data`.


3 cpu clock cycles and 1 cpu cycle correspond to the instruction fetch clock cycle.

### InstructionDecode
`InstructionDecode` gets the output immediately when input `instruction` is signaled. And it holds the state for 4 cpu cycles.

It is different from `InstructionFetch`, there is no clock inside `InstructionDecode`. I guess this is due to there is no register inside `InstructionDecode`.
The register inside `InstructionFetch`
```scala!
val pc = RegInit(ProgramCounter.EntryAddress)
```
### Execute
When an instruction in `Execute` recives all input signals from `instruction`, `instruction_address`, `reg1_data`, `reg2_data`, `immediate`, `aluop1_source` and `aluop2_source`, it generates output `mem_alu_result`, `if_jump_flag` and `if_jump_address` immediately. And it holds the state for 4 cpu cycles.
But here is something weird in some cases, it did not hold the state for 4 cpu cycles. Take the following `auipc` instruction for example
```assembly!
00000000 <_start>:
0: 00100893 li a7,1
4: 00000617 auipc a2,0x0
```
The `mem_alu_result` changes at 3rd cpu cycle. And I found out it's due to the changing time of `ex.io.instruction`(this is from `inst_fetch.io.instruction`) does not synchronize with `ex.io.instruction_address`(this is from `inst_fetch.io.instruction_address`).

### MemoryAccess
In the following `lw` example, it takes 1 cpu cycle to load data from memory since instrcution fetched. And it hold the data for 3 cpu cycles.
```assembly!
00000000 <_start>:
0: 00100893 li a7,1
4: 00000617 auipc a2,0x0
8: 00060613 mv a2,a2
c: 00062803 lw a6,0(a2) # 4 <_start+0x4>
```

### WriteBack
In the following `auipc` example, the `wb_io_regs_write_data` is come from `ex_io_mem_alu_result`
```assembly!
00000000 <_start>:
0: 00100893 li a7,1
4: 00000617 auipc a2,0x0
```

In the following `lw` example, it loads a word `0x4141f9a7`. The `wb_io_regs_write_data` is come after 1 cpu cycle since instructions loaded. Because the data is from `mem_io_wb_memory_read_data`
```assembly!
00000000 <_start>:
0: 00100893 li a7,1
4: 00000617 auipc a2,0x0
8: 00060613 mv a2,a2
c: 00062803 lw a6,0(a2) # 4 <_start+0x4>
```

## Waveform on Verilator
The `vcd` generated by verilator is different from the `vcd` file generated during testing. There is no boot up time. In the fisrt cpu cycle, the CPU continued to fetch, decode, execute instructinons. And the cpu cycle is 4 ps, which is different from 2 ps generated during testing.

The output `inst_fetch_io_instruction_read_data` is generated after quarter cpu cycle, which is 1 ps, since `inst_fetch_io_instrucion_address` is signaled. However, the result generated during testing is 1 cpu cycle, 2ps.
The time interval between instruction fetching and next instrucion fetching is 1 cpu cycle, which is 4 ps. The result is different from the time generated during testing, which is 4 cpu cycles, 8 ps.
## ecall
I implement `ecall` in this [branch](https://github.com/eeeXun/ca2023-lab3/tree/ecall). And test it with the [assembly program that print `RISC-V\n`](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md#risc-v-calling-conventions).
Initially, I add the signal `ecall_flag`, which will be turned on when instruction `ecall` is decoded, `ecall_a0`, `ecall_a1`, `ecall_a2` and `ecall_a7` to the `CPUBundle.scala`. The `ecall_a1`, `ecall_a2` and `ecall_a7` are data of register.
In `verilog/verilator/sim_main.cpp`, `Simulator.run` function, I check if `top->io_ecall_flag` is true. If it is true, then I check `top->io_ecall_a7` code. Then I compare it with [system call number](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md#newlib-integration). If the code is write, then I get the data from `memory->read` function with starting address `top->io_ecall_a1` and length `top->io_ecall_a2`.
When I run `make verilator`, it just get some errors
```
[error] firrtl.passes.PassExceptions:
[error] firrtl.passes.CheckInitialization$RefNotInitializedException: @[src/main/scala/board/verilator/Top.scala 15:14] : [module Top] Reference io is not fully initialized.
[error] : io.ecall_a7 <= VOID
[error] firrtl.passes.CheckInitialization$RefNotInitializedException: @[src/main/scala/board/verilator/Top.scala 15:14] : [module Top] Reference io is not fully initialized.
[error] : io.ecall_flag <= VOID
[error] firrtl.passes.CheckInitialization$RefNotInitializedException: @[src/main/scala/board/verilator/Top.scala 15:14] : [module Top] Reference io is not fully initialized.
[error] : io.ecall_a0 <= VOID
[error] firrtl.passes.CheckInitialization$RefNotInitializedException: @[src/main/scala/board/verilator/Top.scala 15:14] : [module Top] Reference io is not fully initialized.
[error] : io.ecall_a2 <= VOID
[error] firrtl.passes.CheckInitialization$RefNotInitializedException: @[src/main/scala/board/verilator/Top.scala 15:14] : [module Top] Reference io is not fully initialized.
[error] : io.ecall_a1 <= VOID
[error] firrtl.passes.PassException: 5 errors detected!
```
So I add the code below to `src/main/scala/board/verilator/Top.scala`
```diff!
--- a/src/main/scala/board/verilator/Top.scala
+++ b/src/main/scala/board/verilator/Top.scala
@@ -25,6 +25,12 @@ class Top extends Module {
cpu.io.instruction := io.instruction
cpu.io.instruction_valid := io.instruction_valid
+
+ io.ecall_flag := cpu.io.ecall_flag
+ io.ecall_a0 := cpu.io.ecall_a0
+ io.ecall_a1 := cpu.io.ecall_a1
+ io.ecall_a2 := cpu.io.ecall_a2
+ io.ecall_a7 := cpu.io.ecall_a7
}
object VerilogGenerator extends App {
```
Then it works! But I'm not sure what is the relationship between `Top.scala` and `CPU.scala`.
When I run the verilator, I found there are some bugs in my code. The string `RISC-V\n` is printed out 3 times. Then I inspect the wavform dumped from verilator. The `ecall` is just like other instrion, it hold for 1 cpu cycle.

So I suspect it is caused by the while loop in `Simulator.run` function, which is not looping every cpu cycle.