# Assignment3: Single-cycle RISC-V CPU
<style>
.vis {
user-select:none;
display:inline;
/*oncopy="return false"*/
}
</style>
Contributed by [kc71486](https://github.com/kc71486)
###### tags: `RISC-V`, `jserv`
## Environment setup
os: ubuntu 22.04
sbt version: 1.9 (following [this guide](https://www.scala-sbt.org/release/docs/Installing-sbt-on-Linux.html) and use `Ubuntu and other Debian-based distributions` method)
javac version: openjdk 17.0.8.1
:::info
Doesn't matter at all.
:::
java version: openjdk 17.0.8.1
:::info
Default ubuntu 22.04 java version(openjdk 11.0.20.1) also works.
:::
scala version: 2.11.12
::: info
It will work even if scala isn't installed, although adding it won't change anything.
:::
verilator: 4.038
:::info
If verilator isn't installed, when running `sbt test`, all testcase that requires file access (\*.asmbin) would have java "not a relative path" error.
:::
gtkwave: 3.3.104 (only used in evaluation)
## About <a href="https://hackmd.io/@sysprog/r1mlr3I7p#Hello-World-in-Chisel">Hello World in Chisel</a>
```scala=
class Hello extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
})
val CNT_MAX = (50000000 / 2 - 1).U;
val cntReg = RegInit(0.U(32.W))
val blkReg = RegInit(0.U(1.W))
cntReg := cntReg + 1.U
when(cntReg === CNT_MAX) {
cntReg := 0.U
blkReg := ~blkReg
}
io.led := blkReg
}
```
### Explaination
The module represents a blinking LED, the counter will increament every tick (internal clock), and when the counter reaches certain threshold, the led signal will invert. This is an sequential circuit, and the whole block would be in clock triggered always block in verilog. `cntReg` and `blkReg` behaves like registers, while other variables would behave like wires.<br>
It kinda looks like...
```verilog
`define CNT_MAX (50000000 / 2 - 1)
module Hello(output reg led)
reg [31:0] cntReg;
reg blkReg;
initial begin
cntReg = 32'd0;
blkReg = 1'd0;
end
always #1 begin
cntReg += 1;
if (cntReg == CNT_MAX) begin
cntReg = 0;
blkReg = ~blkReg;
end
end
endmodule
```
### enhancement
`when` block should generally be avoided in standard HDL design. In this case, we can change it into mux.
```scala=
class Hello extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
})
val CNT_MAX = (50000000 / 2 - 1).U;
val cntReg = RegInit(0.U(32.W))
val blkReg = RegInit(0.U(1.W))
val invsig = (cntReg === CNT_MAX)
cntReg := invsig ? 0.U : cntReg + 1.U
blkReg := blkReg ^ invsig
io.led := blkReg
}
```
`invsig` represents the trigger point (wire-like). When invsig is on, cntReg becomes zero instead of (cntReg+1), and blkReg signal reverts.
## Finish mycpu
### add lines
There are not much things we need to add, only...
* 4 lines in InstructionFetch.scala
* 2 lines in InstructionDecode.scala
* 7 lines in CPU.scala
* 11 lines in Execute.scala
The heavy lifting memory and register access module part has been done for us, all we need to do is make connections between each modules.
### mycpu evaluation (gtkwave in test)
#### InstructionFetch
<a href="https://github.com/kc71486/Computer-Architecture/blob/main/hw3/gtkimg/hamming_if.png"><img src="https://github.com/kc71486/Computer-Architecture/raw/main/hw3/gtkimg/hamming_if.png" width="100%" title="click to open the link" alt="InstructionFetch gtkwave"></a><br>
For Instruction fetch, the address should either jump to the address or +4, which depends on jump_flag_id.
#### InstructionDecode
<a href="https://github.com/kc71486/Computer-Architecture/blob/main/hw3/gtkimg/hamming_id.png"><img src="https://github.com/kc71486/Computer-Architecture/raw/main/hw3/gtkimg/hamming_id.png" width="100%" title="click to open the link" alt="InstructionDecode gtkwave"></a><br>
For Instruction decode, `Load` will make memory_read_enable=1, `Store` will make memory_write_enable=1
#### CPU & Execute
<a href="https://github.com/kc71486/Computer-Architecture/blob/main/hw3/gtkimg/hamming_ex.png"><img src="https://github.com/kc71486/Computer-Architecture/raw/main/hw3/gtkimg/hamming_ex.png" width="100%" title="click to open the link" alt="Execute gtkwave"></a><br>
For CPU and execute, two control singnal determines which input, aluop1_source=1 makes ALU accept pc, while aluop2_source=1 makes ALU accept immediate.<br>For example, around 2171ps mark, 0000139C+FFFFFFB8=00001354
## Adapt assignment2
### modify c code
I removed all print function and csr function, and place called function into same file.
### modify assembly code
I removed all print function and csr function, and place called function into same file. Additionally, I removed some unused variables in data segments.
:::spoiler code
```
_start:
_sinit:
la sp, 0xffc
call main
_sexit:
li t0, 0xfffc # halt address
li t1, 0xbabecafe
sw t1, 0(t0)
dead_loop:
j dead_loop # infinite loop
...
add sp, sp, -8
sw ra, 4(sp)
sw s0, 0(sp)
li s0, 0x0004
...
call HammingDistance
sw a0, 0(s0)
...
```
:::
### add testbench
I added test HomeWorkTest to see if the result is as expected.
:::spoiler code
```scala=117
class HomeWorkTest extends AnyFlatSpec with ChiselScalatestTester {
behavior.of("Single Cycle CPU")
it should "execute hamming code calculation" in {
test(new TestTopModule("homework.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
for (i <- 1 to 50000) {
c.clock.step()
c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
}
c.io.mem_debug_read_address.poke(4.U) // #1
c.clock.step()
c.io.mem_debug_read_data.expect(24.U)
c.io.mem_debug_read_address.poke(8.U) // #2
c.clock.step()
c.io.mem_debug_read_data.expect(60.U)
c.io.mem_debug_read_address.poke(12.U) // #3
c.clock.step()
c.io.mem_debug_read_data.expect(0.U)
}
}
}
```
:::
:::info
Try to use high clock step count, the result might not be evaluated if step is too low.
:::
### problem faced (c code)
> <span style="color:green">HomeWorkTest:</span>
> [info] <span style="color:green">Single Cycle CPU</span>
> [info] <span style="color:red">- should execute hamming code calculation *** FAILED ***</span>
> [info] <span style="color:red"> io_mem_debug_read_data=27 (0x1b) did not equal expected=24 (0x18) (lines in CPUTest.scala: 133, 120) (CPUTest.scala:127)</span>
### add more testbench
To address the issue, my first thought is add more testcase to see if my datapath is wrong.
The testcase is [here](https://github.com/kc71486/Computer-Architecture/blob/main/hw3/csrc/tb.S). I copied and modified it from other lecture's testbench.<br>
> [info] <span style="color:green">Testbench:</span>
> [info] <span style="color:green">Single Cycle CPU Testbench</span>
> [info] <span style="color:green">- should execute full testbench</span>
<!---->
Unfortunately, there aren't any error. It is actually really bad, because the issue is much trickier than I thought.
### root cause
I tried manually input the arguments, but all of them seems correct. Even those with exactly same input.
:::spoiler such as:
```clike
static uint32_t t1_x0 = 0x00100000; // low
static uint32_t t1_y0 = 0x00130000; // high
static uint32_t t1_x1 = 0x000FFFFF;
static uint32_t t1_y1 = 0x00000000;
...
*((volatile int32_t *) 16) = HammingDistance(0x00100000, 0x00130000, 0x000FFFFF, 0x00000000); //24, correct
*((volatile int32_t *) 4) = HammingDistance(t1_x0, t1_y0, t1_x1, t1_y1); //24, output 27
```
:::
<br>Then I tried putting them in stack instead of global variable, and it also works.
Inspecting the data segment, I found something strange.<br>
::: spoiler original c code:
```clike=
static uint32_t t1_x0 = 0x00100000; // low
static uint32_t t1_y0 = 0x00130000; // high
static uint32_t t1_x1 = 0x000FFFFF;
static uint32_t t1_y1 = 0x00000000;
static uint32_t t2_x0 = 0x00000001;
static uint32_t t2_y0 = 0x00000002;
static uint32_t t2_x1 = 0x7FFFFFFF;
static uint32_t t2_y1 = 0xFFFFFFFE;
static uint32_t t3_x0 = 0x00000002;
static uint32_t t3_y0 = 0x8370228F;
static uint32_t t3_x1 = 0x00000002;
static uint32_t t3_y1 = 0x8370228F;
```
:::
<br>.data dump
```
homework.elf: file format elf32-littleriscv
Contents of section .data:
2000 00001000 00001300 ffff0f00 01000000
2010 02000000 ffffff7f feffffff 02000000
2020 8f227083 02000000 8f227083
```
There are only 11 variables, but I declare a total of 12 variables, so something must be wrong.
Inspecting the text segment, I think I found the reason.
```
1418: 000027b7 lui a5,0x2
141c: 0007a703 lw a4,0(a5) # 2000 <t1_x0>
1420: 000027b7 lui a5,0x2
1424: 0047a583 lw a1,4(a5) # 2004 <t1_y0>
1428: 000027b7 lui a5,0x2
142c: 0087a603 lw a2,8(a5) # 2008 <t1_x1>
1430: 000027b7 lui a5,0x2
1434: 02c7a783 lw a5,44(a5) # 202c <t1_y1>
```
202c is not listed in `.data` segment, since `.data` segment isn't zero initialized, it might just be a junk value.
### result
> [info] <span style="color:blue">Run completed in 1 minute, 13 seconds.</span>
> [info] <span style="color:blue">Total number of tests run: 12</span>
> [info] <span style="color:blue">Suites: completed 10, aborted 0</span>
> [info] <span style="color:blue">Tests: succeeded 12, failed 0, canceled 0, ignored 0, pending 0</span>
> [info] <span style="color:green">All tests passed.</span>
> [<span style="color:green">success</span>] Total time: 91 s (01:31), completed Nov 21, 2023, 1:46:37 PM
## Verilator analysis
### modify sim_main.cpp and run verilator
I changed memory arguments to accept hex input and get more verbose halt output.
```cpp
if (auto it = std::find(args.begin(), args.end(), "-memory");
it != args.end()) {
memory_words = parse_number(*(it + 1));
...
if (halt_address) {
if (memory->read(halt_address) == 0xBABECAFE) {
std::cout << "halted after " << main_time << " time\n";
break;
}
}
```
And I run verilator with the following arguments.
```bash
arguments="-memory 0x8000 -instruction src/main/resources/homework.asmbin -time 50000 -signature 0x0 0x80 mem.txt -halt 0xfffc -vcd dump.vcd"
verilog/verilator/obj_dir/VTop $arguments
```
### result
## Small bonus
### Implement mmio
We define address 0x20000000 ~ 0x3FFFFFFF as VRAM (in <a href="https://github.com/sysprog21/ca2023-lab3/blob/main/csrc/hello.c">hello.c</a>). Any write action in this segment will be redirected into another dedicated memory module. VRAM can currently only be written to, reading through vram_bundle won't do anything.<br>
We modify the address and write enable to redirect write signal correctly.
:::spoiler CPU
```scala
io.memory_bundle.address := Cat(
0.U(Parameters.SlaveDeviceCountBits.W),
mem.io.memory_bundle.address(Parameters.AddrBits - 1 - Parameters.SlaveDeviceCountBits, 0)
)
io.memory_bundle.write_enable :=
mem.io.memory_bundle.write_enable &&
mem.io.memory_bundle.address(Parameters.AddrBits - 1, Parameters.AddrBits - Parameters.SlaveDeviceCountBits) === SlaveType.UserMemory
io.vram_bundle.address := Cat(
0.U(Parameters.SlaveDeviceCountBits.W),
mem.io.memory_bundle.address(Parameters.AddrBits - 1 - Parameters.SlaveDeviceCountBits, 0)
)
io.vram_bundle.write_enable :=
mem.io.memory_bundle.write_enable &&
mem.io.memory_bundle.address(Parameters.AddrBits - 1, Parameters.AddrBits - Parameters.SlaveDeviceCountBits) === SlaveType.VisualMemory
```
:::
<br> I also add some output port, so it can actually do something.
:::spoiler Top
```scala
class Top extends Module {
...
io.memory_bundle <> cpu.io.memory_bundle
io.vram_bundle <> cpu.io.vram_bundle
io.kernel_bundle <> cpu.io.kernel_bundle
io.instruction_address := cpu.io.instruction_address
cpu.io.instruction := io.instruction
cpu.io.instruction_valid := io.instruction_valid
io.ecall_en := cpu.io.ecall_en
io.ecall_a7 := cpu.io.ecall_a7
io.ecall_a0 := cpu.io.ecall_a0
io.ecall_a1 := cpu.io.ecall_a1
io.ecall_a2 := cpu.io.ecall_a2
io.ecall_a3 := cpu.io.ecall_a3
io.ecall_a4 := cpu.io.ecall_a4
io.ecall_a5 := cpu.io.ecall_a5
}
```
:::
:::spoiler CPUBundle
```scala
class CPUBundle extends Bundle {
val instruction_address = Output(UInt(Parameters.AddrWidth))
val instruction = Input(UInt(Parameters.DataWidth))
val memory_bundle = Flipped(new RAMBundle)
val vram_bundle = Flipped(new RAMBundle)
val kernel_bundle = Flipped(new RAMBundle)
val instruction_valid = Input(Bool())
val deviceSelect = Output(UInt(Parameters.SlaveDeviceCountBits.W))
val debug_read_address = Input(UInt(Parameters.PhysicalRegisterAddrWidth))
val debug_read_data = Output(UInt(Parameters.DataWidth))
val ecall_en = Output(Bool())
val ecall_a7 = Output(UInt(Parameters.DataWidth))
val ecall_a0 = Output(UInt(Parameters.DataWidth))
val ecall_a1 = Output(UInt(Parameters.DataWidth))
val ecall_a2 = Output(UInt(Parameters.DataWidth))
val ecall_a3 = Output(UInt(Parameters.DataWidth))
val ecall_a4 = Output(UInt(Parameters.DataWidth))
val ecall_a5 = Output(UInt(Parameters.DataWidth))
}
```
:::
<br>To actually print in console, I add VRAM class and implement flush method.
:::spoiler sim_main
```cpp
class VRAM {
...
void write(size_t address, uint32_t value, bool write_strobe[4]) {
...
vram_dirty = true;
}
void flush() {
if (! vram_dirty) {
return;
}
vram_dirty = false;
std::cout << "\x1b[A";
for(uint32_t i=0; i<ROWS; i++) {
std::cout << "\x1b[A\x1b[2K";
}
std::cout << std::flush;
for(uint32_t i=0; i<ROWS; i++) {
for(uint32_t j=0; j<COLS; j++) {
uint32_t val = vram[i * COLS + j];
char ch0 = (char) ((val >> 0) & 0xff);
char ch1 = (char) ((val >> 8) & 0xff);
char ch2 = (char) ((val >> 16) & 0xff);
char ch3 = (char) ((val >> 24) & 0xff);
std::cout << ch0 << ch1 << ch2 << ch3;
}
std::cout << "\n";
}
std::cout << "\x1b[B" << std::flush;
}
};
```
:::
Whenever we write something in VRAM, the terminal will refresh the area according to the data in VRAM.
<img src="https://hackmd.io/_uploads/B1cZSHoB6.gif" width=100% alt="hello.asmbin"><br>
### Implement ECALL (a7=64)
When ecall is called, cpu emits io_ecall_en signal,
```scala
io.do_ecall := io.instruction === InstructionsEnv.ecall
```
which is caught by simulator.
```cpp
if (top->io_ecall_en) {
if(ecall_timeout == 0) {
ihandler->handle(top->io_ecall_a7, top->io_ecall_a0, top->io_ecall_a1, top->io_ecall_a2,
top->io_ecall_a3, top->io_ecall_a4, top->io_ecall_a5);
ecall_timeout = 3;
}
else {
ecall_timeout --;
}
}
```
It is then handled by InterruptHandler.
```cpp
class InterruptHandler {
private:
Memory &memory;
VRAM &vram;
Printer printer;
public:
InterruptHandler(Memory &memory, VRAM &vram) : memory(memory), vram(vram), printer(vram) {}
uintptr_t memoryTranslate(uint32_t src) {
if(src < 0x20000000) {
return (uintptr_t) (((char *) memory.image()) + src);
}
else if(src < 0x40000000) {
return (uintptr_t) (((char *) vram.image()) + src - 0x20000000);
}
else {
return 0;
}
}
void handle(uint32_t type, uint32_t arg0, uint32_t arg1, uint32_t arg2, uint32_t arg3, uint32_t arg4, uint32_t arg5) {
if(type == 64) {
printer.print_string((char *)memoryTranslate(arg1), arg2);
}
}
};
```
<img src="https://hackmd.io/_uploads/BJ1GrHjHa.gif" width=100% alt="homework.asmbin"><br>
## Reference
* [lab3](https://github.com/sysprog21/ca2023-lab3)