contributed by < eeeXun
>
On Arch Linux
sudo pacman -S sbt jdk17-openjdk verilator gtkwave
The current version of the packages
pacman -Q | grep "sbt\|jdk\|verilator\|gtkwave"
gtkwave 3.3.117-1
jdk17-openjdk 17.0.9.u8-2
jre17-openjdk 17.0.9.u8-2
jre17-openjdk-headless 17.0.9.u8-2
sbt 1:1.8.3-1
verilator 5.018-1
At the beginning, I have no idea where to start. So I look into the tests that I failed.
In the InstructionFetchTest
, I see that when jump_flag_id
is turned on, the instruction_address
should be entry
. And the value of entry
is pass into instruction_address_id
.
If the jump_flag_id
is turned off, the instruction_address
should be cur
. And the value of cur
is the prev
+ 4.
So at this point, I could make the InstructionFetchTest
pass.
But when it goes to InstructionDecoderTest
, I see the test only examine ex_aluop1_source
, ex_aluop2_source
, regs_reg1_read_address
and regs_reg2_read_address
these four values. But the value of these four signals has been already assigned in InstructionDecode
. And I think these four values are all assigned correctly.
The error message shows
[info] - should produce correct control signal *** FAILED ***
[info] : io.memory_write_enable <= VOID
[info] firrtl.passes.CheckInitialization$RefNotInitializedException: @[src/main/scala/riscv/core/InstructionDecode.scala 127:14] : [module InstructionDecode] Reference io is not fully initialized.
memory_write_enable
is not initialized. But I still have no idea why it would fail. In the bootcamp, I only see the failure occur when the value mismatch the value in expect
function.
So I look at Lab3 again. I found that I missed this diagram. This diagram shows all the signals required for this homework. After following this diagram, all the failed tests are quickly resolved.
In assembly code of assignment 2, I remove all the instructions related to ecall
. And in exit, I change it to infinite jump
exit:
j exit
There is only one test case. And the result is stored in register s3
. So in my test case, I only check whether the value of register s3
is correct.
class HW2Test extends AnyFlatSpec with ChiselScalatestTester {
behavior.of("Single Cycle CPU")
it should "multiply two bfloat16" in {
test(new TestTopModule("hw2.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
for (i <- 1 to 50) {
c.clock.step(1000)
c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
}
c.io.regs_debug_read_address.poke(19.U) // s3
c.io.regs_debug_read_data.expect(0x440a0000.U)
}
}
}
I load the vcd
file of assignment 2. And I compare the clock
with iostruction_address
and instruction
at first. I observe that it take so long to start the instruction.
Then I compare it with memory, rom_loader
.
The instruction start time is almost identical to the time when rom_loader
stops changing. Then I look at the objdump result from the assignment 2 and compare it with what I observed in GTKWave.
hw2.o: file format elf32-littleriscv
Disassembly of section .text:
00000000 <_start>:
0: 00100893 li a7,1
4: 00000617 auipc a2,0x0
8: 00060613 mv a2,a2
c: 00062803 lw a6,0(a2) # 4 <_start+0x4>
10: 028000ef jal 38 <f32_b16_p1>
14: 000807b3 add a5,a6,zero
18: 00462803 lw a6,4(a2)
1c: 01c000ef jal 38 <f32_b16_p1>
20: 00080733 add a4,a6,zero
24: 0a8000ef jal cc <encoder>
28: 00098cb3 add s9,s3,zero
2c: 0b8000ef jal e4 <decoder>
30: 0d0000ef jal 100 <Multi_bfloat>
34: 25c0006f j 290 <exit>
00000038 <f32_b16_p1>:
38: 01012023 sw a6,0(sp)
3c: 000802b3 add t0,a6,zero
40: 7f800fb7 lui t6,0x7f800
44: 01f2f333 and t1,t0,t6
48: 00800fb7 lui t6,0x800
4c: ffff8f93 add t6,t6,-1 # 7fffff <str+0x7ffd53>
50: 01f2f3b3 and t2,t0,t6
54: 7f800fb7 lui t6,0x7f800
58: 07f30463 beq t1,t6,c0 <inf_or_zero>
5c: 00736e33 or t3,t1,t2
60: 060e0063 beqz t3,c0 <inf_or_zero>
64: 00800fb7 lui t6,0x800
68: 01f3e3b3 or t2,t2,t6
6c: 00008fb7 lui t6,0x8
70: 01f383b3 add t2,t2,t6
74: 0183df13 srl t5,t2,0x18
78: 020f0063 beqz t5,98 <no_overflow>
7c: 00800fb7 lui t6,0x800
80: 01f30333 add t1,t1,t6
84: 0113d393 srl t2,t2,0x11
88: 07f00f93 li t6,127
8c: 01f3f3b3 and t2,t2,t6
90: 01039393 sll t2,t2,0x10
94: 0140006f j a8 <f32_b16_p2>
00000098 <no_overflow>:
98: 0103d393 srl t2,t2,0x10
9c: 07f00f93 li t6,127
a0: 01f3f3b3 and t2,t2,t6
a4: 01039393 sll t2,t2,0x10
000000a8 <f32_b16_p2>:
a8: 01f2d293 srl t0,t0,0x1f
ac: 01f29293 sll t0,t0,0x1f
b0: 0062e2b3 or t0,t0,t1
b4: 0072e2b3 or t0,t0,t2
b8: 00028833 add a6,t0,zero
bc: 00008067 ret
000000c0 <inf_or_zero>:
c0: 01085813 srl a6,a6,0x10
c4: 01081813 sll a6,a6,0x10
c8: 00008067 ret
000000cc <encoder>:
cc: 000782b3 add t0,a5,zero
d0: 00070333 add t1,a4,zero
d4: 01035313 srl t1,t1,0x10
d8: 0062e2b3 or t0,t0,t1
dc: 000289b3 add s3,t0,zero
e0: 00008067 ret
000000e4 <decoder>:
e4: 000c82b3 add t0,s9,zero
e8: ffff0937 lui s2,0xffff0
ec: 0122f333 and t1,t0,s2
f0: 01029393 sll t2,t0,0x10
f4: 00030b33 add s6,t1,zero
f8: 00038ab3 add s5,t2,zero
fc: 00008067 ret
00000100 <Multi_bfloat>:
100: 000a82b3 add t0,s5,zero
104: 000b0333 add t1,s6,zero
108: 7f800fb7 lui t6,0x7f800
10c: 01f2fe33 and t3,t0,t6
110: 01f373b3 and t2,t1,t6
114: 007e0e33 add t3,t3,t2
118: 3f800fb7 lui t6,0x3f800
11c: 41fe0e33 sub t3,t3,t6
120: 0062c3b3 xor t2,t0,t1
124: 01f3d393 srl t2,t2,0x1f
128: 01f39393 sll t2,t2,0x1f
12c: 007e6e33 or t3,t3,t2
130: 00929293 sll t0,t0,0x9
134: 0092d293 srl t0,t0,0x9
138: 005e62b3 or t0,t3,t0
13c: 007f0fb7 lui t6,0x7f0
140: 01f2f3b3 and t2,t0,t6
144: 01f37e33 and t3,t1,t6
148: 00839393 sll t2,t2,0x8
14c: 80000fb7 lui t6,0x80000
150: 01f3e3b3 or t2,t2,t6
154: 0013d393 srl t2,t2,0x1
158: 008e1e13 sll t3,t3,0x8
15c: 01fe6e33 or t3,t3,t6
160: 001e5e13 srl t3,t3,0x1
164: 00000333 add t1,zero,zero
168: 80000fb7 lui t6,0x80000
16c: 001fdf93 srl t6,t6,0x1
170: 01f3feb3 and t4,t2,t6
174: 01d03433 snez s0,t4
178: 40800433 neg s0,s0
17c: 01c474b3 and s1,s0,t3
180: 00930333 add t1,t1,s1
184: 001e5e13 srl t3,t3,0x1
188: 001fdf93 srl t6,t6,0x1
18c: 01f3feb3 and t4,t2,t6
190: 01d03433 snez s0,t4
194: 40800433 neg s0,s0
198: 01c474b3 and s1,s0,t3
19c: 00930333 add t1,t1,s1
1a0: 001e5e13 srl t3,t3,0x1
1a4: 001fdf93 srl t6,t6,0x1
1a8: 01f3feb3 and t4,t2,t6
1ac: 01d03433 snez s0,t4
1b0: 40800433 neg s0,s0
1b4: 01c474b3 and s1,s0,t3
1b8: 00930333 add t1,t1,s1
1bc: 001e5e13 srl t3,t3,0x1
1c0: 001fdf93 srl t6,t6,0x1
1c4: 01f3feb3 and t4,t2,t6
1c8: 01d03433 snez s0,t4
1cc: 40800433 neg s0,s0
1d0: 01c474b3 and s1,s0,t3
1d4: 00930333 add t1,t1,s1
1d8: 001e5e13 srl t3,t3,0x1
1dc: 001fdf93 srl t6,t6,0x1
1e0: 01f3feb3 and t4,t2,t6
1e4: 01d03433 snez s0,t4
1e8: 40800433 neg s0,s0
1ec: 01c474b3 and s1,s0,t3
1f0: 00930333 add t1,t1,s1
1f4: 001e5e13 srl t3,t3,0x1
1f8: 001fdf93 srl t6,t6,0x1
1fc: 01f3feb3 and t4,t2,t6
200: 01d03433 snez s0,t4
204: 40800433 neg s0,s0
208: 01c474b3 and s1,s0,t3
20c: 00930333 add t1,t1,s1
210: 001e5e13 srl t3,t3,0x1
214: 001fdf93 srl t6,t6,0x1
218: 01f3feb3 and t4,t2,t6
21c: 01d03433 snez s0,t4
220: 40800433 neg s0,s0
224: 01c474b3 and s1,s0,t3
228: 00930333 add t1,t1,s1
22c: 001e5e13 srl t3,t3,0x1
230: 001fdf93 srl t6,t6,0x1
234: 01f3feb3 and t4,t2,t6
238: 01d03433 snez s0,t4
23c: 40800433 neg s0,s0
240: 01c474b3 and s1,s0,t3
244: 00930333 add t1,t1,s1
248: 001e5e13 srl t3,t3,0x1
24c: 80000fb7 lui t6,0x80000
250: 01f37eb3 and t4,t1,t6
254: 000e8a63 beqz t4,268 <not_overflow>
258: 00131313 sll t1,t1,0x1
25c: 00800fb7 lui t6,0x800
260: 01f282b3 add t0,t0,t6
264: 0080006f j 26c <Mult_end>
00000268 <not_overflow>:
268: 00231313 sll t1,t1,0x2
0000026c <Mult_end>:
26c: 01835313 srl t1,t1,0x18
270: 00130313 add t1,t1,1
274: 00135313 srl t1,t1,0x1
278: 01031313 sll t1,t1,0x10
27c: 0172d293 srl t0,t0,0x17
280: 01729293 sll t0,t0,0x17
284: 0062e2b3 or t0,t0,t1
288: 000289b3 add s3,t0,zero
28c: 00008067 ret
00000290 <exit>:
290: 0000006f j 290 <exit>
00000294 <test0>:
294: 4141f9a7 .word 0x4141f9a7
298: 423645a2 .word 0x423645a2
0000029c <test1>:
29c: 3fa66666 .word 0x3fa66666
2a0: 42c63333 .word 0x42c63333
000002a4 <test2>:
2a4: 43e43a5e .word 0x43e43a5e
2a8: 42b1999a .word 0x42b1999a
000002ac <str>:
2ac: 0000000a .word 0x0000000a
So these period should be the time of loading ELF file into memory!
When inst_fetch.io.instruction_read_data
is loaded, it takes 3 cpu clock cycles to dump the inst_fetch.io.instruction_address
, which is the PC
. And when inst_fetch.io.instruction_address
changes, it takes 1 cpu clock cycle to load the inst_fetch.io.instruction_read_data
.
3 cpu clock cycles and 1 cpu cycle correspond to the instruction fetch clock cycle.
InstructionDecode
gets the output immediately when input instruction
is signaled. And it holds the state for 4 cpu cycles.
It is different from InstructionFetch
, there is no clock inside InstructionDecode
. I guess this is due to there is no register inside InstructionDecode
.
The register inside InstructionFetch
val pc = RegInit(ProgramCounter.EntryAddress)
When an instruction in Execute
recives all input signals from instruction
, instruction_address
, reg1_data
, reg2_data
, immediate
, aluop1_source
and aluop2_source
, it generates output mem_alu_result
, if_jump_flag
and if_jump_address
immediately. And it holds the state for 4 cpu cycles.
But here is something weird in some cases, it did not hold the state for 4 cpu cycles. Take the following auipc
instruction for example
00000000 <_start>:
0: 00100893 li a7,1
4: 00000617 auipc a2,0x0
The mem_alu_result
changes at 3rd cpu cycle. And I found out it's due to the changing time of ex.io.instruction
(this is from inst_fetch.io.instruction
) does not synchronize with ex.io.instruction_address
(this is from inst_fetch.io.instruction_address
).
In the following lw
example, it takes 1 cpu cycle to load data from memory since instrcution fetched. And it hold the data for 3 cpu cycles.
00000000 <_start>:
0: 00100893 li a7,1
4: 00000617 auipc a2,0x0
8: 00060613 mv a2,a2
c: 00062803 lw a6,0(a2) # 4 <_start+0x4>
In the following auipc
example, the wb_io_regs_write_data
is come from ex_io_mem_alu_result
00000000 <_start>:
0: 00100893 li a7,1
4: 00000617 auipc a2,0x0
In the following lw
example, it loads a word 0x4141f9a7
. The wb_io_regs_write_data
is come after 1 cpu cycle since instructions loaded. Because the data is from mem_io_wb_memory_read_data
00000000 <_start>:
0: 00100893 li a7,1
4: 00000617 auipc a2,0x0
8: 00060613 mv a2,a2
c: 00062803 lw a6,0(a2) # 4 <_start+0x4>
The vcd
generated by verilator is different from the vcd
file generated during testing. There is no boot up time. In the fisrt cpu cycle, the CPU continued to fetch, decode, execute instructinons. And the cpu cycle is 4 ps, which is different from 2 ps generated during testing.
The output inst_fetch_io_instruction_read_data
is generated after quarter cpu cycle, which is 1 ps, since inst_fetch_io_instrucion_address
is signaled. However, the result generated during testing is 1 cpu cycle, 2ps.
The time interval between instruction fetching and next instrucion fetching is 1 cpu cycle, which is 4 ps. The result is different from the time generated during testing, which is 4 cpu cycles, 8 ps.
I implement ecall
in this branch. And test it with the assembly program that print RISC-V\n
.
Initially, I add the signal ecall_flag
, which will be turned on when instruction ecall
is decoded, ecall_a0
, ecall_a1
, ecall_a2
and ecall_a7
to the CPUBundle.scala
. The ecall_a1
, ecall_a2
and ecall_a7
are data of register.
In verilog/verilator/sim_main.cpp
, Simulator.run
function, I check if top->io_ecall_flag
is true. If it is true, then I check top->io_ecall_a7
code. Then I compare it with system call number. If the code is write, then I get the data from memory->read
function with starting address top->io_ecall_a1
and length top->io_ecall_a2
.
When I run make verilator
, it just get some errors
[error] firrtl.passes.PassExceptions:
[error] firrtl.passes.CheckInitialization$RefNotInitializedException: @[src/main/scala/board/verilator/Top.scala 15:14] : [module Top] Reference io is not fully initialized.
[error] : io.ecall_a7 <= VOID
[error] firrtl.passes.CheckInitialization$RefNotInitializedException: @[src/main/scala/board/verilator/Top.scala 15:14] : [module Top] Reference io is not fully initialized.
[error] : io.ecall_flag <= VOID
[error] firrtl.passes.CheckInitialization$RefNotInitializedException: @[src/main/scala/board/verilator/Top.scala 15:14] : [module Top] Reference io is not fully initialized.
[error] : io.ecall_a0 <= VOID
[error] firrtl.passes.CheckInitialization$RefNotInitializedException: @[src/main/scala/board/verilator/Top.scala 15:14] : [module Top] Reference io is not fully initialized.
[error] : io.ecall_a2 <= VOID
[error] firrtl.passes.CheckInitialization$RefNotInitializedException: @[src/main/scala/board/verilator/Top.scala 15:14] : [module Top] Reference io is not fully initialized.
[error] : io.ecall_a1 <= VOID
[error] firrtl.passes.PassException: 5 errors detected!
So I add the code below to src/main/scala/board/verilator/Top.scala
--- a/src/main/scala/board/verilator/Top.scala
+++ b/src/main/scala/board/verilator/Top.scala
@@ -25,6 +25,12 @@ class Top extends Module {
cpu.io.instruction := io.instruction
cpu.io.instruction_valid := io.instruction_valid
+
+ io.ecall_flag := cpu.io.ecall_flag
+ io.ecall_a0 := cpu.io.ecall_a0
+ io.ecall_a1 := cpu.io.ecall_a1
+ io.ecall_a2 := cpu.io.ecall_a2
+ io.ecall_a7 := cpu.io.ecall_a7
}
object VerilogGenerator extends App {
Then it works! But I'm not sure what is the relationship between Top.scala
and CPU.scala
.
When I run the verilator, I found there are some bugs in my code. The string RISC-V\n
is printed out 3 times. Then I inspect the wavform dumped from verilator. The ecall
is just like other instrion, it hold for 1 cpu cycle.
So I suspect it is caused by the while loop in Simulator.run
function, which is not looping every cpu cycle.