Contributed by kc71486
RISC-V
, jserv
os: ubuntu 22.04
sbt version: 1.9 (following this guide and use Ubuntu and other Debian-based distributions
method)
javac version: openjdk 17.0.8.1
Doesn't matter at all.
java version: openjdk 17.0.8.1
Default ubuntu 22.04 java version(openjdk 11.0.20.1) also works.
scala version: 2.11.12
It will work even if scala isn't installed, although adding it won't change anything.
verilator: 4.038
If verilator isn't installed, when running sbt test
, all testcase that requires file access (*.asmbin) would have java "not a relative path" error.
gtkwave: 3.3.104 (only used in evaluation)
class Hello extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
})
val CNT_MAX = (50000000 / 2 - 1).U;
val cntReg = RegInit(0.U(32.W))
val blkReg = RegInit(0.U(1.W))
cntReg := cntReg + 1.U
when(cntReg === CNT_MAX) {
cntReg := 0.U
blkReg := ~blkReg
}
io.led := blkReg
}
The module represents a blinking LED, the counter will increament every tick (internal clock), and when the counter reaches certain threshold, the led signal will invert. This is an sequential circuit, and the whole block would be in clock triggered always block in verilog. cntReg
and blkReg
behaves like registers, while other variables would behave like wires.
It kinda looks like…
`define CNT_MAX (50000000 / 2 - 1)
module Hello(output reg led)
reg [31:0] cntReg;
reg blkReg;
initial begin
cntReg = 32'd0;
blkReg = 1'd0;
end
always #1 begin
cntReg += 1;
if (cntReg == CNT_MAX) begin
cntReg = 0;
blkReg = ~blkReg;
end
end
endmodule
when
block should generally be avoided in standard HDL design. In this case, we can change it into mux.
class Hello extends Module {
val io = IO(new Bundle {
val led = Output(UInt(1.W))
})
val CNT_MAX = (50000000 / 2 - 1).U;
val cntReg = RegInit(0.U(32.W))
val blkReg = RegInit(0.U(1.W))
val invsig = (cntReg === CNT_MAX)
cntReg := invsig ? 0.U : cntReg + 1.U
blkReg := blkReg ^ invsig
io.led := blkReg
}
invsig
represents the trigger point (wire-like). When invsig is on, cntReg becomes zero instead of (cntReg+1), and blkReg signal reverts.
There are not much things we need to add, only…
The heavy lifting memory and register access module part has been done for us, all we need to do is make connections between each modules.
Learn More →
For Instruction fetch, the address should either jump to the address or +4, which depends on jump_flag_id.
Learn More →
For Instruction decode, Load
will make memory_read_enable=1, Store
will make memory_write_enable=1
Learn More →
For CPU and execute, two control singnal determines which input, aluop1_source=1 makes ALU accept pc, while aluop2_source=1 makes ALU accept immediate.
For example, around 2171ps mark, 0000139C+FFFFFFB8=00001354
I removed all print function and csr function, and place called function into same file.
I removed all print function and csr function, and place called function into same file. Additionally, I removed some unused variables in data segments.
_start:
_sinit:
la sp, 0xffc
call main
_sexit:
li t0, 0xfffc # halt address
li t1, 0xbabecafe
sw t1, 0(t0)
dead_loop:
j dead_loop # infinite loop
...
add sp, sp, -8
sw ra, 4(sp)
sw s0, 0(sp)
li s0, 0x0004
...
call HammingDistance
sw a0, 0(s0)
...
I added test HomeWorkTest to see if the result is as expected.
class HomeWorkTest extends AnyFlatSpec with ChiselScalatestTester {
behavior.of("Single Cycle CPU")
it should "execute hamming code calculation" in {
test(new TestTopModule("homework.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
for (i <- 1 to 50000) {
c.clock.step()
c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
}
c.io.mem_debug_read_address.poke(4.U) // #1
c.clock.step()
c.io.mem_debug_read_data.expect(24.U)
c.io.mem_debug_read_address.poke(8.U) // #2
c.clock.step()
c.io.mem_debug_read_data.expect(60.U)
c.io.mem_debug_read_address.poke(12.U) // #3
c.clock.step()
c.io.mem_debug_read_data.expect(0.U)
}
}
}
Try to use high clock step count, the result might not be evaluated if step is too low.
HomeWorkTest:
[info] Single Cycle CPU
[info] - should execute hamming code calculation *** FAILED ***
[info] io_mem_debug_read_data=27 (0x1b) did not equal expected=24 (0x18) (lines in CPUTest.scala: 133, 120) (CPUTest.scala:127)
To address the issue, my first thought is add more testcase to see if my datapath is wrong.
The testcase is here. I copied and modified it from other lecture's testbench.
[info] Testbench:
[info] Single Cycle CPU Testbench
[info] - should execute full testbench
Unfortunately, there aren't any error. It is actually really bad, because the issue is much trickier than I thought.
I tried manually input the arguments, but all of them seems correct. Even those with exactly same input.
static uint32_t t1_x0 = 0x00100000; // low
static uint32_t t1_y0 = 0x00130000; // high
static uint32_t t1_x1 = 0x000FFFFF;
static uint32_t t1_y1 = 0x00000000;
...
*((volatile int32_t *) 16) = HammingDistance(0x00100000, 0x00130000, 0x000FFFFF, 0x00000000); //24, correct
*((volatile int32_t *) 4) = HammingDistance(t1_x0, t1_y0, t1_x1, t1_y1); //24, output 27
Then I tried putting them in stack instead of global variable, and it also works.
Inspecting the data segment, I found something strange.
static uint32_t t1_x0 = 0x00100000; // low
static uint32_t t1_y0 = 0x00130000; // high
static uint32_t t1_x1 = 0x000FFFFF;
static uint32_t t1_y1 = 0x00000000;
static uint32_t t2_x0 = 0x00000001;
static uint32_t t2_y0 = 0x00000002;
static uint32_t t2_x1 = 0x7FFFFFFF;
static uint32_t t2_y1 = 0xFFFFFFFE;
static uint32_t t3_x0 = 0x00000002;
static uint32_t t3_y0 = 0x8370228F;
static uint32_t t3_x1 = 0x00000002;
static uint32_t t3_y1 = 0x8370228F;
.data dump
homework.elf: file format elf32-littleriscv
Contents of section .data:
2000 00001000 00001300 ffff0f00 01000000
2010 02000000 ffffff7f feffffff 02000000
2020 8f227083 02000000 8f227083
There are only 11 variables, but I declare a total of 12 variables, so something must be wrong.
Inspecting the text segment, I think I found the reason.
1418: 000027b7 lui a5,0x2
141c: 0007a703 lw a4,0(a5) # 2000 <t1_x0>
1420: 000027b7 lui a5,0x2
1424: 0047a583 lw a1,4(a5) # 2004 <t1_y0>
1428: 000027b7 lui a5,0x2
142c: 0087a603 lw a2,8(a5) # 2008 <t1_x1>
1430: 000027b7 lui a5,0x2
1434: 02c7a783 lw a5,44(a5) # 202c <t1_y1>
202c is not listed in .data
segment, since .data
segment isn't zero initialized, it might just be a junk value.
[info] Run completed in 1 minute, 13 seconds.
[info] Total number of tests run: 12
[info] Suites: completed 10, aborted 0
[info] Tests: succeeded 12, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 91 s (01:31), completed Nov 21, 2023, 1:46:37 PM
I changed memory arguments to accept hex input and get more verbose halt output.
if (auto it = std::find(args.begin(), args.end(), "-memory");
it != args.end()) {
memory_words = parse_number(*(it + 1));
...
if (halt_address) {
if (memory->read(halt_address) == 0xBABECAFE) {
std::cout << "halted after " << main_time << " time\n";
break;
}
}
And I run verilator with the following arguments.
arguments="-memory 0x8000 -instruction src/main/resources/homework.asmbin -time 50000 -signature 0x0 0x80 mem.txt -halt 0xfffc -vcd dump.vcd"
verilog/verilator/obj_dir/VTop $arguments
We define address 0x20000000 ~ 0x3FFFFFFF as VRAM (in hello.c). Any write action in this segment will be redirected into another dedicated memory module. VRAM can currently only be written to, reading through vram_bundle won't do anything.
We modify the address and write enable to redirect write signal correctly.
io.memory_bundle.address := Cat(
0.U(Parameters.SlaveDeviceCountBits.W),
mem.io.memory_bundle.address(Parameters.AddrBits - 1 - Parameters.SlaveDeviceCountBits, 0)
)
io.memory_bundle.write_enable :=
mem.io.memory_bundle.write_enable &&
mem.io.memory_bundle.address(Parameters.AddrBits - 1, Parameters.AddrBits - Parameters.SlaveDeviceCountBits) === SlaveType.UserMemory
io.vram_bundle.address := Cat(
0.U(Parameters.SlaveDeviceCountBits.W),
mem.io.memory_bundle.address(Parameters.AddrBits - 1 - Parameters.SlaveDeviceCountBits, 0)
)
io.vram_bundle.write_enable :=
mem.io.memory_bundle.write_enable &&
mem.io.memory_bundle.address(Parameters.AddrBits - 1, Parameters.AddrBits - Parameters.SlaveDeviceCountBits) === SlaveType.VisualMemory
I also add some output port, so it can actually do something.
class Top extends Module {
...
io.memory_bundle <> cpu.io.memory_bundle
io.vram_bundle <> cpu.io.vram_bundle
io.kernel_bundle <> cpu.io.kernel_bundle
io.instruction_address := cpu.io.instruction_address
cpu.io.instruction := io.instruction
cpu.io.instruction_valid := io.instruction_valid
io.ecall_en := cpu.io.ecall_en
io.ecall_a7 := cpu.io.ecall_a7
io.ecall_a0 := cpu.io.ecall_a0
io.ecall_a1 := cpu.io.ecall_a1
io.ecall_a2 := cpu.io.ecall_a2
io.ecall_a3 := cpu.io.ecall_a3
io.ecall_a4 := cpu.io.ecall_a4
io.ecall_a5 := cpu.io.ecall_a5
}
class CPUBundle extends Bundle {
val instruction_address = Output(UInt(Parameters.AddrWidth))
val instruction = Input(UInt(Parameters.DataWidth))
val memory_bundle = Flipped(new RAMBundle)
val vram_bundle = Flipped(new RAMBundle)
val kernel_bundle = Flipped(new RAMBundle)
val instruction_valid = Input(Bool())
val deviceSelect = Output(UInt(Parameters.SlaveDeviceCountBits.W))
val debug_read_address = Input(UInt(Parameters.PhysicalRegisterAddrWidth))
val debug_read_data = Output(UInt(Parameters.DataWidth))
val ecall_en = Output(Bool())
val ecall_a7 = Output(UInt(Parameters.DataWidth))
val ecall_a0 = Output(UInt(Parameters.DataWidth))
val ecall_a1 = Output(UInt(Parameters.DataWidth))
val ecall_a2 = Output(UInt(Parameters.DataWidth))
val ecall_a3 = Output(UInt(Parameters.DataWidth))
val ecall_a4 = Output(UInt(Parameters.DataWidth))
val ecall_a5 = Output(UInt(Parameters.DataWidth))
}
To actually print in console, I add VRAM class and implement flush method.
class VRAM {
...
void write(size_t address, uint32_t value, bool write_strobe[4]) {
...
vram_dirty = true;
}
void flush() {
if (! vram_dirty) {
return;
}
vram_dirty = false;
std::cout << "\x1b[A";
for(uint32_t i=0; i<ROWS; i++) {
std::cout << "\x1b[A\x1b[2K";
}
std::cout << std::flush;
for(uint32_t i=0; i<ROWS; i++) {
for(uint32_t j=0; j<COLS; j++) {
uint32_t val = vram[i * COLS + j];
char ch0 = (char) ((val >> 0) & 0xff);
char ch1 = (char) ((val >> 8) & 0xff);
char ch2 = (char) ((val >> 16) & 0xff);
char ch3 = (char) ((val >> 24) & 0xff);
std::cout << ch0 << ch1 << ch2 << ch3;
}
std::cout << "\n";
}
std::cout << "\x1b[B" << std::flush;
}
};
Whenever we write something in VRAM, the terminal will refresh the area according to the data in VRAM.
When ecall is called, cpu emits io_ecall_en signal,
io.do_ecall := io.instruction === InstructionsEnv.ecall
which is caught by simulator.
if (top->io_ecall_en) {
if(ecall_timeout == 0) {
ihandler->handle(top->io_ecall_a7, top->io_ecall_a0, top->io_ecall_a1, top->io_ecall_a2,
top->io_ecall_a3, top->io_ecall_a4, top->io_ecall_a5);
ecall_timeout = 3;
}
else {
ecall_timeout --;
}
}
It is then handled by InterruptHandler.
class InterruptHandler {
private:
Memory &memory;
VRAM &vram;
Printer printer;
public:
InterruptHandler(Memory &memory, VRAM &vram) : memory(memory), vram(vram), printer(vram) {}
uintptr_t memoryTranslate(uint32_t src) {
if(src < 0x20000000) {
return (uintptr_t) (((char *) memory.image()) + src);
}
else if(src < 0x40000000) {
return (uintptr_t) (((char *) vram.image()) + src - 0x20000000);
}
else {
return 0;
}
}
void handle(uint32_t type, uint32_t arg0, uint32_t arg1, uint32_t arg2, uint32_t arg3, uint32_t arg4, uint32_t arg5) {
if(type == 64) {
printer.print_string((char *)memoryTranslate(arg1), arg2);
}
}
};