Try   HackMD

Assignment3: Single-cycle RISC-V CPU

Contributed by kc71486

tags: RISC-V, jserv

Environment setup

os: ubuntu 22.04
sbt version: 1.9 (following this guide and use Ubuntu and other Debian-based distributions method)
javac version: openjdk 17.0.8.1

Doesn't matter at all.

java version: openjdk 17.0.8.1

Default ubuntu 22.04 java version(openjdk 11.0.20.1) also works.

scala version: 2.11.12

It will work even if scala isn't installed, although adding it won't change anything.

verilator: 4.038

If verilator isn't installed, when running sbt test, all testcase that requires file access (*.asmbin) would have java "not a relative path" error.

gtkwave: 3.3.104 (only used in evaluation)

About Hello World in Chisel

class Hello extends Module { val io = IO(new Bundle { val led = Output(UInt(1.W)) }) val CNT_MAX = (50000000 / 2 - 1).U; val cntReg = RegInit(0.U(32.W)) val blkReg = RegInit(0.U(1.W)) cntReg := cntReg + 1.U when(cntReg === CNT_MAX) { cntReg := 0.U blkReg := ~blkReg } io.led := blkReg }

Explaination

The module represents a blinking LED, the counter will increament every tick (internal clock), and when the counter reaches certain threshold, the led signal will invert. This is an sequential circuit, and the whole block would be in clock triggered always block in verilog. cntReg and blkReg behaves like registers, while other variables would behave like wires.

It kinda looks like

`define CNT_MAX (50000000 / 2 - 1)
module Hello(output reg led)
    reg [31:0] cntReg;
    reg        blkReg;
    initial begin
        cntReg = 32'd0;
        blkReg =  1'd0;
    end
    always #1 begin
        cntReg += 1;
        if (cntReg == CNT_MAX) begin
            cntReg = 0;
            blkReg = ~blkReg;
        end
    end
endmodule

enhancement

when block should generally be avoided in standard HDL design. In this case, we can change it into mux.

class Hello extends Module { val io = IO(new Bundle { val led = Output(UInt(1.W)) }) val CNT_MAX = (50000000 / 2 - 1).U; val cntReg = RegInit(0.U(32.W)) val blkReg = RegInit(0.U(1.W)) val invsig = (cntReg === CNT_MAX) cntReg := invsig ? 0.U : cntReg + 1.U blkReg := blkReg ^ invsig io.led := blkReg }

invsig represents the trigger point (wire-like). When invsig is on, cntReg becomes zero instead of (cntReg+1), and blkReg signal reverts.

Finish mycpu

add lines

There are not much things we need to add, only

  • 4 lines in InstructionFetch.scala
  • 2 lines in InstructionDecode.scala
  • 7 lines in CPU.scala
  • 11 lines in Execute.scala

The heavy lifting memory and register access module part has been done for us, all we need to do is make connections between each modules.

mycpu evaluation (gtkwave in test)

InstructionFetch



For Instruction fetch, the address should either jump to the address or +4, which depends on jump_flag_id.

InstructionDecode



For Instruction decode, Load will make memory_read_enable=1, Store will make memory_write_enable=1

CPU & Execute



For CPU and execute, two control singnal determines which input, aluop1_source=1 makes ALU accept pc, while aluop2_source=1 makes ALU accept immediate.
For example, around 2171ps mark, 0000139C+FFFFFFB8=00001354

Adapt assignment2

modify c code

I removed all print function and csr function, and place called function into same file.

modify assembly code

I removed all print function and csr function, and place called function into same file. Additionally, I removed some unused variables in data segments.

code
_start:
_sinit:
    la sp, 0xffc
    call main
_sexit:
    li t0, 0xfffc       # halt address
    li t1, 0xbabecafe
    sw t1, 0(t0)
dead_loop:
    j dead_loop         # infinite loop
    ...
    add	 sp, sp, -8
    sw  ra, 4(sp)
    sw  s0, 0(sp)
    li  s0, 0x0004
    ...
    call  HammingDistance
    sw  a0, 0(s0)
    ...

add testbench

I added test HomeWorkTest to see if the result is as expected.

code
class HomeWorkTest extends AnyFlatSpec with ChiselScalatestTester { behavior.of("Single Cycle CPU") it should "execute hamming code calculation" in { test(new TestTopModule("homework.asmbin")).withAnnotations(TestAnnotations.annos) { c => for (i <- 1 to 50000) { c.clock.step() c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout } c.io.mem_debug_read_address.poke(4.U) // #1 c.clock.step() c.io.mem_debug_read_data.expect(24.U) c.io.mem_debug_read_address.poke(8.U) // #2 c.clock.step() c.io.mem_debug_read_data.expect(60.U) c.io.mem_debug_read_address.poke(12.U) // #3 c.clock.step() c.io.mem_debug_read_data.expect(0.U) } } }

Try to use high clock step count, the result might not be evaluated if step is too low.

problem faced (c code)

HomeWorkTest:
[info] Single Cycle CPU
[info] - should execute hamming code calculation *** FAILED ***
[info]   io_mem_debug_read_data=27 (0x1b) did not equal expected=24 (0x18) (lines in CPUTest.scala: 133, 120) (CPUTest.scala:127)

add more testbench

To address the issue, my first thought is add more testcase to see if my datapath is wrong.
The testcase is here. I copied and modified it from other lecture's testbench.

[info] Testbench:
[info] Single Cycle CPU Testbench
[info] - should execute full testbench

Unfortunately, there aren't any error. It is actually really bad, because the issue is much trickier than I thought.

root cause

I tried manually input the arguments, but all of them seems correct. Even those with exactly same input.

such as:
static uint32_t t1_x0 = 0x00100000; // low
static uint32_t t1_y0 = 0x00130000; // high
static uint32_t t1_x1 = 0x000FFFFF;
static uint32_t t1_y1 = 0x00000000;
...
*((volatile int32_t *) 16) = HammingDistance(0x00100000, 0x00130000, 0x000FFFFF, 0x00000000); //24, correct
*((volatile int32_t *) 4) = HammingDistance(t1_x0, t1_y0, t1_x1, t1_y1); //24, output 27


Then I tried putting them in stack instead of global variable, and it also works.
Inspecting the data segment, I found something strange.

original c code:
static uint32_t t1_x0 = 0x00100000; // low static uint32_t t1_y0 = 0x00130000; // high static uint32_t t1_x1 = 0x000FFFFF; static uint32_t t1_y1 = 0x00000000; static uint32_t t2_x0 = 0x00000001; static uint32_t t2_y0 = 0x00000002; static uint32_t t2_x1 = 0x7FFFFFFF; static uint32_t t2_y1 = 0xFFFFFFFE; static uint32_t t3_x0 = 0x00000002; static uint32_t t3_y0 = 0x8370228F; static uint32_t t3_x1 = 0x00000002; static uint32_t t3_y1 = 0x8370228F;


.data dump

homework.elf:     file format elf32-littleriscv
Contents of section .data:
 2000 00001000 00001300 ffff0f00 01000000
 2010 02000000 ffffff7f feffffff 02000000
 2020 8f227083 02000000 8f227083 

There are only 11 variables, but I declare a total of 12 variables, so something must be wrong.
Inspecting the text segment, I think I found the reason.

1418:	000027b7          	lui	a5,0x2
141c:	0007a703          	lw	a4,0(a5) # 2000 <t1_x0>
1420:	000027b7          	lui	a5,0x2
1424:	0047a583          	lw	a1,4(a5) # 2004 <t1_y0>
1428:	000027b7          	lui	a5,0x2
142c:	0087a603          	lw	a2,8(a5) # 2008 <t1_x1>
1430:	000027b7          	lui	a5,0x2
1434:	02c7a783          	lw	a5,44(a5) # 202c <t1_y1>

202c is not listed in .data segment, since .data segment isn't zero initialized, it might just be a junk value.

result

[info] Run completed in 1 minute, 13 seconds.
[info] Total number of tests run: 12
[info] Suites: completed 10, aborted 0
[info] Tests: succeeded 12, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 91 s (01:31), completed Nov 21, 2023, 1:46:37 PM

Verilator analysis

modify sim_main.cpp and run verilator

I changed memory arguments to accept hex input and get more verbose halt output.

if (auto it = std::find(args.begin(), args.end(), "-memory");
    it != args.end()) {
    memory_words = parse_number(*(it + 1));

...
if (halt_address) {
    if (memory->read(halt_address) == 0xBABECAFE) {
        std::cout << "halted after " << main_time << " time\n";
        break;
    }
}

And I run verilator with the following arguments.

arguments="-memory 0x8000 -instruction src/main/resources/homework.asmbin -time 50000 -signature 0x0 0x80 mem.txt -halt 0xfffc -vcd dump.vcd"
verilog/verilator/obj_dir/VTop $arguments

result

Small bonus

Implement mmio

We define address 0x20000000 ~ 0x3FFFFFFF as VRAM (in hello.c). Any write action in this segment will be redirected into another dedicated memory module. VRAM can currently only be written to, reading through vram_bundle won't do anything.

We modify the address and write enable to redirect write signal correctly.

CPU
io.memory_bundle.address := Cat(
    0.U(Parameters.SlaveDeviceCountBits.W),
    mem.io.memory_bundle.address(Parameters.AddrBits - 1 - Parameters.SlaveDeviceCountBits, 0)
)
io.memory_bundle.write_enable  :=
    mem.io.memory_bundle.write_enable && 
    mem.io.memory_bundle.address(Parameters.AddrBits - 1, Parameters.AddrBits - Parameters.SlaveDeviceCountBits) === SlaveType.UserMemory
io.vram_bundle.address := Cat(
    0.U(Parameters.SlaveDeviceCountBits.W),
    mem.io.memory_bundle.address(Parameters.AddrBits - 1 - Parameters.SlaveDeviceCountBits, 0)
)
io.vram_bundle.write_enable  :=
    mem.io.memory_bundle.write_enable && 
    mem.io.memory_bundle.address(Parameters.AddrBits - 1, Parameters.AddrBits - Parameters.SlaveDeviceCountBits) === SlaveType.VisualMemory


I also add some output port, so it can actually do something.

Top
class Top extends Module {
  ...
  io.memory_bundle <> cpu.io.memory_bundle
  io.vram_bundle <> cpu.io.vram_bundle
  io.kernel_bundle <> cpu.io.kernel_bundle
  io.instruction_address := cpu.io.instruction_address
  cpu.io.instruction     := io.instruction

  cpu.io.instruction_valid := io.instruction_valid
  
  io.ecall_en    := cpu.io.ecall_en
  io.ecall_a7    := cpu.io.ecall_a7
  io.ecall_a0    := cpu.io.ecall_a0
  io.ecall_a1    := cpu.io.ecall_a1
  io.ecall_a2    := cpu.io.ecall_a2
  io.ecall_a3    := cpu.io.ecall_a3
  io.ecall_a4    := cpu.io.ecall_a4
  io.ecall_a5    := cpu.io.ecall_a5
}
CPUBundle
class CPUBundle extends Bundle {
  val instruction_address = Output(UInt(Parameters.AddrWidth))
  val instruction         = Input(UInt(Parameters.DataWidth))
  val memory_bundle       = Flipped(new RAMBundle)
  val vram_bundle         = Flipped(new RAMBundle)
  val kernel_bundle       = Flipped(new RAMBundle)
  val instruction_valid   = Input(Bool())
  val deviceSelect        = Output(UInt(Parameters.SlaveDeviceCountBits.W))
  val debug_read_address  = Input(UInt(Parameters.PhysicalRegisterAddrWidth))
  val debug_read_data     = Output(UInt(Parameters.DataWidth))
  
  val ecall_en            = Output(Bool())
  val ecall_a7            = Output(UInt(Parameters.DataWidth))
  val ecall_a0            = Output(UInt(Parameters.DataWidth))
  val ecall_a1            = Output(UInt(Parameters.DataWidth))
  val ecall_a2            = Output(UInt(Parameters.DataWidth))
  val ecall_a3            = Output(UInt(Parameters.DataWidth))
  val ecall_a4            = Output(UInt(Parameters.DataWidth))
  val ecall_a5            = Output(UInt(Parameters.DataWidth))
}


To actually print in console, I add VRAM class and implement flush method.

sim_main
class VRAM {
    ...
    void write(size_t address, uint32_t value, bool write_strobe[4]) {
        ...
        vram_dirty = true;
    }
    void flush() {
        if (! vram_dirty) {
            return;
        }
        vram_dirty = false;
        std::cout << "\x1b[A";
        for(uint32_t i=0; i<ROWS; i++) {
            std::cout << "\x1b[A\x1b[2K";
        }
        std::cout << std::flush;
        for(uint32_t i=0; i<ROWS; i++) {
            for(uint32_t j=0; j<COLS; j++) {
                uint32_t val = vram[i * COLS + j];
                char ch0 = (char) ((val >>  0) & 0xff);
                char ch1 = (char) ((val >>  8) & 0xff);
                char ch2 = (char) ((val >> 16) & 0xff);
                char ch3 = (char) ((val >> 24) & 0xff);
                std::cout << ch0 << ch1 << ch2 << ch3;
            }
            std::cout << "\n";
        }
        std::cout << "\x1b[B" << std::flush;
    }
};

Whenever we write something in VRAM, the terminal will refresh the area according to the data in VRAM.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Implement ECALL (a7=64)

When ecall is called, cpu emits io_ecall_en signal,

io.do_ecall := io.instruction === InstructionsEnv.ecall

which is caught by simulator.

if (top->io_ecall_en) {
    if(ecall_timeout == 0) {
        ihandler->handle(top->io_ecall_a7, top->io_ecall_a0, top->io_ecall_a1, top->io_ecall_a2,
                    top->io_ecall_a3, top->io_ecall_a4, top->io_ecall_a5);
        ecall_timeout = 3;
    }
    else {
        ecall_timeout --;
    }
}

It is then handled by InterruptHandler.

class InterruptHandler {
private:
    Memory &memory;
    VRAM &vram;
    Printer printer;
public:
    InterruptHandler(Memory &memory, VRAM &vram) : memory(memory), vram(vram), printer(vram) {}
    
    uintptr_t memoryTranslate(uint32_t src) {
        if(src < 0x20000000) {
            return (uintptr_t) (((char *) memory.image()) + src);
        }
        else if(src < 0x40000000) {
            return (uintptr_t) (((char *) vram.image()) + src - 0x20000000);
        }
        else {
            return 0;
        }
    }
    
    void handle(uint32_t type, uint32_t arg0, uint32_t arg1, uint32_t arg2, uint32_t arg3, uint32_t arg4, uint32_t arg5) {
        if(type == 64) {
            printer.print_string((char *)memoryTranslate(arg1), arg2);
        }
    }
};

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Reference