Assignment3: Single-cycle RISC-V CPU

contributed by < jeremy90307 >

Environment setup

OS:ubuntu 22.04
sbt versopn:1.9.4
JDK version:1.8.0
Follow the instructions in Lab3: Construct a single-cycle RISC-V CPU with Chisel to set up the environment.

GTKWave Installation

Install

  1. Visit the GTKWave website to download gtkwave-3.3.117.tar.gz.
  2. According to the README file instructions, if the installation fails, you need to install some packages.
sudo apt-get install libjudy-dev
sudo apt-get install libbz2-dev
sudo apt-get install liblzma-dev
sudo apt-get install libgconf2-dev
sudo apt-get install libgtk2.0-dev
sudo apt-get install tcl-dev
sudo apt-get install tk-dev
sudo apt-get install gperf
sudo apt-get install gtk2-engines-pixbuf
  1. ./configure
  2. make
  3. make install

Hello World in Chisel

class Hello extends Module {
  val io = IO(new Bundle {
    val led = Output(UInt(1.W))
  })
  val CNT_MAX = (50000000 / 2 - 1).U;
  val cntReg  = RegInit(0.U(32.W))
  val blkReg  = RegInit(0.U(1.W))
  cntReg := cntReg + 1.U
  when(cntReg === CNT_MAX) {
    cntReg := 0.U
    blkReg := ~blkReg                                                                                                                                         
  }
  io.led := blkReg
}
  • This module has only one output signal.
  • The led is an output terminal with an unsigned type and a bit width of 1.
  • cntReg is a counter with an initial value set to 0 and a bit width of 32 bits
  • CNT_MAX is the maximum value of the counter.
  • blkReg represents the current state, with an initial value of 0 and a bit width of 1.
  • when(...) : When cntReg is equal to CNT_MAX, reset cntReg, and change the state of blkReg.
  • Finally, link blkReg to the output signal.

Lab 3 : Single Cycle RISC-V CPU

Install the dependent packages

sudo apt install build-essential verilator gtkwave

Run all test

sbt test

If the execution is successful, you will see the following message.

[info] welcome to sbt 1.9.7 (Temurin Java 1.8.0_392)
[info] loading settings for project ca2023-lab3-build from plugins.sbt ...
[info] loading project definition from /home/jeremytsai/ca2023-lab3/project
[info] loading settings for project root from build.sbt ...
[info] set current project to mycpu (in build file:/home/jeremytsai/ca2023-lab3/)
[info] InstructionDecoderTest:
[info] InstructionDecoder of Single Cycle CPU
[info] - should produce correct control signal
[info] InstructionFetchTest:
[info] InstructionFetch of Single Cycle CPU
[info] - should fetch instruction
[info] FibonacciTest:
[info] Single Cycle CPU
[info] - should recursively calculate Fibonacci(10)
[info] ByteAccessTest:
[info] Single Cycle CPU
[info] - should store and load a single byte
[info] QuicksortTest:
[info] Single Cycle CPU
[info] - should perform a quicksort on 10 numbers
[info] HW2Test:
[info] Single Cycle CPU
[info] - should calculate the scale
[info] ExecuteTest:
[info] Execution of Single Cycle CPU
[info] - should execute correctly
[info] RegisterFileTest:
[info] Register File of Single Cycle CPU
[info] - should read the written content
[info] - should x0 always be zero
[info] - should read the writing content
[info] Run completed in 10 seconds, 35 milliseconds.
[info] Total number of tests run: 10
[info] Suites: completed 8, aborted 0
[info] Tests: succeeded 10, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 11 s, completed 2023/11/30 下午 06:04:27

Run single Scala file for unit test

sbt "testOnly riscv.singlecycle.XXXTest"

Output .vcd file and analyze using GTKWave.

WRITE_VCD=1 sbt test

Resolve MyCPU

Pending resolution

  1. InstructionFetch.scala
  2. InstructionDecoder.scala
  3. Execute.scala
  4. CPU.scala

My repository, where you can see the code for my completed MyCPU, is available here.

image

Instruction Fetch

image

Test

  • If io.instruction_valid is true, then jump to that location.(0x1000)
  • If io.instruction_valid is false, then pc + 4

GTKWave

image

image

From the diagram, it can be observed that when io.instruction_valid = 1, the PC position returns to 0x1000 on the next rising edge of the clock.

Instruction Decoder

Test

After inputting 'sw,' 'lui,' and 'add,' the correct control signals are obtained.

  • If the opcode is of the Load type, then io.memory_read_enable is set to true.
  • If the opcode is of the store type, then io.memory_write_enable is set to true.
GTKWave

image

input=0x00a02223L.U > sw x10, 4(x0)
When the instruction is of the store type, io_memory_write_enable is set to 1.

Execution

image

Explanation of Scala syntax

muxLookup description

io.if_jump_flag := 
  (opcode === Instructions.jal) ||
  (opcode === Instructions.jalr) ||
  (opcode === InstructionTypes.B) && 
  MuxLookup(
    funct3,
    false.B,
    IndexedSeq(
      InstructionsTypeB.beq -> (io.reg1_data === io.reg2_data),
      // ...
  )

funct3 : The value to be matched.
false.B : The default value when no corresponding match is found.
InstructionsTypeB.beq -> (io.reg1_data === io.reg2_data), :
If funct3 is equal to InstructionsTypeB.beq, then it evaluates whether io.reg1_data = io.reg2_data is true, and the result serves as the output of MuxLookup. If funct3 is not equal to InstructionsTypeB.beq, the default value for MuxLookup is set to false.B.

Test

  • Test add, and obtain the expected output.
  • Test beq, and determine the output address by comparing if they are equal.

GTKWave

  1. add
    image
  • ALU input value alu_io_op1 = io_reg1_data = 016A05E2
  • ALU input value alu_io_op2 = io_reg2_data = 0FBD8F12
  • ALU output value alu_io_result is the sum of alu_io_op1 and alu_io_op2.
  • According to the ALUControl.scala file, when alu_io_function is set to 1, it corresponds to ALUFunctions.add.
  • Therefore, the output value of the add operation matches the expectations.
  1. beq
    image
  • ALU input value alu_io_op1 = io_instruction_address = 0x00000002
  • ALU input value alu_io_op2 = io_immediate = 0x00000002
  • ALU output value alu_io_result is the sum of alu_io_op1 and alu_io_op2.
  • alu_io_func=1 -> ALUFunctions.add
  • When the data of reg1 and reg2 are the same, io_if_jump_flag is set to 1.

RegisterFileTest

Test

  • Testing writing data to a register and ensuring it can be successfully read.
  • Writing the value of x0 always results in 0.

GTKWave

image

  • io_write_enable is set to 1, and data is written to registers_2, and it is successfully read.

image

  • It can be observed that the value of x0(registers_0=0x00000000) remains unchanged, with no modifications made to x0.

CPU

Complete CPU.scala, this part is missing the necessary data and signals for connecting to module EXE.

Test

  • Calculating the Fibonacci sequence and obtaining the expected answer.
  • Calculating the Quicksort and obtaining the expected answer.
  • Test whether the CPU can correctly store and retrieve a single byte of data.

HW2 runs on MyCPU

Adapt HW2 for MyCPU

c code
#include <stdio.h> #include <stdlib.h> #include<math.h> #include <inttypes.h> # define array_size 7 # define range 127 /*2^(n-1)-1, n: quant bit*/ float fp32_to_bf16(float x); int* quant_bf16_to_int8(float x[]); float bf16_findmax(float x[]); typedef uint64_t ticks; static inline ticks getticks(void) { uint64_t result; uint32_t l, h, h2; asm volatile( "rdcycleh %0\n" "rdcycle %1\n" "rdcycleh %2\n" "sub %0, %0, %2\n" "seqz %0, %0\n" "sub %0, zero, %0\n" "and %1, %1, %0\n" : "=r"(h), "=r"(l), "=r"(h2)); result = (((uint64_t) h) << 32) | ((uint64_t) l); return result; } int main() { ticks t0 = getticks(); float array[array_size] = {1.200000, 1.203125, 2.310000, 2.312500, 3.460000, 3.4531255, 5.630000}; float array2[array_size] = { 0.1, 0.2, 1.2, 3, 2.1, -4.2, 3.5}; float array3[array_size] = { 3.14159265, 0.12345678 , 1.23456789 , 0.00000123, 0.00000001, 0.99999999 , 0.00000007 }; float array_bf16[array_size] = {}; int *after_quant; /*data 1*/ for (int i = 0; i < 7; i++) { array_bf16[i] = fp32_to_bf16(array[i]); } printf("data 1\nbfloat16 number is \n"); for (int i = 0; i < array_size; i++) { printf("%.12f\n", array_bf16[i]); } after_quant = quant_bf16_to_int8(array_bf16); printf("after quantization \n"); for (int i = 0; i < array_size; i++) { printf("%d\n", after_quant[i]); } /*data 2*/ for (int i = 0; i < 7; i++) { array_bf16[i] = fp32_to_bf16(array2[i]); } printf("data 2\nbfloat16 number is \n"); for (int i = 0; i < array_size; i++) { printf("%.12f\n", array_bf16[i]); } after_quant = quant_bf16_to_int8(array_bf16); printf("after quantization \n"); for (int i = 0; i < array_size; i++) { printf("%d\n", after_quant[i]); } /*data 3*/ for (int i = 0; i < 7; i++) { array_bf16[i] = fp32_to_bf16(array3[i]); } printf("data 3\nbfloat16 number is \n"); for (int i = 0; i < array_size; i++) { printf("%.12f\n", array_bf16[i]); } after_quant = quant_bf16_to_int8(array_bf16); printf("after quantization \n"); for (int i = 0; i < array_size; i++) { printf("%d\n", after_quant[i]); } ticks t1 = getticks(); printf("elapsed cycle: %" PRIu64 "\n", t1 - t0); system("pause"); return 0; } float fp32_to_bf16(float x) { float y = x; int *p = (int *)&y; unsigned int exp = *p & 0x7F800000; unsigned int man = *p & 0x007FFFFF; if (exp == 0 && man == 0) /* zero */ return x; if (exp == 0x7F800000 /* Fill this! */) /* infinity or NaN */ return x; /* Normalized number */ /* round to nearest */ float r = x; int *pr = (int *)&r; *pr &= 0xFF800000; /* r has the same exp as x */ r /= 0x100 /* Fill this! */; y = x + r; *p &= 0xFFFF0000; return y; } int* quant_bf16_to_int8(float x[array_size]) { static int after_quant[array_size] = {}; float max = fabs(x[0]); for (int i = 1; i < array_size; i++) { if (fabs(x[i]) > max) { max = fabs(x[i]); } } printf("maximum number is %.12f\n", max); float scale = range / max; for (int i = 0; i < array_size; i++) { after_quant[i] = (x[i] * scale); } return after_quant; }

Process

  1. Place the assembly code for HW2 (hw2.S) into the ca2023-lab/csrc directory.
  2. Modify hw2.S to remove ecall and add _start:
  3. Modify the Makefile ,and add hw2.asmbin under BINS.
  4. Enter $ make update in the directory to generate hw2.asmbin.
  5. In CPUTest.scala, add a Test class for hw2.asmbin.
class HW2Test extends AnyFlatSpec with ChiselScalatestTester {
  behavior.of("Single Cycle CPU")
  it should "calculate the scale" in {
    test(new TestTopModule("hw2.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
      for (i <- 1 to 50) {
        c.clock.step(1000)
        c.io.mem_debug_read_address.poke((i * 4).U)
      }

      c.io.regs_debug_read_address.poke(16.U) //a6
      c.clock.step()
      c.io.regs_debug_read_data.expect(0x41d00000.U)
    }
  }
}

Test the scale value for the first set of data in hw2.

$ sbt "testOnly riscv.singlecycle.HW2Test"

Output

[info] welcome to sbt 1.9.7 (Temurin Java 1.8.0_392)
[info] loading settings for project ca2023-lab3-build from plugins.sbt ...
[info] loading project definition from /home/jeremytsai/ca2023-lab3/project
[info] loading settings for project root from build.sbt ...
[info] set current project to mycpu (in build file:/home/jeremytsai/ca2023-lab3/)
[info] compiling 1 Scala source to /home/jeremytsai/ca2023-lab3/target/scala-2.13/test-classes ...
[info] HW2Test:
[info] Single Cycle CPU
[info] - should calculate the scale
[info] Run completed in 4 seconds, 832 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 9 s, completed 2023/11/30 下午 05:12:43

Verilator

generate Verilog files

$ make verilator
Parameter Usage
-memory Specify the size of the simulation memory in words (4 bytes each).
Example: -memory 4096
-instruction Specify the RISC-V program used to initialize the simulation memory.
Example: -instruction src/main/resources/hello.asmbin
-signature Specify the memory range and destination file to output after simulation.
Example: -signature 0x100 0x200 mem.txt
-halt Specify the halt identifier address; writing 0xBABECAFE to this memory address stops the simulation.
Example: -halt 0x8000
-vcd Specify the filename for saving the simulation waveform during the process; not specifying this parameter will not generate a waveform file.
Example: -vcd dump.vcd
-time Specify the maximum simulation time; note that time is twice the number of cycles.
Example: -time 1000

Load the hw2.asmbin file, simulate for 2000 cycles, and save the simulation waveform to the dump01.vcd file.

./run-verilator.sh -instruction src/main/resources/hw2.asmbin -time 4000 -vcd dump01.vcd

Output

-time 4000
-memory 1048576
-instruction src/main/resources/hw2.asmbin
[-------------------->] 100%

Run GTKwave dump01.vcd to check its waveform.

I-type : addi x2, x2, -4

Hexadecimal = 0xffc10113
Binary = 1111111 11100 00010 000 00010 0010011

IF

image

  • io_instruction=0xFFC10113 -> addi x2, x2, -4
  • io_instruction_read_data=io_instruction
  • Since io_jump_flag_id=0, the next pc is pc+4.

ID

image

  • io_ex_aluop1_source=0 reads the value of io_reg1_data.
  • io_ex_aluop2_source=1,reads the value of io_ex_immediate.(io_ex_immediate=0xFFFFFFFC = -4)
  • Since io_memory_read_enable=io_memory_write_enable=0, there is no modification to the memory
    L-type : io_memory_read_enable = 1
    S-type : io_memory_write_enable = 1
  • io_regs_reg1_read_address=02(x2=sp)

EXE

image

  • When alu_ctrl_io_alu_func=1, it indicates the addi function.
  • Because io_aluop1_source=0,alu_io_op1 is equal to io_reg1_data, which is 0.
  • Because io_aluop2_source=1,alu_io_op2 is equal to io_immediate, which is 0xFFFFFFFC

MEM

image

  • From the figure below, it can be observed that io_alu_result is equal to io_memory_bundle_address, both being 0xFFFFFFFC.
    image
  • At this stage, no read/write operations are performed on the memory.
WB

image

  • Write io_regs_write_data=0xFFFFFFFC to register x2.

J-type : jal x0, 68

Hexdicimal:0x0440006f
Binary:00000100010000000000 00000 1101111

IF

image

  • io_instruction[31:0]=0440006F -> jal x0, 68
  • io_instruction[31:0]=io_instruction_read_data[31:0]
  • When io_jump_flag_id=1 is set to 1, the program counter (pc) consequently receives the instruction io_jump_address_id[31:0]=00001098, resulting in the pc becoming pc=1098.

ID

image

  • io_ex_aluop1_source=1=io.instruction_address
  • io_ex_aluop2_source=1 reads the value of io_ex_immediate[31:0]=00000044
  • Due to io_memory_read_enable=0 and io_memory_write_enable=0, no read or write operations are performed on the memory.
  • io_regs_reg1_read_address[4:0]=00(x0) is equivalent to rd[4:0]=00(x0).

EXE

image

  • Because io_aluop1_source=1,alu_io_op1 is equal to io_instruction_address=00001054, where pc=1054.
  • Because io_aluop2_source=1, alu_io_op2 is equal to io_immediate, which is 0x00000044
  • Due to io_if_jump_flag=1, the program counter (pc) jumps to io_if_jump_address=00001098.(The next pc is set to 1098.)
  • Where io_if_jump_address is defined as the sum of io_immediate and io_instruction_address.

MEM

image

  • At this stage, no read/write operations are performed on the memory.
  • io_alu_result is equal to io_memory_bundle_address.

WB

image

  • At this stage, no changes are made to the registers.

S-type : sw x15, 0(x12)

Hexadecimal=0x00f62023
Binary=00000000 11110 11000 100 00000 100011

IF

image

  • io_instruction=0xFFC10113 -> sw x15, 0(x12)
  • io_instruction_read_data=io_instruction
  • Since io_jump_flag_id=0, the next pc is pc+4.

ID

image

  • The value of io_ex_aluop1_source=0 is the base memory address, which is the value of x12.(io_regs_reg1_read_address=0x0C)
  • The value of io_ex_aluop2_source=1 is the offset of the memory address, which is 0.(io_ex_immediate=0x00000000)
  • Because io_memory_write_enable=1, data is written into memory.

EXE

image

  • The ALU output, alu_io_result=00001334, is equal to the sum of ALU operands, where alu_io_op1=00001334 and alu_io_op2=00000000.

MEM

image

  • The memory address input, io_memory_bundle_address=0x00001334, is equivalent to the ALU output, io_alu_result=00001334.
  • The data input of the memory io_memory_bundle_write_data is 0x00000000, which is equal to the value of io_reg2_data.

WB

image

  • S-type does not assign new values to registers.

Reference