# Assignment3: Single-cycle RISC-V CPU contributed by < [`jeremy90307`](https://github.com/jeremy90307) > ## Environment setup OS:ubuntu 22.04 sbt versopn:1.9.4 JDK version:1.8.0 Follow the instructions in [Lab3: Construct a single-cycle RISC-V CPU with Chisel](https://hackmd.io/@sysprog/r1mlr3I7p) to set up the environment. ### GTKWave Installation **Install** 1. Visit the [GTKWave](https://gtkwave.sourceforge.net/) website to download `gtkwave-3.3.117.tar.gz`. 2. According to the `README` file instructions, if the installation fails, you need to install some packages. ``` sudo apt-get install libjudy-dev sudo apt-get install libbz2-dev sudo apt-get install liblzma-dev sudo apt-get install libgconf2-dev sudo apt-get install libgtk2.0-dev sudo apt-get install tcl-dev sudo apt-get install tk-dev sudo apt-get install gperf sudo apt-get install gtk2-engines-pixbuf ``` 3. ./configure 4. make 5. make install ## Hello World in Chisel ```scala class Hello extends Module { val io = IO(new Bundle { val led = Output(UInt(1.W)) }) val CNT_MAX = (50000000 / 2 - 1).U; val cntReg = RegInit(0.U(32.W)) val blkReg = RegInit(0.U(1.W)) cntReg := cntReg + 1.U when(cntReg === CNT_MAX) { cntReg := 0.U blkReg := ~blkReg } io.led := blkReg } ``` - This module has only one output signal. - The `led` is an output terminal with an unsigned type and a bit width of 1. - `cntReg` is a counter with an initial value set to 0 and a bit width of 32 bits - `CNT_MAX` is the maximum value of the counter. - `blkReg` represents the current state, with an initial value of 0 and a bit width of 1. - `when(...)` : When cntReg is equal to CNT_MAX, reset cntReg, and change the state of blkReg. - Finally, link `blkReg` to the output signal. ## Lab 3 : Single Cycle RISC-V CPU Install the dependent packages ``` sudo apt install build-essential verilator gtkwave ``` Run all test ``` sbt test ``` If the execution is successful, you will see the following message. ``` [info] welcome to sbt 1.9.7 (Temurin Java 1.8.0_392) [info] loading settings for project ca2023-lab3-build from plugins.sbt ... [info] loading project definition from /home/jeremytsai/ca2023-lab3/project [info] loading settings for project root from build.sbt ... [info] set current project to mycpu (in build file:/home/jeremytsai/ca2023-lab3/) [info] InstructionDecoderTest: [info] InstructionDecoder of Single Cycle CPU [info] - should produce correct control signal [info] InstructionFetchTest: [info] InstructionFetch of Single Cycle CPU [info] - should fetch instruction [info] FibonacciTest: [info] Single Cycle CPU [info] - should recursively calculate Fibonacci(10) [info] ByteAccessTest: [info] Single Cycle CPU [info] - should store and load a single byte [info] QuicksortTest: [info] Single Cycle CPU [info] - should perform a quicksort on 10 numbers [info] HW2Test: [info] Single Cycle CPU [info] - should calculate the scale [info] ExecuteTest: [info] Execution of Single Cycle CPU [info] - should execute correctly [info] RegisterFileTest: [info] Register File of Single Cycle CPU [info] - should read the written content [info] - should x0 always be zero [info] - should read the writing content [info] Run completed in 10 seconds, 35 milliseconds. [info] Total number of tests run: 10 [info] Suites: completed 8, aborted 0 [info] Tests: succeeded 10, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 11 s, completed 2023/11/30 下午 06:04:27 ``` Run single Scala file for unit test ``` sbt "testOnly riscv.singlecycle.XXXTest" ``` Output `.vcd` file and analyze using GTKWave. ``` WRITE_VCD=1 sbt test ``` ### Resolve MyCPU **Pending resolution** 1. InstructionFetch.scala 2. InstructionDecoder.scala 3. Execute.scala 4. CPU.scala My [repository](https://github.com/jeremy90307/ca2023-lab3.git), where you can see the code for my completed MyCPU, is available here. ![image](https://hackmd.io/_uploads/Hy9FugIST.png) ### Instruction Fetch ![image](https://hackmd.io/_uploads/H1xhueLBT.png) #### Test - If `io.instruction_valid` is true, then jump to that location.(0x1000) - If `io.instruction_valid` is false, then pc + 4 #### GTKWave ![image](https://hackmd.io/_uploads/ryJJYE7Ha.png) ![image](https://hackmd.io/_uploads/S1m8KEXHa.png) From the diagram, it can be observed that when `io.instruction_valid = 1`, the PC position returns to 0x1000 on the next rising edge of the clock. ### Instruction Decoder #### Test After inputting 'sw,' 'lui,' and 'add,' the correct control signals are obtained. - If the opcode is of the Load type, then `io.memory_read_enable` is set to true. - If the opcode is of the store type, then `io.memory_write_enable` is set to true. ##### GTKWave ![image](https://hackmd.io/_uploads/H1E_WPXr6.png) input=0x00a02223L.U --> `sw x10, 4(x0)` When the instruction is of the store type, io_memory_write_enable is set to 1. ### Execution ![image](https://hackmd.io/_uploads/SkFzJYXrp.png) #### Explanation of Scala syntax `muxLookup` description ```scala io.if_jump_flag := (opcode === Instructions.jal) || (opcode === Instructions.jalr) || (opcode === InstructionTypes.B) && MuxLookup( funct3, false.B, IndexedSeq( InstructionsTypeB.beq -> (io.reg1_data === io.reg2_data), // ... ) ``` `funct3` : The value to be matched. `false.B` : The default value when no corresponding match is found. `InstructionsTypeB.beq -> (io.reg1_data === io.reg2_data),` : If funct3 is equal to InstructionsTypeB.beq, then it evaluates whether io.reg1_data = io.reg2_data is true, and the result serves as the output of MuxLookup. If funct3 is not equal to InstructionsTypeB.beq, the default value for MuxLookup is set to false.B. #### Test - Test `add`, and obtain the expected output. - Test `beq`, and determine the output address by comparing if they are equal. #### GTKWave 1. `add` ![image](https://hackmd.io/_uploads/SyLun8EH6.png) - ALU input value `alu_io_op1` = `io_reg1_data` = 016A05E2 - ALU input value `alu_io_op2` = `io_reg2_data` = 0FBD8F12 - ALU output value `alu_io_result` is the sum of `alu_io_op1` and `alu_io_op2`. - According to the ALUControl.scala file, when `alu_io_function` is set to 1, it corresponds to ALUFunctions.add. - Therefore, the output value of the `add` operation matches the expectations. 2. `beq` ![image](https://hackmd.io/_uploads/SyNYYtmBT.png) - ALU input value `alu_io_op1` = `io_instruction_address` = 0x00000002 - ALU input value `alu_io_op2` = `io_immediate` = 0x00000002 - ALU output value `alu_io_result` is the sum of `alu_io_op1` and `alu_io_op2`. - `alu_io_func=1` -> ALUFunctions.add - When the data of `reg1` and `reg2` are the same, `io_if_jump_flag` is set to 1. ### RegisterFileTest #### Test - Testing writing data to a register and ensuring it can be successfully read. - Writing the value of `x0` always results in 0. #### GTKWave ![image](https://hackmd.io/_uploads/ByIgYPVrT.png) - `io_write_enable` is set to 1, and data is written to registers_2, and it is successfully read. ![image](https://hackmd.io/_uploads/Syv6KvVHa.png) - It can be observed that the value of x0(registers_0=0x00000000) remains unchanged, with no modifications made to x0. ### CPU Complete `CPU.scala`, this part is missing the necessary data and signals for connecting to module EXE. #### Test - Calculating the Fibonacci sequence and obtaining the expected answer. - Calculating the Quicksort and obtaining the expected answer. - Test whether the CPU can correctly store and retrieve a single byte of data. # HW2 runs on MyCPU ## Adapt [HW2](https://github.com/jeremy90307/Computer_Architecture/tree/main/HW2) for MyCPU :::spoiler **c code** ```c= #include <stdio.h> #include <stdlib.h> #include<math.h> #include <inttypes.h> # define array_size 7 # define range 127 /*2^(n-1)-1, n: quant bit*/ float fp32_to_bf16(float x); int* quant_bf16_to_int8(float x[]); float bf16_findmax(float x[]); typedef uint64_t ticks; static inline ticks getticks(void) { uint64_t result; uint32_t l, h, h2; asm volatile( "rdcycleh %0\n" "rdcycle %1\n" "rdcycleh %2\n" "sub %0, %0, %2\n" "seqz %0, %0\n" "sub %0, zero, %0\n" "and %1, %1, %0\n" : "=r"(h), "=r"(l), "=r"(h2)); result = (((uint64_t) h) << 32) | ((uint64_t) l); return result; } int main() { ticks t0 = getticks(); float array[array_size] = {1.200000, 1.203125, 2.310000, 2.312500, 3.460000, 3.4531255, 5.630000}; float array2[array_size] = { 0.1, 0.2, 1.2, 3, 2.1, -4.2, 3.5}; float array3[array_size] = { 3.14159265, 0.12345678 , 1.23456789 , 0.00000123, 0.00000001, 0.99999999 , 0.00000007 }; float array_bf16[array_size] = {}; int *after_quant; /*data 1*/ for (int i = 0; i < 7; i++) { array_bf16[i] = fp32_to_bf16(array[i]); } printf("data 1\nbfloat16 number is \n"); for (int i = 0; i < array_size; i++) { printf("%.12f\n", array_bf16[i]); } after_quant = quant_bf16_to_int8(array_bf16); printf("after quantization \n"); for (int i = 0; i < array_size; i++) { printf("%d\n", after_quant[i]); } /*data 2*/ for (int i = 0; i < 7; i++) { array_bf16[i] = fp32_to_bf16(array2[i]); } printf("data 2\nbfloat16 number is \n"); for (int i = 0; i < array_size; i++) { printf("%.12f\n", array_bf16[i]); } after_quant = quant_bf16_to_int8(array_bf16); printf("after quantization \n"); for (int i = 0; i < array_size; i++) { printf("%d\n", after_quant[i]); } /*data 3*/ for (int i = 0; i < 7; i++) { array_bf16[i] = fp32_to_bf16(array3[i]); } printf("data 3\nbfloat16 number is \n"); for (int i = 0; i < array_size; i++) { printf("%.12f\n", array_bf16[i]); } after_quant = quant_bf16_to_int8(array_bf16); printf("after quantization \n"); for (int i = 0; i < array_size; i++) { printf("%d\n", after_quant[i]); } ticks t1 = getticks(); printf("elapsed cycle: %" PRIu64 "\n", t1 - t0); system("pause"); return 0; } float fp32_to_bf16(float x) { float y = x; int *p = (int *)&y; unsigned int exp = *p & 0x7F800000; unsigned int man = *p & 0x007FFFFF; if (exp == 0 && man == 0) /* zero */ return x; if (exp == 0x7F800000 /* Fill this! */) /* infinity or NaN */ return x; /* Normalized number */ /* round to nearest */ float r = x; int *pr = (int *)&r; *pr &= 0xFF800000; /* r has the same exp as x */ r /= 0x100 /* Fill this! */; y = x + r; *p &= 0xFFFF0000; return y; } int* quant_bf16_to_int8(float x[array_size]) { static int after_quant[array_size] = {}; float max = fabs(x[0]); for (int i = 1; i < array_size; i++) { if (fabs(x[i]) > max) { max = fabs(x[i]); } } printf("maximum number is %.12f\n", max); float scale = range / max; for (int i = 0; i < array_size; i++) { after_quant[i] = (x[i] * scale); } return after_quant; } ``` ::: ### Process 1. Place the assembly code for HW2 (`hw2.S`) into the `ca2023-lab/csrc` directory. 2. Modify hw2.S to remove `ecall` and add `_start:` 3. Modify the `Makefile` ,and add `hw2.asmbin` under BINS. 4. Enter `$ make update` in the directory to generate `hw2.asmbin`. 5. In `CPUTest.scala`, add a Test class for `hw2.asmbin`. ```scala class HW2Test extends AnyFlatSpec with ChiselScalatestTester { behavior.of("Single Cycle CPU") it should "calculate the scale" in { test(new TestTopModule("hw2.asmbin")).withAnnotations(TestAnnotations.annos) { c => for (i <- 1 to 50) { c.clock.step(1000) c.io.mem_debug_read_address.poke((i * 4).U) } c.io.regs_debug_read_address.poke(16.U) //a6 c.clock.step() c.io.regs_debug_read_data.expect(0x41d00000.U) } } } ``` Test the scale value for the first set of data in hw2. ``` $ sbt "testOnly riscv.singlecycle.HW2Test" ``` Output ``` [info] welcome to sbt 1.9.7 (Temurin Java 1.8.0_392) [info] loading settings for project ca2023-lab3-build from plugins.sbt ... [info] loading project definition from /home/jeremytsai/ca2023-lab3/project [info] loading settings for project root from build.sbt ... [info] set current project to mycpu (in build file:/home/jeremytsai/ca2023-lab3/) [info] compiling 1 Scala source to /home/jeremytsai/ca2023-lab3/target/scala-2.13/test-classes ... [info] HW2Test: [info] Single Cycle CPU [info] - should calculate the scale [info] Run completed in 4 seconds, 832 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 9 s, completed 2023/11/30 下午 05:12:43 ``` ## Verilator generate Verilog files ``` $ make verilator ``` | Parameter | Usage | |:----------|:------| | `-memory` | Specify the size of the simulation memory in words (4 bytes each).<br> Example: `-memory 4096` | | `-instruction` | Specify the RISC-V program used to initialize the simulation memory.<br>Example: `-instruction src/main/resources/hello.asmbin` | | `-signature` | Specify the memory range and destination file to output after simulation.<br>Example: `-signature 0x100 0x200 mem.txt` | | `-halt` | Specify the halt identifier address; writing `0xBABECAFE` to this memory address stops the simulation.<br>Example: `-halt 0x8000` | | `-vcd` | Specify the filename for saving the simulation waveform during the process; not specifying this parameter will not generate a waveform file.<br>Example: `-vcd dump.vcd` | | `-time` | Specify the maximum simulation time; note that time is **twice** the number of cycles.<br>Example: `-time 1000` | Load the `hw2.asmbin` file, simulate for 2000 cycles, and save the simulation waveform to the `dump01.vcd` file. ``` ./run-verilator.sh -instruction src/main/resources/hw2.asmbin -time 4000 -vcd dump01.vcd ``` Output ``` -time 4000 -memory 1048576 -instruction src/main/resources/hw2.asmbin [-------------------->] 100% ``` ## Run GTKwave `dump01.vcd` to check its waveform. ### I-type : `addi x2, x2, -4` Hexadecimal = 0xffc10113 Binary = 1111111 11100 00010 000 00010 0010011 #### IF ![image](https://hackmd.io/_uploads/BJ-H3GvH6.png) - `io_instruction=0xFFC10113 -> addi x2, x2, -4` - `io_instruction_read_data=io_instruction` - Since `io_jump_flag_id=0`, the next pc is pc+4. #### ID ![image](https://hackmd.io/_uploads/BJ9DTMvra.png) - `io_ex_aluop1_source=0` reads the value of `io_reg1_data`. - `io_ex_aluop2_source=1`,reads the value of `io_ex_immediate`.(`io_ex_immediate=0xFFFFFFFC = -4`) - Since `io_memory_read_enable=io_memory_write_enable=0`, there is no modification to the memory <font color="#f00">L-type : `io_memory_read_enable = 1` S-type : `io_memory_write_enable = 1`</font> - `io_regs_reg1_read_address=02`(x2=sp) #### EXE ![image](https://hackmd.io/_uploads/BypW0zDBa.png) - When `alu_ctrl_io_alu_func=1`, it indicates the `addi` function. - Because `io_aluop1_source=0`,`alu_io_op1` is equal to `io_reg1_data`, which is 0. - Because `io_aluop2_source=1`,`alu_io_op2` is equal to `io_immediate`, which is 0xFFFFFFFC #### MEM ![image](https://hackmd.io/_uploads/S1siMQPH6.png) - From the figure below, it can be observed that `io_alu_result` is equal to `io_memory_bundle_address`, both being 0xFFFFFFFC. ![image](https://hackmd.io/_uploads/S1YBIVDrT.png =40%x) - At this stage, no read/write operations are performed on the memory. ##### WB ![image](https://hackmd.io/_uploads/SJZTfXDrT.png) - Write `io_regs_write_data=0xFFFFFFFC` to register `x2`. ### J-type : `jal x0, 68` Hexdicimal:0x0440006f Binary:00000100010000000000 00000 1101111 ##### IF ![image](https://hackmd.io/_uploads/B1kVaMt_p.png) - `io_instruction[31:0]=0440006F -> jal x0, 68` - `io_instruction[31:0]=io_instruction_read_data[31:0]` - When `io_jump_flag_id=1` is set to 1, the program counter (pc) consequently receives the instruction `io_jump_address_id[31:0]=00001098`, resulting in the pc becoming `pc=1098`. #### ID ![image](https://hackmd.io/_uploads/S1hDLXKd6.png) - `io_ex_aluop1_source=1`=`io.instruction_address` - `io_ex_aluop2_source=1` reads the value of `io_ex_immediate[31:0]=00000044` - Due to `io_memory_read_enable=0` and `io_memory_write_enable=0`, no read or write operations are performed on the memory. - `io_regs_reg1_read_address[4:0]=00`(x0) is equivalent to `rd[4:0]=00`(x0). #### EXE ![image](https://hackmd.io/_uploads/rk_EdmK_a.png) - Because `io_aluop1_source=1`,`alu_io_op1` is equal to `io_instruction_address=00001054`, where pc=1054. - Because `io_aluop2_source=1`, `alu_io_op2 is equal` to `io_immediate`, which is 0x00000044 - Due to `io_if_jump_flag=1`, the program counter (pc) jumps to `io_if_jump_address=00001098`.(The next pc is set to 1098.) - Where `io_if_jump_address` is defined as the sum of `io_immediate` and `io_instruction_address`. #### MEM ![image](https://hackmd.io/_uploads/S1WGG4K_6.png) - At this stage, no read/write operations are performed on the memory. - `io_alu_result` is equal to `io_memory_bundle_address`. #### WB ![image](https://hackmd.io/_uploads/H15OHVYd6.png) - At this stage, no changes are made to the registers. ### S-type : `sw x15, 0(x12)` Hexadecimal=0x00f62023 Binary=00000000 11110 11000 100 00000 100011 #### IF ![image](https://hackmd.io/_uploads/rkQaZrtdT.png) - `io_instruction=0xFFC10113 -> sw x15, 0(x12)` - `io_instruction_read_data=io_instruction` - Since `io_jump_flag_id=0`, the next pc is pc+4. #### ID ![image](https://hackmd.io/_uploads/S1xqfrKua.png) - The value of `io_ex_aluop1_source=0` is the base memory address, which is the value of x12.(`io_regs_reg1_read_address=0x0C`) - The value of `io_ex_aluop2_source=1` is the offset of the memory address, which is 0.(`io_ex_immediate=0x00000000`) - Because `io_memory_write_enable=1`, data is written into memory. #### EXE ![image](https://hackmd.io/_uploads/rJMuGjt_T.png) - The ALU output, `alu_io_result=00001334`, is equal to the sum of ALU operands, where `alu_io_op1=00001334` and `alu_io_op2=00000000`. #### MEM ![image](https://hackmd.io/_uploads/HyIkvHtu6.png) - The memory address input, `io_memory_bundle_address=0x00001334`, is equivalent to the ALU output, `io_alu_result=00001334`. - The data input of the memory `io_memory_bundle_write_data` is 0x00000000, which is equal to the value of `io_reg2_data`. #### WB ![image](https://hackmd.io/_uploads/HyRVfjYdT.png) - S-type does not assign new values to registers. # Reference - [Lab3: Construct a single-cycle RISC-V CPU with Chisel](https://hackmd.io/@sysprog/r1mlr3I7p) - [HW2](https://hackmd.io/3oeAp56nT3uVBbEyTxyQnQ) - [RISC-V Instruction Encoder/Decoder](https://luplab.gitlab.io/rvcodecjs/)