Assignment3: single-cycle RISC-V CPU

contribute by < JinYu1225 >

Introduction

The objective of this assignment is to develop a single-cycle RISC-V CPU, named MyCPU, using Chisel, a hardware description language based on Scala. The process involves forking and modifying code from ca2023-lab3 to complete the CPU construction. The final step is to execute the project developed in Assignment2 on the MyCPU.

The complete project can be accessed here

For further details on Chisel, Scala, and CPU construction as part of this assignment, refer to Lab3: Construct a single-cycle RISC-V CPU with Chisel.

Hello World in Chisel

class Hello extends Module {
  val io = IO(new Bundle {
    val led = Output(UInt(1.W))
  })
  val CNT_MAX = (50000000 / 2 - 1).U;
  val cntReg  = RegInit(0.U(32.W))
  val blkReg  = RegInit(0.U(1.W))
  cntReg := cntReg + 1.U
  when(cntReg === CNT_MAX) {
    cntReg := 0.U
    blkReg := ~blkReg
  }
  io.led := blkReg
}

Hello is a module that includes a slot named io, used to define an unnamed bundle with an output wire named led. Additionally, Hello features a counter called cntReg and a flag called blkReg, both of which connect to the output led. The flag blkReg is triggered when the counter cntReg reaches the value of CNT_MAX.

Modifying MyCPU

The project comprises six tests designed for various purposes, and it is crucial to successfully pass all of them to ensure the proper functioning of MyCPU, particularly for basic functionalities. The key .scala files that require modification are InstructionFetch.scala,InstructionDecode.scala,Execute.scala, and CPU.scala.

Run all 6 tests:

$ sbt test

Result:

[info] *** 5 TESTS FAILED ***
[error] Failed tests:
[error] 	riscv.singlecycle.InstructionDecoderTest
[error] 	riscv.singlecycle.ByteAccessTest
[error] 	riscv.singlecycle.ExecuteTest
[error] 	riscv.singlecycle.FibonacciTest
[error] 	riscv.singlecycle.QuicksortTest
[error] (Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 11 s, completed Nov 30, 2023, 11:23:51 PM

It is evident that five tests have failed. One test has passed due to the modifications made in InstructionFetch.scala, which will be elaborated on in the following discussion.

Instruction Fetch

The function of instruction fetch stage is implement in InstructionFetch.scala.

MyCPU fetches instructions from the address specified by the Program Counter (PC) and determines the next PC. We can easily implement this by categorizing the situation into two parts: whether the code jumps or not.

:warning: Refrain from copying and pasting your solution directly into the HackMD note. Instead, provide a concise summary of the various test cases, outlining the aspects of the CPU they evaluate, the techniques employed for loading test program instructions, and the outcomes of these test cases.

Another important point is that we should always check if instruciton input is valid.

when(io.instruction_valid) {
    io.instruction := io.instruction_read_data
	...
}.otherwise {
	pc             := pc
	io.instruction := 0x00000013.U 
}
io.instruction_address := pc

0x00000013 is define as nop in InstructionDecode.scala

Instruction Decode

Instruction Decode is implemented in InstructionDecode.scala. The purpose of the decode stage is to assist the CPU in recognizing each type of operation and determining the corresponding action to be taken. It involves the following steps.

Identify the instruction type and the length of each field based on the opcode of the instruction.
Retrieve all the arguments (from registers, memory, or immediate values) required for the operation.
Generate the control signals for the next stage.

Let's use the ADD instruction as an example, whose opcode is represented as 0b0110011. A more comprehensive understanding can be gained by referring to the following table.

First, the instruction undergoes processing in the IF stage as the input. It is then divided into six parts: opcode, funct3, funct7, rd, rs1, rs2 as the figure shows above.

val opcode = io.instruction(6, 0)
val funct3 = io.instruction(14, 12)
val funct7 = io.instruction(31, 25)
val rd     = io.instruction(11, 7)
val rs1    = io.instruction(19, 15)
val rs2    = io.instruction(24, 20)

Second, determine the instruction type and the exact operation by the opcode and funct3.

object InstructionTypes {
  val L  = "b0000011".U
  val I  = "b0010011".U
  val S  = "b0100011".U
  val RM = "b0110011".U
  val B  = "b1100011".U
}
...

Third, determine which register or immediate value should be accessed by the instruction.

...
io.wb_reg_write_source := MuxCase(
  RegWriteSource.ALUResult,
  ArraySeq(
    (opcode === InstructionTypes.RM || opcode === InstructionTypes.I ||
        opcode === Instructions.lui || opcode === Instructions.auipc) -> RegWriteSource.ALUResult, // same as default
    (opcode === InstructionTypes.L)                                 -> RegWriteSource.Memory,
    (opcode === Instructions.jal || opcode === Instructions.jalr)   -> RegWriteSource.NextInstructionAddress
  )
)
...

To ensure the correct operation of MyCPU, it is necessary to modify the control signal of MemRW in the Decode stage. This modification should be based on the types of operations being performed.

Another important point is that we should always check if instruciton input is valid.

Finally, pass all the necessary arguments and control sigals to next stage (EXE).

Execute

The Execute stage is where MyCPU carries out the arithmetic processes of each instruction using the ALU. In this stage, the control signals and arguments are obtained from the ID stage, and a specific operation is performed in the ALU. Then the result will be passed to the next stage.

ALU.scala and ALUControl.scala are utilized in the Execute stage. ALUControl produces an output specifying the type of operation that the ALU needs to perform. The inputs of the ALU are determined by the results of ALUControl and the previous stage, as illustrated by the inputs provided through ALUop1 and ALUop2.

Combining into CPU

The CPU assumes a crucial role in coordinating the connection between each stage. Each stage is declared as a module variable within the CPU. The code establishes the necessary connections of inputs and outputs among these stages in the CPU to ensure the proper functioning of the single-cycle CPU.

The connections of inputs and outputs can be viewed by following figure.

Test Result

$ sbt test
...
[info] Run completed in 14 seconds, 891 milliseconds.
[info] Total number of tests run: 9
[info] Suites: completed 7, aborted 0
[info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 16 s, completed Dec 1, 2023, 11:57:58 PM

Generate waveform files during tests:

$ $ WRITE_VCD=1 sbt test

Waveform file .vcd will be generated under test_run_dir directory.

Instruction Fetch Test

The reset signal is typically set to HIGH to ensure hardware resets to its initial state at the beginning of a functional test. Additionally, io_instruction_valid is set to LOW to output the No Operation (NOP) instruction 0x00000013 to io_instruction. The necessary input for the test will be set during this period.

Subsequently, during the falling edge of the clock, reset and io_instruction_valid are seperately set to LOW and High in preparation for the input during the subsequent rising edge.

pc is set to io_jump_address_id = 0x1000 as the signal io_jump_flag_id is HIGH in the rising edge shows above.

pc is set to the value of pc+4 when io_jump_flag_id remain LOW.

Instruction Decode Test

The first io_struction is 0x00A02223 = 0b0000 0000 1010 0000 0010 0010 0010 0011. The corresponding opcode should be 0b010 0011 which means a S-type instruction.

io_memory_write_enable is set to HIGH. And reg_write_enable, memory_read_enable is set to LOW.
io_ex_aluop1_source = 0 = Register, and io_ex_aluop2_source = 1 = Immediate.
funct3 = 0b010 indicates it is sw, and imm = 0b0 0100, rs1 = 0b0 0000, rs2 = 0b0 1010
io_regs_reg1_read_address = 0, io_regs_reg2_read_address = 0xA, io_ex_immediate = 0x4

We can make sure the decoder stage work properly by checking the instruction input and the corresponding outputs.

There is an interesting point in this test that the output signals trigger during the falling edge of the clock while the IF stage trigger at the rising edge.

This could cause by the structure of single-cycle CPU. There isn't any reg to store the termianl signals execpt for the IF stage. Therefore, rising edge trigger will only appears when we test IF stage. Otherwise, the output will just change right after we give an input to the module.

EXE Test

In this test, the circuit will be test by the RISCV code represents x3 = x2 + x1 for 100 times. Then have a few tests for the function of pc + 2 if x1 === x2

x3 = x2 + x1:

io_if_jump_flag remains 0 for the add function.
In the first clock cycle right after the reset signal, the circuit obtain the input of 0x0a45c5af and 1486d599, then calculate the output 1ECC9B48.

pc + 2 if x1 === x2:

In the first set of branch test signal, io_if_jump_flag = 0 because io_reg1_data does not equal to io_reg2_data
At the next clock cycle, io_if_jump = 1, and io_if_jump_address equals to io_instruction_address + 2

Register File Test

read the writing content:

After the write_enable = 1, signal io_read_data1 will change to 0xDEADBEEF based on the value of the signals io_read_address1, io_write_address, and io_write_data during the next rising edge.

read the written content:

In this case, we can only observe the results from 0 to 1.5 clock cycle, but the output io_read_data1 doesn't change during this period.
We can get the wave results of the later time by adding more c.clock.step() to the test banch.
The result of io_read_data1 will change to the corresponding result after 2 clock cycles.

Modify HW2 to fit MyCPU

Add a new class HammingDistanceTest into CPUTest.scala

class HammingDistanceTest extends AnyFlatSpec with ChiselScalatestTester {
  behavior.of("Single Cycle CPU")
  it should "cal hamming distance of 0x100000 and 0xFFFFF" in {
    test(new TestTopModule("HammingDistance.asmbin")).withAnnotations(TestAnnotations.annos) { c =>
      for (i <- 1 to 50) {
        c.clock.step(1000)
        c.io.mem_debug_read_address.poke((i * 4).U) // Avoid timeout
      }

	c.io.regs_debug_read_address.poke(10.U) //a0
    c.clock.step()
    c.io.regs_debug_read_data.expect(21.U)
    }
  }
}

Modify Makefile in /csrc to generate HammingDistance.asmbin
Run the test

$ sbt "testOnly riscv.singlecycle.HammingDistanceTest"

[info] welcome to sbt 1.9.7 (Oracle Corporation Java 17.0.9)
[info] loading settings for project ca2023-lab3-build from plugins.sbt ...
[info] loading project definition from /home/edenlin/Documents/Computer_Architecture/ca2023-lab3/project
[info] loading settings for project root from build.sbt ...
[info] set current project to mycpu (in build file:/home/edenlin/Documents/Computer_Architecture/ca2023-lab3/)
[info] HammingDistanceTest:
[info] Single Cycle CPU
[info] - should cal hamming distance of 0x100000 and 0xFFFFF *** FAILED ***
[info]   io_regs_debug_read_data=1 (0x1) did not equal expected=21 (0x15) (lines in CPUTest.scala: 77, 69) (CPUTest.scala:77)
[info] Run completed in 6 seconds, 582 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
[info] *** 1 TEST FAILED ***
[error] Failed tests:
[error] 	riscv.singlecycle.HammingDistanceTest
[error] (Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 8 s, completed Jan 19, 2024, 11:22:37 PM

The test initially failed due to an oversight in the code. Upon reviewing the code, I discovered that in the HW2 version, there was a termination triggered by ecall function placed in the middle of the code. When transitioning to myCPU, this termination point was not present, causing the program to continue execution. Consequently, the final result deviated from expectations. The issue was rectified by adjusting the termination point to the end of the code, then normal results were achieved.

[info] welcome to sbt 1.9.7 (Oracle Corporation Java 17.0.9)
[info] loading settings for project ca2023-lab3-build from plugins.sbt ...
[info] loading project definition from /home/edenlin/Documents/Computer_Architecture/ca2023-lab3/project
[info] loading settings for project root from build.sbt ...
[info] set current project to mycpu (in build file:/home/edenlin/Documents/Computer_Architecture/ca2023-lab3/)
[info] HammingDistanceTest:
[info] Single Cycle CPU
[info] - should cal hamming distance of 0x100000 and 0xFFFFF
[info] Run completed in 5 seconds, 774 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 7 s, completed Jan 19, 2024, 11:38:19 PM

Verilator

Use Verilator to check the waveform and quickly test the programs. The following code should be executed everytime the source Chisel file has been modified to generate corresponding Verilog file.

$ make verilator

Then we can get a executable file VTop. This executable file can run the code files with following parameters.

Parameter	Usage
`-memory`	Specify the size of the simulation memory in words (4 bytes each). Example: `-memory 4096`
`-instruction`	Specify the RISC-V program used to initialize the simulation memory. Example: `-instruction src/main/resources/hello
.asmbin`
`-signature`	Specify the memory range and destination file to output after simulation. Example: `-signature 0x100 0x200 mem.txt`
`-halt`	Specify the halt identifier address; writing `0xBABECAFE` to this memory address stops the simulation. Example: `-halt 0x8000`
`-vcd`	Specify the filename for saving the simulation waveform during the process; not specifying this parameter will not generate a waveform file. Example: `-vcd dump.vcd`
`-time`	Specify the maximum simulation time; note that time is twice the number of cycles. Example: `-time 1000`

Load the HammingDistance.asmbin, simulate for 2000 cycles, and save the simulation waveform to the dumpH.vcd.

$ ./run-verilator.sh -instruction src/main/resources/HammingDistance.asmbin
-time 4000 -vcd dumpH.vcd
$ gtkwave dumpH.vcd

Waveforms analysis

RISC-V Instruction Formats by Reference Data Card

I type

Take the following instruction as I-type example.
0xFF410113 = 0b 1111 1111 0100 0001 0000 0001 0001 0011
By the Reference Data Card, we can know the code represents following instruction.
addi sp sp -12

imm[11:0]	rs1	funct3	rd	Opcode
1111 1111 0100	0 0010	000	0 0010	001 0011

IF:

The instruction at the address of io_instruction_address 0x1000 will be assigned to the io_instruction_read_data, and was soon assigned to the io_instruction when the signal io_instruction_valid turns to HIGH.
The io_jump_flag_id is LOW, so the next pc will be pc+4.

ID:

For the I-type code, the io_ex_aluop1_source, io_ex_aluop2_source were assigned as 0 and 1, which represent Register and Immediate.
The io_ex_immediate = 0x1...10100, which means -12 in decminal.
Read/Write enable signal for the memory were all 0 for I-type. Concurrently, the write enable signal for the register was assigned to the 1 since the result should be passed back to the register rd.

EXE:

The jump flag was 0 since the opcode wasn't one of the JAL, JALR, or B type instruction.
The op1 and op2 of alu were assigned as 0 from sp and -12 from imm according to the aluop1_source and aluop2_source
The execution of adding sp and -12 was been perform by the ALU with the corresponding opcode and funct3 which represent addi.

MEM:

The memory read/write enable signal, which was controled by the information decoded in the ID stage from L & S type instruction, were remain LOW. Therefore, the except for the L & S type instruction, the others won't do anything in MEM stage.

WB:

The source of reg_write_data in WB stage will be decided by the instruction type decoded in ID stage. The RM, I, lui, auipc types instructions will take alu_result as the source data.
While the L type instructions take memory_read_data as data source, and the JAL, JALR types take instruction_address + 4.

B type

Take the following instruction as B type example.
0x04040A63 = 0b0000 0100 0000 0100 0000 1010 0110 0011
beq s0 x0 EXIT_HAMDIS

imm[12,10:5]	rs2	rs1	funct3	imm[4:1,11]	opcode
000 0010	00000	01000	000	10100	110 0011

ID:

We can easily see the jump_flag is assigned to HIGH, and the pc will be jump_address_id in the next cycle.

ID:

ID stage regconized the code was a B-type and exactly a beq instruction from the opcode and funct3.
The stage soon decoded the instruction and put each byte into the corresponding postion in imm for B-type and assigned it to ex_immediate.
aluop1_source and aluop2_source were assigned as register and imm because whether the branch valid or not in this CPU would be determined in the jump judge unit. Normally, jump address would calculated by the ALU, but we complete it in EXE stage without ALU module in this CPU.

EXE:

if_jump_address was calculated from instruction_address + immediate, and if_jump_flag was determined by the opcode and funct3.
Although the CPU calculate the if_jump_address outside the ALU module in the module code, we can still see that the alu_io_result calculate the same value as the if_jump_address. Therefore, we can use ALU to calculate B type instruction instead of doing it with extra execution.

There was nothing to do in MEM and WB stage in B-type code.

JAL

Take the following instruciton as JAL example.
0x010000EF = 0b 0000 0001 0000 0000 0000 0000 1110 1111
jal ra HAMDIS

imm[20,10:1,11,19:12]	rd	opcode
0000 0001 0000 0000 0000	0 0001	110 1111

IF:

The jump_flag was set to HIGH, and the correspond jump_address was 0x104C. we can see the value of pc was 0x104C in the next cycle.

ID:

The reg_write_enable was set to HIGH to store pc + 4 into rd, which was assigned to reg_write_address.
The aluop1 and aluop2 were assigned as 1 that represented Instruction address and Imm.

EXE:

The control signal if_jump_flag was set to HIGH as the instruction was JAL.
The if_jump_address was assigned as immediate + instruction_address = 0x103C + 0x10 = 0x104C.

WB:

regs_write_data was assigned as instruction_address + 4 = 0x1040.

S type

We take the following S-type instruction as example in this section.
0x00512023 = 0b 0000 0000 0101 0001 0010 0000 0010 0011
which also means sw t0 0(sp)

imm[11:5]	rs2	rs1	funct3	imm[4:0]	opcode
0000 000	00101	00010	010	0 0000	010 0011

Because of the result in IF stage will only different when the instruction is JAL, JALR, and B type, we skip IF stage analysis for S type instructions.

The address of t0 was assigned to reg2_read_address and would be used in the MEM stage since the control signal memory_wirte_enable was HIHG.

MEM:

The control signal memory_write_enable was HIGH.
As the funct3 reveal that the instruction was sw, the strobe from 0 to 3 were all set to 1 corresponding to the size of a word.
The memory_bundle_address = 0xFFFFFFF4 was the same as the address of sp that we had moved at the previous instruction addi sp sp -12.

The EXE and WB stages wouldn't do any effort to S-type instruction, so we didn't discuss about it in this section.

Assignment3: single-cycle RISC-V CPU

Introduction

Hello World in Chisel

Modifying MyCPU

Instruction Fetch

Instruction Decode

Execute

Combining into CPU

Test Result

Instruction Fetch Test

Instruction Decode Test

EXE Test

Register File Test

Modify HW2 to fit MyCPU

Verilator

Waveforms analysis

I type

B type

JAL

S type

Read more

Assignment1: RISC-V Assembly and Instruction Pipeline

Assignment2: GNU Toolchain