Assignment3: Single-cycle RISC-V CPU

contributed by < paulpeng-popo >

Prerequisites

In order to avoid affecting the original computer environment, a container is set up to provide the experimental environment for Assignment 3.

Here, I use Docker for building container.

FROM arm64v8/ubuntu:22.04

# set the working directory
WORKDIR /root

# set the environment variable
ENV DEBIAN_FRONTEND=noninteractive

# update the repository sources list
RUN apt update

# install sudo
RUN apt install sudo -y

# create a new user as popo
RUN useradd -ms /bin/bash popo

# add the user to sudo group
RUN usermod -aG sudo popo

# set user popo as sudoer without password
RUN echo "popo ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers

# change the user to popo
USER popo

# change the working directory to home
WORKDIR /home/popo

# set the environment variable
ENV HOME /home/popo
ENV USER popo
ENV PATH $PATH:/home/popo/.local/bin

# install packages
RUN sudo apt install git wget curl xauth dbus-x11 -y

ENTRYPOINT ["/bin/bash"]

Then, following the instructions provided in Lab3: Construct a single-cycle RISC-V CPU with Chisel to install necessary dependency packages and tools.

$ sudo apt install build-essential verilator gtkwave
$ curl -s "https://get.sdkman.io" | bash
$ sdk install java 11.0.21-tem 
$ sdk install sbt

# install scala on aarch64 linux
$ curl -fL https://github.com/VirtusLab/coursier-m1/releases/latest/download/cs-aarch64-pc-linux.gz | gzip -d > cs && chmod +x cs && ./cs setup

# change to scala 2
$ cs install scala:2.13.12 scalac:2.13.12

Hello World in Chisel

// Hello.scala

class Hello extends Module {
  val io = IO(new Bundle {
    val led = Output(UInt(1.W))
  })
  val CNT_MAX = (50000000 / 2 - 1).U;
  val cntReg  = RegInit(0.U(32.W))
  val blkReg  = RegInit(0.U(1.W))
  cntReg := cntReg + 1.U
  when(cntReg === CNT_MAX) {
    cntReg := 0.U
    blkReg := ~blkReg
  }
  io.led := blkReg
}

The module has two registers, cntReg and blkReg, both initialized with zero values. cntReg is a 32-bit counter that increments by 1 in each clock cycle. When cntReg reaches a certain value (CNT_MAX), it resets to zero, and blkReg toggles its value.

I test Hello.scala on chisel-template

For testing convenience, I have reduced the number from 50,000,000 to 10, making it easier to observe the differences in the output.

The updated CNT_MAX value is now set to 4

Then create HelloSpec.scala in scr/test/scala/example

// HelloSpec.scala

class HelloSpec extends AnyFreeSpec with ChiselScalatestTester {
  "Hello" in {
    test(new Hello) { hello =>
      for (clk <- 0 until 10) {
        hello.clock.step(1)
        val led = hello.io.led.peek()
        println(s"clk: $clk, led: $led")
      }
    }
  }
}

It checks whether the module correctly simulates by stepping the simulation forward for 10 clock cycles and printing the values of the clk and led signals at each step.

$ sbt "testOnly example.HelloSpec"

The output would look like:

clk: 0, led: UInt<1>(0)
clk: 1, led: UInt<1>(0)
clk: 2, led: UInt<1>(0)
clk: 3, led: UInt<1>(0)
clk: 4, led: UInt<1>(1)
clk: 5, led: UInt<1>(1)
clk: 6, led: UInt<1>(1)
clk: 7, led: UInt<1>(1)
clk: 8, led: UInt<1>(1)
clk: 9, led: UInt<1>(0)

Enhancement

Using when blocks in Hardware Description Language (HDL) designs is not necessarily something that should be avoided in all cases. However, in some situations, particularly when dealing with simple state machines or conditional assignments, it might be more readable and synthesizable to use multiplexers (muxes) instead of when blocks.

The primary reason for preferring muxes in some cases is that they directly map to hardware multiplexing structures, which synthesis tools can often recognize and implement more efficiently. This is especially true for simple conditions or state machines where a mux can directly represent the selection of one value from several inputs.

–- From ChatGPT –-







- cntReg := cntReg + 1.U
- when(cntReg === CNT_MAX) {
-   cntReg := 0.U
-   blkReg := ~blkReg
- }
+ cntReg := Mux(cntReg === CNT_MAX, 0.U, cntReg + 1.U)
+ blkReg := blkReg ^ (cntReg === CNT_MAX)

Here, a multiplexer is employed to determine whether the counter should increment or reset to zero. Simultaneously, a logical XOR operation is utilized to toggle the state of blkReg. The XOR operation ensures that blkReg changes its state whenever the counter is reset, providing the desired functionality without using a when block.

Single Cycle RISC-V CPU (MyCPU)

Overview of Implementation

Instruction Fetch: Fetching the instruction data from memory.
Decode: Understanding the meaning of the instruction and reading register data.
Execute: Calculating the result using the ALU.
Memory Access (load/store instructions): Reading from and writing to memory.
Write-back (for all instructions except store): Writing the result back to registers.

Check waveform by GTKWave

$ WRITE_VCD=1 sbt test
$ gtkwave test_run_dir/<xxx>/<xxx>.vcd

Instruction Fetch

Instruction fetch stage does:

Fetch the instruction from memory based on the current address in the PC register.
Modify the value of the PC register to point to the next instruction.

The PC register is initially set to the entry address of the program. Upon encountering a valid instruction, the CPU fetches the instruction located at the address specified by the PC. If a jump is necessary, the CPU checks the jump_flag_id to determine whether a jump should be taken. If a jump is required, the PC is then updated with the address specified by jump_address_id. Otherwise, the PC is incremented by 4 to move to the next sequential instruction.

PC initiates at address 0x1000. In the first test case, where no jump occurs, the PC advances to fetch the next instruction by incrementing to PC + 4. Subsequently, in the second test case, a jump to address 0x1000 is executed, causing the PC to update its value to 0x1000 during the next clock cycle.

Instruction Decode

Decode stage does:

Read the opcode to determine instruction type and field lengths
Read in data from all necessary registers
- for add, read two registers
- for addi, read one register
- for jal, no reads are necessary
Output control signals

At this stage, 8 signals need to be generated, and the remaining two outputs, namely memory_read_enable and memory_write_enable, have not been implemented yet.

These two signals appear to be associated with load and store instructions.

To finalize their implementation, we can easily configure memory_read_enable to be true.B when processing L type instructions, and set memory_write_enable to true.B for S type instructions; otherwise, the default value remains false.B.

A warning occurs during compilation:

method apply in object MuxLookup is deprecated (since Chisel 3.6): Use MuxLookup(key, default)(mapping) instead

To address this warning, simply relocate the mapping sequence section to eliminate the deprecation message.

val immediate = MuxLookup(
    opcode,
    Cat(..., ...)
) {
  IndexedSeq(
    ...,
    ...
  )
}

Three test cases:

0x00a02223 (S-type)
0x000022b7 (lui)
0x002081b3 (add)

object InstructionTypes {
  val L  = "b0000011".U // 0x3
  val I  = "b0010011".U
  val S  = "b0100011".U // 0x23
  val RM = "b0110011".U
  val B  = "b1100011".U
}

According to our design specification, when the opcode is 0x3, the signal memory_read_enable should be set to true.B, and when the opcode is 0x23, the signal memory_write_enable should be set to true.B. The waveform chart above conveniently validates this behavior.

Execution

Execution stage does:

Perform ALU computation.
Determine if there is a branch.

The control line for the ALU, denoted as alu.io.func, is derived from the output of the ALU control module, specifically alu_ctrl.io.alu_funct. Additionally, the two inputs of the ALU are determined by the control lines aluop1_source and aluop2_source. These control lines drive the corresponding inputs through two Muxes.

Initially, there are some test cases that involve the ADD instruction, aiming to evaluate the normal functioning of the ALU. The final two tests involve the BEQ instruction, assessing both jump and non-jump scenarios. In the case where the jump is taken, the program counter advances to PC + 2, equivalent to 0x4.

Combining into a CPU

With the completion of modules for each stage, the subsequent phase involves connecting the inputs and outputs of these stages. Once this integration is accomplished, the single-cycle RISC-V CPU will be considered complete.

[info] Run completed in 10 seconds, 938 milliseconds.
[info] Total number of tests run: 9
[info] Suites: completed 7, aborted 0
[info] Tests: succeeded 9, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 12 s, completed Dec 3, 2023, 12:05:01 AM

Assignment3: Single-cycle RISC-V CPU

Prerequisites

Hello World in Chisel

Enhancement

Single Cycle RISC-V CPU (MyCPU)

Overview of Implementation

Check waveform by GTKWave

Instruction Fetch

Instruction Decode

Execution

Combining into a CPU

Make handwritten RISC-V assembly code functions correctly on "MyCPU"

Read more

筆記

Leetcode

Assignment2: RISC-V Toolchain

Assignment1: RISC-V Assembly and Instruction Pipeline