Pipelined RISC-V core

李皓翔

GitHub

Single cycle CPU (lab3)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

It can divide the execution of instructions into five distinct stages:

Fetch: Retrieve instruction data from memory.
Decode: Understand the meaning of the instruction and read data from registers.
Execute: Use the ALU to compute the result.
Memory Access (for load/store instructions): Perform memory read or write operations.
Write Back (for all instructions except store): Write the result back to the registers.

Instruction Fetch

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Tasks of the Instruction Fetch Stage:

Retrieve the instruction from memory based on the current address stored in the PC register.
Update the value of the PC register to point to the next instruction.

    when(io.jump_flag_id){
      pc := io.jump_address_id
    }.otherwise{
      pc := pc + 4.U
    }

This part determines whether the instruction is of J-type or B-type by checking jump_flag_id. If it is, the address of the next instruction is set to jump_address_id. Otherwise, it is set to pc + 4.U. The corresponding instruction is then fetched from memory and passed to the next stage.

Instruction Decode

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Tasks of the Instruction Decode Stage:

Read the operands from the registers and the immediate value.
Generate the control signals needed for the subsequent stages.

  val opcode = io.instruction(6, 0)
  val funct3 = io.instruction(14, 12)
  val funct7 = io.instruction(31, 25)
  val rd     = io.instruction(11, 7)
  val rs1    = io.instruction(19, 15)
  val rs2    = io.instruction(24, 20)

In the Decode stage, the instruction is first decomposed into opcode, funct3, funct7, rd, rs1, and rs2. Based on the opcode, the type of the instruction can be identified, as shown in the table below.

opcode	Instruction Type
011 0011	R-type
110 0011	B-type
001 0011	I-type
010 0011	S-type
000 0011	L-type
001 0111	AUIPC
011 0111	LUI
110 1111	JAL
110 0111	JALR

R-type

31-25	24-20	19-15	14-12	11-7	6-0
funct7	rs2	rs1	funct3	rd	opcode

I-type

31-25	24-20	19-15	14-12	11-7	6-0
imm[11:0]		rs1	funct3	rd	opcode

S-Type

31-25	24-20	19-15	14-12	11-7	6-0
imm[11:5]	rs2	rs1	funct3	imm[4:0]	opcode

Branch-Type

31-25	24-20	19-15	14-12	11-7	6-0
imm[12\|10:5]	rs2	rs1	funct3	imm[4:1\|11]	opcode

U-Type

31-25	24-20	19-15	14-12	11-7	6-0
imm[31:12]				rd	opcode

UJ-Type

31-25	24-20	19-15	14-12	11-7	6-0
imm[20\|10:1\|11\|19:12]				rd	opcode

Depending on the type, the corresponding control signals and the value of the immediate are handled separately.

Execute

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Tasks of the Execute Stage:

Perform calculations using the ALU.
Determine whether a branch or jump should occur.

ALU Control

Code























































 switch(io.opcode) {
    is(InstructionTypes.I) {
      io.alu_funct := MuxLookup(
        io.funct3,
        ALUFunctions.zero,
        IndexedSeq(
          InstructionsTypeI.addi  -> ALUFunctions.add,
          InstructionsTypeI.slli  -> ALUFunctions.sll,
          InstructionsTypeI.slti  -> ALUFunctions.slt,
          InstructionsTypeI.sltiu -> ALUFunctions.sltu,
          InstructionsTypeI.xori  -> ALUFunctions.xor,
          InstructionsTypeI.ori   -> ALUFunctions.or,
          InstructionsTypeI.andi  -> ALUFunctions.and,
          InstructionsTypeI.sri   -> Mux(io.funct7(5), ALUFunctions.sra, ALUFunctions.srl)
        ),
      )
    }
    is(InstructionTypes.RM) {
      io.alu_funct := MuxLookup(
        io.funct3,
        ALUFunctions.zero,
        IndexedSeq(
          InstructionsTypeR.add_sub -> Mux(io.funct7(5), ALUFunctions.sub, ALUFunctions.add),
          InstructionsTypeR.sll     -> ALUFunctions.sll,
          InstructionsTypeR.slt     -> ALUFunctions.slt,
          InstructionsTypeR.sltu    -> ALUFunctions.sltu,
          InstructionsTypeR.xor     -> ALUFunctions.xor,
          InstructionsTypeR.or      -> ALUFunctions.or,
          InstructionsTypeR.and     -> ALUFunctions.and,
          InstructionsTypeR.sr      -> Mux(io.funct7(5), ALUFunctions.sra, ALUFunctions.srl)
        ),
      )
    }
    is(InstructionTypes.B) {
      io.alu_funct := ALUFunctions.add
    }
    is(InstructionTypes.L) {
      io.alu_funct := ALUFunctions.add
    }
    is(InstructionTypes.S) {
      io.alu_funct := ALUFunctions.add
    }
    is(Instructions.jal) {
      io.alu_funct := ALUFunctions.add
    }
    is(Instructions.jalr) {
      io.alu_funct := ALUFunctions.add
    }
    is(Instructions.lui) {
      io.alu_funct := ALUFunctions.add
    }
    is(Instructions.auipc) {
      io.alu_funct := ALUFunctions.add
    }
  }

In the ALUControl, the opcode is used to determine the corresponding instruction type, and each instruction is mapped to its respective alu_funct.

B-type, S-type, J-type, as well as the lui, auipc, and load instructions:
The ALU simply performs addition on the two input values.Therefore, they can all be mapped to the add operation.
I-type instructions:
They are essentially the same as R-type instructions, except that one of the inputs is an immediate value.Thus, the conversion method is the same as for R-type.

ALU

Code



































    switch(io.func) {
    is(ALUFunctions.add) {
      io.result := io.op1 + io.op2
    }
    is(ALUFunctions.sub) {
      io.result := io.op1 - io.op2
    }
    is(ALUFunctions.sll) {
      io.result := io.op1 << io.op2(4, 0)
    }
    is(ALUFunctions.slt) {
      io.result := io.op1.asSInt < io.op2.asSInt
    }
    is(ALUFunctions.xor) {
      io.result := io.op1 ^ io.op2
    }
    is(ALUFunctions.or) {
      io.result := io.op1 | io.op2
    }
    is(ALUFunctions.and) {
      io.result := io.op1 & io.op2
    }
    is(ALUFunctions.srl) {
      io.result := io.op1 >> io.op2(4, 0)
    }
    is(ALUFunctions.sra) {
      io.result := (io.op1.asSInt >> io.op2(4, 0)).asUInt
    }
    is(ALUFunctions.sltu) {
      io.result := io.op1 < io.op2
    }
  }

The ALU performs operations on the input operands op1 and op2 according to the corresponding instruction.

Execute

Code























 
 alu.io.op1 := Mux(io.aluop1_source === 1.U, io.instruction_address, io.reg1_data)
 alu.io.op2 := Mux(io.aluop2_source === 1.U, io.immediate, io.reg2_data)
    
 alu.io.func := alu_ctrl.io.alu_funct
    
 io.mem_alu_result := alu.io.result
  io.if_jump_flag := opcode === Instructions.jal ||
    (opcode === Instructions.jalr) ||
    (opcode === InstructionTypes.B) && MuxLookup(
      funct3,
      false.B,
      IndexedSeq(
        InstructionsTypeB.beq  -> (io.reg1_data === io.reg2_data),
        InstructionsTypeB.bne  -> (io.reg1_data =/= io.reg2_data),
        InstructionsTypeB.blt  -> (io.reg1_data.asSInt < io.reg2_data.asSInt),
        InstructionsTypeB.bge  -> (io.reg1_data.asSInt >= io.reg2_data.asSInt),
        InstructionsTypeB.bltu -> (io.reg1_data.asUInt < io.reg2_data.asUInt),
        InstructionsTypeB.bgeu -> (io.reg1_data.asUInt >= io.reg2_data.asUInt)
      )
    )
  io.if_jump_address := io.immediate + Mux(opcode === Instructions.jalr, io.reg1_data, io.instruction_address)
}

In the Execute module, the ALU and ALUControl are instantiated. The specific ALU computation logic is handled within the ALU module. In the Execute module only need to assign values to the input ports of the ALU and determine whether to perform a jump.

For unconditional jump instructions (e.g., jal and jalr), the jump is executed directly.
For branch instructions, the decision to jump is based on the corresponding jump condition.

Memory Access

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

If the instruction is an L-type, memory_read_enable is set to 1.
If the instruction is an S-type, memory_write_enable is set to 1.

Based on different instructions, the corresponding read and write operations are performed.

WriteBack

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

In the writeback stage, regs_write_source is used to determine the value to be written, which can be one of the following:

alu_result
memory_read_data
instruction_address + 4.U

CPU.scala

Code


package riscv.core

import chisel3._
import chisel3.util.Cat
import riscv.CPUBundle
import riscv.Parameters

class CPU extends Module {
  val io = IO(new CPUBundle)

  val regs       = Module(new RegisterFile)
  val inst_fetch = Module(new InstructionFetch)
  val id         = Module(new InstructionDecode)
  val ex         = Module(new Execute)
  val mem        = Module(new MemoryAccess)
  val wb         = Module(new WriteBack)

  io.deviceSelect := mem.io.memory_bundle
    .address(Parameters.AddrBits - 1, Parameters.AddrBits - Parameters.SlaveDeviceCountBits)

  inst_fetch.io.jump_address_id       := ex.io.if_jump_address
  inst_fetch.io.jump_flag_id          := ex.io.if_jump_flag
  inst_fetch.io.instruction_valid     := io.instruction_valid
  inst_fetch.io.instruction_read_data := io.instruction
  io.instruction_address              := inst_fetch.io.instruction_address

  regs.io.write_enable  := id.io.reg_write_enable
  regs.io.write_address := id.io.reg_write_address
  regs.io.write_data    := wb.io.regs_write_data
  regs.io.read_address1 := id.io.regs_reg1_read_address
  regs.io.read_address2 := id.io.regs_reg2_read_address

  regs.io.debug_read_address := io.debug_read_address
  io.debug_read_data         := regs.io.debug_read_data

  id.io.instruction := inst_fetch.io.instruction

  ex.io.instruction := inst_fetch.io.instruction
  ex.io.instruction_address := inst_fetch.io.instruction_address
  ex.io.reg1_data := regs.io.read_data1
  ex.io.reg2_data := regs.io.read_data2
  ex.io.immediate := id.io.ex_immediate
  ex.io.aluop1_source := id.io.ex_aluop1_source
  ex.io.aluop2_source := id.io.ex_aluop2_source

  mem.io.alu_result          := ex.io.mem_alu_result
  mem.io.reg2_data           := regs.io.read_data2
  mem.io.memory_read_enable  := id.io.memory_read_enable
  mem.io.memory_write_enable := id.io.memory_write_enable
  mem.io.funct3              := inst_fetch.io.instruction(14, 12)

  io.memory_bundle.address := Cat(
    0.U(Parameters.SlaveDeviceCountBits.W),
    mem.io.memory_bundle.address(Parameters.AddrBits - 1 - Parameters.SlaveDeviceCountBits, 0)
  )
  io.memory_bundle.write_enable  := mem.io.memory_bundle.write_enable
  io.memory_bundle.write_data    := mem.io.memory_bundle.write_data
  io.memory_bundle.write_strobe  := mem.io.memory_bundle.write_strobe
  mem.io.memory_bundle.read_data := io.memory_bundle.read_data

  wb.io.instruction_address := inst_fetch.io.instruction_address
  wb.io.alu_result          := ex.io.mem_alu_result
  wb.io.memory_read_data    := mem.io.wb_memory_read_data
  wb.io.regs_write_source   := id.io.wb_reg_write_source
}

In CPU.scala, all components are instantiated and connected together.

Three Stage Pipeline

Using the IF2ID and ID2EX pipeline registers, the pipeline is divided into three stages:

Instruction Fetch (IF): Calculates the program counter (PC) and retrieves data from the instruction memory.
Instruction Decode (ID): Decodes the incoming instruction, converts it into control signals, and passes the decoded data to the register file and immediate generator (imm).
Execute (EX): Performs ALU operations, accesses memory, and the write-back stage, all within this phase.

Hazard

The hazards in a pipeline can be divided into data hazards and control hazards:

data hazard
Since all data processing operations are completed simultaneously in the EX/MEM/WB stages, no data hazards will occur during the decode (ID) stage. This approach can eliminate the need for forwarding; however, it will result in greater delays because the pipeline's maximum latency will be EX + MA + WB.

EX:

instruction	1	2	3	4
add x1, x2, x3	IF	ID	EX/MEM/WB
sub x4, x5, x1		IF	ID	EX/MEM/WB

Control Hazard
Control hazards occur under the following scenarios:
- When a jump instruction is executed in the EX stage.
- When a branch instruction is executed in the EX stage and the branch condition is satisfied.
- When an interrupt occurs in the EX stage due to the reception of an InterruptAssert signal from the CLINT.

In these cases, the EX stage sends a jump signal (including jump_flag and jump_address) to the IF stage. However, before the jump_address is written to the program counter (PC), the pipeline stages IF and ID may still contain invalid instructions that have not been written to the register. To address this issue, it is necessary to flush the corresponding pipeline registers to clear these invalid instructions.

Compared to the single-cycle design, four new files have been added: PipelineRegister.scala, Control.scala, IF2ID.scala, and ID2EX.scala.

PipelineRgister.scala

package riscv.core

import chisel3._
import riscv.Parameters

class PipelineRegister(width: Int = Parameters.DataBits, defaultValue: UInt = 0.U) extends Module {
  val io = IO(new Bundle {
    val stall = Input(Bool())
    val flush = Input(Bool())
    val in    = Input(UInt(width.W))
    val out   = Output(UInt(width.W))
  })
  val myreg = RegInit(UInt(width.W), defaultValue)
  val out   = RegInit(UInt(width.W), defaultValue)
  when(io.flush) {
    out   := defaultValue
    myreg := defaultValue
  }
    .elsewhen(io.stall) {
      out := myreg
    }
    .otherwise {
      myreg := io.in
      out   := io.in
    }
  io.out := out
}

This part acts as a cache in the pipeline, with the purpose of splitting the combinational logic and, based on the input state, performing flush and stall operations or setting new values.

Control.scala

package riscv.core.threestage

import chisel3._

class Control extends Module {
  val io = IO(new Bundle {
    val JumpFlag = Input(Bool())
    val Flush    = Output(Bool())
  })
  io.Flush := io.JumpFlag
}

This part will determine when to perform a flush based on the jump flag.

IF2ID.scala & ID2EX.scala

///// IF2ID
package riscv.core.threestage

import chisel3._
import riscv.core.PipelineRegister
import riscv.Parameters

class IF2ID extends Module {
  val io = IO(new Bundle {
    val flush               = Input(Bool())
    val instruction         = Input(UInt(Parameters.InstructionWidth))
    val instruction_address = Input(UInt(Parameters.AddrWidth))
    val interrupt_flag      = Input(UInt(Parameters.InterruptFlagWidth))

    val output_instruction         = Output(UInt(Parameters.DataWidth))
    val output_instruction_address = Output(UInt(Parameters.AddrWidth))
    val output_interrupt_flag      = Output(UInt(Parameters.InterruptFlagWidth))
  })

  val stall = false.B

  val instruction = Module(new PipelineRegister(defaultValue = InstructionsNop.nop))
  instruction.io.in     := io.instruction
  instruction.io.stall  := stall
  instruction.io.flush  := io.flush
  io.output_instruction := instruction.io.out

  val instruction_address = Module(new PipelineRegister(defaultValue = ProgramCounter.EntryAddress))
  instruction_address.io.in     := io.instruction_address
  instruction_address.io.stall  := stall
  instruction_address.io.flush  := io.flush
  io.output_instruction_address := instruction_address.io.out

  val interrupt_flag = Module(new PipelineRegister(Parameters.InterruptFlagBits))
  interrupt_flag.io.in     := io.interrupt_flag
  interrupt_flag.io.stall  := stall
  interrupt_flag.io.flush  := io.flush
  io.output_interrupt_flag := interrupt_flag.io.out
}

These two parts will instantiate PipelineRegister and pass the output information from the previous stage to the next stage through the PipelineRegister, while providing stall and flush functionalities.

Five Stage Pipeline

Stall

Stall Pipeline Example

add t0 t1 t2
or t3 t4 t5 
slt t6 t0 t3

Cycle	1	2	3	4	5	6	7
IF	add	or	slt
ID		add	or	slt
EX			add	or	slt
MEM				add	or	slt
WB					add	or	slt

When an instruction in the ID stage needs to read a register that depends on an instruction in the EX or MEM stage, a data hazard occurs. As shown in the table above, when the instruction slt t6, t0, t3 enters the ID stage, the previous instruction add t0, t1, t2 is only in the MEM stage. Therefore, the slt instruction will encounter a data hazard issue when fetching t0.

Cycle	1	2	3	4	5	6	7	8
IF	add	or	slt
ID		add	or	slt	slt
EX			add	or	nop	slt
MEM				add	or	nop	slt
WB					add	or	nop	slt

By inserting nop instructions between the instructions and stalling the PC and IF2ID registers, the slt instruction can correctly read the value of t0. It is crucial to ensure that while keeping the IF and ID stages unchanged, the ID2EX register is cleared to insert a blank instruction ("bubble") in the EX stage. Otherwise, the instruction in the ID stage will continue into the EX stage.

This part of the logic is implemented in Control.scala.The data hazard occurs if the source registers (rs1_id, rs2_id) of the instruction in the ID stage depend on the destination registers (rd_ex, rd_mem) of the instructions in the EX or MEM stages.When a data hazard is detected:

io.id_flush := true.B: Flush the ID2EX register .
io.pc_stall := true.B: Stall the program counter to prevent fetching a new instruction.
io.if_stall := true.B: Stall the IF stage to hold the current instruction.

When a jump instruction (jump_flag) is detected:

io.if_flush := true.B: Flush the IF2ID register.
io.id_flush := true.B: Flush the ID2EX register.

Because the next two instruction should not be executed consecutively but should be cleared instead. Therefore, the IF2ID and ID2EX registers should be cleared.

package riscv.core.fivestage_stall

import chisel3._
import riscv.Parameters

class Control extends Module {
  val io = IO(new Bundle {
    val jump_flag            = Input(Bool())                                     // ex.io.if_jump_flag
    val rs1_id               = Input(UInt(Parameters.PhysicalRegisterAddrWidth)) // id.io.regs_reg1_read_address
    val rs2_id               = Input(UInt(Parameters.PhysicalRegisterAddrWidth)) // id.io.regs_reg2_read_address
    val rd_ex                = Input(UInt(Parameters.PhysicalRegisterAddrWidth)) // id2ex.io.output_regs_write_address
    val reg_write_enable_ex  = Input(Bool())                                     // id2ex.io.output_regs_write_enable
    val rd_mem               = Input(UInt(Parameters.PhysicalRegisterAddrWidth)) // ex2mem.io.output_regs_write_address
    val reg_write_enable_mem = Input(Bool())                                     // ex2mem.io.output_regs_write_enable

    val if_flush = Output(Bool())
    val id_flush = Output(Bool())
    val pc_stall = Output(Bool())
    val if_stall = Output(Bool())
  })

  io.if_flush := false.B
  io.id_flush := false.B
  io.pc_stall := false.B
  io.if_stall := false.B
  when(io.jump_flag) {
    io.if_flush := true.B
    io.id_flush := true.B
  }.elsewhen(
    (io.reg_write_enable_ex && (io.rd_ex === io.rs1_id || io.rd_ex === io.rs2_id) && io.rd_ex =/= 0.U)
      || (io.reg_write_enable_mem && (io.rd_mem === io.rs1_id || io.rd_mem === io.rs2_id) && io.rd_mem =/= 0.U)
  ) {
    io.id_flush := true.B
    io.pc_stall := true.B
    io.if_stall := true.B
  }
}

Forwarding

Using stalls can resolve data hazard issues; however, this approach involves a significant amount of bubbling, which reduces execution efficiency. To address this, forwarding can be used instead to transfer data to the dependent instruction, avoiding wasted clock cycles.

Forwarding Pipeline Example

0000: addi x1, x0, 1
0004: sub x2, x0, x1
0008: and x2, x1, x2
000C: lw x2, 4(x2)
0010: or x3, x1, x2

clock cycle	0	1	2	3	4	5	6	7
IF	addi	sub	and	lw	or
ID		addi	sub	and	lw	or
EX			addi	sub	and	lw	or
EX2MEM				addi:x1	sub:x2	and:x2
MEM				addi	sub	and	lw	nop
MEM2WB					addi:x1	sub:x2	and:x2	lw:x2
WB					addi	sub	and	lw

In the example above, the instruction sub x2, x0, x1 depends on the result of the previous instruction, but the result has not yet been written back to the register. Through forwarding, the result of the addi instruction can be directly passed from the EX/MEM register to the sub instruction. However, when an instruction needs data loaded from memory by the previous instruction, since the data is only available in the MEM stage, forwarding cannot immediately resolve this hazard, and a stall is still required.

Compare between Forwarding and Stall

Method	Description	Performance
Forwarding	Resolves data hazards by directly passing data, avoiding pipeline stalls	Improves instruction throughput but increases hardware complexity.
Stall	Inserts a bubble to allow the pipeline to wait for data to be ready before proceeding.	Reduces performance but is simpler to implement.

M extension

The M extension is a subset of the R-type instructions and includes the following eight instructions: remu, rem, divu, div, mulhu, mulhsu, mulh, and mul.

The main distinction between the M extension and standard R-type instructions lies in the value of the funct7 field. For the M extension, the funct7 field is always 0000001.

To handle this in the ALU control logic, the processing of R-type instructions should first differentiate instructions based on the value of funct7. Specifically:

If funct7 is 0000001: The instruction belongs to the M extension, and the ALU should perform operations such as multiplication, division, or remainder computation.
Otherwise: The instruction belongs to standard R-type, and the ALU should handle operations like addition, subtraction, logical operations, etc.

By distinguishing between the M extension and standard R-type instructions at this stage, the ALU control logic can correctly execute the required operation based on the instruction's functionality.

ALUControl.scala

    is(InstructionTypes.RM) {
        when(io.funct7 === "b0000001".U) {
          // M Extension 
          io.alu_funct := MuxLookup(
            io.funct3,
            ALUFunctions.zero,
            IndexedSeq(
              InstructionsTypeM.mul     -> ALUFunctions.mul,
              InstructionsTypeM.mulh    -> ALUFunctions.mulh,
              InstructionsTypeM.mulhsu  -> ALUFunctions.mulhsu,  
              InstructionsTypeM.mulhum  -> ALUFunctions.mulhum,  
              InstructionsTypeM.div     -> ALUFunctions.div,
              InstructionsTypeM.divu    -> ALUFunctions.divu,
              InstructionsTypeM.rem     -> ALUFunctions.rem,   
              InstructionsTypeM.remu    -> ALUFunctions.remu
            )
          )
        }.otherwise {
          // R Type
          io.alu_funct := MuxLookup(
            io.funct3,
            ALUFunctions.zero,
            IndexedSeq(
              InstructionsTypeR.add_sub -> Mux(io.funct7(5), ALUFunctions.sub, ALUFunctions.add),
              InstructionsTypeR.sll     -> ALUFunctions.sll,
              InstructionsTypeR.slt     -> ALUFunctions.slt,
              InstructionsTypeR.sltu    -> ALUFunctions.sltu,
              InstructionsTypeR.xor     -> ALUFunctions.xor,
              InstructionsTypeR.or      -> ALUFunctions.or,
              InstructionsTypeR.and     -> ALUFunctions.and,
              InstructionsTypeR.sr      -> Mux(io.funct7(5), ALUFunctions.sra, ALUFunctions.srl)
            )
          )
        }

    }

ALU.scala

object ALUFunctions extends ChiselEnum {
  val zero, add, sub, sll, slt, xor, or, and, srl, sra, sltu, mul, mulh, mulhsu, mulhu, div , divu, rem, remu  = Value
}

First, add the definitions for the M extension instructions in the object section, in the ALU.scala file.

is(ALUFunctions.mul) {
      io.result := (io.op1 * io.op2)(31, 0)
    }
    is(ALUFunctions.mulh) {
      io.result := (io.op1.asSInt * io.op2.asSInt >> 32).asUInt
    }
    is(ALUFunctions.mulhsu) {
      io.result := ((io.op1.asSInt *io.op2) >> 32).asUInt  
    }
    is(ALUFunctions.mulhu) {
        io.result := ((io.op1 * io.op2 ) >> 32).asUInt
    }
    is(ALUFunctions.div) {
  io.result := Mux(io.op2 === 0.U, "hFFFFFFFF".U, (io.op1.asSInt / io.op2.asSInt).asUInt)  
    }
    is(ALUFunctions.divu) {
  io.result := Mux(io.op2 === 0.U, "hFFFFFFFF".U, io.op1 / io.op2)  
    }  
    is(ALUFunctions.rem) {
  io.result := Mux(io.op2 === 0.U, io.op1, (io.op1.asSInt % io.op2.asSInt).asUInt) 
    }
    is(ALUFunctions.remu) {
  io.result := Mux(io.op2 === 0.U, io.op1, io.op1 % io.op2)  
    }

In the ALU.scala file, add the computation logic for each M extension instruction.

Test

test1 from quiz2 problem A

original code

.text
    la a0, multiplier         # Load multiplier address
    lw a1, 0(a0)              # Load multiplier value
    la a2, multiplicand       # Load multiplicand address
    lw a3, 0(a2)              # Load multiplicand value
    li t0, 0                  # Initialize accumulator
    li t1, 32                 # Set bit counter (#A01)

    # Check for negative values
    bltz a1, handle_negative1 # If multiplier negative (#A02)
    j shift_and_add_loop      # Skip to main loop (#A05)
    bltz a3, handle_negative2 # If multiplicand negative (#A03)
    j shift_and_add_loop      # Continue to main loop (#A04)

handle_negative1:
    neg a1, a1                # Make multiplier positive

handle_negative2:
    neg a3, a3                # Make multiplicand positive

shift_and_add_loop:
    beqz t1, end_shift_and_add # Exit if bit count is zero
    andi t2, a1, 1            # Check least significant bit (#A06)
    beqz t2, skip_add         # Skip add if bit is 0
    add t0, t0, a3            # Add to accumulator

skip_add:
    srai a1, a1, 1            # Right shift multiplier
    slli a3, a3, 1            # Left shift multiplicand
    addi t1, t1, -1           # Decrease bit counter
    j shift_and_add_loop      # Repeat loop (#A07)

end_shift_and_add:
    la a4, result             # Load result address
    sw t0, 0(a4)              # Store final result (#A08)

code after modified

.text
    la a0, multiplier         # Load multiplier address
    lw a1, 0(a0)              # Load multiplier value
    la a2, multiplicand       # Load multiplicand address
    lw a3, 0(a2)              # Load multiplicand value

    mul t0, a1, a3            # Perform multiplication (t0 = a1 * a3)

    la a4, result             # Load result address
    sw t0, 0(a4)              # Store final result

use mul to simply the instruction.

test2

.globl _start

_start:

    # Set up the initial value for a0
    addi a0, x0, 8        # a0 = 8

    # Multiply a0 by itself (a1 = a0 * a0)
    mul a1, a0, a0        # a1 = a0 * a0
    
    # Division of a1 by a0 (a2 = a1 / a0)
    div a2, a1, a0        # a2 = a1 / a0 (integer division)
    
    # Unsigned division of a1 by a0 (a3 = a1 / a0)
    divu a3, a1, a0       # a3 = a1 / a0 (unsigned division)

    # Remainder of a1 divided by a0 (a4 = a1 % a0)
    rem a4, a1, a0        # a4 = a1 % a0 (remainder)

    # Signed remainder of a1 divided by a0 (a5 = a1 % a0)
    remu a5, a1, a0       # a5 = a1 % a0 (unsigned remainder)

loop:
    j loop

Test the various instructions of the M extension.

test3

.globl _start
_start:
        
        addi t1,x0, 0              # t1 = low = 0
        addi t2, x0,100             # t2 = high = N
        addi t0, x0,100
        addi t5,x0,2
binary_search:
        bgt t1, t2, loop

        # mid = (low + high) / 2
        add t3, t1, t2        # t3 = low + high
        div t3, t3, t5        # t3 = mid = (low + high) / 2

        mul t4, t3, t3        # t4 = mid * mid
        blt t4, t0, set_low   
        beq t4, t0, set_result 

        add t2, t3, x0            
        addi t2, t2, -1
        j binary_search

set_low:
        # low = mid + 1
        addi t1, t3, 1
        j binary_search

set_result:

        add t1, t3,x0
loop:
    j loop

This part uses the binary search method to calculate the square root. By using this approach, we can find the integer closest to the square root. This method avoids the complications of floating-point calculations, making it a more straightforward way to compute the square root.

Convert file format.

riscv32-unknown-elf-as -o test.o test.s
riscv32-unknown-elf-ld -o test.elf -T link.lds test.o
riscv32-unknown-elf-objcopy -O binary test.elf test.asmbin

The commands use the riscv32-unknown-elf toolchain to convert a .s file into a .asmbin file.

testfile in scala

Three stage

it should "test multiplication" in {
  test(new TestTopModule("test.asmbin", ImplementationType.ThreeStage)).withAnnotations(TestAnnotations.annos) {
    c =>
      c.clock.step(1000)
      c.io.regs_debug_read_address.poke(5.U) 
      c.io.regs_debug_read_data.expect(BigInt("4294967233").U)
      var realData = c.io.regs_debug_read_data.peek().litValue
      var signedData = if (realData >= (1L << 31)) {
        realData - (1L << 32)
      } else {
        realData
      }
      println(s"[Check a0] real = $signedData")
      c.clock.step(1) 
   }
  }
  it should "test multiplication, division, unsigned division, and remainders with an infinite loop" in {
  test(new TestTopModule("test2.asmbin", ImplementationType.ThreeStage)).withAnnotations(TestAnnotations.annos) {
    c =>
      // Set the simulation clock steps
      c.clock.step(1000)

      // Verify register a0 (should be 8)
      c.io.regs_debug_read_address.poke(10.U) // a0 corresponds to register number 10
      c.io.regs_debug_read_data.expect(8.U)
      c.clock.step()
      // Verify register a1 (should be 8 * 8 = 64)
      c.io.regs_debug_read_address.poke(11.U) // a1 corresponds to register number 11
      c.io.regs_debug_read_data.expect(64.U)
      c.clock.step()
      // Verify register a2 (should be a1 / a0 = 64 / 8 = 8)
      c.io.regs_debug_read_address.poke(12.U) // a2 corresponds to register number 12 
      c.io.regs_debug_read_data.expect(8.U)
      c.clock.step()
      // Verify register a3 (should be a1 / a0 using unsigned division = 64 / 8 = 8)
      c.io.regs_debug_read_address.poke(13.U) // a3 corresponds to register number 13
      c.io.regs_debug_read_data.expect(8.U)
      c.clock.step()
      // Verify register a4 (should be a1 % a0 = 64 % 8 = 0)
      c.io.regs_debug_read_address.poke(14.U) // a4 corresponds to register number 14
      c.io.regs_debug_read_data.expect(0.U)
      c.clock.step()
      // Verify register a5 (should be a1 % a0 using unsigned remainder = 64 % 8 = 0)
      c.io.regs_debug_read_address.poke(15.U) // a5 corresponds to register number 15
      c.io.regs_debug_read_data.expect(0.U)

      // Check that the simulator enters an infinite loop
      c.clock.step(1) // Advance one more clock cycle to confirm system stability
  }
 }
   it should "correctly calculate the integer square root of 100" in {
    test(new TestTopModule("test3.asmbin", ImplementationType.ThreeStage)).withAnnotations(TestAnnotations.annos) { c =>
      
      for (_ <- 0 until 1000) {
        c.clock.step()
      }

      c.io.regs_debug_read_address.poke(6.U) 
      c.clock.step()
      c.io.regs_debug_read_data.expect(10.U)
    }
  }

five stage

  it should "test multiplication" in {
  test(new TestTopModule("test.asmbin", ImplementationType.FiveStageFinal)).withAnnotations(TestAnnotations.annos) {
    c =>
      c.clock.step(1000)

      c.io.regs_debug_read_address.poke(5.U) 
      c.io.regs_debug_read_data.expect(BigInt("4294967233").U)
      var realData = c.io.regs_debug_read_data.peek().litValue
      var signedData = if (realData >= (1L << 31)) {
        realData - (1L << 32)
      } else {
        realData
      }
      println(s"[Check a0] real = $signedData")
      c.clock.step(1) 
   }
  }
  it should "test multiplication, division, unsigned division, and remainders with an infinite loop" in {
  test(new TestTopModule("test2.asmbin", ImplementationType.FiveStageFinal)).withAnnotations(TestAnnotations.annos) {
    c =>
      // Set the simulation clock steps
      c.clock.step(1000)

      // Verify register a0 (should be 8)
      c.io.regs_debug_read_address.poke(10.U) // a0 corresponds to register number 10
      c.io.regs_debug_read_data.expect(8.U)
      c.clock.step()
      // Verify register a1 (should be 8 * 8 = 64)
      c.io.regs_debug_read_address.poke(11.U) // a1 corresponds to register number 11
      c.io.regs_debug_read_data.expect(64.U)
      c.clock.step()
      // Verify register a2 (should be a1 / a0 = 64 / 8 = 8)
      c.io.regs_debug_read_address.poke(12.U) // a2 corresponds to register number 12 
      c.io.regs_debug_read_data.expect(8.U)
      c.clock.step()
      // Verify register a3 (should be a1 / a0 using unsigned division = 64 / 8 = 8)
      c.io.regs_debug_read_address.poke(13.U) // a3 corresponds to register number 13
      c.io.regs_debug_read_data.expect(8.U)
      c.clock.step()
      // Verify register a4 (should be a1 % a0 = 64 % 8 = 0)
      c.io.regs_debug_read_address.poke(14.U) // a4 corresponds to register number 14
      c.io.regs_debug_read_data.expect(0.U)
      c.clock.step()
      // Verify register a5 (should be a1 % a0 using unsigned remainder = 64 % 8 = 0)
      c.io.regs_debug_read_address.poke(15.U) // a5 corresponds to register number 15
      c.io.regs_debug_read_data.expect(0.U)

      // Check that the simulator enters an infinite loop
      c.clock.step(1) // Advance one more clock cycle to confirm system stability
  }
 }
   it should "correctly calculate the integer square root of 100" in {
    test(new TestTopModule("test3.asmbin", ImplementationType.FiveStageFinal)).withAnnotations(TestAnnotations.annos) { c =>
      
      for (_ <- 0 until 1000) {
        c.clock.step()
      }

      c.io.regs_debug_read_address.poke(6.U) 
      c.clock.step()
      c.io.regs_debug_read_data.expect(10.U)
    }
  }

Refer to the test files in the reference documents to complete the test programs for the three-stage and five-stage implementations.

Output

riscv-arch-test

The recommended installation environment from the official website requires Python 3.6. To avoid conflicts with the local environment, I used conda to create a virtual environment:

$ source ~/miniconda3/bin/activate
$ conda create -n RISCOF python=3.6
$ conda activate RISCOF
$ conda deactivate
$ conda remove --name RISCOF --all

Install riscvof

$ pip3 install git+https://github.com/riscv/riscof.git
$ cd riscv-ctg
$ pip3 install --editable .
$ cd riscv-isac
$ pip3 install --editable .

The above are the installation commands provided in the GitHub repository. However, when following the commands to download the resources from GitHub, I noticed that the riscv-ctg and riscv-isac directories were not present.

I then checked the official RISCOF website for proper installation instructions and found that simply running the following command would suffice pip install riscof

To confirm the installation, ran the command riscof --help
If the installation was successful, the following message was displayed:

Usage: riscof [OPTIONS] COMMAND [ARGS]...

    Options:
      --version                       Show the version and exit.
      -v, --verbose [info|error|debug]
                                      Set verbose level
      --help                          Show this message and exit.

    Commands:
      arch-test     Setup and maintenance for Architectural TestSuite.
      coverage      Run the tests on DUT and reference and compare signatures
      gendb         Generate Database for the Suite.
      run           Run the tests on DUT and reference and compare signatures
      setup         Initiate Setup for riscof.
      testlist      Generate the test list for the given DUT and suite.
      validateyaml  Validate the Input YAMLs using riscv-config.

Install RISCV-GNU Toolchain

The following installation steps are provided on the official website:

$ sudo apt-get install autoconf automake autotools-dev curl python3 libmpc-dev \
      libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool \
      patchutils bc zlib1g-dev libexpat-dev
$ git clone --recursive https://github.com/riscv/riscv-gnu-toolchain
$ git clone --recursive https://github.com/riscv/riscv-opcodes.git
$ cd riscv-gnu-toolchain
$ ./configure --prefix=/path/to/install --with-arch=rv32gc --with-abi=ilp32d # for 32-bit toolchain
$ [sudo] make # sudo is required depending on the path chosen in the previous setup

However, during execution, I encountered an error when running the command git clone --recursive https://github.com/riscv/riscv-gnu-toolchain To resolve this issue, I referred to the riscv-gnu-toolchain GitHub repository for a solution. Instead of using the –recursive option, I simply used git clone https://github.com/riscv/riscv-gnu-toolchain

After making this adjustment, I followed the rest of the steps as described above.
Finally, to verify whether the installation was successful, I ran: riscv32-unknown-elf-gcc --version.It will show

lhh@lhh-OptiPlex-Tower-Plus-7020:~$ riscv32-unknown-elf-gcc --version
riscv32-unknown-elf-gcc () 14.2.0
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

install spike and sail

$ sudo apt-get install device-tree-compiler
$ git clone https://github.com/riscv-software-src/riscv-isa-sim.git
$ cd riscv-isa-sim
$ mkdir build
$ cd build
$ ../configure --prefix=/path/to/install
$ make
$ [sudo] make install

$ sudo apt-get install libgmp-dev pkg-config zlib1g-dev curl
$ curl --location https://github.com/rems-project/sail/releases/download/0.18-linux-binary/sail.tar.gz | [sudo] tar xvz --directory=/path/to/install --strip-components=1
$ git clone https://github.com/riscv/sail-riscv.git
$ cd sail-riscv
$ ARCH=RV32 make
$ ARCH=RV64 make

Testing the CPU with riscv-arch-test Using RISCOF

Setting Up RISCOF with Reference Model and Device Under Test

riscof setup --refname=sail_cSim --dutname=my_dut

Using the above command will generate a folder structure as follow

├──config.ini                   # configuration file for riscof
├──my_dut/                     # DUT plugin templates
   ├── env
   │   ├── link.ld              # DUT linker script
   │   └── model_test.h         # DUT specific header file
   ├── riscof_my_dut.py        # DUT python plugin
   ├── my_dut_isa.yaml         # DUT ISA yaml based on riscv-config
   └── my_dut_platform.yaml    # DUT Platform yaml based on riscv-config
├──sail_cSim/                   # reference plugin templates
   ├── env
   │   ├── link.ld              # Reference linker script
   │   └── model_test.h         # Reference model specific header file
   ├── __init__.py
   └── riscof_sail_cSim.py      # Reference model python plugin.

Program Changes

Below is the changes made to the riscof_my_dut.py file. First, I modified the ELF file path, replacing the hardcoded value with a dynamic path (output.elf) generated within the test_dir. Additionally, I added a step to generate a binary file (asmbin) by using the riscv32-unknown-elf-objcopy tool to convert the ELF file into a binary format. The test execution command was also updated by replacing the original simcmd with a new sbt-based command that takes the ELF file and signature file paths as arguments. Finally, I adjusted the execute command by including the objcopy_cmd step and specifying the working directory path (/home/lhh/computer_arch/final_lab/riscv-core) before running the simcmd.

def runTests(self, testList):

      # Delete Makefile if it already exists.
      if os.path.exists(self.work_dir+ "/Makefile." + self.name[:-1]):
            os.remove(self.work_dir+ "/Makefile." + self.name[:-1])
      # create an instance the makeUtil class that we will use to create targets.
      make = utils.makeUtil(makefilePath=os.path.join(self.work_dir, "Makefile." + self.name[:-1]))

      # set the make command that will be used. The num_jobs parameter was set in the __init__
      # function earlier
      make.makeCommand = 'make -k -j' + self.num_jobs

      # we will iterate over each entry in the testList. Each entry node will be refered to by the
      # variable testname.
      for testname in testList:

          # for each testname we get all its fields (as described by the testList format)
          testentry = testList[testname]

          # we capture the path to the assembly file of this test
          test = testentry['test_path']

          # capture the directory where the artifacts of this test will be dumped/created. RISCOF is
          # going to look into this directory for the signature files
          test_dir = testentry['work_dir']

          # name of the elf file after compilation of the test
-         # elf = 'my.elf'
+         elf = os.path.join(test_dir, 'output.elf')

          # name of the signature file as per requirement of RISCOF. RISCOF expects the signature to
          # be named as DUT-<dut-name>.signature. The below variable creates an absolute path of
          # signature file.
          sig_file = os.path.join(test_dir, self.name[:-1] + ".signature")

          # for each test there are specific compile macros that need to be enabled. The macros in
          # the testList node only contain the macros/values. For the gcc toolchain we need to
          # prefix with "-D". The following does precisely that.
          compile_macros= ' -D' + " -D".join(testentry['macros'])

          # substitute all variables in the compile command that we created in the initialize
          # function
          cmd = self.compile_cmd.format(testentry['isa'].lower(), self.xlen, test, elf, compile_macros)
+         asmbin = os.path.join(test_dir, 'output.asmbin')
+         objcopy_cmd = f"riscv32-unknown-elf-objcopy -O binary {elf} {asmbin}"

	  # if the user wants to disable running the tests and only compile the tests, then
	  # the "else" clause is executed below assigning the sim command to simple no action
	  # echo statement.
          if self.target_run:
            # set up the simulation command. Template is for spike. Please change.
+            simcmd= f'sbt -DelfFile={elf} -DsignatureFile={sig_file} "testOnly riscv.mycputest"' 
-            simcmd = self.dut_exe + ' --isa={0} +signature={1} +signature-granularity=4 {2}'.format(self.isa, sig_file, elf)
          else:
            simcmd = 'echo "NO RUN"'

          # concatenate all commands that need to be executed within a make-target.
-          execute = '@cd {0}; {1}; {2};'.format(testentry['work_dir'], cmd, simcmd)
+          execute = '@cd {0}; {1};{2}; cd {3}; {4};'.format(testentry['work_dir'], cmd, objcopy_cmd,"/home/lhh/computer_arch/final_lab/riscv-core  ", simcmd)
          # create a target. The makeutil will create a target with the name "TARGET<num>" where num
          # starts from 0 and increments automatically for each new target that is added
          make.add_target(execute)

      # if you would like to exit the framework once the makefile generation is complete uncomment the
      # following line. Note this will prevent any signature checking or report generation.
      #raise SystemExit

      # once the make-targets are done and the makefile has been created, run all the targets in
      # parallel using the make command set above.
      make.execute_all(self.work_dir)

      # if target runs are not required then we simply exit as this point after running all
      # the makefile targets.
      if not self.target_run:
          raise SystemExit(0)

make testfile in scala

First uses the ELF file that has been read to determine the memory range of the signature file through the following program.

  def extractSymbolAddress(elfFile: String, objdumpPath: String, symbolName: String): BigInt = {
    val symbolsCmd = s"$objdumpPath -t $elfFile"
    val symbolsOutput = symbolsCmd.!!
    val symbolLine = symbolsOutput
      .split("\n")
      .find(_.contains(s" $symbolName"))
      .getOrElse(throw new RuntimeException(s"Symbol $symbolName not found in $elfFile."))

    BigInt(symbolLine.split("\\s+")(0), 16)
    
  }

Next, the information in the memory range that was read is extracted and output as a signature file. Initially, I used the following program to directly read the data from the corresponding memory location, but the values read were all zeros.

  val signatureData = (0 until signatureWords.toInt).map { i =>
    val address = beginSignature + (i * 4)
    c.io.mem_debug_read_address.poke(address.U)
    c.clock.step()
    val data = c.io.mem_debug_read_data.peek().litValue
    writer.println(f"$data%08x")  
}

Next, I discussed this issue with my classmates and examined the disass file in the reference materials. I decided to output the values in memory from address 0 to 30000. Afterward, I discovered that the memory region from 0 to 4096 was empty.That's because in Parameters.scala, it defines the memory entry address as EntryAddress = 0x1000.U(Parameters.AddrWidth). Therefore, I modified the program to the following form.








     (0 until signatureWords.toInt).map { i =>
        val address = beginSignature + (i * 4) + 4096
        c.io.mem_debug_read_address.poke(address.U)
        c.clock.step() 
        val data = c.io.mem_debug_read_data.peek().litValue
        writer.printf("%08x\n", data.toLong)
      }
      writer.close()

Testing the CPU with riscv-arch-test Using RISCOF

riscof run --config=config.ini --suite=riscv-arch-test/riscv-test-suite/rv32i_m/M --env=riscv-arch-test/riscv-test-suite/env

The above command is used to test the m extension data from the riscv-arch-test suite on a custom CPU. The final output will be displayed in a web-based format as shown below.

Fix the permissions of the uploaded pictures.

The above shows the test results for the three-stage pipeline, while the following displays the test results for the five-stage pipeline.Both pipelines successfully passed all the tests.

Reference

https://yatcpu.sysu.tech/

Cycle	1	2	3	4	5	6	7	8
IF	add	or	slt
ID		add	or	slt	slt
EX			add	or	nop	slt
MEM				add	or	nop	slt
WB					add	or	nop	slt

Cycle	1	2	3	4	5	6	7	8
IF	add	or	slt
ID		add	or	slt	slt
EX			add	or	nop	slt
MEM				add	or	nop	slt
WB					add	or	nop	slt

Cycle	1	2	3	4	5	6	7	8
IF	add	or	slt
ID		add	or	slt	slt
EX			add	or	nop	slt
MEM				add	or	nop	slt
WB					add	or	nop	slt