[Computer Architecture 2022] Final Project

Contributed by < Peter >
Code can be found here: <GitHub>

tags: `RISC-V`

Project Overview

In this project, I am going to build a RISC-V 32-bit which support R32I ISA.

There will be 2 version of this CPU:

non-pipelined
5-state pipelined

In order to evaluate the performance of both 2 versions, I write total 3 test:

Prog0: Evaluate implemented instruction
Prog1: Sort signed 4-byte integers algorithm
Prog2: Multiplication signed 4-byte integers

But first, take a look at the architecture.

Architecture

Top testbench overview

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

System block diagram

Non-pipelined version

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

CPU Block diagram non-pipelined version

5-state pipelined

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

CPU Block diagram with 5 stages

DMEM_controller

In charge of control address, data, and control signal for Data memory (DMEM). This unit oversees masking bit with 0 or 1 depending on what instruction is used. For details:

Read/Write address A[13:0]: connect to ALU_output[15:2] (word-based address)
WEB[3:0] is write byte enable. It can be inferred from funct3 field of instruction along with the byte offset which is the last 2 LSB of ALU_output.

Register_File

Contains of 32 32-bit registers from x0 to x31. Register x0 is connected to ground (value 0). It can be used to read up to 2 different registers and write 1 register in the same clock.

CSR_File

At this moment, it only contains 2 64-bit register csr_cycle and csr_instret. Ability for read and write 1 register at the same time. Read ability support bit masking technique which allows user to hide sensitive information from being read without permission.
Supported CSR:

instret: number of instruction has been procceed and,
cyclecnt: number of clock cycle

Imm_Gen

Generating immediate value with 0 or signed extend for ALU; Generated immediate is based on ImmSel signal from Control_Unit.
Value of Imm is based on instruction which is shown in the below table:

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Control_Unit

Using instruction to generate control signal for the pipeline. Details:

ImmSel
- 3’b0: Imm Type I
- 3’b1: Imm Type S
- 3’b2: Imm Type J
- 3’b3: Imm Type U
- 3’b4: Imm for CSR instruction (shamt)
BrUn: 1 if the current Branch instruction is Unsigned.
ASel: Selector for ALU_DataA Mux:
- 2’b00: Data RS1
- 2’b01: PC
- 2’b10: Immediate
- 2’b11: 0
BSel: Selector for ALU_DataB Mux:
- 2’b00: Data RS2
- 2’b01: Immediate
- 2’b10: CSR read data
- 2’b 11: 0
MemRW: Control signal for DMEM read(0)/write(1) operation
MemEn: High active Enable signal for Read/Write operation
RegWEn: High active Enable signal for Register_File write operation.
WBSel: Selector for WBData Mux:
- 2’b00: DMEM read data
- 2’b01: ALU result
- 2’b10: PC+4
- 2’b 11: 0
ALUSel: Selector for ALU_Unit:
- 4’d0: Addition
- 4’d1: Subtraction
- 4’d2: Shift left logically
- 4’d3: Smaller than comparison (Unsigned)
- 4’d4: Smaller than comparison (Signed)
- 4’d5: XOR operation
- 4’d6: Shift right logically
- 4’d7: Shift right arithmetically
- 4’d8: OR operation
- 4’d9: AND operation
- 4’d10: Shift left logically using Immediate
- 4’d11: Shift right logically using Immediate
- 4’d12: Shift right arithmetically using Immediate
- 4’d13: AND operation with inverted value
csr_ren/csr_wen: CSR read/ write Active high Enable

Forwarding_Unit

This unit collect information about Data Hazard with R-Type instruction. The operation for detect Data hazard is described in below picture. ForwardA/ForwardB selector bit then be transferred to DataRS1/RS2 Mux respectively.

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Branch_Comparator

To deal with the Branch of current PC when B-Type instruction is detected, I create this Unit to compare value of RS1 and RS2 in EX stage. Depended on which B-Type instruction is running, BrEn signal will be active or not. If BrEn is 1 then pc_next will be ALU result data.

Hazard handling process

Structural hazard

Description: This type of hazard come from 2 or more instructions in the pipeline compete for access to a single physical resource.

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Structural RegFile is used in ID and WB [2]

Solution: Separate read and write port in Register_file. Now each instruction can read upto 2 register in ID stage and write 1 value in WB stage

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Register_File unit with 2 read port and 1 write port [2]

Data hazard- R-type instruction

Description: Happened when data in previous instruction is stored in Register_File but current instruction already require its value for calculation. For example:

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Data hazard when t0 register is used for many sequential instructions.

Solution: Data forwarding from MEM or (and) WB stage to EX stage. This technique require more hardware (Forwarding)Unit) to detect wherether data is forwarded or not.

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Data in MEM and WB stage is forwarded to EX stage in my CPU

Example with sequence of 5 add t0,t1,t0 instruction

id_rs1 = 5, id_rs = 6, id_rd= 5 for 5 next cycle
ForwardASel/BSel: selection bit for MUX Data A and B
- 0: No hazard
- 1: Hazard in 1 next instruction, forward data from EX (cpu_out)
- 2: Hazard in 2 next instruction, forward data from MEM
- 3: Hazard in 2 next instructions, forward from MEM (mem_read_data)

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Control hazard

Description: Branch (beq, bne,…) and Jump (jal, jalr) instructions determine flow of control.
Solution: Using the Static Branch Prediction (always not taking the Branch). In case of Branch instruction, I add a Branch_Comparision that will decide to take the branch or not in EX stage. But in case of Jump instruction, I add just let pipeline to load the 2-next instruction to IF and ID stage. After that, when the next PC is calculated in EX stage, I will flush the pipeline with STALL instruction.

Pipeline before flush (JAL in EX stage) [3]

Pipeline after flush (JAL in MEM stage)[3]

Example with instruction BLE: func3= 3’b000; opcode=7’h63
- DataRS1= DataRS2 = 0xFFFFF000
- BrUn = 1 means this instruction is signed comparison
- BrEn = 1 which mean PC jump to PC+ Imm (ALU_out)
- Meanwhile, flush ID and IF so that next 2 instruction go to EX stage become NoP (instruction = 0x13)

Makefile explaination

Verification

Prog0: 48 RV32I instructions

prog0 to perform verification for the functionality of instructions. The code is provided by VSLI System Design course, EE department, NCKU and be writen in Assembly.
You can check it under the name main.S in prog0 folder.

Prog1: Sorting algorithm

Requirement: The number of sorting elements is stored at the address named array_size. The first element is stored at the address named array_addr, others are stored at adjacent addresses. The maximum number of elements is 64. All elements are signed 4-byte integers and you should sort them in ascending order. Rearranged data should be stored at the address named _test_start.




























#include <stdio.h>
int main()
    {
        extern int array_addr ;
        extern int array_size ;
        extern int _test_start;
        int array[64];
        int i, j, a;
        for (i = 0; i < array_size; i++){ //Load array from DMEM 
            array[i] = *(&array_addr + i);
        }
        for (i = 0; i < array_size; ++i) //Exhausted search 
        {
            for (j = i + 1; j < array_size; ++j)
            {
                if (array[i] > array[j]) 
                {
                    a =  array[i];
                    array[i] = array[j];
                    array[j] = a;
                }
            }
        }
        for (i = 0; i < array_size; i++){ //Write sorted array back to DMEM
            *(&_test_start + i) = array[i];
        }
        return 0;
    }

Prog2: Perform multiplication.

Since the implementation focus on RV32I, the multiplication and division is not necessary but we can still having the same function using C code.

Requirement: The multiplicand is stored at the address named mul1. The multiplier is stored at the address named mul2. Their product is signed 8-byte integers and should be stored at the address named _test_start.


























    #include <stdio.h>
    int main()
    {
        extern int mul1 ;
        extern int mul2 ;
        extern int _test_start;
        long long num1 = 0; //Extended mul1 to 64-bit
        long long num2 = 0; //Extended mul2 to 64-bit
        long long result_db;
        int a = *(&mul1);
        int b = *(&mul2);
        int result_l = a*b;
        *(&_test_start) = result_l; //Save resulted of first 32-bit multiplication's result
        if (a<0){ ///Check sign of mul1(a)
            num1 = a|0xFFFFFFF00000000;//YES: Extend signed-bit of mul1 to 32 upper bit
        } else num1 = a;//NO: keeping unchange
        //Same process for mul2(b)
        if (b<0){
            num2 = b|0xFFFFFFF00000000;
        } else num2 = b; 
        //Calculate mulplication to get 32 upper bit of the final result
        result_db = num1*num2;
        int result_h = result_db >>32;//Shift 32 bit 
        *(&_test_start+1) = result_h;//Save last 32-bit of the result to next byte
        return 0;
    }

Result

Prog0

Non-pipeline version make prog0
- CPI: 1
5-stage pipelined version make prog0
- CPI: 1.449

Prog1

Non-pipelined version make prog1
- CPI: 1
5-stage pipelined version make prog1
- CPI: 1.205

Prog2

Non-pipelined version make prog2
- CPI: 1
5-stage pipelined version make prog2
- CPI: 1.477

Hardware Usage

In this project, I implement both 2 version into Xilinx kintex-7 FPGA series.
Remember that the Instruction Memory and Data Memory are excluded from this report.

Non-pipelined version
5-stage pipelined version
Comparision:
- Number of Register increase for 48.24%.
- Number of LUTs increase for 18.36% (mostly from increasement of Register).

Timming Report

Non-pipelined version
5-stage pipelined version
Comparision: WNS(5-stage) > WNS(non-pp) which resulted in better maximum clock speed.