# [Computer Architecture 2022] Final Project
Contributed by < [`Peter`](https://hackmd.io/@kaminto-1999) >
Code can be found here: <[`GitHub`](https://github.com/kaminto-1999/ComputerArchitectureNCKU111)>
###### tags: `RISC-V`
## Project Overview
In this project, I am going to build a RISC-V 32-bit which support R32I ISA.
There will be 2 version of this CPU:
- non-pipelined
- 5-state pipelined
In order to evaluate the performance of both 2 versions, I write total 3 test:
- Prog0: Evaluate implemented instruction
- Prog1: Sort signed 4-byte integers algorithm
- Prog2: Multiplication signed 4-byte integers
But first, take a look at the architecture.
## Architecture
Top testbench overview
> 
> System block diagram
### Non-pipelined version
> 
> CPU Block diagram non-pipelined version
### 5-state pipelined
> 
> CPU Block diagram with 5 stages
#### DMEM_controller
In charge of control address, data, and control signal for Data memory (DMEM). This unit oversees masking bit with 0 or 1 depending on what instruction is used. For details:
- Read/Write address `A[13:0]`: connect to `ALU_output[15:2]` (word-based address)
- `WEB[3:0]` is write byte enable. It can be inferred from `funct3` field of `instruction` along with the byte offset which is the last 2 LSB of `ALU_output`.
#### Register_File
Contains of 32 32-bit registers from x0 to x31. Register x0 is connected to ground (value 0). It can be used to read up to 2 different registers and write 1 register in the same clock.
#### CSR_File
At this moment, it only contains 2 64-bit register csr_cycle and csr_instret. Ability for read and write 1 register at the same time. Read ability support bit masking technique which allows user to hide sensitive information from being read without permission.
Supported CSR:
1. `instret`: number of instruction has been procceed and,
1. `cyclecnt`: number of clock cycle
#### Imm_Gen
Generating immediate value with 0 or signed extend for ALU; Generated immediate is based on ImmSel signal from Control_Unit.
Value of Imm is based on instruction which is shown in the below table:
> 
> 
> 
#### Control_Unit
Using instruction to generate control signal for the pipeline. Details:
- ImmSel
- 3’b0: Imm Type I
- 3’b1: Imm Type S
- 3’b2: Imm Type J
- 3’b3: Imm Type U
- 3’b4: Imm for CSR instruction (shamt)
- BrUn: 1 if the current Branch instruction is Unsigned.
- ASel: Selector for ALU_DataA Mux:
- 2’b00: Data RS1
- 2’b01: PC
- 2’b10: Immediate
- 2’b11: 0
- BSel: Selector for ALU_DataB Mux:
- 2’b00: Data RS2
- 2’b01: Immediate
- 2’b10: CSR read data
- 2’b 11: 0
- MemRW: Control signal for DMEM read(0)/write(1) operation
- MemEn: High active Enable signal for Read/Write operation
- RegWEn: High active Enable signal for Register_File write operation.
- WBSel: Selector for WBData Mux:
- 2’b00: DMEM read data
- 2’b01: ALU result
- 2’b10: PC+4
- 2’b 11: 0
- ALUSel: Selector for ALU_Unit:
- 4’d0: Addition
- 4’d1: Subtraction
- 4’d2: Shift left logically
- 4’d3: Smaller than comparison (Unsigned)
- 4’d4: Smaller than comparison (Signed)
- 4’d5: XOR operation
- 4’d6: Shift right logically
- 4’d7: Shift right arithmetically
- 4’d8: OR operation
- 4’d9: AND operation
- 4’d10: Shift left logically using Immediate
- 4’d11: Shift right logically using Immediate
- 4’d12: Shift right arithmetically using Immediate
- 4’d13: AND operation with inverted value
- csr_ren/csr_wen: CSR read/ write Active high Enable
#### Forwarding_Unit
This unit collect information about Data Hazard with R-Type instruction. The operation for detect Data hazard is described in below picture. `ForwardA/ForwardB` selector bit then be transferred to` DataRS1/RS2 Mux` respectively.
> 
#### Branch_Comparator
To deal with the Branch of current PC when B-Type instruction is detected, I create this Unit to compare value of RS1 and RS2 in EX stage. Depended on which B-Type instruction is running, `BrEn` signal will be active or not. If `BrEn` is 1 then `pc_next` will be ALU result data.
### Hazard handling process
#### Structural hazard
- Description: This type of hazard come from 2 or more instructions in the pipeline compete for access to a single physical resource.
> 
> Structural RegFile is used in ID and WB [2]
- Solution: Separate read and write port in Register_file. Now each instruction can read upto 2 register in ID stage and write 1 value in WB stage
> 
> Register_File unit with 2 read port and 1 write port [2]
#### Data hazard- R-type instruction
- Description: Happened when data in previous instruction is stored in Register_File but current instruction already require its value for calculation. For example:
> 
> Data hazard when t0 register is used for many sequential instructions.
- Solution: Data forwarding from MEM or (and) WB stage to EX stage. This technique require more hardware (Forwarding)Unit) to detect wherether data is forwarded or not.
> 
> Data in MEM and WB stage is forwarded to EX stage in my CPU
- Example with sequence of 5 add t0,t1,t0 instruction
* id_rs1 = 5, id_rs = 6, id_rd= 5 for 5 next cycle
* ForwardASel/BSel: selection bit for MUX Data A and B
* 0: No hazard
* 1: Hazard in 1 next instruction, forward data from EX (cpu_out)
* 2: Hazard in 2 next instruction, forward data from MEM
* 3: Hazard in 2 next instructions, forward from MEM (mem_read_data)
> 
#### Control hazard
- Description: Branch (beq, bne,...) and Jump (jal, jalr) instructions determine flow of control.
- Solution: Using the `Static Branch Prediction` (always not taking the Branch). In case of Branch instruction, I add a Branch_Comparision that will decide to take the branch or not in EX stage. But in case of Jump instruction, I add just let pipeline to load the 2-next instruction to IF and ID stage. After that, when the next PC is calculated in EX stage, I will flush the pipeline with STALL instruction.
> 
> Pipeline before flush (JAL in EX stage) [3]
> 
> Pipeline after flush (JAL in MEM stage)[3]
- Example with instruction BLE: func3= 3’b000; opcode=7’h63
* DataRS1= DataRS2 = 0xFFFFF000
* BrUn = 1 means this instruction is signed comparison
* BrEn = 1 which mean PC jump to PC+ Imm (ALU_out)
* Meanwhile, flush ID and IF so that next 2 instruction go to EX stage become NoP (instruction = 0x13)

## Makefile explaination
## Verification
### Prog0: 48 RV32I instructions
> prog0 to perform verification for the functionality of instructions. The code is provided by VSLI System Design course, EE department, NCKU and be writen in Assembly.
> You can check it under the name `main.S` in `prog0` folder.
### Prog1: Sorting algorithm
> Requirement: The number of sorting elements is stored at the address named `array_size`. The first element is stored at the address named `array_addr`, others are stored at adjacent addresses. The maximum number of elements is 64. All elements are *signed 4-byte integers* and you should sort them in ascending order. Rearranged data should be stored at the address named `_test_start`.
```c=
#include <stdio.h>
int main()
{
extern int array_addr ;
extern int array_size ;
extern int _test_start;
int array[64];
int i, j, a;
for (i = 0; i < array_size; i++){ //Load array from DMEM
array[i] = *(&array_addr + i);
}
for (i = 0; i < array_size; ++i) //Exhausted search
{
for (j = i + 1; j < array_size; ++j)
{
if (array[i] > array[j])
{
a = array[i];
array[i] = array[j];
array[j] = a;
}
}
}
for (i = 0; i < array_size; i++){ //Write sorted array back to DMEM
*(&_test_start + i) = array[i];
}
return 0;
}
```
### Prog2: Perform multiplication.
Since the implementation focus on RV32I, the multiplication and division is not necessary but we can still having the same function using C code.
> Requirement: The multiplicand is stored at the address named `mul1`. The multiplier is stored at the address named `mul2`. Their product is signed 8-byte integers and should be stored at the address named `_test_start`.
```c=
#include <stdio.h>
int main()
{
extern int mul1 ;
extern int mul2 ;
extern int _test_start;
long long num1 = 0; //Extended mul1 to 64-bit
long long num2 = 0; //Extended mul2 to 64-bit
long long result_db;
int a = *(&mul1);
int b = *(&mul2);
int result_l = a*b;
*(&_test_start) = result_l; //Save resulted of first 32-bit multiplication's result
if (a<0){ ///Check sign of mul1(a)
num1 = a|0xFFFFFFF00000000;//YES: Extend signed-bit of mul1 to 32 upper bit
} else num1 = a;//NO: keeping unchange
//Same process for mul2(b)
if (b<0){
num2 = b|0xFFFFFFF00000000;
} else num2 = b;
//Calculate mulplication to get 32 upper bit of the final result
result_db = num1*num2;
int result_h = result_db >>32;//Shift 32 bit
*(&_test_start+1) = result_h;//Save last 32-bit of the result to next byte
return 0;
}
```
## Result
### Prog0
- Non-pipeline version `make prog0`

- **CPI: 1**
- 5-stage pipelined version `make prog0`

- **CPI: 1.449**
### Prog1
- Non-pipelined version `make prog1`

- **CPI: 1**
- 5-stage pipelined version `make prog1`

- **CPI: 1.205**
### Prog2
- Non-pipelined version `make prog2`

- **CPI: 1**
- 5-stage pipelined version `make prog2`

- **CPI: 1.477**
## Hardware Usage
In this project, I implement both 2 version into Xilinx kintex-7 FPGA series.
*Remember that the Instruction Memory and Data Memory are excluded from this report.*
- Non-pipelined version

- 5-stage pipelined version

- Comparision:
- Number of Register increase for `48.24%`.
- Number of LUTs increase for `18.36%` (mostly from increasement of Register).
## Timming Report
- Non-pipelined version

- 5-stage pipelined version

- Comparision: `WNS(5-stage) > WNS(non-pp)` which resulted in better maximum clock speed.