# Assignment1 : RISC-V Assembly and Instruction Pipeline
contributed by <```JapokaChung```>
## Reverse Bits
The problem is chosen from [LeetCode190](https://leetcode.com/problems/reverse-bits/?fbclid=IwAR0iedf7rj-j3cpYrzhqQCMLvCflEtd1PfozK-5CR92ZR3X2Pe765V_qozc) with difficulty "Easy" as the assignment request.
We need to **reverse a given 32-bits unsigned int**, and print the result in unsigned int after processing. For example :
> **Input** : 00000010100101000001111010011100
> **Output**: 964176192 (00111001011110000010100101000000)
>
> **Input** : 11111111111111111111111111111101
> **Output**: 3221225471 (10111111111111111111111111111111)
>
## Implementation
You can see the complete source code in [here](https://github.com/japoka410666/Computer-Architecture-Assignment1).
### C code
I want to see how a function is called in assembly, so I write the reverse process as another function, instead of doing it in main directly.
```c=
int main(void){
uint32_t num = 0b11111111111111111111111111111101;
uint32_t result = 0;
result = reverse(num);
printf("%u is reverse to %u\n",num,result);
return 0;
}
```
In the reverse function, it reverse the input number n bit-by-bit from LSB to MSB.The reulut is initialize to 0 at first, which imply if n is become 0 in advance, we can directly break the loop, because the remaining un-reversed bits are all 0.
```c=
uint32_t reverse(uint32_t n){
uint32_t res = 0;
for(int i = 0; i < 32 && n; i++){
res += (n & 1) << (31 - i);
n = n >> 1;
}
return res;
}
```
### Assembly
The main logic of the following code is identical to C code, but there are more details need to be handled, especially in calling function ,loop branch and using system call.
Also, this program is **not able to handle overflow**, so make sure your input is limited in 32 bits.
```s=
.data
INPUT:.word 0b1111111111111111111111111111101
str: .string " is reversed to "
.text
main:
lw s0, INPUT #laod the input
li s1, 0 #initialize the result to zero
jal ra, reverse
mv s1, a0 #save the return val
jal ra, print
li a7, 10 #call system to end the program
ecall
reverse:
addi sp, sp, -8 #save the arguments
sw ra, 4(sp) #return address
sw s0, 0(sp) #input
li t1, 0 #initialize the result
li t2, 0 #initialize the iterator i
loop:
andi t0, s0, 1 #mask input with 1
sub t3, x0, t2 # -(i)
addi t3, t3, 31 # 31 - i
sll t0, t0, t3 # <<(31 - i)
add t1, t1, t0 # accumualte the result in t1
srai s0, s0, 1 # input = input >> 1
addi t2, t2, 1 # accumualte i
slti t0, t2, 32 # test if i < 32
beqz t0, out # break the loop if ~(i < 32)
bnez s0, loop # go to loop if n is not zero
out:
mv a0, t1 #save the result
lw s0, 0(sp) #laod the original input
lw ra, 4(sp) #laod the return address
addi sp, sp, 8 #recover sp
jr ra #return
#Print the result in console
print:
lw a0, INPUT
li a7, 35 #call system to print binary int (input)
ecall
la a0, str
li a7, 4 #call system to print string
ecall
mv a0, s1
li a7, 35 #call system to print binary int (result)
ecall
jr ra
```
## Analysis
### Pseudo Instruction
When we write our assembly into [Ripes](https://github.com/mortbopet/Ripes), it will translate some our pseudo instructions such as ``` li ```, ``` mv ```, ``` jr ```, into corresponding base instructions.There are reason for using these pseudo instruction. One is that they are more easy to read and realize for human, the other is that we can write pseudo instrcutions in low-level design, and they can be translate to different base instruciton depend on different types of machine. For example, in RV32, if we want to laod a 32-bits constant into a register, it will be acheived through ``` lui ```, ``` addi ```, ```jalr```... in base instructions, telling the processor what actually need to do. (The correspondence can be found in [RISCV Instruction Set Manual](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf).)
The following is my assembly translated by [Ripes](https://github.com/mortbopet/Ripes).
```
00000000 <main>:
0: 10000417 auipc x8 0x10000
4: 00042403 lw x8 0 x8
8: 00000493 addi x9 x0 0
c: 014000ef jal x1 20 <reverse>
10: 00050493 addi x9 x10 0
14: 05c000ef jal x1 92 <print>
18: 00a00893 addi x17 x0 10
1c: 00000073 ecall
00000020 <reverse>:
20: ff810113 addi x2 x2 -8
24: 00112223 sw x1 4 x2
28: 00812023 sw x8 0 x2
2c: 00000313 addi x6 x0 0
30: 00000393 addi x7 x0 0
00000034 <loop>:
34: 00147293 andi x5 x8 1
38: 40700e33 sub x28 x0 x7
3c: 01fe0e13 addi x28 x28 31
40: 01c292b3 sll x5 x5 x28
44: 00530333 add x6 x6 x5
48: 40145413 srai x8 x8 1
4c: 00138393 addi x7 x7 1
50: 0203a293 slti x5 x7 32
54: 00028463 beq x5 x0 8 <out>
58: fc041ee3 bne x8 x0 -36 <loop>
0000005c <out>:
5c: 00030513 addi x10 x6 0
60: 00012403 lw x8 0 x2
64: 00412083 lw x1 4 x2
68: 00810113 addi x2 x2 8
6c: 00008067 jalr x0 x1 0
00000070 <print>:
70: 10000517 auipc x10 0x10000
74: f9052503 lw x10 -112 x10
78: 02300893 addi x17 x0 35
7c: 00000073 ecall
80: 10000517 auipc x10 0x10000
84: f8450513 addi x10 x10 -124
88: 00400893 addi x17 x0 4
8c: 00000073 ecall
90: 00048513 addi x10 x9 0
94: 02300893 addi x17 x0 35
98: 00000073 ecall
9c: 00008067 jalr x0 x1 0
```
### 5 stages of pipeline
There are several types of processor in [Ripes](https://github.com/mortbopet/Ripes) simulater : single-cycle processor, 5-stage processor, 5-stage processor w/o forwarding, 5-stage processor w/o harazd detection, 6-stage dual-issue processor. In this notes ,we will mainly dicuss the **5-stage processor with forwarding and harazd detection**.
The following is the 5-Stage RISCV Processor architecture :

The five stages are :
1. **IF : Instruction Fetch**
Mux will select the PC, either from PC+4 or the other result due to the previous instructions. Then, the selected PC will be sent into Instruction Memory (SRAM) to fetch the next instruction. And because each instruction has length in size of 4 byte in RV32, PC would increment 4 each cycle in this stage to jump 4 memory address to get the next instruction in a normal situation .
2. **ID : Instruction Decode**
Since different type of instructions have to be interpreted in different way, so we decode them here. Instructions would be decomposed into several parts and sent to Register File to extract needed data ,or sent the immediate data part to the next stage.
3. **EXE : Execute**
In this stages, ALU is responsible for performing logic and arithmetic function, and provide the status of calculation. The result might be a data or a address target for load/store. On the other hand, Branch module is responsible for calculating the branch destination.
4. **MEM : Memory access**
If there any data need to be accessed, it would be done here. The memory will give a output data depends on the given address from the EXE stage.
5. **WB : Writeback**
During this stage, the mux decides which data should be sent back and written into Register File. Differnt from ID stages who also access the Register File, the decoding stage is reading source registers, that the WB stages is writing the result data into instruction's destination register.
### Instruction formats
In [RV32 Base Integer ISA](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf), there are four core instruction formats (**R/I/S/U**). Each of them is a fixed 32 bits in length and must be aligned on a 4-byte boundary in memory. And it keeps all `rs` and `rd` in same position to simplify the decode. All the sign bit of immediate are in ins[31] for the same reason. In addition, there are futher two more variants of the instruction formats (**B/J**, or SB/UJ in [Green Card](https://www.cl.cam.ac.uk/teaching/1617/ECAD+Arch/files/docs/RISCVGreenCardv8-20151013.pdf)) based on the handling of immediate.
The following is their alignment :

The difference between `S` and `B` is that `B` format use imm[11] and imm[12] to **encode branch offsets** in multiple of 2. Instead of shifting all bits in the instruction-encoded immediate left, the middle bits (imm[10:1]) and sign bit stay in fixed positions, while the lowest bit in `S` format (inst[7]) encodes a high-order bit in `B` format. It has a similiar difference between `U` and `J`, that the 20-bit immediate is shifted left by 12 bits to form U immediates and by 1 bit to form J immediates.
The following is the analysis of how **each format instructions perform in the processor** in my program. And there will also have some discussion on **memory updates** result by load/store instructions.
### U format
The first instruction in this program is `auipc x8 0x10000` which is a U format instruction and is used to **build PC-relative addresses**. let's see how it is performed in the processor.
1. **IF**

* PC is initialize as 0
* `auipc x8 0x10000 (0x10000417)` is read out from Instr. memory.
* PC = PC + 4
* Mux choose the incremenent PC as the next Instr. address
2. **ID**

* Because both `rs1` and `rs2` is zero, the output of Registers File is zero too.
* The opcode `AUIPC` and `0x10000` is sent to Imm.
It laods the upper 20 bits and fill the lower with 0, getting the output `0x10000000`.
* Wr idx = Ins[7:11] = `0x08`
3. **EXE**

* First level mux select `Reg1` and `Reg2`
* Second level mux select `PC = 0` and `0x10000000 (Imm. ouput)` as ALU's input.
* ALU do `add` operation with the result `0x10000000`.
* No branch is taken.
4. **MEM**

* The result of ALU `0x10000000` is sent into three path:
1.**sent back to EXE** ,used by next inxtruction.
2.**directly go through to WB.**
3.**sent to Data Memory** as the read/write address.
* Because the Data Memory Wr is not enable here, resulting the data at the address = `0x10000000`, which is `0xfffffffd`.

5. **WB**

* Mux select ALU result `0x10000000` to writeback to Register file.
* Wr idx is also be sent to Register file as the write address.
### S format
Another instruciton of this program is `sw x1 4 x2` which is a S format instruction and is used to **save data from register into memory**.
1. **IF**

* PC sent into Instr. memory is `0x0000002c` then getting this intrction `0x00112223`.
* PC = PC + 4
`0x00000030 = 0x0000002c + 0x00000004`
* Mux woudld select the incremenented PC as the next Instr. address.
2. **ID**

* R1 idx = `0x02`,R2 idx = `0x01`.
* Reg1 = `0x07ffffff0` is the data will be saved.
* Reg2 = `0x00000018` is the base Memory address.
* Imm. = `0x00000004`.
3. **EXE**

* ALU should operate `add` to the `Imm.` and the `Reg2`. But the previous instruction re-write x2 after we read it, so the mux need to select the sent back one, which is known as **forwarding**.
* The reslut of adding is ALU's output `0x7fffffec` which is the definite **Data memory address** we are going to access.
* Reg2 data `0x00000018` directly go through this stage.
* No branch is taken.
4. **MEM**

* Use `0x00000018` as the saving data.
* Use the ALU's result `0x7fffffec` as the Data memory address.
* **Data memory Wr is enable.**
5. **WB**

* This instruction `sw` don't need to write back anything to Register File. Althougth data do be sent back, **Wr of Register File is not enable**. So, the selection of the mux in WB is actually `don't care`.
* Data in `0x7fffffec` has already been written to `0x00000018`.

### I format
Another instruciton of this program is `lw x10 -112 x10` which is a I format instruction, and is used to **laod a data from Data memory into destinate register**.
**The difference between `lw` and `sw`** is that `lw` use `rd (ins[7:11])` as the Wr idx, and `sw` use `rs1 (ins[15:19])` as the Wr idx. They need different control signals for Register File and Data memory to do read/write ,making these two instruction implement in different format.
1. **IF**
The situation in IF is similar to the U and S cases above:
* PC = PC + 4
* Mux woudld select the incremenented PC as the next Instr. address.
2. **ID**

* R1 idx = `0x0a` .Reg1 = `0x10000070` should be the base Memory address.
* R2 idx = `0x10`. Reg2 = `0x00000000` .
* `0x0a` is the destinate register.
* Imm. = `0xffffff90`is the two's complement of -112.
3. **EXE**

* Same as the `sw` case above, **forwarding** occur again, because previous intruction `auipc x10 0x10000` re-write x2 after we read it from Regiter File.
* ALU operate `add` to the `Imm.` and `x10`.
* * The reslut of adding is ALU's output `0x10000000` which is the definite **Data memory address** we are going to access.
4. **MEM**

* Data memory's Wr is not enable, so Data memory output the data in `0x10000000` which is `0x7ffffffd`.

5. **WB**

* Mux select the data read from Data memory to send to Register File.
* Register File's Wr is turn to enable.
* Wr idx = `0x0a`.
### R format
R format instructions is resposible for performing **Integer Register-Register Operations** which really common to see in a program. I take `add x10 x8 x0` as example to analys the implementation process.
1. **IF**
The situation in IF is similar to the other cases above:
* PC = PC + 4
* Mux woudld select the incremenented PC as the next Instr. address.
2. **ID**

* R1 idx = `0x08`, Reg2 `0x10000000`
* R2 idx = `0x00`, Reg1 `0x00000000` (data in x0 should always be zero.)
* Destinate register = `0xa0`
* Because there are no any immediate in this instruction, the ouput of Imm. = `deadbeaf`.
3. **EXE**

* At normal situation the first and the second level mux should select the data ouput from Register File. But **forwarding** occurs again here, so the data from x8 is changed from `0x10000000` into `0xffffffd`.
* ALU operate `add` to selected `op1` and `op2` with the result `0x00000000 + 0xfffffffd = 0xfffffffd`.
* No branch is taken.
4. **MEM**

* R format instructions will neither write or read Data memory in this stage. Although there is an ouput with given address input, it is useless.
* ALU's reslut `0xfffffffd` will directly go through this stage.
5. **WB**

* MUX select ALU's result `0xffffffd`, and pass it to Register file as Wr data.
* `0x0a` is sent to Register file as Wr idx, representing x10.
* Register File's Wr is enable.
### J format
After analysing R/I/S/U format instruction above, we are now going to discuss `jal x1 20 <reverse>` which is an U format instruction, and will **change the execution oreder of instructions** later. It will also **save PC + 4** into `ra` to tell the processor where to come back after processing `<reverse>` section.
1. **IF**
Although this istruction is aim to change the execution oreder of instructions, its IF stage is similar to the other cases above. This fact shows that the processor will always perdict instruction executed in oreder, in the first steps.

* PC = PC + 4
* Mux woudld select the incremenented PC as the next Instr. address.
2. **ID**

* R1 idx, R2 idx and their outputs (Reg1, Reg2) are not important in this stage.
* Destinate register = `0x01`
* Immediate is 20 bits in instruction. After multiply by 2 and append its sign bits to 32 bit, Imm. result in `0x00000014`, which is th the **address distance** from current position to <loop> label.
3. **EXE**

* The First level mux will select the ouptut from Register file, but it won't be used.
* The Secong level mux will select the Imm. and PC of this instruction (jal).
* ALU operate `add` to selected `op1` and `op2` with the result `0x00000014 + 0x00000014 = 0x00000028`. (Note : `0x00000028` is the address of <reverse> label.)
* ALU's result is **sent back to IF stage**, and the mux in IF will select this value as the next PC.
* Since the order of execution is changed, which means `add x9 x11 x0` and `jal x1 92 <print>` need to be **flushed**. So, **clear of IF/ID and ID/EXE is turned on**.
* No branch is taken.
4. **MEM**

* U format instructions will neither write or read Data memory in this stage. Although there is an ouput with given address input, it is useless.
* **The previous two stages is flushed (nop).**
5. **WB**

* The return address `0x00000018` will be sent to Register file as Wr data.
* `0x01` is sent to Register file as Wr idx, representing ra.
* Register File's Wr is enable.
### B format
At last, I am going to dicuss B format instruciton which will also change the execution oreder like S format do. Here, I take `bne x8 x0 -36 <loop>` as example. branch instruction will check the condition given, then dicide if processor have to go to the label address or not.
Different from `J` format, B format take more space to place the source register resulting a smaller branch range. In the [Manual(p.15, p.17)](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf), it says that the **unconditional jump can target a ±1 MiB range, and the conditional branch has range in ±4 KiB**. It can also be verified by the information below:
J-immediate encodes a signed offset in multiples of 2 bytes.

The 12-bit B-immediate encodes signed offsets in multiples of 2.

1. **IF**
Its IF stage is similar to the other cases above.
* PC = PC + 4
* Mux woudld select the incremenented PC as the next Instr.
2. **ID**

* R1 idx = `0x08`, Reg 1 = `0x3ffffffe`
* R2 idx = `0x00`, Reg 2 = `0x00000000`
* Immediate is 12 bits in instruction. After multiply by 2 and append its sign bits to 32 bit, Imm. result in `0xffffffdc`, which is th the **address distance** from current position to <loop> label.
3. **EXE**

* The first level mux will select the result, one from Register file, one from forwarding (in this case), then, send them into Branch.
* The Secong level mux will select the Imm. and PC of this instruction (bne).
* ALU operate `add` to selected `op1` and `op2` with the result `0x00000058 + 0xffffffdc = 0x00000034`. (Note : `0x00000034` is the address of <loop> label.)
* **Branch taken**
* Because of Branch taken, the mux in IF stage will **select the sent back value as next PC**.
* Since the order of execution is changed, which means `addi x10 x6 0` and `lw x8 0 x2` need to be **flushed**. So, **clear of IF/ID and ID/EXE is turned on**.
4. **MEM** and **WB**

* It will do nothing in these two stage. Both Register file and Data memory's Wr will be not enable. The value read out from Data memory is meanless and useless too. But this instruction `bne` still go through these stages like others do to achieve pipeline alignment.
* **The previous two stages is flushed (nop).**
### Hazard
#### **Structure Hazard**
**Structure hazard occur when two or more instructions attempt to use one resource at the same time.** But in RV32I architecture, it's prevent by replicating these hardware. In particular, branch can use ALU to decide if needed to be taken. However, branch instruction sneed to calculate the target address at the same time. This confliction is solved by adding another Branch module in addition to ALU.
#### **Data Hazard**
Data Harzard occur when **data dependency** exist btween two instructions executed in pipeline at the same time, and the program will result diffently with the differnt execution order of instrucitons. There are 3 types of data dependency : WAW, RAW, WAR. For the processor used in this assignment, **we should only consider RAW which actually causes data hazard in this case**.
We have two solution with this type of hazard, one is **stalling**, the other is **forwarding**:
**Stalling is also known as pipeline inter-lock.** It's an intuitive solution that stall the conflict instruction until the writing to Register is done. In the following example, we can see that Wr idx of `auipc` is conflict with the Rs idx of `lw`, so the processor stalled it until WB of `auipc` is done. This operation is achieve by insert one or several **NOP instruciton (bubble)** who will actually do nothing.

**Forwarding is also known as bypassing.** It's has been mentioned sevveral times above. The main idea is to send the previous result back to EXE stage before the whole instruction is done. And it is decided by additioanal **combinational logic**. Taking the same example in to consider, we can see there are no more NOP instruction needed.

In conclusion, **forwarding is a faster solution for data harzard**. The following is the execution info. of with and without forwaring cases.
w/ forwarding

w/o forwarding

However, forwarding cannot work by itself sometimes. For example:

So in this case, forwarding should coorperate with stalling to avoid data hazard.

#### **Control Hazard**
**Controal hazard occur when conditional branch or un-conditional jump with wrong prediction.** It have been analysed above, in J and B format section. When it happended, processor needs to turn on `clear` signal of IF/ID and ID/EXE to flush the wrong instruction those are executing.

The panelty can be minimize by applying better branch predition. But in this demostration, the processor always predicts branch not taken causing a lot of wastes of cycles.
## Reference
[Reverse Bits (LeetCode190)](https://leetcode.com/problems/reverse-bits/?fbclid=IwAR1KsE-dwQi636x9m6KCSdFwpdPoz2amb01Oeklfuqu1DuNbfHfqrNMA9k0)
[RISCV Green Card](https://www.cl.cam.ac.uk/teaching/1617/ECAD+Arch/files/docs/RISCVGreenCardv8-20151013.pdf)
[RISCV Instruction Set Manual](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf)
[Wiki-Classic RISC Pipeline](https://en.wikipedia.org/wiki/Classic_RISC_pipeline)