---
tags: CA2020
---
# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < `kevin30292` >
==Requirement==
- [x] Don’t implement the same subject as others do. Your program shall be different.
- [x] Your program(s) MUST contain loops (or recursive calls) and conditional branches.
- [x] You have to ensure the program fully functioned with Ripes simulator.
- [x] Explain your program with the visualization for multiplexer input selection, register write/enable signals and more.
- [x] Illustrate each stage such as IF, ID, IE, MEM, and WB.
- [x] Discuss the steps of memory updates accordingly.
## Selection Sort
Use selection sort to sort a integer array which includes 5 integers
:::info
Selection sort:
selection sort is a simple sorting algorithm. This sorting algorithm divided the list into two parts, `sorted-part` at the left end and `unsorted-part` at the right end. Initially, the sorted part is empty and the unsorted part is the entire list. Everytime choose the minimum value in `unsorted-part` and swapped with the most left value of `unsorted-part`, and it will become a part of `sorted-part`.
:::
### C code
```c=
int main()
{
// inputs
int i, j, min_id, temp;
int arr[5]={1,2,4096,16,256};
// sort
for (i = 0; i<4; i++) {
min_id = i;
for (j=i+1; j<5; j++) {
if (arr[j] < arr[min_id]) {
min_id = j;
}
}
temp = arr[i];
arr[i] = arr[min_id];
arr[min_id] = temp;
}
return 0;
}
```
#### Result
arr={1,2,16,256,4096}
### Assembly code
#### `version 1`
| info | |
|-------|-----|
|cycles |210 |
|retired|168 |
|CPI |1.25 |
|IPC |0.800|
```cpp=
.data
arr: .word 1,2,4096,16,256
.text
main:
#get base address
la x8, arr
#init i
add x15, x0, x0
#for loop i
L1:
#init min_id
add x16, x0, x15
#init j
addi x6, x15, 1
#for loop j
L2:
slli x7, x6, 2
slli x28, x16, 2
add x5, x8, x28
lw x31, 0(x5)
add x5, x8, x7
lw x30, 0(x5)
bge x30, x31, nochange
add x16, x0, x6
nochange:
addi x6, x6, 1
addi x9, x0, 5
blt x6, x9, L2
swap:
#get a[i]
slli x29, x15, 2
add x5, x29, x8
#load a[i] in x30
lw x30, 0(x5)
#get a[min_id]
slli x28, x16, 2
add x5, x28, x8
#load a{min_id] in x31
lw x31, 0(x5)
#put a[min_id] in a[i]
add x5, x29, x8
sw x31, 0(x5)
#get a[min_id]
add x5, x28, x8
sw x30, 0(x5)
addi x1, x0, 4
addi x15, x15, 1
blt x15, x1, L1
```
#### `version 2`
:::info
organized the variable assignment
:::
| variable | register | note |
|-----------------|----------|-------|
| arr -> base | x3 = gp | |
| i | x5 = t0 | |
| i condition | x18 = s2 | |
| arr[i] offset | x28 = t3 | |
| arr[i] address | x12 = a2 |x28(x3)|
| j | x6 = t1 | |
| j condition | x19 = s3 | |
| arr[j] offset | x29 = t4 | |
| arr[j] address | x13 = a3 |x29(x3)|
| min_id | x7 = t2 | |
| arr[j] offset | x30 = t5 | |
| arr[j] address | x14 = a4 |x30(x3)|
| compare1 & temp | x10 = a0 | |
| compare2 | x11 = a1 | |
| info | |
|-------|-----|
|cycles |209 |
|retired|157 |
|CPI |1.33 |
|IPC |0.751|
```cpp=
.data
arr: .word 1,2,4096,16,256
.text
main:
#get array base address
la x3, arr
#init i
addi x5, x0, 0
#for loop i
L1:
#init min_id = i
addi x7, x5, 0
#init j = i + 1
addi x6, x5, 1
#for loop j
L2:
#get arr[j] address offset and address
slli x29, x6, 2
add x13, x3, x29
lw x10, 0(x13)
#get arr[min_id] address offset and address
slli x30, x7, 2
add x14, x3, x30
lw x11, 0(x14)
#check if condition x10 = arr[j] ; x11 = arr[min_id]
bge x10, x11, nochange
addi x7, x6, 0
nochange:
#j++
addi x6, x6, 1
addi x19, x0, 5
blt x6, x19, L2
#get arr[i] address offset and address
slli x28, x5, 2
add x12, x3, x28
#get arr[min_id] address offset and address
slli x30, x7, 2
add x14, x3, x30
# temp = arr[i]
lw x10, 0(x12)
# arr[i] = arr[min_id]
lw x11, 0(x14)
sw x11, 0(x12)
# arr[min_id] = temp
sw x10, 0(x14)
#i++
addi x5, x5, 1
addi x18, x0, 4
blt x5, x18, L1
```
#### Result
initial:

after:

#### code explain
This code have four different parts.
1. main: Initializing value of i and `base-address` of the array.
2. L1: The big for-loop to sort every element in the array.
3. L2: The for-loop in L1 to find the smallest element in `unsorted-part`.
4. nochange: Label for skipping `change-min_id` if it have the smallest element's ID.
## Ripes simulator
:::info
RISC-V instruction format

:::
### la x3, arr
:::warning
==la== will be transform to ==auipc== and ==addi==
==auipc==: add the imm value with `PC` and store in the `rd`
==addi==: subtract the `PC` value from `rd`
:::
#### IF(Instruction Fetch)
The `PC` will initialize with 0x0000_0000 which is the address of first instruction.
:::info
The `PC` value will also pass to `adder` plus 4(PC+4) and update the `PC`.
If no branch or jump, the multiplexer will choose PC+4 for next instruction address.
:::
In the picture, the first instruction(0x10000197) will pass to decoder.

#### ID(Instruction Decode)
:::info
la:

:::
0x1000_0197
0b00010000000000000000_00011_0010111
U type
op -> 0b0010111 = `auipc`
rd -> 0b00011 = `x3`
imm -> 0b0001_0000_0000_0000_0000 = 0x10000
:::info
U type:
20-bit immediate is shifted left by 12 bits to form U immediates.
:::
imm -> 0x10000==000==
The first instruction pass into decoder and come aparts to `AUPIC`, `0x03`, `0x1000_0000`

#### EX(Execute)
`0x03` directly pass to next stage.
ALU plus `imm` and `PC` and pass the sum to next stage.

#### MEM(memory access)
just pass value
#### WB(Write Back)
The multiplexer choose value pass from ALU, not use the Memory Unit value.
Write the value(`imm` + `pc`) back to register(`x3`).
The `Wr En` will turn green to allow write back.

:::info
The register write back will delay one stage to actually update the Registers.
Example:
The x5 doesn't updata when instruction in WB stage

---
Next stage:

:::
:::warning
Now the x3 value is `pc` + `imm`, but we want the value is `imm`, so next instruction is subtract `pc`.
:::
addi x3, x3, 0(-pc)
>if pc is 0x08
addi x3, x3, -8
:::spoiler

imm value will change automatically

:::
#### Result:
`x3` = 0x1000_0000 (address of arr[])

---
### addi x5, x0, 0
Initialize `i`
#### IF(Instruction Fetch)
Fetch instruction at `0x08` -> 0x0000_0293

#### ID(Instruction Decode)
:::info
addi

:::
0x0000_0293
0b000000000000_00000_000_00101_0010011
op -> 0b0010011 = `imm`
rd -> 0b00101 = `x5`
funct -> 0b000 = `addi`
rs1 -> 0b00000 = `x0`
imm -> 0b0 = `0`
:::info
I type instruction:
sign-extended 12-bit immediate.
:::
The instruction pass into decoder and come aparts to `ADDI`, `0x00` ,`0x05`, `0x0000_0000`
The rs1 pass into R1 idx and `Registers` pass the value of `x0` to next stage.
There no rs2, so the value input to `Registers` is invalid, of course `Reg 2` output are no meaning.

#### EX(Execute)
`0x05` directly pass to next stage.
Multuplexer choose `Reg 1` and `imm` value to ALU.
ALU plus `imm` and `reg1` and pass the sum to next stage.

#### MEM(memory access)
Just pass value.
:::info
The memory unit still have the `Read out` value, but it is invalid.
In this case, the memory unit read the value at 0x0000_0000, but 0x0000_0000 actually doesn't mean address.

:::
#### WB(Write Back)
So in this stage, the multiplexer will choose the value pass from `ALU`, by pass the `memory unit`.
Write the value(`imm` + `0x00`) back to register(`x5`).
(`imm` + `0x00`) is directly pass from ALU to WB stage when the instruction still in the EX stage.
Also, the `Registers' Wr En` will turn green to allow register writing.

#### Result:
`x5` = 0x0000_0000

---
### Get array element address
```cpp=
slli x29, x6, 2
add x13, x3, x29
```
1. Get the array element's offset.
`2`: shift imm
`x6`: i
`x29`: offset
2. Get the array element's address, plus base and offset.
`x3`: array base address
`x29`: offset
`x13`: array element's address
#### IF(Instruction Fetch)
1. Fetch instruction at `0x14` -> 0x0023_1e93

2. Fetch instruction at `0x18` -> 0x01d1_86b3

#### ID(Instruction Decode)
:::info
slli:

---
add:

:::
1. 0x0023_1e93
0b0000000_00010_00110_001_11101_0010011
op -> 0b0010011 = `op-imm`
rd -> 0b11101 = `29`
funct3-> 0b001 = `slli`
rs1 -> 0b00110 = `6`
imm -> 0b00010 = `2`
funct7-> 0b0000000 = `slli`
The instruction pass into decoder and come aparts to `SLLI`, `0x06`, `0x02`, `0x1d`
The rs1 pass into R1 idx and `Registers` pass the value of `x6` to next stage.
The rs2 pass into R2 idx and `Registers` pass the value of `x2` to next stage, ==but it is invalid==.

2. 0x01d1_86b3
0b0000000_11101_00011_000_01101_0110011
op -> 0b0110011 = `op`
rd -> 0b01101 = `13`
funct3-> 0b000 = `add`
rs1 -> 0b00011 = `3`
rs2 -> 0b11101 = `29`
funct7-> 0b0000000 = `add`
The instruction pass into decoder and come aparts to `ADD`, `0x03`, `0x1d`, `0x0d`4
The rs1 pass into R1 idx and `Registers` pass the value of `x3` to next stage.
The rs2 pass into R1 idx and `Registers` pass the value of `x29` to next stage.
The `imm.` value is invalid.

#### EX(Execute)
1. slli x29, x6, 2
`0x1d` directly pass to next stage.
:::info
Because the last instruction is addi x6, x5, 1, but the new value of x6 haven't update to `Registers`, so cause ==data hazard==.
`x6` mutiplexer need choose x6 value ==forwarding== from EX/MEM(==red mark==).

:::
ALU shift `reg1` 2(`0x02`) bits and pass the value to next stage.

2. add x13, x3, x29
Because the last instruction have change the register(x29),

but it doesn't update to Rrgisters, so the multiplxer choose to use the value ==forwarding== from EX/MEM.(==red mark==)

#### MEM(Memory access)
Just pass value.
#### WB(Write Back)
1. slli x29, x6, 2
Write back 0x1d(x29), value = 0x04

2. add x13, x3, x29
Write back 0x0d(x13), value = 0x1000_0004

#### Result:
1. slli x29, x6, 2

2. add x13, x3, x29

---
### bge x10, x11, nochange
```c=
if (arr[j] < arr[min_id]) {
mid_id = j;
}
```
change logic to
```c=
if (arr[j] >= arr[min_id]) {
skip(mid_id = j);
}
```
bge x10, x11, nochange
rs1 = `x10` = arr[j]
rs2 = `x11` = arr[min_id]
`bge`: take the branch if rs1 is greater than or equal to rs2
#### IF(Instruction Fetch)
Fetch instruction at `0x2c` -> 0x00b55463

#### ID(Instruction Decode)
:::info
bge:

:::
0x00b55463
0b0_000000_01011_01010_101_0100_0_1100011
op -> 0b1100011 = `branch`
funct3-> 0b101 = `bge`
src1 -> 0b01010 = `10`
src2 -> 0b01011 = `11`
imm -> 0b000000000100 = `4`
turn into -> 0b0000000001000 = `8` bytes
:::spoiler
Cite from CS61 2018 [Lecture 7](https://inst.eecs.berkeley.edu/~cs61c/resources/su18_lec/Lecture7.pdf)

---
Cite from CS61 2020 [Lecture 6](https://cs61c.org/su20/lectures/?file=lec06.pdf)

:::
---
The instruction pass into decoder and come aparts to `BGE`, `0x0a`, `0x0b`, `0x08`
The src1 pass into R1 idx and `Registers` pass the value of `x10` to next stage.
The src2 pass into R2 idx and `Registers` pass the value of `x11` to next stage.

#### EX(Execute)
:::info
==one stall insert==

Because the last instruction is lw x11, 0(x14), and the bge need value of x11, so cause ==data hazard==, and it can't be deal with forwarding because MEM stage doesn't have the forwarding unit.
So insert one stall, and ==forwarding== from WB can solve the hazard.
:::
The multiplexer choose the value to branch unit:
1. `Register`(bule mark)
2. forwarding value(red mark)
and branch unit decide to turn on the `Branch taken` signal.

Also multiplexer choose the `PC` and `imm(offset)` value to ALU unit.
ALU plus PC and offset and pass to PC unit and fetch the code,

but the code had ==be fetched and decoded== need to be ==flush==.

#### MEM(Memory Access)
no operation
#### WB(Write Back)
on operation
#### Result:
branch to flag: nochange
---
### blt x6, x19, L2
#### IF(Instruction Fetch)
Fetch instruction at `0x3c` -> 0xfd33_4ce3

#### ID(Instruction Decode)
:::info
blt:

:::
0xfd334ce3
0b1_111110_10011_00110_100_1100_1_1100011
op -> 0b1100011 = `branch`
funct3-> 0b100 = `blt`
src1 -> 0b00110 = `6`
src2 -> 0b10011 = `19`
imm -> 0b111111101100 = `-20`
0b000000010100 = `20` -> 20*2===40==
The instruction pass into decoder and come aparts to `BLT`, `0x06`, `0x13`, `0xffffffd8`
The src1 pass into R1 idx and `Registers` pass the value of `x06` to next stage.
The src2 pass into R2 idx and `Registers` pass the value of `x19` to next stage.

#### EX(Execute)
Because the registers, `x19` and `x6`, didn't update to `Registers`, so the two value need to use forwarding value.

So the multiplxer choose to use the `x6` value forwarding from WB(==red mark==), and the `x19` value forwarding from EX/MEM(==blue mark==).

The branch unit get the forwarding value, and decide to turn on the `Branch taken` signal.

ALU plus PC and offset and pass to PC unit and fetch the code,

but the code had ==be fetched and decoded== need to be ==flush==.

#### MEM(Memory Access)
on operation
#### WB(Write Back)
on operation
#### Result:
branch to flag: L2
---
### lw x10, 0(x12)
load data from memory to register
#### IF(Instruction Fetch)
Fetch instruction at `0x1c` -> 0x0006a503

#### ID(Instruction Decode)
:::info
lw:

:::
0x0006_a503
0b000000000000_01101_010_01010_0000011
op -> 0b110001 = `LOAD`
rd -> 0b11001 = `25`
funct3-> 0b010 = `load word`
rs1 -> 0b01101 = `13` (base)
imm -> 0b0 = `0` (offset)
The instruction pass into decoder and come aparts to `LW`, `0x0d`, `0x0a`, `0x00000000`
The rs1(base) pass into R1 idx and `Registers` pass the value of `x13` to next stage.

#### EX(Execute)
Because the registers, `x19`, didn't update to `Registers`, so the value need to use forwarding value.

So the multiplxer choose to use the `x13` value forwarding from EX/MEM(==red mark==).

ALU plus `x13` and offset, then pass the value(0x10000004) to next stage.
#### MEM(Memory Access)
Pass rd value to WB.
Data memory unit get the address(0x1000_0004) from `ALU Res`, and output the corresponding data(0x0000_0002).

#### WB(Write Back)
Write back to `Registers`:
0x0a(x10), value = 0x0000_0002

#### Result:
x10 load data from 0(x13) = 0x02

---
### sw x11, 0(x12)
store data from register to memory
#### IF(Instruction Fetch)
Fetch instruction at `0x58` -> 0x00b62023

#### ID(Instruction Decode)
:::info
sw:

:::
0x00b6_2023
0b0000000_01011_01100_010_00000_0100011
op -> 0b110001 = `STORE`
funct3-> 0b010 = `store word`
rs1 -> 0b01100 = `12` (base)
rs2 -> 0b01011 = `11`
imm -> 0b0 = `0` (offset)
The instruction pass into decoder and come aparts to `SW`, `0x0b`, `0x0c`, `0x00000000`
The rs1 pass into R1 idx and `Registers` pass the value of `x12` to next stage.
The rs2 pass into R2 idx and `Registers` pass the value of `x11` to next stage.

#### EX(Execute)
:::info
==one stall insert==
Because the last instruction is lw x11, 0(x14), and the SW need the data of x11, but MEM stage doesn't have the forwarding unit.

So insert one stall, let LW to WB stage when SW in EX stage, and use the forwarding unit in WB to pass the new x11 value to SW.

:::
Multiplexer pass `imm`(offset) value and Reg 1(base) to ALU, ALU plus them and pass the address to MEM stage.
And multiplexer let the forwarding value pass to `Data in`.

#### MEM(memory access)
The address pass from ALU Res and the Data in pass from last stage forwarding value.
The `Wr en` turn green to allow Memory unit store data to memory.

#### WB(Write Back)
no operation
#### Result:
The value at the address(0x1000_0000) is 0x0000_0001

---
### Memory update
The i index for-loop(big loop), will swap a couple of memory.
example:
> i=0 is initialize
| ==i== | arr[0] | arr[1] | arr[2] | arr[3] | arr[4] |
|-------|--------|--------|--------|--------|--------|
| ==0== | 5 | 3 | 8 | 6 | 1 |
| ==1== | 1 | 3 | 8 | 6 | 5 |
| ==2== | 1 | 3 | 8 | 6 | 5 |
| ==3== | 1 | 3 | 5 | 6 | 8 |
| ==4== | 1 | 3 | 5 | 6 | 8 |
---
## Refine assembly code
According to the above analysis, some instruction can be reorginized.
1. The register always same value can move out from loop.
x19
x18
> cycles 209 -> 197
2. i++ move front to before ==sw x11, 0(x12)==
> cycles 197 -> 193
| info | |
|-------|-----|
|cycles |193 |
|retired|145 |
|CPI |1.33 |
|IPC |0.751|
```cpp
.data
arr: .word 1,2,16,256,8
.text
main:
#get array base address
la x3, arr
#init i
addi x5, x0, 0
addi x19, x0, 5
addi x18, x0, 4
#for loop i
L1:
#init mid_id = i
addi x7, x5, 0
#init j = i + 1
addi x6, x5, 1
#for loop j
L2:
#get arr[j] address offset and address
slli x29, x6, 2
add x13, x3, x29
lw x10, 0(x13)
#get arr[mid_id] address offset and address
slli x30, x7, 2
add x14, x3, x30
lw x11, 0(x14)
#check if condition x10 = arr[j] ; x11 = arr[mid_id]
bge x10, x11, nochange
addi x7, x6, 0
nochange:
#j++
addi x6, x6, 1
blt x6, x19, L2
#get arr[i] address offset and address
slli x28, x5, 2
add x12, x3, x28
#get arr[mid_id] address offset and address
slli x30, x7, 2
add x14, x3, x30
# temp = arr[i]
lw x10, 0(x12)
# arr[i] = arr[mid_id]
lw x11, 0(x14)
addi x5, x5, 1
sw x11, 0(x12)
# arr[mid_id] = temp
sw x10, 0(x14)
blt x5, x18, L1
```
## Reference
1. ==Instruction format pictures== from [The RISC-V Instruction Set Manual](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf)
2. ==CPU architecture pictures== generate from [Ripes Simulator](https://github.com/mortbopet/Ripes)
3. [CS61C_Su18_Lecture 7](https://inst.eecs.berkeley.edu/~cs61c/resources/su18_lec/Lecture7.pdf) P.45
4. [CS61C_Su20_Letture 6](https://cs61c.org/su20/lectures/?file=lec06.pdf) P.62