# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [`dhiptmc`](https://github.com/dhiptmc) >
###### tags: `RISC-V` `Computer Architecture 2022`
## Problem
Decompress Run-Length Encoded List([LeetCode1313](https://leetcode.com/problems/decompress-run-length-encoded-list/))
>We are given a list nums of integers representing a list compressed with run-length encoding.
>
>Consider each adjacent pair of elements [freq, val] = [nums[2*i], nums[2*i+1]] (with i >= 0). For each such pair, there are freq elements with value val concatenated in a sublist. Concatenate all the sublists from left to right to generate the decompressed list.
>
>Return the decompressed list.
#### Example 1:
```
Input: nums = [1,2,3,4]
Output: [2,4,4,4]
Explanation: The first pair [1,2] means we have freq = 1 and val = 2 so we generate the array [2].
The second pair [3,4] means we have freq = 3 and val = 4 so we generate [4,4,4].
At the end the concatenation [2] + [4,4,4] is [2,4,4,4].
```
#### Example 2:
```
Input: nums = [1,1,2,3]
Output: [1,3,3]
```
## Solution
Input array and numsSize are known when we start this program.
We first traverse the input array *(int i = 0; i < numsSize; i += 2)* to find the size of the output array.
Next we collect the element after "freq" so as to get "val".
Note that each "val" needs to be insert into output array for "freq" times.
## Implementation
You can find the source code [here](https://github.com/dhiptmc/ComputerArchitecture/tree/main/Hw1). Feel free to fork and modify it.
### C code
```clike=
/**
* Note: The returned array must be malloced, assume caller calls free().
*/
int *decompressRLElist(int *nums, int numsSize, int *returnSize)
{
*returnSize = 0;
for (int i = 0; i < numsSize; i += 2)
{
*returnSize += nums[i];
}
int *result = malloc(sizeof(int) * (*returnSize));
int count = 0;
for (int i = 0; i < numsSize; i += 2)
{
for (int j = 0; j < nums[i]; j++)
{
result[count++] = nums[i+1];
}
}
return result;
}
```
### Assembly code
```
.data
nums1: .word 1,2,3,4 #2444
nums2: .word 6,5,1,3,2,1 #555555311
nums3: .word 1,3,2,4,2,7,1,8 #344778
numsSize1: .word 4
numsSize2: .word 6
numsSize3: .word 8
returnSize1: .word 0
returnSize2: .word 0
returnSize3: .word 0
enter: .string "\n"
.text
main_1: # execute first test data
la a1, nums1 # get nums[i] address a1->nums1
la a2, numsSize1 # get numsSize address a2->numsSize1
lw a2, 0(a2) # a2 = numsSize
la a3, returnSize1 # get returnSize address a3->returnSize1
lw a3, 0(a3) # a3 = returnSize
addi a4, zero, 0 # a4 = i = 0
#sw 0, 0(a3) # *returnSize1 = 0
addi s2, zero, 2 # record number of data set need to be execute
j outer_loop
#####loop#####
proc:
lw t1, 4(t0) # nums[i+1]
slli t2, a6, 2 # calculate result[count] address
add t2, t2, s1 # calculate result[count] address //s1 -> result base address
sw t1, 0(t2) # result[count] = nums[i+1];
addi a6, a6, 1 # count++
addi a5, a5, 1 # j++
inner_loop:
slli t0, a4, 2 # calculate nums[i] address
add t0, t0, a1 # calculate nums[i] address //a1 -> nums base address
lw t1, 0(t0) # load nums[i] to t1
blt a5, t1, proc # if j < nums[i], jump to proc
add a3, a3, t1 # *returnSize += nums[i] ##modified
addi a5, zero, 0 # a5 = j = 0
addi a4, a4, 2 # i += 2
outer_loop:
blt a4, a2, inner_loop # if i < numsSize, jump to inner_loop
addi a4, zero, 0 # set i = 0
addi a5, zero, 0 # a5 = j = 0
addi a6, zero, 0 # a6 = count = 0
j print
#####load different test data#####
main_2:
la a1, nums2
la a2, numsSize2 # get numsSize2 address a2->numsSize2
lw a2, 0(a2) # a2 = numsSize2
la a3, returnSize2 # get returnSize2 address a3->returnSize2
lw a3, 0(a3) # a3 = returnSize
addi a4, zero, 0 # reset i
j outer_loop
main_3:
la a1, nums3
la a2, numsSize3 # get numsSize3 address a2->numsSize3
lw a2, 0(a2) # a2 = numsSize3
la a3, returnSize3 # get returnSize3 address a3->returnSize3
lw a3, 0(a3) # a3 = returnSize
addi a4, zero, 0 # reset i
j outer_loop
#####print result#####
print:
slli t2, a4, 2 # calculate result[i] address
add t2, t2 ,s1 # calculate result[i] address //s1 -> result base address
lw a0, 0(t2) # load result[i] to a0
li a7, 1 # system call print
ecall # execute system call
addi a4, a4, 1 # i++
blt a4, a3, print # if i < returnSize, jump to print
la a0, enter # print \n
li a7, 4 # a7 = 4 means print string
ecall
addi s2, s2, -1
bgtz s2,main_2 # if s2 > 0 jump to main_2
beqz s2,main_3 # if s2 == 0 jump to main_3
j exit
exit:
li a7, 10 # system call exit
ecall # execute system call
```
:::warning
:warning: Can you use fewer instructions?
:notes: jserv
:::
:::success
Indeed, there are possible ways to use fewer instructions.
Modified version of Assembly code is shown below.
```
.data
nums: .word 1,2,3,4, 6,5,1,3,2,1, 1,3,2,4,2,7,1,8
numsSize: .word 4,6,8
returnSize: .word 0,0,0
enter: .string "\n"
result: .word 0,0,0,0,0,0,0,0,0,0 ##used to avoid EXCEEDING boundaries
#####function starts#####
.text
initial:
addi t5, zero, 3 # record number of testing datas need to be executed
addi s4, zero, 0 # identify which testing data is executing
la a1, nums # get nums[i] address a1 -> nums
la a2, numsSize # get numsSize address a2 -> numsSize
la a3, returnSize # get returnSize address a3 -> returnSize
main:
lw s3, 0(a3) # s3 = returnSize
lw s2, 0(a2) # s2 = numsSize
lw a4, result # base address of result
lw t4, result # the desired result address
j loop
#####loop#####
proc:
sw t2, 0(t4) # result[count] = nums[i+1]
addi t4, t4, 4 # t4 -> result[count+1]
addi a6, a6, 1 # j++
blt a6, t1, proc # if j < nums[i], jump back to proc
add s3, s3, t1 # *returnSize += nums[i]
addi a6, zero, 0 # a6 = j = 0
addi a5, a5, 2 # i += 2
addi a1, a1, 8 # a1 -> nums[i+2]
loop:
lw t1, 0(a1) # t1 = nums[i]
lw t2, 4(a1) # t2 = nums[i+1]
bge a5, s2, print # if i >= numsSize, jump to print
blt a6, t1, proc # if j < nums[i] , jump to proc
#####print result and initialize next test set#####
print:
lw a0, 0(a4) # load result[i] to a0
li a7, 1 # system call print
ecall # execute system call
addi a4, a4, 4 # move to next result word
bne a4, t4, print # if a4 != final address of t4, jump to print
la a0, enter # print \n
li a7, 4 # a7 = 4 means print string
ecall
##initialize
addi a5, zero, 0 # a5 = i = 0
addi s4, s4, 1 # move to the next testing data
addi a2, a2, 4 # move to the next numsSize
addi a3, a3, 4 # move to the next returnSize
blt s4, t5, main # if s4 < 3 jump to main
j exit
exit:
li a7, 10 # system call exit
ecall # execute system call
```
As a result, clock cycle as the figure shown below decreases compared to the original work.
Origin:

After modified:

:::
## Analysis
We test our code using [Ripes](https://github.com/mortbopet/Ripes) simulator
## Results
* Test data 1
```
Input: nums = [1,2,3,4]
Output: [244]
```
* Test data 2
```
Input: nums = [6,5,1,3,2,1]
Output: [555555311]
```
* Test data 3
```
Input: nums = [1,3,2,4,2,7,1,8]
Output: [344778]
```
#### Ripes print

## 5-stage pipelined processor

### IF: instruction fetch stage

**Fetch instruction from memory.**
At this example, ***lw x12 0 x12*** is in IF stage.
Current PC(Program counter) is 0x00000010, which leads to an instruction decoded as 0x00062603.
We can find out that there is a simple ALU that adds 4 to the current PC which eventually returns the next PC.
Meanwhile, current PC and PC+4 will be delivered to *IFID* for later use.
Noted that there is also a possibility that *PC*(PC Unit) receives a value from **EXE** stage.
### ID: instruction decode/register file read stage

**Read registers while decoding the instruction.**
At this example, ***addi x12 x12 64*** is in ID stage.
*Decode*(Decode Unit) outputs the x12 address(0x0c) for the *Registers*, and *Registers* outputs the corresponding value stored in x12 to *IDEX*. In this circumstance, *Reg2* is unnecessary for *addi* operation, so we can ignore it.
*Imm.* unit(Immediate Generator) outputs the immediaite value, in this case, 0x00000040(32'd64) will be delivered to *IDEX* for later use.
Noted that target register (x12) address for storing back to the *Registers*, PC, and PC+4 are also delivered to *IDEX*.
### EXE: execution stage

**Execute the operation or calculate an address.**
At this example, ***auipc x12 0x10000*** is in EXE stage.
**EXE** stage is somewhat complicated. 4 MUXs(multiplexer) can be found.
The left two MUXs choose the source of the input registers, which can be directly come from *IDEX* (*Reg1*/*Reg2*) or from two different forwarding paths.
The right two MUXs choose whether the *Op* source comes from PC, *IMM.* unit or register. In this case, *Op1* comes from PC, and *Op2* comes from *IMM.* unit.
*ALU* unit eventually finsihes the arithmetic operation and outputs ths result to *EXMEM*, and outputs the target PC address to **IF** stage in some branch instructions.
In addition, a *Branch* unit takes value of two different registers in order to perform branch control, which splits the traditional operation from the *ALU* unit so as to alleviate the workload for *ALU* unit, and may have better performance for branch operations.
Noted that target register, one register input, and PC+4 is also delivered to *EX/MEM*.
### MEM: memory access stage

**Access an operand in data memory.**
At this example, ***addi x11 x11 0*** is in MEM stage.
*Data memory* unit recieves the *Addr.* and *Data in* signals to perform memory read/write. In this circumstance, there is no need to use *Data memory* unit, so we can simply ignore the *Read out* result.
In addition, result from the previous **EXE** stage can be forwarded to the current **EXE** stage.
### WB: write-back stage

**Write the result into a register.**
At this example, ***auipc x11 0x10000*** is in WB stage.
MUX chooses the input signal from PC+4, *Data memory* unit, or result from the **EXE** stage.
In this circumstance, we choose the result from the **EXE** stage(0x10000000), and stores the value to the register whose target address is 0x0b.
In addition, output signal from the MUX can be forwarded to the **EXE** stage.
## Memory update steps
**Memory condition when program starts**


From the Instruction memory list, we can discover that the next few instructions has been loaded into the pipeline.
For instance, ***lw x12 0 x12*** is in IF stage.
**Memory updates during main_1**


a1 gets the base address of nums1.
a2 gets the numsSize1.
a3 = returnSize which initials as 0 for later use.
a4 = i = 0 for later branch deciding.
s2 = 2 records numbers of testing sets need to be execute.
**Memory updates during proc and inner_loop**


t0 gets the nums1[i] address.
t1 first loads nums[i], and is used to decide whether to branch to proc or not.
Then, t1 loads nums[i+1], t2 gets result[count] address, and nums[i+1] is stored in "result[count] address".
Noted that we shift the index 2 bits so as to match the memory address.
Eventually, we make a5 += 1(j++), a6 += 1(count++).
t1 reloads nums[i], and we check the relationship between the value in a5 and t1, which leads us to decide whether go backs to proc or not.
If we don't need to branch to proc again, we add t1(nums[i]) to a3(returnSize), set a5(j) to 0, and add 2 to a4(i).
**Memory updates during outer_loop**


We reset a4(i), a5(j), a6(count) to zero, prepare for later use.
**Memory updates after print**


s2 value is decreased by 1.
## Hazard
### Data Hazard
Data hazards occur when instructions that exhibit data dependence modify data in different stages of a pipeline. Ignoring potential data hazards can result in race conditions (also termed race hazards). There are three situations in which a data hazard can occur:
* read after write (RAW), a true dependency
* write after read (WAR), an anti-dependency
* write after write (WAW), an output dependency
Read after read (RAR) is not a hazard case.
* **Solution**
1. Forwarding

By forwarding, we can use the result from **MEM** stage or **WB** stage immediately.
As the picture shown above, the current instruction in **EXE** stage is ***addi x11 x11 0***, while the previous instruction is ***auipc x11 0x10000***.
Therefore, we pass the value from *EX/MEM* to **EXE** stage as the yellow line shows in order to solve the data hazard.
2. Stall when Load-Use

One case where forwarding cannot save the day is when an instruction tries to read a register following a load instruction that writes the same register. The picture shown above illustrates the problem. In this situation, a stall is needed.
### Control Hazard
Control hazard occurs when the pipeline makes wrong decisions on branch prediction and therefore brings instructions into the pipeline that must subsequently be discarded. The term branch hazard also refers to a control hazard.
* Solution:

Control hazard occurs when the branch is taken. Hence, we need to flush the sequential instructions that follow the branch. As a matter of fact, we will waste 2 cycles since branch is dicided at **EXE** stage.
One way to improve branch performance is to reduce the cost of the taken branch. Moving the branch execution to the **ID** stage will be a wise choice, since it reduces the penalty of a branch to only one instruction if the branch is taken, namely, the one currently being fetched. Despite there may be several difficulties to conquer (several new units and mechanisms needed), it is still an improvement to the performance.

> David A. PATTERSON, JOHN L. HENNESSY. Computer Organization and Design 4/e, page 379 FIGURE 4.62.
## Pseudo instruction
Origin:
```
00000000 <main_1>:
0: 10000597 auipc x11 0x10000
4: 00058593 addi x11 x11 0
8: 10000617 auipc x12 0x10000
c: 04060613 addi x12 x12 64
10: 00062603 lw x12 0 x12
14: 10000697 auipc x13 0x10000
18: 04068693 addi x13 x13 64
1c: 0006a683 lw x13 0 x13
20: 00000713 addi x14 x0 0
24: 00200913 addi x18 x0 2
28: 0380006f jal x0 56 <outer_loop>
0000002c <proc>:
2c: 0042a303 lw x6 4 x5
30: 00281393 slli x7 x16 2
34: 009383b3 add x7 x7 x9
38: 0063a023 sw x6 0 x7
3c: 00180813 addi x16 x16 1
40: 00178793 addi x15 x15 1
00000044 <inner_loop>:
44: 00271293 slli x5 x14 2
48: 00b282b3 add x5 x5 x11
4c: 0002a303 lw x6 0 x5
50: fc67cee3 blt x15 x6 -36 <proc>
54: 006686b3 add x13 x13 x6
58: 00000793 addi x15 x0 0
5c: 00270713 addi x14 x14 2
00000060 <outer_loop>:
60: fec742e3 blt x14 x12 -28 <inner_loop>
64: 00000713 addi x14 x0 0
68: 00000793 addi x15 x0 0
6c: 00000813 addi x16 x0 0
70: 0540006f jal x0 84 <print>
00000074 <main_2>:
74: 10000597 auipc x11 0x10000
78: f9c58593 addi x11 x11 -100
7c: 10000617 auipc x12 0x10000
80: fd060613 addi x12 x12 -48
84: 00062603 lw x12 0 x12
88: 10000697 auipc x13 0x10000
8c: fd068693 addi x13 x13 -48
90: 0006a683 lw x13 0 x13
94: 00000713 addi x14 x0 0
98: fc9ff06f jal x0 -56 <outer_loop>
0000009c <main_3>:
9c: 10000597 auipc x11 0x10000
a0: f8c58593 addi x11 x11 -116
a4: 10000617 auipc x12 0x10000
a8: fac60613 addi x12 x12 -84
ac: 00062603 lw x12 0 x12
b0: 10000697 auipc x13 0x10000
b4: fac68693 addi x13 x13 -84
b8: 0006a683 lw x13 0 x13
bc: 00000713 addi x14 x0 0
c0: fa1ff06f jal x0 -96 <outer_loop>
000000c4 <print>:
c4: 00271393 slli x7 x14 2
c8: 009383b3 add x7 x7 x9
cc: 0003a503 lw x10 0 x7
d0: 00100893 addi x17 x0 1
d4: 00000073 ecall
d8: 00170713 addi x14 x14 1
dc: fed744e3 blt x14 x13 -24 <print>
e0: 10000517 auipc x10 0x10000
e4: f8050513 addi x10 x10 -128
e8: 00400893 addi x17 x0 4
ec: 00000073 ecall
f0: fff90913 addi x18 x18 -1
f4: f92040e3 blt x0 x18 -128 <main_2>
f8: fa0902e3 beq x18 x0 -92 <main_3>
fc: 0040006f jal x0 4 <exit>
00000100 <exit>:
100: 00a00893 addi x17 x0 10
104: 00000073 ecall
```
:::success
Modified:
```
00000000 <initial>:
0: 00300f13 addi x30 x0 3
4: 00000a13 addi x20 x0 0
8: 10000597 auipc x11 0x10000
c: ff858593 addi x11 x11 -8
10: 10000617 auipc x12 0x10000
14: 03860613 addi x12 x12 56
18: 10000697 auipc x13 0x10000
1c: 03c68693 addi x13 x13 60
00000020 <main>:
20: 0006a983 lw x19 0 x13
24: 00062903 lw x18 0 x12
28: 10000717 auipc x14 0x10000
2c: 03a72703 lw x14 58 x14
30: 10000e97 auipc x29 0x10000
34: 032eae83 lw x29 50 x29
38: 0240006f jal x0 36 <loop>
0000003c <proc>:
3c: 007ea023 sw x7 0 x29
40: 004e8e93 addi x29 x29 4
44: 00180813 addi x16 x16 1
48: fe684ae3 blt x16 x6 -12 <proc>
4c: 006989b3 add x19 x19 x6
50: 00000813 addi x16 x0 0
54: 00278793 addi x15 x15 2
58: 00858593 addi x11 x11 8
0000005c <loop>:
5c: 0005a303 lw x6 0 x11
60: 0045a383 lw x7 4 x11
64: 0127d463 bge x15 x18 8 <print>
68: fc684ae3 blt x16 x6 -44 <proc>
0000006c <print>:
6c: 00072503 lw x10 0 x14
70: 00100893 addi x17 x0 1
74: 00000073 ecall
78: 00470713 addi x14 x14 4
7c: ffd718e3 bne x14 x29 -16 <print>
80: 10000517 auipc x10 0x10000
84: fe050513 addi x10 x10 -32
88: 00400893 addi x17 x0 4
8c: 00000073 ecall
90: 00000793 addi x15 x0 0
94: 001a0a13 addi x20 x20 1
98: 00460613 addi x12 x12 4
9c: 00468693 addi x13 x13 4
a0: f9ea40e3 blt x20 x30 -128 <main>
a4: 0040006f jal x0 4 <exit>
000000a8 <exit>:
a8: 00a00893 addi x17 x0 10
ac: 00000073 ecall
```
:::
## Reference
* [Computer Organization and Design 4/e](http://staff.ustc.edu.cn/~llxx/cod/reference_books/Computer%20Organization%20and%20Design%204th.pdf)
* [Hazard (computer architecture)](https://en.wikipedia.org/wiki/Hazard_(computer_architecture))