# Lab1: RV32I Simulator
###### tags: `Course` `RISC-V` `Computer Architecture 2022`
## Problem
* Given an array nums. We define a running sum of an array as runningSum[i] = sum(nums[0]…nums[i]).
Return the running sum of nums
More detail description can be founded at [LeetCode 1480. Running Sum of 1d Array](https://leetcode.com/problems/running-sum-of-1d-array/)
## Solution
* Pass the first element to result 0 as initial. Then add the A[i-1] element with num[i], then pass the value to the result. Finally return the result arr
## C Code
```c=
#include <stdlib.h>
#include <stdio.h>
int* runningSum(int* nums, int numsSize){
int* result = (int*)malloc(sizeof(int)*numsSize);
for(int i=0; i<numsSize; i++){
if(i==0) result[0] = nums[0];
else result[i] = result[i-1] + nums[i];
}
return result;
}
int main(int argc, char *argv[]){
int nums1[5] = {1,2,3,4,5};
int nums2[6] = {1,3,5,7,9,11};
int nums3[7] = {0,2,4,6,8,10,12};
int* result1 = runningSum(nums1, 5);
int* result2 = runningSum(nums2, 6);
int* result3 = runningSum(nums3, 7);
for(int j=0; j<5; j++){
printf("%d ", result1[j]);
}
printf("\n");
for(int j=0; j<6; j++){
printf("%d ", result2[j]);
}
printf("\n");
for(int j=0; j<7; j++){
printf("%d ", result3[j]);
}
printf("\n");
}
```
## Assembly Code
:::spoiler
```RISCV=
# The old one
.data
nums1: .word 1,2,3,4,5
nums2: .word 1,3,5,7,9,11
nums3: .word 0,2,4,6,8,10,12
res1: .word 0,0,0,0,0,
res2: .word 0,0,0,0,0,0
res3: .word 0,0,0,0,0,0,0
space: .string " "
nl: .string "\n"
.text
main:
# t2: nums head, t3: result head, t4: numsSize
la t2, nums1 # load arr base to t2
la t3, res1 # load res base to t3
li t4, 5 # load the numsize to t4
jal ra, runningSum
jal ra, print
la t2, nums2 # load arr base to t2
la t3, res2 # load res base to t3
li t4, 6 # load the numsize to t4
jal ra, runningSum
jal ra, print
la t2, nums3 # load arr base to t2
la t3, res3 # load res base to t3
li t4, 7 # load the numsize to t4
jal ra, runningSum
jal ra, print
j exit
runningSum:
# t5: i, t6: j
# initialize
li t5, 0 # load i with 0
li t6, 0 # load j with 0
j end.L1 # jump to L1
end.L1:
bge t5, t4, end.L6 # if i >= numsSize then branch to L6
j end.L2 # jump to L2
end.L2:
bnez t5, end.L4 # if i != 0 then branch to L4
j end.L3 # jump to L3
end.L3:
lw a3, 0(t2) # load nums[0] to a3
sw a3, 0(t3) # save num[0] to result[0]
j end.L5 # jump to L5
end.L4:
addi a3, t3, 0 # load result's address to a3
addi a4, t5, 0 # load i to a4
slli a6, a4, 2 # i*4 (cause of byte size)
add a4, a3, a6 # add result's address with i*4
# to get result[i]'s address
lw a3, -4(a4) # load result[i-1] to a3
addi a5, t2, 0 # load nums's address
add a5, a5, a6 # add nums's address with i*4
# to get nums[i]'s address
lw a5, 0(a5) # load nums[i] to a5
add a3, a3, a5 # add result[i-1] + nums[i]
sw a3, 0(a4) # save to result[i]
j end.L5 # jump to L5
end.L5:
addi t5, t5, 1 # i = i + 1
j end.L1 # jump to L1
end.L6:
jr ra # retuen to main
print:
slli a4, t6, 2
add a5, t3, a4
lw a0, 0(a5) # load result out
li a7, 1 # pint a0
ecall
la a0, space #load space
li a7, 4 #print string
ecall
addi t6, t6, 1 # j++
blt t6, t5, print # loop
la a0, nl #load next line
li a7, 4 #print string
ecall
jr ra
exit:
li a7, 10 # exit
ecall
```
:::
```RISCV=
.data
nums1: .word 1,2,3,4,5
nums2: .word 1,3,5,7,9,11
nums3: .word 0,2,4,6,8,10,12
res1: .word 0,0,0,0,0,
res2: .word 0,0,0,0,0,0
res3: .word 0,0,0,0,0,0,0
space: .string " "
nl: .string "\n"
.text
main:
# t2: nums head, t3: result head, t4: numsSize
la t2, nums1 # load arr base to t2
la t3, res1 # load res base to t3
li t4, 5 # load the numsize to t4
jal ra, runningSum
la t2, nums2 # load arr base to t2
la t3, res2 # load res base to t3
li t4, 6 # load the numsize to t4
jal ra, runningSum
la t2, nums3 # load arr base to t2
la t3, res3 # load res base to t3
li t4, 7 # load the numsize to t4
jal ra, runningSum
j exit
runningSum:
# t5: i, t6: j # initialize
li t5, 1 # load i with 1
lw a3, 0(t2) # load nums[0] to a3
sw a3, 0(t3) # save num[0] to result[0]
end.L1:
slli a0, t5, 2 # i*4 (cause of byte size)
add a1, t3, a0 # add result's address with i*4 to get result[i]'s address
lw a3, -4(a1) # load result[i-1] to a3
add a2, t2, a0 # add nums's address with i*4 to get nums[i]'s address
lw a2, 0(a2) # load nums[i] to a5
add a3, a3, a2 # add result[i-1] + nums[i]
sw a3, 0(a1) # save to result[i]
addi t5, t5, 1 # i = i + 1
blt t5, t4, end.L1 # if i >= numsSize then branch to L3
li t5, 0
print:
slli a1, t5, 2
add a1, t3, a1
lw a0, 0(a1) # load result out
li a7, 1 # pint a0
ecall
la a0, space #load space
li a7, 4 #print string
ecall
addi t5, t5, 1 # j++
blt t5, t4, print # loop
la a0, nl #load next line
li a7, 4 #print string
ecall
jr ra
exit:
li a7, 10 # exit
ecall
```
:::warning
:warning: Can you use fewer instructions?
:notes: jserv
:::
:::info
**Update at 10/6**
* The old one I puted it in the spoiler `>`
* Modify the numbers of Label from 6 into 1
* Modify $i$ begin form 1 not 0, because I put lw,sw of nums [0] right after li
* Modify & Replace $bgt\;t5,\;t4,\;end.L1$ into $blt\;t5,\;t4,\;end.L1$. Thus jr can put in L1
* Delete the reg $t6$
* Delete the code $bnez\;t5,\;end.L2$
* Delete redundant code, which load data to another reg. modify at part $runningSum$
* Only use $a_1$ at $Print$
**Total numbers of lines have decrease from 86 into 62**
:::
## Pipeline stage explain
* **IF : Instruction Fetch**
* The instructioins is fetched at this stage from $Instr. Memory$, and the pointer counter's value is also given at this stage.

* **ID : Instruction Decode**
* The instruction is decoded into OPcode and get the value from registers.

* **EXE : Execute**
* Use the OPcode at previous stage to determine which arithmetic operator should be used to the value which is also passed by the previous stage.

* **MEM : Memory access**
* The result of the calculation will be buffered to the Data Memory.
* **WB : Write Back**
* The data and the register address will be write back to Register File

## Result & Memory & Register
### Result
* **The result array should be**
* nums1
> 1
1 + 2 = 3
1 + 2 + 3 = 6
1 + 2 + 3 + 4 = 10
1 + 2 + 3 + 4 + 5 = 15
* nums2
> 1
1 + 3 = 4
1 + 3 + 5 = 9
1 + 3 + 5 + 7 = 16
1 + 3 + 5 + 7 + 9 = 25
1 + 3 + 5 + 7 + 9 + 11 = 36
* nums3
> 0
0 + 2 = 0
0 + 2 + 4 = 6
0 + 2 + 4 + 6 = 12
0 + 2 + 4 + 6 + 8 = 20
0 + 2 + 4 + 6 + 8 + 10 = 30
0 + 2 + 4 + 6 + 8 + 10 + 12 = 42
| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
| - | - | - | - | - | - | - |
| 1 | 3 | 6 | 10 | 15 |||
| 1 | 4 | 9 | 16 | 25 | 36 ||
| 0 | 2 | 6 | 12 | 20 | 30 | 42 |
* **Result $Ripes$ print**

### Memory & Register
* The $nums$ is store at $0x1000\_0000$; $res$ is store at $0x1000\_0018$; $numsize$ is store at $0x1000\_0014$ with size $5$;

* The step that pass nums[0] to res[0]

* The reg when load res's address & i's value

* The reg when adding result

* The reg when print the result out

## Pipeline Feature
### Data Hazard
* **The reason why it occur:**
Because we only have 5 stage pipeline. When the current instruction need the result of previous instruction, the CPU need to stall until the result WriteBack. Then CPU will waste lots of time to waiting for the result. Thus, there are two ways to solve this kind of data hazard.
* **solution**
* Forwarding
By forwarding, we can take the result from the previous instruction immediately. As the picture below, the next instruction is $add\;x12\;x12\;x13$ while the previous one is $addi\;x12\;x7\;0$. Thus we pass the value from EX/MEM Reg to EXE stage as the yellow line shows. So we can run the pipeline continuosly.

* Stall (load use)
However, there are still some cases that we need to stall. Just like the below case, we need to $lw\;x10\;0(x7)$. Immediately, the next one need $sw\;x10\;0(x28)$ to save word, while the data hasn't be readed from the Memory. Therefore, we stall the CPU for one cycle, waiting for the data be prepared.

### Control Hazard
* **The reasin why it occur:**
Control hazard occur when the branch is taken. Hence, we need to Flush the instructions that we don't want them to be execute. As the result, we will waste 2 cycle because branch is calculated at EXE stage. Like the picture below, $bne\;x30\;x0\;L4$ is taken. Then, the two of the instructions after branch is Flushed.

* **solution**
* Put branch unit at ID stage
Move the place where branch is judged forward one. That is, Branch decide at ID stage, not at EXE stage. Finally, it will only cost 1 cycle for branch taken.

* Use predict scheme
Usually, we will use branch when we are dealing with loop. Consequently, what we can do is to "guess" the next branch is taken or not taken. We can use "static" or "dynamic" prediction to solve the Control Hazard.


## Reference
* [Stall and Flush](https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec13.pdf)
* [What's the role of EX stage for branching in Pipelined MIPS w Forwarding?](https://stackoverflow.com/questions/44866407/whats-the-role-of-ex-stage-for-branching-in-pipelined-mips-w-forwarding)
* [淺談分支預測與 Hazards 議題](https://ithelp.ithome.com.tw/m/articles/10265705)
* [在Hazard尋求解法是否搞錯了什麼](https://ithelp.ithome.com.tw/articles/10268810)