# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [chinghongfang](https://github.com/chinghongfang) >
###### tags: `RISC-V` `computer architure 2021`
## Problem
[Leetcode 330 Patching Array](https://leetcode.com/problems/patching-array/)
Given a sorted integer array $nums$ and an integer $n$, add/patch elements to the array such that any number in the range $[1, n]$ can be formed by the sum of some elements in the array. Find the least number to add.
Example:
$nums = [1,5,10]$, $n = 20$
Ans:
2
## Solution
### Expalain
In previous Example, add numbers {2, 4}. Then the array becomes {1, 2, 4, 5, 10}
$\implies$ 1 = 1
2 = 2
3 = 1+2
4 = 4
5 = 5
6 = 1+5
7 = 2+5
8 = 1+2+5
9 = 4+5
10 = 10
11 = 1+10
12 = 2+10
13 = 1+2+10
14 = 4+10
15 = 5+10
16 = 1+5+10
17 = 2+5+10
18 = 1+2+5+10
19 = 4+5+10
20 = 1+4+5+10
All numbers in range$[1, 20]$ can be composed by elements in the array.
### Concept
Consider that all 32-bit unsigned number can be composed of the 32 numbers:
0x0000_0001
0x0000_0002
0x0000_0004
0x0000_0008
0x0000_0010
0x0000_0020
...
0x8000_0000
So the numbers we patch should not exceed 32.
### Algorithm
Suppose that we have a set $S$ that can build the numbers in range $[0, k]$.
Then,
* ($S$ $\cup$ $x$) can build (range $[0, k]$ $\cup$ range $[x, k+x]$). ($x$ is a natural number).
By the constraint, we
1. Start from an empty set $S = \{\}$ which can build range $[0,0]$.
2. Each time we add a number in $nums$ or "the best" number to the set $S$.
"The best" number is ($k+1$) because ($k+1$) can expand the range most. And do not cause a gap.
$$
range [0, k] \cup range [k+1, k+1+x] = range [0, k+1+x]
$$
So, any time we encounter a number (in nums) bigger than "the best" number, we add "the best" number first to fill the gap and expand the range.
## C Code
```c
#include<stdio.h>
int main()
{
int nums[] = {1, 5, 10}; // Numbers we have.
int len = 3; // The length of the array
int target = 20; // The upper bound of the range
int ans = 0; // Number of patches we need
long now = 0; // The range of number we can form now
int idx = 0; // The `nums` index we have consumed
while (now < target) {
if (idx < len && (long) nums[idx] <= now + 1) {
now += nums[idx];
++idx;
} else {
now += now + 1;
++ans;
}
}
printf("%d\n", ans);
return 0;
}
```
:::info
Alternatively, we can even shorten as following:
```c
int minPatches(int *nums, int numsSize, int n)
{
unsigned int m = 1, res = 0, i = 0;
while (m <= n)
m += (i < numsSize && nums[i] <= m) ? nums[i++] : (res++, m);
return res;
}
```
:notes: jserv
:::
:::info
Yes,
* using unsigned integer for `m` will prevent from 64-bit integer calculation.
* And set `m` to `1` will avoid adding continuously.
It reduces the execution time. Thanks for this patch.
I will rewrite my c code and the assembly code at the bottom of this page. Also, I will reconstruct it to a fucntion and a main.
Thanks,
ChingHongFang
:::
## abi Code
```
.data
nums: .word 1,5,10
len: .word 3
target: .word 20
.text
main:
la s1, nums # s1 = nums[]; // array
lw s2, len # s2 = len; // length of the array
lw s3, target # s3 = target; // range upper bound
add s4, zero, zero # s4 = ans = 0; // set `ans` to 0
add s5, zero, zero # s5, s6 represent `now`. A 64-bits integer.
add s6, zero, zero # Set `now` to zero
add s7, zero, zero # s7 = idx = 0; // set `idx` to 0
while:
bgt s5, zero, exit # now > target. s5 is more significant.
bge s6, s3, exit # Check s6 >= s3 to break the while loop
# if statment
bge s7, s2, else # idx >= len
lw t0, (0)s1 # t0 = nums[idx]
addi t0, t0, -1 # t0 -= 1
blt zero, s5, if # nums[idx]-1 < now
bgt t0, s6, else # nums[idx]-1 > now
if:
# now += nums[0]; // In fact, now += nums[idx];
add a0, zero, s5 # do function call place argument
add a1, zero, s6 # do function call place argument
add a2, zero, zero # do function call place argument
lw a3, (0)s1 # do function call place argument
jal ra, add_positive_long # function call
add s5, zero, a0 # get return value
add s6, zero, a1 # get rerurn value
addi s1, s1, 4 # nums += 1
addi s7, s7, 1 # ++idx
j while # Loop again
else:
add a0, zero, s5 # do function call place argument
add a1, zero, s6 # do function call place argument
add a2, zero, s5 # do function call place argument
addi a3, s6, 1 # do function call place argument
jal ra, add_positive_long # function call
add s5, zero, a0 # get return value
add s6, zero, a1 # get rerurn value
addi s4, s4, 1 # ++ans
j while # Loop again
exit:
add a0, s4, zero # load print value
li a7, 1 # print a0
ecall
li a7, 10 # exit
ecall
add_positive_long:
# a0a1 + a2a3
add a1, a1, a3 # Add lower bit. a1 = a1 + a3
bge a1, zero, no_carry # a1 becomes negative if overflow occurs.
addi a0, a0, 1 # Add 1 to higher bit
# make mask: 0x7fff_ffff
addi t0, zero, 1 # t0 = 1
slli t0, t0, 31 # t0 = 0x8000_0000
not t0, t0 # t0 = 0x7fff_ffff
# clear sign bit
and a1, a1, t0 # a1 = a1 & 0x7fff_ffff
no_carry:
add a0, a0, a2 # a0 = a0 + a2
ret
```
## Assembly Code
```
00000000 <main>:
0: 10000497 auipc x9 0x10000
4: 00048493 addi x9 x9 0
8: 10000917 auipc x18 0x10000
c: 00492903 lw x18 4 x18
10: 10000997 auipc x19 0x10000
14: 0009a983 lw x19 0 x19
18: 00000a33 add x20 x0 x0
1c: 00000ab3 add x21 x0 x0
20: 00000b33 add x22 x0 x0
24: 00000bb3 add x23 x0 x0
00000028 <while>:
28: 07504463 blt x0 x21 104 <exit>
2c: 073b5263 bge x22 x19 100 <exit>
30: 032bde63 bge x23 x18 60 <else>
34: 0004a283 lw x5 0 x9
38: fff28293 addi x5 x5 -1
3c: 01504463 blt x0 x21 8 <if>
40: 025b4663 blt x22 x5 44 <else>
00000044 <if>:
44: 01500533 add x10 x0 x21
48: 016005b3 add x11 x0 x22
4c: 00000633 add x12 x0 x0
50: 0004a683 lw x13 0 x9
54: 050000ef jal x1 80 <add_positive_long>
58: 00a00ab3 add x21 x0 x10
5c: 00b00b33 add x22 x0 x11
60: 00448493 addi x9 x9 4
64: 001b8b93 addi x23 x23 1
68: fc1ff06f jal x0 -64 <while>
0000006c <else>:
6c: 01500533 add x10 x0 x21
70: 016005b3 add x11 x0 x22
74: 01500633 add x12 x0 x21
78: 001b0693 addi x13 x22 1
7c: 028000ef jal x1 40 <add_positive_long>
80: 00a00ab3 add x21 x0 x10
84: 00b00b33 add x22 x0 x11
88: 001a0a13 addi x20 x20 1
8c: f9dff06f jal x0 -100 <while>
00000090 <exit>:
90: 000a0533 add x10 x20 x0
94: 00100893 addi x17 x0 1
98: 00000073 ecall
9c: 00a00893 addi x17 x0 10
a0: 00000073 ecall
000000a4 <add_positive_long>:
a4: 00d585b3 add x11 x11 x13
a8: 0005dc63 bge x11 x0 24 <no_carry>
ac: 00150513 addi x10 x10 1
b0: 00100293 addi x5 x0 1
b4: 01f29293 slli x5 x5 31
b8: fff2c293 xori x5 x5 -1
bc: 0055f5b3 and x11 x11 x5
000000c0 <no_carry>:
c0: 00c50533 add x10 x10 x12
c4: 00008067 jalr x0 x1 0
```
## Pipeline stage explain
IF: Instruction fetch. Get the instruction from memory according to the program counter(PC).
ID: Instruction Decode.
1. Get the register value that the instruction need.
2. Control most of the multiplexer according to opcode.
3. Deal with imm field(shift left 2 bits).
EX: Exceute. Where Arithmetic Logic Unit(ALU) locate. Compute according to opcode and get the answer. Include (+ - \* / << >> & | ...).
MEM: Memory. Write data to memory or Read data from memory.
WB: Write Back. Write the result back to register.
## Memory
At first, we load data to registers. The data is store in the memory start from address 0x1000_0000. As a result, the assembler use `auipc` and `program counter` to set the higher 20 bits to what we want. Then use `addi` to set the lower 12 bits to get the memory address. Or just load word dirctly by the imcomplete address.
```
.data
nums: .word 1,5,10 # this array locate 0x1000_0000
len: .word 3 # this integer locate 0x1000_000C
target: .word 2000000000 # this integer locate 0x1000_0010
```
($\because$ `nums` has 3 integers. It takes 4\*3 bytes storing in memory.
$\therefore$ the address of `len` is 0x1000_000C)
To get `nums`'s address:
```
# la s1, nums
0: 10000497 auipc x9 0x10000 # x9 == 0x1000_0000
4: 00048493 addi x9 x9 0
```
To load `len`:
```
8: 10000917 auipc x18 0x10000 # x18 == 0x1000_0008
c: 00492903 lw x18 4 x18
```
To load `target`:
```
10: 10000997 auipc x19 0x10000 # x19 == 0x1000_0010
14: 0009a983 lw x19 0 x19
```
----
## Pipeline data hazard
The pipeline has 5 stages. They are instruction fetch(IF), instruction decode(ID), execute(EX), memory(MEM), write back(WB). Each takes one cycle to run.
The data hazard happens when we change a register value in last instruction. But the value has not yet writen back to the register, and we need it in this instruction. There are two ways to deal with this.
1. Data forwarding
2. Stall
### Data Forwarding (multiplexer choose data)

Pic 1. `slli x5 x5 31` comes after `addi x5 x0 1`
In this case, `x5` has not been writen to register in this cycle. For `slli`, the value read from register is old value. So we need to choose the newest value get from MEM stage.
In the picture, we can see the multiplexer choose (the green point) the data from MEM stage, instead of from WB stage or from register.
### Stall

Pic 2. `addi x5 x5 -1` comes after `lw x5 0 x9`
In this case, `x5` has not get from memory. We get the **memory value** at the next stage. But `addi` need the **memory value** at next stage too. So we need to wait for memory read. Here, we insert a `nop` in the pipline. To do so, we
1. clear the IDEX registers $\implies$ nop instruction.
2. Do not enable registers write in IFID $\implies$ still `addi` in ID stage.
3. Do not enable register write in program counter(PC) $\implies$ same PC.

Pic3. Stall by inserting a `nop`
After inserting a `nop` we can forward the data from WB stage.
## 10/31 Rewrite and Reconstruct C and assembly
### In C
`1 <= n <= 2^31-1`
Because `n` is at most 0x7fff\_ffff, `m` is at most `0x7fff_fffe + 0x7fff_ffff = 0xffff_fffd`.
After calculation, we found that `m` will not cause unsigned-integer overflow. So it is OK to use unsigned integer to calculate and compare.
In C code, signed-integer compare to unsigned-integer will cause **implicit conversion**. (Signed-integer will convert to unsigned-integer.) But `n` is in range `[1, 2^31-1]` which do not contain negative number. Moreover, in assembly code, we do not add any instruction to convert the type. Simply use `bltu`.
```C
int minPatches(int *nums, int numsSize, int n){
unsigned int m = 1, res = 0, i = 0;
while (m <= n)
m += (i < numsSize && nums[i] <= m) ? nums[i++] : (res++, m);
return res;
}
```
### Convert to abi
Note:
* The value `nums[i]` can be reused. No need to compute twice.
* Compute `nums[i]` first to reduce the data dependency. (Use after load)
```
.data
nums: .word 1,5,10
numsSize: .word 3
n: .word 2000000000
.text
main:
la a0, nums # s1 = nums[]; // array
lw a1, numsSize # s2 = numsSize; // length of the array
lw a2, n # s3 = n; // range upper bound
jal ra, minPatches # Function call
# a0 is the return value and also the value we want to print
li a7, 1 # print a0
ecall
minPatches:
addi s0, x0, 1 # s0 = m
addi s1, x0, 0 # s1 = res
addi s2, x0, 0 # s2 = i
bgtu a2, s0, exit # n > m
loop:
slli t0, s2, 2 # i*4
add t0, a0, t0 # nums + i*4
lw t0, (0)t0 # t0 = nums[i]
bge s2, a1, else # i >= numsSize
bgtu t0, s0, else # nums[i] > m
addi s2, s2, 1 # ++i
j done # jump out of if-else statement
else:
addi s1, s1, 1 # ++res
mv t0, s0 # t0 = m
done:
add s0, s0, t0 # m += nums[i] or m
bleu s0, a2, loop # m <= n
exit:
mv a0, s1 # return value = res
ret
```
### Result
The improvement of this new version is
1. eliminating the cost on doing long integer adding ,and
2. less branch.
table1. Less cycles, and lower CPI
| |Old code result|New code result|
|---|---|---|
|cycles|883|417|
|instrs.retired|556|279|
|CPI|1.59|1.49|
|IPC|0.63|0.669|
(instrs.retired: An instruction has been retired once leaving the final stage of the processor.)