Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [chinghongfang](https://github.com/chinghongfang) > ###### tags: `RISC-V` `computer architure 2021` ## Problem [Leetcode 330 Patching Array](https://leetcode.com/problems/patching-array/) Given a sorted integer array $nums$ and an integer $n$, add/patch elements to the array such that any number in the range $[1, n]$ can be formed by the sum of some elements in the array. Find the least number to add. Example: $nums = [1,5,10]$, $n = 20$ Ans: 2 ## Solution ### Expalain In previous Example, add numbers {2, 4}. Then the array becomes {1, 2, 4, 5, 10} $\implies$ 1 = 1 2 = 2 3 = 1+2 4 = 4 5 = 5 6 = 1+5 7 = 2+5 8 = 1+2+5 9 = 4+5 10 = 10 11 = 1+10 12 = 2+10 13 = 1+2+10 14 = 4+10 15 = 5+10 16 = 1+5+10 17 = 2+5+10 18 = 1+2+5+10 19 = 4+5+10 20 = 1+4+5+10 All numbers in range$[1, 20]$ can be composed by elements in the array. ### Concept Consider that all 32-bit unsigned number can be composed of the 32 numbers: 0x0000_0001 0x0000_0002 0x0000_0004 0x0000_0008 0x0000_0010 0x0000_0020 ... 0x8000_0000 So the numbers we patch should not exceed 32. ### Algorithm Suppose that we have a set $S$ that can build the numbers in range $[0, k]$. Then, * ($S$ $\cup$ $x$) can build (range $[0, k]$ $\cup$ range $[x, k+x]$). ($x$ is a natural number). By the constraint, we 1. Start from an empty set $S = \{\}$ which can build range $[0,0]$. 2. Each time we add a number in $nums$ or "the best" number to the set $S$. "The best" number is ($k+1$) because ($k+1$) can expand the range most. And do not cause a gap. $$ range [0, k] \cup range [k+1, k+1+x] = range [0, k+1+x] $$ So, any time we encounter a number (in nums) bigger than "the best" number, we add "the best" number first to fill the gap and expand the range. ## C Code ```c #include<stdio.h> int main() { int nums[] = {1, 5, 10}; // Numbers we have. int len = 3; // The length of the array int target = 20; // The upper bound of the range int ans = 0; // Number of patches we need long now = 0; // The range of number we can form now int idx = 0; // The `nums` index we have consumed while (now < target) { if (idx < len && (long) nums[idx] <= now + 1) { now += nums[idx]; ++idx; } else { now += now + 1; ++ans; } } printf("%d\n", ans); return 0; } ``` :::info Alternatively, we can even shorten as following: ```c int minPatches(int *nums, int numsSize, int n) { unsigned int m = 1, res = 0, i = 0; while (m <= n) m += (i < numsSize && nums[i] <= m) ? nums[i++] : (res++, m); return res; } ``` :notes: jserv ::: :::info Yes, * using unsigned integer for `m` will prevent from 64-bit integer calculation. * And set `m` to `1` will avoid adding continuously. It reduces the execution time. Thanks for this patch. I will rewrite my c code and the assembly code at the bottom of this page. Also, I will reconstruct it to a fucntion and a main. Thanks, ChingHongFang ::: ## abi Code ``` .data nums: .word 1,5,10 len: .word 3 target: .word 20 .text main: la s1, nums # s1 = nums[]; // array lw s2, len # s2 = len; // length of the array lw s3, target # s3 = target; // range upper bound add s4, zero, zero # s4 = ans = 0; // set `ans` to 0 add s5, zero, zero # s5, s6 represent `now`. A 64-bits integer. add s6, zero, zero # Set `now` to zero add s7, zero, zero # s7 = idx = 0; // set `idx` to 0 while: bgt s5, zero, exit # now > target. s5 is more significant. bge s6, s3, exit # Check s6 >= s3 to break the while loop # if statment bge s7, s2, else # idx >= len lw t0, (0)s1 # t0 = nums[idx] addi t0, t0, -1 # t0 -= 1 blt zero, s5, if # nums[idx]-1 < now bgt t0, s6, else # nums[idx]-1 > now if: # now += nums[0]; // In fact, now += nums[idx]; add a0, zero, s5 # do function call place argument add a1, zero, s6 # do function call place argument add a2, zero, zero # do function call place argument lw a3, (0)s1 # do function call place argument jal ra, add_positive_long # function call add s5, zero, a0 # get return value add s6, zero, a1 # get rerurn value addi s1, s1, 4 # nums += 1 addi s7, s7, 1 # ++idx j while # Loop again else: add a0, zero, s5 # do function call place argument add a1, zero, s6 # do function call place argument add a2, zero, s5 # do function call place argument addi a3, s6, 1 # do function call place argument jal ra, add_positive_long # function call add s5, zero, a0 # get return value add s6, zero, a1 # get rerurn value addi s4, s4, 1 # ++ans j while # Loop again exit: add a0, s4, zero # load print value li a7, 1 # print a0 ecall li a7, 10 # exit ecall add_positive_long: # a0a1 + a2a3 add a1, a1, a3 # Add lower bit. a1 = a1 + a3 bge a1, zero, no_carry # a1 becomes negative if overflow occurs. addi a0, a0, 1 # Add 1 to higher bit # make mask: 0x7fff_ffff addi t0, zero, 1 # t0 = 1 slli t0, t0, 31 # t0 = 0x8000_0000 not t0, t0 # t0 = 0x7fff_ffff # clear sign bit and a1, a1, t0 # a1 = a1 & 0x7fff_ffff no_carry: add a0, a0, a2 # a0 = a0 + a2 ret ``` ## Assembly Code ``` 00000000 <main>: 0: 10000497 auipc x9 0x10000 4: 00048493 addi x9 x9 0 8: 10000917 auipc x18 0x10000 c: 00492903 lw x18 4 x18 10: 10000997 auipc x19 0x10000 14: 0009a983 lw x19 0 x19 18: 00000a33 add x20 x0 x0 1c: 00000ab3 add x21 x0 x0 20: 00000b33 add x22 x0 x0 24: 00000bb3 add x23 x0 x0 00000028 <while>: 28: 07504463 blt x0 x21 104 <exit> 2c: 073b5263 bge x22 x19 100 <exit> 30: 032bde63 bge x23 x18 60 <else> 34: 0004a283 lw x5 0 x9 38: fff28293 addi x5 x5 -1 3c: 01504463 blt x0 x21 8 <if> 40: 025b4663 blt x22 x5 44 <else> 00000044 <if>: 44: 01500533 add x10 x0 x21 48: 016005b3 add x11 x0 x22 4c: 00000633 add x12 x0 x0 50: 0004a683 lw x13 0 x9 54: 050000ef jal x1 80 <add_positive_long> 58: 00a00ab3 add x21 x0 x10 5c: 00b00b33 add x22 x0 x11 60: 00448493 addi x9 x9 4 64: 001b8b93 addi x23 x23 1 68: fc1ff06f jal x0 -64 <while> 0000006c <else>: 6c: 01500533 add x10 x0 x21 70: 016005b3 add x11 x0 x22 74: 01500633 add x12 x0 x21 78: 001b0693 addi x13 x22 1 7c: 028000ef jal x1 40 <add_positive_long> 80: 00a00ab3 add x21 x0 x10 84: 00b00b33 add x22 x0 x11 88: 001a0a13 addi x20 x20 1 8c: f9dff06f jal x0 -100 <while> 00000090 <exit>: 90: 000a0533 add x10 x20 x0 94: 00100893 addi x17 x0 1 98: 00000073 ecall 9c: 00a00893 addi x17 x0 10 a0: 00000073 ecall 000000a4 <add_positive_long>: a4: 00d585b3 add x11 x11 x13 a8: 0005dc63 bge x11 x0 24 <no_carry> ac: 00150513 addi x10 x10 1 b0: 00100293 addi x5 x0 1 b4: 01f29293 slli x5 x5 31 b8: fff2c293 xori x5 x5 -1 bc: 0055f5b3 and x11 x11 x5 000000c0 <no_carry>: c0: 00c50533 add x10 x10 x12 c4: 00008067 jalr x0 x1 0 ``` ## Pipeline stage explain IF: Instruction fetch. Get the instruction from memory according to the program counter(PC). ID: Instruction Decode. 1. Get the register value that the instruction need. 2. Control most of the multiplexer according to opcode. 3. Deal with imm field(shift left 2 bits). EX: Exceute. Where Arithmetic Logic Unit(ALU) locate. Compute according to opcode and get the answer. Include (+ - \* / << >> & | ...). MEM: Memory. Write data to memory or Read data from memory. WB: Write Back. Write the result back to register. ## Memory At first, we load data to registers. The data is store in the memory start from address 0x1000_0000. As a result, the assembler use `auipc` and `program counter` to set the higher 20 bits to what we want. Then use `addi` to set the lower 12 bits to get the memory address. Or just load word dirctly by the imcomplete address. ``` .data nums: .word 1,5,10 # this array locate 0x1000_0000 len: .word 3 # this integer locate 0x1000_000C target: .word 2000000000 # this integer locate 0x1000_0010 ``` ($\because$ `nums` has 3 integers. It takes 4\*3 bytes storing in memory. $\therefore$ the address of `len` is 0x1000_000C) To get `nums`'s address: ``` # la s1, nums 0: 10000497 auipc x9 0x10000 # x9 == 0x1000_0000 4: 00048493 addi x9 x9 0 ``` To load `len`: ``` 8: 10000917 auipc x18 0x10000 # x18 == 0x1000_0008 c: 00492903 lw x18 4 x18 ``` To load `target`: ``` 10: 10000997 auipc x19 0x10000 # x19 == 0x1000_0010 14: 0009a983 lw x19 0 x19 ``` ---- ## Pipeline data hazard The pipeline has 5 stages. They are instruction fetch(IF), instruction decode(ID), execute(EX), memory(MEM), write back(WB). Each takes one cycle to run. The data hazard happens when we change a register value in last instruction. But the value has not yet writen back to the register, and we need it in this instruction. There are two ways to deal with this. 1. Data forwarding 2. Stall ### Data Forwarding (multiplexer choose data) ![](https://i.imgur.com/ekg5KU2.png) Pic 1. `slli x5 x5 31` comes after `addi x5 x0 1` In this case, `x5` has not been writen to register in this cycle. For `slli`, the value read from register is old value. So we need to choose the newest value get from MEM stage. In the picture, we can see the multiplexer choose (the green point) the data from MEM stage, instead of from WB stage or from register. ### Stall ![](https://i.imgur.com/ntVKerV.png) Pic 2. `addi x5 x5 -1` comes after `lw x5 0 x9` In this case, `x5` has not get from memory. We get the **memory value** at the next stage. But `addi` need the **memory value** at next stage too. So we need to wait for memory read. Here, we insert a `nop` in the pipline. To do so, we 1. clear the IDEX registers $\implies$ nop instruction. 2. Do not enable registers write in IFID $\implies$ still `addi` in ID stage. 3. Do not enable register write in program counter(PC) $\implies$ same PC. ![](https://i.imgur.com/RRBNDXv.png) Pic3. Stall by inserting a `nop` After inserting a `nop` we can forward the data from WB stage. ## 10/31 Rewrite and Reconstruct C and assembly ### In C `1 <= n <= 2^31-1` Because `n` is at most 0x7fff\_ffff, `m` is at most `0x7fff_fffe + 0x7fff_ffff = 0xffff_fffd`. After calculation, we found that `m` will not cause unsigned-integer overflow. So it is OK to use unsigned integer to calculate and compare. In C code, signed-integer compare to unsigned-integer will cause **implicit conversion**. (Signed-integer will convert to unsigned-integer.) But `n` is in range `[1, 2^31-1]` which do not contain negative number. Moreover, in assembly code, we do not add any instruction to convert the type. Simply use `bltu`. ```C int minPatches(int *nums, int numsSize, int n){ unsigned int m = 1, res = 0, i = 0; while (m <= n) m += (i < numsSize && nums[i] <= m) ? nums[i++] : (res++, m); return res; } ``` ### Convert to abi Note: * The value `nums[i]` can be reused. No need to compute twice. * Compute `nums[i]` first to reduce the data dependency. (Use after load) ``` .data nums: .word 1,5,10 numsSize: .word 3 n: .word 2000000000 .text main: la a0, nums # s1 = nums[]; // array lw a1, numsSize # s2 = numsSize; // length of the array lw a2, n # s3 = n; // range upper bound jal ra, minPatches # Function call # a0 is the return value and also the value we want to print li a7, 1 # print a0 ecall minPatches: addi s0, x0, 1 # s0 = m addi s1, x0, 0 # s1 = res addi s2, x0, 0 # s2 = i bgtu a2, s0, exit # n > m loop: slli t0, s2, 2 # i*4 add t0, a0, t0 # nums + i*4 lw t0, (0)t0 # t0 = nums[i] bge s2, a1, else # i >= numsSize bgtu t0, s0, else # nums[i] > m addi s2, s2, 1 # ++i j done # jump out of if-else statement else: addi s1, s1, 1 # ++res mv t0, s0 # t0 = m done: add s0, s0, t0 # m += nums[i] or m bleu s0, a2, loop # m <= n exit: mv a0, s1 # return value = res ret ``` ### Result The improvement of this new version is 1. eliminating the cost on doing long integer adding ,and 2. less branch. table1. Less cycles, and lower CPI | |Old code result|New code result| |---|---|---| |cycles|883|417| |instrs.retired|556|279| |CPI|1.59|1.49| |IPC|0.63|0.669| (instrs.retired: An instruction has been retired once leaving the final stage of the processor.)