# Lab1: RV32I Simulator ###### tags: `Course` `RISC-V` `Computer Architecture 2022` ## Problem * Given an array nums. We define a running sum of an array as runningSum[i] = sum(nums[0]…nums[i]). Return the running sum of nums More detail description can be founded at [LeetCode 1480. Running Sum of 1d Array](https://leetcode.com/problems/running-sum-of-1d-array/) ## Solution * Pass the first element to result 0 as initial. Then add the A[i-1] element with num[i], then pass the value to the result. Finally return the result arr ## C Code ```c= #include <stdlib.h> #include <stdio.h> int* runningSum(int* nums, int numsSize){ int* result = (int*)malloc(sizeof(int)*numsSize); for(int i=0; i<numsSize; i++){ if(i==0) result[0] = nums[0]; else result[i] = result[i-1] + nums[i]; } return result; } int main(int argc, char *argv[]){ int nums1[5] = {1,2,3,4,5}; int nums2[6] = {1,3,5,7,9,11}; int nums3[7] = {0,2,4,6,8,10,12}; int* result1 = runningSum(nums1, 5); int* result2 = runningSum(nums2, 6); int* result3 = runningSum(nums3, 7); for(int j=0; j<5; j++){ printf("%d ", result1[j]); } printf("\n"); for(int j=0; j<6; j++){ printf("%d ", result2[j]); } printf("\n"); for(int j=0; j<7; j++){ printf("%d ", result3[j]); } printf("\n"); } ``` ## Assembly Code :::spoiler ```RISCV= # The old one .data nums1: .word 1,2,3,4,5 nums2: .word 1,3,5,7,9,11 nums3: .word 0,2,4,6,8,10,12 res1: .word 0,0,0,0,0, res2: .word 0,0,0,0,0,0 res3: .word 0,0,0,0,0,0,0 space: .string " " nl: .string "\n" .text main: # t2: nums head, t3: result head, t4: numsSize la t2, nums1 # load arr base to t2 la t3, res1 # load res base to t3 li t4, 5 # load the numsize to t4 jal ra, runningSum jal ra, print la t2, nums2 # load arr base to t2 la t3, res2 # load res base to t3 li t4, 6 # load the numsize to t4 jal ra, runningSum jal ra, print la t2, nums3 # load arr base to t2 la t3, res3 # load res base to t3 li t4, 7 # load the numsize to t4 jal ra, runningSum jal ra, print j exit runningSum: # t5: i, t6: j # initialize li t5, 0 # load i with 0 li t6, 0 # load j with 0 j end.L1 # jump to L1 end.L1: bge t5, t4, end.L6 # if i >= numsSize then branch to L6 j end.L2 # jump to L2 end.L2: bnez t5, end.L4 # if i != 0 then branch to L4 j end.L3 # jump to L3 end.L3: lw a3, 0(t2) # load nums[0] to a3 sw a3, 0(t3) # save num[0] to result[0] j end.L5 # jump to L5 end.L4: addi a3, t3, 0 # load result's address to a3 addi a4, t5, 0 # load i to a4 slli a6, a4, 2 # i*4 (cause of byte size) add a4, a3, a6 # add result's address with i*4 # to get result[i]'s address lw a3, -4(a4) # load result[i-1] to a3 addi a5, t2, 0 # load nums's address add a5, a5, a6 # add nums's address with i*4 # to get nums[i]'s address lw a5, 0(a5) # load nums[i] to a5 add a3, a3, a5 # add result[i-1] + nums[i] sw a3, 0(a4) # save to result[i] j end.L5 # jump to L5 end.L5: addi t5, t5, 1 # i = i + 1 j end.L1 # jump to L1 end.L6: jr ra # retuen to main print: slli a4, t6, 2 add a5, t3, a4 lw a0, 0(a5) # load result out li a7, 1 # pint a0 ecall la a0, space #load space li a7, 4 #print string ecall addi t6, t6, 1 # j++ blt t6, t5, print # loop la a0, nl #load next line li a7, 4 #print string ecall jr ra exit: li a7, 10 # exit ecall ``` ::: ```RISCV= .data nums1: .word 1,2,3,4,5 nums2: .word 1,3,5,7,9,11 nums3: .word 0,2,4,6,8,10,12 res1: .word 0,0,0,0,0, res2: .word 0,0,0,0,0,0 res3: .word 0,0,0,0,0,0,0 space: .string " " nl: .string "\n" .text main: # t2: nums head, t3: result head, t4: numsSize la t2, nums1 # load arr base to t2 la t3, res1 # load res base to t3 li t4, 5 # load the numsize to t4 jal ra, runningSum la t2, nums2 # load arr base to t2 la t3, res2 # load res base to t3 li t4, 6 # load the numsize to t4 jal ra, runningSum la t2, nums3 # load arr base to t2 la t3, res3 # load res base to t3 li t4, 7 # load the numsize to t4 jal ra, runningSum j exit runningSum: # t5: i, t6: j # initialize li t5, 1 # load i with 1 lw a3, 0(t2) # load nums[0] to a3 sw a3, 0(t3) # save num[0] to result[0] end.L1: slli a0, t5, 2 # i*4 (cause of byte size) add a1, t3, a0 # add result's address with i*4 to get result[i]'s address lw a3, -4(a1) # load result[i-1] to a3 add a2, t2, a0 # add nums's address with i*4 to get nums[i]'s address lw a2, 0(a2) # load nums[i] to a5 add a3, a3, a2 # add result[i-1] + nums[i] sw a3, 0(a1) # save to result[i] addi t5, t5, 1 # i = i + 1 blt t5, t4, end.L1 # if i >= numsSize then branch to L3 li t5, 0 print: slli a1, t5, 2 add a1, t3, a1 lw a0, 0(a1) # load result out li a7, 1 # pint a0 ecall la a0, space #load space li a7, 4 #print string ecall addi t5, t5, 1 # j++ blt t5, t4, print # loop la a0, nl #load next line li a7, 4 #print string ecall jr ra exit: li a7, 10 # exit ecall ``` :::warning :warning: Can you use fewer instructions? :notes: jserv ::: :::info **Update at 10/6** * The old one I puted it in the spoiler `>` * Modify the numbers of Label from 6 into 1 * Modify $i$ begin form 1 not 0, because I put lw,sw of nums [0] right after li * Modify & Replace $bgt\;t5,\;t4,\;end.L1$ into $blt\;t5,\;t4,\;end.L1$. Thus jr can put in L1 * Delete the reg $t6$ * Delete the code $bnez\;t5,\;end.L2$ * Delete redundant code, which load data to another reg. modify at part $runningSum$ * Only use $a_1$ at $Print$ **Total numbers of lines have decrease from 86 into 62** ::: ## Pipeline stage explain * **IF : Instruction Fetch** * The instructioins is fetched at this stage from $Instr. Memory$, and the pointer counter's value is also given at this stage. ![](https://i.imgur.com/GEcUYjp.png =x300) * **ID : Instruction Decode** * The instruction is decoded into OPcode and get the value from registers. ![](https://i.imgur.com/TyPdMFL.png =x300) * **EXE : Execute** * Use the OPcode at previous stage to determine which arithmetic operator should be used to the value which is also passed by the previous stage. ![](https://i.imgur.com/Z6g5Xbg.png =x300) * **MEM : Memory access** * The result of the calculation will be buffered to the Data Memory. * **WB : Write Back** * The data and the register address will be write back to Register File ![](https://i.imgur.com/B3s5hJz.png =x300) ## Result & Memory & Register ### Result * **The result array should be** * nums1 > 1 1 + 2 = 3 1 + 2 + 3 = 6 1 + 2 + 3 + 4 = 10 1 + 2 + 3 + 4 + 5 = 15 * nums2 > 1 1 + 3 = 4 1 + 3 + 5 = 9 1 + 3 + 5 + 7 = 16 1 + 3 + 5 + 7 + 9 = 25 1 + 3 + 5 + 7 + 9 + 11 = 36 * nums3 > 0 0 + 2 = 0 0 + 2 + 4 = 6 0 + 2 + 4 + 6 = 12 0 + 2 + 4 + 6 + 8 = 20 0 + 2 + 4 + 6 + 8 + 10 = 30 0 + 2 + 4 + 6 + 8 + 10 + 12 = 42 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | | - | - | - | - | - | - | - | | 1 | 3 | 6 | 10 | 15 ||| | 1 | 4 | 9 | 16 | 25 | 36 || | 0 | 2 | 6 | 12 | 20 | 30 | 42 | * **Result $Ripes$ print** ![](https://i.imgur.com/iOypb9z.png) ### Memory & Register * The $nums$ is store at $0x1000\_0000$; $res$ is store at $0x1000\_0018$; $numsize$ is store at $0x1000\_0014$ with size $5$; ![](https://i.imgur.com/SYchBpM.png =x100)![](https://i.imgur.com/1A9Ue90.png =x100) * The step that pass nums[0] to res[0] ![](https://i.imgur.com/lM1pUx9.png )![](https://i.imgur.com/b0ipqF0.png) * The reg when load res's address & i's value ![](https://i.imgur.com/ukTlnO4.png =x70)![](https://i.imgur.com/HKDCLtg.png =x100) * The reg when adding result ![](https://i.imgur.com/vaBtzKA.png =x50)![](https://i.imgur.com/7W4pbyK.png =x100) * The reg when print the result out ![](https://i.imgur.com/dGlbNqY.png =x150)![](https://i.imgur.com/273eHSM.png) ## Pipeline Feature ### Data Hazard * **The reason why it occur:** Because we only have 5 stage pipeline. When the current instruction need the result of previous instruction, the CPU need to stall until the result WriteBack. Then CPU will waste lots of time to waiting for the result. Thus, there are two ways to solve this kind of data hazard. * **solution** * Forwarding By forwarding, we can take the result from the previous instruction immediately. As the picture below, the next instruction is $add\;x12\;x12\;x13$ while the previous one is $addi\;x12\;x7\;0$. Thus we pass the value from EX/MEM Reg to EXE stage as the yellow line shows. So we can run the pipeline continuosly. ![](https://i.imgur.com/XqbiFow.png) * Stall (load use) However, there are still some cases that we need to stall. Just like the below case, we need to $lw\;x10\;0(x7)$. Immediately, the next one need $sw\;x10\;0(x28)$ to save word, while the data hasn't be readed from the Memory. Therefore, we stall the CPU for one cycle, waiting for the data be prepared. ![](https://i.imgur.com/cnXQAet.png) ### Control Hazard * **The reasin why it occur:** Control hazard occur when the branch is taken. Hence, we need to Flush the instructions that we don't want them to be execute. As the result, we will waste 2 cycle because branch is calculated at EXE stage. Like the picture below, $bne\;x30\;x0\;L4$ is taken. Then, the two of the instructions after branch is Flushed. ![](https://i.imgur.com/FuTz3bo.png) * **solution** * Put branch unit at ID stage Move the place where branch is judged forward one. That is, Branch decide at ID stage, not at EXE stage. Finally, it will only cost 1 cycle for branch taken. ![](https://i.imgur.com/6xmxqSp.png) * Use predict scheme Usually, we will use branch when we are dealing with loop. Consequently, what we can do is to "guess" the next branch is taken or not taken. We can use "static" or "dynamic" prediction to solve the Control Hazard. ![](https://i.imgur.com/7gOrz6W.png) ![](https://i.imgur.com/Ysp4KRZ.png) ## Reference * [Stall and Flush](https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec13.pdf) * [What's the role of EX stage for branching in Pipelined MIPS w Forwarding?](https://stackoverflow.com/questions/44866407/whats-the-role-of-ex-stage-for-branching-in-pipelined-mips-w-forwarding) * [淺談分支預測與 Hazards 議題](https://ithelp.ithome.com.tw/m/articles/10265705) * [在Hazard尋求解法是否搞錯了什麼](https://ithelp.ithome.com.tw/articles/10268810)