Rework Homework 1

Rework Homework 1 === ###### tags: `RISC-V` --- ## Ripes and Hazard Ripes is a visual computer architecture simulator and assembly code editor built for the RISC-V instruction set architecture. It has following features: - ISA:RV32IMC - Processor:5-stage processor - And its architecture is:. ![](https://i.imgur.com/i8oo7ZF.png) Based on the architecture, we will face some hazards with RIPES. :::spoiler data hazard First see the below code segment: ```assembly= add x1,x2,x3 sub x2,x1,x3 ``` Here, suppose the processor has no pipeline registers. Because x1 of first instruction is write back at fifth cycle, but x1 of second instruction has must be fetched in MEM stage, thus, the processor should stall 2 cycle let the sub instruction get the wright value of x1. But RIPES is a 5-stage processor, and the x1 value of first instruction was be calculated in EX stage. Hence with pipelined register, the processor don't need tp stall and can fetch the correct value. Now suppose another situation ```assembly= lw x1,0(x2) add x3,x1,x4 ``` Same, x1 has data dependency, but result of lw is calculated in MEM. This time even though the processor has pipeline register, it must stall 1 cycle so that add instruction can get the correct value of x3, this is called load use hazard. ![](https://i.imgur.com/72D5k6w.png) The way solve data hazard are use different registers or move at least 1 instruction between *lw* and *add*(of course it shouldn't affect the result) ::: :::spoiler control hazard Branch instruction will be calculted whether the processor jump at ALU stage, and RIPES has no branch prediction scheme, instead, it will schedule instruction by the order. So let's condiser ther following situation: ```Assembly= bne x0,x0,Jump add x1,x2,x3 add x2,x3,x4 Jump: sub x3,x4,x5 ``` The code has no meaning, and RIPES will schedule instruction by order. In this case, 2 *add* instructions always execute, so there is no stall occur. And in the following case: ```Assembly= beq x0,x0,Jump add x1,x2,x3 add x2,x3,x4 Jump: sub x3,x4,x5 ``` ![](https://i.imgur.com/iKmrojF.png) ![](https://i.imgur.com/J9ctWMr.png) In this case 2 *add* instructions shouldn't execute ever, but RIPES will always schedulte them right after *beq*. And when the result of comparison is out, processor know that 2 *add* shouldn't schedule, so now processor has to **flush** it, it waste 2 cycles. The way solve control hazard is more complicated. Implement branch prediction scheme, or modify the structure of the code. ::: --- ## Load use hazard In the following 3 part, I will show some work to eliminate load use hazard of different cases. ### Wu's homework 1 - [Wu's homework 1](https://hackmd.io/Nd8VNUnZSQWVpAKS-g0oag) - [2089. Find Target Indices After Sorting Array](https://leetcode.com/problems/find-target-indices-after-sorting-array/description/) I rewrite the code of mine because I write the code with pretty bad algorithm (*bubble sort*) at that time. ::: spoiler C program My first method to solve the problem is following the question description, that is, sorted the array and then find the index of target. It's quite simple but even I sorted the array by Quick Sort, its time complexity is **O(nlongn)**. So another way is not using sort, simply count the number that less than target, then find out how many target the array has. Based on this way, the time complexity is **O(n)**. Below is the C code of this question: ```C= int* targetIndices(int* nums, int numsSize, int target, int* returnSize){ int i=0,count=0,index=0,j=0; int *ans; while(i<numsSize){ if(nums[i]<target) index++; //index count the number that less than target, which means, the staring index of answer else if(nums[i]==target) count++; i++; } ans = (int*)malloc(count*sizeof(int)); for(int i=0;i<count;i++) ans[j++] = index++; *returnSize = count; return ans; } ``` ::: :::spoiler RISC-V program, performance and enhancement ```RISC-V= .data arr: .word 5,4,3,2,1,5,4,3,2,1,5,4,3,2,1,5,4,3,2,1,5,4,3,2,1,5,4,3,2,1 arrSize: .word 30 target: .word 1 Lbracket: .string "[" Rbracket: .string "]" Blank: .string " " .text main: la a0 Lbracket li a7,4 ecall la t0,arrSize #load base address of arrSize1 lw t0,0(t0) #load arrSize1 to t0 la t1,arr #load base address of arr1 (t1 also =i) la t3,target #load base address of target lw t3,0(t3) #load target to t2 li t4,0 #index=0 li t5,0 #count=0 li t6,0 #i=0 li s0,0 Loop1: lw t2,0(t1) #load arr1[i] to t2 bge t2, t3, if1 addi t4,t4,1 #index++ j iPP if1: bne t2,t3,iPP addi t5,t5,1 #count++ iPP: addi t6,t6,1 addi t1,t1,4 blt t6,t0,Loop1 Print: add a0,t4,x0 li a7,1 ecall addi s0,s0,1 addi t4,t4,1 la a0,Blank li a7,4 ecall bne s0,t5,Print la a0, Rbracket #a0 is return value register li a7,4 # print string by specifiy the value of a7 register t0 4 ecall nop ``` ![](https://i.imgur.com/tX8s7Sw.png) In line 33 and 34, there is a load use hazard occur, and because there are in *Loop1*, the processor will stall many times due to it. We can move *li s0,0* between line 33 and 34, but the cost is that *li s0,0* will redundantly execute *arrSize-1* times, let check the performance. ![](https://i.imgur.com/LNbWKpW.png) Well, the difference can calculate easily. As we known, load use hazard generates 1 stall, and in this code, the hazard will occur *arrSize* times, that is, 30 stalls. As we modified the code to delete the hazard, 30 cycle has been eliminated, but the processor also need to execute *numsSize-1* additional instructions. So whatever the *arrSize* and whatever the value in array is, the code will decrease **1** cycle after modification. As we can see in the below example, when we want to elilminate load use hazard and the case the hazard is about the branch instuction, we must pay attention on that whether the code execute redundantly instruction. ::: ### Zheng's homework 1 - [Zhen's homework 1](https://hackmd.io/@Fo7UsdePRsKPVV4CPYGbpA/rk91z0LWi) - [283. Move Zeroes](https://leetcode.com/problems/move-zeroes/) I chose Zhen's work to discuss how to reduce cycles(or stalls) cause in his work he doesn't mention how to solve tha hazard. :::spoiler C program Here I list critical part of the function ```C= if(nums[i] != 0){ nums[next_nonzero_index] = nums[i]; if(next_nonzero_index != i) nums[i] = 0; next_nonzero_index++; } ``` ::: :::spoiler RISC-V program, performance and enhancement The RISC-V code is as below: ```assembly= .data nums1: .word 1,0,0,2,5 nums2: .word 1,9,0,6,7 nums3: .word 0,0,0,0,4 .text main: la s0 nums1 # load nums base address to s0 addi s1 x0 5 #nums_size = 5 #call function moveZeroes add a0 s0 x0 # a0 which stands for address of nums array is the first argument add a1 s1 x0 # a1 which stands for array size of nums is the second argument jal ra moveZeroes #jump to moveZeroes function #PrintArray jal ra printArray li a7 10 ecall moveZeroes: addi sp sp -12 sw ra 0(sp) sw s0 4(sp) sw s1 8(sp) li s1 0 #next non zero index = 0 addi s2 x0 0 # i = 0 loop: bge s2 a1 exit #i >= array_size exit slli t0 s2 2 #i * 4 add t0 a0 t0 #array + i*4 lw t1 0(t0) #t1 = array[i] beq t1 x0 next_iter slli t2 s1 2 #next_nonzero_index * 4 add t2 t2 a0 #array + next_nonzero_index * 4 sw t1 0(t2) #array[next_nonzero_index] = array[i] beq s1 s2 next_iter_addIndex #if(next_nonzero_index != i) sw x0 0(t0) #store 0 to array[i] next_iter_addIndex: addi s1 s1 1 #next_nonzero_index++ next_iter: addi s2 s2 1 #i++ j loop exit: lw ra 0(sp) lw s0 4(sp) lw s1 8(sp) addi sp sp 12 jr ra printArray: li a7 1 addi sp sp -8 sw ra 0(sp) sw s0 4(sp) add s0 a0 x0 # s1 = pointer to array add t0 x0 x0 # t0 = i in for loop printloop: bge t0 a1 finish_print_loop slli t1 t0 2 # t1 = i * 4 add t1 t1 s0 # t1 = t1 + array address lw a0 0(t1) # a0 = element in nums[i] li a7 1 ecall addi t0 t0 1 j printloop finish_print_loop: lw ra 0(sp) lw s0 4(sp) addi sp sp 8 jr ra ``` In his work, load use data hazard occur at line 33 and 34, I list the part and the following segment below: ```assembly= loop: bge s2 a1 exit slli t0 s2 2 add t0 a0 t0 lw t1 0(t0) beq t1 x0 next_iter slli t2 s1 2 add t2 t2 a0 sw t1 0(t2) beq s1 s2 next_iter_addIndex sw x0 0(t0) next_iter_addIndex: addi s1 s1 1 next_iter: addi s2 s2 1 j loop ``` According to the below RISC-V code, if t1 doesn't equal to 0, then value of t1, t2 and MEM[t2] will bechanged. But the question is, even if t1 equal to 0, the value of t1 and t2 won't be changed! So, we can calculate the new value of t1, t2 and MEM[t2] between line 5 and 6 and it will not affect output! ```Assembly= lw t1 0(t0) slli t2 s1 2 beq t1 x0 next_iter ``` Because our poupose is simply solve the hazard, so we only move *slli* between *lw* and *beq*. Let's check the performance! Before modified: ![](https://i.imgur.com/UK3YUoJ.png) After modified: ![](https://i.imgur.com/3VOIDWK.png) Due to move 1 instruction before the loop, there are 2 additional instructions (because the number of 0 equal to 2), but total cycle decrease, and CPI decrease of course! What if the array size is much bigger? So I let the array size be 35, and the array is: ``` 1,2,0,0,5,7,9,4,1,1,0,0,6,4,8,9,0,5,0,4,0,1,0,2,3,6,5,8,0,1,4,5,0,6,0 ``` Before modified: ![](https://i.imgur.com/Whe4rFs.png) After modified: ![](https://i.imgur.com/u83d1o6.png) The gap of cycles and CPI are bigger! Finally, I want to know that is the modified code still be better if the array are all 0? because when there are more 0 in the array, the more instruction the modified code need to run. Before modified: ![](https://i.imgur.com/eqkiLDQ.png) After modified: ![](https://i.imgur.com/Vl7JFVM.png) Same, the difference can be calculated like above, in conclusion, the enhancement depends on how many non-zero element in the array, if the array has 100 non-zero elements, then the cycle will decrease 50 cycle. ::: ### Handwritten Leetcode Medium RISC-V code - [2221. Find Triangular Sum of an Array](https://leetcode.com/problems/find-triangular-sum-of-an-array/description/) :::spoiler RISC-V program, performance, and enhancement The *numsSize* indicated that the first line of triangle has *numsSize* number. So I must calculate the number for *numsSize-1* outer loop, and each outer loop I have to calculate *numsSize-1* times. For eaxmple, if the input array is: ``` [1,2,3,4,5] ``` then first outer loop I have to calculte ``` [1+2, 2+3, 3+4, 4+5] 4 times ``` And seconde loop I have to calculate ``` [3+5, 5+7, 7+9] 3 times ``` Until I calculte 48%10, that is the answer.But due to the length of the test case is prerry big(at most 1000), I have to module 10 right after the sum calculated, or it would overflow. ``` [(1+2)%10, (2+3)%10, (3+4)%10, (4+5)%10] ``` So below is my hadnwritten RISC-V implementation: ```Assembly= .data: nums1: .word 1,2,3,4,5 #nums2: .word 1,2,3,4,5,6,7,8,9,10 #nums3: .word 1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10 numsSize1: .word 5 #numsSize2: .word 10 #numsSize3: .word 20 .text: main: li t3,1 # is used in j<"1" li s3,10 # is used in num[j]%"10" la t2,numsSize1 # load base address of numsSize1 to t2 lw t2,0(t2) # load numsSize1 to t2, so t2=numsSize addi t5,t2,-1 # i=numsSize-1 Outer: li t0,0 # j la s0,nums1 # s0 = base address of nums lw t1,0(s0) # load nums1[0] to t1 add s1,s0,x0 # stored address Inner: lw t1,0(s0) addi s2,s0,4 # lw t4,0(s2) add t6,t1,t4 rem t6,t6,s3 addi t0,t0,1 # j++ sw t6,0(s1) addi s1,s1,4 addi s0,s0,4 blt t0,t5,Inner addi t5,t5,-1 li t0,0 la s0,nums1 add s1,s0,x0 bge t5,t3,Outer rem t6,t6,s3 sw t6,0(s1) ``` For the purpose of save the memory, I calculate the sum and then write it to the input memory ``` [1,2,3,4,5] -> [(1+2)%10, (2+3)%10, (3+4)%10, (4+5)%10, 5 ] 1 2 3 4 5 1 2 3 4 5(This address we dont use anymore after the finish of first loop) So the memory usage equal the input size ``` Now we can see that there is a load use hazard occur between line 29 and 30. And it's another situation different from the above, that is, the this hazard has no relation about branch and also in the loop block, so if we can find a instruction can move between this 2 instrction, then we can eliminate many cycles. Fortunately we can simply move *addi t0,t0,1 # j++* to eliminate the hazard, so let's see what happend. Before modification ![](https://i.imgur.com/Ht5PKUp.png) After modification ![](https://i.imgur.com/xllehm8.png) The numbers of cycle decrease 10, and 10 is equal to the times of addition. And the instruction count not even increase. If the input size is 1000, then the cycle gap is (1+999)*999/2, that is HUGE! So this is the best situation of solving the load use hazard. ::: --- ## Control hazard Next I will try to decrease the stall generated by control hazard. ### Huang's Homework 1 - [Huang's homework 1](https://hackmd.io/@ZLQisilvQvOh2DclLmk1bg/SyjKI7sZi) - [1720. Decode XORed Array](https://leetcode.com/problems/decode-xored-array/) I choose *Huang's homework 1* because I can enhance his code through reducing the **control hazard**. :::spoiler C program ```C= int* decode(int* encoded, int encodedSize, int first, int* returnSize){ int* result = (int*)malloc(sizeof(int)*(encodedSize+1)); for(int i=0;i<=encodedSize;i++){ if(i==0){ result[i]=first; } else{ result[i]=result[i-1] ^ encoded[i-1]; } } *returnSize = encodedSize+1; return result; } ``` ::: :::spoiler RISC-V program, performance and enhancement ```Assembly= .data num1: .word 1,2,3 numSize1: .word 3 first1: .word 1 num2: .word 6,2,7,3 numSize2: .word 4 first2: .word 4 num3: .word 5,6,7,8,9 numSize3: .word 5 first3: .word 2 nextline: .string "\n" space: .string " " .text main: la a1,num1 #get num1 address la a2,numSize1 #get numSize1 address la a3,first1 #get first1 address lw a2,0(a2) #a2 = numSize1 = 3 lw a3,0(a3) #a3 = first1 = 1 li a4,0 #int i = 0 li a5,0 #int count = 0 li a6,3 #input set =3 jal ra,decode.L1 #goto decode.L1 la a1,num2 #get num2 address la a2,numSize2 #get numSize2 address la a3,first2 #get first2 address lw a2,0(a2) #a2 = numSize2 = 4 lw a3,0(a3) #a3 = first2 = 4 jal ra,decode.L1 #goto decode.L1 la a1,num3 #get num3 address la a2,numSize3 #get numSize3 address la a3,first3 #get first3 address lw a2,0(a2) #a2 = numSize3 = 5 lw a3,0(a3) #a3 = first3 = 2 decode.L1: add t0,a3,x0 sw t0,0(a4) #result[0] = first decode.L2: #loop beq a4,a2,index_zero #if i=numSize then goto index_zero slli t0,a4,2 #a4*4 add t2,t0,a1 #get num1[i] lw t1,0(t2) #get num1[i] lw t2,0(t0) #get result[i] xor t1,t1,t2 #result[i] ^ num[i] addi t2,t0,4 #a4*4,k = i + 1 sw t1,0(t2) #result[k] = result[i] ^ num[i] addi a4,a4,1 #i++ j decode.L2 #goto decode.L2 index_zero: addi a4,x0,0 #i=0 print: slli t0,a4,2 #a4*4 lw a0,0(t0) #get result[i] li a7,1 #systemcall print ecall #execute addi a4,a4,1 #i++ la a0,space #print " " li a7,4 ecall ble a4,a2,print #if i<numSize then goto print la a0,nextline #nextline li a7,4 ecall addi a5,a5,1 #count++ beq a5,a6,exit #excute all dataset li a4,0 #int i = 0 jr ra exit: li a7,10 ecall ``` In line 39: ```Assembly= beq a4,a2,index_zero #if i=numSize then goto index_zero ``` It check whether a4 equal to the size of the input array, where a4 is a counter start from 0. We can see that he put this instruction in the begining of the *decode.L2* block. But as LeetCode said, the input size is at least 2, so it is no need to put the branch instruction here, we can execute it later and will not affect the result. We can put the instruction before *j decode.L2* as below: ```Assembly= decode.L2: #loop slli t0,a4,2 #a4*4 add t2,t0,a1 #get num1[i] lw t1,0(t2) #get num1[i] lw t2,0(t0) #get result[i] xor t1,t1,t2 #result[i] ^ num[i] addi t2,t0,4 #a4*4,k = i + 1 sw t1,0(t2) #result[k] = result[i] ^ num[i] addi a4,a4,1 #i++ beq a4,a2,index_zero #if i=numSize then goto index_zero j decode.L2 #goto decode.L2 index_zero: addi a4,x0,0 #i=0 ``` So what's the difference of the two version? Let's recall the *Control Hazard in RIPES*, and check the examples: ```Assembly= beq x0,x0,func add x1,x2,x3 add x2,x3,x4 func: sub x1,x2,x3 ``` Cause RIPES will fetch instuctions by order, 2 *add* instructions will be fetched after beq. When branch occur, 2 add instuctions will be flushed. Let's consider another case: ```Assembly= beq x0,x0,func add x1,x2,x3 funcc: sub x1,x2,x3 ``` In this case, *add* and *sub* will be fetched after beq. When branch occur, *add* will be flushed, while *sub* will not because *sub* is the branch target, so if the fetched instructin is the branch target, it can reduce 1 stall! Before modified: ![](https://i.imgur.com/xGp0p3H.png) After modified: ![](https://i.imgur.com/TV80qFQ.png) With 3 short input array, instructions counts and cycles both decrease. ::: ## Conclusion - Solving load use hazard: When we want to eliminate this type of hazard, we must focus on the result can't be changed and whether the additional instuction counts is more than reduced cycle counts. - Solving control hazard: Solving control hazard is more complicated, it usually needs to change the structure of program, and if the Simulator (of processor) support branch prediction, then we can more easily eliminate control hazard based on the prediction scheme.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.