owned this note
owned this note
Published
Linked with GitHub
Rework Homework 1
===
###### tags: `RISC-V`
---
## Ripes and Hazard
Ripes is a visual computer architecture simulator and assembly code editor built for the RISC-V instruction set architecture.
It has following features:
- ISA:RV32IMC
- Processor:5-stage processor
- And its architecture is:.

Based on the architecture, we will face some hazards with RIPES.
:::spoiler data hazard
First see the below code segment:
```assembly=
add x1,x2,x3
sub x2,x1,x3
```
Here, suppose the processor has no pipeline registers. Because x1 of first instruction is write back at fifth cycle, but x1 of second instruction has must be fetched in MEM stage, thus, the processor should stall 2 cycle let the sub instruction get the wright value of x1.
But RIPES is a 5-stage processor, and the x1 value of first instruction was be calculated in EX stage. Hence with pipelined register, the processor don't need tp stall and can fetch the correct value.
Now suppose another situation
```assembly=
lw x1,0(x2)
add x3,x1,x4
```
Same, x1 has data dependency, but result of lw is calculated in MEM. This time even though the processor has pipeline register, it must stall 1 cycle so that add instruction can get the correct value of x3, this is called load use hazard.

The way solve data hazard are use different registers or move at least 1 instruction between *lw* and *add*(of course it shouldn't affect the result)
:::
:::spoiler control hazard
Branch instruction will be calculted whether the processor jump at ALU stage, and RIPES has no branch prediction scheme, instead, it will schedule instruction by the order. So let's condiser ther following situation:
```Assembly=
bne x0,x0,Jump
add x1,x2,x3
add x2,x3,x4
Jump:
sub x3,x4,x5
```
The code has no meaning, and RIPES will schedule instruction by order. In this case, 2 *add* instructions always execute, so there is no stall occur.
And in the following case:
```Assembly=
beq x0,x0,Jump
add x1,x2,x3
add x2,x3,x4
Jump:
sub x3,x4,x5
```


In this case 2 *add* instructions shouldn't execute ever, but RIPES will always schedulte them right after *beq*. And when the result of comparison is out, processor know that 2 *add* shouldn't schedule, so now processor has to **flush** it, it waste 2 cycles. The way solve control hazard is more complicated. Implement branch prediction scheme, or modify the structure of the code.
:::
---
## Load use hazard
In the following 3 part, I will show some work to eliminate load use hazard of different cases.
### Wu's homework 1
- [Wu's homework 1](https://hackmd.io/Nd8VNUnZSQWVpAKS-g0oag)
- [2089. Find Target Indices After Sorting Array](https://leetcode.com/problems/find-target-indices-after-sorting-array/description/)
I rewrite the code of mine because I write the code with pretty bad algorithm (*bubble sort*) at that time.
::: spoiler C program
My first method to solve the problem is following the question description, that is, sorted the array and then find the index of target.
It's quite simple but even I sorted the array by Quick Sort, its time complexity is **O(nlongn)**.
So another way is not using sort, simply count the number that less than target, then find out how many target the array has. Based on this way, the time complexity is **O(n)**.
Below is the C code of this question:
```C=
int* targetIndices(int* nums, int numsSize, int target, int* returnSize){
int i=0,count=0,index=0,j=0;
int *ans;
while(i<numsSize){
if(nums[i]<target)
index++; //index count the number that less than target, which means, the staring index of answer
else if(nums[i]==target)
count++;
i++;
}
ans = (int*)malloc(count*sizeof(int));
for(int i=0;i<count;i++)
ans[j++] = index++;
*returnSize = count;
return ans;
}
```
:::
:::spoiler RISC-V program, performance and enhancement
```RISC-V=
.data
arr: .word 5,4,3,2,1,5,4,3,2,1,5,4,3,2,1,5,4,3,2,1,5,4,3,2,1,5,4,3,2,1
arrSize: .word 30
target: .word 1
Lbracket: .string "["
Rbracket: .string "]"
Blank: .string " "
.text
main:
la a0 Lbracket
li a7,4
ecall
la t0,arrSize #load base address of arrSize1
lw t0,0(t0) #load arrSize1 to t0
la t1,arr #load base address of arr1 (t1 also =i)
la t3,target #load base address of target
lw t3,0(t3) #load target to t2
li t4,0 #index=0
li t5,0 #count=0
li t6,0 #i=0
li s0,0
Loop1:
lw t2,0(t1) #load arr1[i] to t2
bge t2, t3, if1
addi t4,t4,1 #index++
j iPP
if1:
bne t2,t3,iPP
addi t5,t5,1 #count++
iPP:
addi t6,t6,1
addi t1,t1,4
blt t6,t0,Loop1
Print:
add a0,t4,x0
li a7,1
ecall
addi s0,s0,1
addi t4,t4,1
la a0,Blank
li a7,4
ecall
bne s0,t5,Print
la a0, Rbracket #a0 is return value register
li a7,4 # print string by specifiy the value of a7 register t0 4
ecall
nop
```

In line 33 and 34, there is a load use hazard occur, and because there are in *Loop1*, the processor will stall many times due to it. We can move *li s0,0* between line 33 and 34, but the cost is that *li s0,0* will redundantly execute *arrSize-1* times, let check the performance.

Well, the difference can calculate easily. As we known, load use hazard generates 1 stall, and in this code, the hazard will occur *arrSize* times, that is, 30 stalls. As we modified the code to delete the hazard, 30 cycle has been eliminated, but the processor also need to execute *numsSize-1* additional instructions. So whatever the *arrSize* and whatever the value in array is, the code will decrease **1** cycle after modification.
As we can see in the below example, when we want to elilminate load use hazard and the case the hazard is about the branch instuction, we must pay attention on that whether the code execute redundantly instruction.
:::
### Zheng's homework 1
- [Zhen's homework 1](https://hackmd.io/@Fo7UsdePRsKPVV4CPYGbpA/rk91z0LWi)
- [283. Move Zeroes](https://leetcode.com/problems/move-zeroes/)
I chose Zhen's work to discuss how to reduce cycles(or stalls) cause in his work he doesn't mention how to solve tha hazard.
:::spoiler C program
Here I list critical part of the function
```C=
if(nums[i] != 0){
nums[next_nonzero_index] = nums[i];
if(next_nonzero_index != i)
nums[i] = 0;
next_nonzero_index++;
}
```
:::
:::spoiler RISC-V program, performance and enhancement
The RISC-V code is as below:
```assembly=
.data
nums1: .word 1,0,0,2,5
nums2: .word 1,9,0,6,7
nums3: .word 0,0,0,0,4
.text
main:
la s0 nums1 # load nums base address to s0
addi s1 x0 5 #nums_size = 5
#call function moveZeroes
add a0 s0 x0 # a0 which stands for address of nums array is the first argument
add a1 s1 x0 # a1 which stands for array size of nums is the second argument
jal ra moveZeroes #jump to moveZeroes function
#PrintArray
jal ra printArray
li a7 10
ecall
moveZeroes:
addi sp sp -12
sw ra 0(sp)
sw s0 4(sp)
sw s1 8(sp)
li s1 0 #next non zero index = 0
addi s2 x0 0 # i = 0
loop:
bge s2 a1 exit #i >= array_size exit
slli t0 s2 2 #i * 4
add t0 a0 t0 #array + i*4
lw t1 0(t0) #t1 = array[i]
beq t1 x0 next_iter
slli t2 s1 2 #next_nonzero_index * 4
add t2 t2 a0 #array + next_nonzero_index * 4
sw t1 0(t2) #array[next_nonzero_index] = array[i]
beq s1 s2 next_iter_addIndex #if(next_nonzero_index != i)
sw x0 0(t0) #store 0 to array[i]
next_iter_addIndex:
addi s1 s1 1 #next_nonzero_index++
next_iter:
addi s2 s2 1 #i++
j loop
exit:
lw ra 0(sp)
lw s0 4(sp)
lw s1 8(sp)
addi sp sp 12
jr ra
printArray:
li a7 1
addi sp sp -8
sw ra 0(sp)
sw s0 4(sp)
add s0 a0 x0 # s1 = pointer to array
add t0 x0 x0 # t0 = i in for loop
printloop:
bge t0 a1 finish_print_loop
slli t1 t0 2 # t1 = i * 4
add t1 t1 s0 # t1 = t1 + array address
lw a0 0(t1) # a0 = element in nums[i]
li a7 1
ecall
addi t0 t0 1
j printloop
finish_print_loop:
lw ra 0(sp)
lw s0 4(sp)
addi sp sp 8
jr ra
```
In his work, load use data hazard occur at line 33 and 34, I list the part and the following segment below:
```assembly=
loop:
bge s2 a1 exit
slli t0 s2 2
add t0 a0 t0
lw t1 0(t0)
beq t1 x0 next_iter
slli t2 s1 2
add t2 t2 a0
sw t1 0(t2)
beq s1 s2 next_iter_addIndex
sw x0 0(t0)
next_iter_addIndex:
addi s1 s1 1
next_iter:
addi s2 s2 1
j loop
```
According to the below RISC-V code, if t1 doesn't equal to 0, then value of t1, t2 and MEM[t2] will bechanged. But the question is, even if t1 equal to 0, the value of t1 and t2 won't be changed! So, we can calculate the new value of t1, t2 and MEM[t2] between line 5 and 6 and it will not affect output!
```Assembly=
lw t1 0(t0)
slli t2 s1 2
beq t1 x0 next_iter
```
Because our poupose is simply solve the hazard, so we only move *slli* between *lw* and *beq*. Let's check the performance!
Before modified:

After modified:

Due to move 1 instruction before the loop, there are 2 additional instructions (because the number of 0 equal to 2), but total cycle decrease, and CPI decrease of course!
What if the array size is much bigger? So I let the array size be 35, and the array is:
```
1,2,0,0,5,7,9,4,1,1,0,0,6,4,8,9,0,5,0,4,0,1,0,2,3,6,5,8,0,1,4,5,0,6,0
```
Before modified:

After modified:

The gap of cycles and CPI are bigger!
Finally, I want to know that is the modified code still be better if the array are all 0? because when there are more 0 in the array, the more instruction the modified code need to run.
Before modified:

After modified:

Same, the difference can be calculated like above, in conclusion, the enhancement depends on how many non-zero element in the array, if the array has 100 non-zero elements, then the cycle will decrease 50 cycle.
:::
### Handwritten Leetcode Medium RISC-V code
- [2221. Find Triangular Sum of an Array](https://leetcode.com/problems/find-triangular-sum-of-an-array/description/)
:::spoiler RISC-V program, performance, and enhancement
The *numsSize* indicated that the first line of triangle has *numsSize* number. So I must calculate the number for *numsSize-1* outer loop, and each outer loop I have to calculate *numsSize-1* times.
For eaxmple, if the input array is:
```
[1,2,3,4,5]
```
then first outer loop I have to calculte
```
[1+2, 2+3, 3+4, 4+5] 4 times
```
And seconde loop I have to calculate
```
[3+5, 5+7, 7+9] 3 times
```
Until I calculte 48%10, that is the answer.But due to the length of the test case is prerry big(at most 1000), I have to module 10 right after the sum calculated, or it would overflow.
```
[(1+2)%10, (2+3)%10, (3+4)%10, (4+5)%10]
```
So below is my hadnwritten RISC-V implementation:
```Assembly=
.data:
nums1: .word 1,2,3,4,5
#nums2: .word 1,2,3,4,5,6,7,8,9,10
#nums3: .word 1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10
numsSize1: .word 5
#numsSize2: .word 10
#numsSize3: .word 20
.text:
main:
li t3,1 # is used in j<"1"
li s3,10 # is used in num[j]%"10"
la t2,numsSize1 # load base address of numsSize1 to t2
lw t2,0(t2) # load numsSize1 to t2, so t2=numsSize
addi t5,t2,-1 # i=numsSize-1
Outer:
li t0,0 # j
la s0,nums1 # s0 = base address of nums
lw t1,0(s0) # load nums1[0] to t1
add s1,s0,x0 # stored address
Inner:
lw t1,0(s0)
addi s2,s0,4 #
lw t4,0(s2)
add t6,t1,t4
rem t6,t6,s3
addi t0,t0,1 # j++
sw t6,0(s1)
addi s1,s1,4
addi s0,s0,4
blt t0,t5,Inner
addi t5,t5,-1
li t0,0
la s0,nums1
add s1,s0,x0
bge t5,t3,Outer
rem t6,t6,s3
sw t6,0(s1)
```
For the purpose of save the memory, I calculate the sum and then write it to the input memory
```
[1,2,3,4,5] -> [(1+2)%10, (2+3)%10, (3+4)%10, (4+5)%10, 5 ]
1 2 3 4 5 1 2 3 4 5(This address we dont use anymore after the finish of first loop)
So the memory usage equal the input size
```
Now we can see that there is a load use hazard occur between line 29 and 30. And it's another situation different from the above, that is, the this hazard has no relation about branch and also in the loop block, so if we can find a instruction can move between this 2 instrction, then we can eliminate many cycles. Fortunately we can simply move *addi t0,t0,1 # j++* to eliminate the hazard, so let's see what happend.
Before modification

After modification

The numbers of cycle decrease 10, and 10 is equal to the times of addition. And the instruction count not even increase. If the input size is 1000, then the cycle gap is (1+999)*999/2, that is HUGE! So this is the best situation of solving the load use hazard.
:::
---
## Control hazard
Next I will try to decrease the stall generated by control hazard.
### Huang's Homework 1
- [Huang's homework 1](https://hackmd.io/@ZLQisilvQvOh2DclLmk1bg/SyjKI7sZi)
- [1720. Decode XORed Array](https://leetcode.com/problems/decode-xored-array/)
I choose *Huang's homework 1* because I can enhance his code through reducing the **control hazard**.
:::spoiler C program
```C=
int* decode(int* encoded, int encodedSize, int first, int* returnSize){
int* result = (int*)malloc(sizeof(int)*(encodedSize+1));
for(int i=0;i<=encodedSize;i++){
if(i==0){
result[i]=first;
}
else{
result[i]=result[i-1] ^ encoded[i-1];
}
}
*returnSize = encodedSize+1;
return result;
}
```
:::
:::spoiler RISC-V program, performance and enhancement
```Assembly=
.data
num1: .word 1,2,3
numSize1: .word 3
first1: .word 1
num2: .word 6,2,7,3
numSize2: .word 4
first2: .word 4
num3: .word 5,6,7,8,9
numSize3: .word 5
first3: .word 2
nextline: .string "\n"
space: .string " "
.text
main:
la a1,num1 #get num1 address
la a2,numSize1 #get numSize1 address
la a3,first1 #get first1 address
lw a2,0(a2) #a2 = numSize1 = 3
lw a3,0(a3) #a3 = first1 = 1
li a4,0 #int i = 0
li a5,0 #int count = 0
li a6,3 #input set =3
jal ra,decode.L1 #goto decode.L1
la a1,num2 #get num2 address
la a2,numSize2 #get numSize2 address
la a3,first2 #get first2 address
lw a2,0(a2) #a2 = numSize2 = 4
lw a3,0(a3) #a3 = first2 = 4
jal ra,decode.L1 #goto decode.L1
la a1,num3 #get num3 address
la a2,numSize3 #get numSize3 address
la a3,first3 #get first3 address
lw a2,0(a2) #a2 = numSize3 = 5
lw a3,0(a3) #a3 = first3 = 2
decode.L1:
add t0,a3,x0
sw t0,0(a4) #result[0] = first
decode.L2: #loop
beq a4,a2,index_zero #if i=numSize then goto index_zero
slli t0,a4,2 #a4*4
add t2,t0,a1 #get num1[i]
lw t1,0(t2) #get num1[i]
lw t2,0(t0) #get result[i]
xor t1,t1,t2 #result[i] ^ num[i]
addi t2,t0,4 #a4*4,k = i + 1
sw t1,0(t2) #result[k] = result[i] ^ num[i]
addi a4,a4,1 #i++
j decode.L2 #goto decode.L2
index_zero:
addi a4,x0,0 #i=0
print:
slli t0,a4,2 #a4*4
lw a0,0(t0) #get result[i]
li a7,1 #systemcall print
ecall #execute
addi a4,a4,1 #i++
la a0,space #print " "
li a7,4
ecall
ble a4,a2,print #if i<numSize then goto print
la a0,nextline #nextline
li a7,4
ecall
addi a5,a5,1 #count++
beq a5,a6,exit #excute all dataset
li a4,0 #int i = 0
jr ra
exit:
li a7,10
ecall
```
In line 39:
```Assembly=
beq a4,a2,index_zero #if i=numSize then goto index_zero
```
It check whether a4 equal to the size of the input array, where a4 is a counter start from 0. We can see that he put this instruction in the begining of the *decode.L2* block. But as LeetCode said, the input size is at least 2, so it is no need to put the branch instruction here, we can execute it later and will not affect the result. We can put the instruction before *j decode.L2* as below:
```Assembly=
decode.L2: #loop
slli t0,a4,2 #a4*4
add t2,t0,a1 #get num1[i]
lw t1,0(t2) #get num1[i]
lw t2,0(t0) #get result[i]
xor t1,t1,t2 #result[i] ^ num[i]
addi t2,t0,4 #a4*4,k = i + 1
sw t1,0(t2) #result[k] = result[i] ^ num[i]
addi a4,a4,1 #i++
beq a4,a2,index_zero #if i=numSize then goto index_zero
j decode.L2 #goto decode.L2
index_zero:
addi a4,x0,0 #i=0
```
So what's the difference of the two version? Let's recall the *Control Hazard in RIPES*, and check the examples:
```Assembly=
beq x0,x0,func
add x1,x2,x3
add x2,x3,x4
func:
sub x1,x2,x3
```
Cause RIPES will fetch instuctions by order, 2 *add* instructions will be fetched after beq. When branch occur, 2 add instuctions will be flushed.
Let's consider another case:
```Assembly=
beq x0,x0,func
add x1,x2,x3
funcc:
sub x1,x2,x3
```
In this case, *add* and *sub* will be fetched after beq.
When branch occur, *add* will be flushed, while *sub* will not because *sub* is the branch target, so if the fetched instructin is the branch target, it can reduce 1 stall!
Before modified:

After modified:

With 3 short input array, instructions counts and cycles both decrease.
:::
## Conclusion
- Solving load use hazard: When we want to eliminate this type of hazard, we must focus on the result can't be changed and whether the additional instuction counts is more than reduced cycle counts.
- Solving control hazard: Solving control hazard is more complicated, it usually needs to change the structure of program, and if the Simulator (of processor) support branch prediction, then we can more easily eliminate control hazard based on the prediction scheme.