# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [`阮祈翰`](https://github.com/akkk74159/computer_architecture)>
## [Problem C](https://hackmd.io/@sysprog/arch2024-quiz1-sol#Problem-C)
In Problem C, we have three functions: `fabsf`, `my_clz`, `fp16_to_fp32`
## fabsf
### C code
```c=
static inline float fabsf(float x) {
uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer
i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value
x = *(float *)&i; // Write the modified bits back into the float
return x;
}
```
#### Test case for C code
| Input in decimal | Input in Hex | Output in decimal | Output in Hex |
|-------------------|---|---|---|
|0|0x00000000|0|0x00000000|
|-0|0x80000000|0|0x00000000|
|Normalized number: -1.15625|0xbf940000|1.15625|0x3f940000|
|Denormalized number: -1.39047e-39|0x800f2411|1.39047e-39|0x000f2411|
|NaN|0xffffffff|nan|0x7fffffff|
|inf|0x7f800000|inf|0x7f800000|
|-inf|0xff800000|inf|0x7f800000|
### Assembly
```c=
.data
case1: .word 0x00000000 # 0.0
case2: .word 0x80000000 # -0.0
case3: .word 0xbf940000 # -1.15625
case4: .word 0x800f2411 # -1.39047e-39
case5: .word 0xffffffff # nan
case6: .word 0x7f800000 # inf
case7: .word 0xff800000 # -inf
output1: .word 0x00000000 # 0.0
output2: .word 0x00000000 # 0.0
output3: .word 0x3f940000 # 1.15625
output4: .word 0x000f2411 # 1.39047e-39
output5: .word 0x7fffffff # nan
output6: .word 0x7f800000 # inf
output7: .word 0x7f800000 # inf
str1: .string "pass "
str2: .string "fail "
.text
main:
li s0, 7 # s0 stores amount of test cases
la t0, case1 # t0 = address of case1
loop:
# run case 1~7
lw a0, 0(t0)
jal ra, fabsf
lw a1, 28(t0)
jal ra, print_result
addi t0, t0, 4 # move to next case
addi s0, s0, -1 # case counter -1
bnez s0, loop
#exit
li a7, 10
ecall
fabsf:
li t1, 0x7FFFFFFF
and a0, a0, t1 # x and 0x7fffffff
jr ra
# print result (pass or fail)
print_result:
bne a0, a1, fail
la a0, str1 # print "pass"
li a7, 4
ecall
j print_end
fail:
la a0, str2 # print "fail"
li a7, 4
ecall
print_end:
jr ra # return
```
#### Test case for Assembly
| Test case 1 | Test case 2 | Test case 3 | Test case 4 | Test case 5 | Test case 6 | Test case 7 |
|----|----|----|----|----|----|----|
|pass|pass|pass|pass|pass|pass|pass|

## my_clz
### C code
```c=
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
#### Test case for C code
| Input in decimal | Input in Hex | Output |
|-------------------|---|---|
|0|0x00000000|32|
|1537|0x00000601|21|
|4294967295|0xffffffff|0|
### Assembly
```c=
.data
case1: .word 0
case2: .word 1537
case3: .word 4294967295
output1: .word 32
output2: .word 21
output3: .word 0
str1: .string "pass "
str2: .string "fail "
.text
main:
li s0, 3 # s0 stores amount of test cases
la t0, case1 # t0 = address of case1
loop:
# run case 1~3
lw a0, 0(t0)
jal ra, my_clz
lw a1, 12(t0)
jal ra, print_result
addi t0, t0, 4 # move to next case
addi s0, s0, -1 # case counter -1
bnez s0, loop
#exit
li a7, 10
ecall
# count leading zero function
my_clz:
li t2, 1 # init t3 = 1U
li t3, 0 # init count = 0
li t4, 31 # init i = 31
my_clz_loop:
sll t5, t2, t4 # t5 = 1U << i
and t5, t5, a0 # t5 = (x & (1U << i))
bne t5, x0, clz_end # if (t5 != 0), exit the loop
addi t3, t3, 1 # count++
addi t4, t4, -1 # --i
bge t4, zero, my_clz_loop # if (i >= 0), stay in the loop
clz_end:
mv a0, t3 # a0 = count
jr ra
# print result (pass or fail)
print_result:
bne a0, a1, fail
la a0, str1 # print "pass"
li a7, 4
ecall
j print_end
fail:
la a0, str2 # print "fail"
li a7, 4
ecall
print_end:
jr ra
```
### Test case for Assembly
| Test case 1 | Test case 2 | Test case 3 |
|-------------|-------------|-------------|
|pass|pass|pass|

## fp16_to_fp32
#### C code
```c=
static inline uint32_t fp16_to_fp32(uint16_t h) {
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &
INT32_C(0x7F800000);
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
return sign | ((((nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}
```
#### Test case for C code
| Input in decimal | Iuput in Hex | Output in decimal | Output in Hex |
|-------------------|---|---|---|
|0|0x0000|0|0x00000000|
|-0|0x8000|-0|0x80000000|
|Normalized number: -1.157|0xbca0|-1.15625|0xbf940000|
|Denormalized number: -0.00000871|0x8092|-0.000008702278|0xb7120000|
|NaN|0xfc01|nan|0xff802000|
|inf|0x7c00|inf|0x7f800000|
|-inf|0xfc00|-inf|0xff800000|
#### Assembly
```c=
.data
case1: .word 0x0000 # 0.0
case2: .word 0x8000 # -0.0
case3: .word 0xbca0 # -1.157
case4: .word 0x8092 # -0.00000871
case5: .word 0xfc01 # nan
case6: .word 0x7c00 # inf
case7: .word 0xfc00 # -inf
output1: .word 0x00000000 # 0.0
output2: .word 0x80000000 # -0.0
output3: .word 0xbf940000 # -1.15625
output4: .word 0xb7120000 # -0.000008702278
output5: .word 0xff802000 # nan
output6: .word 0x7f800000 # inf
output7: .word 0xff800000 # -inf
str1: .string "pass "
str2: .string "fail "
.text
main:
li s0, 7 # s0 stores amount of test cases
la t0, case1 # t0 = address of case1
loop:
# run case 1~7
lw a0, 0(t0)
jal ra, fp16_to_fp32
lw a1, 28(t0)
jal ra, print_result
addi t0, t0, 4 # move to next case
addi s0, s0, -1 # case counter -1
bnez s0, loop
#exit
li a7, 10
ecall
fp16_to_fp32:
# s1 = w
slli s1, a0, 16 # s1 = h << 16
# s2 = sign
li s7, 0x80000000
and s2, s1, s7 # s2 = w & 0x80000000
# s3 = nonsign
li s7, 0x7FFFFFFF
and s3, s1, s7 # s3 = w & 0x7FFFFFFF
# s4 = renorm_shift
my_clz:
li t2, 1 # init t3 = 1U
li t3, 0 # init count = 0
li t4, 31 # init i = 31
my_clz_loop:
sll t5, t2, t4 # t5 = 1U << i
and t5, t5, s3 # t5 = (x & (1U << i))
bne t5, x0, clz_end # if (t5 != 0), exit the loop
addi t3, t3, 1 # count++
addi t4, t4, -1 # --i
bge t4, zero, my_clz_loop # if (i >= 0), stay in the loop
clz_end:
mv s4, t3 # s4 = my_clz(nonsign)
addi s4, s4, -5
bge s4, zero, gt5 # if (s4 > 5), s4 = (s4 - 5)
add s4, zero, zero # else s4 = 0
gt5:
# s5 = inf_nan_mask
li s7, 0x04000000
add s5, s3, s7 # s5 = nonsign + 0x04000000
srai s5, s5, 8 # s5 = (nonsign + 0x04000000) >> 8
li s7, 0x7F800000
and s5, s5, s7 # s5 = ((nonsign + 0x04000000) >> 8) & (0x7F800000)
# s6 = ~zero_mask
addi s6, s3, -1 # s6 = nonsign - 1
srai s6, s6, 31 # s6 = (nonsign - 1) >> 31
li s7, 0xFFFFFFFF
xor s6, s6, s7 # s6 = ~zero_mask
sll s3, s3, s4 # s3 = nonsign << renorm_shift
srli s3, s3, 3 # s3 = s3 >> 3
li s7, 0x70
sub s4, s7, s4 # s4 = 0x70 - renorm_shift
slli s4, s4, 23 # s4 = s4 << 23
add s3, s3, s4 # s3 = (nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)
or s3, s3, s5 # s3 = s3 | inf_nan_mask
and s3, s3, s6 # s3 = s3 & ~zero_mask
or a0, s2, s3 # a0 = sign | s3
jr ra # return
# print result (pass or fail)
print_result:
bne a0, a1, fail
la a0, str1 # print "pass"
li a7, 4
ecall
j print_end
fail:
la a0, str2 # print "fail"
li a7, 4
ecall
print_end:
jr ra # return
```
#### Test case for Assembly
| Test case 1 | Test case 2 | Test case 3 | Test case 4 | Test case 5 | Test case 6 | Test case 7 |
|----|----|----|----|----|----|----|
|pass|pass|pass|pass|pass|pass|pass|

## LeetCode : 3226. Number of Bit Changes to Make Two Integers Equal
[LeetCode : 3226. Number of Bit Changes to Make Two Integers Equal](https://leetcode.com/problems/number-of-bit-changes-to-make-two-integers-equal/)
Difficulty : easy
Problem description:
You are given two positive integers `n` and `k`.
You can choose **any** bit in the **binary representation** of n that is equal to 1 and change it to 0.
Return the *number of changes* needed to make `n` equal to `k`. If it is impossible, return `-1`.
### Example 1:
>Input: n = 13, k = 4
>
>Output: 2
>
>Explanation:
>Initially, the binary >representations of n and k are n = (1101)~2~ and k = (0100)~2~.
>We can change the first and fourth >bits of n. The resulting integer is n = (0100)2 = k.
### Example 2:
>Input: n = 21, k = 21
>
>Output: 0
>
>Explanation:
>n and k are already equal, so no changes are needed.
### Example 3:
>Input: n = 14, k = 13
>
>Output: -1
>
>Explanation:
>It is not possible to make n equal to k.
## Implementation
You can check the source code [here](https://github.com/akkk74159/computer_architecture)
Using the **my_clz** (counting leading zero) function from `Problem C in Quiz 1` to calculate the number of leading zeros of `n XOR k` (assume it is **num**). We then compare bit difference between `n` and `k` starting from least significant bit. After at most `32 - num` times of shifting bits we can find the number of bit changes to make `n` and `k` equal.
### C code
The function of `my_clz`:
```c=
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i)) // 1U = unsigned int 1
break;
count++;
}
return count;
}
```
The function `bitChanges` compare `n` and `k` bits by bits starting from LSB, then return the number of bit to change.
```c=
int bitChanges(int n, int k) {
int num = n ^ k;
num = my_clz(num);
int result = 0;
for(int i = 0;i < 32 - num;i++){
if((n & 1) == 1 && (k & 1) == 0){
result = result + 1;
}
else if((n & 1) == 0 && (k & 1) == 1){
result = -1;
break;
}
n >>= 1;
k >>= 1;
}
return result;
}
```
```c=
int main(){
int n1 = 13, k1 = 4;
int n2 = 21, k2 = 21;
int n3 = 14, k3 = 13;
int output1 = bitChanges(n1, k1);
int output2 = bitChanges(n2, k2);
int output3 = bitChanges(n3, k3);
printf("%d\n", output1);
printf("%d\n", output2);
printf("%d\n", output3);
return 0;
}
```
### Assembly code
```c=
.data
n1: .word 13
k1: .word 4
output1: .word 2
n2: .word 21
k2: .word 21
output2: .word 0
n3: .word 14
k3: .word 13
output3: .word -1
str1: .string "pass "
str2: .string "fail "
.text
main:
li s0, 3 # s0 stores how many test cases
la s1, n1 # s1 stores address of n1
loop:
lw a0, 0(s1) # a0 = n
lw a1, 4(s1) # a1 = k
lw a2, 8(s1) # a2 = correct output
addi sp, sp, -4
sw ra, 0(sp)
jal ra, bitChanges # goto bitChanges function
lw ra, 0(sp)
addi sp, sp, 4
jal ra, print_result # goto print result
addi s1, s1, 12 # move to the next test case
addi s0, s0, -1 # case counter -1
bnez s0, loop
#exit
li a7, 10
ecall
#bitChanges function
bitChanges:
mv t0, a0 # t0 = n
mv t1, a1 # t1 = k
xor a0, t0, t1 # a0 = n xor k
addi sp, sp, -4
sw ra, 0(sp)
jal ra, my_clz # goto my_clz function
lw ra, 0(sp)
addi sp, sp, 4
li t2, 0 # init i = 0
li t3, 32 # let t3 = 32
sub t3, t3, a0 # let t3 = (32 - num)
li t4, 0 # init result = 0
bne t3, x0, bitChanges_loop # if (32 - num) != 0 then start looping
li a0, 0 # if (32 - num) == 0, the result is 0
j bitChanges_end
bitChanges_loop:
andi s2, t0, 1 # s2 = (n & 1)
andi s3, t1, 1 # s3 = (k & 1)
beq s2, zero, skip_add # if s2 == 0, goto skip_add
bne s3, zero, skip_add # if s3 == 1, goto skip_add
addi t4, t4, 1 # if(s2 == 1) && (s3 == 0), then result++
skip_add:
bne s2, zero, shift # if s2 == 1, goto shift
beq s3, zero, shift # if s3 == 0, goto shift
li a0, -1 # if (s2 == 0) && (s3 == 1), then result = -1
j bitChanges_end
shift:
srli t0, t0, 1 # n >>= 1
srli t1, t1, 1 # k >>= 1
addi t2, t2, 1 # i++
bne t2, t3, bitChanges_loop # if(i < (32 - num)), stay in the loop
mv a0, t4 # else, let a0 = result
bitChanges_end:
jr ra # return
#clz function
my_clz:
li t2, 1 # init t2 = 1U
li t3, 0 # init count = 0
li t4, 31 # init i = 31
my_clz_loop:
sll t5, t2, t4 # t5 = 1U << i
and t5, t5, a0 # t5 = (x & (1U << i))
bne t5, x0, my_clz_end # if (t5 != 0), exit the loop
addi t3, t3, 1 # count++
addi t4, t4, -1 # --i
bge t4, zero, my_clz_loop # if (i >= 0), stay in the loop
my_clz_end:
mv a0, t3 # a0 = count
jr ra # return
# print result (pass or fail)
print_result:
bne a0, a2, fail
la a0, str1 # print "pass"
li a7, 4
ecall
j print_end
fail:
la a0, str2 # print "fail"
li a7, 4
ecall
print_end:
jr ra # return
```
## Result
### Test cases:
#### Case 1
>Input: n = 13, k = 4
>
>Output: 2
#### Case 2
>Input: n = 21, k = 21
>
>Output: 0
#### Case 3
>Input: n = 14, k = 13
>
>Output: -1
### C program output
||n|k|output|
|-|-|-|-|
|Case1|13|4|2|
|Case2|21|21|0|
|Case3|14|13|-1|
### Assembly output
||n|k|output|
|-|-|-|-|
|Case1|13|4|pass|
|Case2|21|21|pass|
|Case3|14|13|pass|
### Performance
| Execution info||
| -------------- | --- |
| Cycles | 981 |
| Instrs. retired | 713 |
| CPI | 1.38 |
| IPC | 0.727 |
| Clock rate | 10.87Hz |
## Second version
We implement the **branchless** `my_clz`, meaning there will be no branch instruction in `my_clz` function.
### C code
```c=
int blclz(uint32_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return 32 - (x & 0x7f); //move 32 from my_clz
}
```
### Assembly
```c=
.data
n1: .word 13
k1: .word 4
output1: .word 2
n2: .word 21
k2: .word 21
output2: .word 0
n3: .word 14
k3: .word 13
output3: .word -1
str1: .string "pass "
str2: .string "fail "
.text
main:
li s0, 3 # s0 stores how many test cases
la s1, n1 # s1 stores address of n1
loop:
lw a0, 0(s1) # a0 = n
lw a1, 4(s1) # a1 = k
lw a2, 8(s1) # a2 = correct output
addi sp, sp, -4
sw ra, 0(sp)
jal ra, bitChanges # goto bitChanges function
lw ra, 0(sp)
addi sp, sp, 4
jal ra, print_result # goto print result
addi s1, s1, 12 # move to the next test case
addi s0, s0, -1 # case counter -1
bnez s0, loop
#exit
li a7, 10
ecall
#bitChanges function
bitChanges:
mv t0, a0 # t0 = n
mv t1, a1 # t1 = k
xor a0, t0, t1 # a0 = n xor k
addi sp, sp, -4
sw ra, 0(sp)
jal ra, blclz # goto my_clz function
lw ra, 0(sp)
addi sp, sp, 4
li t2, 0 # init i = 0
li t3, 32 # let t3 = 32
sub t3, t3, a0 # let t3 = (32 - num)
li t4, 0 # init result = 0
bne t3, x0, bitChanges_loop # if (32 - num) != 0 then start looping
li a0, 0 # if (32 - num) == 0, the result is 0
j bitChanges_end
bitChanges_loop:
andi s2, t0, 1 # s2 = (n & 1)
andi s3, t1, 1 # s3 = (k & 1)
beq s2, zero, skip_add # if s2 == 0, goto skip_add
bne s3, zero, skip_add # if s3 == 1, goto skip_add
addi t4, t4, 1 # if(s2 == 1) && (s3 == 0), then result++
skip_add:
bne s2, zero, shift # if s2 == 1, goto shift
beq s3, zero, shift # if s3 == 0, goto shift
li a0, -1 # if (s2 == 0) && (s3 == 1), then result = -1
j bitChanges_end
shift:
srli t0, t0, 1 # n >>= 1
srli t1, t1, 1 # k >>= 1
addi t2, t2, 1 # i++
bne t2, t3, bitChanges_loop # if(i < (32 - num)), stay in the loop
mv a0, t4 # else, let a0 = result
bitChanges_end:
jr ra # return
#clz function
blclz:
srli t2, a0, 1 # x >> 1
or a0, a0, t2 # x |= (x >> 1)
srli t2, a0, 2 # x >> 2
or a0, a0, t2 # x |= (x >> 2)
srli t2, a0, 4 # x >> 4
or a0, a0, t2 # x |= (x >> 4)
srli t2, a0, 8 # x >> 8
or a0, a0, t2 # x |= (x >> 8)
srli t2, a0, 16 # x >> 16
or a0, a0, t2 # x |= (x >> 16)
srli t2, a0, 1 # x >> 1
li t4, 0x55555555
and t2, t2, t4 # t2 = ((x >> 1) & 0x55555555)
sub a0, a0, t2 # x = x - t2
srli t2, a0, 2 # x >> 2
li t4, 0x33333333
and t2, t2, t4 # t2 = ((x >> 2) & 0x33333333)
and t4, a0, t4 # t4 = x & 0x33333333
add a0, t2, t4 # x = t2 & t4
srli t2, a0, 4 # x >> 4
add t2, t2, a0 # t2 = (x >> 4) + x
li t4, 0x0f0f0f0f
and a0, t2, t4 # x = t2 & 0x0f0f0f0f
srli t2, a0, 8 # x >> 8
add a0, a0, t2 # x += (x >> 8)
srli t2, a0, 16 # x >> 16
add a0, a0, t2 # x += (x >> 16)
andi a0, a0, 0x7F # x & 0x7f
li t2, 32
sub a0, t2, a0 # a0 = 32 - (x & 0x7f)
clz_end:
jr ra # return
# print result (pass or fail)
print_result:
bne a0, a2, fail
la a0, str1 # print "pass"
li a7, 4
ecall
j print_end
fail:
la a0, str2 # print "fail"
li a7, 4
ecall
print_end:
jr ra # return
```
### Performance
We transform the original `my_clz` function to `blclz` function, resulting less cycles and instructions for the program to run. That indicates a better improvement for the code.
| Execution info|Version1|Version2|
| ------------- | ------ | ------ |
| Cycles |981| 340 |
| Instrs. retired|713| 254 |
| CPI |1.38| 1.34 |
| IPC |7.27| 0.747 |
| Clock rate |10.87Hz| 10.99Hz |
## Analysis
We use [Ripe](https://github.com/mortbopet/Ripes) to simulate RISC-V processor.
### Pseudo Instruction
Ripes will automatically convert the pseudo instructions to the Executable instructions.
```
00000000 <main>:
0: 00300413 addi x8 x0 3
4: 10000497 auipc x9 0x10000
8: ffc48493 addi x9 x9 -4
0000000c <loop>:
c: 0004a503 lw x10 0 x9
10: 0044a583 lw x11 4 x9
14: 0084a603 lw x12 8 x9
18: ffc10113 addi x2 x2 -4
1c: 00112023 sw x1 0 x2
20: 024000ef jal x1 36 <bitChanges>
24: 00012083 lw x1 0 x2
28: 00410113 addi x2 x2 4
2c: 118000ef jal x1 280 <print_result>
30: 00c48493 addi x9 x9 12
34: fff40413 addi x8 x8 -1
38: fc041ae3 bne x8 x0 -44 <loop>
3c: 00a00893 addi x17 x0 10
40: 00000073 ecall
00000044 <bitChanges>:
44: 00050293 addi x5 x10 0
48: 00058313 addi x6 x11 0
4c: 0062c533 xor x10 x5 x6
50: ffc10113 addi x2 x2 -4
54: 00112023 sw x1 0 x2
58: 064000ef jal x1 100 <blclz>
5c: 00012083 lw x1 0 x2
60: 00410113 addi x2 x2 4
64: 00000393 addi x7 x0 0
68: 02000e13 addi x28 x0 32
6c: 40ae0e33 sub x28 x28 x10
70: 00000e93 addi x29 x0 0
74: 000e1663 bne x28 x0 12 <bitChanges_loop>
78: 00000513 addi x10 x0 0
7c: 03c0006f jal x0 60 <bitChanges_end>
00000080 <bitChanges_loop>:
80: 0012f913 andi x18 x5 1
84: 00137993 andi x19 x6 1
88: 00090663 beq x18 x0 12 <skip_add>
8c: 00099463 bne x19 x0 8 <skip_add>
90: 001e8e93 addi x29 x29 1
00000094 <skip_add>:
94: 00091863 bne x18 x0 16 <shift>
98: 00098663 beq x19 x0 12 <shift>
9c: fff00513 addi x10 x0 -1
a0: 0180006f jal x0 24 <bitChanges_end>
000000a4 <shift>:
a4: 0012d293 srli x5 x5 1
a8: 00135313 srli x6 x6 1
ac: 00138393 addi x7 x7 1
b0: fdc398e3 bne x7 x28 -48 <bitChanges_loop>
b4: 000e8513 addi x10 x29 0
000000b8 <bitChanges_end>:
b8: 00008067 jalr x0 x1 0
000000bc <blclz>:
bc: 00155393 srli x7 x10 1
c0: 00756533 or x10 x10 x7
c4: 00255393 srli x7 x10 2
c8: 00756533 or x10 x10 x7
cc: 00455393 srli x7 x10 4
d0: 00756533 or x10 x10 x7
d4: 00855393 srli x7 x10 8
d8: 00756533 or x10 x10 x7
dc: 01055393 srli x7 x10 16
e0: 00756533 or x10 x10 x7
e4: 00155393 srli x7 x10 1
e8: 55555eb7 lui x29 0x55555
ec: 555e8e93 addi x29 x29 1365
f0: 01d3f3b3 and x7 x7 x29
f4: 40750533 sub x10 x10 x7
f8: 00255393 srli x7 x10 2
fc: 33333eb7 lui x29 0x33333
100: 333e8e93 addi x29 x29 819
104: 01d3f3b3 and x7 x7 x29
108: 01d57eb3 and x29 x10 x29
10c: 01d38533 add x10 x7 x29
110: 00455393 srli x7 x10 4
114: 00a383b3 add x7 x7 x10
118: 0f0f1eb7 lui x29 0xf0f1
11c: f0fe8e93 addi x29 x29 -241
120: 01d3f533 and x10 x7 x29
124: 00855393 srli x7 x10 8
128: 00750533 add x10 x10 x7
12c: 01055393 srli x7 x10 16
130: 00750533 add x10 x10 x7
134: 07f57513 andi x10 x10 127
138: 02000393 addi x7 x0 32
13c: 40a38533 sub x10 x7 x10
00000140 <clz_end>:
140: 00008067 jalr x0 x1 0
00000144 <print_result>:
144: 00c51c63 bne x10 x12 24 <fail>
148: 10000517 auipc x10 0x10000
14c: edc50513 addi x10 x10 -292
150: 00400893 addi x17 x0 4
154: 00000073 ecall
158: 0140006f jal x0 20 <print_end>
0000015c <fail>:
15c: 10000517 auipc x10 0x10000
160: ece50513 addi x10 x10 -306
164: 00400893 addi x17 x0 4
168: 00000073 ecall
0000016c <print_end>:
16c: 00008067 jalr x0 x1 0
```
### 5-Stage RISC-V Processor
Ripes provides different processors to execute the code. In my case, I selected 5-stage processer to run my code.

I choose `add x10 x10 x7` as an example to demonstrate how instruction works in each stage, including `IF`, `ID`, `EX`, `MEM`, `WB`.
1. **IF (Instrction Fetch)**
- Fetch the instructions from the instruction memory.
2. **ID (Instruction Decode)**
- Decode the instruction to determine the type of operation, such as addition, subtraction, or load/store.
- Read the values of source registers from the register file.
3. **EX (Execute)**
- The ALU (Arithmetic Logic Unit) executes arithmetic (e.g., add, sub) and logical operations (e.g., AND, OR, XOR).
- For branch instructions, the EX stage calculates whether the branch condition is true or false, based on the values of the source registers.
4. **MEM (Memory access)**
- Load instructions: Reading data from memory to be stored in a register.
- Store instructions: Writing data from a register into a memory location.
5. **WB (Write back)**
- For ALU operations (e.g., add, sub), the result computed by the ALU in the EX stage is written back to the destination register.
- For load instructions (e.g., lw), the value read from memory in the MEM stage is written to the destination register.
---
### 1. Instruction Fetch

- The current program counter is `0x00000128`, which means that `add x10 x10 x7` instruction is located at memory address `000000128`. The processor will **fetch the instruction** which stored at the current address from the instruction memory. In this case, the instruction will be translated to machine code `0x00750533`
- The program counter will then be added by 4(`0x0000012c`) for the next instruction to be fetched.
### 2. Instruction Decode

- The decoder will decode the machine code into several parts.
- `opcode` which indicate this instruction is `ADD`
- `rs1` which is `0x0a` (x10)
- `rs2` which is `0x07` (x7)
- `rd` which is `0x0a` (x10)
- Reg1 will hold the data of `rs1`(0x00000004).
- Reg2 will hold the data of `rs2`(0x00000004).
- `rd` will be sent to next stage.
### 3. Execute

- In this stage, `Op1` will hold the data from `rs1`, `Op2` will hold the data from `rs2`. ALU then operates `ADD` instruction of the two inputs, `Op1`(0x00000004) and `Op2`(0x00000000), the result will be stored in `Res` which is `0x00000004`.
>The reason why Op2 turned into 0x00000000 is because of data forwarding (We will talk about this later).
- `rd` will be sent to next stage.
### 4. Memory Access

- `ADD` instruction uses the R-type format. In this stage, R-type format no need to load data from memory or store data to memory.
- The `Wr en` is set to 0 since `ADD` don't store data to memory.
- `rd` will be sent to next stage.
### 5. Write Back

- In this stage, `RES`(0x00000004) will be stored in `Wr data` and written back to the destination register, which is `rd`(0x0a), represented as `Wr idx` in the `ID` stage.
## Memory
The data is stored in the memory start from address `0x1000_0000`. We want to load data to registers. The assembler uses `auipc` and `program counter` to set the higher 20 bits, then uses `addi` to set the lower 12 bits to get the memory address.
```
.data
n1: .word 13 #0x1000 0000
k1: .word 4 #0x1000 0004
output1: .word 2 #0x1000 0008
n2: .word 21 #0x1000 000c
k2: .word 21 #0x1000 0010
output2: .word 0 #0x1000 0014
n3: .word 14 #0x1000 0018
k3: .word 13 #0x1000 001c
output3: .word -1 #0x1000 0020
```
Store address of `n1` to s1 register
```
la s1, n1
```
Pseudo Instruction
```
4: 10000497 auipc x9 0x10000 #x9 == 0x10000004
8: ffc48493 addi x9 x9 -4 #x9 == 0x10000000
```
Load `n1` to a0 register, `k1` to a1 register, `output1` to a2 register
```
#At this point, s1 == 0x10000000
lw a0, 0(s1) # a0 = n1
lw a1, 4(s1) # a1 = k1
lw a2, 8(s1) # a2 = output1
```
Move to next test case
```
addi s1, s1, 12
```
Load `n2` to a0 register, `k2` to a1 register, `output2` to a2 register
```
#At this point, s1 == 0x1000000c
lw a0, 0(s1) # a0 = n2
lw a1, 4(s1) # a1 = k2
lw a2, 8(s1) # a2 = output2
```
Move to next test case
```
addi s1, s1, 12
```
Load `n3` to a0 register, `k3` to a1 register, `output3` to a2 register
```
#At this point, s1 == 0x10000018
lw a0, 0(s1) # a0 = n3
lw a1, 4(s1) # a1 = k3
lw a2, 8(s1) # a2 = output3
```
## Pipeline data hazard
Data hazards occur when instructions that exhibit data dependence modify data in different stages of a pipeline. Most commonly seen data hazard is usually caused by `read after write(RAW)` dependency. In the Ripes simulator, we can employ [data forwarding](https://en.wikipedia.org/wiki/Operand_forwarding) to deal with it.
### RAW data hazard
```
addi x5 x10 0
addi x6 x11 0
xor x10 x5 x6
```

In this case, `x5` and `x6` has not been writen to register in this cycle. For `xor` instruction, the data it read from the register is still the original data. Thus, we have to get the newest data from the `EX/MEM` and `MEM/WB`.
By employing data forwarding, we can get the latest `x5` data from `MEM/WB` stage and the latest `x6` data from `EX/MEM` stage. They will then be `op1` and `op2` respectively for the ALU operation.
## Reference
* [Quiz1 of Computer Architecture (2024 Fall)](https://hackmd.io/@sysprog/arch2024-quiz1-sol)
* [LeetCode : 3226. Number of Bit Changes to Make Two Integers Equal](https://leetcode.com/problems/number-of-bit-changes-to-make-two-integers-equal/description/)
* [RISC-V 指令集架構介紹 - RV32I](https://tclin914.github.io/16df19b4/)