# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < `ChengRay` >
## Problem C in [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol)
In problem C, there are three functions: `fabsf`, `my_clz`, and `uint32_t fp16_to_fp32`.
### fabsf
>##### IEEE754
>The format of IEEE 754 single precision(32 bits) is as follows:

>* `Sign`: 1 bit, it indicates positive or negative.
>* `Exponent`: 8 bits, it is represented in a biased format to accommodate both positive and negative exponents. The actual exponent is calculated by subtracting the bias(127).
>* `fraction`: 23 bits, the fraction part represents the significant digits of the number, with an implicit leading bit of 1 for normalized number.
The purpose is to compute the **absolute value** of a float number.
`uint32_t i` & `0x7FFFFFFF` sets the leftmost bit to 0 and retains the original values of the remaining bits.
##### C code
```c
#include<stdio.h>
#include<stdint.h>
static inline float fabsf(float x) {
uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer
i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value
x = *(float *)&i; // Write the modified bits back into the float
return x;
}
int main(){
float test1 = -5.125;
float test2 = 1024.25;
float test3 = 0;
printf("test1: %f" , fabsf(test1));
printf("\ntest2: %f" , fabsf(test2));
printf("\ntest3: %f" , fabsf(test3));
return 0;
}
```
##### RISC-V
```c
.data
test1: .word 0xC0A40000 # -5.125 in IEEE 754
test2: .word 0x44800800 # 1024.25 in IEEE 754
test3: .word 0x00000000 # 0 in IEEE 754
str1: .string "test1: "
str2: .string "\ntest2: "
str3: .string "\ntest3: "
.text
main:
#test1
la a0, str1 # Load the address of the str1
li a7, 4 # System call code for printing a string
ecall # Print the string
la t0, test1 # Load address of test1
lw a0, 0(t0) # Load the 32-bit float into a0
jal ra, fabsf # Call fabsf
li a7, 2 # Print the answer
ecall
#test2
la a0, str2 # Load the address of the str2
li a7, 4 # System call code for printing a string
ecall # Print the string
la t0, test2 # Load address of test2
lw a0, 0(t0) # Load the 32-bit float into a0
jal ra, fabsf # Call fabsf
li a7, 2 # Print the answer
ecall
#test3
la a0, str3 # Load the address of the str3
li a7, 4 # System call code for printing a string
ecall # Print the string
la t0, test3 # Load address of test3
lw a0, 0(t0) # Load the 32-bit float into a0
jal ra, fabsf # Call fabsf
li a7, 2 # Print the answer
ecall
#Exit
li a0, 0 # Exit code 0
li a7, 93 # Syscall for exit
ecall # Make the syscall
fabsf:
li t0, 0X7FFFFFFF
and t1, a0, t0 # Clear the sign bit
mv a0, t1
jr x1
```
###### Output console

:::danger
Do not use screenshots for plain text content, as this is inaccessible to visually impaired users.
:::
:::success
ChengRay: The linking path has been changed.
:::
### my_clz
The function is counts **the number of leading zero bits** in the binary representation of an unsigned **32-bit** integer, starting from the most significant bit.
>[!Note]
The special case: when the input is **`0`**, the output is **`undefined`**.
#### C code_version1
```c
#include <stdio.h>
#include <stdint.h>
static inline int my_clz(uint32_t x){
if(x==0){
printf("Undefined");
return -1;
}
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
int main(){
int test1 = 16;
int test2 = 33;
int test3 = 0;
int result=0;
printf("test1: ");
result = my_clz(test1);
if(result>0)printf("%d" , result);
printf("\ntest2: ");
result = my_clz(test2);
if(result>0)printf("%d" , result);
printf("\ntest3: ");
result = my_clz(test3);
if(result>0)printf("%d\n" , result);
return 0;
}
```
#### RISC-V version1
```c
.data
test1: .word 16
test2: .word 33
test3: .word 0
str1: .string "test1: "
str2: .string "\ntest2: "
str3: .string "\ntest3: "
str4: .string "undefined"
.text
main:
# test1
la a0, str1 # Load the address of the str1
li a7, 4 # System call code for printing a string
ecall # Print the string
la t0, test1 # Load address of test1
lw a0, 0(t0) # Load the test1 into a0
jal ra, my_clz # Call my_clz
# test2
la a0, str2 # Load the address of the str2
li a7, 4 # System call code for printing a string
ecall # Print the string
la t0, test2 # Load address of test2
lw a0, 0(t0) # Load the test2 into a0
jal ra, my_clz # Call my_clz
# test3
la a0, str3 # Load the address of the str3
li a7, 4 # System call code for printing a string
ecall # Print the string
la t0, test3 # Load address of test3
lw a0, 0(t0) # Load the test3 into a0
jal ra, my_clz # Call my_clz
# Exit
li a0, 0 # Exit code 0
li a7, 93 # Syscall for exit
ecall # Make the syscall
my_clz:
beqz a0, return0 # if a0 == 0, jump.
li t0, 1 # load 1 into t0
li t1, 31 # load 31 into t1
cal:
sll t2, t0, t1 # shift
bgeu a0, t2, return # input > t2, get numbers of shift.
addi t1, t1, -1 # shift number - 1
j cal # loop
return:
xori t0, t1, 0x1F # 1' complement
mv a0, t0 # print value
li a7, 1 # Print the answer
ecall
jr x1
return0:
li a0, -1 # return -1
la a0, str4 # Load the address of the str4
li a7, 4 # System call code for printing a string
ecall
jr x1
```
###### output result


I have an idea to modify the code of problem A in [2023_Quiz1](https://hackmd.io/@sysprog/arch2023-quiz1-sol).
ProblemA in 2023_Quiz1 is the function that counts leading zero for unsigned **64-bit** integers. Thus, I can modify the code into the fuction that calculates for 32-bit integer.
The `my_clz` function has a loop, but the modified code does not contain any loops. This may reduce the number of `nop` instructions due to the branch instruction.
*Let’s see which version performs the best.*
#### C code_version2
```c
#include <stdio.h>
#include <stdint.h>
int16_t my_clz(uint32_t x){
if(x==0){
printf("test is 0, Undefined\n");
return -1;
}
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x7f));
}
int main(){
int test1 = 16;
int test2 = 33;
int test3 = 0;
int result = 0;
result = my_clz(test1);
if(result>0)printf("test1: %d\n" , result);
result = my_clz(test2);
if(result>0)printf("test2: %d\n" , result);
result = my_clz(test3);
if(result>0)printf("test3: %d\n" , result);
return 0;
}
```
#### RISC-V version2
```c
.data
test1: .word 16
test2: .word 33
test3: .word 0
str1: .string "test1: "
str2: .string "\ntest2: "
str3: .string "\ntest3: "
str4: .string "undefined"
.text
main:
# test1
la a0, str1 # Load the address of the str1
li a7, 4 # System call code for printing a string
ecall # Print the string
la t0, test1 # Load address of test1
lw a0, 0(t0) # Load the test1 into a0
jal ra, my_clz2 # Call my_clz
# test2
la a0, str2 # Load the address of the str2
li a7, 4 # System call code for printing a string
ecall # Print the string
la t0, test2 # Load address of test2
lw a0, 0(t0) # Load the test2 into a0
jal ra, my_clz2 # Call my_clz
# test3
la a0, str3 # Load the address of the str3
li a7, 4 # System call code for printing a string
ecall # Print the string
la t0, test3 # Load address of test3
lw a0, 0(t0) # Load the test3 into a0
jal ra, my_clz2 # Call my_clz
# Exit
li a0, 0 # Exit code 0
li a7, 93 # Syscall for exit
ecall # Make the syscall
my_clz2:
beqz a0, return0 # if a0 == 0, jump.
mv t0, a0
srli t1, t0, 1 # x |= (x >> 1);
or t0, t0, t1
srli t1, t0, 2 # x |= (x >> 2);
or t0, t0, t1
srli t1, t0, 4 # x |= (x >> 4);
or t0, t0, t1
srli t1, t0, 8 # x |= (x >> 8);
or t0, t0, t1
srli t1, t0, 16 # x |= (x >> 16);
or t0, t0, t1
li t2, 0x55555555
srli t1, t0, 1 # x -= ((x >> 1) & 0x55555555);
li t3, 0xFFFFFFFF
and t1, t1, t2 # t2: 0x55555555
addi t0, t0, 1 # t0+xor(t1)+1 -> I calculate t0+1 first.
xor t1, t1, t3 # negative t1, 2' complement.(xor 0xFFFFFFFF)
add t0, t0, t1
li t3, 0x33333333
srli t1, t0, 2 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
and t2, t0, t3 # t3: 0x33333333
and t1, t1, t3
add t0, t1, t2
li t3, 0x0F0F0F0F
srli t1, t0, 4 # x = ((x >> 4) + x) & 0x0f0f0f0f;
add t0, t1, t0
and t0, t0, t3 # t3: 0x0F0F0F0F
srli t1, t0, 8 # x += (x >> 8);
add t0, t0, t1
srli t1, t0, 16 # x += (x >> 16);
add t0, t0, t1
andi t0, t0, 0x7F # return (32 - (x & 0x7f));
xori t0, t0, 0x1F
addi a0, t0, 1
li a7, 1
ecall
jr x1
return0:
la a0, str4 # Load the address of the str4
li a7, 4 # System call code for printing a string
ecall
li a0, -1 # return -1
jr x1
```
###### output result


#### Performance
According to the performance of both.
Version 1 has branch instruction, which results in a worse cycle count compared to Version 2.
Based on other data, version 2's performance is evidently superior.
| | version1(loop) | version2(without loop) |
|:--------------- | --------------:| ----------------------:|
| Cycles: | 408 | 148 |
| Instrs.retired: | 266 | 116 |
| CPI: | 1.53 | 1.28 |
| IPC: | 0.652 | 0.784 |
| Clock rate: | 152.75Hz | 143.83Hz |
### uint32_t fp16_to_fp32
##### C code version1
```c
#include <stdio.h>
#include <stdint.h>
int16_t my_clz(uint32_t x){
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x7f));
}
static inline uint32_t fp16_to_fp32(uint16_t h) {
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &
INT32_C(0x7F800000);
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
return sign | ((((nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}
int main(){
int test1 = 0x4BE0;//15.75: 0100 1011 1110 0000
int test2 = 0xC540;//-5.25: 1100 0101 0100 0000
int test3 = 0x0;
test1 = fp16_to_fp32(test1);
float *result1 = (float*)&test1;
test2 = fp16_to_fp32(test2);
float *result2 = (float*)&test2;
test3 = fp16_to_fp32(test3);
float *result3 = (float*)&test3;
printf("1: %.2f\n" , *result1);//Output: 15.75
printf("2: %.2f\n" , *result2);//Output: -5.25
printf("3: %.2f\n" , *result3);//Output: 0.00
return 0;
}
```
`zero_mask = (int32_t)(nonsign - 1) >> 31` involves a lot of operations and will result in at least four instructions. If I check whether the input is zero at the very beginning, it will generate at most three cycles, which is clearly better than the original, and I can also exit the function earlier.
##### C code version2
```c
#include <stdio.h>
#include <stdint.h>
int16_t my_clz(uint32_t x){
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x7f));
}
static inline uint32_t fp16_to_fp32(uint16_t h) {
if(h==0)return 0;
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &
INT32_C(0x7F800000);
//remove const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
return sign | ((((nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)) | inf_nan_mask));
}
int main(){
int test1 = 0x4BE0;//15.75: 0100 1011 1110 0000
int test2 = 0xC540;//-5.25: 1100 0101 0100 0000
int test3 = 0x0;
test1 = fp16_to_fp32(test1);
float *result1 = (float*)&test1;
test2 = fp16_to_fp32(test2);
float *result2 = (float*)&test2;
test3 = fp16_to_fp32(test3);
float *result3 = (float*)&test3;
printf("1: %.2f\n" , *result1);//Output: 15.75
printf("2: %.2f\n" , *result2);//Output: -5.25
printf("3: %.2f\n" , *result3);//Output: 0.00
return 0;
}
```
##### RISC-V
```c
.data
test1: .word 0x4BE0 # 15.75: 0100 1011 1110 0000
test2: .word 0xc540 # -5.25: 1100 0101 0100 0000
test3: .word 0 # 0
str1: .string "test1: "
str2: .string "\ntest2: "
str3: .string "\ntest3: "
.text
main:
# test1
la a0, str1 # Load the address of the str1
li a7, 4 # System call code for printing a string
ecall # Print the string
la t0, test1 # Load address of test1
lw a0, 0(t0) # Load the test1 into a0
jal ra, fp16to32 # Call fp16to32
# test2
la a0, str2 # Load the address of the str2
li a7, 4 # System call code for printing a string
ecall # Print the string
la t0, test2 # Load address of test2
lw a0, 0(t0) # Load the test2 into a0
jal ra, fp16to32 # Call fp16to32
# test3
la a0, str3 # Load the address of the str3
li a7, 4 # System call code for printing a string
ecall # Print the string
la t0, test3 # Load address of test3
lw a0, 0(t0) # Load the test3 into a0
jal ra, fp16to32 # Call fp16to32
# Exit
li a0, 0 # Exit code 0
li a7, 93 # Syscall for exit
ecall # Make the syscall
fp16to32:
beqz a0, return0 # if a0 == 0, jump.
sw ra, -4(sp) # a0 is unuseful after `slli t0, t1, 16`
# Thus, I didn't store to stack.
slli t0, a0, 16 # w = (uint32_t) h << 16;
# t0: w
li s0, 0x80000000 # sign = w & UINT32_C(0x80000000);
and s1, t0, s0 # s1: sign
li t1, 0x7FFFFFFF # nonsign = w & UINT32_C(0x7FFFFFFF);
and a0, t0, t1 # a0: nonsign
addi t4, a0, 0 # t4: nonsign
jal ra, my_clz # execute my_clz
li t2, 5
addi t0, a0, 0 # renorm_shift = my_clz(nonsign);
# t0: renorm_shift
blt t0, t2, ZERO # renorm_shift = renorm_shift > 5? renorm_shift - 5 : 0;
addi t0, t0, -5 # t0: renorm_shift
continue:
li t3, 0x04000000 # inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &
li t2, 0x7F800000 # INT32_C(0x7F800000);
add t1, t4, t3
srli t1, t1, 8
and t1, t1, t2 # t1: inf_nan_mask
sll t4, t4, t0
srli t4, t4, 3 # (nonsign << renorm_shift >> 3)
addi t0, t0, 0b110010000 # Original: (0x70 - renorm_shift) << 23)
xori t0, t0, 0b111111111 # First, calculate renorm_shift - 0x70, then take the two's complement.
addi t0, t0, 1 # 2'complement of 0x70 is 110010000 in 9 bit.
slli t0, t0, 23
add t0, t0, t4 # add: Both of above
or t0, t0, t1 # | inf_nan_mask
or a0, t0, s1
li a7, 2
ecall
lw ra, -4(sp) # load return address
jr x1
ZERO:
li t0, 0 # t0: renorm_shift = 0;
j continue
my_clz:
mv t0, a0
srli t1, t0, 1 # x |= (x >> 1);
or t0, t0, t1
srli t1, t0, 2 # x |= (x >> 2);
or t0, t0, t1
srli t1, t0, 4 # x |= (x >> 4);
or t0, t0, t1
srli t1, t0, 8 # x |= (x >> 8);
or t0, t0, t1
srli t1, t0, 16 # x |= (x >> 16);
or t0, t0, t1
li t2, 0x55555555
srli t1, t0, 1 # x -= ((x >> 1) & 0x55555555);
li t3, 0xFFFFFFFF
and t1, t1, t2 # t2: 0x55555555
addi t0, t0, 1 # t0+xor(t1)+1 -> I calculate t0+1 first.
xor t1, t1, t3 # negative t1, 2' complement.(xor 0xFFFFFFFF)
add t0, t0, t1
li t3, 0x33333333
srli t1, t0, 2 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
and t2, t0, t3 # t3: 0x33333333
and t1, t1, t3
add t0, t1, t2
li t3, 0x0F0F0F0F
srli t1, t0, 4 # x = ((x >> 4) + x) & 0x0f0f0f0f;
add t0, t1, t0
and t0, t0, t3 # t3: 0x0F0F0F0F
srli t1, t0, 8 # x += (x >> 8);
add t0, t0, t1
srli t1, t0, 16 # x += (x >> 16);
add t0, t0, t1
andi t0, t0, 0x7F # return (32 - (x & 0x7f));
xori t0, t0, 0x1F
addi a0, t0, 1
jr x1
return0:
li a7, 2
ecall
jr x1
```
##### partial code
```c
addi t0, t0, 0b110010000 # Original: (0x70 - renorm_shift) << 23)
xori t0, t0, 0b111111111 # First, calculate renorm_shift - 0x70, then take the two's complement.
addi t0, t0, 1 # 2'complement of 0x70 is 110010000 in 9 bit.
slli t0, t0, 23
```
In the above code details, the result will be shifted by 23 bits in the end, so the previous operations only need to focus on the 32-23=`9 bits`. Therefore, I use a 9-bit value with I-type instructions instead of exceeding 12 bits, which would require loading into a register before execution. This approach improves performance.
###### output result


## Minimum One Bit Operations to Make Integers Zero([Leetcode 1611](https://leetcode.com/problems/minimum-one-bit-operations-to-make-integers-zero/description/))
### Description
>Given an integer n, you must transform it into 0 using the following operations any number of times:
>###### 1. Change the rightmost (0th) bit in the binary representation of n.
>###### 2. Change the ith bit in the binary representation of n if the (i-1)th bit is set to 1 and the (i-2)th through 0th bits are set to 0.
>Return the minimum number of operations to transform n into 0.
### Example
>* `example1`
Input: n = 3
Output: 2
Explanation: The binary representation of 3 is "11".
"11" -> "01" with the 2nd operation since the 0th bit is 1.
"01" -> "00" with the 1st operation.
>* `example2`
Input: n = 6
Output: 4
Explanation: The binary representation of 6 is "110".
"110" -> "010" with the 2nd operation since the 1st bit is 1 and 0th through 0th bits are 0.
"010" -> "011" with the 1st operation.
"011" -> "001" with the 2nd operation since the 0th bit is 1.
"001" -> "000" with the 1st operation.
### Solution
#### Idea for problem solving
First, I observed a pattern as described below:
| Input | 0 | 1 | 2 | 4 | 8 |
| ---------- | ---- | ---- | ---- | ---- | ---- |
| binary |`0000`|`0001`|`0010`|`0100`|`1000`|
| - | - | - | - | - | - |
| round1 | |`0000`| 0011 | 0101 | 1001 |
| round2 | | |`0001`| 0111 | 1011 |
| round3 | | |`0000`| 0110 | 1010 |
| round4 | | | |`0010`| 1110 |
| round5 | | | | 0011 | 1111 |
| round6 | | | |`0001`| 1101 |
| round7 | | | |`0000`| 1100 |
| round8 | | | | |`0100`|
| round9 | | | | | 0101 |
| round10 | | | | | 0111 |
| round11 | | | | | 0110 |
| round12 | | | | |`0010`|
| round13 | | | | | 0011 |
| round14 | | | | |`0001`|
| round15 | | | | |`0000`|
| $2^n$ -> $2^{n-1}$ | move | $2^n$ -> 0 | move |
| ------------------ | ------- | ---------- |:-------- |
| 1-> 0 | 1 move | 1-> 0 | 1 move |
| 10-> 01 | 2 moves | 10-> 00 | 3 moves |
| 100-> 010 | 4 moves | 100-> 000 | 7 moves |
| 1000->0100 | 8 moves | 1000->0000 | 15 moves |
:::danger
Use LaTeX annotations when possible. e.g., use $2^n$ instead of `2^n`.
:::
:::success
ChengRay: I get it.
:::
Thus, I can observe that when the input is $2^n$, the result is $2^{n+1}$ - 1.
However, if any bit right of the leftmost bit is 1, then it will take fewer moves.
For instance:
`1100`: Compared to `1000`, `1100` has already eliminated the leftmost `1`, reducing many steps. Therefore, I need to calculate how many steps it takes to go from `1000` to `1100`. I find that the number of steps from `1000` to `1100` is the same as from `0000` to `0100`.
I reached a conclusion: subtract when encountering a 1 on the right.
But, what if the `1` I subtract has another `1` on its right... does it become a process formed by the `1` I subtracted? In other words, does the process decrease again?
A clear example is as follows:
```math
Input n = 15, binary is 1111.
f(1111)
=f(1000)-f(0111)
=f(1000)-(f(0100)-f(0011))
=f(1000)-(f(0100)-(f(0010)-f(0001)))
=f(1000)-f(0100)+f(0010)-f(0001)
```
Thus, I derived a logic: "As long as the input encounter a bit `1`, the operation is a cycle of one addition(+) and one subtraction(-)."
### C Programming
##### First version
```c
int minimumOneBitOperations(int n){
if(n == 0)return 0;
int bits = (int)log2(n)+1;
int flop = -1;
int moves = pow(2, bits)-1;
for(int i=bits-2 ; i>=0 ; i--){
if(n>>i & 0b00000001){
moves += flop * (pow(2, i + 1) - 1);
flop *= -1;
}
}
return moves;
}
```
Considering the conversion to RISC-V, I want to replace the function that use the [math.h](https://www.runoob.com/cprogramming/c-standard-library-math-h.html) library.
#### pow()
During the modification process, I realized that in `pow(2, bits)`, the base 2 is **fixed**, and since I've learned about ***left shift***, I directly replaced the original `pow(2, bits)` function with `1 << bits`.
#### log2()
The `log2()` purpose is to get the length of the binary.
Write a function `BinaryLen()` as shown below.
##### log() version1
```c
int BinaryLen(int n){
int ans=0;
while(n > 0){
ans++;
n = n>>1;
}
return ans;
}
```
One loop will generate many `nop` instructions after the branch.
The previously mentioned `count_leading_zero` function calculates how many leading zeros there are in a binary number. What I need to compute here is the total length of the binary number. Since the result of this function is obtained by subtracting the calculated value from `32`, I can simply use this result directly, which will give me the length of the binary number.
The previous comparison results indicate that the version **without loops** has fewer cycles and provides better performance.
##### log() version2
```c
uint16_t BinaryLen(uint32_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (x & 0x7f); //move 32 from my_clz
}
```
After optimizing `log()` and `pow()`, the final result of the code is as follows:
```c
uint16_t BinaryLen(uint32_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (x & 0x7f);
}
int minimumOneBitOperations(int n){
if(n == 0)return 0;
int bits = BinaryLen(n);
int flop = -1;
int moves = (1<<(bits))-1;
for(int i=bits-2 ; i>=0 ; i--){
if(n>>i & 0b00000001){
moves += flop * ((1<<(i + 1)) - 1);
flop *= -1;
}
}
return moves;
}
```
##### Full C code
```c
#include <stdio.h>
#include <stdint.h>
uint16_t BinaryLen(uint32_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (x & 0x7f);
}
int minimumOneBitOperations(int n){
if(n == 0)return 0;
int bits = BinaryLen(n);
int flop = -1;
int moves = (1<<(bits))-1;
for(int i=bits-2 ; i>=0 ; i--){
if(n>>i & 0b00000001){
moves += flop * ((1<<(i + 1)) - 1);
flop *= -1;
}
}
return moves;
}
int main(){
int test1 = 0;
int test2 = 64;
int test3 = 15;
printf("test1 moves %d\n" , minimumOneBitOperations(test1));
//output 0
printf("test2 moves %d\n" , minimumOneBitOperations(test2));
//output 127
printf("test3 moves %d" , minimumOneBitOperations(test3));
//output 10
return 0;
}
```
##### RISC-V
```c
.data
test1: .word 16
test2: .word 33
test3: .word 0
str1: .string "test1: "
str2: .string "\ntest2: "
str3: .string "\ntest3: "
.text
main:
# test1
la a0, str1 # Load the address of the str1
li a7, 4 # System call code for printing a string
ecall # Print the string
la t0, test1 # Load address of test1
lw a0, 0(t0) # Load the test1 into a0
jal ra, MOBO # Call MOBO
# test2
la a0, str2 # Load the address of the str2
li a7, 4 # System call code for printing a string
ecall # Print the string
la t0, test2 # Load address of test2
lw a0, 0(t0) # Load the test2 into a0
jal ra, MOBO # Call MOBO
# test3
la a0, str3 # Load the address of the str3
li a7, 4 # System call code for printing a string
ecall # Print the string
la t0, test3 # Load address of test3
lw a0, 0(t0) # Load the test3 into a0
jal ra, MOBO # Call MOBO
# Exit
li a0, 0 # Exit code 0
li a7, 93 # Syscall for exit
ecall # Make the syscall
MOBO:
beqz a0, return0 # if a0 == 0, jump.
addi sp, sp, -8 # MOBO can call BinaryLen
sw ra, 0(sp)
sw a0, 4(sp)
jal ra, BinaryLen # a0 is already set, so directly use jal.
addi t5, a0, 0 # set return value to t5. #bits = BinaryLen(n);
li t4, 1
lw a0, 4(sp) # Restore the original parameters.
li t1, 1 # int flop = -1;
# Here, I set 1 to represent a negative number.
# set 0 to represent a positive number.
sll t2, t4, t5 # int moves = (1<<(bits))-1;
addi t2, t2, -1
addi t3, t5, -1 # i=bits-1
li t5, 0xFFFFFFFF
loop:
addi t3, t3, -1
bltz t3, return
srl t0, a0, t3 # store n >> i in t0
andi t6, t0, 1
beqz t6, loop
addi t0, t3, 1 # (i + 1)
sll t0, t4, t0 # (1<<(i + 1))
addi t0, t0, -1 # (1<<(i + 1)) - 1
beqz t1, addmove # if flop is Positive, jump addmove
xor t0, t0, t5 # negative number
addi t0, t0, 1
addmove:
add t2, t2, t0 # moves + flop * ((1<<(i + 1)) - 1);
xori t1, t1, 1
j loop
return:
lw ra, 0(sp)
addi sp, sp, 8
addi a0, t2, 0 # return moves;
li a7, 1
ecall
jr x1
return0:
li a0, 0 # input = 0, return 0
li a7, 1
ecall
jr x1
BinaryLen:
mv t0, a0
srli t1, t0, 1 # x |= (x >> 1);
or t0, t0, t1
srli t1, t0, 2 # x |= (x >> 2);
or t0, t0, t1
srli t1, t0, 4 # x |= (x >> 4);
or t0, t0, t1
srli t1, t0, 8 # x |= (x >> 8);
or t0, t0, t1
srli t1, t0, 16 # x |= (x >> 16);
or t0, t0, t1
li t2, 0x55555555
srli t1, t0, 1 # x -= ((x >> 1) & 0x55555555);
li t3, 0xFFFFFFFF
and t1, t1, t2 # t2: 0x55555555
addi t0, t0, 1 # t0+xor(t1)+1 -> I calculate t0+1 first.
xor t1, t1, t3 # negative t1, 2' complement.(xor 0xFFFFFFFF)
add t0, t0, t1
li t3, 0x33333333
srli t1, t0, 2 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
and t2, t0, t3 # t3: 0x33333333
and t1, t1, t3
add t0, t1, t2
li t3, 0x0F0F0F0F
srli t1, t0, 4 # x = ((x >> 4) + x) & 0x0f0f0f0f;
add t0, t1, t0
and t0, t0, t3 # t3: 0x0F0F0F0F
srli t1, t0, 8 # x += (x >> 8);
add t0, t0, t1
srli t1, t0, 16 # x += (x >> 16);
add t0, t0, t1
andi a0, t0, 0x7F # return (32 - (x & 0x7f));
jr x1
```
###### output result


### Analysis
#### (1) Five stage
> Description: The 5-stage pipeline is a common architectural design used in modern processors, including RISC-V, to **improve instruction throughput and overall performance.** The concept of pipelining allows multiple instructions to be in different stages of execution at the same time, thereby maximizing the utilization of the CPU's resources. Each stage of the pipeline performs a specific function, and together they work in a streamlined manner.
##### Instruction Fetch (IF)
In this stage, the processor fetches the instruction from memory based on the current Program Counter (PC). The PC is then updated to point to the next instruction. This stage is crucial for retrieving the instructions that will be executed in subsequent stages.
* Show `lw a0, 0(t0)` for example, the instruction is converted to `lw x10 0 x5`. PC is `0x00000018`. Next PC is `0x0000001C`. Fetch the instruction from instruction memory using the `PC:0x00000018`. `0x0002a503` is represents the machine code.

##### Instruction Decode (ID)
During the ID stage, the fetched instruction is decoded to determine its operation and the operands required. The control signals are generated, and the source registers are read. This stage ensures that the necessary data is available for execution in the next stage.
* Show `mv t0, a0` for example, it's coverted to `addi x5 x10 0`. `Reg 1` is `0x00000010`;`Reg 2` is `0x00000000`, values are delivered to the next stage.

##### Execute (EX)
In the EX stage, the actual operation specified by the instruction is performed. This can involve arithmetic or logical operations, address calculations for memory access, or branch target calculations. The results of this operation are prepared to be written back to the registers or memory.
* To continue the result of the previous instruction. The **selector A** has three input, they are respectively from the `Reg1` of `ID` stage and from the forwarding technique, which comes from EX/MEM and MEM/WB. The **selector B** is the same as selector A. The **selector C** has two input, they are respectively from the output of selector A and from the `PC`, which is needed to calculate the next instruction. The **selector D** has two input, they are respectively from the output of selector B and immediate value of `ID` stage.

##### Memory Access (MEM)
If the instruction involves memory operations (such as load or store), this stage accesses the memory. For load instructions, data is retrieved from memory and prepared for writing back to the register file. For store instructions, data from the register is written to the specified memory location.
* Data memory have two input, address and data value. This is for use with the `lw` and `sw` instructions. The `lw` instruction requires an address to read the value(read out). The `sw` instruction requires a value and an address to store the value at that address.

##### Write Back (WB)
The final stage is where the results of the executed instruction are written back to the register file. This ensures that the updated values are available for subsequent instructions. The WB stage completes the instruction cycle, allowing the processor to continue with the next instruction.
* In WB stage, there have one selector with three input. they are respectively PC+4, output of ALU and readout of data memory.The light in the middle is on, indicating that the value is from output of ALU.

---
#### (2) Challenge
##### Data Hazards
Data hazards occur when an instruction depends on the result of a previous instruction. This can cause pipeline stalls because subsequent instructions must wait for the required data to become available. Data hazards can be classified into three types:
1. **RAW** (Read After Write): Occurs when an instruction needs to read data that has not yet been written by a previous instruction.
* In my code, there are lots of RAW (Read After Write) hazards. Ripes provides forwarding technology to prevent stalls from occurring.
2. **WAR** (Write After Read): Occurs when an instruction's write operation happens after another instruction's read operation, potentially causing errors.
* This hazard does not occur in the pipeline but happens in scenarios with `early write and late read` or `out-of-order execution`. Therefore, it won't occur in the 5-stage pipeline.
3. **WAW** (Write After Write): Occurs when two instructions attempt to write to the same register.
* Solution: renaming register.
##### Control Hazards
Control hazards arise from branch instructions or jumps. When there is a conditional branch instruction in the pipeline, it may be unclear which instruction should be executed next, leading to incorrect instruction fetching. This may necessitate stalls or branch prediction to **minimize** the impact.
* The following is the conditional branch that appears in the code.
* **`beqz a0, return0`**: The input test data contains at most one *zero*, and the rest are *non-zero*. Therefore, using `beqz a0, return0` will cause at most one jump, which only happens when the input is zero.
* **`bltz t3, return`**: Most of the time, when executing this instruction, it will not jump; it will only jump in the last iteration of the loop.
* **`beqz t6, loop`**: This conditional branch cannot be determined; whether to jump is based on the input, so it's impossible to predict the probability of the branch being taken.
* **`beqz t1, addmove`**: Since the values added by ***move*** will be one positive and one negative, the probability of this conditional branch jumping is 50%.
##### Structural Hazards
Structural hazards occur when hardware resources are insufficient to support multiple operations simultaneously. For example, if the processor cannot access both instruction memory and data memory at the same time, structural hazards may arise, requiring certain instructions to be queued or stalled.
* This involves whether the memory is separated or unified. Based on executing one instruction at a time, I did not observe any stalls caused by structural hazards.
##### Branch Prediction Misses
If the branch prediction mechanism fails to accurately predict the outcome of branch instructions, it can lead to flushing the pipeline and reloading, which adds additional delays and performance loss.
* Using static prediction(always not taken). I designed code where branches have a lower probability of being taken.
* If I use dynamic branch prediction, the number of `nop` instructions will be about the same, so I will choose static because dynamic prediction requires additional hardware support.
##### Ecall
`ecall` is a RISC-V instruction used to trigger a system call. Place the call number in register `a7`, then execute `ecall`, transferring control to the operating system, which executes the requested service and then returns to the original program.
* In order for the instruction to operate correctly, the two preceding instructions must complete the `MEM` and `WB` stages before the ecall executes in the `EX` stage, which results in a delay of two cycles compared to normal operation.
* It's inevitable that there are two stalls before `ecall` execution.
## Reference
IEEE 754 graph: https://en.wikipedia.org/wiki/IEEE_754
arch2024_Quiz1: https://hackmd.io/@sysprog/arch2024-quiz1-sol
arch2023_Quiz1: https://hackmd.io/@sysprog/arch2023-quiz1-sol
Leetcode1611: https://leetcode.com/problems/minimum-one-bit-operations-to-make-integers-zero/description/
Libary math.h: https://www.runoob.com/cprogramming/c-standard-library-math-h.html