# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by <`hsuhsiang`> (Hsu, Han-Hsiang)
:::danger
Check the title, making it consistent with the sample pages.
:::
## Quiz1 Probem C
### C code
:::danger
Choose one problem (A, B, or C) from [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate it from C code to a complete RISC-V assembly program, and include the relevant test data.
:::
```c=
static inline float fabsf(float x)
{
uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer
i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value
x = *(float *)&i; // Write the modified bits back into the float
return x;
}
```
```c=
static inline int my_clz(uint32_t x)
{
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
```c=
static inline uint32_t fp16_to_fp32(uint16_t h) {
/*
* Extends the 16-bit half-precision floating-point number to 32 bits
* by shifting it to the upper half of a 32-bit word:
* +---+-----+------------+-------------------+
* | S |EEEEE|MM MMMM MMMM|0000 0000 0000 0000|
* +---+-----+------------+-------------------+
* Bits 31 26-30 16-25 0-15
*
* S - sign bit, E - exponent bits, M - mantissa bits, 0 - zero bits.
*/
const uint32_t w = (uint32_t) h << 16;
/*
* Isolates the sign bit from the input number, placing it in the most
* significant bit of a 32-bit word:
*
* +---+----------------------------------+
* | S |0000000 00000000 00000000 00000000|
* +---+----------------------------------+
* Bits 31 0-31
*/
const uint32_t sign = w & UINT32_C(0x80000000);
/*
* Extracts the mantissa and exponent from the input number, placing
* them in bits 0-30 of the 32-bit word:
*
* +---+-----+------------+-------------------+
* | 0 |EEEEE|MM MMMM MMMM|0000 0000 0000 0000|
* +---+-----+------------+-------------------+
* Bits 30 27-31 17-26 0-16
*/
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
/*
* The renorm_shift variable indicates how many bits the mantissa
* needs to be shifted to normalize the half-precision number.
* For normalized numbers, renorm_shift will be 0. For denormalized
* numbers, renorm_shift will be greater than 0. Shifting a
* denormalized number will move the mantissa into the exponent,
* normalizing it.
*/
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
/*
* If the half-precision number has an exponent of 15, adding a
* specific value will cause overflow into bit 31, which converts
* the upper 9 bits into ones. Thus:
* inf_nan_mask ==
* 0x7F800000 if the half-precision number is
* NaN or infinity (exponent of 15)
* 0x00000000 otherwise
*/
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &
INT32_C(0x7F800000);
/*
* If nonsign equals 0, subtracting 1 will cause overflow, setting
* bit 31 to 1. Otherwise, bit 31 will be 0. Shifting this result
* propagates bit 31 across all bits in zero_mask. Thus:
* zero_mask ==
* 0xFFFFFFFF if the half-precision number is
* zero (+0.0h or -0.0h)
* 0x00000000 otherwise
*/
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
/*
* 1. Shifts nonsign left by renorm_shift to normalize it (for denormal
* inputs).
* 2. Shifts nonsign right by 3, adjusting the exponent to fit in the
* 8-bit exponent field and moving the mantissa into the correct
* position within the 23-bit mantissa field of the single-precision
* format.
* 3. Adds 0x70 to the exponent to account for the difference in bias
* between half-precision and single-precision.
* 4. Subtracts renorm_shift from the exponent to account for any
* renormalization that occurred.
* 5. ORs with inf_nan_mask to set the exponent to 0xFF if the input
* was NaN or infinity.
* 6. ANDs with the inverted zero_mask to set the mantissa and exponent
* to zero if the input was zero.
* 7. Combines everything with the sign bit of the input number.
*/
return sign | ((((nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}
}
```
Implement the fp16_to_fp32 function, which converts a 16-bit floating-point number in IEEE half-precision format (bit representation) to a 32-bit floating-point number in IEEE single-precision format (bit representation). This implementation will avoid using any floating-point arithmetic and will utilize the clz function discussed earlier.
### Assembly code
```c=
.data
num_test: .word 10
test: .word 0x1234,0x5678,0x4876,0xF123,0xFC00,0x0400,0x3555,0x0022,0x0001,0x00FF
answer: .word 0x3A468000,0x42cf0000,0x410ec000,0xc6246000,0xFF800000,0x38800000,0x3eaaa000,0x36080000,0x33800000,0x377f0000
statement1: .string "All Test Pass"
statement2: .string "Wrong, Check Again"
.text
main:
la t0, num_test
lw a6, 0(t0) # a0 = num_test
la s1, test
la t3, answer
fp16tofp32:
lw t1, 0(s1) # t1 = number
lw a2, 0(t3) # a2 = answer
slli t0, t1, 16
li t6, 0x80000000
and t4, t0, t6
li t6, 0x7FFFFFFF
and t5, t0, t6
########### call function clz #############
addi sp, sp, -16
sw a1, 12(sp)
sw t2, 8(sp)
sw t1, 4(sp)
sw t0, 0(sp)
mv a0, t5 # a0 = num
li a1, 0x55af
jal ra, clz # a0 = leading zero
lw t0, 0(sp)
lw t1, 4(sp)
lw t2, 8(sp)
lw a1, 12(sp)
##########################################
sltiu t6, a0, 6
bne t6, x0, lessthan5
addi a0, a0, -5
j inf_nan_mask
lessthan5:
li a0, 0
inf_nan_mask:
li t6, 0x04000000
add t6, t6, t5
srai t6, t6, 8
li a1, 0x7F800000
and a1, t6, a1
addi a5, t5, -1
srai a5, a5, 31
sub a5, x0, a5
sll a3, t5, a0
srli a3, a3, 3
li a4, 0x70
sub a4, a4, a0
slli a4, a4, 23
add a3, a3, a4
or a3, a3, a1
or a3, a3, a5
or t2, a3, t4
next:
bne t2, a2, wrong
addi t3, t3, 4
addi s1, s1, 4
addi a6, a6, -1
bne a6, x0, fp16tofp32
return:
la a0, statement1
addi a7, zero, 4
ecall
j fin
wrong:
la a0, statement2
addi a7, zero, 4
ecall
fin:
li a7, 10
ecall
##########################################
###############Loop through###############
clz:
li t2, 0
li s3, 31
li t1, 1
loop:
sll t0, t1, s3
and t0, a0, t0
bne t0, x0, stop
addi t2, t2, 1
addi s3, s3, -1
bge s3, x0, loop
stop:
mv a0, t2
ret
##########################################
```
## Count leading zero
Count the leading zeros starting from the most significant bit (MSB). It returns an integer representing the number of zero bits that precede the first '1' in the binary representation of the number.
:::info
In C, according to the GCC manual, the behavior of __builtin_clz(x) is undefined when x = 0, meaning the output is unpredictable. However, in my implementation of the count leading zero function, when the input is x = 0, the output is explicitly defined to be 32.
:::
1. **Example 1**: For x = 16
* Binary representation: `00000000 00000000 00000000 00010000`
* The number of leading zero bits is `27`(as there are `27` zeros before the first `1`).
* Output: `clz(16)` returns `27`.
1. **Example 2**: For x = -1
* Binary representation: `11111111 11111111 11111111 111111111`
* The number of leading zero bits is `0`(as there are `0` zeros before the first `1`).
* Output: `clz(-1)` returns `0`.
1. **Example 3**: For x = 0
* Binary representation: `00000000 00000000 00000000 00000000`
* The number of leading zero bits is `32`(as x is zero).
* Output: `clz(0)` returns `32`.
### C code
from quiz1 problem C, consider the C implementation of `my_clz`:
#### first version (Naive loop through bits)
```c=
int clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
This approach works by scanning each bit individually from left to right, stopping at the first occurrence of 1. Although this method is simple and easy to understand, it can be inefficient for certain inputs, especially when there are very few or no leading zeros, as the loop will have to examine all 32 bits in the worst case.
:::danger
Show the full C source code corresponding to the original problem set.
:::
#### second version (Conditional checks)
```c=
int clz(uint32_t x) {
if (x == 0) return 32;
int count = 0;
if (x <= 0x0000FFFF) {
count += 16;
x <<= 16;
}
if (x <= 0x00FFFFFF) {
count += 8;
x <<= 8;
}
if (x <= 0x0FFFFFFF) {
count += 4;
x <<= 4;
}
if (x <= 0x3FFFFFFF) {
count += 2;
x <<= 2;
}
if (x <= 0x7FFFFFFF) {
count += 1;
x <<= 1;
}
return count;
}
```
This approach uses conditional checks to progressively determine the number of leading zeros in the input number. It works by shifting the bits and updating the count in chunks—16, 8, 4, and so on—until the total number of leading zeros is found.
#### third version (Bitwise operations without branches)
```c=
int clz(uint32_t x) {
int count = 0, temp;
temp = (x < 0x00010000) << 4;
count += temp;
x <<= temp;
temp = (x < 0x01000000) << 3;
count += temp;
x <<= temp;
temp = (x < 0x10000000) << 2;
count += temp;
x <<= temp;
temp = (x < 0x40000000) << 1;
count += temp;
x <<= temp;
temp = (x < 0x80000000);
count += temp;
x <<= temp;
count += (x == 0);
return count;
}
```
This approach implements the same algorithm in second version but without using any branch instructions. Instead of conditional checks, it relies on bitwise operations to progressively determine the number of leading zeros.
#### fourth version (Lookup table optimization)
```c=
int clz(uint32_t x) {
int count = 0, temp;
temp = (x < 0x00010000) << 4;
count += temp;
x <<= temp;
temp = (x < 0x01000000) << 3;
count += temp;
x <<= temp;
temp = (x < 0x10000000) << 2;
count += temp;
x <<= temp;
temp = (x >> 27) & 0x1e;
count += (0x55af >> temp) & 3;
count += (x == 0);
return count;
}
```
For the final two steps of the third version, we adopt a lookup table approach to efficiently determine the number of leading zeros in the upper 4 bits of x.
### Assembly code
:::danger
Use fewer instructions when possible.
:::
#### first version (Naive loop through bits)
```c=
.data
num_test: .word 12
test: .word -1,0,1,2147483647,-2147483648,48763,54321,123456789,16,256,65536,65535
answer: .word 0,32,31,1,0,16,16,5,27,23,15,16
statement1: .string "All Test Pass"
statement2: .string "Wrong, Check Again"
.text
main:
la t0, num_test
lw a0, 0(t0) # a0 = num_test
la t1, test
la t6, answer
count:
lw a1, 0(t1) # a1 = number
lw a2, 0(t6) # a2 = answer
li t2, 0
li t3, 31
li t4, 1
loop:
sll t5, t4, t3
and t5, a1, t5
bne t5, x0, next
addi t2, t2, 1
addi t3, t3, -1
bge t3, x0, loop
next:
bne t2, a2, wrong
addi t6, t6, 4
addi t1, t1, 4
addi a0, a0, -1
bne a0, x0, count
return:
la a0, statement1
addi a7, zero, 4
ecall
j fin
wrong:
la a0, statement2
addi a7, zero, 4
ecall
fin:
li a7, 10
ecall
```
#### second version (Conditional checks)
```c=
.data
num_test: .word 12
constant: .word 0x0000FFFF,0x00FFFFFF,0x0FFFFFFF,0x3FFFFFFF,0x7FFFFFFF
test: .word -1,0,1,2147483647,-2147483648,48763,54321,123456789,16,256,65536,65535
answer: .word 0,32,31,1,0,16,16,5,27,23,15,16
statement1: .string "All Test Pass"
statement2: .string "Wrong, Check Again"
.text
main:
la t0, num_test
lw a0, 0(t0) # a0 = num_test
la t1, test
la t4, answer
count:
lw a1, 0(t1) # a1 = number
lw a2, 0(t4) # a2 = answer
li t2, 0 # t2 = count
li t6, 32
la t5, constant
bne a1, x0, loop
li t2, 32
j next
loop:
lw t3, 0(t5)
srli t6, t6, 1
beq t6, x0, next
addi t5, t5, 4
bltu t3, a1, loop
add t2, t2, t6
sll a1, a1, t6
j loop
next:
bne t2, a2, wrong
addi t4, t4, 4
addi t1, t1, 4
addi a0, a0, -1
bne a0, x0, count
return:
la a0, statement1
addi a7, zero, 4
ecall
j fin
wrong:
la a0, statement2
addi a7, zero, 4
ecall
fin:
li a7, 10
ecall
```
#### third version (Bitwise operations without branches)
```c=
.data
num_test: .word 12
constant: .word 0x00010000,0x01000000,0x10000000,0x40000000,0x80000000
test: .word -1,0,1,2147483647,-2147483648,48763,54321,123456789,16,256,65536,65535
answer: .word 0,32,31,1,0,16,16,5,27,23,15,16
statement1: .string "All Test Pass"
statement2: .string "Wrong, Check Again"
.text
main:
la t0, num_test
lw a0, 0(t0) # a0 = num_test
la t1, test
la t3, answer
count:
la t5, constant
lw a1, 0(t1) # a1 = number
lw a2, 0(t3) # a2 = answer
li t2, 0 # t2 = count
li t6, 4
loop:
lw a3, 0(t5)
sltu t4, a1, a3
sll t4, t4, t6
add t2, t2, t4
sll a1, a1, t4
addi t5, t5, 4
addi t6, t6, -1
bge t6, x0, loop
sltiu t4, a1, 1
add t2, t2, t4
next:
bne t2, a2, wrong
addi t3, t3, 4
addi t1, t1, 4
addi a0, a0, -1
bne a0, x0, count
return:
la a0, statement1
addi a7, zero, 4
ecall
j fin
wrong:
la a0, statement2
addi a7, zero, 4
ecall
fin:
li a7, 10
ecall
```
#### fourth version (Lookup table optimization and loop unrolling)
```c=
.data
num_test: .word 12
test: .word -1,0,1,2147483647,-2147483648,48763,54321,123456789,16,256,65536,65535
answer: .word 0,32,31,1,0,16,16,5,27,23,15,16
statement1: .string "All Test Pass"
statement2: .string "Wrong, Check Again"
.text
main:
la t0, num_test
lw a0, 0(t0) # a0 = num_test
la t1, test
la t3, answer
li s0, 0x00010000
li s1, 0x01000000
li s2, 0x10000000
li s3, 0x55af
count:
lw a1, 0(t1) # a1 = number
lw a2, 0(t3) # a2 = answer
li t2, 0 # t2 = count
sltu t4, a1, s0
slli t4, t4, 4
add t2, t2, t4
sll a1, a1, t4
sltu t4, a1, s1
slli t4, t4, 3
add t2, t2, t4
sll a1, a1, t4
sltu t4, a1, s2
slli t4, t4, 2
add t2, t2, t4
sll a1, a1, t4
srli t4, a1, 27
andi t4, t4, 0x1e
srl t4, s3, t4
andi t4, t4, 3
add t2, t2, t4
sltiu t4, a1, 1
add t2, t2, t4
next:
bne t2, a2, wrong
addi t3, t3, 4
addi t1, t1, 4
addi a0, a0, -1
bne a0, x0, count
return:
la a0, statement1
addi a7, zero, 4
ecall
j fin
wrong:
la a0, statement2
addi a7, zero, 4
ecall
fin:
li a7, 10
ecall
```
:::danger
You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking.
:::
### Comparison
Comparison of the above four clz function:
| | 1st | 2nd | 3rd | 4th (best)|
| --- | --- | --- | --- | --- |
|cycles|1675|725|838|375|
|Insts |1259|537|650|343|
|IPC |0.752|0.741|0.776|0.915|
|CPI |1.33|1.35|1.29|1.09|
From the table, we can find the fourth one has the least execution cycles.
For Quiz1 Problem C, we can replace original loop through clz with branchless and optimized one:
```c=
#########branchless and optimized########
clz:
li t0, 0
li t1, 0x00010000
sltu t2, a0, t1
slli t2, t2, 4
add t0, t0, t2
sll a0, a0, t2
li t1, 0x01000000
sltu t2, a0, t1
slli t2, t2, 3
add t0, t0, t2
sll a0, a0, t2
li t1, 0x10000000
sltu t2, a0, t1
slli t2, t2, 2
add t0, t0, t2
sll a0, a0, t2
srli t1, a0, 27
andi t1, t1, 0x1e
srl t1, a1, t1
andi t1, t1, 3
add t0, t0, t1
sltiu t1, a0, 1
add t0, t0, t1
mv a0, t0
ret
##########################################
```
Comparison of the Original and Revised clz Functions in Quiz1 Problem C:
| | loop through | branchless and optimized |
| --- | --- | --- |
|cycles|1014|815|
|Insts |836|727|
|IPC |0.806|0.892|
|CPI |1.24|1.43|
Based on the data in the table, we can observe that the revised implementation reduces execution cycles by approximately 20%.
## Int to Float conversion
### Problem
The goal of this problem is to convert a 32-bit signed integer into a single-precision floating-point number, following the IEEE 754 standard for floating-point representation. This involves translating the integer into its equivalent floating-point bit representation, which consists of three components: the sign bit, the exponent, and the mantissa.
1. **Example**: For $x = -1 = -1\times 1\times2^0$
* |Binary representation|: `00000000 00000000 00000000 00000001`
* The sign bit is `1`
* The exponent is `127 + 0 = 127`
* The mantissa is `00000000 00000000 0000000`
* The round bit is `0`
* So output will be `1 01111111 00000000000000000000000 = 0xBF800000`
2. **Example**: For $x = 48763 = -1\times 1.48813\times2^{15}$
* |Binary representation|: `00000000 00000000 10111110 01111011`
* The sign bit is `0`
* The exponent is `127 + 15 = 142`
* The mantissa is `01111100 11110110 0000000`
* The round bit is `0`
* So output will be `0 10001110 01111100111101100000000= 0x473E7B00`
3. **Example**: For $x = 134217736 = 1\times (1+2^{-24})\times2^{27}$
* |Binary representation|: `00001000 00000000 00000000 00001000`
* The sign bit is `0`
* The exponent is `127 + 27 = 154`
* The mantissa is `00000000 00000000 0000000`
* The round bit is `1`
* But the last bit of mantissa is `0`
* For round-to-even, round bit is modified to be `0`
* So output will be `0 10011010 00000000000000000000000 = 0x4D000000`
4. **Example**: For $x = 134217752 = 1\times (1+3 \times 2^{-24})\times2^{27}$
* |Binary representation|: `00001000 00000000 00000000 00011000`
* The sign bit is `0`
* The exponent is `127 + 27 = 154`
* The mantissa is `00000000 00000000 0000001`
* The round bit is `1`
* For round-to-even, round bit is still to be `1`
* So output will be `0 10011010 00000000000000000000010 = 0x4D000002`
5. **Example**: For $x = 0$
* For `x = 0`, the output is defined as `0x00000000`
* So output will be `0 00000000 00000000000000000000000 = 0x00000000`
reference:
https://onestepcode.com/int-to-float-c/
https://evanw.github.io/float-toy/
### C code
```c=
uint32_t IntToFloat(int num)
{
int leading_zero, exponent, mantissa, sign = 0, abs_num = num;
int shift, round_bit, temp, last_bit;
sign = num >> 31;
// If num is 0, the function directly returns 0
if (num == 0)
{
return 0;
}
// The absolute value of num is taken to proceed with the conversion.
if (num < 0)
{
abs_num = -num;
}
// calculates the number of leading zeros
leading_zero = clz(abs_num);
shift = 31 - leading_zero;
// The exponent in IEEE 754 format is stored with a bias of 127
// So exponent = 127 + (31 - leading_zero);
exponent = 127 + shift;
mantissa = abs_num ^ (1 << shift);
if (shift > 23)
{
// Round to closest
round_bit = (mantissa >> (shift - 24)) & 0x00000001;
// Round to even modification
last_bit = (mantissa >> (shift - 23)) & 0x00000001;
temp = (1 << (shift - 24)) - 1;
temp = mantissa & temp;
// if the round part is XXX.5
// in C, .5 will be round to closest even
// if last_bit == 1, then round_bit = 1
// if last_bit == 0, then round_bit = 0
if (temp == 0 && round_bit == 1)
{
round_bit = last_bit;
}
mantissa = mantissa >> shift - 23;
mantissa = mantissa + round_bit;
}
else
{
mantissa = mantissa << 23 - shift;
}
// The function assembles the single-precision floating-point number by packing:
// The sign bit into the most significant bit 31,
// The 8-bit exponent into bits 30-23,
// The 23-bit mantissa into bits 22-0.
uint32_t value = ((sign << 31) | (exponent << 23)) + mantissa;
//internal validation
assert(value == fp32_to_bits((float)(num)));
return value;
}
```
Convert Int type to single-precision floating-point bit representation.
```c=
uint32_t fp32_to_bits(float f)
{
union
{
float as_value;
uint32_t as_bits;
} fp32 = {.as_value = f};
return fp32.as_bits;
}
```
Then convert float type to bit representation
```c=
int main()
{
srand(time(NULL));
int num, y, corner_case[7] = {0, -1, 1, -2147483648, 2147483647, 0x08000008,0x08000018 };
for (int i = 0; i < 7; i++)
{
y = IntToFloat(corner_case[i]);
assert(y == fp32_to_bits((float)(corner_case[i])));
}
for (int i = 0; i < 10000; i++)
{
num = ((rand() << 17) | rand());
y = IntToFloat(num);
assert(y == fp32_to_bits((float)(num)));
}
printf("All test pass!!\n");
return 0;
}
```
For the testing program, I started by testing 7 corner cases:
* -1
* 0
* 1
* 2147483647 (INT_MAX)
* -2147483648 (INT_MIN)
* 0x08000008 (to test round-to-even behavior)
* 0x08000018 (to test round-to-even behavior)
Afterward, I generated 10,000 random values to further test the program. I used assert statements to verify that the function's output matched the result of the built-in C int-to-float conversion function.
:::danger
You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking.
:::
:::success
modified in the function
:::
### Assembly code
```c=
.data
num_test: .word 13
test: .word -1,0,1,2147483647,-2147483648,0x08000008,0x08000018,48763,-48763,123456789,537530813,278410568,1509961956
answer: .word 0xbf800000,0,0x3f800000,0x4f000000,0xcf000000,0x4d000000,0x4d000002,0x473e7b00,0xc73e7b00,0x4ceb79a3,0x4e002847,0x4d84c1aa,0x4eb40062
statement1: .string "All Test Pass"
statement2: .string "Wrong, Check Again"
.text
main:
la s0, num_test
lw t0, 0(s0) # t0 = num_test
la s1, test
la t2, answer
loop:
lw t1, 0(s1) # t1 = num
lw a2, 0(t2) # a2 = answer
beq t1, x0, next
sign:
srli t3, t1, 31
bgt t1, x0, exponent
sub t1, x0, t1
exponent:
########### call function clz #############
addi sp, sp, -12
sw t2, 8(sp)
sw t1, 4(sp)
sw t0, 0(sp)
mv a0, t1 # a0 = num
li a1, 0x55af
jal ra, clz # a0 = leading zero
lw t0, 0(sp)
lw t1, 4(sp)
lw t2, 8(sp)
##########################################
sub a0, x0, a0
addi s5, a0, 31
addi t4, s5, 127
slt s6, s5, x0
mantissa:
li s8, 1
sll s7, s8, s5
xor t5, t1, s7
round:
li s7, 23
ble s5, s7, noround
addi s9, s5, -24
srl s7, t5, s9
andi t6, s7, 1
addi s7, s5, -23
srl s7, t5, s7
andi s6, s7, 1
sll s8, s8, s9
addi s8, s8, -1
and s8, s8, t5
bne s8, x0, noroundeven
beq t6, x0, noroundeven
mv t6, s6
noroundeven:
addi s9, s5, -23
srl t5, t5, s9
add t5, t5, t6
j merge
noround:
sub s5, x0, s5
addi s7, s5, 23
sll t5, t5, s7
merge:
slli t3, t3, 31
slli t4, t4, 23
or a0, t3, t4
add t1, a0, t5
next:
bne t1, a2, wrong
addi t2, t2, 4
addi s1, s1, 4
addi t0, t0, -1
bne t0, zero, loop
return:
la a0, statement1
addi a7, zero, 4
ecall
j fin
wrong:
la a0, statement2
addi a7, zero, 4
ecall
fin:
li a7, 10
ecall
##########################################
clz:
li t0, 0
li t1, 0x00010000
sltu t2, a0, t1
slli t2, t2, 4
add t0, t0, t2
sll a0, a0, t2
li t1, 0x01000000
sltu t2, a0, t1
slli t2, t2, 3
add t0, t0, t2
sll a0, a0, t2
li t1, 0x10000000
sltu t2, a0, t1
slli t2, t2, 2
add t0, t0, t2
sll a0, a0, t2
srli t1, a0, 27
andi t1, t1, 0x1e
srl t1, a1, t1
andi t1, t1, 3
add t0, t0, t1
sltiu t1, a0, 1
add t0, t0, t1
mv a0, t0
ret
##########################################
```
:::danger
Use fewer instructions.
:::
## pipeline explanation
A 5-stage pipeline CPU is a common architecture used to improve instruction throughput by dividing the execution of an instruction into five stages. Each stage handles a different part of instruction processing, allowing multiple instructions to be processed simultaneously at different stages.
#### Take add x7 x7 x29 for example:
#### Instruction Fetch (IF):
The CPU retrieves the next instruction from memory, based on the Program Counter (PC). The PC is then updated for the next cycle.

In this case, the current PC is 60, and the instruction is fetched from the instruction memory address 60 where the preloaded instruction is stored.
#### Instruction Decode (ID):
The fetched instruction is decoded to understand what operation is required. The operands are also fetched from registers if needed.

The instruction is decoded as follows:
* The operation is ADD
* R1 index is 7
* R2 index is 29
* The values 10 and 0 are fetched from the corresponding registers.
#### Execution (EXE):
The CPU performs the actual computation or operation specified by the instruction. This could be an arithmetic or logic operation, or calculating an address for memory access.


For the value of rs1 (x7), which is 10, it directly enters the ALU for computation. However, for rs2 (x29), based on the instruction execution order, x29 will be written by the previous instruction, causing a Read-After-Write (RAW) hazard. Therefore, the mux below selects the forwarded value to provide the ALU with the correct data for calculation.
#### Memory Access (MEM):
If the instruction requires data to be read from or written to memory, this is done in the MEM stage. For load/store instructions, this is where data is fetched or saved in memory.

Since the ADD instruction does not involve memory, the computed value is passed directly to the next stage.
#### Write Back (WB):
The result of the instruction execution (from EXE or MEM stage) is written back to the register file, completing the instruction.

The result of the ADD computation, which is 10, is written back to the register file.