# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [dennis356811](https://github.com/dennis356811) >
## Quiz1 Problem C: Counting Leading Zero
The counting leading zeros (CLZ) function is used to determine how many zero bits are presented at the beginning (most significant bits) of a binary number, before the first occurrence of a '1'.
### C Code
#### Version 1: From quiz 1
```cpp
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
This function counts the leading zeros of an unsigned 32-bit integer by recursively checking each bit starting from the most significant one. It begins with the index ```i = 31``` and shifts the bit to determine how many leading zeros there are. After each shift, it performs a bitwise AND operation with ```x``` to check the value of the bit in the ```32-i``` position. Once it encounters the first non-zero bit, the loop terminates, and the count of leading zeros is returned.
:::info
The version has a big disadvatage that when the number is small, that is, has a lot of leading zero, the function will suffer from going through many loops.
:::
So, I come up with another solution.
#### Version 2: Binary search clz
```cpp
int clz_bs(uint32_t x) {
int count = 0;
if (x == 0) {
return 32;
}
if (x <= 0x0000FFFF) {
count += 16;
x <<= 16;
}
if (x <= 0x00FFFFFF) {
count += 8;
x <<= 8;
}
if (x <= 0x0FFFFFFF) {
count += 4;
x <<= 4;
}
if (x <= 0x3FFFFFFF) {
count += 2;
x <<= 2;
}
if (x <= 0x7FFFFFFF) {
count += 1;
}
return count;
}
```
This version separates 32 bits into 6 cases using binary search alike method. It means the program only need to checks 6 condition for every number. In the first case, when x = 0, just returns 32. In the other cases, x is compared to a specific constant. When x are smaller than it, it means that x at least has that much leading zero corresponding to the cases. After every checking, shift x and start next checking until every cases are done.
#### Test Program
```cpp
#include <stdint.h>
#include <stdio.h>
static inline int clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
int clz_bs(uint32_t x) {
int count = 0;
if (x == 0) {
return 32;
}
if (x <= 0x0000FFFF) {
count += 16;
x <<= 16;
}
if (x <= 0x00FFFFFF) {
count += 8;
x <<= 8;
}
if (x <= 0x0FFFFFFF) {
count += 4;
x <<= 4;
}
if (x <= 0x3FFFFFFF) {
count += 2;
x <<= 2;
}
if (x <= 0x7FFFFFFF) {
count += 1;
}
return count;
}
int main() {
const int size = 8;
uint32_t arr[size] = {64, 128, 1584367, 2147483647,
4294967295, 300, 878780422, 920105};
int error = 0;
printf("test original clz\n");
for (int i = 0; i < size; i++) {
printf("answer: %d, your result: %d", __builtin_clz(arr[i]),
clz(arr[i]));
if (__builtin_clz(arr[i]) == clz(arr[i])) {
printf(" ,pass !\n");
} else {
printf(" ,wrong !\n");
error++;
}
}
printf("test binary search clz\n");
for (int i = 0; i < size; i++) {
printf("answer: %d, your result: %d", __builtin_clz(arr[i]),
clz_bs(arr[i]));
if (__builtin_clz(arr[i]) == clz_bs(arr[i])) {
printf(" ,pass !\n");
} else {
printf(" ,wrong !\n");
error++;
}
}
if (error != 0) {
printf("Total has %d error !\n", error);
} else {
printf("All pass !\n");
}
}
```
The test program's compares the value returned between the two clz function I designed with the __builtin_clz(), which is in gcc's standard library.
:::info
According to GCC manual, for __builtin_clz(x), it is not defined when x = 0, that is, it won't return 32. That's why I didn't choose 0 to test.
:::
#### Output
```cpp
test original clz
answer: 25, your result: 25 ,pass !
answer: 24, your result: 24 ,pass !
answer: 11, your result: 11 ,pass !
answer: 1, your result: 1 ,pass !
answer: 0, your result: 0 ,pass !
answer: 23, your result: 23 ,pass !
answer: 2, your result: 2 ,pass !
answer: 12, your result: 12 ,pass !
test binary search clz
answer: 25, your result: 25 ,pass !
answer: 24, your result: 24 ,pass !
answer: 11, your result: 11 ,pass !
answer: 1, your result: 1 ,pass !
answer: 0, your result: 0 ,pass !
answer: 23, your result: 23 ,pass !
answer: 2, your result: 2 ,pass !
answer: 12, your result: 12 ,pass !
All pass !
```
### Assembly Code
#### Version 1: From quiz 1
```c
.data
num_test: .word 10
test: .word 64, 128, 0, 2147483647, 4294967295, -1, -2147483647, 300, 878780422, 920105
answer: .word 25, 24, 32, 1, 0, 0, 0, 23, 2, 12
output: .string "your result -> "
output2: .string " , answer -> "
next_line: .string "\n"
output_pass: .string "All pass\n"
output_error: .string "Fail, total has "
output_error2: .string " error"
.text
main:
lw s0, num_test # loading number of test
la s1, test # loading test data address
la s2, answer # loading answer data address
li s3, 0 # i = 0
li s6, 0 # int error = 0
bge s3, s0, main_for_loop_exit
main_for_loop:
slli s4, s3, 2 # set s4 = i * 4 bytes
add s5, s4, s1 # set s5 = *arr[i]
lw a0, 0(s5) # loading test data from memory
## do clz function
jal ra, clz
## verification
add s5, s2, s4 # set s5 = *answer[i]
lw a1, 0(s5) # loading answer data from memory
beq a0, a1, pass
addi s6, s6, 1
pass:
jal ra, print
## i++, check i < num_test
addi s3, s3, 1
blt s3, s0, main_for_loop
main_for_loop_exit:
## program finish
beq s6, x0, all_pass
la a0, output_error
li a7, 4
ecall
add a0, s6, x0
li a7, 1
ecall
la a0, output_error2
li a7, 4
ecall
jal x0, exit
all_pass:
la a0, output_pass
li a7, 4
ecall
exit:
li a7, 10
ecall
clz:
li t0, 0 # int count = 0
li t1, 31 # int i = 31
li t2, 1 # set t2 = 1U
blt t1, x0, loop_exit # if i < 0, go outside loop
for_loop:
sll t3, t2, t1 # set t3 = (1U << i)
and t3, t3, a0 # set t3 = (x & (1U << i))
bne t3, x0, loop_exit # if (x & (1U << i) == 1): break
addi t0, t0, 1 # count++
## condition check (j >= 0)
addi t1, t1, -1 # --i
bge t1, x0, for_loop # if (i >= 0) go loop again
loop_exit:
add a0, t0, x0 # return count
jalr x0, ra, 0
print:
## your result:
add a2, a0, x0 # backup a0 result
la a0, output
li a7, 4
ecall
add a0, a2, x0
li a7, 1
ecall
## answer:
la a0, output2
li a7, 4
ecall
add a0, a1, x0
li a7, 1
ecall
la a0, next_line
li a7, 4
ecall
jalr x0, ra, 0
```
#### Version 2: Binary search clz
```cpp
.data
num_test: .word 10
test: .word 64, 128, 0, 2147483647, 4294967295, -1, -2147483647, 300, 878780422, 920105
answer: .word 25, 24, 32, 1, 0, 0, 0, 23, 2, 12
output: .string "your result -> "
output2: .string " , answer -> "
next_line: .string "\n"
output_pass: .string "All pass\n"
output_error: .string "Fail, total has "
output_error2: .string " error"
.text
main:
lw s0, num_test # loading number of test
la s1, test # loading test data address
la s2, answer # loading answer data address
li s3, 0 # i = 0
li s6, 0 # int error = 0
bge s3, s0, main_for_loop_exit
main_for_loop:
slli s4, s3, 2 # set s4 = i * 4 bytes
add s5, s4, s1 # set s5 = *arr[i]
lw a0, 0(s5) # loading test data from memory
## do clz function
jal ra, clz_bs
## verification
add s5, s2, s4 # set s5 = *answer[i]
lw a1, 0(s5) # loading answer data from memory
beq a0, a1, pass
addi s6, s6, 1
pass:
jal ra, print
## i++, check i < num_test
addi s3, s3, 1
blt s3, s0, main_for_loop
main_for_loop_exit:
## program finish
beq s6, x0, all_pass
la a0, output_error
li a7, 4
ecall
add a0, s6, x0
li a7, 1
ecall
la a0, output_error2
li a7, 4
ecall
jal x0, exit
all_pass:
la a0, output_pass
li a7, 4
ecall
exit:
li a7, 10
ecall
clz_bs:
## save ra
addi sp, sp, -4
sw ra, 0(sp)
add t0, x0, x0 # int count = 0
beq a0, x0, x_zero # if (x == 0):
li t1, 0x0000FFFF
bgtu a0, t1, check_8 # if (x <= 0x0000FFFF):
addi t0, t0, 16 # count += 16
slli a0, a0, 16 # x <<= 16
check_8:
li t1, 0x00FFFFFF
bgtu a0, t1, check_4 # if (x <= 0x00FFFFFF):
addi t0, t0, 8 # count += 8
slli a0, a0, 8 # x <<= 8
check_4:
li t1, 0x0FFFFFFF
bgtu a0, t1, check_2 # if (x <= 0x0FFFFFFF):
addi t0, t0, 4 # count += 4
slli a0, a0, 4 # x <<= 4
check_2:
li t1, 0x3FFFFFFF
bgtu a0, t1, check_1 # if (x <= 0x0000FFFF):
addi t0, t0, 2 # count += 2
slli a0, a0, 2 # x <<= 2
check_1:
li t1, 0x7FFFFFFF
bgtu a0, t1, end_clz # if (x <= 0x0000FFFF):
addi t0, t0, 1 # count += 1
slli a0, a0, 1 # x <<= 1
end_clz:
add a0, t0, x0 # return count
## Retrieved ra
lw ra, 0(sp)
addi sp, sp, 4
jalr x0, ra, 0
x_zero:
li a0, 32 # return 32
## Retrieved ra
lw ra, 0(sp)
addi sp, sp, 4
jalr x0, ra, 0
print:
## your result:
add a2, a0, x0 # backup a0 result
la a0, output
li a7, 4
ecall
add a0, a2, x0
li a7, 1
ecall
## answer:
la a0, output2
li a7, 4
ecall
add a0, a1, x0
li a7, 1
ecall
la a0, next_line
li a7, 4
ecall
jalr x0, ra, 0
```
#### Output
```cpp
your result -> 25 , answer -> 25
your result -> 24 , answer -> 24
your result -> 32 , answer -> 32
your result -> 1 , answer -> 1
your result -> 0 , answer -> 0
your result -> 0 , answer -> 0
your result -> 0 , answer -> 0
your result -> 23 , answer -> 23
your result -> 2 , answer -> 2
your result -> 12 , answer -> 12
All pass
Program exited with code: 0
```
#### Comparison
| | clz (v1) | clz_bs (v2) |
| --- | --- | --- |
| Cycles | 1609 | 861 |
| Instructions | 1117 | 557 |
| CPI | 1.44 | 1.55 |
| IPC | 0.694 | 0.647 |
From above table, it is obviously that the second version of clz has fewer execution cycles and instruction.
## Improving shift-and-add multiplication using clz
From Quiz 2 Problem A, it implements the multiplication by using shift-and-add technique. In this approach, the multiplicand ```x``` is repeatedly added to itself according to the value of the multiplier ```y``` , effectively performing ```x``` multiplied by ```y```.
### Assembly Code
#### Version 1: From quiz2
```cpp
.data
num_test: .word 14
multiplier: .word 9, 9, -9, -9, 0, 10, 0, -10, 0, 1024, 32768, -32768, 32768, -32768
multiplicand: .word 7, -7, 7, -7, 0, 0, 10, 0, -10, -1024, 32768, -32768, -32768, 32768
answer: .word 63, -63, -63, 63, 0, 0, 0, 0, 0, -1048576, 1073741824, 1073741824, -1073741824, -1073741824
next_line: .string "\n"
output: .string "your result -> "
output2: .string " , answer -> "
output_pass: .string "All pass\n"
output_error: .string "Fail, total has "
output_error2: .string " error"
.text
main:
la s0, multiplier # load multiplier address
la s1, multiplicand # load multiplicand address
la s2, answer # Load answer address
la s3, num_test # Load number of test
lw s3, 0(s3)
li s5, 0 # int i = 0
li s6, 0 # int error = 0
main_for_loop:
li s4, 0 # Initialize accumulator
## get test data
slli t0, s5, 2 # size = i * 4 byte
add t1, t0, s0 # get multiplier[i] address
lw t1, 0(t1) # set t1 = mulitplier[i]
add t2, t0, s1 # get multiplicand[i] address
lw t2, 0(t2) # set t1 = mulitplicand[i]
## make multipler positvie
srai t3, t1, 31 # t3 = 0xFFFFFFFF if different signed
bge t1, x0, multiplier_positive # If multiplier positive (#A02)
neg t1, t1 # Make multiplier positive
multiplier_positive:
## use clz to initialize bit counter (t4)
add a0, t1, x0 # pass arguement
li t4, 32 # t4 = 32
## do shift_and_add_loop
shift_and_add_loop:
beq t4, x0, end_shift_and_add # Exit if bit count is zero
andi t5, t1, 1 # Check least significant bit
beq t5, x0, skip_add # Skip add if bit is 0
add s4, s4, t2 # Add to accumulator
skip_add:
srai t1, t1, 1 # Right shift multiplier
slli t2, t2, 1 # Left shift multiplicand
addi t4, t4, -1 # Decrease bit counter
jal x0, shift_and_add_loop # Repeat loop
end_shift_and_add:
xor s4, s4, t3
andi t3, t3, 1
add s4, s4, t3
jal ra, print
addi s5, s5, 1
blt s5, s3, main_for_loop
main_for_loop_exit:
beq s6, x0, pass
la a0, output_error
li a7, 4
ecall
add a0, s6, x0
li a7, 1
ecall
la a0, output_error2
li a7, 4
ecall
jal x0, exit
pass:
la a0, output_pass
li a7, 4
ecall
exit:
li a7, 10
ecall
print:
la a0, output
li a7, 4
ecall
add a0, s4, x0
li a7, 1
ecall
la a0, output2
li a7, 4
ecall
add a0, t0, s2
lw a0, 0(a0)
li a7, 1
ecall
beq a0, s4, correct
addi s6, s6, 1
correct:
la a0, next_line
li a7, 4
ecall
jalr x0, ra, 0
```
In this approach, I fixed the bugs of dealing with negative number, and add some testing data for verification. Before the shifting routine, I first check the signed bit of multiplier, ensuring it is positive, so that it can correctly checks its least significant bit ```(LSB)``` to know whether the multiplicand needed to be added. After adding is performed, it then checks the next ```LSB``` of multipler by shifting right one bit, and shifting multiplicand left one bit. In the original one, it totally performs 32 times of shifting in shift-and-add technique, while the actual number don't need that much of shifting.
:::info
By using clz to multiplier in advance, it won't need 32 times of shifting for every number, increasing the efficiency of the program.
:::
#### Version 2: Improving shift-and-add by clz from quiz1
```cpp
.data
num_test: .word 14
multiplier: .word 9, 9, -9, -9, 0, 10, 0, -10, 0, 1024, 32768, -32768, 32768, -32768
multiplicand: .word 7, -7, 7, -7, 0, 0, 10, 0, -10, -1024, 32768, -32768, -32768, 32768
answer: .word 63, -63, -63, 63, 0, 0, 0, 0, 0, -1048576, 1073741824, 1073741824, -1073741824, -1073741824
next_line: .string "\n"
output: .string "your result -> "
output2: .string " , answer -> "
output_pass: .string "All pass\n"
output_error: .string "Fail, total has "
output_error2: .string " error"
.text
main:
la s0, multiplier # load multiplier address
la s1, multiplicand # load multiplicand address
la s2, answer # Load answer address
la s3, num_test # Load number of test
lw s3, 0(s3)
li s5, 0 # int i = 0
li s6, 0 # int error = 0
main_for_loop:
li s4, 0 # Initialize accumulator
## get test data
slli t0, s5, 2 # size = i * 4 byte
add t1, t0, s0 # get multiplier[i] address
lw t1, 0(t1) # set t1 = mulitplier[i]
add t2, t0, s1 # get multiplicand[i] address
lw t2, 0(t2) # set t1 = mulitplicand[i]
## make multipler positvie
srai t3, t1, 31 # t3 = 0xFFFFFFFF if different signed
bge t1, x0, multiplier_positive # If multiplier positive (#A02)
neg t1, t1 # Make multiplier positive
multiplier_positive:
## use clz to initialize bit counter (t4)
add a0, t1, x0 # pass arguement
## Caller save
addi sp, sp, -16
sw t0, 0(sp)
sw t1, 4(sp)
sw t2, 8(sp)
sw t3, 12(sp)
jal ra, clz # return a0 = clz(multiplier)
## Retrieved caller save
lw t0, 0(sp)
lw t1, 4(sp)
lw t2, 8(sp)
lw t3, 12(sp)
addi sp, sp, 16
li t4, 32 # t4 = 32
sub t4, t4, a0 # t4 = 32 - clz(multiplier)
## do shift_and_add_loop
shift_and_add_loop:
beq t4, x0, end_shift_and_add # Exit if bit count is zero
andi t5, t1, 1 # Check least significant bit
beq t5, x0, skip_add # Skip add if bit is 0
add s4, s4, t2 # Add to accumulator
skip_add:
srai t1, t1, 1 # Right shift multiplier
slli t2, t2, 1 # Left shift multiplicand
addi t4, t4, -1 # Decrease bit counter
jal x0, shift_and_add_loop # Repeat loop
end_shift_and_add:
xor s4, s4, t3
andi t3, t3, 1
add s4, s4, t3
jal ra, print
addi s5, s5, 1
blt s5, s3, main_for_loop
main_for_loop_exit:
beq s6, x0, pass
la a0, output_error
li a7, 4
ecall
add a0, s6, x0
li a7, 1
ecall
la a0, output_error2
li a7, 4
ecall
jal x0, exit
pass:
la a0, output_pass
li a7, 4
ecall
exit:
li a7, 10
ecall
## function clz:
## used register: t0~t3
clz:
li t0, 0 # int count = 0
li t1, 31 # int i = 31
li t2, 1 # set t2 = 1U
blt t1, x0, loop_exit # if i < 0, go outside loop
for_loop:
sll t3, t2, t1 # set t3 = (1U << i)
and t3, t3, a0 # set t3 = (x & (1U << i))
bne t3, x0, loop_exit # if (x & (1U << i) == 1): break
addi t0, t0, 1 # count++
## condition check (j >= 0)
addi t1, t1, -1 # --i
bge t1, x0, for_loop # if (i >= 0) go loop again
loop_exit:
add a0, t0, x0 # return count
jalr x0, ra, 0
print:
la a0, output
li a7, 4
ecall
add a0, s4, x0
li a7, 1
ecall
la a0, output2
li a7, 4
ecall
add a0, t0, s2
lw a0, 0(a0)
li a7, 1
ecall
beq a0, s4, correct
addi s6, s6, 1
correct:
la a0, next_line
li a7, 4
ecall
jalr x0, ra, 0
```
In this approach, instead of shifting ```32``` times for every number, I use clz function to count how many leading zero ```clz(x)``` multiplier has, which means only ```32 - clz(x)``` times of shifting needs to be done.
#### Version 3 : Improving shift-and-add by binary searching clz
```cpp
.data
num_test: .word 14
multiplier: .word 9, 9, -9, -9, 0, 10, 0, -10, 0, 1024, 32768, -32768, 32768, -32768
multiplicand: .word 7, -7, 7, -7, 0, 0, 10, 0, -10, -1024, 32768, -32768, -32768, 32768
answer: .word 63, -63, -63, 63, 0, 0, 0, 0, 0, -1048576, 1073741824, 1073741824, -1073741824, -1073741824
next_line: .string "\n"
output: .string "your result -> "
output2: .string " , answer -> "
output_pass: .string "All pass\n"
output_error: .string "Fail, total has "
output_error2: .string " error"
.text
main:
la s0, multiplier # load multiplier address
la s1, multiplicand # load multiplicand address
la s2, answer # Load answer address
la s3, num_test # Load number of test
lw s3, 0(s3)
li s5, 0 # int i = 0
li s6, 0 # int error = 0
main_for_loop:
li s4, 0 # Initialize accumulator
## get test data
slli t0, s5, 2 # size = i * 4 byte
add t1, t0, s0 # get multiplier[i] address
lw t1, 0(t1) # set t1 = mulitplier[i]
add t2, t0, s1 # get multiplicand[i] address
lw t2, 0(t2) # set t1 = mulitplicand[i]
## make multipler positvie
srai t3, t1, 31 # t3 = 0xFFFFFFFF if different signed
bge t1, x0, multiplier_positive # If multiplier positive (#A02)
neg t1, t1 # Make multiplier positive
multiplier_positive:
## use clz to initialize bit counter (t4)
add a0, t1, x0 # pass arguement
## Caller save
addi sp, sp, -8
sw t0, 0(sp)
sw t1, 4(sp)
jal ra, clz_bs # return a0 = clz(multiplier)
## Retrieved caller save
lw t0, 0(sp)
lw t1, 4(sp)
addi sp, sp, 8
li t4, 32 # t4 = 32
sub t4, t4, a0 # t4 = 32 - clz(multiplier)
## do shift_and_add_loop
shift_and_add_loop:
beq t4, x0, end_shift_and_add # Exit if bit count is zero
andi t5, t1, 1 # Check least significant bit
beq t5, x0, skip_add # Skip add if bit is 0
add s4, s4, t2 # Add to accumulator
skip_add:
srai t1, t1, 1 # Right shift multiplier
slli t2, t2, 1 # Left shift multiplicand
addi t4, t4, -1 # Decrease bit counter
jal x0, shift_and_add_loop # Repeat loop
end_shift_and_add:
xor s4, s4, t3
andi t3, t3, 1
add s4, s4, t3
jal ra, print
addi s5, s5, 1
blt s5, s3, main_for_loop
main_for_loop_exit:
beq s6, x0, pass
la a0, output_error
li a7, 4
ecall
add a0, s6, x0
li a7, 1
ecall
la a0, output_error2
li a7, 4
ecall
jal x0, exit
pass:
la a0, output_pass
li a7, 4
ecall
exit:
li a7, 10
ecall
## function clz:
## used register: t0~t3
clz_bs:
add t0, x0, x0 # int count = 0
beq a0, x0, x_zero # if (x == 0):
li t1, 0x0000FFFF
bgtu a0, t1, check_8 # if (x <= 0x0000FFFF):
addi t0, t0, 16 # count += 16
slli a0, a0, 16 # x <<= 16
check_8:
li t1, 0x00FFFFFF
bgtu a0, t1, check_4 # if (x <= 0x00FFFFFF):
addi t0, t0, 8 # count += 8
slli a0, a0, 8 # x <<= 8
check_4:
li t1, 0x0FFFFFFF
bgtu a0, t1, check_2 # if (x <= 0x0FFFFFFF):
addi t0, t0, 4 # count += 4
slli a0, a0, 4 # x <<= 4
check_2:
li t1, 0x3FFFFFFF
bgtu a0, t1, check_1 # if (x <= 0x0000FFFF):
addi t0, t0, 2 # count += 2
slli a0, a0, 2 # x <<= 2
check_1:
li t1, 0x7FFFFFFF
bgtu a0, t1, end_clz # if (x <= 0x0000FFFF):
addi t0, t0, 1 # count += 1
slli a0, a0, 1 # x <<= 1
end_clz:
add a0, t0, x0 # return count
jalr x0, ra, 0
x_zero:
li a0, 32 # return 32
jalr x0, ra, 0
print:
la a0, output
li a7, 4
ecall
add a0, s4, x0
li a7, 1
ecall
la a0, output2
li a7, 4
ecall
add a0, t0, s2
lw a0, 0(a0)
li a7, 1
ecall
beq a0, s4, correct
addi s6, s6, 1
correct:
la a0, next_line
li a7, 4
ecall
jalr x0, ra, 0
```
In this approach, I use the binary searching clz instead of the looped one from quiz1, futher improving the performance of multiplication
#### Output
```cpp
your result -> 63 , answer -> 63
your result -> -63 , answer -> -63
your result -> -63 , answer -> -63
your result -> 63 , answer -> 63
your result -> 0 , answer -> 0
your result -> 0 , answer -> 0
your result -> 0 , answer -> 0
your result -> 0 , answer -> 0
your result -> 0 , answer -> 0
your result -> -1048576 , answer -> -1048576
your result -> 1073741824 , answer -> 1073741824
your result -> 1073741824 , answer -> 1073741824
your result -> -1073741824 , answer -> -1073741824
your result -> -1073741824 , answer -> -1073741824
All pass
```
For the test data, I randomly choose some calculation between small number, big number , negative number, postive number, and zero.
#### Comparison
| | Quiz2 (v1) | with clz (v2) | with bs clz (v3) |
| --- | --- | --- | --- |
| Cycles | 5772 | 5082 | 2438 |
| Instructions | 3708 | 3644 | 1648 |
| CPI | 1.56 | 1.39 | 1.48 |
| IPC | 0.642 | 0.717 | 0.676 |
In the table, we can find out clz indeed improves the performance of shift-and-add technique. But in version 2, its clz still has a lot of loops to check, so its improvement is not that obvious. On the other hand, in version 3, it significantly decreases the number of cycles and instructions.
## RISC-V 5-stage pipeline processor
Pipeline is a technique which separate the execution of an instruction into different stage, in order to do several instruction at the same time. In convenience, I will use the following program to introduce each stage.
```cpp
0000009c <clz>:
9c: 00000293 addi x5 x0 0
a0: 01f00313 addi x6 x0 31
a4: 00100393 addi x7 x0 1
a8: 00034e63 blt x6 x0 28 <loop_exit>
000000ac <for_loop>:
ac: 00639e33 sll x28 x7 x6
b0: 00ae7e33 and x28 x28 x10
b4: 000e1863 bne x28 x0 16 <loop_exit>
b8: 00128293 addi x5 x5 1
bc: fff30313 addi x6 x6 -1
c0: fe0356e3 bge x6 x0 -20 <for_loop>
000000c4 <loop_exit>:
c4: 00028533 add x10 x5 x0
c8: 00008067 jalr x0 x1 0
```
Take ```bc: fff30313 addi x6 x6 -1``` for example.
The five stage includes:
### IF (Instruction Fetch)

This stage fetches instruction from instruction memory corresponding to the ``PC``. As seen from the picture, when ```pc``` is ```0x000000bc```, the corresponding address in memory is ```0xfff30313```, that is, ```addi x6, x6, -1``` in machine lauguage.
### ID (Instruction Decode)

This stage decodes the instruction from the last stage, and recognized the register and immediate being use in the instruction. As seen from the picture, the decoded register is ```0x06```, that is ```x6```, and the stored data is ```0x0000001f```. ```Imm``` blocks generate ```0xffffffff``` corresponding to ```-1``` in the instruction.
### EXE (Execution)

This stage executes the instruction calculation using ```ALU``` unit. As seen from the picture, two input ```0x0000001f``` and ```0xffffffff``` are added and the output is ```0x0000001e```.
### MEM (Memory)

This stage loads or store data from memory. Because in the previous example ```bc: fff30313 addi x6 x6 -1``` doesn't involve memory reading/writing, I use another example to demonstrate the effect. In instruction ```sw x1 0 x2```, it stores data in ```x1``` register into the address stored in ```x2``` register corresponding to the memory. As seen the picture, ```0x00000034``` is stored in ```x1``` register, and will be writed into ```0x7ffffff0``` stored in ```x2``` register in data memory.
### WB (Write back)
 
This stage writes data back to the register. As seen in the picture, the instruction eventually stores the result into ```x5``` register. The writing data is ```0x00000002``` and writing register is ```x5```.