# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < `王彥珽` >
:::danger
Choose one problem (A, B, or C) from [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate it from C code to a complete RISC-V assembly program, and include the relevant test data.
:::
## Quiz 1 Problem C
**Problem C** from **Quiz1** has been selected for this assignment to solve a problem from Leetcode. Below are the corresponding code and assembly code with functions in Proble C. I will write c code and assembly code for each function and vertify them.
### C code and test data for `fabsf`
```c
static inline float fabsf(float x) {
uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer
i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value
x = *(float *)&i; // Write the modified bits back into the float
return x;
}
```
### assembly code for `fabsf`
```c
.data
test0: .word 0xD8800000
test1: .word 0xCB000000
test2: .word 0x72000000
mask: .word 0x7fffffff
str0: .string "Yes\n"
str1: .string "No\n"
.text
main:
lw t0, test0 #Load test0 into t0
lw s0, mask #Load mask into s0
jal ra, fabsf
lw t0, test1 #Load test1 into t0
jal ra, fabsf
lw t0, test2 #Load test2 into t0
jal ra, fabsf
j exit
fabsf:
and t2,t0,s0 #i &= 0x7FFFFFFF;
beq t2,t0, print_no #if(b == -a)
la a0 str0 #print Yes
li a7, 4
ecall
jr ra #back to next testdata
print_no:
la a0, str1 #print No
li a7, 4
ecall
jr ra #back to next testdata
exit:
li a7, 10 #exit syscall
ecall
```
The output of assembly code will be the table below:
| input to `fabsf` | output from `fabsf` |
|:--------------:|:-----------------:|
| 0xD8800000 | 0x58800000 |
| 0xCB000000 | 0xCB000000 |
| 0x72000000 | 0x72000000 |
### C code and test data for `fp16_to_fp32`
```c
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
static inline uint32_t fp16_to_fp32(uint16_t h) {
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000);
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}
```
### assembly code for `fp16_to_fp32`
```c
.data
test0: .word 0x00003857
test1: .word 0x00005555
test2: .word 0x00008877
mask0: .word 0x80000000
mask1: .word 0x7fffffff
mask2: .word 0x04000000
mask3: .word 0x7f800000
str0: .string "\n"
.text
main:
lw s0, test0 #Load test0 into s0
jal ra, fp16_to_fp32
lw s0, test1 #Load test1 into s0
jal ra, fp16_to_fp32
lw s0, test2 #Load test2 into s0
jal ra, fp16_to_fp32
j exit
fp16_to_fp32:
slli s1, s0, 16 # s1 = w, (uint32_t) h << 16;
lw t0, mask0
and s2, s1, t0 # s2 = sign, w & UINT32_C(0x80000000);
lw t0, mask1
and s3, s1, t0 # s3 = nonsign, w & UINT32_C(0x7FFFFFFF);
mv s0, s3 # s0 = nonsign, x = nonsign
mv s11, ra
jal ra, my_clz # my_clz(x)
mv ra, s11
mv s1, a0 # s1 = renorm, renorm_shift = my_clz and -5
lw t0, mask2
add s0, s3, t0
lw t0, mask3
srli s0, s0, 8
and s0, s0 t0 #s0 = inf_nan_mask
andi s4, s3, -1
srli s4, s4, 31 #s4 = zero_mask
sll s3, s3, s1
srli s3, s3, 3 #s3 = nonsign << renorm_shift >> 3
li t0, 112 #0x70
sub s1, t0, s1 #(0x70 - renorm_shift)
slli s1, s1, 23 #s1 = (0x70 - renorm_shift) << 23
add s1, s1, s3
xori s4, s4, -1 #~zero_mask
or s1, s1, s0
and s1, s1, s4
or s2, s2, s1
mv a0, s2
mv s11, ra
jal ra, print
mv ra, s11
jr ra
my_clz:
li a0, 0 #count = 0
li t0, 31 #i = 31
li s10, 5
for_clz:
li t1, 1
sll t1, t1, t0 #1U << i
and t2, s0, t1 #x & (1U << i)
bne t2, x0, condition_clz #if (x == 1)
addi a0, a0, 1 #count += 1
addi t0, t0, -1 #--i
bge t0, x0, for_clz #Repeat while i >= 0
condition_clz:
blt a0, s10 , false
addi a0,a0,-5
jr ra
false:
sub a0,a0,a0
jr ra
print:
li a7, 34
ecall
la a0, str0
li a7, 4
ecall
jr ra
exit:
li a7, 10 #exit syscall
ecall
```
The output of assembly code will be the table below:
| input to `fp16_to_fp32` | output from `fp16_to_fp32` |
|:-----------------------:|:--------------------------:|
| 0x00003857 | 0x3f0ae000 |
| 0x00005555 | 0x42aaa000 |
| 0x00008877 | 0xb90ee000 |
## Selected LeetCode Problem
> LeetCode [762. Prime Number of Set Bits in Binary Representation](https://leetcode.com/problems/prime-number-of-set-bits-in-binary-representation/description/)
Given two integers `left` and `right`, return the **count** of numbers in the **inclusive** range `[left, right]` having a **prime number of set bits** in their binary representation.
Exanple:
Input: `left = 6, right = 10`
Output: `4`
Explanation:
`6 -> 110 (2 set bits, 2 is prime)`
`7 -> 111 (3 set bits, 3 is prime)`
`8 -> 1000 (1 set bit, 1 is not prime)`
`9 -> 1001 (2 set bits, 2 is prime)`
`10 -> 1010 (2 set bits, 2 is prime)`
`4 numbers have a prime number of set bits.`
## Solution
By utilizing the `my_clz` function from Quiz1 Problem 3, we can first determine the number of set bits for each number. Next, we check if this number is a prime. Finally, we sum up the prime numbers and output the result as the final answer.
### C code
```c
#include<stdio.h>
#include<stdint.h>
#include<stdbool.h>
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i)){
break;
}
count++;
}
return count;
}
bool is_prime(int a){
if(a < 2){
return false;
}
for(int i=2 ; i*i<=a ; i++){
if(a%i == 0){
return false;
}
}
return true;
}
int main()
{
int left = 10, right = 15, ans = 0, clz_cnt = 0;
uint32_t num;
for(int i=left ; i <= right ; i++){
int set_bit_cnt = 0;
num = (uint32_t)i;
while((clz_cnt = my_clz(num)) != 32){ //if my_clz return 32, mean num == 0
num = num << (clz_cnt+1);
set_bit_cnt++; //count set bit
}
if(is_prime(set_bit_cnt)){
ans++;
}
}
printf("The total number of prime number of set bits is: %d ", ans);
return 0;
}
```
### Assembly code
#### version 1
```c
.data
left0: .word 6
right0: .word 10
left1: .word 10
right1: .word 15
left2: .word 20
right2: .word 100
str0: .string "The total number of prime number of set bits is: "
str1: .string "\n"
.text
main:
lw s0, left0 # s0 = left0
lw s1, right0 # s1 = right0
li s2, 0 # s2 = ans = 0
li s3, 0 # s3 = set_bit_cnt = 0
li s4, 32 # s4 = 32
jal ra, store_ra
lw s0, left1
lw s1, right1
jal ra, store_ra
lw s0, left2
lw s1, right2
jal ra store_ra
j exit
store_ra:
mv s11, ra #store ra
loop_main:
bgt s0, s1, print #left > right , print ans
mv t0, s0 # input left to my_clz
while:
jal ra, my_clz
beq a0, s4, is_prime #clz_cnt == 32, check s3 is prime
addi s3, s3, 1 #set_bit_cnt++
addi a0, a0, 1 #clz_cnt++
sll t0, t0, a0 #num = num << (clz_cnt+1);
j while
my_clz:
li t1, 0 # count = 0
li t2, 31 # i = 31
clz_for:
li t3, 1
sll t3, t3, t2 # 1U << i
and t4, t0, t3 # x & (1U << i)
bne t4, x0, clz_done # if (x == 0), exit loop
addi t1, t1, 1 # count += 1
addi t2, t2, -1 # --i
bge t2, x0, clz_for # Repeat while i >= 0
clz_done:
mv a0, t1
jr ra
is_prime:
li t0, 2 # t0 = 2
blt s3, t0, false # set_bit_cnt < 2, not a prime
prime_loop:
li t1, 0 #count i*i
li t3, 0 #counter for 0-i
mul:
add t1, t1, t0 # t2 = t2 + i
addi t3, t3, 1 # t3++
bge t3,t0 , check_prime #t3 = t0, finish i*i
j mul
check_prime:
bgt t1, s3, true # i*i > a, return true
mv t1, s3
mod:
sub t1, t1, t0 # a = a - i
bge t1, t0, mod # a >= i, keep sub
beqz t1, false # a == 0, not prime
addi t0, t0, 1 # i++
j prime_loop
false:
li s3, 0 # reset set_bit_cnt
addi s0, s0, 1 # left++
j loop_main # another data
true:
li s3, 0
addi s2, s2, 1 # ans++
addi s0, s0, 1
j loop_main
print:
la a0, str0
li a7, 4 # print str0
ecall
mv a0, s2
li a7, 1 #print ans
ecall
la a0, str1
li a7, 4 # print \n
ecall
li s2, 0 # s2 = ans = 0
mv ra, s11 #back to main
jr ra
exit:
li a7, 10 # exit
ecall
```
I optimized the `is_prime` function to limit checks to primes less than 32, rather than evaluating each number individually. This change reduces the overall instruction count and decrease the number of branches.
:::danger
Instead of saying "optimized," prove it!
:::
#### version 2
```c
.data
left0: .word 6
right0: .word 10
left1: .word 10
right1: .word 15
left2: .word 20
right2: .word 100
prime: .word 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 33
str0: .string "The total number of prime number of set bits is: "
str1: .string "\n"
.text
main:
lw s0, left0 # s0 = left0
lw s1, right0 # s1 = right0
li s2, 0 # s2 = ans = 0
li s4, 32 # s4 = 32
jal ra, store_ra
lw s0, left1
lw s1, right1
jal ra, store_ra
lw s0, left2
lw s1, right2
jal ra store_ra
j exit
store_ra:
mv s11, ra #store ra
loop_main:
li s3, 0
la a1, prime
li a2, 12
bgt s0, s1, print #left > right , print ans
mv t0, s0 # input left to my_clz
while:
jal ra, my_clz
beq a0, s4, is_prime #clz_cnt == 32, check s3 is prime
addi s3, s3, 1 #set_bit_cnt++
addi a0, a0, 1 #clz_cnt++
sll t0, t0, a0 #num = num << (clz_cnt+1);
j while
my_clz:
li t1, 0 # count = 0
li t2, 31 # i = 31
clz_for:
li t3, 1
sll t3, t3, t2 # 1U << i
and t4, t0, t3 # x & (1U << i)
bne t4, x0, clz_done # if (x == 0), exit loop
addi t1, t1, 1 # count += 1
addi t2, t2, -1 # --i
bge t2, x0, clz_for # Repeat while i >= 0
clz_done:
mv a0, t1
jr ra
is_prime:
lw a3, 0(a1)
addi a1, a1, 4
addi a2, a2, -1
beq s3, a3, true
bgt s3, a3, is_prime
#bge a2, x0, is_prime
addi s0, s0, 1 # left++
j loop_main # another data
true:
addi s2, s2, 1 # ans++
addi s0, s0, 1
j loop_main
print:
la a0, str0
li a7, 4 # print str0
ecall
mv a0, s2
li a7, 1 #print ans
ecall
la a0, str1
li a7, 4 # print \n
ecall
li s2, 0 # s2 = ans = 0
mv ra, s11 #back to main
jr ra
exit:
li a7, 10 # exit
ecall
```
Furthermore, I refactored the `my_clz` function to a loop-free version. While this increased the instruction count slightly, it significantly reduced the total cycle count, improving overall performance.
version 3
```c
.data
left0: .word 6
right0: .word 10
left1: .word 10
right1: .word 15
left2: .word 20
right2: .word 100
prime: .word 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 33
str0: .string "The total number of prime number of set bits is: "
str1: .string "\n"
.text
main:
lw s0, left0 # s0 = left0
lw s1, right0 # s1 = right0
li s2, 0 # s2 = ans = 0
li s4, 32 # s4 = 32
jal ra, loop_main
lw s0, left1
lw s1, right1
jal ra, loop_main
lw s0, left2
lw s1, right2
jal ra loop_main
li a7, 10 # exit
ecall
true:
addi s2, s2, 1 # ans++
addi s0, s0, 1
loop_main:
li s3, 0 # s3 = set_bit_cnt = 0
la a1, prime
li a2, 12
bgt s0, s1, print # left > right , print ans
mv t0, s0 # input left to my_clz
while:
#my_clz
srli t1, t0, 1 #x |= (x >> 1);
or t2, t0, t1
srli t1, t2, 2 #x |= (x >> 2);
or t2, t2, t1
srli t1, t2, 4 #x |= (x >> 4);
or t2, t2, t1
srli t1, t2, 8 #x |= (x >> 8);
or t2, t2, t1
srli t1, t2, 16 #x |= (x >> 16);
or t2, t2, t1
srli t1, t2, 1 #x -= ((x >> 1) & 0x55555555);
li t3, 0x55555555
and t1, t1, t3
sub t2, t2, t1
srli t1, t2, 2 #x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
li t3, 0x33333333
and t1, t1, t3
and t2, t2, t3
add t2, t2, t1
li t3, 0x0f0f0f0f #x = ((x >> 4) + x) & 0x0f0f0f0f;
srli t1, t2, 4
add t2, t1, t2
and t2, t2, t3
srli t1, t2, 8 #x += (x >> 8);
add t2, t2, t1
srli t1, t2, 16 #x += (x >> 16);
add t2, t2, t1
li t3, 0x0000007f #(32 - (x & 0x7f))
and t2, t2, t3
sub a0, s4, t2
beq a0, s4, is_prime #if clz_cnt == 32, check s3 is prime
addi s3, s3, 1 #set_bit_cnt++
addi a0, a0, 1 #clz_cnt++
sll t0, t0, a0 #num = num << (clz_cnt+1);
j while
is_prime:
lw a3, 0(a1)
addi a1, a1, 4
addi a2, a2, -1
beq s3, a3, true
bgt s3, a3, is_prime
addi s0, s0, 1 # left++
j loop_main # another data
print:
la a0, str0
li a7, 4 # print str0
ecall
mv a0, s2
li a7, 1 #print ans
ecall
la a0, str1
li a7, 4 # print \n
ecall
li s2, 0 # s2 = ans = 0
ret
```
:::danger
Use fewer instructions.
:::
## Result
All version outputs:

Execution info among 3 versoins:
|version 1|version 2| versoin 3 |
|:---------:|:---------:|:---------:|
||||
From the three images above, it is clear that version 3 executes fewer instructions and contains fewer branches compared to version 1 and version 2, because it uses a loopless `my_clz` function.Though difference between version 1 and 2 is the `is_prime` function ,there is a slight reduction in cycles except from code size.
## Assembly Optimization Analysis
### `is_prime` function
Since we're only dealing with 32-bit numbers, I modified the `is_prime` function's logic to directly check for primes less than 32 instead of using a loop.
While there is no significant difference with smaller numbers, a noticeable difference emerges as the number of inputs increases.
before:
```c
is_prime:
li t0, 2 # t0 = 2
blt s3, t0, false # set_bit_cnt < 2, not a prime
prime_loop:
li t1, 0 #count i*i
li t3, 0 #counter for 0-i
mul:
add t1, t1, t0 # t2 = t2 + i
addi t3, t3, 1 # t3++
bge t3,t0 , check_prime #t3 = t0, finish i*i
j mul
check_prime:
bgt t1, s3, true # i*i > a, return true
mv t1, s3
mod:
sub t1, t1, t0 # a = a - i
bge t1, t0, mod # a >= i, keep sub
beqz t1, false # a == 0, not prime
addi t0, t0, 1 # i++
j prime_loop
false:
li s3, 0 # reset set_bit_cnt
addi s0, s0, 1 # left++
j loop_main # another data
true:
li s3, 0
addi s2, s2, 1 # ans++
addi s0, s0, 1
j loop_main
```
after:
```c
.data
prime: .word 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 33
.text
la a1, prime
li a2, 12
is_prime:
lw a3, 0(a1)
addi a1, a1, 4
addi a2, a2, -1
beq s3, a3, true
bgt s3, a3, is_prime
addi s0, s0, 1 # left++
j loop_main # another data
```
with a smaller range of numbers(**100 numbers**)
| before | after |
|:------:|:-----:|
|||
Based on the two images above, the cycles were reduced by approximately 0.98%.
with a larger range of numbers(**100000 numbers**)
| before | after |
|:------:|:-----:|
|||
Based on the two images above, the cycles were reduced by approximately 2.36%.
From the two sets of tests, one with 100 data points and a cycle reduction of 0.98%, and the other with 100,000 data points and a cycle reduction of 2.36%, we can conclude that **the performance improvements become more pronounced as the data size increases**. This suggests that the optimization has a greater impact on larger datasets, likely due to better efficiency scaling over more iterations.
### `my_clz` function
In addition, referring to last year's [Quiz 1](https://hackmd.io/@sysprog/arch2023-quiz1-sol), I replaced `my_clz` with a loopless version.
When comparing versions 2 and 3, we observed a significant reduction in cycles, dropping from over 60,000 to nearly 20,000, while IPC improved from 0.7 to 0.9. The key difference between the two versions lies in the implementation of `my_clz`: one uses a loop while the other does not.
It is reasonable to infer that the program's main computations are concentrated in `my_clz`. Therefore, if we can minimize branch evaluations within `my_clz`, we can significantly reduce the number of flushed instructions, thereby enhancing performance.
Although this results in an increase in code length, the reduction in cycle count makes it a worthwhile trade-off.
:::danger
Explain and discuss
:::
function `count_leading_zeros` modified to count 32-bit in last year's Quiz 1 in c
```c
#include <stdint.h>
uint16_t count_leading_zeros(uint64_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
/* count ones (population count) */
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x7f));
}
```
assembly code to `count_leading_zeros` which without loop
```c
#my_clz
srli t1, t0, 1 #x |= (x >> 1);
or t2, t0, t1
srli t1, t2, 2 #x |= (x >> 2);
or t2, t2, t1
srli t1, t2, 4 #x |= (x >> 4);
or t2, t2, t1
srli t1, t2, 8 #x |= (x >> 8);
or t2, t2, t1
srli t1, t2, 16 #x |= (x >> 16);
or t2, t2, t1
srli t1, t2, 1 #x -= ((x >> 1) & 0x55555555);
li t3, 0x55555555
and t1, t1, t3
sub t2, t2, t1
srli t1, t2, 2 #x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
li t3, 0x33333333
and t1, t1, t3
and t2, t2, t3
add t2, t2, t1
li t3, 0x0f0f0f0f #x = ((x >> 4) + x) & 0x0f0f0f0f;
srli t1, t2, 4
add t2, t1, t2
and t2, t2, t3
srli t1, t2, 8 #x += (x >> 8);
add t2, t2, t1
srli t1, t2, 16 #x += (x >> 16);
add t2, t2, t1
li t3, 0x0000007f #(32 - (x & 0x7f))
and t2, t2, t3
sub a0, s4, t2
```
assembly code to `my_clz` which with loop
```c
my_clz:
li t1, 0 # count = 0
li t2, 31 # i = 31
clz_for:
li t3, 1
sll t3, t3, t2 # 1U << i
and t4, t0, t3 # x & (1U << i)
bne t4, x0, clz_done # if (x == 0), exit loop
addi t1, t1, 1 # count += 1
addi t2, t2, -1 # --i
bge t2, x0, clz_for # Repeat while i >= 0
clz_done:
mv a0, t1
jr ra
```
| `my_clz` with loop | `my_clz` without loop |
| -------- | -------- |
|  | 
|
Based on the two images above, the cycles were reduced by approximately 70.9%.
From the two sets of test results, we can conclude that there is a significant performance improvement in the latter. The drastic reduction in cycle count indicates that the optimizations made in the second test lead to more efficient execution, suggesting that the changes have effectively enhanced the overall performance of the code.
## Analysis
We use [Ripes](https://github.com/mortbopet/Ripes) to simulate RISC-V processor
Ripes is a 5-stage pipeline instruciton simulator.

| Stage | Description |
|:-----:|:-----------:|
| IF |The instruction is retrieved from memory based on the program counter (PC)|
| ID |The instruction is decoded to determine its type and required operands|
| EXE |The ALU performs the specified operation or computes memory addresses|
| MEM |Data is read from or written to memory if required by the instruction|
| WB |The result is written back to the destination register or memory|
Below, I will use a single instruction as an example:
### IF

1. In the **IF stage**, the program counter (PC) initially holds the value `0x00000000` and is used to fetch the instruction from instruction memory.
2. The instruction memory returns the instruction `0x10000417`, and the PC is updated to `0x00000004` to point to the next instruction.
3. The fetched instruction `0x10000417` is then sent to the instruction decode stage (IFID), where it can be processed for further decoding.
### ID

The Decoder will decode the instruction `0x10000417` into:
| immediate\[31:12] | rd\[11:7] | opcode\[6:0] |
|:-------------------- |:---------:|:------------:|
| 00010000000000000000 | 01000 | 0010111 |
Result in immediate = `0x10000000`, rd = `0x08` and opcode = `AUIPC`. This instruction will operate` rd = immediate + PC`.
### EXE

The ALU adds `PC` and `immediate` ,then output the result to the next stage.
### MEM

The **MEM** stage passes `alu_res` and `write_back_address` into the next stage.
### WB

The **WB** stage writes the `alu_res` back to register file which located at `0x08`.
## Reference
* [Leetcode 762. Prime Number of Set Bits in Binary Representation](https://leetcode.com/problems/prime-number-of-set-bits-in-binary-representation/description/)
* [Quiz1 of Computer Architecture (2024 Fall)](https://hackmd.io/@sysprog/arch2024-quiz1-sol)
* [Quiz1 of Computer Architecture (2023 Fall)](https://hackmd.io/@sysprog/arch2023-quiz1-sol)
* [Lab1: RV32I Simulator](https://hackmd.io/@sysprog/H1TpVYMdB)
:::danger
Always refer to primary sources, such as official RISC-V documentation.
:::