# Assignment 1: RISC-V Assembly and Instruction Pipeline
Contributed by < [LeoriumDev](https://github.com/LeoriumDev) >
## AI Usage Citation
This assignment was completed with assistance from ChatGPT for refining commit messages, improving wording clarity, inquiring about RISC-V instruction usage (e.g., the format and purpose of `srli`), and explaining assembly structure. All final analysis and conclusions are my own.
## Lab1: RV32I Simulator


## Quiz1 - Problem B
### UF8
> My thoughts + excerpts from the description of Problem B (wording refined with ChatGPT for clarity)
UF8 (Unsigned Float 8-bit) is a compact way to store numbers that covers a huge range ($[0,1{,}015{,}792]$) while using very little space using only one byte.
It works kind of like a mini-version of floating-point numbers: rather than storing every exact value, it saves an approximation that’s close enough for many real-world use cases.
UF8 is most useful when range matters more than precision.
For example, sensor readings like temperature, weight, or distance can be stored in UF8 form to reduce memory space on the device.
Because it only uses 8 bits, it can compress 20-bit values by about 2.5 times while keeping results within roughly 6% of the true number.
However, UF8 is not good for things that need perfect accuracy, such as finance or cryptography.
### CLZ (Count Leading Zeros)
My initial thought is to create a bitmask that isolates the MSB and then OR the value with 0x7FFFFFFF to check whether the MSB is zero. If it is true, increment to counter. Then, left-shift the value by one bit and continue to the next iteration. If the value becomes zero (all bits shifted out), return from the function. I also found a documentation on using Ripes' environment calls (ecall). [^1]
> Commit: [aaef6bc](https://github.com/sysprog21/ca2025-quizzes/commit/aaef6bcfafb376891b69d86c573e9da59daf3993)
> Processor Mode: Single-cycle Processor
> Cycle Count: 831
```c
.data
mask: .word 0x7FFFFFFF # first bit is zero
bin: .word 0x0000FFFF # expected return value from clz: 16
bin2: .word 0xFFFFFFFF # expected return value from clz: 0
bin3: .word 0x7FFFFFFF # expected return value from clz: 1
.text
.globl main
main:
# Testcase - bin
lw a0, bin # Load the test argument into register a0
jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros
li a7, 1 # Print values returned from clz
ecall
# Newline, '\n' ASCII code (10_dec)
li a0, 10
li a7, 11
ecall
# Testcase - bin2
lw a0, bin2 # Load the test argument into register a0
jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros
li a7, 1 # Print values returned from clz
ecall
# Newline, '\n' ASCII code (10_dec)
li a0, 10
li a7, 11 # Print char
ecall
# Testcase - bin3
lw a0, bin3 # Load the test argument into register a0
jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros
li a7, 1 # Print values returned from clz
ecall
# Exit the program
li a7, 10
ecall
# Counting Leading Zeros, return value is save at a0
clz:
addi s0, x0, 0 # Set s0 (saved register) to 0, served as a variable for counting leading zeros
clz1:
lw s1, mask # Load the bit-mask for determing MSB is 1 to s1
or s2, a0, s1 # Use the bit-mask to filter out all bits except the MSB and save it to s2
beq s1, s2, inc # Compare the MSB of a0 (s2) with the bit-mask (s1), branch to inc if they are equivalent
trailing:
slli a0, a0, 1 # Left Shift a0 by 1
addi s3, x0, 0 # Set s3 to 0, served as a determing value for whether function ended (bin is zero)
beq a0, s3, end # If bin is zero, then jump to end
jal x0, clz1 # Jump to clz for next iteration
inc:
addi s0, s0, 1 # Increment s0 (clz counter) by 1
jal x0, trailing # Jump to trailing for bit shifting for comparing next bit
end:
add a0, x0, s0 # Save clz counter to a0
ret # return from function clz
```
This is purely a iterative approach, no algorithms involved.
However, I noticed some ways the CLZ function could be improved.
First, the code provided by the instructor uses a binary search approach and is implemented iteratively. (Before realization, I thought it was just a simple loop that counts leading zeros).
```c
static inline unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return n - x;
}
```
Execution time for this implementation is $O(\log_2n)$.
Second, we can use bit masks to unroll the iterative approach. [^2]
```c
static inline unsigned clz(uint32_t x)
{
if (x == 0) return 32;
char n = 0;
if (x <= 0x0000FFFF) { n += 16; x <<= 16; }
if (x <= 0x00FFFFFF) { n += 8; x <<= 8; }
if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; }
if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; }
if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; }
return n;
}
```
I changed the data type from int to char since the number of leading zeros is at most 32, we only need a char to store it.
Next, I implemented the binary search with bitmasks approach:
> Commit: [245d5ec](https://github.com/sysprog21/ca2025-quizzes/commit/245d5ec3de7c848f287d688c742e62514e120c90)
> Processor Mode: Single-cycle Processor
> Cycle Count: 129
```c
.data
mask1: .word 0x0000FFFF
mask2: .word 0x00FFFFFF
mask3: .word 0x0FFFFFFF
mask4: .word 0x3FFFFFFF
mask5: .word 0x7FFFFFFF
bin1: .word 0x0000FFFF # expected return value from clz: 16
bin2: .word 0xFFFFFFFF # expected return value from clz: 0
bin3: .word 0x7FFFFFFF # expected return value from clz: 1
.text
.globl main
main:
# === Testcase bin1 ===
lw a0, bin1 # Load the test argument into register a0
jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros
li a7, 1 # Print values returned from clz
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# === Testcase bin2 ===
lw a0, bin2 # Load the test argument into register a0
jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros
li a7, 1 # Print values returned from clz
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# === Testcase bin3 ===
lw a0, bin3 # Load the test argument into register a0
jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros
li a7, 1 # Print values returned from clz
ecall
# Exit
li a7, 10
ecall
# Count Leading Zeros (return value is saved at a0)
# a0: Input argument
clz:
beq a0, x0, check_zero # Check a0 == 0; if true, jump to check_zero for early return
addi sp, sp, -24 # Allocate stack space for local variables
sw s5, 20(sp) # Save for use afterwards
sw s4, 16(sp) # Save for use afterwards
sw s3, 12(sp) # Save for use afterwards
sw s2, 8(sp) # Save for use afterwards
sw s1, 4(sp) # Save for use afterwards
sw s0, 0(sp) # Save for use afterwards
li s0, 0 # Set s0 = 0 for counting leading zeros
lw s1, mask1 # Load the bitmask to register
lw s2, mask2 # Load the bitmask to register
lw s3, mask3 # Load the bitmask to register
lw s4, mask4 # Load the bitmask to register
lw s5, mask5 # Load the bitmask to register
check_16:
bleu a0, s1, less_16 # Check if a0 <= 0x0000FFFF; if true, jump to less_16
check_8:
bleu a0, s2, less_8 # Check if a0 <= 0x00FFFFFF; if true, jump to less_8
check_4:
bleu a0, s3, less_4 # Check if a0 <= 0x0FFFFFFF; if true, jump to less_4
check_2:
bleu a0, s4, less_2 # Check if a0 <= 0x3FFFFFFF; if true, jump to less_2
check_1:
bleu a0, s5, less_1 # Check if a0 <= 0x7FFFFFFF; if true, jump to less_1
j return_clz # Jump to return_clz for restoring saved register and returning to the caller
less_16:
addi s0, s0, 16 # s0 += 16
slli a0, a0, 16 # a0 <<= 16
j check_8
less_8:
addi s0, s0, 8 # s0 += 8
slli a0, a0, 8 # a0 <<= 8
j check_4
less_4:
addi s0, s0, 4 # s0 += 4
slli a0, a0, 4 # a0 <<= 4
j check_2
less_2:
addi s0, s0, 2 # s0 += 2
slli a0, a0, 2 # a0 <<= 2
j check_1
less_1:
addi s0, s0, 1 # s0 += 1
slli a0, a0, 1 # a0 <<= 1
return_clz:
mv a0, s0 # Save s0 (counter) to a0
lw s0, 0(sp) # Restore the original data
lw s1, 4(sp) # Restore the original data
lw s2, 8(sp) # Restore the original data
lw s3, 12(sp) # Restore the original data
lw s4, 16(sp) # Restore the original data
lw s5, 20(sp) # Restore the original data
addi sp, sp, 24 # Deallocate stack space
ret # Return to the caller
check_zero:
li a0, 32 # Set a0 = 32
ret # Return to the caller
```
The cycle count dropped from 831 to 129 by using the bitmask approach, resulting in more than six times fewer instructions to execute.
During the coding process, I reviewed the factorial example from [Lab1: RV32I Simulator](https://hackmd.io/@sysprog/H1TpVYMdB#Example-Factorial-Calculation) and was reminded of the RISC-V calling conventions.
In my previous code, I did not save the callee-saved registers to the stack, which I failed to allocate space for them. Since these registers must be preserved by the callee, we need to adjust the stack pointer to create space for local variables. In this “bitmask” version, I corrected that mistake.
In addition, I added meaningful comments to my assembly program and formatted it so that it is more readable as in Lab1's program.
### uf8_decode
The code in the quiz is as follows:
```c
/* Decode uf8 to uint32_t */
uint32_t uf8_decode(uf8 fl)
{
uint32_t mantissa = fl & 0x0f;
uint8_t exponent = fl >> 4;
uint32_t offset = (0x7FFF >> (15 - exponent)) << 4;
return (mantissa << exponent) + offset;
}
```
In fact, we don't need to allocate local variables such as mantissa, exponent, and offset since this function can be simplified into one line of code as below.
```c
uint32_t uf8_decode(uf8 fl)
{
return ((fl & 0x0f) << (fl >> 4)) + ((0x7FFF >> (15 - (fl >> 4))) << 4);
}
```
In the simplified code, we only need to manipulate the value `fl`.
We can thus write the equivalent RISC-V assembly accordingly.
> Commit: [7e2fcbb](https://github.com/sysprog21/ca2025-quizzes/commit/7e2fcbbebda69d082ff5e875f82de84f3e0c81b4)
> Processor Mode: Single-cycle Processor
> Cycle Count: 60 (189 - 129 = 60)
```c
.data
...
# DEC
byte1: .word 0x000000FF # expected return value from dec: 1015792
byte2: .word 0x00000055 # expected return value from dec: 656
byte3: .word 0x00000007 # expected return value from dec: 7
.text
.globl main
main:
...
# =====================
# uf8_decode
# =====================
# === Testcase byte1 ===
lw a0, byte1 # Load the test byte into register a0
jal ra, dec # Jump-and-link to the 'dec' function for decoding uf8
li a7, 1 # Print integer
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# === Testcase byte2 ===
lw a0, byte2 # Load the test byte into register a0
jal ra, dec # Jump-and-link to the 'dec' function for decoding uf8
li a7, 1 # Print integer
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# === Testcase byte3 ===
lw a0, byte3 # Load the test byte into register a0
jal ra, dec # Jump-and-link to the 'dec' function for decoding uf8
li a7, 1 # Print integer
ecall
# Exit
li a7, 10
ecall
...
# Decode uf8 to uint32_t (return value is saved at a0)
# a0: Input argument
dec:
mv t0, a0 # Save a0 (argument) for calculating exponent
srli t0, t0, 4 # Save exponent (fl >> 4) to t0
andi a0, a0, 0x0f # Perform (fl & 0x0f)
sll a0, a0, t0 # Perform (fl & 0x0f) << (fl >> 4)
li t1, 15 # Save constant to t1 for calculating (15 - (fl >> 4))
sub t0, t1, t0 # Perform (15 - (fl >> 4)) and save the result to t0
li t1, 0x7FFF # Save constant to t1 for calculating (0x7FFF >> (15 - (fl >> 4)))
srl t0, t1, t0 # Perform (0x7FFF >> (15 - (fl >> 4))) and save it to t0
slli t0, t0, 4 # Perform ((0x7FFF >> (15 - (fl >> 4))) << 4)
add a0, a0, t0 # Add up a0 and t0 and save to a0
ret
```
For the implementation of `decode`, I used temporary registers instead of allocating some stack space and using saved registers.
Also, I created an accompanying C program to see what the is the correct output value of each test case for making sure the RISC-V assembly I wrote is correct. In addition, this C program verifies the idea that the single-line code works the same as the original one.
> Commit: [07ce1ba](https://github.com/sysprog21/ca2025-quizzes/commit/07ce1baed4de96445e369cb9893b934b5e2e8dc2)
```c
#include <stdint.h>
#include <stdio.h>
typedef uint8_t uf8;
uint32_t uf8_decode(uf8 fl)
{
uint32_t mantissa = fl & 0x0f;
uint8_t exponent = fl >> 4;
uint32_t offset = (0x7FFF >> (15 - exponent)) << 4;
return (mantissa << exponent) + offset;
}
uint32_t uf8_decode_simple(uf8 fl)
{
return ((fl & 0x0f) << (fl >> 4)) + ((0x7FFF >> (15 - (fl >> 4))) << 4);
}
void print_bin(uint32_t bin) {
for (int i = 31; i >= 0; i--) {
char b = (bin >> i & 0x1) == 0x1 ? '1' : '0';
putchar(b);
}
putchar('\n');
}
int main(void) {
uint8_t testcase = 0x07;
print_bin(uf8_decode(testcase));
printf("%d\n", uf8_decode(testcase));
print_bin(uf8_decode_simple(testcase));
printf("%d\n", uf8_decode_simple(testcase));
return 0;
}
```
### uf8_encode
The implementation provided by the instructor is as follows:
```c
uf8 uf8_encode(uint32_t value)
{
/* Use CLZ for fast exponent calculation */
if (value < 16)
return value;
/* Find appropriate exponent using CLZ hint */
int lz = clz(value);
int msb = 31 - lz;
/* Start from a good initial guess */
uint8_t exponent = 0;
uint32_t overflow = 0;
if (msb >= 5) {
/* Estimate exponent - the formula is empirical */
exponent = msb - 4;
if (exponent > 15)
exponent = 15;
/* Calculate overflow for estimated exponent */
for (uint8_t e = 0; e < exponent; e++)
overflow = (overflow << 1) + 16;
/* Adjust if estimate was off */
while (exponent > 0 && value < overflow) {
overflow = (overflow - 16) >> 1;
exponent--;
}
}
/* Find exact exponent */
while (exponent < 15) {
uint32_t next_overflow = (overflow << 1) + 16;
if (value < next_overflow)
break;
overflow = next_overflow;
exponent++;
}
uint8_t mantissa = (value - overflow) >> exponent;
return (exponent << 4) | mantissa;
}
```
I first did a line-by-line translation. from C to RISC-V assembly.
```c
# Encode uint32_t to uf8 (return value is saved at a0)
# a0: Input argument
enc:
li t0, 16 # Load 16 to t0 for performing early-return
bltu a0, t0, e_ret_enc # if (value < 16)
addi sp, sp, -12 # Allocate stack space to store local variables
sw a0, 8(sp) # Save a0 to stack to prevent data loss
jal ra, clz # Call clz function, return value is saved at a0
mv a1, a0 # Copy the return value to a1 (a0 = clz(a0))
lw a0, 8(sp) # Restore value from the stack (a0 is the argument)
li t0, 31 # Load 31 to t0 for computing msb = 31 - a0
sub a0, t0, a0 # Perform msb = 31 - a0 and save result to a0
sw s1, 4(sp) # Save s1 (overflow) to stack, restore when function ends
sw s0, 0(sp) # Save s0 (exponent) to stack, restore when function ends
li s0, 0 # s0 is exponent
li s1, 0 # s1 is overflow
li t0, 5 # Load 5 to t0 for perfoming msb >= 5
bge a0, 5, ge_5 # Perform msb >= 5
exact_exp:
li t0, 15 # Load 5 to t0 for perfoming exponent < 15
bge s0, t0, mant # when while (exponent < 15) is false jump to mant
slli t0, s0, 1 # next_overflow = (overflow << 1)
addi t0, x0, 16 # next_overflow = next_overflow + 16
lw t1, 8(sp) # Load value to t1
blt t1, t0, mant # if (value < next_overflow) then break
mv s1, t0 # overflow = next_overflow
addi s0, x0, 1 # exponent++
j exact_exp
mant:
sub t0, t1, s1 # mantissa = (value - overflow)
srl t0, t0, s0 # mantissa = mantissa>> exponent
slli t1, s0, 4 # t1 = (exponent << 4)
or a0, t1, t0 # (exponent << 4) | mantissa
ret
e_ret_enc:
ret
ge_5:
li t0, 4 # Load 4 to t0 for subtraction
sub, s0, a0, t0 # exponent (s0) = msb (a0) - 4
li t0, 15 # Load 4 to t0 for comparison
bgt s0, t0, gt_15 # if (exponent > 15)
e_init:
li t0, 0 # for-loop variable e
uf8_overflow:
bge t0, s0, uf8_est_off # for-loop condition: e < exponent
slli s1, s1 1 # overflow = (overflow << 1)
addi s1, s1, 16 # overflow += 16
j uf8_overflow
uf8_est_off:
bleu s0, x0, ret_msb
lw t2, 8(sp) # Load value to t2
bge t2, s1, ret_msb
addi s1, s1, -16 # overflow = (overflow - 16)
srli s1, s1, 1 # overflow >>= 1
addi s0, s0, -1 # exponent--
j uf8_est_off
ret_msb:
j exact_exp
gt_15:
li s0, 15 # exponent = 15
j e_init # Jump back to the next line of bgt
```
(To be clear, I haven’t tested my first attempt at the `uf8_encode` RISC-V assembly, since I came up with an improvement right after finishing the translation. Therefore, there may be bugs in the code above.)
Later, I realized that I could simply use temporary registers (at some places) instead of managing stack pointers, memory allocations, and deallocations, since temporary registers are saved by the caller, whereas saved registers are preserved by the callee.
During the revision of my code, I consulted ChatGPT, who suggested that I follow the RISC-V procedure calling conventions. It cited the statement, "In the standard RISC-V calling convention, the stack grows downward and the stack pointer is always kept 16-byte aligned." [^3] (I specifically and explicitly asked ChatGPT to give me a conceptual idea of assembly philosophy and some proof-of-concepts, I am by no means cognitive offloading to AI)
However, this reference appeared unofficial, so I searched further and found the [RISC-V ELF psABI Document](https://github.com/riscv-non-isa/riscv-elf-psabi-doc/tree/master), which contains a dedicated section elaborating on the [RISC-V Calling Conventions](https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc). (In fact, I first checked the [official RISC-V specification](https://docs.riscv.org/reference/isa/_attachments/riscv-unprivileged.pdf), which noted that the calling convention section had been moved to the psABI.)
Note: I forgot to commit each individual improvements, but I've documented changes here.
Overall, I've done several improvements:
- Use temporary registers where possible (One of the exceptions is calling another function, since that function might use temporary registers, so it is the caller's job to preserve value inside.)
- Follow the RISC-V calling conventions by adjusting the amount of allocated stack space to be a multiple of 16.
- Avoid double labels for a for-loop. If there’s a branch right before the for-loop, you often add one label to jump back and another to skip the loop’s initialization (e.g., int i = 0). You can simplify this by moving the initialization before the preceding branch. Then the loop needs only a single label, eliminating the extra label. (I did a bit research and found out that this technique is called [hoisting](https://en.wikipedia.org/wiki/Loop-invariant_code_motion))
```diff
ge_5:
...
li t4, 15 # Load 4 to t3 for comparison
bgt t1, t4, gt_15 # if (exponent > 15)
- ovf_init:
- li t3, 0 # for-loop variable e for calculaing overflow
uf8_ovf:
...
gt_15:
li t1, 15 # exponent = 15
- j ovf_init # Jump back to the next line of bgt
***
ge_5:
...
+ li t3, 0 # for-loop variable e for calculaing overflow (hoisting)
li t4, 15 # Load 4 to t3 for comparison
bgt t1, t4, gt_15 # if (exponent > 15)
uf8_ovf:
...
gt_15:
li t1, 15 # exponent = 15
+ j uf8_ovf # Jump back to the next line of bgt
```
- Reverse the branch logic so that normal exponents fall through (skip the clamp), and only values greater than 15 execute the assignment `exponent = 15`.
```diff
ge_5:
...
li t3, 0 # for-loop variable e for calculaing overflow (hoisting)
li t4, 15 # Load 4 to t3 for comparison
- bgt t1, t4, gt_15 # if (exponent > 15)
uf8_ovf:
...
- gt_15:
- li s0, 15 # exponent = 15
- j uf8_ovf # Jump back to the next line of bgt
***
ge_5:
...
li t3, 0 # for-loop variable e for calculaing overflow (hoisting)
li t4, 15 # Load 4 to t3 for comparison
+ bleu t1, t4, uf8_ovf # if (exponent > 15)
+ li t1, 15 # exponent = 15
uf8_ovf:
...
```
- The for-loop for calculating overflow for estimated exponent can be simplified. It repeats `overflow = (overflow << 1) + 16;` $x$ times, where $x$ is the value of `exponent`. We can simplify the operations using [mathematical induction](https://en.wikipedia.org/wiki/Mathematical_induction) since that operation is similar to recursive function.
We can represent the overall operations using nested expression as follows:
$$
\text{OVF}_x
= \underbrace{((\cdots(((}_{x\text{ times}}
(\text{OVF}_0 \ll 1) + 16) \ll 1 + 16) \cdots \ll 1) + 16
$$
The subscript $_0$ means the initial overflow value, whereas $_x$ means the value is the result of the operation done $x$ times. In addition, OVF stands for overflow.
Next, we can rewrite $\text{OVF}$ as recursive function ($O_n$).
$$
O_{n+1} = 2O_n + 16
$$
Observed from the behavior of the operations as follows, we can write out a general form:
$$
\begin{align}
Given\;O_0,\;O_1& = 2O_0 + 16 \\
O_2& = 2O_1 + 16 = 2\times(2O_0 + 16) + 16 = 4O_0 + 48 \\
O_3& = 2O_2 + 16 = 2\times(2O_1 + 16) + 16 = 4O_1 + 48 = 8O_0 + 112 \\
&\;\;\vdots \notag \\
O_n& = 2^nO_0 + 16\times(2^n-1)
\end{align}
$$
Thus, we have our guess of the general form. We need to use mathematical induction to prove the induction hypothesis.
\begin{align}
&\textbf{Claim:}\quad O_{n+1} = 2O_n + 16,\; O_0 \text{ given}\;\Rightarrow\;O_n = 2^n O_0 + 16(2^n - 1)\quad \forall n\ge 0. \\
\\
&\textbf{Proof By Induction:} \\
\\
&\text{Base case (n = 0):}\;O_0 = 2^0O_0+16(2^0-1) = 2^0O_0 = O_0\quad\text{(Correct)} \\
&\text{Induction step (n = 0):}\;\text{Assume}\;O_k = 2^k O_0 + 16(2^k - 1)\;\text{for}\;k\ge 0.\;\text{Then,}
\end{align}
\begin{align}
O_{k+1}
&= 2O_k + 16 \\
&= 2(2^kO_0+16(2^k-1))+16 \\
&= 2^{k+1}O_0+32\times2^k-32+16 \\
&= 2^{k+1}O_0+16\times2^{k+1}-16 \\
&= 2^{k+1}O_0+16(2^{k+1}-1).
\end{align}
\begin{align}
\text{Thus, }
& O_{k+1} \text{ is correctly defined.}
\\
\text{Hence, }
& O_k \rightarrow O_{k+1}
\text{ is true for all } k.
\\[1em]
\textbf{Conclusion:}\quad
& \text{The claimed form with } n = k + 1
\text{ matches } O_{k+1}.\\
& \text{Therefore, by induction,}\\
&O_n = 2^n O_0 + 16(2^n - 1)\quad \forall n \ge 0.
\end{align}
Concluded from the proof above, we can know that the equivalent operation of the for-loop would be $O_x = 2^x O_0 + 16(2^x - 1)$, where $x$ is the value of exponent.
We can write a small C statement to demonstrate the C-equivalent expression for the equation.
```c
overflow = overflow << exponent + 16 << exponent - 16;
```
We can combine the terms:
```c
overflow = ((overflow + 16) << exponent) - 16;
```
Because overflow is 0, before this loop, no one changes it. Therefore, we can further simplify it into:
```c
overflow = (16 << exponent) - 16;
```
From now on, we can *finally* translate it to RISC-V assembly.
```diff
ge_5:
# t0 = msb, t1 = exponent, t2 = overflow
...
- li t3, 0 # for-loop variable e for calculaing overflow (hoisting)
...
uf8_ovf:
- bge t3, t1, uf8_est_off # for-loop condition: e < exponent (reverse)
- slli t2, t2, 1 # overflow = (overflow << 1)
- addi t2, t2, 16 # overflow += 16
- j uf8_ovf
***
uf8_ovf:
+ li t3, 16 # Load 16 to t4 for bit manipulation
+ sll t4, t3, t1 # t3 = (16 << exponent)
+ sub t2, t4, t3 # overflow = (16 << exponent) - 16;
```
Afterwards, I wrote some test cases for verifying the correctness of my program. One thing hit me–it never ends. I first did identify few typos of the register names. But, the problem still persists. Then I did step by step investigation on Ripes. Then, I realized that I forgot to save the return address of enc, since I've called CLZ during the function, I must save `ra` to saved registers in order to preserve where it is from. Below is the line of code where things went wrong.

After roughly an hour, it worked! But the result is not what I expected. As it turns out that I forgot to mask the return value (a0) to only one byte, meaning doing an `and` bitmask with value 0xFF, which can filter out other bits. I added that to the program. It worked correctly. Hooray~
> Commit: [dd15b88](https://github.com/sysprog21/ca2025-quizzes/commit/dd15b886153cb0ca5c4bb56fbb30294eb564d269)
> Processor Mode: Single-cycle Processor
> Cycle Count: 255 (444 - 189 = 255)
```c
.data
...
# ENC
bin4: .word 0x12345678 # expected return value from enc: 248
bin5: .word 0x55553333 # expected return value from enc: 250
bin6: .word 0x01010101 # expected return value from enc: 242
.text
.globl main
main:
...
# =====================
# uf8_encode
# =====================
# === Testcase bin4 ===
lw a0, bin4 # Load the test binary into register a0
jal ra, enc # Jump-and-link to the 'dec' function for encoding to uf8
li a7, 1 # Print integer
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# === Testcase bin5 ===
lw a0, bin5 # Load the test binary into register a0
jal ra, enc # Jump-and-link to the 'eec' function for encoding to uf8
li a7, 1 # Print integer
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# === Testcase bin6 ===
lw a0, bin6 # Load the test binary into register a0
jal ra, enc # Jump-and-link to the 'enc' function for encoding to uf8
li a7, 1 # Print integer
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# Exit
li a7, 10
ecall
...
# Encode uint32_t to uf8 (return value is saved at a0)
# a0: Input argument
enc:
li t0, 16 # Load 16 to t0 for performing early-return
bltu a0, t0, e_ret_enc # if (value < 16)
addi sp, sp, -16 # Allocate stack space to store local variables
sw a0, 12(sp) # Save a0's data to stack to prevent data loss
sw s0, 8(sp) # Save s0's data to stack to prevent data loss
mv s0, ra # Save ra to s0
jal ra, clz # Call CLZ function, return value is saved at a0
mv t0, a0 # lz = clz(value), t0 represents lz
lw a0, 12(sp) # Restore value from the stack (a0 is the argument)
mv ra, s0 # Restore value from s0
lw s0, 8(sp) # Restore value from the stack
addi sp, sp, 16 # Deallocate stack space
li t1, 31 # Load 31 to t1 for computing msb = 31 - a0
sub t0, t1, t0 # Perform msb = 31 - a0 and save the result to t0, t0 now represents msb
li t1, 0 # uint8_t exponent = 0; (t1)
li t2, 0 # uint32_t overflow = 0; (t2)
li t3, 5 # Load 5 to t3 for perfoming if (msb >= 5)
bgeu t0, t3, ge_5 # Perform msb >= 5
exact_exp:
# a0 = value, t0 = msb, t1 = exponent, t2 = overflow
li t3, 15 # Load 5 to t3 for perfoming inverse of (exponent < 15)
bgeu t1, t3, mant # when while (exponent < 15) is false jump to mant
slli t3, t2, 1 # next_overflow = (overflow << 1)
addi t3, t3, 16 # next_overflow = next_overflow + 16
bltu a0, t3, mant # if (value < next_overflow) then break
mv t2, t3 # overflow = next_overflow
addi t1, t1, 1 # exponent++
j exact_exp
mant:
sub t3, a0, t2 # mantissa = (value - overflow)
srl t3, t3, t1 # mantissa = mantissa >> exponent
slli t4, t1, 4 # t1 = (exponent << 4)
or a0, t3, t4 # (exponent << 4) | mantissa
li t3, 0xFF # Make a bitmask that mask the least significant byte
and a0, a0, t3 # AND it with a0 to make sure no garbage values remain
ret
e_ret_enc:
ret # early return
ge_5:
# a0 = value, t0 = msb, t1 = exponent, t2 = overflow
li t3, 4 # Load 4 to t3 for subtraction
sub, t1, t0, t3 # exponent = msb - 4;
li t3, 15 # Load 4 to t3 for comparison
bleu t1, t3, uf8_ovf # Invert if (exponent > 15), if less than or equals to 15 jump pass next line.
li t1, 15 # exponent = 15
uf8_ovf:
li t3, 16 # Load 16 to t4 for bit manipulation
sll t4, t3, t1 # t3 = (16 << exponent)
sub t2, t4, t3 # overflow = (16 << exponent) - 16;
uf8_est_off:
bleu t1, x0, ret_msb # Invert (exponent > 0)
bgeu a0, t2, ret_msb # Invert (value < overflow)
addi t2, t2, -16 # overflow = (overflow - 16)
srli t2, t2, 1 # overflow >>= 1
addi t1, t1, -1 # exponent--
j uf8_est_off
ret_msb:
j exact_exp
```
### test
There are a few mistakes when implementing the test function:
1. My suggestion is that try not to write something simple to implement first at the bottom and come up without realizing how the variables are stored before. This bugged this when implementing the test function, I first wrote the condition at the bottom and didn't change the register when I used a different approach above.
2. I was being a smart alec when trying to optimize the for-loop for iterating through 0 to 255. I was trying to use a count down method that reverses the condition of the loop. However, what I didn't pay attention to is that the logic inside the for-loop _does_ depend on `i`, so when I ran the program, the error flooded my console output. After realizing this mistake, I corrected the program.
> Commit: [0714c10](https://github.com/sysprog21/ca2025-quizzes/commit/0714c1046522f97ec9a7522560f54a434cb7ac3d)
> Processor Mode: Single-cycle Processor
> Cycle Count: 25511 (25955 - 444 = 25511)
```c
.data
...
# Test
str1: .string ": produces value "
str2: .string " but encodes back to "
str3: .string ": value "
str4: .string " <= previous_value "
str5: .string "All tests passed.\n"
.text
.globl main
main:
...
# =====================
# test
# =====================
jal ra, test # Jump-and-link to the 'test' function for verifying the correctness of this assembly program
beq a0, x0, exit # If test return true, print str5
la a0 str5 # Load str5's address to a0 for printing
li a7 4 # print str5
ecall
exit:
# Exit
li a7, 10
ecall
...
test:
addi sp, sp, -16 # Allocate stack space for storing local variables
sw s0, 12(sp) # Save s0's data to the stack
sw s1, 8(sp) # Save s1's data to the stack
sw s2, 4(sp) # Save s2's data to the stack
sw s3, 0(sp) # Save s3's data to the stack
mv s0, ra # Save the return address for this function to s0
li s1, -1 # int32_t previous_value = -1;
li s2, 1 # bool passed = true;
li s3, 0 # i = 0;
testcases:
mv a0, s3 # uint8_t fl = i;
jal ra, dec # int32_t value = uf8_decode(fl); (return value is stored at a0)
mv t6, a0 # Save value to t6
jal ra, enc # uint8_t fl2 = uf8_encode(value); (return value is stored at a0)
mv ra, s0 # Restore ra from s0
# | s0: ra | s1: previous_value | s2: passed | s3: i | t6: value | a0: fl2 |
fl_eq:
beq s3, a0, val_cmp # Invert if (fl != fl2)
mv s0, a0 # Save a0 to s0
mv a0, s3 # Copy i to a0 for printing
li a7 34 # print i in hex form
ecall
la a0 str1 # Load str1's address to a0 for printing
li a7 4 # print str1
ecall
mv a0, t6 # Copy value to a0 for printing
li a7 1 # print i in int form
ecall
la a0 str2 # Load str2's address to a0 for printing
li a7 4 # print str2
ecall
mv a0, s0 # Restore a0 (fl2) from s0
li a7 34 # print i in hex form
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
mv a0, s0 # Restore a0 (fl2) from s0
li s2, 0 # passed = false;
# | s0: ra | s1: previous_value | s2: passed | s3: i | t6: value | a0: fl2 |
val_cmp:
bgt t6, s1, next_it # Invert if (value <= previous_value)
mv s0, a0 # Save a0 to s0
mv a0, s3 # Copy i to a0 for printing
li a7 34 # print i in hex form
ecall
la a0 str3 # Load str3's address to a0 for printing
li a7 4 # print str3
ecall
mv a0, t6 # Copy value to a0 for printing
li a7 1 # print i in int form
ecall
la a0 str4 # Load str4's address to a0 for printing
li a7 4 # print str4
ecall
mv a0, s1 # Copy previous_value to a0 for printing
li a7 34 # print i in hex form
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
mv a0, s0 # Restore a0 (fl2) from s0
li s2, 0 # passed = false;
next_it:
mv s1, t6 # previous_value = value;
addi s3, s3, 1 # i++
li t0, 256 # Load 256 to t0 for comparison
blt s3, t0, testcases # i < 256
end_test:
mv a0, s2 # save return value "passed" to a0
lw s3, 0(sp) # Save s3's data to the stack
lw s2, 4(sp) # Save s2's data to the stack
lw s1, 8(sp) # Save s1's data to the stack
lw s0, 12(sp) # Save s0's data to the stack
ret
```
### Complete RISC-V Translation of the UF8 C Program from Quiz 1
```c
.data
# CLZ
mask1: .word 0x0000FFFF
mask2: .word 0x00FFFFFF
mask3: .word 0x0FFFFFFF
mask4: .word 0x3FFFFFFF
mask5: .word 0x7FFFFFFF
bin1: .word 0x0000FFFF # expected return value from clz: 16
bin2: .word 0xFFFFFFFF # expected return value from clz: 0
bin3: .word 0x7FFFFFFF # expected return value from clz: 1
# DEC
byte1: .word 0x000000FF # expected return value from dec: 1015792
byte2: .word 0x00000055 # expected return value from dec: 656
byte3: .word 0x00000007 # expected return value from dec: 7
# ENC
bin4: .word 0x12345678 # expected return value from enc: 248
bin5: .word 0x55553333 # expected return value from enc: 250
bin6: .word 0x01010101 # expected return value from enc: 242
# Test
str1: .string ": produces value "
str2: .string " but encodes back to "
str3: .string ": value "
str4: .string " <= previous_value "
str5: .string "All tests passed.\n"
.text
.globl main
main:
# =====================
# CLZ
# =====================
# === Testcase bin1 ===
lw a0, bin1 # Load the test argument into register a0
jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros
li a7, 1 # Print values returned from clz
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# === Testcase bin2 ===
lw a0, bin2 # Load the test argument into register a0
jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros
li a7, 1 # Print values returned from clz
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# === Testcase bin3 ===
lw a0, bin3 # Load the test argument into register a0
jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros
li a7, 1 # Print values returned from clz
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# =====================
# uf8_decode
# =====================
# === Testcase byte1 ===
lw a0, byte1 # Load the test byte into register a0
jal ra, dec # Jump-and-link to the 'dec' function for decoding uf8
li a7, 1 # Print integer
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# === Testcase byte2 ===
lw a0, byte2 # Load the test byte into register a0
jal ra, dec # Jump-and-link to the 'dec' function for decoding uf8
li a7, 1 # Print integer
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# === Testcase byte3 ===
lw a0, byte3 # Load the test byte into register a0
jal ra, dec # Jump-and-link to the 'dec' function for decoding uf8
li a7, 1 # Print integer
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# =====================
# uf8_encode
# =====================
# === Testcase bin4 ===
lw a0, bin4 # Load the test binary into register a0
jal ra, enc # Jump-and-link to the 'dec' function for encoding to uf8
li a7, 1 # Print integer
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# === Testcase bin5 ===
lw a0, bin5 # Load the test binary into register a0
jal ra, enc # Jump-and-link to the 'eec' function for encoding to uf8
li a7, 1 # Print integer
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# === Testcase bin6 ===
lw a0, bin6 # Load the test binary into register a0
jal ra, enc # Jump-and-link to the 'enc' function for encoding to uf8
li a7, 1 # Print integer
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
# =====================
# test
# =====================
jal ra, test # Jump-and-link to the 'test' function for verifying the correctness of this assembly program
beq a0, x0, exit # If test return true, print str5
la a0 str5 # Load str5's address to a0 for printing
li a7 4 # print str5
ecall
exit:
# Exit
li a7, 10
ecall
# Count Leading Zeros (return value is saved at a0)
# a0: Input argument
clz:
beq a0, x0, check_zero # Check a0 == 0; if true, jump to check_zero for early return
addi sp, sp, -24 # Allocate stack space for local variables
sw s5, 20(sp) # Save for use afterwards
sw s4, 16(sp) # Save for use afterwards
sw s3, 12(sp) # Save for use afterwards
sw s2, 8(sp) # Save for use afterwards
sw s1, 4(sp) # Save for use afterwards
sw s0, 0(sp) # Save for use afterwards
li s0, 0 # Set s0 = 0 for counting leading zeros
lw s1, mask1 # Load the bitmask to register
lw s2, mask2 # Load the bitmask to register
lw s3, mask3 # Load the bitmask to register
lw s4, mask4 # Load the bitmask to register
lw s5, mask5 # Load the bitmask to register
check_16:
bleu a0, s1, less_16 # Check if a0 <= 0x0000FFFF; if true, jump to less_16
check_8:
bleu a0, s2, less_8 # Check if a0 <= 0x00FFFFFF; if true, jump to less_8
check_4:
bleu a0, s3, less_4 # Check if a0 <= 0x0FFFFFFF; if true, jump to less_4
check_2:
bleu a0, s4, less_2 # Check if a0 <= 0x3FFFFFFF; if true, jump to less_2
check_1:
bleu a0, s5, less_1 # Check if a0 <= 0x7FFFFFFF; if true, jump to less_1
j return_clz # Jump to return_clz for restoring saved register and returning to the caller
less_16:
addi s0, s0, 16 # s0 += 16
slli a0, a0, 16 # a0 <<= 16
j check_8
less_8:
addi s0, s0, 8 # s0 += 8
slli a0, a0, 8 # a0 <<= 8
j check_4
less_4:
addi s0, s0, 4 # s0 += 4
slli a0, a0, 4 # a0 <<= 4
j check_2
less_2:
addi s0, s0, 2 # s0 += 2
slli a0, a0, 2 # a0 <<= 2
j check_1
less_1:
addi s0, s0, 1 # s0 += 1
slli a0, a0, 1 # a0 <<= 1
return_clz:
mv a0, s0 # Save s0 (counter) to a0
lw s0, 0(sp) # Restore the original data
lw s1, 4(sp) # Restore the original data
lw s2, 8(sp) # Restore the original data
lw s3, 12(sp) # Restore the original data
lw s4, 16(sp) # Restore the original data
lw s5, 20(sp) # Restore the original data
addi sp, sp, 24 # Deallocate stack space
ret # Return to the caller
check_zero:
li a0, 32 # Set a0 = 32
ret # Return to the caller
# Decode uf8 to uint32_t (return value is saved at a0)
# a0: Input argument
dec:
mv t0, a0 # Save a0 (argument) for calculating exponent
srli t0, t0, 4 # Save exponent (fl >> 4) to t0
andi a0, a0, 0x0f # Perform (fl & 0x0f)
sll a0, a0, t0 # Perform (fl & 0x0f) << (fl >> 4)
li t1, 15 # Save constant to t1 for calculating (15 - (fl >> 4))
sub t0, t1, t0 # Perform (15 - (fl >> 4)) and save the result to t0
li t1, 0x7FFF # Save constant to t1 for calculating (0x7FFF >> (15 - (fl >> 4)))
srl t0, t1, t0 # Perform (0x7FFF >> (15 - (fl >> 4))) and save it to t0
slli t0, t0, 4 # Perform ((0x7FFF >> (15 - (fl >> 4))) << 4)
add a0, a0, t0 # Add up a0 and t0 and save to a0
ret
# Encode uint32_t to uf8 (return value is saved at a0)
# a0: Input argument
enc:
li t0, 16 # Load 16 to t0 for performing early-return
bltu a0, t0, e_ret_enc # if (value < 16)
addi sp, sp, -16 # Allocate stack space to store local variables
sw a0, 12(sp) # Save a0's data to stack to prevent data loss
sw s0, 8(sp) # Save s0's data to stack to prevent data loss
mv s0, ra # Save ra to s0
jal ra, clz # Call CLZ function, return value is saved at a0
mv t0, a0 # lz = clz(value), t0 represents lz
lw a0, 12(sp) # Restore value from the stack (a0 is the argument)
mv ra, s0 # Restore value from s0
lw s0, 8(sp) # Restore value from the stack
addi sp, sp, 16 # Deallocate stack space
li t1, 31 # Load 31 to t1 for computing msb = 31 - a0
sub t0, t1, t0 # Perform msb = 31 - a0 and save the result to t0, t0 now represents msb
li t1, 0 # uint8_t exponent = 0; (t1)
li t2, 0 # uint32_t overflow = 0; (t2)
li t3, 5 # Load 5 to t3 for perfoming if (msb >= 5)
bgeu t0, t3, ge_5 # Perform msb >= 5
exact_exp:
# a0 = value, t0 = msb, t1 = exponent, t2 = overflow
li t3, 15 # Load 5 to t3 for perfoming inverse of (exponent < 15)
bgeu t1, t3, mant # when while (exponent < 15) is false jump to mant
slli t3, t2, 1 # next_overflow = (overflow << 1)
addi t3, t3, 16 # next_overflow = next_overflow + 16
bltu a0, t3, mant # if (value < next_overflow) then break
mv t2, t3 # overflow = next_overflow
addi t1, t1, 1 # exponent++
j exact_exp
mant:
sub t3, a0, t2 # mantissa = (value - overflow)
srl t3, t3, t1 # mantissa = mantissa >> exponent
slli t4, t1, 4 # t1 = (exponent << 4)
or a0, t3, t4 # (exponent << 4) | mantissa
li t3, 0xFF # Make a bitmask that mask the least significant byte
and a0, a0, t3 # AND it with a0 to make sure no garbage values remain
ret
e_ret_enc:
ret # early return
ge_5:
# a0 = value, t0 = msb, t1 = exponent, t2 = overflow
li t3, 4 # Load 4 to t3 for subtraction
sub, t1, t0, t3 # exponent = msb - 4;
li t3, 15 # Load 4 to t3 for comparison
bleu t1, t3, uf8_ovf # Invert if (exponent > 15), if less than or equals to 15 jump pass next line.
li t1, 15 # exponent = 15
uf8_ovf:
li t3, 16 # Load 16 to t4 for bit manipulation
sll t4, t3, t1 # t3 = (16 << exponent)
sub t2, t4, t3 # overflow = (16 << exponent) - 16;
uf8_est_off:
bleu t1, x0, ret_msb # Invert (exponent > 0)
bgeu a0, t2, ret_msb # Invert (value < overflow)
addi t2, t2, -16 # overflow = (overflow - 16)
srli t2, t2, 1 # overflow >>= 1
addi t1, t1, -1 # exponent--
j uf8_est_off
ret_msb:
j exact_exp
test:
addi sp, sp, -16 # Allocate stack space for storing local variables
sw s0, 12(sp) # Save s0's data to the stack
sw s1, 8(sp) # Save s1's data to the stack
sw s2, 4(sp) # Save s2's data to the stack
sw s3, 0(sp) # Save s3's data to the stack
mv s0, ra # Save the return address for this function to s0
li s1, -1 # int32_t previous_value = -1;
li s2, 1 # bool passed = true;
li s3, 0 # i = 0;
testcases:
mv a0, s3 # uint8_t fl = i;
jal ra, dec # int32_t value = uf8_decode(fl); (return value is stored at a0)
mv t6, a0 # Save value to t6
jal ra, enc # uint8_t fl2 = uf8_encode(value); (return value is stored at a0)
mv ra, s0 # Restore ra from s0
# | s0: ra | s1: previous_value | s2: passed | s3: i | t6: value | a0: fl2 |
fl_eq:
beq s3, a0, val_cmp # Invert if (fl != fl2)
mv s0, a0 # Save a0 to s0
mv a0, s3 # Copy i to a0 for printing
li a7 34 # print i in hex form
ecall
la a0 str1 # Load str1's address to a0 for printing
li a7 4 # print str1
ecall
mv a0, t6 # Copy value to a0 for printing
li a7 1 # print i in int form
ecall
la a0 str2 # Load str2's address to a0 for printing
li a7 4 # print str2
ecall
mv a0, s0 # Restore a0 (fl2) from s0
li a7 34 # print i in hex form
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
mv a0, s0 # Restore a0 (fl2) from s0
li s2, 0 # passed = false;
# | s0: ra | s1: previous_value | s2: passed | s3: i | t6: value | a0: fl2 |
val_cmp:
bgt t6, s1, next_it # Invert if (value <= previous_value)
mv s0, a0 # Save a0 to s0
mv a0, s3 # Copy i to a0 for printing
li a7 34 # print i in hex form
ecall
la a0 str3 # Load str3's address to a0 for printing
li a7 4 # print str3
ecall
mv a0, t6 # Copy value to a0 for printing
li a7 1 # print i in int form
ecall
la a0 str4 # Load str4's address to a0 for printing
li a7 4 # print str4
ecall
mv a0, s1 # Copy previous_value to a0 for printing
li a7 34 # print i in hex form
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
mv a0, s0 # Restore a0 (fl2) from s0
li s2, 0 # passed = false;
next_it:
mv s1, t6 # previous_value = value;
addi s3, s3, 1 # i++
li t0, 256 # Load 256 to t0 for comparison
blt s3, t0, testcases # i < 256
end_test:
mv a0, s2 # save return value "passed" to a0
lw s3, 0(sp) # Save s3's data to the stack
lw s2, 4(sp) # Save s2's data to the stack
lw s1, 8(sp) # Save s1's data to the stack
lw s0, 12(sp) # Save s0's data to the stack
ret
```
### Analysis
TBD
## Quiz1 - Problem C
### bf16_isnan
The C code to be translated to RISC-V assembly is:
```c
static inline bool bf16_isnan(bf16_t a)
{
return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) &&
(a.bits & BF16_MANT_MASK);
}
```
Below is my first attempt to translate the code above to RISC-V assembly.
```c
.data
BF16_SIGN_MASK: .word 0x8000
BF16_EXP_MASK: .word 0x7F80
BF16_MANT_MASK: .word 0x007F
BF16_EXP_BIAS: .byte 0x007F
.text
.globl main
main:
# =====================
# bf16_isnan
# =====================
# === Test case 1 ===
li a0, 0x8000 # Load testing value to a0
jal ra, bf16_isnan # Call (Jump to) bf16_isnan function
li a7, 1 # Print integer
ecall
li a0, 10 # Newline, '\n'
li a7, 11 # Print char
ecall
exit:
# Exit
li a7, 10
ecall
# Input argument: a0 | Return value: a0
bf16_isnan:
la t0, BF16_EXP_MASK # Load mask to t0 for comparsion
and t0, a0, t0 # t0 = a.bits & BF16_EXP_MASK
beq t0, a0, bf16_isnan_1 # (a.bits & BF16_EXP_MASK) == BF16_EXP_MASK
li a0, 0 # Set a0 to 0 and early return
ret
bf16_isnan_1:
la t0, BF16_MANT_MASK # Load mask to t0 for comparsion
and t0, t0, a0 # t0 = a.bits & BF16_MANT_MASK
bne x0, t0, bf16_isnan_ret # Compare t0 to 0, if not equal, return 1
li a0, 0 # Set a0 to 0 and early return
ret
bf16_isnan_ret:
li a0, 1 # Set a0 to 1
ret
```
Note:
- Since we are directly comparing the bits when doing branching (`beq`), bitwise operation (`and`, `xor`), etc., we don't need to create a struct for retrieving the bits (e.g., `a.bits`).
When I tried to run it in Ripes, it seemed to behave correctly. But the result was not what I expected (outputing 0). I assumed the output to be 1. So I dug down my code and found several bugs.

1. Incorrect Comparsion: The 4th line of assembly below is incorrect since the equivalent output would be comparing `a.bits` with `a.bits & BF16_EXP_MASK`. Also, I was comparing `a.bits & BF16_EXP_MASK` with `a.bits` which I have no idea what I was writing back then. In addition, I overwrote register `t0` with `a.bits & BF16_EXP_MASK` but we need `BF16_EXP_MASK` later.
```c=
bf16_isnan:
la t0, BF16_EXP_MASK # Load mask to t0 for comparsion
and t0, a0, t0 # t0 = a.bits & BF16_EXP_MASK
beq t0, a0, bf16_isnan_1 # (a.bits & BF16_EXP_MASK) == BF16_EXP_MASK
li a0, 0 # Set a0 to 0 and early return
ret
```
```diff
bf16_isnan:
la t0, BF16_EXP_MASK # Load mask to t0 for comparsion
- and t0, a0, t0 # t0 = a.bits & BF16_EXP_MASK
- beq t0, a0, bf16_isnan_1 # (a.bits & BF16_EXP_MASK) == BF16_EXP_MASK
+ and t1, a0, t0 # t1 = a.bits & BF16_EXP_MASK
+ beq t1, t0, bf16_isnan_1 # (a.bits & BF16_EXP_MASK) == BF16_EXP_MASK
li a0, 0 # Set a0 to 0 and early return
ret
```
[^1]: [Supported Environment Calls](https://github.com/mortbopet/Ripes/blob/master/docs/ecalls.md)
[^2]: [2017q3 Homework4 (改善 clz)](https://hackmd.io/@3xOSPTI6QMGdj6jgMMe08w/Bk-uxCYxz#fn2)
[^3]: [Chapter 18 Calling Convention](https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf)