arch2025-homework1

# Assignment 1: RISC-V Assembly and Instruction Pipeline Contributed by < [LeoriumDev](https://github.com/LeoriumDev) > ## AI Usage Citation This assignment was completed with assistance from ChatGPT for refining commit messages, improving wording clarity, inquiring about RISC-V instruction usage (e.g., the format and purpose of `srli`), and explaining assembly structure. All final analysis and conclusions are my own. ## Lab1: RV32I Simulator ![image](https://hackmd.io/_uploads/HJTeg9n6lg.png) ![output_downscaled1](https://hackmd.io/_uploads/SkzOG92Teg.gif) ## Quiz1 - Problem B ### UF8 > My thoughts + excerpts from the description of Problem B (wording refined with ChatGPT for clarity) UF8 (Unsigned Float 8-bit) is a compact way to store numbers that covers a huge range ($[0,1{,}015{,}792]$) while using very little space using only one byte. It works kind of like a mini-version of floating-point numbers: rather than storing every exact value, it saves an approximation that’s close enough for many real-world use cases. UF8 is most useful when range matters more than precision. For example, sensor readings like temperature, weight, or distance can be stored in UF8 form to reduce memory space on the device. Because it only uses 8 bits, it can compress 20-bit values by about 2.5 times while keeping results within roughly 6% of the true number. However, UF8 is not good for things that need perfect accuracy, such as finance or cryptography. ### CLZ (Count Leading Zeros) My initial thought is to create a bitmask that isolates the MSB and then OR the value with 0x7FFFFFFF to check whether the MSB is zero. If it is true, increment to counter. Then, left-shift the value by one bit and continue to the next iteration. If the value becomes zero (all bits shifted out), return from the function. I also found a documentation on using Ripes' environment calls (ecall). [^1] > Commit: [aaef6bc](https://github.com/sysprog21/ca2025-quizzes/commit/aaef6bcfafb376891b69d86c573e9da59daf3993) > Processor Mode: Single-cycle Processor > Cycle Count: 831 ```c .data mask: .word 0x7FFFFFFF # first bit is zero bin: .word 0x0000FFFF # expected return value from clz: 16 bin2: .word 0xFFFFFFFF # expected return value from clz: 0 bin3: .word 0x7FFFFFFF # expected return value from clz: 1 .text .globl main main: # Testcase - bin lw a0, bin # Load the test argument into register a0 jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros li a7, 1 # Print values returned from clz ecall # Newline, '\n' ASCII code (10_dec) li a0, 10 li a7, 11 ecall # Testcase - bin2 lw a0, bin2 # Load the test argument into register a0 jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros li a7, 1 # Print values returned from clz ecall # Newline, '\n' ASCII code (10_dec) li a0, 10 li a7, 11 # Print char ecall # Testcase - bin3 lw a0, bin3 # Load the test argument into register a0 jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros li a7, 1 # Print values returned from clz ecall # Exit the program li a7, 10 ecall # Counting Leading Zeros, return value is save at a0 clz: addi s0, x0, 0 # Set s0 (saved register) to 0, served as a variable for counting leading zeros clz1: lw s1, mask # Load the bit-mask for determing MSB is 1 to s1 or s2, a0, s1 # Use the bit-mask to filter out all bits except the MSB and save it to s2 beq s1, s2, inc # Compare the MSB of a0 (s2) with the bit-mask (s1), branch to inc if they are equivalent trailing: slli a0, a0, 1 # Left Shift a0 by 1 addi s3, x0, 0 # Set s3 to 0, served as a determing value for whether function ended (bin is zero) beq a0, s3, end # If bin is zero, then jump to end jal x0, clz1 # Jump to clz for next iteration inc: addi s0, s0, 1 # Increment s0 (clz counter) by 1 jal x0, trailing # Jump to trailing for bit shifting for comparing next bit end: add a0, x0, s0 # Save clz counter to a0 ret # return from function clz ``` This is purely a iterative approach, no algorithms involved. However, I noticed some ways the CLZ function could be improved. First, the code provided by the instructor uses a binary search approach and is implemented iteratively. (Before realization, I thought it was just a simple loop that counts leading zeros). ```c static inline unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; } ``` Execution time for this implementation is $O(\log_2n)$. Second, we can use bit masks to unroll the iterative approach. [^2] ```c static inline unsigned clz(uint32_t x) { if (x == 0) return 32; char n = 0; if (x <= 0x0000FFFF) { n += 16; x <<= 16; } if (x <= 0x00FFFFFF) { n += 8; x <<= 8; } if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; } if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; } if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; } return n; } ``` I changed the data type from int to char since the number of leading zeros is at most 32, we only need a char to store it. Next, I implemented the binary search with bitmasks approach: > Commit: [245d5ec](https://github.com/sysprog21/ca2025-quizzes/commit/245d5ec3de7c848f287d688c742e62514e120c90) > Processor Mode: Single-cycle Processor > Cycle Count: 129 ```c .data mask1: .word 0x0000FFFF mask2: .word 0x00FFFFFF mask3: .word 0x0FFFFFFF mask4: .word 0x3FFFFFFF mask5: .word 0x7FFFFFFF bin1: .word 0x0000FFFF # expected return value from clz: 16 bin2: .word 0xFFFFFFFF # expected return value from clz: 0 bin3: .word 0x7FFFFFFF # expected return value from clz: 1 .text .globl main main: # === Testcase bin1 === lw a0, bin1 # Load the test argument into register a0 jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros li a7, 1 # Print values returned from clz ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # === Testcase bin2 === lw a0, bin2 # Load the test argument into register a0 jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros li a7, 1 # Print values returned from clz ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # === Testcase bin3 === lw a0, bin3 # Load the test argument into register a0 jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros li a7, 1 # Print values returned from clz ecall # Exit li a7, 10 ecall # Count Leading Zeros (return value is saved at a0) # a0: Input argument clz: beq a0, x0, check_zero # Check a0 == 0; if true, jump to check_zero for early return addi sp, sp, -24 # Allocate stack space for local variables sw s5, 20(sp) # Save for use afterwards sw s4, 16(sp) # Save for use afterwards sw s3, 12(sp) # Save for use afterwards sw s2, 8(sp) # Save for use afterwards sw s1, 4(sp) # Save for use afterwards sw s0, 0(sp) # Save for use afterwards li s0, 0 # Set s0 = 0 for counting leading zeros lw s1, mask1 # Load the bitmask to register lw s2, mask2 # Load the bitmask to register lw s3, mask3 # Load the bitmask to register lw s4, mask4 # Load the bitmask to register lw s5, mask5 # Load the bitmask to register check_16: bleu a0, s1, less_16 # Check if a0 <= 0x0000FFFF; if true, jump to less_16 check_8: bleu a0, s2, less_8 # Check if a0 <= 0x00FFFFFF; if true, jump to less_8 check_4: bleu a0, s3, less_4 # Check if a0 <= 0x0FFFFFFF; if true, jump to less_4 check_2: bleu a0, s4, less_2 # Check if a0 <= 0x3FFFFFFF; if true, jump to less_2 check_1: bleu a0, s5, less_1 # Check if a0 <= 0x7FFFFFFF; if true, jump to less_1 j return_clz # Jump to return_clz for restoring saved register and returning to the caller less_16: addi s0, s0, 16 # s0 += 16 slli a0, a0, 16 # a0 <<= 16 j check_8 less_8: addi s0, s0, 8 # s0 += 8 slli a0, a0, 8 # a0 <<= 8 j check_4 less_4: addi s0, s0, 4 # s0 += 4 slli a0, a0, 4 # a0 <<= 4 j check_2 less_2: addi s0, s0, 2 # s0 += 2 slli a0, a0, 2 # a0 <<= 2 j check_1 less_1: addi s0, s0, 1 # s0 += 1 slli a0, a0, 1 # a0 <<= 1 return_clz: mv a0, s0 # Save s0 (counter) to a0 lw s0, 0(sp) # Restore the original data lw s1, 4(sp) # Restore the original data lw s2, 8(sp) # Restore the original data lw s3, 12(sp) # Restore the original data lw s4, 16(sp) # Restore the original data lw s5, 20(sp) # Restore the original data addi sp, sp, 24 # Deallocate stack space ret # Return to the caller check_zero: li a0, 32 # Set a0 = 32 ret # Return to the caller ``` The cycle count dropped from 831 to 129 by using the bitmask approach, resulting in more than six times fewer instructions to execute. During the coding process, I reviewed the factorial example from [Lab1: RV32I Simulator](https://hackmd.io/@sysprog/H1TpVYMdB#Example-Factorial-Calculation) and was reminded of the RISC-V calling conventions. In my previous code, I did not save the callee-saved registers to the stack, which I failed to allocate space for them. Since these registers must be preserved by the callee, we need to adjust the stack pointer to create space for local variables. In this “bitmask” version, I corrected that mistake. In addition, I added meaningful comments to my assembly program and formatted it so that it is more readable as in Lab1's program. ### uf8_decode The code in the quiz is as follows: ```c /* Decode uf8 to uint32_t */ uint32_t uf8_decode(uf8 fl) { uint32_t mantissa = fl & 0x0f; uint8_t exponent = fl >> 4; uint32_t offset = (0x7FFF >> (15 - exponent)) << 4; return (mantissa << exponent) + offset; } ``` In fact, we don't need to allocate local variables such as mantissa, exponent, and offset since this function can be simplified into one line of code as below. ```c uint32_t uf8_decode(uf8 fl) { return ((fl & 0x0f) << (fl >> 4)) + ((0x7FFF >> (15 - (fl >> 4))) << 4); } ``` In the simplified code, we only need to manipulate the value `fl`. We can thus write the equivalent RISC-V assembly accordingly. > Commit: [7e2fcbb](https://github.com/sysprog21/ca2025-quizzes/commit/7e2fcbbebda69d082ff5e875f82de84f3e0c81b4) > Processor Mode: Single-cycle Processor > Cycle Count: 60 (189 - 129 = 60) ```c .data ... # DEC byte1: .word 0x000000FF # expected return value from dec: 1015792 byte2: .word 0x00000055 # expected return value from dec: 656 byte3: .word 0x00000007 # expected return value from dec: 7 .text .globl main main: ... # ===================== # uf8_decode # ===================== # === Testcase byte1 === lw a0, byte1 # Load the test byte into register a0 jal ra, dec # Jump-and-link to the 'dec' function for decoding uf8 li a7, 1 # Print integer ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # === Testcase byte2 === lw a0, byte2 # Load the test byte into register a0 jal ra, dec # Jump-and-link to the 'dec' function for decoding uf8 li a7, 1 # Print integer ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # === Testcase byte3 === lw a0, byte3 # Load the test byte into register a0 jal ra, dec # Jump-and-link to the 'dec' function for decoding uf8 li a7, 1 # Print integer ecall # Exit li a7, 10 ecall ... # Decode uf8 to uint32_t (return value is saved at a0) # a0: Input argument dec: mv t0, a0 # Save a0 (argument) for calculating exponent srli t0, t0, 4 # Save exponent (fl >> 4) to t0 andi a0, a0, 0x0f # Perform (fl & 0x0f) sll a0, a0, t0 # Perform (fl & 0x0f) << (fl >> 4) li t1, 15 # Save constant to t1 for calculating (15 - (fl >> 4)) sub t0, t1, t0 # Perform (15 - (fl >> 4)) and save the result to t0 li t1, 0x7FFF # Save constant to t1 for calculating (0x7FFF >> (15 - (fl >> 4))) srl t0, t1, t0 # Perform (0x7FFF >> (15 - (fl >> 4))) and save it to t0 slli t0, t0, 4 # Perform ((0x7FFF >> (15 - (fl >> 4))) << 4) add a0, a0, t0 # Add up a0 and t0 and save to a0 ret ``` For the implementation of `decode`, I used temporary registers instead of allocating some stack space and using saved registers. Also, I created an accompanying C program to see what the is the correct output value of each test case for making sure the RISC-V assembly I wrote is correct. In addition, this C program verifies the idea that the single-line code works the same as the original one. > Commit: [07ce1ba](https://github.com/sysprog21/ca2025-quizzes/commit/07ce1baed4de96445e369cb9893b934b5e2e8dc2) ```c #include <stdint.h> #include <stdio.h> typedef uint8_t uf8; uint32_t uf8_decode(uf8 fl) { uint32_t mantissa = fl & 0x0f; uint8_t exponent = fl >> 4; uint32_t offset = (0x7FFF >> (15 - exponent)) << 4; return (mantissa << exponent) + offset; } uint32_t uf8_decode_simple(uf8 fl) { return ((fl & 0x0f) << (fl >> 4)) + ((0x7FFF >> (15 - (fl >> 4))) << 4); } void print_bin(uint32_t bin) { for (int i = 31; i >= 0; i--) { char b = (bin >> i & 0x1) == 0x1 ? '1' : '0'; putchar(b); } putchar('\n'); } int main(void) { uint8_t testcase = 0x07; print_bin(uf8_decode(testcase)); printf("%d\n", uf8_decode(testcase)); print_bin(uf8_decode_simple(testcase)); printf("%d\n", uf8_decode_simple(testcase)); return 0; } ``` ### uf8_encode The implementation provided by the instructor is as follows: ```c uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16) return value; /* Find appropriate exponent using CLZ hint */ int lz = clz(value); int msb = 31 - lz; /* Start from a good initial guess */ uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; /* Calculate overflow for estimated exponent */ for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } } /* Find exact exponent */ while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) break; overflow = next_overflow; exponent++; } uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } ``` I first did a line-by-line translation. from C to RISC-V assembly. ```c # Encode uint32_t to uf8 (return value is saved at a0) # a0: Input argument enc: li t0, 16 # Load 16 to t0 for performing early-return bltu a0, t0, e_ret_enc # if (value < 16) addi sp, sp, -12 # Allocate stack space to store local variables sw a0, 8(sp) # Save a0 to stack to prevent data loss jal ra, clz # Call clz function, return value is saved at a0 mv a1, a0 # Copy the return value to a1 (a0 = clz(a0)) lw a0, 8(sp) # Restore value from the stack (a0 is the argument) li t0, 31 # Load 31 to t0 for computing msb = 31 - a0 sub a0, t0, a0 # Perform msb = 31 - a0 and save result to a0 sw s1, 4(sp) # Save s1 (overflow) to stack, restore when function ends sw s0, 0(sp) # Save s0 (exponent) to stack, restore when function ends li s0, 0 # s0 is exponent li s1, 0 # s1 is overflow li t0, 5 # Load 5 to t0 for perfoming msb >= 5 bge a0, 5, ge_5 # Perform msb >= 5 exact_exp: li t0, 15 # Load 5 to t0 for perfoming exponent < 15 bge s0, t0, mant # when while (exponent < 15) is false jump to mant slli t0, s0, 1 # next_overflow = (overflow << 1) addi t0, x0, 16 # next_overflow = next_overflow + 16 lw t1, 8(sp) # Load value to t1 blt t1, t0, mant # if (value < next_overflow) then break mv s1, t0 # overflow = next_overflow addi s0, x0, 1 # exponent++ j exact_exp mant: sub t0, t1, s1 # mantissa = (value - overflow) srl t0, t0, s0 # mantissa = mantissa>> exponent slli t1, s0, 4 # t1 = (exponent << 4) or a0, t1, t0 # (exponent << 4) | mantissa ret e_ret_enc: ret ge_5: li t0, 4 # Load 4 to t0 for subtraction sub, s0, a0, t0 # exponent (s0) = msb (a0) - 4 li t0, 15 # Load 4 to t0 for comparison bgt s0, t0, gt_15 # if (exponent > 15) e_init: li t0, 0 # for-loop variable e uf8_overflow: bge t0, s0, uf8_est_off # for-loop condition: e < exponent slli s1, s1 1 # overflow = (overflow << 1) addi s1, s1, 16 # overflow += 16 j uf8_overflow uf8_est_off: bleu s0, x0, ret_msb lw t2, 8(sp) # Load value to t2 bge t2, s1, ret_msb addi s1, s1, -16 # overflow = (overflow - 16) srli s1, s1, 1 # overflow >>= 1 addi s0, s0, -1 # exponent-- j uf8_est_off ret_msb: j exact_exp gt_15: li s0, 15 # exponent = 15 j e_init # Jump back to the next line of bgt ``` (To be clear, I haven’t tested my first attempt at the `uf8_encode` RISC-V assembly, since I came up with an improvement right after finishing the translation. Therefore, there may be bugs in the code above.) Later, I realized that I could simply use temporary registers (at some places) instead of managing stack pointers, memory allocations, and deallocations, since temporary registers are saved by the caller, whereas saved registers are preserved by the callee. During the revision of my code, I consulted ChatGPT, who suggested that I follow the RISC-V procedure calling conventions. It cited the statement, "In the standard RISC-V calling convention, the stack grows downward and the stack pointer is always kept 16-byte aligned." [^3] (I specifically and explicitly asked ChatGPT to give me a conceptual idea of assembly philosophy and some proof-of-concepts, I am by no means cognitive offloading to AI) However, this reference appeared unofficial, so I searched further and found the [RISC-V ELF psABI Document](https://github.com/riscv-non-isa/riscv-elf-psabi-doc/tree/master), which contains a dedicated section elaborating on the [RISC-V Calling Conventions](https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc). (In fact, I first checked the [official RISC-V specification](https://docs.riscv.org/reference/isa/_attachments/riscv-unprivileged.pdf), which noted that the calling convention section had been moved to the psABI.) Note: I forgot to commit each individual improvements, but I've documented changes here. Overall, I've done several improvements: - Use temporary registers where possible (One of the exceptions is calling another function, since that function might use temporary registers, so it is the caller's job to preserve value inside.) - Follow the RISC-V calling conventions by adjusting the amount of allocated stack space to be a multiple of 16. - Avoid double labels for a for-loop. If there’s a branch right before the for-loop, you often add one label to jump back and another to skip the loop’s initialization (e.g., int i = 0). You can simplify this by moving the initialization before the preceding branch. Then the loop needs only a single label, eliminating the extra label. (I did a bit research and found out that this technique is called [hoisting](https://en.wikipedia.org/wiki/Loop-invariant_code_motion)) ```diff ge_5: ... li t4, 15 # Load 4 to t3 for comparison bgt t1, t4, gt_15 # if (exponent > 15) - ovf_init: - li t3, 0 # for-loop variable e for calculaing overflow uf8_ovf: ... gt_15: li t1, 15 # exponent = 15 - j ovf_init # Jump back to the next line of bgt *** ge_5: ... + li t3, 0 # for-loop variable e for calculaing overflow (hoisting) li t4, 15 # Load 4 to t3 for comparison bgt t1, t4, gt_15 # if (exponent > 15) uf8_ovf: ... gt_15: li t1, 15 # exponent = 15 + j uf8_ovf # Jump back to the next line of bgt ``` - Reverse the branch logic so that normal exponents fall through (skip the clamp), and only values greater than 15 execute the assignment `exponent = 15`. ```diff ge_5: ... li t3, 0 # for-loop variable e for calculaing overflow (hoisting) li t4, 15 # Load 4 to t3 for comparison - bgt t1, t4, gt_15 # if (exponent > 15) uf8_ovf: ... - gt_15: - li s0, 15 # exponent = 15 - j uf8_ovf # Jump back to the next line of bgt *** ge_5: ... li t3, 0 # for-loop variable e for calculaing overflow (hoisting) li t4, 15 # Load 4 to t3 for comparison + bleu t1, t4, uf8_ovf # if (exponent > 15) + li t1, 15 # exponent = 15 uf8_ovf: ... ``` - The for-loop for calculating overflow for estimated exponent can be simplified. It repeats `overflow = (overflow << 1) + 16;` $x$ times, where $x$ is the value of `exponent`. We can simplify the operations using [mathematical induction](https://en.wikipedia.org/wiki/Mathematical_induction) since that operation is similar to recursive function. We can represent the overall operations using nested expression as follows: $$ \text{OVF}_x = \underbrace{((\cdots(((}_{x\text{ times}} (\text{OVF}_0 \ll 1) + 16) \ll 1 + 16) \cdots \ll 1) + 16 $$ The subscript $_0$ means the initial overflow value, whereas $_x$ means the value is the result of the operation done $x$ times. In addition, OVF stands for overflow. Next, we can rewrite $\text{OVF}$ as recursive function ($O_n$). $$ O_{n+1} = 2O_n + 16 $$ Observed from the behavior of the operations as follows, we can write out a general form: $$ \begin{align} Given\;O_0,\;O_1& = 2O_0 + 16 \\ O_2& = 2O_1 + 16 = 2\times(2O_0 + 16) + 16 = 4O_0 + 48 \\ O_3& = 2O_2 + 16 = 2\times(2O_1 + 16) + 16 = 4O_1 + 48 = 8O_0 + 112 \\ &\;\;\vdots \notag \\ O_n& = 2^nO_0 + 16\times(2^n-1) \end{align} $$ Thus, we have our guess of the general form. We need to use mathematical induction to prove the induction hypothesis. \begin{align} &\textbf{Claim:}\quad O_{n+1} = 2O_n + 16,\; O_0 \text{ given}\;\Rightarrow\;O_n = 2^n O_0 + 16(2^n - 1)\quad \forall n\ge 0. \\ \\ &\textbf{Proof By Induction:} \\ \\ &\text{Base case (n = 0):}\;O_0 = 2^0O_0+16(2^0-1) = 2^0O_0 = O_0\quad\text{(Correct)} \\ &\text{Induction step (n = 0):}\;\text{Assume}\;O_k = 2^k O_0 + 16(2^k - 1)\;\text{for}\;k\ge 0.\;\text{Then,} \end{align} \begin{align} O_{k+1} &= 2O_k + 16 \\ &= 2(2^kO_0+16(2^k-1))+16 \\ &= 2^{k+1}O_0+32\times2^k-32+16 \\ &= 2^{k+1}O_0+16\times2^{k+1}-16 \\ &= 2^{k+1}O_0+16(2^{k+1}-1). \end{align} \begin{align} \text{Thus, } & O_{k+1} \text{ is correctly defined.} \\ \text{Hence, } & O_k \rightarrow O_{k+1} \text{ is true for all } k. \\[1em] \textbf{Conclusion:}\quad & \text{The claimed form with } n = k + 1 \text{ matches } O_{k+1}.\\ & \text{Therefore, by induction,}\\ &O_n = 2^n O_0 + 16(2^n - 1)\quad \forall n \ge 0. \end{align} Concluded from the proof above, we can know that the equivalent operation of the for-loop would be $O_x = 2^x O_0 + 16(2^x - 1)$, where $x$ is the value of exponent. We can write a small C statement to demonstrate the C-equivalent expression for the equation. ```c overflow = overflow << exponent + 16 << exponent - 16; ``` We can combine the terms: ```c overflow = ((overflow + 16) << exponent) - 16; ``` Because overflow is 0, before this loop, no one changes it. Therefore, we can further simplify it into: ```c overflow = (16 << exponent) - 16; ``` From now on, we can *finally* translate it to RISC-V assembly. ```diff ge_5: # t0 = msb, t1 = exponent, t2 = overflow ... - li t3, 0 # for-loop variable e for calculaing overflow (hoisting) ... uf8_ovf: - bge t3, t1, uf8_est_off # for-loop condition: e < exponent (reverse) - slli t2, t2, 1 # overflow = (overflow << 1) - addi t2, t2, 16 # overflow += 16 - j uf8_ovf *** uf8_ovf: + li t3, 16 # Load 16 to t4 for bit manipulation + sll t4, t3, t1 # t3 = (16 << exponent) + sub t2, t4, t3 # overflow = (16 << exponent) - 16; ``` Afterwards, I wrote some test cases for verifying the correctness of my program. One thing hit me–it never ends. I first did identify few typos of the register names. But, the problem still persists. Then I did step by step investigation on Ripes. Then, I realized that I forgot to save the return address of enc, since I've called CLZ during the function, I must save `ra` to saved registers in order to preserve where it is from. Below is the line of code where things went wrong. ![image](https://hackmd.io/_uploads/S13R08Z0el.png) After roughly an hour, it worked! But the result is not what I expected. As it turns out that I forgot to mask the return value (a0) to only one byte, meaning doing an `and` bitmask with value 0xFF, which can filter out other bits. I added that to the program. It worked correctly. Hooray~ > Commit: [dd15b88](https://github.com/sysprog21/ca2025-quizzes/commit/dd15b886153cb0ca5c4bb56fbb30294eb564d269) > Processor Mode: Single-cycle Processor > Cycle Count: 255 (444 - 189 = 255) ```c .data ... # ENC bin4: .word 0x12345678 # expected return value from enc: 248 bin5: .word 0x55553333 # expected return value from enc: 250 bin6: .word 0x01010101 # expected return value from enc: 242 .text .globl main main: ... # ===================== # uf8_encode # ===================== # === Testcase bin4 === lw a0, bin4 # Load the test binary into register a0 jal ra, enc # Jump-and-link to the 'dec' function for encoding to uf8 li a7, 1 # Print integer ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # === Testcase bin5 === lw a0, bin5 # Load the test binary into register a0 jal ra, enc # Jump-and-link to the 'eec' function for encoding to uf8 li a7, 1 # Print integer ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # === Testcase bin6 === lw a0, bin6 # Load the test binary into register a0 jal ra, enc # Jump-and-link to the 'enc' function for encoding to uf8 li a7, 1 # Print integer ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # Exit li a7, 10 ecall ... # Encode uint32_t to uf8 (return value is saved at a0) # a0: Input argument enc: li t0, 16 # Load 16 to t0 for performing early-return bltu a0, t0, e_ret_enc # if (value < 16) addi sp, sp, -16 # Allocate stack space to store local variables sw a0, 12(sp) # Save a0's data to stack to prevent data loss sw s0, 8(sp) # Save s0's data to stack to prevent data loss mv s0, ra # Save ra to s0 jal ra, clz # Call CLZ function, return value is saved at a0 mv t0, a0 # lz = clz(value), t0 represents lz lw a0, 12(sp) # Restore value from the stack (a0 is the argument) mv ra, s0 # Restore value from s0 lw s0, 8(sp) # Restore value from the stack addi sp, sp, 16 # Deallocate stack space li t1, 31 # Load 31 to t1 for computing msb = 31 - a0 sub t0, t1, t0 # Perform msb = 31 - a0 and save the result to t0, t0 now represents msb li t1, 0 # uint8_t exponent = 0; (t1) li t2, 0 # uint32_t overflow = 0; (t2) li t3, 5 # Load 5 to t3 for perfoming if (msb >= 5) bgeu t0, t3, ge_5 # Perform msb >= 5 exact_exp: # a0 = value, t0 = msb, t1 = exponent, t2 = overflow li t3, 15 # Load 5 to t3 for perfoming inverse of (exponent < 15) bgeu t1, t3, mant # when while (exponent < 15) is false jump to mant slli t3, t2, 1 # next_overflow = (overflow << 1) addi t3, t3, 16 # next_overflow = next_overflow + 16 bltu a0, t3, mant # if (value < next_overflow) then break mv t2, t3 # overflow = next_overflow addi t1, t1, 1 # exponent++ j exact_exp mant: sub t3, a0, t2 # mantissa = (value - overflow) srl t3, t3, t1 # mantissa = mantissa >> exponent slli t4, t1, 4 # t1 = (exponent << 4) or a0, t3, t4 # (exponent << 4) | mantissa li t3, 0xFF # Make a bitmask that mask the least significant byte and a0, a0, t3 # AND it with a0 to make sure no garbage values remain ret e_ret_enc: ret # early return ge_5: # a0 = value, t0 = msb, t1 = exponent, t2 = overflow li t3, 4 # Load 4 to t3 for subtraction sub, t1, t0, t3 # exponent = msb - 4; li t3, 15 # Load 4 to t3 for comparison bleu t1, t3, uf8_ovf # Invert if (exponent > 15), if less than or equals to 15 jump pass next line. li t1, 15 # exponent = 15 uf8_ovf: li t3, 16 # Load 16 to t4 for bit manipulation sll t4, t3, t1 # t3 = (16 << exponent) sub t2, t4, t3 # overflow = (16 << exponent) - 16; uf8_est_off: bleu t1, x0, ret_msb # Invert (exponent > 0) bgeu a0, t2, ret_msb # Invert (value < overflow) addi t2, t2, -16 # overflow = (overflow - 16) srli t2, t2, 1 # overflow >>= 1 addi t1, t1, -1 # exponent-- j uf8_est_off ret_msb: j exact_exp ``` ### test There are a few mistakes when implementing the test function: 1. My suggestion is that try not to write something simple to implement first at the bottom and come up without realizing how the variables are stored before. This bugged this when implementing the test function, I first wrote the condition at the bottom and didn't change the register when I used a different approach above. 2. I was being a smart alec when trying to optimize the for-loop for iterating through 0 to 255. I was trying to use a count down method that reverses the condition of the loop. However, what I didn't pay attention to is that the logic inside the for-loop _does_ depend on `i`, so when I ran the program, the error flooded my console output. After realizing this mistake, I corrected the program. > Commit: [0714c10](https://github.com/sysprog21/ca2025-quizzes/commit/0714c1046522f97ec9a7522560f54a434cb7ac3d) > Processor Mode: Single-cycle Processor > Cycle Count: 25511 (25955 - 444 = 25511) ```c .data ... # Test str1: .string ": produces value " str2: .string " but encodes back to " str3: .string ": value " str4: .string " <= previous_value " str5: .string "All tests passed.\n" .text .globl main main: ... # ===================== # test # ===================== jal ra, test # Jump-and-link to the 'test' function for verifying the correctness of this assembly program beq a0, x0, exit # If test return true, print str5 la a0 str5 # Load str5's address to a0 for printing li a7 4 # print str5 ecall exit: # Exit li a7, 10 ecall ... test: addi sp, sp, -16 # Allocate stack space for storing local variables sw s0, 12(sp) # Save s0's data to the stack sw s1, 8(sp) # Save s1's data to the stack sw s2, 4(sp) # Save s2's data to the stack sw s3, 0(sp) # Save s3's data to the stack mv s0, ra # Save the return address for this function to s0 li s1, -1 # int32_t previous_value = -1; li s2, 1 # bool passed = true; li s3, 0 # i = 0; testcases: mv a0, s3 # uint8_t fl = i; jal ra, dec # int32_t value = uf8_decode(fl); (return value is stored at a0) mv t6, a0 # Save value to t6 jal ra, enc # uint8_t fl2 = uf8_encode(value); (return value is stored at a0) mv ra, s0 # Restore ra from s0 # | s0: ra | s1: previous_value | s2: passed | s3: i | t6: value | a0: fl2 | fl_eq: beq s3, a0, val_cmp # Invert if (fl != fl2) mv s0, a0 # Save a0 to s0 mv a0, s3 # Copy i to a0 for printing li a7 34 # print i in hex form ecall la a0 str1 # Load str1's address to a0 for printing li a7 4 # print str1 ecall mv a0, t6 # Copy value to a0 for printing li a7 1 # print i in int form ecall la a0 str2 # Load str2's address to a0 for printing li a7 4 # print str2 ecall mv a0, s0 # Restore a0 (fl2) from s0 li a7 34 # print i in hex form ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall mv a0, s0 # Restore a0 (fl2) from s0 li s2, 0 # passed = false; # | s0: ra | s1: previous_value | s2: passed | s3: i | t6: value | a0: fl2 | val_cmp: bgt t6, s1, next_it # Invert if (value <= previous_value) mv s0, a0 # Save a0 to s0 mv a0, s3 # Copy i to a0 for printing li a7 34 # print i in hex form ecall la a0 str3 # Load str3's address to a0 for printing li a7 4 # print str3 ecall mv a0, t6 # Copy value to a0 for printing li a7 1 # print i in int form ecall la a0 str4 # Load str4's address to a0 for printing li a7 4 # print str4 ecall mv a0, s1 # Copy previous_value to a0 for printing li a7 34 # print i in hex form ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall mv a0, s0 # Restore a0 (fl2) from s0 li s2, 0 # passed = false; next_it: mv s1, t6 # previous_value = value; addi s3, s3, 1 # i++ li t0, 256 # Load 256 to t0 for comparison blt s3, t0, testcases # i < 256 end_test: mv a0, s2 # save return value "passed" to a0 lw s3, 0(sp) # Save s3's data to the stack lw s2, 4(sp) # Save s2's data to the stack lw s1, 8(sp) # Save s1's data to the stack lw s0, 12(sp) # Save s0's data to the stack ret ``` ### Complete RISC-V Translation of the UF8 C Program from Quiz 1 ```c .data # CLZ mask1: .word 0x0000FFFF mask2: .word 0x00FFFFFF mask3: .word 0x0FFFFFFF mask4: .word 0x3FFFFFFF mask5: .word 0x7FFFFFFF bin1: .word 0x0000FFFF # expected return value from clz: 16 bin2: .word 0xFFFFFFFF # expected return value from clz: 0 bin3: .word 0x7FFFFFFF # expected return value from clz: 1 # DEC byte1: .word 0x000000FF # expected return value from dec: 1015792 byte2: .word 0x00000055 # expected return value from dec: 656 byte3: .word 0x00000007 # expected return value from dec: 7 # ENC bin4: .word 0x12345678 # expected return value from enc: 248 bin5: .word 0x55553333 # expected return value from enc: 250 bin6: .word 0x01010101 # expected return value from enc: 242 # Test str1: .string ": produces value " str2: .string " but encodes back to " str3: .string ": value " str4: .string " <= previous_value " str5: .string "All tests passed.\n" .text .globl main main: # ===================== # CLZ # ===================== # === Testcase bin1 === lw a0, bin1 # Load the test argument into register a0 jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros li a7, 1 # Print values returned from clz ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # === Testcase bin2 === lw a0, bin2 # Load the test argument into register a0 jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros li a7, 1 # Print values returned from clz ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # === Testcase bin3 === lw a0, bin3 # Load the test argument into register a0 jal ra, clz # Jump-and-link to the 'clz' function for counting leading zeros li a7, 1 # Print values returned from clz ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # ===================== # uf8_decode # ===================== # === Testcase byte1 === lw a0, byte1 # Load the test byte into register a0 jal ra, dec # Jump-and-link to the 'dec' function for decoding uf8 li a7, 1 # Print integer ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # === Testcase byte2 === lw a0, byte2 # Load the test byte into register a0 jal ra, dec # Jump-and-link to the 'dec' function for decoding uf8 li a7, 1 # Print integer ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # === Testcase byte3 === lw a0, byte3 # Load the test byte into register a0 jal ra, dec # Jump-and-link to the 'dec' function for decoding uf8 li a7, 1 # Print integer ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # ===================== # uf8_encode # ===================== # === Testcase bin4 === lw a0, bin4 # Load the test binary into register a0 jal ra, enc # Jump-and-link to the 'dec' function for encoding to uf8 li a7, 1 # Print integer ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # === Testcase bin5 === lw a0, bin5 # Load the test binary into register a0 jal ra, enc # Jump-and-link to the 'eec' function for encoding to uf8 li a7, 1 # Print integer ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # === Testcase bin6 === lw a0, bin6 # Load the test binary into register a0 jal ra, enc # Jump-and-link to the 'enc' function for encoding to uf8 li a7, 1 # Print integer ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall # ===================== # test # ===================== jal ra, test # Jump-and-link to the 'test' function for verifying the correctness of this assembly program beq a0, x0, exit # If test return true, print str5 la a0 str5 # Load str5's address to a0 for printing li a7 4 # print str5 ecall exit: # Exit li a7, 10 ecall # Count Leading Zeros (return value is saved at a0) # a0: Input argument clz: beq a0, x0, check_zero # Check a0 == 0; if true, jump to check_zero for early return addi sp, sp, -24 # Allocate stack space for local variables sw s5, 20(sp) # Save for use afterwards sw s4, 16(sp) # Save for use afterwards sw s3, 12(sp) # Save for use afterwards sw s2, 8(sp) # Save for use afterwards sw s1, 4(sp) # Save for use afterwards sw s0, 0(sp) # Save for use afterwards li s0, 0 # Set s0 = 0 for counting leading zeros lw s1, mask1 # Load the bitmask to register lw s2, mask2 # Load the bitmask to register lw s3, mask3 # Load the bitmask to register lw s4, mask4 # Load the bitmask to register lw s5, mask5 # Load the bitmask to register check_16: bleu a0, s1, less_16 # Check if a0 <= 0x0000FFFF; if true, jump to less_16 check_8: bleu a0, s2, less_8 # Check if a0 <= 0x00FFFFFF; if true, jump to less_8 check_4: bleu a0, s3, less_4 # Check if a0 <= 0x0FFFFFFF; if true, jump to less_4 check_2: bleu a0, s4, less_2 # Check if a0 <= 0x3FFFFFFF; if true, jump to less_2 check_1: bleu a0, s5, less_1 # Check if a0 <= 0x7FFFFFFF; if true, jump to less_1 j return_clz # Jump to return_clz for restoring saved register and returning to the caller less_16: addi s0, s0, 16 # s0 += 16 slli a0, a0, 16 # a0 <<= 16 j check_8 less_8: addi s0, s0, 8 # s0 += 8 slli a0, a0, 8 # a0 <<= 8 j check_4 less_4: addi s0, s0, 4 # s0 += 4 slli a0, a0, 4 # a0 <<= 4 j check_2 less_2: addi s0, s0, 2 # s0 += 2 slli a0, a0, 2 # a0 <<= 2 j check_1 less_1: addi s0, s0, 1 # s0 += 1 slli a0, a0, 1 # a0 <<= 1 return_clz: mv a0, s0 # Save s0 (counter) to a0 lw s0, 0(sp) # Restore the original data lw s1, 4(sp) # Restore the original data lw s2, 8(sp) # Restore the original data lw s3, 12(sp) # Restore the original data lw s4, 16(sp) # Restore the original data lw s5, 20(sp) # Restore the original data addi sp, sp, 24 # Deallocate stack space ret # Return to the caller check_zero: li a0, 32 # Set a0 = 32 ret # Return to the caller # Decode uf8 to uint32_t (return value is saved at a0) # a0: Input argument dec: mv t0, a0 # Save a0 (argument) for calculating exponent srli t0, t0, 4 # Save exponent (fl >> 4) to t0 andi a0, a0, 0x0f # Perform (fl & 0x0f) sll a0, a0, t0 # Perform (fl & 0x0f) << (fl >> 4) li t1, 15 # Save constant to t1 for calculating (15 - (fl >> 4)) sub t0, t1, t0 # Perform (15 - (fl >> 4)) and save the result to t0 li t1, 0x7FFF # Save constant to t1 for calculating (0x7FFF >> (15 - (fl >> 4))) srl t0, t1, t0 # Perform (0x7FFF >> (15 - (fl >> 4))) and save it to t0 slli t0, t0, 4 # Perform ((0x7FFF >> (15 - (fl >> 4))) << 4) add a0, a0, t0 # Add up a0 and t0 and save to a0 ret # Encode uint32_t to uf8 (return value is saved at a0) # a0: Input argument enc: li t0, 16 # Load 16 to t0 for performing early-return bltu a0, t0, e_ret_enc # if (value < 16) addi sp, sp, -16 # Allocate stack space to store local variables sw a0, 12(sp) # Save a0's data to stack to prevent data loss sw s0, 8(sp) # Save s0's data to stack to prevent data loss mv s0, ra # Save ra to s0 jal ra, clz # Call CLZ function, return value is saved at a0 mv t0, a0 # lz = clz(value), t0 represents lz lw a0, 12(sp) # Restore value from the stack (a0 is the argument) mv ra, s0 # Restore value from s0 lw s0, 8(sp) # Restore value from the stack addi sp, sp, 16 # Deallocate stack space li t1, 31 # Load 31 to t1 for computing msb = 31 - a0 sub t0, t1, t0 # Perform msb = 31 - a0 and save the result to t0, t0 now represents msb li t1, 0 # uint8_t exponent = 0; (t1) li t2, 0 # uint32_t overflow = 0; (t2) li t3, 5 # Load 5 to t3 for perfoming if (msb >= 5) bgeu t0, t3, ge_5 # Perform msb >= 5 exact_exp: # a0 = value, t0 = msb, t1 = exponent, t2 = overflow li t3, 15 # Load 5 to t3 for perfoming inverse of (exponent < 15) bgeu t1, t3, mant # when while (exponent < 15) is false jump to mant slli t3, t2, 1 # next_overflow = (overflow << 1) addi t3, t3, 16 # next_overflow = next_overflow + 16 bltu a0, t3, mant # if (value < next_overflow) then break mv t2, t3 # overflow = next_overflow addi t1, t1, 1 # exponent++ j exact_exp mant: sub t3, a0, t2 # mantissa = (value - overflow) srl t3, t3, t1 # mantissa = mantissa >> exponent slli t4, t1, 4 # t1 = (exponent << 4) or a0, t3, t4 # (exponent << 4) | mantissa li t3, 0xFF # Make a bitmask that mask the least significant byte and a0, a0, t3 # AND it with a0 to make sure no garbage values remain ret e_ret_enc: ret # early return ge_5: # a0 = value, t0 = msb, t1 = exponent, t2 = overflow li t3, 4 # Load 4 to t3 for subtraction sub, t1, t0, t3 # exponent = msb - 4; li t3, 15 # Load 4 to t3 for comparison bleu t1, t3, uf8_ovf # Invert if (exponent > 15), if less than or equals to 15 jump pass next line. li t1, 15 # exponent = 15 uf8_ovf: li t3, 16 # Load 16 to t4 for bit manipulation sll t4, t3, t1 # t3 = (16 << exponent) sub t2, t4, t3 # overflow = (16 << exponent) - 16; uf8_est_off: bleu t1, x0, ret_msb # Invert (exponent > 0) bgeu a0, t2, ret_msb # Invert (value < overflow) addi t2, t2, -16 # overflow = (overflow - 16) srli t2, t2, 1 # overflow >>= 1 addi t1, t1, -1 # exponent-- j uf8_est_off ret_msb: j exact_exp test: addi sp, sp, -16 # Allocate stack space for storing local variables sw s0, 12(sp) # Save s0's data to the stack sw s1, 8(sp) # Save s1's data to the stack sw s2, 4(sp) # Save s2's data to the stack sw s3, 0(sp) # Save s3's data to the stack mv s0, ra # Save the return address for this function to s0 li s1, -1 # int32_t previous_value = -1; li s2, 1 # bool passed = true; li s3, 0 # i = 0; testcases: mv a0, s3 # uint8_t fl = i; jal ra, dec # int32_t value = uf8_decode(fl); (return value is stored at a0) mv t6, a0 # Save value to t6 jal ra, enc # uint8_t fl2 = uf8_encode(value); (return value is stored at a0) mv ra, s0 # Restore ra from s0 # | s0: ra | s1: previous_value | s2: passed | s3: i | t6: value | a0: fl2 | fl_eq: beq s3, a0, val_cmp # Invert if (fl != fl2) mv s0, a0 # Save a0 to s0 mv a0, s3 # Copy i to a0 for printing li a7 34 # print i in hex form ecall la a0 str1 # Load str1's address to a0 for printing li a7 4 # print str1 ecall mv a0, t6 # Copy value to a0 for printing li a7 1 # print i in int form ecall la a0 str2 # Load str2's address to a0 for printing li a7 4 # print str2 ecall mv a0, s0 # Restore a0 (fl2) from s0 li a7 34 # print i in hex form ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall mv a0, s0 # Restore a0 (fl2) from s0 li s2, 0 # passed = false; # | s0: ra | s1: previous_value | s2: passed | s3: i | t6: value | a0: fl2 | val_cmp: bgt t6, s1, next_it # Invert if (value <= previous_value) mv s0, a0 # Save a0 to s0 mv a0, s3 # Copy i to a0 for printing li a7 34 # print i in hex form ecall la a0 str3 # Load str3's address to a0 for printing li a7 4 # print str3 ecall mv a0, t6 # Copy value to a0 for printing li a7 1 # print i in int form ecall la a0 str4 # Load str4's address to a0 for printing li a7 4 # print str4 ecall mv a0, s1 # Copy previous_value to a0 for printing li a7 34 # print i in hex form ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall mv a0, s0 # Restore a0 (fl2) from s0 li s2, 0 # passed = false; next_it: mv s1, t6 # previous_value = value; addi s3, s3, 1 # i++ li t0, 256 # Load 256 to t0 for comparison blt s3, t0, testcases # i < 256 end_test: mv a0, s2 # save return value "passed" to a0 lw s3, 0(sp) # Save s3's data to the stack lw s2, 4(sp) # Save s2's data to the stack lw s1, 8(sp) # Save s1's data to the stack lw s0, 12(sp) # Save s0's data to the stack ret ``` ### Analysis ＴＢＤ ## Quiz1 - Problem C ### bf16_isnan The C code to be translated to RISC-V assembly is: ```c static inline bool bf16_isnan(bf16_t a) { return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) && (a.bits & BF16_MANT_MASK); } ``` Below is my first attempt to translate the code above to RISC-V assembly. ```c .data BF16_SIGN_MASK: .word 0x8000 BF16_EXP_MASK: .word 0x7F80 BF16_MANT_MASK: .word 0x007F BF16_EXP_BIAS: .byte 0x007F .text .globl main main: # ===================== # bf16_isnan # ===================== # === Test case 1 === li a0, 0x8000 # Load testing value to a0 jal ra, bf16_isnan # Call (Jump to) bf16_isnan function li a7, 1 # Print integer ecall li a0, 10 # Newline, '\n' li a7, 11 # Print char ecall exit: # Exit li a7, 10 ecall # Input argument: a0 | Return value: a0 bf16_isnan: la t0, BF16_EXP_MASK # Load mask to t0 for comparsion and t0, a0, t0 # t0 = a.bits & BF16_EXP_MASK beq t0, a0, bf16_isnan_1 # (a.bits & BF16_EXP_MASK) == BF16_EXP_MASK li a0, 0 # Set a0 to 0 and early return ret bf16_isnan_1: la t0, BF16_MANT_MASK # Load mask to t0 for comparsion and t0, t0, a0 # t0 = a.bits & BF16_MANT_MASK bne x0, t0, bf16_isnan_ret # Compare t0 to 0, if not equal, return 1 li a0, 0 # Set a0 to 0 and early return ret bf16_isnan_ret: li a0, 1 # Set a0 to 1 ret ``` Note: - Since we are directly comparing the bits when doing branching (`beq`), bitwise operation (`and`, `xor`), etc., we don't need to create a struct for retrieving the bits (e.g., `a.bits`). When I tried to run it in Ripes, it seemed to behave correctly. But the result was not what I expected (outputing 0). I assumed the output to be 1. So I dug down my code and found several bugs. ![image](https://hackmd.io/_uploads/BkFOpSPkZe.png) 1. Incorrect Comparsion: The 4th line of assembly below is incorrect since the equivalent output would be comparing `a.bits` with `a.bits & BF16_EXP_MASK`. Also, I was comparing `a.bits & BF16_EXP_MASK` with `a.bits` which I have no idea what I was writing back then. In addition, I overwrote register `t0` with `a.bits & BF16_EXP_MASK` but we need `BF16_EXP_MASK` later. ```c= bf16_isnan: la t0, BF16_EXP_MASK # Load mask to t0 for comparsion and t0, a0, t0 # t0 = a.bits & BF16_EXP_MASK beq t0, a0, bf16_isnan_1 # (a.bits & BF16_EXP_MASK) == BF16_EXP_MASK li a0, 0 # Set a0 to 0 and early return ret ``` ```diff bf16_isnan: la t0, BF16_EXP_MASK # Load mask to t0 for comparsion - and t0, a0, t0 # t0 = a.bits & BF16_EXP_MASK - beq t0, a0, bf16_isnan_1 # (a.bits & BF16_EXP_MASK) == BF16_EXP_MASK + and t1, a0, t0 # t1 = a.bits & BF16_EXP_MASK + beq t1, t0, bf16_isnan_1 # (a.bits & BF16_EXP_MASK) == BF16_EXP_MASK li a0, 0 # Set a0 to 0 and early return ret ``` [^1]: [Supported Environment Calls](https://github.com/mortbopet/Ripes/blob/master/docs/ecalls.md) [^2]: [2017q3 Homework4 (改善 clz)](https://hackmd.io/@3xOSPTI6QMGdj6jgMMe08w/Bk-uxCYxz#fn2) [^3]: [Chapter 18 Calling Convention](https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf)