Assignment 1: RISC-V Assembly and Instruction Pipeline

# Assignment 1: RISC-V Assembly and Instruction Pipeline contributed by < [`jin11109`](https://github.com/jin11109/ca2025-quizzes) > > This assignment was completed with assistance from [ChatGPT](https://chatgpt.com/) for grammar checking, translation and initial brainstorming. All final analysis and conclusions are my own. ## [Problem B](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-B) from Quiz1 ### Abstraction Problem B discusses a logarithmic 8-bit codec that maps 20-bit unsigned integers to 8-bit symbols through logarithmic quantization. In my work, I optimized the `clz` (count leading zeros) function and translated all the C code from Problem B into RISC-V assembly. Additionally, I applied several low-level optimizations, such as branchless operations, to improve performance. ### Inspiration >We used ChatGPT to understand the compilation chain behind GCC's `__builtin_clz()` and to locate the software fallback implementation that GCC emits when a target CPU does not provide a hardware CLZ instruction. #### CLZ optimization CLZ is a performance-critical primitive in many low-level algorithms. Because this primitive can be implemented either by a single hardware instruction or by a software fallback, I prioritize optimizing its software path when hardware support is absent. GCC exposes a family of [built-in helpers](https://gcc.gnu.org/onlinedocs/gcc/Bit-Operation-Builtins.html) (e.g. `__builtin_clz()`). If the target CPU lacks a dedicated CLZ instruction, the compiler will arrange for a library helper to be used at runtime. My approach is to inspect the software fallback implementations provided by GCC and adapt or optimize them for our target. #### Tracing GCC's `__builtin_clz()` When the frontend encounters `__builtin_clz()` it lowers that call to an internal helper (for example, `clzsi2`). If the backend can emit a native `clz` instruction, the backend will expand the helper into that instruction. Otherwise the generated code will call a libgcc helper such as `__clzSI2`, which lives in [`gcc/libgcc/libgcc2.c`](https://github.com/gcc-mirror/gcc/blob/master/libgcc/libgcc2.c). An example snippet from that file looks like: ```c #ifdef L_clzsi2 #undef int int __clzSI2 (UWtype x) { Wtype ret; count_leading_zeros (ret, x); return ret; } #endif ``` In [`gcc/include/longlong.h`](https://github.com/gcc-mirror/gcc/blob/master/include/longlong.h), GCC defines `count_leading_zeros()` in an architecture-specific way. Targets that provide a native CLZ (for example RISC-V with the Zbb extension) typically define the macro to use the builtin or direct instruction: ```c #ifdef __riscv_zbb #if W_TYPE_SIZE == 32 #define count_leading_zeros(COUNT, X) ((COUNT) = __builtin_clz (X)) #define count_trailing_zeros(COUNT, X) ((COUNT) = __builtin_ctz (X)) #define COUNT_LEADING_ZEROS_0 32 #endif /* W_TYPE_SIZE == 32 */ ``` Finally, if no architecture provides a `count_leading_zeros()` macro, the header falls back to a generic software implementation at the end of the file — this is the fallback routine I want to study and optimize: ```c #if !defined (count_leading_zeros) #define count_leading_zeros(count, x) \ do { \ UWtype __xr = (x); \ UWtype __a; \ \ if (W_TYPE_SIZE <= 32) \ { \ __a = __xr < ((UWtype)1<<2*__BITS4) \ ? (__xr < ((UWtype)1<<__BITS4) ? 0 : __BITS4) \ : (__xr < ((UWtype)1<<3*__BITS4) ? 2*__BITS4 : 3*__BITS4); \ } \ else \ { \ for (__a = W_TYPE_SIZE - 8; __a > 0; __a -= 8) \ if (((__xr >> __a) & 0xff) != 0) \ break; \ } \ \ (count) = W_TYPE_SIZE - (__clz_tab[__xr >> __a] + __a); \ } while (0) #define COUNT_LEADING_ZEROS_0 W_TYPE_SIZE #endif ``` This fallback locates the highest non-zero byte (or scans by bytes for larger word sizes), then consults a small lookup table (`__clz_tab`) for the leading-zero count of that byte and adds the byte-offset to produce the full count. That approach reduces per-bit work compared with naive bit-by-bit scanning and is a good starting point for further micro-optimizations. #### Rewrite problem B `clz` function ```c const uint8_t clz_tab[256] = { 0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; static inline unsigned clz(uint32_t x) { uint32_t a; // used for offset if (x < (1u << 16)) { // 0x0000_XXXX if (x < (1u << 8)) { // 0x0000_00XX a = 0; } else { // 0x0000_XX00 a = 8; } } else { // 0xXXXX_0000 if (x < (1u << 24)) { // 0x00XX_0000 a = 16; } else { // 0xXX00_0000 a = 24; } } /** * clz_tab is used to get the position of leading one of the selected 8-bit * block(XX) */ return 32 - (clz_tab[x >> a] + a); } ``` This version of CLZ only use two branches and one lookup table. Original version use 10 branches to find the clz result. ## Solution ### Idea for problem solving 1. **Version1**: After optimize `CLZ` funciton,I perform a direct, line-by-line translation of the original C code into RISC-V assembly to establish a correctness baseline. 2. **Version2**: Based on observations of the Ripes pipeline, I apply low-level assembly optimizations such as instruction scheduling and branchless implementations to improve throughput and reduce pipeline stalls. ### Version1(Naive): **UF8 Decode:** ```asm uf8_decode: andi t0, a0, 0x0f srli t1, a0, 4 li t2, 0x7fff li t3, 15 sub t3, t3, t1 srl t2, t2, t3 slli t2, t2, 4 sll a0, t0, t1 add a0, a0, t2 jr ra ``` **UF8 Encode:** This naive version of assembly code don't care about the pipeline and just translate line-by-line. But, there is still a little optimization which is inline clz function. ```asm uf8_encode: li t0, 16 blt a0, t0, return1 # if (value < 16) return value # Find appropriate exponent using CLZ hint clz: li t0, 0x10000 bgt a0, t0, 1f li t0, 0x100 bgt a0, t0, 3f li t1, 0 j 2f 3: li t1, 8 j 2f 1: li t0, 0x1000000 bgt a0, t0, 3f li t1, 16 j 2f 3: li t1, 24 2: li t0, 32 la t2, clz_tab srl t3, a0, t1 add t2, t2, t3 lb t2, 0(t2) sub t0, t0, t2 sub t0, t0, t1 clz_done: addi t0, t0, -31 sub t0, x0, t0 # msb = 31 - lz # Start from a good initial guess addi t1, x0, 0 # exponent = 0 addi t2, x0, 0 # overflow = 0 li t3, 5 blt t0, t3, estimate_end # if msb < 5 then estimate_end addi t1, t0, -4 # exponent = msb - 4 li t3, 15 ble t1, t3, skip_set_exp # if exponent <= 15 li t1, 15 # exponent = 15 skip_set_exp: # Calculate overflow for estimated exponent li t3, 0 for_loop: bge t3, t1, for_loop_end # e < exponent slli t2, t2, 1 # overflow <<= 1 addi t2, t2, 16 # overflow += 16 addi t3, t3, 1 # e++ j for_loop for_loop_end: # Adjust if estimate was off while_loop: ble t1, x0, while_loop_end bge a0, t2, while_loop_end addi t2, t2, -16 # overflow += -16 srli t2, t2, 1 # overflow >>= 1 addi t1, t1, -1 j while_loop while_loop_end: estimate_end: # Find exact exponent li t3, 15 while_exact_exp: bge t1, t3, return2 # while (exponent < 15) slli t4, t2, 1 addi t4, t4, 16 blt a0, t4, return2 # if (value < next_overflow) break; mv t2, t4 addi t1, t1, 1 j while_exact_exp return1: jr ra return2: sub t3, a0, t2 srl t3, t3, t1 slli a0, t1, 4 or a0, a0, t3 jr ra ``` clz table ```asm .data clz_tab: .byte 0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, .byte 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, .byte 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, .byte 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, .byte 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, .byte 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, .byte 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, .byte 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, .byte 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, .byte 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, .byte 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8 ``` **Main Test Function:** Recall that the maximum absolute error for UF8 is defined as $\Delta_{\max} = 2^e - 1$. Based on this property, I designed a verification test: for any given input value, after being encoded and then decoded by the UF8 codec, the absolute error of the reconstructed result should be smaller than $\Delta_{\max}$. If the observed error exceeds this bound, it indicates that either the encoder or the decoder is implemented incorrectly. ```asm .data # test data, max absolute error (2^e - 1) test_case1: .word 0x3, 0 test_case2: .word 0x50, 0x3 test_case3: .word 0x12345, 0xfff test_case4: .word 0xf1234, 0x7fff test_case_end: msg_pass: .asciz "All tests passed\n" msg_failed: .asciz "Failed\n" ``` ```asm main: # Prologue addi sp, sp, -24 sw ra, 20(sp) sw s0, 16(sp) sw s1, 12(sp) sw s2, 8(sp) sw s3, 4(sp) sw s4, 0(sp) la, s2, test_case1 # test case start address la, s3, test_case_end addi s4, x0, 1 # init result of all cases # Encode and decode a given number, than test if the absolute error of result # smaller than max absolute error. test_loop: beq s2, s3, done lw s0, 0(s2) mv a0, s0 jal uf8_encode jal uf8_decode lw s1, 4(s2) # Find absolute error sub s0, a0, s0 bgez s0, 1f sub s0, x0, s0 1: ble s0, s1, 1f andi s4, s4, 0 1: # next_case addi s2, s2, 8 j test_loop done: # Print result bnez s4, print_pass la a0, msg_failed li a7, 4 ecall j 1f print_pass: la a0, msg_pass li a7, 4 ecall 1: # Epilogue lw ra, 20(sp) lw s0, 16(sp) lw s1, 12(sp) lw s2, 8(sp) lw s3, 4(sp) lw s4, 0(sp) addi sp, sp, 24 j end_of_file ``` #### Ripes risc-v simulator All of test cases is pass in Ripes. ![image](https://hackmd.io/_uploads/ByvUfFITex.png) ![Excution information](https://hackmd.io/_uploads/Syu6A_L6ee.png) Through observation of the pipeline behavior, I found that each iteration of the loop causes a pipeline flush due to the branch instruction. To address this issue, I applied optimizations in Version 2 to eliminate or reduce branch dependencies. For example, in the `uf8_encode()` function, the `for` loop used to calculate overflow for the estimated exponent was identified as a major source of control-flow stalls, as shown below: ![9687ee2f-c679-4e63-a512-0499cc0e3787](https://hackmd.io/_uploads/S1YCjFI6xx.gif) ### Version2: - **Branchless operation and unrolling** **`for` loop in `uf8_encode()`:** Recall the C source code and assembly code in version1. ```c /* Calculate overflow for estimated exponent */ for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; ``` ```asm # Calculate overflow for estimated exponent li t3, 0 for_loop: bge t3, t1, for_loop_end # e < exponent slli t2, t2, 1 # overflow <<= 1 addi t2, t2, 16 # overflow += 16 addi t3, t3, 1 # e++ j for_loop for_loop_end: ``` Through inspection of the loop, we observe the recurrence: \begin{gather*}overflow_{n+1} = (overflow_n << 1) + 16\end{gather*} | exponent | overflow | | -------- | -------- | | 0 | overflow | | 1 | (overflow << 1) + 16 | | 2 | (overflow << 2) + (16 << 1) + 16 | | 3 | (overflow << 3) + (16 << 2) + (16 << 1) + 16 | | n | (overflow << n) + 16 * (1 << n - 1) | Because RV32I does not provide a hardware multiply instruction, implementing the term $16 \times (1 << n - 1)$ via multiplication could either require a software multiply (which may introduce branches or additional overhead) or a loop. Instead, we compute the term using shifts. Therefore the branchless closed-form calculation is: \begin{gather*}overflow_n = (overflow << n) + (((1 << n) - 1) << 4)\end{gather*} Then write this in c and assembly ```c /* Calculate overflow for estimated exponent */ overflow = (overflow << exponent) + (((1 << exponent) - 1) << 4); ``` ```asm # Calculate overflow for estimated exponent li t3, 1 sll t3, t3, t1 addi t3, t3, -1 slli t3, t3, 4 sll t2, t2, t1 add t2, t2, t3 ``` - **Instruction Scheduling** After completing the section [Ripes simulator: instruction-level operations](#Ripes-simulator:-instruction-level-operations). , I realized that instruction scheduling can be applied to reduce CPU **stalls** caused by data hazards. Consider the final part of the `clz` function, which involves the lookup table and the computation of the return value: ```asm li t0, 32 la t2, clz_tab srl t3, a0, t1 add t2, t2, t3 lb t2, 0(t2) sub t0, t0, t2 sub t0, t0, t1 ``` As discussed in the final seciton, `add t2, t2, t3` need to **stall** to wait for `lb` because of dependency data from `t2`. By applying instruction scheduling, we can reorder instructions to reduce this stall. Specifically, swapping: ```asm sub t0, t0, t2 sub t0, t0, t1 ``` to execute `sub t0, t0, t1` before `sub t0, t0, t2` allows the CPU to perform useful work while waiting for the lb to complete. This simple reordering helps eliminate the pipeline stall and improves execution efficiency without changing the program’s semantics. **Before** instruction scheduling. ![Screenshot from 2025-10-12 13-00-03](https://hackmd.io/_uploads/B14wU2Oaxx.png) **After** instruction scheduling. ![image](https://hackmd.io/_uploads/ry0uI3uplg.png) #### Ripes risc-v simulator After further optimization, the total cpu cycle reduce from `544` to `344` ![image](https://hackmd.io/_uploads/ryYMD3_plg.png) And also pass all test cases. ![Screenshot from 2025-10-11 19-11-56](https://hackmd.io/_uploads/Hy5qsnDagg.png) ## [Problem C](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-C) from Quiz 1 ### Abstraction Problem C involves multiple operations on the bfloat16 data type. In my work, I first studied the hardware multiplier and derived several insights based on its behavior. After that, I translated all the C code from Problem C into RISC-V assembly. Based on observations from the Ripes simulator, I focused on the most computationally expensive operations and applied several low-level optimizations to improve performance. ### Inspiration I found that the implementation of the `bf16_sqrt()` operation uses **multiplication** and **division** inside a binary search, as shown below: ```c uint32_t low = 90; /* Min sqrt (roughly sqrt(128)) */ uint32_t high = 256; /* Max sqrt (roughly sqrt(512)) */ uint32_t result = 128; /* Default */ /* Binary search for square root of m */ while (low <= high) { uint32_t mid = (low + high) >> 1; uint32_t sq = (mid * mid) / 128; /* Square and scale */ if (sq <= m) { result = mid; /* This could be our answer */ low = mid + 1; } else { high = mid - 1; } } ``` Since these operations are more computationally expensive than others, I first investigated how to implement fast multiplication. Because RV32I does not include the mul instruction, multiplication must be implemented in software. (The division by 128 in the binary search can be replaced with a right shift, >> 7, so only multiplication needs to be optimized.) I began by studying [Hardware Multipliers](https://web.archive.org/web/20000818145531/http://www.andraka.com/multipli.htm#Booth%20Recoding) to identify techniques that could be implemented in assembly — such as look-up table multipliers or scaling accumulator multipliers. However, in this binary search, the variable `mid` always lies between `low` and `high` (i.e., from 90 to 256). Moreover, the operation is a **square** operation, and the result is always divided by `128`. Therefore, a full 32-bit multiplier is unnecessary. Instead, we can precompute and store the results for these 167 possible values in a lookup table, inspired by the concept of look-up table multipliers. ```c uint16_t sq_table[167] = { 63, 64, 66, 67, 69, 70, 72, 73, 75, 76, 78, 79, 81, 82, 84, 86, 87, 89, 91, 92, 94, 96, 98, 99, 101, 103, 105, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 153, 155, 157, 159, 162, 164, 166, 168, 171, 173, 175, 178, 180, 182, 185, 187, 190, 192, 195, 197, 200, 202, 205, 207, 210, 212, 215, 217, 220, 223, 225, 228, 231, 233, 236, 239, 242, 244, 247, 250, 253, 255, 258, 261, 264, 267, 270, 273, 276, 279, 282, 285, 288, 291, 294, 297, 300, 303, 306, 309, 312, 315, 318, 321, 325, 328, 331, 334, 338, 341, 344, 347, 351, 354, 357, 361, 364, 367, 371, 374, 378, 381, 385, 388, 392, 395, 399, 402, 406, 409, 413, 416, 420, 424, 427, 431, 435, 438, 442, 446, 450, 453, 457, 461, 465, 468, 472, 476, 480, 484, 488, 492, 496, 500, 504, 508, 512}; ``` ```c uint32_t low = 90; /* Min sqrt (roughly sqrt(128)) */ uint32_t high = 256; /* Max sqrt (roughly sqrt(512)) */ uint32_t result = 128; /* Default */ /* Binary search for square root of m */ while (low <= high) { uint32_t mid = (low + high) >> 1; uint32_t sq = sq_table[mid - 90]; /* Square and scale */ if (sq <= m) { result = mid; /* This could be our answer */ low = mid + 1; } else { high = mid - 1; } } ``` The lookup table requires 334 bytes of storage. However, the benefit is that it completely eliminates the need for a hardware or software multiplier, relying only on one table lookup and one addition instruction. >Because I want to evaluate the benefit of this optimization, I will include it in Version 2, while Version 1 will use a general 32-bit software multiplier. ## Solution ### Idea for problem solving 1. **Version1**: Translate the C code line by line as a baseline implementation. 2. **Version2**: Replace the `bf16_sqrt()` multiplier with the lookup-table approach. Based on the Ripes pipeline observation, apply assembly-level optimizations such as instruction scheduling and branchless operations. ### Version1(Naive): **f32_to_bf16** and **f32_to_bf16**: ```asm # ---------------------------------------- # Function: f32_to_bf16 # Input: a0 = float # Output: a0 = bf16 # ---------------------------------------- f32_to_bf16: srli t0, a0, 23 andi t0, t0, 0xff li t1, 0xff bne t0, t1, 1f srli a0, a0, 16 li t4, 0xffff and a0, t4, a0 jr ra 1: li t1, 0x7fff srli t0, a0, 16 andi t0, t0, 1 add t0, t0, t1 add a0, a0, t0 srli a0, a0, 16 jr ra # ---------------------------------------- # Function: bf16_to_f32 # Input: a0 = bf16 # Output: a0 = float # ---------------------------------------- bf16_to_f32: slli a0, a0, 16 jr ra ``` **bf16_add**: ```asm # ---------------------------------------- # Function: bf16_add # Input: a0 = augend, a1 = addend # Output: a0 = sum # ---------------------------------------- bf16_add: # Prologue addi sp, sp, -28 sw ra, 24(sp) sw s0, 20(sp) sw s1, 16(sp) sw s2, 12(sp) sw s3, 8(sp) sw s4, 4(sp) sw s5, 0(sp) li t0, 0xffff and a0, a0, t0 and a1, a1, t0 # Find sign, exponent and mantissa srli t0, a0, 15 srli t1, a1, 15 andi s0, t0, 1 andi s1, t1, 1 srli t0, a0, 7 andi s2, t0, 0xff srli t0, a1, 7 andi s3, t0, 0xff andi s4, a0, 0x7f andi s5, a1, 0x7f # Handle special cases li t0, 0xff # Test if exp_a == 0xff bne s2, t0, 1f beqz s4, 2f j add_return 2: bne s3, t0, 2f bnez s5, 3f beq s0, s1, 3f li t6, 0x7FC0 add a0, t6, x0 j add_return 3: add a0, a1, x0 j add_return 2: j add_return 1: # Test if exp_b == 0xff bne s3, t0, 1f add a0, a1, x0 j add_return 1: # Test if a is zero bnez s2, 1f bnez s4, 1f add a0, a1, x0 j add_return 1: # Test if b is zero bnez s3, 1f bnez s5, 1f j add_return 1: # Normalize mantissas beqz s2, 1f ori s4, s4, 0x80 1: beqz s3, 1f ori s5, s5, 0x80 1: # Align exponents sub t0, s2, s3 ble t0, x0, 1f add t2, s2, x0 li t6, 8 ble t0, t6, 2f j add_return 2: srl s5, s5, t0 j 3f 1: bge t0, x0, 1f add t2, s3, x0 li t6, -8 bge t0, t6, 2f add a0, a1, x0 j add_return 2: sub t6, x0, t0 srl s4, s4, t6 j 3f 1: add t2, x0, s2 3: # Perform addition or subtraction depending on sign bne s0, s1, 1f add t1, s0, x0 add t3, x0, s4 add t3, t3, s5 andi t6, t3, 0x100 beqz t6, add_end srli t3, t3, 1 addi t2, t2, 1 li t6, 0xff blt t2, t6, add_end slli t1, t1, 15 li t6, 0x7f80 or a0, t6, t1 j add_return 1: blt s4, s5, 2f add t1, x0, s0 sub t3, s4, s5 j 3f 2: add t1, x0, s1 sub t3, s5, s4 3: bnez t3, 1f li a0, 0x0 j add_return 1: andi t6, t3, 0x80 bnez t6, add_end slli t3, t3, 1 addi t2, t2, -1 bgt t2, x0, 1b li a0, 0x0 j add_return # Pack final result (sign | exponent | mantissa) add_end: slli a0, t1, 15 andi t2, t2, 0xff slli t2, t2, 7 andi t3, t3, 0x7f or a0, t2, a0 or a0, t3, a0 j add_return # Epilogue add_return: lw ra, 24(sp) lw s0, 20(sp) lw s1, 16(sp) lw s2, 12(sp) lw s3, 8(sp) lw s4, 4(sp) lw s5, 0(sp) addi sp, sp, 28 jr ra ``` **bf16_sub**: ```asm # ---------------------------------------- # Function: bf16_sub # Input: a0 = minuend, a1 = subtrahend # Output: a0 = difference # ---------------------------------------- bf16_sub: li t0, 0x8000 xor a1, a1, t0 jal bf16_add ``` **bf16_mul**: ```asm # ---------------------------------------- # Function: bf16_mul # Input: a0 = multiplicand, a1 = multiplier # Output: a0 = product # ---------------------------------------- bf16_mul: # Prologue addi sp, sp, -28 sw ra, 24(sp) sw s0, 20(sp) sw s1, 16(sp) sw s2, 12(sp) sw s3, 8(sp) sw s4, 4(sp) sw s5, 0(sp) li t0, 0xffff and a0, a0, t0 and a1, a1, t0 # Find sign, exponent and mantissa srli t0, a0, 15 srli t1, a1, 15 andi s0, t0, 1 andi s1, t1, 1 srli t0, a0, 7 andi s2, t0, 0xff srli t0, a1, 7 andi s3, t0, 0xff andi s4, a0, 0x7f andi s5, a1, 0x7f xor s0, s0, s1 # Handle special cases # If exp_a is 0xff li t6, 0xff bne s2, t6, 1f beqz s4, 2f j mul_return 2: bnez s3, 2f bnez s5, 2f li a0, 0x7fc0 j mul_return 2: li a0, 0x7f80 slli s0, s0, 15 or a0, s0, a0 j mul_return 1: # If exp_b is 0xff bne s3, t6, 1f beqz s5, 2f add a0, a1, x0 j mul_return 2: bnez s2, 2f bnez s4, 2f li a0, 0x7fc0 j mul_return 2: li a0, 0x7f80 slli s0, s0, 15 or a0, s0, a0 j mul_return 1: # Test if a or b is zero bnez s2, 1f bnez s4, 1f j 2f 1: bnez s3, 1f bnez s5, 1f 2: slli a0, s0, 15 j mul_return 1: # Normalize operands li s1, 0 bnez s2, 1f 3: andi t0, s4, 0x80 bnez t0, 3f slli s4, s4, 1 addi s1, s1, -1 j 3b 3: li s2, 1 j 2f 1: ori s4, s4, 0x80 2: bnez s3, 1f 3: andi t0, s5, 0x80 bnez t0, 3f slli s5, s5, 1 addi s1, s1, -1 j 3b 3: li s3, 1 j 2f 1: ori s5, s5, 0x80 2: # Multiply mantissas using integer arithmetic mv a0, s4 mv a1, s5 jal mul32u mv s4, a0 # Compute result exponent with bias adjustment add t0, s2, s3 addi t0, t0, -127 add s1, s1, t0 # Normalize mantissa result and handle possible overflow li t0, 0x8000 and t0, t0, s4 beqz t0, 1f srli s4, s4, 8 andi s4, s4, 0x7f addi s1, s1, 1 j 2f 1: srli s4, s4, 7 andi s4, s4, 0x7f 2: # Handle overflow and underflow for exponent li t6, 0xff blt s1, t6, 1f li a0, 0x7f80 slli s0, s0, 15 or a0, a0, s0 j mul_return 1: bgtz s1, 1f li t6, -6 bge s1, t6, 2f slli a0, s0, 15 j mul_return 2: li t0, 1 sub t0, t0, s1 srl s4, s4, t0 li s1, 0 1: # Pack final result (sign | exponent | mantissa) slli a0, s0, 15 andi s1, s1, 0xff slli s1, s1, 7 andi s4, s4, 0x7f or a0, a0, s1 or a0, a0, s4 j mul_return # Epilogue mul_return: lw ra, 24(sp) lw s0, 20(sp) lw s1, 16(sp) lw s2, 12(sp) lw s3, 8(sp) lw s4, 4(sp) lw s5, 0(sp) addi sp, sp, 28 jr ra ``` **bf16_div**: ```asm # ---------------------------------------- # Function: bf16_div # Input: a0 = dividend, a1 = divisor # Output: a0 = quotient # ---------------------------------------- bf16_div: # Prologue addi sp, sp, -24 sw ra, 20(sp) sw s0, 16(sp) sw s1, 12(sp) sw s2, 8(sp) sw s3, 4(sp) sw s4, 0(sp) li t0, 0xffff and a0, a0, t0 and a1, a1, t0 # Find sign, exponent and mantissa srli t0, a0, 15 srli t1, a1, 15 andi t0, t0, 1 andi t1, t1, 1 xor s0, t0, t1 srli t0, a0, 7 andi s1, t0, 0xff srli t0, a1, 7 andi s2, t0, 0xff andi s3, a0, 0x7f andi s4, a1, 0x7f li t6, 0xff beq s2, t6, div_return1 # Handle special cases bnez s2, 1f bnez s4, 1f j div_return2 1: beq s1, t6, div_return3 seqz t0, s1 seqz t1, s3 and t0, t0, t1 bnez t0, div_return4 # Normalize mantissas beqz s1, 1f ori s3, s3, 0x80 1: beqz s2, 1f ori s4, s4, 0x80 1: # Perform fixed-point division slli s3, s3, 15 li t0, 0 # Bitwise long division loop li t1, 0 li t2, 16 2: bge t1, t2, 2f slli t0, t0, 1 addi t3, t2, -1 sub t3, t3, t1 sll t4, s4, t3 blt s3, t4, 1f sub s3, s3, t4 ori t0, t0, 1 1: addi t1, t1, 1 j 2b 2: # Adjust exponent based on input exponents and bias sub t1, s1, s2 addi t1, t1, 127 bnez s1, 1f addi t1, t1, -1 1: bnez s2, 1f addi t1, t1, 1 1: # Normalize quotient and align to 8-bit mantissa format li t2, 0x8000 and t3, t2, t0 beqz t3, 1f srli t0, t0, 8 j 2f 1: li t4, -1 and t3, t2, t0 xor t5, t4, t3 li t6, 1 slt t6, t6, t1 and t6, t6, t5 bnez t6, 1f slli t0, t0, 1 addi t1, t1, -1 1: srli t0, t0, 8 2: andi t0, t0, 0x7f # Handle overflow and underflow conditions li t2, 0xff blt t1, t2, 1f slli a0, s0, 15 li t2, 0x7f80 or a0, a0, t2 j div_return 1: bgt t1, x0, 1f slli a0, s0, 15 j div_return 1: # Pack final result (sign | exponent | mantissa) slli a0, s0, 15 andi t1, t1, 0xff slli t1, t1, 7 andi t0, t0, 0x7f or a0, a0, t1 or a0, a0, t0 j div_return # Return result of special cases div_return1: beqz s4, 1f add a0, x0, a1 j div_return 1: bne s1, t6, 1f bnez s3, 1f li a0, 0x7fc0 j div_return 1: slli a0, s0, 15 j div_return div_return2: bnez s1, 1f bnez s3, 1f li a0, 0x7fc0 j div_return 1: li t2, 0x7f80 slli a0, s0, 15 or a0, a0, t2 j div_return div_return3: beqz s3, 1f j div_return 1: li t2, 0x7f80 slli a0, s0, 15 or a0, t2, a0 j div_return div_return4: slli a0, s0, 15 j div_return # Epilogue div_return: lw ra, 20(sp) lw s0, 16(sp) lw s1, 12(sp) lw s2, 8(sp) lw s3, 4(sp) lw s4, 0(sp) addi sp, sp, 24 jr ra ``` **bf16_sqrt**: ```asm # ---------------------------------------- # Function: bf16_sqrt # Input: a0 = num # Output: a0 = square root # ---------------------------------------- bf16_sqrt: # Prologue addi sp, sp, -20 sw ra, 16(sp) sw s0, 12(sp) sw s1, 8(sp) sw s2, 4(sp) sw s3, 0(sp) srli s0, a0, 15 andi s0, s0, 1 srli s1, a0, 7 andi s1, s1, 0xff andi s2, a0, 0x7f # handle special cases li t6, 0xff bne s1, t6, 1f beqz s2, 2f j sqrt_return 2: beqz s0, 2f li a0, 0x7fc0 j sqrt_return 2: j sqrt_return 1: # sqrt(+/-0) = 0 bnez s1, 1f bnez s2, 1f li a0, 0x0 j sqrt_return 1: # sqrt of negative number is NaN beqz s0, 1f li a0, 0x7fc0 j sqrt_return 1: # Flush denormals to zero bnez s1, 1f li a0, 0x0 j sqrt_return 1: addi s1, s1, -127 ori s2, s2, 0x80 andi t0, s1, 1 beqz t0, 1f slli s2, s2, 1 addi s0, s1, -1 srai s0, s0, 1 addi s0, s0, 127 j 2f 1: srai s0, s1, 1 addi s0, s0, 127 2: # Binary search for square root li t0, 90 li t1, 256 li t2, 128 1: bgt t0, t1, 1f add t3, t0, t1 srli t3, t3, 1 addi sp, sp, -20 sw ra, 16(sp) sw t0, 12(sp) sw t1, 8(sp) sw t2, 4(sp) sw t3, 0(sp) mv a0, t3 mv a1, t3 jal mul32u srli t4, a0, 7 lw ra, 16(sp) lw t0, 12(sp) lw t1, 8(sp) lw t2, 4(sp) lw t3, 0(sp) addi sp, sp, 20 bgt t4, s2, 2f mv t2, t3 addi t0, t3, 1 j 1b 2: addi t1, t3, -1 j 1b 1: # Normalize to ensure result is in li t6, 256 blt t2, t6, 1f srli t2, t2, 1 addi s0, s0, 1 j 2f 1: li t6, 128 li t5, 1 bge t2, t6, 2f 3: bge t2, t6, 2f ble s0, t5, 2f slli t2, t2, 1 addi s0, s0, -1 j 3b 2: # Extract 7-bit mantissa (remove implicit 1) andi t2, t2, 0x7f # Check for overflow/underflow li t6, 0xff blt s0, t6, 1f li a0, 0x7f80 j sqrt_return 1: bgtz s0, 1f li a0, 0x0 j sqrt_return 1: # return組合 andi a0, s0, 0xff slli a0, a0, 7 or a0, a0, t2 j sqrt_return # Epilogue sqrt_return: lw ra, 16(sp) lw s0, 12(sp) lw s1, 8(sp) lw s2, 4(sp) lw s3, 0(sp) addi sp, sp, 20 jr ra ``` **mul32u**: ```asm # ---------------------------------------- # Function: mul32u # Input: a0 = multiplicand # a1 = multiplier # Output: a0 = 32-bit product # ---------------------------------------- mul32u: li t1, 0 mul_loop: beqz a1, mul_doned andi t0, a1, 1 beqz t0, mul_skip_add add t1, t1, a0 mul_skip_add: slli a0, a0, 1 srli a1, a1, 1 j mul_loop mul_done: mv a0, t1 jr ra ``` **Main test function**: For each test case, two floating-point numbers and their expected results after every operation are provided. The program first uses `f32_to_bf16()` to convert the `float` values into the `bf16` format and then performs each operation. After that, the result is converted back to the `float` type for comparison with the expected answer. If the results of all operations are correct, the program prints **“Passed”** for that case; otherwise, it prints **“Failed.”** Therefore, seeing four “Passed” messages indicates that all test cases have passed successfully. ```asm .data # test data1, test data2, div result, add result, mul result, sqrt result # 25.0, 5.0, 5.0, 30.0, 125.0, 5.0 test_case1: .word 0x41c80000, 0x40a00000, 0x40a00000, 0x41f00000, 0x42fa0000, 0x40a00000 # -2.5, -5, 0.5, -7.5, 12.5, NaN test_case2: .word 0xc0200000, 0xc0a00000, 0x3f000000, 0xc0f00000, 0x41480000, 0x7fc00000 # -4.0, 0, -inf, -4.0, -0, NaN test_case3: .word 0xc0800000, 0x00000000, 0xff800000, 0xc0800000, 0x80000000, 0x7fc00000 # inf, inf, NaN, inf, inf, inf test_case4: .word 0x7f800000, 0x7f800000, 0x7fc00000, 0x7f800000, 0x7f800000, 0x7f800000 test_case_end: msg_pass: .asciz "Passed\n" msg_failed: .asciz "Failed\n" .text main: # Prologue addi sp, sp, -28 sw ra, 24(sp) sw s0, 20(sp) sw s1, 16(sp) sw s2, 12(sp) sw s3, 8(sp) sw s4, 4(sp) sw s5, 0(sp) la, s3, test_case1 # test case start address la, s4, test_case_end addi s5, x0, 1 # init result of all cases test_loop: beq s3, s4, done lw a0, 0(s3) jal f32_to_bf16 add s0, a0, x0 lw a0, 4(s3) jal f32_to_bf16 add s1, a0, x0 # Test div operatoion add a0, s0, x0 add a1, s1, x0 jal bf16_div jal bf16_to_f32 lw s2, 8(s3) xor t0, a0, s2 seqz t0, t0 and s5, s5, t0 # Test add operation add a0, s0, x0 add a1, s1, x0 jal bf16_add jal bf16_to_f32 lw s2, 12(s3) xor t0, a0, s2 seqz t0, t0 and s5, s5, t0 # Test mul operation add a0, s0, x0 add a1, s1, x0 jal bf16_mul jal bf16_to_f32 lw s2, 16(s3) xor t0, a0, s2 seqz t0, t0 and s5, s5, t0 # Test sqrt operation add a0, s0, x0 jal bf16_sqrt jal bf16_to_f32 lw s2, 20(s3) xor t0, a0, s2 seqz t0, t0 and s5, s5, t0 # Print result bnez s5, print_pass j print_failed print_pass: la a0, msg_pass li a7, 4 ecall j next_case print_failed: la a0, msg_failed li a7, 4 ecall next_case: addi s3, s3, 24 j test_loop done: # Epilogue lw ra, 24(sp) lw s0, 20(sp) lw s1, 16(sp) lw s2, 12(sp) lw s3, 8(sp) lw s4, 4(sp) lw s5, 0(sp) addi sp, sp, 28 j end_of_file ``` #### Ripes risc-v simulator All of test cases is pass in Ripes. ![Screenshot from 2025-10-12 00-08-08](https://hackmd.io/_uploads/HJveb-daee.png) ![image](https://hackmd.io/_uploads/Bklf-Wd6ex.png) ### Version2: - **Remove muliplication form `bf16_sqrt()`** As mentioned in the **Inspiration** section, I implemented this optimization in RISC-V assembly. Recall the binary search in **Version 1**: because the algorithm calls `mul32u`, it is necessary to save caller-saved registers beforehand. This additional step increases the execution time and makes the operation more costly. ```asm # Binary search for square root li t0, 90 li t1, 256 li t2, 128 1: bgt t0, t1, 1f add t3, t0, t1 srli t3, t3, 1 addi sp, sp, -20 sw ra, 16(sp) sw t0, 12(sp) sw t1, 8(sp) sw t2, 4(sp) sw t3, 0(sp) mv a0, t3 mv a1, t3 jal mul32u srli t4, a0, 7 lw ra, 16(sp) lw t0, 12(sp) lw t1, 8(sp) lw t2, 4(sp) lw t3, 0(sp) addi sp, sp, 20 bgt t4, s2, 2f mv t2, t3 addi t0, t3, 1 j 1b 2: addi t1, t3, -1 j 1b 1: ``` Now, we use `sq_table` to optimize. ```asm .data sq_table: .half 63, 64, 66, 67, 69, 70, 72, 73, 75, 76, 78, 79, 81, 82 .half 84, 86, 87, 89, 91, 92, 94, 96, 98, 99, 101, 103, 105, 106 .half 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134 .half 136, 138, 140, 142, 144, 146, 148, 150, 153, 155, 157, 159, 162, 164 .half 166, 168, 171, 173, 175, 178, 180, 182, 185, 187, 190, 192, 195, 197 .half 200, 202, 205, 207, 210, 212, 215, 217, 220, 223, 225, 228, 231, 233 .half 236, 239, 242, 244, 247, 250, 253, 255, 258, 261, 264, 267, 270, 273 .half 276, 279, 282, 285, 288, 291, 294, 297, 300, 303, 306, 309, 312, 315 .half 318, 321, 325, 328, 331, 334, 338, 341, 344, 347, 351, 354, 357, 361 .half 364, 367, 371, 374, 378, 381, 385, 388, 392, 395, 399, 402, 406, 409 .half 413, 416, 420, 424, 427, 431, 435, 438, 442, 446, 450, 453, 457, 461 .half 465, 468, 472, 476, 480, 484, 488, 492, 496, 500, 504, 508, 512 ``` ```asm # Binary search for square root li t0, 90 li t1, 256 li t2, 128 1: bgt t0, t1, 1f add t3, t0, t1 srli t3, t3, 1 # Load square and scale from table addi t4, t3, -90 slli t4, t4, 1 la t6, sq_table add t4, t4, t6 lh t4, 0(t4) bgt t4, s2, 2f mv t2, t3 addi t0, t3, 1 j 1b 2: addi t1, t3, -1 j 1b 1: ``` - **`for` loop in `bf16_div()`** When running the Version 1 code in Ripes, we found that the `for` loop in `bf16_div()` consumes a significant amount of execution time. The loop in Version 1 can be expressed as: ```asm # Bitwise long division loop li t1, 0 li t2, 16 2: bge t1, t2, 2f slli t0, t0, 1 addi t3, t2, -1 sub t3, t3, t1 sll t4, s4, t3 blt s3, t4, 1f sub s3, s3, t4 ori t0, t0, 1 1: addi t1, t1, 1 j 2b 2: ``` and the corresponding C source code: ```c for (int i = 0; i < 16; i++) { quotient <<= 1; if (dividend >= (divisor << (15 - i))) { dividend -= (divisor << (15 - i)); quotient |= 1; } } ``` We can implement a **branchless version** using the `slt` instruction. Since `slt` returns `1` or `0`, subtracting `1` from the result produces a mask (`0x0` for true, `0xFFFF_FFFF` for false) that can be used to replace the branch: ```asm # Bitwise long division loop li t1, 0 li t2, 16 li t5, 15 2: bge t1, t2, 2f slli t0, t0, 1 sub t3, t5, t1 # 15 - i sll t4, s4, t3 # divisor << (15 - i) slt t6, s3, t4 addi t6, t6, -1 # mask for branch and t4, t4, t6 andi t3, t6, 1 sub s3, s3, t4 # dividend -= (divisor << (15 - i)); or t0, t0, t3 # quotient |= 1; addi t1, t1, 1 j 2b 2: ``` Moreover, since this loop has a fixed number of iterations, it is a good candidate for **loop unrolling** to further improve performance: ```asm # Bitwise long division loop li t5, 15 # first loop slli t0, t0, 1 addi t3, t5, 0 # change here for 0~-15 sll t4, s4, t3 slt t6, s3, t4 addi t6, t6, -1 and t4, t4, t6 andi t3, t6, 1 sub s3, s3, t4 or t0, t0, t3 # second loop slli t0, t0, 1 addi t3, t5, -1 sll t4, s4, t3 slt t6, s3, t4 addi t6, t6, -1 and t4, t4, t6 andi t3, t6, 1 sub s3, s3, t4 or t0, t0, t3 . . . # final loop slli t0, t0, 1 addi t3, t5, -15 sll t4, s4, t3 slt t6, s3, t4 addi t6, t6, -1 and t4, t4, t6 andi t3, t6, 1 sub s3, s3, t4 or t0, t0, t3 ``` #### Ripes risc-v simulator After further optimization, the total cpu cycle reduce from `2941` to `2075` ![image](https://hackmd.io/_uploads/rJ-C6GO6ee.png) And also pass all test cases. ![image](https://hackmd.io/_uploads/S1g6TGO6eg.png) ## Ripes simulator: instruction-level operations Take `lb t2, 0(t2)` in `clz` function as an example. ```asm li a0, 0x00100000 clz: addi t0, x0, 1 slli t0, t0, 16 bgt a0, t0, 1f slli t0, t0, 8 bgt a0, t0, 3f li t1, 0 j 2f 3: li t1, 8 j 2f 1: slli t0, t0, 24 bgt a0, t0, 3f li t1, 16 j 2f 3: li t1, 24 2: li t0, 32 la t2, clz_tab srl t3, a0, t1 add t2, t2, t3 lb t2, 0(t2) sub t0, t0, t2 sub a0, t0, t1 clz_done: ``` ### Program functionality The clz function computes the count of leading zeros for a 32-bit integer. 1. It uses a sequence of shift and compare operations to estimate the exponent. 2. Then it accesses the lookup table clz_tab to retrieve the exact count of leading zeros for the least significant byte. 3. Finally, the result is calculated and stored in a0. ### Instruction-level operations We focus on the `lb t2, 0(t2)` instruction and analyze its pipeline behavior, memory access, and hazard handling: 1. Instruction Fetch (IF) - The processor fetches the instruction from instruction memory and places it in the IF/ID pipeline register. - The fetch stage does not yet execute the instruction; it only retrieves the binary encoding.![image](https://hackmd.io/_uploads/rJTaOXd6le.png) 2. Instruction decode (ID): - In this stage, the processor decodes the instruction: identifying source and destination registers, immediate values, and control signals. - These decoded values are prepared for the EX stage.![image](https://hackmd.io/_uploads/r1jhimOTle.png) 3. Execute(EX): - The `lb` instruction computes the effective address by adding the base register (`t7`) and the immediate offset (`0`). - But, the base register depends on a pervious instruction (`add x7, x7, x28`), **data forwarding** ensures that the EX stage can use the latest value **without stalling**. - This is done via a forwarding circuit connecting MEM/WB results back to the EX stage.![image](https://hackmd.io/_uploads/B1wdTmdTlg.png) - And then, add with the immdeiate data:![image](https://hackmd.io/_uploads/Hyr52Xd6le.png) 4. Memory(MEM): - The memory stage performs the byte load from the calculated address. - We can see that the EX stage is stall now. This because the next instruction `sub x5, x5, x7` depend on the `x7`, but `x7` need to read data memory to get the data. It can not use **data forwarding** to pass the data to EX stage. - Memory read results are stored in the MEM/WB pipeline register for the write-back stage. ![image](https://hackmd.io/_uploads/BJhIkNOpex.png) 5. Write back(WB): - The loaded byte is written back to the destination register `x7`. - Thanks to the forwarding paths, subsequent instructions depending on `x7` can access the updated value without additional stalls.![image](https://hackmd.io/_uploads/S1BdgE_ale.png)