Assignment 1: RISC-V Assembly and Instruction Pipeline

# Assignment 1: RISC-V Assembly and Instruction Pipeline > contributed by < [Urdan0117](https://github.com/Urdan0117) > > N26144387 [name=Chiu Kun Chan] > [color=#2fbeed] **🔹 Use of AI tools** > I used ChatGPT to assist with Quiz 1 by providing code explanations, grammar polishing, preliminary research, code summaries, and explanations of standard RISC-V instruction usage. --- # ProbB ## Abstract This assignment implements **UF8 logarithmic quantization** encoder and decoder in **RV32I**. Besides reproducing the baseline functionality, the key improvements include: * Using `clz` to locate the MSB + an **O(1)** formula `offset(e)=((1<<e)-1)<<4` * **Leafified** `clz` and `uf8_decode` (all `t*` registers, zero `sw/lw`); `uf8_encode` saves only `ra/s0` * Built-in `test` verifies **round-trip** and **monotonicity** * Measured code size / cycles in **Ripes**, and provided 5-stage pipeline and control-signal analysis --- ## 1. Use Case UF8 is suitable for: sensor distance / temperature data, graphics LOD distance / fog density, and exponential-backoff timers. Not suitable for: financial precision computations or cryptographic applications requiring uniform value distribution. --- ## 2. UF8 Format and Algorithm * 8-bit UF8: `fl = (e<<4) | m`, where `e=fl[7:4]`, `m=fl[3:0]` * **Decoding** $$ offset(e) = ((1<<e) - 1) << 4 $$ $$ value = (m << e) + offset(e) $$ * **Encoding (concept)** 1. `msb = 31 - clz(value)`; initial estimate `e0 = clamp(msb-4, 0..15)` 2. Compare with `offset(e)` / `offset(e+1)` (if `e<15`) and adjust `e ± 1` if needed 3. `m = (value - offset(e)) >> e`, combine `fl=(e<<4)|m` * **Edge cases:** `value < 16` returns directly; `e=15` has no `offset(e+1)`. --- ## 3. Version Evolution and Key Fixes (V0 → V1 → V2) ### 3.1 V0 (baseline, functional) First student attempt directly rewritten from C. :exclamation: All used stack frames and saved `ra/s*`; functional but **leaf functions still used frequent sw/lw**. * **decode:** used $$ \text{offset}(e)=(0x7FFF \gg (15-e)) \ll 4 $$ and `value = (m<<e) + offset(e)` * **encode:** implemented **overflow loop** accumulating `(overflow<<=1; +=16)` until `e`; if misestimated, `adjust_overflow` / `find_exact_exp` refines `e` before mantissa computation * **test:** stored **fl in `s3`** (safe across two `jal`); checks **round-trip** + **monotonicity** * **Status:** fully functional & passes tests, but with higher **memory traffic** and **stack overhead** <details> <summary>V0 full code</summary> ```asm # 字串常數 .data success_msg: .string "All tests passed.\n" failure_msg: .string "Tests failed.\n" .text .globl main # 主程式 main: addi sp, sp, -16 sw ra, 0(sp) # 呼叫測試函數 jal ra, test # 檢查測試結果 beq a0, zero, print_failure print_success: la a0, success_msg # a0 = C-string 位址 li a7, 4 # 4: print_string (Ripes 教學環境) ecall li a7, 10 # 10: exit ecall print_failure: la a0, failure_msg li a7, 4 ecall li a7, 10 ecall # 程式入口點 _start: jal ra, main # CLZ - Count Leading Zeros # 輸入: a0 = 32位無號整數 # 輸出: a0 = 前導零的數量 (0-32) clz: # 保存暫存器 addi sp, sp, -16 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) # 檢查輸入是否為零 beq a0, zero, clz_zero # 初始化計數器和移位量 li s0, 0 # count = 0 li s1, 32 # n = 32 li s2, 16 # c = 16 clz_loop: # y = x >> c srl t0, a0, s2 # t0 = a0 >> s2 beq t0, zero, clz_no_shift # 如果 y != 0，則 n -= c, x = y sub s1, s1, s2 # n -= c mv a0, t0 # x = y clz_no_shift: # c >>= 1 srli s2, s2, 1 # c = c >> 1 bne s2, zero, clz_loop # return n - x sub a0, s1, a0 j clz_end clz_zero: li a0, 32 # 如果輸入為0，返回32 clz_end: # 恢復暫存器 lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) addi sp, sp, 16 ret # UF8_DECODE - 將 UF8 格式轉換為 32位整數 # 輸入: a0 = UF8 值 (8位) # 輸出: a0 = 32位整數值 uf8_decode: addi sp, sp, -16 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) # 提取尾數 (lower 4 bits) andi s0, a0, 0x0f # mantissa = fl & 0x0f # 提取指數 (upper 4 bits) srli s1, a0, 4 # exponent = fl >> 4 # 計算偏移量: offset = (0x7FFF >> (15 - exponent)) << 4 li t0, 15 sub t0, t0, s1 # t0 = 15 - exponent li t1, 0x7FFF srl t1, t1, t0 # t1 = 0x7FFF >> (15 - exponent) slli t1, t1, 4 # offset = t1 << 4 # 計算結果: (mantissa << exponent) + offset sll t2, s0, s1 # t2 = mantissa << exponent add a0, t2, t1 # result = (mantissa << exponent) + offset lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) addi sp, sp, 16 ret # UF8_ENCODE - 將 32位整數轉換為 UF8 格式 # 輸入: a0 = 32位整數值 # 輸出: a0 = UF8 值 (8位) uf8_encode: addi sp, sp, -32 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) sw s4, 20(sp) sw s5, 24(sp) mv s0, a0 # s0 = value # 如果 value < 16，直接返回 li t0, 16 blt s0, t0, encode_direct # 使用 CLZ 計算指數 jal ra, clz # a0 = clz(value) li t0, 31 sub s1, t0, a0 # msb = 31 - lz # 初始化變數 li s2, 0 # exponent = 0 li s3, 0 # overflow = 0 # 檢查 msb >= 5 li t0, 5 blt s1, t0, find_exact_exp # 估算指數: exponent = msb - 4 addi s2, s1, -4 # exponent = msb - 4 li t0, 15 bgt s2, t0, cap_exponent j calc_overflow cap_exponent: li s2, 15 # exponent = 15 calc_overflow: # 計算初始 overflow li s4, 0 # e = 0 overflow_loop: bge s4, s2, adjust_overflow slli s3, s3, 1 # overflow <<= 1 addi s3, s3, 16 # overflow += 16 addi s4, s4, 1 # e++ j overflow_loop adjust_overflow: # 調整 overflow 如果估算錯誤 beq s2, zero, find_exact_exp bge s0, s3, find_exact_exp addi s3, s3, -16 # overflow -= 16 srli s3, s3, 1 # overflow >>= 1 addi s2, s2, -1 # exponent-- j adjust_overflow find_exact_exp: # 找到精確的指數 li t0, 15 bge s2, t0, calc_mantissa slli t1, s3, 1 # next_overflow = overflow << 1 addi t1, t1, 16 # next_overflow += 16 bge s0, t1, update_exp j calc_mantissa update_exp: mv s3, t1 # overflow = next_overflow addi s2, s2, 1 # exponent++ j find_exact_exp calc_mantissa: # 計算尾數: mantissa = (value - overflow) >> exponent sub t0, s0, s3 # t0 = value - overflow srl s5, t0, s2 # mantissa = t0 >> exponent # 組合結果: (exponent << 4) | mantissa slli t0, s2, 4 # t0 = exponent << 4 or a0, t0, s5 # result = (exponent << 4) | mantissa j encode_end encode_direct: # 直接返回 value (< 16) mv a0, s0 encode_end: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) addi sp, sp, 32 ret # TEST - 測試編碼/解碼的往返轉換 # 輸出: a0 = 1 (通過) 或 0 (失敗) test: addi sp, sp, -32 sw ra, 0(sp) sw s0, 4(sp) # i (loop counter) sw s1, 8(sp) # previous_value sw s2, 12(sp) # current value sw s3, 16(sp) # fl (original) sw s4, 20(sp) # fl2 (re-encoded) sw s5, 24(sp) # passed flag li s0, 0 # i = 0 li s1, -1 # previous_value = -1 li s5, 1 # passed = true test_loop: li t0, 256 bge s0, t0, test_end # fl = i mv s3, s0 # value = uf8_decode(fl) mv a0, s3 jal ra, uf8_decode mv s2, a0 # s2 = decoded value # fl2 = uf8_encode(value) mv a0, s2 jal ra, uf8_encode mv s4, a0 # s4 = re-encoded value # 檢查 fl != fl2 bne s3, s4, test_fail # 檢查 value <= previous_value ble s2, s1, test_fail # 更新 previous_value mv s1, s2 # i++ addi s0, s0, 1 j test_loop test_fail: li s5, 0 # passed = false test_end: mv a0, s5 # return passed lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) addi sp, sp, 32 ret ``` </details> --- ### 3.2 V1 (Lightweight + O(1) offset) * **encode:** replaced loop with **O(1) formula** $$ \text{offset}(e)=((1\ll e)-1)\ll 4 $$ > Estimate `e` from `msb`, then use `offset(e)` and `offset(e+1)` for single-step interval correction (`update_exp_up/down`) before computing mantissa. * **decode:** kept `0x7FFF` formula (mathematically equivalent to V1’s O(1) form, including `e=0→offset=0`) * **test results:** passed correctly; encode’s **offset construction cost reduced from O(e) to O(1)**; further improvement possible by full `t*` leaf refactor to remove sw/lw. <details> <summary>V1 encode code</summary> ```asm # ====== O(1) 版：計算 offset(e) = ((1<<e)-1) << 4 ====== calc_overflow: li t1, 1 sll t1, t1, s2 # t1 = 1 << e addi t1, t1, -1 # t1 = (1<<e) - 1 slli s3, t1, 4 # s3 = offset(e) = ((1<<e)-1)<<4 # ====== 調整 e：確保 offset(e) ≤ value < offset(e+1)（若 e<15） ====== find_exact_exp: # 若 e < 15，先算 next_offset = ((1<<(e+1))-1)<<4 li t0, 15 bge s2, t0, calc_mantissa # e==15：無 next_offset，直接去算尾數 addi t2, s2, 1 # t2 = e+1 li t3, 1 sll t3, t3, t2 # t3 = 1<<(e+1) addi t3, t3, -1 # t3 = (1<<(e+1))-1 slli t3, t3, 4 # t3 = next_offset(e+1) # 若 value >= next_offset(e+1) → e++（往上擴） bge s0, t3, update_exp_up # 若 value < offset(e) → e--（往下縮） blt s0, s3, update_exp_down # 否則 offset(e) ≤ value < next_offset(e+1)，e 固定 j calc_mantissa update_exp_up: addi s2, s2, 1 # e = e+1 mv s3, t3 # s3 = offset(e)（剛好等於之前算的 next_offset） j find_exact_exp update_exp_down: addi s2, s2, -1 # e = e-1 # 重新 O(1) 計算 offset(e) li t1, 1 sll t1, t1, s2 addi t1, t1, -1 slli s3, t1, 4 j find_exact_exp ``` </details> --- ### 3.3 V2 (Leafified, s→t conversion; caller-saved fix in test) * **clz / decode:** **leafified**, all `t*`, **zero sw/lw** * **encode:** saves only `ra/s0`; all else `t*`; `calc_overflow` uses **O(1) offset** then `find_exact_exp` for single-step range adjust before mantissa * **test:** * Early V2 kept `fl` in `t1`, but cross-`jal` callee (`decode/encode`) overwrote it → occasional round-trip fail * Fixed by storing `fl` in `s3` → problem resolved * **Result:** after fixing `s3` and `msb<5` branches, passes all tests with **minimal sw/lw** and reduced pipeline stalls <details> <summary>V2 full code</summary> ```asm # 字串常數 .data success_msg: .string "All tests passed.\n" failure_msg: .string "Tests failed.\n" .text .globl main # 主程式 main: # 不需要保存 ra，不需要動 sp # 呼叫測試函數 jal ra, test # 檢查測試結果 beq a0, zero, print_failure print_success: la a0, success_msg # a0 = C-string 位址 li a7, 4 # 4: print_string (Ripes 教學環境) ecall li a7, 10 # 10: exit ecall print_failure: la a0, failure_msg li a7, 4 ecall li a7, 10 ecall # 程式入口點 # _start: # jal ra, main # CLZ - Count Leading Zeros # 輸入: a0 = 32位無號整數 # 輸出: a0 = 前導零的數量 (0-32) clz: # 檢查輸入是否為零 beq a0, zero, clz_zero # 初始化計數器和移位量 li t0, 0 # count = 0 li t1, 32 # n = 32 li t2, 16 # c = 16 clz_loop: # y = x >> c srl t3, a0, t2 # t3 = a0 >> t2 beq t3, zero, clz_no_shift # 如果 y != 0，則 n -= c, x = y sub t1, t1, t2 # n -= c mv a0, t3 # x = y clz_no_shift: # c >>= 1 srli t2, t2, 1 # c = c >> 1 bne t2, zero, clz_loop # return n - x sub a0, t1, a0 ret clz_zero: li a0, 32 # 如果輸入為0，返回32 ret # UF8_DECODE - 將 UF8 格式轉換為 32位整數 # 輸入: a0 = UF8 值 (8位) # 輸出: a0 = 32位整數值 uf8_decode: # 提取尾數 (lower 4 bits) andi t4, a0, 0x0f # mantissa = fl & 0x0f # 提取指數 (upper 4 bits) srli t5, a0, 4 # exponent = fl >> 4 # 計算偏移量: offset = (0x7FFF >> (15 - exponent)) << 4 li t0, 15 sub t0, t0, t5 # t0 = 15 - exponent li t1, 0x7FFF srl t1, t1, t0 # t1 = 0x7FFF >> (15 - exponent) slli t1, t1, 4 # offset = t1 << 4 # 計算結果: (mantissa << exponent) + offset sll t2, t4, t5 # t2 = mantissa << exponent add a0, t2, t1 # result = (mantissa << exponent) + offset ret # UF8_ENCODE - 將 32位整數轉換為 UF8 格式 # 輸入: a0 = 32位整數值 # 輸出: a0 = UF8 值 (8位) uf8_encode: addi sp, sp, -8 sw ra, 0(sp) sw s0, 4(sp) # s0 = value（唯一需要跨 call 保留者） mv s0, a0 # s0 = value # 如果 value < 16，直接返回 li t0, 16 blt s0, t0, encode_direct # 使用 CLZ 計算 msb 與初始 exponent mv a0, s0 jal ra, clz # a0 = clz(value) li t1, 31 sub t2, t1, a0 # msb = 31 - lz # 初始化變數 li t3, 0 # exponent = 0 li t4, 0 # overflow = 0 # 檢查 msb >= 5 li t0, 5 blt t2, t0, find_exact_exp # 估算指數: exponent = msb - 4 addi t3, t2, -4 # exponent = msb - 4 li t0, 15 bgt t3, t0, cap_exponent j calc_overflow cap_exponent: li t3, 15 # exponent = 15 # O(1)版：計算 offset(e) = ((1<<e)-1) << 4 calc_overflow: li t5, 1 sll t5, t5, t3 # t5 = 1 << e addi t5, t5, -1 # t5 = (1<<e) - 1 slli t4, t5, 4 # t4 = offset(e) = ((1<<e)-1)<<4 # 調整 e：確保 offset(e) ≤ value < offset(e+1)（若 e<15） find_exact_exp: # 若 e < 15，先算 next_offset = ((1<<(e+1))-1)<<4 li t0, 15 bge t3, t0, calc_mantissa # e==15：無 next_offset，直接去算尾數 addi t6, t3, 1 # t6 = e+1 li t1, 1 sll t1, t1, t6 # t1 = 1<<(e+1) addi t1, t1, -1 # t1 = (1<<(e+1))-1 slli t1, t1, 4 # t1 = next_offset(e+1) # 若 value >= next_offset(e+1) → e++（往上擴） bge s0, t1, update_exp_up # 若 value < offset(e) → e--（往下縮） blt s0, t4, update_exp_down # 否則 offset(e) ≤ value < next_offset(e+1)，e 固定 j calc_mantissa update_exp_up: addi t3, t3, 1 # e = e+1 mv t4, t1 # t4 = offset(e)（剛好等於之前算的 next_offset） j find_exact_exp update_exp_down: addi t3, t3, -1 # e = e-1 # 重新 O(1) 計算 offset(e) li t5, 1 sll t5, t5, t3 addi t5, t5, -1 slli t4, t5, 4 j find_exact_exp calc_mantissa: # 計算尾數: mantissa = (value - overflow) >> exponent sub t0, s0, t4 # t0 = value - overflow srl t2, t0, t3 # mantissa = t0 >> exponent # 組合結果: (exponent << 4) | mantissa slli t1, t3, 4 # t1 = exponent << 4 or a0, t1, t2 # result = (exponent << 4) | mantissa j encode_end encode_direct: # 直接返回 value (< 16) mv a0, s0 encode_end: lw ra, 0(sp) lw s0, 4(sp) addi sp, sp, 8 ret # TEST - 測試編碼/解碼的往返轉換 # 輸出: a0 = 1 (通過) 或 0 (失敗) test: addi sp, sp, -20 sw ra, 0(sp) sw s0, 4(sp) # i (loop counter) sw s1, 8(sp) # previous_value sw s2, 12(sp) # decoded value（需跨 encode 呼叫保留） sw s5, 16(sp) # passed li s0, 0 # i = 0 li s1, -1 # previous_value = -1 li s5, 1 # passed = true test_loop: li t0, 256 bge s0, t0, test_end # fl = i mv t1, s0 # value = uf8_decode(fl) mv a0, t1 jal ra, uf8_decode mv s2, a0 # decoded value 要跨下一個 jal 保留 → 放 s2 # fl2 = uf8_encode(value) mv a0, s2 jal ra, uf8_encode mv t2, a0 # fl2 暫存 # 檢查 fl != fl2 bne t1, t2, test_fail # 檢查 value <= previous_value ble s2, s1, test_fail # 更新 previous_value mv s1, s2 # i++ addi s0, s0, 1 j test_loop test_fail: li s5, 0 # passed = false test_end: mv a0, s5 # return passed lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s5, 16(sp) addi sp, sp, 20 ret ``` </details> --- ## 4. Version Comparison Table (Quick Summary) | Aspect | V0 | V1 | V2 | | :------------------ | :-------------------------- | :--------------------------------------------- | :----------------------------------------- | | `clz` | Uses stack & `s*` | Same as V0 | **Leaf**, all `t*`, zero `sw/lw` | | `uf8_decode` | Stack, `0x7FFF>>(15-e)<<4` | Same as V0 | **Leaf**, all `t*`, still `0x7FFF` formula | | `uf8_encode` | Loop builds overflow (O(e)) | **O(1)** `((1<<e)-1)<<4`; `update_exp_up/down` | Same as V1, saves only `ra/s0` | | `test` (fl storage) | `s3` (safe) | `s3` (safe) | Originally `t1`, later → `s3` | | Memory traffic | High | Medium | **Low** | --- ## 5. Testing (Automation + Representative Values) ### 5.1 Automated Validation * **Round-trip:** For `fl∈[0..255]`, check `encode(decode(fl))==fl` * **Monotonicity:** For `v(fl)=decode(fl)`, verify `v(fl+1)>v(fl)` * Prints `All tests passed.` on success; otherwise `Tests failed.` ### 5.2 Representative Manual Examples UF8 rule: `value = (m<<e) + ((1<<e)-1)<<4` | UF8 `fl` | decimal | e | m | decoded `value` | re-encoded | Result | | :------: | ------: | -: | -: | --------------: | ---------: | :----: | | `0x00` | 0 | 0 | 0 | 0 | `0x00` | ✅ | | `0x10` | 16 | 1 | 0 | 16 | `0x10` | ✅ | | `0x7F` | 127 | 7 | 15 | 3952 | `0x7F` | ✅ | | `0xF0` | 240 | 15 | 0 | 524 272 | `0xF0` | ✅ | | `0xFF` | 255 | 15 | 15 | 1 015 792 | `0xFF` | ✅ | --- ## 6. Performance and Code Size (Ripes Measurement) ### 6.1 Runtime Metrics (`fl=0..255` test) | Version | Cycles | Instr. Retired | CPI | IPC | L1-Data Accesses | | :------ | -----: | -------------: | ---: | ----: | ---------------: | | **V0** | 53 666 | 39 387 | 1.36 | 0.734 | 7 055 | | **V1** | 40 214 | 30 651 | 1.31 | 0.762 | 7 055 | | **V2** | 32 466 | 23 879 | 1.36 | 0.736 | 1 036 | > Overhead ≈ `Cycles−Instrs.` > V0 14 279 V1 9 563 V2 8 587 --- ### 6.2 Improvements (Relative) 1. **Cycles** * V1 vs V0: −25.1% (53 666 → 40 214) * V2 vs V0: −39.5% (53 666 → 32 466) * V2 vs V1: −19.3% (40 214 → 32 466) 2. **Instructions** * V1 vs V0: −22.2% (39 387 → 30 651) * V2 vs V0: −39.4% (39 387 → 23 879) * V2 vs V1: −22.1% (30 651 → 23 879) 3. **CPI** * V1 vs V0: 1.36 → 1.31 (↑ 3.7%) * V2 vs V1: 1.31 → 1.36 (slight regression) > Although V2 CPI rose slightly, its 22% fewer instructions still yield fewer total cycles. 4. **L1-Data Accesses** * V2 vs V0/V1: 7 055 → 1 036 (−85.3%) → directly shows the benefit of **leafification** and reduced stack `sw/lw`. --- ### 6.3 Code Size (per function) Computation: $$ size(clz)=decode−clz $$ $$ size(uf8_decode)=encode−decode $$ $$ size(uf8_encode)=test−encode $$ **Start addresses** * V0 clz 0x44 decode 0xA4 encode 0xF0 test 0x1DC * V1 clz 0x40 decode 0xA0 encode 0xEC test 0x1DC * V2 clz 0x38 decode 0x70 encode 0x9C test 0x16C | Version | `clz` (B/I) | `uf8_decode` | `uf8_encode` | **Sum** | | :------ | ----------: | -----------: | -----------: | --------: | | **V0** | 96 / 24 | 76 / 19 | 236 / 59 | 408 / 102 | | **V1** | 96 / 24 | 76 / 19 | 240 / 60 | 412 / 103 | | **V2** | 56 / 14 | 44 / 11 | 208 / 52 | 308 / 77 | **Improvement** * V2 vs V0 (total): 408 → 308 B (−24.5%) * V2 vs V1 (total): 412 → 308 B (−25.2%) * Per function: `clz −41.7%`, `decode −42.1%`, `encode −11.9% (vs V0)` > Observation: V1’s formulaic offset slightly increased encode size (236→240 B); V2 via **leafification + register reallocation** minimized all three functions, with `clz/decode` shrinking > 40%. --- ## 7. References * AI Tool (ChatGPT) * Ripes (environment calls) / RISC-V ISA (RV32I) * UF8 and bit hacks (CLZ) * [Quiz 1 Solution](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-C) * [Lab 1](https://hackmd.io/@sysprog/H1TpVYMdB) --- # ProbC ## 1. Abstract This project implements **BFloat16 arithmetic** on **RV32I** using **pure integer operations**: `f32_to_bf16`, `bf16_to_f32`, `add_bf16`, `sub_bf16`, `mul_bf16` (with round-to-nearest-even, RNE), `div_bf16` (integer long division), and `sqrt_bf16` (binary search). IEEE-754 corner cases are fully handled: **NaN / Inf / ±0**, and **subnormal numbers are flushed to zero**. All programs run correctly in **Ripes**. This submission directly presents a one-step upgrade **from V0 → V1** (V1 = final version): 1. **Simplified `main` test harness**: removed redundant “clear upper 16 bits” operations. 2. **`add_bf16` leafified**: switched to **caller-saved (`a*/t*`)** registers, **no stack frame or `s*` saves**. Normalization after subtraction was changed from a while-loop to a **multi-bit segmented shift (one-step normalize)**. The other routines (`mul_bf16`, `div_bf16`, `sqrt_bf16`) retain V0 logic for compatibility and correctness reference. --- ## 2. BFloat16 Overview ![image](https://hackmd.io/_uploads/H1aDR9c6el.png) ![image](https://hackmd.io/_uploads/B1IdR5qpgg.png) * **add/sub**: align exponents → add/sub 8-bit mantissas (with hidden 1) → normalize → take 7-bit fraction * **mul**: multiply two 8-bit mantissas, adjust exponent by bias, normalize, then **RNE rounding via guard/sticky bits** * **div**: integer restoring division yields 7-bit fraction (truncate) * **sqrt**: binary search on mantissa; exponent adjusted for odd/even and halved --- ## 3. What Changed (V0 → V1) ### 1. main : Remove redundant “clear upper 16 bits” **V0 behavior**: after every call, `a0` was forced through ```asm slli a0, a0, 16 srli a0, a0, 16 ``` **Why redundant**: each subroutine already returns a properly masked 16-bit result, so the high bits are zero. **V1 action**: removed these unnecessary `slli/srli`. **Effect**: ~12 fewer dynamic instructions; reduced ALU shifts and pipeline stalls. --- ### 2. `add_bf16`: Leafified + Normalize via segmented shift **V0**: * Used `s*` registers and built a stack frame (`sw/lw` overhead). * Normalization after subtraction used a while-loop, shifting left bit-by-bit (up to 7 times). **V1**: * Uses only **`a*/t*`** → true **leaf function**, **no stack frame, zero `sw/lw`**. * Normalization replaced with **segmented shifting** (check bits 6..1 → shift 1–7 bits directly). Single-step adjustment reduces branches and loop iterations. **Result**: fewer branches and memory accesses; both **cycles** and **retired instructions** drop noticeably. :::spoiler <details> <summary>V0 full code</summary> ```asm ############################################################ # File : bf16_all_fixed.S # Target : RISC-V RV32I (Ripes-compatible) # Notes : 修正 bgt/ble 類助記詞、立即數溢位、label 對齊、16-bit 遮罩 # ★ 本版加入所有必要的 s* 保存/還原 (callee-saved) ############################################################ .data # 資料區 msg_suite_ok: .asciz "BF16 suite: PASS\n" msg_suite_fail: .asciz "BF16 suite: FAIL\n" msg_t1: .asciz "[OK ] f32_to_bf16\n" msg_t1f: .asciz "[FAIL] f32_to_bf16\n" msg_t2: .asciz "[OK ] bf16_to_f32\n" msg_t2f: .asciz "[FAIL] bf16_to_f32\n" msg_add: .asciz "[OK ] add_bf16\n" msg_addf: .asciz "[FAIL] add_bf16\n" msg_sub: .asciz "[OK ] sub_bf16\n" msg_subf: .asciz "[FAIL] sub_bf16\n" msg_mul: .asciz "[OK ] mul_bf16\n" msg_mulf: .asciz "[FAIL] mul_bf16\n" msg_div: .asciz "[OK ] div_bf16\n" msg_divf: .asciz "[FAIL] div_bf16\n" msg_sqrt: .asciz "[OK ] sqrt_bf16\n" msg_sqrtf: .asciz "[FAIL] sqrt_bf16\n" .text # 程式區 .globl main .globl f32_to_bf16 .globl bf16_to_f32 .globl add_bf16 .globl sub_bf16 .globl mul_bf16 .globl div_bf16 .globl sqrt_bf16 .globl mul_u16 #============================================================ # main：呼叫各函式做基本測試並列印 PASS/FAIL #============================================================ main: addi sp, sp, -16 # 建立簡易 stack frame sw ra, 12(sp) # 保存返回位址 li s0, 1 # s0 = suite_pass = 1 #------------------ T1: f32_to_bf16 ------------------ li a0, 0x3F800000 # 1.0f jal ra, f32_to_bf16 # -> 0x3F80 (bf16) slli a0, a0, 16 # 只比較低 16 srli a0, a0, 16 li t0, 0x3F80 bne a0, t0, print_t1_fail print_t1_ok: la a0, msg_t1 li a7, 4 ecall j t1_done print_t1_fail: la a0, msg_t1f li a7, 4 ecall li s0, 0 #------------------ T2: bf16_to_f32 ------------------ t1_done: li a0, 0x4000 # bf16(2.0) -> f32 0x40000000 jal ra, bf16_to_f32 li t0, 0x40000000 bne a0, t0, print_t2_fail print_t2_ok: la a0, msg_t2 li a7, 4 ecall j t2_done print_t2_fail: la a0, msg_t2f li a7, 4 ecall li s0, 0 #------------------ ADD: 1.0 + 1.0 = 2.0 ------------ t2_done: li a0, 0x3F80 # 1.0 li a1, 0x3F80 # 1.0 jal ra, add_bf16 # -> 2.0 (0x4000) slli a0, a0, 16 srli a0, a0, 16 li t0, 0x4000 bne a0, t0, print_add_fail print_add_ok: la a0, msg_add li a7, 4 ecall j add_done print_add_fail: la a0, msg_addf li a7, 4 ecall li s0, 0 #------------------ SUB: 2.0 - 1.0 = 1.0 ------------ add_done: li a0, 0x4000 # 2.0 li a1, 0x3F80 # 1.0 jal ra, sub_bf16 # -> 1.0 (0x3F80) slli a0, a0, 16 srli a0, a0, 16 li t0, 0x3F80 bne a0, t0, print_sub_fail print_sub_ok: la a0, msg_sub li a7, 4 ecall j sub_done print_sub_fail: la a0, msg_subf li a7, 4 ecall li s0, 0 #------------------ MUL: 1.5 * 2.0 = 3.0 ------------ sub_done: li a0, 0x3FC0 # 1.5 li a1, 0x4000 # 2.0 jal ra, mul_bf16 # -> 3.0 (0x4040) slli a0, a0, 16 srli a0, a0, 16 li t0, 0x4040 bne a0, t0, print_mul_fail print_mul_ok: la a0, msg_mul li a7, 4 ecall j mul_done print_mul_fail: la a0, msg_mulf li a7, 4 ecall li s0, 0 #------------------ DIV: 3.0 / 2.0 = 1.5 ------------ mul_done: li a0, 0x4040 # 3.0 li a1, 0x4000 # 2.0 jal ra, div_bf16 # -> 1.5 (0x3FC0) slli a0, a0, 16 srli a0, a0, 16 li t0, 0x3FC0 bne a0, t0, print_div_fail print_div_ok: la a0, msg_div li a7, 4 ecall j div_done print_div_fail: la a0, msg_divf li a7, 4 ecall li s0, 0 #------------------ SQRT: sqrt(4.0) = 2.0 ----------- div_done: li a0, 0x4080 # 4.0 jal ra, sqrt_bf16 # -> 2.0 (0x4000) slli a0, a0, 16 srli a0, a0, 16 li t0, 0x4000 bne a0, t0, print_sqrt_fail print_sqrt_ok: la a0, msg_sqrt li a7, 4 ecall j sqrt_done print_sqrt_fail: la a0, msg_sqrtf li a7, 4 ecall li s0, 0 #------------------ 總結 SUITE 結果 ------------------ sqrt_done: bnez s0, suite_ok suite_fail: la a0, msg_suite_fail li a7, 4 ecall j exit_suite suite_ok: la a0, msg_suite_ok li a7, 4 ecall exit_suite: lw ra, 12(sp) # ★ 對稱恢復 ra/sp（雖 ecall 10 不返回） addi sp, sp, 16 li a7, 10 ecall #============================================================ # f32_to_bf16 (leaf; 不使用 s*；不需保存) #============================================================ f32_to_bf16: srli t0, a0, 23 andi t0, t0, 0xFF li t1, 0xFF beq t0, t1, f32_is_special # exp==0xFF -> Inf/NaN srli t2, a0, 16 andi t2, t2, 1 li t3, 0x7FFF # 0.5 ULP add t2, t2, t3 add a0, a0, t2 srli a0, a0, 16 ret f32_is_special: srli a0, a0, 16 ret #============================================================ # bf16_to_f32 (leaf; 不使用 s*；不需保存) #============================================================ bf16_to_f32: slli a0, a0, 16 ret #============================================================ # add_bf16 (A + B) # 使用 s2,s3,s4,s5,s8,s9 → 需保存/還原 #============================================================ add_bf16: # ---------- Prologue: save s* ---------- addi sp, sp, -24 sw s8, 0(sp) sw s9, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) # 取 A/B 的 16-bit（高位清零） slli s8, a0, 16 # s8 = A srli s8, s8, 16 slli s9, a1, 16 # s9 = B srli s9, s9, 16 srli t0, s8, 15 # sign_a andi t0, t0, 1 srli t1, s9, 15 # sign_b andi t1, t1, 1 srli t2, s8, 7 # exp_a andi t2, t2, 0xFF srli t3, s9, 7 # exp_b andi t3, t3, 0xFF andi t4, s8, 0x7F # mant_a andi t5, s9, 0x7F # mant_b # 特殊值 li t6, 0xFF beq t2, t6, add_a_special beq t3, t6, add_b_special # 零值快速路徑 beq t2, x0, add_chk_a_zero j add_chk_b_zero add_chk_a_zero: beq t4, x0, add_ret_b # A == 0 -> 回傳 B add_chk_b_zero: beq t3, x0, add_chk_b_is_zero j add_norm add_chk_b_is_zero: beq t5, x0, add_ret_a # B == 0 -> 回傳 A add_norm: bnez t2, add_a_set1 j add_b_set1 add_a_set1: ori t4, t4, 0x80 # exp_a != 0 → 補隱含 1 add_b_set1: beqz t3, add_align # exp_b == 0 → 不補 1 ori t5, t5, 0x80 # exp_b != 0 → 補隱含 1 j add_align add_align: sub s2, t2, t3 # s2 = exp_diff blt x0, s2, add_a_bigger # s2 > 0 blt s2, x0, add_b_bigger # s2 < 0 add s3, x0, t2 # 指數相等 j add_op add_a_bigger: add s3, x0, t2 li t6, 8 blt t6, s2, add_ret_a # 差距 > 8 → 回傳 A srl t5, t5, s2 # B 尾數右移對齊 j add_op add_b_bigger: add s3, x0, t3 li t6, -8 blt s2, t6, add_ret_b # 差距 < -8 → 回傳 B sub t6, x0, s2 # t6 = -s2 srl t4, t4, t6 # A 尾數右移對齊 add_op: beq t0, t1, add_same_sign # 同號 → 相加 # 異號 → 相減 bge t4, t5, add_a_ge_b add s4, x0, t1 # 結果符號 = B sub s5, t5, t4 # mant_res = B - A j add_norm_diff add_a_ge_b: add s4, x0, t0 # 結果符號 = A sub s5, t4, t5 # mant_res = A - B add_norm_diff: beq s5, x0, add_ret_zero # 結果 0 add_norm_loop: andi t6, s5, 0x80 bnez t6, add_pack # 已規範化 addi s3, s3, -1 bge x0, s3, add_ret_zero # 指數 <= 0 → 0 slli s5, s5, 1 j add_norm_loop add_same_sign: add s4, x0, t0 add s5, t4, t5 andi t6, s5, 0x100 beq t6, x0, add_pack srli s5, s5, 1 addi s3, s3, 1 li t6, 0xFF bge s3, t6, add_to_inf add_pack: slli s4, s4, 15 andi s3, s3, 0xFF slli s3, s3, 7 andi s5, s5, 0x7F or a0, s4, s3 or a0, a0, s5 j add_epilogue add_to_inf: slli s4, s4, 15 li t6, 0x7F80 or a0, s4, t6 j add_epilogue add_ret_zero: li a0, 0x0000 j add_epilogue add_ret_a: add a0, x0, s8 j add_epilogue add_ret_b: add a0, x0, s9 j add_epilogue add_a_special: bnez t4, add_ret_a # A=NaN → A beq t3, t6, add_both_inf_nan add a0, x0, s8 # A=Inf、B 非特殊 → A j add_epilogue add_both_inf_nan: bnez t5, add_ret_b # B=NaN → B beq t0, t1, add_ret_b # 同號 Inf → B li a0, 0x7FC0 # +Inf + (-Inf) = NaN j add_epilogue add_b_special: bnez t5, add_ret_b # B=NaN → B add a0, x0, s9 # B=Inf → B j add_epilogue add_epilogue: # ---------- Epilogue: restore s* ---------- lw s5, 20(sp) lw s4, 16(sp) lw s3, 12(sp) lw s2, 8(sp) lw s9, 4(sp) lw s8, 0(sp) addi sp, sp, 24 ret #============================================================ # sub_bf16 (A - B) = A + (-B) # 只需保存 ra（caller-saved），本函式會呼叫 add_bf16 #============================================================ sub_bf16: addi sp, sp, -4 sw ra, 0(sp) li t6, 0x8000 xor a1, a1, t6 # 反轉 B 的符號 jal ra, add_bf16 lw ra, 0(sp) addi sp, sp, 4 ret #============================================================ # mul_bf16 (A * B) — 含捨入；使用 s1,s2,s3,s4,s8,s9,s10；呼叫 mul_u16 #============================================================ mul_bf16: # ---------- Prologue ---------- addi sp, sp, -32 sw ra, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s8, 20(sp) sw s9, 24(sp) sw s10,28(sp) # 取 A/B 的 16-bit slli s8, a0, 16 srli s8, s8, 16 slli s9, a1, 16 srli s9, s9, 16 srli t0, s8, 15 # sign_a andi t0, t0, 1 srli t1, s9, 15 # sign_b andi t1, t1, 1 xor s1, t0, t1 # result_sign srli t2, s8, 7 # exp_a andi t2, t2, 0xFF srli t3, s9, 7 # exp_b andi t3, t3, 0xFF andi t4, s8, 0x7F # mant_a andi t5, s9, 0x7F # mant_b li t6, 0xFF beq t2, t6, mul_a_special beq t3, t6, mul_b_special beq t2, x0, mul_chk_a_zero beq t3, x0, mul_chk_b_zero j mul_norm_go mul_chk_a_zero: beqz t4, mul_zero j mul_norm_go mul_chk_b_zero: beqz t5, mul_zero mul_norm_go: addi s2, x0, 0 bnez t2, mul_a_ok beqz t4, mul_zero mul_a_den_loop: andi t6, t4, 0x80 bnez t6, mul_a_normed slli t4, t4, 1 addi s2, s2, -1 j mul_a_den_loop mul_a_normed: addi t2, x0, 1 j mul_b_chk mul_a_ok: ori t4, t4, 0x80 mul_b_chk: bnez t3, mul_b_ok beqz t5, mul_zero mul_b_den_loop: andi t6, t5, 0x80 bnez t6, mul_b_normed slli t5, t5, 1 addi s2, s2, -1 j mul_b_den_loop mul_b_normed: addi t3, x0, 1 j mul_do mul_b_ok: ori t5, t5, 0x80 mul_do: # 直接呼叫（已在 prologue 保存 ra） add a0, x0, t4 add a1, x0, t5 jal ra, mul_u16 # a0 = mant_a * mant_b add s3, x0, a0 # s3 = mant_prod add s4, t2, t3 addi s4, s4, -127 add s4, s4, s2 li t6, 0x8000 and t6, s3, t6 beqz t6, mul_norm_lt2 # >= 2.0 mul_norm_ge2: addi s4, s4, 1 srli s10, s3, 8 andi t0, s10, 0x7F srli t1, s3, 7 andi t1, t1, 1 andi t2, s3, 0x7F j mul_round # [1,2] mul_norm_lt2: srli s10, s3, 7 andi t0, s10, 0x7F srli t1, s3, 6 andi t1, t1, 1 andi t2, s3, 0x3F mul_round: beqz t1, mul_after_round bnez t2, mul_do_round andi t3, t0, 1 beqz t3, mul_after_round mul_do_round: addi t0, t0, 1 li t3, 0x80 bne t0, t3, mul_after_round li t0, 0x00 addi s4, s4, 1 mul_after_round: li t6, 0xFF bge s4, t6, mul_to_inf bge x0, s4, mul_underflow slli s1, s1, 15 andi s4, s4, 0xFF slli s4, s4, 7 andi t0, t0, 0x7F or a0, s1, s4 or a0, a0, t0 j mul_epilogue mul_to_inf: slli s1, s1, 15 li t6, 0x7F80 or a0, s1, t6 j mul_epilogue mul_underflow: slli a0, s1, 15 j mul_epilogue mul_zero: slli a0, s1, 15 j mul_epilogue mul_a_special: bnez t4, mul_ret_a beq t3, x0, mul_inf_times_zero slli s1, s1, 15 li t6, 0x7F80 or a0, s1, t6 j mul_epilogue mul_b_special: bnez t5, mul_ret_b beq t2, x0, mul_inf_times_zero slli s1, s1, 15 li t6, 0x7F80 or a0, s1, t6 j mul_epilogue mul_inf_times_zero: li a0, 0x7FC0 j mul_epilogue mul_ret_a: add a0, x0, s8 j mul_epilogue mul_ret_b: add a0, x0, s9 j mul_epilogue mul_epilogue: lw s10,28(sp) lw s9, 24(sp) lw s8, 20(sp) lw s4, 16(sp) lw s3, 12(sp) lw s2, 8(sp) lw s1, 4(sp) lw ra, 0(sp) addi sp, sp, 32 ret #============================================================ # div_bf16 (A / B) # 使用 s1,s2,s3,s4,s5,s8,s9,s10；leaf（不呼叫他人）→ 無需保存 ra #============================================================ div_bf16: # ---------- Prologue ---------- addi sp, sp, -32 sw s1, 0(sp) sw s2, 4(sp) sw s3, 8(sp) sw s4, 12(sp) sw s5, 16(sp) sw s8, 20(sp) sw s9, 24(sp) sw s10,28(sp) # 取 A/B 的 16-bit slli s8, a0, 16 srli s8, s8, 16 slli s9, a1, 16 srli s9, s9, 16 srli t0, s8, 15 # sign_a andi t0, t0, 1 srli t1, s9, 15 # sign_b andi t1, t1, 1 xor s1, t0, t1 # result_sign srli t2, s8, 7 # exp_a andi t2, t2, 0xFF srli t3, s9, 7 # exp_b andi t3, t3, 0xFF andi t4, s8, 0x7F # mant_a andi t5, s9, 0x7F # mant_b li t6, 0xFF beq t3, t6, div_b_special beq t3, x0, div_b_zero_or_subn beq t2, t6, div_a_special beq t2, x0, div_a_zero_or_subn div_common: bnez t2, div_a_set1 beqz t4, div_result_zero div_a_set1: ori t4, t4, 0x80 bnez t3, div_b_set1 beqz t5, div_result_nan div_b_set1: ori t5, t5, 0x80 # 長除法 slli s2, t4, 15 add s3, x0, t5 li s4, 0 li s5, 0 div_loop_i: li t6, 16 bge s5, t6, div_qdone slli s4, s4, 1 li t6, 15 sub t6, t6, s5 sll t6, s3, t6 bgeu s2, t6, div_sub j div_next div_sub: sub s2, s2, t6 ori s4, s4, 1 div_next: addi s5, s5, 1 j div_loop_i div_qdone: sub s5, t2, t3 addi s5, s5, 127 beqz t2, div_adj_a j div_adj_b div_adj_a: addi s5, s5, -1 div_adj_b: beqz t3, div_adj_b2 j div_norm div_adj_b2: addi s5, s5, 1 div_norm: li t6, 0x8000 and t6, s4, t6 bnez t6, div_q_has1 div_q_shift: li t6, 1 bge t6, s5, div_q_done # if s5 <= 1 li s10, 0x8000 and s10, s4, s10 bnez s10, div_q_done slli s4, s4, 1 addi s5, s5, -1 j div_q_shift div_q_has1: srli s4, s4, 8 div_q_done: andi s4, s4, 0x7F li t6, 0xFF bge s5, t6, div_to_inf bge x0, s5, div_result_zero slli s1, s1, 15 andi s5, s5, 0xFF slli s5, s5, 7 or a0, s1, s5 or a0, a0, s4 j div_epilogue div_b_special: bnez t5, div_ret_b slli a0, s1, 15 j div_epilogue div_b_zero_or_subn: beqz t5, div_by_zero j div_common div_by_zero: beq t2, x0, div_result_nan slli s1, s1, 15 li t6, 0x7F80 or a0, s1, t6 j div_epilogue div_a_special: bnez t4, div_ret_a li t6, 0xFF beq t3, t6, div_result_nan slli s1, s1, 15 li t6, 0x7F80 or a0, s1, t6 j div_epilogue div_a_zero_or_subn: beqz t4, div_result_zero j div_common div_result_zero: slli a0, s1, 15 j div_epilogue div_result_nan: li a0, 0x7FC0 j div_epilogue div_to_inf: slli s1, s1, 15 li t6, 0x7F80 or a0, s1, t6 j div_epilogue div_ret_a: add a0, x0, s8 j div_epilogue div_ret_b: add a0, x0, s9 j div_epilogue div_epilogue: lw s10,28(sp) lw s9, 24(sp) lw s8, 20(sp) lw s5, 16(sp) lw s4, 12(sp) lw s3, 8(sp) lw s2, 4(sp) lw s1, 0(sp) addi sp, sp, 32 ret #============================================================ # sqrt_bf16 (sqrt(A)) # 使用 s2,s3,s4,s5,s6,s7,s8；呼叫 mul_u16 → 另需保存 ra #============================================================ sqrt_bf16: # ---------- Prologue ---------- addi sp, sp, -32 sw ra, 0(sp) sw s2, 4(sp) sw s3, 8(sp) sw s4, 12(sp) sw s5, 16(sp) sw s6, 20(sp) sw s7, 24(sp) sw s8, 28(sp) # A(16) : 取 a0 的低 16 位到 s8（上 16 清 0） slli s8, a0, 16 srli s8, s8, 16 srli t0, s8, 15 # sign andi t0, t0, 1 srli t1, s8, 7 # exp andi t1, t1, 0xFF andi t2, s8, 0x7F # mant li t3, 0xFF bne t1, t3, sqrt_chk_zero bnez t2, sqrt_ret_a # NaN → A bnez t0, sqrt_nan # -Inf → NaN add a0, x0, s8 # +Inf → A j sqrt_epilogue sqrt_chk_zero: beq t1, x0, sqrt_zero # 0/subnormal → 0 bnez t0, sqrt_nan # 負數 → NaN addi s2, t1, -127 ori s3, t2, 0x80 andi t3, s2, 1 beqz t3, sqrt_even slli s3, s3, 1 addi s2, s2, -1 sqrt_even: srai s4, s2, 1 addi s4, s4, 127 li s5, 90 # low li s6, 256 # high li s7, 128 # result sqrt_bs_loop: blt s6, s5, sqrt_bs_done # if low > high add t3, s5, s6 srli t3, t3, 1 # mid # 呼叫 mul_u16（已保存 ra） add a0, x0, t3 add a1, x0, t3 jal ra, mul_u16 srli a0, a0, 7 # sq = (mid^2)/128 blt s3, a0, sqrt_sq_too_big # if m < sq add s7, x0, t3 addi s5, t3, 1 j sqrt_bs_loop sqrt_sq_too_big: addi s6, t3, -1 j sqrt_bs_loop sqrt_bs_done: li t3, 256 blt s7, t3, sqrt_norm_low_ok srli s7, s7, 1 addi s4, s4, 1 sqrt_norm_low_ok: li t3, 128 bge s7, t3, sqrt_pack sqrt_shift_up: li t3, 1 bge t3, s4, sqrt_pack # if s4 <= 1 slli s7, s7, 1 addi s4, s4, -1 blt s7, t3, sqrt_shift_up sqrt_pack: andi s7, s7, 0x7F li t3, 0xFF bge s4, t3, sqrt_to_inf bge x0, s4, sqrt_zero slli s4, s4, 7 or a0, s4, s7 j sqrt_epilogue sqrt_zero: li a0, 0x0000 j sqrt_epilogue sqrt_nan: li a0, 0x7FC0 j sqrt_epilogue sqrt_to_inf: li a0, 0x7F80 j sqrt_epilogue sqrt_ret_a: add a0, x0, s8 j sqrt_epilogue sqrt_epilogue: lw s8, 28(sp) lw s7, 24(sp) lw s6, 20(sp) lw s5, 16(sp) lw s4, 12(sp) lw s3, 8(sp) lw s2, 4(sp) lw ra, 0(sp) addi sp, sp, 32 ret #============================================================ # mul_u16：無號 16-bit 乘法（RV32I：移位＋加法）(leaf) #============================================================ mul_u16: li t0, 0 mul16_loop: beq a1, x0, mul16_done andi t1, a1, 1 beq t1, x0, mul16_skip_add add t0, t0, a0 mul16_skip_add: slli a0, a0, 1 srli a1, a1, 1 j mul16_loop mul16_done: add a0, x0, t0 ret ``` </details> ::: :::spoiler <details> <summary>V1 full code</summary> ```asm .data .text msg_suite_ok: .asciz "BF16 suite: PASS\n" msg_suite_fail: .asciz "BF16 suite: FAIL\n" msg_t1: .asciz "[OK ] f32_to_bf16\n" msg_t1f: .asciz "[FAIL] f32_to_bf16\n" msg_t2: .asciz "[OK ] bf16_to_f32\n" msg_t2f: .asciz "[FAIL] bf16_to_f32\n" msg_add: .asciz "[OK ] add_bf16\n" msg_addf: .asciz "[FAIL] add_bf16\n" msg_sub: .asciz "[OK ] sub_bf16\n" msg_subf: .asciz "[FAIL] sub_bf16\n" msg_mul: .asciz "[OK ] mul_bf16\n" msg_mulf: .asciz "[FAIL] mul_bf16\n" msg_div: .asciz "[OK ] div_bf16\n" msg_divf: .asciz "[FAIL] div_bf16\n" msg_sqrt: .asciz "[OK ] sqrt_bf16\n" msg_sqrtf: .asciz "[FAIL] sqrt_bf16\n" # 程式區 .globl main .globl f32_to_bf16 .globl bf16_to_f32 .globl add_bf16 .globl sub_bf16 .globl mul_bf16 .globl div_bf16 .globl sqrt_bf16 .globl mul_u16 # 呼叫各函式做基本測試並列印 PASS/FAIL main: addi sp, sp, -16 # 簡易堆疊空間 sw ra, 12(sp) # 保存返回位址 li s0, 1 # s0 = suite_pass = 1 # T1: f32_to_bf16 li a0, 0x3F800000 # 1.0f jal ra, f32_to_bf16 # -> 0x3F80 (bf16) slli a0, a0, 16 # 清除高 16 → 只保留低 16 srli a0, a0, 16 li t0, 0x3F80 bne a0, t0, print_t1_fail print_t1_ok: la a0, msg_t1 li a7, 4 ecall j t1_done print_t1_fail: la a0, msg_t1f li a7, 4 ecall li s0, 0 # T2: bf16_to_f32 t1_done: li a0, 0x4000 # bf16(2.0) -> f32 0x40000000 jal ra, bf16_to_f32 li t0, 0x40000000 bne a0, t0, print_t2_fail print_t2_ok: la a0, msg_t2 li a7, 4 ecall j t2_done print_t2_fail: la a0, msg_t2f li a7, 4 ecall li s0, 0 # ADD: 1.0 + 1.0 = 2.0 t2_done: li a0, 0x3F80 # 1.0 li a1, 0x3F80 # 1.0 jal ra, add_bf16 # -> 2.0 (0x4000) slli a0, a0, 16 # 清除高 16 srli a0, a0, 16 li t0, 0x4000 bne a0, t0, print_add_fail print_add_ok: la a0, msg_add li a7, 4 ecall j add_done print_add_fail: la a0, msg_addf li a7, 4 ecall li s0, 0 # SUB: 2.0 - 1.0 = 1.0 add_done: li a0, 0x4000 # 2.0 li a1, 0x3F80 # 1.0 jal ra, sub_bf16 # -> 1.0 (0x3F80) slli a0, a0, 16 srli a0, a0, 16 li t0, 0x3F80 bne a0, t0, print_sub_fail print_sub_ok: la a0, msg_sub li a7, 4 ecall j sub_done print_sub_fail: la a0, msg_subf li a7, 4 ecall li s0, 0 #MUL: 1.5 * 2.0 = 3.0 sub_done: li a0, 0x3FC0 # 1.5 li a1, 0x4000 # 2.0 jal ra, mul_bf16 # -> 3.0 (0x4040) slli a0, a0, 16 srli a0, a0, 16 li t0, 0x4040 bne a0, t0, print_mul_fail print_mul_ok: la a0, msg_mul li a7, 4 ecall j mul_done print_mul_fail: la a0, msg_mulf li a7, 4 ecall li s0, 0 # DIV: 3.0 / 2.0 = 1.5 mul_done: li a0, 0x4040 # 3.0 li a1, 0x4000 # 2.0 jal ra, div_bf16 # -> 1.5 (0x3FC0) slli a0, a0, 16 srli a0, a0, 16 li t0, 0x3FC0 bne a0, t0, print_div_fail print_div_ok: la a0, msg_div li a7, 4 ecall j div_done print_div_fail: la a0, msg_divf li a7, 4 ecall li s0, 0 # SQRT: sqrt(4.0) = 2.0 div_done: li a0, 0x4080 # 4.0 jal ra, sqrt_bf16 # -> 2.0 (0x4000) slli a0, a0, 16 srli a0, a0, 16 li t0, 0x4000 bne a0, t0, print_sqrt_fail print_sqrt_ok: la a0, msg_sqrt li a7, 4 ecall j sqrt_done print_sqrt_fail: la a0, msg_sqrtf li a7, 4 ecall li s0, 0 #SUITE sqrt_done: bnez s0, suite_ok suite_fail: la a0, msg_suite_fail li a7, 4 ecall j exit_suite suite_ok: la a0, msg_suite_ok li a7, 4 ecall exit_suite: li a7, 10 ecall ############################################# # f32_to_bf16 ############################################# f32_to_bf16: srli t0, a0, 23 # t0 = 指數 (8位) andi t0, t0, 0xFF li t1, 0xFF beq t0, t1, f32_is_special # exp==0xFF -> Inf/NaN srli t2, a0, 16 # t2 = (a0 >> 16) andi t2, t2, 1 # t2 = LSB for ties li t3, 0x7FFF # 0.5 ULP 偏移 add t2, t2, t3 add a0, a0, t2 srli a0, a0, 16 # 取高16位作 bf16 ret f32_is_special: srli a0, a0, 16 # 直接取高16位（保留 NaN/Inf payload） ret ############################################# # bf16_to_f32 ############################################# bf16_to_f32: slli a0, a0, 16 # bf16 << 16 ret ############################################# # add_bf16 (A + B) ############################################# add_bf16: #把低 16 位搬到高半（高半乾淨）邏輯右移回來 → 上 16 直接清成 0 slli s8, a0, 16 # A srli s8, s8, 16 slli s9, a1, 16 # B srli s9, s9, 16 srli t0, s8, 15 # sign_a andi t0, t0, 1 srli t1, s9, 15 # sign_b andi t1, t1, 1 srli t2, s8, 7 # exp_a andi t2, t2, 0xFF srli t3, s9, 7 # exp_b andi t3, t3, 0xFF andi t4, s8, 0x7F # mant_a andi t5, s9, 0x7F # mant_b # 特殊值 li t6, 0xFF beq t2, t6, add_a_special beq t3, t6, add_b_special # 零值快速路徑 beq t2, x0, add_chk_a_zero j add_chk_b_zero add_chk_a_zero: beq t4, x0, add_ret_b # A == 0 -> 回傳 B add_chk_b_zero: beq t3, x0, add_chk_b_is_zero j add_norm add_chk_b_is_zero: beq t5, x0, add_ret_a # B == 0 -> 回傳 A add_norm: bnez t2, add_a_set1 j add_b_set1 add_a_set1: ori t4, t4, 0x80 # A 非0 -> 補隱含1 add_b_set1: beqz t3, add_align #exp_b != 0 要補隱含 1；exp_b == 0 不補 ori t5, t5, 0x80 # B 非0 -> 補隱含1 j add_align add_align: sub s2, t2, t3 # s2 = exp_diff blt x0, s2, add_a_bigger # if s2 > 0 blt s2, x0, add_b_bigger # if s2 < 0 add s3, x0, t2 # 指數相等 j add_op add_a_bigger: add s3, x0, t2 li t6, 8 blt t6, s2, add_ret_a # if s2 > 8 srl t5, t5, s2 # B 尾數右移對齊 j add_op add_b_bigger: add s3, x0, t3 li t6, -8 blt s2, t6, add_ret_b # if s2 < -8 neg t6, s2 srl t4, t4, t6 # A 尾數右移對齊 add_op: beq t0, t1, add_same_sign # 同號 => 相加 # 異號 => 相減 bge t4, t5, add_a_ge_b add s4, x0, t1 # 結果符號 = B 的符號 sub s5, t5, t4 # mant_res = B - A j add_norm_diff add_a_ge_b: add s4, x0, t0 # 結果符號 = A 的符號 sub s5, t4, t5 # mant_res = A - B add_norm_diff: beq s5, x0, add_ret_zero # 差值為0 => 0 add_norm_loop: andi t6, s5, 0x80 bnez t6, add_pack # 已規範化 addi s3, s3, -1 # 指數-- bge x0, s3, add_ret_zero # if s3 <= 0 slli s5, s5, 1 j add_norm_loop add_same_sign: add s4, x0, t0 # 結果符號 = 同號 add s5, t4, t5 # mant_res = A + B andi t6, s5, 0x100 beq t6, x0, add_pack srli s5, s5, 1 # 進位：尾數右移 addi s3, s3, 1 # 指數+1 li t6, 0xFF bge s3, t6, add_to_inf # 上溢 => Inf add_pack: slli s4, s4, 15 # 符號到 bit15 andi s3, s3, 0xFF slli s3, s3, 7 # 指數到 bit14:7 andi s5, s5, 0x7F # 尾數 7-bit or a0, s4, s3 or a0, a0, s5 ret add_to_inf: slli s4, s4, 15 li t6, 0x7F80 or a0, s4, t6 ret add_ret_zero: li a0, 0x0000 ret add_ret_a: add a0, x0, s8 ret add_ret_b: add a0, x0, s9 ret add_a_special: bnez t4, add_ret_a # A 是 NaN -> 傳回 A beq t3, t6, add_both_inf_nan add a0, x0, s8 # A 是 Inf、B 非特殊 -> 回傳 A ret add_both_inf_nan: bnez t5, add_ret_b # B 是 NaN -> 回傳 B beq t0, t1, add_ret_b # Inf 同號 -> 回傳 B li a0, 0x7FC0 # +Inf + (-Inf) = NaN ret add_b_special: bnez t5, add_ret_b # B 是 NaN -> 回傳 B add a0, x0, s9 # B 是 Inf -> 回傳 B ret ############################################# # sub_bf16 (A - B) = A + (-B) ############################################# sub_bf16: li t6, 0x8000 # 建立符號遮罩（避免 xori 12-bit 溢位） xor a1, a1, t6 # 反轉 B 的符號位 jal ra, add_bf16 # A + (-B) ret #============================================================ # mul_bf16 (A * B) — 內聯 8x8 乘法 + RNE 捨入 # 使用 s1,s2,s3,s4,s8,s9,s10；不再呼叫 mul_u16 #============================================================ mul_bf16: addi sp, sp, -32 sw ra, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s8, 20(sp) sw s9, 24(sp) sw s10,28(sp) slli s8, a0, 16 # A srli s8, s8, 16 slli s9, a1, 16 # B srli s9, s9, 16 srli t0, s8, 15 # sign_a andi t0, t0, 1 srli t1, s9, 15 # sign_b andi t1, t1, 1 xor s1, t0, t1 # result_sign srli t2, s8, 7 # exp_a andi t2, t2, 0xFF srli t3, s9, 7 # exp_b andi t3, t3, 0xFF andi t4, s8, 0x7F # mant_a andi t5, s9, 0x7F # mant_b li t6, 0xFF beq t2, t6, mul_a_special beq t3, t6, mul_b_special beq t2, x0, mul_chk_a_zero beq t3, x0, mul_chk_b_zero j mul_norm_go mul_chk_a_zero: beqz t4, mul_zero j mul_norm_go mul_chk_b_zero: beqz t5, mul_zero mul_norm_go: addi s2, x0, 0 bnez t2, mul_a_ok beqz t4, mul_zero mul_a_den_loop: andi t6, t4, 0x80 bnez t6, mul_a_normed slli t4, t4, 1 addi s2, s2, -1 j mul_a_den_loop mul_a_normed: addi t2, x0, 1 j mul_b_chk mul_a_ok: ori t4, t4, 0x80 mul_b_chk: bnez t3, mul_b_ok beqz t5, mul_zero mul_b_den_loop: andi t6, t5, 0x80 bnez t6, mul_b_normed slli t5, t5, 1 addi s2, s2, -1 j mul_b_den_loop mul_b_normed: addi t3, x0, 1 j mul_do mul_b_ok: ori t5, t5, 0x80 mul_do: # ---- 內聯 8x8 乘法：s3 = t4 * t5 ---- li s3, 0 add t6, x0, t5 # multiplier add s10, x0, t4 # multiplicand li t0, 8 mul8_loop: andi t1, t6, 1 beqz t1, mul8_skip_add add s3, s3, s10 mul8_skip_add: slli s10, s10, 1 srli t6, t6, 1 addi t0, t0, -1 bnez t0, mul8_loop # result_exp = exp_a + exp_b - 127 + exp_adjust add s4, t2, t3 addi s4, s4, -127 add s4, s4, s2 # 正規化 + RNE li t6, 0x8000 and t6, s3, t6 beqz t6, mul_norm_lt2 # >= 2.0 mul_norm_ge2: addi s4, s4, 1 srli s10, s3, 8 # m8 andi t0, s10, 0x7F # frac7 srli t1, s3, 7 # guard andi t1, t1, 1 andi t2, s3, 0x7F # sticky j mul_round # [1,2) mul_norm_lt2: srli s10, s3, 7 # m8 andi t0, s10, 0x7F # frac7 srli t1, s3, 6 # guard andi t1, t1, 1 andi t2, s3, 0x3F # sticky mul_round: beqz t1, mul_after_round # guard=0 → 不進位 bnez t2, mul_do_round # sticky!=0 → 進位 andi t3, t0, 1 # tie: LSB=1 才進位（最近偶數） beqz t3, mul_after_round mul_do_round: addi t0, t0, 1 li t3, 0x80 bne t0, t3, mul_after_round li t0, 0x00 addi s4, s4, 1 mul_after_round: li t6, 0xFF bge s4, t6, mul_to_inf bge x0, s4, mul_underflow slli s1, s1, 15 andi s4, s4, 0xFF slli s4, s4, 7 andi t0, t0, 0x7F or a0, s1, s4 or a0, a0, t0 j mul_epilogue mul_to_inf: slli s1, s1, 15 li t6, 0x7F80 or a0, s1, t6 j mul_epilogue mul_underflow: slli a0, s1, 15 j mul_epilogue mul_zero: slli a0, s1, 15 j mul_epilogue mul_a_special: bnez t4, mul_ret_a beq t3, x0, mul_inf_times_zero slli s1, s1, 15 li t6, 0x7F80 or a0, s1, t6 j mul_epilogue mul_b_special: bnez t5, mul_ret_b beq t2, x0, mul_inf_times_zero slli s1, s1, 15 li t6, 0x7F80 or a0, s1, t6 j mul_epilogue mul_inf_times_zero: li a0, 0x7FC0 j mul_epilogue mul_ret_a: add a0, x0, s8 j mul_epilogue mul_ret_b: add a0, x0, s9 j mul_epilogue mul_epilogue: lw s10,28(sp) lw s9, 24(sp) lw s8, 20(sp) lw s4, 16(sp) lw s3, 12(sp) lw s2, 8(sp) lw s1, 4(sp) lw ra, 0(sp) addi sp, sp, 32 ret ######################################## # div_bf16 (A / B) ######################################## div_bf16: #把低 16 位搬到高半（高半乾淨）邏輯右移回來 → 上 16 直接清成 0 slli s8, a0, 16 # A srli s8, s8, 16 slli s9, a1, 16 # B srli s9, s9, 16 srli t0, s8, 15 # sign_a andi t0, t0, 1 srli t1, s9, 15 # sign_b andi t1, t1, 1 xor s1, t0, t1 # result_sign srli t2, s8, 7 # exp_a andi t2, t2, 0xFF srli t3, s9, 7 # exp_b andi t3, t3, 0xFF andi t4, s8, 0x7F # mant_a andi t5, s9, 0x7F # mant_b li t6, 0xFF beq t3, t6, div_b_special beq t3, x0, div_b_zero_or_subn beq t2, t6, div_a_special beq t2, x0, div_a_zero_or_subn div_common: bnez t2, div_a_set1 beqz t4, div_result_zero div_a_set1: ori t4, t4, 0x80 bnez t3, div_b_set1 beqz t5, div_result_nan div_b_set1: ori t5, t5, 0x80 # 長除法 slli s2, t4, 15 add s3, x0, t5 li s4, 0 li s5, 0 div_loop_i: li t6, 16 bge s5, t6, div_qdone slli s4, s4, 1 li t6, 15 sub t6, t6, s5 sll t6, s3, t6 bgeu s2, t6, div_sub j div_next div_sub: sub s2, s2, t6 ori s4, s4, 1 div_next: addi s5, s5, 1 j div_loop_i div_qdone: sub s5, t2, t3 addi s5, s5, 127 beqz t2, div_adj_a j div_adj_b div_adj_a: addi s5, s5, -1 div_adj_b: beqz t3, div_adj_b2 j div_norm div_adj_b2: addi s5, s5, 1 div_norm: li t6, 0x8000 and t6, s4, t6 bnez t6, div_q_has1 div_q_shift: li t6, 1 bge t6, s5, div_q_done # if s5 <= 1 li s10, 0x8000 and s10, s4, s10 bnez s10, div_q_done slli s4, s4, 1 addi s5, s5, -1 j div_q_shift div_q_has1: srli s4, s4, 8 div_q_done: andi s4, s4, 0x7F li t6, 0xFF bge s5, t6, div_to_inf bge x0, s5, div_result_zero slli s1, s1, 15 andi s5, s5, 0xFF slli s5, s5, 7 or a0, s1, s5 or a0, a0, s4 ret div_b_special: bnez t5, div_ret_b slli a0, s1, 15 ret div_b_zero_or_subn: beqz t5, div_by_zero j div_common div_by_zero: beq t2, x0, div_result_nan slli s1, s1, 15 li t6, 0x7F80 or a0, s1, t6 ret div_a_special: bnez t4, div_ret_a li t6, 0xFF beq t3, t6, div_result_nan slli s1, s1, 15 li t6, 0x7F80 or a0, s1, t6 ret div_a_zero_or_subn: beqz t4, div_result_zero j div_common div_result_zero: slli a0, s1, 15 ret div_result_nan: li a0, 0x7FC0 ret div_to_inf: slli s1, s1, 15 li t6, 0x7F80 or a0, s1, t6 ret div_ret_a: add a0, x0, s8 ret div_ret_b: add a0, x0, s9 ret #============================================================ # sqrt_bf16 (sqrt(A)) # 使用 s2,s3,s4,s5,s6,s7,s8；二分法內聯 mid*mid（9步乘法） #============================================================ sqrt_bf16: addi sp, sp, -32 sw ra, 0(sp) sw s2, 4(sp) sw s3, 8(sp) sw s4, 12(sp) sw s5, 16(sp) sw s6, 20(sp) sw s7, 24(sp) sw s8, 28(sp) slli s8, a0, 16 # A(16) 低位 → s8 srli s8, s8, 16 srli t0, s8, 15 # sign andi t0, t0, 1 srli t1, s8, 7 # exp andi t1, t1, 0xFF andi t2, s8, 0x7F # mant li t3, 0xFF bne t1, t3, sqrt_chk_zero bnez t2, sqrt_ret_a # NaN → A bnez t0, sqrt_nan # -Inf → NaN add a0, x0, s8 # +Inf → A j sqrt_epilogue sqrt_chk_zero: beq t1, x0, sqrt_zero # 0 / 次正規 → 0 bnez t0, sqrt_nan # 負 → NaN addi s2, t1, -127 # e = exp - bias ori s3, t2, 0x80 # m = 1.x andi t3, s2, 1 beqz t3, sqrt_even slli s3, s3, 1 # odd exponent: m <<= 1, e-- addi s2, s2, -1 sqrt_even: srai s4, s2, 1 addi s4, s4, 127 # new_exp li s5, 90 # low li s6, 256 # high li s7, 128 # result sqrt_bs_loop: blt s6, s5, sqrt_bs_done # low > high ? add t3, s5, s6 srli t3, t3, 1 # mid in t3 # ---- 內聯 mid*mid（9步移位加法；mid ∈ [0..256]）---- add t4, x0, t3 # multiplicand add t5, x0, t3 # multiplier li t6, 0 # acc = 0 li t0, 9 sqrt_mul9_loop: andi t1, t5, 1 beqz t1, sqrt_mul9_skip add t6, t6, t4 sqrt_mul9_skip: slli t4, t4, 1 srli t5, t5, 1 addi t0, t0, -1 bnez t0, sqrt_mul9_loop srli a0, t6, 7 # sq = (mid^2)/128 blt s3, a0, sqrt_sq_too_big # if m < sq → high = mid - 1 add s7, x0, t3 # result = mid addi s5, t3, 1 # low = mid + 1 j sqrt_bs_loop sqrt_sq_too_big: addi s6, t3, -1 j sqrt_bs_loop sqrt_bs_done: li t3, 256 blt s7, t3, sqrt_norm_low_ok srli s7, s7, 1 addi s4, s4, 1 sqrt_norm_low_ok: li t3, 128 bge s7, t3, sqrt_pack sqrt_shift_up: li t3, 1 bge t3, s4, sqrt_pack slli s7, s7, 1 addi s4, s4, -1 blt s7, t3, sqrt_shift_up sqrt_pack: andi s7, s7, 0x7F li t3, 0xFF bge s4, t3, sqrt_to_inf bge x0, s4, sqrt_zero slli s4, s4, 7 or a0, s4, s7 j sqrt_epilogue sqrt_zero: li a0, 0x0000 j sqrt_epilogue sqrt_nan: li a0, 0x7FC0 j sqrt_epilogue sqrt_to_inf: li a0, 0x7F80 j sqrt_epilogue sqrt_ret_a: add a0, x0, s8 j sqrt_epilogue sqrt_epilogue: lw s8, 28(sp) lw s7, 24(sp) lw s6, 20(sp) lw s5, 16(sp) lw s4, 12(sp) lw s3, 8(sp) lw s2, 4(sp) lw ra, 0(sp) addi sp, sp, 32 ret #============================================================ # mul_u16：無號 16-bit 乘法（RV32I：移位＋加法） #============================================================ mul_u16: li t0, 0 mul16_loop: beq a1, x0, mul16_done andi t1, a1, 1 beq t1, x0, mul16_skip_add add t0, t0, a0 mul16_skip_add: slli a0, a0, 1 srli a1, a1, 1 j mul16_loop mul16_done: add a0, x0, t0 ret ``` </details> ::: --- ## 4. Using s* Registers * **Calling convention:** `s0–s11` are **callee-saved**. Using them inside a subroutine requires push/pop (`sw/lw`). For **leaf functions** (no further `jal` calls), switching to `a*/t*` **eliminates stack traffic**. * **V1 strategy in this project:** * `main`: keeps `s0` as a persistent suite flag (stable across multiple `jal`). * **`add_bf16`**: previously relied heavily on `s*`; now **converted entirely to `a*/t*`**, true leaf, no `sw/lw`. * `sub_bf16`, `f32_to_bf16`, `bf16_to_f32`: already leaf; continue using `a*/t*`. * `mul_bf16`, `div_bf16`, `sqrt_bf16`: **retain V0 stack-based design** due to algorithm complexity and high register pressure, as well as consistency with reference C behavior. (Future versions could leafify these for performance, but current focus is optimizing the frequently-called `add_bf16`.) * **Trade-off:** optimizing the most frequently executed and branch-dense function (`add_bf16`) first provides the greatest benefit; others remain unchanged for stability and validation clarity. --- ## Performance Comparison (Ripes, with print I/O) | Metric | **V0** | **V1** | **Δ (V1 − V0)** | | :----------------- | -----: | --------: | -----------------: | | **Cycles** | 1572 | **1531** | **−41 (−2.6%)** | | **Instr. retired** | 1105 | **1093** | **−12 (−1.1%)** | | **CPI** | 1.42 | **1.40** | **−0.02 (−1.4%)** | | **IPC** | 0.703 | **0.714** | **+0.011 (+1.6%)** | :::success V0 ![image](https://hackmd.io/_uploads/HkLbJscpxe.png) V1 ![image](https://hackmd.io/_uploads/ByZk1ocTxg.png) ::: **Interpretation** * Removing 12 redundant instructions in `main` directly reduces retired count. * Leafification and segmented normalization in `add_bf16` eliminate branches and memory traffic, yielding fewer cycles. * CPI/IPC slightly improved. > Note: results include `ecall` I/O overhead. > Under microbenchmark mode (no printing, using CSR `cycle/instret`), computational gains are even more pronounced. --- ## 5. Code Excerpts ### `main` (removed post-call clearing) ```asm - jal ra, add_bf16 - slli a0, a0, 16 - srli a0, a0, 16 + jal ra, add_bf16 ``` ### `add_bf16` (normalization: while → one-step segmented shift) ```asm - add_norm_loop: - andi t6, s5, 0x80 - bnez t6, add_pack - addi s3, s3, -1 - bge x0, s3, add_ret_zero - slli s5, s5, 1 - j add_norm_loop + # Check leading bits and decide direct left shift (1–7 bits) + andi t6, t5, 0x40 ; bnez t6, add_shift_1 + andi t6, t5, 0x20 ; bnez t6, add_shift_2 + ... + slli t5, t5, 7 + addi a6, a6, -7 ``` --- ## 6. Correctness (Identical to V0) * **NaN propagation, ±Inf, ±0:** follow reference C rules. * **Subnormals flushed to zero.** * **`mul_bf16`:** RNE rounding via guard/sticky bits. * **`add_bf16`:** behaves as defined in Quiz 1 Problem C after normalization. * **`sqrt_bf16`:** binary search with even/odd exponent adjustment; within allowed error range. :::success The test result ![image](https://hackmd.io/_uploads/BkrPJic6le.png) :::