# Assignment 1: RISC-V Assembly and Instruction Pipeline
> contributed by < [Urdan0117](https://github.com/Urdan0117) >
> N26144387 [name=Chiu Kun Chan]
> [color=#2fbeed] **🔹 Use of AI tools**
> I used ChatGPT to assist with Quiz 1 by providing code explanations, grammar polishing, preliminary research, code summaries, and explanations of standard RISC-V instruction usage.
---
# ProbB
## Abstract
This assignment implements **UF8 logarithmic quantization** encoder and decoder in **RV32I**.
Besides reproducing the baseline functionality, the key improvements include:
* Using `clz` to locate the MSB + an **O(1)** formula `offset(e)=((1<<e)-1)<<4`
* **Leafified** `clz` and `uf8_decode` (all `t*` registers, zero `sw/lw`); `uf8_encode` saves only `ra/s0`
* Built-in `test` verifies **round-trip** and **monotonicity**
* Measured code size / cycles in **Ripes**, and provided 5-stage pipeline and control-signal analysis
---
## 1. Use Case
UF8 is suitable for: sensor distance / temperature data, graphics LOD distance / fog density, and exponential-backoff timers.
Not suitable for: financial precision computations or cryptographic applications requiring uniform value distribution.
---
## 2. UF8 Format and Algorithm
* 8-bit UF8: `fl = (e<<4) | m`, where `e=fl[7:4]`, `m=fl[3:0]`
* **Decoding**
$$ offset(e) = ((1<<e) - 1) << 4 $$
$$ value = (m << e) + offset(e) $$
* **Encoding (concept)**
1. `msb = 31 - clz(value)`; initial estimate `e0 = clamp(msb-4, 0..15)`
2. Compare with `offset(e)` / `offset(e+1)` (if `e<15`) and adjust `e ± 1` if needed
3. `m = (value - offset(e)) >> e`, combine `fl=(e<<4)|m`
* **Edge cases:** `value < 16` returns directly; `e=15` has no `offset(e+1)`.
---
## 3. Version Evolution and Key Fixes (V0 → V1 → V2)
### 3.1 V0 (baseline, functional)
First student attempt directly rewritten from C.
:exclamation: All used stack frames and saved `ra/s*`; functional but **leaf functions still used frequent sw/lw**.
* **decode:** used
$$ \text{offset}(e)=(0x7FFF \gg (15-e)) \ll 4 $$
and `value = (m<<e) + offset(e)`
* **encode:** implemented **overflow loop** accumulating `(overflow<<=1; +=16)` until `e`; if misestimated, `adjust_overflow` / `find_exact_exp` refines `e` before mantissa computation
* **test:** stored **fl in `s3`** (safe across two `jal`); checks **round-trip** + **monotonicity**
* **Status:** fully functional & passes tests, but with higher **memory traffic** and **stack overhead**
<details>
<summary>V0 full code</summary>
```asm
# 字串常數
.data
success_msg: .string "All tests passed.\n"
failure_msg: .string "Tests failed.\n"
.text
.globl main
# 主程式
main:
addi sp, sp, -16
sw ra, 0(sp)
# 呼叫測試函數
jal ra, test
# 檢查測試結果
beq a0, zero, print_failure
print_success:
la a0, success_msg # a0 = C-string 位址
li a7, 4 # 4: print_string (Ripes 教學環境)
ecall
li a7, 10 # 10: exit
ecall
print_failure:
la a0, failure_msg
li a7, 4
ecall
li a7, 10
ecall
# 程式入口點
_start:
jal ra, main
# CLZ - Count Leading Zeros
# 輸入: a0 = 32位無號整數
# 輸出: a0 = 前導零的數量 (0-32)
clz:
# 保存暫存器
addi sp, sp, -16
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
# 檢查輸入是否為零
beq a0, zero, clz_zero
# 初始化計數器和移位量
li s0, 0 # count = 0
li s1, 32 # n = 32
li s2, 16 # c = 16
clz_loop:
# y = x >> c
srl t0, a0, s2 # t0 = a0 >> s2
beq t0, zero, clz_no_shift
# 如果 y != 0,則 n -= c, x = y
sub s1, s1, s2 # n -= c
mv a0, t0 # x = y
clz_no_shift:
# c >>= 1
srli s2, s2, 1 # c = c >> 1
bne s2, zero, clz_loop
# return n - x
sub a0, s1, a0
j clz_end
clz_zero:
li a0, 32 # 如果輸入為0,返回32
clz_end:
# 恢復暫存器
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
addi sp, sp, 16
ret
# UF8_DECODE - 將 UF8 格式轉換為 32位整數
# 輸入: a0 = UF8 值 (8位)
# 輸出: a0 = 32位整數值
uf8_decode:
addi sp, sp, -16
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
# 提取尾數 (lower 4 bits)
andi s0, a0, 0x0f # mantissa = fl & 0x0f
# 提取指數 (upper 4 bits)
srli s1, a0, 4 # exponent = fl >> 4
# 計算偏移量: offset = (0x7FFF >> (15 - exponent)) << 4
li t0, 15
sub t0, t0, s1 # t0 = 15 - exponent
li t1, 0x7FFF
srl t1, t1, t0 # t1 = 0x7FFF >> (15 - exponent)
slli t1, t1, 4 # offset = t1 << 4
# 計算結果: (mantissa << exponent) + offset
sll t2, s0, s1 # t2 = mantissa << exponent
add a0, t2, t1 # result = (mantissa << exponent) + offset
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
addi sp, sp, 16
ret
# UF8_ENCODE - 將 32位整數轉換為 UF8 格式
# 輸入: a0 = 32位整數值
# 輸出: a0 = UF8 值 (8位)
uf8_encode:
addi sp, sp, -32
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
sw s5, 24(sp)
mv s0, a0 # s0 = value
# 如果 value < 16,直接返回
li t0, 16
blt s0, t0, encode_direct
# 使用 CLZ 計算指數
jal ra, clz # a0 = clz(value)
li t0, 31
sub s1, t0, a0 # msb = 31 - lz
# 初始化變數
li s2, 0 # exponent = 0
li s3, 0 # overflow = 0
# 檢查 msb >= 5
li t0, 5
blt s1, t0, find_exact_exp
# 估算指數: exponent = msb - 4
addi s2, s1, -4 # exponent = msb - 4
li t0, 15
bgt s2, t0, cap_exponent
j calc_overflow
cap_exponent:
li s2, 15 # exponent = 15
calc_overflow:
# 計算初始 overflow
li s4, 0 # e = 0
overflow_loop:
bge s4, s2, adjust_overflow
slli s3, s3, 1 # overflow <<= 1
addi s3, s3, 16 # overflow += 16
addi s4, s4, 1 # e++
j overflow_loop
adjust_overflow:
# 調整 overflow 如果估算錯誤
beq s2, zero, find_exact_exp
bge s0, s3, find_exact_exp
addi s3, s3, -16 # overflow -= 16
srli s3, s3, 1 # overflow >>= 1
addi s2, s2, -1 # exponent--
j adjust_overflow
find_exact_exp:
# 找到精確的指數
li t0, 15
bge s2, t0, calc_mantissa
slli t1, s3, 1 # next_overflow = overflow << 1
addi t1, t1, 16 # next_overflow += 16
bge s0, t1, update_exp
j calc_mantissa
update_exp:
mv s3, t1 # overflow = next_overflow
addi s2, s2, 1 # exponent++
j find_exact_exp
calc_mantissa:
# 計算尾數: mantissa = (value - overflow) >> exponent
sub t0, s0, s3 # t0 = value - overflow
srl s5, t0, s2 # mantissa = t0 >> exponent
# 組合結果: (exponent << 4) | mantissa
slli t0, s2, 4 # t0 = exponent << 4
or a0, t0, s5 # result = (exponent << 4) | mantissa
j encode_end
encode_direct:
# 直接返回 value (< 16)
mv a0, s0
encode_end:
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
lw s5, 24(sp)
addi sp, sp, 32
ret
# TEST - 測試編碼/解碼的往返轉換
# 輸出: a0 = 1 (通過) 或 0 (失敗)
test:
addi sp, sp, -32
sw ra, 0(sp)
sw s0, 4(sp) # i (loop counter)
sw s1, 8(sp) # previous_value
sw s2, 12(sp) # current value
sw s3, 16(sp) # fl (original)
sw s4, 20(sp) # fl2 (re-encoded)
sw s5, 24(sp) # passed flag
li s0, 0 # i = 0
li s1, -1 # previous_value = -1
li s5, 1 # passed = true
test_loop:
li t0, 256
bge s0, t0, test_end
# fl = i
mv s3, s0
# value = uf8_decode(fl)
mv a0, s3
jal ra, uf8_decode
mv s2, a0 # s2 = decoded value
# fl2 = uf8_encode(value)
mv a0, s2
jal ra, uf8_encode
mv s4, a0 # s4 = re-encoded value
# 檢查 fl != fl2
bne s3, s4, test_fail
# 檢查 value <= previous_value
ble s2, s1, test_fail
# 更新 previous_value
mv s1, s2
# i++
addi s0, s0, 1
j test_loop
test_fail:
li s5, 0 # passed = false
test_end:
mv a0, s5 # return passed
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
lw s5, 24(sp)
addi sp, sp, 32
ret
```
</details>
---
### 3.2 V1 (Lightweight + O(1) offset)
* **encode:** replaced loop with **O(1) formula**
$$ \text{offset}(e)=((1\ll e)-1)\ll 4 $$
> Estimate `e` from `msb`, then use `offset(e)` and `offset(e+1)` for single-step interval correction (`update_exp_up/down`) before computing mantissa.
* **decode:** kept `0x7FFF` formula (mathematically equivalent to V1’s O(1) form, including `e=0→offset=0`)
* **test results:** passed correctly; encode’s **offset construction cost reduced from O(e) to O(1)**; further improvement possible by full `t*` leaf refactor to remove sw/lw.
<details>
<summary>V1 encode code</summary>
```asm
# ====== O(1) 版:計算 offset(e) = ((1<<e)-1) << 4 ======
calc_overflow:
li t1, 1
sll t1, t1, s2 # t1 = 1 << e
addi t1, t1, -1 # t1 = (1<<e) - 1
slli s3, t1, 4 # s3 = offset(e) = ((1<<e)-1)<<4
# ====== 調整 e:確保 offset(e) ≤ value < offset(e+1)(若 e<15) ======
find_exact_exp:
# 若 e < 15,先算 next_offset = ((1<<(e+1))-1)<<4
li t0, 15
bge s2, t0, calc_mantissa # e==15:無 next_offset,直接去算尾數
addi t2, s2, 1 # t2 = e+1
li t3, 1
sll t3, t3, t2 # t3 = 1<<(e+1)
addi t3, t3, -1 # t3 = (1<<(e+1))-1
slli t3, t3, 4 # t3 = next_offset(e+1)
# 若 value >= next_offset(e+1) → e++(往上擴)
bge s0, t3, update_exp_up
# 若 value < offset(e) → e--(往下縮)
blt s0, s3, update_exp_down
# 否則 offset(e) ≤ value < next_offset(e+1),e 固定
j calc_mantissa
update_exp_up:
addi s2, s2, 1 # e = e+1
mv s3, t3 # s3 = offset(e)(剛好等於之前算的 next_offset)
j find_exact_exp
update_exp_down:
addi s2, s2, -1 # e = e-1
# 重新 O(1) 計算 offset(e)
li t1, 1
sll t1, t1, s2
addi t1, t1, -1
slli s3, t1, 4
j find_exact_exp
```
</details>
---
### 3.3 V2 (Leafified, s→t conversion; caller-saved fix in test)
* **clz / decode:** **leafified**, all `t*`, **zero sw/lw**
* **encode:** saves only `ra/s0`; all else `t*`; `calc_overflow` uses **O(1) offset** then `find_exact_exp` for single-step range adjust before mantissa
* **test:**
* Early V2 kept `fl` in `t1`, but cross-`jal` callee (`decode/encode`) overwrote it → occasional round-trip fail
* Fixed by storing `fl` in `s3` → problem resolved
* **Result:** after fixing `s3` and `msb<5` branches, passes all tests with **minimal sw/lw** and reduced pipeline stalls
<details>
<summary>V2 full code</summary>
```asm
# 字串常數
.data
success_msg: .string "All tests passed.\n"
failure_msg: .string "Tests failed.\n"
.text
.globl main
# 主程式
main:
# 不需要保存 ra,不需要動 sp
# 呼叫測試函數
jal ra, test
# 檢查測試結果
beq a0, zero, print_failure
print_success:
la a0, success_msg # a0 = C-string 位址
li a7, 4 # 4: print_string (Ripes 教學環境)
ecall
li a7, 10 # 10: exit
ecall
print_failure:
la a0, failure_msg
li a7, 4
ecall
li a7, 10
ecall
# 程式入口點
# _start:
# jal ra, main
# CLZ - Count Leading Zeros
# 輸入: a0 = 32位無號整數
# 輸出: a0 = 前導零的數量 (0-32)
clz:
# 檢查輸入是否為零
beq a0, zero, clz_zero
# 初始化計數器和移位量
li t0, 0 # count = 0
li t1, 32 # n = 32
li t2, 16 # c = 16
clz_loop:
# y = x >> c
srl t3, a0, t2 # t3 = a0 >> t2
beq t3, zero, clz_no_shift
# 如果 y != 0,則 n -= c, x = y
sub t1, t1, t2 # n -= c
mv a0, t3 # x = y
clz_no_shift:
# c >>= 1
srli t2, t2, 1 # c = c >> 1
bne t2, zero, clz_loop
# return n - x
sub a0, t1, a0
ret
clz_zero:
li a0, 32 # 如果輸入為0,返回32
ret
# UF8_DECODE - 將 UF8 格式轉換為 32位整數
# 輸入: a0 = UF8 值 (8位)
# 輸出: a0 = 32位整數值
uf8_decode:
# 提取尾數 (lower 4 bits)
andi t4, a0, 0x0f # mantissa = fl & 0x0f
# 提取指數 (upper 4 bits)
srli t5, a0, 4 # exponent = fl >> 4
# 計算偏移量: offset = (0x7FFF >> (15 - exponent)) << 4
li t0, 15
sub t0, t0, t5 # t0 = 15 - exponent
li t1, 0x7FFF
srl t1, t1, t0 # t1 = 0x7FFF >> (15 - exponent)
slli t1, t1, 4 # offset = t1 << 4
# 計算結果: (mantissa << exponent) + offset
sll t2, t4, t5 # t2 = mantissa << exponent
add a0, t2, t1 # result = (mantissa << exponent) + offset
ret
# UF8_ENCODE - 將 32位整數轉換為 UF8 格式
# 輸入: a0 = 32位整數值
# 輸出: a0 = UF8 值 (8位)
uf8_encode:
addi sp, sp, -8
sw ra, 0(sp)
sw s0, 4(sp) # s0 = value(唯一需要跨 call 保留者)
mv s0, a0 # s0 = value
# 如果 value < 16,直接返回
li t0, 16
blt s0, t0, encode_direct
# 使用 CLZ 計算 msb 與初始 exponent
mv a0, s0
jal ra, clz # a0 = clz(value)
li t1, 31
sub t2, t1, a0 # msb = 31 - lz
# 初始化變數
li t3, 0 # exponent = 0
li t4, 0 # overflow = 0
# 檢查 msb >= 5
li t0, 5
blt t2, t0, find_exact_exp
# 估算指數: exponent = msb - 4
addi t3, t2, -4 # exponent = msb - 4
li t0, 15
bgt t3, t0, cap_exponent
j calc_overflow
cap_exponent:
li t3, 15 # exponent = 15
# O(1)版:計算 offset(e) = ((1<<e)-1) << 4
calc_overflow:
li t5, 1
sll t5, t5, t3 # t5 = 1 << e
addi t5, t5, -1 # t5 = (1<<e) - 1
slli t4, t5, 4 # t4 = offset(e) = ((1<<e)-1)<<4
# 調整 e:確保 offset(e) ≤ value < offset(e+1)(若 e<15)
find_exact_exp:
# 若 e < 15,先算 next_offset = ((1<<(e+1))-1)<<4
li t0, 15
bge t3, t0, calc_mantissa # e==15:無 next_offset,直接去算尾數
addi t6, t3, 1 # t6 = e+1
li t1, 1
sll t1, t1, t6 # t1 = 1<<(e+1)
addi t1, t1, -1 # t1 = (1<<(e+1))-1
slli t1, t1, 4 # t1 = next_offset(e+1)
# 若 value >= next_offset(e+1) → e++(往上擴)
bge s0, t1, update_exp_up
# 若 value < offset(e) → e--(往下縮)
blt s0, t4, update_exp_down
# 否則 offset(e) ≤ value < next_offset(e+1),e 固定
j calc_mantissa
update_exp_up:
addi t3, t3, 1 # e = e+1
mv t4, t1 # t4 = offset(e)(剛好等於之前算的 next_offset)
j find_exact_exp
update_exp_down:
addi t3, t3, -1 # e = e-1
# 重新 O(1) 計算 offset(e)
li t5, 1
sll t5, t5, t3
addi t5, t5, -1
slli t4, t5, 4
j find_exact_exp
calc_mantissa:
# 計算尾數: mantissa = (value - overflow) >> exponent
sub t0, s0, t4 # t0 = value - overflow
srl t2, t0, t3 # mantissa = t0 >> exponent
# 組合結果: (exponent << 4) | mantissa
slli t1, t3, 4 # t1 = exponent << 4
or a0, t1, t2 # result = (exponent << 4) | mantissa
j encode_end
encode_direct:
# 直接返回 value (< 16)
mv a0, s0
encode_end:
lw ra, 0(sp)
lw s0, 4(sp)
addi sp, sp, 8
ret
# TEST - 測試編碼/解碼的往返轉換
# 輸出: a0 = 1 (通過) 或 0 (失敗)
test:
addi sp, sp, -20
sw ra, 0(sp)
sw s0, 4(sp) # i (loop counter)
sw s1, 8(sp) # previous_value
sw s2, 12(sp) # decoded value(需跨 encode 呼叫保留)
sw s5, 16(sp) # passed
li s0, 0 # i = 0
li s1, -1 # previous_value = -1
li s5, 1 # passed = true
test_loop:
li t0, 256
bge s0, t0, test_end
# fl = i
mv t1, s0
# value = uf8_decode(fl)
mv a0, t1
jal ra, uf8_decode
mv s2, a0 # decoded value 要跨下一個 jal 保留 → 放 s2
# fl2 = uf8_encode(value)
mv a0, s2
jal ra, uf8_encode
mv t2, a0 # fl2 暫存
# 檢查 fl != fl2
bne t1, t2, test_fail
# 檢查 value <= previous_value
ble s2, s1, test_fail
# 更新 previous_value
mv s1, s2
# i++
addi s0, s0, 1
j test_loop
test_fail:
li s5, 0 # passed = false
test_end:
mv a0, s5 # return passed
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s5, 16(sp)
addi sp, sp, 20
ret
```
</details>
---
## 4. Version Comparison Table (Quick Summary)
| Aspect | V0 | V1 | V2 |
| :------------------ | :-------------------------- | :--------------------------------------------- | :----------------------------------------- |
| `clz` | Uses stack & `s*` | Same as V0 | **Leaf**, all `t*`, zero `sw/lw` |
| `uf8_decode` | Stack, `0x7FFF>>(15-e)<<4` | Same as V0 | **Leaf**, all `t*`, still `0x7FFF` formula |
| `uf8_encode` | Loop builds overflow (O(e)) | **O(1)** `((1<<e)-1)<<4`; `update_exp_up/down` | Same as V1, saves only `ra/s0` |
| `test` (fl storage) | `s3` (safe) | `s3` (safe) | Originally `t1`, later → `s3` |
| Memory traffic | High | Medium | **Low** |
---
## 5. Testing (Automation + Representative Values)
### 5.1 Automated Validation
* **Round-trip:** For `fl∈[0..255]`, check `encode(decode(fl))==fl`
* **Monotonicity:** For `v(fl)=decode(fl)`, verify `v(fl+1)>v(fl)`
* Prints `All tests passed.` on success; otherwise `Tests failed.`
### 5.2 Representative Manual Examples
UF8 rule: `value = (m<<e) + ((1<<e)-1)<<4`
| UF8 `fl` | decimal | e | m | decoded `value` | re-encoded | Result |
| :------: | ------: | -: | -: | --------------: | ---------: | :----: |
| `0x00` | 0 | 0 | 0 | 0 | `0x00` | ✅ |
| `0x10` | 16 | 1 | 0 | 16 | `0x10` | ✅ |
| `0x7F` | 127 | 7 | 15 | 3952 | `0x7F` | ✅ |
| `0xF0` | 240 | 15 | 0 | 524 272 | `0xF0` | ✅ |
| `0xFF` | 255 | 15 | 15 | 1 015 792 | `0xFF` | ✅ |
---
## 6. Performance and Code Size (Ripes Measurement)
### 6.1 Runtime Metrics (`fl=0..255` test)
| Version | Cycles | Instr. Retired | CPI | IPC | L1-Data Accesses |
| :------ | -----: | -------------: | ---: | ----: | ---------------: |
| **V0** | 53 666 | 39 387 | 1.36 | 0.734 | 7 055 |
| **V1** | 40 214 | 30 651 | 1.31 | 0.762 | 7 055 |
| **V2** | 32 466 | 23 879 | 1.36 | 0.736 | 1 036 |
> Overhead ≈ `Cycles−Instrs.`
> V0 14 279 V1 9 563 V2 8 587
---
### 6.2 Improvements (Relative)
1. **Cycles**
* V1 vs V0: −25.1% (53 666 → 40 214)
* V2 vs V0: −39.5% (53 666 → 32 466)
* V2 vs V1: −19.3% (40 214 → 32 466)
2. **Instructions**
* V1 vs V0: −22.2% (39 387 → 30 651)
* V2 vs V0: −39.4% (39 387 → 23 879)
* V2 vs V1: −22.1% (30 651 → 23 879)
3. **CPI**
* V1 vs V0: 1.36 → 1.31 (↑ 3.7%)
* V2 vs V1: 1.31 → 1.36 (slight regression)
> Although V2 CPI rose slightly, its 22% fewer instructions still yield fewer total cycles.
4. **L1-Data Accesses**
* V2 vs V0/V1: 7 055 → 1 036 (−85.3%) → directly shows the benefit of **leafification** and reduced stack `sw/lw`.
---
### 6.3 Code Size (per function)
Computation:
$$ size(clz)=decode−clz $$
$$ size(uf8_decode)=encode−decode $$
$$ size(uf8_encode)=test−encode $$
**Start addresses**
* V0 clz 0x44 decode 0xA4 encode 0xF0 test 0x1DC
* V1 clz 0x40 decode 0xA0 encode 0xEC test 0x1DC
* V2 clz 0x38 decode 0x70 encode 0x9C test 0x16C
| Version | `clz` (B/I) | `uf8_decode` | `uf8_encode` | **Sum** |
| :------ | ----------: | -----------: | -----------: | --------: |
| **V0** | 96 / 24 | 76 / 19 | 236 / 59 | 408 / 102 |
| **V1** | 96 / 24 | 76 / 19 | 240 / 60 | 412 / 103 |
| **V2** | 56 / 14 | 44 / 11 | 208 / 52 | 308 / 77 |
**Improvement**
* V2 vs V0 (total): 408 → 308 B (−24.5%)
* V2 vs V1 (total): 412 → 308 B (−25.2%)
* Per function: `clz −41.7%`, `decode −42.1%`, `encode −11.9% (vs V0)`
> Observation: V1’s formulaic offset slightly increased encode size (236→240 B); V2 via **leafification + register reallocation** minimized all three functions, with `clz/decode` shrinking > 40%.
---
## 7. References
* AI Tool (ChatGPT)
* Ripes (environment calls) / RISC-V ISA (RV32I)
* UF8 and bit hacks (CLZ)
* [Quiz 1 Solution](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-C)
* [Lab 1](https://hackmd.io/@sysprog/H1TpVYMdB)
---
# ProbC
## 1. Abstract
This project implements **BFloat16 arithmetic** on **RV32I** using **pure integer operations**:
`f32_to_bf16`, `bf16_to_f32`, `add_bf16`, `sub_bf16`, `mul_bf16` (with round-to-nearest-even, RNE), `div_bf16` (integer long division), and `sqrt_bf16` (binary search).
IEEE-754 corner cases are fully handled: **NaN / Inf / ±0**, and **subnormal numbers are flushed to zero**.
All programs run correctly in **Ripes**.
This submission directly presents a one-step upgrade **from V0 → V1** (V1 = final version):
1. **Simplified `main` test harness**: removed redundant “clear upper 16 bits” operations.
2. **`add_bf16` leafified**: switched to **caller-saved (`a*/t*`)** registers, **no stack frame or `s*` saves**.
Normalization after subtraction was changed from a while-loop to a **multi-bit segmented shift (one-step normalize)**.
The other routines (`mul_bf16`, `div_bf16`, `sqrt_bf16`) retain V0 logic for compatibility and correctness reference.
---
## 2. BFloat16 Overview


* **add/sub**: align exponents → add/sub 8-bit mantissas (with hidden 1) → normalize → take 7-bit fraction
* **mul**: multiply two 8-bit mantissas, adjust exponent by bias, normalize, then **RNE rounding via guard/sticky bits**
* **div**: integer restoring division yields 7-bit fraction (truncate)
* **sqrt**: binary search on mantissa; exponent adjusted for odd/even and halved
---
## 3. What Changed (V0 → V1)
### 1. main : Remove redundant “clear upper 16 bits”
**V0 behavior**: after every call, `a0` was forced through
```asm
slli a0, a0, 16
srli a0, a0, 16
```
**Why redundant**: each subroutine already returns a properly masked 16-bit result, so the high bits are zero.
**V1 action**: removed these unnecessary `slli/srli`.
**Effect**: ~12 fewer dynamic instructions; reduced ALU shifts and pipeline stalls.
---
### 2. `add_bf16`: Leafified + Normalize via segmented shift
**V0**:
* Used `s*` registers and built a stack frame (`sw/lw` overhead).
* Normalization after subtraction used a while-loop, shifting left bit-by-bit (up to 7 times).
**V1**:
* Uses only **`a*/t*`** → true **leaf function**, **no stack frame, zero `sw/lw`**.
* Normalization replaced with **segmented shifting** (check bits 6..1 → shift 1–7 bits directly).
Single-step adjustment reduces branches and loop iterations.
**Result**: fewer branches and memory accesses; both **cycles** and **retired instructions** drop noticeably.
:::spoiler
<details>
<summary>V0 full code</summary>
```asm
############################################################
# File : bf16_all_fixed.S
# Target : RISC-V RV32I (Ripes-compatible)
# Notes : 修正 bgt/ble 類助記詞、立即數溢位、label 對齊、16-bit 遮罩
# ★ 本版加入所有必要的 s* 保存/還原 (callee-saved)
############################################################
.data # 資料區
msg_suite_ok: .asciz "BF16 suite: PASS\n"
msg_suite_fail: .asciz "BF16 suite: FAIL\n"
msg_t1: .asciz "[OK ] f32_to_bf16\n"
msg_t1f: .asciz "[FAIL] f32_to_bf16\n"
msg_t2: .asciz "[OK ] bf16_to_f32\n"
msg_t2f: .asciz "[FAIL] bf16_to_f32\n"
msg_add: .asciz "[OK ] add_bf16\n"
msg_addf: .asciz "[FAIL] add_bf16\n"
msg_sub: .asciz "[OK ] sub_bf16\n"
msg_subf: .asciz "[FAIL] sub_bf16\n"
msg_mul: .asciz "[OK ] mul_bf16\n"
msg_mulf: .asciz "[FAIL] mul_bf16\n"
msg_div: .asciz "[OK ] div_bf16\n"
msg_divf: .asciz "[FAIL] div_bf16\n"
msg_sqrt: .asciz "[OK ] sqrt_bf16\n"
msg_sqrtf: .asciz "[FAIL] sqrt_bf16\n"
.text # 程式區
.globl main
.globl f32_to_bf16
.globl bf16_to_f32
.globl add_bf16
.globl sub_bf16
.globl mul_bf16
.globl div_bf16
.globl sqrt_bf16
.globl mul_u16
#============================================================
# main:呼叫各函式做基本測試並列印 PASS/FAIL
#============================================================
main:
addi sp, sp, -16 # 建立簡易 stack frame
sw ra, 12(sp) # 保存返回位址
li s0, 1 # s0 = suite_pass = 1
#------------------ T1: f32_to_bf16 ------------------
li a0, 0x3F800000 # 1.0f
jal ra, f32_to_bf16 # -> 0x3F80 (bf16)
slli a0, a0, 16 # 只比較低 16
srli a0, a0, 16
li t0, 0x3F80
bne a0, t0, print_t1_fail
print_t1_ok:
la a0, msg_t1
li a7, 4
ecall
j t1_done
print_t1_fail:
la a0, msg_t1f
li a7, 4
ecall
li s0, 0
#------------------ T2: bf16_to_f32 ------------------
t1_done:
li a0, 0x4000 # bf16(2.0) -> f32 0x40000000
jal ra, bf16_to_f32
li t0, 0x40000000
bne a0, t0, print_t2_fail
print_t2_ok:
la a0, msg_t2
li a7, 4
ecall
j t2_done
print_t2_fail:
la a0, msg_t2f
li a7, 4
ecall
li s0, 0
#------------------ ADD: 1.0 + 1.0 = 2.0 ------------
t2_done:
li a0, 0x3F80 # 1.0
li a1, 0x3F80 # 1.0
jal ra, add_bf16 # -> 2.0 (0x4000)
slli a0, a0, 16
srli a0, a0, 16
li t0, 0x4000
bne a0, t0, print_add_fail
print_add_ok:
la a0, msg_add
li a7, 4
ecall
j add_done
print_add_fail:
la a0, msg_addf
li a7, 4
ecall
li s0, 0
#------------------ SUB: 2.0 - 1.0 = 1.0 ------------
add_done:
li a0, 0x4000 # 2.0
li a1, 0x3F80 # 1.0
jal ra, sub_bf16 # -> 1.0 (0x3F80)
slli a0, a0, 16
srli a0, a0, 16
li t0, 0x3F80
bne a0, t0, print_sub_fail
print_sub_ok:
la a0, msg_sub
li a7, 4
ecall
j sub_done
print_sub_fail:
la a0, msg_subf
li a7, 4
ecall
li s0, 0
#------------------ MUL: 1.5 * 2.0 = 3.0 ------------
sub_done:
li a0, 0x3FC0 # 1.5
li a1, 0x4000 # 2.0
jal ra, mul_bf16 # -> 3.0 (0x4040)
slli a0, a0, 16
srli a0, a0, 16
li t0, 0x4040
bne a0, t0, print_mul_fail
print_mul_ok:
la a0, msg_mul
li a7, 4
ecall
j mul_done
print_mul_fail:
la a0, msg_mulf
li a7, 4
ecall
li s0, 0
#------------------ DIV: 3.0 / 2.0 = 1.5 ------------
mul_done:
li a0, 0x4040 # 3.0
li a1, 0x4000 # 2.0
jal ra, div_bf16 # -> 1.5 (0x3FC0)
slli a0, a0, 16
srli a0, a0, 16
li t0, 0x3FC0
bne a0, t0, print_div_fail
print_div_ok:
la a0, msg_div
li a7, 4
ecall
j div_done
print_div_fail:
la a0, msg_divf
li a7, 4
ecall
li s0, 0
#------------------ SQRT: sqrt(4.0) = 2.0 -----------
div_done:
li a0, 0x4080 # 4.0
jal ra, sqrt_bf16 # -> 2.0 (0x4000)
slli a0, a0, 16
srli a0, a0, 16
li t0, 0x4000
bne a0, t0, print_sqrt_fail
print_sqrt_ok:
la a0, msg_sqrt
li a7, 4
ecall
j sqrt_done
print_sqrt_fail:
la a0, msg_sqrtf
li a7, 4
ecall
li s0, 0
#------------------ 總結 SUITE 結果 ------------------
sqrt_done:
bnez s0, suite_ok
suite_fail:
la a0, msg_suite_fail
li a7, 4
ecall
j exit_suite
suite_ok:
la a0, msg_suite_ok
li a7, 4
ecall
exit_suite:
lw ra, 12(sp) # ★ 對稱恢復 ra/sp(雖 ecall 10 不返回)
addi sp, sp, 16
li a7, 10
ecall
#============================================================
# f32_to_bf16 (leaf; 不使用 s*;不需保存)
#============================================================
f32_to_bf16:
srli t0, a0, 23
andi t0, t0, 0xFF
li t1, 0xFF
beq t0, t1, f32_is_special # exp==0xFF -> Inf/NaN
srli t2, a0, 16
andi t2, t2, 1
li t3, 0x7FFF # 0.5 ULP
add t2, t2, t3
add a0, a0, t2
srli a0, a0, 16
ret
f32_is_special:
srli a0, a0, 16
ret
#============================================================
# bf16_to_f32 (leaf; 不使用 s*;不需保存)
#============================================================
bf16_to_f32:
slli a0, a0, 16
ret
#============================================================
# add_bf16 (A + B)
# 使用 s2,s3,s4,s5,s8,s9 → 需保存/還原
#============================================================
add_bf16:
# ---------- Prologue: save s* ----------
addi sp, sp, -24
sw s8, 0(sp)
sw s9, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
# 取 A/B 的 16-bit(高位清零)
slli s8, a0, 16 # s8 = A
srli s8, s8, 16
slli s9, a1, 16 # s9 = B
srli s9, s9, 16
srli t0, s8, 15 # sign_a
andi t0, t0, 1
srli t1, s9, 15 # sign_b
andi t1, t1, 1
srli t2, s8, 7 # exp_a
andi t2, t2, 0xFF
srli t3, s9, 7 # exp_b
andi t3, t3, 0xFF
andi t4, s8, 0x7F # mant_a
andi t5, s9, 0x7F # mant_b
# 特殊值
li t6, 0xFF
beq t2, t6, add_a_special
beq t3, t6, add_b_special
# 零值快速路徑
beq t2, x0, add_chk_a_zero
j add_chk_b_zero
add_chk_a_zero:
beq t4, x0, add_ret_b # A == 0 -> 回傳 B
add_chk_b_zero:
beq t3, x0, add_chk_b_is_zero
j add_norm
add_chk_b_is_zero:
beq t5, x0, add_ret_a # B == 0 -> 回傳 A
add_norm:
bnez t2, add_a_set1
j add_b_set1
add_a_set1:
ori t4, t4, 0x80 # exp_a != 0 → 補隱含 1
add_b_set1:
beqz t3, add_align # exp_b == 0 → 不補 1
ori t5, t5, 0x80 # exp_b != 0 → 補隱含 1
j add_align
add_align:
sub s2, t2, t3 # s2 = exp_diff
blt x0, s2, add_a_bigger # s2 > 0
blt s2, x0, add_b_bigger # s2 < 0
add s3, x0, t2 # 指數相等
j add_op
add_a_bigger:
add s3, x0, t2
li t6, 8
blt t6, s2, add_ret_a # 差距 > 8 → 回傳 A
srl t5, t5, s2 # B 尾數右移對齊
j add_op
add_b_bigger:
add s3, x0, t3
li t6, -8
blt s2, t6, add_ret_b # 差距 < -8 → 回傳 B
sub t6, x0, s2 # t6 = -s2
srl t4, t4, t6 # A 尾數右移對齊
add_op:
beq t0, t1, add_same_sign # 同號 → 相加
# 異號 → 相減
bge t4, t5, add_a_ge_b
add s4, x0, t1 # 結果符號 = B
sub s5, t5, t4 # mant_res = B - A
j add_norm_diff
add_a_ge_b:
add s4, x0, t0 # 結果符號 = A
sub s5, t4, t5 # mant_res = A - B
add_norm_diff:
beq s5, x0, add_ret_zero # 結果 0
add_norm_loop:
andi t6, s5, 0x80
bnez t6, add_pack # 已規範化
addi s3, s3, -1
bge x0, s3, add_ret_zero # 指數 <= 0 → 0
slli s5, s5, 1
j add_norm_loop
add_same_sign:
add s4, x0, t0
add s5, t4, t5
andi t6, s5, 0x100
beq t6, x0, add_pack
srli s5, s5, 1
addi s3, s3, 1
li t6, 0xFF
bge s3, t6, add_to_inf
add_pack:
slli s4, s4, 15
andi s3, s3, 0xFF
slli s3, s3, 7
andi s5, s5, 0x7F
or a0, s4, s3
or a0, a0, s5
j add_epilogue
add_to_inf:
slli s4, s4, 15
li t6, 0x7F80
or a0, s4, t6
j add_epilogue
add_ret_zero:
li a0, 0x0000
j add_epilogue
add_ret_a:
add a0, x0, s8
j add_epilogue
add_ret_b:
add a0, x0, s9
j add_epilogue
add_a_special:
bnez t4, add_ret_a # A=NaN → A
beq t3, t6, add_both_inf_nan
add a0, x0, s8 # A=Inf、B 非特殊 → A
j add_epilogue
add_both_inf_nan:
bnez t5, add_ret_b # B=NaN → B
beq t0, t1, add_ret_b # 同號 Inf → B
li a0, 0x7FC0 # +Inf + (-Inf) = NaN
j add_epilogue
add_b_special:
bnez t5, add_ret_b # B=NaN → B
add a0, x0, s9 # B=Inf → B
j add_epilogue
add_epilogue:
# ---------- Epilogue: restore s* ----------
lw s5, 20(sp)
lw s4, 16(sp)
lw s3, 12(sp)
lw s2, 8(sp)
lw s9, 4(sp)
lw s8, 0(sp)
addi sp, sp, 24
ret
#============================================================
# sub_bf16 (A - B) = A + (-B)
# 只需保存 ra(caller-saved),本函式會呼叫 add_bf16
#============================================================
sub_bf16:
addi sp, sp, -4
sw ra, 0(sp)
li t6, 0x8000
xor a1, a1, t6 # 反轉 B 的符號
jal ra, add_bf16
lw ra, 0(sp)
addi sp, sp, 4
ret
#============================================================
# mul_bf16 (A * B) — 含捨入;使用 s1,s2,s3,s4,s8,s9,s10;呼叫 mul_u16
#============================================================
mul_bf16:
# ---------- Prologue ----------
addi sp, sp, -32
sw ra, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s8, 20(sp)
sw s9, 24(sp)
sw s10,28(sp)
# 取 A/B 的 16-bit
slli s8, a0, 16
srli s8, s8, 16
slli s9, a1, 16
srli s9, s9, 16
srli t0, s8, 15 # sign_a
andi t0, t0, 1
srli t1, s9, 15 # sign_b
andi t1, t1, 1
xor s1, t0, t1 # result_sign
srli t2, s8, 7 # exp_a
andi t2, t2, 0xFF
srli t3, s9, 7 # exp_b
andi t3, t3, 0xFF
andi t4, s8, 0x7F # mant_a
andi t5, s9, 0x7F # mant_b
li t6, 0xFF
beq t2, t6, mul_a_special
beq t3, t6, mul_b_special
beq t2, x0, mul_chk_a_zero
beq t3, x0, mul_chk_b_zero
j mul_norm_go
mul_chk_a_zero:
beqz t4, mul_zero
j mul_norm_go
mul_chk_b_zero:
beqz t5, mul_zero
mul_norm_go:
addi s2, x0, 0
bnez t2, mul_a_ok
beqz t4, mul_zero
mul_a_den_loop:
andi t6, t4, 0x80
bnez t6, mul_a_normed
slli t4, t4, 1
addi s2, s2, -1
j mul_a_den_loop
mul_a_normed:
addi t2, x0, 1
j mul_b_chk
mul_a_ok:
ori t4, t4, 0x80
mul_b_chk:
bnez t3, mul_b_ok
beqz t5, mul_zero
mul_b_den_loop:
andi t6, t5, 0x80
bnez t6, mul_b_normed
slli t5, t5, 1
addi s2, s2, -1
j mul_b_den_loop
mul_b_normed:
addi t3, x0, 1
j mul_do
mul_b_ok:
ori t5, t5, 0x80
mul_do:
# 直接呼叫(已在 prologue 保存 ra)
add a0, x0, t4
add a1, x0, t5
jal ra, mul_u16 # a0 = mant_a * mant_b
add s3, x0, a0 # s3 = mant_prod
add s4, t2, t3
addi s4, s4, -127
add s4, s4, s2
li t6, 0x8000
and t6, s3, t6
beqz t6, mul_norm_lt2
# >= 2.0
mul_norm_ge2:
addi s4, s4, 1
srli s10, s3, 8
andi t0, s10, 0x7F
srli t1, s3, 7
andi t1, t1, 1
andi t2, s3, 0x7F
j mul_round
# [1,2]
mul_norm_lt2:
srli s10, s3, 7
andi t0, s10, 0x7F
srli t1, s3, 6
andi t1, t1, 1
andi t2, s3, 0x3F
mul_round:
beqz t1, mul_after_round
bnez t2, mul_do_round
andi t3, t0, 1
beqz t3, mul_after_round
mul_do_round:
addi t0, t0, 1
li t3, 0x80
bne t0, t3, mul_after_round
li t0, 0x00
addi s4, s4, 1
mul_after_round:
li t6, 0xFF
bge s4, t6, mul_to_inf
bge x0, s4, mul_underflow
slli s1, s1, 15
andi s4, s4, 0xFF
slli s4, s4, 7
andi t0, t0, 0x7F
or a0, s1, s4
or a0, a0, t0
j mul_epilogue
mul_to_inf:
slli s1, s1, 15
li t6, 0x7F80
or a0, s1, t6
j mul_epilogue
mul_underflow:
slli a0, s1, 15
j mul_epilogue
mul_zero:
slli a0, s1, 15
j mul_epilogue
mul_a_special:
bnez t4, mul_ret_a
beq t3, x0, mul_inf_times_zero
slli s1, s1, 15
li t6, 0x7F80
or a0, s1, t6
j mul_epilogue
mul_b_special:
bnez t5, mul_ret_b
beq t2, x0, mul_inf_times_zero
slli s1, s1, 15
li t6, 0x7F80
or a0, s1, t6
j mul_epilogue
mul_inf_times_zero:
li a0, 0x7FC0
j mul_epilogue
mul_ret_a:
add a0, x0, s8
j mul_epilogue
mul_ret_b:
add a0, x0, s9
j mul_epilogue
mul_epilogue:
lw s10,28(sp)
lw s9, 24(sp)
lw s8, 20(sp)
lw s4, 16(sp)
lw s3, 12(sp)
lw s2, 8(sp)
lw s1, 4(sp)
lw ra, 0(sp)
addi sp, sp, 32
ret
#============================================================
# div_bf16 (A / B)
# 使用 s1,s2,s3,s4,s5,s8,s9,s10;leaf(不呼叫他人)→ 無需保存 ra
#============================================================
div_bf16:
# ---------- Prologue ----------
addi sp, sp, -32
sw s1, 0(sp)
sw s2, 4(sp)
sw s3, 8(sp)
sw s4, 12(sp)
sw s5, 16(sp)
sw s8, 20(sp)
sw s9, 24(sp)
sw s10,28(sp)
# 取 A/B 的 16-bit
slli s8, a0, 16
srli s8, s8, 16
slli s9, a1, 16
srli s9, s9, 16
srli t0, s8, 15 # sign_a
andi t0, t0, 1
srli t1, s9, 15 # sign_b
andi t1, t1, 1
xor s1, t0, t1 # result_sign
srli t2, s8, 7 # exp_a
andi t2, t2, 0xFF
srli t3, s9, 7 # exp_b
andi t3, t3, 0xFF
andi t4, s8, 0x7F # mant_a
andi t5, s9, 0x7F # mant_b
li t6, 0xFF
beq t3, t6, div_b_special
beq t3, x0, div_b_zero_or_subn
beq t2, t6, div_a_special
beq t2, x0, div_a_zero_or_subn
div_common:
bnez t2, div_a_set1
beqz t4, div_result_zero
div_a_set1:
ori t4, t4, 0x80
bnez t3, div_b_set1
beqz t5, div_result_nan
div_b_set1:
ori t5, t5, 0x80
# 長除法
slli s2, t4, 15
add s3, x0, t5
li s4, 0
li s5, 0
div_loop_i:
li t6, 16
bge s5, t6, div_qdone
slli s4, s4, 1
li t6, 15
sub t6, t6, s5
sll t6, s3, t6
bgeu s2, t6, div_sub
j div_next
div_sub:
sub s2, s2, t6
ori s4, s4, 1
div_next:
addi s5, s5, 1
j div_loop_i
div_qdone:
sub s5, t2, t3
addi s5, s5, 127
beqz t2, div_adj_a
j div_adj_b
div_adj_a:
addi s5, s5, -1
div_adj_b:
beqz t3, div_adj_b2
j div_norm
div_adj_b2:
addi s5, s5, 1
div_norm:
li t6, 0x8000
and t6, s4, t6
bnez t6, div_q_has1
div_q_shift:
li t6, 1
bge t6, s5, div_q_done # if s5 <= 1
li s10, 0x8000
and s10, s4, s10
bnez s10, div_q_done
slli s4, s4, 1
addi s5, s5, -1
j div_q_shift
div_q_has1:
srli s4, s4, 8
div_q_done:
andi s4, s4, 0x7F
li t6, 0xFF
bge s5, t6, div_to_inf
bge x0, s5, div_result_zero
slli s1, s1, 15
andi s5, s5, 0xFF
slli s5, s5, 7
or a0, s1, s5
or a0, a0, s4
j div_epilogue
div_b_special:
bnez t5, div_ret_b
slli a0, s1, 15
j div_epilogue
div_b_zero_or_subn:
beqz t5, div_by_zero
j div_common
div_by_zero:
beq t2, x0, div_result_nan
slli s1, s1, 15
li t6, 0x7F80
or a0, s1, t6
j div_epilogue
div_a_special:
bnez t4, div_ret_a
li t6, 0xFF
beq t3, t6, div_result_nan
slli s1, s1, 15
li t6, 0x7F80
or a0, s1, t6
j div_epilogue
div_a_zero_or_subn:
beqz t4, div_result_zero
j div_common
div_result_zero:
slli a0, s1, 15
j div_epilogue
div_result_nan:
li a0, 0x7FC0
j div_epilogue
div_to_inf:
slli s1, s1, 15
li t6, 0x7F80
or a0, s1, t6
j div_epilogue
div_ret_a:
add a0, x0, s8
j div_epilogue
div_ret_b:
add a0, x0, s9
j div_epilogue
div_epilogue:
lw s10,28(sp)
lw s9, 24(sp)
lw s8, 20(sp)
lw s5, 16(sp)
lw s4, 12(sp)
lw s3, 8(sp)
lw s2, 4(sp)
lw s1, 0(sp)
addi sp, sp, 32
ret
#============================================================
# sqrt_bf16 (sqrt(A))
# 使用 s2,s3,s4,s5,s6,s7,s8;呼叫 mul_u16 → 另需保存 ra
#============================================================
sqrt_bf16:
# ---------- Prologue ----------
addi sp, sp, -32
sw ra, 0(sp)
sw s2, 4(sp)
sw s3, 8(sp)
sw s4, 12(sp)
sw s5, 16(sp)
sw s6, 20(sp)
sw s7, 24(sp)
sw s8, 28(sp)
# A(16) : 取 a0 的低 16 位到 s8(上 16 清 0)
slli s8, a0, 16
srli s8, s8, 16
srli t0, s8, 15 # sign
andi t0, t0, 1
srli t1, s8, 7 # exp
andi t1, t1, 0xFF
andi t2, s8, 0x7F # mant
li t3, 0xFF
bne t1, t3, sqrt_chk_zero
bnez t2, sqrt_ret_a # NaN → A
bnez t0, sqrt_nan # -Inf → NaN
add a0, x0, s8 # +Inf → A
j sqrt_epilogue
sqrt_chk_zero:
beq t1, x0, sqrt_zero # 0/subnormal → 0
bnez t0, sqrt_nan # 負數 → NaN
addi s2, t1, -127
ori s3, t2, 0x80
andi t3, s2, 1
beqz t3, sqrt_even
slli s3, s3, 1
addi s2, s2, -1
sqrt_even:
srai s4, s2, 1
addi s4, s4, 127
li s5, 90 # low
li s6, 256 # high
li s7, 128 # result
sqrt_bs_loop:
blt s6, s5, sqrt_bs_done # if low > high
add t3, s5, s6
srli t3, t3, 1 # mid
# 呼叫 mul_u16(已保存 ra)
add a0, x0, t3
add a1, x0, t3
jal ra, mul_u16
srli a0, a0, 7 # sq = (mid^2)/128
blt s3, a0, sqrt_sq_too_big # if m < sq
add s7, x0, t3
addi s5, t3, 1
j sqrt_bs_loop
sqrt_sq_too_big:
addi s6, t3, -1
j sqrt_bs_loop
sqrt_bs_done:
li t3, 256
blt s7, t3, sqrt_norm_low_ok
srli s7, s7, 1
addi s4, s4, 1
sqrt_norm_low_ok:
li t3, 128
bge s7, t3, sqrt_pack
sqrt_shift_up:
li t3, 1
bge t3, s4, sqrt_pack # if s4 <= 1
slli s7, s7, 1
addi s4, s4, -1
blt s7, t3, sqrt_shift_up
sqrt_pack:
andi s7, s7, 0x7F
li t3, 0xFF
bge s4, t3, sqrt_to_inf
bge x0, s4, sqrt_zero
slli s4, s4, 7
or a0, s4, s7
j sqrt_epilogue
sqrt_zero:
li a0, 0x0000
j sqrt_epilogue
sqrt_nan:
li a0, 0x7FC0
j sqrt_epilogue
sqrt_to_inf:
li a0, 0x7F80
j sqrt_epilogue
sqrt_ret_a:
add a0, x0, s8
j sqrt_epilogue
sqrt_epilogue:
lw s8, 28(sp)
lw s7, 24(sp)
lw s6, 20(sp)
lw s5, 16(sp)
lw s4, 12(sp)
lw s3, 8(sp)
lw s2, 4(sp)
lw ra, 0(sp)
addi sp, sp, 32
ret
#============================================================
# mul_u16:無號 16-bit 乘法(RV32I:移位+加法)(leaf)
#============================================================
mul_u16:
li t0, 0
mul16_loop:
beq a1, x0, mul16_done
andi t1, a1, 1
beq t1, x0, mul16_skip_add
add t0, t0, a0
mul16_skip_add:
slli a0, a0, 1
srli a1, a1, 1
j mul16_loop
mul16_done:
add a0, x0, t0
ret
```
</details>
:::
:::spoiler
<details>
<summary>V1 full code</summary>
```asm
.data
.text
msg_suite_ok: .asciz "BF16 suite: PASS\n"
msg_suite_fail: .asciz "BF16 suite: FAIL\n"
msg_t1: .asciz "[OK ] f32_to_bf16\n"
msg_t1f: .asciz "[FAIL] f32_to_bf16\n"
msg_t2: .asciz "[OK ] bf16_to_f32\n"
msg_t2f: .asciz "[FAIL] bf16_to_f32\n"
msg_add: .asciz "[OK ] add_bf16\n"
msg_addf: .asciz "[FAIL] add_bf16\n"
msg_sub: .asciz "[OK ] sub_bf16\n"
msg_subf: .asciz "[FAIL] sub_bf16\n"
msg_mul: .asciz "[OK ] mul_bf16\n"
msg_mulf: .asciz "[FAIL] mul_bf16\n"
msg_div: .asciz "[OK ] div_bf16\n"
msg_divf: .asciz "[FAIL] div_bf16\n"
msg_sqrt: .asciz "[OK ] sqrt_bf16\n"
msg_sqrtf: .asciz "[FAIL] sqrt_bf16\n"
# 程式區
.globl main
.globl f32_to_bf16
.globl bf16_to_f32
.globl add_bf16
.globl sub_bf16
.globl mul_bf16
.globl div_bf16
.globl sqrt_bf16
.globl mul_u16
# 呼叫各函式做基本測試並列印 PASS/FAIL
main:
addi sp, sp, -16 # 簡易堆疊空間
sw ra, 12(sp) # 保存返回位址
li s0, 1 # s0 = suite_pass = 1
# T1: f32_to_bf16
li a0, 0x3F800000 # 1.0f
jal ra, f32_to_bf16 # -> 0x3F80 (bf16)
slli a0, a0, 16 # 清除高 16 → 只保留低 16
srli a0, a0, 16
li t0, 0x3F80
bne a0, t0, print_t1_fail
print_t1_ok:
la a0, msg_t1
li a7, 4
ecall
j t1_done
print_t1_fail:
la a0, msg_t1f
li a7, 4
ecall
li s0, 0
# T2: bf16_to_f32
t1_done:
li a0, 0x4000 # bf16(2.0) -> f32 0x40000000
jal ra, bf16_to_f32
li t0, 0x40000000
bne a0, t0, print_t2_fail
print_t2_ok:
la a0, msg_t2
li a7, 4
ecall
j t2_done
print_t2_fail:
la a0, msg_t2f
li a7, 4
ecall
li s0, 0
# ADD: 1.0 + 1.0 = 2.0
t2_done:
li a0, 0x3F80 # 1.0
li a1, 0x3F80 # 1.0
jal ra, add_bf16 # -> 2.0 (0x4000)
slli a0, a0, 16 # 清除高 16
srli a0, a0, 16
li t0, 0x4000
bne a0, t0, print_add_fail
print_add_ok:
la a0, msg_add
li a7, 4
ecall
j add_done
print_add_fail:
la a0, msg_addf
li a7, 4
ecall
li s0, 0
# SUB: 2.0 - 1.0 = 1.0
add_done:
li a0, 0x4000 # 2.0
li a1, 0x3F80 # 1.0
jal ra, sub_bf16 # -> 1.0 (0x3F80)
slli a0, a0, 16
srli a0, a0, 16
li t0, 0x3F80
bne a0, t0, print_sub_fail
print_sub_ok:
la a0, msg_sub
li a7, 4
ecall
j sub_done
print_sub_fail:
la a0, msg_subf
li a7, 4
ecall
li s0, 0
#MUL: 1.5 * 2.0 = 3.0
sub_done:
li a0, 0x3FC0 # 1.5
li a1, 0x4000 # 2.0
jal ra, mul_bf16 # -> 3.0 (0x4040)
slli a0, a0, 16
srli a0, a0, 16
li t0, 0x4040
bne a0, t0, print_mul_fail
print_mul_ok:
la a0, msg_mul
li a7, 4
ecall
j mul_done
print_mul_fail:
la a0, msg_mulf
li a7, 4
ecall
li s0, 0
# DIV: 3.0 / 2.0 = 1.5
mul_done:
li a0, 0x4040 # 3.0
li a1, 0x4000 # 2.0
jal ra, div_bf16 # -> 1.5 (0x3FC0)
slli a0, a0, 16
srli a0, a0, 16
li t0, 0x3FC0
bne a0, t0, print_div_fail
print_div_ok:
la a0, msg_div
li a7, 4
ecall
j div_done
print_div_fail:
la a0, msg_divf
li a7, 4
ecall
li s0, 0
# SQRT: sqrt(4.0) = 2.0
div_done:
li a0, 0x4080 # 4.0
jal ra, sqrt_bf16 # -> 2.0 (0x4000)
slli a0, a0, 16
srli a0, a0, 16
li t0, 0x4000
bne a0, t0, print_sqrt_fail
print_sqrt_ok:
la a0, msg_sqrt
li a7, 4
ecall
j sqrt_done
print_sqrt_fail:
la a0, msg_sqrtf
li a7, 4
ecall
li s0, 0
#SUITE
sqrt_done:
bnez s0, suite_ok
suite_fail:
la a0, msg_suite_fail
li a7, 4
ecall
j exit_suite
suite_ok:
la a0, msg_suite_ok
li a7, 4
ecall
exit_suite:
li a7, 10
ecall
#############################################
# f32_to_bf16
#############################################
f32_to_bf16:
srli t0, a0, 23 # t0 = 指數 (8位)
andi t0, t0, 0xFF
li t1, 0xFF
beq t0, t1, f32_is_special # exp==0xFF -> Inf/NaN
srli t2, a0, 16 # t2 = (a0 >> 16)
andi t2, t2, 1 # t2 = LSB for ties
li t3, 0x7FFF # 0.5 ULP 偏移
add t2, t2, t3
add a0, a0, t2
srli a0, a0, 16 # 取高16位作 bf16
ret
f32_is_special:
srli a0, a0, 16 # 直接取高16位(保留 NaN/Inf payload)
ret
#############################################
# bf16_to_f32
#############################################
bf16_to_f32:
slli a0, a0, 16 # bf16 << 16
ret
#############################################
# add_bf16 (A + B)
#############################################
add_bf16:
#把低 16 位搬到高半(高半乾淨)邏輯右移回來 → 上 16 直接清成 0
slli s8, a0, 16 # A
srli s8, s8, 16
slli s9, a1, 16 # B
srli s9, s9, 16
srli t0, s8, 15 # sign_a
andi t0, t0, 1
srli t1, s9, 15 # sign_b
andi t1, t1, 1
srli t2, s8, 7 # exp_a
andi t2, t2, 0xFF
srli t3, s9, 7 # exp_b
andi t3, t3, 0xFF
andi t4, s8, 0x7F # mant_a
andi t5, s9, 0x7F # mant_b
# 特殊值
li t6, 0xFF
beq t2, t6, add_a_special
beq t3, t6, add_b_special
# 零值快速路徑
beq t2, x0, add_chk_a_zero
j add_chk_b_zero
add_chk_a_zero:
beq t4, x0, add_ret_b # A == 0 -> 回傳 B
add_chk_b_zero:
beq t3, x0, add_chk_b_is_zero
j add_norm
add_chk_b_is_zero:
beq t5, x0, add_ret_a # B == 0 -> 回傳 A
add_norm:
bnez t2, add_a_set1
j add_b_set1
add_a_set1:
ori t4, t4, 0x80 # A 非0 -> 補隱含1
add_b_set1:
beqz t3, add_align #exp_b != 0 要補隱含 1;exp_b == 0 不補
ori t5, t5, 0x80 # B 非0 -> 補隱含1
j add_align
add_align:
sub s2, t2, t3 # s2 = exp_diff
blt x0, s2, add_a_bigger # if s2 > 0
blt s2, x0, add_b_bigger # if s2 < 0
add s3, x0, t2 # 指數相等
j add_op
add_a_bigger:
add s3, x0, t2
li t6, 8
blt t6, s2, add_ret_a # if s2 > 8
srl t5, t5, s2 # B 尾數右移對齊
j add_op
add_b_bigger:
add s3, x0, t3
li t6, -8
blt s2, t6, add_ret_b # if s2 < -8
neg t6, s2
srl t4, t4, t6 # A 尾數右移對齊
add_op:
beq t0, t1, add_same_sign # 同號 => 相加
# 異號 => 相減
bge t4, t5, add_a_ge_b
add s4, x0, t1 # 結果符號 = B 的符號
sub s5, t5, t4 # mant_res = B - A
j add_norm_diff
add_a_ge_b:
add s4, x0, t0 # 結果符號 = A 的符號
sub s5, t4, t5 # mant_res = A - B
add_norm_diff:
beq s5, x0, add_ret_zero # 差值為0 => 0
add_norm_loop:
andi t6, s5, 0x80
bnez t6, add_pack # 已規範化
addi s3, s3, -1 # 指數--
bge x0, s3, add_ret_zero # if s3 <= 0
slli s5, s5, 1
j add_norm_loop
add_same_sign:
add s4, x0, t0 # 結果符號 = 同號
add s5, t4, t5 # mant_res = A + B
andi t6, s5, 0x100
beq t6, x0, add_pack
srli s5, s5, 1 # 進位:尾數右移
addi s3, s3, 1 # 指數+1
li t6, 0xFF
bge s3, t6, add_to_inf # 上溢 => Inf
add_pack:
slli s4, s4, 15 # 符號到 bit15
andi s3, s3, 0xFF
slli s3, s3, 7 # 指數到 bit14:7
andi s5, s5, 0x7F # 尾數 7-bit
or a0, s4, s3
or a0, a0, s5
ret
add_to_inf:
slli s4, s4, 15
li t6, 0x7F80
or a0, s4, t6
ret
add_ret_zero:
li a0, 0x0000
ret
add_ret_a:
add a0, x0, s8
ret
add_ret_b:
add a0, x0, s9
ret
add_a_special:
bnez t4, add_ret_a # A 是 NaN -> 傳回 A
beq t3, t6, add_both_inf_nan
add a0, x0, s8 # A 是 Inf、B 非特殊 -> 回傳 A
ret
add_both_inf_nan:
bnez t5, add_ret_b # B 是 NaN -> 回傳 B
beq t0, t1, add_ret_b # Inf 同號 -> 回傳 B
li a0, 0x7FC0 # +Inf + (-Inf) = NaN
ret
add_b_special:
bnez t5, add_ret_b # B 是 NaN -> 回傳 B
add a0, x0, s9 # B 是 Inf -> 回傳 B
ret
#############################################
# sub_bf16 (A - B) = A + (-B)
#############################################
sub_bf16:
li t6, 0x8000 # 建立符號遮罩(避免 xori 12-bit 溢位)
xor a1, a1, t6 # 反轉 B 的符號位
jal ra, add_bf16 # A + (-B)
ret
#============================================================
# mul_bf16 (A * B) — 內聯 8x8 乘法 + RNE 捨入
# 使用 s1,s2,s3,s4,s8,s9,s10;不再呼叫 mul_u16
#============================================================
mul_bf16:
addi sp, sp, -32
sw ra, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s8, 20(sp)
sw s9, 24(sp)
sw s10,28(sp)
slli s8, a0, 16 # A
srli s8, s8, 16
slli s9, a1, 16 # B
srli s9, s9, 16
srli t0, s8, 15 # sign_a
andi t0, t0, 1
srli t1, s9, 15 # sign_b
andi t1, t1, 1
xor s1, t0, t1 # result_sign
srli t2, s8, 7 # exp_a
andi t2, t2, 0xFF
srli t3, s9, 7 # exp_b
andi t3, t3, 0xFF
andi t4, s8, 0x7F # mant_a
andi t5, s9, 0x7F # mant_b
li t6, 0xFF
beq t2, t6, mul_a_special
beq t3, t6, mul_b_special
beq t2, x0, mul_chk_a_zero
beq t3, x0, mul_chk_b_zero
j mul_norm_go
mul_chk_a_zero:
beqz t4, mul_zero
j mul_norm_go
mul_chk_b_zero:
beqz t5, mul_zero
mul_norm_go:
addi s2, x0, 0
bnez t2, mul_a_ok
beqz t4, mul_zero
mul_a_den_loop:
andi t6, t4, 0x80
bnez t6, mul_a_normed
slli t4, t4, 1
addi s2, s2, -1
j mul_a_den_loop
mul_a_normed:
addi t2, x0, 1
j mul_b_chk
mul_a_ok:
ori t4, t4, 0x80
mul_b_chk:
bnez t3, mul_b_ok
beqz t5, mul_zero
mul_b_den_loop:
andi t6, t5, 0x80
bnez t6, mul_b_normed
slli t5, t5, 1
addi s2, s2, -1
j mul_b_den_loop
mul_b_normed:
addi t3, x0, 1
j mul_do
mul_b_ok:
ori t5, t5, 0x80
mul_do:
# ---- 內聯 8x8 乘法:s3 = t4 * t5 ----
li s3, 0
add t6, x0, t5 # multiplier
add s10, x0, t4 # multiplicand
li t0, 8
mul8_loop:
andi t1, t6, 1
beqz t1, mul8_skip_add
add s3, s3, s10
mul8_skip_add:
slli s10, s10, 1
srli t6, t6, 1
addi t0, t0, -1
bnez t0, mul8_loop
# result_exp = exp_a + exp_b - 127 + exp_adjust
add s4, t2, t3
addi s4, s4, -127
add s4, s4, s2
# 正規化 + RNE
li t6, 0x8000
and t6, s3, t6
beqz t6, mul_norm_lt2
# >= 2.0
mul_norm_ge2:
addi s4, s4, 1
srli s10, s3, 8 # m8
andi t0, s10, 0x7F # frac7
srli t1, s3, 7 # guard
andi t1, t1, 1
andi t2, s3, 0x7F # sticky
j mul_round
# [1,2)
mul_norm_lt2:
srli s10, s3, 7 # m8
andi t0, s10, 0x7F # frac7
srli t1, s3, 6 # guard
andi t1, t1, 1
andi t2, s3, 0x3F # sticky
mul_round:
beqz t1, mul_after_round # guard=0 → 不進位
bnez t2, mul_do_round # sticky!=0 → 進位
andi t3, t0, 1 # tie: LSB=1 才進位(最近偶數)
beqz t3, mul_after_round
mul_do_round:
addi t0, t0, 1
li t3, 0x80
bne t0, t3, mul_after_round
li t0, 0x00
addi s4, s4, 1
mul_after_round:
li t6, 0xFF
bge s4, t6, mul_to_inf
bge x0, s4, mul_underflow
slli s1, s1, 15
andi s4, s4, 0xFF
slli s4, s4, 7
andi t0, t0, 0x7F
or a0, s1, s4
or a0, a0, t0
j mul_epilogue
mul_to_inf:
slli s1, s1, 15
li t6, 0x7F80
or a0, s1, t6
j mul_epilogue
mul_underflow:
slli a0, s1, 15
j mul_epilogue
mul_zero:
slli a0, s1, 15
j mul_epilogue
mul_a_special:
bnez t4, mul_ret_a
beq t3, x0, mul_inf_times_zero
slli s1, s1, 15
li t6, 0x7F80
or a0, s1, t6
j mul_epilogue
mul_b_special:
bnez t5, mul_ret_b
beq t2, x0, mul_inf_times_zero
slli s1, s1, 15
li t6, 0x7F80
or a0, s1, t6
j mul_epilogue
mul_inf_times_zero:
li a0, 0x7FC0
j mul_epilogue
mul_ret_a:
add a0, x0, s8
j mul_epilogue
mul_ret_b:
add a0, x0, s9
j mul_epilogue
mul_epilogue:
lw s10,28(sp)
lw s9, 24(sp)
lw s8, 20(sp)
lw s4, 16(sp)
lw s3, 12(sp)
lw s2, 8(sp)
lw s1, 4(sp)
lw ra, 0(sp)
addi sp, sp, 32
ret
########################################
# div_bf16 (A / B)
########################################
div_bf16:
#把低 16 位搬到高半(高半乾淨)邏輯右移回來 → 上 16 直接清成 0
slli s8, a0, 16 # A
srli s8, s8, 16
slli s9, a1, 16 # B
srli s9, s9, 16
srli t0, s8, 15 # sign_a
andi t0, t0, 1
srli t1, s9, 15 # sign_b
andi t1, t1, 1
xor s1, t0, t1 # result_sign
srli t2, s8, 7 # exp_a
andi t2, t2, 0xFF
srli t3, s9, 7 # exp_b
andi t3, t3, 0xFF
andi t4, s8, 0x7F # mant_a
andi t5, s9, 0x7F # mant_b
li t6, 0xFF
beq t3, t6, div_b_special
beq t3, x0, div_b_zero_or_subn
beq t2, t6, div_a_special
beq t2, x0, div_a_zero_or_subn
div_common:
bnez t2, div_a_set1
beqz t4, div_result_zero
div_a_set1:
ori t4, t4, 0x80
bnez t3, div_b_set1
beqz t5, div_result_nan
div_b_set1:
ori t5, t5, 0x80
# 長除法
slli s2, t4, 15
add s3, x0, t5
li s4, 0
li s5, 0
div_loop_i:
li t6, 16
bge s5, t6, div_qdone
slli s4, s4, 1
li t6, 15
sub t6, t6, s5
sll t6, s3, t6
bgeu s2, t6, div_sub
j div_next
div_sub:
sub s2, s2, t6
ori s4, s4, 1
div_next:
addi s5, s5, 1
j div_loop_i
div_qdone:
sub s5, t2, t3
addi s5, s5, 127
beqz t2, div_adj_a
j div_adj_b
div_adj_a:
addi s5, s5, -1
div_adj_b:
beqz t3, div_adj_b2
j div_norm
div_adj_b2:
addi s5, s5, 1
div_norm:
li t6, 0x8000
and t6, s4, t6
bnez t6, div_q_has1
div_q_shift:
li t6, 1
bge t6, s5, div_q_done # if s5 <= 1
li s10, 0x8000
and s10, s4, s10
bnez s10, div_q_done
slli s4, s4, 1
addi s5, s5, -1
j div_q_shift
div_q_has1:
srli s4, s4, 8
div_q_done:
andi s4, s4, 0x7F
li t6, 0xFF
bge s5, t6, div_to_inf
bge x0, s5, div_result_zero
slli s1, s1, 15
andi s5, s5, 0xFF
slli s5, s5, 7
or a0, s1, s5
or a0, a0, s4
ret
div_b_special:
bnez t5, div_ret_b
slli a0, s1, 15
ret
div_b_zero_or_subn:
beqz t5, div_by_zero
j div_common
div_by_zero:
beq t2, x0, div_result_nan
slli s1, s1, 15
li t6, 0x7F80
or a0, s1, t6
ret
div_a_special:
bnez t4, div_ret_a
li t6, 0xFF
beq t3, t6, div_result_nan
slli s1, s1, 15
li t6, 0x7F80
or a0, s1, t6
ret
div_a_zero_or_subn:
beqz t4, div_result_zero
j div_common
div_result_zero:
slli a0, s1, 15
ret
div_result_nan:
li a0, 0x7FC0
ret
div_to_inf:
slli s1, s1, 15
li t6, 0x7F80
or a0, s1, t6
ret
div_ret_a:
add a0, x0, s8
ret
div_ret_b:
add a0, x0, s9
ret
#============================================================
# sqrt_bf16 (sqrt(A))
# 使用 s2,s3,s4,s5,s6,s7,s8;二分法內聯 mid*mid(9步乘法)
#============================================================
sqrt_bf16:
addi sp, sp, -32
sw ra, 0(sp)
sw s2, 4(sp)
sw s3, 8(sp)
sw s4, 12(sp)
sw s5, 16(sp)
sw s6, 20(sp)
sw s7, 24(sp)
sw s8, 28(sp)
slli s8, a0, 16 # A(16) 低位 → s8
srli s8, s8, 16
srli t0, s8, 15 # sign
andi t0, t0, 1
srli t1, s8, 7 # exp
andi t1, t1, 0xFF
andi t2, s8, 0x7F # mant
li t3, 0xFF
bne t1, t3, sqrt_chk_zero
bnez t2, sqrt_ret_a # NaN → A
bnez t0, sqrt_nan # -Inf → NaN
add a0, x0, s8 # +Inf → A
j sqrt_epilogue
sqrt_chk_zero:
beq t1, x0, sqrt_zero # 0 / 次正規 → 0
bnez t0, sqrt_nan # 負 → NaN
addi s2, t1, -127 # e = exp - bias
ori s3, t2, 0x80 # m = 1.x
andi t3, s2, 1
beqz t3, sqrt_even
slli s3, s3, 1 # odd exponent: m <<= 1, e--
addi s2, s2, -1
sqrt_even:
srai s4, s2, 1
addi s4, s4, 127 # new_exp
li s5, 90 # low
li s6, 256 # high
li s7, 128 # result
sqrt_bs_loop:
blt s6, s5, sqrt_bs_done # low > high ?
add t3, s5, s6
srli t3, t3, 1 # mid in t3
# ---- 內聯 mid*mid(9步移位加法;mid ∈ [0..256])----
add t4, x0, t3 # multiplicand
add t5, x0, t3 # multiplier
li t6, 0 # acc = 0
li t0, 9
sqrt_mul9_loop:
andi t1, t5, 1
beqz t1, sqrt_mul9_skip
add t6, t6, t4
sqrt_mul9_skip:
slli t4, t4, 1
srli t5, t5, 1
addi t0, t0, -1
bnez t0, sqrt_mul9_loop
srli a0, t6, 7 # sq = (mid^2)/128
blt s3, a0, sqrt_sq_too_big # if m < sq → high = mid - 1
add s7, x0, t3 # result = mid
addi s5, t3, 1 # low = mid + 1
j sqrt_bs_loop
sqrt_sq_too_big:
addi s6, t3, -1
j sqrt_bs_loop
sqrt_bs_done:
li t3, 256
blt s7, t3, sqrt_norm_low_ok
srli s7, s7, 1
addi s4, s4, 1
sqrt_norm_low_ok:
li t3, 128
bge s7, t3, sqrt_pack
sqrt_shift_up:
li t3, 1
bge t3, s4, sqrt_pack
slli s7, s7, 1
addi s4, s4, -1
blt s7, t3, sqrt_shift_up
sqrt_pack:
andi s7, s7, 0x7F
li t3, 0xFF
bge s4, t3, sqrt_to_inf
bge x0, s4, sqrt_zero
slli s4, s4, 7
or a0, s4, s7
j sqrt_epilogue
sqrt_zero:
li a0, 0x0000
j sqrt_epilogue
sqrt_nan:
li a0, 0x7FC0
j sqrt_epilogue
sqrt_to_inf:
li a0, 0x7F80
j sqrt_epilogue
sqrt_ret_a:
add a0, x0, s8
j sqrt_epilogue
sqrt_epilogue:
lw s8, 28(sp)
lw s7, 24(sp)
lw s6, 20(sp)
lw s5, 16(sp)
lw s4, 12(sp)
lw s3, 8(sp)
lw s2, 4(sp)
lw ra, 0(sp)
addi sp, sp, 32
ret
#============================================================
# mul_u16:無號 16-bit 乘法(RV32I:移位+加法)
#============================================================
mul_u16:
li t0, 0
mul16_loop:
beq a1, x0, mul16_done
andi t1, a1, 1
beq t1, x0, mul16_skip_add
add t0, t0, a0
mul16_skip_add:
slli a0, a0, 1
srli a1, a1, 1
j mul16_loop
mul16_done:
add a0, x0, t0
ret
```
</details>
:::
---
## 4. Using s* Registers
* **Calling convention:** `s0–s11` are **callee-saved**.
Using them inside a subroutine requires push/pop (`sw/lw`).
For **leaf functions** (no further `jal` calls), switching to `a*/t*` **eliminates stack traffic**.
* **V1 strategy in this project:**
* `main`: keeps `s0` as a persistent suite flag (stable across multiple `jal`).
* **`add_bf16`**: previously relied heavily on `s*`; now **converted entirely to `a*/t*`**, true leaf, no `sw/lw`.
* `sub_bf16`, `f32_to_bf16`, `bf16_to_f32`: already leaf; continue using `a*/t*`.
* `mul_bf16`, `div_bf16`, `sqrt_bf16`: **retain V0 stack-based design** due to algorithm complexity and high register pressure, as well as consistency with reference C behavior.
(Future versions could leafify these for performance, but current focus is optimizing the frequently-called `add_bf16`.)
* **Trade-off:** optimizing the most frequently executed and branch-dense function (`add_bf16`) first provides the greatest benefit; others remain unchanged for stability and validation clarity.
---
## Performance Comparison (Ripes, with print I/O)
| Metric | **V0** | **V1** | **Δ (V1 − V0)** |
| :----------------- | -----: | --------: | -----------------: |
| **Cycles** | 1572 | **1531** | **−41 (−2.6%)** |
| **Instr. retired** | 1105 | **1093** | **−12 (−1.1%)** |
| **CPI** | 1.42 | **1.40** | **−0.02 (−1.4%)** |
| **IPC** | 0.703 | **0.714** | **+0.011 (+1.6%)** |
:::success
V0

V1

:::
**Interpretation**
* Removing 12 redundant instructions in `main` directly reduces retired count.
* Leafification and segmented normalization in `add_bf16` eliminate branches and memory traffic, yielding fewer cycles.
* CPI/IPC slightly improved.
> Note: results include `ecall` I/O overhead.
> Under microbenchmark mode (no printing, using CSR `cycle/instret`), computational gains are even more pronounced.
---
## 5. Code Excerpts
### `main` (removed post-call clearing)
```asm
- jal ra, add_bf16
- slli a0, a0, 16
- srli a0, a0, 16
+ jal ra, add_bf16
```
### `add_bf16` (normalization: while → one-step segmented shift)
```asm
- add_norm_loop:
- andi t6, s5, 0x80
- bnez t6, add_pack
- addi s3, s3, -1
- bge x0, s3, add_ret_zero
- slli s5, s5, 1
- j add_norm_loop
+ # Check leading bits and decide direct left shift (1–7 bits)
+ andi t6, t5, 0x40 ; bnez t6, add_shift_1
+ andi t6, t5, 0x20 ; bnez t6, add_shift_2
+ ...
+ slli t5, t5, 7
+ addi a6, a6, -7
```
---
## 6. Correctness (Identical to V0)
* **NaN propagation, ±Inf, ±0:** follow reference C rules.
* **Subnormals flushed to zero.**
* **`mul_bf16`:** RNE rounding via guard/sticky bits.
* **`add_bf16`:** behaves as defined in Quiz 1 Problem C after normalization.
* **`sqrt_bf16`:** binary search with even/odd exponent adjustment; within allowed error range.
:::success
The test result

:::