# Assignment2: Complete Applications
contributed by <`戴仁杰`>
repo: https://github.com/dozingmoon/ca2025-quizzes
Disclaimer: I used AI tools such as ChatGPT for better understanding of algorithm or architecture descriptions.
## Homework 1: BFloat16
```
=== BFloat16 Tests (ASM Version) ===
Test 1: bf16_special_cases (asm helpers)
bf16_iszero(0.0) ... PASSED
bf16_iszero(-0.0) ... PASSED
!bf16_iszero(1.0) ... PASSED
bf16_isinf(Inf) ... PASSED
bf16_isinf(-Inf) ... PASSED
!bf16_isinf(NaN) ... PASSED
bf16_isnan(NaN) ... PASSED
!bf16_isnan(Inf) ... PASSED
Passed: 8 / 8
Cycles: 3122, Instructions: 3122
Test 2: bf16_add
0x3f80 + 0x3f80 = 0x4000 (PASSED)
0x4000 + 0x3f80 = 0x4040 (PASSED)
0x3f80 + 0x4000 = 0x4040 (PASSED)
0x3f00 + 0x3f00 = 0x3f80 (PASSED)
0x3f80 + 0x0 = 0x3f80 (PASSED)
0x3f80 + 0x8000 = 0x3f80 (PASSED)
0x0 + 0x0 = 0x0 (PASSED)
0x8000 + 0x0 = 0x0 (PASSED)
0x8000 + 0x8000 = 0x8000 (PASSED)
0x3f80 + 0xbf80 = 0x0 (PASSED)
0xbf80 + 0xbf80 = 0xc000 (PASSED)
0xc000 + 0x3f80 = 0xbf80 (PASSED)
0x7f80 + 0x3f80 = 0x7f80 (PASSED)
0x7f80 + 0x7f80 = 0x7f80 (PASSED)
0xff80 + 0x3f80 = 0xff80 (PASSED)
0x7f80 + 0xff80 = 0x7fc0 (PASSED)
0x7fc0 + 0x3f80 = 0x7fc0 (PASSED)
0x3f80 + 0x7fc0 = 0x7fc0 (PASSED)
0x7f7f + 0x7f7f = 0x7f80 (PASSED)
0x1 + 0x1 = 0x2 (PASSED)
Passed: 20 / 20
Cycles: 16971, Instructions: 16971
Test 3: bf16_sub
0x4040 - 0x4000 = 0x3f80 (PASSED)
0x3f80 - 0x3f80 = 0x0 (PASSED)
0x3f80 - 0x4000 = 0xbf80 (PASSED)
0x3f80 - 0x0 = 0x3f80 (PASSED)
0x3f80 - 0x8000 = 0x3f80 (PASSED)
0x0 - 0x8000 = 0x0 (PASSED)
0x8000 - 0x0 = 0x8000 (PASSED)
0x7f80 - 0x3f80 = 0x7f80 (PASSED)
0x3f80 - 0x7f80 = 0xff80 (PASSED)
0x7f80 - 0x7f80 = 0x7fc0 (PASSED)
0xff80 - 0xff80 = 0x7fc0 (PASSED)
0x7fc0 - 0x3f80 = 0x7fc0 (PASSED)
0x3f80 - 0x7fc0 = 0xffc0 (PASSED)
Passed: 13 / 13
Cycles: 12944, Instructions: 12944
Test 4: bf16_mul
0x4000 * 0x4040 = 0x40c0 (PASSED)
0x3f00 * 0x3f00 = 0x3e80 (PASSED)
0x3f80 * 0x3f80 = 0x3f80 (PASSED)
0xbf80 * 0x4000 = 0xc000 (PASSED)
0xbf80 * 0xbf80 = 0x3f80 (PASSED)
0x3f80 * 0x0 = 0x0 (PASSED)
0x3f80 * 0x8000 = 0x8000 (PASSED)
0xbf80 * 0x0 = 0x8000 (PASSED)
0xbf80 * 0x8000 = 0x0 (PASSED)
0x8000 * 0x8000 = 0x0 (PASSED)
0x7f80 * 0x4000 = 0x7f80 (PASSED)
0x7f80 * 0xc000 = 0xff80 (PASSED)
0x7f80 * 0x0 = 0x7fc0 (PASSED)
0x7f7f * 0x4000 = 0x7f80 (PASSED)
Passed: 14 / 14
Cycles: 14694, Instructions: 14694
Test 5: bf16_div
0x40c0 / 0x4000 = 0x4040 (PASSED)
0x3f80 / 0x3f80 = 0x3f80 (PASSED)
0x3f80 / 0x4000 = 0x3f00 (PASSED)
0x3f80 / 0x0 = 0x7f80 (PASSED)
0x3f80 / 0x8000 = 0xff80 (PASSED)
0xbf80 / 0x0 = 0xff80 (PASSED)
0x0 / 0x3f80 = 0x0 (PASSED)
0x8000 / 0x3f80 = 0x8000 (PASSED)
0x0 / 0x0 = 0x7fc0 (PASSED)
0x8000 / 0x0 = 0x7fc0 (PASSED)
0x7f80 / 0x3f80 = 0x7f80 (PASSED)
0x3f80 / 0x7f80 = 0x0 (PASSED)
0x7f80 / 0x7f80 = 0x7fc0 (PASSED)
Passed: 13 / 13
Cycles: 12931, Instructions: 12931
Test 6: bf16_sqrt
sqrt(0x3f80) = 0x3f80 (PASSED)
sqrt(0x4080) = 0x4000 (PASSED)
sqrt(0x4110) = 0x4040 (PASSED)
sqrt(0x0) = 0x0 (PASSED)
sqrt(0x7f80) = 0x7f80 (PASSED)
sqrt(0xbf80) = 0x7fc0 (PASSED)
sqrt(0x7fc0) = 0x7fc0 (PASSED)
Passed: 7 / 7
Cycles: 19627, Instructions: 19627
Test 7: Q_sqrt_bf16
Q_sqrt(0x3f80) = 0x3f80 (PASSED)
Q_sqrt(0x4080) = 0x3f00 (PASSED)
Q_sqrt(0x3e80) = 0x4000 (PASSED)
Q_sqrt(0x4180) = 0x3e80 (PASSED)
Q_sqrt(0x7fc0) = 0xffc0 (PASSED)
Passed: 5 / 5
Cycles: 8255, Instructions: 8255
=== All Tests Completed ===
```
## Quiz 2: Tower of Hanoi in Assembly
### Core Algorithm — Gray Code Method (Iterative Tower of Hanoi)
The core algorithm **leverages the Gray code method** to generate the sequence of disk moves **iteratively**, avoiding recursion.
### 🔹 Key Principles
Each step computes:
\begin{aligned}
\text{gray}(i) &= i \oplus (i \gg 1) \\
\text{gray}(i-1) &= (i-1) \oplus ((i-1) \gg 1) \\
\text{changed_bit} &= \text{gray}(i) \oplus \text{gray}(i-1)
\end{aligned}
- The **changed bit** determines **which disk to move**.
- Disk movement follows **specific cyclic rules** depending on disk **parity** (odd or even number of disks).
---
### 🧠 Assembly Design Highlights
- **Registers**
- `s0–s4` are used for persistent variables:
- `s0`: loop counter `i`
- `s1`: current `disk` index
- `s2`: `from_peg`
- `s3`: `to_peg`
- `s4`: pointer to the **output buffer**
- **Stack Frame**
- 48 bytes total
- 16-byte aligned for ABI compliance
- **System Calls (ecall 64)**
Used for text output through `SYS_WRITE`:
```
=== Starting Hanoi ASM ===
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
=== Hanoi ASM Finished ===
Total Cycles: 643
Total Instructions: 643
```
## Quiz 3:
### Algorithm (Scheme 2 – LUT-Only Version)
The implemented function:
```c
uint32_t fast_rsqrt_c_v2_lut_only(uint32_t x)
```
uses a **lookup-table (LUT)** to estimate reciprocal square roots based on the **exponent** of the input number.
### Algorithm Steps
1. **Handle Zero Case:**
* If `x == 0`, return `0xFFFFFFFF` (infinity approximation).
2. **Find Exponent:**
* Compute `exp = 31 - clz_c(x)` where `clz_c` counts leading zeros.
3. **Lookup Approximation:**
* Return `rsqrt_table[exp]` as the result.
This LUT-based version is a simplified (non-iterative) version of the original `fast_rsqrt_c_v1_full()` algorithm, providing fast and deterministic output without wide 64-bit math.
---
## 🧠 Mathematical Background
The **reciprocal square root** function $y = \frac{1}{\sqrt{x}}$ can be approximated by:
$$y \approx 2^{-n/2} \times f(m)$$
where:
* $n$ is the exponent of `x`
* $f(m)$ is a precomputed lookup from normalized mantissa values
Instead of floating-point math, we use **fixed-point integers** and **bit-level exponent extraction**.
---
## ⚙️ Key Components
### 1. **Software Arithmetic (RV32I-Compatible)**
Since RV32I lacks hardware multiply/divide instructions, all arithmetic is implemented in pure software:
```c
static uint32_t umul(uint32_t a, uint32_t b) {
uint32_t result = 0;
while (b) {
if (b & 1) result += a;
a <<= 1;
b >>= 1;
}
return result;
}
```
Similarly, `udiv` and `umod` perform division and modulo by iterative subtraction.
---
### 2. **64-bit Emulation Functions**
To support 64-bit operations (required by some compilers for `uint64_t` math), software versions of GCC’s helper functions are provided:
* `__muldi3` – 32×32 → 64 multiply
* `__ashlodi4` – 64-bit logical left shift
* `__lshrdi3` – 64-bit logical right shift
All are implemented with `memcpy` to safely handle 64-bit values under strict aliasing rules.
```
x=0: Expected=4294967295, Actual=4294967295, Error=0, Cycles=35, Instructions=35
x=1: Expected=65535, Actual=65535, Error=0, Cycles=99, Instructions=99
x=2: Expected=46340, Actual=46341, Error=1, Cycles=96, Instructions=96
x=3: Expected=37836, Actual=46341, Error=8505, Cycles=0, Instructions=0
x=5: Expected=29302, Actual=32768, Error=3466, Cycles=90, Instructions=90
x=15: Expected=16909, Actual=23170, Error=6261, Cycles=90, Instructions=90
x=16: Expected=16384, Actual=16384, Error=0, Cycles=82, Instructions=82
x=17: Expected=15890, Actual=16384, Error=494, Cycles=0, Instructions=0
x=100: Expected=6553, Actual=8192, Error=1639, Cycles=82, Instructions=82
x=511: Expected=2900, Actual=4096, Error=1196, Cycles=73, Instructions=73
x=512: Expected=2896, Actual=2896, Error=0, Cycles=0, Instructions=0
x=513: Expected=2893, Actual=2896, Error=3, Cycles=0, Instructions=0
x=1024: Expected=2048, Actual=2048, Error=0, Cycles=0, Instructions=0
x=65535: Expected=256, Actual=362, Error=106, Cycles=73, Instructions=73
x=100000: Expected=207, Actual=256, Error=49, Cycles=64, Instructions=64
x=1000000: Expected=65, Actual=90, Error=25, Cycles=0, Instructions=0
x=2147483647: Expected=1, Actual=2, Error=1, Cycles=64, Instructions=64
x=4294967295: Expected=1, Actual=1, Error=0, Cycles=0, Instructions=0
Scheme 2 (C) Total: 848 Cycles, 848 Instructions
```
* Note: some CSR counters are zero because of **compiler optimization**
## Disassemble the ELF files produced by the C compiler and contrast the handwritten and compiler-optimized assembly listings. (`bf16_isnan`)
---
## Function Overview
`bf16_isnan` checks whether a **bfloat16 (BF16)** value represents *NaN (Not-a-Number)*.
In BF16 format:
* Exponent bits = `0x7F80` (all 1s)
* Mantissa ≠ 0 → indicates **NaN**
Mathematically:
```c
bool bf16_isnan(uint16_t x) {
return ((x & 0x7F80) == 0x7F80) && ((x & 0x007F) != 0);
}
```
---
## Compiler-Generated Assembly (`bf16_isnan_c`)
```asm
00010d54 <bf16_isnan_c>:
addi sp,sp,-32
sw ra,28(sp)
sw s0,24(sp)
addi s0,sp,32
sh a0,-20(s0)
lhu a5,-20(s0)
mv a4,a5
lui a5,0x8
addi a5,a5,-128 # 0x7F80
and a4,a4,a5
lui a5,0x8
addi a5,a5,-128 # 0x7F80
bne a4,a5,10d9c # if (exp != 0x7F80)
lhu a5,-20(s0)
andi a5,a5,127 # mantissa bits
beqz a5,10d9c # if (mantissa == 0)
li a5,1
j 10da0
10d9c: li a5,0
10da0: andi a5,a5,1
zext.b a5,a5
mv a0,a5
lw ra,28(sp)
lw s0,24(sp)
addi sp,sp,32
ret
```
### 🧩 Characteristics
* **Prologue/Epilogue heavy:** 32-byte stack frame, RA/S0 save-restore.
* **Local variable stored on stack** (`sh`/`lhu` to/from `s0 - 20`).
* Redundant reload of constants and masking.
* Follows *GCC ABI conventions strictly* even though the function is trivial.
* Final result normalized to 0 or 1 (`andi` + `zext.b`).
✅ **Advantage:** Portable, ABI-compliant, easy for debugging.
❌ **Disadvantage:** Inefficient (≈ 20+ instructions for a simple check).
---
## Handwritten Assembly (`bf16_isnan`)
```asm
000100c0 <bf16_isnan>:
addi sp,sp,-4
sw s1,0(sp)
lui s1,0x8
addi s1,s1,-128 # 0x7F80
and t0,a0,s1
bne t0,s1,100e8 # exp != 0x7F80 → not NaN
andi t0,a0,127
beqz t0,100e8 # mantissa == 0 → not NaN
li a0,1
j 100ec
100e8: li a0,0
100ec: lw s1,0(sp)
addi sp,sp,4
ret
```
### ⚙️ Optimized Features
* **Minimal stack usage (4 bytes)**, only saving `s1`.
* No unnecessary locals — computation stays in registers.
* Constant `0x7F80` loaded once and reused.
* No redundant `andi/zext.b` postprocessing.
* Straightline control flow with clear true/false branches.
✅ **Advantage:** Compact (≈ 12 instructions), fewer memory accesses.
❌ **Disadvantage:** Not fully ABI-safe (no frame pointer, assumes caller convention).
---
In performance-sensitive embedded systems (like BF16 math units or soft cores), **hand-tuned assembly** like this can reduce instruction count and latency by over **40–50%** for small helper routines.