# Assignment2: Complete Applications contributed by <`戴仁杰`> repo: https://github.com/dozingmoon/ca2025-quizzes Disclaimer: I used AI tools such as ChatGPT for better understanding of algorithm or architecture descriptions. ## Homework 1: BFloat16 ``` === BFloat16 Tests (ASM Version) === Test 1: bf16_special_cases (asm helpers) bf16_iszero(0.0) ... PASSED bf16_iszero(-0.0) ... PASSED !bf16_iszero(1.0) ... PASSED bf16_isinf(Inf) ... PASSED bf16_isinf(-Inf) ... PASSED !bf16_isinf(NaN) ... PASSED bf16_isnan(NaN) ... PASSED !bf16_isnan(Inf) ... PASSED Passed: 8 / 8 Cycles: 3122, Instructions: 3122 Test 2: bf16_add 0x3f80 + 0x3f80 = 0x4000 (PASSED) 0x4000 + 0x3f80 = 0x4040 (PASSED) 0x3f80 + 0x4000 = 0x4040 (PASSED) 0x3f00 + 0x3f00 = 0x3f80 (PASSED) 0x3f80 + 0x0 = 0x3f80 (PASSED) 0x3f80 + 0x8000 = 0x3f80 (PASSED) 0x0 + 0x0 = 0x0 (PASSED) 0x8000 + 0x0 = 0x0 (PASSED) 0x8000 + 0x8000 = 0x8000 (PASSED) 0x3f80 + 0xbf80 = 0x0 (PASSED) 0xbf80 + 0xbf80 = 0xc000 (PASSED) 0xc000 + 0x3f80 = 0xbf80 (PASSED) 0x7f80 + 0x3f80 = 0x7f80 (PASSED) 0x7f80 + 0x7f80 = 0x7f80 (PASSED) 0xff80 + 0x3f80 = 0xff80 (PASSED) 0x7f80 + 0xff80 = 0x7fc0 (PASSED) 0x7fc0 + 0x3f80 = 0x7fc0 (PASSED) 0x3f80 + 0x7fc0 = 0x7fc0 (PASSED) 0x7f7f + 0x7f7f = 0x7f80 (PASSED) 0x1 + 0x1 = 0x2 (PASSED) Passed: 20 / 20 Cycles: 16971, Instructions: 16971 Test 3: bf16_sub 0x4040 - 0x4000 = 0x3f80 (PASSED) 0x3f80 - 0x3f80 = 0x0 (PASSED) 0x3f80 - 0x4000 = 0xbf80 (PASSED) 0x3f80 - 0x0 = 0x3f80 (PASSED) 0x3f80 - 0x8000 = 0x3f80 (PASSED) 0x0 - 0x8000 = 0x0 (PASSED) 0x8000 - 0x0 = 0x8000 (PASSED) 0x7f80 - 0x3f80 = 0x7f80 (PASSED) 0x3f80 - 0x7f80 = 0xff80 (PASSED) 0x7f80 - 0x7f80 = 0x7fc0 (PASSED) 0xff80 - 0xff80 = 0x7fc0 (PASSED) 0x7fc0 - 0x3f80 = 0x7fc0 (PASSED) 0x3f80 - 0x7fc0 = 0xffc0 (PASSED) Passed: 13 / 13 Cycles: 12944, Instructions: 12944 Test 4: bf16_mul 0x4000 * 0x4040 = 0x40c0 (PASSED) 0x3f00 * 0x3f00 = 0x3e80 (PASSED) 0x3f80 * 0x3f80 = 0x3f80 (PASSED) 0xbf80 * 0x4000 = 0xc000 (PASSED) 0xbf80 * 0xbf80 = 0x3f80 (PASSED) 0x3f80 * 0x0 = 0x0 (PASSED) 0x3f80 * 0x8000 = 0x8000 (PASSED) 0xbf80 * 0x0 = 0x8000 (PASSED) 0xbf80 * 0x8000 = 0x0 (PASSED) 0x8000 * 0x8000 = 0x0 (PASSED) 0x7f80 * 0x4000 = 0x7f80 (PASSED) 0x7f80 * 0xc000 = 0xff80 (PASSED) 0x7f80 * 0x0 = 0x7fc0 (PASSED) 0x7f7f * 0x4000 = 0x7f80 (PASSED) Passed: 14 / 14 Cycles: 14694, Instructions: 14694 Test 5: bf16_div 0x40c0 / 0x4000 = 0x4040 (PASSED) 0x3f80 / 0x3f80 = 0x3f80 (PASSED) 0x3f80 / 0x4000 = 0x3f00 (PASSED) 0x3f80 / 0x0 = 0x7f80 (PASSED) 0x3f80 / 0x8000 = 0xff80 (PASSED) 0xbf80 / 0x0 = 0xff80 (PASSED) 0x0 / 0x3f80 = 0x0 (PASSED) 0x8000 / 0x3f80 = 0x8000 (PASSED) 0x0 / 0x0 = 0x7fc0 (PASSED) 0x8000 / 0x0 = 0x7fc0 (PASSED) 0x7f80 / 0x3f80 = 0x7f80 (PASSED) 0x3f80 / 0x7f80 = 0x0 (PASSED) 0x7f80 / 0x7f80 = 0x7fc0 (PASSED) Passed: 13 / 13 Cycles: 12931, Instructions: 12931 Test 6: bf16_sqrt sqrt(0x3f80) = 0x3f80 (PASSED) sqrt(0x4080) = 0x4000 (PASSED) sqrt(0x4110) = 0x4040 (PASSED) sqrt(0x0) = 0x0 (PASSED) sqrt(0x7f80) = 0x7f80 (PASSED) sqrt(0xbf80) = 0x7fc0 (PASSED) sqrt(0x7fc0) = 0x7fc0 (PASSED) Passed: 7 / 7 Cycles: 19627, Instructions: 19627 Test 7: Q_sqrt_bf16 Q_sqrt(0x3f80) = 0x3f80 (PASSED) Q_sqrt(0x4080) = 0x3f00 (PASSED) Q_sqrt(0x3e80) = 0x4000 (PASSED) Q_sqrt(0x4180) = 0x3e80 (PASSED) Q_sqrt(0x7fc0) = 0xffc0 (PASSED) Passed: 5 / 5 Cycles: 8255, Instructions: 8255 === All Tests Completed === ``` ## Quiz 2: Tower of Hanoi in Assembly ### Core Algorithm — Gray Code Method (Iterative Tower of Hanoi) The core algorithm **leverages the Gray code method** to generate the sequence of disk moves **iteratively**, avoiding recursion. ### 🔹 Key Principles Each step computes: \begin{aligned} \text{gray}(i) &= i \oplus (i \gg 1) \\ \text{gray}(i-1) &= (i-1) \oplus ((i-1) \gg 1) \\ \text{changed_bit} &= \text{gray}(i) \oplus \text{gray}(i-1) \end{aligned} - The **changed bit** determines **which disk to move**. - Disk movement follows **specific cyclic rules** depending on disk **parity** (odd or even number of disks). --- ### 🧠 Assembly Design Highlights - **Registers** - `s0–s4` are used for persistent variables: - `s0`: loop counter `i` - `s1`: current `disk` index - `s2`: `from_peg` - `s3`: `to_peg` - `s4`: pointer to the **output buffer** - **Stack Frame** - 48 bytes total - 16-byte aligned for ABI compliance - **System Calls (ecall 64)** Used for text output through `SYS_WRITE`: ``` === Starting Hanoi ASM === Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C === Hanoi ASM Finished === Total Cycles: 643 Total Instructions: 643 ``` ## Quiz 3: ### Algorithm (Scheme 2 – LUT-Only Version) The implemented function: ```c uint32_t fast_rsqrt_c_v2_lut_only(uint32_t x) ``` uses a **lookup-table (LUT)** to estimate reciprocal square roots based on the **exponent** of the input number. ### Algorithm Steps 1. **Handle Zero Case:** * If `x == 0`, return `0xFFFFFFFF` (infinity approximation). 2. **Find Exponent:** * Compute `exp = 31 - clz_c(x)` where `clz_c` counts leading zeros. 3. **Lookup Approximation:** * Return `rsqrt_table[exp]` as the result. This LUT-based version is a simplified (non-iterative) version of the original `fast_rsqrt_c_v1_full()` algorithm, providing fast and deterministic output without wide 64-bit math. --- ## 🧠 Mathematical Background The **reciprocal square root** function $y = \frac{1}{\sqrt{x}}$ can be approximated by: $$y \approx 2^{-n/2} \times f(m)$$ where: * $n$ is the exponent of `x` * $f(m)$ is a precomputed lookup from normalized mantissa values Instead of floating-point math, we use **fixed-point integers** and **bit-level exponent extraction**. --- ## ⚙️ Key Components ### 1. **Software Arithmetic (RV32I-Compatible)** Since RV32I lacks hardware multiply/divide instructions, all arithmetic is implemented in pure software: ```c static uint32_t umul(uint32_t a, uint32_t b) { uint32_t result = 0; while (b) { if (b & 1) result += a; a <<= 1; b >>= 1; } return result; } ``` Similarly, `udiv` and `umod` perform division and modulo by iterative subtraction. --- ### 2. **64-bit Emulation Functions** To support 64-bit operations (required by some compilers for `uint64_t` math), software versions of GCC’s helper functions are provided: * `__muldi3` – 32×32 → 64 multiply * `__ashlodi4` – 64-bit logical left shift * `__lshrdi3` – 64-bit logical right shift All are implemented with `memcpy` to safely handle 64-bit values under strict aliasing rules. ``` x=0: Expected=4294967295, Actual=4294967295, Error=0, Cycles=35, Instructions=35 x=1: Expected=65535, Actual=65535, Error=0, Cycles=99, Instructions=99 x=2: Expected=46340, Actual=46341, Error=1, Cycles=96, Instructions=96 x=3: Expected=37836, Actual=46341, Error=8505, Cycles=0, Instructions=0 x=5: Expected=29302, Actual=32768, Error=3466, Cycles=90, Instructions=90 x=15: Expected=16909, Actual=23170, Error=6261, Cycles=90, Instructions=90 x=16: Expected=16384, Actual=16384, Error=0, Cycles=82, Instructions=82 x=17: Expected=15890, Actual=16384, Error=494, Cycles=0, Instructions=0 x=100: Expected=6553, Actual=8192, Error=1639, Cycles=82, Instructions=82 x=511: Expected=2900, Actual=4096, Error=1196, Cycles=73, Instructions=73 x=512: Expected=2896, Actual=2896, Error=0, Cycles=0, Instructions=0 x=513: Expected=2893, Actual=2896, Error=3, Cycles=0, Instructions=0 x=1024: Expected=2048, Actual=2048, Error=0, Cycles=0, Instructions=0 x=65535: Expected=256, Actual=362, Error=106, Cycles=73, Instructions=73 x=100000: Expected=207, Actual=256, Error=49, Cycles=64, Instructions=64 x=1000000: Expected=65, Actual=90, Error=25, Cycles=0, Instructions=0 x=2147483647: Expected=1, Actual=2, Error=1, Cycles=64, Instructions=64 x=4294967295: Expected=1, Actual=1, Error=0, Cycles=0, Instructions=0 Scheme 2 (C) Total: 848 Cycles, 848 Instructions ``` * Note: some CSR counters are zero because of **compiler optimization** ## Disassemble the ELF files produced by the C compiler and contrast the handwritten and compiler-optimized assembly listings. (`bf16_isnan`) --- ## Function Overview `bf16_isnan` checks whether a **bfloat16 (BF16)** value represents *NaN (Not-a-Number)*. In BF16 format: * Exponent bits = `0x7F80` (all 1s) * Mantissa ≠ 0 → indicates **NaN** Mathematically: ```c bool bf16_isnan(uint16_t x) { return ((x & 0x7F80) == 0x7F80) && ((x & 0x007F) != 0); } ``` --- ## Compiler-Generated Assembly (`bf16_isnan_c`) ```asm 00010d54 <bf16_isnan_c>: addi sp,sp,-32 sw ra,28(sp) sw s0,24(sp) addi s0,sp,32 sh a0,-20(s0) lhu a5,-20(s0) mv a4,a5 lui a5,0x8 addi a5,a5,-128 # 0x7F80 and a4,a4,a5 lui a5,0x8 addi a5,a5,-128 # 0x7F80 bne a4,a5,10d9c # if (exp != 0x7F80) lhu a5,-20(s0) andi a5,a5,127 # mantissa bits beqz a5,10d9c # if (mantissa == 0) li a5,1 j 10da0 10d9c: li a5,0 10da0: andi a5,a5,1 zext.b a5,a5 mv a0,a5 lw ra,28(sp) lw s0,24(sp) addi sp,sp,32 ret ``` ### 🧩 Characteristics * **Prologue/Epilogue heavy:** 32-byte stack frame, RA/S0 save-restore. * **Local variable stored on stack** (`sh`/`lhu` to/from `s0 - 20`). * Redundant reload of constants and masking. * Follows *GCC ABI conventions strictly* even though the function is trivial. * Final result normalized to 0 or 1 (`andi` + `zext.b`). ✅ **Advantage:** Portable, ABI-compliant, easy for debugging. ❌ **Disadvantage:** Inefficient (≈ 20+ instructions for a simple check). --- ## Handwritten Assembly (`bf16_isnan`) ```asm 000100c0 <bf16_isnan>: addi sp,sp,-4 sw s1,0(sp) lui s1,0x8 addi s1,s1,-128 # 0x7F80 and t0,a0,s1 bne t0,s1,100e8 # exp != 0x7F80 → not NaN andi t0,a0,127 beqz t0,100e8 # mantissa == 0 → not NaN li a0,1 j 100ec 100e8: li a0,0 100ec: lw s1,0(sp) addi sp,sp,4 ret ``` ### ⚙️ Optimized Features * **Minimal stack usage (4 bytes)**, only saving `s1`. * No unnecessary locals — computation stays in registers. * Constant `0x7F80` loaded once and reused. * No redundant `andi/zext.b` postprocessing. * Straightline control flow with clear true/false branches. ✅ **Advantage:** Compact (≈ 12 instructions), fewer memory accesses. ❌ **Disadvantage:** Not fully ABI-safe (no frame pointer, assumes caller convention). --- In performance-sensitive embedded systems (like BF16 math units or soft cores), **hand-tuned assembly** like this can reduce instruction count and latency by over **40–50%** for small helper routines.