# Assignment 1: RISC-V Assembly and Instruction Pipeline :::warning This assignment utilized AI tools for formatting and translation. During the code development process, AI-assisted debugging was employed, and AI was also used to propose and implement optimization methods. ::: ## Problem B — UF8 Encode / Decode ### 1.1 Description & Design Goal This problem implements a compact floating-like integer representation called **UF8**, which encodes a 32-bit integer into **8 bits** using 4-bit exponent and 4-bit mantissa. The original task (provided by the instructor) required translating a C reference implementation into **pure RISC-V assembly (RV32I)** without compiler support. The main goal was to verify bit-level correctness and ensure that every encoded value can be decoded back consistently while maintaining **monotonic order**. --- ### 1.2 Program Structure & Key Functions The program is divided into several self-contained routines: | Function / Label | Description | |:---------------- |:------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `main` | Entry point. Calls `test()` and prints the result using Ripes syscalls. | | `test` | Iterates over all 256 UF8 values: decodes → re-encodes → checks equality and monotonicity. Stops at the first failure, or prints “All tests passed.” if successful. | | `uf8_decode` | Implements the equation `value = (mantissa << exponent) + ((2^E − 1) << 4)` using only shift and add operations. | | `uf8_encode` | Finds the most significant bit via `clz`, calculates exponent and mantissa, then packs them as \`(E << 4) | | `clz` | Counts leading zeros by bitwise shifting (no M extension). | | `print_*` | Utility routines for printing strings, integers, and newline characters in Ripes. | > The **original version** is a direct translation from the C reference, preserving loops and branches as in the original code. > The **optimized version** refactors the control flow to reduce branching, introduces early exits, and minimizes register spills while preserving the same logic and output. --- ### 1.3 Instruction-Level Analysis (in Ripes) #### sw ra, 28(sp) — Store Word (Routine Prologue) **Instruction:** `sw ra, 28(sp)` **Description:** Store the return address register `ra` to the stack frame at offset 28 from `sp`. **IF Stage:** PC fetches the instruction from instruction memory.![image](https://hackmd.io/_uploads/B11hlnd6gx.png) **ID Stage:** Instruction decoded as S-type. Registers `sp` (base) and `ra` (data) read from the register file. Control signals: `MemWrite = 1`, `RegWrite = 0`, `ALUSrc = 1` (immediate offset), `MemToReg = 0`.![image](https://hackmd.io/_uploads/rJMal2dage.png) **EX Stage:** ALU computes the effective address = `sp + 28`. Multiplexer for ALU-B selects the sign-extended immediate.![image](https://hackmd.io/_uploads/Hy9Rx2_Tgl.png) **MEM Stage:** Data memory write enabled. The 32-bit word in `ra` is stored to address `sp + 28`. Signal `MemWrite = 1` is asserted.![image](https://hackmd.io/_uploads/SyayW2dale.png) **WB Stage:** No write back (`RegWrite = 0`). → Memory contains the saved return address for later `lw ra, 28(sp)` restore.![image](https://hackmd.io/_uploads/BkaeWn_6lg.png) --- #### addi sp, sp, -32 — Adjust Stack Pointer **Instruction:** `addi sp, sp, -32` **Description:** Allocate 32 bytes for local variables and callee-saved registers. **IF Stage:** Fetch from PC.![image](https://hackmd.io/_uploads/SJ-V-hOple.png) **ID Stage:** Decode I-type, read `sp`. `ImmSel = I`.![image](https://hackmd.io/_uploads/S12Vb3dTlg.png) **EX Stage:** ALU adds `sp + (-32)`.![image](https://hackmd.io/_uploads/HyFB-2dagx.png) **WB Stage:** Write result back to `sp` (`RegWrite = 1`, `MemToReg = ALU`).![image](https://hackmd.io/_uploads/rk7_WnOalg.png) In the Ripes visualization, the **register-write-enable** signal is asserted, and the destination register multiplexer selects `sp`. This instruction often feeds the next `sw` with a forwarded value. --- #### lw ra, 28(sp) — Load Word (Epilogue Restore) **Instruction:** `lw ra, 28(sp)` **Description:** Restore the saved return address from the stack into `ra`. **IF Stage:** Fetch instruction.![image](https://hackmd.io/_uploads/HybQXhuTxx.png) **ID Stage:** Decode I-type, read `sp`. Control signals: `MemRead = 1`, `RegWrite = 1`, `MemToReg = 1`.![image](https://hackmd.io/_uploads/Sky4mh_ael.png) **EX Stage:** ALU computes address `sp + 28`.![image](https://hackmd.io/_uploads/ryjVXh_6ge.png) **MEM Stage:** Data memory read performed; word loaded from that address.![image](https://hackmd.io/_uploads/rkVYQhOpxg.png) **WB Stage:** Loaded value written to `ra`.![image](https://hackmd.io/_uploads/rkZu7ndpeg.png) → Ripes shows **MemRead high** in the MEM stage and **RegWrite high** in WB, confirming correct stack restore. --- #### jal ra, uf8_decode — Jump and Link (Function Call) **Instruction:** `jal ra, uf8_decode` **Description:** Jump to `uf8_decode` subroutine and store return address in `ra`. **IF Stage:** Fetch J-type instruction.![image](https://hackmd.io/_uploads/B1lyE2upxe.png) **ID Stage:** Immediate decoded (`ImmSel = J`).![image](https://hackmd.io/_uploads/r16J42_Tee.png) **EX Stage:** ALU computes target = `PC + offset`; `PCSrc = 1`.![image](https://hackmd.io/_uploads/rypeE2dpel.png) **WB Stage:** `PC + 4` written to `ra` (`RegWrite = 1`, `MemToReg = PC+4`).![image](https://hackmd.io/_uploads/By0fV2uTxx.png) → Control flow redirect visible in Ripes: new PC = target, pipeline flush one bubble. --- #### beq a0, zero, .Lfail — Conditional Branch **Instruction:** `beq a0, zero, .Lfail` **Description:** If decoded value equals zero, branch to failure label. **IF Stage:** Fetch branch instruction.![image](https://hackmd.io/_uploads/rkt2V3u6ex.png) **ID Stage:** Read `a0`, `x0`; generate branch offset (`ImmSel = B`).![image](https://hackmd.io/_uploads/ByKaVndale.png) **EX Stage:** ALU subtracts `a0 – x0`; sets Zero flag. Control signals: `Branch = 1`; if Zero flag = 1 → `PCSrc = 1` (branch taken). Ripes shows PC redirect and bubble in the next cycle. --- #### sll t0, t0, t1 — Mantissa Shift (in uf8_decode) **Instruction:** `sll t0, t0, t1` **Description:** Shift the mantissa left by exponent bits (`mantissa << exponent`). **IF Stage:** Fetch instruction.![image](https://hackmd.io/_uploads/Byy_1pdagx.png) **ID Stage:** Decode as R-type; registers `t0` and `t1` read from register file.![image](https://hackmd.io/_uploads/H1gFyaOTxg.png) **EX Stage:** ALU performs logical shift left; control signals: `ALUSrc = 0`, `ALUOp = SLL`.![image](https://hackmd.io/_uploads/Hytc1Tuagl.png) **WB Stage:** Result written to `t0` (`RegWrite = 1`, `MemRead = 0`, `MemWrite = 0`).![image](https://hackmd.io/_uploads/HJxh1adTle.png) → Pure bit-shift arithmetic; no memory access. In Ripes, only the **ALU output** and **RegWrite** lines toggle. --- **Summary View:** Bit-shift instructions (`sll`, `srl`) handle UF8 mantissa/exponent logic branch-lessly, minimizing hazards. Branches (`beq`, `bne`) control test loop flow. Ripes signal panes confirm proper multiplexer selection (ALUSrc = imm vs. rs2), register enable timing, and memory access phases mirroring a standard 5-stage RV32I pipeline. --- ### 1.4 Testing and Validation **Automated test loop.** The program exhaustively validates **all 256 UF8 codes**. For each byte `fl ∈ [0..255]`, it **decodes first** to a 32-bit value and then **re-encodes** to check round-trip correctness and strict monotonicity: * `value ← uf8_decode(fl)` * `fl2 ← uf8_encode(value)` * Assert **round-trip:** `fl2 == fl` * Assert **monotonic:** `value > previous_value` (with `previous_value` carried across iterations) This logic is implemented in the `test` routine’s for-loop (`fl` in `t4`, `value` in `t5`, `fl2` in `t6`, `previous_value` in `s3`). A single **printed-once** flag (`s5`) ensures only the **first** failure is reported before exit.&#x20; **Console output** ``` All tests passed. Program exited with code: 0 ``` **Failure detection & messages.** * **Re-encode mismatch:** if `fl2 != fl`, the program prints `FAIL[re-encode] fl=<fl> val=<value> prev=<prev> fl2=<fl2>` then terminates via `sys_exit` (a7=93). * **Monotonic violation:** if `value <= previous_value`, it prints `FAIL[monotonic] fl=<fl> val=<value> prev=<prev> fl2=<fl2>` then terminates similarly. The message labels (`msg_mismatch`, `msg_mono`, etc.), the printed-once guard, and the early `sys_exit` are all in `test`, with small print helpers for strings/ints/newlines.&#x20; > **Note on original vs optimized.** Both versions share the same test harness and pass criteria; the optimized version only reduces control-flow and stack traffic, keeping the validation behavior unchanged.&#x20; --- ### 1.5 Performance Discussion This section compares the **original implementation** and the **optimized implementation** of the `uf8_encode` function, focusing on their computational complexity, microarchitectural behavior, and mathematical formulation. The derivation part provides a complete, unsimplified explanation of how the encoding formula is obtained from the decoding definition. --- #### Overall Performance Comparison ![image](https://hackmd.io/_uploads/HJFQAWIpeg.png) | Aspect | Original | Optimized | Performance Impact | | ---------------------- | ---------------------------------------------- | ----------------------------------------------- | ----------------------------- | | Exponent determination | Iterative overflow accumulation (`(ov<<1)+16`) | Direct extraction using `CLZ` and MSB detection | Eliminates loops and branches | | Overflow computation | Dynamic accumulation | Closed-form expression `((1<<E)-1)<<4` | Reduced arithmetic overhead | | Mantissa calculation | `(value - overflow) >> E` | Same expression (direct computation) | Simplified logic | | Branch count | High (frequent boundary tests) | Minimal (range clamping only) | Fewer pipeline stalls | | Memory access | Frequent temporary loads/stores | Compact register-level operations | Lower memory pressure | --- #### Mathematical Derivation ##### 1. Decode Definition The UF8 format is an 8-bit pseudo-floating representation `[EEEE MMMM]`, where the upper 4 bits denote the exponent `E` and the lower 4 bits denote the mantissa `M`. The decoding process defines the numeric value as: $$ \text{value} = (M \ll E) + 16 \cdot (2^E - 1). $$ This can be equivalently written as: $$ \text{value} = M \cdot 2^E + 16 \cdot 2^E - 16. $$ Let us define: $$ S \triangleq \text{value} + 16, $$ which yields the following exact expression: $$ S = (M + 16) \cdot 2^E. $$ This equality holds *exactly*, without approximation. --- ##### 2. Range Boundaries Since the mantissa $M \in \{0, 1, 2, \dots, 15\}$, it follows that: $$ M + 16 \in [16, 31]. $$ Therefore: $$ 16 \cdot 2^E \le S \le 31 \cdot 2^E. $$ Taking the base-2 logarithm on both sides gives: $$ \log_2 S \in [\log_2(16 \cdot 2^E), \log_2(31 \cdot 2^E)] = [4 + E, \log_2(31) + E]. $$ Since $31 < 32 = 2^5$, we have $\log_2(31) < 5$, and therefore $\lfloor \log_2(31) \rfloor = 4$. For all integers $k \in [16, 31]$, $\lfloor \log_2 k \rfloor = 4$ holds true. Thus, for all valid `M`: $$ \lfloor \log_2 S \rfloor = \lfloor \log_2((M+16) \cdot 2^E) \rfloor = \lfloor \log_2(M+16) \rfloor + E = 4 + E. $$ Hence, the exponent can be expressed *exactly* as: $$ \boxed{E = \lfloor \log_2(\text{value} + 16) \rfloor - 4.} $$ In implementation, the floor of the base-2 logarithm corresponds to the **most significant bit (MSB)** position: $$ E = \text{msb}(\text{value} + 16) - 4. $$ On RISC-V, this can be computed using the instruction `CLZ` (count leading zeros): $$ E = (31 - \text{clz}(\text{value} + 16)) - 4. $$ The result is then clamped to the valid 4-bit range `[0, 15]`. --- ##### 3. Overflow and Mantissa From the decoding definition: $$ \text{value} = (M \ll E) + 16(2^E - 1). $$ We define the **segment base**, also called *overflow*: $$ \text{overflow}(E) = 16(2^E - 1) = ((1 \ll E) - 1) \ll 4. $$ For a given exponent `E`, the mantissa is then derived as: $$ M = \frac{\text{value} - \text{overflow}(E)}{2^E}. $$ Because `value` always lies within the range $[\text{overflow}(E), \text{overflow}(E) + 16 \cdot 2^E]$, the numerator is guaranteed to be in $[0, 16 \cdot 2^E]$, yielding $M \in [0, 16]$. After clamping to a 4-bit integer, we obtain: $$ M = (\text{value} - \text{overflow}(E)) \gg E, \quad \text{uf8} = (E \ll 4) | (M \& 0xF). $$ --- #### Implementation Notes (RISC-V) * Compute: `a0 ← value + 16` `msb ← 31 - clz(a0)` `E ← msb - 4` (then clamped to 0–15) * Then: `overflow ← ((1<<E) - 1) << 4` `M ← (value - overflow) >> E` `uf8 ← (E << 4) | (M & 0xF)` * Extreme cases (very small or large values) are safely handled through exponent clamping and 4-bit masking. Through a **rigorous bit-level derivation**—without any approximation—the relation $$ E = \text{msb}(\text{value}+16) - 4 $$ is established, together with the closed-form overflow and mantissa computation. By replacing iterative searches with direct bitwise arithmetic, the optimized implementation transforms `uf8_encode` from a **loop-driven algorithm** into a **constant-time computation**. This explains its superior performance in instruction count, branch predictability, and memory efficiency compared to the original version. ## Problem C — BF16 Arithmetic ### 2.1 Description & Design Goal This problem implements **bfloat16 (BF16)** arithmetic and conversions in **pure RV32I** (no M/F/D extensions). The goal is to support a practical subset of BF16 operations—**add, sub, mul, div, sqrt**, plus **f32↔bf16** conversion—handling **NaN/Inf/zero/subnormal** corner cases and producing correct 16-bit bit-patterns suitable for Ripes. The implementation emphasizes **closed-form bit-manipulation**, **branch minimization**, and **deterministic testing**. ### 2.2 Program Structure & Key Functions The project is organized into self-contained routines (labels) that operate on 16-bit BF16 bit patterns passed in integer registers: * `bf16_isnan / bf16_isinf / bf16_iszero` Bitwise predicates on the BF16 layout: sign (1), exponent (8), mantissa (7). These isolate exponent/mantissa fields with shifts and masks to decide special categories without any FP hardware. * `conv_f32_to_bf16 / conv_bf16_to_f32` Round-to-nearest-even (via bias + LSB carry-in) when truncating from f32 to bf16; high-word passthrough for Inf/NaN; and zero-extend back to f32 placement. * `bf16_add / bf16_sub` Software alignment + sign logic: add performs exponent alignment with hidden-1 handling for normals, separate paths for equal/unequal signs (normalize and renormalize), overflow to ±Inf; sub is implemented as add with flipped sign. * `bf16_mul` Field extraction, special-case handling, **8×8 shift-add integer multiply** for (1.mant)* (1.mant), exponent biasing, and normalization with under/overflow handling to ±Inf or flush-to-zero for deep subnormals. * `bf16_div` Special-case matrix (NaN/Inf/0), then **non-restoring long division** style quotient build with shifts, exponent difference + bias adjust, normalization, and packing. * `bf16_sqrt` * **Original**: binary-search on the mantissa with an inner 16-step shift-add square loop; post-normalization by shifting and exponent fix-up. * **Optimization**: a **table-driven lookup** for the mantissa, with **even/odd-exponent tables** to avoid the binary search altogether; exponent is derived by parity-aware halving. This removes inner multiply/loop and slashes branches. * `main` + test harness (`tests_sqrt` set) Iterates deterministic BF16 cases, prints formatted results (“a= … -> … (expect= …) ok/FAIL”), and exits—useful for Ripes runs and regression. ### 2.3 RISC-V Pipeline Demonstration — `bf16_sqrt` Example #### `lhu a0, 0(s0)` — Load Halfword * **Instruction:** `lhu a0, 0(s0)` * **Description:** Load a 16-bit value from memory address in `s0` into register `a0`. > **IF Stage:** > PC fetches instruction from Instruction Memory. > ![image](https://hackmd.io/_uploads/BkYbQOvpxx.png) > **ID Stage:** > Instruction decoded; `s0` read from Register File. > Control signals: `MemRead=1`, `RegWrite=1`, `MemToReg=1`. > ![image](https://hackmd.io/_uploads/SkAQ7ODplg.png) > **EX Stage:** > ALU computes effective address = `s0 + 0`. > ![image](https://hackmd.io/_uploads/rJPBmdPale.png) > **MEM Stage:** > Memory accessed; halfword `0x3F80` read (`MemRead=1`). > ![image](https://hackmd.io/_uploads/S1i8QuDple.png) > **WB Stage:** > Loaded value `0x3F80` written to `a0` (`RegWrite=1`). > ![image](https://hackmd.io/_uploads/SkGuXOP6gx.png) --- #### `jal ra, bf16_sqrt` — Jump and Link * **Description:** Jump to `bf16_sqrt` and save return address to `ra`. > ALU computes target address (PC + offset); > `RegWrite=1`, `PCSrc=1`. > ![image](https://hackmd.io/_uploads/Hy7maoO6el.png) ![image](https://hackmd.io/_uploads/rkq4TiOale.png) --- #### `bne t2, t1, sqrt_check_zero` — Branch Not Equal * **Description:** Branch if `t2 != t1`. > ALU performs `t2 - t1`; if not zero, `PCSrc=1` (branch taken). > ![image](https://hackmd.io/_uploads/HJIA6o_6le.png) --- #### `lbu a5, 0(t0)` — Load Byte * **Description:** Load a single byte from memory into `a5`. > Similar to `lhu`, but reads 8 bits (`MemRead=1`, `MemToReg=1`). > ![image](https://hackmd.io/_uploads/Hkp7CidTlx.png) --- #### `or a0, a4, a5` — Bitwise OR * **Description:** Perform logical OR between `a4` and `a5`, store in `a0`. > Pure ALU op, no memory access (`RegWrite=1`, `MemRead=0`). > ![image](https://hackmd.io/_uploads/HkzcRsd6ee.png) ### 2.4 Performance Discussion (Original vs. Optimization) #### What changed from Original → Optimization ![image](https://hackmd.io/_uploads/rJDhsML6eg.png) | Aspect | Original | Optimization | Impact | | ---------------------- | --------------------------------------------------------- | --------------------------------------------------------- | ----------------------------------------------- | | `sqrt` mantissa method | **Binary search** with inner **16-step** shift-add square | **Table lookup** with **even/odd exponent tables** | Removes inner multiply loop; near-constant time | | Branching in `sqrt` | Many (loop + compare) | Few (index + load + pack) | Fewer pipeline stalls; better predictability | | | `add/mul/div` | Same core algorithms | Same (minor cleanups) | Stable behavior | **Why it’s faster:** The table-driven `sqrt` bypasses both the **binary-search iterations** and the **16-step square loop**. Exponent halving is done by parity detection (even/odd) to select the correct table. This mirrors the Part-B optimization spirit—replace iterative control flow with **direct bitwise arithmetic / lookup** to reduce retired instructions and branches. ### 2.5 Testing and Validation #### Test Design Each function was validated through a series of deterministic test cases that covered: * **Normal cases:** Standard numerical inputs (e.g., 1.0, 2.0, 0.5). * **Boundary and exceptional cases:** Including `0`, `+Inf`, `NaN`, and negative inputs to verify IEEE-754 special condition handling. * **Subnormal and signed values:** To check correct sign-bit propagation and denormalized number behavior. For the `bf16_sqrt` operation, the test dataset was defined as a table in memory (`tests_sqrt`), where each record consists of: ``` .half <input_value>, 0x0000, <expected_output> ``` During execution, the program iterates through the table, calls the `bf16_sqrt` function, compares the returned value against the expected one, and prints the result via `ecall` system calls. #### Execution Output The actual console output obtained from the simulation is shown below: ``` a=0x3f80 -> 0x3f80 (expect=0x3f80) ok a=0x4080 -> 0x4000 (expect=0x4000) ok a=0x4010 -> 0x3fc0 (expect=0x3fc0) ok a=0x0000 -> 0x0000 (expect=0x0000) ok a=0x7f80 -> 0x7f80 (expect=0x7f80) ok a=0x7fc1 -> 0x7fc0 (expect=0x7fc0) ok a=0xbf80 -> 0x7fc0 (expect=0x7fc0) ok a=0x0001 -> 0x0000 (expect=0x0000) ok a=0x3f00 -> 0x3f35 (expect=0x3f35) ok a=0x3fc0 -> 0x3f9d (expect=0x3f9d) ok Program exited with code: 0 ``` All test cases returned “ok”, indicating the computed results matched the expected outputs precisely. The program exited normally with return code `0`, confirming stable termination without runtime exceptions. #### Result Analysis The test results validate that: * Normal values produce accurate square-root results. * Special cases such as zero, infinity, and NaN are handled in accordance with IEEE-754 BF16 conventions. * Negative input correctly yields a canonical NaN (`0x7FC0`), verifying the exception logic path. * Subnormal inputs (e.g., `0x0001`) correctly flush to zero. These results confirm the functional correctness and numerical stability of the `bf16_sqrt` implementation. ## Leetcode Performance ### 3.1 Leetcode Problem (367. Valid Perfect Square) ``` Given a positive integer num, return true if num is a perfect square or false otherwise. A perfect square is an integer that is the square of an integer. In other words, it is the product of some integer with itself. You must not use any built-in library function, such as sqrt. Example 1: Input: num = 16 Output: true Explanation: We return true because 4 * 4 = 16 and 4 is an integer. Example 2: Input: num = 14 Output: false Explanation: We return false because 3.742 * 3.742 = 14 and 3.742 is not an integer. Constraints: 1 <= num <= 231 - 1 ``` **Given** a positive integer `num`, **return** `true` if it is a perfect square, otherwise `false`. Constraints: `1 ≤ num ≤ 2^31 – 1`. No built-in `sqrt`. ### 3.2 Key idea (overview) 1. Handle trivial inputs `x ∈ {0,1}`. 2. Convert `x` from unsigned integer → **float32 bit pattern** → **bfloat16**. 3. Compute an **approximate √x in bfloat16** via a 128-entry LUT (two 128-byte tables for even/odd exponents), then convert back to float32 bits. 4. Truncate to get an **integer candidate** `y = ⌊√x⌋`. 5. Because the bf16 approximation can be off slightly around binade boundaries, **verify nearby integers** by exact equality checks using a fast multiply-equals-x routine. * We check `y^2`, `(y±1)^2`, … `(y±4)^2` (constant number of checks, early-out on first hit). 6. Return `true` if any matches `x`, else `false`. All of the above is implemented in assembly: `isPerfectSquare`, bf16 conversions, LUT-based `bf16_sqrt`, and the constant-time verifier. ### 3.3 What each routine does * **`u32_to_f32bits`** – builds an IEEE-754 single-precision bit pattern from a positive integer: finds MSB, constructs exponent = `msb+127`, and aligns the 23-bit fraction. * **`conv_f32_to_bf16` / `conv_bf16_to_f32`** – round-to-nearest-even into bf16 (upper 16 bits with proper bias rounding), and back (shift left 16). Handles Inf/NaN pass-through. * **`bf16_sqrt`** – table-driven: decides **even/odd exponent path**, indexes a table (`sqrt_tbl_even_q17` / `sqrt_tbl_odd_q17`, each 128 bytes) and repacks sign/exp/mantissa in bf16. Returns +0/+Inf/qNaN for special cases. * **`f32bits_trunc_to_u32`** – interprets the f32 bit pattern, extracts unbiased exponent and leading 1, and truncates to unsigned integer by shifting. Subnormals/negatives → 0. * **`mul_eq_u32_x_fast`** – verifies `(a*b)==x` without `mul`: shift-and-add over the multiplier bits with **early overflow pruning** (`sum > x - a` or `a` would exceed `x` on next shift). This keeps the loop short and branch-predictable. * **Harness (`main`)** – iterates over **12 test cases**, prints `x` and the boolean answer, using `ecall` 1/4/11/10 (print_int/print_string/print_char/exit). #### Correctness notes & edge cases * **`x = 0 or 1`**: handled explicitly and return `true`. Matches mathematical definition. * **Special fp forms** (Inf/NaN/subnormal/negative) are sanitized inside converters and `bf16_sqrt`; final path for positive integers always follows the normal-number route. * **Approximation safety**: bf16 has only 7 mantissa bits. The LUT is Q1.7 and we also have exponent-parity handling. Around binade edges, truncating the reconstructed f32 can undershoot by a few units. The code therefore checks `y, y±1, …, y±4`. In practice this makes the method robust over all 32-bit inputs we care about while keeping checks constant and tiny. * **Overflow safety** in verifier: `mul_eq_u32_x_fast` never forms the full product; it accumulates `sum` with **range checks against `x`** every step, so we avoid 64-bit temporaries or UB. ### 3.3 Testing and Validation ``` x = 0, ans = true
 x = 1, ans = true
 x = 2, ans = false
 x = 3, ans = false
 x = 4, ans = true
 x = 8, ans = false
 x = 9, ans = true
 x = 15, ans = false
 x = 16, ans = true
 x = 2147395600, ans = true
 x = 2147483647, ans = false
 x = -1, ans = false

 Program exited with code: 0 ``` * **Zero dynamic memory**; only two small LUTs: **128 B + 128 B = 256 B** total, plus code. * **No binary search loop** over `[1..46340]`. Instead: * One bf16 LUT lookup (O(1)) * A constant number of integer checks (≤ 9 calls to `mul_eq_u32_x_fast`) * Each `mul_eq_u32_x_fast` runs in ≤ 32 iterations with multiple **early-out branches** on overflow or when `a` already exceeds `x` next round. ## References * [工具介紹](https://ithelp.ithome.com.tw/m/articles/10351706) * [LeetCode 338. Counting Bits](https://leetcode.com/problems/counting-bits/description/) * [LeetCode 50. Pow(x, n)](https://leetcode.com/problems/powx-n/description/) * [LeetCode 367. Valid Perfect Square](https://leetcode.com/problems/valid-perfect-square/description/) * [Quiz1 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-B)