# Assignment 1: RISC-V Assembly and Instruction Pipeline
:::warning
This assignment utilized AI tools for formatting and translation. During the code development process, AI-assisted debugging was employed, and AI was also used to propose and implement optimization methods.
:::
## Problem B — UF8 Encode / Decode
### 1.1 Description & Design Goal
This problem implements a compact floating-like integer representation called **UF8**, which encodes a 32-bit integer into **8 bits** using 4-bit exponent and 4-bit mantissa.
The original task (provided by the instructor) required translating a C reference implementation into **pure RISC-V assembly (RV32I)** without compiler support.
The main goal was to verify bit-level correctness and ensure that every encoded value can be decoded back consistently while maintaining **monotonic order**.
---
### 1.2 Program Structure & Key Functions
The program is divided into several self-contained routines:
| Function / Label | Description |
|:---------------- |:------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `main` | Entry point. Calls `test()` and prints the result using Ripes syscalls. |
| `test` | Iterates over all 256 UF8 values: decodes → re-encodes → checks equality and monotonicity. Stops at the first failure, or prints “All tests passed.” if successful. |
| `uf8_decode` | Implements the equation `value = (mantissa << exponent) + ((2^E − 1) << 4)` using only shift and add operations. |
| `uf8_encode` | Finds the most significant bit via `clz`, calculates exponent and mantissa, then packs them as \`(E << 4) |
| `clz` | Counts leading zeros by bitwise shifting (no M extension). |
| `print_*` | Utility routines for printing strings, integers, and newline characters in Ripes. |
> The **original version** is a direct translation from the C reference, preserving loops and branches as in the original code.
> The **optimized version** refactors the control flow to reduce branching, introduces early exits, and minimizes register spills while preserving the same logic and output.
---
### 1.3 Instruction-Level Analysis (in Ripes)
#### sw ra, 28(sp) — Store Word (Routine Prologue)
**Instruction:** `sw ra, 28(sp)`
**Description:** Store the return address register `ra` to the stack frame at offset 28 from `sp`.
**IF Stage:**
PC fetches the instruction from instruction memory.
**ID Stage:**
Instruction decoded as S-type. Registers `sp` (base) and `ra` (data) read from the register file.
Control signals: `MemWrite = 1`, `RegWrite = 0`, `ALUSrc = 1` (immediate offset), `MemToReg = 0`.
**EX Stage:**
ALU computes the effective address = `sp + 28`.
Multiplexer for ALU-B selects the sign-extended immediate.
**MEM Stage:**
Data memory write enabled. The 32-bit word in `ra` is stored to address `sp + 28`.
Signal `MemWrite = 1` is asserted.
**WB Stage:**
No write back (`RegWrite = 0`).
→ Memory contains the saved return address for later `lw ra, 28(sp)` restore.
---
#### addi sp, sp, -32 — Adjust Stack Pointer
**Instruction:** `addi sp, sp, -32`
**Description:** Allocate 32 bytes for local variables and callee-saved registers.
**IF Stage:** Fetch from PC.
**ID Stage:** Decode I-type, read `sp`. `ImmSel = I`.
**EX Stage:** ALU adds `sp + (-32)`.
**WB Stage:** Write result back to `sp` (`RegWrite = 1`, `MemToReg = ALU`).
In the Ripes visualization, the **register-write-enable** signal is asserted, and the destination register multiplexer selects `sp`. This instruction often feeds the next `sw` with a forwarded value.
---
#### lw ra, 28(sp) — Load Word (Epilogue Restore)
**Instruction:** `lw ra, 28(sp)`
**Description:** Restore the saved return address from the stack into `ra`.
**IF Stage:** Fetch instruction.
**ID Stage:** Decode I-type, read `sp`. Control signals: `MemRead = 1`, `RegWrite = 1`, `MemToReg = 1`.
**EX Stage:** ALU computes address `sp + 28`.
**MEM Stage:** Data memory read performed; word loaded from that address.
**WB Stage:** Loaded value written to `ra`.
→ Ripes shows **MemRead high** in the MEM stage and **RegWrite high** in WB, confirming correct stack restore.
---
#### jal ra, uf8_decode — Jump and Link (Function Call)
**Instruction:** `jal ra, uf8_decode`
**Description:** Jump to `uf8_decode` subroutine and store return address in `ra`.
**IF Stage:** Fetch J-type instruction.
**ID Stage:** Immediate decoded (`ImmSel = J`).
**EX Stage:** ALU computes target = `PC + offset`; `PCSrc = 1`.
**WB Stage:** `PC + 4` written to `ra` (`RegWrite = 1`, `MemToReg = PC+4`).
→ Control flow redirect visible in Ripes: new PC = target, pipeline flush one bubble.
---
#### beq a0, zero, .Lfail — Conditional Branch
**Instruction:** `beq a0, zero, .Lfail`
**Description:** If decoded value equals zero, branch to failure label.
**IF Stage:** Fetch branch instruction.
**ID Stage:** Read `a0`, `x0`; generate branch offset (`ImmSel = B`).
**EX Stage:** ALU subtracts `a0 – x0`; sets Zero flag.
Control signals: `Branch = 1`; if Zero flag = 1 → `PCSrc = 1` (branch taken). Ripes shows PC redirect and bubble in the next cycle.
---
#### sll t0, t0, t1 — Mantissa Shift (in uf8_decode)
**Instruction:** `sll t0, t0, t1`
**Description:** Shift the mantissa left by exponent bits (`mantissa << exponent`).
**IF Stage:** Fetch instruction.
**ID Stage:** Decode as R-type; registers `t0` and `t1` read from register file.
**EX Stage:** ALU performs logical shift left; control signals: `ALUSrc = 0`, `ALUOp = SLL`.
**WB Stage:** Result written to `t0` (`RegWrite = 1`, `MemRead = 0`, `MemWrite = 0`).
→ Pure bit-shift arithmetic; no memory access. In Ripes, only the **ALU output** and **RegWrite** lines toggle.
---
**Summary View:**
Bit-shift instructions (`sll`, `srl`) handle UF8 mantissa/exponent logic branch-lessly, minimizing hazards.
Branches (`beq`, `bne`) control test loop flow.
Ripes signal panes confirm proper multiplexer selection (ALUSrc = imm vs. rs2), register enable timing, and memory access phases mirroring a standard 5-stage RV32I pipeline.
---
### 1.4 Testing and Validation
**Automated test loop.**
The program exhaustively validates **all 256 UF8 codes**. For each byte `fl ∈ [0..255]`, it **decodes first** to a 32-bit value and then **re-encodes** to check round-trip correctness and strict monotonicity:
* `value ← uf8_decode(fl)`
* `fl2 ← uf8_encode(value)`
* Assert **round-trip:** `fl2 == fl`
* Assert **monotonic:** `value > previous_value` (with `previous_value` carried across iterations)
This logic is implemented in the `test` routine’s for-loop (`fl` in `t4`, `value` in `t5`, `fl2` in `t6`, `previous_value` in `s3`). A single **printed-once** flag (`s5`) ensures only the **first** failure is reported before exit. 
**Console output**
```
All tests passed.
Program exited with code: 0
```
**Failure detection & messages.**
* **Re-encode mismatch:** if `fl2 != fl`, the program prints
`FAIL[re-encode] fl=<fl> val=<value> prev=<prev> fl2=<fl2>`
then terminates via `sys_exit` (a7=93).
* **Monotonic violation:** if `value <= previous_value`, it prints
`FAIL[monotonic] fl=<fl> val=<value> prev=<prev> fl2=<fl2>`
then terminates similarly.
The message labels (`msg_mismatch`, `msg_mono`, etc.), the printed-once guard, and the early `sys_exit` are all in `test`, with small print helpers for strings/ints/newlines. 
> **Note on original vs optimized.** Both versions share the same test harness and pass criteria; the optimized version only reduces control-flow and stack traffic, keeping the validation behavior unchanged. 
---
### 1.5 Performance Discussion
This section compares the **original implementation** and the **optimized implementation** of the `uf8_encode` function, focusing on their computational complexity, microarchitectural behavior, and mathematical formulation. The derivation part provides a complete, unsimplified explanation of how the encoding formula is obtained from the decoding definition.
---
#### Overall Performance Comparison

| Aspect | Original | Optimized | Performance Impact |
| ---------------------- | ---------------------------------------------- | ----------------------------------------------- | ----------------------------- |
| Exponent determination | Iterative overflow accumulation (`(ov<<1)+16`) | Direct extraction using `CLZ` and MSB detection | Eliminates loops and branches |
| Overflow computation | Dynamic accumulation | Closed-form expression `((1<<E)-1)<<4` | Reduced arithmetic overhead |
| Mantissa calculation | `(value - overflow) >> E` | Same expression (direct computation) | Simplified logic |
| Branch count | High (frequent boundary tests) | Minimal (range clamping only) | Fewer pipeline stalls |
| Memory access | Frequent temporary loads/stores | Compact register-level operations | Lower memory pressure |
---
#### Mathematical Derivation
##### 1. Decode Definition
The UF8 format is an 8-bit pseudo-floating representation `[EEEE MMMM]`, where the upper 4 bits denote the exponent `E` and the lower 4 bits denote the mantissa `M`.
The decoding process defines the numeric value as:
$$
\text{value} = (M \ll E) + 16 \cdot (2^E - 1).
$$
This can be equivalently written as:
$$
\text{value} = M \cdot 2^E + 16 \cdot 2^E - 16.
$$
Let us define:
$$
S \triangleq \text{value} + 16,
$$
which yields the following exact expression:
$$
S = (M + 16) \cdot 2^E.
$$
This equality holds *exactly*, without approximation.
---
##### 2. Range Boundaries
Since the mantissa $M \in \{0, 1, 2, \dots, 15\}$, it follows that:
$$
M + 16 \in [16, 31].
$$
Therefore:
$$
16 \cdot 2^E \le S \le 31 \cdot 2^E.
$$
Taking the base-2 logarithm on both sides gives:
$$
\log_2 S \in [\log_2(16 \cdot 2^E), \log_2(31 \cdot 2^E)] = [4 + E, \log_2(31) + E].
$$
Since $31 < 32 = 2^5$, we have $\log_2(31) < 5$, and therefore $\lfloor \log_2(31) \rfloor = 4$.
For all integers $k \in [16, 31]$, $\lfloor \log_2 k \rfloor = 4$ holds true. Thus, for all valid `M`:
$$
\lfloor \log_2 S \rfloor = \lfloor \log_2((M+16) \cdot 2^E) \rfloor = \lfloor \log_2(M+16) \rfloor + E = 4 + E.
$$
Hence, the exponent can be expressed *exactly* as:
$$
\boxed{E = \lfloor \log_2(\text{value} + 16) \rfloor - 4.}
$$
In implementation, the floor of the base-2 logarithm corresponds to the **most significant bit (MSB)** position:
$$
E = \text{msb}(\text{value} + 16) - 4.
$$
On RISC-V, this can be computed using the instruction `CLZ` (count leading zeros):
$$
E = (31 - \text{clz}(\text{value} + 16)) - 4.
$$
The result is then clamped to the valid 4-bit range `[0, 15]`.
---
##### 3. Overflow and Mantissa
From the decoding definition:
$$
\text{value} = (M \ll E) + 16(2^E - 1).
$$
We define the **segment base**, also called *overflow*:
$$
\text{overflow}(E) = 16(2^E - 1) = ((1 \ll E) - 1) \ll 4.
$$
For a given exponent `E`, the mantissa is then derived as:
$$
M = \frac{\text{value} - \text{overflow}(E)}{2^E}.
$$
Because `value` always lies within the range $[\text{overflow}(E), \text{overflow}(E) + 16 \cdot 2^E]$,
the numerator is guaranteed to be in $[0, 16 \cdot 2^E]$, yielding $M \in [0, 16]$.
After clamping to a 4-bit integer, we obtain:
$$
M = (\text{value} - \text{overflow}(E)) \gg E,
\quad
\text{uf8} = (E \ll 4) | (M \& 0xF).
$$
---
#### Implementation Notes (RISC-V)
* Compute:
`a0 ← value + 16`
`msb ← 31 - clz(a0)`
`E ← msb - 4` (then clamped to 0–15)
* Then:
`overflow ← ((1<<E) - 1) << 4`
`M ← (value - overflow) >> E`
`uf8 ← (E << 4) | (M & 0xF)`
* Extreme cases (very small or large values) are safely handled through exponent clamping and 4-bit masking.
Through a **rigorous bit-level derivation**—without any approximation—the relation
$$
E = \text{msb}(\text{value}+16) - 4
$$
is established, together with the closed-form overflow and mantissa computation.
By replacing iterative searches with direct bitwise arithmetic, the optimized implementation transforms `uf8_encode` from a **loop-driven algorithm** into a **constant-time computation**.
This explains its superior performance in instruction count, branch predictability, and memory efficiency compared to the original version.
## Problem C — BF16 Arithmetic
### 2.1 Description & Design Goal
This problem implements **bfloat16 (BF16)** arithmetic and conversions in **pure RV32I** (no M/F/D extensions). The goal is to support a practical subset of BF16 operations—**add, sub, mul, div, sqrt**, plus **f32↔bf16** conversion—handling **NaN/Inf/zero/subnormal** corner cases and producing correct 16-bit bit-patterns suitable for Ripes. The implementation emphasizes **closed-form bit-manipulation**, **branch minimization**, and **deterministic testing**.
### 2.2 Program Structure & Key Functions
The project is organized into self-contained routines (labels) that operate on 16-bit BF16 bit patterns passed in integer registers:
* `bf16_isnan / bf16_isinf / bf16_iszero`
Bitwise predicates on the BF16 layout: sign (1), exponent (8), mantissa (7). These isolate exponent/mantissa fields with shifts and masks to decide special categories without any FP hardware.
* `conv_f32_to_bf16 / conv_bf16_to_f32`
Round-to-nearest-even (via bias + LSB carry-in) when truncating from f32 to bf16; high-word passthrough for Inf/NaN; and zero-extend back to f32 placement.
* `bf16_add / bf16_sub`
Software alignment + sign logic: add performs exponent alignment with hidden-1 handling for normals, separate paths for equal/unequal signs (normalize and renormalize), overflow to ±Inf; sub is implemented as add with flipped sign.
* `bf16_mul`
Field extraction, special-case handling, **8×8 shift-add integer multiply** for (1.mant)* (1.mant), exponent biasing, and normalization with under/overflow handling to ±Inf or flush-to-zero for deep subnormals.
* `bf16_div`
Special-case matrix (NaN/Inf/0), then **non-restoring long division** style quotient build with shifts, exponent difference + bias adjust, normalization, and packing.
* `bf16_sqrt`
* **Original**: binary-search on the mantissa with an inner 16-step shift-add square loop; post-normalization by shifting and exponent fix-up.
* **Optimization**: a **table-driven lookup** for the mantissa, with **even/odd-exponent tables** to avoid the binary search altogether; exponent is derived by parity-aware halving. This removes inner multiply/loop and slashes branches.
* `main` + test harness (`tests_sqrt` set)
Iterates deterministic BF16 cases, prints formatted results (“a= … -> … (expect= …) ok/FAIL”), and exits—useful for Ripes runs and regression.
### 2.3 RISC-V Pipeline Demonstration — `bf16_sqrt` Example
#### `lhu a0, 0(s0)` — Load Halfword
* **Instruction:** `lhu a0, 0(s0)`
* **Description:** Load a 16-bit value from memory address in `s0` into register `a0`.
> **IF Stage:**
> PC fetches instruction from Instruction Memory.
> 
> **ID Stage:**
> Instruction decoded; `s0` read from Register File.
> Control signals: `MemRead=1`, `RegWrite=1`, `MemToReg=1`.
> 
> **EX Stage:**
> ALU computes effective address = `s0 + 0`.
> 
> **MEM Stage:**
> Memory accessed; halfword `0x3F80` read (`MemRead=1`).
> 
> **WB Stage:**
> Loaded value `0x3F80` written to `a0` (`RegWrite=1`).
> 
---
#### `jal ra, bf16_sqrt` — Jump and Link
* **Description:** Jump to `bf16_sqrt` and save return address to `ra`.
> ALU computes target address (PC + offset);
> `RegWrite=1`, `PCSrc=1`.
> 

---
#### `bne t2, t1, sqrt_check_zero` — Branch Not Equal
* **Description:** Branch if `t2 != t1`.
> ALU performs `t2 - t1`; if not zero, `PCSrc=1` (branch taken).
> 
---
#### `lbu a5, 0(t0)` — Load Byte
* **Description:** Load a single byte from memory into `a5`.
> Similar to `lhu`, but reads 8 bits (`MemRead=1`, `MemToReg=1`).
> 
---
#### `or a0, a4, a5` — Bitwise OR
* **Description:** Perform logical OR between `a4` and `a5`, store in `a0`.
> Pure ALU op, no memory access (`RegWrite=1`, `MemRead=0`).
> 
### 2.4 Performance Discussion (Original vs. Optimization)
#### What changed from Original → Optimization

| Aspect | Original | Optimization | Impact |
| ---------------------- | --------------------------------------------------------- | --------------------------------------------------------- | ----------------------------------------------- |
| `sqrt` mantissa method | **Binary search** with inner **16-step** shift-add square | **Table lookup** with **even/odd exponent tables** | Removes inner multiply loop; near-constant time |
| Branching in `sqrt` | Many (loop + compare) | Few (index + load + pack) | Fewer pipeline stalls; better predictability | |
| `add/mul/div` | Same core algorithms | Same (minor cleanups) | Stable behavior |
**Why it’s faster:** The table-driven `sqrt` bypasses both the **binary-search iterations** and the **16-step square loop**. Exponent halving is done by parity detection (even/odd) to select the correct table. This mirrors the Part-B optimization spirit—replace iterative control flow with **direct bitwise arithmetic / lookup** to reduce retired instructions and branches.
### 2.5 Testing and Validation
#### Test Design
Each function was validated through a series of deterministic test cases that covered:
* **Normal cases:** Standard numerical inputs (e.g., 1.0, 2.0, 0.5).
* **Boundary and exceptional cases:** Including `0`, `+Inf`, `NaN`, and negative inputs to verify IEEE-754 special condition handling.
* **Subnormal and signed values:** To check correct sign-bit propagation and denormalized number behavior.
For the `bf16_sqrt` operation, the test dataset was defined as a table in memory (`tests_sqrt`), where each record consists of:
```
.half <input_value>, 0x0000, <expected_output>
```
During execution, the program iterates through the table, calls the `bf16_sqrt` function, compares the returned value against the expected one, and prints the result via `ecall` system calls.
#### Execution Output
The actual console output obtained from the simulation is shown below:
```
a=0x3f80 -> 0x3f80 (expect=0x3f80) ok
a=0x4080 -> 0x4000 (expect=0x4000) ok
a=0x4010 -> 0x3fc0 (expect=0x3fc0) ok
a=0x0000 -> 0x0000 (expect=0x0000) ok
a=0x7f80 -> 0x7f80 (expect=0x7f80) ok
a=0x7fc1 -> 0x7fc0 (expect=0x7fc0) ok
a=0xbf80 -> 0x7fc0 (expect=0x7fc0) ok
a=0x0001 -> 0x0000 (expect=0x0000) ok
a=0x3f00 -> 0x3f35 (expect=0x3f35) ok
a=0x3fc0 -> 0x3f9d (expect=0x3f9d) ok
Program exited with code: 0
```
All test cases returned “ok”, indicating the computed results matched the expected outputs precisely. The program exited normally with return code `0`, confirming stable termination without runtime exceptions.
#### Result Analysis
The test results validate that:
* Normal values produce accurate square-root results.
* Special cases such as zero, infinity, and NaN are handled in accordance with IEEE-754 BF16 conventions.
* Negative input correctly yields a canonical NaN (`0x7FC0`), verifying the exception logic path.
* Subnormal inputs (e.g., `0x0001`) correctly flush to zero.
These results confirm the functional correctness and numerical stability of the `bf16_sqrt` implementation.
## Leetcode Performance
### 3.1 Leetcode Problem (367. Valid Perfect Square)
```
Given a positive integer num, return true if num is a perfect square or false otherwise.
A perfect square is an integer that is the square of an integer. In other words, it is the product of some integer with itself.
You must not use any built-in library function, such as sqrt.
Example 1:
Input: num = 16
Output: true
Explanation: We return true because 4 * 4 = 16 and 4 is an integer.
Example 2:
Input: num = 14
Output: false
Explanation: We return false because 3.742 * 3.742 = 14 and 3.742 is not an integer.
Constraints:
1 <= num <= 231 - 1
```
**Given** a positive integer `num`, **return** `true` if it is a perfect square, otherwise `false`.
Constraints: `1 ≤ num ≤ 2^31 – 1`. No built-in `sqrt`.
### 3.2 Key idea (overview)
1. Handle trivial inputs `x ∈ {0,1}`.
2. Convert `x` from unsigned integer → **float32 bit pattern** → **bfloat16**.
3. Compute an **approximate √x in bfloat16** via a 128-entry LUT (two 128-byte tables for even/odd exponents), then convert back to float32 bits.
4. Truncate to get an **integer candidate** `y = ⌊√x⌋`.
5. Because the bf16 approximation can be off slightly around binade boundaries, **verify nearby integers** by exact equality checks using a fast multiply-equals-x routine.
* We check `y^2`, `(y±1)^2`, … `(y±4)^2` (constant number of checks, early-out on first hit).
6. Return `true` if any matches `x`, else `false`.
All of the above is implemented in assembly: `isPerfectSquare`, bf16 conversions, LUT-based `bf16_sqrt`, and the constant-time verifier.
### 3.3 What each routine does
* **`u32_to_f32bits`** – builds an IEEE-754 single-precision bit pattern from a positive integer: finds MSB, constructs exponent = `msb+127`, and aligns the 23-bit fraction.
* **`conv_f32_to_bf16` / `conv_bf16_to_f32`** – round-to-nearest-even into bf16 (upper 16 bits with proper bias rounding), and back (shift left 16). Handles Inf/NaN pass-through.
* **`bf16_sqrt`** – table-driven: decides **even/odd exponent path**, indexes a table (`sqrt_tbl_even_q17` / `sqrt_tbl_odd_q17`, each 128 bytes) and repacks sign/exp/mantissa in bf16. Returns +0/+Inf/qNaN for special cases.
* **`f32bits_trunc_to_u32`** – interprets the f32 bit pattern, extracts unbiased exponent and leading 1, and truncates to unsigned integer by shifting. Subnormals/negatives → 0.
* **`mul_eq_u32_x_fast`** – verifies `(a*b)==x` without `mul`: shift-and-add over the multiplier bits with **early overflow pruning** (`sum > x - a` or `a` would exceed `x` on next shift). This keeps the loop short and branch-predictable.
* **Harness (`main`)** – iterates over **12 test cases**, prints `x` and the boolean answer, using `ecall` 1/4/11/10 (print_int/print_string/print_char/exit).
#### Correctness notes & edge cases
* **`x = 0 or 1`**: handled explicitly and return `true`. Matches mathematical definition.
* **Special fp forms** (Inf/NaN/subnormal/negative) are sanitized inside converters and `bf16_sqrt`; final path for positive integers always follows the normal-number route.
* **Approximation safety**: bf16 has only 7 mantissa bits. The LUT is Q1.7 and we also have exponent-parity handling. Around binade edges, truncating the reconstructed f32 can undershoot by a few units. The code therefore checks `y, y±1, …, y±4`. In practice this makes the method robust over all 32-bit inputs we care about while keeping checks constant and tiny.
* **Overflow safety** in verifier: `mul_eq_u32_x_fast` never forms the full product; it accumulates `sum` with **range checks against `x`** every step, so we avoid 64-bit temporaries or UB.
### 3.3 Testing and Validation
```
x = 0, ans = true
x = 1, ans = true
x = 2, ans = false
x = 3, ans = false
x = 4, ans = true
x = 8, ans = false
x = 9, ans = true
x = 15, ans = false
x = 16, ans = true
x = 2147395600, ans = true
x = 2147483647, ans = false
x = -1, ans = false
Program exited with code: 0
```
* **Zero dynamic memory**; only two small LUTs: **128 B + 128 B = 256 B** total, plus code.
* **No binary search loop** over `[1..46340]`. Instead:
* One bf16 LUT lookup (O(1))
* A constant number of integer checks (≤ 9 calls to `mul_eq_u32_x_fast`)
* Each `mul_eq_u32_x_fast` runs in ≤ 32 iterations with multiple **early-out branches** on overflow or when `a` already exceeds `x` next round.
## References
* [工具介紹](https://ithelp.ithome.com.tw/m/articles/10351706)
* [LeetCode 338. Counting Bits](https://leetcode.com/problems/counting-bits/description/)
* [LeetCode 50. Pow(x, n)](https://leetcode.com/problems/powx-n/description/)
* [LeetCode 367. Valid Perfect Square](https://leetcode.com/problems/valid-perfect-square/description/)
* [Quiz1 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-B)