Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline ## Problem B Assume **uf8** implements a logarithmic 8-bit codec that maps 20-bit unsigned integers $[0, 1,015,792]$ to 8-bit symbols via logarithmic quantization, delivering 2.5:1 compression and ≤6.25% relative error. The uf8 encoding scheme is ideal for sensor data applications such as temperature and distance measurements where range is more important than precision. In graphics applications, it efficiently represents level-of-detail (LOD) distances and fog density values. The scheme is also well-suited for timer implementations that use exponential backoff strategies. However, the uf8 encoding should not be used for financial calculations where exact precision is required and rounding errors are unacceptable. It is inappropriate for cryptographic applications that require uniform distribution of values for security purposes. --- ### 🔹 Decoding $$ D(b) = m \cdot 2^{e} + (2^{e} - 1) \cdot 16 $$ Where $ e = \lfloor b/16 \rfloor $ and $ m = b \bmod 16 $ --- ### 🔹 Encoding $$ E(v) = \begin{cases} v, & \text{if } v < 16 \\ 16e + \left\lfloor \dfrac{(v - \text{offset}(e))}{2^{e}} \right\rfloor, & \text{otherwise} \end{cases} $$ Where $\text{offset}(e) = (2^{e} - 1) \cdot 16$ --- ### 🔹 Error Analysis - **Absolute Error:** $ \Delta_{\max} = 2^{e} - 1 $ - **Relative Error:** $ \varepsilon_{\max} = 1/16 = 6.25\% $ - **Expected Error:** $ \mathbb{E}[\varepsilon] \approx 3\% $ --- ### 🔹 Information Theory | Property | Value | |-----------|--------| | Input Entropy | 20 bits | | Output Entropy | 8 bits | | Theoretical Minimum | 7.6 bits (for 6.25% error bound) | | Efficiency | 8 / 7.6 = **95% optimal** | --- 🧩 *In summary, uf8 encoding offers efficient 8-bit logarithmic quantization for 20-bit values, balancing compactness and controlled precision loss.* --- --- ### Implementation in Riscv32 [github link](https://github.com/Lin-2002/ca2025-quizzes/blob/main/uf8.s) I rewrote `clz` (count leading zeros for 32-bit ints) in three flavors: - **A. `clz_stackless`** — same algorithm as the original loop but **No lw sw instruction** (no access memory). - **B. `clz_unrolled`** — fully **unrolled** 5 fixed steps (16/8/4/2/1). - **C. `clz_branchless`** — **branchless** (no `beq/bne/blt*` at all). --- ### Why branchless? Ripes’ 5-stage pipeline doesn’t have a fancy branch predictor. Every `if/while` is a potential pipeline flush, which makes cycles unstable. A branchless design converts conditions into 0/1 values, scales them into shift amounts or masks, and uses `sll/and/or/add` to emulate conditional behavior. The result is **fixed latency** and **pipeline-friendly** timing. --- ### What I changed (concept) - Kept the classic “binary split” idea (check top 16/8/4/2/1 bits), but each step does **conditional shift/add via dataflow** rather than a branch. - Even the **`x==0` case** is handled **without** a branch using a final mask-based select between `{32, computed_value}`. - Everything is **register-only**: **no stack, no `ra/s0` saves**. --- --- ### Primary version Idea: binary search the highest 1-bit. Start with Y=32, X=16, U=x. If (U>>X) != 0 then Y -= X; U = U>>X. Then X >>= 1. Finally return Y - U. Issues in this baseline: `Excessive spills/loads` ```asm .globl clz clz: addi sp,sp,-48 sw ra,44(sp) sw s0,40(sp) addi s0,sp,48 sw a0,-36(s0) # U = a0 li a5,32 sw a5,-20(s0) # Y = 32 li a5,16 sw a5,-24(s0) # X = 16 L3: lw a5,-24(s0) # a5 = X lw a4,-36(s0) # a4 = U srl a5,a4,a5 # T = U >> X sw a5,-28(s0) # T lw a5,-28(s0) beq a5,x0,L2 # if (T==0) skip update lw a4,-20(s0) # Y lw a5,-24(s0) # X sub a5,a4,a5 # Y = Y - X sw a5,-20(s0) lw a5,-28(s0) # T sw a5,-36(s0) # U = T L2: lw a5,-24(s0) # X srli a5,a5,1 # X >>= 1 (termination guaranteed) sw a5,-24(s0) lw a5,-24(s0) bne a5,x0,L3 # while (X != 0) lw a4,-20(s0) # Y lw a5,-36(s0) # U sub a5,a4,a5 # return Y - U addi a0,a5,0 lw ra,44(sp) lw s0,40(sp) addi sp,sp,48 jalr x0, ra, 0 # return ``` ![image](https://hackmd.io/_uploads/ByWXQ9qTxe.png) --- ### Version A: `clz_stackless` (loop, stackless) **Idea:** Same algorithm as the original (`Y=32, X=16`, loop with `X>>=1`), but no stack frame and no local spills. ```asm .text .globl clz_loop_slim clz_loop_slim: li t0, 32 # Y = 32 li t1, 16 # X = 16 mv t2, a0 # U = x CLZ_LOOP: srl t3, t2, t1 # T = U >> X beq t3, x0, SKIP_UPDATE # if (T==0) skip sub t0, t0, t1 # Y -= X mv t2, t3 # U = T SKIP_UPDATE: srli t1, t1, 1 # X >>= 1 bnez t1, CLZ_LOOP # while (X != 0) sub a0, t0, t2 # return Y - U ret ``` ![image](https://hackmd.io/_uploads/S196z59aee.png) **Why faster:** fewer memory ops, simpler control flow. --- ### Version B: `clz_unrolled` (unrolled 5 steps) **Idea:** Remove the `while`; do fixed steps for X = 16, 8, 4, 2, 1. Only branch is “update or skip” per step. ```asm .text .globl clz_unrolled clz_unrolled: li t0, 32 # Y = 32 li t1, 16 # X = 16 mv t2, a0 # U = x # Step X=16 srl t3, t2, t1 beq t3, x0, STEP8 sub t0, t0, t1 mv t2, t3 STEP8: li t1, 8 srl t3, t2, t1 beq t3, x0, STEP4 sub t0, t0, t1 mv t2, t3 STEP4: li t1, 4 srl t3, t2, t1 beq t3, x0, STEP2 sub t0, t0, t1 mv t2, t3 STEP2: li t1, 2 srl t3, t2, t1 beq t3, x0, STEP1 sub t0, t0, t1 mv t2, t3 STEP1: li t1, 1 srl t3, t2, t1 beq t3, x0, PACK sub t0, t0, t1 mv t2, t3 PACK: sub a0, t0, t2 ret ``` ![image](https://hackmd.io/_uploads/H1eKE9c6xg.png) **Why faster:** no loop-branch; branch count is fixed (up to 5). --- ### Version C: `clz_branchless` (fully branchless) **Idea:** For each step, build a mask from `(T!=0)` and do `Y -= (X & mask)` and `U = (T & mask) | (U & ~mask)` — all via dataflow (no control-flow branches). ```asm .text .globl clz_branchless # a0 = x ; return a0 = clz(x) # Reg map per step: # t0=tY (Y), t1=tX (X), t2=tU (U), t3=tT (T), # t4=cond (0/1), t5=mask (-cond), t6=tmp (~mask or delta) clz_branchless: li t0, 32 # Y = 32 li t1, 16 # X = 16 mv t2, a0 # U = x # --- Step X=16 --- srl t3, t2, t1 # T = U >> 16 sltu t4, x0, t3 # cond = (T!=0)?1:0 sub t5, x0, t4 # mask = -cond (0xFFFFFFFF or 0) and t6, t1, t5 # delta = X & mask sub t0, t0, t6 # Y -= delta xori t6, t5, -1 # ~mask and t4, t2, t6 # U & ~mask and t3, t3, t5 # T & mask or t2, t4, t3 # U = (U & ~mask) | (T & mask) # --- Step X=8 --- li t1, 8 srl t3, t2, t1 sltu t4, x0, t3 sub t5, x0, t4 and t6, t1, t5 sub t0, t0, t6 xori t6, t5, -1 and t4, t2, t6 and t3, t3, t5 or t2, t4, t3 # --- Step X=4 --- li t1, 4 srl t3, t2, t1 sltu t4, x0, t3 sub t5, x0, t4 and t6, t1, t5 sub t0, t0, t6 xori t6, t5, -1 and t4, t2, t6 and t3, t3, t5 or t2, t4, t3 # --- Step X=2 --- li t1, 2 srl t3, t2, t1 sltu t4, x0, t3 sub t5, x0, t4 and t6, t1, t5 sub t0, t0, t6 xori t6, t5, -1 and t4, t2, t6 and t3, t3, t5 or t2, t4, t3 # --- Step X=1 --- li t1, 1 srl t3, t2, t1 sltu t4, x0, t3 sub t5, x0, t4 and t6, t1, t5 sub t0, t0, t6 xori t6, t5, -1 and t4, t2, t6 and t3, t3, t5 or t2, t4, t3 # return Y - U sub a0, t0, t2 ret ``` ![image](https://hackmd.io/_uploads/B1hXEcqagl.png) **Why this can be faster (and when it isn’t):** On a simple 5-stage pipeline, removing branches helps only if branch penalties (flushes from hard-to-predict branches) dominate. In that case, a branchless path stabilizes timing by avoiding control hazards. But when branches are mostly easy/not-taken, the extra ALU instructions and longer dependency chains of a branchless design can cost more cycles than the occasional flush—so an unrolled-but-branched version may run faster overall. --- **Expectation:** - `clz_stackless` > original stack-based version. - `clz_unrolled` avoids the loop branch; fewer, fixed branches. - **`clz_branchless`** is usually the most stable and often the fastest on average. --- ## Problem C The original BF16 computation logic (in `bf16.s`) relied on compiler-generated soft-float emulation code. This implementation introduced significant performance overhead due to frequent stack memory operations and unnecessary function calls. The core goals of this optimization are: - Remove memory dependency: Replace stack-based variable passing with register-based passing (register renaming) to reduce L1 cache pressure. - Algorithm replacement: Implement efficient bit-level arithmetic (shift-and-add / restoring division) manually to replace generic C library calls. - Instruction reduction: Use inline expansion to eliminate `call` / `ret` and context switch overhead. ## Implementation Details ### Multiplication Optimization - File location: `bf16_div_mul.s` - Optimization strategies: - Stackless design: Completely remove `addi sp, sp, -64` and all `sw` / `lw` operations, using only registers `t0`–`t6` for computation. - Algorithm: Implement an 8-bit shift-and-add loop, specialized for the 7-bit mantissa of BF16. - Edge cases: Built-in handling of NaN, Inf, Zero, and subnormal values according to the IEEE-754 specification. ### Division Optimization - File locations: `bf16_div.s` / `bf16_div_mul.s` - Optimization strategies: - Algorithm: Use the restoring division (digit recurrence) algorithm. - Loop tuning: Adjust the number of iterations to match BF16 precision, reducing dynamic branch prediction pressure. - Inlining: Remove dependencies on external `udiv` or software emulation routines and complete the division entirely within the kernel. ### Square Root Optimization - File location: `bf16_sqrt.s` - Optimization strategies: - Algorithm: Bitwise approximation using a binary, bit-by-bit trial (successive approximation method). - Logic: Perform square root in the integer domain on the mantissa, then adjust the exponent accordingly. - Safety: For negative inputs, correctly return NaN (canonical NaN `0x7FC0`). ## Performance Evaluation The optimized kernel was evaluated on a 5-stage pipeline simulator. Compared to the baseline version, both the dynamic instruction count and execution cycles show significant improvements. ### Benchmark Results | Version | Cycles | Instructions | CPI | Speedup | | :-- | :-- | :-- | :-- | :-- | | Baseline (`bf16.s`) | 4417 | 2939 | 1.50 | 1.00× | | Opt: Div | 3900 | 2593 | 1.50 | 1.13× | | Opt: Div + Mul | 3707 | 2475 | 1.50 | 1.19× | | Opt: Div + Sqrt | 2856 | 1963 | 1.45 | 1.55× | ### Analysis - Cumulative effects: The best performance (1.55× speedup) is achieved when combining the optimizations for division (Div) and square root (Sqrt). - Bottleneck removal: Based on the Div-optimized version, adding the Sqrt optimization reduces cycles from 3900 to 2856. This indicates that the original Sqrt implementation (likely involving expensive iterative function calls) was the dominant performance bottleneck. - CPI improvement: The fully optimized version (Div + Sqrt) improves CPI from 1.50 to 1.45, indicating fewer pipeline stalls and a smoother instruction flow. ## Technical Highlights By comparing the unoptimized baseline (`bf16.s`) with the optimized versions, the following microarchitectural improvements were achieved: ### Microarchitectural Metrics | Metric | Baseline (`bf16.s`) | Optimized Version | Impact | | :-- | :-- | :-- | :-- | | Memory model | Stack-based (frequent `sp` operations) | Register-based (no load/store) | Eliminates L1 cache pressure | | Control flow | Function calls (e.g., `call umul32`) | Inlined logic | Removes pipeline flush and jump overhead | | Resource usage | High memory traffic | Compute-bound | Improves instruction-level parallelism (ILP) | ## Conclusion This work successfully optimizes the scalar BF16 computation path. By systematically removing stack memory operations and using inline expansion to eliminate function call overhead, the implementation achieves a 1.55× speedup (cycles reduced from 4417 to 2856) and approximately 33% reduction in instruction count (from 2939 to 1963) in the evaluation environment. Furthermore, the implementation strictly adheres to the IEEE-754 specification, fully supporting correct propagation and handling of boundary values such as NaN and Infinity, ensuring numerical correctness and compatibility. ## Performance Analysis of BF16 Kernels on LeetCode 50 tags: RISC-V, bf16, Optimization, LeetCode 50 ### Application Case: LeetCode 50. Pow(x, n) To evaluate the real-world performance of our optimized kernels, we implemented LeetCode 50 (Pow(x, n)) using the Binary Exponentiation algorithm. This problem is an ideal testbed because: - Complex Instruction Chain: It requires a high volume of interleaved bf16_mul and bf16_div (for negative exponents) calls. - Cumulative Efficiency: Any micro-optimization in the scalar kernels is amplified through the algorithm's iterative loops. ### Performance Benchmark Results The following data was captured using the Ripes 5-stage pipeline simulator, comparing the unoptimized baseline with our fully optimized (stackless & inlined) kernel. | Metric | Original Baseline (bf16.s) | Optimized Version | Improvement (%) | |---|---:|---:|---:| | Cycles | 11838 | 5344 | 54.86% Reduction | | Instructions Retired | 7895 | 3882 | 50.83% Reduction | | CPI (Cycles Per Instruction) | 1.50 | 1.38 | 8% Efficiency Gain | Key Observations: - Over 50% Cycle Reduction: By removing all stack operations (addi sp, lw, sw), the processor saves thousands of cycles that were previously spent on memory traffic. - Pipeline Optimization: The decrease in CPI from 1.5 to 1.38 indicates fewer pipeline stalls. This is a direct result of our register-based model, which minimizes data hazards during high-frequency mathematical operations. - Execution Speedup: The optimized kernel achieves a 2.22x speedup ($11838 \div 5344$) for the Pow(x, n) workload. ### Conclusion The implementation of Problem C demonstrates that low-level assembly refactoring can lead to dramatic performance leaps. For complex mathematical workloads like LeetCode 50, our optimized BF16 library successfully reduced the execution time by 54.86% while maintaining bit-level accuracy across various test cases (positive/negative bases and exponents).