# Computer Architecture — Fall 2025 Homework 2 📦 [Full Source Code on GitHub](https://github.com/ab842650/ca2025_assignment2) ## Implement UF8 encode and decode on rv32emu **UF8 from Quiz 1 Problem B** was successfully executed in a **bare-metal environment** using **rv32emu’s system emulation**. The original **`main.c`** was modified to include the **`problemB_main()`** function from the assembly implementation, allowing the test routine to run and return a **pass/fail result** directly to C. The **Makefile** was updated to compile and link all files under the **RV32I_Zicsr** instruction set (**`-march=rv32i_zicsr -mabi=ilp32`**), ensuring compatibility without using **RV32M** or **RV32F** extensions. The program was then built into **`test.elf`** and executed with **rv32emu** in system mode. **Cycle and instruction counts** were measured using **`get_cycles()`** and **`get_instret()`** within the C test framework, confirming correct execution and **performance profiling** of the UF8 test suite. > **codes here:** > - [`c code`](https://github.com/ab842650/ca2025_assignment2/blob/main/main.c) > - [`asm code`](https://github.com/ab842650/ca2025_assignment2/blob/main/Problem_B_refinement.S) **Result:** ![截圖 2025-10-30 下午4.16.15](https://hackmd.io/_uploads/ByTAAqlJ-x.png) --- ## implement hanoi tower on rv32 emu **Problem A (Gray Code Hanoi Tower)** from **Quiz 2** was successfully executed in a **bare-metal environment** using **rv32emu’s system emulation**. The original **`main.c`** was modified to call the **`hanoi_main()`** function from the custom assembly implementation, while the output logic in **`hanoi.S`** was adapted to use **system-level ECALL I/O** for compatibility with rv32emu’s bare-metal mode. The **Makefile** was updated to compile and link all components under the **RV32I_Zicsr** instruction set (**`-march=rv32i_zicsr -mabi=ilp32`**) without floating-point or multiply/divide extensions. The program was built into **`test.elf`** and executed under **rv32emu system mode**, successfully displaying the **Gray Code–based Hanoi Tower move sequence** and verifying correct **bare-metal execution and output behavior**. > **codes here:** > - [`c code`](https://github.com/ab842650/ca2025_assignment2/blob/main/main_A.c) > - [`asm code`](https://github.com/ab842650/ca2025_assignment2/blob/main/hanoi.S) **Result:** ![截圖 2025-11-01 下午3.04.04](https://hackmd.io/_uploads/BkleZEmJbe.png) --- ## implement rsqrt in rv32emu ### objective Implement and verify a **fixed-point reciprocal square root (`1/sqrt(x)`)** function using the **Q16.16 format**. The goal is to achieve a practical, FPU-free solution for use in RISC-V (RV32I) environments and to analyze its accuracy through simulation on both **RV32Emu** and **PC reference tests**. --- ### Background The reciprocal square root function is widely used in: - Vector normalization (`1 / sqrt(x² + y² + z²)`) - Graphics and physics engines - Machine learning or DSP tasks that replace division with multiplication In floating-point systems, it’s trivial to use `1.0f / sqrtf(x)`. However, on embedded CPUs **without a floating-point unit**, we must rely on **fixed-point arithmetic** and **Newton–Raphson iteration**. --- ### Algorithm Overview The goal is to compute the **reciprocal square root**, i.e. $$ y = \frac{1}{\sqrt{x}} $$ We use the **Newton–Raphson iteration** to efficiently approximate this value. --- #### Newton–Raphson Iteration The iterative formula is: $$ y_{n+1} = y_n \times \left( 1.5 - 0.5 \times x \times y_n^2 \right) $$ Each iteration improves the accuracy of `y` as an estimate of $( 1/\sqrt{x})$. --- #### Q16.16 Fixed-Point Representation Because the RV32I architecture does not include a floating-point unit, all arithmetic is performed in **Q16.16 fixed-point format**. - **Integer bits:** 16 - **Fractional bits:** 16 - **Scaling factor:** ( $2^{16}$ = 65536) --- #### Fixed-Point Iteration Formula The Newton iteration can be implemented in C as: ```c // One Newton iteration in Q16.16 y = (y * ((3 << 16) - ((x * y * y) >> 16))) >> 17; ``` ### Lookup Table (LUT) for Initial Guess We approximate $y_0 \approx 2^{16} / \sqrt{x}$ using a **32-entry table** keyed by the MSB position of `x` (denoted `exp = floor(log2(x))`). For inputs between $2^{exp}$ and $2^{exp+1}$, we linearly interpolate between `rsqrt_table[exp]` and `rsqrt_table[exp+1]` in **Q16.16**. **Table (Q16.16, i.e., `65536 / sqrt(2^k)`):** ::: spoiler code ```c static const uint16_t rsqrt_table[32] = { 65536, 46341, 32768, 23170, 16384, 11585, 8192, 5793, 4096, 2896, 2048, 1448, 1024, 724, 512, 362, 256, 181, 128, 90, 64, 45, 32, 23, 16, 11, 8, 6, 4, 3, 2, 1 }; ``` ::: To determine the exponent (exp) efficiently, we reuse the branchless clz function from the earlier assignment, which avoids conditional branches and improves performance on RV32I. :::spoiler code ```c= uint32_t clz(uint32_t x) { int r = 0, c; c = (x < 0x00010000) << 4; r += c; x <<= c; // off 16 c = (x < 0x01000000) << 3; r += c; x <<= c; // off 8 c = (x < 0x10000000) << 2; r += c; x <<= c; // off 4 c = (x < 0x40000000) << 1; r += c; x <<= c; // off 2 c = x < 0x80000000; r += c; x <<= c; // off 1 r += x == 0; return r; } ``` ::: --- full code running on rv32emu > - [`main_C.c`](https://github.com/ab842650/ca2025_assignment2/blob/main/main_C.c) testing code on PC :::spoiler code ```c= #include <math.h> #include <stdio.h> #include <stdint.h> #include <inttypes.h> static const uint32_t RSQRT_INPUTS[50] = { // small range (0 ~ 256) 0u, 1u, 2u, 3u, 4u, 5u, 6u, 8u, 9u, 10u, 12u, 15u, 16u, 20u, 25u, 50u, 100u, 128u, 200u, 256u, // mid range (300 ~ 16384) 300u, 400u, 500u, 768u, 1000u, 1500u, 2000u, 3000u, 4096u, 5000u, 8192u, 10000u, 12000u, 14000u, 15000u, 16000u, 16383u, 16384u, 10000u, 12000u, // high range + boundary (≥ 32768) 32768u, 65536u, 131072u, 262144u, 524288u, 1048576u, 2097152u, 4194304u, 2147483647u, 4294967295u }; // One-Newton iteration results from RV32Emu (Q16.16 format) static const uint32_t RSQRT_Y_Q16_NEWTON1[50] = { 0xffffffff, 0x00010000, 0x0000b505, 0x00009357, 0x00008000, 0x00007242, 0x0000682f, 0x00005a83, 0x00005546, 0x000050cb, 0x000049ab, 0x00004211, 0x00004000, 0x00003921, 0x0000330c, 0x00002419, 0x00001986, 0x000016a1, 0x0000120c, 0x00001000, 0x00000ec3, 0x00000cc3, 0x00000b73, 0x00000935, 0x00000818, 0x00000697, 0x000005b9, 0x000004a9, 0x00000400, 0x0000039d, 0x000002d4, 0x0000028e, 0x00000254, 0x00000229, 0x00000217, 0x00000206, 0x00000200, 0x00000200, 0x0000028e, 0x00000254, 0x0000016a, 0x00000100, 0x000000b5, 0x00000080, 0x0000005b, 0x00000040, 0x0000002d, 0x00000020, 0x00000001, 0x00000001 }; // Two-Newton iteration results from RV32Emu (Q16.16 format) static const uint32_t RSQRT_Y_Q16_NEWTON2[50] = { 0xffffffff, 0x00010000, 0x0000b505, 0x000093cd, 0x00008000, 0x0000727c, 0x00006883, 0x00005a83, 0x00005555, 0x000050f4, 0x000049e6, 0x00004219, 0x00004000, 0x0000393e, 0x00003333, 0x00002434, 0x0000199a, 0x000016a1, 0x0000121a, 0x00001000, 0x00000ec8, 0x00000ccd, 0x00000b73, 0x0000093d, 0x00000818, 0x0000069c, 0x000005b9, 0x000004ad, 0x00000400, 0x0000039f, 0x000002d4, 0x0000028f, 0x00000256, 0x0000022a, 0x00000217, 0x00000206, 0x00000200, 0x00000200, 0x0000028f, 0x00000256, 0x0000016a, 0x00000100, 0x000000b5, 0x00000080, 0x0000005b, 0x00000040, 0x0000002d, 0x00000020, 0x00000001, 0x00000001 }; // LUT only (no interpolation, no Newton iteration) static const uint32_t RSQRT_Y_Q16_LUT_ONLY[50] = { 0xffffffff, 0x00010000, 0x0000b505, 0x000087c4, 0x00008000, 0x00007000, 0x00006000, 0x00005a83, 0x000054db, 0x00004f32, 0x000043e2, 0x000032ea, 0x00004000, 0x00003800, 0x00002e00, 0x00002087, 0x00001700, 0x000016a1, 0x00001043, 0x00001000, 0x00000ea0, 0x00000b80, 0x00000860, 0x0000087c, 0x000005ec, 0x00000624, 0x00000430, 0x00000458, 0x00000400, 0x0000038f, 0x000002d4, 0x00000284, 0x0000022c, 0x000001d3, 0x000001a7, 0x0000017b, 0x0000016a, 0x00000200, 0x00000284, 0x0000022c, 0x0000016a, 0x00000100, 0x000000b5, 0x00000080, 0x0000005b, 0x00000040, 0x0000002d, 0x00000020, 0x00000001, 0x00000001 }; static inline float q16_to_float(uint32_t q) { return (float)q / 65536.0f; } static inline uint32_t float_to_q16(float f) { float scaled = f * 65536.0f + 0.5f; if (scaled < 0) return 0; if (scaled > 4294967295.0f) return 0xFFFFFFFFu; return (uint32_t)scaled; } int main(void) { size_t n = sizeof(RSQRT_INPUTS) / sizeof(RSQRT_INPUTS[0]); double sum_rel = 0.0, max_rel = 0.0; size_t valid_n = 0; size_t idx_max_rel = 0; printf("x,y_q16,ref_q16,rel_error(%%)\n"); for (size_t i = 0; i < n; ++i) { uint32_t x = RSQRT_INPUTS[i]; uint32_t yq = RSQRT_Y_Q16_NEWTON2[i]; if (x == 0) { printf("0x%08X,0x%08X,----,N/A\n", x, yq); continue; } float ref_f = 1.0f / sqrtf((float)x); uint32_t ref_q = float_to_q16(ref_f); float y_f = q16_to_float(yq); float rel_err = fabsf((y_f - ref_f) / ref_f); printf("0x%08X,0x%08X,0x%08X,%.6f%%\n", x, yq, ref_q, rel_err * 100.0f); sum_rel += rel_err; valid_n++; if (rel_err > max_rel) { max_rel = rel_err; idx_max_rel = i; } } printf("\n=== SUMMARY ===\n"); printf("Valid samples: %zu\n", valid_n); printf("Average relative error: %.6f%%\n", (sum_rel / valid_n) * 100.0); printf("Max relative error: %.6f%% (x=0x%08X)\n", max_rel * 100.0, RSQRT_INPUTS[idx_max_rel]); return 0; } ``` ::: --- ### Performance & Accuracy (Q16.16 `rsqrt`) #### 1. Variants To analyze both **initial approximation quality** and **Newton iteration impact**, we test three representative variants: | Variant | Description | Purpose | |----------|--------------|----------| | `rsqrt_1x` | LUT + interpolation + **1 Newton iteration** | Baseline, good trade-off between speed and precision | | `rsqrt_2x` | LUT + interpolation + **2 Newton iterations** | Tests convergence improvement with extra iteration | | `rsqrt_no_interp_2x` | LUT (**no interpolation**) + **2 Newton iterations** | Measures how much interpolation improves initial accuracy | This selection highlights how **interpolation** and **iteration depth** affect both **accuracy** and **performance**, without redundant combinations. --- #### 2. Dataset - 50 input samples (covering low/mid/high ranges, powers of two, and edge cases) :::spoiler test data here ```c= #ifndef RSQRT_TEST_DATA_H #define RSQRT_TEST_DATA_H #include <stdint.h> /* === Test input values (x) === */ static const uint32_t RSQRT_INPUTS[50] = { // small range (0 ~ 256) 0u, 1u, 2u, 3u, 4u, 5u, 6u, 8u, 9u, 10u, 12u, 15u, 16u, 20u, 25u, 50u, 100u, 128u, 200u, 256u, // mid range (300 ~ 16384) 300u, 400u, 500u, 768u, 1000u, 1500u, 2000u, 3000u, 4096u, 5000u, 8192u, 10000u, 12000u, 14000u, 15000u, 16000u, 16383u, 16384u, 10000u, 12000u, // high range + boundary (≥ 32768) 32768u, 65536u, 131072u, 262144u, 524288u, 1048576u, 2097152u, 4194304u, 2147483647u, 4294967295u }; #endif /* RSQRT_TEST_DATA_H */ ``` ::: --- #### 3. Measurement Procedure To evaluate precision, we first **run the fixed-point `rsqrt` implementation on `rv32emu`**, which produces outputs in **Q16.16 format**. Then, we **convert those Q16.16 values back to floating-point on PC** and compare them against the reference value computed with the standard library: $$ y_{ref} = \frac{1}{\sqrt{x}} $$ The relative error for each sample is calculated as: $$ \text{RelErr}(x) = 100 \times \frac{|y_{fixed} - y_{ref}|}{y_{ref}} $$ #### testing result The following table summarizes the accuracy and performance results for the reciprocal square-root (`1/√x`) implementation in **Q16.16 fixed-point** format. | Variant | Description | Avg. Rel. Error (%) | Cycles (50inputs) | Cycles/Input | Instructions | |:--------|:-------------|--------------------:|-------------------:|--------------:|--------------:| | **V1** | LUT + 1× Newton (linear interpolation) | **0.7401%** | **115,208** | **2,304** | **115,208** | Baseline (measured) | | **V2** | LUT + 2× Newton (linear interpolation) | **0.6298%** | **205,900** | **4,118** | **205,900** | **V3** | no LUT + 1× Newton | **7.1775%** | **101,723** | **2,034** | **101,723** | - **Accuracy vs. work:** Moving from **1× Newton (V1)** to **2× Newton (V2)** improves average relative error only slightly (**0.7401% → 0.6298%**, ≈15% better), but nearly **doubles the cycles** (≈**+79%** total). In other words, the **marginal accuracy gain is not cost-effective** on RV32I. - **Why interpolation matters:** Comparing **V1 (with linear interpolation)** to **V3 (no interpolation, still 1× Newton)** shows that a **cheap linear interpolation step** dramatically improves accuracy (**7.1775% → 0.7401%**, ≈**10×** better) with only a modest cycle increase. Intuition: linear interpolation gives a **closer initial guess** $y_0$, so **one Newton step** lands much nearer the true $(1/\sqrt{x})$. ## Compiler Optimization Analysis ### Objective In this section, we disassemble the ELF files generated by the C compiler under different optimization levels and compare their assembly structure and runtime behavior. The goal is to understand how compiler optimizations affect **code size**, **instruction count**, and **execution cycles**. Specifically, we compare the following builds: - `-O0` : No optimization (baseline) - `-Os` : Optimize for code size - `-Ofast` : Aggressive optimization for performance Each version is compiled, disassembled, and executed on **RV32Emu**, with measured cycle and instruction counts. --- ### Methodology Each version of the program was compiled under three optimization levels: `-O0`, `-Os`, and `-Ofast`, using the same toolchain configuration (`rv32i_zicsr`, `ilp32`). The generated ELF binaries were then disassembled and analyzed at the instruction level. Only the **.text** segment was examined, as ELF headers and section tables remain structurally similar across optimization levels. The analysis focused on: - Function inlining and call elimination - Stack frame allocation - Instruction scheduling and branching pattern - Memory access frequency For performance evaluation, each ELF was executed in **RV32Emu**, which provides hardware counters for: - **cycle** – total clock cycles consumed - **instret** – number of retired instructions The program prints both counters upon completion, allowing quantitative comparison of compiler optimizations. --- ### Performance Results | Optimization Level | Code Size (bytes) | Cycles | Instructions | Relative Speed | |:-------------------|------------------:|-------:|--------------:|----------------:| | `-O0` | 3473 | 115,208 | 115,208 | Baseline | | `-Os` | 1853 | 56,890 | 56,889 | **2.03× faster** | | `-Ofast` | 2245 | 33,165 | 33,164 | **3.47× faster** | --- ### Observations from Disassembly > **Disassembly Listings:** > - [`dump_O0.S`](https://github.com/ab842650/ca2025_assignment2/blob/main/dump_o0.S) — baseline(`-o0`) > - [`dump_Os.S`](https://github.com/ab842650/ca2025_assignment2/blob/main/dump_os.S) — optimized for size (`-Os`) > - [`dump_Ofast.S`](https://github.com/ab842650/ca2025_assignment2/blob/main/dump_ofast.S) — optimized for speed (`-Ofast`) #### (1) Function Inlining and Call Reduction - **`-O0`** keeps all helper routines (`print_dec`, `udiv`, `umod`, `mul32x32`, `__mulsi3`) as *separate function calls*. Each digit conversion in `print_dec` invokes `udiv` and `umod`, producing many `jal` (jump-and-link) instructions and large stack frames. - **`-Os`** begins inlining small helpers. The decimal-printing loop is expanded in-place using shift-and-subtract instead of calling `udiv/umod`. `mul32x32` still calls `__mulsi3`, but fewer times overall. - **`-Ofast`** performs aggressive inlining and constant folding. Hot paths (Newton iteration, interpolation) are flattened into straight-line arithmetic. Most helper calls disappear entirely, which drastically cuts control-flow overhead. #### (2) Stack Frame and Register Management - **`-O0`** uses large stack frames and saves all callee-saved registers, even for leaf functions. Many temporary values are spilled to memory (`sw/lw` pairs). - **`-Os`** applies *leaf-function optimization*, minimizing stack size and omitting unnecessary saves. The stack frame shrinks by roughly half. - **`-Ofast`** goes further—flattened loops and full inlining reduce memory traffic, keeping intermediate values entirely in registers. #### (3) Arithmetic Transformation and Loop Simplification - The unoptimized version performs general-purpose software division (`udiv`/`umod`) in a bitwise loop for every decimal digit. - Optimized versions inline that logic, replacing division with simple shifts and subtractions. - In `-Ofast`, arithmetic patterns like `(lo << 16) | (hi >> 16)` and `(3u << 16) - xy2` are merged into compact instruction sequences. --- ### Discussion Compiler optimization clearly affects both **code structure** and **execution behavior**: - `-Os` achieves **nearly half the code size** and doubles the speed by removing redundant calls while preserving readability and correctness. - `-Ofast` further improves speed by ~3.5× compared to `-O0`, thanks to aggressive inlining, constant propagation, and tighter arithmetic sequences. - Both optimizations reduce stack usage, function call overhead, and instruction count— leading to shorter `.text` sections *and* faster runtime. In summary, > **Inlining + constant folding + reduced stack traffic = fewer cycles and smaller binaries.** Even on a simple RV32I core without hardware multiply/divide, compiler optimization significantly reshapes the generated assembly structure and yields substantial performance gains. --- ## Handwritten RISC-V Optimization ### Objective In this part, we go beyond compiler optimizations and try to **manually optimize** the core routine in RISC-V assembly. The main goals are: - Reduce **CSR cycle count** measured by `ticks.c` and `perfcounter`. - Eliminate as many **function calls** as possible in the hot path. - Apply low-level techniques such as **loop unrolling** and **peephole optimization**. - Explore how far a carefully tuned RV32I assembly implementation can compete with (or beat) `gcc --- ### Implementation Strategy To push performance further beyond compiler-generated code, I rewrote the entire reciprocal square-root routine directly in handwritten RV32I assembly. #### No Function Calls The entire algorithm is implemented as **one monolithic assembly block**, without calling any subroutines. This removes: - `jal`/`jalr` overhead - stack frame setup/teardown - register saving conventions - unnecessary control-flow penalties #### Fully Inlined Arithmetic All arithmetic steps—initial LUT lookup, linear interpolation, and Newton–Raphson refinement—are **fully inlined**. Since RV32I lacks hardware multiplication, every 32-bit × 32-bit multiply is expanded into a **manual shift-and-add multiplier**, producing both the 32-bit `hi` and `lo` words of the simulated 64-bit result. This exactly mirrors the Q16 fixed-point computation of the C version. #### Register-Only Dataflow Almost all computation stays entirely inside **temporary registers (`t0`–`t6`)**, minimizing the need for loads/stores. Only two saved registers (`s1`, `s2`) are spilled once, using a single pair of: #### codes here > - [`rsqrt_fast.S`](https://github.com/ab842650/ca2025_assignment2/blob/main/rsqrt_fast.S) --- ### Performance Comparison To evaluate the effectiveness of the handwritten RV32I implementation, I compared execution cycles using `ticks.c` and `perfcounter` under different compiler optimization levels. Because this assignment focuses **only on CSR cycle counts**, all I/O-heavy operations — especially `print_hex()` and any console printing — were temporarily disabled to avoid polluting the measurement. | Version | CSR Cycles | Relative Speed | |-----------------------------|------------|----------------| | `gcc -O0` | 109,876 | 1.0× (baseline) | | `gcc -Os` | 55,101 | ~1.994 × faster | | `gcc -Ofast` | 31,115 | ~3.53 x faster | | **Handwritten RV32I ASM** | 32,020 | **~3.43 × faster** | #### Notes - `-O0` serves as a baseline; the compiler emits virtually no optimizations. - `-Os` produces compact code but occasionally sacrifices instruction-level performance. - `-Ofast` enables aggressive math optimizations and inlines the function heavily. - The **handwritten assembly version** removes function call overhead, uses only register-resident dataflow, and expands all multiplications manually, achieving the lowest cycle count. #### Interpretation The handwritten RV32I implementation clearly beats both `-O0` and `-Os`. By removing all helper-function calls, avoiding stack usage in the hot path, and manually expanding every arithmetic operation (including the shift-and-add multipliers), the cycle count drops dramatically compared to the lower-optimization compiler builds. However, the assembly version is still slightly slower than `gcc -Ofast`. This is mainly because the compiler is able to fully inline the rsqrt routine into `main`, allowing instruction scheduling and cross-block optimization that are not possible in my current setup. My assembly implementation lives in a separate function, so each call to `rsqrt_fast` still incurs a prologue/epilogue and prevents whole-function inlining. If the entire `main` loop were rewritten in assembly—or if the rsqrt code were injected directly into the loop body—the gap between the handwritten version and `-Ofast` would likely shrink further.