# Computer Architecture — Fall 2025 Homework 2
📦 [Full Source Code on GitHub](https://github.com/ab842650/ca2025_assignment2)
## Implement UF8 encode and decode on rv32emu
**UF8 from Quiz 1 Problem B** was successfully executed in a **bare-metal environment** using **rv32emu’s system emulation**.
The original **`main.c`** was modified to include the **`problemB_main()`** function from the assembly implementation, allowing the test routine to run and return a **pass/fail result** directly to C.
The **Makefile** was updated to compile and link all files under the **RV32I_Zicsr** instruction set (**`-march=rv32i_zicsr -mabi=ilp32`**), ensuring compatibility without using **RV32M** or **RV32F** extensions.
The program was then built into **`test.elf`** and executed with **rv32emu** in system mode.
**Cycle and instruction counts** were measured using **`get_cycles()`** and **`get_instret()`** within the C test framework, confirming correct execution and **performance profiling** of the UF8 test suite.
> **codes here:**
> - [`c code`](https://github.com/ab842650/ca2025_assignment2/blob/main/main.c)
> - [`asm code`](https://github.com/ab842650/ca2025_assignment2/blob/main/Problem_B_refinement.S)
**Result:**

---
## implement hanoi tower on rv32 emu
**Problem A (Gray Code Hanoi Tower)** from **Quiz 2** was successfully executed in a **bare-metal environment** using **rv32emu’s system emulation**.
The original **`main.c`** was modified to call the **`hanoi_main()`** function from the custom assembly implementation, while the output logic in **`hanoi.S`** was adapted to use **system-level ECALL I/O** for compatibility with rv32emu’s bare-metal mode.
The **Makefile** was updated to compile and link all components under the **RV32I_Zicsr** instruction set (**`-march=rv32i_zicsr -mabi=ilp32`**) without floating-point or multiply/divide extensions.
The program was built into **`test.elf`** and executed under **rv32emu system mode**, successfully displaying the **Gray Code–based Hanoi Tower move sequence** and verifying correct **bare-metal execution and output behavior**.
> **codes here:**
> - [`c code`](https://github.com/ab842650/ca2025_assignment2/blob/main/main_A.c)
> - [`asm code`](https://github.com/ab842650/ca2025_assignment2/blob/main/hanoi.S)
**Result:**

---
## implement rsqrt in rv32emu
### objective
Implement and verify a **fixed-point reciprocal square root (`1/sqrt(x)`)** function using the **Q16.16 format**.
The goal is to achieve a practical, FPU-free solution for use in RISC-V (RV32I) environments and to analyze its accuracy through simulation on both **RV32Emu** and **PC reference tests**.
---
### Background
The reciprocal square root function is widely used in:
- Vector normalization (`1 / sqrt(x² + y² + z²)`)
- Graphics and physics engines
- Machine learning or DSP tasks that replace division with multiplication
In floating-point systems, it’s trivial to use `1.0f / sqrtf(x)`.
However, on embedded CPUs **without a floating-point unit**, we must rely on **fixed-point arithmetic** and **Newton–Raphson iteration**.
---
### Algorithm Overview
The goal is to compute the **reciprocal square root**, i.e.
$$
y = \frac{1}{\sqrt{x}}
$$
We use the **Newton–Raphson iteration** to efficiently approximate this value.
---
#### Newton–Raphson Iteration
The iterative formula is:
$$
y_{n+1} = y_n \times \left( 1.5 - 0.5 \times x \times y_n^2 \right)
$$
Each iteration improves the accuracy of `y` as an estimate of $( 1/\sqrt{x})$.
---
#### Q16.16 Fixed-Point Representation
Because the RV32I architecture does not include a floating-point unit, all arithmetic is performed in **Q16.16 fixed-point format**.
- **Integer bits:** 16
- **Fractional bits:** 16
- **Scaling factor:** ( $2^{16}$ = 65536)
---
#### Fixed-Point Iteration Formula
The Newton iteration can be implemented in C as:
```c
// One Newton iteration in Q16.16
y = (y * ((3 << 16) - ((x * y * y) >> 16))) >> 17;
```
### Lookup Table (LUT) for Initial Guess
We approximate $y_0 \approx 2^{16} / \sqrt{x}$ using a **32-entry table** keyed by the MSB position of `x` (denoted `exp = floor(log2(x))`).
For inputs between $2^{exp}$ and $2^{exp+1}$, we linearly interpolate between `rsqrt_table[exp]` and `rsqrt_table[exp+1]` in **Q16.16**.
**Table (Q16.16, i.e., `65536 / sqrt(2^k)`):**
::: spoiler code
```c
static const uint16_t rsqrt_table[32] = {
65536, 46341, 32768, 23170, 16384,
11585, 8192, 5793, 4096, 2896,
2048, 1448, 1024, 724, 512,
362, 256, 181, 128, 90,
64, 45, 32, 23, 16,
11, 8, 6, 4, 3,
2, 1
};
```
:::
To determine the exponent (exp) efficiently, we reuse the branchless clz function from the earlier assignment, which avoids conditional branches and improves performance on RV32I.
:::spoiler code
```c=
uint32_t clz(uint32_t x)
{
int r = 0, c;
c = (x < 0x00010000) << 4;
r += c;
x <<= c; // off 16
c = (x < 0x01000000) << 3;
r += c;
x <<= c; // off 8
c = (x < 0x10000000) << 2;
r += c;
x <<= c; // off 4
c = (x < 0x40000000) << 1;
r += c;
x <<= c; // off 2
c = x < 0x80000000;
r += c;
x <<= c; // off 1
r += x == 0;
return r;
}
```
:::
---
full code running on rv32emu
> - [`main_C.c`](https://github.com/ab842650/ca2025_assignment2/blob/main/main_C.c)
testing code on PC
:::spoiler code
```c=
#include <math.h>
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
static const uint32_t RSQRT_INPUTS[50] = {
// small range (0 ~ 256)
0u, 1u, 2u, 3u, 4u, 5u, 6u, 8u, 9u, 10u,
12u, 15u, 16u, 20u, 25u, 50u, 100u, 128u, 200u, 256u,
// mid range (300 ~ 16384)
300u, 400u, 500u, 768u, 1000u, 1500u, 2000u, 3000u, 4096u, 5000u,
8192u, 10000u, 12000u, 14000u, 15000u, 16000u, 16383u, 16384u, 10000u, 12000u,
// high range + boundary (≥ 32768)
32768u, 65536u, 131072u, 262144u, 524288u,
1048576u, 2097152u, 4194304u, 2147483647u, 4294967295u
};
// One-Newton iteration results from RV32Emu (Q16.16 format)
static const uint32_t RSQRT_Y_Q16_NEWTON1[50] = {
0xffffffff, 0x00010000, 0x0000b505, 0x00009357, 0x00008000, 0x00007242,
0x0000682f, 0x00005a83, 0x00005546, 0x000050cb, 0x000049ab, 0x00004211,
0x00004000, 0x00003921, 0x0000330c, 0x00002419, 0x00001986, 0x000016a1,
0x0000120c, 0x00001000, 0x00000ec3, 0x00000cc3, 0x00000b73, 0x00000935,
0x00000818, 0x00000697, 0x000005b9, 0x000004a9, 0x00000400, 0x0000039d,
0x000002d4, 0x0000028e, 0x00000254, 0x00000229, 0x00000217, 0x00000206,
0x00000200, 0x00000200, 0x0000028e, 0x00000254, 0x0000016a, 0x00000100,
0x000000b5, 0x00000080, 0x0000005b, 0x00000040, 0x0000002d, 0x00000020,
0x00000001, 0x00000001
};
// Two-Newton iteration results from RV32Emu (Q16.16 format)
static const uint32_t RSQRT_Y_Q16_NEWTON2[50] = {
0xffffffff, 0x00010000, 0x0000b505, 0x000093cd, 0x00008000, 0x0000727c,
0x00006883, 0x00005a83, 0x00005555, 0x000050f4, 0x000049e6, 0x00004219,
0x00004000, 0x0000393e, 0x00003333, 0x00002434, 0x0000199a, 0x000016a1,
0x0000121a, 0x00001000, 0x00000ec8, 0x00000ccd, 0x00000b73, 0x0000093d,
0x00000818, 0x0000069c, 0x000005b9, 0x000004ad, 0x00000400, 0x0000039f,
0x000002d4, 0x0000028f, 0x00000256, 0x0000022a, 0x00000217, 0x00000206,
0x00000200, 0x00000200, 0x0000028f, 0x00000256, 0x0000016a, 0x00000100,
0x000000b5, 0x00000080, 0x0000005b, 0x00000040, 0x0000002d, 0x00000020,
0x00000001, 0x00000001
};
// LUT only (no interpolation, no Newton iteration)
static const uint32_t RSQRT_Y_Q16_LUT_ONLY[50] = {
0xffffffff, 0x00010000, 0x0000b505, 0x000087c4, 0x00008000, 0x00007000,
0x00006000, 0x00005a83, 0x000054db, 0x00004f32, 0x000043e2, 0x000032ea,
0x00004000, 0x00003800, 0x00002e00, 0x00002087, 0x00001700, 0x000016a1,
0x00001043, 0x00001000, 0x00000ea0, 0x00000b80, 0x00000860, 0x0000087c,
0x000005ec, 0x00000624, 0x00000430, 0x00000458, 0x00000400, 0x0000038f,
0x000002d4, 0x00000284, 0x0000022c, 0x000001d3, 0x000001a7, 0x0000017b,
0x0000016a, 0x00000200, 0x00000284, 0x0000022c, 0x0000016a, 0x00000100,
0x000000b5, 0x00000080, 0x0000005b, 0x00000040, 0x0000002d, 0x00000020,
0x00000001, 0x00000001
};
static inline float q16_to_float(uint32_t q) { return (float)q / 65536.0f; }
static inline uint32_t float_to_q16(float f) {
float scaled = f * 65536.0f + 0.5f;
if (scaled < 0) return 0;
if (scaled > 4294967295.0f) return 0xFFFFFFFFu;
return (uint32_t)scaled;
}
int main(void) {
size_t n = sizeof(RSQRT_INPUTS) / sizeof(RSQRT_INPUTS[0]);
double sum_rel = 0.0, max_rel = 0.0;
size_t valid_n = 0;
size_t idx_max_rel = 0;
printf("x,y_q16,ref_q16,rel_error(%%)\n");
for (size_t i = 0; i < n; ++i) {
uint32_t x = RSQRT_INPUTS[i];
uint32_t yq = RSQRT_Y_Q16_NEWTON2[i];
if (x == 0) {
printf("0x%08X,0x%08X,----,N/A\n", x, yq);
continue;
}
float ref_f = 1.0f / sqrtf((float)x);
uint32_t ref_q = float_to_q16(ref_f);
float y_f = q16_to_float(yq);
float rel_err = fabsf((y_f - ref_f) / ref_f);
printf("0x%08X,0x%08X,0x%08X,%.6f%%\n", x, yq, ref_q, rel_err * 100.0f);
sum_rel += rel_err;
valid_n++;
if (rel_err > max_rel) {
max_rel = rel_err;
idx_max_rel = i;
}
}
printf("\n=== SUMMARY ===\n");
printf("Valid samples: %zu\n", valid_n);
printf("Average relative error: %.6f%%\n", (sum_rel / valid_n) * 100.0);
printf("Max relative error: %.6f%% (x=0x%08X)\n", max_rel * 100.0, RSQRT_INPUTS[idx_max_rel]);
return 0;
}
```
:::
---
### Performance & Accuracy (Q16.16 `rsqrt`)
#### 1. Variants
To analyze both **initial approximation quality** and **Newton iteration impact**, we test three representative variants:
| Variant | Description | Purpose |
|----------|--------------|----------|
| `rsqrt_1x` | LUT + interpolation + **1 Newton iteration** | Baseline, good trade-off between speed and precision |
| `rsqrt_2x` | LUT + interpolation + **2 Newton iterations** | Tests convergence improvement with extra iteration |
| `rsqrt_no_interp_2x` | LUT (**no interpolation**) + **2 Newton iterations** | Measures how much interpolation improves initial accuracy |
This selection highlights how **interpolation** and **iteration depth** affect both **accuracy** and **performance**, without redundant combinations.
---
#### 2. Dataset
- 50 input samples (covering low/mid/high ranges, powers of two, and edge cases)
:::spoiler test data here
```c=
#ifndef RSQRT_TEST_DATA_H
#define RSQRT_TEST_DATA_H
#include <stdint.h>
/* === Test input values (x) === */
static const uint32_t RSQRT_INPUTS[50] = {
// small range (0 ~ 256)
0u, 1u, 2u, 3u, 4u, 5u, 6u, 8u, 9u, 10u,
12u, 15u, 16u, 20u, 25u, 50u, 100u, 128u, 200u, 256u,
// mid range (300 ~ 16384)
300u, 400u, 500u, 768u, 1000u, 1500u, 2000u, 3000u, 4096u, 5000u,
8192u, 10000u, 12000u, 14000u, 15000u, 16000u, 16383u, 16384u, 10000u, 12000u,
// high range + boundary (≥ 32768)
32768u, 65536u, 131072u, 262144u, 524288u,
1048576u, 2097152u, 4194304u, 2147483647u, 4294967295u
};
#endif /* RSQRT_TEST_DATA_H */
```
:::
---
#### 3. Measurement Procedure
To evaluate precision, we first **run the fixed-point `rsqrt` implementation on `rv32emu`**,
which produces outputs in **Q16.16 format**.
Then, we **convert those Q16.16 values back to floating-point on PC** and compare them against the reference value computed with the standard library:
$$
y_{ref} = \frac{1}{\sqrt{x}}
$$
The relative error for each sample is calculated as:
$$
\text{RelErr}(x) = 100 \times \frac{|y_{fixed} - y_{ref}|}{y_{ref}}
$$
#### testing result
The following table summarizes the accuracy and performance results for the reciprocal square-root (`1/√x`) implementation in **Q16.16 fixed-point** format.
| Variant | Description | Avg. Rel. Error (%) | Cycles (50inputs) | Cycles/Input | Instructions |
|:--------|:-------------|--------------------:|-------------------:|--------------:|--------------:|
| **V1** | LUT + 1× Newton (linear interpolation) | **0.7401%** | **115,208** | **2,304** | **115,208** | Baseline (measured) |
| **V2** | LUT + 2× Newton (linear interpolation) | **0.6298%** | **205,900** | **4,118** | **205,900**
| **V3** | no LUT + 1× Newton | **7.1775%** | **101,723** | **2,034** | **101,723** |
- **Accuracy vs. work:** Moving from **1× Newton (V1)** to **2× Newton (V2)** improves average relative error only slightly (**0.7401% → 0.6298%**, ≈15% better), but nearly **doubles the cycles** (≈**+79%** total). In other words, the **marginal accuracy gain is not cost-effective** on RV32I.
- **Why interpolation matters:** Comparing **V1 (with linear interpolation)** to **V3 (no interpolation, still 1× Newton)** shows that a **cheap linear interpolation step** dramatically improves accuracy (**7.1775% → 0.7401%**, ≈**10×** better) with only a modest cycle increase.
Intuition: linear interpolation gives a **closer initial guess** $y_0$, so **one Newton step** lands much nearer the true $(1/\sqrt{x})$.
## Compiler Optimization Analysis
### Objective
In this section, we disassemble the ELF files generated by the C compiler under different optimization levels
and compare their assembly structure and runtime behavior.
The goal is to understand how compiler optimizations affect **code size**, **instruction count**, and **execution cycles**.
Specifically, we compare the following builds:
- `-O0` : No optimization (baseline)
- `-Os` : Optimize for code size
- `-Ofast` : Aggressive optimization for performance
Each version is compiled, disassembled, and executed on **RV32Emu**, with measured cycle and instruction counts.
---
### Methodology
Each version of the program was compiled under three optimization levels: `-O0`, `-Os`, and `-Ofast`,
using the same toolchain configuration (`rv32i_zicsr`, `ilp32`).
The generated ELF binaries were then disassembled and analyzed at the instruction level.
Only the **.text** segment was examined, as ELF headers and section tables remain structurally similar across optimization levels.
The analysis focused on:
- Function inlining and call elimination
- Stack frame allocation
- Instruction scheduling and branching pattern
- Memory access frequency
For performance evaluation, each ELF was executed in **RV32Emu**, which provides hardware counters for:
- **cycle** – total clock cycles consumed
- **instret** – number of retired instructions
The program prints both counters upon completion, allowing quantitative comparison of compiler optimizations.
---
### Performance Results
| Optimization Level | Code Size (bytes) | Cycles | Instructions | Relative Speed |
|:-------------------|------------------:|-------:|--------------:|----------------:|
| `-O0` | 3473 | 115,208 | 115,208 | Baseline |
| `-Os` | 1853 | 56,890 | 56,889 | **2.03× faster** |
| `-Ofast` | 2245 | 33,165 | 33,164 | **3.47× faster** |
---
### Observations from Disassembly
> **Disassembly Listings:**
> - [`dump_O0.S`](https://github.com/ab842650/ca2025_assignment2/blob/main/dump_o0.S) — baseline(`-o0`)
> - [`dump_Os.S`](https://github.com/ab842650/ca2025_assignment2/blob/main/dump_os.S) — optimized for size (`-Os`)
> - [`dump_Ofast.S`](https://github.com/ab842650/ca2025_assignment2/blob/main/dump_ofast.S) — optimized for speed (`-Ofast`)
#### (1) Function Inlining and Call Reduction
- **`-O0`** keeps all helper routines (`print_dec`, `udiv`, `umod`, `mul32x32`, `__mulsi3`)
as *separate function calls*. Each digit conversion in `print_dec` invokes `udiv` and `umod`,
producing many `jal` (jump-and-link) instructions and large stack frames.
- **`-Os`** begins inlining small helpers.
The decimal-printing loop is expanded in-place using shift-and-subtract instead of calling
`udiv/umod`. `mul32x32` still calls `__mulsi3`, but fewer times overall.
- **`-Ofast`** performs aggressive inlining and constant folding.
Hot paths (Newton iteration, interpolation) are flattened into straight-line arithmetic.
Most helper calls disappear entirely, which drastically cuts control-flow overhead.
#### (2) Stack Frame and Register Management
- **`-O0`** uses large stack frames and saves all callee-saved registers, even for leaf functions.
Many temporary values are spilled to memory (`sw/lw` pairs).
- **`-Os`** applies *leaf-function optimization*, minimizing stack size and omitting unnecessary saves.
The stack frame shrinks by roughly half.
- **`-Ofast`** goes further—flattened loops and full inlining reduce memory traffic,
keeping intermediate values entirely in registers.
#### (3) Arithmetic Transformation and Loop Simplification
- The unoptimized version performs general-purpose software division (`udiv`/`umod`)
in a bitwise loop for every decimal digit.
- Optimized versions inline that logic, replacing division with simple shifts and subtractions.
- In `-Ofast`, arithmetic patterns like `(lo << 16) | (hi >> 16)` and
`(3u << 16) - xy2` are merged into compact instruction sequences.
---
### Discussion
Compiler optimization clearly affects both **code structure** and **execution behavior**:
- `-Os` achieves **nearly half the code size** and doubles the speed by removing redundant calls
while preserving readability and correctness.
- `-Ofast` further improves speed by ~3.5× compared to `-O0`, thanks to aggressive inlining,
constant propagation, and tighter arithmetic sequences.
- Both optimizations reduce stack usage, function call overhead, and instruction count—
leading to shorter `.text` sections *and* faster runtime.
In summary,
> **Inlining + constant folding + reduced stack traffic = fewer cycles and smaller binaries.**
Even on a simple RV32I core without hardware multiply/divide,
compiler optimization significantly reshapes the generated assembly structure
and yields substantial performance gains.
---
## Handwritten RISC-V Optimization
### Objective
In this part, we go beyond compiler optimizations and try to **manually optimize** the core routine in RISC-V assembly.
The main goals are:
- Reduce **CSR cycle count** measured by `ticks.c` and `perfcounter`.
- Eliminate as many **function calls** as possible in the hot path.
- Apply low-level techniques such as **loop unrolling** and **peephole optimization**.
- Explore how far a carefully tuned RV32I assembly implementation can compete with (or beat) `gcc
---
### Implementation Strategy
To push performance further beyond compiler-generated code, I rewrote the entire reciprocal square-root routine directly in handwritten RV32I assembly.
#### No Function Calls
The entire algorithm is implemented as **one monolithic assembly block**, without calling any subroutines.
This removes:
- `jal`/`jalr` overhead
- stack frame setup/teardown
- register saving conventions
- unnecessary control-flow penalties
#### Fully Inlined Arithmetic
All arithmetic steps—initial LUT lookup, linear interpolation, and Newton–Raphson refinement—are **fully inlined**.
Since RV32I lacks hardware multiplication, every 32-bit × 32-bit multiply is expanded into a **manual shift-and-add multiplier**, producing both the 32-bit `hi` and `lo` words of the simulated 64-bit result.
This exactly mirrors the Q16 fixed-point computation of the C version.
#### Register-Only Dataflow
Almost all computation stays entirely inside **temporary registers (`t0`–`t6`)**, minimizing the need for loads/stores.
Only two saved registers (`s1`, `s2`) are spilled once, using a single pair of:
#### codes here
> - [`rsqrt_fast.S`](https://github.com/ab842650/ca2025_assignment2/blob/main/rsqrt_fast.S)
---
### Performance Comparison
To evaluate the effectiveness of the handwritten RV32I implementation,
I compared execution cycles using `ticks.c` and `perfcounter` under different compiler optimization levels.
Because this assignment focuses **only on CSR cycle counts**,
all I/O-heavy operations — especially `print_hex()` and any console printing —
were temporarily disabled to avoid polluting the measurement.
| Version | CSR Cycles | Relative Speed |
|-----------------------------|------------|----------------|
| `gcc -O0` | 109,876 | 1.0× (baseline) |
| `gcc -Os` | 55,101 | ~1.994 × faster |
| `gcc -Ofast` | 31,115 | ~3.53 x faster |
| **Handwritten RV32I ASM** | 32,020 | **~3.43 × faster** |
#### Notes
- `-O0` serves as a baseline; the compiler emits virtually no optimizations.
- `-Os` produces compact code but occasionally sacrifices instruction-level performance.
- `-Ofast` enables aggressive math optimizations and inlines the function heavily.
- The **handwritten assembly version** removes function call overhead, uses only register-resident dataflow, and expands all multiplications manually, achieving the lowest cycle count.
#### Interpretation
The handwritten RV32I implementation clearly beats both `-O0` and `-Os`.
By removing all helper-function calls, avoiding stack usage in the hot path, and manually expanding every arithmetic operation (including the shift-and-add multipliers), the cycle count drops dramatically compared to the lower-optimization compiler builds.
However, the assembly version is still slightly slower than `gcc -Ofast`.
This is mainly because the compiler is able to fully inline the rsqrt routine into `main`, allowing instruction scheduling and cross-block optimization that are not possible in my current setup. My assembly implementation lives in a separate function, so each call to `rsqrt_fast` still incurs a prologue/epilogue and prevents whole-function inlining.
If the entire `main` loop were rewritten in assembly—or if the rsqrt code were injected directly into the loop body—the gap between the handwritten version and `-Ofast` would likely shrink further.