# Assignment2: Complete Applications
> [GitHub](https://github.com/seallllllllll/arch2025-homework2)
This note documents my work for Assignment 2:
* Setting up the bare-metal environment on rv32emu
* Porting HW1 and Quiz 2A / Quiz 3C to this environment
* Comparing compiler-generated assembly under different optimization levels
* Measuring performance (cycles / instructions) and doing at least one manual optimization
* Analyzing the numerical precision of Quiz 3C (fast reciprocal square root in Q16.16)
All code lives in my course repository under `app/`, `quiz2A/`, and `quiz3C/`.
---
## 1. Environment and toolchain
### 1.1 Toolchain and emulator
* Cross compiler: `riscv-none-elf-gcc` 14.2.0
* ISA / ABI: RV32I with Zicsr, ILP32 ABI
* Emulator: `rv32emu` (system emulator from course repository)
* Host OS: macOS, running rv32emu built from source
Compiler flags are unified across all programs:
```bash
-march=rv32i_zicsr -mabi=ilp32 -ffreestanding -nostdlib -nostartfiles
```
I use `linker.ld` derived from `tests/system/playground`, with only basic sections and a simple stack layout.
### 1.2 Bare-metal skeleton
I started from `tests/system/playground` in the rv32emu repository and copied it into my own repo as `app/`, keeping only the minimal skeleton:
* `crt0.S` – reset entry, stack setup, call into `main`.
* `syscalls.c` – thin wrappers for Linux syscalls through `ecall`.
* `ticks.c` / `perfcounter.c` – helpers to read CSRs such as `mcycle` and `minstret`.
* A small `main.c` that prints `"OK\n"` and exits, just to verify the environment.
The build system uses a simple `Makefile` that produces an ELF file and runs it with rv32emu:
```bash
make run-hw1
make run-quiz2a
make run-quiz3c OPT=-O0 # or -Os / -Ofast
```
---
## 2. Porting HW1 to bare-metal rv32emu
### 2.1 Background
The original HW1 program was written for Ripes and used its teaching syscalls:
* `print_int` (a7 = 1)
* `print_string` (a7 = 4)
* `print_char` (a7 = 11)
* `exit` (a7 = 93, but semantics different)
These syscall numbers exist only in Ripes. rv32emu follows the Linux RISC-V syscall ABI, so I needed to replace all Ripes-specific syscalls and remove any dependency on the host libc.
### 2.2 Syscall adaptation
I implemented a small set of bare-metal wrappers in C/assembly:
* `write(int fd, const void *buf, size_t len)`
* `a7 = 64` (`SYS_write`), `a0 = fd`, `a1 = buf`, `a2 = len`, then `ecall`.
* `exit(int status)`
* `a7 = 93` (`SYS_exit`), `a0 = status`, then `ecall`.
On top of `write`, I reimplemented:
* `print_str(const char *s)` – write a C string to stdout (`fd = 1`).
* `print_char(int ch)` – write a single character.
* `print_int(unsigned int x)` – decimal printing routine implemented in pure RV32I (no `div` / `mul`), using a table of powers of 10 and repeated subtraction, similar to the version used in Quiz 2A.
### 2.3 Integrating the HW1 program
In the copied `app/` directory, I:
1. Added my HW1 assembly source file to the `Makefile`’s object list.
2. Ensured that `crt0.S` calls `main` and that `main` returns via `exit(0)`.
3. Linked everything with my `linker.ld` to produce `hw1.elf`.
Running `make run-hw1` in the `app/` directory executes the ELF under rv32emu and prints the same output as in Ripes. I also added calls to `get_cycles()` and `get_instret()` (from `perfcounter`) to measure performance around the main computation:
```c
uint64_t start_cycles = get_cycles();
uint64_t start_instret = get_instret();
/* HW1 core computation */
uint64_t end_cycles = get_cycles();
uint64_t end_instret = get_instret();
print_str("Cycles: ");
print_int((unsigned long)(end_cycles - start_cycles));
print_str("\nInstructions: ");
print_int((unsigned long)(end_instret - start_instret));
print_str("\n");
```

In my measurement run, HW1 takes on the order of 8×10⁴ cycles and instructions on rv32emu, which is reasonable for a simple loop with several prints.
---
## 3. Quiz 2A: Tower of Hanoi (Problem A)
### 3.1 Porting strategy
Quiz 2A provides an RV32I assembly solution for the 3-disk Tower of Hanoi using Gray codes. The original quiz environment assumes the Ripes syscall interface and does not measure performance. I adapted it to the same bare-metal rv32emu skeleton as HW1.
Key changes in `quiz2a_main.S`:
* Stack frame extended to store:
* `disk[3]` array for peg positions,
* `start_cycles` and `start_instret`.
* After initialization, I call:
```asm
jal ra, get_cycles
sw a0, 32(x2) # start_cycles
jal ra, get_instret
sw a0, 36(x2) # start_instret
```
* At the end (`finish_game`), I compute and print the differences:
```asm
jal ra, get_cycles
lw t0, 32(x2)
sub t0, a0, t0 # cycles elapsed
la a0, msg_cycles # "Cycles: "
jal ra, print_str
mv a0, t0
jal ra, print_int
jal ra, get_instret
lw t0, 36(x2)
sub t0, a0, t0 # instret elapsed
la a0, msg_instret # "\nInstructions: "
jal ra, print_str
mv a0, t0
jal ra, print_int
jal ra, print_nl
```
* The decimal printing routine `print_int` is implemented without any multiply or divide instructions, using a power-of-10 table and repeated subtraction.
### 3.2 Correctness and performance
The adapted program still prints every move:

followed by the measured cycles and instructions. For 3 disks, the number of moves is fixed at 7, so the computation cost is small compared to the I/O cost of printing each move.
Using `get_cycles()` and `get_instret()`, I confirmed that:
* The total cycles scale roughly with the number of printed moves.
* Most of the cost comes from the repeated `print_int` and `print_str` calls rather than from the Gray-code logic itself.
I did not perform an extensive optimization study on Quiz 2A, because the main focus of the assignment is on Quiz 3C.
---
## 4. Quiz 3C: fast reciprocal square root in Q16.16
Quiz 3C implements a fast reciprocal square root `fast_rsqrt(x)` for 32-bit unsigned integers. The function returns `1/sqrt(x)` in Q16.16 fixed-point format.
### 4.1 Implementation overview
The implementation in `quiz3c_main.c` has four stages:
1. **Input handling**
* For `x == 0`, it returns 0 to avoid division by zero.
* For `x > 0`, it computes the exponent of the most significant bit:
```c
int exp = 31 - clz32(x);
```
2. **Lookup-table initial estimate**
* A 32-entry table `rsqrt_table[32]` stores Q16.16 approximations of `1/sqrt(2^exp)` for `exp = 0..31`.
* For a given `exp`, it loads `y_base = rsqrt_table[exp]` and `y_next = rsqrt_table[exp+1]` (or 0 at the end of the range).
3. **Linear interpolation in Q0.16**
* It computes a fractional position of `x` between `2^exp` and `2^{exp+1}`:
```c
uint32_t frac = (uint32_t)((((uint64_t)x - (1ULL << exp)) << 16) >> exp);
```
* Then it linearly interpolates between `y_base` and `y_next` in Q16.16:
```c
uint32_t delta = y_base - y_next;
uint32_t y = y_base - (uint32_t)(((uint64_t)delta * frac) >> 16);
```
4. **Two Newton–Raphson refinement steps**
* Using the standard reciprocal square root iteration:
```c
for (int i = 0; i < 2; ++i) {
uint64_t y2 = (uint64_t)y * y; // Q32.32
uint64_t xy2 = ((uint64_t)x * y2) >> 16; // x*y^2 / 2^16
uint64_t t = ((uint64_t)3 << 16) - xy2;
uint64_t prod = (uint64_t)y * t;
y = (uint32_t)(prod >> 17);
}
```
* The result is returned as a 32-bit Q16.16 integer.
### 4.2 Test harness and measurement
The `main` function runs `fast_rsqrt(x)` for a set of test values and prints the Q16.16 results. It also measures cycles and instructions using `get_cycles()` / `get_instret()` from the perfcounter helpers.
Test inputs:
```c
const uint32_t xs[] = {1, 2, 3, 4, 16, 100, 1000, 65535};
```
For each `x`, the program:
1. Calls `fast_rsqrt(x)` to compute `y`.
2. Handles the special case `x == 1` by clamping `y = 65536` (exact Q16.16 value of 1.0).
3. Prints `x` and `y` using `print_dec`.
4. After the loop, prints the elapsed cycles and instructions.
I built three variants with different optimization levels:
* `-O0`
* `-Os`
* `-Ofast`
Each ELF was run on rv32emu using a `make` target such as:
```bash
make run-quiz3c OPT=-Os
```
---
## 5. Performance comparison for Quiz 3C
### 5.1 Compiler-flag comparison
Measured performance (full test loop including printing):
| Optimization | Cycles | Instructions | ΔCycles vs `-Os` | ΔInst vs `-Os` | Relative cycles |
| ------------ | ------ | ------------ | ---------------- | -------------- | ---------------- |
| `-O0` | 82218 | 82218 | +45269 | +45270 | ≈ 2.23× slower |
| `-Os` | 36949 | 36948 | – | – | 1.00× (baseline) |
| `-Ofast` | 37020 | 37019 | +71 | +71 | ≈ 1.002× slower |
(Values taken from the `Cycles:` / `Instructions:` lines printed by the program on rv32emu.)
* `-O0`

* `-Os`

* `-Ofast`

Observations:
* `-O0` is about 2.23× slower than `-Os`, which matches the expectation that unoptimized code executes many more instructions.
* `-Ofast` is slightly slower than `-Os` (37020 vs. 36949 cycles, about 0.2% difference) and executes about 70 more instructions.
* In this benchmark, the main cost comes from I/O and decimal printing (`print_dec`, which internally performs repeated `udiv` / `umod`), not from the arithmetic in `fast_rsqrt` itself. `-Ofast` tends to inline and expand some code, which slightly increases instruction count without reducing the number of high-level operations (prints and divisions).
### 5.2 Disassembly comparison (`-O0` vs. `-Os` vs. `-Ofast`)
To understand the performance differences, I disassembled:
* `quiz3c_O0.elf`
* `quiz3c_Os.elf`
* `quiz3c_Ofast.elf`
using:
```bash
riscv-none-elf-objdump -d -S quiz3c_O0.elf > quiz3c_O0.dis
riscv-none-elf-objdump -d -S quiz3c_Os.elf > quiz3c_Os.dis
riscv-none-elf-objdump -d -S quiz3c_Ofast.elf > quiz3c_Ofast.dis
```
I focused on `main`, `fast_rsqrt`, and the printing helpers.
Key observations:
1. **Prologue / epilogue and stack usage**
* Under `-O0`, each non-trivial function has a large stack frame and saves many callee-saved registers (`s0`–`sN`). Every call involves multiple `sw`/`lw` instructions.
* Under `-Os`, the compiler saves fewer `s` registers and keeps more temporaries in `t0`–`t6`. The prologues/epilogues are shorter, reducing stack traffic.
* `-Ofast` has a stack layout very similar to `-Os`; the difference in cycles is not primarily from stack usage.
2. **Function calls vs. inlining**
* `-O0` emits explicit `jal` instructions to `fast_rsqrt` and `print_dec` for every test, along with many `mv` / `addi` around calls.
* `-Os` inlines small helper logic and removes redundant moves, reducing the number of `jal` instructions inside the hot loop.
* `-Ofast` inlines even more arithmetic expressions, but because I/O dominates, the extra inlining does not yield meaningful speedup.
3. **Loop structure**
* Under `-O0`, loops follow the C structure directly with separate increment and compare instructions.
* Under `-Os` and `-Ofast`, loop bounds are hoisted, comparisons are merged with branches, and some variables stay in registers instead of memory.
Overall, the disassembly matches the measurement: `-Os` minimizes both code size and dynamic instruction count, and therefore gives the best performance on this simple in-order RV32I model.
---
## 6. Precision analysis for `fast_rsqrt` (Q16.16)
### 6.1 Q16.16 representation
`fast_rsqrt` returns its result in unsigned Q16.16 fixed-point format:
* A 32-bit integer `q` encodes the real value
$v = \frac{q}{2^{16}}.$
* Examples:
* `65536` represents `1.0`.
* `32768` represents `0.5`.
* The ideal value of `1/sqrt(2) ≈ 0.7071` is encoded as `46341`.
### 6.2 Reference computation
To evaluate precision, I selected test values covering small numbers, medium values, powers of two, and the largest input used in the benchmark:
$x \in {1, 2, 3, 4, 5, 10, 16, 100, 256, 1000, 4096, 65535}.$
On the host (double precision), I computed:
$\text{ref}(x) = \frac{1.0}{\sqrt{x}}$
and converted each reference to Q16.16 by rounding:
$\text{reference_q16}(x) =
\text{round}(\text{ref}(x) \times 2^{16}).$
For each `x`, I then:
* Ran `fast_rsqrt(x)` on rv32emu to obtain `fast_rsqrt(x)` in Q16.16.
* Recorded:
* absolute error: `fast_rsqrt(x) - reference_q16(x)` (in LSBs),
* relative error: `abs(error) / reference_q16(x)` as a percentage.
The results are summarized in the table below:
| x | fast_rsqrt (Q16.16) | reference (Q16.16) | abs error | rel error (%) |
| ----- | ------------------- | ------------------ | --------- | ------------- |
| 1 | 65536 | 65536 | 0 | 0.000000 |
| 2 | 46341 | 46341 | 0 | 0.000000 |
| 3 | 37836 | 37837 | -1 | 0.002643 |
| 4 | 32768 | 32768 | 0 | 0.000000 |
| 5 | 29308 | 29309 | -1 | 0.003412 |
| 10 | 20724 | 20724 | 0 | 0.000000 |
| 16 | 16384 | 16384 | 0 | 0.000000 |
| 100 | 6553 | 6554 | -1 | 0.015258 |
| 256 | 4096 | 4096 | 0 | 0.000000 |
| 1000 | 2072 | 2072 | 0 | 0.000000 |
| 4096 | 1024 | 1024 | 0 | 0.000000 |
| 65535 | 255 | 256 | -1 | 0.390625 |
### 6.3 Discussion
* For the range most relevant to the assignment (`1 ≤ x ≤ 4096`), `fast_rsqrt` is always within ±1 LSB of the rounded Q16.16 reference.
* The maximum relative error in this range is around 0.015% (for `x = 100`), which is more than accurate enough for typical fixed-point applications.
* The largest relative error appears at `x = 65535`, where the ideal value is encoded as `256` but the implementation returns `255`. This is still only one LSB of error, corresponding to about 0.39%.
#### Special handling for `x == 1`
Because `1/sqrt(1)` is exactly `1.0`, represented as `65536` in Q16.16, I explicitly clamp this case in the test loop:
```c
if (x == 1)
y = 65536u; /* exact 1.0 in Q16.16 */
```
Without this, the combination of interpolation and Newton–Raphson can produce a value off by ±1 LSB. The clamp guarantees that this numerically important point is always exact.
---
## 7. Manual optimization: reducing I/O overhead in Quiz 3C
The assignment requires at least one example of “modify the code → re-measure” beyond just changing compiler flags. I used Quiz 3C as the case study.
### 7.1 Baseline vs. optimized variant
**Baseline (detailed output)**
* For each test input `x` in `{1, 2, 3, 4, 16, 100, 1000, 65535}`:
* Call `fast_rsqrt(x)`.
* Print `x` and the Q16.16 result `y` using `print_dec` and `print_str`.
* Handle `x == 1` with the clamp to `65536`.
* After the loop, read `mcycle` and `minstret`, and print the total `Cycles` and `Instructions`.
This version is easy to debug but performs a lot of decimal printing. `print_dec` internally uses repeated `udiv` / `umod` and is relatively expensive on RV32I.
**Optimized I/O version**
To isolate the cost of the algorithm itself, I introduced a compile-time switch (e.g., `DETAILED_OUTPUT`) around the per-test printing:
* In **baseline mode**, printing inside the loop is enabled.
* In **optimized I/O mode**, the loop only computes results; it no longer prints every `(x, y)` pair. Only the final summary (`Cycles`, `Instructions`) is printed at the end.
To prevent the compiler from optimizing the entire loop away under `-Os`, I added a checksum:
```c
uint32_t checksum = 0;
for (int i = 0; i < N; ++i) {
uint32_t x = xs[i];
uint32_t y = fast_rsqrt(x);
checksum ^= y; // use the result
}
static volatile uint32_t rsqrt_sink;
rsqrt_sink = checksum; // observable side effect
```
The volatile store ensures that the loop and the calls to `fast_rsqrt` cannot be removed.
### 7.2 Pitfall: “too good to be true” counters
In an intermediate experiment, I removed the per-test printing but forgot to add the checksum and volatile sink. Under `-Os`, the compiler noticed that the loop had no observable side effects and optimized it out completely. The counters reported:
* `Cycles: 14`
* `Instructions: 14`
These values are obviously unrealistic for a benchmark that is supposed to execute several iterations with 64-bit multiplications and shifts: essentially, only the CSR reads and minimal scaffolding remained. This convinced me that I must force an observable use of the results.
After adding the checksum and volatile sink, the measured cycles and instructions again reflect the real cost of the algorithm.
### 7.3 Measurement results
Using `-Os` for both variants and the same measurement window (only the test loop, not `main` initialization), I obtained:
| Version | Cycles | Instructions | Note |
| ------------- | ------ | ------------ | -------------------------------------- |
| baseline | 36949 | 36948 | print all test cases in the loop |
| optimized I/O | 9854 | 9853 | compute + checksum, only print summary |
Compared with the baseline, the optimized version reduces both cycles and instructions by roughly 73%:
$\text{reduction} \approx
\frac{36949 - 9854}{36949}
\approx 73.3%.$
This experiment clearly shows that:
* In the baseline configuration, most of the time is spent in formatted output and integer-to-decimal conversion (`print_dec` with repeated `udiv` / `umod`).
* Once per-test I/O is removed and results are only used to update a checksum, the measured cost drops dramatically, and the counters better represent the cost of the `fast_rsqrt` algorithm itself.
This satisfies the assignment requirement of providing at least one example of “manual optimization → re-measure → explain the difference”.
---
## 8. Summary
* I set up a bare-metal RV32I + Zicsr environment on rv32emu, using a minimal skeleton derived from `tests/system/playground` and a unified set of compiler flags.
* HW1 and Quiz 2A were successfully ported by replacing Ripes-specific syscalls with Linux-style `write` / `exit` and by providing pure-RV32I printing routines.
* For Quiz 3C, I implemented a table-based + Newton–Raphson `fast_rsqrt` function in Q16.16, measured its performance under `-O0`, `-Os`, and `-Ofast`, and compared the generated assembly.
* Precision analysis shows that `fast_rsqrt` is within ±1 LSB of the rounded Q16.16 reference for all tested inputs, with relative error below about 0.4%.
* A simple manual optimization that removes per-test I/O (while preserving a checksum to keep the loop alive) reduces the measured cycles and instructions by more than 70%, demonstrating how much overhead comes from formatted printing rather than from the core numeric algorithm.