arch2025-homework2

# Assignment2: Complete Applications > [GitHub](https://github.com/seallllllllll/arch2025-homework2) This note documents my work for Assignment 2: * Setting up the bare-metal environment on rv32emu * Porting HW1 and Quiz 2A / Quiz 3C to this environment * Comparing compiler-generated assembly under different optimization levels * Measuring performance (cycles / instructions) and doing at least one manual optimization * Analyzing the numerical precision of Quiz 3C (fast reciprocal square root in Q16.16) All code lives in my course repository under `app/`, `quiz2A/`, and `quiz3C/`. --- ## 1. Environment and toolchain ### 1.1 Toolchain and emulator * Cross compiler: `riscv-none-elf-gcc` 14.2.0 * ISA / ABI: RV32I with Zicsr, ILP32 ABI * Emulator: `rv32emu` (system emulator from course repository) * Host OS: macOS, running rv32emu built from source Compiler flags are unified across all programs: ```bash -march=rv32i_zicsr -mabi=ilp32 -ffreestanding -nostdlib -nostartfiles ``` I use `linker.ld` derived from `tests/system/playground`, with only basic sections and a simple stack layout. ### 1.2 Bare-metal skeleton I started from `tests/system/playground` in the rv32emu repository and copied it into my own repo as `app/`, keeping only the minimal skeleton: * `crt0.S` – reset entry, stack setup, call into `main`. * `syscalls.c` – thin wrappers for Linux syscalls through `ecall`. * `ticks.c` / `perfcounter.c` – helpers to read CSRs such as `mcycle` and `minstret`. * A small `main.c` that prints `"OK\n"` and exits, just to verify the environment. The build system uses a simple `Makefile` that produces an ELF file and runs it with rv32emu: ```bash make run-hw1 make run-quiz2a make run-quiz3c OPT=-O0 # or -Os / -Ofast ``` --- ## 2. Porting HW1 to bare-metal rv32emu ### 2.1 Background The original HW1 program was written for Ripes and used its teaching syscalls: * `print_int` (a7 = 1) * `print_string` (a7 = 4) * `print_char` (a7 = 11) * `exit` (a7 = 93, but semantics different) These syscall numbers exist only in Ripes. rv32emu follows the Linux RISC-V syscall ABI, so I needed to replace all Ripes-specific syscalls and remove any dependency on the host libc. ### 2.2 Syscall adaptation I implemented a small set of bare-metal wrappers in C/assembly: * `write(int fd, const void *buf, size_t len)` * `a7 = 64` (`SYS_write`), `a0 = fd`, `a1 = buf`, `a2 = len`, then `ecall`. * `exit(int status)` * `a7 = 93` (`SYS_exit`), `a0 = status`, then `ecall`. On top of `write`, I reimplemented: * `print_str(const char *s)` – write a C string to stdout (`fd = 1`). * `print_char(int ch)` – write a single character. * `print_int(unsigned int x)` – decimal printing routine implemented in pure RV32I (no `div` / `mul`), using a table of powers of 10 and repeated subtraction, similar to the version used in Quiz 2A. ### 2.3 Integrating the HW1 program In the copied `app/` directory, I: 1. Added my HW1 assembly source file to the `Makefile`’s object list. 2. Ensured that `crt0.S` calls `main` and that `main` returns via `exit(0)`. 3. Linked everything with my `linker.ld` to produce `hw1.elf`. Running `make run-hw1` in the `app/` directory executes the ELF under rv32emu and prints the same output as in Ripes. I also added calls to `get_cycles()` and `get_instret()` (from `perfcounter`) to measure performance around the main computation: ```c uint64_t start_cycles = get_cycles(); uint64_t start_instret = get_instret(); /* HW1 core computation */ uint64_t end_cycles = get_cycles(); uint64_t end_instret = get_instret(); print_str("Cycles: "); print_int((unsigned long)(end_cycles - start_cycles)); print_str("\nInstructions: "); print_int((unsigned long)(end_instret - start_instret)); print_str("\n"); ``` ![image](https://hackmd.io/_uploads/SyBGAVuGZg.png) In my measurement run, HW1 takes on the order of 8×10⁴ cycles and instructions on rv32emu, which is reasonable for a simple loop with several prints. --- ## 3. Quiz 2A: Tower of Hanoi (Problem A) ### 3.1 Porting strategy Quiz 2A provides an RV32I assembly solution for the 3-disk Tower of Hanoi using Gray codes. The original quiz environment assumes the Ripes syscall interface and does not measure performance. I adapted it to the same bare-metal rv32emu skeleton as HW1. Key changes in `quiz2a_main.S`: * Stack frame extended to store: * `disk[3]` array for peg positions, * `start_cycles` and `start_instret`. * After initialization, I call: ```asm jal ra, get_cycles sw a0, 32(x2) # start_cycles jal ra, get_instret sw a0, 36(x2) # start_instret ``` * At the end (`finish_game`), I compute and print the differences: ```asm jal ra, get_cycles lw t0, 32(x2) sub t0, a0, t0 # cycles elapsed la a0, msg_cycles # "Cycles: " jal ra, print_str mv a0, t0 jal ra, print_int jal ra, get_instret lw t0, 36(x2) sub t0, a0, t0 # instret elapsed la a0, msg_instret # "\nInstructions: " jal ra, print_str mv a0, t0 jal ra, print_int jal ra, print_nl ``` * The decimal printing routine `print_int` is implemented without any multiply or divide instructions, using a power-of-10 table and repeated subtraction. ### 3.2 Correctness and performance The adapted program still prints every move: ![image](https://hackmd.io/_uploads/BkhIaN_zbl.png) followed by the measured cycles and instructions. For 3 disks, the number of moves is fixed at 7, so the computation cost is small compared to the I/O cost of printing each move. Using `get_cycles()` and `get_instret()`, I confirmed that: * The total cycles scale roughly with the number of printed moves. * Most of the cost comes from the repeated `print_int` and `print_str` calls rather than from the Gray-code logic itself. I did not perform an extensive optimization study on Quiz 2A, because the main focus of the assignment is on Quiz 3C. --- ## 4. Quiz 3C: fast reciprocal square root in Q16.16 Quiz 3C implements a fast reciprocal square root `fast_rsqrt(x)` for 32-bit unsigned integers. The function returns `1/sqrt(x)` in Q16.16 fixed-point format. ### 4.1 Implementation overview The implementation in `quiz3c_main.c` has four stages: 1. **Input handling** * For `x == 0`, it returns 0 to avoid division by zero. * For `x > 0`, it computes the exponent of the most significant bit: ```c int exp = 31 - clz32(x); ``` 2. **Lookup-table initial estimate** * A 32-entry table `rsqrt_table[32]` stores Q16.16 approximations of `1/sqrt(2^exp)` for `exp = 0..31`. * For a given `exp`, it loads `y_base = rsqrt_table[exp]` and `y_next = rsqrt_table[exp+1]` (or 0 at the end of the range). 3. **Linear interpolation in Q0.16** * It computes a fractional position of `x` between `2^exp` and `2^{exp+1}`: ```c uint32_t frac = (uint32_t)((((uint64_t)x - (1ULL << exp)) << 16) >> exp); ``` * Then it linearly interpolates between `y_base` and `y_next` in Q16.16: ```c uint32_t delta = y_base - y_next; uint32_t y = y_base - (uint32_t)(((uint64_t)delta * frac) >> 16); ``` 4. **Two Newton–Raphson refinement steps** * Using the standard reciprocal square root iteration: ```c for (int i = 0; i < 2; ++i) { uint64_t y2 = (uint64_t)y * y; // Q32.32 uint64_t xy2 = ((uint64_t)x * y2) >> 16; // x*y^2 / 2^16 uint64_t t = ((uint64_t)3 << 16) - xy2; uint64_t prod = (uint64_t)y * t; y = (uint32_t)(prod >> 17); } ``` * The result is returned as a 32-bit Q16.16 integer. ### 4.2 Test harness and measurement The `main` function runs `fast_rsqrt(x)` for a set of test values and prints the Q16.16 results. It also measures cycles and instructions using `get_cycles()` / `get_instret()` from the perfcounter helpers. Test inputs: ```c const uint32_t xs[] = {1, 2, 3, 4, 16, 100, 1000, 65535}; ``` For each `x`, the program: 1. Calls `fast_rsqrt(x)` to compute `y`. 2. Handles the special case `x == 1` by clamping `y = 65536` (exact Q16.16 value of 1.0). 3. Prints `x` and `y` using `print_dec`. 4. After the loop, prints the elapsed cycles and instructions. I built three variants with different optimization levels: * `-O0` * `-Os` * `-Ofast` Each ELF was run on rv32emu using a `make` target such as: ```bash make run-quiz3c OPT=-Os ``` --- ## 5. Performance comparison for Quiz 3C ### 5.1 Compiler-flag comparison Measured performance (full test loop including printing): | Optimization | Cycles | Instructions | ΔCycles vs `-Os` | ΔInst vs `-Os` | Relative cycles | | ------------ | ------ | ------------ | ---------------- | -------------- | ---------------- | | `-O0` | 82218 | 82218 | +45269 | +45270 | ≈ 2.23× slower | | `-Os` | 36949 | 36948 | – | – | 1.00× (baseline) | | `-Ofast` | 37020 | 37019 | +71 | +71 | ≈ 1.002× slower | (Values taken from the `Cycles:` / `Instructions:` lines printed by the program on rv32emu.) * `-O0` ![image](https://hackmd.io/_uploads/ryhXlS_Gbl.png) * `-Os` ![image](https://hackmd.io/_uploads/S1pzlH_z-l.png) * `-Ofast` ![image](https://hackmd.io/_uploads/S16beH_zbe.png) Observations: * `-O0` is about 2.23× slower than `-Os`, which matches the expectation that unoptimized code executes many more instructions. * `-Ofast` is slightly slower than `-Os` (37020 vs. 36949 cycles, about 0.2% difference) and executes about 70 more instructions. * In this benchmark, the main cost comes from I/O and decimal printing (`print_dec`, which internally performs repeated `udiv` / `umod`), not from the arithmetic in `fast_rsqrt` itself. `-Ofast` tends to inline and expand some code, which slightly increases instruction count without reducing the number of high-level operations (prints and divisions). ### 5.2 Disassembly comparison (`-O0` vs. `-Os` vs. `-Ofast`) To understand the performance differences, I disassembled: * `quiz3c_O0.elf` * `quiz3c_Os.elf` * `quiz3c_Ofast.elf` using: ```bash riscv-none-elf-objdump -d -S quiz3c_O0.elf > quiz3c_O0.dis riscv-none-elf-objdump -d -S quiz3c_Os.elf > quiz3c_Os.dis riscv-none-elf-objdump -d -S quiz3c_Ofast.elf > quiz3c_Ofast.dis ``` I focused on `main`, `fast_rsqrt`, and the printing helpers. Key observations: 1. **Prologue / epilogue and stack usage** * Under `-O0`, each non-trivial function has a large stack frame and saves many callee-saved registers (`s0`–`sN`). Every call involves multiple `sw`/`lw` instructions. * Under `-Os`, the compiler saves fewer `s` registers and keeps more temporaries in `t0`–`t6`. The prologues/epilogues are shorter, reducing stack traffic. * `-Ofast` has a stack layout very similar to `-Os`; the difference in cycles is not primarily from stack usage. 2. **Function calls vs. inlining** * `-O0` emits explicit `jal` instructions to `fast_rsqrt` and `print_dec` for every test, along with many `mv` / `addi` around calls. * `-Os` inlines small helper logic and removes redundant moves, reducing the number of `jal` instructions inside the hot loop. * `-Ofast` inlines even more arithmetic expressions, but because I/O dominates, the extra inlining does not yield meaningful speedup. 3. **Loop structure** * Under `-O0`, loops follow the C structure directly with separate increment and compare instructions. * Under `-Os` and `-Ofast`, loop bounds are hoisted, comparisons are merged with branches, and some variables stay in registers instead of memory. Overall, the disassembly matches the measurement: `-Os` minimizes both code size and dynamic instruction count, and therefore gives the best performance on this simple in-order RV32I model. --- ## 6. Precision analysis for `fast_rsqrt` (Q16.16) ### 6.1 Q16.16 representation `fast_rsqrt` returns its result in unsigned Q16.16 fixed-point format: * A 32-bit integer `q` encodes the real value $v = \frac{q}{2^{16}}.$ * Examples: * `65536` represents `1.0`. * `32768` represents `0.5`. * The ideal value of `1/sqrt(2) ≈ 0.7071` is encoded as `46341`. ### 6.2 Reference computation To evaluate precision, I selected test values covering small numbers, medium values, powers of two, and the largest input used in the benchmark: $x \in {1, 2, 3, 4, 5, 10, 16, 100, 256, 1000, 4096, 65535}.$ On the host (double precision), I computed: $\text{ref}(x) = \frac{1.0}{\sqrt{x}}$ and converted each reference to Q16.16 by rounding: $\text{reference_q16}(x) = \text{round}(\text{ref}(x) \times 2^{16}).$ For each `x`, I then: * Ran `fast_rsqrt(x)` on rv32emu to obtain `fast_rsqrt(x)` in Q16.16. * Recorded: * absolute error: `fast_rsqrt(x) - reference_q16(x)` (in LSBs), * relative error: `abs(error) / reference_q16(x)` as a percentage. The results are summarized in the table below: | x | fast_rsqrt (Q16.16) | reference (Q16.16) | abs error | rel error (%) | | ----- | ------------------- | ------------------ | --------- | ------------- | | 1 | 65536 | 65536 | 0 | 0.000000 | | 2 | 46341 | 46341 | 0 | 0.000000 | | 3 | 37836 | 37837 | -1 | 0.002643 | | 4 | 32768 | 32768 | 0 | 0.000000 | | 5 | 29308 | 29309 | -1 | 0.003412 | | 10 | 20724 | 20724 | 0 | 0.000000 | | 16 | 16384 | 16384 | 0 | 0.000000 | | 100 | 6553 | 6554 | -1 | 0.015258 | | 256 | 4096 | 4096 | 0 | 0.000000 | | 1000 | 2072 | 2072 | 0 | 0.000000 | | 4096 | 1024 | 1024 | 0 | 0.000000 | | 65535 | 255 | 256 | -1 | 0.390625 | ### 6.3 Discussion * For the range most relevant to the assignment (`1 ≤ x ≤ 4096`), `fast_rsqrt` is always within ±1 LSB of the rounded Q16.16 reference. * The maximum relative error in this range is around 0.015% (for `x = 100`), which is more than accurate enough for typical fixed-point applications. * The largest relative error appears at `x = 65535`, where the ideal value is encoded as `256` but the implementation returns `255`. This is still only one LSB of error, corresponding to about 0.39%. #### Special handling for `x == 1` Because `1/sqrt(1)` is exactly `1.0`, represented as `65536` in Q16.16, I explicitly clamp this case in the test loop: ```c if (x == 1) y = 65536u; /* exact 1.0 in Q16.16 */ ``` Without this, the combination of interpolation and Newton–Raphson can produce a value off by ±1 LSB. The clamp guarantees that this numerically important point is always exact. --- ## 7. Manual optimization: reducing I/O overhead in Quiz 3C The assignment requires at least one example of “modify the code → re-measure” beyond just changing compiler flags. I used Quiz 3C as the case study. ### 7.1 Baseline vs. optimized variant **Baseline (detailed output)** * For each test input `x` in `{1, 2, 3, 4, 16, 100, 1000, 65535}`: * Call `fast_rsqrt(x)`. * Print `x` and the Q16.16 result `y` using `print_dec` and `print_str`. * Handle `x == 1` with the clamp to `65536`. * After the loop, read `mcycle` and `minstret`, and print the total `Cycles` and `Instructions`. This version is easy to debug but performs a lot of decimal printing. `print_dec` internally uses repeated `udiv` / `umod` and is relatively expensive on RV32I. **Optimized I/O version** To isolate the cost of the algorithm itself, I introduced a compile-time switch (e.g., `DETAILED_OUTPUT`) around the per-test printing: * In **baseline mode**, printing inside the loop is enabled. * In **optimized I/O mode**, the loop only computes results; it no longer prints every `(x, y)` pair. Only the final summary (`Cycles`, `Instructions`) is printed at the end. To prevent the compiler from optimizing the entire loop away under `-Os`, I added a checksum: ```c uint32_t checksum = 0; for (int i = 0; i < N; ++i) { uint32_t x = xs[i]; uint32_t y = fast_rsqrt(x); checksum ^= y; // use the result } static volatile uint32_t rsqrt_sink; rsqrt_sink = checksum; // observable side effect ``` The volatile store ensures that the loop and the calls to `fast_rsqrt` cannot be removed. ### 7.2 Pitfall: “too good to be true” counters In an intermediate experiment, I removed the per-test printing but forgot to add the checksum and volatile sink. Under `-Os`, the compiler noticed that the loop had no observable side effects and optimized it out completely. The counters reported: * `Cycles: 14` * `Instructions: 14` These values are obviously unrealistic for a benchmark that is supposed to execute several iterations with 64-bit multiplications and shifts: essentially, only the CSR reads and minimal scaffolding remained. This convinced me that I must force an observable use of the results. After adding the checksum and volatile sink, the measured cycles and instructions again reflect the real cost of the algorithm. ### 7.3 Measurement results Using `-Os` for both variants and the same measurement window (only the test loop, not `main` initialization), I obtained: | Version | Cycles | Instructions | Note | | ------------- | ------ | ------------ | -------------------------------------- | | baseline | 36949 | 36948 | print all test cases in the loop | | optimized I/O | 9854 | 9853 | compute + checksum, only print summary | Compared with the baseline, the optimized version reduces both cycles and instructions by roughly 73%: $\text{reduction} \approx \frac{36949 - 9854}{36949} \approx 73.3%.$ This experiment clearly shows that: * In the baseline configuration, most of the time is spent in formatted output and integer-to-decimal conversion (`print_dec` with repeated `udiv` / `umod`). * Once per-test I/O is removed and results are only used to update a checksum, the measured cost drops dramatically, and the counters better represent the cost of the `fast_rsqrt` algorithm itself. This satisfies the assignment requirement of providing at least one example of “manual optimization → re-measure → explain the difference”. --- ## 8. Summary * I set up a bare-metal RV32I + Zicsr environment on rv32emu, using a minimal skeleton derived from `tests/system/playground` and a unified set of compiler flags. * HW1 and Quiz 2A were successfully ported by replacing Ripes-specific syscalls with Linux-style `write` / `exit` and by providing pure-RV32I printing routines. * For Quiz 3C, I implemented a table-based + Newton–Raphson `fast_rsqrt` function in Q16.16, measured its performance under `-O0`, `-Os`, and `-Ofast`, and compared the generated assembly. * Precision analysis shows that `fast_rsqrt` is within ±1 LSB of the rounded Q16.16 reference for all tested inputs, with relative error below about 0.4%. * A simple manual optimization that removes per-test I/O (while preserving a checksum to keep the loop alive) reduces the measured cycles and instructions by more than 70%, demonstrating how much overhead comes from formatted printing rather than from the core numeric algorithm.