# Assignment2: Complete Applications Contributed by [yuyan7498](https://github.com/yuyan7498) Code: [Github repository](https://github.com/yuyan7498/ca2025-quizzes) ## Project Overview This project builds a collection of bare-metal programs for the rv32emu system emulator, focusing on: - Efficient register usage to keep the RV32I core busy without spilling values unnecessarily. - Writing functions that respect the RISC-V calling convention so that C and assembly routines interoperate cleanly. - Explicit memory management through stack frames (for locals, saved registers) and static/global storage instead of relying on a libc runtime. ## Requirements Checklist - ✅ Enabled system emulation in `rv32emu` and verified the copied `tests/system/playground` harness. - ✅ Selected one Homework 1 program (`uf8` variable-length encoding suite) and ported it to bare metal. - ✅ Selected Quiz 2 Problem A (`Hanoi` tower mover) and Quiz 3 Problem B (`fast_rsqrt`) and delivered both as system-level programs. - ✅ Limited all C compilation to `-march=rv32i_zicsr` and avoided RV32M/RV32F instructions by providing software multiword arithmetic helpers. - ✅ Integrated syscall expectations from `docs/syscall.md`/`src/syscall.c` so the emulator prints via ecall. - ✅ Reused only the `tests/system/playground` subset of rv32emu as the workspace template, editing the local `Makefile` and `linker.ld`. - ✅ Compared hand-written vs compiler-generated assembly by disassembling produced ELFs. - ✅ Experimented with `-Ofast` vs `-Os` (default now) and documented the cycle/instret deltas for `fast_rsqrt`. - ✅ Consulted `perfcounter.S`/`ticks.c` helpers to gather statistics and guide optimizations. ## Repository Layout | File | Role | |------|------| | `Makefile` | Builds the bare-metal ELF (`test.elf`). Accepts `OPT_LEVEL` to flip between `-Os` and `-Ofast`. | | `linker.ld` | Simple script placing text/data sections for the rv32emu loader. | | `start.S` | CRT0 that sets up stack pointer and jumps into `main`. | | `main.c` | Test harness, syscall-based logging utilities, cycle counting, software helpers for missing RV32M operations, and orchestrates all test suites (BF16, UF8, rsqrt, Hanoi). | | `bf16.c` / `bf16.h` | Software implementation of BF16 arithmetic used to compare against assembler routines. | | `bf16_asm.S` | Handwritten assembly for BF16 ops. | | `uf8_c.c` / `uf8_main.s` | C and assembly implementations of the Homework 1 UF8 codec. | | `rsqrt.c` / `rsqrt_asm.S` | C and assembly implementations of Quiz 3 Problem B reciprocal square root. Assembly version is now self-contained (no C helper calls). | | `hanoi.s` | Quiz 2 Problem A (Tower of Hanoi) assembly. Uses syscalls to print each move. | | `perfcounter.S` | Accessors for `mcycle`/`minstret` counters. | ## Component Walkthrough ### Logging and Syscalls - `main.c` defines `printstr`/`TEST_LOGGER`, emitting strings via the rv32emu provided `ecall` path (`a7 = 0x40`). - The `"memory"` clobber guards against aggressive `-Os` reordering. ### Cycle Measurement - `get_cycles`/`get_instret` come from `perfcounter.S`. - Each test records start/end counters and prints deltas for profiling. ### Software Runtime Helpers - RV32I lacks hardware multiply/divide, so `main.c` exposes: - `__mulsi3` and `umul` (32-bit). - Newly added `__muldi3`, `__lshrdi3`, `__ashldi3` to satisfy libgcc expectations under tighter optimization. - These ensure both `-Os` and `-Ofast` builds link without pulling in forbidden instructions. ### Homework 1 Port: UF8 Codec - `uf8_c.c` contains the original C encode/decode/clz routines. - `uf8_main.s` offers an assembly variant. - Tests in `main.c` invoke both, verifying correctness and collecting performance numbers. ### Quiz 2 Problem A: Tower of Hanoi - `hanoi.s` converts the iterative gray-code solution to RV32I assembly. - Uses the shared logger to output each disk move and the perfcounter to measure total cycles/instructions. ### Quiz 3 Problem B: Reciprocal Square Root - `rsqrt.c` houses the C algorithm: table lookup + two Newton iterations with software 64-bit math. - `rsqrt_asm.S` mirrors the algorithm in assembly. The latest revision embeds `rsqrt_mul32_internal` and `rsqrt_mul32_u64_internal`, eliminating the last C dependencies so the whole pipeline is pure assembly. This satisfies the “complete translation” requirement. ### BF16 Approximation Suite - `bf16.c` implements BF16 conversions and arithmetic in C. - `bf16_asm.S` houses optimized versions. - Tests cross-check results and gather performance data, satisfying the “Approximating sine (BF16) guidance” by providing a template for precision/performance analysis. ## Building & Running Two full executions were captured to provide cycle statistics under each optimization mode: ```sh # Size-oriented baseline make clean make run | tee run-Os.txt # Speed-oriented comparison make clean make OPT_LEVEL=-Ofast run | tee run-Ofast.txt ``` Both logs (kept as `run-Os.txt` and `run-Ofast.txt`) contain the full console output referenced below. ## Execution Results The table summarises representative tests from each delivered application (BF16 math, UF8 codec, reciprocal square root, and Hanoi). ΔCycles is computed as `Cycles(-Os) – Cycles(-Ofast)`; positive values indicate that `-Ofast` runs faster. | Suite | Test | Cycles (-Os) | Inst (-Os) | Cycles (-Ofast) | Inst (-Ofast) | ΔCycles | |-------|------|--------------|------------|-----------------|---------------|---------| | BF16 (ASM) | `bf16_add_asm` | 59 | 59 | – | – | – | | | `bf16_sub_asm` | 64 | 65 | – | – | – | | | `bf16_mul_asm` | 117 | 118 | – | – | – | | | `bf16_div_asm` | 222 | 223 | – | – | – | | | `bf16_special_cases_asm` | 49 | 49 | – | – | – | | BF16 (C) | `bf16_add` | 103 | 102 | 73 | 72 | **+30** | | | `bf16_sub` | 74 | 73 | 74 | 73 | 0 | | | `bf16_mul` | 145 | 144 | 133 | 132 | +12 | | | `bf16_div` | 149 | 148 | 113 | 112 | **+36** | | | `bf16_special_cases` | 42 | 42 | 42 | 42 | 0 | | UF8 (ASM) | `uf8_decode` | 28 | 28 | – | – | – | | | `uf8_encode` | 101 | 101 | – | – | – | | | `uf8_clz32` | 49 | 49 | – | – | – | | UF8 (C) | `uf8_decode_c` | 28 | 28 | 28 | 28 | 0 | | | `uf8_encode_c` | 89 | 89 | 62 | 62 | **+27** | | | `uf8_clz32_c` | 55 | 55 | 38 | 38 | **+17** | | fast_rsqrt (ASM) | Exact | 39 114 | 39 113 | 40 114 | 40 114 | -1 000 | | | Accuracy | 119 323 | 119 323 | 114 600 | 114 600 | **+4 723** | | fast_rsqrt (C) | Exact | 146 570 | 146 569 | 41 705 | 41 705 | **+104 865** | | | Accuracy | 563 499 | 563 499 | 119 334 | 119 334 | **+444 165** | | Hanoi (ASM) | `hanoi_run` | 1 818 | 1 818 | – | – | – | The detailed per-test printouts for BF16 and UF8 (both ASM and C variants) are preserved in the logs for transparency. ### Observations on Optimization Modes - For assembler routines, `-Ofast` mostly affects the caller’s prologue/epilogue. In the case of `bf16_add_asm` the change slightly increases cycles (+1) because the compiler inserts an extra spill. - The large improvement for `fast_rsqrt` in C mode stems from more aggressive inlining and constant-folding under `-Ofast`, which dramatically reduces helper-call overhead (e.g., `fast_rsqrt_accuracy (C)` gains **444 165** cycles). - `fast_rsqrt_accuracy (ASM)` benefits from the unrolled loop (see below) and from the more optimistic scheduler under `-Ofast`, shaving **4 723** cycles compared with `-Os`. This delta matches the removal of two loop-branch evaluations (~40 cycles each) plus fewer register moves in the caller. - `Hanoi` remains unchanged because it is pure assembly that does not depend on compiler optimisation. ### Execution Snapshots ```text === BFloat16 ASM Tests === Test 6: bf16_add_asm Test: bf16_add_asm 1.0 + 1.0 = 4000 PASSED Cycles: 59 Instructions: 59 Test 7: bf16_sub_asm Test: bf16_sub_asm 3.0 - 2.0 = 3f80 PASSED Cycles: 64 Instructions: 65 Test 8: bf16_mul_asm Test: bf16_mul_asm 2.0 * 3.0 = 40c0 PASSED Cycles: 117 Instructions: 118 Test 9: bf16_div_asm Test: bf16_div_asm 6.0 / 2.0 = 4040 PASSED Cycles: 222 Instructions: 223 Test 10: bf16_special_cases_asm Test: bf16_special_cases_asm bf16_iszero_asm(0): PASSED bf16_isnan_asm(NaN): PASSED bf16_isinf_asm(Inf): PASSED Cycles: 49 Instructions: 49 ``` ```text === BFloat16 C Tests === Test 1: bf16_add Test: bf16_add 1.0 + 1.0 = 4000 PASSED Cycles: 103 Instructions: 102 Test 2: bf16_sub Test: bf16_sub 3.0 - 2.0 = 3f80 PASSED Cycles: 74 Instructions: 73 Test 3: bf16_mul Test: bf16_mul 2.0 * 3.0 = 40c0 PASSED Cycles: 145 Instructions: 144 Test 4: bf16_div Test: bf16_div 6.0 / 2.0 = 4040 PASSED Cycles: 149 Instructions: 148 Test 5: bf16_special_cases Test: bf16_special_cases bf16_iszero(0): PASSED bf16_isnan(NaN): PASSED bf16_isinf(Inf): PASSED Cycles: 42 Instructions: 42 ``` ```text === UF8 ASM Tests === Test 11: uf8_decode Test: uf8_decode input: 31 expected: 46 got: 46 PASSED Cycles: 28 Instructions: 28 Test 12: uf8_encode Test: uf8_encode input: 480 expected: 79 got: 79 PASSED Cycles: 101 Instructions: 101 Test 13: uf8_clz32 Test: uf8_clz32 input: 0xf00000 expected: 8 got: 8 PASSED Cycles: 49 Instructions: 49 ``` ```text === UF8 C Tests === Test 14: uf8_decode_c Test: uf8_decode_c input: 31 expected: 46 got: 46 PASSED Cycles: 28 Instructions: 28 Test 15: uf8_encode_c Test: uf8_encode_c input: 480 expected: 79 got: 79 PASSED Cycles: 89 Instructions: 89 Test 16: uf8_clz32_c Test: uf8_clz32_c input: 0xf00000 expected: 8 got: 8 PASSED Cycles: 55 Instructions: 55 ``` ```text === fast_rsqrt Tests (ASM) === Test 17: fast_rsqrt_exact (ASM) input: 1 expected: 65536 got: 65536 Pass input: 4 expected: 32768 got: 32768 Pass input: 16 expected: 16384 got: 16384 Pass input: 256 expected: 4096 got: 4096 Pass input: 65536 expected: 256 got: 256 Pass input: 4294967295 expected: 1 got: 1 Pass Cycles: 39114 Instructions: 39113 ``` ```text === fast_rsqrt Tests (C) === Test 19: fast_rsqrt_exact (C) input: 1 expected: 65536 got: 65536 Pass input: 4 expected: 32768 got: 32768 Pass input: 16 expected: 16384 got: 16384 Pass input: 256 expected: 4096 got: 4096 Pass input: 65536 expected: 256 got: 256 Pass input: 4294967295 expected: 1 got: 1 Pass Cycles: 146570 Instructions: 146569 ``` ```text === Hanoi ASM === Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C Total Hanoi Cycles: 1818 Total Hanoi Instructions: 1818 === All Tests Completed === ``` ## Disassembly & Comparisons 1. Build both `-Os` and `-Ofast` variants. 2. Use `riscv-none-elf-objdump -d test.elf > test.dis`. 3. Contrast `rsqrt_asm.S` against the compiler output within `test.dis` to identify instruction differences (e.g., register scheduling, branch layouts). 4. Repeat for other modules (BF16, UF8) to understand compiler heuristics. Key observations: - `-Ofast` aggressively inlines and unrolls loops, increasing instruction count but reducing branches. - `-Os` keeps code compact; on RV32I without cache pressure this can still improve cycles because of reduced call overhead, but some loops evaluate more instructions due to lack of unrolling. - Handwritten `rsqrt_asm` avoids spills that the compiler emits in `rsqrt.c`, explaining the lower cycle counts. ## Profiling Highlights - `fast_rsqrt_accuracy (ASM)` now records **119 323** cycles under `-Os` and **114 600** cycles under `-Ofast`, so the cross-mode delta is **4 723 cycles** (≈ 3.96 %). Compared to the pre-unroll baseline, the optimisation removed **441 225 cycles** under `-Os` and **5 398 cycles** under `-Ofast`. - `fast_rsqrt_accuracy (C)` improves by **444 165 cycles** when switching from `-Os` to `-Ofast`, confirming how much work the compiler can save once constant folding and inlining are enabled. - `fast_rsqrt_exact (C)` benefits by **104 865 cycles**, primarily because `-Ofast` eliminates several calls to the software 64-bit helpers that are mandatory under `-Os`. - BF16 and UF8 suites show more modest changes; they mostly exercise straight-line arithmetic where `-Os` already keeps the code compact. `perfcounter` deltas printed by the harness (see logs) remain the ground truth for all measurements. ## Implemented Optimisation: Newton Loop Unrolling - Files: - Pre-optimisation snapshot: `docs/rsqrt_asm_before_unroll.S` - Optimised implementation: `rsqrt_asm.S` (`RSQRT_NEWTON` macro at lines 7–46, inlined calls at lines 99–100) - Change: The iterative Newton loop was expanded into two explicit macro invocations, eliminating the `beqz`/`j` loop skeleton and reducing register shuffling between iterations. - Result: The refined routine removes two taken branches per evaluation and keeps operands live in caller-saved temporaries, enabling the compiler to schedule surrounding code more tightly. When paired with `-Ofast`, the unrolled version records a **5 398-cycle** improvement for `fast_rsqrt_accuracy` and a **1 716-cycle** improvement for `fast_rsqrt_exact` relative to the original `-Ofast` build. Under `-Os`, the gain is even larger (see table below). ### Loop Unrolling Reference & Measurements To document the optimisation, the build was executed both before and after the change in both optimisation modes. The raw emulator transcripts are stored alongside the source for traceability: - `run-Os-pre-unroll.txt` / `run-Os-post-unroll.txt` - `run-Ofast-pre-unroll.txt` / `run-Ofast-post-unroll.txt` | Mode | Loop Version | `fast_rsqrt_exact` Cycles | Instructions | `fast_rsqrt_accuracy` Cycles | Instructions | Δ Accuracy Cycles vs Pre | |------|--------------|---------------------------|--------------|-----------------------------|--------------|--------------------------| | -Os | Before unroll | 144 477 | 144 476 | 560 548 | 560 548 | baseline | | -Os | After unroll | 39 114 | 39 113 | 119 323 | 119 323 | **-441 225** | | -Ofast | Before unroll | 41 830 | 41 830 | 119 998 | 119 998 | baseline | | -Ofast | After unroll | 40 114 | 40 114 | 114 600 | 114 600 | **-5 398** | The pre-unroll source remains available under `docs/` for side-by-side inspection with the macro-based version. Key differences: - `docs/rsqrt_asm_before_unroll.S` retains the `.Lnewton_loop` structure with `beqz`/`j` control flow. - `rsqrt_asm.S` removes the loop label in favour of two `RSQRT_NEWTON` expansions, keeping temporaries in registers across iterations.