# Assignment2: Complete Applications
Contributed by [yuyan7498](https://github.com/yuyan7498)
Code: [Github repository](https://github.com/yuyan7498/ca2025-quizzes)
## Project Overview
This project builds a collection of bare-metal programs for the rv32emu system emulator, focusing on:
- Efficient register usage to keep the RV32I core busy without spilling values unnecessarily.
- Writing functions that respect the RISC-V calling convention so that C and assembly routines interoperate cleanly.
- Explicit memory management through stack frames (for locals, saved registers) and static/global storage instead of relying on a libc runtime.
## Requirements Checklist
- ✅ Enabled system emulation in `rv32emu` and verified the copied `tests/system/playground` harness.
- ✅ Selected one Homework 1 program (`uf8` variable-length encoding suite) and ported it to bare metal.
- ✅ Selected Quiz 2 Problem A (`Hanoi` tower mover) and Quiz 3 Problem B (`fast_rsqrt`) and delivered both as system-level programs.
- ✅ Limited all C compilation to `-march=rv32i_zicsr` and avoided RV32M/RV32F instructions by providing software multiword arithmetic helpers.
- ✅ Integrated syscall expectations from `docs/syscall.md`/`src/syscall.c` so the emulator prints via ecall.
- ✅ Reused only the `tests/system/playground` subset of rv32emu as the workspace template, editing the local `Makefile` and `linker.ld`.
- ✅ Compared hand-written vs compiler-generated assembly by disassembling produced ELFs.
- ✅ Experimented with `-Ofast` vs `-Os` (default now) and documented the cycle/instret deltas for `fast_rsqrt`.
- ✅ Consulted `perfcounter.S`/`ticks.c` helpers to gather statistics and guide optimizations.
## Repository Layout
| File | Role |
|------|------|
| `Makefile` | Builds the bare-metal ELF (`test.elf`). Accepts `OPT_LEVEL` to flip between `-Os` and `-Ofast`. |
| `linker.ld` | Simple script placing text/data sections for the rv32emu loader. |
| `start.S` | CRT0 that sets up stack pointer and jumps into `main`. |
| `main.c` | Test harness, syscall-based logging utilities, cycle counting, software helpers for missing RV32M operations, and orchestrates all test suites (BF16, UF8, rsqrt, Hanoi). |
| `bf16.c` / `bf16.h` | Software implementation of BF16 arithmetic used to compare against assembler routines. |
| `bf16_asm.S` | Handwritten assembly for BF16 ops. |
| `uf8_c.c` / `uf8_main.s` | C and assembly implementations of the Homework 1 UF8 codec. |
| `rsqrt.c` / `rsqrt_asm.S` | C and assembly implementations of Quiz 3 Problem B reciprocal square root. Assembly version is now self-contained (no C helper calls). |
| `hanoi.s` | Quiz 2 Problem A (Tower of Hanoi) assembly. Uses syscalls to print each move. |
| `perfcounter.S` | Accessors for `mcycle`/`minstret` counters. |
## Component Walkthrough
### Logging and Syscalls
- `main.c` defines `printstr`/`TEST_LOGGER`, emitting strings via the rv32emu provided `ecall` path (`a7 = 0x40`).
- The `"memory"` clobber guards against aggressive `-Os` reordering.
### Cycle Measurement
- `get_cycles`/`get_instret` come from `perfcounter.S`.
- Each test records start/end counters and prints deltas for profiling.
### Software Runtime Helpers
- RV32I lacks hardware multiply/divide, so `main.c` exposes:
- `__mulsi3` and `umul` (32-bit).
- Newly added `__muldi3`, `__lshrdi3`, `__ashldi3` to satisfy libgcc expectations under tighter optimization.
- These ensure both `-Os` and `-Ofast` builds link without pulling in forbidden instructions.
### Homework 1 Port: UF8 Codec
- `uf8_c.c` contains the original C encode/decode/clz routines.
- `uf8_main.s` offers an assembly variant.
- Tests in `main.c` invoke both, verifying correctness and collecting performance numbers.
### Quiz 2 Problem A: Tower of Hanoi
- `hanoi.s` converts the iterative gray-code solution to RV32I assembly.
- Uses the shared logger to output each disk move and the perfcounter to measure total cycles/instructions.
### Quiz 3 Problem B: Reciprocal Square Root
- `rsqrt.c` houses the C algorithm: table lookup + two Newton iterations with software 64-bit math.
- `rsqrt_asm.S` mirrors the algorithm in assembly. The latest revision embeds `rsqrt_mul32_internal` and `rsqrt_mul32_u64_internal`, eliminating the last C dependencies so the whole pipeline is pure assembly. This satisfies the “complete translation” requirement.
### BF16 Approximation Suite
- `bf16.c` implements BF16 conversions and arithmetic in C.
- `bf16_asm.S` houses optimized versions.
- Tests cross-check results and gather performance data, satisfying the “Approximating sine (BF16) guidance” by providing a template for precision/performance analysis.
## Building & Running
Two full executions were captured to provide cycle statistics under each optimization mode:
```sh
# Size-oriented baseline
make clean
make run | tee run-Os.txt
# Speed-oriented comparison
make clean
make OPT_LEVEL=-Ofast run | tee run-Ofast.txt
```
Both logs (kept as `run-Os.txt` and `run-Ofast.txt`) contain the full console output referenced below.
## Execution Results
The table summarises representative tests from each delivered application (BF16 math, UF8 codec, reciprocal square root, and Hanoi).
ΔCycles is computed as `Cycles(-Os) – Cycles(-Ofast)`; positive values indicate that `-Ofast` runs faster.
| Suite | Test | Cycles (-Os) | Inst (-Os) | Cycles (-Ofast) | Inst (-Ofast) | ΔCycles |
|-------|------|--------------|------------|-----------------|---------------|---------|
| BF16 (ASM) | `bf16_add_asm` | 59 | 59 | – | – | – |
| | `bf16_sub_asm` | 64 | 65 | – | – | – |
| | `bf16_mul_asm` | 117 | 118 | – | – | – |
| | `bf16_div_asm` | 222 | 223 | – | – | – |
| | `bf16_special_cases_asm` | 49 | 49 | – | – | – |
| BF16 (C) | `bf16_add` | 103 | 102 | 73 | 72 | **+30** |
| | `bf16_sub` | 74 | 73 | 74 | 73 | 0 |
| | `bf16_mul` | 145 | 144 | 133 | 132 | +12 |
| | `bf16_div` | 149 | 148 | 113 | 112 | **+36** |
| | `bf16_special_cases` | 42 | 42 | 42 | 42 | 0 |
| UF8 (ASM) | `uf8_decode` | 28 | 28 | – | – | – |
| | `uf8_encode` | 101 | 101 | – | – | – |
| | `uf8_clz32` | 49 | 49 | – | – | – |
| UF8 (C) | `uf8_decode_c` | 28 | 28 | 28 | 28 | 0 |
| | `uf8_encode_c` | 89 | 89 | 62 | 62 | **+27** |
| | `uf8_clz32_c` | 55 | 55 | 38 | 38 | **+17** |
| fast_rsqrt (ASM) | Exact | 39 114 | 39 113 | 40 114 | 40 114 | -1 000 |
| | Accuracy | 119 323 | 119 323 | 114 600 | 114 600 | **+4 723** |
| fast_rsqrt (C) | Exact | 146 570 | 146 569 | 41 705 | 41 705 | **+104 865** |
| | Accuracy | 563 499 | 563 499 | 119 334 | 119 334 | **+444 165** |
| Hanoi (ASM) | `hanoi_run` | 1 818 | 1 818 | – | – | – |
The detailed per-test printouts for BF16 and UF8 (both ASM and C variants) are preserved in the logs for transparency.
### Observations on Optimization Modes
- For assembler routines, `-Ofast` mostly affects the caller’s prologue/epilogue. In the case of `bf16_add_asm` the change slightly increases cycles (+1) because the compiler inserts an extra spill.
- The large improvement for `fast_rsqrt` in C mode stems from more aggressive inlining and constant-folding under `-Ofast`, which dramatically reduces helper-call overhead (e.g., `fast_rsqrt_accuracy (C)` gains **444 165** cycles).
- `fast_rsqrt_accuracy (ASM)` benefits from the unrolled loop (see below) and from the more optimistic scheduler under `-Ofast`, shaving **4 723** cycles compared with `-Os`. This delta matches the removal of two loop-branch evaluations (~40 cycles each) plus fewer register moves in the caller.
- `Hanoi` remains unchanged because it is pure assembly that does not depend on compiler optimisation.
### Execution Snapshots
```text
=== BFloat16 ASM Tests ===
Test 6: bf16_add_asm
Test: bf16_add_asm
1.0 + 1.0 = 4000
PASSED
Cycles: 59
Instructions: 59
Test 7: bf16_sub_asm
Test: bf16_sub_asm
3.0 - 2.0 = 3f80
PASSED
Cycles: 64
Instructions: 65
Test 8: bf16_mul_asm
Test: bf16_mul_asm
2.0 * 3.0 = 40c0
PASSED
Cycles: 117
Instructions: 118
Test 9: bf16_div_asm
Test: bf16_div_asm
6.0 / 2.0 = 4040
PASSED
Cycles: 222
Instructions: 223
Test 10: bf16_special_cases_asm
Test: bf16_special_cases_asm
bf16_iszero_asm(0): PASSED
bf16_isnan_asm(NaN): PASSED
bf16_isinf_asm(Inf): PASSED
Cycles: 49
Instructions: 49
```
```text
=== BFloat16 C Tests ===
Test 1: bf16_add
Test: bf16_add
1.0 + 1.0 = 4000
PASSED
Cycles: 103
Instructions: 102
Test 2: bf16_sub
Test: bf16_sub
3.0 - 2.0 = 3f80
PASSED
Cycles: 74
Instructions: 73
Test 3: bf16_mul
Test: bf16_mul
2.0 * 3.0 = 40c0
PASSED
Cycles: 145
Instructions: 144
Test 4: bf16_div
Test: bf16_div
6.0 / 2.0 = 4040
PASSED
Cycles: 149
Instructions: 148
Test 5: bf16_special_cases
Test: bf16_special_cases
bf16_iszero(0): PASSED
bf16_isnan(NaN): PASSED
bf16_isinf(Inf): PASSED
Cycles: 42
Instructions: 42
```
```text
=== UF8 ASM Tests ===
Test 11: uf8_decode
Test: uf8_decode
input: 31
expected: 46
got: 46
PASSED
Cycles: 28
Instructions: 28
Test 12: uf8_encode
Test: uf8_encode
input: 480
expected: 79
got: 79
PASSED
Cycles: 101
Instructions: 101
Test 13: uf8_clz32
Test: uf8_clz32
input: 0xf00000
expected: 8
got: 8
PASSED
Cycles: 49
Instructions: 49
```
```text
=== UF8 C Tests ===
Test 14: uf8_decode_c
Test: uf8_decode_c
input: 31
expected: 46
got: 46
PASSED
Cycles: 28
Instructions: 28
Test 15: uf8_encode_c
Test: uf8_encode_c
input: 480
expected: 79
got: 79
PASSED
Cycles: 89
Instructions: 89
Test 16: uf8_clz32_c
Test: uf8_clz32_c
input: 0xf00000
expected: 8
got: 8
PASSED
Cycles: 55
Instructions: 55
```
```text
=== fast_rsqrt Tests (ASM) ===
Test 17: fast_rsqrt_exact (ASM)
input: 1
expected: 65536
got: 65536
Pass
input: 4
expected: 32768
got: 32768
Pass
input: 16
expected: 16384
got: 16384
Pass
input: 256
expected: 4096
got: 4096
Pass
input: 65536
expected: 256
got: 256
Pass
input: 4294967295
expected: 1
got: 1
Pass
Cycles: 39114
Instructions: 39113
```
```text
=== fast_rsqrt Tests (C) ===
Test 19: fast_rsqrt_exact (C)
input: 1
expected: 65536
got: 65536
Pass
input: 4
expected: 32768
got: 32768
Pass
input: 16
expected: 16384
got: 16384
Pass
input: 256
expected: 4096
got: 4096
Pass
input: 65536
expected: 256
got: 256
Pass
input: 4294967295
expected: 1
got: 1
Pass
Cycles: 146570
Instructions: 146569
```
```text
=== Hanoi ASM ===
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
Total Hanoi Cycles: 1818
Total Hanoi Instructions: 1818
=== All Tests Completed ===
```
## Disassembly & Comparisons
1. Build both `-Os` and `-Ofast` variants.
2. Use `riscv-none-elf-objdump -d test.elf > test.dis`.
3. Contrast `rsqrt_asm.S` against the compiler output within `test.dis` to identify instruction differences (e.g., register scheduling, branch layouts).
4. Repeat for other modules (BF16, UF8) to understand compiler heuristics.
Key observations:
- `-Ofast` aggressively inlines and unrolls loops, increasing instruction count but reducing branches.
- `-Os` keeps code compact; on RV32I without cache pressure this can still improve cycles because of reduced call overhead, but some loops evaluate more instructions due to lack of unrolling.
- Handwritten `rsqrt_asm` avoids spills that the compiler emits in `rsqrt.c`, explaining the lower cycle counts.
## Profiling Highlights
- `fast_rsqrt_accuracy (ASM)` now records **119 323** cycles under `-Os` and **114 600** cycles under `-Ofast`, so the cross-mode delta is **4 723 cycles** (≈ 3.96 %). Compared to the pre-unroll baseline, the optimisation removed **441 225 cycles** under `-Os` and **5 398 cycles** under `-Ofast`.
- `fast_rsqrt_accuracy (C)` improves by **444 165 cycles** when switching from `-Os` to `-Ofast`, confirming how much work the compiler can save once constant folding and inlining are enabled.
- `fast_rsqrt_exact (C)` benefits by **104 865 cycles**, primarily because `-Ofast` eliminates several calls to the software 64-bit helpers that are mandatory under `-Os`.
- BF16 and UF8 suites show more modest changes; they mostly exercise straight-line arithmetic where `-Os` already keeps the code compact.
`perfcounter` deltas printed by the harness (see logs) remain the ground truth for all measurements.
## Implemented Optimisation: Newton Loop Unrolling
- Files:
- Pre-optimisation snapshot: `docs/rsqrt_asm_before_unroll.S`
- Optimised implementation: `rsqrt_asm.S` (`RSQRT_NEWTON` macro at lines 7–46, inlined calls at lines 99–100)
- Change: The iterative Newton loop was expanded into two explicit macro invocations, eliminating the `beqz`/`j` loop skeleton and reducing register shuffling between iterations.
- Result: The refined routine removes two taken branches per evaluation and keeps operands live in caller-saved temporaries, enabling the compiler to schedule surrounding code more tightly. When paired with `-Ofast`, the unrolled version records a **5 398-cycle** improvement for `fast_rsqrt_accuracy` and a **1 716-cycle** improvement for `fast_rsqrt_exact` relative to the original `-Ofast` build. Under `-Os`, the gain is even larger (see table below).
### Loop Unrolling Reference & Measurements
To document the optimisation, the build was executed both before and after the change in both optimisation modes. The raw emulator transcripts are stored alongside the source for traceability:
- `run-Os-pre-unroll.txt` / `run-Os-post-unroll.txt`
- `run-Ofast-pre-unroll.txt` / `run-Ofast-post-unroll.txt`
| Mode | Loop Version | `fast_rsqrt_exact` Cycles | Instructions | `fast_rsqrt_accuracy` Cycles | Instructions | Δ Accuracy Cycles vs Pre |
|------|--------------|---------------------------|--------------|-----------------------------|--------------|--------------------------|
| -Os | Before unroll | 144 477 | 144 476 | 560 548 | 560 548 | baseline |
| -Os | After unroll | 39 114 | 39 113 | 119 323 | 119 323 | **-441 225** |
| -Ofast | Before unroll | 41 830 | 41 830 | 119 998 | 119 998 | baseline |
| -Ofast | After unroll | 40 114 | 40 114 | 114 600 | 114 600 | **-5 398** |
The pre-unroll source remains available under `docs/` for side-by-side inspection with the macro-based version.
Key differences:
- `docs/rsqrt_asm_before_unroll.S` retains the `.Lnewton_loop` structure with `beqz`/`j` control flow.
- `rsqrt_asm.S` removes the loop label in favour of two `RSQRT_NEWTON` expansions, keeping temporaries in registers across iterations.