# Assignment2: Complete Applications
contributed by < [TWChris90](https://github.com/TWChris90/ca2025-quizzes) >
## Pick one complete program (with test suite) done in your Homework 1 and make it run in a bare-metal environment using rv32emu’s system emulation.
### Project Overview
This project implements `uf8_decode` and `uf8_encode` functions in RISC-V assembly, then runs them on the `rv32emu` emulator to verify encoding–decoding correctness.
### Enviroment Installation
* rv32emu is installed on WSL Ubuntu 22.04 LTS.
* I chose to implement the encoding and decoding for Quiz 1 Problem b on re32emu.
### Development and Debugging Process
#### Invalid `-march` Error
**Issue** : When running make run, I got `invalid -march=rv32izicsr`.
**Cause** : The system `as` (x86 assembler) was invoked instead of the RISC-V cross-assembler.
**Fix** : Ensures the RISC-V toolchain is used.
```
export PATH=$HOME/riscv-none-elf-gcc/bin:$PATH
make run AS=riscv-none-elf-as CC=riscv-none-elf-gcc
```
#### Global Label and Alignment Issue
**Issue** : At first the linker couldn’t find `pb_main`, and memory alignment errors occurred when accessing string data.
**Fix** : `.globl` makes the label visible to the linker, `.align 2` ensures 4-byte alignment to avoid misaligned access.
```
.globl pb_main
.align 2
```
#### Stack Frame Maintenance
**Issue** : Function returns jumped to wrong address because `ra` was not saved.
**Fix** : Proper stack save/restore solved the issue.
```
addi sp, sp, -4
sw ra, 0(sp)
lw ra, 0(sp)
addi sp, sp, 4
jr ra
```
### Core Implementation
#### uf8_decode
```
andi a1, a0, 0x0F # mantissa
srli a2, a0, 4 # exponent
addi a3, a2, 4
li a4, 1
sll a4, a4, a3 # 1 << (e+4)
addi a4, a4, -16 # offset
sll a3, a1, a2 # m << e
add a0, a3, a4 # value = (m<<e) + offset
jr ra
```
#### uf8_encode
```
add a0, a7, x0
jal ra, CLZ # Count leading zeros
li a1, 31
sub a1, a1, a0 # msb = 31 - lz
slli a1, a3, 4
or a0, a1, a2 # pack [eeee mmmm]
jr ra
```
The `test` function loops through codes 0 to 255, runs decode → encode, and verifies both equivalence and monotonic increase.
If any check fails, it prints error messages via `ecall`; otherwise, `pb_main` prints “All tests passed.”
### Disassemble elf.
```
Disassembly of section .text:
00010000 <_start>:
10000: 00008117 auipc sp,0x8
10004: ba010113 addi sp,sp,-1120 # 17ba0 <__stack_top>
10008: 00007297 auipc t0,0x7
1000c: b9828293 addi t0,t0,-1128 # 16ba0 <__bss_end>
10010: 00007317 auipc t1,0x7
10014: b9030313 addi t1,t1,-1136 # 16ba0 <__bss_end>
10018: 0062d863 bge t0,t1,10028 <_start+0x28>
1001c: 0002a023 sw zero,0(t0)
10020: 00428293 addi t0,t0,4
10024: ff5ff06f j 10018 <_start+0x18>
10028: 609010ef jal 11e30 <main>
1002c: 05d00893 li a7,93
10030: 00000513 li a0,0
10034: 00000073 ecall
10038: 0000006f j 10038 <_start+0x38>
...
```
### Execution Result

## Quiz 2 Problem A — Recursive Hanoi Tower
The goal of this task is to implement a recursive version of the Hanoi Tower algorithm in RISC-V assembly, printing each move step to the console using `ecall` system calls.
### Program Logic
* **problemA_main**
```
problemA_main:
addi sp, sp, -16
sw ra, 12(sp)
sw s0, 8(sp)
sw s1, 4(sp)
sw s2, 0(sp)
li a0, 3
li a1, 0
li a2, 2
li a3, 1
jal ra, hanoi
```
Creates a new stack frame and saves registers for recursion.
Sets parameters `a0=3`, `a1=0` (A), `a2=2` (C), `a3=1` (B).
Calls `hanoi(3, A, C, B)` to start recursion.
* **hanoi**(Recursive Function)
```
hanoi:
addi sp, sp, -20
sw ra, 16(sp)
sw a0, 12(sp)
sw a1, 8(sp)
sw a2, 4(sp)
sw a3, 0(sp)
beq a0, x0, h_ret
```
Each recursive call creates its own stack frame.
If `a0==0`, there are no disks to move — return.
Then, the recursive three-step process:
1. Moves the top n−1 disks to the auxiliary peg.
```
addi a0, a0, -1
mv t0, a2
mv a2, a3
mv a3, t0
jal ra, hanoi
```
2. Prints the move for the largest disk.
```
lw a0, 12(sp)
lw a1, 8(sp)
lw a2, 4(sp)
jal ra, print_move
```
3. Moves the n−1 disks from the auxiliary peg to the destination peg.
```
lw a0, 12(sp)
lw a1, 8(sp)
lw a2, 4(sp)
lw a3, 0(sp)
addi a0, a0, -1
mv t0, a1
mv a1, a3
mv a3, t0
jal ra, hanoi
```
4. Finally restores the stack and returns:
```
h_ret:
lw ra, 16(sp)
addi sp, sp, 20
jr ra
```
* **Print Function**
```
print_move:
addi sp, sp, -16
sw ra, 12(sp)
```
Retrieves the ASCII character for the disk number:
```
addi t0, a0, -1
la t1, disks
add t1, t1, t0
lbu t0, 0(t1)
```
Retrieves 'A', 'B', or 'C' based on from/to values:
```
la t2, pegs
add t3, t2, a1
lbu t1, 0(t3)
add t3, t2, a2
lbu t2, 0(t3)
```
Outputs full sentence with multiple `ecall`s:
```
li a0,1
la a1,str1
li a2,10
li a7,0x40
ecall
```
Prints `"Move Disk "`, then disk number, from peg, to peg, and newline.
### System Call Design
| Register | Function | Description |
|-----------|-----------|-------------|
| `a7 = 0x40` | Write string | To stdout |
| `a0 = 1` | File descriptor | stdout |
| `a1` | String address | Pointer to string |
| `a2` | String length | Output size |
Each full sentence requires 6 `ecall`s.
### Execution Result

## Quiz 3 Problem C — fast_rsqrt
The goal is to compute the reciprocal square root $y = \frac{1}{\sqrt{x}}$ efficiently on a pure RV32I platform without floating-point hardware. Since division and square root are expensive, we reformulate the problem as a fixed-point iteration that relies only on integer operations, lookup tables, and shifts.
### Algorithm Structure
The implementation follows a lookup-table based Newton–Raphson iteration.
* Find exponent `e = 31 - clz(x)`
* Get initial value `y` from the lookup table
* Linearly interpolate between two neighboring entries
* Perform two Newton–Raphson correction iterations
### Mathematical Foundation
We define the target function:
$$
f(y) = \frac{1}{y^2} - x
$$
and seek $f(y) = 0$.
The Newton–Raphson update rule is:
$$
y_{n+1} = y_n - \frac{f(y_n)}{f'(y_n)}
$$
Expanding $f'(y_n) = -\frac{2}{y_n^3}$, we obtain:
$$
y_{n+1} = y_n - \frac{\frac{1}{y_n^2} - x}{-2 / y_n^3} = y_n \left( \frac{3 - x y_n^2}{2} \right)
$$
Thus, the iterative formula becomes:
$$
y_{n+1} = y_n \times \frac{3 - (x \times y_n^2)}{2}
$$
### Fixed-point Implementation
To handle fractional precision, all variables are scaled by $2^{16}$.
Hence, the true value $y$ is represented as $y_{int} = y \times 2^{16}$.
All intermediate multiplications and corrections are shifted right by 16–17 bits to restore the fixed-point range.
The lookup table `rsqrt_table[32]` stores precomputed values of:
$$
2^{16} / \sqrt{2^e}
$$
for exponents $e = 0...31$.
When an input $x$ is given, we first find its leading exponent:
$$
e = 31 - \text{clz}(x)
$$
and interpolate between `rsqrt_table[e]` and `rsqrt_table[e+1]` using a linear ratio derived from the fractional part of $x$.
### Newton Iteration and Refinement
Two correction steps are performed:
```c
for (int k = 0; k < 2; ++k) {
uint64_t y2_64 = mul32(y, y);
uint32_t y2 = (uint32_t)y2_64;
uint64_t xy2_64 = mul32(x, y2);
uint32_t xy2 = (uint32_t)(xy2_64 >> 16);
uint32_t corr = (3u << 16) - xy2;
uint64_t update = mul32(y, corr);
y = (uint32_t)(update >> 17);
}
```
Each iteration approximately doubles the number of correct bits.
After two iterations, the result stabilizes for 16-bit precision.
### Integer Arithmetic Core
Since RV32I lacks a native 64-bit multiplier, a manual $32\times32 \to 64$ routine `mul32()` is implemented using bit-shift accumulation.
```c
static uint64_t mul32(uint32_t a, uint32_t b)
{
uint64_t acc = 0;
uint64_t va = a;
uint32_t vb = b;
while (vb) {
if (vb & 1u)
acc += va;
va <<= 1;
vb >>= 1;
}
return acc;
}
```
This eliminates the need for `__muldi3` and ensures correct execution in bare-metal mode.
### Experimental Behavior
The output shows perfect integer error (0%) for all test inputs, with small remainders indicating sub-1% deviation.
Cycle counts per call range from roughly 2600 to 3900 depending on input magnitude.

### Convergence Analysis
The Newton iteration exhibits quadratic convergence near the true value:
$$
e_{n+1} \approx C \cdot e_n^2
$$
where $e_n$ is the relative error.
Given an initial lookup error below 5%, two iterations are sufficient to reach fixed-point precision within $10^{-3}$.
### Disassembly Analysis
#### Observations on Disassembled Code
##### Function Structure
The disassembly shows the following functions:
* `_start`
* `memcpy`
* `udiv`, `umod`, `umul`
* `__mulsi3`
* `print_hex`, `print_dec`
* `fast_rsqrt`
* `run_tests`, `report_result`
Among these, `fast_rsqrt` is the core function under test.
##### fast_rsqrt — Mathematical Concept
`fast_rsqrt` computes an approximate reciprocal square root:
$$
x_{n+1} = x_n \cdot \frac{3 - a \cdot x_n^2}{2}
$$
It’s derived from the Newton–Raphson iteration for$f(x) = \frac{1}{\sqrt{a}}$
##### Instruction Patterns
| Version | Observation |
|----------|--------------|
| `-O0` | Contains many load/store operations (`lw`, `sw`) between stack and registers. |
| `-Ofast` | Fewer branches, some inlining; loop unrolled; register reuse optimized. |
| `-Os` | Maintains loops; compact instruction count; fewer temporaries. |
For example, under -Ofast, the compiler may replace:
```
mul a0, a0, a0
div a0, a1, a0
```
with a shift–multiply pattern like:
```
slli a5, a0, 1
mul a0, a1, a5
```
to avoid costly division.
##### Division and Multiplication Details
The disassembly shows calls to `udiv`, `umod`, and `umul`, all implemented manually since rv32i lacks hardware division.
In mathematical form:
$$
\text{udiv}(a, b) = \sum_{i=0}^{31} \frac{((a >> (31 - i)) \& 1)}{b}
$$
and
$$
\text{umul}(a, b) = \sum_{i=0}^{31} ((b_i = 1) ? (a << i) : 0)
$$
These emulate integer division and multiplication in software.
##### Reported Performance (Cycles & Instructions)
Example from RV32Emu output:
| Input | Ideal | Error | Cycles | Instructions |
|-------|--------|--------|----------|---------------|
| 1 | 65536 | 0% | 56 | 56 |
| 3 | 37836 | 0% | 3953 | 3964 |
| 9 | 21845 | 0% | 3749 | 3760 |
| 25 | 13107 | 0% | 3705 | 3716 |
| 64 | 8192 | 0% | 2606 | 2671 |
From the report, the**Total Instructions:** ≈ 21,233 and the **Total Cycles:** ≈ 21,167, showing a one-to-one correspondence with instruction count due to simple arithmetic.
## Quiz 3 Problem C -Os (Optimized for size)
In this task, the goal was to reimplement the original `fast_rsqrt` function using the -Os (optimized for size) compilation flag. The objective was to ensure correct execution in the RISC-V environment, resolve linker and optimization issues introduced by the compiler, and maintain correct output and performance statistics (cycles and instructions).
### Development Challenges
#### Linker Errors for 64-bit Shifts
The original C version of `fast_rsqrt` used `uint64_t` operations such as `(uint64_t)x << 16`. On an RV32I architecture, GCC automatically calls libgcc helpers like `__ashldi3` and `__lshrdi3` to handle 64-bit shifts.
However, since **rv32emu** does not include libgcc, this caused linker errors such as undefined reference to `__ashldi3` and `__lshrdi3`.
Fix:
For example:
```c
frac = ((x - (1u << exp)) << 16) >> exp;
```
All 64-bit operations were rewritten as 32-bit equivalents, using a custom `mul32()` to emulate 64-bit multiplication without libgcc.
#### Missing 32-bit Arithmetic Helpers
In functions such as print_dec() and error computation, the use of `/` and `%` caused GCC to inject calls to `__udivsi3`,` __umodsi3`, and `__mulsi3`.
Without libgcc, these functions were undefined at link time.
```
undefined reference to `__udivsi3'
undefined reference to `__umodsi3'
undefined reference to `__mulsi3'
```
Fix:
(Software-emulated arithmetic functions):
```
unsigned int __mulsi3(unsigned int a, unsigned int b) { ... }
unsigned int __udivsi3(unsigned int a, unsigned int b) { ... }
unsigned int __umodsi3(unsigned int a, unsigned int b) { ... }
```
These were reimplemented using only addition and bit shifting, fully removing dependency on libgcc.
#### Incorrect Output under -Os
When compiled with `-Os` or `-Ofast`, GCC incorrectly optimized away parts of the inline assembly `asm volatile("ecall")`, assuming it did not affect memory, causing garbled or missing output.
Fix:
Add `"memory"` to the clobber list:
```
asm volatile (... : : "r"(ptr), "r"(len) : "a0","a1","a2","a7","memory");
```
This explicitly tells the compiler that memory is affected, preventing unsafe optimizations.
#### ENABLE_ELF_LOADER Error
```
Error: ENABLE_ELF_LOADER=1 not set
```
`make run` checked for `ENABLE_ELF_LOADER=1` in the rv32emu config file. Commenting out that line in the Makefile allowed direct execution of `test.elf`.
### Key Modifications Summary
| Aspect | Original | Optimized for Size (-Os) |
|---------|-----------|--------------------------|
| 64-bit Operations | Used `uint64_t`, triggered libgcc | Replaced with 32-bit shifts and custom `mul32()` |
| Division | Relied on `__udivsi3` / `__umodsi3` | Reimplemented manual bit-shift division |
| Output | No `"memory"` clobber, caused corruption | Added `"memory"` to prevent optimization |
| Compilation Flags | Default `-O2` | Changed to `-Os` for size reduction |
| Result | Larger binary, correct output | Smaller binary, same accuracy (≈ 0% error) |
### Execution Results Comparison
| x | fast_rsqrt(x) | Ideal | Error (%) | Cycles | Instructions |
|---|---------------|--------|------------|---------|---------------|
| 1 | 65536 | 65536 | 0 | 40 | 40 |
| 3 | 37836 | 37837.23 | 0 | 2626 | 2637 |
| 9 | 21845 | 21845.33 | 0 | 2597 | 2604 |
| 25 | 13107 | 13107.20 | 0 | 2597 | 2604 |
| 64 | 8192 | 8192.00 | 0 | 1906 | 1913 |
| 123 | 5909 | 5909.18 | 0 | 2616 | 2623 |
| 500 | 2930 | 2930.86 | 0 | 2576 | 2583 |

Summary:
* Total Cycles: ≈ 14,958
* Total Instructions: ≈ 15,004
Compared with the original version (~21,167 cycles / 21,233 instructions),
the -Os version achieved roughly 29% fewer cycles and 28% fewer instructions.
### Disassembly Analysis
After compiling with `-Os`, the disassembled code (test_osize_dump.S) shows that
the compiler reduced instruction count, removed redundant register moves, and inlined small helper functions like `mul32`, `__umodsi3`, and `__udivsi3`.
#### Comparison Summary
| Aspect | Handwritten Assembly | Compiler `-Os` Assembly |
|---------|----------------------|--------------------------|
| Function structure | Each function has full prologue/epilogue with register save/restore | Small helper functions are inlined, fewer prologue/epilogue instructions |
| Instruction count | More instructions, explicitly structured | Fewer, denser instructions |
| Loop & branch | Explicit jump labels and branch targets | Combined conditional logic with reduced jumps |
| Arithmetic | Multi-step shifts and register operations | Rewritten as pure 32-bit arithmetic replacing 64-bit helpers |
| Performance | Larger code size, maintains correctness | Smaller binary, slightly faster execution |
#### Explanation
* The `-Os` flag tells GCC to optimize for size, not speed.
* It merges constant expressions, reuses registers efficiently, and inlines small functions to reduce jumps.
* The output code still preserves correctness with almost **0% numerical error**,
but occupies fewer bytes and fewer instructions than the handwritten version.
The disassembled `-Os` version shows that GCC can automatically achieve space-efficient assembly similar to manually optimized code.
It demonstrates how compiler-level optimization balances **code size**, **performance**, and **readability**.
## Quiz 3 Problem C -Ofast (Optimized for speed)
### Development Process & Issues
#### Link failed: libgcc division and remainder functions
At first, `print_dec` used `n / 10` and `n % 10` directly in C.
In bare-metal mode, the compiler lowers `/` and `%` to libgcc helper functions such as
`__udivsi3` and `__umodsi3`.
Since the standard libraries are not linked, the linker reported many
`undefined reference to '__udivsi3' / '__umodsi3'` errors.
fix:
**Completely remove `/` and `%` from the C source code.**
rewrote `print_dec` to emulate division by 10 with repeated subtraction:
A loop increments the quotient `q` while subtracting 10 from `t` until `t < 10`; then `t` is `n % 10` and `q` is `n / 10`.
```c
static void print_dec(uint32_t n)
{
char buf[16];
int i = 0;
if (n == 0) {
put_char('0');
return;
}
while (n && i < 16) {
uint32_t t = n;
uint32_t q = 0;
while (t >= 10u) {
t -= 10u;
q++;
}
buf[i] = (char)('0' + t); // t = n % 10
i++;
n = q; // n = n / 10
}
while (i > 0) {
i--;
put_bytes(&buf[i], 1);
}
}
```
#### `fast_rsqrt` multiple definition
Initially, fast_rsqrt was implemented both in `main.c` and in `rsqrt_org.c`.
The linker reported : multiple definition of 'fast_rsqrt'.
Fix:
Keep the implementation of `fast_rsqrt` only in `rsqrt_org.c`.
In `main.c`, keep only the declaration above and call `fast_rsqrt(x)` from there.
```c
extern uint32_t fast_rsqrt(uint32_t x);
```
#### RV32I No Multiplication Instruction: Avoiding `__mulsi3`
* The RV32I base ISA has no hardware multiply.
When the C code uses `a * b`, the compiler may emit a call to `__mulsi3`.
* In an earlier version, `fast_rsqrt` triggered
`undefined reference to '__mulsi3'`.
Fix:
In `rsqrt_org.c`, we rewrote `fast_rsqrt` to avoid 32-bit `*`,using shift-and-add style operations instead.
This makes the test self-contained: it relies only on RV32I instructions and the ecall interface.
### Optimization Principles & Effect of `-Ofast`
#### `-Os` vs `-Ofast` Difference Concept
* `-Os`: prioritizes code size, encouraging code sharing and discouraging aggressive inlining or unrolling.
* `-Ofast`: builds on `-O3` and enables more aggressive optimizations,
prioritizing **execution speed**, even if it increases instruction count or reorders instructions for pipeline behavior.
* In this assignment, we keep the C source code identical and only change the compiler flags`(-Os vs -Ofast)`, then compare the resulting cycles and instructions.
### Comparison: Hand-written, -Os, and -Ofast
| Version | Compile Option | Total Cycles | Total Instructions | Note |
|----------|----------------|--------------:|-------------------:|------|
| Hand-written `fast_rsqrt` | N/A | *(measured manually)* | *(measured manually)* | Assembly reference version |
| C version `fast_rsqrt` (-Os) | `-Os` | **14958** | **15004** | Optimized for code size |
| C version `fast_rsqrt` (-Ofast) | `-Ofast` | **14124** | **14139** | Optimized for execution speed |
From the table, with identical C code, the `-Ofast` build achieves about a
**6% reduction in total** cycles and total instructions compared to `-Os`.
### Speedup Formula
$$
\text{Speedup}_{\text{cycles}}
= \frac{\text{cycles}_{-Os}}{\text{cycles}_{-Ofast}}
= \frac{14958}{14124}
\approx 1.06
$$
$$
\text{Speedup}_{\text{inst}}
= \frac{\text{inst}_{-Os}}{\text{inst}_{-Ofast}}
= \frac{15004}{14139}
\approx 1.06
$$
Therefore, for this workload, -Ofast is roughly **6% faster** than `-Os`.
### Qualitative Comparison to Hand-written Version
* The hand-written version typically:
* Minimizes branches and memory accesses on the hot path
* Uses bit-level tricks and manual scheduling to match the pipeline.
* Although compiler optimizations (`-Os`, `-Ofast`) cannot perfectly match the level of manual control in assembly,they can still get close using inlining, dead-code elimination, and constant folding.
* Overall:
* Hand-written: fastest and tightest, but highest development cost.
* `-Ofast`: gives extra speed “for free” without changing the algorithm.
* `-Os`: sacrifices some performance to significantly reduce code size, which is useful when flash/ROM is limited.
### Execution Result

### Disassembly Analysis
#### Overall Structure
- The disassembly shows main sections: `_start`, `print_dec`, `memcpy`, `get_cycles`, `get_instret`, and `fast_rsqrt`.
- `_start` sets up the stack pointer and clears `.bss` before jumping to `main`, which is standard startup code unaffected by optimization.
#### Main Function Changes
- `main` initializes test data, stores registers, and measures cycles using `get_cycles` / `get_instret`.
- Under `-Ofast`, the compiler allocates more variables in registers and minimizes memory access.
- Branches and store/load operations are reduced, resulting in fewer instructions and better pipeline flow.
#### print_dec Optimization
- The `print_dec` function converts integers to ASCII digits and prints via `ecall`.
- Instead of a simple while loop, the compiler performs **loop unrolling**, reducing branch frequency.
- This technique increases code length but improves instruction-level parallelism.
| Optimization | Effect |
|---------------|---------|
| Loop unrolling | Fewer branches, smoother pipeline |
| Register reuse | Less memory access, faster execution |
#### memcpy Optimization
- `memcpy` checks if the address is 4-byte aligned.
- If aligned, it copies data in 32-bit words using `lw`/`sw`.
- Otherwise, it falls back to byte-wise copying with `lbu`/`sb`.
- This is a classic alignment optimization that improves memory throughput.
#### fast_rsqrt Algorithm
- Handles special cases early (`a0 == 0` or `a0 == 1`) to skip unnecessary computation.
- Uses lookup table access via `rsqrt_table` and exponent normalization.
- Employs **bitwise conditional add** instead of branching:
```c
result += (mask & value);
```