Assignment2: Complete Applications

# Assignment2: Complete Applications contributed by < [TWChris90](https://github.com/TWChris90/ca2025-quizzes) > ## Pick one complete program (with test suite) done in your Homework 1 and make it run in a bare-metal environment using rv32emu’s system emulation. ### Project Overview This project implements `uf8_decode` and `uf8_encode` functions in RISC-V assembly, then runs them on the `rv32emu` emulator to verify encoding–decoding correctness. ### Enviroment Installation * rv32emu is installed on WSL Ubuntu 22.04 LTS. * I chose to implement the encoding and decoding for Quiz 1 Problem b on re32emu. ### Development and Debugging Process #### Invalid `-march` Error **Issue** : When running make run, I got `invalid -march=rv32izicsr`. **Cause** : The system `as` (x86 assembler) was invoked instead of the RISC-V cross-assembler. **Fix** : Ensures the RISC-V toolchain is used. ``` export PATH=$HOME/riscv-none-elf-gcc/bin:$PATH make run AS=riscv-none-elf-as CC=riscv-none-elf-gcc ``` #### Global Label and Alignment Issue **Issue** : At first the linker couldn’t find `pb_main`, and memory alignment errors occurred when accessing string data. **Fix** : `.globl` makes the label visible to the linker, `.align 2` ensures 4-byte alignment to avoid misaligned access. ``` .globl pb_main .align 2 ``` #### Stack Frame Maintenance **Issue** : Function returns jumped to wrong address because `ra` was not saved. **Fix** : Proper stack save/restore solved the issue. ``` addi sp, sp, -4 sw ra, 0(sp) lw ra, 0(sp) addi sp, sp, 4 jr ra ``` ### Core Implementation #### uf8_decode ``` andi a1, a0, 0x0F # mantissa srli a2, a0, 4 # exponent addi a3, a2, 4 li a4, 1 sll a4, a4, a3 # 1 << (e+4) addi a4, a4, -16 # offset sll a3, a1, a2 # m << e add a0, a3, a4 # value = (m<<e) + offset jr ra ``` #### uf8_encode ``` add a0, a7, x0 jal ra, CLZ # Count leading zeros li a1, 31 sub a1, a1, a0 # msb = 31 - lz slli a1, a3, 4 or a0, a1, a2 # pack [eeee mmmm] jr ra ``` The `test` function loops through codes 0 to 255, runs decode → encode, and verifies both equivalence and monotonic increase. If any check fails, it prints error messages via `ecall`; otherwise, `pb_main` prints “All tests passed.” ### Disassemble elf. ``` Disassembly of section .text: 00010000 <_start>: 10000: 00008117 auipc sp,0x8 10004: ba010113 addi sp,sp,-1120 # 17ba0 <__stack_top> 10008: 00007297 auipc t0,0x7 1000c: b9828293 addi t0,t0,-1128 # 16ba0 <__bss_end> 10010: 00007317 auipc t1,0x7 10014: b9030313 addi t1,t1,-1136 # 16ba0 <__bss_end> 10018: 0062d863 bge t0,t1,10028 <_start+0x28> 1001c: 0002a023 sw zero,0(t0) 10020: 00428293 addi t0,t0,4 10024: ff5ff06f j 10018 <_start+0x18> 10028: 609010ef jal 11e30 <main> 1002c: 05d00893 li a7,93 10030: 00000513 li a0,0 10034: 00000073 ecall 10038: 0000006f j 10038 <_start+0x38> ... ``` ### Execution Result ![螢幕擷取畫面 2025-11-11 185313](https://hackmd.io/_uploads/rykaScglbl.png) ## Quiz 2 Problem A — Recursive Hanoi Tower The goal of this task is to implement a recursive version of the Hanoi Tower algorithm in RISC-V assembly, printing each move step to the console using `ecall` system calls. ### Program Logic * **problemA_main** ``` problemA_main: addi sp, sp, -16 sw ra, 12(sp) sw s0, 8(sp) sw s1, 4(sp) sw s2, 0(sp) li a0, 3 li a1, 0 li a2, 2 li a3, 1 jal ra, hanoi ``` Creates a new stack frame and saves registers for recursion. Sets parameters `a0=3`, `a1=0` (A), `a2=2` (C), `a3=1` (B). Calls `hanoi(3, A, C, B)` to start recursion. * **hanoi**(Recursive Function) ``` hanoi: addi sp, sp, -20 sw ra, 16(sp) sw a0, 12(sp) sw a1, 8(sp) sw a2, 4(sp) sw a3, 0(sp) beq a0, x0, h_ret ``` Each recursive call creates its own stack frame. If `a0==0`, there are no disks to move — return. Then, the recursive three-step process: 1. Moves the top n−1 disks to the auxiliary peg. ``` addi a0, a0, -1 mv t0, a2 mv a2, a3 mv a3, t0 jal ra, hanoi ``` 2. Prints the move for the largest disk. ``` lw a0, 12(sp) lw a1, 8(sp) lw a2, 4(sp) jal ra, print_move ``` 3. Moves the n−1 disks from the auxiliary peg to the destination peg. ``` lw a0, 12(sp) lw a1, 8(sp) lw a2, 4(sp) lw a3, 0(sp) addi a0, a0, -1 mv t0, a1 mv a1, a3 mv a3, t0 jal ra, hanoi ``` 4. Finally restores the stack and returns: ``` h_ret: lw ra, 16(sp) addi sp, sp, 20 jr ra ``` * **Print Function** ``` print_move: addi sp, sp, -16 sw ra, 12(sp) ``` Retrieves the ASCII character for the disk number: ``` addi t0, a0, -1 la t1, disks add t1, t1, t0 lbu t0, 0(t1) ``` Retrieves 'A', 'B', or 'C' based on from/to values: ``` la t2, pegs add t3, t2, a1 lbu t1, 0(t3) add t3, t2, a2 lbu t2, 0(t3) ``` Outputs full sentence with multiple `ecall`s: ``` li a0,1 la a1,str1 li a2,10 li a7,0x40 ecall ``` Prints `"Move Disk "`, then disk number, from peg, to peg, and newline. ### System Call Design | Register | Function | Description | |-----------|-----------|-------------| | `a7 = 0x40` | Write string | To stdout | | `a0 = 1` | File descriptor | stdout | | `a1` | String address | Pointer to string | | `a2` | String length | Output size | Each full sentence requires 6 `ecall`s. ### Execution Result ![image](https://hackmd.io/_uploads/rk7Rj6MfWx.png) ## Quiz 3 Problem C — fast_rsqrt The goal is to compute the reciprocal square root $y = \frac{1}{\sqrt{x}}$ efficiently on a pure RV32I platform without floating-point hardware. Since division and square root are expensive, we reformulate the problem as a fixed-point iteration that relies only on integer operations, lookup tables, and shifts. ### Algorithm Structure The implementation follows a lookup-table based Newton–Raphson iteration. * Find exponent `e = 31 - clz(x)` * Get initial value `y` from the lookup table * Linearly interpolate between two neighboring entries * Perform two Newton–Raphson correction iterations ### Mathematical Foundation We define the target function: $$ f(y) = \frac{1}{y^2} - x $$ and seek $f(y) = 0$. The Newton–Raphson update rule is: $$ y_{n+1} = y_n - \frac{f(y_n)}{f'(y_n)} $$ Expanding $f'(y_n) = -\frac{2}{y_n^3}$, we obtain: $$ y_{n+1} = y_n - \frac{\frac{1}{y_n^2} - x}{-2 / y_n^3} = y_n \left( \frac{3 - x y_n^2}{2} \right) $$ Thus, the iterative formula becomes: $$ y_{n+1} = y_n \times \frac{3 - (x \times y_n^2)}{2} $$ ### Fixed-point Implementation To handle fractional precision, all variables are scaled by $2^{16}$. Hence, the true value $y$ is represented as $y_{int} = y \times 2^{16}$. All intermediate multiplications and corrections are shifted right by 16–17 bits to restore the fixed-point range. The lookup table `rsqrt_table[32]` stores precomputed values of: $$ 2^{16} / \sqrt{2^e} $$ for exponents $e = 0...31$. When an input $x$ is given, we first find its leading exponent: $$ e = 31 - \text{clz}(x) $$ and interpolate between `rsqrt_table[e]` and `rsqrt_table[e+1]` using a linear ratio derived from the fractional part of $x$. ### Newton Iteration and Refinement Two correction steps are performed: ```c for (int k = 0; k < 2; ++k) { uint64_t y2_64 = mul32(y, y); uint32_t y2 = (uint32_t)y2_64; uint64_t xy2_64 = mul32(x, y2); uint32_t xy2 = (uint32_t)(xy2_64 >> 16); uint32_t corr = (3u << 16) - xy2; uint64_t update = mul32(y, corr); y = (uint32_t)(update >> 17); } ``` Each iteration approximately doubles the number of correct bits. After two iterations, the result stabilizes for 16-bit precision. ### Integer Arithmetic Core Since RV32I lacks a native 64-bit multiplier, a manual $32\times32 \to 64$ routine `mul32()` is implemented using bit-shift accumulation. ```c static uint64_t mul32(uint32_t a, uint32_t b) { uint64_t acc = 0; uint64_t va = a; uint32_t vb = b; while (vb) { if (vb & 1u) acc += va; va <<= 1; vb >>= 1; } return acc; } ``` This eliminates the need for `__muldi3` and ensures correct execution in bare-metal mode. ### Experimental Behavior The output shows perfect integer error (0%) for all test inputs, with small remainders indicating sub-1% deviation. Cycle counts per call range from roughly 2600 to 3900 depending on input magnitude. ![image](https://hackmd.io/_uploads/H1BfuZEz-l.png) ### Convergence Analysis The Newton iteration exhibits quadratic convergence near the true value: $$ e_{n+1} \approx C \cdot e_n^2 $$ where $e_n$ is the relative error. Given an initial lookup error below 5%, two iterations are sufficient to reach fixed-point precision within $10^{-3}$. ### Disassembly Analysis #### Observations on Disassembled Code ##### Function Structure The disassembly shows the following functions: * `_start` * `memcpy` * `udiv`, `umod`, `umul` * `__mulsi3` * `print_hex`, `print_dec` * `fast_rsqrt` * `run_tests`, `report_result` Among these, `fast_rsqrt` is the core function under test. ##### fast_rsqrt — Mathematical Concept `fast_rsqrt` computes an approximate reciprocal square root: $$ x_{n+1} = x_n \cdot \frac{3 - a \cdot x_n^2}{2} $$ It’s derived from the Newton–Raphson iteration for$f(x) = \frac{1}{\sqrt{a}}$ ##### Instruction Patterns | Version | Observation | |----------|--------------| | `-O0` | Contains many load/store operations (`lw`, `sw`) between stack and registers. | | `-Ofast` | Fewer branches, some inlining; loop unrolled; register reuse optimized. | | `-Os` | Maintains loops; compact instruction count; fewer temporaries. | For example, under -Ofast, the compiler may replace: ``` mul a0, a0, a0 div a0, a1, a0 ``` with a shift–multiply pattern like: ``` slli a5, a0, 1 mul a0, a1, a5 ``` to avoid costly division. ##### Division and Multiplication Details The disassembly shows calls to `udiv`, `umod`, and `umul`, all implemented manually since rv32i lacks hardware division. In mathematical form: $$ \text{udiv}(a, b) = \sum_{i=0}^{31} \frac{((a >> (31 - i)) \& 1)}{b} $$ and $$ \text{umul}(a, b) = \sum_{i=0}^{31} ((b_i = 1) ? (a << i) : 0) $$ These emulate integer division and multiplication in software. ##### Reported Performance (Cycles & Instructions) Example from RV32Emu output: | Input | Ideal | Error | Cycles | Instructions | |-------|--------|--------|----------|---------------| | 1 | 65536 | 0% | 56 | 56 | | 3 | 37836 | 0% | 3953 | 3964 | | 9 | 21845 | 0% | 3749 | 3760 | | 25 | 13107 | 0% | 3705 | 3716 | | 64 | 8192 | 0% | 2606 | 2671 | From the report, the**Total Instructions:** ≈ 21,233 and the **Total Cycles:** ≈ 21,167, showing a one-to-one correspondence with instruction count due to simple arithmetic. ## Quiz 3 Problem C -Os (Optimized for size) In this task, the goal was to reimplement the original `fast_rsqrt` function using the -Os (optimized for size) compilation flag. The objective was to ensure correct execution in the RISC-V environment, resolve linker and optimization issues introduced by the compiler, and maintain correct output and performance statistics (cycles and instructions). ### Development Challenges #### Linker Errors for 64-bit Shifts The original C version of `fast_rsqrt` used `uint64_t` operations such as `(uint64_t)x << 16`. On an RV32I architecture, GCC automatically calls libgcc helpers like `__ashldi3` and `__lshrdi3` to handle 64-bit shifts. However, since **rv32emu** does not include libgcc, this caused linker errors such as undefined reference to `__ashldi3` and `__lshrdi3`. Fix: For example: ```c frac = ((x - (1u << exp)) << 16) >> exp; ``` All 64-bit operations were rewritten as 32-bit equivalents, using a custom `mul32()` to emulate 64-bit multiplication without libgcc. #### Missing 32-bit Arithmetic Helpers In functions such as print_dec() and error computation, the use of `/` and `%` caused GCC to inject calls to `__udivsi3`,` __umodsi3`, and `__mulsi3`. Without libgcc, these functions were undefined at link time. ``` undefined reference to `__udivsi3' undefined reference to `__umodsi3' undefined reference to `__mulsi3' ``` Fix: （Software-emulated arithmetic functions）： ``` unsigned int __mulsi3(unsigned int a, unsigned int b) { ... } unsigned int __udivsi3(unsigned int a, unsigned int b) { ... } unsigned int __umodsi3(unsigned int a, unsigned int b) { ... } ``` These were reimplemented using only addition and bit shifting, fully removing dependency on libgcc. #### Incorrect Output under -Os When compiled with `-Os` or `-Ofast`, GCC incorrectly optimized away parts of the inline assembly `asm volatile("ecall")`, assuming it did not affect memory, causing garbled or missing output. Fix: Add `"memory"` to the clobber list: ``` asm volatile (... : : "r"(ptr), "r"(len) : "a0","a1","a2","a7","memory"); ``` This explicitly tells the compiler that memory is affected, preventing unsafe optimizations. #### ENABLE_ELF_LOADER Error ``` Error: ENABLE_ELF_LOADER=1 not set ``` `make run` checked for `ENABLE_ELF_LOADER=1` in the rv32emu config file. Commenting out that line in the Makefile allowed direct execution of `test.elf`. ### Key Modifications Summary | Aspect | Original | Optimized for Size (-Os) | |---------|-----------|--------------------------| | 64-bit Operations | Used `uint64_t`, triggered libgcc | Replaced with 32-bit shifts and custom `mul32()` | | Division | Relied on `__udivsi3` / `__umodsi3` | Reimplemented manual bit-shift division | | Output | No `"memory"` clobber, caused corruption | Added `"memory"` to prevent optimization | | Compilation Flags | Default `-O2` | Changed to `-Os` for size reduction | | Result | Larger binary, correct output | Smaller binary, same accuracy (≈ 0% error) | ### Execution Results Comparison | x | fast_rsqrt(x) | Ideal | Error (%) | Cycles | Instructions | |---|---------------|--------|------------|---------|---------------| | 1 | 65536 | 65536 | 0 | 40 | 40 | | 3 | 37836 | 37837.23 | 0 | 2626 | 2637 | | 9 | 21845 | 21845.33 | 0 | 2597 | 2604 | | 25 | 13107 | 13107.20 | 0 | 2597 | 2604 | | 64 | 8192 | 8192.00 | 0 | 1906 | 1913 | | 123 | 5909 | 5909.18 | 0 | 2616 | 2623 | | 500 | 2930 | 2930.86 | 0 | 2576 | 2583 | ![image](https://hackmd.io/_uploads/r1p3GVVfWg.png) Summary: * Total Cycles: ≈ 14,958 * Total Instructions: ≈ 15,004 Compared with the original version (~21,167 cycles / 21,233 instructions), the -Os version achieved roughly 29% fewer cycles and 28% fewer instructions. ### Disassembly Analysis After compiling with `-Os`, the disassembled code (test_osize_dump.S) shows that the compiler reduced instruction count, removed redundant register moves, and inlined small helper functions like `mul32`, `__umodsi3`, and `__udivsi3`. #### Comparison Summary | Aspect | Handwritten Assembly | Compiler `-Os` Assembly | |---------|----------------------|--------------------------| | Function structure | Each function has full prologue/epilogue with register save/restore | Small helper functions are inlined, fewer prologue/epilogue instructions | | Instruction count | More instructions, explicitly structured | Fewer, denser instructions | | Loop & branch | Explicit jump labels and branch targets | Combined conditional logic with reduced jumps | | Arithmetic | Multi-step shifts and register operations | Rewritten as pure 32-bit arithmetic replacing 64-bit helpers | | Performance | Larger code size, maintains correctness | Smaller binary, slightly faster execution | #### Explanation * The `-Os` flag tells GCC to optimize for size, not speed. * It merges constant expressions, reuses registers efficiently, and inlines small functions to reduce jumps. * The output code still preserves correctness with almost **0% numerical error**, but occupies fewer bytes and fewer instructions than the handwritten version. The disassembled `-Os` version shows that GCC can automatically achieve space-efficient assembly similar to manually optimized code. It demonstrates how compiler-level optimization balances **code size**, **performance**, and **readability**. ## Quiz 3 Problem C -Ofast (Optimized for speed) ### Development Process & Issues #### Link failed: libgcc division and remainder functions At first, `print_dec` used `n / 10` and `n % 10` directly in C. In bare-metal mode, the compiler lowers `/` and `%` to libgcc helper functions such as `__udivsi3` and `__umodsi3`. Since the standard libraries are not linked, the linker reported many `undefined reference to '__udivsi3' / '__umodsi3'` errors. fix: **Completely remove `/` and `%` from the C source code.** rewrote `print_dec` to emulate division by 10 with repeated subtraction: A loop increments the quotient `q` while subtracting 10 from `t` until `t < 10`; then `t` is `n % 10` and `q` is `n / 10`. ```c static void print_dec(uint32_t n) { char buf[16]; int i = 0; if (n == 0) { put_char('0'); return; } while (n && i < 16) { uint32_t t = n; uint32_t q = 0; while (t >= 10u) { t -= 10u; q++; } buf[i] = (char)('0' + t); // t = n % 10 i++; n = q; // n = n / 10 } while (i > 0) { i--; put_bytes(&buf[i], 1); } } ``` #### `fast_rsqrt` multiple definition Initially, fast_rsqrt was implemented both in `main.c` and in `rsqrt_org.c`. The linker reported : multiple definition of 'fast_rsqrt'. Fix: Keep the implementation of `fast_rsqrt` only in `rsqrt_org.c`. In `main.c`, keep only the declaration above and call `fast_rsqrt(x)` from there. ```c extern uint32_t fast_rsqrt(uint32_t x); ``` #### RV32I No Multiplication Instruction: Avoiding `__mulsi3` * The RV32I base ISA has no hardware multiply. When the C code uses `a * b`, the compiler may emit a call to `__mulsi3`. * In an earlier version, `fast_rsqrt` triggered `undefined reference to '__mulsi3'`. Fix: In `rsqrt_org.c`, we rewrote `fast_rsqrt` to avoid 32-bit `*`,using shift-and-add style operations instead. This makes the test self-contained: it relies only on RV32I instructions and the ecall interface. ### Optimization Principles & Effect of `-Ofast` #### `-Os` vs `-Ofast` Difference Concept * `-Os`: prioritizes code size, encouraging code sharing and discouraging aggressive inlining or unrolling. * `-Ofast`: builds on `-O3` and enables more aggressive optimizations, prioritizing **execution speed**, even if it increases instruction count or reorders instructions for pipeline behavior. * In this assignment, we keep the C source code identical and only change the compiler flags`(-Os vs -Ofast)`, then compare the resulting cycles and instructions. ### Comparison: Hand-written, -Os, and -Ofast | Version | Compile Option | Total Cycles | Total Instructions | Note | |----------|----------------|--------------:|-------------------:|------| | Hand-written `fast_rsqrt` | N/A | *(measured manually)* | *(measured manually)* | Assembly reference version | | C version `fast_rsqrt` (-Os) | `-Os` | **14958** | **15004** | Optimized for code size | | C version `fast_rsqrt` (-Ofast) | `-Ofast` | **14124** | **14139** | Optimized for execution speed | From the table, with identical C code, the `-Ofast` build achieves about a **6% reduction in total** cycles and total instructions compared to `-Os`. ### Speedup Formula $$ \text{Speedup}_{\text{cycles}} = \frac{\text{cycles}_{-Os}}{\text{cycles}_{-Ofast}} = \frac{14958}{14124} \approx 1.06 $$ $$ \text{Speedup}_{\text{inst}} = \frac{\text{inst}_{-Os}}{\text{inst}_{-Ofast}} = \frac{15004}{14139} \approx 1.06 $$ Therefore, for this workload, -Ofast is roughly **6% faster** than `-Os`. ### Qualitative Comparison to Hand-written Version * The hand-written version typically: * Minimizes branches and memory accesses on the hot path * Uses bit-level tricks and manual scheduling to match the pipeline. * Although compiler optimizations (`-Os`, `-Ofast`) cannot perfectly match the level of manual control in assembly,they can still get close using inlining, dead-code elimination, and constant folding. * Overall: * Hand-written: fastest and tightest, but highest development cost. * `-Ofast`: gives extra speed “for free” without changing the algorithm. * `-Os`: sacrifices some performance to significantly reduce code size, which is useful when flash/ROM is limited. ### Execution Result ![image](https://hackmd.io/_uploads/HyazIrNfWg.png) ### Disassembly Analysis #### Overall Structure - The disassembly shows main sections: `_start`, `print_dec`, `memcpy`, `get_cycles`, `get_instret`, and `fast_rsqrt`. - `_start` sets up the stack pointer and clears `.bss` before jumping to `main`, which is standard startup code unaffected by optimization. #### Main Function Changes - `main` initializes test data, stores registers, and measures cycles using `get_cycles` / `get_instret`. - Under `-Ofast`, the compiler allocates more variables in registers and minimizes memory access. - Branches and store/load operations are reduced, resulting in fewer instructions and better pipeline flow. #### print_dec Optimization - The `print_dec` function converts integers to ASCII digits and prints via `ecall`. - Instead of a simple while loop, the compiler performs **loop unrolling**, reducing branch frequency. - This technique increases code length but improves instruction-level parallelism. | Optimization | Effect | |---------------|---------| | Loop unrolling | Fewer branches, smoother pipeline | | Register reuse | Less memory access, faster execution | #### memcpy Optimization - `memcpy` checks if the address is 4-byte aligned. - If aligned, it copies data in 32-bit words using `lw`/`sw`. - Otherwise, it falls back to byte-wise copying with `lbu`/`sb`. - This is a classic alignment optimization that improves memory throughput. #### fast_rsqrt Algorithm - Handles special cases early (`a0 == 0` or `a0 == 1`) to skip unnecessary computation. - Uses lookup table access via `rsqrt_table` and exponent normalization. - Employs **bitwise conditional add** instead of branching: ```c result += (mask & value); ```