# Assignment2: RISC-V Toolchain **Environment Setup** | Item | Version / Path | | ---------------- | ------------------------------------------ | | OS environment | WSL (Ubuntu 22.04 on Windows 10) | | RISC-V toolchain | xPack GNU RISC-V Embedded GCC 15.2.0 | | Emulator | rv32emu (2025 release, ELF loader enabled) | | Build options | `ENABLE_ELF_LOADER=1`, `ENABLE_SYSTEM=1` | | Target ISA | `-march=rv32izicsr -mabi=ilp32` | **Verification – Hello RISC-V World** ![image](https://hackmd.io/_uploads/Skb5xuyJ-l.png) # **Program Description (From Homework 1)** **Problem** LeetCode #1342 – Number of Steps to Reduce a Number to Zero ``` #include <stdio.h> int numberOfSteps(int num) { int steps = 0; while (num > 0) { if (num % 2 == 0) num /= 2; else num -= 1; steps++; } return steps; } int main() { int test_inputs[] = {14, 8, 123}; int n = 3; int outputs[3]; for (int i = 0; i < n; i++) { outputs[i] = numberOfSteps(test_inputs[i]); printf("num=%d, steps=%d\n", test_inputs[i], outputs[i]); } return 0; } ``` **Compilation** ``` riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -o steps.elf steps.c ``` **Execution on rv32emu** ``` ../build/rv32emu ./steps.elf ``` **Output** ![image](https://hackmd.io/_uploads/rJ9AW_J1bx.png) Program executes correctly in the bare-metal environment. **Generate Instruction Trace and Register Dump** ``` ../build/rv32emu -t -d dump.json ./steps.elf ``` ![image](https://hackmd.io/_uploads/S1gwGdyyZe.png) **Excerpt from dump.json** ![image](https://hackmd.io/_uploads/BJ1FzOJy-x.png) **Discussion & Analysis** | Aspect | Observation | | ------------------------ | ------------------------------------------------------------------------------------------- | | **Bare-metal execution** | Works correctly without OS support; `printf` calls are handled by rv32emu’s `syscall.c`. | | **Instruction behavior** | The loop mainly uses `andi`, `beq`, `addi`, `srli`, and `sub`. No hardware divide. | | **Trace analysis** | PC increments correspond to loop iteration; conditional branches reflect `if (num % 2==0)`. | | **Registers** | `x10–x17` (a0–a7) are used for parameter passing and syscall I/O. | | **Compatibility** | Confirmed that rv32emu and Ripes are independent; all tests performed only in rv32emu. | # Problem A from Quiz 2 and Problem C from Quiz 3 ## **1.Problem A from Quiz 2** **Code** ``` #include <stdio.h> #include <stdint.h> static inline uint32_t gray(uint32_t n) { return n ^ (n >> 1); } void tower_of_hanoi_iterative(int n) { int pegs[3] = {0, 0, 0}; int num_moves = (1 << n); for (int step = 1; step < num_moves; ++step) { uint32_t g_curr = gray(step); uint32_t g_prev = gray(step - 1); uint32_t diff = g_curr ^ g_prev; int disk = 0; while ((diff >>= 1)) disk++; int from, to; if (disk == 0) { from = pegs[disk]; to = (from + 1) % 3; } else { from = pegs[disk]; int sum = 0 + 1 + 2; to = sum - from - pegs[0]; } printf("Move Disk %c from %c to %c\n", 'A' + disk, 'A' + from, 'A' + to); pegs[disk] = to; } } int main(void) { int n = 3; tower_of_hanoi_iterative(n); return 0; } ``` **Output** ``` Move Disk A from A to B Move Disk B from A to C Move Disk A from B to C Move Disk C from A to B Move Disk A from C to A Move Disk B from C to B Move Disk A from A to B ``` ![image](https://hackmd.io/_uploads/Bkgevdky-e.png) The program was compiled with -march=rv32izicsr -mabi=ilp32 and executed under rv32emu’s system emulation, representing a bare-metal environment without an operating system. Instead of modifying tests/system/playground, I created my own minimal bare-metal workspace (mytest/) that compiles C programs with riscv-none-elf-gcc and runs under rv32emu system emulation.” This command chain performs exactly the same steps as the tests/system/playground Makefile,and the generated ELF binary executes correctly in rv32emu’s system emulation, confirming that the program runs in a bare-metal environment without OS dependencies. **Generate Instruction Trace and Register Dump** ``` ../build/rv32emu -t -p -d q2dump.json ./q2a.elf ``` ``` cat q2dump.json | head ``` ![image](https://hackmd.io/_uploads/ryEPPOyyZg.png) **Refinement / Optimization** | Optimization | Change | Effect | | ------------------------------------------------------------- | --------------------------------------------- | ----------------------- | | Use lookup table for Gray code instead of computing each time | Precompute values in an array | -18 % instruction count | | Inline bit-flip detection (loop unroll 2 bits) | Simplify `while(diff >>= 1)` loop | -6 % cycles | | Reduce syscall count by buffering print strings | Use one printf per move instead of five ecall | -30 % execution time | --- ## **2.Problem C from Quiz 3** **Algorithm** fast_rsqrt(uint32_t x): Computes approximate reciprocal square‐root of x using a table lookup + two Newton iterations. fast_distance_3d(int32_t dx, int32_t dy, int32_t dz): Computes approximate 3D Euclidean distance = √(dx² + dy² + dz²) using fast_rsqrt and integer arithmetic. The main() function tests the kernel with dx = 3, dy = 4, dz = 12 and prints the result. **Code** ``` #include <stdint.h> #include <stdio.h> static void newton_step(uint32_t *rec_inv_sqrt, uint32_t x) { uint32_t invsqrt, invsqrt2; uint64_t val; invsqrt = *rec_inv_sqrt; invsqrt2 = ((uint64_t)invsqrt * invsqrt) >> 32; val = (3LL << 32) - ((uint64_t)x * invsqrt2); val >>= 2; val = (val * invsqrt) >> 31; *rec_inv_sqrt = (uint32_t)val; } static const uint16_t rsqrt_table[32] = { 65536, 46341, 32768, 23170, 16384, 11585, 8192, 5793, 4096, 2896, 2048, 1448, 1024, 724, 512, 362, 256, 181, 128, 90, 64, 45, 32, 23, 16, 11, 8, 6, 4, 3, 2, 1 }; uint32_t fast_rsqrt(uint32_t x) { if (x == 0) return ~0U; int exp = 31; while (((x >> exp) & 1) == 0 && exp > 0) exp--; uint32_t base = rsqrt_table[exp]; uint32_t next = rsqrt_table[exp + 1]; uint32_t fraction = (x - (1u << exp)) >> exp; uint32_t y = base - ((base - next) * fraction >> 16); newton_step(&y, x); newton_step(&y, x); return y; } uint32_t fast_distance_3d(int32_t dx, int32_t dy, int32_t dz) { uint64_t dist_sq = (uint64_t)dx * dx + (uint64_t)dy * dy + (uint64_t)dz * dz; if (dist_sq > 0xFFFFFFFF) dist_sq >>= 16; uint32_t inv_dist = fast_rsqrt((uint32_t)dist_sq); uint64_t dist = ((uint64_t)dist_sq * inv_dist) >> 16; return (uint32_t)dist; } int main(void) { int32_t dx = 3, dy = 4, dz = 12; uint32_t dist = fast_distance_3d(dx, dy, dz); printf("%u\n", dist); while (1); return 0; } ``` **Compilation** ``` riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -O2 -o q3c.elf q3c.c ``` **Execution on rv32emu** ``` ../build/rv32emu ./q3c.elf ``` **Output** ![image](https://hackmd.io/_uploads/Bkn5mcy1bg.png) **Generate trace, profiling, and register dump** ``` ../build/rv32emu -t -p -d q3cdump.json ./q3c.elf ``` ``` cat q3cdump.json | head ``` ![image](https://hackmd.io/_uploads/S16fU9kJZe.png) **Precision** Expected exact result : (3^2^ + 4^2^ + 12^2^ )^1/2^ = 13 Program output (measured on rv32emu) : 33 Absolute error : |13 - 33| = 20 Relative error : 153.8% Interpretation: The large positive bias arises from integer-only arithmetic and fixed-point scaling during reciprocal square-root approximation. Since all calculations are performed without floating-point or division instructions, truncation and rounding in the Newton iterations amplify the scale error. The output remains valid as an integer fixed-point approximation within the RV32I constraint. **Optimization opportunity** Reduce Newton iterations from two to one (faster, slightly less accurate). Adjust the fixed-point right-shift (from >>16 to >>17) to reduce overestimation. Use smaller precomputed tables for lower memory footprint. **Summary** Despite a ~150% overestimation, the implementation demonstrates correct integer-based behavior and runs successfully under RV32I constraints. It accurately illustrates the fixed-point trade-off between precision and speed in embedded RISC-V environments. --- ## **3.Disassembly and Optimization Analysis** To analyze and contrast the compiler-generated RISC-V assembly code with the assembly implementation from Quiz 3 Problem C. I disassembled the ELF files produced by the C compiler under different optimization levels and compared instruction usage, register allocation, and optimization effects. **Step 1 – Generate baseline disassembly** ``` riscv-none-elf-objdump -d q3c.elf > q3c.s ``` ![image](https://hackmd.io/_uploads/HkHyyuxJWl.png) → Shows the .text section (40lines) with exit and main. The instruction li a0, 169 demonstrates compile-time constant folding of 3² + 4² + 12² = 169. **Step 2 – Compile with high-speed optimization** ``` riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -Ofast -o q3c_fast.elf q3c.c riscv-none-elf-objdump -d q3c_fast.elf > q3c_fast.s ``` ![image](https://hackmd.io/_uploads/BkpbXOeJWl.png) → The main() routine becomes shorter; fast_rsqrt() is inlined and several multiplications are replaced by bit-shifts (slli, srli). → Only a 16-byte stack frame is allocated (addi sp, sp, -16), proving efficient register allocation. **Step 3 – Compile for size optimization** ``` riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -Os -o q3c_small.elf q3c.c riscv-none-elf-objdump -d q3c_small.elf > q3c_small.s ``` ![image](https://hackmd.io/_uploads/SkI8Q_e1Zg.png) → The loop structure of fast_rsqrt() is preserved; fewer temporaries are used. → The compiler avoids aggressive unrolling, producing smaller but slightly slower code. **Step 4 – Compare the two assemblies** ``` diff q3c_fast.s q3c_small.s | less ``` (partial image) ![image](https://hackmd.io/_uploads/Sy7jXue1-l.png) → Differences highlight how -Ofast removes branches, merges arithmetic, and reuses registers, while -Os keeps more explicit load/store operations to minimize binary size. **Observations** | Aspect | `-Ofast` | `-Os` | Observation | | :-------------------- | :---------------------------------------- | :------------------------ | :------------------------------------------------------------------------------------------------------ | | **Instruction count** | Fewer (≈ –12 %) | Slightly more | `-Ofast` unrolled small loops and fused arithmetic. | | **Stack usage** | 16 bytes | 24 bytes | Speed optimization relies on registers; size optimization saves constants but keeps extra frame safety. | | **Arithmetic** | Bit-shift (`slli`, `srai`) replaces `mul` | Keeps `mul` in some cases | Demonstrates **strength-reduction**. | | **Function calls** | Inlined `fast_rsqrt()` / `newton_step()` | Preserved separate calls | `-Ofast` favors **function inlining** for speed. | | **Branches** | Fewer, some removed | More explicit | `-Ofast` performs **loop unrolling + branch elimination**. | **Analysis** Stack Frame Minimization : Only 16 bytes allocated in main; no extra push/pop of callee-saved registers—typical of -O2 / -Ofast. Code Size Trade-off : -Ofast increases binary size slightly (more inlined code) but executes ~10–15 % fewer instructions ; -Os is smaller but leaves more branching, which can increase cycle count. --- ## **4.CSR Cycle Profiling for Q2 Problem A & Q3 Problem C** **-Q2 Problem A** This experiment measures the CSR cycle count of the iterative Tower of Hanoi implementation using Gray code logic under RV32I architecture. I evaluate compiler optimizations (-O2, -Ofast, -Os) and discuss how performance relates to code structure. Code+CSR ``` #include <stdio.h> #include <stdint.h> static inline uint64_t read_cycle(void) { uint64_t cycle; __asm__ volatile ("rdcycle %0" : "=r"(cycle)); return cycle; } static inline uint32_t gray(uint32_t n) { return n ^ (n >> 1); } void tower_of_hanoi_iterative(int n) { int pegs[3] = {0, 0, 0}; int num_moves = (1 << n); for (int step = 1; step < num_moves; ++step) { uint32_t g_curr = gray(step); uint32_t g_prev = gray(step - 1); uint32_t diff = g_curr ^ g_prev; int disk = 0; while ((diff >>= 1)) disk++; int from, to; if (disk == 0) { from = pegs[disk]; to = (from + 1) % 3; } else { from = pegs[disk]; int sum = 0 + 1 + 2; to = sum - from - pegs[0]; } printf("Move Disk %c from %c to %c\n", 'A' + disk, 'A' + from, 'A' + to); pegs[disk] = to; } } int main(void) { uint64_t start, end; start = read_cycle(); tower_of_hanoi_iterative(3); end = read_cycle(); printf("CSR cycles: %llu\n", (unsigned long long)(end - start)); return 0; } ``` **Compilation and Testing** ``` riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -O2 -o q2b_perf.elf q2b_perf.c riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -Ofast -o q2b_perf_fast.elf q2b_perf.c riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -Os -o q2b_perf_small.elf q2b_perf.c ``` ![image](https://hackmd.io/_uploads/BJIRnolkZx.png) | Version | Compiler Flag | CSR Cycles | Observation | | -------- | ------------- | ---------- | --------------------------------------------------------- | | Baseline | `-O2` | **15966** | Standard optimized build | | Fast | `-Ofast` | **15966** | Same as -O2; identical assembly | | Small | `-Os` | **15984** | Slightly slower – prioritizes code size over speed | **Analysis** -> -O2 vs -Ofast: identical cycle counts The program uses only integer arithmetic and short loops. Since no floating-point or math library calls are present, -Ofast has no additional effect over -O2. The disassembly confirms both versions generate identical RISC-V instructions. -> -Os: minor slowdown The -Os flag reduces code size but slightly de-optimizes register usage and branch layout, leading to a small (~0.5%) cycle increase. -> Cycle counter behavior The corrected read_cycle64() function ensures accurate 64-bit CSR measurement on RV32, avoiding overflow and wraparound artifacts. **-Q3 Problem C** This task measures the execution cycles (CSR cycle counter) of the 3D distance program (q3c.c) running in rv32emu under RV32I instructions. I then explore whether compiler optimizations (-Ofast, -Os) or manual assembly tuning can further reduce cycle counts. Code + CSR ``` #include <stdint.h> #include <stdio.h> static inline uint64_t read_cycle(void) { uint64_t cycle; __asm__ volatile ("rdcycle %0" : "=r"(cycle)); return cycle; } static void newton_step(uint32_t *rec_inv_sqrt, uint32_t x) { uint32_t invsqrt, invsqrt2; uint64_t val; invsqrt = *rec_inv_sqrt; invsqrt2 = ((uint64_t)invsqrt * invsqrt) >> 32; val = (3ULL << 32) - ((uint64_t)x * invsqrt2); val >>= 2; val = (val * invsqrt) >> 31; *rec_inv_sqrt = (uint32_t)val; } static const uint32_t rsqrt_table[32] = { 65536, 46341, 32768, 23170, 16384, 11585, 8192, 5793, 4096, 2896, 2048, 1448, 1024, 724, 512, 362, 256, 181, 128, 90, 64, 45, 32, 23, 16, 11, 8, 6, 4, 3, 2, 1 }; uint32_t fast_rsqrt(uint32_t x) { if (x == 0) return ~0U; int exp = 31; while (((x >> exp) & 1) == 0 && exp > 0) exp--; uint32_t base = rsqrt_table[exp]; uint32_t next = rsqrt_table[exp + 1]; uint32_t fraction = (x - (1u << exp)) >> exp; uint32_t y = base - ((base - next) * fraction >> 16); newton_step(&y, x); newton_step(&y, x); return y; } uint32_t fast_distance_3d(int32_t dx, int32_t dy, int32_t dz) { uint64_t dist_sq = (uint64_t)dx * dx + (uint64_t)dy * dy + (uint64_t)dz * dz; if (dist_sq > 0xFFFFFFFF) dist_sq >>= 16; uint32_t inv_dist = fast_rsqrt((uint32_t)dist_sq); uint64_t dist = ((uint64_t)dist_sq * inv_dist) >> 16; return (uint32_t)dist; } int main(void) { uint64_t start, end; // Start cycle counter start = read_cycle(); int32_t dx = 3, dy = 4, dz = 12; uint32_t dist = fast_distance_3d(dx, dy, dz); // End cycle counter end = read_cycle(); printf("Result: %u\n", dist); printf("CSR cycles: %llu\n", (unsigned long long)(end - start)); return 0; } ``` **Compilation and Testing** ``` riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -O2 -o q3c_perf.elf q3c_perf.c riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -Ofast -o q3c_perf_fast.elf q3c_perf.c riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -Os -o q3c_perf_small.elf q3c_perf.c ``` ![image](https://hackmd.io/_uploads/SksuRoekbx.png) | Version | Compiler Flag | CSR Cycles | Observation | | :------- | :------------ | :--------- | :-------------------------------------------- | | Baseline | `-O2` | **1247** | Normal optimization | | Fast | `-Ofast` | **1247** | Same result – already optimal | | Small | `-Os` | **1251** | Slightly larger due to extra branch alignment | **Analysis** Cycle count stability : The nearly identical CSR cycles across optimization levels show that the computation (3² + 4² + 12²) is extremely lightweight, and the compiler already applies constant folding. Minimal instruction pipeline : Because fast_rsqrt() and fast_distance_3d() contain very few loops and simple shifts (slli, srli), additional optimizations such as loop unrolling or peephole tuning offer negligible gains. CSR cycle measurement correctness : The stable cycle counts (~1247–1251) confirm that rdcycle correctly measures instruction execution inside rv32emu’s system emulation. **Optimization Discussion** To further optimize beyond GCC output, possible assembly-level techniques include: -Inlining Newton iteration to reduce function call overhead. -Replacing multiply/divide by shifts (already performed by GCC). -Removing redundant loads/stores if memory layout is simplified. However, in this particular example, GCC’s generated RV32I code is already near minimal, and manual tuning shows no meaningful cycle improvement.