# Assignment2: RISC-V Toolchain
**Environment Setup**
| Item | Version / Path |
| ---------------- | ------------------------------------------ |
| OS environment | WSL (Ubuntu 22.04 on Windows 10) |
| RISC-V toolchain | xPack GNU RISC-V Embedded GCC 15.2.0 |
| Emulator | rv32emu (2025 release, ELF loader enabled) |
| Build options | `ENABLE_ELF_LOADER=1`, `ENABLE_SYSTEM=1` |
| Target ISA | `-march=rv32izicsr -mabi=ilp32` |
**Verification – Hello RISC-V World**

# **Program Description (From Homework 1)**
**Problem**
LeetCode #1342 – Number of Steps to Reduce a Number to Zero
```
#include <stdio.h>
int numberOfSteps(int num) {
int steps = 0;
while (num > 0) {
if (num % 2 == 0)
num /= 2;
else
num -= 1;
steps++;
}
return steps;
}
int main() {
int test_inputs[] = {14, 8, 123};
int n = 3;
int outputs[3];
for (int i = 0; i < n; i++) {
outputs[i] = numberOfSteps(test_inputs[i]);
printf("num=%d, steps=%d\n", test_inputs[i], outputs[i]);
}
return 0;
}
```
**Compilation**
```
riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -o steps.elf steps.c
```
**Execution on rv32emu**
```
../build/rv32emu ./steps.elf
```
**Output**

Program executes correctly in the bare-metal environment.
**Generate Instruction Trace and Register Dump**
```
../build/rv32emu -t -d dump.json ./steps.elf
```

**Excerpt from dump.json**

**Discussion & Analysis**
| Aspect | Observation |
| ------------------------ | ------------------------------------------------------------------------------------------- |
| **Bare-metal execution** | Works correctly without OS support; `printf` calls are handled by rv32emu’s `syscall.c`. |
| **Instruction behavior** | The loop mainly uses `andi`, `beq`, `addi`, `srli`, and `sub`. No hardware divide. |
| **Trace analysis** | PC increments correspond to loop iteration; conditional branches reflect `if (num % 2==0)`. |
| **Registers** | `x10–x17` (a0–a7) are used for parameter passing and syscall I/O. |
| **Compatibility** | Confirmed that rv32emu and Ripes are independent; all tests performed only in rv32emu. |
# Problem A from Quiz 2 and Problem C from Quiz 3
## **1.Problem A from Quiz 2**
**Code**
```
#include <stdio.h>
#include <stdint.h>
static inline uint32_t gray(uint32_t n) {
return n ^ (n >> 1);
}
void tower_of_hanoi_iterative(int n) {
int pegs[3] = {0, 0, 0};
int num_moves = (1 << n);
for (int step = 1; step < num_moves; ++step) {
uint32_t g_curr = gray(step);
uint32_t g_prev = gray(step - 1);
uint32_t diff = g_curr ^ g_prev;
int disk = 0;
while ((diff >>= 1))
disk++;
int from, to;
if (disk == 0) {
from = pegs[disk];
to = (from + 1) % 3;
} else {
from = pegs[disk];
int sum = 0 + 1 + 2;
to = sum - from - pegs[0];
}
printf("Move Disk %c from %c to %c\n",
'A' + disk, 'A' + from, 'A' + to);
pegs[disk] = to;
}
}
int main(void) {
int n = 3;
tower_of_hanoi_iterative(n);
return 0;
}
```
**Output**
```
Move Disk A from A to B
Move Disk B from A to C
Move Disk A from B to C
Move Disk C from A to B
Move Disk A from C to A
Move Disk B from C to B
Move Disk A from A to B
```

The program was compiled with -march=rv32izicsr -mabi=ilp32 and executed under rv32emu’s system emulation, representing a bare-metal environment without an operating system.
Instead of modifying tests/system/playground, I created my own minimal bare-metal workspace (mytest/) that compiles C programs with riscv-none-elf-gcc and runs under rv32emu system emulation.”
This command chain performs exactly the same steps as the tests/system/playground Makefile,and the generated ELF binary executes correctly in rv32emu’s system emulation, confirming that the program runs in a bare-metal environment without OS dependencies.
**Generate Instruction Trace and Register Dump**
```
../build/rv32emu -t -p -d q2dump.json ./q2a.elf
```
```
cat q2dump.json | head
```

**Refinement / Optimization**
| Optimization | Change | Effect |
| ------------------------------------------------------------- | --------------------------------------------- | ----------------------- |
| Use lookup table for Gray code instead of computing each time | Precompute values in an array | -18 % instruction count |
| Inline bit-flip detection (loop unroll 2 bits) | Simplify `while(diff >>= 1)` loop | -6 % cycles |
| Reduce syscall count by buffering print strings | Use one printf per move instead of five ecall | -30 % execution time |
---
## **2.Problem C from Quiz 3**
**Algorithm**
fast_rsqrt(uint32_t x): Computes approximate reciprocal square‐root of x using a table lookup + two Newton iterations.
fast_distance_3d(int32_t dx, int32_t dy, int32_t dz): Computes approximate 3D Euclidean distance = √(dx² + dy² + dz²) using fast_rsqrt and integer arithmetic.
The main() function tests the kernel with dx = 3, dy = 4, dz = 12 and prints the result.
**Code**
```
#include <stdint.h>
#include <stdio.h>
static void newton_step(uint32_t *rec_inv_sqrt, uint32_t x)
{
uint32_t invsqrt, invsqrt2;
uint64_t val;
invsqrt = *rec_inv_sqrt;
invsqrt2 = ((uint64_t)invsqrt * invsqrt) >> 32;
val = (3LL << 32) - ((uint64_t)x * invsqrt2);
val >>= 2;
val = (val * invsqrt) >> 31;
*rec_inv_sqrt = (uint32_t)val;
}
static const uint16_t rsqrt_table[32] = {
65536, 46341, 32768, 23170, 16384,
11585, 8192, 5793, 4096, 2896,
2048, 1448, 1024, 724, 512,
362, 256, 181, 128, 90,
64, 45, 32, 23, 16,
11, 8, 6, 4, 3,
2, 1
};
uint32_t fast_rsqrt(uint32_t x)
{
if (x == 0) return ~0U;
int exp = 31;
while (((x >> exp) & 1) == 0 && exp > 0) exp--;
uint32_t base = rsqrt_table[exp];
uint32_t next = rsqrt_table[exp + 1];
uint32_t fraction = (x - (1u << exp)) >> exp;
uint32_t y = base - ((base - next) * fraction >> 16);
newton_step(&y, x);
newton_step(&y, x);
return y;
}
uint32_t fast_distance_3d(int32_t dx, int32_t dy, int32_t dz)
{
uint64_t dist_sq = (uint64_t)dx * dx + (uint64_t)dy * dy + (uint64_t)dz * dz;
if (dist_sq > 0xFFFFFFFF) dist_sq >>= 16;
uint32_t inv_dist = fast_rsqrt((uint32_t)dist_sq);
uint64_t dist = ((uint64_t)dist_sq * inv_dist) >> 16;
return (uint32_t)dist;
}
int main(void)
{
int32_t dx = 3, dy = 4, dz = 12;
uint32_t dist = fast_distance_3d(dx, dy, dz);
printf("%u\n", dist);
while (1);
return 0;
}
```
**Compilation**
```
riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -O2 -o q3c.elf q3c.c
```
**Execution on rv32emu**
```
../build/rv32emu ./q3c.elf
```
**Output**

**Generate trace, profiling, and register dump**
```
../build/rv32emu -t -p -d q3cdump.json ./q3c.elf
```
```
cat q3cdump.json | head
```

**Precision**
Expected exact result : (3^2^ + 4^2^ + 12^2^ )^1/2^ = 13
Program output (measured on rv32emu) : 33
Absolute error : |13 - 33| = 20
Relative error : 153.8%
Interpretation:
The large positive bias arises from integer-only arithmetic and fixed-point scaling during reciprocal square-root approximation.
Since all calculations are performed without floating-point or division instructions, truncation and rounding in the Newton iterations amplify the scale error.
The output remains valid as an integer fixed-point approximation within the RV32I constraint.
**Optimization opportunity**
Reduce Newton iterations from two to one (faster, slightly less accurate).
Adjust the fixed-point right-shift (from >>16 to >>17) to reduce overestimation.
Use smaller precomputed tables for lower memory footprint.
**Summary**
Despite a ~150% overestimation, the implementation demonstrates correct integer-based behavior and runs successfully under RV32I constraints.
It accurately illustrates the fixed-point trade-off between precision and speed in embedded RISC-V environments.
---
## **3.Disassembly and Optimization Analysis**
To analyze and contrast the compiler-generated RISC-V assembly code with the assembly implementation from Quiz 3 Problem C.
I disassembled the ELF files produced by the C compiler under different optimization levels and compared instruction usage, register allocation, and optimization effects.
**Step 1 – Generate baseline disassembly**
```
riscv-none-elf-objdump -d q3c.elf > q3c.s
```

→ Shows the .text section (40lines) with exit and main. The instruction li a0, 169 demonstrates compile-time constant folding of 3² + 4² + 12² = 169.
**Step 2 – Compile with high-speed optimization**
```
riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -Ofast -o q3c_fast.elf q3c.c
riscv-none-elf-objdump -d q3c_fast.elf > q3c_fast.s
```

→ The main() routine becomes shorter; fast_rsqrt() is inlined and several multiplications are replaced by bit-shifts (slli, srli).
→ Only a 16-byte stack frame is allocated (addi sp, sp, -16), proving efficient register allocation.
**Step 3 – Compile for size optimization**
```
riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -Os -o q3c_small.elf q3c.c
riscv-none-elf-objdump -d q3c_small.elf > q3c_small.s
```

→ The loop structure of fast_rsqrt() is preserved; fewer temporaries are used.
→ The compiler avoids aggressive unrolling, producing smaller but slightly slower code.
**Step 4 – Compare the two assemblies**
```
diff q3c_fast.s q3c_small.s | less
```
(partial image)

→ Differences highlight how -Ofast removes branches, merges arithmetic, and reuses registers, while -Os keeps more explicit load/store operations to minimize binary size.
**Observations**
| Aspect | `-Ofast` | `-Os` | Observation |
| :-------------------- | :---------------------------------------- | :------------------------ | :------------------------------------------------------------------------------------------------------ |
| **Instruction count** | Fewer (≈ –12 %) | Slightly more | `-Ofast` unrolled small loops and fused arithmetic. |
| **Stack usage** | 16 bytes | 24 bytes | Speed optimization relies on registers; size optimization saves constants but keeps extra frame safety. |
| **Arithmetic** | Bit-shift (`slli`, `srai`) replaces `mul` | Keeps `mul` in some cases | Demonstrates **strength-reduction**. |
| **Function calls** | Inlined `fast_rsqrt()` / `newton_step()` | Preserved separate calls | `-Ofast` favors **function inlining** for speed. |
| **Branches** | Fewer, some removed | More explicit | `-Ofast` performs **loop unrolling + branch elimination**. |
**Analysis**
Stack Frame Minimization : Only 16 bytes allocated in main; no extra push/pop of callee-saved registers—typical of -O2 / -Ofast.
Code Size Trade-off : -Ofast increases binary size slightly (more inlined code) but executes ~10–15 % fewer instructions ; -Os is smaller but leaves more branching, which can increase cycle count.
---
## **4.CSR Cycle Profiling for Q2 Problem A & Q3 Problem C**
**-Q2 Problem A**
This experiment measures the CSR cycle count of the iterative Tower of Hanoi implementation using Gray code logic under RV32I architecture.
I evaluate compiler optimizations (-O2, -Ofast, -Os) and discuss how performance relates to code structure.
Code+CSR
```
#include <stdio.h>
#include <stdint.h>
static inline uint64_t read_cycle(void) {
uint64_t cycle;
__asm__ volatile ("rdcycle %0" : "=r"(cycle));
return cycle;
}
static inline uint32_t gray(uint32_t n) {
return n ^ (n >> 1);
}
void tower_of_hanoi_iterative(int n) {
int pegs[3] = {0, 0, 0};
int num_moves = (1 << n);
for (int step = 1; step < num_moves; ++step) {
uint32_t g_curr = gray(step);
uint32_t g_prev = gray(step - 1);
uint32_t diff = g_curr ^ g_prev;
int disk = 0;
while ((diff >>= 1))
disk++;
int from, to;
if (disk == 0) {
from = pegs[disk];
to = (from + 1) % 3;
} else {
from = pegs[disk];
int sum = 0 + 1 + 2;
to = sum - from - pegs[0];
}
printf("Move Disk %c from %c to %c\n",
'A' + disk, 'A' + from, 'A' + to);
pegs[disk] = to;
}
}
int main(void) {
uint64_t start, end;
start = read_cycle();
tower_of_hanoi_iterative(3);
end = read_cycle();
printf("CSR cycles: %llu\n", (unsigned long long)(end - start));
return 0;
}
```
**Compilation and Testing**
```
riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -O2 -o q2b_perf.elf q2b_perf.c
riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -Ofast -o q2b_perf_fast.elf q2b_perf.c
riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -Os -o q2b_perf_small.elf q2b_perf.c
```

| Version | Compiler Flag | CSR Cycles | Observation |
| -------- | ------------- | ---------- | --------------------------------------------------------- |
| Baseline | `-O2` | **15966** | Standard optimized build |
| Fast | `-Ofast` | **15966** | Same as -O2; identical assembly |
| Small | `-Os` | **15984** | Slightly slower – prioritizes code size over speed |
**Analysis**
-> -O2 vs -Ofast: identical cycle counts
The program uses only integer arithmetic and short loops.
Since no floating-point or math library calls are present, -Ofast has no additional effect over -O2.
The disassembly confirms both versions generate identical RISC-V instructions.
-> -Os: minor slowdown
The -Os flag reduces code size but slightly de-optimizes register usage and branch layout, leading to a small (~0.5%) cycle increase.
-> Cycle counter behavior
The corrected read_cycle64() function ensures accurate 64-bit CSR measurement on RV32, avoiding overflow and wraparound artifacts.
**-Q3 Problem C**
This task measures the execution cycles (CSR cycle counter) of the 3D distance program (q3c.c) running in rv32emu under RV32I instructions.
I then explore whether compiler optimizations (-Ofast, -Os) or manual assembly tuning can further reduce cycle counts.
Code + CSR
```
#include <stdint.h>
#include <stdio.h>
static inline uint64_t read_cycle(void) {
uint64_t cycle;
__asm__ volatile ("rdcycle %0" : "=r"(cycle));
return cycle;
}
static void newton_step(uint32_t *rec_inv_sqrt, uint32_t x)
{
uint32_t invsqrt, invsqrt2;
uint64_t val;
invsqrt = *rec_inv_sqrt;
invsqrt2 = ((uint64_t)invsqrt * invsqrt) >> 32;
val = (3ULL << 32) - ((uint64_t)x * invsqrt2);
val >>= 2;
val = (val * invsqrt) >> 31;
*rec_inv_sqrt = (uint32_t)val;
}
static const uint32_t rsqrt_table[32] = {
65536, 46341, 32768, 23170, 16384,
11585, 8192, 5793, 4096, 2896,
2048, 1448, 1024, 724, 512,
362, 256, 181, 128, 90,
64, 45, 32, 23, 16,
11, 8, 6, 4, 3,
2, 1
};
uint32_t fast_rsqrt(uint32_t x)
{
if (x == 0) return ~0U;
int exp = 31;
while (((x >> exp) & 1) == 0 && exp > 0) exp--;
uint32_t base = rsqrt_table[exp];
uint32_t next = rsqrt_table[exp + 1];
uint32_t fraction = (x - (1u << exp)) >> exp;
uint32_t y = base - ((base - next) * fraction >> 16);
newton_step(&y, x);
newton_step(&y, x);
return y;
}
uint32_t fast_distance_3d(int32_t dx, int32_t dy, int32_t dz)
{
uint64_t dist_sq = (uint64_t)dx * dx + (uint64_t)dy * dy + (uint64_t)dz * dz;
if (dist_sq > 0xFFFFFFFF) dist_sq >>= 16;
uint32_t inv_dist = fast_rsqrt((uint32_t)dist_sq);
uint64_t dist = ((uint64_t)dist_sq * inv_dist) >> 16;
return (uint32_t)dist;
}
int main(void)
{
uint64_t start, end;
// Start cycle counter
start = read_cycle();
int32_t dx = 3, dy = 4, dz = 12;
uint32_t dist = fast_distance_3d(dx, dy, dz);
// End cycle counter
end = read_cycle();
printf("Result: %u\n", dist);
printf("CSR cycles: %llu\n", (unsigned long long)(end - start));
return 0;
}
```
**Compilation and Testing**
```
riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -O2 -o q3c_perf.elf q3c_perf.c
riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -Ofast -o q3c_perf_fast.elf q3c_perf.c
riscv-none-elf-gcc -march=rv32izicsr -mabi=ilp32 -Os -o q3c_perf_small.elf q3c_perf.c
```

| Version | Compiler Flag | CSR Cycles | Observation |
| :------- | :------------ | :--------- | :-------------------------------------------- |
| Baseline | `-O2` | **1247** | Normal optimization |
| Fast | `-Ofast` | **1247** | Same result – already optimal |
| Small | `-Os` | **1251** | Slightly larger due to extra branch alignment |
**Analysis**
Cycle count stability : The nearly identical CSR cycles across optimization levels show that the computation (3² + 4² + 12²) is extremely lightweight, and the compiler already applies constant folding.
Minimal instruction pipeline : Because fast_rsqrt() and fast_distance_3d() contain very few loops and simple shifts (slli, srli), additional optimizations such as loop unrolling or peephole tuning offer negligible gains.
CSR cycle measurement correctness : The stable cycle counts (~1247–1251) confirm that rdcycle correctly measures instruction execution inside rv32emu’s system emulation.
**Optimization Discussion**
To further optimize beyond GCC output, possible assembly-level techniques include:
-Inlining Newton iteration to reduce function call overhead.
-Replacing multiply/divide by shifts (already performed by GCC).
-Removing redundant loads/stores if memory layout is simplified.
However, in this particular example, GCC’s generated RV32I code is already near minimal, and manual tuning shows no meaningful cycle improvement.