# Assignment2: Complete Applications
contributed by [< rara0857 >](https://github.com/rara0857/arch2025-homework2)
::: warning
**Disclaimer:** I used AI tools such as ChatGPT / Gemini to better understand algorithm and architecture descriptions.
:::
## RV32I emulator's system emulation
In this assignment, we are required to complete the code and run it in a **bare-metal environment** using **rv32emu’s system emulation**.
Since we run on a bare-metal environment, standard library functions are not available, and direct `ecall` printing is also restricted.
To output results, we will need to implement a custom print routine in `main.c` that works with rv32emu's system emulation environment.
I started from the sample code at `tests/system/playground` and modified it.
here is the following structure
:::success
```text
├── Makefile
├── chacha20_asm.S
├── linker.ld
├── main.c
├── perfcounter.S
└── start.S
```
:::

The **`linker.ld`** file defines how different program sections (`.text`, `.data`, `.bss`, `.stack`) are placed in memory.
### Note on Implementation
- **Printing:** Implemented in C, because bare-metal environment does not support standard `ecall` printing.
- **Functional Routines:** Implemented in Assembly (ASM) for performance optimization and instruction reduction.
---
## Problem B in Homework 1
### Descripition
Assume uf8 implements a logarithmic 8-bit codec that maps 20-bit unsigned integers $([0,1{,}015{,}792])$ to 8-bit symbols via logarithmic quantization, delivering 2.5:1 compression and ≤6.25% relative error.
**UF8_Decode:**
$$
D(b) = m \cdot 2^e + (2^e - 1) \cdot 16
$$
**UF8_Encode**
$$
E(v) =
\begin{cases}
v, & \text{if } v < 16 \\
16e + \lfloor (v - \text{offset}(e)) / 2^e \rfloor, & \text{otherwise}
\end{cases}
$$
---
**File structure:**
:::success
```text
├── Makefile
├── linker.ld
├── main.c
├── perfcounter.S
├── q1-uf8.S
├── q1-uf8_opt.S
├── q1-uf8_origin.c
└── start.S
:::
Before running the `Makefile`, we need to make the following modifications:
```diff
- OBJS = start.o perfcounter.o chacha20_asm.o main.o
+ OBJS = start.o perfcounter.o q1-uf8.o main.o
- CFLAGS = -g $(ARCH)
+ CFLAGS = -g $(ARCH) -Ofast // add optimization level (OS/O0/O1/O2/O3/Ofast)
```
After running `make run`, the output is as follows:
**Console output:**
```console.log
=== UF8 Codec===
UF8 Codec: ALL PASSED
Cycles: 34752
Instructions: 34752
=== All Tests Completed ===
```
**CLZ optimization:**
To reduce instructions, we optimized the CLZ routine by eliminating all loops and branches.
It now uses a fixed sequence of right shifts with `snez` masks, achieving **constant-time execution** while slightly increasing code size.
```s=
clz:
li t0, 32 # n = 32
# 16-bit
srli t2, a0, 16 # y = x >> 16
snez t3, t2 # y != 0
slli t1, t3, 4 # t1 = 16 or 0
sub t0, t0, t1 # n -= t1
srl a0, a0, t1 # x >>= t1
# 8-bit
srli t2, a0, 8
snez t3, t2
slli t1, t3, 3 # t1 = 8 or 0
sub t0, t0, t1
srl a0, a0, t1
# 4-bit
srli t2, a0, 4
snez t3, t2
slli t1, t3, 2 # t1 = 4 or 0
sub t0, t0, t1
srl a0, a0, t1
# 2-bit
srli t2, a0, 2
snez t3, t2
slli t1, t3, 1 # t1 = 2 or 0
sub t0, t0, t1
srl a0, a0, t1
# 1-bit
srli t2, a0, 1
snez t3, t2
sub t0, t0, t3 # n -= 1
srl a0, a0, t3
sub a0, t0, a0 # return n - x
ret
```
**Instructions Comparison:**
| | C Code | ASM | CLZ opt. |
|:----------------:|:------:|:-----:|:--------:|
| **Instructions** | 71316 | 34752 | 34372 |
**Size Comparision:**
`riscv-none-elf-size test.elf`
| | text | data | bss | dec | hex |
|:------------:|:----:|:----:|:----:|:----:|:----:|
| **C Code** | 3234 | 0 | 4110 | 7344 | 1cb0 |
| **ASM** | 2874 | 0 | 4102 | 6976 | 1b40 |
| **CLZ opt.** | 2942 | 0 | 4098 | 7040 | 1b80 |
## Q2_ProblemA
### Descripition
The **Tower of Hanoi** is a classic recursive problem that involves moving a stack of disks from one peg to another, following these rules:
* Only one disk can be moved at a time.
* A larger disk can never be placed on top of a smaller disk.
* All disks must eventually be moved from the source peg to the destination peg using an auxiliary peg.
### Method
To solve the problem, we can seperate into following steps:
1. Recursively Move N-1 disk from source to Auxiliary peg.
2. Move the last disk from source to destination.
3. Recursively Move N-1 disk from Auxiliary to destination peg.

**File structure:**
:::success
```text
├── Makefile
├── linker.ld
├── main.c
├── perfcounter.S
├── q2_hanoi.S
└── start.S
:::
**Console output:**
```console.log
=== Q2_A Test ===
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
Hanoi Test Completed.
Cycles: 9628
Instructions: 9628
=== All Tests Completed ===
```
To reduce instructions, we can use optimization flags.
However, before enabling optimizations, to prevent print statements from being removed, we should modify `printstr` as follows:
```diff
#define printstr(ptr, length) \
do { \
asm volatile( \
"add a7, x0, 0x40;" \
"add a0, x0, 0x1;" /* stdout */ \
"add a1, x0, %0;" \
"mv a2, %1;" /* length character */ \
"ecall;" \
: \
: "r"(ptr), "r"(length) \
- : "a0", "a1", "a2", "a7");
+ : "a0", "a1", "a2", "a7","memory"); \
} while (0)
```
**Instructions Comparison:**
| Optimization | -O0 | -O1 | -O2 | -O3 / -Ofast |
|:------------:|:----:|:----:|:----:|:------------:|
| Instructions | 9628 | 4208 | 4404 | 4404 |
**Size Comparision:**
`riscv-none-elf-size test.elf`
| | text | data | bss | dec | hex |
|:----------:|:----:|:----:|:----:|:----:|:----:|
| -O0 | 2570 | 3 | 4099 | 6672 | 1a10 |
| -O1 | 1626 | 3 | 4099 | 5728 | 1660 |
| -O2 | 1606 | 3 | 4103 | 5712 | 1650 |
| -O3/-Ofast | 1606 | 3 | 4103 | 5712 | 1650 |
## Q3_ProblemC
### Descripition
`rsqrt` provides a fast reciprocal square root implementation optimized for RV32I.
* No hardware multiplier: Uses shift-add for 32×32 → 64-bit multiplication
* LUT + Newton iterations: Balances precision and performance
* Precision: ~3-8% error
Newton's method is an algorithm to compute the root of a function.

### Method
1. **Initial estimate (CLZ + lookup table)**
Use `clz(x)` to find the exponent range of the input, then fetch two nearby entries from the LUT.
A small linear interpolation gives a closer starting value for the reciprocal square root.
2. **Newton refinement**
Apply one Newton iteration:
$$
y_{\text{new}} = y \left( 1.5 - 0.5xy^2 \right)
$$ All operations are done with fixed-point shift–add math since RV32I has no hardware multiplier.
3. **Return fixed-point result**
Output the final value in Q16 format.
This keeps the computation fast while keeping the error roughly within 3–8%.
**File structure:**
:::success
```text
├── Makefile
├── linker.ld
├── main.c
├── perfcounter.S
├── q3-rsqrt-origin.c
├── q3-rsqrt.S
└── start.S
:::
:::spoiler Assembly Code
```s
.data
.align 2
rsqrt_table:
.word 65536, 46341, 32768, 23170, 16384
.word 11585, 8192, 5793, 4096, 2896
.word 2048, 1448, 1024, 724, 512
.word 362, 256, 181, 128, 90
.word 64, 45, 32, 23, 16
.word 11, 8, 6, 4, 3
.word 2, 1
.text
.globl clz
clz:
li t0, 32 # n = 32
# 16-bit
srli t2, a0, 16 # y = x >> 16
snez t3, t2 # y != 0
slli t1, t3, 4 # t1 = 16 or 0
sub t0, t0, t1 # n -= t1
srl a0, a0, t1 # x >>= t1
# 8-bit
srli t2, a0, 8
snez t3, t2
slli t1, t3, 3 # t1 = 8 or 0
sub t0, t0, t1
srl a0, a0, t1
# 4-bit
srli t2, a0, 4
snez t3, t2
slli t1, t3, 2 # t1 = 4 or 0
sub t0, t0, t1
srl a0, a0, t1
# 2-bit
srli t2, a0, 2
snez t3, t2
slli t1, t3, 1 # t1 = 2 or 0
sub t0, t0, t1
srl a0, a0, t1
# 1-bit
srli t2, a0, 1
snez t3, t2
sub t0, t0, t3 # n -= 1
srl a0, a0, t3
sub a0, t0, a0 # return n - x
ret
.globl __muldi3
__muldi3:
mv t0, a0 # a_lo
mv t1, a1 # a_hi
mv t2, a2 # b_lo
mv t3, a3 # b_hi
li a0, 0 # res_lo
li a1, 0 # res_hi
or t4, t2, t3
beqz t4, .L_muldi3_end # if (b == 0) skip loop
.L_muldi3_loop:
andi t5, t2, 1
beqz t5, .L_muldi3_skip_add
# res = res + a
add a0, a0, t0
sltu t5, a0, t0
add a1, a1, t1
add a1, a1, t5
.L_muldi3_skip_add:
# b >>= 1 (64-bit shift)
srli t2, t2, 1
slli t5, t3, 31
or t2, t2, t5
srli t3, t3, 1
# a <<= 1 (64-bit shift)
mv t5, t0
add t0, t0, t0
sltu t5, t0, t5
add t1, t1, t1
add t1, t1, t5
or t4, t2, t3
bnez t4, .L_muldi3_loop # continue if (b > 0)
.L_muldi3_end:
ret
.globl __lshrdi3
__lshrdi3:
andi a2, a2, 63
beqz a2, .L_lshr_ret
li t0, 32
blt a2, t0, .L_lshr_lt32
# b >= 32
sub a2, a2, t0
srl a0, a1, a2
li a1, 0
j .L_lshr_ret
.L_lshr_lt32:
# b < 32
sub t1, t0, a2 # t1 = 32 - b
sll t2, a1, t1 # t2 = in.hi << (32 - b)
srl t1, a0, a2 # t1 = in.lo >> b
srl a1, a1, a2 # out.hi = in.hi >> b
or a0, t2, t1 # out.lo = t2 | t1
.L_lshr_ret:
ret
.globl fast_rsqrt
fast_rsqrt:
addi sp, sp, -44
sw ra, 40(sp)
sw s0, 36(sp) # s0 = x
sw s1, 32(sp) # s1 = exp
sw s2, 28(sp) # s2 = y_base
sw s3, 24(sp) # s3 = y
sw s4, 20(sp) # s4 = y_next
sw s6, 16(sp) # s6 = y64_lo
sw s7, 12(sp) # s7 = y64_hi
sw s8, 8(sp) # s8 = term_lo
sw s9, 4(sp) # s9 = term_hi
sw s10, 0(sp) # s10 = tmp
mv s0, a0
bnez s0, .L_frsqrt_not_zero
li a0, -1
j .L_frsqrt_epilogue
.L_frsqrt_not_zero:
mv a0, s0
call clz
li t0, 31
sub s1, t0, a0
li t0, 32
bne s1, t0, .L_frsqrt_exp_ok
li s1, 31
.L_frsqrt_exp_ok:
la t0, rsqrt_table
slli t1, s1, 2
add t0, t0, t1
lw s2, 0(t0)
li t0, 31
bne s1, t0, .L_frsqrt_interp
mv s3, s2
j .L_frsqrt_newton
.L_frsqrt_interp:
la t0, rsqrt_table
addi t1, s1, 1
slli t1, t1, 2
add t0, t0, t1
lw s4, 0(t0)
# power_of_2 = 1U << exp
li t0, 1
sll t3, t0, s1 # t3 = power_of_2
# (uint64_t)x - power_of_2
sub a0, s0, t3
slt a1, s0, t3
neg a1, a1
# << 16
slli t0, a0, 16 # new_lo
srli t1, a0, 16 # new_hi part 1
slli t2, a1, 16 # new_hi part 2
or t1, t1, t2 # new_hi
mv a0, t0
mv a1, t1
mv a2, s1
call __lshrdi3
mv s10, a0 # s10 = fraction_q16 (low)
# delta_y = (uint64_t)y_base - y_next;
sub t0, s2, s4
li t1, 0
# (delta_y * fraction_q16)
mv a0, t0
mv a1, t1
mv a2, s10
li a3, 0
call __muldi3
# >> 16
srli t0, a0, 16 # new_lo part 1
slli t1, a1, 16 # new_lo part 2
or a0, t0, t1 # new_lo
srli a1, a1, 16 # new_hi
# y = y_base - (uint32_t)result
sub s3, s2, a0
.L_frsqrt_newton:
mv s6, s3
li s7, 0
# y2 = y64 * y64
mv a0, s6
mv a1, s7
mv a2, s6
mv a3, s7
call __muldi3
sw a0, 16(sp) # s6 (tmp) = y2_lo
sw a1, 12(sp) # s7 (tmp) = y2_hi
# xy2 = ((uint64_t)x * y2)
mv a0, s0
li a1, 0
lw a2, 16(sp)
lw a3, 12(sp)
call __muldi3
# >> 16
srli t0, a0, 16
slli t1, a1, 16
or a0, t0, t1
srli a1, a1, 16
# term = (3U << 16) - (uint32_t)xy2;
li s8, 196608
li s9, 0
sub a0, s8, a0
sltu t0, s8, a0
sub a1, s9, t0
mv s8, a0
mv s9, a1
# y = (uint32_t)((y64 * term) >> 17);
mv a0, s3
li a1, 0
mv a2, s8
mv a3, s9
call __muldi3
# >> 17
srli t0, a0, 17 # new_lo part 1
slli t1, a1, 15 # new_lo part 2 (32-17)
or a0, t0, t1 # new_lo
srli a1, a1, 17 # new_hi
.L_frsqrt_epilogue:
lw ra, 40(sp)
lw s0, 36(sp)
lw s1, 32(sp)
lw s2, 28(sp)
lw s3, 24(sp)
lw s4, 20(sp)
lw s6, 16(sp)
lw s7, 12(sp)
lw s8, 8(sp)
lw s9, 4(sp)
lw s10, 0(sp)
addi sp, sp, 44
ret
.globl newton_step
newton_step:
addi sp, sp, -20
sw ra, 16(sp)
sw s0, 12(sp) # s0 = *rec_inv_sqrt
sw s1, 8(sp) # s1 = x
sw s2, 4(sp) # s2 = invsqrt
sw s3, 0(sp) # s3 = invsqrt2
mv s0, a0
mv s1, a1
lw s2, 0(s0)
# invsqrt2 = ((uint64_t)invsqrt * invsqrt) >> 32;
mv a0, s2
li a1, 0
mv a2, s2
li a3, 0
call __muldi3
mv s3, a1
# val = (3LL << 32) - ((uint64_t)x * invsqrt2);
mv a0, s1
li a1, 0
mv a2, s3
srai a3, s3, 31
call __muldi3
li t0, 0
li t1, 3
sub a0, t0, a0
sltu t2, t0, a0
sub a1, t1, a1
sub a1, a1, t2
# val >>= 2
srli t0, a0, 2
slli t1, a1, 30 # 32-2
or a0, t0, t1
srli a1, a1, 2
# val = (val * invsqrt) >> 31
mv t0, a0
mv t1, a1
mv a0, s2
li a1, 0
mv a2, t0
mv a3, t1
call __muldi3
# >> 31
srli t0, a0, 31
slli t1, a1, 1 # 32-31
or a0, t0, t1
srli a1, a1, 31
sw a0, 0(s0)
lw ra, 16(sp)
lw s0, 12(sp)
lw s1, 8(sp)
lw s2, 4(sp)
lw s3, 0(sp)
addi sp, sp, 20
ret
.globl fast_distance_3d
fast_distance_3d:
addi sp, sp, -28
sw ra, 24(sp)
sw s0, 20(sp)
sw s1, 16(sp)
sw s2, 12(sp)
sw s3, 8(sp)
sw s4, 4(sp)
sw s5, 0(sp)
mv s0, a0
mv s1, a1
mv s2, a2
# dx*dx
mv a0, s0
srai a1, s0, 31
mv a2, s0
srai a3, s0, 31
call __muldi3
mv s3, a0
mv s4, a1
# dy*dy
mv a0, s1
srai a1, s1, 31
mv a2, s1
srai a3, s1, 31
call __muldi3
add a0, s3, a0
sltu t0, a0, s3
add a1, s4, a1
add a1, a1, t0
mv s3, a0
mv s4, a1
# dz*dz
mv a0, s2
srai a1, s2, 31
mv a2, s2
srai a3, s2, 31
call __muldi3
add a0, s3, a0
sltu t0, a0, s3
add a1, s4, a1
add a1, a1, t0
mv s3, a0
mv s4, a1
bnez s4, .L_fdist_shift
j .L_fdist_no_shift
.L_fdist_shift:
srli t0, s3, 16 # a0 is s3
slli t1, s4, 16 # a1 is s4
or a0, t0, t1
srli a1, s4, 16
mv s3, a0
mv s4, a1
.L_fdist_no_shift:
# inv_dist = fast_rsqrt((uint32_t)dist_sq);
mv a0, s3
call fast_rsqrt
mv s5, a0
# dist = ((uint64_t)dist_sq * inv_dist) >> 16;
mv a0, s3
mv a1, s4
mv a2, s5
li a3, 0
call __muldi3
# >> 16
srli t0, a0, 16
slli t1, a1, 16
or a0, t0, t1
srli a1, a1, 16
lw ra, 24(sp)
lw s0, 20(sp)
lw s1, 16(sp)
lw s2, 12(sp)
lw s3, 8(sp)
lw s4, 4(sp)
lw s5, 0(sp)
addi sp, sp, 28
ret
```
:::
**Console output:**
```console.log
=== Q3_C Test ===
Num: 16
Fast_rsqrt result (Q0.16): 16384
Cycles: 2217
Instructions: 2217
=== All Tests Completed ===
```
**Instructions Comparison:**
Optimization levels tested: `-O0 -O1 -O2 -O3 -Ofast`
| Optimization | -O0 | -O1 | -O2 | -O3 / -Ofast | ASM Version |
|:------------:|:----:|:----:|:----:|:------------:|:-----------:|
| Instructions | 2213 | 1212 | 1599 | 16 | 1021 |
From **-O2** to **-O3**, the instruction count drops from **1595** to only **14**, showing that the optimizer becomes extremely aggressive at higher levels. At this point, most computations are removed or constant-folded.
To prevent this behavior, the variable must be treated as externally visible. Using `volatile` forces the compiler to keep the actual computation:
```diff
- int num = 16;
+ volatile int num = 16;
```
| Optimization | -O0 | -O1 | -O2 | -O3 / -Ofast | ASM Version |
|:------------:|:----:|:----:|:----:|:------------:|:-----------:|
| Instructions | 2213 | 1212 | 1599 | 1084 | 1021 |
**Size Comparision:**
`riscv-none-elf-size test.elf`
| | text | data | bss | dec | hex |
|:----------:|:----:|:----:|:----:|:----:|:----:|
| -O0 | 4634 | 0 | 4102 | 8736 | 2220 |
| -O1 | 2250 | 0 | 4100 | 6350 | 18ce |
| -O2 | 2290 | 0 | 4108 | 6398 | 18fe |
| -O3/-Ofast | 3102 | 0 | 4096 | 7198 | 1c1e |
| ASM | 3210 | 128 | 4100 | 7438 | 1d0e |
---
## Reference
[Tower of Hanoi Approach](https://tutorialhorizon.com/algorithms/towers-of-hanoi/)
[Quiz2 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz2-sol)
[Quiz3 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz3-sol)