Assignment2: Complete Applications

# Assignment2: Complete Applications contributed by [< rara0857 >](https://github.com/rara0857/arch2025-homework2) ::: warning **Disclaimer:** I used AI tools such as ChatGPT / Gemini to better understand algorithm and architecture descriptions. ::: ## RV32I emulator's system emulation In this assignment, we are required to complete the code and run it in a **bare-metal environment** using **rv32emu’s system emulation**. Since we run on a bare-metal environment, standard library functions are not available, and direct `ecall` printing is also restricted. To output results, we will need to implement a custom print routine in `main.c` that works with rv32emu's system emulation environment. I started from the sample code at `tests/system/playground` and modified it. here is the following structure :::success ```text ├── Makefile ├── chacha20_asm.S ├── linker.ld ├── main.c ├── perfcounter.S └── start.S ``` ::: ![image](https://hackmd.io/_uploads/Hy_0ZXblbe.png =60%x) The **`linker.ld`** file defines how different program sections (`.text`, `.data`, `.bss`, `.stack`) are placed in memory. ### Note on Implementation - **Printing:** Implemented in C, because bare-metal environment does not support standard `ecall` printing. - **Functional Routines:** Implemented in Assembly (ASM) for performance optimization and instruction reduction. --- ## Problem B in Homework 1 ### Descripition Assume uf8 implements a logarithmic 8-bit codec that maps 20-bit unsigned integers $([0,1{,}015{,}792])$ to 8-bit symbols via logarithmic quantization, delivering 2.5:1 compression and ≤6.25% relative error. **UF8_Decode:** $$ D(b) = m \cdot 2^e + (2^e - 1) \cdot 16 $$ **UF8_Encode** $$ E(v) = \begin{cases} v, & \text{if } v < 16 \\ 16e + \lfloor (v - \text{offset}(e)) / 2^e \rfloor, & \text{otherwise} \end{cases} $$ --- **File structure:** :::success ```text ├── Makefile ├── linker.ld ├── main.c ├── perfcounter.S ├── q1-uf8.S ├── q1-uf8_opt.S ├── q1-uf8_origin.c └── start.S ::: Before running the `Makefile`, we need to make the following modifications: ```diff - OBJS = start.o perfcounter.o chacha20_asm.o main.o + OBJS = start.o perfcounter.o q1-uf8.o main.o - CFLAGS = -g $(ARCH) + CFLAGS = -g $(ARCH) -Ofast // add optimization level (OS/O0/O1/O2/O3/Ofast) ``` After running `make run`, the output is as follows: **Console output:** ```console.log === UF8 Codec=== UF8 Codec: ALL PASSED Cycles: 34752 Instructions: 34752 === All Tests Completed === ``` **CLZ optimization:** To reduce instructions, we optimized the CLZ routine by eliminating all loops and branches. It now uses a fixed sequence of right shifts with `snez` masks, achieving **constant-time execution** while slightly increasing code size. ```s= clz: li t0, 32 # n = 32 # 16-bit srli t2, a0, 16 # y = x >> 16 snez t3, t2 # y != 0 slli t1, t3, 4 # t1 = 16 or 0 sub t0, t0, t1 # n -= t1 srl a0, a0, t1 # x >>= t1 # 8-bit srli t2, a0, 8 snez t3, t2 slli t1, t3, 3 # t1 = 8 or 0 sub t0, t0, t1 srl a0, a0, t1 # 4-bit srli t2, a0, 4 snez t3, t2 slli t1, t3, 2 # t1 = 4 or 0 sub t0, t0, t1 srl a0, a0, t1 # 2-bit srli t2, a0, 2 snez t3, t2 slli t1, t3, 1 # t1 = 2 or 0 sub t0, t0, t1 srl a0, a0, t1 # 1-bit srli t2, a0, 1 snez t3, t2 sub t0, t0, t3 # n -= 1 srl a0, a0, t3 sub a0, t0, a0 # return n - x ret ``` **Instructions Comparison:** | | C Code | ASM | CLZ opt. | |:----------------:|:------:|:-----:|:--------:| | **Instructions** | 71316 | 34752 | 34372 | **Size Comparision:** `riscv-none-elf-size test.elf` | | text | data | bss | dec | hex | |:------------:|:----:|:----:|:----:|:----:|:----:| | **C Code** | 3234 | 0 | 4110 | 7344 | 1cb0 | | **ASM** | 2874 | 0 | 4102 | 6976 | 1b40 | | **CLZ opt.** | 2942 | 0 | 4098 | 7040 | 1b80 | ## Q2_ProblemA ### Descripition The **Tower of Hanoi** is a classic recursive problem that involves moving a stack of disks from one peg to another, following these rules: * Only one disk can be moved at a time. * A larger disk can never be placed on top of a smaller disk. * All disks must eventually be moved from the source peg to the destination peg using an auxiliary peg. ### Method To solve the problem, we can seperate into following steps: 1. Recursively Move N-1 disk from source to Auxiliary peg. 2. Move the last disk from source to destination. 3. Recursively Move N-1 disk from Auxiliary to destination peg. ![image](https://hackmd.io/_uploads/HJ3n3fWeZg.png =60%x) **File structure:** :::success ```text ├── Makefile ├── linker.ld ├── main.c ├── perfcounter.S ├── q2_hanoi.S └── start.S ::: **Console output:** ```console.log === Q2_A Test === Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C Hanoi Test Completed. Cycles: 9628 Instructions: 9628 === All Tests Completed === ``` To reduce instructions, we can use optimization flags. However, before enabling optimizations, to prevent print statements from being removed, we should modify `printstr` as follows: ```diff #define printstr(ptr, length) \ do { \ asm volatile( \ "add a7, x0, 0x40;" \ "add a0, x0, 0x1;" /* stdout */ \ "add a1, x0, %0;" \ "mv a2, %1;" /* length character */ \ "ecall;" \ : \ : "r"(ptr), "r"(length) \ - : "a0", "a1", "a2", "a7"); + : "a0", "a1", "a2", "a7","memory"); \ } while (0) ``` **Instructions Comparison:** | Optimization | -O0 | -O1 | -O2 | -O3 / -Ofast | |:------------:|:----:|:----:|:----:|:------------:| | Instructions | 9628 | 4208 | 4404 | 4404 | **Size Comparision:** `riscv-none-elf-size test.elf` | | text | data | bss | dec | hex | |:----------:|:----:|:----:|:----:|:----:|:----:| | -O0 | 2570 | 3 | 4099 | 6672 | 1a10 | | -O1 | 1626 | 3 | 4099 | 5728 | 1660 | | -O2 | 1606 | 3 | 4103 | 5712 | 1650 | | -O3/-Ofast | 1606 | 3 | 4103 | 5712 | 1650 | ## Q3_ProblemC ### Descripition `rsqrt` provides a fast reciprocal square root implementation optimized for RV32I. * No hardware multiplier: Uses shift-add for 32×32 → 64-bit multiplication * LUT + Newton iterations: Balances precision and performance * Precision: ~3-8% error Newton's method is an algorithm to compute the root of a function. ![image](https://hackmd.io/_uploads/BylXiRWeZe.png =60%x) ### Method 1. **Initial estimate (CLZ + lookup table)** Use `clz(x)` to find the exponent range of the input, then fetch two nearby entries from the LUT. A small linear interpolation gives a closer starting value for the reciprocal square root. 2. **Newton refinement** Apply one Newton iteration: $$ y_{\text{new}} = y \left( 1.5 - 0.5xy^2 \right) $$ All operations are done with fixed-point shift–add math since RV32I has no hardware multiplier. 3. **Return fixed-point result** Output the final value in Q16 format. This keeps the computation fast while keeping the error roughly within 3–8%. **File structure:** :::success ```text ├── Makefile ├── linker.ld ├── main.c ├── perfcounter.S ├── q3-rsqrt-origin.c ├── q3-rsqrt.S └── start.S ::: :::spoiler Assembly Code ```s .data .align 2 rsqrt_table: .word 65536, 46341, 32768, 23170, 16384 .word 11585, 8192, 5793, 4096, 2896 .word 2048, 1448, 1024, 724, 512 .word 362, 256, 181, 128, 90 .word 64, 45, 32, 23, 16 .word 11, 8, 6, 4, 3 .word 2, 1 .text .globl clz clz: li t0, 32 # n = 32 # 16-bit srli t2, a0, 16 # y = x >> 16 snez t3, t2 # y != 0 slli t1, t3, 4 # t1 = 16 or 0 sub t0, t0, t1 # n -= t1 srl a0, a0, t1 # x >>= t1 # 8-bit srli t2, a0, 8 snez t3, t2 slli t1, t3, 3 # t1 = 8 or 0 sub t0, t0, t1 srl a0, a0, t1 # 4-bit srli t2, a0, 4 snez t3, t2 slli t1, t3, 2 # t1 = 4 or 0 sub t0, t0, t1 srl a0, a0, t1 # 2-bit srli t2, a0, 2 snez t3, t2 slli t1, t3, 1 # t1 = 2 or 0 sub t0, t0, t1 srl a0, a0, t1 # 1-bit srli t2, a0, 1 snez t3, t2 sub t0, t0, t3 # n -= 1 srl a0, a0, t3 sub a0, t0, a0 # return n - x ret .globl __muldi3 __muldi3: mv t0, a0 # a_lo mv t1, a1 # a_hi mv t2, a2 # b_lo mv t3, a3 # b_hi li a0, 0 # res_lo li a1, 0 # res_hi or t4, t2, t3 beqz t4, .L_muldi3_end # if (b == 0) skip loop .L_muldi3_loop: andi t5, t2, 1 beqz t5, .L_muldi3_skip_add # res = res + a add a0, a0, t0 sltu t5, a0, t0 add a1, a1, t1 add a1, a1, t5 .L_muldi3_skip_add: # b >>= 1 (64-bit shift) srli t2, t2, 1 slli t5, t3, 31 or t2, t2, t5 srli t3, t3, 1 # a <<= 1 (64-bit shift) mv t5, t0 add t0, t0, t0 sltu t5, t0, t5 add t1, t1, t1 add t1, t1, t5 or t4, t2, t3 bnez t4, .L_muldi3_loop # continue if (b > 0) .L_muldi3_end: ret .globl __lshrdi3 __lshrdi3: andi a2, a2, 63 beqz a2, .L_lshr_ret li t0, 32 blt a2, t0, .L_lshr_lt32 # b >= 32 sub a2, a2, t0 srl a0, a1, a2 li a1, 0 j .L_lshr_ret .L_lshr_lt32: # b < 32 sub t1, t0, a2 # t1 = 32 - b sll t2, a1, t1 # t2 = in.hi << (32 - b) srl t1, a0, a2 # t1 = in.lo >> b srl a1, a1, a2 # out.hi = in.hi >> b or a0, t2, t1 # out.lo = t2 | t1 .L_lshr_ret: ret .globl fast_rsqrt fast_rsqrt: addi sp, sp, -44 sw ra, 40(sp) sw s0, 36(sp) # s0 = x sw s1, 32(sp) # s1 = exp sw s2, 28(sp) # s2 = y_base sw s3, 24(sp) # s3 = y sw s4, 20(sp) # s4 = y_next sw s6, 16(sp) # s6 = y64_lo sw s7, 12(sp) # s7 = y64_hi sw s8, 8(sp) # s8 = term_lo sw s9, 4(sp) # s9 = term_hi sw s10, 0(sp) # s10 = tmp mv s0, a0 bnez s0, .L_frsqrt_not_zero li a0, -1 j .L_frsqrt_epilogue .L_frsqrt_not_zero: mv a0, s0 call clz li t0, 31 sub s1, t0, a0 li t0, 32 bne s1, t0, .L_frsqrt_exp_ok li s1, 31 .L_frsqrt_exp_ok: la t0, rsqrt_table slli t1, s1, 2 add t0, t0, t1 lw s2, 0(t0) li t0, 31 bne s1, t0, .L_frsqrt_interp mv s3, s2 j .L_frsqrt_newton .L_frsqrt_interp: la t0, rsqrt_table addi t1, s1, 1 slli t1, t1, 2 add t0, t0, t1 lw s4, 0(t0) # power_of_2 = 1U << exp li t0, 1 sll t3, t0, s1 # t3 = power_of_2 # (uint64_t)x - power_of_2 sub a0, s0, t3 slt a1, s0, t3 neg a1, a1 # << 16 slli t0, a0, 16 # new_lo srli t1, a0, 16 # new_hi part 1 slli t2, a1, 16 # new_hi part 2 or t1, t1, t2 # new_hi mv a0, t0 mv a1, t1 mv a2, s1 call __lshrdi3 mv s10, a0 # s10 = fraction_q16 (low) # delta_y = (uint64_t)y_base - y_next; sub t0, s2, s4 li t1, 0 # (delta_y * fraction_q16) mv a0, t0 mv a1, t1 mv a2, s10 li a3, 0 call __muldi3 # >> 16 srli t0, a0, 16 # new_lo part 1 slli t1, a1, 16 # new_lo part 2 or a0, t0, t1 # new_lo srli a1, a1, 16 # new_hi # y = y_base - (uint32_t)result sub s3, s2, a0 .L_frsqrt_newton: mv s6, s3 li s7, 0 # y2 = y64 * y64 mv a0, s6 mv a1, s7 mv a2, s6 mv a3, s7 call __muldi3 sw a0, 16(sp) # s6 (tmp) = y2_lo sw a1, 12(sp) # s7 (tmp) = y2_hi # xy2 = ((uint64_t)x * y2) mv a0, s0 li a1, 0 lw a2, 16(sp) lw a3, 12(sp) call __muldi3 # >> 16 srli t0, a0, 16 slli t1, a1, 16 or a0, t0, t1 srli a1, a1, 16 # term = (3U << 16) - (uint32_t)xy2; li s8, 196608 li s9, 0 sub a0, s8, a0 sltu t0, s8, a0 sub a1, s9, t0 mv s8, a0 mv s9, a1 # y = (uint32_t)((y64 * term) >> 17); mv a0, s3 li a1, 0 mv a2, s8 mv a3, s9 call __muldi3 # >> 17 srli t0, a0, 17 # new_lo part 1 slli t1, a1, 15 # new_lo part 2 (32-17) or a0, t0, t1 # new_lo srli a1, a1, 17 # new_hi .L_frsqrt_epilogue: lw ra, 40(sp) lw s0, 36(sp) lw s1, 32(sp) lw s2, 28(sp) lw s3, 24(sp) lw s4, 20(sp) lw s6, 16(sp) lw s7, 12(sp) lw s8, 8(sp) lw s9, 4(sp) lw s10, 0(sp) addi sp, sp, 44 ret .globl newton_step newton_step: addi sp, sp, -20 sw ra, 16(sp) sw s0, 12(sp) # s0 = *rec_inv_sqrt sw s1, 8(sp) # s1 = x sw s2, 4(sp) # s2 = invsqrt sw s3, 0(sp) # s3 = invsqrt2 mv s0, a0 mv s1, a1 lw s2, 0(s0) # invsqrt2 = ((uint64_t)invsqrt * invsqrt) >> 32; mv a0, s2 li a1, 0 mv a2, s2 li a3, 0 call __muldi3 mv s3, a1 # val = (3LL << 32) - ((uint64_t)x * invsqrt2); mv a0, s1 li a1, 0 mv a2, s3 srai a3, s3, 31 call __muldi3 li t0, 0 li t1, 3 sub a0, t0, a0 sltu t2, t0, a0 sub a1, t1, a1 sub a1, a1, t2 # val >>= 2 srli t0, a0, 2 slli t1, a1, 30 # 32-2 or a0, t0, t1 srli a1, a1, 2 # val = (val * invsqrt) >> 31 mv t0, a0 mv t1, a1 mv a0, s2 li a1, 0 mv a2, t0 mv a3, t1 call __muldi3 # >> 31 srli t0, a0, 31 slli t1, a1, 1 # 32-31 or a0, t0, t1 srli a1, a1, 31 sw a0, 0(s0) lw ra, 16(sp) lw s0, 12(sp) lw s1, 8(sp) lw s2, 4(sp) lw s3, 0(sp) addi sp, sp, 20 ret .globl fast_distance_3d fast_distance_3d: addi sp, sp, -28 sw ra, 24(sp) sw s0, 20(sp) sw s1, 16(sp) sw s2, 12(sp) sw s3, 8(sp) sw s4, 4(sp) sw s5, 0(sp) mv s0, a0 mv s1, a1 mv s2, a2 # dx*dx mv a0, s0 srai a1, s0, 31 mv a2, s0 srai a3, s0, 31 call __muldi3 mv s3, a0 mv s4, a1 # dy*dy mv a0, s1 srai a1, s1, 31 mv a2, s1 srai a3, s1, 31 call __muldi3 add a0, s3, a0 sltu t0, a0, s3 add a1, s4, a1 add a1, a1, t0 mv s3, a0 mv s4, a1 # dz*dz mv a0, s2 srai a1, s2, 31 mv a2, s2 srai a3, s2, 31 call __muldi3 add a0, s3, a0 sltu t0, a0, s3 add a1, s4, a1 add a1, a1, t0 mv s3, a0 mv s4, a1 bnez s4, .L_fdist_shift j .L_fdist_no_shift .L_fdist_shift: srli t0, s3, 16 # a0 is s3 slli t1, s4, 16 # a1 is s4 or a0, t0, t1 srli a1, s4, 16 mv s3, a0 mv s4, a1 .L_fdist_no_shift: # inv_dist = fast_rsqrt((uint32_t)dist_sq); mv a0, s3 call fast_rsqrt mv s5, a0 # dist = ((uint64_t)dist_sq * inv_dist) >> 16; mv a0, s3 mv a1, s4 mv a2, s5 li a3, 0 call __muldi3 # >> 16 srli t0, a0, 16 slli t1, a1, 16 or a0, t0, t1 srli a1, a1, 16 lw ra, 24(sp) lw s0, 20(sp) lw s1, 16(sp) lw s2, 12(sp) lw s3, 8(sp) lw s4, 4(sp) lw s5, 0(sp) addi sp, sp, 28 ret ``` ::: **Console output:** ```console.log === Q3_C Test === Num: 16 Fast_rsqrt result (Q0.16): 16384 Cycles: 2217 Instructions: 2217 === All Tests Completed === ``` **Instructions Comparison:** Optimization levels tested: `-O0 -O1 -O2 -O3 -Ofast` | Optimization | -O0 | -O1 | -O2 | -O3 / -Ofast | ASM Version | |:------------:|:----:|:----:|:----:|:------------:|:-----------:| | Instructions | 2213 | 1212 | 1599 | 16 | 1021 | From **-O2** to **-O3**, the instruction count drops from **1595** to only **14**, showing that the optimizer becomes extremely aggressive at higher levels. At this point, most computations are removed or constant-folded. To prevent this behavior, the variable must be treated as externally visible. Using `volatile` forces the compiler to keep the actual computation: ```diff - int num = 16; + volatile int num = 16; ``` | Optimization | -O0 | -O1 | -O2 | -O3 / -Ofast | ASM Version | |:------------:|:----:|:----:|:----:|:------------:|:-----------:| | Instructions | 2213 | 1212 | 1599 | 1084 | 1021 | **Size Comparision:** `riscv-none-elf-size test.elf` | | text | data | bss | dec | hex | |:----------:|:----:|:----:|:----:|:----:|:----:| | -O0 | 4634 | 0 | 4102 | 8736 | 2220 | | -O1 | 2250 | 0 | 4100 | 6350 | 18ce | | -O2 | 2290 | 0 | 4108 | 6398 | 18fe | | -O3/-Ofast | 3102 | 0 | 4096 | 7198 | 1c1e | | ASM | 3210 | 128 | 4100 | 7438 | 1d0e | --- ## Reference [Tower of Hanoi Approach](https://tutorialhorizon.com/algorithms/towers-of-hanoi/) [Quiz2 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz2-sol) [Quiz3 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz3-sol)