# Assignment2: Complete Applications You can find the source code [here](https://github.com/mephen/2025_RISC_V_HW2.git). ## Pick a Complete Program of Homework 1 ### uf8 Encoding and Decoding with Optimized CLZ Method To port this program from Ripes to rv32emu, I made several adjustments in the playground folder: - Added the uf8 assembly code from Homework1 into the playground folder. - Modified the uf8 assembly file to keep only the function definitions, while moving the test suite to main.c. - Updated the Makefile to include the necessary object files for linking. After these adjustments, the program produced the following results: ![image](https://hackmd.io/_uploads/H1sW2XUkZx.png) ### Comparison between Handwritten and Compiler-Optimized Assembly Listings The disassembled assembly code produced from the ELF file is almost identical to the handwritten assembly. - No reordering, inlining, or instruction reduction is observed. This is because handwritten assembly files (.S) are processed directly by the GNU Assembler (as), which simply translates each line into machine code without applying any of GCC’s optimization stages. Optimization options such as `-O0`, `-O2`, or `-Ofast` only affect C source files, not manually written assembly. The small differences between the two listings come from low-level formatting rather than logic or structure: - Pseudo-instruction expansion For example: - `li a0, 32` → `addi a0, zero, 32` - `mv t0, t1` → `addi t0, t1, 0` - `j label` → `jal x0, offset` - Alignment NOPs The assembler inserts nop instructions to satisfy alignment requirements specified by directives such as `.align 3`. - Immediate and label expansion Conditional branches like `beqz t0, label` are displayed as `beq t0, zero, offset` after label addresses are resolved. ## Adapt Problem A from Quiz 2 ### Implementation - The loop uses Gray code to decide which disk moves each step ```asm= addi x5, x0, 8 beq x8, x5, finish_game srli x5, x8, 1 xor x6, x8, x5 # gray(n) addi x7, x8, -1 srli x28, x7, 1 xor x7, x7, x28 # gray(n-1) xor x5, x6, x7 # changed bit(s) ``` - For the smallest disk it does +2 mod 3; for larger disks it uses 3 - src - other - Each disk move prints seven times via write - "Move Disk " - `disk digit` - " from " - `source peg char` - " to " - `destination peg char` - newline Lengths are recomputed each time. Peg characters are decoded each time from obdata by XOR and bias. ```asm= display_move: la x20, obdata add x5, x20, x18 # BLANK 23: Load byte from obfuscated data lbu x11, 0(x5) # BLANK 24: Decode constant (0x6F) li x6, 0x6F xor x11, x11, x6 # BLANK 25: Final offset adjustment addi x11, x11, -0x12 add x7, x20, x19 lbu x12, 0(x7) xor x12, x12, x6 addi x12, x12, -0x12 #rv32emu syscall to print move # mv t2, x11 # src_ch mv t3, x12 # dst_ch # --- print "Move Disk " --- li a7, 64 # SYS_write li a0, 1 # STDOUT la a1, str1 la a2, str1_end la t0, str1 sub a2, a2, t0 addi a2, a2, -1 # remove NUL ecall # --- print disk number (x9=0..2) as '1'..'3' --- addi t0, x9, 1 addi t0, t0, 48 # '1'..'3' la t1, tmp_ch sb t0, 0(t1) li a7, 64 li a0, 1 la a1, tmp_ch addi a2, x0, 1 ecall # --- print " from " --- li a7, 64 li a0, 1 la a1, str2 la a2, str2_end la t0, str2 sub a2, a2, t0 addi a2, a2, -1 ecall # --- print source peg char --- la t1, tmp_ch sb t2, 0(t1) li a7, 64 li a0, 1 la a1, tmp_ch addi a2, x0, 1 ecall # --- print " to " --- li a7, 64 li a0, 1 la a1, str3 la a2, str3_end la t0, str3 sub a2, a2, t0 addi a2, a2, -1 ecall # --- print target peg char --- la t1, tmp_ch sb t3, 0(t1) li a7, 64 li a0, 1 la a1, tmp_ch addi a2, x0, 1 ecall # --- print newline --- li t0, 10 # '\n' la t1, tmp_ch sb t0, 0(t1) li a7, 64 li a0, 1 la a1, tmp_ch addi a2, x0, 1 ecall ``` ### Drawbacks - High syscall frequency - 7 ecall per move dominate runtime in an emulator. - Redundant address/length work - Every print recalculates literal lengths and base addresses (la, sub, addi), inside the hot loop. - Unnecessary per-move decoding - Peg chars are derived by xor + addi for every move instead of a direct lookup. ### Optimization - Single buffered print per disk move - Replace seven write system calls with only one system call. Put a fixed template in `.data` and only patch the three changing bytes - offset 10: disk digit '1'..'3' - offset 17: source peg char - offset 22: destination peg char ```asm= .data .balign 4 outbuf: .ascii "Move Disk 0 from X to Y\n" peg_chars: .byte 'A','B','C' ``` - Hoist invariants - Load &`outbuf` once into a callee-saved register and reuse it ```asm= lui x20, %hi(outbuf) addi x20, x20, %lo(outbuf) # x20 = &outbuf ``` - Table lookup for peg chars - No XOR/bias per move ```asm= display_move: lui x6, %hi(peg_chars) addi x6, x6, %lo(peg_chars) add x5, x6, x18 lbu x11, 0(x5) # src add x7, x6, x19 lbu x12, 0(x7) # dst ``` ### Performance Improvement As shown in the figure, after applying the optimization steps described above, the execution time (measured in CSR cycles) was reduced by approximately `(862-374)/862 ≈ 56%`, demonstrating a significant improvement in performance. - Original ![image](https://hackmd.io/_uploads/S1DuyvDkZl.png) - Optimization ![image](https://hackmd.io/_uploads/HyjYJwvyWx.png) ## Adapt Problem C from Quiz 3 ### Implementation - Q16 format and flow - Input `x` is an unsigned 32-bit integer - Output is Q16 fixed-point `y ≈ ⌊2^16 / sqrt(x)⌋` - Flow: 1. find bucket `e = 31 - clz32(x)` 2. LUT seed `rsqrt_table[e], rsqrt_table[e+1]` 3. linear interpolation in Q16 4. Two Newton steps to refine the output ```c= static inline uint32_t clz32(uint32_t x) { if (!x) return 32u; uint32_t n = 0; if ((x >> 16) == 0) { n += 16; x <<= 16; } if ((x >> 24) == 0) { n += 8; x <<= 8; } if ((x >> 28) == 0) { n += 4; x <<= 4; } if ((x >> 30) == 0) { n += 2; x <<= 2; } if ((x >> 31) == 0) { n += 1; } return n; } static const uint16_t rsqrt_table[32] = { 65535, 46341, 32768, 23170, 16384, 11585, 8192, 5793, 4096, 2896, 2048, 1448, 1024, 724, 512, 362, 256, 181, 128, 90, 64, 45, 32, 23, 16, 11, 8, 6, 4, 3, 2, 1 }; ... uint32_t fast_rsqrt(uint32_t x) { if (x == 0u) return 0u; /* 1) Find MSB position to locate range 2^exp..2^(exp+1) */ uint32_t exp = 31u - clz32(x); /* 2) Get base and next table values (for exp=31 set y_next=1 to avoid overflow) */ uint32_t y_base = rsqrt_table[exp]; uint32_t y_next = (exp < 31u) ? rsqrt_table[exp + 1u] : 1u; /* 3) Linear interpolation: frac = ((x - 2^exp) << 16) >> exp ∈ [0, 2^16) */ uint32_t one_exp = (exp < 31u) ? (1u << exp) : 0u; /* 1<<31 safe */ uint32_t frac = (uint32_t)((((uint64_t)x - (uint64_t)one_exp) << 16) >> exp); uint32_t delta = y_base - y_next; uint32_t y = y_base - (uint32_t)((mul32_shift_add(delta, frac)) >> 16); /* 4) One or two Newton iterations (two iterations usually reach 3–8% error) */ y = q16_newton_step(y, x); y = q16_newton_step(y, x); return y; /* Q16 result: represents 2^16 / sqrt(x) */ } ``` ### Drawbacks - Expensive 64-bit arithmetic on RV32I - Use a generic “bit-by-bit” 32×32→64 multiplier and many 64-bit shifts. On RV32I these expand into long instruction sequences ```c= static inline uint64_t mul32_shift_add(uint32_t a, uint32_t b) { uint64_t acc = 0, aa = (uint64_t)a; while (b) { // up to 32 iterations if (b & 1u) acc += aa; // 64-bit add aa <<= 1; // 64-bit shift b >>= 1; } return acc; } ``` - Unnecessary 64-bit interpolation - Fraction computation and interpolation run through 64-bit left/right shifts, adding more cost on RV32I. - Two Newton steps by default - With LUT+linear interpolation, the second step often gives only a small accuracy gain per large cycle cost ### Optimization - Replace “bit loop” with fixed-size 16×16 multiply - Use nibble-grouped shifts/adds. Cost is constant and fully 32-bit. ```c= static inline uint32_t mul16x16_32(uint32_t a16, uint32_t b16) { a16 &= 0xFFFFu; b16 &= 0xFFFFu; uint32_t acc = 0, a = a16; #define NIBBLE(accum, nibshift) do { \ uint32_t nib = (b16 >> (nibshift)) & 0xFu; \ uint32_t p = 0; \ if (nib & 1u) p += a; \ if (nib & 2u) p += (a << 1); \ if (nib & 4u) p += (a << 2); \ if (nib & 8u) p += (a << 3); \ accum += (p << (nibshift)); \ } while (0) NIBBLE(acc, 0); NIBBLE(acc, 4); NIBBLE(acc, 8); NIBBLE(acc, 12); #undef NIBBLE return acc; // exact 16x16 product in 32 bits } ``` - Specialize ( 𝑥⋅𝑦 )>>16 as 32×16 with decomposition - Avoid 64-bit by splitting x into hi/lo halves. ```C= // (x32 * y16) >> 16 using two 16x16 multiplies static inline uint32_t mul32x16_shr16(uint32_t x32, uint32_t y16) { uint32_t x_lo = x32 & 0xFFFFu; uint32_t x_hi = x32 >> 16; uint32_t p_hi = mul16x16_32(x_hi, y16); // x_hi*y uint32_t p_lo = mul16x16_32(x_lo, y16) >> 16; // high half of x_lo*y return p_hi + p_lo; // exact low-32 of (x*y)>>16 } ``` - Do the Newton step entirely in 32 bits - Correct scaling and no double-shift on `𝑥⋅𝑦^2`. Split the final product into a tiny "high part" and a 16×16 "low part." ```c= // y <- y * (3 - x*y*y/2^16) / 2 , all Q16 static inline uint32_t q16_newton_step(uint32_t y, uint32_t x) { y &= 0xFFFFu; // Q16 uint32_t y2 = mul16x16_32(y, y); // Q32 uint32_t xy2_q16 = mul32x16_shr16(x, y2 >> 16);// (x*y^2)>>16 uint32_t term = (3u << 16) - xy2_q16; // Q16 uint32_t hi = term >> 16; // 0..3 uint32_t lo = term & 0xFFFFu; // (y*hi)>>1 for hi in {0,1,2,3} uint32_t y_hi = (hi==3u) ? ((y<<1)+y) : (hi==2u) ? (y<<1) : (hi==1u) ? y : 0u; uint32_t t0 = y_hi >> 1; // (y*lo)>>17 via 16x16 uint32_t t1 = (mul16x16_32(y, lo) >> 17); return t0 + t1; // Q16 } ``` - Keep interpolation 32-bit only - Compute frac with balanced shifts, no 64-bit ```c= // frac = ((x - 2^e) << 16) >> e without 64-bit uint32_t base = 1u << e; uint32_t diff = x - base; uint32_t frac = (e >= 16u) ? (diff >> (e - 16u)) : (diff << (16u - e)); // y = y0 - ((y0 - y1) * frac >> 16), all 32-bit uint32_t dy = y0 - y1; uint32_t y = y0 - (mul16x16_32(dy, frac) >> 16); ``` ### Performance Improvement As shown in the figure, after applying the optimization steps described above, the execution time (measured in CSR cycles) was reduced by approximately `(79128-59729)/79128 ≈ 25%`, demonstrating a significant improvement in performance. - Original ![image](https://hackmd.io/_uploads/Hyc5TcvkWl.png) - Optimization ![image](https://hackmd.io/_uploads/SyGIiiwk-e.png) ## References - [Assignment2 Requirement](https://hackmd.io/@sysprog/2025-arch-homework2#Requirements) - [Lab2](https://hackmd.io/@sysprog/Sko2Ja5pel): RISC-V Instruction Set Simulator and System Emulator - [RISC-V Privileged ISA](https://riscv.org/specifications/ratified/): CSR specifications and system mode - [RISC-V ABI](https://github.com/riscv-non-isa/riscv-elf-psabi-doc): Calling conventions and register usage - [GNU Linker Scripts](https://sourceware.org/binutils/docs/ld/Scripts.html): Linker script syntax reference - This assignment was completed with assistance from [ChatGPT] for [grammar checking, initial brainstorming, and idea refinement]. All final analysis and conclusions are my own.