# Assignment2: Complete Applications contributed by < [kuohanwen](https://github.com/kuohanwen/2025_computer_architecture_hw2) > [Homework2 Requirment](https://hackmd.io/@sysprog/2025-arch-homework2) ## Installation of rv32emu Refer to [Lab2: RISC-V Instruction Set Simulator and System Emulator](https://hackmd.io/@sysprog/Sko2Ja5pel) OS environment : WSL (Ubuntu 24.04.3 LTS on Windows 11) RISC-V toolchain : xPack GNU RISC-V Embedded GCC x86_64 All Lab 2 tests have been successfully completed ## Pick one complete program (with test suite) done in your Homework 1 and make it run in a bare-metal environment using rv32emu’s system emulation. Refer to [Homework1 ProblemB](https://hackmd.io/@Hanwen0218/arch2025-homework1) ### Revision of File 1. <span style="background-color:#cce5ff; color:#004085; font-weight:bold; padding:2px 6px; border-radius:4px;">Makefile</span> Add `quiz1ProblemBuf8.o` in OBJS so that this object file is linked into the final program, allowing the Quiz1 Problem B assembly code to be used. 2. <span style="background-color:#cce5ff; color:#004085; font-weight:bold; padding:2px 6px; border-radius:4px;">main.c</span> Add `extern int test(void);` This allows main() to call test(), where test() serves as the entry point for assembly code. Then add this code in the final part of main.c, the return value indicates whether Quiz1 Problem B passes, and we can also see how many cycles the handwritten assembly code takes and how many instructions it executes. ```c= TEST_LOGGER("\n=== Quiz1 Problem B Tests ===\n"); // Record starting cycle and instruction counters start_cycles = get_cycles(); start_instret = get_instret(); // Call the assembly implementation of test() for Quiz1 Problem B { int pb_pass = test(); // Record ending cycle and instruction counters end_cycles = get_cycles(); end_instret = get_instret(); // Compute elapsed cycles and retired instructions cycles_elapsed = end_cycles - start_cycles; instret_elapsed = end_instret - start_instret; // Print pass/fail result based on test() return value if (pb_pass) { TEST_LOGGER("Quiz1 Problem B: PASSED\n"); } else { TEST_LOGGER("Quiz1 Problem B: FAILED\n"); } // Print performance statistics for Quiz1 Problem B TEST_LOGGER(" Cycles: "); print_dec((unsigned long) cycles_elapsed); TEST_LOGGER(" Instructions: "); print_dec((unsigned long) instret_elapsed); TEST_LOGGER("\n"); } /* ===== End of Quiz1 Problem B tests ===== */ TEST_LOGGER("\n=== All Tests Completed ===\n"); return 0; } ``` 3. <span style="background-color:#cce5ff; color:#004085; font-weight:bold; padding:2px 6px; border-radius:4px;">Assembly Code</span> main → q1_main .globl main → .globl q1_main There is already a C main() function in main.c in the project. If the assembly file also defines main while the C code defines main as well, the linker will report a multiple definition of main error, or it will not know which one to use as the entry point. Therefore, we rename the main in the assembly file to q1_main, so that the only true program entry point is the C main(), avoiding any conflicts. A line `.globl test` was added before test. This tells the assembler/linker that the symbol test should be exported as a global symbol so that other object files can see it and call it, the Quiz1 Problem B test in main.c can be correctly linked to this assembly implementation. ### Disassemble Code Enter following command ``` cd ~/riscv-none-elf-gcc/rv32emu/tests/system/quiz1problemb riscv-none-elf-objdump -d ./test.elf ``` Disassemble elf(binary → assembly): ``` Disassembly of section .text: 00010000 <_start>: 10000: 00008117 auipc sp,0x8 10004: be010113 addi sp,sp,-1056 # 17be0 <__stack_top> 10008: 00007297 auipc t0,0x7 1000c: bcc28293 addi t0,t0,-1076 # 16bd4 <__bss_end> 10010: 00007317 auipc t1,0x7 10014: bc430313 addi t1,t1,-1084 # 16bd4 <__bss_end> 10018: 0062d863 bge t0,t1,10028 <_start+0x28> 1001c: 0002a023 sw zero,0(t0) 10020: 00428293 addi t0,t0,4 10024: ff5ff06f j 10018 <_start+0x18> 10028: 609010ef jal 11e30 <main> 1002c: 05d00893 li a7,93 10030: 00000513 li a0,0 10034: 00000073 ecall 10038: 0000006f j 10038 <_start+0x38> 0001003c <memcpy>: 1003c: fd010113 addi sp,sp,-48 10040: 02112623 sw ra,44(sp) 10044: 02812423 sw s0,40(sp) 10048: 03010413 addi s0,sp,48 1004c: fca42e23 sw a0,-36(s0) 10050: fcb42c23 sw a1,-40(s0) 10054: fcc42a23 sw a2,-44(s0) 10058: fdc42783 lw a5,-36(s0) 1005c: fef42623 sw a5,-20(s0) 10060: fd842783 lw a5,-40(s0) 10064: fef42423 sw a5,-24(s0) 10068: 0240006f j 1008c <memcpy+0x50> 1006c: fe842703 lw a4,-24(s0) 10070: 00170793 addi a5,a4,1 10074: fef42423 sw a5,-24(s0) 10078: fec42783 lw a5,-20(s0) 1007c: 00178693 addi a3,a5,1 10080: fed42623 sw a3,-20(s0) 10084: 00074703 lbu a4,0(a4) 10088: 00e78023 sb a4,0(a5) 1008c: fd442783 lw a5,-44(s0) 10090: fff78713 addi a4,a5,-1 10094: fce42a23 sw a4,-44(s0) 10098: fc079ae3 bnez a5,1006c <memcpy+0x30> 1009c: fdc42783 lw a5,-36(s0) 100a0: 00078513 mv a0,a5 100a4: 02c12083 lw ra,44(sp) 100a8: 02812403 lw s0,40(sp) 100ac: 03010113 addi sp,sp,48 100b0: 00008067 ret 000100b4 <udiv>: 100b4: fd010113 addi sp,sp,-48 100b8: 02112623 sw ra,44(sp) 100bc: 02812423 sw s0,40(sp) 100c0: 03010413 addi s0,sp,48 ... ``` ### Final Result By using this command ``` make clean make run ``` ![螢幕擷取畫面 2025-12-07 230723](https://hackmd.io/_uploads/BJKSdM7GWe.png) By using Ripes ![螢幕擷取畫面 2025-12-07 223719](https://hackmd.io/_uploads/BkuN-MXzZg.png) Cycles in **rv32emu's system** is 24231 Cycles in **Ripes** is 35402 ## Adapt Problem A from Quiz 2, and make it run in a bare-metal environment using rv32emu’s system emulation. refer to [Problem A of Quiz 2](https://hackmd.io/@sysprog/arch2025-quiz2-sol) ### Overview The goal of this problem is to implement the **Tower of Hanoi** algorithm for three disks using RISC-V assembly in a bare-metal environment. Instead of using recursion, we must adopt an iterative approach based on 3-bit Gray code to generate the sequence of moves. The assembly function `test` should: 1️⃣ Maintain the current peg index (0 = A, 1 = B, 2 = C) for each disk in memory. 2️⃣ For each step n = 1..7, compute gray(n) and gray(n-1), use their XOR to determine which disk moves. 3️⃣ Update the disk’s peg according to the Tower of Hanoi movement rules: - The smallest disk (disk 0) always moves in a fixed cyclic direction (mod 3). - Larger disks move to the remaining peg based on the sum of all peg indices. 4️⃣ For every move, print a line in the format `Move Disk <k> from <X> to <Y>` using the system call interface provided by rv32emu (Linux-style write syscall). 5️⃣ After finishing all moves, restore the saved registers, return to the C test harness, and return 1 in a0 to indicate that the sequence is correct. The C program will call `test`, capture the printed output, and measure the number of cycles and instructions for this assembly implementation. :::spoiler Original Assembly Code ```c= .text .globl main main: addi x2, x2, -32 sw x8, 0(x2) sw x9, 4(x2) sw x18, 8(x2) sw x19, 12(x2) sw x20, 16(x2) li x5, 0x15 sw x5, 20(x2) sw x5, 24(x2) sw x5, 28(x2) # Fix disk positions (BLANK 1-3: neutralize x5 effect) # BLANK 1: Fix position at x2+20 sw x0, 20(x2) # BLANK 2: Fix position at x2+24 sw x0, 24(x2) # BLANK 3: Fix position at x2+28 sw x0, 28(x2) addi x8, x0, 1 game_loop: # BLANK 4: Check loop termination (2^3 moves) addi x5, x0, 8 beq x8, x5, finish_game # Gray code formula: gray(n) = n XOR (n >> k) # BLANK 5: What is k for Gray code? srli x5, x8, 1 # BLANK 6: Complete Gray(n) calculation xor x6, x8, x5 # BLANK 7-8: Calculate previous value and its shift addi x7, x8, -1 srli x28, x7, 1 # BLANK 9: Generate Gray(n-1) xor x7, x7, x28 # BLANK 10: Which bits changed? xor x5, x6, x7 # Initialize disk number addi x9, x0, 0 # BLANK 11: Mask for testing LSB andi x6, x5, 1 # BLANK 12: Branch if disk 0 moves bne x6, x0, disk_found # BLANK 13: Set disk 1 addi x9, x0, 1 # BLANK 14: Test second bit with proper mask andi x6, x5, 2 bne x6, x0, disk_found # BLANK 15: Last disk number addi x9, x0, 2 disk_found: # BLANK 16: Check impossible pattern (multiple bits) andi x30, x5, 5 addi x31, x0, 5 beq x30, x31, pattern_match jal x0, continue_move pattern_match: continue_move: # BLANK 17: Word-align disk index (multiply by what?) slli x5, x9, 2 # BLANK 18: Base offset for disk array addi x5, x5, 20 add x5, x2, x5 lw x18, 0(x5) bne x9, x0, handle_large # BLANK 19: Small disk moves by how many positions? addi x19, x18, 2 # BLANK 20: Number of pegs addi x6, x0, 3 blt x19, x6, display_move sub x19, x19, x6 jal x0, display_move handle_large: # BLANK 21: Load reference disk position lw x6, 20(x2) # BLANK 22: Sum of all peg indices (0+1+2) addi x19, x0, 3 sub x19, x19, x18 sub x19, x19, x6 display_move: la x20, obdata add x5, x20, x18 # BLANK 23: Load byte from obfuscated data lbu x11, 0(x5) # BLANK 24: Decode constant (0x6F) li x6, 0x6F xor x11, x11, x6 # BLANK 25: Final offset adjustment addi x11, x11, -0x12 add x7, x20, x19 lbu x12, 0(x7) xor x12, x12, x6 addi x12, x12, -0x12 la x10, str1 addi x17, x0, 4 ecall addi x10, x9, 1 addi x17, x0, 1 ecall la x10, str2 addi x17, x0, 4 ecall addi x10, x11, 0 addi x17, x0, 11 ecall la x10, str3 addi x17, x0, 4 ecall addi x10, x12, 0 addi x17, x0, 11 ecall addi x10, x0, 10 addi x17, x0, 11 ecall # BLANK 26: Calculate storage offset slli x5, x9, 2 addi x5, x5, 20 add x5, x2, x5 # BLANK 27: Update disk position sw x19, 0(x5) # BLANK 28-29: Increment counter and loop addi x8, x8, 1 jal x0, game_loop finish_game: lw x8, 0(x2) lw x9, 4(x2) lw x18, 8(x2) lw x19, 12(x2) lw x20, 16(x2) addi x2, x2, 32 addi x17, x0, 10 ecall .data obdata: .byte 0x3c, 0x3b, 0x3a str1: .asciz "Move Disk " str2: .asciz " from " str3: .asciz " to " ``` ::: ### Modified Assembly Code ```c= .text .globl test test: # Create stack frame, reserve 32 bytes addi x2, x2, -32 # sp -= 32 # Save callee-saved registers sw x8, 0(x2) sw x9, 4(x2) sw x18, 8(x2) sw x19, 12(x2) sw x20, 16(x2) # Temporarily set all three disk positions to 0x15 li x5, 0x15 sw x5, 20(x2) # disk0 position sw x5, 24(x2) # disk1 position sw x5, 28(x2) # disk2 position # BLANK 1-3: neutralize the effect → initialize them to 0 sw x0, 20(x2) # disk0 starts on peg 0 (A) sw x0, 24(x2) # disk1 starts on peg 0 (A) sw x0, 28(x2) # disk2 starts on peg 0 (A) # x8 = move counter, start from 1 addi x8, x0, 1 game_loop: # BLANK 4: loop termination for 3 disks → 2^3 = 8 states addi x5, x0, 8 beq x8, x5, finish_game # stop when n == 8 # Gray code formula: gray(n) = n XOR (n >> 1) # BLANK 5: k = 1 srli x5, x8, 1 # x5 = n >> 1 # BLANK 6: compute gray(n) xor x6, x8, x5 # x6 = gray(n) # BLANK 7-8: compute n-1 and (n-1)>>1 addi x7, x8, -1 # x7 = n-1 srli x28, x7, 1 # x28 = (n-1)>>1 # BLANK 9: gray(n-1) xor x7, x7, x28 # x7 = gray(n-1) # BLANK 10: bits that changed between steps xor x5, x6, x7 # x5 = changed bits # Decide which disk moves addi x9, x0, 0 # assume disk 0 first # BLANK 11: test LSB to see if disk 0 moves andi x6, x5, 1 # x6 = x5 & 1 # BLANK 12: if LSB != 0 → disk 0 moves bne x6, x0, disk_found # BLANK 13: otherwise consider disk 1 addi x9, x0, 1 # BLANK 14: test second bit (mask = 2) andi x6, x5, 2 bne x6, x0, disk_found # if bit1 set → disk 1 # BLANK 15: otherwise it must be disk 2 addi x9, x0, 2 disk_found: # BLANK 16: check impossible pattern (multiple bits): 0b101 andi x30, x5, 5 # keep bits 0 and 2 addi x31, x0, 5 # 0b101 beq x30, x31, pattern_match jal x0, continue_move # skip special handling pattern_match: continue_move: # BLANK 17: word-align disk index → multiply by 4 slli x5, x9, 2 # x5 = disk * 4 # BLANK 18: base offset of disk array (frame offset 20) addi x5, x5, 20 # x5 = 4*disk + 20 add x5, x2, x5 # &disk_pos[disk] lw x18, 0(x5) # x18 = current peg of this disk # If disk != 0 (not the smallest disk), use large-disk rule bne x9, x0, handle_large # BLANK 19: smallest disk always moves by +2 (mod 3) addi x19, x18, 2 # tentative new peg = old + 2 # BLANK 20: number of pegs = 3 addi x6, x0, 3 # x6 = 3 blt x19, x6, display_move # if <3, no wrap sub x19, x19, x6 # wrap around: new -= 3 jal x0, display_move handle_large: # BLANK 21: load smallest disk position as reference lw x6, 20(x2) # x6 = disk0 position # BLANK 22: sum of peg indices is 0+1+2 = 3 # large disk must move to the remaining peg: # new = 3 - self - smallest addi x19, x0, 3 # x19 = 3 sub x19, x19, x18 # x19 = 3 - self sub x19, x19, x6 # x19 = 3 - self - min_disk display_move: # Decode from/to peg letters (A/B/C) from obdata table la x20, obdata add x5, x20, x18 # index for "from" lbu x11, 0(x5) li x6, 0x6F xor x11, x11, x6 addi x11, x11, -0x12 # x11 = from char ('A'..'C') add x7, x20, x19 # index for "to" lbu x12, 0(x7) xor x12, x12, x6 addi x12, x12, -0x12 # x12 = to char ('A'..'C') # Save chars into temporaries so a0/a1/a2 can be used by syscalls mv x29, x11 # from char mv x30, x12 # to char # print "Move Disk " li a0, 1 # fd = 1 (stdout) la a1, str1 # buffer = "Move Disk " li a2, 10 # length = 10 li a7, 64 # write syscall ecall # print disk number (1/2/3) addi x5, x9, 1 # disk index: 1..3 addi x5, x5, 48 # convert to '1' / '2' / '3' la x6, charbuf sb x5, 0(x6) # charbuf[0] = digit li a0, 1 mv a1, x6 # buffer = charbuf li a2, 1 # length = 1 li a7, 64 ecall # print " from " li a0, 1 la a1, str2 # " from " li a2, 6 # length = 6 li a7, 64 ecall # print source peg la x6, charbuf sb x29, 0(x6) # charbuf[0] = from char li a0, 1 mv a1, x6 li a2, 1 li a7, 64 ecall # print " to " li a0, 1 la a1, str3 # " to " li a2, 4 # length = 4 li a7, 64 ecall # print destination peg la x6, charbuf sb x30, 0(x6) # charbuf[0] = to char li a0, 1 mv a1, x6 li a2, 1 li a7, 64 ecall # print newline li x5, 10 # '\n' la x6, charbuf sb x5, 0(x6) li a0, 1 mv a1, x6 li a2, 1 li a7, 64 ecall # update disk position and go to next move # BLANK 26: compute storage offset for this disk slli x5, x9, 2 # x5 = 4 * disk addi x5, x5, 20 # offset within frame add x5, x2, x5 # &disk_pos[disk] # BLANK 27: write back new peg position sw x19, 0(x5) # BLANK 28-29: increment step counter and loop addi x8, x8, 1 # n++ jal x0, game_loop finish_game: # Restore saved registers and stack pointer lw x8, 0(x2) lw x9, 4(x2) lw x18, 8(x2) lw x19, 12(x2) lw x20, 16(x2) addi x2, x2, 32 # Return 1 in a0 to C code (indicates PASS) addi x10, x0, 1 ret .data obdata: .byte 0x3c, 0x3b, 0x3a # XOR-obfuscated peg letters str1: .asciz "Move Disk " # prefix str2: .asciz " from " # middle string str3: .asciz " to " # middle string charbuf: .byte 0 # scratch buffer for a single char ``` ### Result Enter this command ``` cd ~/riscv-none-elf-gcc/rv32emu/tests/system/quiz2problema make clean make run ``` ![螢幕擷取畫面 2025-12-15 232122](https://hackmd.io/_uploads/HyV5wsazWx.png) ### Optimize Method with -Os and -Ofast **Optimize for Size** `-Os` optimizes for code size. It is roughly `-O2` but disables optimizations that bloat the binary, producing smaller code while mostly preserving standard-compliant behavior. **Aggressive Speed Optimization** `-Ofast` aims for maximum speed, similar to `-O3` plus more aggressive transforms (e.g., fast-math). It can be much faster, but may break strict IEEE FP and language standards, and usually generates larger binaries. <span style="background-color:#d4edda; color:#155724; font-weight:bold; padding:2px 6px; border-radius:4px;">Revision of Makefile</span> ```c= # Version of -O0 CFLAGS = -g -march=rv32i_zicsr # Version of -Os CFLAGS = -g -march=rv32i_zicsr -Os # Version of -Ofast CFLAGS = -g -march=rv32i_zicsr -Ofast ``` <span style="background-color:#d4edda; color:#155724; font-weight:bold; padding:2px 6px; border-radius:4px;">Revision of main.c</span> 1. Rewrote the printing helper for rv32emu’s `write` syscall - Redefined `printstr` so the inline assembly explicitly uses `li a7, 64` (SYS_write), `li a0, 1` (stdout), `mv a1, %0` (buffer), `mv a2, %1` (length), then ecall. - Wrapped it with `TEST_OUTPUT` / `TEST_LOGGER` macros to print constant messages easily. 2. Added a decimal-print helper `print_dec32` - Introduced `static void print_dec32(uint32_t val)`, which converts a 32-bit integer to decimal using `/` and `%` of C Program, reverses the digits in a local buffer, appends `\n`, then calls `printstr`. - This function is used to print the numeric values of cycles and instructions. 3. Call the custom assembly `test()` - Added `extern int test(void)`; to match .globl test in quiz2problema.s. - `main()` calls `test()` once and uses the return value to print Result: PASSED or FAILED. 4. Measure and print cycles / instructions in main() - Around the call to `test()`, `main()` now reads `get_cycles()` and `get_instret()` before and after, then computes: ``` cycles_elapsed = end_cycles - start_cycles; instret_elapsed = end_instret - start_instret;. ``` - Finally, `main()` prints in the following order: ``` Result: PASSED / FAILED Cycles: <print_dec32(cycles_elapsed)> Instructions: <print_dec32(instret_elapsed)> === All Tests Completed === ``` Enter this command ``` riscv-none-elf-objdump -d test.elf > disasm.s wc -l disasm.s ``` can see the number of lines for different version. |Version|number of lines| |--|--| |with `-O0`|462| |with `-Os`|316| |with `-Ofast`|438| - With `-O0` (no optimization), the compiler almost directly translates C into machine code. Many redundant loads/stores and unoptimized branches remain, so the disassembly is the largest: 462 lines. - `-Os` optimizes for code size. It removes dead code, merges computations and avoids inlining/unrolling that would bloat the binary, so it produces the smallest code: 316 lines, about 32% fewer than `-O0` and almost 28% fewer than `-Ofast`. - `-Ofast` aggressively optimizes for speed. It adds inlining and loop unrolling and reorders computations; this can make some parts shorter but expands others. As a result, the code is slightly smaller than `-O0` (438 lines), but much larger than `-Os`. - Conclusion: `-Os` gives the smallest assembly, `-Ofast` is in the middle, and `-O0` is the largest, showing the trade-off between size-oriented and speed-oriented optimizations. ### Final Result **Test with -O0** ![螢幕擷取畫面 2025-12-21 002549](https://hackmd.io/_uploads/ryOQCBNm-x.png) **Test with -Os** ![螢幕擷取畫面 2025-12-21 002340](https://hackmd.io/_uploads/B1bsaBE7bl.png) **Test with -Ofast** ![螢幕擷取畫面 2025-12-21 002735](https://hackmd.io/_uploads/BkaKRB47-e.png) These experiments show that both `-Os` and `-Ofast` produce slightly fewer cycles and instructions than `-O0`. This means the compiler optimizations removed redundant work and instructions, allowing the same algorithm to finish with a smaller instruction stream. Overall, as long as optimization is enabled with `-Os` or `-Ofast`, this Tower of Hanoi implementation runs more efficiently than the unoptimized `-O0` version. ## Adapt Problem C from Quiz 3, and make it run in a bare-metal environment using rv32emu’s system emulation. refer to [Problem C of Quiz3](https://hackmd.io/@sysprog/arch2025-quiz3-sol) ### Overview To implement a **fast reciprocal square root** routine `fast_rsqrt(x)` on RV32I without a hardware multiplier or FPU. The goal is to approximate $\frac{1}{\sqrt{x}}$ using only integer arithmetic, with about 3–8% error but very low latency. 1️⃣ **Fixed-point representation (2¹⁶ scaling)** - Instead of storing the real value $\frac{1}{\sqrt{x}}$, the code stores $y \approx \frac{2^{16}}{\sqrt{x}}$ as a 32-bit integer. - Values are interpreted as `y / 2^16`, which is similar to a Q0.16 fixed-point format. 2️⃣ **Lookup table (LUT) based on MSB of x** - A 32-entry table `rsqrt_table[32]` is indexed by the most significant bit (MSB) position of x. - Entry `rsqrt_table[exp]` approximates $2^{16} / \sqrt{2^{exp}}$, giving a coarse initial guess for all x in $[2^{exp}, 2^{exp+1})$. 3️⃣ **Linear interpolation between table entries** - For non–power-of-two inputs, the implementation linearly interpolates between `rsqrt_table[exp]` and `rsqrt_table[exp+1]`. - The $\text{fraction} = \frac{x - 2^{exp}}{2^{exp}}$ measures how far x is between the two endpoints, it can make the initial estimate closer to the true $\frac{1}{\sqrt{x}}$, and is used to refine the initial estimate before Newton iterations. 4️⃣ **Newton–Raphson iteration in fixed-point** - Starting from an initial guess $y_n$, Newton’s method for $$ z = \frac{1}{\sqrt{x}}\quad\Rightarrow\quad z_{n+1} = z_n\Bigl(\frac{3} {2} - \frac{x z_n^2}{2}\Bigr) $$ is rewritten under 2¹⁶ scaling as an integer update: $$ y_{n+1} = y_n \cdot \frac{3 - x \cdot y_n^2 / 2^{16}}{2}. $$ - In the code, the Newton update is implemented using integer multiplication and bit shifts (`>>16`, `>>17`). ``` y2 = (y * y) /* Compute y^2 in 2^32 scaling */ xy2 = (x * y2) >> 16 /* Compute x * y^2 / 2^16 */ y = (y * ((3<<16) - xy2)) >> 17 /* Complete Newton step */ ``` The `>>16` compensates for the 2¹⁶ scaling, and the extra `>>1` corresponds to the `/2` in the formula, which together give `>>17`. With two iterations, this reduces the error from about 20% down to 3–8%. ### Fast Reciprocal Square Root Algorithm ```c= #include <stdint.h> #include "rsqrt_org.h" // Lookup table for initial rsqrt estimates; entry i ≈ 2^16 / sqrt(2^i). static const uint32_t rsqrt_table[32] = { 65536, 46341, 32768, 23170, 16384, /* 2^0 to 2^4 */ 11585, 8192, 5793, 4096, 2896, /* 2^5 to 2^9 */ 2048, 1448, 1024, 724, 512, /* 2^10 to 2^14 */ 362, 256, 181, 128, 90, /* 2^15 to 2^19 */ 64, 45, 32, 23, 16, /* 2^20 to 2^24 */ 11, 8, 6, 4, 3, /* 2^25 to 2^29 */ 2, 1 /* 2^30, 2^31 */ }; // Count leading zeros of a 32-bit unsigned integer (software CLZ). static int clz(uint32_t x) { if (x == 0) return 32; int n = 0; if ((x & 0xFFFF0000u) == 0) { n += 16; x <<= 16; } if ((x & 0xFF000000u) == 0) { n += 8; x <<= 8; } if ((x & 0xF0000000u) == 0) { n += 4; x <<= 4; } if ((x & 0xC0000000u) == 0) { n += 2; x <<= 2; } if ((x & 0x80000000u) == 0) { n += 1; } return n; } // 32-bit × 32-bit -> 64-bit multiplication using shift-and-add (no hardware MUL). static uint64_t mul32(uint32_t a, uint32_t b) { uint64_t r = 0; uint64_t a64 = (uint64_t)a; // Use 64-bit for shifting to avoid overflow. while (b) { if (b & 1u) r += a64; a64 <<= 1; b >>= 1; } return r; } // fast_rsqrt(x): approximate 2^16 / sqrt(x) for 32-bit unsigned x using LUT + interpolation + 2 Newton iterations. uint32_t fast_rsqrt(uint32_t x) { // Special cases. if (x == 0) return 0xFFFFFFFFu; // Treat rsqrt(0) as +Inf. if (x == 1) return 65536u; // 2^16 / sqrt(1) = 2^16. // Step 1: find MSB position exp = floor(log2(x)). int exp = 31 - clz(x); uint32_t base = 1u << exp; // Initial estimate from LUT: y ≈ 2^16 / sqrt(2^exp). uint32_t y = rsqrt_table[exp]; // Step 2: linear interpolation between rsqrt_table[exp] and rsqrt_table[exp + 1]. if (x > base) { uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0u; uint32_t delta = y - y_next; uint32_t t = x - base; // 0 <= t < 2^exp. // Compute frac = ((x - 2^exp) / 2^exp) * 2^16 in Q0.16 without 64-bit shifts. uint32_t frac; if (exp >= 16) frac = t >> (exp - 16); else frac = t << (16 - exp); // frac is Q0.16; interp = delta * frac / 2^16. uint32_t interp = (delta * frac) >> 16; y -= interp; } // Constant 3 in Q0.16 format. const uint32_t THREE_Q16 = 3u << 16; // Step 3: two Newton–Raphson iterations in fixed-point. // Real step: z_{n+1} = z_n * (3/2 - x*z_n^2/2); here y = 2^16 * z. for (int iter = 0; iter < 2; iter++) { uint32_t y2 = y * y; // y^2 fits in 32 bits for our range. uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16); // xy2 ≈ 2^16 * x * z^2. uint32_t factor = THREE_Q16 - xy2; // factor ≈ 2^16 * (3 - x*z^2). y = (uint32_t)(mul32(y, factor) >> 17); // y = y * factor / 2 (>>16 for scaling, +1 bit for /2). } return y; } ``` Why is `mul32` designed this way? - On RV32I there is no hardware multiply, so 32×32 multiplication must be implemented with shift and add. - In binary, if $$ b = \sum_{i=0}^{31} b_i 2^i,\quad b_i \in \{0,1\} $$ then $$ a \cdot b = \sum_{i=0}^{31} b_i \cdot (a \cdot 2^i) $$ The loop `while (b)` with `b >>= 1` processes one bit of b per iteration, from LSB to MSB. - `if (b & 1u) r += a64;` adds the current term `a * 2^i` to the result whenever bit $b_i$ is 1. - a64 is a 64-bit copy of a, so `a64 <<= 1` implements `a * 2^i` without overflowing 32 bits. - The final r is the full 64-bit product, which is then used in fixed-point formulas. ### Final Result Enter this command ``` cd ~/riscv-none-elf-gcc/rv32emu/tests/system/quiz3problemc make clean make run ``` ![螢幕擷取畫面 2025-12-17 172433](https://hackmd.io/_uploads/ByxxwlxXbx.png)