# Assignment2: Complete Applications
contributed by < [kuohanwen](https://github.com/kuohanwen/2025_computer_architecture_hw2) >
[Homework2 Requirment](https://hackmd.io/@sysprog/2025-arch-homework2)
## Installation of rv32emu
Refer to [Lab2: RISC-V Instruction Set Simulator and System Emulator](https://hackmd.io/@sysprog/Sko2Ja5pel)
OS environment : WSL (Ubuntu 24.04.3 LTS on Windows 11)
RISC-V toolchain : xPack GNU RISC-V Embedded GCC x86_64
All Lab 2 tests have been successfully completed
## Pick one complete program (with test suite) done in your Homework 1 and make it run in a bare-metal environment using rv32emu’s system emulation.
Refer to [Homework1 ProblemB](https://hackmd.io/@Hanwen0218/arch2025-homework1)
### Revision of File
1. <span style="background-color:#cce5ff; color:#004085; font-weight:bold; padding:2px 6px; border-radius:4px;">Makefile</span>
Add `quiz1ProblemBuf8.o` in OBJS so that this object file is linked into the final program, allowing the Quiz1 Problem B assembly code to be used.
2. <span style="background-color:#cce5ff; color:#004085; font-weight:bold; padding:2px 6px; border-radius:4px;">main.c</span>
Add `extern int test(void);`
This allows main() to call test(), where test() serves
as the entry point for assembly code.
Then add this code in the final part of main.c, the return value indicates whether Quiz1 Problem B passes, and we can also see how many cycles the handwritten assembly code takes and how many instructions it executes.
```c=
TEST_LOGGER("\n=== Quiz1 Problem B Tests ===\n");
// Record starting cycle and instruction counters
start_cycles = get_cycles();
start_instret = get_instret();
// Call the assembly implementation of test() for Quiz1 Problem B
{
int pb_pass = test();
// Record ending cycle and instruction counters
end_cycles = get_cycles();
end_instret = get_instret();
// Compute elapsed cycles and retired instructions
cycles_elapsed = end_cycles - start_cycles;
instret_elapsed = end_instret - start_instret;
// Print pass/fail result based on test() return value
if (pb_pass) {
TEST_LOGGER("Quiz1 Problem B: PASSED\n");
} else {
TEST_LOGGER("Quiz1 Problem B: FAILED\n");
}
// Print performance statistics for Quiz1 Problem B
TEST_LOGGER(" Cycles: ");
print_dec((unsigned long) cycles_elapsed);
TEST_LOGGER(" Instructions: ");
print_dec((unsigned long) instret_elapsed);
TEST_LOGGER("\n");
}
/* ===== End of Quiz1 Problem B tests ===== */
TEST_LOGGER("\n=== All Tests Completed ===\n");
return 0;
}
```
3. <span style="background-color:#cce5ff; color:#004085; font-weight:bold; padding:2px 6px; border-radius:4px;">Assembly Code</span>
main → q1_main
.globl main → .globl q1_main
There is already a C main() function in main.c in the project. If the assembly file also defines main while the C code defines main as well, the linker will report a multiple definition of main error, or it will not know which one to use as the entry point. Therefore, we rename the main in the assembly file to q1_main, so that the only true program entry point is the C main(), avoiding any conflicts.
A line `.globl test` was added before test.
This tells the assembler/linker that the symbol test should be exported as a global symbol so that other object files can see it and call it, the Quiz1 Problem B test in main.c can be correctly linked to this assembly implementation.
### Disassemble Code
Enter following command
```
cd ~/riscv-none-elf-gcc/rv32emu/tests/system/quiz1problemb
riscv-none-elf-objdump -d ./test.elf
```
Disassemble elf(binary → assembly):
```
Disassembly of section .text:
00010000 <_start>:
10000: 00008117 auipc sp,0x8
10004: be010113 addi sp,sp,-1056 # 17be0 <__stack_top>
10008: 00007297 auipc t0,0x7
1000c: bcc28293 addi t0,t0,-1076 # 16bd4 <__bss_end>
10010: 00007317 auipc t1,0x7
10014: bc430313 addi t1,t1,-1084 # 16bd4 <__bss_end>
10018: 0062d863 bge t0,t1,10028 <_start+0x28>
1001c: 0002a023 sw zero,0(t0)
10020: 00428293 addi t0,t0,4
10024: ff5ff06f j 10018 <_start+0x18>
10028: 609010ef jal 11e30 <main>
1002c: 05d00893 li a7,93
10030: 00000513 li a0,0
10034: 00000073 ecall
10038: 0000006f j 10038 <_start+0x38>
0001003c <memcpy>:
1003c: fd010113 addi sp,sp,-48
10040: 02112623 sw ra,44(sp)
10044: 02812423 sw s0,40(sp)
10048: 03010413 addi s0,sp,48
1004c: fca42e23 sw a0,-36(s0)
10050: fcb42c23 sw a1,-40(s0)
10054: fcc42a23 sw a2,-44(s0)
10058: fdc42783 lw a5,-36(s0)
1005c: fef42623 sw a5,-20(s0)
10060: fd842783 lw a5,-40(s0)
10064: fef42423 sw a5,-24(s0)
10068: 0240006f j 1008c <memcpy+0x50>
1006c: fe842703 lw a4,-24(s0)
10070: 00170793 addi a5,a4,1
10074: fef42423 sw a5,-24(s0)
10078: fec42783 lw a5,-20(s0)
1007c: 00178693 addi a3,a5,1
10080: fed42623 sw a3,-20(s0)
10084: 00074703 lbu a4,0(a4)
10088: 00e78023 sb a4,0(a5)
1008c: fd442783 lw a5,-44(s0)
10090: fff78713 addi a4,a5,-1
10094: fce42a23 sw a4,-44(s0)
10098: fc079ae3 bnez a5,1006c <memcpy+0x30>
1009c: fdc42783 lw a5,-36(s0)
100a0: 00078513 mv a0,a5
100a4: 02c12083 lw ra,44(sp)
100a8: 02812403 lw s0,40(sp)
100ac: 03010113 addi sp,sp,48
100b0: 00008067 ret
000100b4 <udiv>:
100b4: fd010113 addi sp,sp,-48
100b8: 02112623 sw ra,44(sp)
100bc: 02812423 sw s0,40(sp)
100c0: 03010413 addi s0,sp,48
...
```
### Final Result
By using this command
```
make clean
make run
```

By using Ripes

Cycles in **rv32emu's system** is 24231
Cycles in **Ripes** is 35402
## Adapt Problem A from Quiz 2, and make it run in a bare-metal environment using rv32emu’s system emulation.
refer to [Problem A of Quiz 2](https://hackmd.io/@sysprog/arch2025-quiz2-sol)
### Overview
The goal of this problem is to implement the **Tower of Hanoi** algorithm for three disks using RISC-V assembly in a bare-metal environment.
Instead of using recursion, we must adopt an iterative approach based on 3-bit Gray code to generate the sequence of moves.
The assembly function `test` should:
1️⃣ Maintain the current peg index (0 = A, 1 = B, 2 = C) for each disk in memory.
2️⃣ For each step n = 1..7, compute gray(n) and gray(n-1), use their XOR to determine which disk moves.
3️⃣ Update the disk’s peg according to the Tower of Hanoi movement rules:
- The smallest disk (disk 0) always moves in a fixed cyclic direction (mod 3).
- Larger disks move to the remaining peg based on the sum of all peg indices.
4️⃣ For every move, print a line in the format
`Move Disk <k> from <X> to <Y>`
using the system call interface provided by rv32emu (Linux-style write syscall).
5️⃣ After finishing all moves, restore the saved registers, return to the C test harness, and return 1 in a0 to indicate that the sequence is correct.
The C program will call `test`, capture the printed output, and measure the number of cycles and instructions for this assembly implementation.
:::spoiler Original Assembly Code
```c=
.text
.globl main
main:
addi x2, x2, -32
sw x8, 0(x2)
sw x9, 4(x2)
sw x18, 8(x2)
sw x19, 12(x2)
sw x20, 16(x2)
li x5, 0x15
sw x5, 20(x2)
sw x5, 24(x2)
sw x5, 28(x2)
# Fix disk positions (BLANK 1-3: neutralize x5 effect)
# BLANK 1: Fix position at x2+20
sw x0, 20(x2)
# BLANK 2: Fix position at x2+24
sw x0, 24(x2)
# BLANK 3: Fix position at x2+28
sw x0, 28(x2)
addi x8, x0, 1
game_loop:
# BLANK 4: Check loop termination (2^3 moves)
addi x5, x0, 8
beq x8, x5, finish_game
# Gray code formula: gray(n) = n XOR (n >> k)
# BLANK 5: What is k for Gray code?
srli x5, x8, 1
# BLANK 6: Complete Gray(n) calculation
xor x6, x8, x5
# BLANK 7-8: Calculate previous value and its shift
addi x7, x8, -1
srli x28, x7, 1
# BLANK 9: Generate Gray(n-1)
xor x7, x7, x28
# BLANK 10: Which bits changed?
xor x5, x6, x7
# Initialize disk number
addi x9, x0, 0
# BLANK 11: Mask for testing LSB
andi x6, x5, 1
# BLANK 12: Branch if disk 0 moves
bne x6, x0, disk_found
# BLANK 13: Set disk 1
addi x9, x0, 1
# BLANK 14: Test second bit with proper mask
andi x6, x5, 2
bne x6, x0, disk_found
# BLANK 15: Last disk number
addi x9, x0, 2
disk_found:
# BLANK 16: Check impossible pattern (multiple bits)
andi x30, x5, 5
addi x31, x0, 5
beq x30, x31, pattern_match
jal x0, continue_move
pattern_match:
continue_move:
# BLANK 17: Word-align disk index (multiply by what?)
slli x5, x9, 2
# BLANK 18: Base offset for disk array
addi x5, x5, 20
add x5, x2, x5
lw x18, 0(x5)
bne x9, x0, handle_large
# BLANK 19: Small disk moves by how many positions?
addi x19, x18, 2
# BLANK 20: Number of pegs
addi x6, x0, 3
blt x19, x6, display_move
sub x19, x19, x6
jal x0, display_move
handle_large:
# BLANK 21: Load reference disk position
lw x6, 20(x2)
# BLANK 22: Sum of all peg indices (0+1+2)
addi x19, x0, 3
sub x19, x19, x18
sub x19, x19, x6
display_move:
la x20, obdata
add x5, x20, x18
# BLANK 23: Load byte from obfuscated data
lbu x11, 0(x5)
# BLANK 24: Decode constant (0x6F)
li x6, 0x6F
xor x11, x11, x6
# BLANK 25: Final offset adjustment
addi x11, x11, -0x12
add x7, x20, x19
lbu x12, 0(x7)
xor x12, x12, x6
addi x12, x12, -0x12
la x10, str1
addi x17, x0, 4
ecall
addi x10, x9, 1
addi x17, x0, 1
ecall
la x10, str2
addi x17, x0, 4
ecall
addi x10, x11, 0
addi x17, x0, 11
ecall
la x10, str3
addi x17, x0, 4
ecall
addi x10, x12, 0
addi x17, x0, 11
ecall
addi x10, x0, 10
addi x17, x0, 11
ecall
# BLANK 26: Calculate storage offset
slli x5, x9, 2
addi x5, x5, 20
add x5, x2, x5
# BLANK 27: Update disk position
sw x19, 0(x5)
# BLANK 28-29: Increment counter and loop
addi x8, x8, 1
jal x0, game_loop
finish_game:
lw x8, 0(x2)
lw x9, 4(x2)
lw x18, 8(x2)
lw x19, 12(x2)
lw x20, 16(x2)
addi x2, x2, 32
addi x17, x0, 10
ecall
.data
obdata: .byte 0x3c, 0x3b, 0x3a
str1: .asciz "Move Disk "
str2: .asciz " from "
str3: .asciz " to "
```
:::
### Modified Assembly Code
```c=
.text
.globl test
test:
# Create stack frame, reserve 32 bytes
addi x2, x2, -32 # sp -= 32
# Save callee-saved registers
sw x8, 0(x2)
sw x9, 4(x2)
sw x18, 8(x2)
sw x19, 12(x2)
sw x20, 16(x2)
# Temporarily set all three disk positions to 0x15
li x5, 0x15
sw x5, 20(x2) # disk0 position
sw x5, 24(x2) # disk1 position
sw x5, 28(x2) # disk2 position
# BLANK 1-3: neutralize the effect → initialize them to 0
sw x0, 20(x2) # disk0 starts on peg 0 (A)
sw x0, 24(x2) # disk1 starts on peg 0 (A)
sw x0, 28(x2) # disk2 starts on peg 0 (A)
# x8 = move counter, start from 1
addi x8, x0, 1
game_loop:
# BLANK 4: loop termination for 3 disks → 2^3 = 8 states
addi x5, x0, 8
beq x8, x5, finish_game # stop when n == 8
# Gray code formula: gray(n) = n XOR (n >> 1)
# BLANK 5: k = 1
srli x5, x8, 1 # x5 = n >> 1
# BLANK 6: compute gray(n)
xor x6, x8, x5 # x6 = gray(n)
# BLANK 7-8: compute n-1 and (n-1)>>1
addi x7, x8, -1 # x7 = n-1
srli x28, x7, 1 # x28 = (n-1)>>1
# BLANK 9: gray(n-1)
xor x7, x7, x28 # x7 = gray(n-1)
# BLANK 10: bits that changed between steps
xor x5, x6, x7 # x5 = changed bits
# Decide which disk moves
addi x9, x0, 0 # assume disk 0 first
# BLANK 11: test LSB to see if disk 0 moves
andi x6, x5, 1 # x6 = x5 & 1
# BLANK 12: if LSB != 0 → disk 0 moves
bne x6, x0, disk_found
# BLANK 13: otherwise consider disk 1
addi x9, x0, 1
# BLANK 14: test second bit (mask = 2)
andi x6, x5, 2
bne x6, x0, disk_found # if bit1 set → disk 1
# BLANK 15: otherwise it must be disk 2
addi x9, x0, 2
disk_found:
# BLANK 16: check impossible pattern (multiple bits): 0b101
andi x30, x5, 5 # keep bits 0 and 2
addi x31, x0, 5 # 0b101
beq x30, x31, pattern_match
jal x0, continue_move # skip special handling
pattern_match:
continue_move:
# BLANK 17: word-align disk index → multiply by 4
slli x5, x9, 2 # x5 = disk * 4
# BLANK 18: base offset of disk array (frame offset 20)
addi x5, x5, 20 # x5 = 4*disk + 20
add x5, x2, x5 # &disk_pos[disk]
lw x18, 0(x5) # x18 = current peg of this disk
# If disk != 0 (not the smallest disk), use large-disk rule
bne x9, x0, handle_large
# BLANK 19: smallest disk always moves by +2 (mod 3)
addi x19, x18, 2 # tentative new peg = old + 2
# BLANK 20: number of pegs = 3
addi x6, x0, 3 # x6 = 3
blt x19, x6, display_move # if <3, no wrap
sub x19, x19, x6 # wrap around: new -= 3
jal x0, display_move
handle_large:
# BLANK 21: load smallest disk position as reference
lw x6, 20(x2) # x6 = disk0 position
# BLANK 22: sum of peg indices is 0+1+2 = 3
# large disk must move to the remaining peg:
# new = 3 - self - smallest
addi x19, x0, 3 # x19 = 3
sub x19, x19, x18 # x19 = 3 - self
sub x19, x19, x6 # x19 = 3 - self - min_disk
display_move:
# Decode from/to peg letters (A/B/C) from obdata table
la x20, obdata
add x5, x20, x18 # index for "from"
lbu x11, 0(x5)
li x6, 0x6F
xor x11, x11, x6
addi x11, x11, -0x12 # x11 = from char ('A'..'C')
add x7, x20, x19 # index for "to"
lbu x12, 0(x7)
xor x12, x12, x6
addi x12, x12, -0x12 # x12 = to char ('A'..'C')
# Save chars into temporaries so a0/a1/a2 can be used by syscalls
mv x29, x11 # from char
mv x30, x12 # to char
# print "Move Disk "
li a0, 1 # fd = 1 (stdout)
la a1, str1 # buffer = "Move Disk "
li a2, 10 # length = 10
li a7, 64 # write syscall
ecall
# print disk number (1/2/3)
addi x5, x9, 1 # disk index: 1..3
addi x5, x5, 48 # convert to '1' / '2' / '3'
la x6, charbuf
sb x5, 0(x6) # charbuf[0] = digit
li a0, 1
mv a1, x6 # buffer = charbuf
li a2, 1 # length = 1
li a7, 64
ecall
# print " from "
li a0, 1
la a1, str2 # " from "
li a2, 6 # length = 6
li a7, 64
ecall
# print source peg
la x6, charbuf
sb x29, 0(x6) # charbuf[0] = from char
li a0, 1
mv a1, x6
li a2, 1
li a7, 64
ecall
# print " to "
li a0, 1
la a1, str3 # " to "
li a2, 4 # length = 4
li a7, 64
ecall
# print destination peg
la x6, charbuf
sb x30, 0(x6) # charbuf[0] = to char
li a0, 1
mv a1, x6
li a2, 1
li a7, 64
ecall
# print newline
li x5, 10 # '\n'
la x6, charbuf
sb x5, 0(x6)
li a0, 1
mv a1, x6
li a2, 1
li a7, 64
ecall
# update disk position and go to next move
# BLANK 26: compute storage offset for this disk
slli x5, x9, 2 # x5 = 4 * disk
addi x5, x5, 20 # offset within frame
add x5, x2, x5 # &disk_pos[disk]
# BLANK 27: write back new peg position
sw x19, 0(x5)
# BLANK 28-29: increment step counter and loop
addi x8, x8, 1 # n++
jal x0, game_loop
finish_game:
# Restore saved registers and stack pointer
lw x8, 0(x2)
lw x9, 4(x2)
lw x18, 8(x2)
lw x19, 12(x2)
lw x20, 16(x2)
addi x2, x2, 32
# Return 1 in a0 to C code (indicates PASS)
addi x10, x0, 1
ret
.data
obdata: .byte 0x3c, 0x3b, 0x3a # XOR-obfuscated peg letters
str1: .asciz "Move Disk " # prefix
str2: .asciz " from " # middle string
str3: .asciz " to " # middle string
charbuf: .byte 0 # scratch buffer for a single char
```
### Result
Enter this command
```
cd ~/riscv-none-elf-gcc/rv32emu/tests/system/quiz2problema
make clean
make run
```

### Optimize Method with -Os and -Ofast
**Optimize for Size**
`-Os` optimizes for code size. It is roughly `-O2` but disables optimizations that bloat the binary, producing smaller code while mostly preserving standard-compliant behavior.
**Aggressive Speed Optimization**
`-Ofast` aims for maximum speed, similar to `-O3` plus more aggressive transforms (e.g., fast-math). It can be much faster, but may break strict IEEE FP and language standards, and usually generates larger binaries.
<span style="background-color:#d4edda; color:#155724; font-weight:bold; padding:2px 6px; border-radius:4px;">Revision of Makefile</span>
```c=
# Version of -O0
CFLAGS = -g -march=rv32i_zicsr
# Version of -Os
CFLAGS = -g -march=rv32i_zicsr -Os
# Version of -Ofast
CFLAGS = -g -march=rv32i_zicsr -Ofast
```
<span style="background-color:#d4edda; color:#155724; font-weight:bold; padding:2px 6px; border-radius:4px;">Revision of main.c</span>
1. Rewrote the printing helper for rv32emu’s `write` syscall
- Redefined `printstr` so the inline assembly explicitly uses `li a7, 64` (SYS_write), `li a0, 1` (stdout), `mv a1, %0` (buffer), `mv a2, %1` (length), then ecall.
- Wrapped it with `TEST_OUTPUT` / `TEST_LOGGER` macros to print constant messages easily.
2. Added a decimal-print helper `print_dec32`
- Introduced `static void print_dec32(uint32_t val)`, which converts a 32-bit integer to decimal using `/` and `%` of C Program, reverses the digits in a local buffer, appends `\n`, then calls `printstr`.
- This function is used to print the numeric values of cycles and instructions.
3. Call the custom assembly `test()`
- Added `extern int test(void)`; to match .globl test in quiz2problema.s.
- `main()` calls `test()` once and uses the return value to print Result: PASSED or FAILED.
4. Measure and print cycles / instructions in main()
- Around the call to `test()`, `main()` now reads `get_cycles()` and `get_instret()` before and after, then computes:
```
cycles_elapsed = end_cycles - start_cycles;
instret_elapsed = end_instret - start_instret;.
```
- Finally, `main()` prints in the following order:
```
Result: PASSED / FAILED
Cycles: <print_dec32(cycles_elapsed)>
Instructions: <print_dec32(instret_elapsed)>
=== All Tests Completed ===
```
Enter this command
```
riscv-none-elf-objdump -d test.elf > disasm.s
wc -l disasm.s
```
can see the number of lines for different version.
|Version|number of lines|
|--|--|
|with `-O0`|462|
|with `-Os`|316|
|with `-Ofast`|438|
- With `-O0` (no optimization), the compiler almost directly translates C into machine code. Many redundant loads/stores and unoptimized branches remain, so the disassembly is the largest: 462 lines.
- `-Os` optimizes for code size. It removes dead code, merges computations and avoids inlining/unrolling that would bloat the binary, so it produces the smallest code: 316 lines, about 32% fewer than `-O0` and almost 28% fewer than `-Ofast`.
- `-Ofast` aggressively optimizes for speed. It adds inlining and loop unrolling and reorders computations; this can make some parts shorter but expands others. As a result, the code is slightly smaller than `-O0` (438 lines), but much larger than `-Os`.
- Conclusion: `-Os` gives the smallest assembly, `-Ofast` is in the middle, and `-O0` is the largest, showing the trade-off between size-oriented and speed-oriented optimizations.
### Final Result
**Test with -O0**

**Test with -Os**

**Test with -Ofast**

These experiments show that both `-Os` and `-Ofast` produce slightly fewer cycles and instructions than `-O0`. This means the compiler optimizations removed redundant work and instructions, allowing the same algorithm to finish with a smaller instruction stream. Overall, as long as optimization is enabled with `-Os` or `-Ofast`, this Tower of Hanoi implementation runs more efficiently than the unoptimized `-O0` version.
## Adapt Problem C from Quiz 3, and make it run in a bare-metal environment using rv32emu’s system emulation.
refer to [Problem C of Quiz3](https://hackmd.io/@sysprog/arch2025-quiz3-sol)
### Overview
To implement a **fast reciprocal square root** routine `fast_rsqrt(x)` on RV32I without a hardware multiplier or FPU. The goal is to approximate $\frac{1}{\sqrt{x}}$ using only integer arithmetic, with about 3–8% error but very low latency.
1️⃣ **Fixed-point representation (2¹⁶ scaling)**
- Instead of storing the real value $\frac{1}{\sqrt{x}}$, the code stores $y \approx \frac{2^{16}}{\sqrt{x}}$ as a 32-bit integer.
- Values are interpreted as `y / 2^16`, which is similar to a Q0.16 fixed-point format.
2️⃣ **Lookup table (LUT) based on MSB of x**
- A 32-entry table `rsqrt_table[32]` is indexed by the most significant bit (MSB) position of x.
- Entry `rsqrt_table[exp]` approximates $2^{16} / \sqrt{2^{exp}}$, giving a coarse initial guess for all x in $[2^{exp}, 2^{exp+1})$.
3️⃣ **Linear interpolation between table entries**
- For non–power-of-two inputs, the implementation linearly interpolates between
`rsqrt_table[exp]` and `rsqrt_table[exp+1]`.
- The $\text{fraction} = \frac{x - 2^{exp}}{2^{exp}}$ measures how far x is between the two endpoints, it can make the initial estimate closer to the true $\frac{1}{\sqrt{x}}$, and is used to refine the initial estimate before Newton iterations.
4️⃣ **Newton–Raphson iteration in fixed-point**
- Starting from an initial guess $y_n$, Newton’s method for
$$
z = \frac{1}{\sqrt{x}}\quad\Rightarrow\quad z_{n+1} = z_n\Bigl(\frac{3} {2} - \frac{x z_n^2}{2}\Bigr)
$$ is rewritten under 2¹⁶ scaling as an integer update:
$$
y_{n+1} = y_n \cdot \frac{3 - x \cdot y_n^2 / 2^{16}}{2}.
$$
- In the code, the Newton update is implemented using integer multiplication and bit shifts (`>>16`, `>>17`).
```
y2 = (y * y) /* Compute y^2 in 2^32 scaling */
xy2 = (x * y2) >> 16 /* Compute x * y^2 / 2^16 */
y = (y * ((3<<16) - xy2)) >> 17 /* Complete Newton step */
```
The `>>16` compensates for the 2¹⁶ scaling, and the extra `>>1` corresponds to the `/2` in the formula, which together give `>>17`. With two iterations, this reduces the error from about 20% down to 3–8%.
### Fast Reciprocal Square Root Algorithm
```c=
#include <stdint.h>
#include "rsqrt_org.h"
// Lookup table for initial rsqrt estimates; entry i ≈ 2^16 / sqrt(2^i).
static const uint32_t rsqrt_table[32] = {
65536, 46341, 32768, 23170, 16384, /* 2^0 to 2^4 */
11585, 8192, 5793, 4096, 2896, /* 2^5 to 2^9 */
2048, 1448, 1024, 724, 512, /* 2^10 to 2^14 */
362, 256, 181, 128, 90, /* 2^15 to 2^19 */
64, 45, 32, 23, 16, /* 2^20 to 2^24 */
11, 8, 6, 4, 3, /* 2^25 to 2^29 */
2, 1 /* 2^30, 2^31 */
};
// Count leading zeros of a 32-bit unsigned integer (software CLZ).
static int clz(uint32_t x)
{
if (x == 0)
return 32;
int n = 0;
if ((x & 0xFFFF0000u) == 0) { n += 16; x <<= 16; }
if ((x & 0xFF000000u) == 0) { n += 8; x <<= 8; }
if ((x & 0xF0000000u) == 0) { n += 4; x <<= 4; }
if ((x & 0xC0000000u) == 0) { n += 2; x <<= 2; }
if ((x & 0x80000000u) == 0) { n += 1; }
return n;
}
// 32-bit × 32-bit -> 64-bit multiplication using shift-and-add (no hardware MUL).
static uint64_t mul32(uint32_t a, uint32_t b)
{
uint64_t r = 0;
uint64_t a64 = (uint64_t)a; // Use 64-bit for shifting to avoid overflow.
while (b) {
if (b & 1u)
r += a64;
a64 <<= 1;
b >>= 1;
}
return r;
}
// fast_rsqrt(x): approximate 2^16 / sqrt(x) for 32-bit unsigned x using LUT + interpolation + 2 Newton iterations.
uint32_t fast_rsqrt(uint32_t x)
{
// Special cases.
if (x == 0)
return 0xFFFFFFFFu; // Treat rsqrt(0) as +Inf.
if (x == 1)
return 65536u; // 2^16 / sqrt(1) = 2^16.
// Step 1: find MSB position exp = floor(log2(x)).
int exp = 31 - clz(x);
uint32_t base = 1u << exp;
// Initial estimate from LUT: y ≈ 2^16 / sqrt(2^exp).
uint32_t y = rsqrt_table[exp];
// Step 2: linear interpolation between rsqrt_table[exp] and rsqrt_table[exp + 1].
if (x > base) {
uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0u;
uint32_t delta = y - y_next;
uint32_t t = x - base; // 0 <= t < 2^exp.
// Compute frac = ((x - 2^exp) / 2^exp) * 2^16 in Q0.16 without 64-bit shifts.
uint32_t frac;
if (exp >= 16)
frac = t >> (exp - 16);
else
frac = t << (16 - exp);
// frac is Q0.16; interp = delta * frac / 2^16.
uint32_t interp = (delta * frac) >> 16;
y -= interp;
}
// Constant 3 in Q0.16 format.
const uint32_t THREE_Q16 = 3u << 16;
// Step 3: two Newton–Raphson iterations in fixed-point.
// Real step: z_{n+1} = z_n * (3/2 - x*z_n^2/2); here y = 2^16 * z.
for (int iter = 0; iter < 2; iter++) {
uint32_t y2 = y * y; // y^2 fits in 32 bits for our range.
uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16); // xy2 ≈ 2^16 * x * z^2.
uint32_t factor = THREE_Q16 - xy2; // factor ≈ 2^16 * (3 - x*z^2).
y = (uint32_t)(mul32(y, factor) >> 17); // y = y * factor / 2 (>>16 for scaling, +1 bit for /2).
}
return y;
}
```
Why is `mul32` designed this way?
- On RV32I there is no hardware multiply, so 32×32 multiplication must be implemented with shift and add.
- In binary, if
$$
b = \sum_{i=0}^{31} b_i 2^i,\quad b_i \in \{0,1\}
$$ then
$$
a \cdot b = \sum_{i=0}^{31} b_i \cdot (a \cdot 2^i)
$$ The loop `while (b)` with `b >>= 1` processes one bit of b per iteration, from LSB to MSB.
- `if (b & 1u) r += a64;` adds the current term `a * 2^i` to the result whenever bit $b_i$ is 1.
- a64 is a 64-bit copy of a, so `a64 <<= 1` implements `a * 2^i` without overflowing 32 bits.
- The final r is the full 64-bit product, which is then used in fixed-point formulas.
### Final Result
Enter this command
```
cd ~/riscv-none-elf-gcc/rv32emu/tests/system/quiz3problemc
make clean
make run
```
