# Assignment 2: Complete Applications ## PART 1 For this part, we need to follow the instructions from [LAB2: RISC-V](https://hackmd.io/@sysprog/Sko2Ja5pel) to set up the environment and verify that **system emulation is enabled in rv32emu** and that `tests/system/playground` passes all tests successfully. The following sections, ***Setup*** and ***System Emulation***, describe the key parts involved. #### Prerequisites Before starting, it is necessary to install a suitable supported operating system. In this assignment, we use **WSL** to install Linux on Windows, and the operating system used is **Ubuntu 24.04.3 LTS**. --- ### Setup First, we follow the instructions in [LAB2](https://hackmd.io/@sysprog/Sko2Ja5pel) to complete the necessary installations and basic verification. #### 1. Install RISC-V GNU Toolchain Install the **xPack GNU RISC-V Embedded GCC** toolchain, then configure and load the environment, and finally check the installation with `riscv-none-elf-gcc -v` . Expected output: ``` gcc version 14.2.0 (xPack GNU RISC-V Embedded GCC x86_64) ``` In addition, it is important to note that the environment needs to **be loaded every time** a new shell is opened: ``` source $HOME/riscv-none-elf-gcc/setenv ``` What in `/riscv-none-elf-gcc/setenv` is: ``` export PATH=/home/xyh/riscv-none-elf-gcc/bin:$PATH ``` #### 2. Install Dependencies To support graphics and audio applications, install the **SDL2 dependencies** according to the operating system. In this assignment, SDL2 is required only for demonstration purposes (e.g., running classic games such as Doom) and was not actually used in the rest requirements. #### 3. Build rv32emu Use `git clone` to get the source code. Then, **build rv32emu** from source to generate the executable file. Once this step is completed, it indicates that rv32emu is ready for use. #### 4. Run Tests **Run the built-in basic tests** to confirm that the emulator works correctly and that the environment is properly configured. Expected output: ``` Running hello.elf ... [OK] Running puzzle.elf ... [OK] Running pi.elf ... [OK] ``` --- ### System Emulation **rv32emu** supports **system emulation**, which allows us to run either Linux kernels or bare-metal programs. System mode requires specific build configurations. After completing the ***Setup*** section above, we continued with the *System Emulation* part of [LAB2](https://hackmd.io/@sysprog/Sko2Ja5pel) to perform the necessary required configuration and verification. #### Rebuilding for System Mode Clean previous build and rebuild rv32emu with **system emulation enabled**: ``` make -C tests/system/alignment make ENABLE_ELF_LOADER=1 ENABLE_EXT_C=0 ENABLE_SYSTEM=1 misalign-in-blk-emu ``` Then, check the configuration file `build/.config` for expected output: ``` ENABLE_ELF_LOADER=1 ENABLE_SYSTEM=1 ``` This confirms that both **system emulation** and the **ELF loader** are correctly enabled. #### Playground Test Suite Next, get the source code for **playground**, which demonstrates a comprehensive **bare-metal test suite**. After building, a `test.elf` file is generated. Running it with `make run` produces the following output, shows that both **ChaCha20** and **BF16** tests passed successfully, and it lists performance metrics such as **cycle counts**. ![require1](https://hackmd.io/_uploads/rkDhpSJxWx.png) The results match the expected output, confirming that we successfully executed the **bare-metal tests** under **rv32emu’s system emulation**. --- ## PART 2 For this part, we select the assembly implementation of Problem B from Quiz 1, namely [uf8.s](https://github.com/Micelearner/ca2025-quizzes/blob/main/uf8.s) , which has done in my Homework 1, and make it run in a **bare-metal environment** using **rv32emu’s system emulation**. --- ### Related Work We do not duplicate workspaces or the entire repository from rv32emu. As a starting point, copy the `tests/system/playground` directory using `cp -a` instead, then enter the corresponding directory. After running `make clean` and removing all unrelated files, we place `uf8.s` into the directory. #### About program In Homework 1, all completed assembly code runs on the **Ripes simulator**. However, according to the assignment description, there is **potential incompatibility** between **rv32emu and Ripes**, so we need to consider modifying `uf8.s` before running it. After comparing the **supported ecalls** of rv32emu and Ripes, we found that the ecalls currently used by `uf8.s` (`a7=4/1/34/10`) are **not compatible** with **rv32emu**, and therefore **cannot be used directly**. Therefore, we need to modify all the **Ripes-specific ecall** instructions in `uf8.s` , replacing them with the system calls (`write` and `exit`) **supported by rv32emu**, and implement the two helper functions (`print_hex` and `print_int`). In this way, the program can produce the same output while running correctly using rv32emu’s system emulation. The corresponding additional code and modified part of `uf8.s` are shown below: ```= .equ SYS_WRITE, 64 # similar in Ripes = 4 .equ SYS_EXIT, 93 # similar in Ripes = 10 .equ STDOUT, 1 # file descriptor 1 = stdout .data msg_pass: .ascii "All tests passed.\n" msg_pass_end: msg_err1: .ascii ": produces value " msg_err1_end: msg_err2: .ascii " but encodes back to " msg_err2_end: msg_err3: .ascii ": value " msg_err3_end: msg_err4: .ascii " <= previous_value " msg_err4_end: nextline: .ascii "\n" nextline_end: minus_char: .ascii "-" minus_char_end: intmin_str: .ascii "-2147483648" intmin_str_end: # --- for print_hex --- hex_table: .ascii "0123456789abcdef" hexbuf: .space 10 "0x" (2 bytes) + up to 8 hex digits # --- for print_int --- .align 2 dec_divisors: .word 1000000000 .word 100000000 .word 10000000 .word 1000000 .word 100000 .word 10000 .word 1000 .word 100 .word 10 .word 1 decbuf: .space 10 .text print_hex: la t0, hexbuf # t0 = &hexbuf[0] li t1, '0' sb t1, 0(t0) li t1, 'x' sb t1, 1(t0) mv t2, a0 # t2 = value beqz t2, phx_zero addi t1, t0, 2 li t3, 7 # nibble index = 7..0 li t4, 0 phx_loop: slli t5, t3, 2 srl t6, t2, t5 andi t6, t6, 0xF beqz t4, phx_check_leading j phx_output_digit phx_check_leading: beqz t6, phx_next_nib li t4, 1 phx_output_digit: la a1, hex_table add a1, a1, t6 lbu a0, 0(a1) sb a0, 0(t1) addi t1, t1, 1 phx_next_nib: addi t3, t3, -1 bgez t3, phx_loop # len = t1 - hexbuf sub a2, t1, t0 li a0, STDOUT la a1, hexbuf li a7, SYS_WRITE ecall jr ra phx_zero: # special case: 0 -> "0x0" li t1, '0' sb t1, 2(t0) # "0x0" li a0, STDOUT la a1, hexbuf li a2, 3 # len = 3 li a7, SYS_WRITE ecall jr ra print_int: mv t0, a0 # t0 = x la t1, decbuf # t1 = buffer base li t2, 0 # t2 = out_len li t3, 0 # t3 = seen_nonzero flag # special case: (-2147483648) = 0x80000000 li t4, 0x80000000 beq t0, t4, pi_intmin # x == 0 ? beqz t0, pi_zero # x < 0 ? bltz t0, pi_negative # x > 0:positive path j pi_positive pi_zero: li t4, '0' sb t4, 0(t1) li t2, 1 j pi_write pi_negative: # print '-' first li a0, STDOUT la a1, minus_char la a2, minus_char_end sub a2, a2, a1 li a7, SYS_WRITE ecall sub t0, x0, t0 # t0 = -t0 pi_positive: la t4, dec_divisors # t4 = &divisors[0] li t5, 10 # 10 divisor pi_outer: beqz t5, pi_write lw t6, 0(t4) # t6 = divisor mv a1, t0 # a1 = remaining li a2, 0 # a2 = digit pi_inner: blt a1, t6, pi_inner_done sub a1, a1, t6 addi a2, a2, 1 j pi_inner pi_inner_done: mv t0, a1 beqz t3, pi_check_first j pi_output_digit pi_check_first: beqz a2, pi_next_div li t3, 1 pi_output_digit: addi a2, a2, '0' # digit -> ASCII add t6, t1, t2 # decbuf[out_len] sb a2, 0(t6) addi t2, t2, 1 pi_next_div: addi t4, t4, 4 # divisor++ addi t5, t5, -1 j pi_outer pi_write: li a0, STDOUT mv a1, t1 mv a2, t2 li a7, SYS_WRITE ecall jr ra # ---- ---- pi_intmin: li a0, STDOUT la a1, intmin_str la a2, intmin_str_end sub a2, a2, a1 li a7, SYS_WRITE ecall jr ra .globl main main: jal ra, Test beqz a0, Fail #------------ li a0, STDOUT # fd = 1 la a1, msg_pass # buf = &msg_pass la t0, msg_pass_end sub a2, t0, a1 # len = end - start li a7, SYS_WRITE ecall li a0, 0 # exit(0) li a7, SYS_EXIT # 93 ecall Fail: li a0, 1 # exit(1) li a7, SYS_EXIT ecall #------------ # ------------------------------------------------------------ # Test function # ------------------------------------------------------------ Test: addi sp, sp, -4 sw ra, 0(sp) li s1, -1 li s2, 1 li s3, 0 li s7, 256 round_Trip: beq s3, s7, test_End mv s4, s3 mv a0, s4 jal ra, UF8_decode mv s5, a0 jal ra, UF8_encode mv s6, a0 beq s4, s6, check_Inc #------------ mv a0, s4 jal ra, print_hex li a0, STDOUT la a1, msg_err1 la t0, msg_err1_end sub a2, t0, a1 li a7, SYS_WRITE ecall mv a0, s5 jal ra, print_int li a0, STDOUT la a1, msg_err2 la t0, msg_err2_end sub a2, t0, a1 li a7, SYS_WRITE ecall mv a0, s6 jal ra, print_hex la a1, nextline la t0, nextline_end sub a2, t0, a1 li a7, SYS_WRITE ecall #------------ li s2, 0 check_Inc: bgt s5, s1, check_End #------------ mv a0, s4 jal ra, print_hex li a0, STDOUT la a1, msg_err3 la t0, msg_err3_end sub a2, t0, a1 li a7, SYS_WRITE ecall mv a0, s5 jal ra, print_int li a0, STDOUT la a1, msg_err4 la t0, msg_err4_end sub a2, t0, a1 li a7, SYS_WRITE ecall mv a0, s1 jal ra, print_int la a1, nextline la t0, nextline_end sub a2, t0, a1 li a7, SYS_WRITE ecall #------------ li s2, 0 check_End: mv s1, s5 addi s3, s3, 1 j round_Trip test_End: mv a0, s2 lw ra, 0(sp) addi sp, sp, 4 jr ra ``` Where: - **SYS_WRITE:** The system call number **64** used by rv32emu to print the buffer as a string to the specified file descriptor. - **SYS_EXIT:** The system call number **93** used by rv32emu to terminate the program, and the exit status code is passed in `a0`. - **STDOUT:** The **file descriptor 1**, which represents the standard output stream (stdout). All messages in this program are written to this descriptor. Besides modifying `uf8.s`, we also need to update the `Makefile` with `OBJS = start.o uf8.o`. This update ensure that `make` will assemble the `uf8.s` into `uf8.o`, link it together with the **bare-metal startup** code `start.o`, and produce an executable that can be loaded and run by **rv32emu’s system emulation**. #### About `write` system call According to [docs/syscall.md](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md) , the `write` system call (with `a7` = `SYS_WRITE` in rv32emu) expects the following arguments: 1. The file descriptor `1` (stdout) in `a0`. 2. The starting address of the buffer (string to be printed) in `a1`. 3. The length of the string (number of bytes to write) in `a2`. In particular, all string literals in the `.data` section are changed from `.asciz` to `.ascii`, so no automatic `'\0'` terminator is appended. Therefore, the string length can be computed simply as the difference between the address of the end label (e.g., `msg_pass_end`) and the address of the beginning label (e.g., `msg_pass`). #### About print_hex The `print_hex` function is designed to print a 32-bit value in hexadecimal, which is in the similar style as Ripes ecall 34 (lowercase hex, no leading zeros, prefixed with `"0x"`). The procedure of `print_hex` is as follows: 1. First, it initializes the `hexbuf` buffer by writing the prefix `"0x"` into the first two bytes. If the input value in `a0` is zero, it directly appends `'0'` to form the string `"0x0"` and calls `SYS_WRITE` once to print this special case. 2. For a non-zero value, it copies `a0` into `t2` and then iterates over the 8 hexadecimal nibbles from the most significant nibble to the least significant one. In each iteration, it shifts and masks `t2` to extract the current nibble (4 bits), and uses it as an index into `hex_table` to obtain the corresponding hexadecimal character. 3. A flag (`t4`) is used to skip leading zeros: before first non-zero nibble is seen, all zero nibbles are ignored. Once the first non-zero nibble is encountered, every subsequent nibble (including zeros) is written sequentially into `hexbuf` after the `"0x"` prefix. 4. After all nibbles have been processed, the function computes the total string length and calls `SYS_WRITE` with `a0 = STDOUT`, `a1 = hexbuf`, and `a2 = length` to print the final hexadecimal string. #### About print_int The `print_int` function is designed to print a signed 32-bit integer in decimal, and which is similar to Ripes ecall 1 (signed decimal, no newline). The procedure of `print_int` is as follows: 1. It first copies the input value from `a0` to `t0` and initializes a decimal buffer `decbuf`. Then it checks for two special cases: - If `t0` is equal to `0x80000000` (i.e., `INT_MIN = -2147483648`), it directly prints the pre-defined string `"-2147483648"` using `SYS_WRITE` and returns. - If `t0` is zero, it writes `'0'` into `decbuf`, sets the output length to 1, and jumps to the final write step. 2. For negative value (excluding `INT_MIN`), it first prints a minus sign `'-'` to `STDOUT`, then negates `t0` (`t0 = -t0`) so that the remaining logic only needs to handle the absolute value `|x|`. For positive inputs, it directly proceeds with `t0 = |x|`. 3. To convert the value into decimal digits, it uses the precomputed `dec_divisors` array containing `10^9, 10^8, ..., 10^0`. For each divisor, it repeatedly subtracts the divisor from the remaining value to obtain the current digit, and also use a flag (`t3`) to skip leading zeros before first non-zero digit. Once the first non-zero digit is found, every subsequent digit (including zeros) is written into `decbuf` and the length `t2` is updated. 4. After all 10 divisors have been processed, it prints the decimal representation of `|x|`. If the original value was negative, the `'-'` printed earlier ensures the overall output is a correct signed decimal number. --- ### Result After completing all the modifications mentioned in ***Related work***, we use `make run` to build and execute the program in **rv32emu’s system emulation** environment. The corresponding output is shown below: ![require2](https://hackmd.io/_uploads/HJay-pxgbx.png) The results match the expected output, confirming that we have successfully executed the **UF8** assembly program, with all tests passing and the output result is consistent with the original Ripes simulation. --- ## PART 3 For this part, we need to adapt Problem A from Quiz 2 and Problem C from Quiz 3, and make them run in a **bare-metal** environment using **rv32emu’s system emulation**. --- ### About Problem A in Quiz2 In the original [Problem A](https://hackmd.io/@sysprog/arch2025-quiz2-sol#Problem-A) description in Quiz 2, it already provides a complete assembly implementation named `hanoi.s`, which can run directly on the Ripes simulator. Therefore, similar to what we did in ***PART 2***, first is to modify all the **Ripes-specific ecall** instructions in `hanoi.s` , replacing them with the `write` and `exit` system calls supported by rv32emu. This allows the same program logic and output format to be preserved, while making the code runnable under **rv32emu’s system emulation**. Next, we reuse the helper function `print_int` from above *PART 2* to print the disk number in decimal, and we introduce a small one-byte buffer `chbuf` for printing single characters (the source and destination pegs). Finally, we need to update `Makefile` with `OBJS = start.o hanoi.o`, so that `make` assembles `hanoi.s` into `hanoi.o`, links it together with `start.o`, and produces **bare-metal** executable that can be loaded by rv32emu. The corresponding modified `hanoi.s` is shown as follows: ```= .equ SYS_WRITE, 64 .equ SYS_EXIT, 93 .equ STDOUT, 1 .text .globl main main: addi x2, x2, -32 sw x8, 0(x2) sw x9, 4(x2) sw x18, 8(x2) sw x19, 12(x2) sw x20, 16(x2) li x5, 0x15 sw x5, 20(x2) sw x5, 24(x2) sw x5, 28(x2) # Fix disk positions (BLANK 1-3: neutralize x5 effect) # BLANK 1: Fix position at x2+20 sw x0, 20(x2) # BLANK 2: Fix position at x2+24 sw x0, 24(x2) # BLANK 3: Fix position at x2+28 sw x0, 28(x2) addi x8, x0, 1 game_loop: # BLANK 4: Check loop termination (2^3 moves) addi x5, x0, 8 beq x8, x5, finish_game # Gray code formula: gray(n) = n XOR (n >> k) # BLANK 5: What is k for Gray code? srli x5, x8, 1 # BLANK 6: Complete Gray(n) calculation xor x6, x8, x5 # BLANK 7-8: Calculate previous value and its shift addi x7, x8, -1 srli x28, x7, 1 # BLANK 9: Generate Gray(n-1) xor x7, x7, x28 # BLANK 10: Which bits changed? xor x5, x6, x7 # Initialize disk number addi x9, x0, 0 # BLANK 11: Mask for testing LSB andi x6, x5, 1 # BLANK 12: Branch if disk 0 moves bne x6, x0, disk_found # BLANK 13: Set disk 1 addi x9, x0, 1 # BLANK 14: Test second bit with proper mask andi x6, x5, 2 bne x6, x0, disk_found # BLANK 15: Last disk number addi x9, x0, 2 disk_found: # BLANK 16: Check impossible pattern (multiple bits) andi x30, x5, 5 addi x31, x0, 5 beq x30, x31, pattern_match jal x0, continue_move pattern_match: continue_move: # BLANK 17: Word-align disk index (multiply by what?) slli x5, x9, 2 # BLANK 18: Base offset for disk array addi x5, x5, 20 add x5, x2, x5 lw x18, 0(x5) bne x9, x0, handle_large # BLANK 19: Small disk moves by how many positions? addi x19, x18, 2 # BLANK 20: Number of pegs addi x6, x0, 3 blt x19, x6, display_move sub x19, x19, x6 jal x0, display_move handle_large: # BLANK 21: Load reference disk position lw x6, 20(x2) # BLANK 22: Sum of all peg indices (0+1+2) addi x19, x0, 3 sub x19, x19, x18 sub x19, x19, x6 # ------ display_move: la x20, obdata add x5, x20, x18 # BLANK 23: Load byte from obfuscated data lbu x11, 0(x5) # BLANK 24: Decode constant (0x6F) li x6, 0x6F xor x11, x11, x6 # BLANK 25: Final offset adjustment addi x11, x11, -0x12 # x11 = source peg char ('A', 'B', or 'C') add x7, x20, x19 lbu x12, 0(x7) xor x12, x12, x6 addi x12, x12, -0x12 # x12 = dest peg char # Small Gotcha: store peg characters into registers that will not be overwritten by print_int addi x21, x11, 0 # x21(s5) = source peg char addi x22, x12, 0 # x22(s6) = dest peg char # print "Move Disk " li a0, STDOUT la a1, str1 la a2, str1_end sub a2, a2, a1 # len = str1_end - str1 li a7, SYS_WRITE ecall # print disk number (x9 + 1) addi a0, x9, 1 jal ra, print_int # print " from " li a0, STDOUT la a1, str2 la a2, str2_end sub a2, a2, a1 li a7, SYS_WRITE ecall # print source peg char (use x21 instead of original x11) li a0, STDOUT la a1, chbuf # chbuf[0] = source char sb x21, 0(a1) li a2, 1 li a7, SYS_WRITE ecall # print " to " li a0, STDOUT la a1, str3 la a2, str3_end sub a2, a2, a1 li a7, SYS_WRITE ecall # print dest peg char (use x22 instead of original x12) li a0, STDOUT la a1, chbuf # chbuf[0] = dest char sb x22, 0(a1) li a2, 1 li a7, SYS_WRITE ecall # print '\n' li a0, STDOUT la a1, chbuf li a2, 1 li a3, 10 # '\n' sb a3, 0(a1) li a7, SYS_WRITE ecall # ------ # BLANK 26: Calculate storage offset slli x5, x9, 2 addi x5, x5, 20 add x5, x2, x5 # BLANK 27: Update disk position sw x19, 0(x5) # BLANK 28-29: Increment counter and loop addi x8, x8, 1 jal x0, game_loop finish_game: lw x8, 0(x2) lw x9, 4(x2) lw x18, 8(x2) lw x19, 12(x2) lw x20, 16(x2) addi x2, x2, 32 li a0, 0 # exit(0) li a7, SYS_EXIT # 93 ecall print_int: mv t0, a0 # t0 = x la t1, decbuf # t1 = buffer base li t2, 0 # t2 = out_len li t3, 0 # t3 = seen_nonzero flag li t4, 0x80000000 beq t0, t4, pi_intmin beqz t0, pi_zero bltz t0, pi_negative j pi_positive pi_zero: li t4, '0' sb t4, 0(t1) li t2, 1 j pi_write pi_negative: li a0, STDOUT la a1, minus_char la a2, minus_char_end sub a2, a2, a1 li a7, SYS_WRITE ecall sub t0, x0, t0 pi_positive: la t4, dec_divisors li t5, 10 pi_outer: beqz t5, pi_write lw t6, 0(t4) mv a1, t0 li a2, 0 pi_inner: blt a1, t6, pi_inner_done sub a1, a1, t6 addi a2, a2, 1 j pi_inner pi_inner_done: mv t0, a1 beqz t3, pi_check_first j pi_output_digit pi_check_first: beqz a2, pi_next_div li t3, 1 pi_output_digit: addi a2, a2, '0' add t6, t1, t2 sb a2, 0(t6) addi t2, t2, 1 pi_next_div: addi t4, t4, 4 addi t5, t5, -1 j pi_outer pi_write: li a0, STDOUT mv a1, t1 mv a2, t2 li a7, SYS_WRITE ecall jr ra pi_intmin: li a0, STDOUT la a1, intmin_str la a2, intmin_str_end sub a2, a2, a1 li a7, SYS_WRITE ecall jr ra .data obdata: .byte 0x3c, 0x3b, 0x3a str1: .ascii "Move Disk " str1_end: str2: .ascii " from " str2_end: str3: .ascii " to " str3_end: minus_char: .ascii "-" minus_char_end: intmin_str: .ascii "-2147483648" intmin_str_end: chbuf: .space 1 # buffer for printing a single character # ---- For print_int ---- .align 2 dec_divisors: .word 1000000000 .word 100000000 .word 10000000 .word 1000000 .word 100000 .word 10000 .word 1000 .word 100 .word 10 .word 1 decbuf: .space 10 ``` When running this modified `hanoi.s` under rv32emu’s system emulation, program prints all 7 moves of the 3-disk *Tower of Hanoi* in the **same format as on Ripes**, and then terminates with exit status `0`. --- #### Alternative method According to [docs/syscall.md](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md) , rv32emu provides a fundamental subset of **necessary system calls required by newlib**, enabling programs compiled with **newlib** to run in the rv32emu emulator. >[!Note] >- **newlib** is a **C standard library implementation** designed for embedded systems, it provides common functions such as `malloc`, `printf`, `exit`, etc. >- **rv32emu** provides system calls compatible with **newlib**, so programs linked with newlib can run under the emulator’s **system mode** without difficult hand-coding with specific ecall IDs. In other words, if you compile a program using `printf()` or `malloc()` with `riscv-none-elf-gcc` , these functions will depend on newlib, and rv32emu can make these functions work properly in the emulator by supporting the necessary syscalls. Therefore, we can add a `call_help.c` as follows: ```C= #include <stdint.h> #include <stdio.h> #include <stdlib.h> void help_puts(const char *s) { fputs(s, stdout); } void help_putu(uint32_t x) { printf("%u", x); } void heip_putch(int c) { putchar(c); } void help_exit_success(void) { exit(0); } ``` Where: - **fputs():** It writes a null-terminated string to `stdout`, which will eventually be printed. - **printf():** Here it prints numbers in decimal (`%u`) formats. - **putchar():** It writes a single character (passed as `int c`) to the standard output stream, which is eventually be printed. - **exit():** Here it terminates execution with a status code ( `0` indicates success). In rv32emu, this corresponds to the `exit` system call to terminate the program. All of them are C standard library functions **provided by newlib**. Next, we let `hanoi.s` call the functions (`help_*`) in `call_help.c` via `jal`. These functions will in turn call `printf/fputs/exit`, and finally, at the lower layer, be handled through the required **syscalls** supported by rv32emu (such as `write`、`exit`). The corresponding modified `hanoi.s` is shown as follows: ```= .text .globl main main: addi x2, x2, -32 sw x8, 0(x2) sw x9, 4(x2) sw x18, 8(x2) sw x19, 12(x2) sw x20, 16(x2) li x5, 0x15 sw x5, 20(x2) sw x5, 24(x2) sw x5, 28(x2) # Fix disk positions (BLANK 1-3: neutralize x5 effect) # BLANK 1: Fix position at x2+20 sw x0, 20(x2) # BLANK 2: Fix position at x2+24 sw x0, 24(x2) # BLANK 3: Fix position at x2+28 sw x0, 28(x2) addi x8, x0, 1 game_loop: # BLANK 4: Check loop termination (2^3 moves) addi x5, x0, 8 beq x8, x5, finish_game # Gray code formula: gray(n) = n XOR (n >> k) # BLANK 5: What is k for Gray code? srli x5, x8, 1 # BLANK 6: Complete Gray(n) calculation xor x6, x8, x5 # BLANK 7-8: Calculate previous value and its shift addi x7, x8, -1 srli x28, x7, 1 # BLANK 9: Generate Gray(n-1) xor x7, x7, x28 # BLANK 10: Which bits changed? xor x5, x6, x7 # Initialize disk number addi x9, x0, 0 # BLANK 11: Mask for testing LSB andi x6, x5, 1 # BLANK 12: Branch if disk 0 moves bne x6, x0, disk_found # BLANK 13: Set disk 1 addi x9, x0, 1 # BLANK 14: Test second bit with proper mask andi x6, x5, 2 bne x6, x0, disk_found # BLANK 15: Last disk number addi x9, x0, 2 disk_found: # BLANK 16: Check impossible pattern (multiple bits) andi x30, x5, 5 addi x31, x0, 5 beq x30, x31, pattern_match jal x0, continue_move pattern_match: continue_move: # BLANK 17: Word-align disk index (multiply by what?) slli x5, x9, 2 # BLANK 18: Base offset for disk array addi x5, x5, 20 add x5, x2, x5 lw x18, 0(x5) bne x9, x0, handle_large # BLANK 19: Small disk moves by how many positions? addi x19, x18, 2 # BLANK 20: Number of pegs addi x6, x0, 3 blt x19, x6, display_move sub x19, x19, x6 jal x0, display_move handle_large: # BLANK 21: Load reference disk position lw x6, 20(x2) # BLANK 22: Sum of all peg indices (0+1+2) addi x19, x0, 3 sub x19, x19, x18 sub x19, x19, x6 display_move: #------------ # print "Move Disk " la a0, str1 # x10 = a0 jal ra, help_puts # print disk number (x9 + 1) addi a0, x9, 1 jal ra, help_putu # print " from " la a0, str2 jal ra, help_puts # find src pos la t0, obdata add t0, t0, x18 # Load byte from obfuscated data lbu t1, 0(t0) # Decode constant (0x6F) li t2, 0x6F xor t1, t1, t2 # Final offset adjustment addi t1, t1, -0x12 # t1 = 'A'/'B'/'C' # print source peg char in t1/x11 mv a0, t1 jal ra, help_putch # print " to " la a0, str3 jal ra, help_puts # find dst pos la t0, obdata add t0, t0, x19 # x19 is dest peg index lbu t1, 0(t0) li t2, 0x6F xor t1, t1, t2 addi t1, t1, -0x12 # print dest peg char in t1/x11 mv a0, t1 jal ra, help_putch # print '\n' li a0, 10 jal ra, help_putch #------------ # BLANK 26: Calculate storage offset slli x5, x9, 2 addi x5, x5, 20 add x5, x2, x5 # BLANK 27: Update disk position sw x19, 0(x5) # BLANK 28-29: Increment counter and loop addi x8, x8, 1 jal x0, game_loop #------------ finish_game: lw x8, 0(x2) lw x9, 4(x2) lw x18, 8(x2) lw x19, 12(x2) lw x20, 16(x2) addi x2, x2, 32 jal ra, help_exit #------------ .data obdata: .byte 0x3c, 0x3b, 0x3a str1: .asciz "Move Disk " str2: .asciz " from " str3: .asciz " to " ``` Where: - **help_puts():** Used to print a full string to the screen (without an automatic newline). - **help_putu():** Used to print an unsigned integer in decimal format. - **help_putch():** Used to print a single character to the screen. - **help_exit():** Indicates tests passed and terminates the program. To cooperate the modifications mentioned above, we need to modify the `Makefile` to ensure correct compilation, linking, and successful execution under **rv32emu’s system emulation**. The minimal necessary updates for `Makefile` are as follows: - Update `OBJS` to include only the actual files used (`start.o`, `hanoi.o`, `call_help.o`). - Ensure that both C files and `.s`/`.S` assembly files can be compiled properly. - Switch to using `gcc` for **linking** instead pure `ld`, so that `printf/puts/exit` within `call_help.c` can be automatically linked to **newlib**. - Use `-nostartfiles` to exclude the default startup files, while keeping the original `start.o` and `linker.ld` to preserve the **bare-metal** startup process. The corresponding modified `Makefile` is shown below: ``` include ../../../mk/toolchain.mk ARCH = -march=rv32izicsr LINKER_SCRIPT = linker.ld EMU ?= ../../../build/rv32emu AFLAGS = -g $(ARCH) CFLAGS = -g $(ARCH) LDFLAGS = -T $(LINKER_SCRIPT) EXEC = test.elf CC = $(CROSS_COMPILE)gcc AS = $(CROSS_COMPILE)as OBJDUMP = $(CROSS_COMPILE)objdump OBJS = start.o hanoi.o call_help.o .PHONY: all run dump clean all: $(EXEC) $(EXEC): $(OBJS) $(LINKER_SCRIPT) $(CC) $(CFLAGS) $(LDFLAGS) -nostartfiles -o $@ $(OBJS) %.o: %.S $(AS) $(AFLAGS) -c $< -o $@ %.o: %.s $(AS) $(AFLAGS) -c $< -o $@ %.o: %.c $(CC) $(CFLAGS) $< -o $@ -c run: $(EXEC) @test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1) @grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1) @grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1) $(EMU) $< dump: $(EXEC) $(OBJDUMP) -Ds $< | less clean: rm -f $(EXEC) $(OBJS) ``` #### A Small Gotcha For example, one of the assembly code segments that needed modification in the original `hanoi.s` is shown below: ```= display_move: la x20, obdata add x5, x20, x18 # BLANK 23: Load byte from obfuscated data lbu x11, 0(x5) # BLANK 24: Decode constant (0x6F) li x6, 0x6F xor x11, x11, x6 # BLANK 25: Final offset adjustment addi x11, x11, -0x12 add x7, x20, x19 lbu x12, 0(x7) xor x12, x12, x6 addi x12, x12, -0x12 la x10, str1 addi x17, x0, 4 ecall addi x10, x9, 1 addi x17, x0, 1 ecall la x10, str2 addi x17, x0, 4 ecall addi x10, x11, 0 addi x17, x0, 11 ecall la x10, str3 addi x17, x0, 4 ecall addi x10, x12, 0 addi x17, x0, 11 ecall addi x10, x0, 10 addi x17, x0, 11 ecall ``` If we simply replaced original `ecall` instructions with direct calls to `help_puts`、 `help_putu` or `help_putch` following the same program structure, we would encounter an issue. The corresponding modified version is shown as follows: ```= display_move: la x20, obdata add x5, x20, x18 # BLANK 23: Load byte from obfuscated data lbu x11, 0(x5) # BLANK 24: Decode constant (0x6F) li x6, 0x6F xor x11, x11, x6 # BLANK 25: Final offset adjustment addi x11, x11, -0x12 add x7, x20, x19 lbu x12, 0(x7) xor x12, x12, x6 addi x12, x12, -0x12 # print "Move Disk " la a0, str1 # x10 = a0 jal ra, help_puts # print disk number (x9 + 1) addi a0, x9, 1 jal ra, help_putu # print " from " la a0, str2 jal ra, help_puts # print source peg char in x11 mv a0, x11 jal ra, help_putch # print " to " la a0, str3 jal ra, help_puts # print dest peg char in x12 mv a0, x12 jal ra, help_putch # print '\n' li a0, 10 jal ra, help_putch ``` Then, when we run it, the corresponding result output becomes: ![quiz2_haveError](https://hackmd.io/_uploads/rkyNygGgWl.png) Clearly, the **characters** representing the **peg positions** (`A/B/C`) are **not printed correctly**. The reason characters are not printed correctly is that the values temporarily stored in `x11/x12` (`=a1/a2`) are **overwritten** by the subsequent `jal help_puts` / `jal help_putu` calls. >[!Note] >- **Caller-saved registers:** `a0–a7` (`x10–x17`), `t0–t6` (`x5–x7`, `x28–x31`), etc. A **callee** may overwrite them, and their values are **not preserved** across calls. >- **Callee-saved registers:** `s0–s11` (`x8/x18–x27`) must be preserved by the **callee** if used, and restored before returning. The current flow of operation is: 1. Compute the source/destination peg characters into `x11` and `x12`. 2. Then call `help_puts` and `help_putu` (which may use `a0–a7` inside). 3. Finally, we move `x11/x12` into `a0` to print the characters. However, `x11/x12` may be **overwritten** during *Step2*, so by the time that we print them in *Step3*, the values may **no longer correct** (such as `0`), appearing as **blank**. To fix this, the peg characters should be computed later, **immediately before printing**, to ensure that their values **remain valid** (fresh and not be overwritten by earlier calls). The corrected version of modified code segment shown earlier in the modified `hanoi.s`. --- #### Result After completing all the modifications mentioned above (any of two methods), running it with `make run` produces the following output: ![quiz2_correct](https://hackmd.io/_uploads/HJEfrgGg-l.png) The results match the expected output, confirming that we have successfully executed the program in rv32emu’s system emulation, with the correct move sequence. --- ### About Problem C in Quiz3 According to the original [Problem C](https://hackmd.io/@sysprog/arch2025-quiz3-sol#Problem-C) description in [Problem set](https://drive.google.com/file/d/1RZBh-ifFZcb9Rcv4tPE0CTbkVdxGuv3l/view) for Quiz 3, it already given the C code for `fast_rsqrt`, which provides a **fast reciprocal square root implementation** optimized for **RV32I**. $$ fast\_rsqrt(x) \approx \tfrac{2^{16}}{\sqrt{x}} $$The function `fast_rsqrt(x)` can efficiently computes an **approximation** of the reciprocal square root. It takes a 32-bit integer `x` as an input and returns an approximation of the reciprocal square root of `x`, scaled by $2^{16}$ (16-bit fixed-point format). >[!Note] >According to the [Problem C](https://hackmd.io/@sysprog/arch2025-quiz3-sol#Problem-C) description: >- For integer inputs `x>=1`, the reciprocal square root (`<=1`) must be represented in either floating-point or fixed-point format. >- Since we assume no floating-point unit (FPU), we use **fixed-point representation**. >- Specifically,the actual implementation in `fast_rsqrt(x)` uses $2^{16}$ scaling factor to represent fractional values using pure integer arithmetic, so the actual reciprocal square root value should equal to `fast_rsqrt(x)/65536`. The original C code implementation is shown below: ```c= #include <stdio.h> #include <stdint.h> /* clz: count leading zeros */ static int clz(uint32_t x) { if(!x) return 32; //Special case: no bits are set int n = 0; if(!(x & 0xFFFF0000)) { n += 16; x <<= 16; } if(!(x & 0xFF000000)) { n += 8; x <<= 8; } if(!(x & 0xF0000000)) { n += 4; x <<= 4; } if(!(x & 0xC0000000)) { n += 2; x <<= 2; } if(!(x & 0x80000000)) { n += 1; } return n; } /* implementation 32x32 -> 64-bit multiplication without hardware MUL instruction */ static uint64_t mul32(uint32_t a, uint32_t b) { uint64_t r = 0; for(int i = 0; i < 32; i++) { if(b & (1U << i)) { r += (uint64_t)a << i; } } return r; } /* Lookup table: initial estimates for 65536 / sqrt(2^n) */ static const uint16_t rsqrt_table[32] = { 65535, 46341, 32768, 23170, 16384, 11585, 8192, 5793, 4096, 2896, 2048, 1448, 1024, 724, 512, 362, 256, 181, 128, 90, 64, 45, 32, 23, 16, 11, 8, 6, 4, 3, 2, 1 }; /* * Fast reciprocal square root: 65536 / sqrt(x) * Computs approximation of 1/sqrt(x) scaled by 2^16. */ uint32_t fast_rsqrt(uint32_t x) { /* Handle edge cases */ if(x == 0) return 0xFFFFFFFF; // Infinity representation if(x == 1) return 65536; // Exact result for x=1 /* Steps 1: Find MSB position */ int exp = 31 - clz(x); // Exponent of highest set bit /* Step 2: Get initial estimate from LUT */ uint32_t y = rsqrt_table[exp]; // Initial estimate from LUT /* Step 3: Linear Interpolation */ if(x > (1U << exp)) { // Bound check: exp = 31 would access sqrt_table[32] uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0; uint32_t delta = y - y_next; uint32_t frac = (uint32_t) ((((uint64_t)x - (1UL << exp)) << 16) >> exp); y -= (uint32_t) ((delta * frac) >> 16); } /* Step 4: Newton-Raphson Iterations */ for(int iter = 0; iter < 2; iter++) { uint64_t y2 = (uint32_t)mul32(y, y); uint64_t xy2 = (uint32_t)(mul32(x, y2) >> 16); y = (uint32_t)(mul32(y, (3U << 16) - xy2) >> 17); } return y; } ``` Where: - **clz():** Used to **count** the number of **leading zeros** in a 32-bit unsigned integer. - **mul32():** Using **shift-add** operation to implement 32×32 → 64-bit multiplication, because RV32I does not include the hardware `MUL` instruction. - **rsqrt_table[ ]:** a **Lookup table** that stores initial estimates for $\tfrac{65536}{\sqrt{2^{n}}}$ . After using `clz()` to locate the MSB of `x`, the corresponding table entry gives an initial guess. - **fast_rsqrt():** Used to compute the **approximation** of $\tfrac{1}{\sqrt{x}}$ scaled by $2^{16}$ using: 1. MSB-based **lookup table**. 2. **Linear interpolation** for refinement. 3. Two **Newton–Raphson** iterations for final accuracy. #### About Analyzing Precision and Performance In the following section, we simply perform a rough precision and performance analysis for the assignment requirement. First, to evaluate the **precision** of function `fast_rsqrt(x)`, we must construct a **reference implementation** that computes the high-accuracy value for comparison, defined as: $$ ref\_rsqrt(x) = \lfloor \sqrt{\tfrac{2^{32}}{x}} \rfloor = \lfloor 2^{16} / \sqrt{x} \rfloor $$This reference value is computed using pure **integer arithmetic** so that its output format **exactly matches** the output range of `fast_rsqrt(x)`. With `ref_rsqrt(x)` available, we can measure the numerical error of `fast_rsqrt(x)`. Next, for **performance** measurement, we write a small test inside `main()`. To record **cycle counts**, we also use the `get_cycles()` function provided in `perfcounter.S` from the original `playground` directory, which reads the 64-bit hardware cycle counter. Putting everything together, the additional C code is show as follows: ```C= static uint32_t isqrt_u64(uint64_t v) { uint64_t lo = 0, hi = 1ull << 32; while (lo + 1 < hi) { uint64_t mid = (lo + hi) >> 1; if (mid * mid <= v) lo = mid; else hi = mid; } return (uint32_t)lo; } static uint32_t ref_rsqrt(uint32_t x) { if (x == 0) return 0xFFFFFFFFu; uint64_t q = (1ULL << 32) / x; return isqrt_u64(q); } static void evaluate(void) { uint32_t max_rel_err_ppm = 0; uint64_t sum_rel_err_ppm = 0; uint32_t samples = 0; for (uint32_t x = 1; x <= 100000; x++) { uint32_t y = fast_rsqrt(x); uint32_t y0 = ref_rsqrt(x); uint32_t abs_err = (y > y0) ? (y - y0) : (y0 - y); uint32_t rel_ppm = (y0 == 0) ? 0 : (uint32_t)((abs_err * 1000000ull) / y0); if (rel_ppm > max_rel_err_ppm) max_rel_err_ppm = rel_ppm; sum_rel_err_ppm += rel_ppm; samples++; } printf("[eval] samples=%u, max_rel_err=%.3f%%, avg_rel_err=%.3f%%\n", samples, max_rel_err_ppm / 10000.0, (sum_rel_err_ppm / (double)samples) / 10000.0); } extern unsigned long long get_cycles(void); int main() { unsigned long long c0 = get_cycles(); uint32_t test_values[] = {1, 4, 16, 100, 1024, 65535, 123456789, 4294967295}; int num = sizeof(test_values)/sizeof(test_values[0]); for (int i = 0; i < num; i++) { uint32_t x = test_values[i]; uint32_t y = fast_rsqrt(x); printf("fast_rsqrt(%u) = %u\n", x, y); } unsigned long long c1 = get_cycles(); printf("cycles = %llu\n", c1 - c0); evaluate(); return 0; } ``` > The above C code was written with the support of AI tools. Where: - **isqrt_u64():** Implements **integer square-root** of a 64-bit value using binary search. It relies entirely on integer arithmetic to ensure correctness under RV32I constraints. - **ref_rsqrt():** Computes the accurate fixed-point reference value of $2^{16} / \sqrt{x}$, matching the exact scaling convention used by `fast_rsqrt()`. - **evaluate():** Sweeps through a range of input values (`1~100000`), computes the **relative errors**, and reports both **maximum** and **average** relative error statistics in `%`. - **main():** It Initially runs several test cases to measures **cycle counts** for **performance** evaluation, then calls `evaluate()` to summarize **precision** statistics. Before statring, we can first temporarily modify `main()` by replacing the computation with the reference implementation in the `for` loop and comment out the call to `evaluate()`. The corresponding execution output is shown below: ![pure-ref_rsqrt](https://hackmd.io/_uploads/HyMa6GPlZe.png) In the above result, we can observe total **cycle counts** required to compute `ref_rsqrt(x)`. This serves as a **baseline**, which is expected to be slow but accurate, since it performs high-accuracy integer square-root computations. Now, let us restore the original `main()` and begin analyzing by breaking `fast_rsqrt(x)` into three different versions: 1. **LUT only** 2. **LUT + linear interpolation** 3. **LUT + linear interpolation + two Newton–Raphson iterations** (original `fast_rsqrt(x)`) The first two versions can be obtained by simply removing the later unused part from the original `fast_rsqrt(x)` implementation (the third version). Below shows a simple analysis of precision and performance for all three versions. ##### 1. LUT only (`v1`) The corresponding execution result is shown below: ![fr_v1](https://hackmd.io/_uploads/BkMaO7DgZl.png) Compared with `ref_rsqrt(x)`: - Cycle count reduced by approximately $59$% - Maximum relative error: $41.436$% - Average relative error: $18.727$% Although `v1` is extremely fast, its precision is unacceptably poor, with **large errors** across the input range. Clearly, such accuracy is unacceptabe in practice. ##### 2. LUT + linear interpolation (`v2`) The corresponding execution result is shown below: ![fr_v2](https://hackmd.io/_uploads/SJdqkVweWe.png) Compared with `ref_rsqrt(x)`: - Cycle count reduced by approximately $58$% - Maximum relative error: $5.314$% - Average relative error: $3.429$% With nearly the same total cycle costs as `v1`, both the `max/average` relative error of `v2` are significantly **lower**. The **Linear interpolation** significantly improves precision, but further refinement is still desirable. #### 3. LUT + linear interpolation + two Newton–Raphson iterations (`v3`) The corresponding execution result is shown below: ![fr_v3(ori)](https://hackmd.io/_uploads/S1fQNEwx-g.png) Compared with `ref_rsqrt(x)`: - Cycle count reduced by approximately $41$% - Maximum relative error: $0.481$% - Average relative error: $0.002$% Although `v3` is slightly slower than `v1` and `v2`, its accuracy is dramatically higher. After using **Newton–Raphson iterations**, both `max/average` relative error are **extremely small**. You may notice that our measured **maximum relative error** (~0.48%) is significantly lower than the 3–8% error mentioned in the original Problem C description. >[!Note] >- In the original [Problem C](https://hackmd.io/@sysprog/arch2025-quiz3-sol#Problem-C) description,the "~3–8% error" likely refers to **relative error** and specifically the **maximum error** (worst-case) rather than average error. >- The reason is that the output values of `fast_rsqrt` can vary in size from tens to thousands of times, so absolute error is not meaningful, and the average relative error is typically much smaller than the range stated in the problem description. This discrepancy can be attributed to several factors. One of explanation is that our test range covers **only x = 1 ~ 100000** (for time-efficiency concern). Within this interval, the values of `ref_rsqrt(x)` are **not relatively small** (minimum around 207). In such a scenario, if the difference between `fast_rsqrt(x)` and `ref_rsqrt(x)` is often just **1 LSB**, then the **worst-case** relative error is approximately ~0.5% ($1 / 207$), which matches closely with our observed **maximum** relative error of ~0.48%. #### Conclution In summary, we successfully executed both the original `fast_rsqrt(x)` implementation and our extended simple precision and performance analysis in **bare-metal environment** using **rv32emu’s system emulation**, as demonstrated in the earlier output screenshots. --- ## PART 4 #### Prerequisites We reuse the `fast_rsqrt.c` from ***PART 3: Section 2*** and apply modifications. The core code of the current `fast_rsqrt.c` is shown below: ```c= extern uint64_t get_cycles(void); extern uint64_t get_instret(void); /* * clz: count leading zeros */ static int clz(uint32_t x) { if(!x) return 32; //Special case: no bits are set int n = 0; if(!(x & 0xFFFF0000)) { n += 16; x <<= 16; } if(!(x & 0xFF000000)) { n += 8; x <<= 8; } if(!(x & 0xF0000000)) { n += 4; x <<= 4; } if(!(x & 0xC0000000)) { n += 2; x <<= 2; } if(!(x & 0x80000000)) { n += 1; } return n; } /* * implementation 32x32 -> 64-bit multiplication without hardware MUL instruction. * Algorithm: For each set bit i in multiplier b, add (a << i) to the result. * * Example: 5 * 3 * Binary: 5 = 101, 3 = 011 * 4 has bits set at positions 0 and 1 * Result = (5 << 0) + (5 << 1) = 5 + 10 = 15 * * Hint: Use a loop from i=0 to i=31, check if bit i is set in b, * then add (a << i) to the result. */ static uint64_t mul32(uint32_t a, uint32_t b) { uint64_t r = 0; for(int i = 0; i < 32; i++) { if(b & (1U << i)) { r += (uint64_t)a << i; } } return r; } /* * Lookup table: initial estimates for 65535 / sqrt(2^n) * Provides starting approximation indexed by MSB position. * Usage: * For input x, find MSB position exp = 31 - clz(x) * Initial estimate = y = rsqrt_table[exp] * Examples: * x = 1 (2^0) -> exp = 0 -> y = 65536(exact: 65536/sqrt(1)) * x = 16 (2^4) -> exp = 4 -> y = 16384(exact: 65536/sqrt(16)) * x = 1024 -> exp = 10 -> y = 2048(exact: 65536/sqrt(1024)) * Each entry computed as: round(65536 / sqrt(2^n)) */ static const uint16_t rsqrt_table[32] = { 65535, 46341, 32768, 23170, 16384, 11585, 8192, 5793, 4096, 2896, 2048, 1448, 1024, 724, 512, 362, 256, 181, 128, 90, 64, 45, 32, 23, 16, 11, 8, 6, 4, 3, 2, 1 }; /* * Fast reciprocal square root: 65536 / sqrt(x) * Computs approximation of 1/sqrt(x) scaled by 2^16. * Input: x - any uint32_t value * Output: y = 65536 / sqrt(x), with 3~8% relative error * * Edge cases: * x = 0 -> 0xFFFFFFFF (represents infinity) * x = 1 -> 65535 (exact) * x = 2^n -> accurate (0~2% error) * x = Max_U32 -> 1 (minimum non-zero result) * * Algorithm overview: * 1. LUT lookup: ~20% error * 2. + Interpolation: ~10% error * 3. + 2 Newton: ~3~8% error */ uint32_t fast_rsqrt(uint32_t x) { /* Handle edge cases */ if(x == 0) return 0xFFFFFFFF; // Infinity representation if(x == 1) return 65536; // Exact result for x=1 /* Steps 1: Find MSB position */ int exp = 31 - clz(x); // Exponent of highest set bit /* Step 2: Get initial estimate from LUT */ uint32_t y = rsqrt_table[exp]; // Initial estimate from LUT /* Step 3: Linear Interpolation */ if(x > (1U << exp)) { // Bound check: exp = 31 would access sqrt_table[32] uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0; uint32_t delta = y - y_next; uint32_t frac = (uint32_t) ((((uint64_t)x - (1UL << exp)) << 16) >> exp); y -= (uint32_t) ((delta * frac) >> 16); } /* Step 4: Newton-Raphson Iterations */ for(int iter = 0; iter < 2; iter++) { uint64_t y2 = (uint32_t)mul32(y, y); uint64_t xy2 = (uint32_t)(mul32(x, y2) >> 16); y = (uint32_t)(mul32(y, (3U << 16) - xy2) >> 17); } return y; } int main(void) { uint64_t c0 = get_cycles(); uint64_t i0 = get_instret(); uint32_t test_values[] = {1, 4, 16, 100, 1024, 65535, 123456789}; int num = sizeof(test_values)/sizeof(test_values[0]); for (int i = 0; i < num; i++) { uint32_t x = test_values[i]; uint32_t y = fast_rsqrt(x); TEST_LOGGER("fast_rsqrt("); print_dec((unsigned long)x); TEST_LOGGER(") = "); print_dec((unsigned long)y); TEST_LOGGER("\n"); } uint64_t c1 = get_cycles(); uint64_t i1 = get_instret(); TEST_LOGGER("cycles = "); print_dec((unsigned long)(c1 - c0)); TEST_LOGGER("\n"); TEST_LOGGER("instret = "); print_dec((unsigned long)(i1 - i0)); TEST_LOGGER("\n"); return 0; } ``` Here, we mainly make two changes: - We remove the **precision-analysis** part in the code, which originally scanned a large range of inputs and summarized error statistics. But we still remain `test_values[]` list for functional validation. - Since we run in this **bare-metal & rv32emu system emulation** environment and may not rely on the standard C library `printf` reliably (the Instructor mentioned this), we replace the original `printf` outputs in `main()` with the functions provided by the original `playground/main.c` (`printstr` + `TEST_LOGGER` + `print_dec`). In addition, we also implement a handwritten `fast_rsqrt.s`, which performs print function and termination similar to `uf8.s` (from ***PART 2***), and reads the performance counters via `get_cycles()` and `get_instret()` from `perfcounter.S`. The corresponding handwritten `fast_rsqrt.s` is shown below: :::info <details> <summary>The handwritten fast_rsqrt.s</summary> ```= .equ SYS_WRITE, 64 # RISC-V Linux ABI: sys_write .equ SYS_EXIT, 93 # RISC-V Linux ABI: sys_exit .equ STDOUT, 1 # file descriptor 1 = stdout .data msg_cycle: .ascii "cycles = " msg_cycle_end: msg_instret: .ascii "instret = " msg_instret_end: msg_fr1: .ascii "fast_rsqrt(" msg_fr1_end: msg_fr2: .ascii ") = " msg_fr2_end: next_line: .ascii "\n" next_line_end: minus_char: .ascii "-" minus_char_end: intmin_str: .ascii "-2147483648" intmin_str_end: .align 2 test_values: .word 1, 4, 16, 100, 1024, 65535, 123456789 .align 2 rsqrt_table: .half 65535, 46341, 32768, 23170, 16384 .half 11585, 8192, 5793, 4096, 2896 .half 2048, 1448, 1024, 724, 512 .half 362, 256, 181, 128, 90 .half 64, 45, 32, 23, 16 .half 11, 8, 6, 4, 3 .half 2, 1 # for print_int .align 2 dec_divisors: .word 1000000000 .word 100000000 .word 10000000 .word 1000000 .word 100000 .word 10000 .word 1000 .word 100 .word 10 .word 1 decbuf: .space 10 .text .globl main main: addi sp, sp, -4 sw ra, 0(sp) # Order:cycles -> instret jal ra, get_cycles # a1=hi, a0=lo mv s4, a0 # cycles_lo_start mv s5, a1 # cycles_hi_start jal ra, get_instret # a1=hi, a0=lo mv s6, a0 # instret_lo_start mv s7, a1 # instret_hi_start # Test of fast_rsqrt la s0, test_values li s1, 7 # s1 = sizes of test_values li s2, 0 # index = 0 loop_tests: bge s2, s1, after_tests slli t0, s2, 2 add s3, s0, t0 li a0, STDOUT la a1, msg_fr1 la t0, msg_fr1_end sub a2, t0, a1 li a7, SYS_WRITE ecall lw a0, 0(s3) jal ra, print_int li a0, STDOUT la a1, msg_fr2 la t0, msg_fr2_end sub a2, t0, a1 li a7, SYS_WRITE ecall lw a0, 0(s3) jal ra, fast_rsqrt jal ra, print_int li a0, STDOUT la a1, next_line la t0, next_line_end sub a2, t0, a1 li a7, SYS_WRITE ecall addi s2, s2, 1 j loop_tests after_tests: # Order:instret -> cycles jal ra, get_instret # a1=hi, a0=lo mv t0, a0 # instret_lo_end mv t1, a1 # instret_hi_end jal ra, get_cycles # a1=hi, a0=lo mv t2, a0 # cycles_lo_end mv t3, a1 # cycles_hi_end # diff_cycles = (t3:t2) - (s5:s4) sub t4, t2, s4 # lo_diff sltu t5, t2, s4 # borrow sub t6, t3, s5 sub t6, t6, t5 # hi_diff mv s0, t4 # reserve cycles_lo_diff mv s1, t6 # reserve cycles_hi_diff # diff_instret = (t1:t0) - (s7:s6) sub t4, t0, s6 # lo_diff sltu t5, t0, s6 # borrow sub t6, t1, s7 sub t6, t6, t5 # hi_diff mv s2, t4 # reserve instret_lo_diff mv s3, t6 # reserve instret_hi_diff # print cycles li a0, STDOUT la a1, msg_cycle la t0, msg_cycle_end sub a2, t0, a1 li a7, SYS_WRITE ecall mv a0, s0 # cycles low 32 jal ra, print_int li a0, STDOUT la a1, next_line la t0, next_line_end sub a2, t0, a1 li a7, SYS_WRITE ecall # print instret li a0, STDOUT la a1, msg_instret la t0, msg_instret_end sub a2, t0, a1 li a7, SYS_WRITE ecall mv a0, s2 # instret low 32 jal ra, print_int li a0, STDOUT la a1, next_line la t0, next_line_end sub a2, t0, a1 li a7, SYS_WRITE ecall main_End: lw ra, 0(sp) addi sp, sp, 4 # jal ra, io_exit_success li a0, 0 # exit(0) li a7, SYS_EXIT # 93 ecall fast_rsqrt: addi sp, sp, -16 sw ra, 12(sp) sw s0, 8(sp) sw s1, 4(sp) sw s2, 0(sp) beqz a0, ret_inf # x==0 -> 0xFFFFFFFF li t0, 1 beq a0, t0, ret_x_1 # x==1 -> 65536 # exp = 31 - clz(x) mv s0, a0 # s0 = x jal ra, clz li t0, 31 sub s1, t0, a0 # s1 = exp # y = rsqrt_table[exp] (uint16_t) la t0, rsqrt_table slli t1, s1, 1 # offset = exp * 2 add t0, t0, t1 lhu s2, 0(t0) # s2 = y (zero-extend) # if (x > (1<<exp)) -> interpolation li t0, 1 # t0 = 1 sll t0, t0, s1 bleu s0, t0, newton # if x <= (1<<exp) skip # y_next = (exp < 31) ? table[exp+1] : 0 li t1, 31 bge s1, t1, yn_zero addi t2, s1, 1 la t3, rsqrt_table slli t2, t2, 1 add t3, t3, t2 lhu t2, 0(t3) # t2 = y_next j delta yn_zero: li t2, 0 delta: # delta = y - y_next sub t1, s2, t2 # 32-bit # frac = (((x - (1<<exp)) << 16) >> exp) sub t2, s0, t0 # x - (1<<exp) slli t2, t2, 16 srl t2, t2, s1 # y -= (delta * frac) >> 16 mv a0, t1 # a = delta (32) mv a1, t2 # b = frac (32) jal ra, mul32 # (a1:a0) = 64-bit product # (hi:lo) >> 16 => new_lo = (hi<<16)|(lo>>16) srli t0, a0, 16 slli t1, a1, 16 or t1, t1, t0 # t1 = (delta*frac)>>16 sub s2, s2, t1 newton: # 2x Newton li a6, 2 newton_loop: # y2 = (uint32_t) mul32(y,y) mv a0, s2 mv a1, s2 jal ra, mul32 # (a1:a0)= y*y mv t5, a0 # t5 = y2 # xy2 = (uint32_t)( (x*y2) >> 16 ) mv a0, s0 # a = x mv a1, t5 # b = y2 (lo32) jal ra, mul32 # (a1:a0) = x*y2 srli a0, a0, 16 slli a1, a1, 16 or t6, a1, a0 # t6 = xy2 # y = (uint32_t)( (y * ((3<<16) - xy2)) >> 17 ) li t0, 3 slli t0, t0, 16 # t0 = 3 << 16 sub t0, t0, t6 # (3<<16) - xy2 mv a0, s2 # a = y mv a1, t0 # b = ... jal ra, mul32 # (a1:a0) = y * (...) srli t0, a0, 16 slli t1, a1, 16 or t1, t1, t0 # 64->32 after >>16 srli s2, t1, 1 # and >>1 so totally >>17 addi a6, a6, -1 bnez a6, newton_loop fr_done: mv a0, s2 j fr_End ret_inf: li a0, -1 # 0xFFFFFFFF j fr_End ret_x_1: li a0, 65536 fr_End: lw ra, 12(sp) lw s0, 8(sp) lw s1, 4(sp) lw s2, 0(sp) addi sp, sp, 16 ret # uint64_t mul32(uint32_t a, uint32_t b) # return: a1=high 32bits, a0=low 32bits mul32: li t0, 0 # lo_r = 0 li t1, 0 # hi_r = 0 li t2, 0 # i = 0 li t3, 32 mul32_loop: beq t2, t3, mul32_End # if (b & (1 << i)) { (hi:lo) += (a << i) } li t4, 1 sll t4, t4, t2 and t4, a1, t4 beqz t4, loop_next # a<<i -> 64: (hi_a:lo_a) mv a2, a0 # lo_a = a li a3, 0 # hi_a = 0 mv a4, t2 # a4 = i shift_loop: beqz a4, add_r srli a5, a2, 31 # a5= carry out of lo slli a2, a2, 1 slli a3, a3, 1 or a3, a3, a5 addi a4, a4, -1 j shift_loop add_r: add t0, t0, a2 sltu a5, t0, a2 # carry add t1, t1, a3 add t1, t1, a5 loop_next: addi t2, t2, 1 j mul32_loop mul32_End: mv a0, t0 # return lo_r mv a1, t1 # return hi_r ret # uint32_t clz(uint32_t x) clz: bnez a0, clz_chk_16 li t0, 32 j clz_End clz_chk_16: mv t0, zero li t1, 0xFFFF0000 and t1, a0, t1 bnez t1, clz_chk_8 addi t0, t0, 16 slli a0, a0, 16 clz_chk_8: li t1, 0xFF000000 and t1, a0, t1 bnez t1, clz_chk_4 addi t0, t0, 8 slli a0, a0, 8 clz_chk_4: li t1, 0xF0000000 and t1, a0, t1 bnez t1, clz_chk_2 addi t0, t0, 4 slli a0, a0, 4 clz_chk_2: li t1, 0xC0000000 and t1, a0, t1 bnez t1, clz_chk_1 addi t0, t0, 2 slli a0, a0, 2 clz_chk_1: li t1, 0x80000000 and t1, a0, t1 bnez t1, clz_End addi t0, t0, 1 clz_End: mv a0, t0 ret print_int: mv t0, a0 # t0 = x la t1, decbuf # t1 = buffer base li t2, 0 # t2 = out_len li t3, 0 # t3 = seen_nonzero flag # special case: (-2147483648) = 0x80000000 li t4, 0x80000000 beq t0, t4, pi_intmin # x == 0 ? beqz t0, pi_zero # x < 0 ? bltz t0, pi_negative # x > 0:positive path j pi_positive pi_zero: li t4, '0' sb t4, 0(t1) li t2, 1 j pi_write pi_negative: # print '-' first li a0, STDOUT la a1, minus_char la a2, minus_char_end sub a2, a2, a1 li a7, SYS_WRITE ecall sub t0, x0, t0 # t0 = -t0 pi_positive: la t4, dec_divisors # t4 = &divisors[0] li t5, 10 # 10 divisor pi_outer: beqz t5, pi_write lw t6, 0(t4) # t6 = divisor mv a1, t0 # a1 = remaining li a2, 0 # a2 = digit pi_inner: blt a1, t6, pi_inner_done sub a1, a1, t6 addi a2, a2, 1 j pi_inner pi_inner_done: mv t0, a1 beqz t3, pi_check_first j pi_output_digit pi_check_first: beqz a2, pi_next_div li t3, 1 pi_output_digit: addi a2, a2, '0' # digit -> ASCII add t6, t1, t2 # decbuf[out_len] sb a2, 0(t6) addi t2, t2, 1 pi_next_div: addi t4, t4, 4 # divisor++ addi t5, t5, -1 j pi_outer pi_write: li a0, STDOUT mv a1, t1 mv a2, t2 li a7, SYS_WRITE ecall jr ra pi_intmin: li a0, STDOUT la a1, intmin_str la a2, intmin_str_end sub a2, a2, a1 li a7, SYS_WRITE ecall jr ra ``` </details> ::: Running this handwritten `fast_rsqrt.s` with `make run` produces the following output: ```shell. ../../../build/rv32emu test.elf 19:36:15 INFO src/riscv.c:552: Log level: TRACE 19:36:15 INFO src/riscv.c:565: test.elf ELF loaded 19:36:15 INFO src/main.c:315: RISC-V emulator is created and ready to run fast_rsqrt(1) = 65536 fast_rsqrt(4) = 32768 fast_rsqrt(16) = 16384 fast_rsqrt(100) = 6553 fast_rsqrt(1024) = 2048 fast_rsqrt(65535) = 255 fast_rsqrt(123456789) = 5 cycles = 23878 instret = 23878 19:36:15 INFO src/main.c:338: RISC-V emulator is destroyed ``` The output shows that `fast_rsqrt(x)` in this `fast_rsqrt.s` matches the expected results for all test inputs. Therefore, this confirm the **functional correctness** of the handwritten `fast_rsqrt.s`, and its results are consistent with the current `fast_rsqrt.c`. Besides, the reported `cycles` and `instret` can also be used for later performance comparisons. --- ### Comparisons and Obserations We first compare the performance of `fast_rsqrt.c` under different compilation options (e.g., `-Ofast` optimized for speed and `-Os` optimized for size), and then compare them with the handwritten `fast_rsqrt.s`. Concretely, for `fast_rsqrt.c`, we use three compilation options in the `Makefile` with: - `CFLAGS = -g $(ARCH)`, `CFLAGS = -g $(ARCH) -Ofast` and `CFLAGS = -g $(ARCH) -Os` We then measured **cycles** and **retired instructions** (`instret`) for each version, the corresponding results are shown below: <div style="margin: auto; width: fit-content;"> | <div style="width:300px">`fast_rsqrt` program</div>| <div style="width:150px">Cycles</div> | <div style="width:150px">Instret</div> | | :-----: | :-----: | :-----: | | **Handwritten assembly** | **23878** | **23878** | | **C program (no `-O*`)**| **85442** | **85442** | | **C program with `-Ofast` (speed)** | **36394** | **36394** | | **C program with `-Os` (size)**| **34095** | **34095** | </div> Across all versions, we found the **cycles and instret are identical**. This indicates that the runtime is **largely dominated** by the number of **retired** instructions (relate to **dynamic instruction count**) rather than static instruction count. In other words, the program becomes slower mainly because it **executes more instructions** (function calls, loops, etc.). And it is also clear that the handwritten assembly version is **faster** than all three compiler-generated versions. The main reason is that our handwritten version is more targeted: we explicitly control register usage and dataflow, and have already applied some optimization (such as modify version of `clz` with loop unrolling). Therefore, the handwritten version **retires fewer instructions and runs faster** for the same work. Below are several observations: #### Observation 1: stack frame/register saving and `mul32` differences may significantly affect the dynamic instruction count **(1) stack frame / register saving** In the C program (no `-O*`), the `<fast_rsqrt>` prologue is heavy, it allocates a large stack frame and saves many registers (multiple `s*` plus `ra`): ```= 00010494 <fast_rsqrt>: 10494: f9010113 addi sp,sp,-112 10498: 06112623 sw ra,108(sp) 1049c: 06812423 sw s0,104(sp) 104a0: 07212223 sw s2,100(sp) 104a4: 07312023 sw s3,96(sp) 104a8: 05412e23 sw s4,92(sp) 104ac: 05512c23 sw s5,88(sp) 104b0: 05612a23 sw s6,84(sp) 104b4: 05712823 sw s7,80(sp) ...... ``` In contrast, with `-Ofast`, the compiler splits `<fast_rsqrt>` into a small wrapper and plus a `fast_rsqrt.part.0` for the main computation: ```= 000103b0 <fast_rsqrt>: 103b0: 00050863 beqz a0,103c0 <fast_rsqrt+0x10> 103b4: 00100713 li a4,1 103b8: 00e50863 beq a0,a4,103c8 <fast_rsqrt+0x18> 103bc: d71ff06f j 1012c <fast_rsqrt.part.0> 103c0: fff00513 li a0,-1 103c4: 00008067 ret 103c8: 00010537 lui a0,0x10 103cc: 00008067 ret ``` Moreover, the `fast_rsqrt.part.0` uses a much smaller stack frame and saves only a small set of registers: ```= 0001012c <fast_rsqrt.part.0>: 1012c: ff010113 addi sp,sp,-16 10130: 00912223 sw s1,4(sp) 10134: 00112623 sw ra,12(sp) 10138: 00812423 sw s0,8(sp) 1013c: 00050493 mv s1,a0 10140: 26050463 beqz a0,103a8 <fast_rsqrt.part.0+0x27c> 10144: 000107b7 lui a5,0x10 10148: 16f56c63 bltu a0,a5,102c0 <fast_rsqrt.part.0+0x194> 1014c: 00050793 mv a5,a0 ...... ``` And for `-Os`, the prologue/epilogue overhead is in-between, it still saves several registers, but the stack frame is clearly smaller than the no `-O*` version: ```= 00010130 <fast_rsqrt>: 10130: fd010113 addi sp,sp,-48 10134: 02912223 sw s1,36(sp) 10138: 02112623 sw ra,44(sp) 1013c: 02812423 sw s0,40(sp) 10140: 03212023 sw s2,32(sp) 10144: 01312e23 sw s3,28(sp) 10148: 01412c23 sw s4,24(sp) 1014c: 01512a23 sw s5,20(sp) 10150: 01612823 sw s6,16(sp) ...... ``` Finally, for the handwritten assembly, the control flow and register usage are explicitly planned, also resulting in the smallest prologue/epilogue overhead: ```= 00010208 <fast_rsqrt>: 10208: ff010113 addi sp,sp,-16 1020c: 00112623 sw ra,12(sp) 10210: 00812423 sw s0,8(sp) 10214: 00912223 sw s1,4(sp) 10218: 01212023 sw s2,0(sp) 1021c: 0e050a63 beqz a0,10310 <ret_inf> 10220: 00100293 li t0,1 10224: 0e550a63 beq a0,t0,10318 <ret_x_1> 10228: 00050413 mv s0,a0 1022c: 178000ef jal 103a4 <clz> ...... ``` Overall, these differences can directly contribute to the dynamic instruction count: a larger stack frames and saving/restoring more registers will increase the number of instructions executed on function prologue/epilogue, which can accumulate into noticeable instret and cycles when the function is invoked repeatedly. **(2) `mul32`: relative heavy in no `-O*`, while `-Ofast/-Os` use `__mulsi3`** In the no `-O*` version, `mul32` stores parameters (`a0/a1`) and intermediates to the stack, and the loop repeatedly performs `lw/sw` to reload/store values. ```= 000103a0 <mul32>: 103a0: fd010113 addi sp,sp,-48 103a4: 02112623 sw ra,44(sp) 103a8: 02812423 sw s0,40(sp) 103ac: 03010413 addi s0,sp,48 103b0: fca42e23 sw a0,-36(s0) 103b4: fcb42c23 sw a1,-40(s0) 103b8: 00000513 li a0,0 103bc: 00000593 li a1,0 103c0: fea42423 sw a0,-24(s0) 103c4: feb42623 sw a1,-20(s0) 103c8: fe042223 sw zero,-28(s0) 103cc: 09c0006f j 10468 <mul32+0xc8> 103d0: fe442583 lw a1,-28(s0) 103d4: 00100513 li a0,1 103d8: 00b51533 sll a0,a0,a1 103dc: fd842583 lw a1,-40(s0) 103e0: 00b575b3 and a1,a0,a1 103e4: 06058c63 beqz a1,1045c <mul32+0xbc> 103e8: fdc42583 lw a1,-36(s0) 103ec: 00058613 mv a2,a1 103f0: 00000693 li a3,0 103f4: fe442583 lw a1,-28(s0) 103f8: fe058593 addi a1,a1,-32 103fc: 0005c863 bltz a1,1040c <mul32+0x6c> 10400: 00b617b3 sll a5,a2,a1 10404: 00000713 li a4,0 10408: 02c0006f j 10434 <mul32+0x94> 1040c: 00165513 srli a0,a2,0x1 10410: 01f00813 li a6,31 10414: fe442583 lw a1,-28(s0) 10418: 40b805b3 sub a1,a6,a1 1041c: 00b555b3 srl a1,a0,a1 10420: fe442503 lw a0,-28(s0) 10424: 00a697b3 sll a5,a3,a0 10428: 00f5e7b3 or a5,a1,a5 1042c: fe442583 lw a1,-28(s0) 10430: 00b61733 sll a4,a2,a1 10434: fe842803 lw a6,-24(s0) 10438: fec42883 lw a7,-20(s0) 1043c: 00e80533 add a0,a6,a4 10440: 00050313 mv t1,a0 10444: 01033333 sltu t1,t1,a6 10448: 00f885b3 add a1,a7,a5 1044c: 00b30833 add a6,t1,a1 10450: 00080593 mv a1,a6 10454: fea42423 sw a0,-24(s0) 10458: feb42623 sw a1,-20(s0) 1045c: fe442583 lw a1,-28(s0) 10460: 00158593 addi a1,a1,1 10464: feb42223 sw a1,-28(s0) 10468: fe442503 lw a0,-28(s0) 1046c: 01f00593 li a1,31 10470: f6a5d0e3 bge a1,a0,103d0 <mul32+0x30> 10474: fe842703 lw a4,-24(s0) 10478: fec42783 lw a5,-20(s0) 1047c: 00070513 mv a0,a4 10480: 00078593 mv a1,a5 10484: 02c12083 lw ra,44(sp) 10488: 02812403 lw s0,40(sp) 1048c: 03010113 addi sp,sp,48 10490: 00008067 ret ``` In contrast, in both `-Ofast` and `-Os`, we no longer find the `<mul32>` symbol. Instead, we observe a direct call to `<__mulsi3>`: ```= # -Os, in <fast_rsqrt> 10230: 140000ef jal 10370 <__mulsi3> ``` This implies that the compiler avoids the original shift-add `mul32()` implementation and switches to a more efficient software-multiplication helper, which may reduce the dynamic instruction cost related to multiplication. #### Observation 2: printing is expensive In the no `-O*` version, `print_dec` will calls `umod` and `udiv`. Since `udiv/umod` implement bit-level long division, printing a single decimal number may result in expensive cost. ```= # no -O*, in <print_dec> ...... 10234: ef5ff0ef jal 10128 <umod> ...... 10264: e01ff0ef jal 10064 <udiv> ...... ``` In contrast, in both `-Ofast` and `-Os`, we found that `print_dec` no longer call `udiv/umod`: ```= # -Os 00010064 <print_dec>: 10064: fe010113 addi sp,sp,-32 10068: 00a00793 li a5,10 1006c: 00f10fa3 sb a5,31(sp) 10070: 0a051663 bnez a0,1011c <print_dec+0xb8> 10074: 03000793 li a5,48 10078: 00f10f23 sb a5,30(sp) 1007c: 01d10713 addi a4,sp,29 10080: 00170713 addi a4,a4,1 10084: 01f10793 addi a5,sp,31 10088: 40e787b3 sub a5,a5,a4 1008c: 04000893 li a7,64 10090: 00100513 li a0,1 10094: 00070593 mv a1,a4 10098: 00078613 mv a2,a5 1009c: 00000073 ecall 100a0: 02010113 addi sp,sp,32 100a4: 00008067 ret 100a8: 00d55633 srl a2,a0,a3 100ac: 00179793 slli a5,a5,0x1 100b0: 00167613 andi a2,a2,1 100b4: 00f667b3 or a5,a2,a5 100b8: 00f87463 bgeu a6,a5,100c0 <print_dec+0x5c> 100bc: ff678793 addi a5,a5,-10 100c0: fff68693 addi a3,a3,-1 100c4: ff1692e3 bne a3,a7,100a8 <print_dec+0x44> 100c8: 03078793 addi a5,a5,48 100cc: 00f70023 sb a5,0(a4) 100d0: 00000613 li a2,0 100d4: fff70713 addi a4,a4,-1 100d8: 01f00693 li a3,31 100dc: 00000793 li a5,0 100e0: 00d555b3 srl a1,a0,a3 100e4: 00179793 slli a5,a5,0x1 100e8: 0015f593 andi a1,a1,1 100ec: 00f5e7b3 or a5,a1,a5 100f0: 00f87863 bgeu a6,a5,10100 <print_dec+0x9c> 100f4: 00d315b3 sll a1,t1,a3 100f8: ff678793 addi a5,a5,-10 100fc: 00b66633 or a2,a2,a1 10100: fff68693 addi a3,a3,-1 10104: fd169ee3 bne a3,a7,100e0 <print_dec+0x7c> 10108: f6060ce3 beqz a2,10080 <print_dec+0x1c> 1010c: 00060513 mv a0,a2 10110: 01f00693 li a3,31 10114: 00000793 li a5,0 10118: f91ff06f j 100a8 <print_dec+0x44> 1011c: 01e10713 addi a4,sp,30 10120: 00900813 li a6,9 10124: fff00893 li a7,-1 10128: 00100313 li t1,1 1012c: fe5ff06f j 10110 <print_dec+0xac> ``` And for the handwritten assembly, printing is done via `print_int` function (which based on a divisor table). It avoids the expensive `udiv/umod` long division, may also reduce the printing overhead. Furthermore, We also want to verify whether printing dominates the overall execution time. To isolate the compute cost of `fast_rsqrt`, we removed the per-test printing from `main()` (in `fast_rsqrt.c`), and the cycles and instret become: - No `-O*`: **cycles/instret = 19741** - With `-Ofast`: **cycles/instret = 9946** - With `-Os`: **cycles/instret = 8943** Compared with the original measurements we observed, removing printing dramatically reduces cycles/instret. Therefore, this strongly shows that printing is a major contributor to the program’s overall execution time. #### Observation 3: `.text` size does not directly predict execution time We also compared `.text` sizes (in bytes) via `riscv-none-elf-size test.elf`: - No `-O*`: **`.text` size = 2591** - With `-Ofast`: **`.text` size = 1715** - With `-Os`: **`.text` size = 1923** - Handwritten assembly: **`.text` size = 1296** Interestingly, the `-Ofast` have smaller `.text` than `-Os`, while in our measurements `-Os` achieves slightly lower cycles/instret than `-Ofast`. This highlights that static code size (`.text` bytes) does not directly predict execution time. For a specific workload, control flow or other factors may matter more than `.text` bytes. --- ### Optimization Since the handwritten `fast_rsqrt.s` is clearly faster than the three compiler-generated versions, our next goal is to optimize the compiler-generated version. Among the three C versions, the no `-O*` is the slowest, so this section we focuse on optimizing the no `-O*` version to reduce cycles/instret without changing the original program functionality. Based on our observations in the previous section, we choose the function `mul32` from the no `-O*` disassembly as the optimization target. This is because `mul32` is invoked multiple times inside `fast_rsqrt` (also `fast_rsqrt` is invoked multiple times in `main()`), and the no `-O*` `mul32` is clearly heavy (frequent `lw/sw` to stack), making it suitable for optimization. To ensure that any performance difference comes only from changes to `mul32`, we use the following workflow to create a baseline and set up a clean replacement: 1. We first extract `mul32` function from the no `-O*` disassembly, apply minimal necessary edits (e.g., replacing hard-coded branch targets with labels), and save it as `mul32_ori.s` in the same directory as `fast_rsqrt.c`. 2. We remove the original `mul32` definition from `fast_rsqrt.c` and replace it with an external declaration: ```c= extern uint64_t mul32(uint32_t a, uint32_t b); ``` We also update the `Makefile` to include the new object file: - `OBJS = start.o perfcounter.o fast_rsqrt.o mul32_ori.o`. After running the program, we confirmed that replacing the original C `mul32` function with `mul32_ori.s` produces the same overall cycles/instret as the original no `-O*` version (only `fast_rsqrt.c`). This validates that our replacement setup is correct, and later performance changes can be attributed solely to `mul32` optimization. Therefore, in the following optimizations, we only need to modify `mul32_ori.s` and re-measure cycles/instret for comparison. #### Optimize `Mul32` The original `mul32` C function (in `fast_rsqrt.c`) is shown below: ```c= static uint64_t mul32(uint32_t a, uint32_t b) { uint64_t r = 0; for(int i = 0; i < 32; i++) { if(b & (1U << i)) { r += (uint64_t)a << i; } } return r; } ``` And the corresponding current `mul32_ori.s` (**baseline**, which is extracted from the no `-O*` disassembly) is shown below: ```= .text .globl mul32 mul32: addi sp, sp, -48 sw ra, 44(sp) sw s0, 40(sp) addi s0, sp, 48 sw a0, -36(s0) # store a sw a1, -40(s0) # store b li a0, 0 li a1, 0 sw a0, -24(s0) # lo_acc sw a1, -20(s0) # hi_acc sw zero, -28(s0) # i = 0 j L_check L_loop: lw a1, -28(s0) # i li a0, 1 sll a0, a0, a1 # mask = 1 << i lw a1, -40(s0) # b and a1, a0, a1 beqz a1, L_loop_next lw a1, -36(s0) # a mv a2, a1 # a2 = a li a3, 0 # a3 = 0 (hi part) lw a1, -28(s0) # i addi a1, a1, -32 bltz a1, L_neg_shift sll a5, a2, a1 # a5 = (a << (i-32)) (hi) li a4, 0 # a4 = 0 (lo) j L_add_accum L_neg_shift: srli a0, a2, 1 li a6, 31 lw a1, -28(s0) sub a1, a6, a1 srl a1, a0, a1 lw a0, -28(s0) sll a5, a3, a0 or a5, a1, a5 lw a1, -28(s0) sll a4, a2, a1 L_add_accum: lw a6, -24(s0) # lo_acc lw a7, -20(s0) # hi_acc add a0, a6, a4 # lo_new mv t1, a0 sltu t1, t1, a6 # carry from lo add a1, a7, a5 # hi_partial add a6, t1, a1 # hi_new mv a1, a6 sw a0, -24(s0) sw a1, -20(s0) L_loop_next: lw a1, -28(s0) addi a1, a1, 1 sw a1, -28(s0) L_check: lw a0, -28(s0) li a1, 31 bge a1, a0, L_loop lw a4, -24(s0) lw a5, -20(s0) mv a0, a4 # return lo in a0 mv a1, a5 # return hi in a1 lw ra, 44(sp) lw s0, 40(sp) addi sp, sp, 48 ret ``` From the `mul32_ori.s`, we can see several obvious inefficient points: **(1) Too many values are kept on the stack and repeatedly loaded/stored** For example, `a(a0)`, `b(b0)`, `i`, `lo_acc`, `hi_acc` are all stored on the stack. As a result, in each loop iteration performs multiple `lw/sw`, making the loop body dominated by memory accesses rather than computation. **(2) The mask `mask = 1 << i` is recomputed every iteration** In `mul32_ori.s`, the `mask` is used to test whether bit `i` of `b` is set `1` or `0`. However, it is recomputed in each iteration: ```= L_loop: ...... li a0, 1 sll a0, a0, a1 # mask = 1 << i ...... ``` **(3) The computation of `(uint64_t)a << i` is unintuitive** The corresponding code segment is shown below: ```= lw a1, -28(s0) # i addi a1, a1, -32 bltz a1, L_neg_shift sll a5, a2, a1 # a5 = (a << (i-32)) (hi) li a4, 0 # a4 = 0 (lo) j L_add_accum L_neg_shift: srli a0, a2, 1 li a6, 31 lw a1, -28(s0) sub a1, a6, a1 srl a1, a0, a1 lw a0, -28(s0) sll a5, a3, a0 or a5, a1, a5 lw a1, -28(s0) sll a4, a2, a1 ``` It will first checks whether `i-32` is negative and then goes through a less intuitive `hi/lo` computation way. Moreover, since `i` is always in **0~31**, the `i-32` check is unnecessary and introduces extra control flow and instructions. Therefore, we try to address these issues in `mul32_ori.s`. The corresponding optimized version **`mul32_opt.s`** is shown below: ```= .text .globl mul32 mul32: # The usage of registers: # t0 = lo_acc # t1 = hi_acc # t2 = mask # t3 = i # t4 = intermediates # t5 = hi_add # t6 = lo_add / carry li t0, 0 # lo_acc = 0 li t1, 0 # hi_acc = 0 li t2, 1 # mask = 1 li t3, 0 # i = 0 L_loop: and t4, a1, t2 beqz t4, L_next # lo_add = a << i sll t6, a0, t3 # hi_add = (i == 0) ? 0 : (a >> (32 - i)) beqz t3, L_hi_zero li t4, 32 sub t4, t4, t3 # t4 = 32 - i srl t5, a0, t4 # hi_add j L_add64 L_hi_zero: li t5, 0 L_add64: add t4, t0, t6 # new_lo sltu t6, t4, t0 # carry = (new_lo < old_lo) add t1, t1, t5 add t1, t1, t6 mv t0, t4 L_next: addi t3, t3, 1 # i++ slli t2, t2, 1 # mask <<= 1 li t4, 32 blt t3, t4, L_loop # while (i < 32) mv a0, t0 mv a1, t1 ret ``` Compared with `mul32_ori.s` (**baseline**), the key changes in `mul32_opt.s` are: 1. **Remove frequent `lw/sw` inside the loop** The baseline keeps `a/b/i/acc` on the stack and reloads/stores them repeatedly. Thus, we keep `lo_acc/hi_acc/mask/i` in registers, so the loop body avoids repeated memory accesses. 2. **Do not recompute `mask = 1 << i` each iteration** We initialize `mask = 1` once and update it with `mask <<= 1` per iteration, reducing repeated setup work. 3. **A simpler `hi/lo` computation for `(uint64_t)(a << i)`** We compute `lo_add = a << i` and `hi_add = (i == 0) ? 0: a >> (32-i)`, removing unnecessary control flow and instructions. We also estimate the **static instruction count** of `mul32` (excluding labels/comments): - Baseline `mul32_ori.s`: **61 instructions** - Optimized `mul32_opt.s`: **25 instructions** Next, we replace `mul32_ori.s` with `mul32_opt.s` and run: ```shell. ../../../build/rv32emu test.elf 16:57:25 INFO src/riscv.c:552: Log level: TRACE 16:57:25 INFO src/riscv.c:565: test.elf ELF loaded 16:57:25 INFO src/main.c:315: RISC-V emulator is created and ready to run fast_rsqrt(1) = 65536 fast_rsqrt(4) = 32768 fast_rsqrt(16) = 16384 fast_rsqrt(100) = 6553 fast_rsqrt(1024) = 2048 fast_rsqrt(65535) = 255 fast_rsqrt(123456789) = 5 cycles = 75995 instret = 75995 16:57:25 INFO src/main.c:338: RISC-V emulator is destroyed ``` The outputs match the expected results, confirming that `mul32_opt.s` is functionally correct. Compared with the baseline: - **Cycles/Instret reduced by approximately $1 -\tfrac{75995}{85442} \approx 11.1$%** - **Code size (static instruction count) reduced by approximately $1 -\tfrac{25}{61} \approx 59.0$%** However, based on ***Observation 2*** in the previous section, printing dominates the program’s overall execution time, so the improvement is likely masked in partial. Therefore, we removed per-test printing from `main()` (in `fast_rsqrt.c`), the cycles and instret become: ```shell. ../../../build/rv32emu test.elf 16:26:54 INFO src/riscv.c:552: Log level: TRACE 16:26:54 INFO src/riscv.c:565: test.elf ELF loaded 16:26:54 INFO src/main.c:315: RISC-V emulator is created and ready to run cycles = 10294 instret = 10294 16:26:54 INFO src/main.c:338: RISC-V emulator is destroyed ``` Compared with the printing-removed baseline: - **Cycles/Instret reduced by approximately $1 -\tfrac{10294}{19741} \approx 47.9$%** That is to say, under the printing-removed setup, the optimization (`mul32_opt.s`) provides **roughly $1.9$ times speedup (close to $2$ times)** over the baseline (`mul32_ori.s`), and this is already close to the performance we observed for the `-Ofast` version. --- > Refer to <[Assignment2 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/2025-arch-homework2)> ***Reference*** - [Assignment2 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/2025-arch-homework2) - [LAB2: RISC-V Instruction Set Simulator and System Emulator](https://hackmd.io/@sysprog/Sko2Ja5pel) - [RISC-V Assembly Programmer’s Manual](https://github.com/riscv-non-isa/riscv-asm-manual/blob/main/src/asm-manual.adoc) - [rv32emu: System Calls (docs/syscall.md)](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md) - [Quiz2 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz2-sol) - [Quiz3 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz3-sol)