Assignment 2: Complete Applications

# Assignment 2: Complete Applications contributed by < [`jin11109`](https://github.com/jin11109/ca2025-assigment2) > ## Introduction ### Assignment Overview The assignment required adapting [Assignment 1](https://hackmd.io/@sysprog/2025-arch-homework1) and [Quiz 2A](https://hackmd.io/@sysprog/arch2025-quiz2-sol#Problem-A)/[Quiz 3C](https://hackmd.io/@sysprog/arch2025-quiz3-sol#Problem-C) programs to run on rv32emu in a bare-metal environment using only RV32I instructions, measuring execution cycles, and analyzing precision. Handwritten and compiler-optimized assembly were also compared, with the goal of exploring further optimizations. ### My Implementation I adapted my [assignment 1’s `uf8_encode`](https://hackmd.io/@jin11109/arch2025-homework1) and Quiz 2A/3C to run in rv32emu, replacing unsupported syscalls. I discussed the precision of Quiz 3C, compared handwritten and compiler-generated assembly for uf8_encode, and explored potential improvements based on the compiler-optimized assembly. ## Run in rv32emu ### Verifying [`tests/system/playground`](https://gist.github.com/jserv/5f682ac880773cab69e3564f4f20d60a) First, we compile and test the **system emulation** environment: ```console $ make ENABLE_ELF_LOADER=1 ENABLE_EXT_C=0 ENABLE_SYSTEM=1 misalign-in-blk-emu Linux image is found. Skipping downloading. Running misalign.elf ... [OK] ``` We can verify that the configuration options are correctly enabled: ```console $ grep -E "ENABLE_SYSTEM|ENABLE_ELF_LOADER" build/.config ENABLE_ELF_LOADER=1 ENABLE_SYSTEM=1 ``` Next, run the **playground** test suite: ```console $ make run ../../../build/rv32emu test.elf 10:37:51 INFO src/riscv.c:552: Log level: TRACE 10:37:51 INFO src/riscv.c:565: test.elf ELF loaded 10:37:51 INFO src/main.c:315: RISC-V emulator is created and ready to run === ChaCha20 Tests === Test 0: ChaCha20 (RISC-V Assembly) Test: ChaCha20 ChaCha20 RFC 7539: PASSED Cycles: 6797 Instructions: 6797 === BFloat16 Tests === Test 1: bf16_add Test: bf16_add 1.0 + 1.0 = 4000 PASSED Cycles: 432 Instructions: 432 Test 2: bf16_sub Test: bf16_sub 3.0 - 2.0 = 3f80 PASSED Cycles: 373 Instructions: 373 Test 3: bf16_mul Test: bf16_mul 2.0 * 3.0 = 40c0 PASSED Cycles: 464 Instructions: 464 Test 4: bf16_div Test: bf16_div 6.0 / 2.0 = 4040 PASSED Cycles: 624 Instructions: 624 Test 5: bf16_special_cases Test: bf16_special_cases bf16_iszero(0): PASSED bf16_isnan(NaN): PASSED bf16_isinf(Inf): PASSED Cycles: 239 Instructions: 239 === All Tests Completed === 10:37:51 INFO src/main.c:338: RISC-V emulator is destroyed ``` After verifying the playground test suite, the next section discusses how to construct the essential components of a **bare-metal program**. ### linker In a bare-metal environment, we must manually specify how each section of the program (such as `.text`, `.data`, and `.bss`) is placed in the output ELF file. This is because there is no operating system to handle memory layout or ELF parsing. Here, we directly reuse the `linker.ld` script from the playground: ```linker OUTPUT_ARCH( "riscv" ) ENTRY(_start) SECTIONS { . = 0x10000; .text : { *(.text._start) *(.text) } .data : { *(.data) } .bss : { __bss_start = .; *(.bss) __bss_end = .; } .stack (NOLOAD) : { . = ALIGN(16); . += 4096; __stack_top = .; } } ``` The linker script defines the memory layout for the `.text`, `.data`, `.bss`, and `.stack` sections, as well as the **program entry point**. A notable detail is that the `.text` section explicitly places `_start` at the beginning.This ensures that the ELF file starts execution at the `_start` label — the very first instruction of our program. ### Startup Code As the entry point of the program, `_start` is responsible for setting up the stack, clearing the `.bss` section, and then jumping to `main`: ```asm # Startup code for bare metal RISC-V .section .text._start .globl _start .type _start, @function _start: # Set up stack pointer la sp, __stack_top # Clear BSS la t0, __bss_start la t1, __bss_end 1: bge t0, t1, 2f sw zero, 0(t0) addi t0, t0, 4 j 1b 2: # Call main call main # Exit syscall (if main returns) li a7, 93 # exit syscall number li a0, 0 # exit code ecall # Infinite loop (should never reach here) 3: j 3b .size _start, .-_start # Provide BSS markers if linker script doesn't define them .weak __bss_start .weak __bss_end .weak __stack_top ``` ### System Call Since a bare-metal environment lacks both an operating system and the standard C library, we must directly implement system calls following the rv32emu [syscall documentation](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md) For example, to perform `write` and `exit` calls: ```asm # write li a0, STDOUT # file descriptor la a1, msg # address of string li a2, 7 # length of string li a7, WRITE # syscall number for write ecall # exit li a0, 0 # 0 signals success li a7, EXIT ecall ``` ### Run in rv32emu: Assignment 1 — q1-uf8 > [source code](https://github.com/jin11109/ca2025-assigment2/tree/main/q1-uf8-assigment1) After reusing the playground’s linker script, startup code, and Makefile, the following error occurred: ```console riscv-none-elf-as -o q1-uf8.o q1-uf8.s riscv-none-elf-ld -T linker.ld -o test.elf q1-uf8.o riscv-none-elf-ld: warning: test.elf has a LOAD segment with RWX permissions ../../../build/rv32emu test.elf 21:21:02 INFO src/riscv.c:552: Log level: TRACE 21:21:02 INFO src/riscv.c:565: test.elf ELF loaded 21:21:02 INFO src/main.c:315: RISC-V emulator is created and ready to run 21:21:02 FATAL src/syscall.c:538: Unknown syscall: 4 rv32emu: src/emulate.c:1272: __trap_handler: Assertion `insn' failed. make: *** [Makefile:38: run] Aborted (core dumped) ``` This error occurred because the program used an unsupported syscall number. Unlike Ripes, where syscall `4` is used for string output, rv32emu uses different syscall identifiers and argument conventions—for example, `WRITE` (syscall 7) instead of syscall 4. In addition, my implementation from Assignment 1 did not properly exit the program using a system call. Previously, the code simply jumped to the bottom of the file: ```asm j end_of_file ... end_of_file: nop ``` While this approach works in **Ripes**, it causes the emulator to hang in rv32emu, since the program never terminates. It must be replaced with an explicit exit call: ```asm li a0, 0 # 0 signals success li a7, EXIT ecall ``` #### Execution Result ```console $ make run ../../../build/rv32emu test.elf 11:20:44 INFO src/riscv.c:552: Log level: TRACE 11:20:44 INFO src/riscv.c:565: test.elf ELF loaded 11:20:44 INFO src/main.c:315: RISC-V emulator is created and ready to run All tests passed 11:20:44 INFO src/main.c:338: RISC-V emulator is destroyed ``` ### Run in rv32emu: Quiz2-A > [source code](https://github.com/jin11109/ca2025-assigment2/tree/main/quiz2-A) Quiz 2-A also requires modification of the system call interface.However, it uses multiple syscall types for different kinds of output: - syscall `4`: print **string** - syscall `1`: print **integer** - syscall `11`: print **character** Syscall 4 was already discussed in [Run in rv32emu: Assignment1 q1-uf8](#Run-in-rv32emu:-Assignment1-q1-uf8) Below, we focus on how to rewrite the other two using rv32emu’s `write` syscall. #### Print Character Since rv32emu’s `write` expects the address of a string as input, a single character must first be stored in memory. Here, we use the **stack** as temporary storage for that character. - Original version: ```asm addi x10, x11, 0 addi x17, x0, 11 # print character ecall ``` - Modified version: ```asm li a0, STDOUT addi a1, x14, 0 sw a1, 32(sp) # Put the character we want to print in stack addi a1, sp, 32 # Set address li a2, 1 li a7, WRITE ecall ``` #### Print Integer Printing integers is slightly more complicated because each digit must be converted into its ASCII representation before output. In the **playground** environment, this is handled by a helper function like `print_dec`, which converts a number into decimal digits stored in memory. However, for this assignment, we only need to print disk indices (e.g., `1`, `2`, `3`), so a simpler approach suffices. We can adapt the character printing logic: by adding the ASCII offset of '0' (which is `48`), a single-digit integer can be displayed correctly. - Original version: ```asm addi x10, x9, 1 addi x17, x0, 1 # print integer ecall ``` - Modified version: ```asm li a0, STDOUT addi a1, x9, 49 # Transfer integer to character, additionly add bass 48 sw a1, 32(sp) # Put the integer we want to print in stack addi a1, sp, 32 # Set address li a2, 1 li a7, WRITE ecall ``` #### Execution Result ```console $ make run ../../../build/rv32emu test.elf 11:52:32 INFO src/riscv.c:552: Log level: TRACE 11:52:32 INFO src/riscv.c:565: test.elf ELF loaded 11:52:32 INFO src/main.c:315: RISC-V emulator is created and ready to run Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C 11:52:32 INFO src/main.c:338: RISC-V emulator is destroyed ``` ### Run in rv32emu: Quiz3-C > [source code](https://github.com/jin11109/ca2025-assigment2/tree/main/quiz3-C) For Quiz 3-C, full support for both hexadecimal and decimal output is required to observe the results of the `fast_rsqrt` computation. Therefore, this version directly reuses the `print_hex` and `print_dec` helper functions from the playground’s main program. #### Execution Result Each pair of lines below shows x (in decimal) and its corresponding $\dfrac{1}{\sqrt{x}}$ value (in Q16.16 fixed-point hex): ```console $ make run riscv-none-elf-gcc -g -march=rv32i_zicsr quiz3-C.c -o quiz3-C.o -c riscv-none-elf-ld -T linker.ld -o test.elf quiz3-C.o start.o utils.o riscv-none-elf-ld: warning: test.elf has a LOAD segment with RWX permissions ../../../build/rv32emu test.elf 12:12:17 INFO src/riscv.c:552: Log level: TRACE 12:12:17 INFO src/riscv.c:565: test.elf ELF loaded 12:12:17 INFO src/main.c:315: RISC-V emulator is created and ready to run 1 10000 2 b505 3 93cc 4 8000 5 727c 6 6882 7 60c2 8 5a82 9 5555 10 50f4 12:12:17 INFO src/main.c:338: RISC-V emulator is destroyed ``` ## Evaluation of Precision > [source code](https://github.com/jin11109/ca2025-assigment2/tree/main/quiz3-C) I selected Quiz3-C to evaluate the precision of the **fast reciprocal square root** (`fast_rsqrt`) implemented in the **Q16.16** fixed-point format. ### Look up Table and Newton’s Iteration I was curious about why additional operations are needed for the table lookup values, yet only two iterations of Newton’s method are sufficient to achieve high precision. To explore this, I visualized the relative error by python. The `fast_rsqrt()` function first obtains an initial estimate of $\dfrac {1}{\sqrt x}$ using the following lookup table: ```c static const uint16_t rsqrt_table[32] = { 65536, 46341, 32768, 23170, 16384, /* 2^0 to 2^4 */ 11585, 8192, 5793, 4096, 2896, /* 2^5 to 2^9 */ 2048, 1448, 1024, 724, 512, /* 2^10 to 2^14 */ 362, 256, 181, 128, 90, /* 2^15 to 2^19 */ 64, 45, 32, 23, 16, /* 2^20 to 2^24 */ 11, 8, 6, 4, 3, /* 2^25 to 2^29 */ 2, 1 /* 2^30, 2^31 */ }; ``` Each table entry corresponds to an initial guess of \begin{gather*}y=\dfrac{1}{\sqrt{2^x}}\end{gather*}for values within the range \begin{gather*}\dfrac{1}{\sqrt{2^x}} \leq y \leq \dfrac{1}{\sqrt{2^{x + 1} - 1}}\end{gather*} However, without further refinement, this lookup method introduces up to **40% relative error** at the upper bound of each interval. To mitigate this issue, `fast_rsqrt()` applies **linear interpolation** to adjust the estimated value proportionally within its interval, significantly improving the initial accuracy. ```c if (x > (1u << exp)) { uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0; uint32_t delta = y - y_next; uint32_t frac = (uint32_t) ((((uint64_t) x - (1UL << exp)) << 16) >> exp); y -= (uint32_t) (mul32(delta, frac) >> 16); } ``` ![image](https://hackmd.io/_uploads/BkJPez-gbg.png) --- Even if Newton’s iteration is applied directly **without interpolation**, the resulting precision remains significantly lower compared to the interpolated version. ![Screenshot from 2025-11-11 00-58-37](https://hackmd.io/_uploads/SJf-qckg-g.png) With interpolation, however, only **two iterations** of Newton’s method are enough to achieve high accuracy: ![Figure_1](https://hackmd.io/_uploads/HJXvOOklbx.png) ## Handwritten and Compiler-Optimized Implementations In this section, we discuss the differences between our handwritten assembly code from Assignment 1 and the compiler-generated assembly produced using optimization flags. ### Disassembly Recall that in Assignment 1, we manually wrote the assembly code. To analyze and compare it with the compiler’s optimized output, we first compile the original C source code with optimization enabled, and then inspect the disassembly using `gdb`. This process helps us locate relevant code segments and understand how compiler optimizations differ from our hand-tuned implementation. ```console $ riscv-none-elf-gcc -o q1-uf8 q1-uf8.c -g -o3 -march=rv32i_zicsr $ riscv-none-elf-gdb q1-uf8 ``` In `gdb`, use: ```gdb (gdb) disassemble /m uf8_encode ``` --- ### `uf8_encode`: While Loop (Adjusting if the Estimate Was Off) When optimizing the `uf8_encode()` function in Assignment 1, I initially could not find an effective way to optimize the following while loop. Thus, I translated it directly from C into assembly: ```c /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } ``` ```asm while_loop: ble t1, x0, while_loop_end bge a0, t2, while_loop_end addi t2, t2, -16 # overflow += -16 srli t2, t2, 1 # overflow >>= 1 addi t1, t1, -1 j while_loop while_loop_end: ``` However, in the assembly code generated by GCC with `-O3` option, I noticed that the compiler unrolled this loop — a result that caught my attention. GCC replaced the simple while loop with multiple **conditional branches**, effectively expanding the iterations in-line for better performance. ```assemble 56 /* Adjust if estimate was off */ 57 while (exponent > 0 && value < overflow) { 0x00010484 <+428>: li a2,15 0x00010490 <+440>: li a0,-16 0x0001052c <+596>: lui a5,0x80 0x00010530 <+600>: addi a5,a5,-17 # 0x7ffef 0x00010534 <+604>: bltu a5,a1,0x10870 <uf8_encode+1432> 0x00010538 <+608>: lui a5,0x40 0x00010540 <+616>: addi a5,a5,-17 0x00010548 <+624>: bltu a5,a1,0x10774 <uf8_encode+1180> 0x0001055c <+644>: beqz a3,0x10708 <uf8_encode+1072> 0x00010560 <+648>: bgeu a1,a5,0x1071c <uf8_encode+1092> 0x00010574 <+668>: beqz a3,0x10708 <uf8_encode+1072> 0x00010578 <+672>: bgeu a1,a5,0x1071c <uf8_encode+1092> 0x0001058c <+692>: beqz a3,0x10708 <uf8_encode+1072> 0x00010590 <+696>: bgeu a1,a5,0x1071c <uf8_encode+1092> 0x000105a4 <+716>: beqz a3,0x10708 <uf8_encode+1072> 0x000105a8 <+720>: bgeu a1,a5,0x1071c <uf8_encode+1092> 0x000105bc <+740>: beqz a3,0x10708 <uf8_encode+1072> 0x000105c0 <+744>: bgeu a1,a5,0x1071c <uf8_encode+1092> 0x000105d4 <+764>: beqz a3,0x10708 <uf8_encode+1072> 0x000105d8 <+768>: bgeu a1,a5,0x1071c <uf8_encode+1092> 0x000105ec <+788>: beqz a3,0x10708 <uf8_encode+1072> 0x000105f0 <+792>: bgeu a1,a5,0x1071c <uf8_encode+1092> 0x00010604 <+812>: beqz a3,0x10708 <uf8_encode+1072> 0x00010608 <+816>: bgeu a1,a5,0x1071c <uf8_encode+1092> 0x0001061c <+836>: beqz a3,0x10708 <uf8_encode+1072> 0x00010620 <+840>: bgeu a1,a5,0x1071c <uf8_encode+1092> 0x00010634 <+860>: beqz a3,0x10708 <uf8_encode+1072> 0x00010638 <+864>: bgeu a1,a5,0x1071c <uf8_encode+1092> 0x0001064c <+884>: beqz a3,0x10708 <uf8_encode+1072> 0x00010650 <+888>: bgeu a1,a5,0x1071c <uf8_encode+1092> 0x00010664 <+908>: beqz a3,0x10708 <uf8_encode+1072> 0x00010668 <+912>: bgeu a1,a5,0x1071c <uf8_encode+1092> 0x00010670 <+920>: li a3,14 0x00010678 <+928>: beq a4,a3,0x10840 <uf8_encode+1384> 0x0001067c <+932>: bgeu a1,a5,0x10858 <uf8_encode+1408> 0x000106c4 <+1004>: bgeu a1,a0,0x106dc <uf8_encode+1028> 0x000106d0 <+1016>: bltu a1,a0,0x1054c <uf8_encode+628> 0x00010780 <+1192>: li a3,15 0x0001078c <+1204>: bgeu a3,a1,0x10688 <uf8_encode+944> 0x00010870 <+1432>: lui a5,0x80 0x00010874 <+1436>: addi a5,a5,-16 # 0x7fff0 0x00010878 <+1440>: j 0x10484 <uf8_encode+428> 58 overflow = (overflow - 16) >> 1; 0x0001053c <+612>: lui a0,0x40 0x00010544 <+620>: addi a0,a0,-16 0x00010558 <+640>: addi a5,a5,-8 0x00010570 <+664>: addi a5,a0,-12 0x00010588 <+688>: addi a5,a5,-8 0x000105a0 <+712>: addi a5,a5,-8 0x000105b8 <+736>: addi a5,a5,-8 0x000105d0 <+760>: addi a5,a5,-8 0x000105e8 <+784>: addi a5,a5,-8 0x00010600 <+808>: addi a5,a5,-8 0x00010618 <+832>: addi a5,a5,-8 0x00010630 <+856>: addi a5,a5,-8 0x00010648 <+880>: addi a5,a5,-8 0x00010660 <+904>: addi a5,a5,-8 0x0001066c <+916>: srli a5,a5,0x1 0x00010674 <+924>: addi a5,a5,-8 0x000106c8 <+1008>: srli a0,a0,0x1 0x000106cc <+1012>: addi a0,a0,-8 0x00010778 <+1184>: mv a5,a0 0x0001077c <+1188>: j 0x10688 <uf8_encode+944> 59 exponent--; 0x0001054c <+628>: addi a3,a2,-2 0x00010550 <+632>: srli a5,a0,0x1 0x00010554 <+636>: zext.b a3,a3 0x00010564 <+652>: addi a3,a2,-3 0x00010568 <+656>: srli a0,a0,0x2 0x0001056c <+660>: zext.b a3,a3 0x0001057c <+676>: addi a3,a2,-4 0x00010580 <+680>: srli a5,a5,0x1 0x00010584 <+684>: zext.b a3,a3 0x00010594 <+700>: addi a3,a2,-5 0x00010598 <+704>: srli a5,a5,0x1 0x0001059c <+708>: zext.b a3,a3 0x000105ac <+724>: addi a3,a2,-6 0x000105b0 <+728>: srli a5,a5,0x1 0x000105b4 <+732>: zext.b a3,a3 0x000105c4 <+748>: addi a3,a2,-7 0x000105c8 <+752>: srli a5,a5,0x1 0x000105cc <+756>: zext.b a3,a3 0x000105dc <+772>: addi a3,a2,-8 0x000105e0 <+776>: srli a5,a5,0x1 0x000105e4 <+780>: zext.b a3,a3 0x000105f4 <+796>: addi a3,a2,-9 0x000105f8 <+800>: srli a5,a5,0x1 0x000105fc <+804>: zext.b a3,a3 0x0001060c <+820>: addi a3,a2,-10 0x00010610 <+824>: srli a5,a5,0x1 0x00010614 <+828>: zext.b a3,a3 0x00010624 <+844>: addi a3,a2,-11 0x00010628 <+848>: srli a5,a5,0x1 0x0001062c <+852>: zext.b a3,a3 0x0001063c <+868>: addi a3,a2,-12 0x00010640 <+872>: srli a5,a5,0x1 0x00010644 <+876>: zext.b a3,a3 0x00010654 <+892>: addi a3,a2,-13 0x00010658 <+896>: srli a5,a5,0x1 0x0001065c <+900>: zext.b a3,a3 0x000106d4 <+1020>: addi a2,a2,-1 0x000106d8 <+1024>: zext.b a2,a2 0x00010774 <+1180>: mv a2,a3 60 } ``` #### Why Can the While Loop Be Unrolled? The key observation is that the variable `exponent` is of type `uint8_t`, and before entering the while loop, the following condition restricts its range to a maximum of 15: ```c if (exponent > 15) exponent = 15; ``` Because `exponent` decreases with every iteration, the loop can execute **at most 15 times**. Therefore, the compiler recognizes it as a **bounded loop**, which makes it eligible for **loop unrolling optimization**. #### How Is the Loop Unrolled? > [!note] Regarding `(gdb) disassemble /m` > Although `/m` is useful for mapping C source lines to assembly instructions, it rearranges instruction order for readability. > This can sometimes be misleading when analyzing the control flow in detail — for example, it appeared that multiple conditional branches corresponded to a single while condition. > Therefore, I removed the `/m` flag in later analyses to examine the raw instruction sequence. By inspecting the middle portion of the unrolled loop (to avoid initialization or final clean-up instructions), I observed the following pattern: ```asm ... 0x0001055c <+644>: beqz a3,0x10708 <uf8_encode+1072> 0x00010560 <+648>: bgeu a1,a5,0x1071c <uf8_encode+1092> 0x00010564 <+652>: addi a3,a2,-3 0x00010568 <+656>: srli a0,a0,0x2 0x0001056c <+660>: zext.b a3,a3 0x00010570 <+664>: addi a5,a0,-12 ... ``` - The `beqz` and `bgeu` instructions correspond to the two conditions in the while loop. - The instruction `addi a3, a2, -3` directly updates `exponent`, eliminating dependency on the previous iteration — effectively part of **loop unrolling** and **dependency elimination**. - Subsequent instructions handle the `overflow` adjustment. #### Comparison with My Implementation The main difference between my hand-translated assembly and GCC’s optimized version lies in the loop analysis. While I treated the while loop as inherently iterative, GCC identified it as a bounded loop and applied loop unrolling. This observation taught me an important lesson: Optimization is not only about instruction-level efficiency, but also about analyzing variable constraints to expose opportunities for structural optimizations such as unrolling and dependency reduction. ### `uf8_encode`: For Loop (Calculate Overflow for Estimated Exponent) #### Optimization in Assignment 1 [In Assignment 1](https://hackmd.io/ay0lR_mBRrq5nQsHUKYa2w#Version2), I focused particularly on optimizing the following `for` loop. By examining its recursive behavior, I found that it could be reduced to a single arithmetic expression. Original `for` loop: ```c for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; ``` Optimized version from Assignment 1: ```c overflow = (overflow << exponent) + (((1 << exponent) - 1) << 4); ``` Translated into RISC-V assembly: ```asm li t3, 1 sll t3, t3, t1 addi t3, t3, -1 slli t3, t3, 4 sll t2, t2, t1 add t2, t2, t3 ``` This transformation eliminates the iterative loop structure by computing the result directly through bit manipulation, significantly reducing control-flow overhead. #### Comparison with GCC-Generated Assembly To examine whether GCC adopted a similar optimization strategy, I analyzed the compiler-generated assembly. ```gdb 52 /* Calculate overflow for estimated exponent */ 53 for (uint8_t e = 0; e < exponent; e++) 0x000104b4 <+476>: li a5,1 0x000104b8 <+480>: beq a4,a5,0x10780 <uf8_encode+1192> 0x000104bc <+484>: li a5,2 0x000104c0 <+488>: beq a4,a5,0x106c0 <uf8_encode+1000> 0x000104c4 <+492>: li a5,3 0x000104c8 <+496>: beq a4,a5,0x107bc <uf8_encode+1252> 0x000104cc <+500>: li a5,4 0x000104d0 <+504>: beq a4,a5,0x107cc <uf8_encode+1268> 0x000104d4 <+508>: li a5,5 0x000104d8 <+512>: beq a4,a5,0x107d4 <uf8_encode+1276> 0x000104dc <+516>: li a5,6 0x000104e0 <+520>: beq a4,a5,0x107e8 <uf8_encode+1296> 0x000104e4 <+524>: li a5,7 0x000104e8 <+528>: beq a4,a5,0x107a8 <uf8_encode+1232> 0x000104ec <+532>: li a5,8 0x000104f0 <+536>: beq a4,a5,0x107f0 <uf8_encode+1304> 0x000104f4 <+540>: li a5,9 0x000104f8 <+544>: beq a4,a5,0x107fc <uf8_encode+1316> 0x000104fc <+548>: li a5,10 0x00010500 <+552>: beq a4,a5,0x10808 <uf8_encode+1328> 0x00010504 <+556>: li a5,11 0x00010508 <+560>: beq a4,a5,0x10814 <uf8_encode+1340> 0x0001050c <+564>: li a5,12 0x00010510 <+568>: beq a4,a5,0x10820 <uf8_encode+1352> 0x00010514 <+572>: li a5,13 0x00010518 <+576>: beq a4,a5,0x1082c <uf8_encode+1364> 0x00010520 <+584>: li a3,14 0x00010528 <+592>: beq a4,a3,0x106c4 <uf8_encode+1004> 0x000106ac <+980>: li a5,1 0x000106b0 <+984>: bne a4,a5,0x104bc <uf8_encode+484> 0x000106b4 <+988>: j 0x10780 <uf8_encode+1192> 54 overflow = (overflow << 1) + 16; 0x0001051c <+580>: lui a0,0x40 0x00010524 <+588>: addi a0,a0,-16 0x000106c0 <+1000>: li a0,48 0x000107a8 <+1232>: li a0,2032 0x000107ac <+1236>: j 0x106c4 <uf8_encode+1004> 0x000107b0 <+1240>: mv a5,a0 0x000107b4 <+1244>: li a2,14 0x000107b8 <+1248>: j 0x10688 <uf8_encode+944> 0x000107bc <+1252>: li a0,112 0x000107c0 <+1256>: j 0x106c4 <uf8_encode+1004> 0x000107cc <+1268>: li a0,240 0x000107d0 <+1272>: j 0x106c4 <uf8_encode+1004> 0x000107d4 <+1276>: li a0,496 0x000107d8 <+1280>: j 0x106c4 <uf8_encode+1004> 0x000107e8 <+1296>: li a0,1008 0x000107ec <+1300>: j 0x106c4 <uf8_encode+1004> 0x000107f0 <+1304>: lui a0,0x1 0x000107f4 <+1308>: addi a0,a0,-16 # 0xff0 0x000107f8 <+1312>: j 0x106c4 <uf8_encode+1004> 0x000107fc <+1316>: lui a0,0x2 0x00010800 <+1320>: addi a0,a0,-16 # 0x1ff0 0x00010804 <+1324>: j 0x106c4 <uf8_encode+1004> 0x00010808 <+1328>: lui a0,0x4 0x0001080c <+1332>: addi a0,a0,-16 # 0x3ff0 0x00010810 <+1336>: j 0x106c4 <uf8_encode+1004> 0x00010814 <+1340>: lui a0,0x8 0x00010818 <+1344>: addi a0,a0,-16 # 0x7ff0 0x0001081c <+1348>: j 0x106c4 <uf8_encode+1004> 0x00010820 <+1352>: lui a0,0x10 0x00010824 <+1356>: addi a0,a0,-16 # 0xfff0 0x00010828 <+1360>: j 0x106c4 <uf8_encode+1004> 0x0001082c <+1364>: lui a0,0x20 0x00010830 <+1368>: addi a0,a0,-16 # 0x1fff0 <__umoddi3+1132> 0x00010834 <+1372>: j 0x106c4 <uf8_encode+1004> 0x00010880 <+1448>: li a5,16 0x00010884 <+1452>: j 0x10688 <uf8_encode+944> ``` The generated code can roughly be divided into **two sections**: 1. The **first part** checks the value of `exponent` and branches to the corresponding code region. 2. The **second part** directly assigns precomputed constant values. This behavior resembles a **lookup table**, except that instead of reading constants from memory using offsets, GCC uses **multiple conditional branches** to jump to specific code paths. This design avoids extra memory accesses and eliminates the need for additional storage space. However, it introduces many branch instructions, which may negatively affect branch prediction and instruction flow. --- #### Comparison My optimization uses a closed-form arithmetic expression to replace the loop. This approach computes the result on demand without pre-stored constants, which makes it more general but also introduces trade-offs: - Advantages: - The loop is replaced by a compact, branch-free formula. - No need to unroll or maintain multiple branch conditions. - Disadvantages: - Each execution requires arithmetic computation, whereas GCC’s method directly selects a constant result. - It introduces severe data dependency, since almost every instruction depends on the result of the previous one, potentially limiting instruction-level parallelism. ## Optimize q1-uf8: `uf8_encode()` > [source code](https://github.com/jin11109/ca2025-assigment2/tree/main/q1-uf8) In the previous section, [Handwritten and Compiler-optimized](#Handwritten-and-Compiler-optimized) , we analyzed the differences between manually written assembly code and the compiler-generated code under the `-O3` optimization level. Beyond that comparison, I found that directly optimizing the compiler-generated assembly was extremely difficult. The main reason is that the compiler tends to **split logical expressions into smaller instruction groups** and rearrange them to minimize data dependencies. As a result, even when using `gdb` with the `/m` disassembly option, it is still very challenging to trace the complete logic of the original code. Therefore, my optimization approach focuses on **modifying the C source code directly**, applying `-O3` optimization afterward, and comparing the performance and correctness across multiple optimization strategies. ### Performance Measurement Based on the reference from [Lab2: RISC-V Instruction Set Simulator and System Emulator](https://hackmd.io/@sysprog/Sko2Ja5pel#Performance-Counters) and the **ChaCha** implementation, we can measure the number of CPU cycles required for execution using the following assembly routine: ```asm get_cycles: rdcycleh a1 # Read high 32 bits rdcycle a0 # Read low 32 bits rdcycleh a2 # Re-read high (detect overflow) bne a1, a2, get_cycles # Retry if overflow occurred ret ``` As shown in [tick.c](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c) the execution time of a function can be obtained by measuring the cycle count before and after execution: ```c ticks t0 = getticks(); fib(19); ticks t1 = getticks(); printf("elapsed cycle: %" PRIu64 "\n", t1 - t0); ``` ### Testing Procedure The `main()` function calls a test function to verify both correctness and monotonicity of the encoding and decoding operations. ```c /* Test encode/decode round-trip */ static bool test(void) { int32_t previous_value = -1; bool passed = true; for (int i = 0; i < 256; i++) { uint8_t fl = i; int32_t value = uf8_decode(fl); uint8_t fl2 = uf8_encode(value); if (fl != fl2) { passed = false; } if (value <= previous_value) { passed = false; } previous_value = value; } return passed; } ``` The `main()` function measures the total number of cycles consumed by this test: ```c int main(void) { uint64_t c1 = get_cycles(); bool pase = test(); uint64_t c2 = get_cycles(); if (pase) { printstr("All tests passed.\n", 19); } else { printstr("Failed.\n", 9); } printstr("Cycles: ", 9); print_dec(c2 - c1); return 0; } ``` This setup ensures that all optimization experiments are evaluated under identical workloads, allowing us to accurately compare the performance impact of different code variants. ### Incorporating Assignment 1 Optimizations In Assignment 1, I applied two main optimizations in `uf8_encode`. The first optimization uses a **lookup table** to accelerate the `clz` operation, and the second converts the `for` loop that computes `overflow` into a **closed-form expression**, eliminating the loop entirely. These optimizations were also discussed in [Handwritten and Compiler-optimized](#Handwritten-and-Compiler-optimized) `clz` ```c const uint8_t clz_tab[256] = { 0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; static inline unsigned clz(uint32_t x) { uint32_t a; // used for offset if (x < (1u << 16)) { // 0x0000_XXXX if (x < (1u << 8)) { // 0x0000_00XX a = 0; } else { // 0x0000_XX00 a = 8; } } else { // 0xXXXX_0000 if (x < (1u << 24)) { // 0x00XX_0000 a = 16; } else { // 0xXX00_0000 a = 24; } } /** * clz_tab is used to get the position of leading one of the selected 8-bit * block(XX) */ return 32 - (clz_tab[x >> a] + a); } ``` `uf8_encode()` ```c= /* Encode uint32_t to uf8 */ uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16) return value; /* Find appropriate exponent using CLZ hint */ int lz = clz(value); int msb = 31 - lz; /* Start from a good initial guess */ uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; overflow = (overflow << exponent) + (((1 << exponent) - 1) << 4); /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } } /* Find exact exponent */ while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) break; overflow = next_overflow; exponent++; } uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } ``` #### Comparison of results: - Naive version with `-O3`: ``` All tests passed. Cycles: 19315 ``` - With Assignment 1 optimizations, also compiled with `-O3`: ``` All tests passed. Cycles: 14591 ``` ### Attempt 1 From the previous observations of the `-O3` results, I realized that manually optimizing the assembly code would not outperform the compiler. Therefore, I decided to restructure the C code, focusing on the numeric ranges to potentially simplify computations. I observed that the condition on line 16, `if (msb >= 5)`, handles exponent estimation for large numbers, but it does nothing for small numbers. My initial idea was to add an else branch to handle small numbers: ```c else { exponent = 1; overflow = 16; } ``` Since numbers less than 16 are already returned early, and numbers with `msb < 5` have a maximum value of 31, we can compute the `exponent` and `overflow` directly, bypassing the while loops on lines 33–39. This allows us to move the while loops into the scope of the `if (msb >= 5)` branch. ```c= /* Encode uint32_t to uf8 */ uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16) return value; /* Find appropriate exponent using CLZ hint */ int lz = clz(value); int msb = 31 - lz; /* Start from a good initial guess */ uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; /* Calculate overflow for estimated exponent */ overflow = (overflow << exponent) + (((1 << exponent) - 1) << 4); /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } /* Find exact exponent */ while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) break; overflow = next_overflow; exponent++; } } else { exponent = 1; overflow = 16; } uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } ``` #### Test result: ``` All tests passed. Cycles: 14379 ``` ### Attempt 2 In Attempt 1, `exponent` and `overflow` were initialized before the `if (msb >= 5)` branch. Since we already modified the small-number handling, we can reconsider initializing these variables inside the `if` block, using the `else` values as defaults. This removes the need for a separate else branch. However, notice that on line 23 (Attempt 1), only the calculation of `overflow` depends on its previous value. Because `overflow` was originally initialized to 0, we can safely remove self-dependency: ```c overflow = ((1 << exponent) - 1) << 4; ``` Unfortunately, this optimization had no effect on cycles compared to Attempt 1. ``` All tests passed. Cycles: 14379 ``` ### Attempt 3 In Attempt 2, we removed the dependency in the dynamic calculation of overflow. Further, since the `if (msb >= 5)` branch excludes numbers less than 32, we know for these small numbers we do not need `clz` to decide whether to perform dynamic calculation. Thus, we can move the calculation of `msb` lines (lines 8–10) into the dynamic calculation branch, skipping `clz` entirely for small numbers. :::warning However, the test result is worse than Attempt 2. A possible reason is that small numbers are relatively rare in the test data, so skipping `clz` has minimal benefit. The slight increase in cycle count may require further investigation. ``` All tests passed. Cycles: 15082 ``` ::: Currently, `uf8_encode` can be seen as a combination of **direct assignment** and **dynamic calculation**. The question is how many values should be assigned directly and how many should be computed dynamically, depending on the application scenario and test dataset. ::: warning Below, I list different value ranges and their impact on cycle counts. However, the current test methodology does not allow us to determine the relative merits quantitatively. ::: - **All values computed dynamically:** ```c /* Encode uint32_t to uf8 */ uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16) return value; /* Start from a good initial guess */ uint8_t exponent; uint32_t overflow; /* Find appropriate exponent using CLZ hint */ int lz = clz(value); int msb = 31 - lz; /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; // overflow = (overflow << exponent) + (((1 << exponent) - 1) << 4); overflow = ((1 << exponent) - 1) << 4; /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } /* Find exact exponent */ while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) break; overflow = next_overflow; exponent++; } uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } ``` ``` All tests passed. Cycles: 15098 ``` - **Direct assignment for numbers < 48** (not 32 because values 16 ≤ value < 48 share the same assigned result). ```c /* Encode uint32_t to uf8 */ uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16) return value; /* Start from a good initial guess */ uint8_t exponent; uint32_t overflow; if (value < 48) { exponent = 1; overflow = 16; } else { /* Find appropriate exponent using CLZ hint */ int lz = clz(value); int msb = 31 - lz; /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; // overflow = (overflow << exponent) + (((1 << exponent) - 1) << 4); overflow = ((1 << exponent) - 1) << 4; /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } /* Find exact exponent */ while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) break; overflow = next_overflow; exponent++; } } uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } ``` ``` All tests passed. Cycles: 14809 ``` - **Direct assignment for numbers < 496** ```c /* Encode uint32_t to uf8 */ uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16) return value; /* Start from a good initial guess */ uint8_t exponent; uint32_t overflow; if (value < 48) { exponent = 1; overflow = 16; } else if (value < 112) { exponent = 2; overflow = 48; } else if (value < 240) { exponent = 3; overflow = 112; } else if (value < 496){ exponent = 4; overflow = 240; } else { // if (value >= 32) { //if (msb >= 5) { /* Find appropriate exponent using CLZ hint */ int lz = clz(value); int msb = 31 - lz; /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; // overflow = (overflow << exponent) + (((1 << exponent) - 1) << 4); overflow = ((1 << exponent) - 1) << 4; /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } /* Find exact exponent */ while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) break; overflow = next_overflow; exponent++; } } uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } ``` ``` All tests passed. Cycles: 12268 ```