Assignment 2: Complete Applications

# Assignment 2: Complete Applications contributed by < [`Shih-Hsuan`](https://github.com/Shih-Hsuan) > --- ## AI Tools Usage I used Genimi for : * Conceptual Explanation and Code debugging * Translation and Technical Writing (Polishing) --- ## Execution Environment * **Host System:** * **VM:** Oracle VirtualBox * **Guest OS:** Ubuntu 24.04 LTS (All compilation and simulation commands were run within this environment). * **Emulator:** * **Name:** `rv32emu` * **Mode:** Compiled with **System Emulation Mode** enabled (`ENABLE_SYSTEM=1`). * **Capabilities:** This mode provides a bare-metal hardware simulation, correctly handling privileged instructions essential for this analysis: * **`ecall`:** Used to invoke system calls, like `ecall 64` (write) for console output. * **`csrr`:** Used to read Control and Status Registers (CSRs), enabling cycle-accurate performance profiling via the `get_cycles` function. * **Target Architecture:** * **ISA:** `RV32I` (Base Integer Instruction Set) * **Key Constraint:** The environment does **not** support the `M` (Integer Multiplication/Division) or `F` (Floating-Point) hardware extensions. * **Toolchain:** * **Compiler Suite:** xPack GNU RISC-V Embedded GCC (v14.2.0-3) * **Process:** The `riscv-none-elf-gcc` compiler and `as` assembler were used (via a `Makefile`) to compile all C (`.c`) and assembly (`.S`) source files. These were then linked via `ld` into a single `test.elf` executable based on a custom linker script (`linker.ld`). --- ## Verifying the System Emulation Environment To support the privileged instructions required by bare-metal programs, the `rv32emu` simulator was first recompiled using the `ENABLE_SYSTEM=1` and `ENABLE_ELF_LOADER=1` flags. This step is essential as it activates the simulation of Machine Mode (M-Mode), which enables the privileged `ecall` (system call) and `csrr` (CSR read) instructions. The [`tests/system/playground`](https://hackmd.io/@sysprog/Sko2Ja5pel#System-Emulation) suite is a comprehensive bare-metal test designed to validate this new environment. It confirms two critical components: * **Privileged Instructions:** It successfully calls `csrr` (e.g., `rdcycle`) via the `perfcounter.S` module, confirming that `ENABLE_SYSTEM=1` is active. * **Core ISA Logic:** It executes the `ChaCha20` encryption algorithm, performing a high-intensity test on the core `RV32I` integer and logical instructions (like `add`, `xor`, `sll`). The `PASSED` result from this suite (shown below) confirms that our test platform—the `rv32emu` simulator—is correctly functional and reliable for the subsequent analysis. **Test Results :** ``` === ChaCha20 Tests === Test 0: ChaCha20 (RISC-V Assembly) Test: ChaCha20 ChaCha20 RFC 7539: PASSED Cycles: 6797 Instructions: 6797 === BFloat16 Tests === Test 1: bf16_add Test: bf16_add 1.0 + 1.0 = 4000 PASSED Cycles: 432 Instructions: 432 Test 2: bf16_sub Test: bf16_sub 3.0 - 2.0 = 3f80 PASSED Cycles: 373 Instructions: 373 Test 3: bf16_mul Test: bf16_mul 2.0 * 3.0 = 40c0 PASSED Cycles: 464 Instructions: 464 Test 4: bf16_div Test: bf16_div 6.0 / 2.0 = 4040 PASSED Cycles: 624 Instructions: 624 Test 5: bf16_special_cases Test: bf16_special_cases bf16_iszero(0): PASSED bf16_isnan(NaN): PASSED bf16_isinf(Inf): PASSED Cycles: 239 Instructions: 239 === All Tests Completed === ``` --- ## 1. HW1 Code Migration: Analysis and Implementation ### Count Leading Zeros Counts the number of consecutive zeros in the binary representation of a 32-bit unsigned integer, starting from the most significant bit (MSB). ```c= static inline unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; } ``` This function's core logic is a highly efficient **binary search** strategy, which operates on the 32-bit width of the integer. * **Progressive Halving:** Instead of iterating bit-by-bit, the algorithm progressively halves the search space. It checks chunks of bits in 5 iterations: first `16` bits, then `8`, then `4`, `2`, and finally `1`. * **Search Mechanism:** 1. It tests the **top half** (e.g., the 16 most significant bits) of the current search space. 2. **If a '1' is found** (the half is non-zero): The search narrows down to this top half, and the potential zero count (`n`) is adjusted accordingly. 3. **If it's all zeros:** That half is discarded, and the search continues on the **bottom half** in the next iteration. * **Variables:** The `c` variable represents the search step size (16, 8, 4, 2, 1), and `n` tracks the potential zero count as the search space is refined. This approach finds the most significant '`1`' in logarithmic time (`O(log N)` where `N` is the bit-width), making it extremely fast. #### 1.1 Handwritten Assembly Code :::spoiler clz full code with three test cases ```s= # ---------------------------------------------------- # Definition of Constant # ---------------------------------------------------- .equ STDOUT, 1 .equ WRITE, 64 # ---------------------------------------------------- # Read-Only Data Section (Define all test strings) # ---------------------------------------------------- .section .rodata .align 2 # --- Test 1 Strings --- msg_pass1: .ascii "Test 1 (clz: 0x80000000): PASSED\n" .equ len_pass1, . - msg_pass1 # Calculate string length msg_fail1: .ascii "Test 1 (clz: 0x80000000): FAILED\n" .equ len_fail1, . - msg_fail1 # Calculate string length # --- Test 2 Strings --- msg_pass2: .ascii "Test 2 (clz: 0x00000001): PASSED\n" .equ len_pass2, . - msg_pass2 msg_fail2: .ascii "Test 2 (clz: 0x00000001): FAILED\n" .equ len_fail2, . - msg_fail2 # --- Test 3 Strings --- msg_pass3: .ascii "Test 3 (clz: 0x001A0000): PASSED\n" .equ len_pass3, . - msg_pass3 msg_fail3: .ascii "Test 3 (clz: 0x001A0000): FAILED\n" .equ len_fail3, . - msg_fail3 # ---------------------------------------------------- # Code Section # ---------------------------------------------------- .section .text .align 2 .global main .global clz # ---------------------------------------------------- # Main function # ---------------------------------------------------- main: # Allocate a 16-byte stack frame. # 'sp' is guaranteed to be 16-byte aligned on entry. # Subtracting 16 maintains this alignment. addi sp, sp, -16 sw ra, 12(sp) # store return address (ra) to stack # ------------------- # --- Test Case 1 --- # ------------------- li a0, 0x80000000 # input parameter jal ra, clz # call clz li a1, 0 # expected result beq a0, a1, test1_passed test1_failed: # print FAILED li a0, STDOUT la a1, msg_fail1 li a2, len_fail1 li a7, WRITE ecall j test2_start # jump to next test case test1_passed: # 印出 PASSED li a0, STDOUT la a1, msg_pass1 li a2, len_pass1 li a7, WRITE ecall j test2_start # jump to next test case # ------------------- # --- Test Case 2 --- # ------------------- test2_start: li a0, 0x00000001 # input parameter jal ra, clz # call clz li a1, 31 # expected result beq a0, a1, test2_passed test2_failed: li a0, STDOUT la a1, msg_fail2 li a2, len_fail2 li a7, WRITE ecall j test3_start # jump to next test case test2_passed: li a0, STDOUT la a1, msg_pass2 li a2, len_pass2 li a7, WRITE ecall j test3_start # jump to next test case # ------------------- # --- Test Case 3 --- # ------------------- test3_start: li a0, 0x001A0000 # input parameter jal ra, clz # call clz li a1, 11 # expected result beq a0, a1, test3_passed test3_failed: li a0, STDOUT la a1, msg_fail3 li a2, len_fail3 li a7, WRITE ecall j end_main test3_passed: li a0, STDOUT la a1, msg_pass3 li a2, len_pass3 li a7, WRITE ecall end_main: lw ra, 12(sp) # load return address from stack addi sp, sp, 16 ret # return to start.S # ---------------------------------------------------- # CLZ function # ---------------------------------------------------- clz: addi t0, a0, 0 # t0 = x (we will operate on the value of x in t0) addi t1, x0, 32 # t1 = n = 32 addi t2, x0, 16 # t2 = c = 16 loop_start: srl t3, t0, t2 # t3 = y = x >> c beqz t3, skip_if # if y (t3) == 0, skip the if-block sub t1, t1, t2 # n = n - c addi t0, t3, 0 # x = y skip_if: srli t2, t2, 1 # c = c / 2 bnez t2, loop_start # if c (t2) != 0, continue the loop sub a0, t1, t0 # Store the result (n-x) into a0 for return ret # Return to the caller ``` ::: #### Syscall `write` example ```s= # --- Prepare Arguments for 'write' syscall --- # Argument 1 (fd): STDOUT li a0, 1 # Argument 2 (buf): Load the memory address of the string 'my_string' into a1. la a1, my_string # Argument 3 (count): Load the exact length of the string (12 bytes) into a2. li a2, 12 # --- Prepare Service Code (Syscall Number) --- # Load the service code for 'write' (64) into a7. li a7, 64 # --- Execute Syscall --- # Make the environment call to trigger the 'write' operation. ecall ``` ##### Verification * Modify `Makefile` 1. Change Object files `OBJS = start.o main.o perfcounter.o` 2. Modified the file paths in the Makefile to point to the correct toolchain. (`e.g., ../mk/toolchain.mk`) * **Test Results :** `make run` ALL test cases PASSED ![image](https://hackmd.io/_uploads/rJsi3N8k-x.png =500x) #### 1.2 Compiler Analysis I modified the `CFLAGS` in the `Makefile` to compile the code with different optimization levels (e.g., `-O0`, `-Ofast`, `-Os`). To analyze the compiler's results, I then used the `make dump` target, which disassembles the `test.elf` file and allows for a direct comparison of the generated assembly. ##### 1.2.1 CFLAGS = `-O0` My observation is that the `-O0` flag produces a very literal, "1-to-1" translation of the original C code. The compiler's behavior is clearly prioritized for **debuggability** over efficiency, which is evident in several ways: * **No Register Caching:** The compiler makes no attempt to keep C variables in registers. * **Immediate Stack Spilling:** All C variables are immediately "spilled" from their initial registers (e.g., `a0`) to dedicated locations on the stack frame. * **Constant Memory Access:** After almost every minor operation (often just 1-2 instructions), the result is written back to the stack (`sw`), only to be read back (`lw`) for the very next operation. * **Debug Convenience:** This behavior is highly advantageous for debugging. It ensures that the "source of truth" for every variable is always on the stack, allowing a debugger (like GDB) to pause at any instruction and inspect the value of any variable reliably. * **The Trade-off:** The clear sacrifice for this convenience is a significant loss in both **performance** (due to the high cost of constant, slow memory access) and **code space** (due to the large number of redundant load/store instructions). :::spoiler -O0 version assembly Code ```s= 00010174 <clz_c_func>: # 1. Construct stack frame 10174: fd010113 addi sp,sp,-48 10178: 02112623 sw ra,44(sp) 1017c: 02812423 sw s0,40(sp) 10180: 03010413 addi s0,sp,48 10184: fca42e23 sw a0,-36(s0) # 2. Initialize variables 10188: 02000793 li a5,32 1018c: fef42623 sw a5,-20(s0) 10190: 01000793 li a5,16 10194: fef42423 sw a5,-24(s0) # 3. Do-while loop 10198: fe842783 lw a5,-24(s0) 1019c: fdc42703 lw a4,-36(s0) 101a0: 00f757b3 srl a5,a4,a5 101a4: fef42223 sw a5,-28(s0) 101a8: fe442783 lw a5,-28(s0) 101ac: 00078e63 beqz a5,101c8 <clz_c_func+0x54> 101b0: fec42703 lw a4,-20(s0) 101b4: fe842783 lw a5,-24(s0) 101b8: 40f707b3 sub a5,a4,a5 101bc: fef42623 sw a5,-20(s0) 101c0: fe442783 lw a5,-28(s0) 101c4: fcf42e23 sw a5,-36(s0) 101c8: fe842783 lw a5,-24(s0) 101cc: 4017d793 srai a5,a5,0x1 101d0: fef42423 sw a5,-24(s0) 101d4: fe842783 lw a5,-24(s0) 101d8: fc0790e3 bnez a5,10198 <clz_c_func+0x24> # 4. Calculate the return value 101dc: fec42703 lw a4,-20(s0) 101e0: fdc42783 lw a5,-36(s0) 101e4: 40f707b3 sub a5,a4,a5 101e8: 00078513 mv a0,a5 # 5. Restore stack 101ec: 02c12083 lw ra,44(sp) 101f0: 02812403 lw s0,40(sp) 101f4: 03010113 addi sp,sp,48 101f8: 00008067 ret ``` ::: ##### 1.2.2 CFLAGS = `-Ofast` My analysis of the `-Ofast` output revealed several powerful optimization techniques employed by the compiler. First, I observed that the compiler implemented **full Loop Unrolling**. It recognized that the `do-while` loop had a fixed iteration count (exactly 5 times) and completely eliminated the loop structure. Instead, it "unrolled" the loop by generating five sequential copies of the logic block, thereby removing all loop control overhead (like counters and conditional branches). Second, I identified a clear instance of **Constant Folding**. During the first iteration (where `c=16`), the compiler "pre-calculated" the result of `n -= c`. It knew at compile-time that `n` was initialized to `32`, so it calculated `32 - 16 = 16` on its own. Consequently, rather than generating a `sub` (subtract) instruction, it directly loaded the result `16` into the register using a single `li` (Load Immediate) instruction. Compared to both my handwritten assembly and the verbose `-O0` version, this optimized code achieves a drastic reduction in CPU cycles by eliminating memory access and loop overhead. For me, as someone new to compiler optimization, this was a novel and insightful discovery into how modern compilers achieve high performance. :::spoiler -Ofast version assembly Code ```s= 00010174 <clz_c_func>: 10174: 01055713 srli a4,a0,0x10 10178: 02000793 li a5,32 1017c: 00070663 beqz a4,10188 <clz_c_func+0x14> # if (y == 0) skip 10180: 00070513 mv a0,a4 10184: 01000793 li a5,16 10188: 00855713 srli a4,a0,0x8 1018c: 00070663 beqz a4,10198 <clz_c_func+0x24> # if (y == 0) skip 10190: ff878793 addi a5,a5,-8 10194: 00070513 mv a0,a4 10198: 00455713 srli a4,a0,0x4 1019c: 00070663 beqz a4,101a8 <clz_c_func+0x34> 101a0: ffc78793 addi a5,a5,-4 101a4: 00070513 mv a0,a4 101a8: 00255713 srli a4,a0,0x2 101ac: 00070663 beqz a4,101b8 <clz_c_func+0x44> 101b0: ffe78793 addi a5,a5,-2 101b4: 00070513 mv a0,a4 101b8: 00155713 srli a4,a0,0x1 101bc: 00070663 beqz a4,101c8 <clz_c_func+0x54> 101c0: fff78793 addi a5,a5,-1 101c4: 00070513 mv a0,a4 101c8: 00008067 ret ``` ::: ##### 1.2.3 CFLAGS = `-Os` This optimization level provides a clear contrast to `-Ofast`. Instead of unrolling the loop (which would increase code size), the `-Os` flag **preserves the loop structure**. This is a deliberate trade-off, prioritizing a minimal binary size over the absolute maximum execution speed. Of all the compiled versions, I found this one to be the **most similar to my own handwritten assembly** code. However, I noted one key difference in its implementation: rather than checking `c != 0` to terminate the loop (as in the original C code), the compiler implemented a **fixed 5-iteration counter** (`li a3, 5` and `addi a3, a3, -1`) to control the loop's execution. :::spoiler -Os version assembly Code ```s= 00010174 <clz_c_func>: 10174: 00500693 li a3,5 10178: 01000793 li a5,16 1017c: 02000713 li a4,32 # Loop Body 10180: 00f55633 srl a2,a0,a5 10184: 00060663 beqz a2,10190 <clz_c_func+0x1c> 10188: 40f70733 sub a4,a4,a5 1018c: 00060513 mv a0,a2 # Loop control 10190: fff68693 addi a3,a3,-1 10194: 4017d793 srai a5,a5,0x1 10198: fe0694e3 bnez a3,10180 <clz_c_func+0xc> 1019c: 40a70533 sub a0,a4,a0 101a0: 00008067 ret ``` ::: #### 1.3 Performance Measurement ##### 1.3.1 Test Assembly Code : * Use `get_cycles` function in `perfcounter.S`， and then count the two `clz` function CPU cycles in `main.S`. ```s= main: addi sp, sp, -16 sw ra, 12(sp) # ra : return address sw s0, 8(sp) # s0 : start time # Performance Testing jal ra, get_cycles mv s0, a0 # s0 = start jal ra, clz_c_func # call C function mv t4, a0 # t4 = return value from C function jal ra, get_cycles mv t1, a0 # t1 = end sub t2, t1, s0 # t2 = end - start = CPU cycles ``` ##### 1.3.2 Test Results : * `CFLAGS = -O0` ![image](https://hackmd.io/_uploads/By6DfkB1Wx.png =500x) * `CFLAGS = -Ofast` ![image](https://hackmd.io/_uploads/SJnOGJBJWx.png =500x) * `CFLAGS = -Os` ![image](https://hackmd.io/_uploads/BkR9fJH1-g.png =500x) | Compilation Target | `CFLAGS` | Cycles [decimal] | Cycles [hexcimal] | | :--- | :--- | :--- | :--- | | Unoptimized | `-O0` | **94** | `0x5E` | | Size Optimized | `-Os` | **44** | `0x2C` | | **Handwritten ASM** | (N/A) | **39** | `0x27` | | Speed Optimized | `-Ofast` | **27** | `0x1B` | ##### 1.3.3 Conclusion * **Can we write faster programs in RISC-V assembly ?** * In this specific case, it is highly unlikely. The reason is that the compiler has already performed excellent optimizations: 1. **Loop Unrolling**: Most importantly, by eliminating the conditional `bnez` branch, it prevents any potential pipeline stalls (or "branch penalties") that can occur if the CPU's branch predictor guesses wrong. 2. **Peephole Optimization & Constant Folding**: The compiler "pre-calculated" `n - c` (e.g., `32 - 16`) during the unrolled iterations, replacing a `sub` instruction with a simpler `li` (Load Immediate) instruction. --- ## 2. Quiz 2 --- ## Reference * [Lab2: RISC-V Instruction Set Simulator and System Emulator](https://hackmd.io/@sysprog/Sko2Ja5pel#Playground-Test-Suite) * [rv32emu syscall](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md) * [DAY1: RISC-V: 大綱與特權模式](https://ithelp.ithome.com.tw/m/articles/10289289)