# Assignment 2: Complete Applications
contributed by < [`Shih-Hsuan`](https://github.com/Shih-Hsuan) >
---
## AI Tools Usage
I used Genimi for :
* Conceptual Explanation and Code debugging
* Translation and Technical Writing (Polishing)
---
## Execution Environment
* **Host System:**
* **VM:** Oracle VirtualBox
* **Guest OS:** Ubuntu 24.04 LTS
(All compilation and simulation commands were run within this environment).
* **Emulator:**
* **Name:** `rv32emu`
* **Mode:** Compiled with **System Emulation Mode** enabled (`ENABLE_SYSTEM=1`).
* **Capabilities:** This mode provides a bare-metal hardware simulation, correctly handling privileged instructions essential for this analysis:
* **`ecall`:** Used to invoke system calls, like `ecall 64` (write) for console output.
* **`csrr`:** Used to read Control and Status Registers (CSRs), enabling cycle-accurate performance profiling via the `get_cycles` function.
* **Target Architecture:**
* **ISA:** `RV32I` (Base Integer Instruction Set)
* **Key Constraint:** The environment does **not** support the `M` (Integer Multiplication/Division) or `F` (Floating-Point) hardware extensions.
* **Toolchain:**
* **Compiler Suite:** xPack GNU RISC-V Embedded GCC (v14.2.0-3)
* **Process:** The `riscv-none-elf-gcc` compiler and `as` assembler were used (via a `Makefile`) to compile all C (`.c`) and assembly (`.S`) source files. These were then linked via `ld` into a single `test.elf` executable based on a custom linker script (`linker.ld`).
---
## Verifying the System Emulation Environment
To support the privileged instructions required by bare-metal programs, the `rv32emu` simulator was first recompiled using the `ENABLE_SYSTEM=1` and `ENABLE_ELF_LOADER=1` flags. This step is essential as it activates the simulation of Machine Mode (M-Mode), which enables the privileged `ecall` (system call) and `csrr` (CSR read) instructions.
The [`tests/system/playground`](https://hackmd.io/@sysprog/Sko2Ja5pel#System-Emulation) suite is a comprehensive bare-metal test designed to validate this new environment. It confirms two critical components:
* **Privileged Instructions:** It successfully calls `csrr` (e.g., `rdcycle`) via the `perfcounter.S` module, confirming that `ENABLE_SYSTEM=1` is active.
* **Core ISA Logic:** It executes the `ChaCha20` encryption algorithm, performing a high-intensity test on the core `RV32I` integer and logical instructions (like `add`, `xor`, `sll`).
The `PASSED` result from this suite (shown below) confirms that our test platform—the `rv32emu` simulator—is correctly functional and reliable for the subsequent analysis.
**Test Results :**
```
=== ChaCha20 Tests ===
Test 0: ChaCha20 (RISC-V Assembly)
Test: ChaCha20
ChaCha20 RFC 7539: PASSED
Cycles: 6797
Instructions: 6797
=== BFloat16 Tests ===
Test 1: bf16_add
Test: bf16_add
1.0 + 1.0 = 4000
PASSED
Cycles: 432
Instructions: 432
Test 2: bf16_sub
Test: bf16_sub
3.0 - 2.0 = 3f80
PASSED
Cycles: 373
Instructions: 373
Test 3: bf16_mul
Test: bf16_mul
2.0 * 3.0 = 40c0
PASSED
Cycles: 464
Instructions: 464
Test 4: bf16_div
Test: bf16_div
6.0 / 2.0 = 4040
PASSED
Cycles: 624
Instructions: 624
Test 5: bf16_special_cases
Test: bf16_special_cases
bf16_iszero(0): PASSED
bf16_isnan(NaN): PASSED
bf16_isinf(Inf): PASSED
Cycles: 239
Instructions: 239
=== All Tests Completed ===
```
---
## 1. HW1 Code Migration: Analysis and Implementation
### Count Leading Zeros
Counts the number of consecutive zeros in the binary representation of a 32-bit unsigned integer, starting from the most significant bit (MSB).
```c=
static inline unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return n - x;
}
```
This function's core logic is a highly efficient **binary search** strategy, which operates on the 32-bit width of the integer.
* **Progressive Halving:** Instead of iterating bit-by-bit, the algorithm progressively halves the search space. It checks chunks of bits in 5 iterations: first `16` bits, then `8`, then `4`, `2`, and finally `1`.
* **Search Mechanism:**
1. It tests the **top half** (e.g., the 16 most significant bits) of the current search space.
2. **If a '1' is found** (the half is non-zero): The search narrows down to this top half, and the potential zero count (`n`) is adjusted accordingly.
3. **If it's all zeros:** That half is discarded, and the search continues on the **bottom half** in the next iteration.
* **Variables:** The `c` variable represents the search step size (16, 8, 4, 2, 1), and `n` tracks the potential zero count as the search space is refined.
This approach finds the most significant '`1`' in logarithmic time (`O(log N)` where `N` is the bit-width), making it extremely fast.
#### 1.1 Handwritten Assembly Code
:::spoiler clz full code with three test cases
```s=
# ----------------------------------------------------
# Definition of Constant
# ----------------------------------------------------
.equ STDOUT, 1
.equ WRITE, 64
# ----------------------------------------------------
# Read-Only Data Section (Define all test strings)
# ----------------------------------------------------
.section .rodata
.align 2
# --- Test 1 Strings ---
msg_pass1:
.ascii "Test 1 (clz: 0x80000000): PASSED\n"
.equ len_pass1, . - msg_pass1 # Calculate string length
msg_fail1:
.ascii "Test 1 (clz: 0x80000000): FAILED\n"
.equ len_fail1, . - msg_fail1 # Calculate string length
# --- Test 2 Strings ---
msg_pass2:
.ascii "Test 2 (clz: 0x00000001): PASSED\n"
.equ len_pass2, . - msg_pass2
msg_fail2:
.ascii "Test 2 (clz: 0x00000001): FAILED\n"
.equ len_fail2, . - msg_fail2
# --- Test 3 Strings ---
msg_pass3:
.ascii "Test 3 (clz: 0x001A0000): PASSED\n"
.equ len_pass3, . - msg_pass3
msg_fail3:
.ascii "Test 3 (clz: 0x001A0000): FAILED\n"
.equ len_fail3, . - msg_fail3
# ----------------------------------------------------
# Code Section
# ----------------------------------------------------
.section .text
.align 2
.global main
.global clz
# ----------------------------------------------------
# Main function
# ----------------------------------------------------
main:
# Allocate a 16-byte stack frame.
# 'sp' is guaranteed to be 16-byte aligned on entry.
# Subtracting 16 maintains this alignment.
addi sp, sp, -16
sw ra, 12(sp) # store return address (ra) to stack
# -------------------
# --- Test Case 1 ---
# -------------------
li a0, 0x80000000 # input parameter
jal ra, clz # call clz
li a1, 0 # expected result
beq a0, a1, test1_passed
test1_failed: # print FAILED
li a0, STDOUT
la a1, msg_fail1
li a2, len_fail1
li a7, WRITE
ecall
j test2_start # jump to next test case
test1_passed: # 印出 PASSED
li a0, STDOUT
la a1, msg_pass1
li a2, len_pass1
li a7, WRITE
ecall
j test2_start # jump to next test case
# -------------------
# --- Test Case 2 ---
# -------------------
test2_start:
li a0, 0x00000001 # input parameter
jal ra, clz # call clz
li a1, 31 # expected result
beq a0, a1, test2_passed
test2_failed:
li a0, STDOUT
la a1, msg_fail2
li a2, len_fail2
li a7, WRITE
ecall
j test3_start # jump to next test case
test2_passed:
li a0, STDOUT
la a1, msg_pass2
li a2, len_pass2
li a7, WRITE
ecall
j test3_start # jump to next test case
# -------------------
# --- Test Case 3 ---
# -------------------
test3_start:
li a0, 0x001A0000 # input parameter
jal ra, clz # call clz
li a1, 11 # expected result
beq a0, a1, test3_passed
test3_failed:
li a0, STDOUT
la a1, msg_fail3
li a2, len_fail3
li a7, WRITE
ecall
j end_main
test3_passed:
li a0, STDOUT
la a1, msg_pass3
li a2, len_pass3
li a7, WRITE
ecall
end_main:
lw ra, 12(sp) # load return address from stack
addi sp, sp, 16
ret # return to start.S
# ----------------------------------------------------
# CLZ function
# ----------------------------------------------------
clz:
addi t0, a0, 0 # t0 = x (we will operate on the value of x in t0)
addi t1, x0, 32 # t1 = n = 32
addi t2, x0, 16 # t2 = c = 16
loop_start:
srl t3, t0, t2 # t3 = y = x >> c
beqz t3, skip_if # if y (t3) == 0, skip the if-block
sub t1, t1, t2 # n = n - c
addi t0, t3, 0 # x = y
skip_if:
srli t2, t2, 1 # c = c / 2
bnez t2, loop_start # if c (t2) != 0, continue the loop
sub a0, t1, t0 # Store the result (n-x) into a0 for return
ret # Return to the caller
```
:::
#### Syscall `write` example
```s=
# --- Prepare Arguments for 'write' syscall ---
# Argument 1 (fd): STDOUT
li a0, 1
# Argument 2 (buf): Load the memory address of the string 'my_string' into a1.
la a1, my_string
# Argument 3 (count): Load the exact length of the string (12 bytes) into a2.
li a2, 12
# --- Prepare Service Code (Syscall Number) ---
# Load the service code for 'write' (64) into a7.
li a7, 64
# --- Execute Syscall ---
# Make the environment call to trigger the 'write' operation.
ecall
```
##### Verification
* Modify `Makefile`
1. Change Object files
`OBJS = start.o main.o perfcounter.o`
2. Modified the file paths in the Makefile to point to the correct toolchain.
(`e.g., ../mk/toolchain.mk`)
* **Test Results :** `make run`
ALL test cases PASSED

#### 1.2 Compiler Analysis
I modified the `CFLAGS` in the `Makefile` to compile the code with different optimization levels (e.g., `-O0`, `-Ofast`, `-Os`). To analyze the compiler's results, I then used the `make dump` target, which disassembles the `test.elf` file and allows for a direct comparison of the generated assembly.
##### 1.2.1 CFLAGS = `-O0`
My observation is that the `-O0` flag produces a very literal, "1-to-1" translation of the original C code. The compiler's behavior is clearly prioritized for **debuggability** over efficiency, which is evident in several ways:
* **No Register Caching:** The compiler makes no attempt to keep C variables in registers.
* **Immediate Stack Spilling:** All C variables are immediately "spilled" from their initial registers (e.g., `a0`) to dedicated locations on the stack frame.
* **Constant Memory Access:** After almost every minor operation (often just 1-2 instructions), the result is written back to the stack (`sw`), only to be read back (`lw`) for the very next operation.
* **Debug Convenience:** This behavior is highly advantageous for debugging. It ensures that the "source of truth" for every variable is always on the stack, allowing a debugger (like GDB) to pause at any instruction and inspect the value of any variable reliably.
* **The Trade-off:** The clear sacrifice for this convenience is a significant loss in both **performance** (due to the high cost of constant, slow memory access) and **code space** (due to the large number of redundant load/store instructions).
:::spoiler -O0 version assembly Code
```s=
00010174 <clz_c_func>:
# 1. Construct stack frame
10174: fd010113 addi sp,sp,-48
10178: 02112623 sw ra,44(sp)
1017c: 02812423 sw s0,40(sp)
10180: 03010413 addi s0,sp,48
10184: fca42e23 sw a0,-36(s0)
# 2. Initialize variables
10188: 02000793 li a5,32
1018c: fef42623 sw a5,-20(s0)
10190: 01000793 li a5,16
10194: fef42423 sw a5,-24(s0)
# 3. Do-while loop
10198: fe842783 lw a5,-24(s0)
1019c: fdc42703 lw a4,-36(s0)
101a0: 00f757b3 srl a5,a4,a5
101a4: fef42223 sw a5,-28(s0)
101a8: fe442783 lw a5,-28(s0)
101ac: 00078e63 beqz a5,101c8 <clz_c_func+0x54>
101b0: fec42703 lw a4,-20(s0)
101b4: fe842783 lw a5,-24(s0)
101b8: 40f707b3 sub a5,a4,a5
101bc: fef42623 sw a5,-20(s0)
101c0: fe442783 lw a5,-28(s0)
101c4: fcf42e23 sw a5,-36(s0)
101c8: fe842783 lw a5,-24(s0)
101cc: 4017d793 srai a5,a5,0x1
101d0: fef42423 sw a5,-24(s0)
101d4: fe842783 lw a5,-24(s0)
101d8: fc0790e3 bnez a5,10198 <clz_c_func+0x24>
# 4. Calculate the return value
101dc: fec42703 lw a4,-20(s0)
101e0: fdc42783 lw a5,-36(s0)
101e4: 40f707b3 sub a5,a4,a5
101e8: 00078513 mv a0,a5
# 5. Restore stack
101ec: 02c12083 lw ra,44(sp)
101f0: 02812403 lw s0,40(sp)
101f4: 03010113 addi sp,sp,48
101f8: 00008067 ret
```
:::
##### 1.2.2 CFLAGS = `-Ofast`
My analysis of the `-Ofast` output revealed several powerful optimization techniques employed by the compiler.
First, I observed that the compiler implemented **full Loop Unrolling**. It recognized that the `do-while` loop had a fixed iteration count (exactly 5 times) and completely eliminated the loop structure. Instead, it "unrolled" the loop by generating five sequential copies of the logic block, thereby removing all loop control overhead (like counters and conditional branches).
Second, I identified a clear instance of **Constant Folding**. During the first iteration (where `c=16`), the compiler "pre-calculated" the result of `n -= c`. It knew at compile-time that `n` was initialized to `32`, so it calculated `32 - 16 = 16` on its own. Consequently, rather than generating a `sub` (subtract) instruction, it directly loaded the result `16` into the register using a single `li` (Load Immediate) instruction.
Compared to both my handwritten assembly and the verbose `-O0` version, this optimized code achieves a drastic reduction in CPU cycles by eliminating memory access and loop overhead. For me, as someone new to compiler optimization, this was a novel and insightful discovery into how modern compilers achieve high performance.
:::spoiler -Ofast version assembly Code
```s=
00010174 <clz_c_func>:
10174: 01055713 srli a4,a0,0x10
10178: 02000793 li a5,32
1017c: 00070663 beqz a4,10188 <clz_c_func+0x14> # if (y == 0) skip
10180: 00070513 mv a0,a4
10184: 01000793 li a5,16
10188: 00855713 srli a4,a0,0x8
1018c: 00070663 beqz a4,10198 <clz_c_func+0x24> # if (y == 0) skip
10190: ff878793 addi a5,a5,-8
10194: 00070513 mv a0,a4
10198: 00455713 srli a4,a0,0x4
1019c: 00070663 beqz a4,101a8 <clz_c_func+0x34>
101a0: ffc78793 addi a5,a5,-4
101a4: 00070513 mv a0,a4
101a8: 00255713 srli a4,a0,0x2
101ac: 00070663 beqz a4,101b8 <clz_c_func+0x44>
101b0: ffe78793 addi a5,a5,-2
101b4: 00070513 mv a0,a4
101b8: 00155713 srli a4,a0,0x1
101bc: 00070663 beqz a4,101c8 <clz_c_func+0x54>
101c0: fff78793 addi a5,a5,-1
101c4: 00070513 mv a0,a4
101c8: 00008067 ret
```
:::
##### 1.2.3 CFLAGS = `-Os`
This optimization level provides a clear contrast to `-Ofast`. Instead of unrolling the loop (which would increase code size), the `-Os` flag **preserves the loop structure**. This is a deliberate trade-off, prioritizing a minimal binary size over the absolute maximum execution speed.
Of all the compiled versions, I found this one to be the **most similar to my own handwritten assembly** code.
However, I noted one key difference in its implementation: rather than checking `c != 0` to terminate the loop (as in the original C code), the compiler implemented a **fixed 5-iteration counter** (`li a3, 5` and `addi a3, a3, -1`) to control the loop's execution.
:::spoiler -Os version assembly Code
```s=
00010174 <clz_c_func>:
10174: 00500693 li a3,5
10178: 01000793 li a5,16
1017c: 02000713 li a4,32
# Loop Body
10180: 00f55633 srl a2,a0,a5
10184: 00060663 beqz a2,10190 <clz_c_func+0x1c>
10188: 40f70733 sub a4,a4,a5
1018c: 00060513 mv a0,a2
# Loop control
10190: fff68693 addi a3,a3,-1
10194: 4017d793 srai a5,a5,0x1
10198: fe0694e3 bnez a3,10180 <clz_c_func+0xc>
1019c: 40a70533 sub a0,a4,a0
101a0: 00008067 ret
```
:::
#### 1.3 Performance Measurement
##### 1.3.1 Test Assembly Code :
* Use `get_cycles` function in `perfcounter.S`,
and then count the two `clz` function CPU cycles in `main.S`.
```s=
main:
addi sp, sp, -16
sw ra, 12(sp) # ra : return address
sw s0, 8(sp) # s0 : start time
# Performance Testing
jal ra, get_cycles
mv s0, a0 # s0 = start
jal ra, clz_c_func # call C function
mv t4, a0 # t4 = return value from C function
jal ra, get_cycles
mv t1, a0 # t1 = end
sub t2, t1, s0 # t2 = end - start = CPU cycles
```
##### 1.3.2 Test Results :
* `CFLAGS = -O0`

* `CFLAGS = -Ofast`

* `CFLAGS = -Os`

| Compilation Target | `CFLAGS` | Cycles [decimal] | Cycles [hexcimal] |
| :--- | :--- | :--- | :--- |
| Unoptimized | `-O0` | **94** | `0x5E` |
| Size Optimized | `-Os` | **44** | `0x2C` |
| **Handwritten ASM** | (N/A) | **39** | `0x27` |
| Speed Optimized | `-Ofast` | **27** | `0x1B` |
##### 1.3.3 Conclusion
* **Can we write faster programs in RISC-V assembly ?**
* In this specific case, it is highly unlikely.
The reason is that the compiler has already performed excellent optimizations:
1. **Loop Unrolling**: Most importantly, by eliminating the conditional `bnez` branch, it prevents any potential pipeline stalls (or "branch penalties") that can occur if the CPU's branch predictor guesses wrong.
2. **Peephole Optimization & Constant Folding**: The compiler "pre-calculated" `n - c` (e.g., `32 - 16`) during the unrolled iterations, replacing a `sub` instruction with a simpler `li` (Load Immediate) instruction.
---
## 2. Quiz 2
---
## Reference
* [Lab2: RISC-V Instruction Set Simulator and System Emulator](https://hackmd.io/@sysprog/Sko2Ja5pel#Playground-Test-Suite)
* [rv32emu syscall](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md)
* [DAY1: RISC-V: 大綱與特權模式](https://ithelp.ithome.com.tw/m/articles/10289289)