Assignment 2 - HackMD

<h1> Assignment 2: Complete Applications</h1> contributed by <[`hsuhsuhs`](https://github.com/hsuhsuhs/2025_NCKU_Computer_Architecture/tree/main/Assignment2)> (The complete source code is available on GitHub) --- [toc] ## Environment Setup This project requires a RISC-V (RV32I) toolchain for cross-compilation and a compatible simulator to execute the bare-metal code. ### a. RISC-V GNU Toolchain This is the cross-compiler, linker, and assembler needed to turn C and Assembly source code into a RISC-V ELF executable. * **Toolchain:** `riscv-none-elf-gcc` * **Installation:** I downloaded the pre-built binary for macOS (x86_64-apple-darwin) provided by the course. * **Configuration:** After unzipping the toolchain to `$HOME/riscv-none-elf-gcc`, I updated my shell's `PATH` environment variable to include it. This step is crucial for the `make` command to find the correct compiler. ```bash # This command must be run in every new terminal session source $HOME/riscv-none-elf-gcc/setenv ``` ### b. RISC-V Simulator (rv32emu) This is the "virtual hardware" that will run our `test.elf` executable. * **Simulator:** `rv32emu` * **Installation:** I cloned the `rv32emu` repository from GitHub: ```bash git clone https://github.com/sysprog21/rv32emu.git cd rv32emu ``` * **Configuration :** The project requires System Emulation (for `ecall` support, enabling `printf`) and **Cycle-Accurate Profiling** (for `rdcycle` / `get_cycles()`). These are not enabled by default. I built the "deluxe" version of the emulator using the following command, as required by Lab 2 and the project specifications: ```bash # Clean any previous builds make distclean # Build with System and ELF Loader features enabled make ENABLE_SYSTEM=1 ENABLE_ELF_LOADER=1 ``` * **Verification:** This process created the simulator executable at `build/rv32emu`. I verified the complete setup by successfully compiling and running the `tests/system/playground` example. # HOMEWORK1_B ## hw1_B.c This section describes the process of porting `hw1_B.c` to the `rv32emu` bare-metal environment and establishing a baseline performance metric. ### a. Porting Process To run `hw1_B.c` in a bare-metal environment, several modifications were required, as there is no C Standard Library (libc). 1. **Test Harness Creation:** * The `main()` function in `my_project/main.c` was designated as the Test Harness. * `hw1_B.c`'s original `main()` was renamed to `hw1_B_main()` to prevent a linker conflict. 2. **Makefile Configuration:** * The `Makefile` was updated to include `hw1_B.o` in the `OBJS` list. * All `playground`-specific relative paths (e.g., `../../../build`) were corrected to `../build` to match the `my_project` directory structure. 3. **Bare-Metal LibC Stubbing :** * The linker would fail because standard functions (`printf`, `puts`, `strlen`) are not available. * To solve this, I implemented "stubs" in `main.c` for: * **`printf(const char *format, ...)`**: A simplified version that only prints the format string. * **`puts(const char *s)`**: A version that prints the string and a newline. * **`strlen(const char *s)`**: A manual implementation to support `printf` and `puts`. * I also included `#include <stddef.h>` to provide the necessary `size_t` type. 4. **Software Multiplication (RV32M Emulation):** * The project is compiled with `-march=rv32i_zicsr`, disabling the `M` extension (hardware multiplier). * `gcc` automatically converts `*` operations into calls to a helper function, `__mulsi3`. * To satisfy the linker, I retained the `umul` and `__mulsi3` software multiplication functions from the `playground`'s `main.c`. ### b. Baseline Performance After porting, the program was compiled with the default **`-g` (debug)** flag to establish a non-optimized baseline. 1. **Compiler Flags:** `CFLAGS = -g -march=rv32i_zicsr` 2. **Result:** The following output shows the program running correctly and the resulting cycle count. * cycle counts = $71506$ ![截圖 2025-11-03 晚上11.29.17](https://hackmd.io/_uploads/rkJP9r8kWx.png) ### c-1. Optimization (optimized for speed) To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Ofast` (optimize for speed). 1. **Compiler Flags:** `CFLAGS = -Ofast -march=rv32i_zicsr` 2. **Result:** After running `make clean` to force a re-compilation, the new results were as follows. * cycle counts = $18997$ ![截圖 2025-11-04 凌晨12.36.01](https://hackmd.io/_uploads/S1NZ9I81Ze.png) #### Analysis 1: Performance Comparison This simple flag change resulted in a dramatic performance improvement: | CFLAGS | Total Cycles | Speedup | | :--- | :--- | :--- | | `-g` (Baseline) | 71,506 | 1.00x | | `-Ofast` (Optimized speed) | 18,997 | 3.76x | #### Analysis 2: Over-Optimization Challenge (The `volatile` Fix) A significant challenge occurred during the `-Ofast` test. The compiler was so aggressive that it identified my performance-measuring code (`get_cycles()`, `TEST_LOGGER`) as "Dead Code" and **optimized it away entirely**, resulting in no cycle count output. Two fixes were required in `main.c` to prevent this: 1. **`volatile uint64_t`**: I added the `volatile` keyword to `start_cycles` and `end_cycles`. This tells the compiler that these variables are important and must not be optimized away. 2. **`asm "memory"` clber**: I added `"memory"` to the clobber list of the `printstr` assembly macro. This tells the compiler that the `ecall` has "side effects" (I/O) and that any code related to it (like `TEST_LOGGER`) must not be deleted. #### Analysis 3: Assembly-Level Comparison (-g vs -Ofast) To understand *how* the 3.76x speedup was achieved, I saved the assembly output for both versions and compared the `hw1_B_main` function. **Observation 1: Register Allocation** * **`-g` (Original):** The debug build is bloated with `sw` (store) and `lw` (load) instructions, especially at the start and end of the function (the "prologue" and "epilogue"). It spills *all* C variables from fast registers to the slow stack (memory) for debugging. * **`-Ofast` (Optimized):** These `sw`/`lw` instructions are gone. The compiler keeps variables (like the loop counter `i` and `previous_value`) in fast CPU registers, eliminating slow memory acce*ss. This is a primary source of the speedup. **Observation 2: Function Inlining** * **`-g` (Original):** The `hw1_B.c` source code has a `for` loop that runs 256 times. Inside this loop, it calls `uf8_decode()` and `uf8_encode()` every single time. This results in at least `256 * 2 = 512` slow `jal` (call) instructions. * **`-Ofast` (Optimized):** These `jal` instructions are gone. The compiler performed aggressive **"inlining"**—it "pasted" the assembly code from `test()`, `clz()`, `uf8_decode()`, and `uf8_encode()` directly into `hw1_B_main`. ### c-2. Optimization (optimize for Size) To measure the impact of compiler optimization, I modified the Makefile to change the CFLAGS from -g (debug) to -Os (optimize for size). This flag instructs the compiler to prioritize making the final `test.elf` binary as small as possible, sometimes at the cost of speed. 1. **Compiler Flags:** `CFLAGS = -Os -march=rv32i_zicsr` 2. **Result : Program Failure** After running `make clean`, the program was re-compiled with `-Os`. However, this version **failed to execute** correctly. * Infinite Loop ![截圖 2025-11-04 凌晨2.36.44](https://hackmd.io/_uploads/rJ_ILOUJZl.png) #### Analysis 1: This is the most significant finding. The program successfully printed `=== [Running hw1B] ===` (from `main.c`) but never returned from the `hw1_B_main()` call. This proves that `gcc -Os` generated faulty assembly code for `hw1_B.c`. #### Analysis 2: The Compiler Optimization Bug By saving the assembly (`dump_hw1_B_size_Os.txt`) and analyzing `hw1_B_main`, I discovered the cause: 1. **`-Os` (Optimize for Size)** decided NOT to "inline" the `uf8_decode` and `uf8_encode` functions because inlining would increase the file size. Instead, it kept the slow `jal` (call) instructions inside the 256-iteration `for` loop. 2. At the same time, to save space, `-Os` also removed the "safe" but "bloated" register-saving code (`sw`/`lw`) that `-g` uses. 3. **This created a bug:** * The `uf8_encode` function (the "callee") uses the same registers as `hw1_B_main` (the "caller"). When `uf8_encode` runs, it corrupts the registers that `hw1_B_main` was using for its loop counter (`i`). * When the `uf8_encode` function returns, the loop counter `i` is no longer correct, causing the loop condition (`i < 256`) to never be met, resulting in an **infinite loop**. ## hw1_B.S This section describes the process of porting `hw1_B.S` to the rv32emu bare-metal environment and establishing a baseline performance metric. ### a. Integration Process 1. **Assembly Interface**: * The `.global hw1_B_main` directive was used in `hw1_B.S` to export the label, making it visible to the linker. * The original `j` and `ecall` instructions at the end of the logic were replaced with a `ret` instruction. This allows control to be passed back to the `main()` C function. 2. **Makefile Configuration:** * Makefile was updated by add `hw1_B.o` in the OBJS list. ### b. Baseline Performance After integration, the program was compiled with the Makefile's default `-g` flag to establish a non-optimized baseline for `hw1_B` * **Compiler Flags:** `CFLAGS = -g -march=rv32i_zicsr` * **Result:** The following output shows the `hw1_B` program running correctly and the resulting baseline cycle count. * cycle counts = $7134$ * instruction counts = $7134$ ![截圖 2025-11-08 清晨5.31.16](https://hackmd.io/_uploads/BJ0SUxnkWx.png) ### c-1. Optimization (optimized for speed) To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Ofast` (optimize for speed). * **Compiler Flags:** `CFLAGS = -Ofast -march=rv32i_zicsr` * **Result:** The following output shows the `hw1_B` program running correctly and the resulting baseline cycle count. * cycle counts = $1881$ * instruction counts = $1881$ ![截圖 2025-11-08 清晨6.47.39](https://hackmd.io/_uploads/B1j7wlnyZe.png) #### Analysis 1: Performance Comparison This simple flag change resulted in a dramatic performance improvement: | CFLAGS | Total Cycles | Speedup | | :--- | :--- | :--- | | `-g` (Baseline) | 7,134 | 1.00x | | `-Ofast` (Optimized speed) | 1,881 | 3.79x | ## Conclusion | CFLAGS | Total Cycles | Speedup | |:---------------------------------------------------------- |:------------ |:--------- | | `-g` (**hw1_B.c**) | 71,506 | **1.00x** | | `-Ofast` (**hw1_B.c**) | 18,997 | **3.76x** | | `-Os` (**hw1_B.c**) | fail | **fail** | | - | - | - | | `-g` (**hw1_B.S**) | 7,134 | **1.00x** | | `-Ofast` (**hw1_B.S**) | 1,881 | **3.79x** | | `-Os` (**hw1_B.S**) | fail | **fail** | 1. **C vs. Assembly**: The C code was 10x slower, not because C is slow, but because the compiler was forced to use an extremely slow software fallback. 2. **Compiler Intelligence**:`-Ofast` shows a key value of compilers : eliminating the cost of function calls. * Similar ~3.7x Speedup 3. On a highly constrained hardware target, a superior algorithm written in assembly can easily outperform an optimized C compiler that is forced to rely on a generic, slow, software fallback. --- # QUIZ2_A ## qz2_A.c This section describes the process of porting Quiz2 A program (`qz2_A.c`) to the rv32emu bare-metal environment and establishing its baseline performance. ### a. Porting Process Porting `qz2_A.c` required a *different* set of modifications than `hw1_B.c` due to its use of C intrinsics and formatted printing. * **Test Harness Creation:** * The `main()` function in `main.c` was reconfigured to act as the test harness for Quiz2_A. * `qz2_A.c`'s original `main()` was renamed to `qz2_A_main()` to prevent a linker conflict. * **Makefile Configuration:** * The Makefile was updated to include `quiz2_A.o` in the `OBJS` list. * **Bare-Metal LibC Stubbing (The "printf" Challenge):** * `qz2_A.c`'s `print_move` function originally used `printf("Move Disk %d from '%c' to '%c'\n", ...)` for formatted output. * Bare-metal `printf` stub cannot handle format specifiers (`%d`, `%c`). * **Solution:** 1. **Upgrading `main.c`:** I added two new global helper functions, `print_int()` (by removing `static` from `print_dec`) and `print_char()`. 2. **Refactoring `qz2_A.c`:** I manually rewrote the `print_move` function to "unroll" the single `printf` call into 7 separate calls to our simpler `printf` (for strings), `print_int` (for the disk number), and `print_char` (for the peg letters). * **Software Popcount (`libgcc` Stubbing):** * The `qz2_A.c` code uses the `__builtin_popcount()` intrinsic (a GCC-specific function) to identify which disk to move. * Because we are on `RV32I` and are not linking the standard C library (`libgcc`), this created an "undefined reference to `__popcountsi2`" linker error. * **Solution:** To satisfy the linker, I added a manual C implementation of the `__popcountsi2` function directly to `main.c`. ### b. Baseline Performance After porting, the program was compiled with the `-g` (debug) flag to establish the non-optimized baseline for `quiz2_A`. * **Compiler Flags:** `CFLAGS = -g -march=rv32i_zicsr` * **Result:** The following output shows the `qz2_A` program running correctly and the resulting baseline cycle count. * cycle counts = $12817$ ![截圖 2025-11-06 凌晨1.24.58](https://hackmd.io/_uploads/Bk7wnbKJbe.png) ### c-1. Optimization (optimized for speed) To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Ofast` (optimize for speed). 1. **Compiler Flags:** `CFLAGS = -Ofast -march=rv32i_zicsr` 2. **Result:** After running `make clean` to force a re-compilation, the new results were as follows. * cycle counts = $6400$ ![截圖 2025-11-06 凌晨1.50.01](https://hackmd.io/_uploads/H1x0X0YyWg.png) #### Analysis 1: Performance Comparison The optimization provided a significant **2.00x** speedup for the `qz2_A` program. | CFLAGS | Total Cycles | Speedup | | :--- | :--- | :--- | | `-g` (Baseline) | 12,817 | 1.00x | | `-Ofast` (Optimized speed) | 6,400 | 2.00x | #### Analysis 2: Assembly-Level Comparison (`-g` vs `-Ofast`) To understand how the 2.00x speedup was (and was not) achieved, I compared the `dump_qz2_A_original_g.txt` and `dump_qz2_A_optimized_Ofast.txt` files. **Observation 1: Register Allocation** * **`-g` (Original):** The `dump_qz2_A_original_g.txt` shows `qz2_A_main` (at `10770`) is extremely bloated. It allocates 112 bytes on the stack and is filled with `sw` and `lw` instructions inside the main `for (n_moves ...)` loop and the nested `for (p ...)` / `for (d ...)` loops. It spills all C variables to the slow stack on every iteration. * **`-Ofast` (Optimized):** The `dump_qz2_A_optimized_Ofast.txt` is much cleaner. `gcc -Ofast` completely removed the `sw`/`lw` instructions inside the loops. It performed aggressive Register Allocation, keeping all loop counters and state variables permanently in fast CPU registers. This is the source of the 2.00x speedup. **Observation 2: The Inlining Bottleneck (What was "NOT" optimized)** * **The Problem:** The `qz2_A` (Hanoi) program is very "chatty". Its main bottleneck is `(7 moves * 7 print calls) + 7 popcounts = 56` `jal` instructions inside its main loop. * **The `Ofast` Dump:** When looking at `dump_qz2_A_optimized_Ofast.txt`, these `jal` calls are still present: ```assembly 10374: e8dff0ef jal 10200 <printf> 1037c: cf1ff0ef jal 1006c <print_int> 1038c: da1ff0ef jal 1012c <print_char> 103dc: e95ff0ef jal 10270 <__popcountsi2> ### c-2. Optimization (optimized for size) To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Os` (optimize for size). 1. **Compiler Flags:** `CFLAGS = -Os -march=rv32i_zicsr` 2. **Result : Program Failure** After running `make clean`, the program was re-compiled with `-Os`. However, this version **failed to execute** correctly. * Infinite Loop ![截圖 2025-11-08 晚上9.06.58](https://hackmd.io/_uploads/BkodPa2yZx.png) #### Analysis 1: This is the most significant finding. The program successfully printed `=== [Running qz2_A] ===` but never returned from the `qz2_A_main()` call. This proves that `gcc -Os` generated faulty assembly code by breaking the RISC-V calling convention. ## qz2_A.S This section describes the process of porting `qz2_A.S` to the rv32emu bare-metal environment and establishing a baseline performance metric. ### a. Integration Process 1. **Assembly Interface**: * The `.global qz2_A_main` directive was used in `qz2_A.S` to export the label, making it visible to the linker. 2. **Makefile Configuration:** * Makefile was updated by add `qz2_A.o` in the OBJS list. ### b. Baseline Performance After integration, the program was compiled with the Makefile's default `-g` flag to establish a non-optimized baseline for `hw1_B` * **Compiler Flags:** `CFLAGS = -g -march=rv32i_zicsr` * **Result:** The following output shows the `hw1_B` program running correctly and the resulting baseline cycle count. * cycle counts = $1219$ * instruction counts = $1219$ ![截圖 2025-11-08 晚上11.11.05](https://hackmd.io/_uploads/S16JkJTJ-g.png) ### c-1. Optimization (optimized for speed) To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Ofast` (optimize for speed). 1. **Compiler Flags:** `CFLAGS = -Ofast -march=rv32i_zicsr` 2. **Result:** After running `make clean` to force a re-compilation, the new results were as follows. * cycle counts = $450$ * instruction counts = $450$ ![截圖 2025-11-08 晚上11.20.25](https://hackmd.io/_uploads/B1rggya1We.png) ### c-2. Optimization (optimized for size) To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Os` (optimize for size). 1. **Compiler Flags:** `CFLAGS = -Os -march=rv32i_zicsr` 2. **Result : Program Failure** After running `make clean`, the program was re-compiled with `-Os`. However, this version **failed to execute** correctly. * Infinite Loop ![截圖 2025-11-08 晚上11.24.08](https://hackmd.io/_uploads/BJ2oxkpyWl.png) ## Conclusion | CFLAGS | Total Cycles | Speedup | |:---------------------------------------------------------- |:------------ |:--------- | | `-g` (**qz2_A.c**) | 12,817 | **1.00x** | | `-Ofast` (**qz2_A.c**) | 6,400 | **2.00x** | | `-Os` (**qz2_A.c**) | fail | **fail** | | - | - | - | | `-g` (**qz2_A.S**) | 1,219 | **1.00x** | | `-Ofast` (**qz2_A.S**) | 450 | **2.70x** | | `-Os` (**qz2_A.S**) | fail | **fail** | # QUIZ3_C ## qz3_C.c This section describes the process of porting Quiz2 A program (`qz3_C.c`) to the rv32emu bare-metal environment and establishing its baseline performance. （Similar to the previous approach, the test is performed directly here.） ### a. Baseline Performance After integration, the program was compiled with the Makefile's default `-g` flag to establish a non-optimized baseline for `qz3_C` * **Compiler Flags:** `CFLAGS = -g -march=rv32i_zicsr` * **Result:** The following output shows the `qz3_C` program running correctly and the resulting baseline cycle count. * cycle counts = $31001$ * instruction counts = $31001$ ![截圖 2025-11-09 凌晨1.55.32](https://hackmd.io/_uploads/rJbGHbp1Wx.png) ### b-1. Optimization (optimized for speed) To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Ofast` (optimize for speed). 1. **Compiler Flags:** `CFLAGS = -Ofast -march=rv32i_zicsr` 2. **Result:** After running `make clean` to force a re-compilation, the new results were as follows. * cycle counts = $12962$ * instruction counts = $12962$ ![截圖 2025-11-09 凌晨2.00.34](https://hackmd.io/_uploads/SkcLBZT1Wx.png) ### b-2. Optimization (optimized for size) To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Os` (optimize for size). 1. **Compiler Flags:** `CFLAGS = -Os -march=rv32i_zicsr` 2. **Result : Program Failure** After running `make clean`, the program was re-compiled with `-Os`. However, this version **failed to execute** correctly. * Infinite Loop ![截圖 2025-11-09 凌晨2.01.42](https://hackmd.io/_uploads/HyKqH-TkWx.png) ## qz3_C.S This section describes the process of porting `qz3_C.S` to the rv32emu bare-metal environment and establishing a baseline performance metric. ### a. Baseline Performance After integration, the program was compiled with the Makefile's default `-g` flag to establish a non-optimized baseline for `qz3_C` * **Compiler Flags:** `CFLAGS = -g -march=rv32i_zicsr` * **Result:** The following output shows the `qz3_C` program running correctly and the resulting baseline cycle count. * cycle counts = $26757$ * instruction counts = $26757$ ![截圖 2025-11-09 凌晨2.34.12](https://hackmd.io/_uploads/BJyCTb6Jbg.png) ### c-1. Optimization (optimized for speed) To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Ofast` (optimize for speed). 1. **Compiler Flags:** `CFLAGS = -Ofast -march=rv32i_zicsr` 2. **Result:** After running `make clean` to force a re-compilation, the new results were as follows. * cycle counts = $12332$ * instruction counts = $12332$ ![截圖 2025-11-09 凌晨2.34.52](https://hackmd.io/_uploads/SyGxC-T1-l.png) ### b-2. Optimization (optimized for size) To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Os` (optimize for size). 1. **Compiler Flags:** `CFLAGS = -Os -march=rv32i_zicsr` 2. **Result : Program Failure** After running `make clean`, the program was re-compiled with `-Os`. However, this version **failed to execute** correctly. * Infinite Loop ![截圖 2025-11-09 凌晨2.35.22](https://hackmd.io/_uploads/rkyMRZpJWl.png) ## Conclusion | CFLAGS | Total Cycles | Speedup | |:---------------------------------------------------------- |:------------ |:--------- | | `-g` (**qz3_C.c**) | 31.001 | **1.00x** | | `-Ofast` (**qz3_C.c**) | 12,962 | **2.39x** | | `-Os` (**qz3_C.c**) | fail | **fail** | | - | - | - | | `-g` (**qz3_C.S**) | 26,757 | **1.00x** | | `-Ofast` (**qz3_C.S**) | 12,332 | **2.16x** | | `-Os` (**qz3_C.S**) | fail | **fail** | --- --- Reference 1. [Assignment2: Complete Applications](https://hackmd.io/@sysprog/2025-arch-homework2)