<h1> Assignment 2: Complete Applications</h1>
contributed by <[`hsuhsuhs`](https://github.com/hsuhsuhs/2025_NCKU_Computer_Architecture/tree/main/Assignment2)>
(The complete source code is available on GitHub)
---
[toc]
## Environment Setup
This project requires a RISC-V (RV32I) toolchain for cross-compilation and a compatible simulator to execute the bare-metal code.
### a. RISC-V GNU Toolchain
This is the cross-compiler, linker, and assembler needed to turn C and Assembly source code into a RISC-V ELF executable.
* **Toolchain:** `riscv-none-elf-gcc`
* **Installation:** I downloaded the pre-built binary for macOS (x86_64-apple-darwin) provided by the course.
* **Configuration:** After unzipping the toolchain to `$HOME/riscv-none-elf-gcc`, I updated my shell's `PATH` environment variable to include it. This step is crucial for the `make` command to find the correct compiler.
```bash
# This command must be run in every new terminal session
source $HOME/riscv-none-elf-gcc/setenv
```
### b. RISC-V Simulator (rv32emu)
This is the "virtual hardware" that will run our `test.elf` executable.
* **Simulator:** `rv32emu`
* **Installation:** I cloned the `rv32emu` repository from GitHub:
```bash
git clone https://github.com/sysprog21/rv32emu.git
cd rv32emu
```
* **Configuration :**
The project requires System Emulation (for `ecall` support, enabling `printf`) and **Cycle-Accurate Profiling** (for `rdcycle` / `get_cycles()`). These are not enabled by default.
I built the "deluxe" version of the emulator using the following command, as required by Lab 2 and the project specifications:
```bash
# Clean any previous builds
make distclean
# Build with System and ELF Loader features enabled
make ENABLE_SYSTEM=1 ENABLE_ELF_LOADER=1
```
* **Verification:** This process created the simulator executable at `build/rv32emu`. I verified the complete setup by successfully compiling and running the `tests/system/playground` example.
# HOMEWORK1_B
## hw1_B.c
This section describes the process of porting `hw1_B.c` to the `rv32emu` bare-metal environment and establishing a baseline performance metric.
### a. Porting Process
To run `hw1_B.c` in a bare-metal environment, several modifications were required, as there is no C Standard Library (libc).
1. **Test Harness Creation:**
* The `main()` function in `my_project/main.c` was designated as the Test Harness.
* `hw1_B.c`'s original `main()` was renamed to `hw1_B_main()` to prevent a linker conflict.
2. **Makefile Configuration:**
* The `Makefile` was updated to include `hw1_B.o` in the `OBJS` list.
* All `playground`-specific relative paths (e.g., `../../../build`) were corrected to `../build` to match the `my_project` directory structure.
3. **Bare-Metal LibC Stubbing :**
* The linker would fail because standard functions (`printf`, `puts`, `strlen`) are not available.
* To solve this, I implemented "stubs" in `main.c` for:
* **`printf(const char *format, ...)`**: A simplified version that only prints the format string.
* **`puts(const char *s)`**: A version that prints the string and a newline.
* **`strlen(const char *s)`**: A manual implementation to support `printf` and `puts`.
* I also included `#include <stddef.h>` to provide the necessary `size_t` type.
4. **Software Multiplication (RV32M Emulation):**
* The project is compiled with `-march=rv32i_zicsr`, disabling the `M` extension (hardware multiplier).
* `gcc` automatically converts `*` operations into calls to a helper function, `__mulsi3`.
* To satisfy the linker, I retained the `umul` and `__mulsi3` software multiplication functions from the `playground`'s `main.c`.
### b. Baseline Performance
After porting, the program was compiled with the default **`-g` (debug)** flag to establish a non-optimized baseline.
1. **Compiler Flags:**
`CFLAGS = -g -march=rv32i_zicsr`
2. **Result:**
The following output shows the program running correctly and the resulting cycle count.
* cycle counts = $71506$

### c-1. Optimization (optimized for speed)
To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Ofast` (optimize for speed).
1. **Compiler Flags:**
`CFLAGS = -Ofast -march=rv32i_zicsr`
2. **Result:**
After running `make clean` to force a re-compilation, the new results were as follows.
* cycle counts = $18997$

#### Analysis 1: Performance Comparison
This simple flag change resulted in a dramatic performance improvement:
| CFLAGS | Total Cycles | Speedup |
| :--- | :--- | :--- |
| `-g` (Baseline) | 71,506 | 1.00x |
| `-Ofast` (Optimized speed) | 18,997 | 3.76x |
#### Analysis 2: Over-Optimization Challenge (The `volatile` Fix)
A significant challenge occurred during the `-Ofast` test. The compiler was so aggressive that it identified my performance-measuring code (`get_cycles()`, `TEST_LOGGER`) as "Dead Code" and **optimized it away entirely**, resulting in no cycle count output.
Two fixes were required in `main.c` to prevent this:
1. **`volatile uint64_t`**: I added the `volatile` keyword to `start_cycles` and `end_cycles`. This tells the compiler that these variables are important and must not be optimized away.
2. **`asm "memory"` clber**: I added `"memory"` to the clobber list of the `printstr` assembly macro. This tells the compiler that the `ecall` has "side effects" (I/O) and that any code related to it (like `TEST_LOGGER`) must not be deleted.
#### Analysis 3: Assembly-Level Comparison (-g vs -Ofast)
To understand *how* the 3.76x speedup was achieved, I saved the assembly output for both versions and compared the `hw1_B_main` function.
**Observation 1: Register Allocation**
* **`-g` (Original):** The debug build is bloated with `sw` (store) and `lw` (load) instructions, especially at the start and end of the function (the "prologue" and "epilogue"). It spills *all* C variables from fast registers to the slow stack (memory) for debugging.
* **`-Ofast` (Optimized):** These `sw`/`lw` instructions are gone. The compiler keeps variables (like the loop counter `i` and `previous_value`) in fast CPU registers, eliminating slow memory acce*ss. This is a primary source of the speedup.
**Observation 2: Function Inlining**
* **`-g` (Original):** The `hw1_B.c` source code has a `for` loop that runs 256 times. Inside this loop, it calls `uf8_decode()` and `uf8_encode()` every single time. This results in at least `256 * 2 = 512` slow `jal` (call) instructions.
* **`-Ofast` (Optimized):** These `jal` instructions are gone. The compiler performed aggressive **"inlining"**—it "pasted" the assembly code from `test()`, `clz()`, `uf8_decode()`, and `uf8_encode()` directly into `hw1_B_main`.
### c-2. Optimization (optimize for Size)
To measure the impact of compiler optimization, I modified the Makefile to change the CFLAGS from -g (debug) to -Os (optimize for size). This flag instructs the compiler to prioritize making the final `test.elf` binary as small as possible, sometimes at the cost of speed.
1. **Compiler Flags:**
`CFLAGS = -Os -march=rv32i_zicsr`
2. **Result : Program Failure**
After running `make clean`, the program was re-compiled with `-Os`. However, this version **failed to execute** correctly.
* Infinite Loop

#### Analysis 1:
This is the most significant finding. The program successfully printed `=== [Running hw1B] ===` (from `main.c`) but never returned from the `hw1_B_main()` call.
This proves that `gcc -Os` generated faulty assembly code for `hw1_B.c`.
#### Analysis 2: The Compiler Optimization Bug
By saving the assembly (`dump_hw1_B_size_Os.txt`) and analyzing `hw1_B_main`, I discovered the cause:
1. **`-Os` (Optimize for Size)** decided NOT to "inline" the `uf8_decode` and `uf8_encode` functions because inlining would increase the file size. Instead, it kept the slow `jal` (call) instructions inside the 256-iteration `for` loop.
2. At the same time, to save space, `-Os` also removed the "safe" but "bloated" register-saving code (`sw`/`lw`) that `-g` uses.
3. **This created a bug:**
* The `uf8_encode` function (the "callee") uses the same registers as `hw1_B_main` (the "caller"). When `uf8_encode` runs, it corrupts the registers that `hw1_B_main` was using for its loop counter (`i`).
* When the `uf8_encode` function returns, the loop counter `i` is no longer correct, causing the loop condition (`i < 256`) to never be met, resulting in an **infinite loop**.
## hw1_B.S
This section describes the process of porting `hw1_B.S` to the rv32emu bare-metal environment and establishing a baseline performance metric.
### a. Integration Process
1. **Assembly Interface**:
* The `.global hw1_B_main` directive was used in `hw1_B.S` to export the label, making it visible to the linker.
* The original `j` and `ecall` instructions at the end of the logic were replaced with a `ret` instruction. This allows control to be passed back to the `main()` C function.
2. **Makefile Configuration:**
* Makefile was updated by add `hw1_B.o` in the OBJS list.
### b. Baseline Performance
After integration, the program was compiled with the Makefile's default `-g` flag to establish a non-optimized baseline for `hw1_B`
* **Compiler Flags:**
`CFLAGS = -g -march=rv32i_zicsr`
* **Result:**
The following output shows the `hw1_B` program running correctly and the resulting baseline cycle count.
* cycle counts = $7134$
* instruction counts = $7134$

### c-1. Optimization (optimized for speed)
To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Ofast` (optimize for speed).
* **Compiler Flags:**
`CFLAGS = -Ofast -march=rv32i_zicsr`
* **Result:**
The following output shows the `hw1_B` program running correctly and the resulting baseline cycle count.
* cycle counts = $1881$
* instruction counts = $1881$

#### Analysis 1: Performance Comparison
This simple flag change resulted in a dramatic performance improvement:
| CFLAGS | Total Cycles | Speedup |
| :--- | :--- | :--- |
| `-g` (Baseline) | 7,134 | 1.00x |
| `-Ofast` (Optimized speed) | 1,881 | 3.79x |
## Conclusion
| CFLAGS | Total Cycles | Speedup |
|:---------------------------------------------------------- |:------------ |:--------- |
| `-g` (<span style="color:brown">**hw1_B.c**</span>) | 71,506 | **1.00x** |
| `-Ofast` (<span style="color:brown">**hw1_B.c**</span>) | 18,997 | **3.76x** |
| `-Os` (<span style="color:brown">**hw1_B.c**</span>) | fail | **fail** |
| - | - | - |
| `-g` (<span style="color:darkblue">**hw1_B.S**</span>) | 7,134 | **1.00x** |
| `-Ofast` (<span style="color:darkblue">**hw1_B.S**</span>) | 1,881 | **3.79x** |
| `-Os` (<span style="color:darkblue">**hw1_B.S**</span>) | fail | **fail** |
1. **C vs. Assembly**: The C code was 10x slower, not because C is slow, but because the compiler was forced to use an extremely slow software fallback.
2. **Compiler Intelligence**:`-Ofast` shows a key value of compilers : eliminating the cost of function calls.
* Similar ~3.7x Speedup
3. On a highly constrained hardware target, a superior algorithm written in assembly can easily outperform an optimized C compiler that is forced to rely on a generic, slow, software fallback.
---
# QUIZ2_A
## qz2_A.c
This section describes the process of porting Quiz2 A program (`qz2_A.c`) to the rv32emu bare-metal environment and establishing its baseline performance.
### a. Porting Process
Porting `qz2_A.c` required a *different* set of modifications than `hw1_B.c` due to its use of C intrinsics and formatted printing.
* **Test Harness Creation:**
* The `main()` function in `main.c` was reconfigured to act as the test harness for Quiz2_A.
* `qz2_A.c`'s original `main()` was renamed to `qz2_A_main()` to prevent a linker conflict.
* **Makefile Configuration:**
* The Makefile was updated to include `quiz2_A.o` in the `OBJS` list.
* **Bare-Metal LibC Stubbing (The "printf" Challenge):**
* `qz2_A.c`'s `print_move` function originally used `printf("Move Disk %d from '%c' to '%c'\n", ...)` for formatted output.
* Bare-metal `printf` stub cannot handle format specifiers (`%d`, `%c`).
* **Solution:**
1. **Upgrading `main.c`:** I added two new global helper functions, `print_int()` (by removing `static` from `print_dec`) and `print_char()`.
2. **Refactoring `qz2_A.c`:** I manually rewrote the `print_move` function to "unroll" the single `printf` call into 7 separate calls to our simpler `printf` (for strings), `print_int` (for the disk number), and `print_char` (for the peg letters).
* **Software Popcount (`libgcc` Stubbing):**
* The `qz2_A.c` code uses the `__builtin_popcount()` intrinsic (a GCC-specific function) to identify which disk to move.
* Because we are on `RV32I` and are not linking the standard C library (`libgcc`), this created an "undefined reference to `__popcountsi2`" linker error.
* **Solution:** To satisfy the linker, I added a manual C implementation of the `__popcountsi2` function directly to `main.c`.
### b. Baseline Performance
After porting, the program was compiled with the `-g` (debug) flag to establish the non-optimized baseline for `quiz2_A`.
* **Compiler Flags:**
`CFLAGS = -g -march=rv32i_zicsr`
* **Result:**
The following output shows the `qz2_A` program running correctly and the resulting baseline cycle count.
* cycle counts = $12817$

### c-1. Optimization (optimized for speed)
To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Ofast` (optimize for speed).
1. **Compiler Flags:**
`CFLAGS = -Ofast -march=rv32i_zicsr`
2. **Result:**
After running `make clean` to force a re-compilation, the new results were as follows.
* cycle counts = $6400$

#### Analysis 1: Performance Comparison
The optimization provided a significant **2.00x** speedup for the `qz2_A` program.
| CFLAGS | Total Cycles | Speedup |
| :--- | :--- | :--- |
| `-g` (Baseline) | 12,817 | 1.00x |
| `-Ofast` (Optimized speed) | 6,400 | 2.00x |
#### Analysis 2: Assembly-Level Comparison (`-g` vs `-Ofast`)
To understand how the 2.00x speedup was (and was not) achieved, I compared the `dump_qz2_A_original_g.txt` and `dump_qz2_A_optimized_Ofast.txt` files.
**Observation 1: Register Allocation**
* **`-g` (Original):** The `dump_qz2_A_original_g.txt` shows `qz2_A_main` (at `10770`) is extremely bloated. It allocates 112 bytes on the stack and is filled with `sw` and `lw` instructions inside the main `for (n_moves ...)` loop and the nested `for (p ...)` / `for (d ...)` loops. It spills all C variables to the slow stack on every iteration.
* **`-Ofast` (Optimized):** The `dump_qz2_A_optimized_Ofast.txt` is much cleaner. `gcc -Ofast` completely removed the `sw`/`lw` instructions inside the loops. It performed aggressive Register Allocation, keeping all loop counters and state variables permanently in fast CPU registers. This is the source of the 2.00x speedup.
**Observation 2: The Inlining Bottleneck (What was "NOT" optimized)**
* **The Problem:** The `qz2_A` (Hanoi) program is very "chatty". Its main bottleneck is `(7 moves * 7 print calls) + 7 popcounts = 56` `jal` instructions inside its main loop.
* **The `Ofast` Dump:** When looking at `dump_qz2_A_optimized_Ofast.txt`, these `jal` calls are still present:
```assembly
10374: e8dff0ef jal 10200 <printf>
1037c: cf1ff0ef jal 1006c <print_int>
1038c: da1ff0ef jal 1012c <print_char>
103dc: e95ff0ef jal 10270 <__popcountsi2>
### c-2. Optimization (optimized for size)
To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Os` (optimize for size).
1. **Compiler Flags:**
`CFLAGS = -Os -march=rv32i_zicsr`
2. **Result : Program Failure**
After running `make clean`, the program was re-compiled with `-Os`. However, this version **failed to execute** correctly.
* Infinite Loop

#### Analysis 1:
This is the most significant finding. The program successfully printed `=== [Running qz2_A] ===` but never returned from the `qz2_A_main()` call.
This proves that `gcc -Os` generated faulty assembly code by breaking the RISC-V calling convention.
## qz2_A.S
This section describes the process of porting `qz2_A.S` to the rv32emu bare-metal environment and establishing a baseline performance metric.
### a. Integration Process
1. **Assembly Interface**:
* The `.global qz2_A_main` directive was used in `qz2_A.S` to export the label, making it visible to the linker.
2. **Makefile Configuration:**
* Makefile was updated by add `qz2_A.o` in the OBJS list.
### b. Baseline Performance
After integration, the program was compiled with the Makefile's default `-g` flag to establish a non-optimized baseline for `hw1_B`
* **Compiler Flags:**
`CFLAGS = -g -march=rv32i_zicsr`
* **Result:**
The following output shows the `hw1_B` program running correctly and the resulting baseline cycle count.
* cycle counts = $1219$
* instruction counts = $1219$

### c-1. Optimization (optimized for speed)
To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Ofast` (optimize for speed).
1. **Compiler Flags:**
`CFLAGS = -Ofast -march=rv32i_zicsr`
2. **Result:**
After running `make clean` to force a re-compilation, the new results were as follows.
* cycle counts = $450$
* instruction counts = $450$

### c-2. Optimization (optimized for size)
To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Os` (optimize for size).
1. **Compiler Flags:**
`CFLAGS = -Os -march=rv32i_zicsr`
2. **Result : Program Failure**
After running `make clean`, the program was re-compiled with `-Os`. However, this version **failed to execute** correctly.
* Infinite Loop

## Conclusion
| CFLAGS | Total Cycles | Speedup |
|:---------------------------------------------------------- |:------------ |:--------- |
| `-g` (<span style="color:brown">**qz2_A.c**</span>) | 12,817 | **1.00x** |
| `-Ofast` (<span style="color:brown">**qz2_A.c**</span>) | 6,400 | **2.00x** |
| `-Os` (<span style="color:brown">**qz2_A.c**</span>) | fail | **fail** |
| - | - | - |
| `-g` (<span style="color:darkblue">**qz2_A.S**</span>) | 1,219 | **1.00x** |
| `-Ofast` (<span style="color:darkblue">**qz2_A.S**</span>) | 450 | **2.70x** |
| `-Os` (<span style="color:darkblue">**qz2_A.S**</span>) | fail | **fail** |
# QUIZ3_C
## qz3_C.c
This section describes the process of porting Quiz2 A program (`qz3_C.c`) to the rv32emu bare-metal environment and establishing its baseline performance.
(Similar to the previous approach, the test is performed directly here.)
### a. Baseline Performance
After integration, the program was compiled with the Makefile's default `-g` flag to establish a non-optimized baseline for `qz3_C`
* **Compiler Flags:**
`CFLAGS = -g -march=rv32i_zicsr`
* **Result:**
The following output shows the `qz3_C` program running correctly and the resulting baseline cycle count.
* cycle counts = $31001$
* instruction counts = $31001$

### b-1. Optimization (optimized for speed)
To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Ofast` (optimize for speed).
1. **Compiler Flags:**
`CFLAGS = -Ofast -march=rv32i_zicsr`
2. **Result:**
After running `make clean` to force a re-compilation, the new results were as follows.
* cycle counts = $12962$
* instruction counts = $12962$

### b-2. Optimization (optimized for size)
To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Os` (optimize for size).
1. **Compiler Flags:**
`CFLAGS = -Os -march=rv32i_zicsr`
2. **Result : Program Failure**
After running `make clean`, the program was re-compiled with `-Os`. However, this version **failed to execute** correctly.
* Infinite Loop

## qz3_C.S
This section describes the process of porting `qz3_C.S` to the rv32emu bare-metal environment and establishing a baseline performance metric.
### a. Baseline Performance
After integration, the program was compiled with the Makefile's default `-g` flag to establish a non-optimized baseline for `qz3_C`
* **Compiler Flags:**
`CFLAGS = -g -march=rv32i_zicsr`
* **Result:**
The following output shows the `qz3_C` program running correctly and the resulting baseline cycle count.
* cycle counts = $26757$
* instruction counts = $26757$

### c-1. Optimization (optimized for speed)
To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Ofast` (optimize for speed).
1. **Compiler Flags:**
`CFLAGS = -Ofast -march=rv32i_zicsr`
2. **Result:**
After running `make clean` to force a re-compilation, the new results were as follows.
* cycle counts = $12332$
* instruction counts = $12332$

### b-2. Optimization (optimized for size)
To measure the impact of compiler optimization, I modified the `Makefile` to change the `CFLAGS` from `-g` (debug) to `-Os` (optimize for size).
1. **Compiler Flags:**
`CFLAGS = -Os -march=rv32i_zicsr`
2. **Result : Program Failure**
After running `make clean`, the program was re-compiled with `-Os`. However, this version **failed to execute** correctly.
* Infinite Loop

## Conclusion
| CFLAGS | Total Cycles | Speedup |
|:---------------------------------------------------------- |:------------ |:--------- |
| `-g` (<span style="color:brown">**qz3_C.c**</span>) | 31.001 | **1.00x** |
| `-Ofast` (<span style="color:brown">**qz3_C.c**</span>) | 12,962 | **2.39x** |
| `-Os` (<span style="color:brown">**qz3_C.c**</span>) | fail | **fail** |
| - | - | - |
| `-g` (<span style="color:darkblue">**qz3_C.S**</span>) | 26,757 | **1.00x** |
| `-Ofast` (<span style="color:darkblue">**qz3_C.S**</span>) | 12,332 | **2.16x** |
| `-Os` (<span style="color:darkblue">**qz3_C.S**</span>) | fail | **fail** |
---
---
<b>Reference</b>
1. [Assignment2: Complete Applications](https://hackmd.io/@sysprog/2025-arch-homework2)