# Assignment2: Complete Applications
< [Github repo link](https://github.com/jgw0915/Computer-Architecture_HW2/tree/main) >
## Environment Set Up
### Objective:
Configure a complete RISC-V bare-metal development environment for compiling, assembling, linking, and running RV32I programs using rv32emu.
### 1️⃣ Install RISC-V GNU Toolchain
The GNU RISC-V Embedded Toolchain provides essential cross-compilation tools (gcc, as, ld, objdump, etc.) for building RISC-V binaries.
```bSH=
cd /tmp
wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases/download/v14.2.0-3/xpack-riscv-none-elf-gcc-14.2.0-3-linux-x64.tar.gz
tar zxvf xpack-riscv-none-elf-gcc-14.2.0-3-linux-x64.tar.gz
cp -af xpack-riscv-none-elf-gcc-14.2.0-3 $HOME/riscv-none-elf-gcc
```
#### Configure Environment Variable
```BASH=
cd $HOME/riscv-none-elf-gcc
echo "export PATH=`pwd`/bin:\$PATH" > setenv
```
#### Load the toolchain before each use
```bash!
source $HOME/riscv-none-elf-gcc/setenv
```
* **Explanation**:
| Command | Meaning |
|----------|----------|
| `pwd` | Prints the current working directory |
| `export PATH=...` | Adds `pwd/bin` to system path so shell can locate `riscv-none-elf-*` binaries |
| `source file` | Executes the file **in the current shell** (not in a subshell) |
### 2️⃣ Build rv32emu
rv32emu is a RISC-V instruction set emulator that executes ELF binaries.
```bash!
git clone https://github.com/sysprog21/rv32emu
cd rv32emu
make
```
After building, the executable rv32emu will be available at rv32emu/build/rv32emu
### 3️⃣ Makefile Configuration for Playground
```Makefile!
include ../../../mk/toolchain.mk
ARCH = -march=rv32izicsr
LINKER_SCRIPT = linker.ld
EMU ?= ../../../build/rv32emu
AFLAGS = -g $(ARCH)
CFLAGS = -g -march=rv32i_zicsr
LDFLAGS = -T $(LINKER_SCRIPT)
EXEC = test.elf
CC = $(CROSS_COMPILE)gcc
AS = $(CROSS_COMPILE)as
LD = $(CROSS_COMPILE)ld
OBJDUMP = $(CROSS_COMPILE)objdump
OBJS = start.o main.o perfcounter.o chacha20_asm.o
.PHONY: all run dump clean
all: $(EXEC)
$(EXEC): $(OBJS) $(LINKER_SCRIPT)
$(LD) $(LDFLAGS) -o $@ $(OBJS)
%.o: %.S
$(AS) $(AFLAGS) $< -o $@
%.o: %.c
$(CC) $(CFLAGS) $< -o $@ -c
run: $(EXEC)
@test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1)
@grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1)
@grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1)
$(EMU) $<
dump: $(EXEC)
$(OBJDUMP) -Ds $< | less
clean:
rm -f $(EXEC) $(OBJS)
```
* **Structure Breakdown**
| Category | Description |
| ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Libraries Required** | `rv32emu/mk/toolchain.mk`, `rv32emu/build/rv32emu`, `rv32emu/build/.config` |
| **Toolchain Variables** | `CC`, `AS`, `LD`, `OBJDUMP` define compiler, assembler, linker, disassembler |
| **Object Targets** | `OBJS` contains `.o` files built from `.c` and `.S` sources |
| **Final Output** | `EXEC = test.elf` is the executable ELF loaded by emulator |
| **Main Targets** | <ul><li>`all`: builds `.elf`</li><li>`run`: executes ELF on rv32emu</li><li>`dump`: disassembles ELF</li><li>`clean`: removes build files</li></ul> |
### Note: About GCC Compilation Flags Note
| Flag | Description | Example |
| -------- | -------------------------------------------------------------- | ------------------------- |
| `-c` | Compile source → object file (`.o`), **no linking** | `gcc -c main.c -o main.o` |
| `-S` | Compile source → assembly (`.s`), **no assembling or linking** | `gcc -S main.c -o main.s` |
| *(none)* | Compile, assemble, and link automatically | `gcc main.c -o main` |
### 4️⃣ Move Required Libraries
To simplify builds, copy the following directories into project root:
```bash!
cp -r rv32emu/mk/ <project_root>/
cp -r rv32emu/build/ <project_root>/
```
These contain Libraries Required:
* **mk/toolchain.mk** : defines CROSS_COMPILE prefix.
* **build/rv32emu**: compiled emulator binary.
* **rv32emu/build/.config** : toggle compile-time options
### Successful setup allows:
```bash!
make
make run
```
→ Able to executes the compiled ELF on rv32emu, confirming a working
RISC-V bare-metal lab environment.
## Run HW1 uf8 in rv32emu
### Objective
This lab demonstrates how to **run the Homework 1 UF8 program** (implemented in RISC-V assembly) inside **rv32emu’s system emulation**.
The goal is to adapt a Ripes-compatible program into rv32emu by **remapping syscalls**, ensuring compatibility under a **bare-metal RV32I environment**.
### Architecture Restriction
- Only **RV32I + Zicsr** instructions are allowed.
- Disable RV32M and RV32F extensions.
- Compilation flags:
```bash
-march=rv32izicsr
-mabi=ilp32
```
### File Structure
Use the same structure as tests/system/playground/, and is not a full rv32emu clone.
```bash!
Computer-Architecure-HW2/
├── mk
├── build/rv32emu
└── HW1_Playground
├── start.S
├── perfcounter.S
├── q1-uf8.S
├── linker.ld
└── Makefile
```
Detail about the implementation is in [HW1_Playground](https://github.com/jgw0915/Computer-Architecture_HW2/tree/main/HW1_Playground)
### Syscall Remapping
* **Problem**
* The original q1-uf8.S used Ripes syscalls (a7 = 4, 11, 10) for printing and exiting.
However, rv32emu follows a Linux-style syscall interface (write=64, exit=93).
This mismatch caused:
```bash!
FATAL src/syscall.c:538: Unknown syscall: 4
```
* **Solution**
* Remap Ripes syscall convention → rv32emu syscall convention.
| Purpose | Ripes (a7) | rv32emu (a7) | Arguments |
| ------------ | ---------- | ------------ | --------------------- |
| print string | 4 | 64 | a0=fd, a1=buf, a2=len |
| exit program | 10 | 93 | a0=exit code |
* Modified `main` in `q1-uf8.S`
```asm=
.text
.globl main
.globl clz
.globl uf8_decode
.globl uf8_encode
.equ SYS_write, 64
.equ SYS_exit, 93
.equ FD_STDOUT, 1
#...other implementation...
main:
addi sp, sp, -16
sw ra, 12(sp)
sw s0, 8(sp)
addi s0, sp, 16
call test # returns in a0; conventionally you copied to a5
mv a5, a0
beq a5, zero, set_fail_exit # 0 => fail
# success path: write(1, okmsg, okmsg_end-okmsg); exit(0)
li a0, FD_STDOUT # fd
la a1, okmsg
la t1, okmsg_end
sub a2, t1, a1 # count
li a7, SYS_write
ecall
li a0, 0 # exit code
li a7, SYS_exit
ecall # no return
set_fail_exit:
# success path: write(1, okmsg, okmsg_end-okmsg); exit(0)
li a0, FD_STDOUT # fd
la a1, failmsg
la t1, failmsg_end
sub a2, t1, a1 # count
li a7, SYS_write
ecall
li a0, 1 # exit(1)
li a7, SYS_exit
ecall # no return
```
* Modify `test` in `q1-uf8.S`
```asm=
test_loop_body:
lw a5,-28(s0) # i
sb a5,-29(s0) # fl = i
lbu a5,-29(s0)
mv a0,a5
call uf8_decode # value = uf8_decode(fl)
mv a5,a0
sw a5,-36(s0)
lw a5,-36(s0)
mv a0,a5
call uf8_encode # fl2 = uf8_encode(value)
mv a5,a0
sb a5,-37(s0)
# ------------------ re-encode check ------------------
lbu a4,-29(s0) # fl
lbu a5,-37(s0) # fl2
beq a4,a5,after_mismatch_check
# print "Re-encode Test Failed:\n"
li a0,FD_STDOUT
la a1,reenc_msg
la t0,reenc_msg_end
sub a2,t0,a1
li a7,SYS_write
ecall
sb zero,-21(s0) # passed = false
after_mismatch_check:
# ---------------- monotonicity check -----------------
lw a4,-36(s0) # value
lw a5,-20(s0) # previous_value
bgt a4,a5,after_monotonic_check
# print "Previous value Test Failed:\n"
li a0,FD_STDOUT
la a1,prevfail_msg
la t0,prevfail_msg_end
sub a2,t0,a1
li a7,SYS_write
ecall
sb zero,-21(s0) # passed = false
```
Detail of remapping implementation is in [q1-uf8.S](http://github.com/jgw0915/Computer-Architecture_HW2/blob/main/HW1_Playground/q1-uf8.S)
### Makefile Configuration
* **Key Adjustments**
* Change object list for startup, performance counter, and main program.
* Adjust path of dependent library
* ` ../build/rv32emu`
* `../build/.config`
* `../mk/toolchain.mk`
```Makefile!
include ../mk/toolchain.mk
...
EMU = ../build/rv32emu
...
OBJS = start.o perfcounter.o q1-uf8.o
...
run: $(EXEC)
@test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1)
@grep -q "ENABLE_ELF_LOADER=1" ../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1)
@grep -q "ENABLE_SYSTEM=1" ../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1)
$(EMU) $(EXEC)
```
### How Files are linked and loaded
* `linker.ld` Controls Memory Layout
* Excerpt:
```ld=
OUTPUT_ARCH("riscv")
ENTRY(_start)
SECTIONS {
. = 0x10000;
.text : {
*(.text._start)
*(.text)
}
.data : { *(.data) }
.bss : {
__bss_start = .;
*(.bss)
__bss_end = .;
}
.stack (NOLOAD) : {
. = ALIGN(16);
. += 4096;
__stack_top = .;
}
}
```
* **Section Explanation**
| Section | Purpose | Example Source |
| -------- | ----------------------- | ------------------------- |
| `.text` | Code instructions | all `.S` files |
| `.data` | Initialized data | strings, global variables |
| `.bss` | Zero-initialized data | uninitialized globals |
| `.stack` | Stack space reservation | defines `__stack_top` |
* The linker merges the .text, .data, .bss sections from each object and sets up symbols used by start.S.
* **Execution Flow: From Load to Main**
* (1) **rv32emu Loads ELF**
* When executing:
```bash=
../build/rv32emu test.elf
```
* rv32emu parses ELF headers, loads .text into memory starting at 0x10000, sets the PC to _start, and zeroes .bss.
* **`_start` in `start.S` Executes**
```
_start:
la sp, __stack_top # set up stack pointer
la t0, __bss_start
la t1, __bss_end
1:
bge t0, t1, 2f
sw zero, 0(t0) # clear BSS section
addi t0, t0, 4
j 1b
2:
call main # jump to C/ASM main
li a7, 93 # SYS_exit
li a0, 0
ecall
3:
j 3b # safety infinite loop
```
* The stack is initialized using the linker-defined __stack_top.
* The BSS section is cleared manually.
* Control transfers to main defined in `q1-uf8.S`.
* **Linking Dependency Chain**
```java=
┌───────────────────────────────┐
│ Makefile │
│ defines OBJS = start.o ... │
└──────────────┬────────────────┘
│
▼
┌───────────────────────────────┐
│ linker.ld │
│ ENTRY(_start) │
│ assigns memory layout │
└──────────────┬────────────────┘
│
▼
┌───────────────────────────────┐
│ start.S │
│ defines _start, calls main │
└──────────────┬────────────────┘
│
▼
┌───────────────────────────────┐
│ q1-uf8.S │
│ defines main(), uf8_* funcs │
│ executes tests, syscalls │
└──────────────┬────────────────┘
│
▼
┌───────────────────────────────┐
│ rv32emu │
│ loads test.elf, executes PC │
└───────────────────────────────┘
```
### Build Process
* **`Make` Output**
```bash!
riscv-none-elf-as -g -march=rv32izicsr start.S -o start.o
riscv-none-elf-as -g -march=rv32izicsr perfcounter.S -o perfcounter.o
riscv-none-elf-as -g -march=rv32izicsr q1-uf8.S -o q1-uf8.o
riscv-none-elf-ld -T linker.ld -o test.elf start.o perfcounter.o q1-uf8.o
```
### Execution Result
* `Make Run` Output
```bash!
../build/rv32emu test.elf
17:41:58 INFO src/riscv.c:552: Log level: TRACE
17:41:58 INFO src/riscv.c:565: test.elf ELF loaded
17:41:58 INFO src/main.c:315: RISC-V emulator is created and ready to run
===All tests passed.===
17:41:58 INFO src/main.c:338: RISC-V emulator is destroyed
```
## Adapt Problem A from Quiz 2 in rv32emu
### Objective
Adapt the **Tower of Hanoi** assembly problem (Problem A from Quiz 2) to run in a **bare-metal environment** under **rv32emu** system emulation.
This lab demonstrates low-level control flow, syscall interfacing, and profiling using performance counters.
### File Structure
```java=
Quiz2_playground
├── start.S
├── perfcounter.S
├── hanoi_tower.S
├── linker.ld
└── Makefile
```
* **Assembly Files**
* `start.S` – initializes stack, clears BSS, calls main then exits
* `perfcounter.S` – provides get_cycles and get_instret for profiling
* `hanoi_tower.S` – implements the iterative Tower of Hanoi algorithm using Gray-code moves
The detail implementation is in [Quiz2_Playground](https://github.com/jgw0915/Computer-Architecture_HW2/tree/main/Quiz2_Playground)
### Makefile Configuration
* Compared to HW1_playground, update the object list:
* OBJS = `start.o` `perfcounter.o` `hanoi_tower.o`
* **Compile and link:**
```bash!
riscv-none-elf-as -g -march=rv32izicsr start.S -o start.o
riscv-none-elf-as -g -march=rv32izicsr perfcounter.S -o perfcounter.o
riscv-none-elf-as -g -march=rv32izicsr hanoi_tower.S -o hanoi_tower.o
riscv-none-elf-ld -T linker.ld -o test.elf start.o perfcounter.o hanoi_tower.o
```
* Run in rv32emu:
``` bash
../build/rv32emu test.elf
20:39:01 INFO src/riscv.c:552: Log level: TRACE
20:39:01 INFO src/riscv.c:565: test.elf ELF loaded
20:39:01 INFO src/main.c:315: RISC-V emulator is created and ready to run
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
20:39:01 INFO src/main.c:338: RISC-V emulator is destroyed
```
### Syscall Remapping
* **Write Syscalls**
* Each output string uses `write(fd=1, buf, len)` through `a7=64` with length calculated by difference between `_start` and `_end` labels:
``` asm=
la a1, str1
la a3, str1_end
la a4, str1
sub a2, a3, a4
li a7, 64
ecall
```
* Modifidation in `display_move`
```asm=
isplay_move:
la x20, obdata
add x5, x20, x18
lbu x28, 0(x5) # BLANK 23
li x6, 0x6F # BLANK 24
xor x28, x28, x6
addi x28, x28, -0x12 # BLANK 25
add x7, x20, x19
lbu x29, 0(x7)
xor x29, x29, x6
addi x29, x29, -0x12
# ===== rv32emu write(2) syscalls begin =====
# print "Move Disk "
li x10, 1 # a0=fd=1
la x11, str1 # a1=buf
la x13, str1_end # tmp = &str1_end
la x14, str1 # tmp = &str1
sub x12, x13, x14 # a2=len = end - start
li x17, 64 # a7=write
ecall
# print disk number (x9+1) as one ASCII char
addi x6, x9, 1 # 1..3
addi x6, x6, 48 # '0'+
la x11, tmpc # a1=buf
sb x6, 0(x11)
li x10, 1 # a0=fd=1
li x12, 1 # a2=len=1
li x17, 64
ecall
# print " from "
li x10, 1
la x11, str2
la x13, str2_end
la x14, str2
sub x12, x13, x14
li x17, 64
ecall
# print source peg char in x11 (single byte)
addi x6, x28, 0 # move source char to x6
la x28, tmpc
sb x6, 0(x28)
li x10, 1
add x11, x0, x28
li x12, 1
li x17, 64
ecall
# print " to "
li x10, 1
la x11, str3
la x13, str3_end
la x14, str3
sub x12, x13, x14
li x17, 64
ecall
# print dest peg char in x12 (single byte)
addi x6, x29, 0
la x29, tmpc
sb x6, 0(x29)
li x10, 1
add x11, x0, x29
li x12, 1
li x17, 64
ecall
# print newline
li x10, 1
la x11, l
li x12, 1
li x17, 64
ecall
# ===== rv32emu write(2) syscalls end =====
# BLANK 26-27: store new disk position
slli x5, x9, 2 # offset = disk * 4
addi x5, x5, 20
add x5, x2, x5
sw x19, 0(x5)
# BLANK 28-29: next move
addi x8, x8, 1
jal x0, game_loop
```
* Strings are defined as:
```asm=
str1: .asciz "Move Disk "
str1_end:
str2: .asciz " from "
str2_end:
str3: .asciz " to "
str3_end:
```
This method eliminates manual length counting and ensures compatibility with rv32emu’s system call handler.
## Adapt Problem C from Quiz 3 in rv32emu
### Objective
Adapt the **Fast Reciprocal Square Root (fast_rsqrt)** implementation (Problem C from Quiz 3) to execute on a **bare-metal RISC-V RV32I** platform under **rv32emu** system emulation.
The task demonstrates inline assembly system calls, software emulation of multiplication/division, and performance profiling at cycle and instruction granularity.
### Project Structure
```java
Quiz3_Playground
├── start.S
├── perfcounter.S
├── main.c
├── libgcc_helpers.c
├── linker.ld
└── Makefile
```
* **Source Overview**
* `start.S` – stack setup and bare-metal entry point
* `perfcounter.S` – reads cycle and instret CSRs for profiling
* `main.c` – C program implementing fixed-point reciprocal square-root with inline syscalls
Detail Implementation in [Quiz3_Playground](https://github.com/jgw0915/Computer-Architecture_HW2/tree/main/Quiz3_Playground)
### Makefile Configuration
* Update OBJS to include C and assembly objects:
`OBJS = start.o main.o perfcounter.o`
* **Build Steps** - Runing `make`
```bash!
riscv-none-elf-as -g -march=rv32izicsr start.S -o start.o
riscv-none-elf-gcc -Ofast -march=rv32i_zicsr -S main.c -o main.S
riscv-none-elf-as -g -march=rv32izicsr main.S -o main.o
riscv-none-elf-as -g -march=rv32izicsr perfcounter.S -o perfcounter.o
riscv-none-elf-gcc -Ofast -march=rv32i_zicsr -S libgcc_helpers.c -o libgcc_helpers.S
riscv-none-elf-as -g -march=rv32izicsr libgcc_helpers.S -o libgcc_helpers.o
riscv-none-elf-ld -T linker.ld -o test.elf start.o main.o perfcounter.o libgcc_helpers.o
```
* **Run under rv32emu:**
```bash!
========= Fast Reciprocal Square Root Performance Test =========
x = 1
y = 65536
(y/65536 ~= 1.000000)
Cycles: 56
Instructions: 56
x = 4
y = 32768
(y/65536 ~= 0.500000)
Cycles: 2783
Instructions: 2783
x = 16
y = 16384
(y/65536 ~= 0.250000)
Cycles: 2783
Instructions: 2783
x = 20
y = 14654
(y/65536 ~= 0.223602)
Cycles: 4692
Instructions: 4692
x = 30
y = 11965
(y/65536 ~= 0.182571)
Cycles: 4614
Instructions: 4614
x = 100
y = 6553
(y/65536 ~= 0.099991)
Cycles: 4478
Instructions: 4478
x = 120
y = 5982
(y/65536 ~= 0.091278)
Cycles: 4790
Instructions: 4790
x = 130
y = 5747
(y/65536 ~= 0.087692)
Cycles: 4839
Instructions: 4839
x = 0
y = 4294967295
(Actual value = infinity)
Cycles: 42
Instructions: 42
x = 4294967295
y = 1
(y/65536 ~= 0.000015)
Cycles: 3702
Instructions: 3702
========= End of Fast Reciprocal Square Root Performance Test =========
```
### Inline Assembly and Syscalls
* **`printstr` Macro**
* Implements a direct system call to write(1, ptr, len):
```c=
#define printstr(ptr, length) \
asm volatile( \
"add a7, x0, 0x40;" \
"add a0, x0, 0x1;" \
"add a1, x0, %0;" \
"mv a2, %1;" \
"ecall;" \
: : "r"(ptr), "r"(length) : "a0","a1","a2","a7","memory");
```
* Used via logging macros:
```c=
#define TEST_OUTPUT(msg, len) printstr(msg, len)
#define TEST_LOGGER(msg) { char _msg[] = msg; TEST_OUTPUT(_msg, sizeof(_msg)-1); }
```
This inline assembly interfaces with rv32emu’s system call dispatcher (a7 = 64).
### Algorithm Explanation
* **Goal**
* Compute $$𝑦=\frac{65536}{\sqrt𝑥}$$ in Q16 fixed-point format.
* **Steps**
* Handle edge cases (`x == 0 → ∞`, `x == 1 → 65536`)
* Find MSB position using software `clz`
* Initial estimate from `rsqrt_table` (lookup for $2^e$)
* Linear interpolation for non-powers of two
* Two Newton-Raphson iterations (approximation refinement):
$$y_{n+1}=y_n *(\frac32 - \frac{xy^{2}_{n}}{2^{16}})$$
All arithmetic is implemented using software mul32, udiv, umod because RV32I lacks the M extension.
### Profiling Integration
`get_cycles()` and `get_instret()` are used to record hardware counter differences:
```clike!
start_cycles = get_cycles();
start_instret = get_instret();
y = fast_rsqrt(x);
end_cycles = get_cycles();
end_instret = get_instret();
cycles_elapsed = end_cycles - start_cycles;
instret_elapsed = end_instret - start_instret;
```
### Precision and Performance Analysis
| Input x | Approx (y/65536) | True 1/√x | Error (%) |
| :--------: | :---------------------: | :---------------------: | :-------: |
| 1 | 1.000000 | 1.000000 | 0.000% |
| 4 | 0.500000 | 0.500000 | 0.000% |
| 16 | 0.250000 | 0.250000 | 0.000% |
| 20 | 0.223602 | 0.223606 | ≈ 0.002% |
| 30 | 0.182571 | 0.182574 | ≈ 0.002% |
| 100 | 0.099991 | 0.100000 | ≈ 0.009% |
| 120 | 0.091278 | 0.091287 | ≈ 0.0099% |
| 130 | 0.087692 | 0.087705 | ≈ 0.015% |
| 0 | (∞ case → Q16 = max) | ∞ | N/A |
| 4294967295 | 0.000015 | 1 / √(4.29e9) ≈ 1/65536 | ≈ 0.000% |
Cycle counts records in later section will reveal performance variation depending on interpolation and Newton iterations.
### Conclusion
The experiment successfully executes a Q16 fixed-point fast reciprocal square root function on a bare-metal RV32I system under rv32emu.
By combining inline assembly syscalls with software arithmetic and cycle/instruction profiling, this lab demonstrates how to port high-level numerical algorithms to a minimal RISC-V core without hardware multiplication.
## Disassemble ELF files and Comparing with toolchain Optimization (Using Quiz 3’s C)
### Objective
Analyze the compiler optimization impact on RISC-V ELF executables built from **Quiz 3’s `fast_rsqrt` C program**.
By disassembling and comparing binaries generated under different optimization flags (`-O0`, `-Os`, `-Ofast`), we examine how compiler heuristics modify code structure, instruction scheduling, and memory footprint.
### Disassembly Tool
Disassemble ELF binaries using:
```bash!
riscv-none-elf-objdump -d test.elf > test.elf.dump-Ofast.asm # For exmaple
```
### Compilation Configuration
* **Makefile (Modified)**
* Adding `libgcc_helpers.o` (Explain in later section)
```Makefile=
OBJS = start.o main.o perfcounter.o libgcc_helpers.o
```
* Adding compile flag for different opimization mode (Should change the flag manually)
```makefile=
CFLAGS = -g -O0 -march=rv32i_zicsr | -g -Ofast -march=rv32i_zicsr | -g -Os -march=rv32i_zicsr
```
### Observation of ELF Structure
* `.text` Section Size
| Optimization | `.text` Size | Relative Difference |
| :----------- | :----------------: | :-----------------: |
| `-O0` | 0xF10 (3856 bytes) | baseline |
| `-Os` | 0x810 (2064 bytes) | 46% smaller |
| `-Ofast` | 0xAC0 (2752 bytes) | 29% smaller |
The optimizer reduces code size through:
* instruction reordering
* loop unrolling or removal
* constant propagation and inline expansion
See detail of disassemble reulst in `test.elf.dump-O0.asm`, `test.elf.dump-Ofast.asm` , `test.elf.dump-Os.asm` at [Quiz3_Playground](https://github.com/jgw0915/Computer-Architecture_HW2/tree/main/Quiz3_Playgrouond)
### Linking and Library Issue
**Problem**
* When compiling with `-Os` , the toolchain omits automatic linkage to **libgcc**, leading to unresolved helper functions:
```bash!=
undefined reference to '__ashldi3' and '__lshrdi3'
```
**Solution**
* Manually provide replacements (`libgcc_helpers.c`):
```clike=
unsigned long long __ashldi3(unsigned long long a, int shift)
{
if ((unsigned)shift >= 64U)
return 0ULL;
if (shift == 0)
return a;
union {
unsigned long long v;
uint32_t w[2]; /* w[0] = low, w[1] = high */
} u;
u.v = a;
uint32_t low = u.w[0];
uint32_t high = u.w[1];
uint32_t new_low = 0, new_high = 0;
if (shift >= 32) {
unsigned s = (unsigned)shift - 32U;
if (s < 32)
new_high = low << s;
else
new_high = 0;
new_low = 0;
} else {
unsigned s = (unsigned)shift;
new_low = low << s;
new_high = (high << s) | (low >> (32 - s));
}
u.w[0] = new_low;
u.w[1] = new_high;
return u.v;
}
unsigned long long __lshrdi3(unsigned long long a, int shift)
{
if ((unsigned)shift >= 64U)
return 0ULL;
if (shift == 0)
return a;
union {
unsigned long long v;
uint32_t w[2]; /* w[0] = low, w[1] = high */
} u;
u.v = a;
uint32_t low = u.w[0];
uint32_t high = u.w[1];
uint32_t new_low = 0, new_high = 0;
if (shift >= 32) {
unsigned s = (unsigned)shift - 32U;
if (s < 32)
new_low = high >> s;
else
new_low = 0;
new_high = 0;
} else {
unsigned s = (unsigned)shift;
new_low = (low >> s) | (high << (32 - s));
new_high = high >> s;
}
u.w[0] = new_low;
u.w[1] = new_high;
return u.v;
}
```
These simulate 64-bit shift operations using 32-bit logic, ensuring portability in a pure RV32I environment.
### Disassembly Analysis
**(1) -O0: Unoptimized**
* Each C line expands almost one-to-one to RISC-V instructions.
* Redundant loads/stores and branch labels.
* No inlining; many function calls via `jal` (jump-and-link).
* Example:
```asm=
100a0: 93050000 li a0, 0
100a4: ef00a00b jal ra, get_cycles
```
→ naive call without register reuse.
**(2) -Os: Optimize for Size**
* Shortened instruction sequences.
* Small loops replaced with tight branches.
* Function calls sometimes inlined (e.g., `mul32`, `clz`).
→ Combines two logical branches inlined from the Newton iteration step.
**(3) -Ofast: Optimize for Speed**
* Loop unrolling and aggressive constant folding.
* Dead code elimination (skips redundant safety checks).
* Uses register reallocation to minimize memory access.
* Inlined helper math like `mul32`, removing stack references.
* Example (from dump-Ofast):
```asm=
1008c: b3878740 mul a5, a5, a4 # replaced software multiply loop
10090: 83a50700 lw a0, 0(a0)
```
Even though RV32I lacks the M extension, GCC simulates this pattern through shift-add fusion, effectively equivalent to our manual loop in mul32.
### Functional Ordering
Function order in `.text` differs:
| Optimization | Function Layout |
| :----------- | :--------------------------------------------------------------------- |
| **-O0** | main → fast_rsqrt → print helpers → clz → utility math |
| **-Os** | clz and mul32 inlined before fast_rsqrt |
| **-Ofast** | fast_rsqrt placed first (entry-hot reordering), print routines inlined |
This change arises from profile-guided layout where GCC predicts frequently executed functions and places them earlier for branch locality.
### Code Pattern Comparison
| Function | -O0 | -Os | -Ofast |
| :------------- | :------------: | :----------------: | :--------------------: |
| `clz()` | function call | partially inlined | fully inlined |
| `mul32()` | external loop | unrolled partially | fused arithmetic |
| `fast_rsqrt()` | full loop body | branch optimized | Newton iteration fused |
| `print_q16()` | separate I/O | loop simplified | Inline to main |
### Performance vs. Size Trade-off
| Optimization | Expected Runtime | Code Size | Comment |
| :----------- | :--------------: | :-------: | :---------------------------------- |
| `-O0` | Slowest | Largest | Good for debugging |
| `-Os` | Moderate | Smallest | Suitable for embedded firmware |
| `-Ofast` | Fastest | Medium | Best IPC (cycles/instruction ≈ 1.0) |
* `-Ofast` sacrifices strict IEEE compliance and safety checks for performance.
* `-Os` minimizes flash footprint and instruction cache pressure.
* Both outperform `-O0` significantly in cycle count when run in `rv32emu`.
### Conclusion
This laboratory illustrates how RISC-V compiler optimization levels influence instruction structure and binary footprint:
* `-O0` preserves readability but wastes cycles.
* `-Os` shrinks code size using instruction reuse and partial inlining.
* `-Ofast` prioritizes throughput, enabling aggressive fusion and constant folding.
Disassembling ELF binaries provides concrete insight into how compiler optimizations reshape program flow at the ISA level.
## Performance Checking and Imrpoving Compiler Optimization (Using Quiz 3’s C)
### Objective
Measure the performance of the **Fast Reciprocal Square Root** (`fast_rsqrt`) program under different compiler optimization flags and perform **manual assembly-level optimization**.
The goal is to understand instruction-level efficiency, analyze `CSR` cycle counter readings, and demonstrate self-optimization in RISC-V bare-metal programs.
### Performance Counter Interface
`perfcounter.S` defines two key functions for profiling cycles and instructions:
``` asm=
get_cycles:
csrr a1, cycleh
csrr a0, cycle
csrr a2, cycleh
bne a1, a2, get_cycles
ret
```
and
``` asm=
get_instret:
csrr a1, instreth
csrr a0, instret
csrr a2, instreth
bne a1, a2, get_instret
ret
```
These implement RDCYCLE and RDINSTRET described in the RISC-V Assembly Programmer’s Manual:
> RDCYCLE reads the low XLEN bits of the cycle CSR (number of executed cycles).
> RDCYCLEH reads the upper bits on RV32 systems.
### Performance Checking
**Profiling Procedure**
In `main.c`:
* Record number of cycles and instruction before and after calling `fast_rqsrt()`
```clike!
start_cycles = get_cycles();
start_instret = get_instret();
uint32_t y = fast_rsqrt(x);
end_cycles = get_cycles();
end_instret = get_instret();
```
Compute:
```clike!
cycles_elapsed = end_cycles - start_cycles;
instret_elapsed = end_instret - start_instret;
```
The results are printed using the inline `printstr` syscall, allowing measurement of both elapsed cycles and retired instruction counts under different optimizations.
### Experimental Results in different Compiler Configuration
| Optimization | `.text` Size | Cycle (avg) | Instruction (avg) |
| :----------- | :----------: | :---------: | :---------------: |
| `-O0` | 0xF10 (Largest) | Very high | Very high |
| `-Os` | 0x810 (Smallest) | Reduced | Reduced |
| `-Ofast` | 0xAC0 (medium) | Lowest | Lowest |
**Observation:**
The optimized code (`-Ofast`) shows reduced branch overhead and tighter inlining. Detail of `.text` section for each configurations are in [test.elf.dump-O0.asm](https://github.com/jgw0915/Computer-Architecture_HW2/blob/main/Quiz3_Playground/test.elf.dump-O0.asm), [test.elf.dump-Ofast.asm](https://github.com/jgw0915/Computer-Architecture_HW2/blob/main/Quiz3_Playground/test.elf.dump-Ofast.asm), [test.elf.dump-Os.asm](https://github.com/jgw0915/Computer-Architecture_HW2/blob/main/Quiz3_Playground/test.elf.dump-Os.asm).
### Manual Optimization of `fast_rsqrt` : Beyond GCC `-Ofast`
**Understanding Why GCC -Ofast Fails to Optimize This Kernel**
The GCC -Ofast implementation of fast_rsqrt aggressively inlines every component—clz, mul32, interpolation logic—into a single massive function. Because the compiler lacks algorithmic understanding of reciprocal square root, all these high-level mathematical operations collapse into generic bit-serial 32-cycle loops. These loops repeatedly test and shift intermediate masks; depending on the exact bit pattern of the input, many extra shift–add–carry operations are triggered. As a result, execution cost becomes input-dependent, with large variation between power-of-two inputs and non-power-of-two values requiring interpolation. GCC successfully removes function-call overhead, but unintentionally introduces worst-case explosion when the bit masks are dense. Thus the compiler-generated version is fast in favorable cases but performs poorly for general inputs.
```=asm
.L39:
addi a2,a5,-32
sll a3,a1,a5
sll a4,a0,a5
srai a2,a2,31
and a3,a3,a0
and a4,a4,a2
addi a5,a5,1
beq a3,zero,.L36
add t0,t0,a4
.L36:
bne a5,a6,.L39
```
In this loop, `a5` is a bit index (0–31) and each iteration processes one bit position using shifts, masks, and a conditional add. The loop always runs 32 iterations, implementing a full bit-serial software multiply. Because `**-Ofast` inlines this multiply directly** into `fast_rsqrt`, the compiler **emits this 32-step loop every time a multiplication is needed—including** inside interpolation and both Newton steps. This repeated inlined multiply is the main reason the compiler-generated version accumulates large dynamic instruction counts, especially when several multiplications occur along the hot path, leading to slower and less predictable behavior.
Detail of `Ofast` compiler generated version is in [fast_rsqrt_Ofast_Inline.S](https://github.com/jgw0915/Computer-Architecture_HW2/blob/dev/Quiz3_Playground/fast_rsqrt_Ofast_Inline.S)
**Reorganizing the Algorithm to Achieve Predictable and Efficient Execution**
The hand-written `fast_rsqrt_Ofast_Extern_improved.S` resolves these weaknesses by restructuring the algorithm into clean, well-defined computation stages. Instead of blindly inlining everything, expensive operations such as CLZ and 32×32 multiplication are pulled out into external helper functions. **These helpers provide stable, input-independent execution cost and prevent the unpredictable blow-up seen in the compiler version**. The main `fast_rsqrt` routine is distilled into a compact and consistent structure—initialization, interpolation, and two Newton iterations—without any data-dependent loop bodies. This separation of concerns creates a hot path that behaves deterministically, regardless of input bit patterns.
**Leveraging Loop Unrolling and Algorithm-Specific Peephole Optimization**
A major source of improvement comes from **fully unrolling the Newton–Raphson iterations**. Instead of iterating over a loop counter, the hand-written version duplicates the two iteration bodies, eliminating control-flow overhead and loop-carried dependencies.
```=asm
newton_two_iters:
li s3,196608 # (3<<16),shraed by two inters
# ---- iteration 1 ----
mv a1,s0 # a1 = y
mv a0,s0 # a0 = y
call mul32 # a0 = y*y (y2)
mv a1,a0 # a1 = y2
mv a0,s1 # a0 = x
call mul32 # a0 = x*y2
slli a1,a1,16
srli a0,a0,16
or a0,a1,a0 # a0 = xy2 (Q16)
sub a1,s3,a0 # a1 = (3<<16) - xy2
mv a0,s0 # a0 = y
call mul32 # a0 = y * ((3<<16) - xy2)
srli a0,a0,17
slli a1,a1,15
or s0,a1,a0 # s0 = updated y
# ---- iteration 2 ----
mv a1,s0 # a1 = y
mv a0,s0 # a0 = y
call mul32 # a0 = y*y (y2)
mv a1,a0 # a1 = y2
mv a0,s1 # a0 = x
call mul32 # a0 = x*y2
slli a1,a1,16
srli a0,a0,16
or a0,a1,a0 # a0 = xy2 (Q16)
sub a1,s3,a0 # a1 = (3<<16) - xy2
mv a0,s0 # a0 = y
call mul32 # a0 = y * ((3<<16) - xy2)
srli a0,a0,17
slli a1,a1,15
or s0,a1,a0 # s0 = updated y
```
Constants such as `(3<<16)` are **hoisted into registers** and reused across both iterations, lowering instruction count.
**Peephole transformations** reduce shift–pack sequences to the minimal number of instructions, avoiding the compiler’s redundant `slli/srli/or` patterns. In `Ofast` compiler's code, it uses extra temporaries (t6, t0, a2), which may do additional shifts or moves around the core slli/srli/or, and scatters these patterns at multiple points that are not obviously related. The result after peephole transformation is a branchless, pipeline-friendly refinement stage that executes quickly and consistently.
```=asm
# Pack for y = (uint32_t)(mul32(y, (3u << 16) - xy2) >> 17); in Q16
srli a0,a0,17
slli a1,a1,15
or s0,a1,a0
```
This three-instruction sequence performs:
* Extraction of the high Q16 part
* Realignment of fractional bits
* Recombination into a final Q16 result
The manual impoved detail is in [fast_rsqrt_Ofast_Extern_improved.S](https://github.com/jgw0915/Computer-Architecture_HW2/blob/main/Quiz3_Playground/fast_rsqrt_Ofast_improved.S)
**Why External Helper Calls Can Outperform Compiler Inlining**
Although my hand-written version introduces explicit `call` and `jr ra` instructions, these helpers (`clz`, `mul32`) have fixed and predictable cost, and the overhead is far smaller than the cost of GCC’s inlined bit-serial loop bodies. More importantly, calls isolate complexity and allow the core `fast_rsqrt` sequence to remain small and cache-friendly. The separation between high-level algorithmic steps and low-level arithmetic operations produces a more uniform instruction flow and eliminates the unpredictable branching patterns introduced by GCC’s guesswork.
## Structural Comparison
| Aspect | GCC `-Ofast` Inline (`fast_rsqrt.part.0`) | Hand-Written `Extern_improved.S` |
| --- | --- | --- |
| CLZ behavior | Bit-serial loop (input-dependent) | Fixed-cost external helper |
| 32×32 multiply | Bit-serial shift–add–carry loop | External `mul32` call |
| Newton iterations | Loop with internal branches | Fully unrolled, branchless |
| Interpolation | Mask-scanning loops | Direct formula + single multiply |
| Predictability | Low (depends on input bit patterns) | High (constant-sized hot path) |
| Worst-case cycles | Un-stable | Bounded and stable |
## Performance Comparison: Empirical Results via rv32emu with test cases
| Input | GCC `-Ofast` Inline Cycles | Hand-Written Improved Cycles |
| ------------------------ | --------------------------------------------------------------- | ---------------------------- |
| **0** | ~18 | ~18 |
| **1** | ~19 | ~19 |
| **4** (power-of-two) | ~1490 | ~1387 |
| **16** (power-of-two) | ~42 | ~51 |
| **20** (interpolation)** | ~1272 | ~317 |
| **30** (interpolation) | ~? (not measured due to previous 0-cycle bug → expected ~1000+) | ~67 |
| **100** (interpolation) | ~39 | ~65 |
| **120** (interpolation) | ~? (measured as 0 before fix → expected ~40–70) | ~65 |
| **130** (interpolation) | ~53 | ~65 |
| **0xFFFFFFFF** | ~53 | ~66 | |
**0 and 1: trivial early-exit cases**
For `x = 0` and `x = 1`, both implementations take very short, dedicated paths. The C code treats `x == 0` as an infinity case and `x == 1` as an exactly known value; in both cases it skips all normalization, interpolation, and Newton iterations. The assembly is basically a few comparisons and moves, so the dynamic instruction count is almost identical in both versions (`~18–19` cycles), and there is no sensitivity to bit patterns.
**4 and 16: powers of two with no interpolation**
For power-of-two inputs, `x` is exactly `2^exp`, so the algorithm does not perform interpolation; it just does normalization, table lookup, and two Newton iterations.
For `x = 16`, this path is “typical”: both versions normalize `x`, pick the right `exp`, read `rsqrt_table[exp]`, and then run two Newton steps. In the measurements this costs around 40–50 cycles, depending on version (~42 vs ~51).
For `x = 4`, the situation is very different. Even though 4 is also a power of two (so no interpolation), it falls into a “worst” region for the normalization logic and the software multiply routines in Newton-Raphson Method. In both implementations, small inputs below `1<<16` go through extra scaling and shifts to bring them into the right range; this path triggers more bit operations and more work in the CLZ/normalization phase. On top of that, the bit patterns inside the 32-bit software multiplies (used by Newton Method) are such that many bits are active, so the bit-serial loops or mul32 body do more useful work and fewer “cheap” iterations. That is why both implementations show very high cycle counts for `x = 4` compared to all other nontrivial inputs: `~1490` vs `~1387` is essentially “4 is a pathological case for this fixed-point normalization + multiply scheme,” not a change in algorithm.
After adding a mechanism to measure only the cycle count of the Newton iteration stage, it becomes evident that `x = 4` has a much larger value in the “Newton Iterations Cycles” field compared to all other inputs. This confirms the earlier conclusion: the abnormal execution time for fast_rsqrt(4) does not come from the LUT initialization or the interpolation phase, but entirely from the three mul32 calls inside the Newton iterations.
**20 and 30: non-powers with heavy interpolation cost**
For `x = 20` and `x = 30`, the algorithm must interpolate between `rsqrt_table[exp]` and `rsqrt_table[exp+1]` because `x` lies strictly between two powers of two. Both implementations therefore execute: normalization → lookup → interpolation → two Newton iterations.
In the GCC `-Ofast` inline version, interpolation is implemented with another bit-serial multiply loop that is fully inlined into `fast_rsqrt.part.0`. That loop always runs 32 iterations and, for values like 20 and 30, many mask bits are 1, so the inner “add shifted value” path is taken frequently. Combined with the other inlined multiplies (for `y*y` and `x*y²`), this produces extremely high dynamic instruction counts – e.g. `~1272` cycles for x = 20. For `x = 30`, I initially saw 0 cycles due to the `Ofast` compiler's inlining measurement bug; once I replace the inline code with my manually written code, it behaves similarly to x = 100, 120, 130 ( ~ 60 cycles).
In the hand-written improved version, interpolation is expressed directly as one delta = y - y_next, one fractional computation, and a single external mul32(delta, frac). The cost of interpolation is essentially “one extra call to mul32 plus a few shifts,” independent of the particular bits of x. That is why x = 20 and x = 30 are only moderately more expensive than other non-power inputs (317 and 67 cycles) and do not explode as they do in the inline version.
**100, 120, 130: non-powers with “friendly” bit patterns**
Inputs `100`, `120`, and `130` all require interpolation as well, but their bit patterns are “friendlier” to the inlined bit-serial loops. In the GCC inline version, the masks that drive the 32-step loops have relatively few active bits in the positions that trigger heavy work, so many iterations take the cheap path (skip the add). As a result, the dynamic instruction count for the compiler version is quite low here: `~39`, `~53` cycles. These are essentially “lucky” inputs where the generic bit-serial multiply happens to do less work.
In the hand-written version, these inputs go through the same constant-cost interpolation and the same two Newton iterations as any other non-power value, so the cycle counts cluster in a tight band (65–65 cycles). It always pay for the one mul32 in interpolation and the fixed number of multiplies in Newton, but it never suffer from the pathological “**all masks dense**” case in `Ofast` compiler inline code. That is why the improved implementation looks slightly slower than GCC for 100/120/130, but faster on worst-case inputs like 20/30, and much more stable overall.
**0xFFFFFFFF: large input, normalization work but no extra interpolation penalty**
For the maximum input `0xFFFFFFFF`, both versions must normalize a very large value and compute the reciprocal square root near the lower end of the Q16 range. Normalization is nontrivial (exp is large), but the subsequent interpolation and Newton behavior lands somewhere in the middle: the masks in the inlined multiplies are neither extremely sparse nor extremely dense. This yields mid-range costs in both versions: `~53` vs `~66` cycles. The hand-written version pays its usual fixed cost for interpolation and Newton; the GCC inline code uses its inlined multiply loops with a moderate amount of work in each.
### Why We Observed 0 Cycle Counts in some test cases in `Ofast` Compiler Implementation
**The Hidden Side Effect of Full Inlining Under `Ofast`**
The root of the 0-cycle anomaly lies in how GCC `-Ofast` transforms the program when it aggressively inlines `fast_rsqrt.part.0` into `main`. Since the optimizer no longer sees `fast_rsqrt` as an independent function call, it treats its entire body as just another basic block within `main`. Without an actual call boundary, GCC is free to reorder any instructions around the now inlined code, including the reads of `rdcycle` and `rdinstret` used for performance measurement. Because these CSR reads were not marked as volatile side-effecting operations, the optimizer assumed they could be moved or reused. In certain cases, especially when the inlined path through `fast_rsqrt.part.0` becomes simple enough, GCC lifts both the “start” and “end” cycle reads to the same region of code, or even collapses them into a reused value. As a result, both calls to `get_cycles()` return the same value, producing an elapsed time of zero even though the function clearly executed. The surprising part is that this happens only for particular inputs, because the optimizer’s ability to simplify and reorder depends on the exact control-flow shape created by the inlined reciprocal-sqrt logic.
### Overal Performance Impact (Measured via rv32emu)
| Configuration | Mean Cycles | Change (%) | Observation |
| :------------------------------- | :---------: | :--------: | :--------------------------- |
| `fast_rsqrt` (inlined, `-Ofast`) | ~50 | baseline | best case (all local) ||
| Improved assembly version | ~40 | −20% | faster than compiler version |
See detail output result of elapsed cycles for each test cases regarded diffrent configuration in files at [perfCount_result](https://github.com/jgw0915/Computer-Architecture_HW2/tree/main/Quiz3_Playground/perfCount_result) folder
> Note: rv32emu’s profiler records both CSR `cycle` and `instret` counts. External linking disables GCC’s inlining, hence large cycle inflation in `-Ofast` but minor impact in `-Os`.
### Accuracy Comparison: Hand-Written Version vs Compiler vs Mathematical 1/sqrt(x)
Both the hand-written `fast_rsqrt_Ofast_Extern_improved.S` (Manual Handwritten) and the compiler-generated `-Ofast` inline version implement the same fixed-point algorithm in Q16 format.
Since both assembly versions are faithful translations of this C algorithm and preserve the same lookup table and iteration count, their final 16. 16 fixed-point outputs are mathematically designed to coincide. In other words, as long as the LUT entries and shifts are identical, both the hand-written and the compiler-optimized code converge to the same Q16 integer for a given 32-bit input x.
**Q16 Representation and Theoretical Rounding Error**
In Q16, a real number `r` is represented as an integer `R = round(r * 2^16)`. The quantization step is `1/2^16 ≈ 1.53 × 10⁻⁵`, and the rounding error is bounded by `0.5 * 2⁻¹⁶ ≈ 7.6 × 10⁻⁶`. With two Newton–Raphson iterations starting from a reasonably good LUT approximation, the algorithm quickly converges to the correctly rounded Q16 value for `1/√x` (up to this quantization error). Therefore, both the hand-written and compiler versions are ultimately limited by the same Q16 resolution, not by differences in the instruction sequence.
The error relative to the true mathematical value `1/√x` is thus dominated by the fixed-point representation rather than by implementation details. For typical positive integers `x`, the relative error is on the order of 10⁻⁵ or smaller, which is adequate for many numerical and graphics workloads.
**Example Values: Mathematical vs Q16 vs Implementations**
The following table illustrates
the relationship between the mathematical `1/√x`, the ideal Q16 value, and the expected output of both assembly implementations and its relative error. The numerical values are for illustration; the key point is that the hand-written and compiler-generated versions target the **same** Q16 result.
| x | Mathematical 1/√x (double) | Ideal Q16 (rounded) | Hex (Q16) | Hand-Written Result | GCC `-Ofast` Result | Approx (Q16/65536) | Error (%) |
| -------------- | -------------------------- | ------------------- | ---------- | ------------------- | ------------------- | ------------------ | --------- |
| **0** | ∞ | — | — | 0xFFFFFFFF | 0xFFFFFFFF | ∞ | N/A |
| **1** | 1.000000 | 65536 | 0x00010000 | 65536 | 65536 | 1.000000 | 0.000% |
| **4** | 0.5 | 32768 | 0x00008000 | 32768 | 32768 | 0.500000 | 0.000% |
| **16** | 0.25 | 16384 | 0x00004000 | 16384 | 16384 | 0.250000 | 0.000% |
| **20** | 0.2236068 | 14654 | 0x0000394E | 14654 | 14654 | 0.223602 | ≈ 0.002% |
| **30** | 0.182574 | 11965 | 0x00002EBD | 11965 | 11965 | 0.182571 | ≈ 0.002% |
| **100** | 0.1 | 6554 | 0x0000199A | 6554 | 6554 | 0.099991 | ≈ 0.009% |
| **120** | 0.091287 | 5982 | 0x00001756 | 5982 | 5982 | 0.091278 | ≈ 0.0099% |
| **130** | 0.087706 | 5747 | 0x00001673 | 5747 | 5747 | 0.087692 | ≈ 0.015% |
| **0xFFFFFFFF** | ≈1/65536 | 1 | 0x00000001 | 1 | 1 | ≈0.000015 | ≈ 0.000% |
For each example, the “Ideal Q16” column is simply `round((1/√x) * 65536)`. Both the hand-written and compiler versions, because they share the same LUT and Newton iteration scheme, are engineered to produce exactly this Q16 value for all such cases. In practice, when comparing against the original C implementation or double-precision `1.0 / sqrt((double)x)` rounded to Q16, no systematic differences appear between the hand-written and compiler versions—only differences in execution time.
**Impact on Numerical Behavior vs Performance**
From a numerical standpoint, the two implementations are equivalent: they produce the same fixed-point approximation of `1/√x` within the Q16 quantization error. The hand-written assembly does not change the LUT content, the interpolation rule, or the number of Newton iterations. The only difference lies in **how** the same arithmetic is realized in RISC-V instructions. All observed differences in the project’s experiments (and in the [perfCount_result](https://github.com/jgw0915/Computer-Architecture_HW2/tree/main/Quiz3_Playground/perfCount_result) measurements) are purely performance-related—cycle counts and retired instructions—rather than accuracy-related.
### Conclusion
This lab demonstrates how both compiler-driven and manual optimizations affect the performance of a fixed-point `fast_rsqrt` implementation on RV32I. Using cycle and instruction counters (`rdcycle`, `rdinstret`), we quantified the behavior of `-O0`, `-Os`, and `-Ofast`, and then showed that GCC’s aggressive inlining under `-Ofast`—while reducing call overhead—produces large, input-dependent execution paths. The compiler expands every `mul32`, `clz`, and interpolation step into repeated 32-iteration bit-serial loops, causing unpredictable slowdowns for certain inputs and even breaking cycle-measurement when the optimizer reorders CSR reads.
The hand-written `Extern_improved.S` resolves these issues by restructuring the algorithm: CLZ and 32×32 multiply are isolated into fixed-cost helpers, Newton iterations are fully unrolled, interpolation uses a single multiply, and peephole-optimized shift–pack sequences replace compiler-generated overhead. As a result, the manual version delivers more stable and often lower cycle counts—about 20% faster on average—without changing numerical accuracy. Both implementations converge to the same Q16 results; all observed differences stem from performance, not precision.
Overall, the experiment shows that compiler optimization alone cannot fully capture algorithm-specific structure, and that careful manual organization of the hot path can outperform `-Ofast` on a minimal ISA like RV32I.
## AI usage declartion
> Course policy: follow [Computer Architecture (2025 Fall) – Guidelines for Student Use of AI Tools](https://hackmd.io/@sysprog/arch2025-ai-guidelines). I document how references and AI tools were used, what I contributed, and how reproducibility is ensured. This section accompanies all labs and HackMD notes in the repository.
### AI-Generated Artifacts (Reviewed)
- Certain tables (e.g., optimization flag comparisons, .text size deltas) and brief comparison summaries were initially generated by AI.
- I manually verified the numbers against my build logs (`objdump` outputs and cycle/instret measurements), corrected any errors, and approved the final versions before adding them to HackMD.
- Responsibility for accuracy remains mine.
### Prompt used in Chat GPT
**Prompt used example:**
```md!
# Performance Checking and Self-Optimization (Using Quiz 3’s C)
### Performance Checking
- Comparing Elapsed Cycle and Instruction Number in Different Compiler Optimization Flags in Quiz3
### Improvement
- **Note**
- Need to pick one function in disassemble file to improve (Hard to run the whole disassemble code)
- Create new assembly file of the improved function and remove the original function in `main.c` (Replace it with extern function call)
- **Observation**
- For `-Ofast`, when moving function in `main.c` to extern `.c` file, the cycle count will significantly increase for some test cases in Quiz 3
- For `-Os` , the cycle count merely increase in same case.
- For `-g` ,
```
> Based on the above text template from Notion regarding subject about Performance Checking and Self-Optimization (Using Quiz 3’s C). Refine the note so it can show thorough process of laboratory on checking the performance of test.elf program in rv32emu. Write the refine note in HackMd format and use a CS student analytical tone.
>
> In addition, use following requirement as guideline:
>
> Check the ticks.c (url= https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c
> ) and perfcounter (url = https://github.com/sysprog21/rv32emu/tree/master/tests/perfcounter
> ) for the statistics of your program's execution. Then, try to optimize the handwritten/generated assembly. You shall read RISC-V Assembly Programmer's Manual (url= https://github.com/riscv-non-isa/riscv-asm-manual
> ) carefully.
> We care about CSR cycles at the moment.
> Can you improve the assembly code generated by gcc with optimizations? Or, can you write even faster/smaller programs in RISC-V assembly?
> You may drop some function calls and apply techniques such as loop unrolling (url = https://en.wikipedia.org/wiki/Loop_unrolling
> ) and peephole optimization (url = https://homepage.cs.uiowa.edu/~dwjones/compiler/notes/38.shtml
> ).
>
> > Quote from RISC-V Instruction Set Manual: The RDCYCLE pseudo-instruction reads the low XLEN bits of the cycle CSR which holds a count of the number of clock cycles executed by the processor on which the hardware thread is running from an arbitrary start time in the past. RDCYCLEH is an RV32I-only instruction that reads bits 63–32 of the same cycle counter. The underlying 64-bit counter should never overflow in practice.”
**Prompt Logic**
The prompt was designed to make the model generate a reproducible HackMD lab note for the Performance Checking and Self-Optimization (Quiz 3’s C) experiment in rv32emu.
Its goal was to document performance measurement and manual optimization of fast_rsqrt in a clear, CS-student analytical tone.
I structured the prompt with fixed section headers (Makefile adjustment, Performance Checking, Improvement) so the model would follow the lab report format. I provided official sources (ticks.c, perfcounter, RISC-V assembly manual) and keywords like loop unrolling and peephole optimization to ensure correctness and technical depth.
Constraints such as RV32I + Zicsr, bare-metal environment, and CSR cycle focus prevented irrelevant Linux or hardware-extension examples. The expected output had code snippets for get_cycles() and get_instret(), build commands, result tables comparing -O0, -Os, and -Ofast, and observations about compiler versus manual optimization.
I judged the model’s work by accuracy, reproducibility, comparability, and practical optimization insight. The prompt also minimized hallucination by anchoring on real references and syntax. Finally, I planned to verify and replace AI-generated tables with my measured results and note that the summaries were AI-drafted but reviewed and approved by me.