# Assignment2: Complete Applications
> contributed by < [Shaoen-Lin](https://github.com/Shaoen-Lin) >
## Prerequisites
### rv32emu
[rv32emu](https://github.com/sysprog21/rv32emu) is a RISC-V RV32 emulator that faithfully implements the RISC-V instruction set architecture. It serves as both:
* An educational tool for understanding modern RISC processor operations
* A starting point for customization and experimentation
Developed at NCKU. See VMIL'24 paper: "[Accelerate RISC-V Instruction Set Simulation by Tiered JIT Compilation](https://dl.acm.org/doi/10.1145/3689490.3690399)."
### Operating System
Ubuntu Linux 24.04 LTS or later (virtualization environments like VirtualBox are supported) and macOS 15+.
After we have those prerequisites, we can follow [Lab2](https://hackmd.io/@sysprog/Sko2Ja5pel) to perform a quick environmental test and verify that your RISC-V toolchain and emulator are working correctly.
## Environment Setup
| Item | Configuration |
| ----------- | -------------------------------------- |
| Emulator | rv32emu |
| Toolchain | riscv-none-elf-gcc @ xPack GNU RISC-V |
| ISA | RV32I + Zicsr |
| Build Flags | `-march=rv32i_zicsr -nostdlib -static` |
| Entry Point | `main` function in main.c |
| Syscall | 64 (write) for stdout printing only |
To achieve executing the code on the system, all components must comply with the following constraints:
* Only **RV32I/Zicsr** instructions are allowed, no **RV32M** (multiply and divide) or **RV32F** (single-precision floating point) extension instructions are allowed.
* **No system calls** and **no FPU** (Floating-Point Unit) are permitted.
* All I/O operations must be performed through simulate system call - `ecall 64` (simulating write to stdout).
## What we should do before simulation?
Before modifying or extending this system, it is important to understand the roles of each part — Makefile, perfcounter, and so on... included how they interact with the RISC-V bare-metal environment.
### 1. Makefile
The Makefile defines how to compile and link all source files into the final RISC-V ELF executable.
#### Key points to modify :
* `CFLAGS = -g -march=rv32i_zicsr` - This flag specifies the target architecture (RV32I + Zicsr) and includes debugging information (-g), which enables symbol and memory inspection in emulators such as rv32emu.
We can also control compiler optimization via the OPT variable:
```c
make OPT=1 # equivalent to -O1
make OPT=2 # equivalent to -O2
make OPT=3 # equivalent to -O3
make OPT=fast # equivalent to -Ofast
```
The flag -O$(OPT) is appended to CFLAGS to dynamically select the optimization level.
* `OBJS = start.o main.o ...` — This line lists all object files (.o) that will be linked together to generate test.elf.
To include new assembly or C files in the final executable, simply add their corresponding .o filenames here.
* `LDFLAGS = -T $(LINKER_SCRIPT) -nostartfiles -nostdlib -nodefaultlibs` —
These flags are required for running in a bare-metal environment (no OS).
* `-nostartfiles` skips the default startup code (crt0.o) since we already define our own _start in start.S.
* `-nostdlib` and `-nodefaultlibs` disable linking to the standard and built-in GCC libraries (like libc, libgcc), because we provide our own ECALL-based I/O and runtime.
Together, they ensure the program contains only our custom code and can run directly on rv32emu without any OS dependency.
* `$(CC) $(LDFLAGS) -o $@ $(OBJS)` — Using $(CC) instead of $(LD) lets the compiler handle linking with the same architecture and optimization settings (-O1, -O2, etc.).
gcc automatically passes correct ABI flags and links both C and assembly objects consistently, avoiding manual linker setup errors.
This ensures unified optimization across all modules, though it does not directly speed up assembly code itself.
* ```%.o: %.S $(CC) $(CFLAGS) -x assembler-with-cpp $< -o $@ -c``` — This option allows .S files to use the C preprocessor (#define, #ifdef, etc.) before assembling.
It helps assembly code adapt to different architectures or optimization levels but does not perform real instruction-level optimization.
#### Modified `Makefile`
```c
include ../../../mk/toolchain.mk
ARCH = -march=rv32izicsr
LINKER_SCRIPT = linker.ld
EMU ?= ../../../build/rv32emu
# -------------------------------------------------
# Optimization Level
# -------------------------------------------------
OPT ?= 0
AFLAGS = -g $(ARCH)
CFLAGS = -g -O$(OPT) -march=rv32i_zicsr
LDFLAGS = -T $(LINKER_SCRIPT) -nostartfiles -nostdlib -nodefaultlibs
EXEC = test.elf
CC = $(CROSS_COMPILE)gcc
AS = $(CROSS_COMPILE)as
LD = $(CROSS_COMPILE)ld
OBJDUMP = $(CROSS_COMPILE)objdump
OBJS = start.o main.o perfcounter.o single_number_3.o hanoi.o fast_rsqrt.o
.PHONY: all run dump clean
all: $(EXEC)
$(EXEC): $(OBJS) $(LINKER_SCRIPT)
$(CC) $(LDFLAGS) -o $@ $(OBJS)
%.o: %.S
$(CC) $(CFLAGS) -x assembler-with-cpp $< -o $@ -c
%.o: %.c
$(CC) $(CFLAGS) $< -o $@ -c
run: $(EXEC)
@test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1)
@grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1)
@grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1)
$(EMU) $<
dump: $(EXEC)
$(OBJDUMP) -Ds $< | less
clean:
rm -f $(EXEC) $(OBJS)
```
### 2. perfcounter
The `perfcounter.S` module provides two low-level RISC-V assembly functions that read hardware performance counters directly from **CSR (Control and Status Registers)**.
These counters measure how many CPU cycles have elapsed and how many instructions have been executed since system startup.
This program implements two performance counter reading functions used for measurement:
* get_cycles: Reads the current number of CPU clock cycles executed (cycle counter).
* get_instret: Reads the current number of instructions that have been completed (instructions retired).
Since this is a 32-bit RISC-V architecture, the hardware provides 64-bit counters, but they can only be accessed through two 32-bit registers (cycle/cycleh, instret/instreth).
Therefore, care must be taken to **avoid overflow** issues during reading.
> For example, when we execute `csrr a1, cycleh` while the counter is at cycle = 0x00000000_11111111, we first get cycleh = 0x00000000.
>
> However, if the counter increments right before the next instruction — so that the value becomes cycle = 0x00000001_00000000 — then executing `csrr a0, cycle` will return cycle = 0x00000000.
>
>When we combine these values (cycleh = 0x00000000 and cycle = 0x00000000),
the final 64-bit result becomes 0x00000000_00000000, which is completely incorrect and causes a serious performance counter error.
#### Original `perfcounter.s`
We use the original [perfcounter](https://github.com/sysprog21/rv32emu/tree/master/tests/perfcounter) to implement the CSR.
```asm
# Performance counter functions for RISC-V
# Read 64-bit cycle counter
.text
.globl get_cycles
.align 2
get_cycles:
csrr a1, cycleh
csrr a0, cycle
csrr a2, cycleh
bne a1, a2, get_cycles
ret
.size get_cycles,.-get_cycles
# Read 64-bit instruction retired counter
.globl get_instret
.align 2
get_instret:
csrr a1, instreth
csrr a0, instret
csrr a2, instreth
bne a1, a2, get_instret
ret
.size get_instret,.-get_instret
```
### 3. RISC-V Instructions/Registers Tools
When performing performance analysis on RISC-V ELF programs, it is essential to first understand the distribution of instructions and register usage at the assembly level.
To accomplish this, we use the RISC-V Instructions/Registers Tool, which performs a static analysis on compiled ELF files.
Before running the analysis, the tool must be built as part of the [rv32emu](https://github.com/sysprog21/rv32emu?tab=readme-ov-file) project.
**Step 1. Navigate to the rv32emu directory**
```c
cd ../rv32emu
```
**Step 2. Build the tool**
```c
make tool
```
**Step 3. Navigate to your ELF directory**
```
cd ../[your_folder_containing_elf_file]
```
**Step.2 Run the analysis**
```c
../rv32emu/build/rv_histogram [-r] [target_program_path]
```
### 4. Linker
#### Key points to modify :
The `.rodata` (read-only data) section is added to the linker script to store constant data such as **string literals**, **lookup tables**, and fixed constants defined with **const**.
Without this section, the linker may incorrectly place read-only constants into the `.data` segment, **causing unnecessary memory writes or read-only violations in bare-metal programs**.
By explicitly defining `.rodata`, we ensure all constants are stored in ROM/flash and remain immutable during program execution.
#### Modified `linker.ld`
```c
SECTIONS
{
. = 0x10000;
.text : {
*(.text._start)
*(.text)
}
.rodata : {
*(.rodata*)
}
.data : {
*(.data)
}
.bss : {
__bss_start = .;
*(.bss)
__bss_end = .;
}
.stack (NOLOAD) : {
. = ALIGN(16);
. += 4096;
__stack_top = .;
}
}
```
### 5. Main
This file implements the bare-metal testing framework that executes various **RISC-V** programs under the system emulator **rv32emu**.
It provides basic I/O routines, performance counters, and test cases for multiple problem modules such as *Single Number III*, *Hanoi*, and *Fast Reciprocal Square Root (rsqrt)*.
#### Key points to modify :
* `extern` declarations — To use a function that is defined in another file, we must declare it with the extern keyword.
This allows the linker to connect the function reference in main.c with its actual implementation in another source file (e.g. assembly or C).
* `print_q16()` — This custom function was added to convert and print **Q0.16 fixed-point numbers** into a human-readable decimal format.
It is mainly used in the `Quiz3_C` test to display the results of the reciprocal square root (rsqrt) function.
* `printstr()` macro updates — The printstr() macro now includes "memory" and "cc" in the clobber list of the inline assembly.
This prevents the compiler from reordering or optimizing away memory accesses during **ECALL** operations.
* `main()` function — We can add our own function calls in the `main()` section to execute and test newly implemented code modules.
Each test block can measure performance using `get_cycles()` and `get_instret()` to compare execution efficiency.
#### Modified `main.c`
```c
#include <stdbool.h>
#include <stdint.h>
#include <string.h>
#define printstr(ptr, length) \
do { \
asm volatile( \
"add a7, x0, 64\n\t" \
"add a0, x0, 1\n\t" /* stdout */ \
"mv a1, %0\n\t" /* buffer */ \
"mv a2, %1\n\t" /* len */ \
"ecall\n\t" \
: \
: "r"(ptr), "r"(length) \
: "a0", "a1", "a2", "a7", "memory", "cc");\
} while (0)
#define TEST_OUTPUT(msg, length) printstr(msg, length)
#define TEST_LOGGER(msg) \
{ \
static const char _msg[] = msg; \
TEST_OUTPUT(_msg, sizeof(_msg) - 1); \
}
extern uint64_t get_cycles(void);
extern uint64_t get_instret(void);
extern void single_number_3(void);
extern void hanoi(void);
extern uint32_t fast_rsqrt(uint32_t x);
/* Bare metal memcpy implementation */
void *memcpy(void *dest, const void *src, size_t n)
{
uint8_t *d = (uint8_t *) dest;
const uint8_t *s = (const uint8_t *) src;
while (n--)
*d++ = *s++;
return dest;
}
/* Software division for RV32I (no M extension) */
static unsigned long udiv(unsigned long dividend, unsigned long divisor)
{
if (divisor == 0)
return 0;
unsigned long quotient = 0;
unsigned long remainder = 0;
for (int i = 31; i >= 0; i--) {
remainder <<= 1;
remainder |= (dividend >> i) & 1;
if (remainder >= divisor) {
remainder -= divisor;
quotient |= (1UL << i);
}
}
return quotient;
}
static unsigned long umod(unsigned long dividend, unsigned long divisor)
{
if (divisor == 0)
return 0;
unsigned long remainder = 0;
for (int i = 31; i >= 0; i--) {
remainder <<= 1;
remainder |= (dividend >> i) & 1;
if (remainder >= divisor) {
remainder -= divisor;
}
}
return remainder;
}
/* Software multiplication for RV32I (no M extension) */
static uint32_t umul(uint32_t a, uint32_t b)
{
uint32_t result = 0;
while (b) {
if (b & 1)
result += a;
a <<= 1;
b >>= 1;
}
return result;
}
uint32_t __mulsi3(uint32_t a, uint32_t b)
{
return umul(a, b);
}
/* Simple integer to hex string conversion */
static void print_hex(unsigned long val)
{
char buf[20];
char *p = buf + sizeof(buf) - 1;
*p = '\n';
p--;
if (val == 0) {
*p = '0';
p--;
} else {
while (val > 0) {
int digit = val & 0xf;
*p = (digit < 10) ? ('0' + digit) : ('a' + digit - 10);
p--;
val >>= 4;
}
}
p++;
printstr(p, (buf + sizeof(buf) - p));
}
/* Simple integer to decimal string conversion */
static void print_dec(unsigned long val)
{
char buf[20];
char *p = buf + sizeof(buf) - 1;
*p = '\n';
p--;
if (val == 0) {
*p = '0';
p--;
} else {
while (val > 0) {
*p = '0' + umod(val, 10);
p--;
val = udiv(val, 10);
}
}
p++;
printstr(p, (buf + sizeof(buf) - p));
}
/* ============= ADDED print_q16 function ============= */
/* Print fixed-point Q0.16 number in human-readable decimal format */
static void print_q16(uint32_t val)
{
unsigned long int_part = val >> 16; // Extract integer part (upper 16 bits)
// Compute fractional part using 64-bit multiplication for precision
uint64_t tmp = (uint64_t)(val & 0xFFFF) * 1000000ULL; // Scale to 6 decimal digits
unsigned long frac_part = (unsigned long)(tmp >> 16); // Convert back to 32-bit range (0..999999)
// Print integer part
char buf_int[20];
char *p = buf_int + sizeof(buf_int);
*--p = '\0';
unsigned long t = int_part;
if (t == 0) *--p = '0';
else {
while (t) { *--p = '0' + umod(t, 10); t = udiv(t, 10); } // Convert integer digits
}
printstr(p, (unsigned)(buf_int + sizeof(buf_int) - 1 - p));
printstr(".", 1); // Print decimal point
// Print fractional part (always 6 digits)
char buf_frac[6];
for (int i = 5; i >= 0; --i) {
buf_frac[i] = '0' + umod(frac_part, 10);
frac_part = udiv(frac_part, 10);
}
printstr(buf_frac, 6);
printstr("\n", 1); // New line
}
int main(void)
{
uint64_t start_cycles, end_cycles, cycles_elapsed;
uint64_t start_instret, end_instret, instret_elapsed;
TEST_LOGGER("\n=== LeetCode 260. Single Number III Tests ===\n\n");
TEST_LOGGER("Test 6: Single Number III\n");
start_cycles = get_cycles();
start_instret = get_instret();
single_number_3();
end_cycles = get_cycles();
end_instret = get_instret();
cycles_elapsed = end_cycles - start_cycles;
instret_elapsed = end_instret - start_instret;
TEST_LOGGER(" Cycles: ");
print_dec((unsigned long) cycles_elapsed);
TEST_LOGGER(" Instructions: ");
print_dec((unsigned long) instret_elapsed);
TEST_LOGGER("\n");
/*
TEST_LOGGER("\n=== Quiz2_A Tests ===\n\n");
TEST_LOGGER("Test 7: Quiz2_A\n");
start_cycles = get_cycles();
start_instret = get_instret();
hanoi();
end_cycles = get_cycles();
end_instret = get_instret();
cycles_elapsed = end_cycles - start_cycles;
instret_elapsed = end_instret - start_instret;
TEST_LOGGER(" Cycles: ");
print_dec((unsigned long) cycles_elapsed);
TEST_LOGGER(" Instructions: ");
print_dec((unsigned long) instret_elapsed);
TEST_LOGGER("\n");
*/
/*
TEST_LOGGER("\n=== Quiz3_C Tests === \n\n");
TEST_LOGGER("Test 8: Quiz3_C\n");
start_cycles = get_cycles();
start_instret = get_instret();
uint32_t test[9] = {1, 2, 3, 4, 10, 100, 1000, 10000, 0xFFFFFFFF};
for (int i = 0; i < 9; i++)
{
TEST_LOGGER("Test case ");
print_dec((unsigned long) (i+1));
uint32_t inv = fast_rsqrt(test[i]);
TEST_LOGGER(" Input: ");
print_dec((unsigned long) test[i]);
TEST_LOGGER(" rsqrt: ");
print_q16(inv);
}
end_cycles = get_cycles();
end_instret = get_instret();
cycles_elapsed = end_cycles - start_cycles;
instret_elapsed = end_instret - start_instret;
TEST_LOGGER(" Cycles: ");
print_dec((unsigned long) cycles_elapsed);
TEST_LOGGER(" Instructions: ");
print_dec((unsigned long) instret_elapsed);
*/
return 0;
}
```
## RISC-V Cross-Compilation and Execution Workflow
* **Step 0 - Navigate to the working directory**
> Make sure the function you want to execute is properly called inside `main.c`.
* **Step 1 - Navigate to the working directory**
> cd $HOME / riscv-none-elf-gcc / rv32emu / tests / system / hw2
* **Step 2 - Build and execute the ELF file**
> make clear
> make [ OPT= x ] (Input the ... you want)
> make run
* **Step 3 - Check ELF file size**
> riscv-none-elf-size test.elf
* **Step 4 - Check the ELF header**
> riscv-none-elf-readelf -h test.elf
* **Step 5 - Disassemble the ELF file**
> riscv-none-elf-objdump -d test > test.txt
* **Step 6 - Generate instruction/register histogram**
> .../rv32emu/build/rv_histogram test.elf
## Simulaion test : LeetCode 260. Single Number III
I choose one complete program (with test suite) from < [Shaoen-Lin](https://github.com/Shaoen-Lin) > and make it run in a bare-metal environment using rv32emu’s system emulation.
For successfully executing the code, We modified few places in the original source code.
### Code Modification Summary
#### 1. Removed OS dependencies
Original Linux syscalls (ecall 4, ecall 1, ecall 10) were replaced with ecall 64 for simulated stdout output.
#### 2. Auto-calculated string lengths
Used `.set msg_xxx_size, . - msg_xxx` to automatically compute string sizes for `SYS_write`.
#### 3. Custom print POS/NEG decimal interger function
Implemented a `print_int` function using pure RV32I instructions, simulating division by 10 via repeated subtraction (no div/rem).
The function checks the sign bit to detect negative numbers, converts them to positive for digit extraction, and prepends a '-' before printing the final result.
```asm
# =====================================================
# print_int(a0)
# Convert integer a0 to ASCII and print (pure RV32I, no div/rem)
# =====================================================
.global print_int
print_int:
addi sp, sp, -92
sw ra, 88(sp)
sw t0, 84(sp)
sw t1, 80(sp)
sw t2, 76(sp)
sw t3, 72(sp)
sw t4, 68(sp)
sw t5, 64(sp)
mv t0, a0
addi t1, sp, 64
li t2, 10
li t5, 0
# === handle negative numbers ===
bltz t0, make_positive
j convert_loop_i
make_positive:
neg t0, t0
li t5, 1
# === simulate divide by 10 ===
convert_loop_i:
# simulate quotient = t0 / 10, remainder = t0 - 10*quotient
mv t3, x0
sub_loop:
blt t0, t2, div_done
sub t0, t0, t2
addi t3, t3, 1
j sub_loop
div_done:
mv t4, t0
mv t0, t3
addi t4, t4, 48
addi t1, t1, -1
sb t4, 0(t1)
bnez t0, convert_loop_i
# if negative, prepend '-'
beqz t5, print_number_i
addi t1, t1, -1
li t3, '-'
sb t3, 0(t1)
# === print the result ===
print_number_i:
li a0, 1
mv a1, t1
add a2, sp, 64
sub a2, a2, t1
li a7, 64
ecall
lw ra, 88(sp)
lw t0, 84(sp)
lw t1, 80(sp)
lw t2, 76(sp)
lw t3, 72(sp)
lw t4, 68(sp)
lw t5, 64(sp)
addi sp, sp, 92
ret
```
### Modified `single_number_3.S`
```asm
.data
nums1: .word 2, 2, 3, 3, 4, 4, 0, 1, 100, 100, 99, 99
nums1_size: .word 12
ans1: .word 1, 0
nums2: .word 101, 17, 102, 102, -98, 0, 1, 101, 0, 1, 99, -98, 100, 17
nums2_size: .word 14
ans2: .word 100, 99
nums3: .word -2, -2, -2, 2, 2, 2, -6, -9, -2, 2, 2, -5, 2, -6, -2, -10, -11, -10, -11, -2, -6, -9
nums3_size: .word 22
ans3: .word -5, -6
test_cases: .word nums1, nums2, nums3
test_sizes: .word nums1_size, nums2_size, nums3_size
test_ans: .word ans1, ans2, ans3
ressize: .word 0
result: .word 0, 0
msg_case: .string "Test case "
.set msg_case_size, . - msg_case # auto calculate length
msg_input: .string " Input: "
.set msg_input_size, . - msg_input
msg_output: .string " Output: "
.set msg_output_size, . - msg_output
msg_expect: .string " Expect: "
.set msg_expect_size, . - msg_expect
msg_ok: .string " PASSED\n"
.set msg_ok_size, . - msg_ok
msg_wrong: .string " FAILED\n"
.set msg_wrong_size, . - msg_wrong
space: .string " "
.set space_size, . - space
nl: .string "\n"
.set nl_size, . - nl
.text
.global single_number_3
single_number_3:
addi sp, sp, -36
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
sw s5, 24(sp)
sw s6, 28(sp)
sw s7, 32(sp)
# initialize iterators for case tables
la s5, test_cases
la s6, test_sizes
la s7, test_ans
li s3, 3
li s4, 1
loop_cases:
beqz s3, end_main
# load current pointers
lw s0, 0(s5)
lw s1, 0(s6)
lw s2, 0(s7)
# print "Test case i"
# === print "Test case " ===
li a0, 1
la a1, msg_case
la a2, msg_case_size
li a7, 64
ecall
# === print case number s4 ===
mv a0, s4
jal ra, print_int
# === newline ===
li a0, 1
la a1, nl
la a2, nl_size
li a7, 64
ecall
# print input array
li a0, 1
la a1, msg_input
la a2, msg_input_size
li a7, 64
ecall
lw t3, 0(s1)
li t4, 0
print_input_loop:
bge t4, t3, print_input_done
slli t5, t4, 2
add t6, s0, t5
lw a0, 0(t6)
jal ra, print_int
li a0, 1
la a1, space
la a2, space_size
li a7, 64
ecall
addi t4, t4, 1
j print_input_loop
print_input_done:
li a0, 1
la a1, nl
la a2, nl_size
li a7, 64
ecall
# save registers before call
addi sp, sp, -24
sw ra, 20(sp)
sw s0, 16(sp)
sw s1, 12(sp)
sw s2, 8(sp)
sw s3, 4(sp)
sw s4, 0(sp)
# call singleNumber(a0=nums, a1=size, a2=&ressize)
mv a0, s0
lw a1, 0(s1)
la a2, ressize
jal singleNumber
mv t6, a0
# restore registers
lw ra, 20(sp)
lw s0, 16(sp)
lw s1, 12(sp)
lw s2, 8(sp)
lw s3, 4(sp)
lw s4, 0(sp)
addi sp, sp, 24
# print output
li a0, 1
la a1, msg_output
la a2, msg_output_size
li a7, 64
ecall
lw t0, 0(t6)
mv a0, t0
jal ra, print_int
li a0, 1
la a1, space
la a2, space_size
li a7, 64
ecall
lw t1, 4(t6)
mv a0, t1
jal ra, print_int
li a0, 1
la a1, nl
la a2, nl_size
li a7, 64
ecall
# print expected result
li a0, 1
la a1, msg_expect
la a2, msg_expect_size
li a7, 64
ecall
lw t2, 0(s2)
mv a0, t2
jal ra, print_int
li a0, 1
la a1, space
la a2, space_size
li a7, 64
ecall
lw t3, 4(s2)
mv a0, t3
jal ra, print_int
li a0, 1
la a1, nl
la a2, nl_size
li a7, 64
ecall
# compare results
lw t4, 0(t6)
lw t5, 4(t6)
beq t4, t2, check_second
j print_wrong
check_second:
beq t5, t3, print_ok
j print_wrong
print_ok:
li a0, 1
la a1, msg_ok
la a2, msg_ok_size
li a7, 64
ecall
j next_case
print_wrong:
li a0, 1
la a1, msg_wrong
la a2, msg_wrong_size
li a7, 64
ecall
next_case:
addi s5, s5, 4
addi s6, s6, 4
addi s7, s7, 4
addi s4, s4, 1
addi s3, s3, -1
j loop_cases
end_main:
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
lw s5, 24(sp)
lw s6, 28(sp)
lw s7, 32(sp)
addi sp, sp, 36
ret
# =====================================================
# print_int(a0)
# Convert integer a0 to ASCII and print (pure RV32I, no div/rem)
# =====================================================
.global print_int
print_int:
addi sp, sp, -92
sw ra, 88(sp)
sw t0, 84(sp)
sw t1, 80(sp)
sw t2, 76(sp)
sw t3, 72(sp)
sw t4, 68(sp)
sw t5, 64(sp)
mv t0, a0
addi t1, sp, 64
li t2, 10
li t5, 0
# === handle negative numbers ===
bltz t0, make_positive
j convert_loop_i
make_positive:
neg t0, t0
li t5, 1
# === simulate divide by 10 ===
convert_loop_i:
# simulate quotient = t0 / 10, remainder = t0 - 10*quotient
mv t3, x0
sub_loop:
blt t0, t2, div_done
sub t0, t0, t2
addi t3, t3, 1
j sub_loop
div_done:
mv t4, t0
mv t0, t3
addi t4, t4, 48
addi t1, t1, -1
sb t4, 0(t1)
bnez t0, convert_loop_i
# Negative situation
beqz t5, print_number_i
addi t1, t1, -1
li t3, '-'
sb t3, 0(t1)
# === print the result ===
print_number_i:
li a0, 1
mv a1, t1
add a2, sp, 64
sub a2, a2, t1
li a7, 64
ecall
lw ra, 88(sp)
lw t0, 84(sp)
lw t1, 80(sp)
lw t2, 76(sp)
lw t3, 72(sp)
lw t4, 68(sp)
lw t5, 64(sp)
addi sp, sp, 92
ret
# ================================
# Count Leading Zeros (clz)
# a0 = input, return a0 = clz(x)
# ================================
clz:
li s0, 32
li s1, 16
clz_loop:
srl t0, a0, s1
bnez t0, clz_if
srli s1, s1, 1
j clz_check
clz_if:
sub s0, s0, s1
mv a0, t0
clz_check:
bnez s1, clz_loop
sub a0, s0, a0
ret
# ==========================================
# singleNumber
# a0 = nums ptr, a1 = size, a2 = &returnSize
# return a0 = &result
# ==========================================
singleNumber:
li s2, 0
li t0, 0
# first loop: xor all numbers
for1_cond:
blt t0, a1, for1_body
li t1, 31
addi sp, sp, -8
sw ra, 0(sp)
sw a0, 4(sp)
mv a0, s2
jal ra, clz
sub s3, t1, a0
lw a0, 4(sp)
lw ra, 0(sp)
addi sp, sp, 8
li t1, 1
sll s4, t1, s3 # mask = 1U << shift
li t1, 0
li t2, 0
li t0, 0
j for2_cond
for1_body:
slli t1, t0, 2
add t1, a0, t1
lw t1, 0(t1)
xor s2, s2, t1
addi t0, t0, 1
j for1_cond
# second loop: split by mask bit
for2_cond:
blt t0, a1, for2_body
la a0, result
sw t1, 0(a0)
sw t2, 4(a0)
li t3, 2
sw t3, 0(a2)
ret
for2_body:
slli t3, t0, 2
add t3, a0, t3
lw t3, 0(t3)
and t4, t3, s4
bnez t4, for2_if
xor t2, t2, t3
j for2_next
for2_if:
xor t1, t1, t3
for2_next:
addi t0, t0, 1
j for2_cond
```
### Execution Results under -O0 Optimization Level
```c
Test: Single Number III
Test case 1
Input: 2 2 3 3 4 4 0 1 100 100 99 99
Output: 1 0
Expect: 1 0
PASSED
Test case 2
Input: 101 17 102 102 -98 0 1 101 0 1 99 -98 100 17
Output: 100 99
Expect: 100 99
PASSED
Test case 3
Input: -2 -2 -2 2 2 2 -6 -9 -2 2 2 -5 2 -6 -2 -10 -11 -10 -11 -2 -6 -9
Output: -5 -6
Expect: -5 -6
PASSED
Cycles: 5399
Instructions: 5399
```
In the followings, we adapt Problem A from Quiz 2 and Problem C from Quiz 3, and make them run in a bare-metal environment using rv32emu’s system emulation for measuring and refining the implementation based on profiling results.
## Problem `A` in Quiz2
### Code Modification Summary
#### 1. Added `print_char` Function
Implemented a `print_char` subroutine using pure RV32I instructions and `ecall 64` to print a single ASCII character.
The function temporarily stores the character on the stack, sets up the buffer pointer and length, and calls the write syscall (fd=1 for stdout).
It restores the stack pointer after output, ensuring safe nested calls within `hanoi.S`.
```asm
# ======================================================
# print_char(x10)
# Print a single character to stdout
# Argument:
# x10 = ASCII value of the character (e.g., 65 = 'A')
# ======================================================
.global print_char
print_char:
addi x2, x2, -4
sw x10, 0(x2)
li x10, 1
add x11, x2, x0
li x12, 1
li x17, 64
ecall
addi x2, x2, 4
ret
```
### Modified `hanoi.S`
```asm
.text
.globl hanoi
hanoi:
addi x2, x2, -4
sw x1, 0(x2)
addi x2, x2, -32
sw x8, 0(x2)
sw x9, 4(x2)
sw x18, 8(x2)
sw x19, 12(x2)
sw x20, 16(x2)
li x5, 0x15
sw x5, 20(x2)
sw x5, 24(x2)
sw x5, 28(x2)
# Fix disk positions (BLANK 1–3: neutralize x5 effect)
sw x0, 20(x2)
sw x0, 24(x2)
sw x0, 28(x2)
addi x8, x0, 1
game_loop:
addi x5, x0, 8
beq x8, x5, finish_game
srli x5, x8, 1
xor x6, x8, x5
addi x7, x8, -1
srli x28, x7, 1
xor x7, x7, x28
xor x5, x6, x7
addi x9, x0, 0
andi x6, x5, 1
bne x6, x0, disk_found
addi x9, x0, 1
andi x6, x5, 2
bne x6, x0, disk_found
addi x9, x0, 2
disk_found:
andi x30, x5, 5
addi x31, x0, 5
beq x30, x31, pattern_match
jal x0, continue_move
pattern_match:
continue_move:
slli x5, x9, 2
addi x5, x5, 20
add x5, x2, x5
lw x18, 0(x5)
bne x9, x0, handle_large
addi x19, x18, 2
addi x6, x0, 3
blt x19, x6, display_move
sub x19, x19, x6
jal x0, display_move
handle_large:
lw x6, 20(x2)
addi x19, x0, 3
sub x19, x19, x18
sub x19, x19, x6
display_move:
la x20, obdata
add x5, x20, x18
lbu x11, 0(x5)
li x6, 0x6F
xor x11, x11, x6
addi x11, x11, -0x12
addi x27, x11, 0
add x7, x20, x19
lbu x12, 0(x7)
xor x12, x12, x6
addi x12, x12, -0x12
addi x26, x12, 0
li x10, 1
la x11, str1
la x12, str1_size
li x17, 64
ecall
addi x10, x9, 49
addi x2, x2, -1
sb x10, 0(x2)
li x10, 1
mv x11, x2
li x12, 1
li x17, 64
ecall
addi x2, x2, 1
li x10, 1
la x11, str2
la x12, str2_size
li x17, 64
ecall
addi x10, x27, 0
jal x1, print_char
li x10, 1
la x11, str3
la x12, str3_size
li x17, 64
ecall
addi x10, x26, 0
jal x1, print_char
addi x5, x0, 10
addi x2, x2, -1
sb x5, 0(x2)
li x10, 1
mv x11, x2
li x12, 1
li x17, 64
ecall
addi x2, x2, 1
slli x5, x9, 2
addi x5, x5, 20
add x5, x2, x5
sw x19, 0(x5)
addi x8, x8, 1
jal x0, game_loop
finish_game:
lw x8, 0(x2)
lw x9, 4(x2)
lw x18, 8(x2)
lw x19, 12(x2)
lw x20, 16(x2)
addi x2, x2, 32
lw x1, 0(x2)
addi x2, x2, 4
ret
# ======================================================
# print_char(x10)
# Print a single character to stdout
# Argument:
# x10 = ASCII value of the character (e.g., 65 = 'A')
# ======================================================
.global print_char
print_char:
addi x2, x2, -4
sw x10, 0(x2)
li x10, 1
add x11, x2, x0
li x12, 1
li x17, 64
ecall
addi x2, x2, 4
ret
.data
obdata: .byte 0x3c, 0x3b, 0x3a
str1: .asciz " Move Disk "
.set str1_size, . - str1
str2: .asciz " from "
.set str2_size, . - str2
str3: .asciz " to "
.set str3_size, . - str3
```
### Execution Results under -O0 Optimization Level
```c
Test: Quiz2_A
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
Cycles: 782
Instructions: 782
```
## Problem `C` in Quiz3
### Description
This problem implements a **fast reciprocal square root** (`fast_rsqrt`) function using **lookup table initialization**, **linear interpolation**, and **Newton–Raphson iteration** under fixed-point arithmetic (**Q0.16 scaling**).
The implementation improves computational accuracy from the initial **20 % error** (table only) to **around 3 – 8 % after two Newton iterations**.
### Code Modification Summary
#### 1. Lookup Table
To ensure that all lookup table values fit within the `uint16_t` data type, the first entry was **adjusted from 65536 to 65535** (the closest representable value). This minor modification may slightly affect precision, but it allows the program to execute correctly without overflow.
#### 2. `print_q16` in main.c
A new helper function `print_q16()` was implemented to convert and display fixed-point **Q0.16** numbers in a readable decimal format.
To implement `print_q16()`, we reused existing utility functions — `printstr`, `umod`, and `udiv` — which were already provided in the original main.c.
These functions handle basic string output and unsigned division/modulo operations, allowing us to build the integer and fractional parts of the Q0.16 number entirely with integer arithmetic.
By separating the upper 16 bits as the integer part and scaling the lower 16 bits by 10^6 using 64-bit arithmetic, `print_q16()` accurately prints six decimal places without relying on floating-point instructions.
```c
/* Print fixed-point Q0.16 number in human-readable decimal format */
static void print_q16(uint32_t val)
{
unsigned long int_part = val >> 16; // Extract integer part (upper 16 bits)
// Compute fractional part using 64-bit multiplication for precision
uint64_t tmp = (uint64_t)(val & 0xFFFF) * 1000000ULL; // Scale to 6 decimal digits
unsigned long frac_part = (unsigned long)(tmp >> 16); // Convert back to 32-bit range (0..999999)
// Print integer part
char buf_int[20];
char *p = buf_int + sizeof(buf_int);
*--p = '\0';
unsigned long t = int_part;
if (t == 0) *--p = '0';
else {
while (t) { *--p = '0' + umod(t, 10); t = udiv(t, 10); } // Convert integer digits
}
printstr(p, (unsigned)(buf_int + sizeof(buf_int) - 1 - p));
printstr(".", 1); // Print decimal point
// Print fractional part (always 6 digits)
char buf_frac[6];
for (int i = 5; i >= 0; --i) {
buf_frac[i] = '0' + umod(frac_part, 10);
frac_part = udiv(frac_part, 10);
}
printstr(buf_frac, 6);
printstr("\n", 1); // New line
}
```
### Modified `fast_rsqrt.c`
```c
#include <stdio.h>
#include <stdint.h>
static const uint16_t rsqrt_table[32] = {
65535, 46341, 32768, 23170, 16384, 11585, 8192, 5792,
4096, 2896, 2048, 1448, 1024, 724, 512, 362,
256, 181, 128, 91, 64, 45, 32, 23,
16, 11, 8, 6, 4, 3, 2, 1};
/* =========================================================
* Optimized CLZ (Count Leading Zeros) — binary search style
* ========================================================= */
static int clz(uint32_t x)
{
if (!x) return 32; // special case: all bits are 0
int n = 0;
if (!(x & 0xFFFF0000u)) { n += 16; x <<= 16; }
if (!(x & 0xFF000000u)) { n += 8; x <<= 8; }
if (!(x & 0xF0000000u)) { n += 4; x <<= 4; }
if (!(x & 0xC0000000u)) { n += 2; x <<= 2; }
if (!(x & 0x80000000u)) { n += 1; }
return n;
}
/* =========================================================
* Software 32×32 → 64 bit multiplication (no MUL)
* ========================================================= */
static uint64_t mul32(uint32_t a, uint32_t b)
{
uint64_t r = 0;
for (int i = 0; i < 32; i++) {
if (b & (1u << i))
r += ((uint64_t)a << i);
}
return r;
}
/* =========================================================
* Fast inverse square root (fixed-point Q0.16)
* ========================================================= */
uint32_t fast_rsqrt(uint32_t x)
{
if (x == 0)
return 0xFFFFFFFF;
if (x == 1)
return 65536;
/* Step 1: MSB position */
int exp = 31 - clz(x);
/* Step 2: Lookup + interpolation */
uint32_t y_base = rsqrt_table[exp];
uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0;
uint32_t diff = y_base - y_next;
uint32_t base = 1u << exp;
uint32_t frac = ((x - base) << 16) >> exp; // Q0.16 fraction
uint32_t y = y_base - (uint32_t)(mul32(diff, frac) >> 16);
/* Step 3: Newton-Raphson refinement (2 iterations) */
for (int i = 0; i < 2; i++) {
uint32_t y2 = (uint32_t)(mul32(y, y) );
uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16);
uint32_t term = (3u << 16) - xy2;
uint32_t prod = (uint32_t)(mul32(y, term) >> 16);
y = prod >> 1;
}
return y;
}
```
### Execution Result under -O0 Optimization Level
```c
Test: Quiz3_C
Test case 1
Input: 1
rsqrt: 1.000000
Test case 2
Input: 2
rsqrt: 0.707107
Test case 3
Input: 3
rsqrt: 0.577331
Test case 4
Input: 4
rsqrt: 0.500000
Test case 5
Input: 10
rsqrt: 0.316223
Test case 6
Input: 100
rsqrt: 0.099990
Test case 7
Input: 1000
rsqrt: 0.031616
Test case 8
Input: 10000
rsqrt: 0.009994
Test case 9
Input: 4294967295
rsqrt: 0.000015
Cycles: 149642
Instructions: 149642
```
## Optimization
Below is the observation of compiler optimization effects.
We implemented and tested `hanoi.S` and `fast_rsqrt.c` in four different optimization levels: -O0, -O1, -O2, -O3, and -Ofast.
For each optimization level, we performed a detailed analysis that includes:
* **Program Size** — comparing the ELF file sizes to evaluate code compactness.
* **ELF Header Inspection** — examining memory section layout and alignment differences.
* **Cycles & Instructions Count** — measuring execution performance by recording the total number of CPU cycles and retired instructions, allowing comparison of efficiency across optimization levels.
* **Instruction Frequency Histogram** — analyzing how compiler optimizations affect instruction distribution and overall instruction count.
Through these observations, we aim to understand how increasing optimization levels impact execution performance, code efficiency, and instruction behavior at the assembly level.
## -O0 Optimization
### Cycles & Instructions count
| | Cycles| Instructions |
| -------- | -------- | -------- |
| hanoi.elf | 782 | 782 |
| fast_rsqrt.S | 149642 | 149642 |
### Program Size
| text | data | bss | dec | hex | filename |
| -------- | -------- | -------- | -------- | -------- | -------- |
| 5058 | 304 | 4108 | 9470 | 24f | hanoi.elf |
| 26000 | 336 | 4096 | 30432 | 76e0 | fast_rsqrt.elf|
### ELF Header Inspection
#### hanoi.elf
```c
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10000
Start of program headers: 52 (bytes into file)
Start of section headers: 21880 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 18
Section header string table index: 17
```
#### fast_rsqrt.elf
```c
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10000
Start of program headers: 52 (bytes into file)
Start of section headers: 52632 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 18
Section header string table index: 17
```
### Instruction Frequency Histogram
#### hanoi.elf
```c
+---------------------------------------------+
| RV32 Target Instruction Frequency Histogram |
+---------------------------------------------+
1. addi 33.69% [408 ] ████████████████████████████████████████████████████████
2. lw 18.83% [228 ] ███████████████████████████████▎
3. sw 14.45% [175 ] ████████████████████████
4. jal 5.37% [65 ] ████████▉
5. ecall 2.64% [32 ] ████▍
6. sub 2.56% [31 ] ████▎
7. auipc 2.31% [28 ] ███▊
8. slli 2.31% [28 ] ███▊
9. bne 1.82% [22 ] ███
10. srli 1.73% [21 ] ██▉
11. jalr 1.65% [20 ] ██▋
12. andi 1.49% [18 ] ██▍
13. add 1.49% [18 ] ██▍
14. lui 1.24% [15 ] ██
15. sb 1.24% [15 ] ██
```
#### fast_rsqrt.elf
```c
+---------------------------------------------+
| RV32 Target Instruction Frequency Histogram |
+---------------------------------------------+
1. xor 27.27% [1334 ] █████████████████████████████████████████████████████
2. add 14.46% [707 ] ████████████████████████████
3. slli 14.01% [685 ] ███████████████████████████▏
4. srli 13.51% [661 ] ██████████████████████████▎
5. addi 9.47% [463 ] ██████████████████▍
6. lw 7.69% [376 ] ██████████████▉
7. sw 4.97% [243 ] █████████▋
8. jal 2.09% [102 ] ████
```
## -O1 Optimization
### Cycles & Instructions count
| | Cycles| Instructions |
| -------- | -------- | -------- |
| hanoi.elf | 780 | 780 |
| fast_rsqrt.elf | 59856 | 59856 |
### Program Size
| text | data | bss | dec | hex | filename |
| -------- | -------- | -------- | -------- | -------- | -------- |
| 2954| 304 |4100| 7358 | 1cbe | hanoi.elf |
| 3798| 304 | 4104 | 8206 | 200e | fast_rsqrt.elf |
### ELF Header Inspection
#### hanoi.elf
```c
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10000
Start of program headers: 52 (bytes into file)
Start of section headers: 20888 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 20
Section header string table index: 19
```
#### fast_rsqrt.elf
```c
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10000
Start of program headers: 52 (bytes into file)
Start of section headers: 23628 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 20
Section header string table index: 19
```
### Instruction Frequency Histogram
#### hanoi.elf
```c
+---------------------------------------------+
| RV32 Target Instruction Frequency Histogram |
+---------------------------------------------+
1. addi 41.92% [288 ] ████████████████████████████████████████████████████████
2. jal 7.28% [50 ] █████████▋
3. lw 7.28% [50 ] █████████▋
4. sw 6.55% [45 ] ████████▊
5. auipc 4.08% [28 ] █████▍
6. ecall 3.93% [27 ] █████▎
7. sub 3.06% [21 ] ████
8. slli 2.91% [20 ] ███▉
9. add 2.47% [17 ] ███▎
10. jalr 2.18% [15 ] ██▉
11. beq 2.18% [15 ] ██▉
12. bne 1.89% [13 ] ██▌
13. lui 1.75% [12 ] ██▎
14. srli 1.31% [9 ] █▊
15. sb 1.16% [8 ] █▌
16. xor 1.16% [8 ] █▌
17. blt 1.02% [7 ] █▎
```
#### fast_rsqrt.elf
```c
+---------------------------------------------+
| RV32 Target Instruction Frequency Histogram |
+---------------------------------------------+
1. addi 41.56% [362 ] █████████████████████████████████████████████████████
2. lw 8.38% [73 ] ██████████▋
3. sw 7.58% [66 ] █████████▋
4. jal 6.89% [60 ] ████████▊
5. ecall 3.90% [34 ] ████▉
6. slli 3.33% [29 ] ████▏
7. sub 3.33% [29 ] ████▏
8. auipc 3.21% [28 ] ████
9. jalr 2.18% [19 ] ██▊
10. add 2.18% [19 ] ██▊
11. beq 1.95% [17 ] ██▍
12. bne 1.95% [17 ] ██▍
13. lui 1.72% [15 ] ██▏
14. srli 1.72% [15 ] ██▏
15. sb 1.38% [12 ] █▊
16. or 1.15% [10 ] █▍
```
## -O2 Optimization
### Cycles & Instructions count
| | Cycles| Instructions |
| -------- | -------- | -------- |
| hanoi.elf | 780 | 780 |
| fast_rsqrt.elf | 62408 | 62408 |
### Program Size
| text | data | bss | dec | hex | filename |
| -------- | -------- | -------- | -------- | -------- | -------- |
| 3066 | 304 | 4100| 7470 | 1d2e | hanoi.elf |
| 3970| 304 | 4108 | 8382 | 20be | fast_rsqrt.elf |
### ELF Header Inspection
#### hanoi.elf
```c
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10000
Start of program headers: 52 (bytes into file)
Start of section headers: 21984 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 21
Section header string table index: 20
```
#### fast_rsqrt.elf
```c
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10000
Start of program headers: 52 (bytes into file)
Start of section headers: 26276 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 21
Section header string table index: 20
```
### Instruction Frequency Histogram
#### hanoi.elf
```c
+---------------------------------------------+
| RV32 Target Instruction Frequency Histogram |
+---------------------------------------------+
1. addi 41.40% [296 ] ████████████████████████████████████████████████████████
2. jal 6.43% [46 ] ████████▋
3. lw 6.29% [45 ] ████████▌
4. sw 5.73% [41 ] ███████▊
5. auipc 3.92% [28 ] █████▎
6. ecall 3.78% [27 ] █████
7. add 3.22% [23 ] ████▎
8. sub 3.08% [22 ] ████▏
9. beq 2.80% [20 ] ███▊
10. slli 2.66% [19 ] ███▌
11. jalr 2.24% [16 ] ███
12. bne 2.10% [15 ] ██▊
13. sll 1.96% [14 ] ██▋
14. lui 1.68% [12 ] ██▎
15. srli 1.54% [11 ] ██
16. sb 1.12% [8 ] █▌
17. xor 1.12% [8 ] █▌
```
#### fast_rsqrt.elf
```c
+---------------------------------------------+
| RV32 Target Instruction Frequency Histogram |
+---------------------------------------------+
1. addi 40.59% [371 ] ████████████████████████████████████████████████████████
2. lw 7.33% [67 ] ██████████
3. sw 6.67% [61 ] █████████▏
4. jal 5.58% [51 ] ███████▋
5. ecall 3.72% [34 ] █████▏
6. slli 3.50% [32 ] ████▊
7. auipc 3.06% [28 ] ████▏
8. sub 3.06% [28 ] ████▏
9. add 2.74% [25 ] ███▊
10. beq 2.41% [22 ] ███▎
11. bne 2.30% [21 ] ███▏
12. srli 1.86% [17 ] ██▌
13. lui 1.75% [16 ] ██▍
14. jalr 1.75% [16 ] ██▍
15. sll 1.75% [16 ] ██▍
16. or 1.75% [16 ] ██▍
17. sb 1.31% [12 ] █▊
18. srl 1.20% [11 ] █▋
19. andi 1.09% [10 ] █▌
```
## -O3 Optimization
### Cycles & Instructions count
| | Cycles| Instructions |
| -------- | -------- | -------- |
| hanoi.elf | 780 | 780 |
| fast_rsqrt.elf | 62408 | 62408 |
### Program Size
| text | data | bss | dec | hex | filename |
| -------- | -------- | -------- | -------- | -------- | -------- |
| 3202 | 304 | 4108 | 7614 | 1dbe | hanoi.elf |
| 4106 | 304 | 4100 | 8510 | 213e | fast_rsqrt.elf |
### ELF Header Inspection
#### hanoi.elf
```c
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10000
Start of program headers: 52 (bytes into file)
Start of section headers: 22424 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 21
Section header string table index: 20
```
#### fast_rsqrt.elf
```c
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10000
Start of program headers: 52 (bytes into file)
Start of section headers: 26708 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 21
Section header string table index: 20
```
### Instruction Frequency Histogram
#### hanoi.elf
```c
+---------------------------------------------+
| RV32 Target Instruction Frequency Histogram |
+---------------------------------------------+
1. addi 40.85% [306 ] ████████████████████████████████████████████████████████
2. jal 6.14% [46 ] ████████▍
3. lw 6.14% [46 ] ████████▍
4. sw 5.61% [42 ] ███████▋
5. auipc 3.74% [28 ] █████
6. ecall 3.60% [27 ] ████▉
7. add 3.34% [25 ] ████▌
8. sub 3.20% [24 ] ████▍
9. beq 3.07% [23 ] ████▏
10. slli 2.54% [19 ] ███▍
11. bne 2.40% [18 ] ███▎
12. jalr 2.27% [17 ] ███
13. sll 1.87% [14 ] ██▌
14. lui 1.60% [12 ] ██▏
15. sb 1.47% [11 ] ██
16. srli 1.47% [11 ] ██
17. andi 1.07% [8 ] █▍
18. xor 1.07% [8 ] █▍
```
#### fast_rsqrt.elf
```c
+---------------------------------------------+
| RV32 Target Instruction Frequency Histogram |
+---------------------------------------------+
1. addi 40.19% [381 ] ████████████████████████████████████████████████████████
2. lw 7.17% [68 ] █████████▉
3. sw 6.54% [62 ] █████████
4. jal 5.38% [51 ] ███████▍
5. ecall 3.59% [34 ] ████▉
6. slli 3.38% [32 ] ████▋
7. sub 3.16% [30 ] ████▍
8. auipc 2.95% [28 ] ████
9. add 2.85% [27 ] ███▉
10. beq 2.64% [25 ] ███▋
11. bne 2.53% [24 ] ███▌
12. jalr 1.79% [17 ] ██▍
13. srli 1.79% [17 ] ██▍
14. or 1.79% [17 ] ██▍
15. lui 1.69% [16 ] ██▎
16. sll 1.69% [16 ] ██▎
17. sb 1.58% [15 ] ██▏
18. andi 1.27% [12 ] █▊
19. srl 1.16% [11 ] █▌
20. bgeu 1.05% [10 ] █▍
```
## -Ofast Optimization
### Cycles & Instructions count
| | Cycles| Instructions |
| -------- | -------- | -------- |
| hanoi.elf | 780 | 780 |
| fast_rsqrt.S | 62408 | 62408 |
### Program Size
| text | data | bss | dec | hex | filename |
| -------- | -------- | -------- | -------- | -------- | -------- |
| 3202 | 304 | 4108 | 7614 | 1dbe | hanoi.elf |
| 4108 | 304 | 4100 | 8512 | 2140 | fast_rsqrt.elf |
### ELF Header Inspection
#### hanoi.elf
```c
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10000
Start of program headers: 52 (bytes into file)
Start of section headers: 22424 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 21
Section header string table index: 20
```
#### fast_rsqrt.elf
```c
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10000
Start of program headers: 52 (bytes into file)
Start of section headers: 26676 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 20
Section header string table index: 19
```
### Instruction Frequency Histogram
#### hanoi.elf
```c
+---------------------------------------------+
| RV32 Target Instruction Frequency Histogram |
+---------------------------------------------+
1. addi 40.85% [306 ] ████████████████████████████████████████████████████████
2. jal 6.14% [46 ] ████████▍
3. lw 6.14% [46 ] ████████▍
4. sw 5.61% [42 ] ███████▋
5. auipc 3.74% [28 ] █████
6. ecall 3.60% [27 ] ████▉
7. add 3.34% [25 ] ████▌
8. sub 3.20% [24 ] ████▍
9. beq 3.07% [23 ] ████▏
10. slli 2.54% [19 ] ███▍
11. bne 2.40% [18 ] ███▎
12. jalr 2.27% [17 ] ███
13. sll 1.87% [14 ] ██▌
14. lui 1.60% [12 ] ██▏
15. sb 1.47% [11 ] ██
16. srli 1.47% [11 ] ██
17. andi 1.07% [8 ] █▍
18. xor 1.07% [8 ] █▍
```
#### fast_rsqrt.elf
```c
+---------------------------------------------+
| RV32 Target Instruction Frequency Histogram |
+---------------------------------------------+
1. addi 40.19% [381 ] ████████████████████████████████████████████████████████
2. lw 7.17% [68 ] █████████▉
3. sw 6.54% [62 ] █████████
4. jal 5.38% [51 ] ███████▍
5. ecall 3.59% [34 ] ████▉
6. slli 3.38% [32 ] ████▋
7. sub 3.16% [30 ] ████▍
8. auipc 2.95% [28 ] ████
9. add 2.85% [27 ] ███▉
10. beq 2.64% [25 ] ███▋
11. bne 2.53% [24 ] ███▌
12. jalr 1.79% [17 ] ██▍
13. srli 1.79% [17 ] ██▍
14. or 1.79% [17 ] ██▍
15. lui 1.69% [16 ] ██▎
16. sll 1.69% [16 ] ██▎
17. sb 1.58% [15 ] ██▏
18. andi 1.27% [12 ] █▊
19. srl 1.16% [11 ] █▌
20. bgeu 1.05% [10 ] █▍
```
## Observation & Analysis
### 1. General Overview
In this experiment, we evaluated the compiler optimization effects on two test programs — `hanoi.S` (assembly-based) and `fast_rsqrt.c` (C-based).
Each program was compiled and analyzed under five optimization levels: -O0, -O1, -O2, -O3, and -Ofast.
As optimization levels increase, redundant computations and memory accesses are eliminated, code size decreases, and performance improves.
**However, beyond -O2, both performance and code size stabilize, showing that -O3 and -Ofast bring minimal further benefit.**
### 2. Cycles and Instructions Count
| Optimization | hanoi.elf (Cycles / Instructions) | fast_rsqrt.elf (Cycles / Instructions) |
| :----------- | :-------------------------------: | :------------------------------------: |
| -O0 | 782 / 782 | 149,642 / 149,642 |
| -O1 | 780 / 780 | 59,856 / 59,856 |
| -O2 | 780 / 780 | 62,408 / 62,408 |
| -O3 | 780 / 780 | 62,408 / 62,408 |
| -Ofast | 780 / 780 | 62,408 / 62,408 |
#### Analysis:
* For `hanoi.elf`, cycles and instructions **remain almost constant (782 → 780) across all levels**, indicating the assembly code is already manually optimized and compiler optimization has minimal effect.
* For `fast_rsqrt.elf`, the improvement is dramatic: from 149k to 59k cycles **(a 2.5× performance gain)** when moving from -O0 to -O1.
* Starting from -O2, results plateau — all higher levels (-O2, -O3, -Ofast) perform identically, suggesting the compiler reached its maximum effective optimization.
### 3. Program Size Comparison
| Optimization | hanoi.elf (text / data / bss / total bytes) | fast_rsqrt.elf (text / data / bss / total bytes) |
| :----------- | :-----------------------------------------: | :----------------------------------------------: |
| -O0 | 5058 / 304 / 4108 / 9470 | 26000 / 336 / 4096 / 30432 |
| -O1 | 2954 / 304 / 4100 / 7358 | 3798 / 304 / 4104 / 8206 |
| -O2 | 3066 / 304 / 4100 / 7470 | 3970 / 304 / 4108 / 8382 |
| -O3 | 3202 / 304 / 4108 / 7614 | 4106 / 304 / 4100 / 8510 |
| -Ofast | 3202 / 304 / 4108 / 7614 | 4108 / 304 / 4100 / 8512 |
#### Analysis:
* The unoptimized -O0 binaries are the largest, showing no compression or inlining.
* Under -O1, the `fast_rsqrt` text section **shrinks from 26,000 → 3,798 bytes (≈85% reduction)**, demonstrating aggressive dead-code elimination and loop optimization.
* From -O2 onward, the program size remains stable, indicating the compiler has **reached an optimization saturation point**.
* -O3 and -Ofast show nearly identical binary sizes; their changes mainly involve code layout and instruction alignment rather than size reduction.
### 4. ELF Header Inspection
All ELF files share the same general properties:
* Architecture: RISC-V (RV32)
* Endian: Little Endian
* File Type: EXEC (Executable file)
* OS ABI: UNIX - System V
Notable differences:
* **The start of section headers** gradually shifts closer to the program header as optimization levels increase — **indicating smaller overall binaries**.
* **The number of section headers** increases slightly (from 18 at -O0 to 21 at -O2/-O3), reflecting **additional sections for symbol tables and inlined code during higher optimizations**.
### 5. Instruction Frequency Histogram Analysis
#### hanoi.elf
| Optimization | Key Instruction Observations |
| :----------- | :------------------------------------------------------------------------------------------------------------------------------------------------- |
| -O0 | Dominated by `addi`, `lw`, and `sw` (>65%), showing many redundant memory accesses and jumps. |
| -O1 | Instruction count decreases significantly; `addi` rises to ~42%, meaning more immediate arithmetic and fewer load/store operations. |
| -O2 ~ -Ofast | Very stable distribution — `addi` ≈ 40%, `jal` ≈ 6%, `lw/sw` ≈ 6% — indicating the compiler has reached optimal scheduling and minimal redundancy. |
#### fast_rsqrt.elf
| Optimization | Key Instruction Observations |
| :----------------- | :--------------------------------------------------------------------------------------------------------------- |
| -O0 | Heavy use of `xor`, `slli`, and `srli`, indicating unoptimized bit manipulation. |
| -O1 | `addi` becomes dominant (≈41%), with fewer `lw/sw`, showing constant folding and loop simplification. |
| -O2 / -O3 / -Ofast | Almost identical profiles — `addi` ≈ 40%, `lw` ≈ 7%, `sw` ≈ 6%, `jal` ≈ 5% — confirming optimization saturation. |
### Summary
Compiler optimizations significantly enhance performance for C-based programs like `fast_rsqrt.c`, with -O1 already providing major speed-ups and -O2 achieving the best overall efficiency.
For manually tuned assembly programs such as `hanoi.S`, optimization has minimal impact since the code is already close to optimal.
## Reference
* [Lab2: RISC-V Instruction Set Simulator and System Emulator](https://hackmd.io/@sysprog/Sko2Ja5pel)
* [Assignment1: RISC-V Assembly and Instruction Pipeline](https://hackmd.io/@Shaoen/arch2025-homework1)
* [Assignment2: Complete Applications](https://hackmd.io/@sysprog/2025-arch-homework2)