# Assignment2: Complete Applications > contributed by < [Shaoen-Lin](https://github.com/Shaoen-Lin) > ## Prerequisites ### rv32emu [rv32emu](https://github.com/sysprog21/rv32emu) is a RISC-V RV32 emulator that faithfully implements the RISC-V instruction set architecture. It serves as both: * An educational tool for understanding modern RISC processor operations * A starting point for customization and experimentation Developed at NCKU. See VMIL'24 paper: "[Accelerate RISC-V Instruction Set Simulation by Tiered JIT Compilation](https://dl.acm.org/doi/10.1145/3689490.3690399)." ### Operating System Ubuntu Linux 24.04 LTS or later (virtualization environments like VirtualBox are supported) and macOS 15+. After we have those prerequisites, we can follow [Lab2](https://hackmd.io/@sysprog/Sko2Ja5pel) to perform a quick environmental test and verify that your RISC-V toolchain and emulator are working correctly. ## Environment Setup | Item | Configuration | | ----------- | -------------------------------------- | | Emulator | rv32emu | | Toolchain | riscv-none-elf-gcc @ xPack GNU RISC-V | | ISA | RV32I + Zicsr | | Build Flags | `-march=rv32i_zicsr -nostdlib -static` | | Entry Point | `main` function in main.c | | Syscall | 64 (write) for stdout printing only | To achieve executing the code on the system, all components must comply with the following constraints: * Only **RV32I/Zicsr** instructions are allowed, no **RV32M** (multiply and divide) or **RV32F** (single-precision floating point) extension instructions are allowed. * **No system calls** and **no FPU** (Floating-Point Unit) are permitted. * All I/O operations must be performed through simulate system call - `ecall 64` (simulating write to stdout). ## What we should do before simulation? Before modifying or extending this system, it is important to understand the roles of each part — Makefile, perfcounter, and so on... included how they interact with the RISC-V bare-metal environment. ### 1. Makefile The Makefile defines how to compile and link all source files into the final RISC-V ELF executable. #### Key points to modify : * `CFLAGS = -g -march=rv32i_zicsr` - This flag specifies the target architecture (RV32I + Zicsr) and includes debugging information (-g), which enables symbol and memory inspection in emulators such as rv32emu. We can also control compiler optimization via the OPT variable: ```c make OPT=1 # equivalent to -O1 make OPT=2 # equivalent to -O2 make OPT=3 # equivalent to -O3 make OPT=fast # equivalent to -Ofast ``` The flag -O$(OPT) is appended to CFLAGS to dynamically select the optimization level. * `OBJS = start.o main.o ...` — This line lists all object files (.o) that will be linked together to generate test.elf. To include new assembly or C files in the final executable, simply add their corresponding .o filenames here. * `LDFLAGS = -T $(LINKER_SCRIPT) -nostartfiles -nostdlib -nodefaultlibs` — These flags are required for running in a bare-metal environment (no OS). * `-nostartfiles` skips the default startup code (crt0.o) since we already define our own _start in start.S. * `-nostdlib` and `-nodefaultlibs` disable linking to the standard and built-in GCC libraries (like libc, libgcc), because we provide our own ECALL-based I/O and runtime. Together, they ensure the program contains only our custom code and can run directly on rv32emu without any OS dependency. * `$(CC) $(LDFLAGS) -o $@ $(OBJS)` — Using $(CC) instead of $(LD) lets the compiler handle linking with the same architecture and optimization settings (-O1, -O2, etc.). gcc automatically passes correct ABI flags and links both C and assembly objects consistently, avoiding manual linker setup errors. This ensures unified optimization across all modules, though it does not directly speed up assembly code itself. * ```%.o: %.S $(CC) $(CFLAGS) -x assembler-with-cpp $< -o $@ -c``` — This option allows .S files to use the C preprocessor (#define, #ifdef, etc.) before assembling. It helps assembly code adapt to different architectures or optimization levels but does not perform real instruction-level optimization. #### Modified `Makefile` ```c include ../../../mk/toolchain.mk ARCH = -march=rv32izicsr LINKER_SCRIPT = linker.ld EMU ?= ../../../build/rv32emu # ------------------------------------------------- # Optimization Level # ------------------------------------------------- OPT ?= 0 AFLAGS = -g $(ARCH) CFLAGS = -g -O$(OPT) -march=rv32i_zicsr LDFLAGS = -T $(LINKER_SCRIPT) -nostartfiles -nostdlib -nodefaultlibs EXEC = test.elf CC = $(CROSS_COMPILE)gcc AS = $(CROSS_COMPILE)as LD = $(CROSS_COMPILE)ld OBJDUMP = $(CROSS_COMPILE)objdump OBJS = start.o main.o perfcounter.o single_number_3.o hanoi.o fast_rsqrt.o .PHONY: all run dump clean all: $(EXEC) $(EXEC): $(OBJS) $(LINKER_SCRIPT) $(CC) $(LDFLAGS) -o $@ $(OBJS) %.o: %.S $(CC) $(CFLAGS) -x assembler-with-cpp $< -o $@ -c %.o: %.c $(CC) $(CFLAGS) $< -o $@ -c run: $(EXEC) @test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1) @grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1) @grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1) $(EMU) $< dump: $(EXEC) $(OBJDUMP) -Ds $< | less clean: rm -f $(EXEC) $(OBJS) ``` ### 2. perfcounter The `perfcounter.S` module provides two low-level RISC-V assembly functions that read hardware performance counters directly from **CSR (Control and Status Registers)**. These counters measure how many CPU cycles have elapsed and how many instructions have been executed since system startup. This program implements two performance counter reading functions used for measurement: * get_cycles: Reads the current number of CPU clock cycles executed (cycle counter). * get_instret: Reads the current number of instructions that have been completed (instructions retired). Since this is a 32-bit RISC-V architecture, the hardware provides 64-bit counters, but they can only be accessed through two 32-bit registers (cycle/cycleh, instret/instreth). Therefore, care must be taken to **avoid overflow** issues during reading. > For example, when we execute `csrr a1, cycleh` while the counter is at cycle = 0x00000000_11111111, we first get cycleh = 0x00000000. > > However, if the counter increments right before the next instruction — so that the value becomes cycle = 0x00000001_00000000 — then executing `csrr a0, cycle` will return cycle = 0x00000000. > >When we combine these values (cycleh = 0x00000000 and cycle = 0x00000000), the final 64-bit result becomes 0x00000000_00000000, which is completely incorrect and causes a serious performance counter error. #### Original `perfcounter.s` We use the original [perfcounter](https://github.com/sysprog21/rv32emu/tree/master/tests/perfcounter) to implement the CSR. ```asm # Performance counter functions for RISC-V # Read 64-bit cycle counter .text .globl get_cycles .align 2 get_cycles: csrr a1, cycleh csrr a0, cycle csrr a2, cycleh bne a1, a2, get_cycles ret .size get_cycles,.-get_cycles # Read 64-bit instruction retired counter .globl get_instret .align 2 get_instret: csrr a1, instreth csrr a0, instret csrr a2, instreth bne a1, a2, get_instret ret .size get_instret,.-get_instret ``` ### 3. RISC-V Instructions/Registers Tools When performing performance analysis on RISC-V ELF programs, it is essential to first understand the distribution of instructions and register usage at the assembly level. To accomplish this, we use the RISC-V Instructions/Registers Tool, which performs a static analysis on compiled ELF files. Before running the analysis, the tool must be built as part of the [rv32emu](https://github.com/sysprog21/rv32emu?tab=readme-ov-file) project. **Step 1. Navigate to the rv32emu directory** ```c cd ../rv32emu ``` **Step 2. Build the tool** ```c make tool ``` **Step 3. Navigate to your ELF directory** ``` cd ../[your_folder_containing_elf_file] ``` **Step.2 Run the analysis** ```c ../rv32emu/build/rv_histogram [-r] [target_program_path] ``` ### 4. Linker #### Key points to modify : The `.rodata` (read-only data) section is added to the linker script to store constant data such as **string literals**, **lookup tables**, and fixed constants defined with **const**. Without this section, the linker may incorrectly place read-only constants into the `.data` segment, **causing unnecessary memory writes or read-only violations in bare-metal programs**. By explicitly defining `.rodata`, we ensure all constants are stored in ROM/flash and remain immutable during program execution. #### Modified `linker.ld` ```c SECTIONS { . = 0x10000; .text : { *(.text._start) *(.text) } .rodata : { *(.rodata*) } .data : { *(.data) } .bss : { __bss_start = .; *(.bss) __bss_end = .; } .stack (NOLOAD) : { . = ALIGN(16); . += 4096; __stack_top = .; } } ``` ### 5. Main This file implements the bare-metal testing framework that executes various **RISC-V** programs under the system emulator **rv32emu**. It provides basic I/O routines, performance counters, and test cases for multiple problem modules such as *Single Number III*, *Hanoi*, and *Fast Reciprocal Square Root (rsqrt)*. #### Key points to modify : * `extern` declarations — To use a function that is defined in another file, we must declare it with the extern keyword. This allows the linker to connect the function reference in main.c with its actual implementation in another source file (e.g. assembly or C). * `print_q16()` — This custom function was added to convert and print **Q0.16 fixed-point numbers** into a human-readable decimal format. It is mainly used in the `Quiz3_C` test to display the results of the reciprocal square root (rsqrt) function. * `printstr()` macro updates — The printstr() macro now includes "memory" and "cc" in the clobber list of the inline assembly. This prevents the compiler from reordering or optimizing away memory accesses during **ECALL** operations. * `main()` function — We can add our own function calls in the `main()` section to execute and test newly implemented code modules. Each test block can measure performance using `get_cycles()` and `get_instret()` to compare execution efficiency. #### Modified `main.c` ```c #include <stdbool.h> #include <stdint.h> #include <string.h> #define printstr(ptr, length) \ do { \ asm volatile( \ "add a7, x0, 64\n\t" \ "add a0, x0, 1\n\t" /* stdout */ \ "mv a1, %0\n\t" /* buffer */ \ "mv a2, %1\n\t" /* len */ \ "ecall\n\t" \ : \ : "r"(ptr), "r"(length) \ : "a0", "a1", "a2", "a7", "memory", "cc");\ } while (0) #define TEST_OUTPUT(msg, length) printstr(msg, length) #define TEST_LOGGER(msg) \ { \ static const char _msg[] = msg; \ TEST_OUTPUT(_msg, sizeof(_msg) - 1); \ } extern uint64_t get_cycles(void); extern uint64_t get_instret(void); extern void single_number_3(void); extern void hanoi(void); extern uint32_t fast_rsqrt(uint32_t x); /* Bare metal memcpy implementation */ void *memcpy(void *dest, const void *src, size_t n) { uint8_t *d = (uint8_t *) dest; const uint8_t *s = (const uint8_t *) src; while (n--) *d++ = *s++; return dest; } /* Software division for RV32I (no M extension) */ static unsigned long udiv(unsigned long dividend, unsigned long divisor) { if (divisor == 0) return 0; unsigned long quotient = 0; unsigned long remainder = 0; for (int i = 31; i >= 0; i--) { remainder <<= 1; remainder |= (dividend >> i) & 1; if (remainder >= divisor) { remainder -= divisor; quotient |= (1UL << i); } } return quotient; } static unsigned long umod(unsigned long dividend, unsigned long divisor) { if (divisor == 0) return 0; unsigned long remainder = 0; for (int i = 31; i >= 0; i--) { remainder <<= 1; remainder |= (dividend >> i) & 1; if (remainder >= divisor) { remainder -= divisor; } } return remainder; } /* Software multiplication for RV32I (no M extension) */ static uint32_t umul(uint32_t a, uint32_t b) { uint32_t result = 0; while (b) { if (b & 1) result += a; a <<= 1; b >>= 1; } return result; } uint32_t __mulsi3(uint32_t a, uint32_t b) { return umul(a, b); } /* Simple integer to hex string conversion */ static void print_hex(unsigned long val) { char buf[20]; char *p = buf + sizeof(buf) - 1; *p = '\n'; p--; if (val == 0) { *p = '0'; p--; } else { while (val > 0) { int digit = val & 0xf; *p = (digit < 10) ? ('0' + digit) : ('a' + digit - 10); p--; val >>= 4; } } p++; printstr(p, (buf + sizeof(buf) - p)); } /* Simple integer to decimal string conversion */ static void print_dec(unsigned long val) { char buf[20]; char *p = buf + sizeof(buf) - 1; *p = '\n'; p--; if (val == 0) { *p = '0'; p--; } else { while (val > 0) { *p = '0' + umod(val, 10); p--; val = udiv(val, 10); } } p++; printstr(p, (buf + sizeof(buf) - p)); } /* ============= ADDED print_q16 function ============= */ /* Print fixed-point Q0.16 number in human-readable decimal format */ static void print_q16(uint32_t val) { unsigned long int_part = val >> 16; // Extract integer part (upper 16 bits) // Compute fractional part using 64-bit multiplication for precision uint64_t tmp = (uint64_t)(val & 0xFFFF) * 1000000ULL; // Scale to 6 decimal digits unsigned long frac_part = (unsigned long)(tmp >> 16); // Convert back to 32-bit range (0..999999) // Print integer part char buf_int[20]; char *p = buf_int + sizeof(buf_int); *--p = '\0'; unsigned long t = int_part; if (t == 0) *--p = '0'; else { while (t) { *--p = '0' + umod(t, 10); t = udiv(t, 10); } // Convert integer digits } printstr(p, (unsigned)(buf_int + sizeof(buf_int) - 1 - p)); printstr(".", 1); // Print decimal point // Print fractional part (always 6 digits) char buf_frac[6]; for (int i = 5; i >= 0; --i) { buf_frac[i] = '0' + umod(frac_part, 10); frac_part = udiv(frac_part, 10); } printstr(buf_frac, 6); printstr("\n", 1); // New line } int main(void) { uint64_t start_cycles, end_cycles, cycles_elapsed; uint64_t start_instret, end_instret, instret_elapsed; TEST_LOGGER("\n=== LeetCode 260. Single Number III Tests ===\n\n"); TEST_LOGGER("Test 6: Single Number III\n"); start_cycles = get_cycles(); start_instret = get_instret(); single_number_3(); end_cycles = get_cycles(); end_instret = get_instret(); cycles_elapsed = end_cycles - start_cycles; instret_elapsed = end_instret - start_instret; TEST_LOGGER(" Cycles: "); print_dec((unsigned long) cycles_elapsed); TEST_LOGGER(" Instructions: "); print_dec((unsigned long) instret_elapsed); TEST_LOGGER("\n"); /* TEST_LOGGER("\n=== Quiz2_A Tests ===\n\n"); TEST_LOGGER("Test 7: Quiz2_A\n"); start_cycles = get_cycles(); start_instret = get_instret(); hanoi(); end_cycles = get_cycles(); end_instret = get_instret(); cycles_elapsed = end_cycles - start_cycles; instret_elapsed = end_instret - start_instret; TEST_LOGGER(" Cycles: "); print_dec((unsigned long) cycles_elapsed); TEST_LOGGER(" Instructions: "); print_dec((unsigned long) instret_elapsed); TEST_LOGGER("\n"); */ /* TEST_LOGGER("\n=== Quiz3_C Tests === \n\n"); TEST_LOGGER("Test 8: Quiz3_C\n"); start_cycles = get_cycles(); start_instret = get_instret(); uint32_t test[9] = {1, 2, 3, 4, 10, 100, 1000, 10000, 0xFFFFFFFF}; for (int i = 0; i < 9; i++) { TEST_LOGGER("Test case "); print_dec((unsigned long) (i+1)); uint32_t inv = fast_rsqrt(test[i]); TEST_LOGGER(" Input: "); print_dec((unsigned long) test[i]); TEST_LOGGER(" rsqrt: "); print_q16(inv); } end_cycles = get_cycles(); end_instret = get_instret(); cycles_elapsed = end_cycles - start_cycles; instret_elapsed = end_instret - start_instret; TEST_LOGGER(" Cycles: "); print_dec((unsigned long) cycles_elapsed); TEST_LOGGER(" Instructions: "); print_dec((unsigned long) instret_elapsed); */ return 0; } ``` ## RISC-V Cross-Compilation and Execution Workflow * **Step 0 - Navigate to the working directory** > Make sure the function you want to execute is properly called inside `main.c`. * **Step 1 - Navigate to the working directory** > cd $HOME / riscv-none-elf-gcc / rv32emu / tests / system / hw2 * **Step 2 - Build and execute the ELF file** > make clear > make [ OPT= x ] (Input the ... you want) > make run * **Step 3 - Check ELF file size** > riscv-none-elf-size test.elf * **Step 4 - Check the ELF header** > riscv-none-elf-readelf -h test.elf * **Step 5 - Disassemble the ELF file** > riscv-none-elf-objdump -d test > test.txt * **Step 6 - Generate instruction/register histogram** > .../rv32emu/build/rv_histogram test.elf ## Simulaion test : LeetCode 260. Single Number III I choose one complete program (with test suite) from < [Shaoen-Lin](https://github.com/Shaoen-Lin) > and make it run in a bare-metal environment using rv32emu’s system emulation. For successfully executing the code, We modified few places in the original source code. ### Code Modification Summary #### 1. Removed OS dependencies Original Linux syscalls (ecall 4, ecall 1, ecall 10) were replaced with ecall 64 for simulated stdout output. #### 2. Auto-calculated string lengths Used `.set msg_xxx_size, . - msg_xxx` to automatically compute string sizes for `SYS_write`. #### 3. Custom print POS/NEG decimal interger function Implemented a `print_int` function using pure RV32I instructions, simulating division by 10 via repeated subtraction (no div/rem). The function checks the sign bit to detect negative numbers, converts them to positive for digit extraction, and prepends a '-' before printing the final result. ```asm # ===================================================== # print_int(a0) # Convert integer a0 to ASCII and print (pure RV32I, no div/rem) # ===================================================== .global print_int print_int: addi sp, sp, -92 sw ra, 88(sp) sw t0, 84(sp) sw t1, 80(sp) sw t2, 76(sp) sw t3, 72(sp) sw t4, 68(sp) sw t5, 64(sp) mv t0, a0 addi t1, sp, 64 li t2, 10 li t5, 0 # === handle negative numbers === bltz t0, make_positive j convert_loop_i make_positive: neg t0, t0 li t5, 1 # === simulate divide by 10 === convert_loop_i: # simulate quotient = t0 / 10, remainder = t0 - 10*quotient mv t3, x0 sub_loop: blt t0, t2, div_done sub t0, t0, t2 addi t3, t3, 1 j sub_loop div_done: mv t4, t0 mv t0, t3 addi t4, t4, 48 addi t1, t1, -1 sb t4, 0(t1) bnez t0, convert_loop_i # if negative, prepend '-' beqz t5, print_number_i addi t1, t1, -1 li t3, '-' sb t3, 0(t1) # === print the result === print_number_i: li a0, 1 mv a1, t1 add a2, sp, 64 sub a2, a2, t1 li a7, 64 ecall lw ra, 88(sp) lw t0, 84(sp) lw t1, 80(sp) lw t2, 76(sp) lw t3, 72(sp) lw t4, 68(sp) lw t5, 64(sp) addi sp, sp, 92 ret ``` ### Modified `single_number_3.S` ```asm .data nums1: .word 2, 2, 3, 3, 4, 4, 0, 1, 100, 100, 99, 99 nums1_size: .word 12 ans1: .word 1, 0 nums2: .word 101, 17, 102, 102, -98, 0, 1, 101, 0, 1, 99, -98, 100, 17 nums2_size: .word 14 ans2: .word 100, 99 nums3: .word -2, -2, -2, 2, 2, 2, -6, -9, -2, 2, 2, -5, 2, -6, -2, -10, -11, -10, -11, -2, -6, -9 nums3_size: .word 22 ans3: .word -5, -6 test_cases: .word nums1, nums2, nums3 test_sizes: .word nums1_size, nums2_size, nums3_size test_ans: .word ans1, ans2, ans3 ressize: .word 0 result: .word 0, 0 msg_case: .string "Test case " .set msg_case_size, . - msg_case # auto calculate length msg_input: .string " Input: " .set msg_input_size, . - msg_input msg_output: .string " Output: " .set msg_output_size, . - msg_output msg_expect: .string " Expect: " .set msg_expect_size, . - msg_expect msg_ok: .string " PASSED\n" .set msg_ok_size, . - msg_ok msg_wrong: .string " FAILED\n" .set msg_wrong_size, . - msg_wrong space: .string " " .set space_size, . - space nl: .string "\n" .set nl_size, . - nl .text .global single_number_3 single_number_3: addi sp, sp, -36 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) sw s4, 20(sp) sw s5, 24(sp) sw s6, 28(sp) sw s7, 32(sp) # initialize iterators for case tables la s5, test_cases la s6, test_sizes la s7, test_ans li s3, 3 li s4, 1 loop_cases: beqz s3, end_main # load current pointers lw s0, 0(s5) lw s1, 0(s6) lw s2, 0(s7) # print "Test case i" # === print "Test case " === li a0, 1 la a1, msg_case la a2, msg_case_size li a7, 64 ecall # === print case number s4 === mv a0, s4 jal ra, print_int # === newline === li a0, 1 la a1, nl la a2, nl_size li a7, 64 ecall # print input array li a0, 1 la a1, msg_input la a2, msg_input_size li a7, 64 ecall lw t3, 0(s1) li t4, 0 print_input_loop: bge t4, t3, print_input_done slli t5, t4, 2 add t6, s0, t5 lw a0, 0(t6) jal ra, print_int li a0, 1 la a1, space la a2, space_size li a7, 64 ecall addi t4, t4, 1 j print_input_loop print_input_done: li a0, 1 la a1, nl la a2, nl_size li a7, 64 ecall # save registers before call addi sp, sp, -24 sw ra, 20(sp) sw s0, 16(sp) sw s1, 12(sp) sw s2, 8(sp) sw s3, 4(sp) sw s4, 0(sp) # call singleNumber(a0=nums, a1=size, a2=&ressize) mv a0, s0 lw a1, 0(s1) la a2, ressize jal singleNumber mv t6, a0 # restore registers lw ra, 20(sp) lw s0, 16(sp) lw s1, 12(sp) lw s2, 8(sp) lw s3, 4(sp) lw s4, 0(sp) addi sp, sp, 24 # print output li a0, 1 la a1, msg_output la a2, msg_output_size li a7, 64 ecall lw t0, 0(t6) mv a0, t0 jal ra, print_int li a0, 1 la a1, space la a2, space_size li a7, 64 ecall lw t1, 4(t6) mv a0, t1 jal ra, print_int li a0, 1 la a1, nl la a2, nl_size li a7, 64 ecall # print expected result li a0, 1 la a1, msg_expect la a2, msg_expect_size li a7, 64 ecall lw t2, 0(s2) mv a0, t2 jal ra, print_int li a0, 1 la a1, space la a2, space_size li a7, 64 ecall lw t3, 4(s2) mv a0, t3 jal ra, print_int li a0, 1 la a1, nl la a2, nl_size li a7, 64 ecall # compare results lw t4, 0(t6) lw t5, 4(t6) beq t4, t2, check_second j print_wrong check_second: beq t5, t3, print_ok j print_wrong print_ok: li a0, 1 la a1, msg_ok la a2, msg_ok_size li a7, 64 ecall j next_case print_wrong: li a0, 1 la a1, msg_wrong la a2, msg_wrong_size li a7, 64 ecall next_case: addi s5, s5, 4 addi s6, s6, 4 addi s7, s7, 4 addi s4, s4, 1 addi s3, s3, -1 j loop_cases end_main: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) lw s6, 28(sp) lw s7, 32(sp) addi sp, sp, 36 ret # ===================================================== # print_int(a0) # Convert integer a0 to ASCII and print (pure RV32I, no div/rem) # ===================================================== .global print_int print_int: addi sp, sp, -92 sw ra, 88(sp) sw t0, 84(sp) sw t1, 80(sp) sw t2, 76(sp) sw t3, 72(sp) sw t4, 68(sp) sw t5, 64(sp) mv t0, a0 addi t1, sp, 64 li t2, 10 li t5, 0 # === handle negative numbers === bltz t0, make_positive j convert_loop_i make_positive: neg t0, t0 li t5, 1 # === simulate divide by 10 === convert_loop_i: # simulate quotient = t0 / 10, remainder = t0 - 10*quotient mv t3, x0 sub_loop: blt t0, t2, div_done sub t0, t0, t2 addi t3, t3, 1 j sub_loop div_done: mv t4, t0 mv t0, t3 addi t4, t4, 48 addi t1, t1, -1 sb t4, 0(t1) bnez t0, convert_loop_i # Negative situation beqz t5, print_number_i addi t1, t1, -1 li t3, '-' sb t3, 0(t1) # === print the result === print_number_i: li a0, 1 mv a1, t1 add a2, sp, 64 sub a2, a2, t1 li a7, 64 ecall lw ra, 88(sp) lw t0, 84(sp) lw t1, 80(sp) lw t2, 76(sp) lw t3, 72(sp) lw t4, 68(sp) lw t5, 64(sp) addi sp, sp, 92 ret # ================================ # Count Leading Zeros (clz) # a0 = input, return a0 = clz(x) # ================================ clz: li s0, 32 li s1, 16 clz_loop: srl t0, a0, s1 bnez t0, clz_if srli s1, s1, 1 j clz_check clz_if: sub s0, s0, s1 mv a0, t0 clz_check: bnez s1, clz_loop sub a0, s0, a0 ret # ========================================== # singleNumber # a0 = nums ptr, a1 = size, a2 = &returnSize # return a0 = &result # ========================================== singleNumber: li s2, 0 li t0, 0 # first loop: xor all numbers for1_cond: blt t0, a1, for1_body li t1, 31 addi sp, sp, -8 sw ra, 0(sp) sw a0, 4(sp) mv a0, s2 jal ra, clz sub s3, t1, a0 lw a0, 4(sp) lw ra, 0(sp) addi sp, sp, 8 li t1, 1 sll s4, t1, s3 # mask = 1U << shift li t1, 0 li t2, 0 li t0, 0 j for2_cond for1_body: slli t1, t0, 2 add t1, a0, t1 lw t1, 0(t1) xor s2, s2, t1 addi t0, t0, 1 j for1_cond # second loop: split by mask bit for2_cond: blt t0, a1, for2_body la a0, result sw t1, 0(a0) sw t2, 4(a0) li t3, 2 sw t3, 0(a2) ret for2_body: slli t3, t0, 2 add t3, a0, t3 lw t3, 0(t3) and t4, t3, s4 bnez t4, for2_if xor t2, t2, t3 j for2_next for2_if: xor t1, t1, t3 for2_next: addi t0, t0, 1 j for2_cond ``` ### Execution Results under -O0 Optimization Level ```c Test: Single Number III Test case 1 Input: 2 2 3 3 4 4 0 1 100 100 99 99 Output: 1 0 Expect: 1 0 PASSED Test case 2 Input: 101 17 102 102 -98 0 1 101 0 1 99 -98 100 17 Output: 100 99 Expect: 100 99 PASSED Test case 3 Input: -2 -2 -2 2 2 2 -6 -9 -2 2 2 -5 2 -6 -2 -10 -11 -10 -11 -2 -6 -9 Output: -5 -6 Expect: -5 -6 PASSED Cycles: 5399 Instructions: 5399 ``` In the followings, we adapt Problem A from Quiz 2 and Problem C from Quiz 3, and make them run in a bare-metal environment using rv32emu’s system emulation for measuring and refining the implementation based on profiling results. ## Problem `A` in Quiz2 ### Code Modification Summary #### 1. Added `print_char` Function Implemented a `print_char` subroutine using pure RV32I instructions and `ecall 64` to print a single ASCII character. The function temporarily stores the character on the stack, sets up the buffer pointer and length, and calls the write syscall (fd=1 for stdout). It restores the stack pointer after output, ensuring safe nested calls within `hanoi.S`. ```asm # ====================================================== # print_char(x10) # Print a single character to stdout # Argument: # x10 = ASCII value of the character (e.g., 65 = 'A') # ====================================================== .global print_char print_char: addi x2, x2, -4 sw x10, 0(x2) li x10, 1 add x11, x2, x0 li x12, 1 li x17, 64 ecall addi x2, x2, 4 ret ``` ### Modified `hanoi.S` ```asm .text .globl hanoi hanoi: addi x2, x2, -4 sw x1, 0(x2) addi x2, x2, -32 sw x8, 0(x2) sw x9, 4(x2) sw x18, 8(x2) sw x19, 12(x2) sw x20, 16(x2) li x5, 0x15 sw x5, 20(x2) sw x5, 24(x2) sw x5, 28(x2) # Fix disk positions (BLANK 1–3: neutralize x5 effect) sw x0, 20(x2) sw x0, 24(x2) sw x0, 28(x2) addi x8, x0, 1 game_loop: addi x5, x0, 8 beq x8, x5, finish_game srli x5, x8, 1 xor x6, x8, x5 addi x7, x8, -1 srli x28, x7, 1 xor x7, x7, x28 xor x5, x6, x7 addi x9, x0, 0 andi x6, x5, 1 bne x6, x0, disk_found addi x9, x0, 1 andi x6, x5, 2 bne x6, x0, disk_found addi x9, x0, 2 disk_found: andi x30, x5, 5 addi x31, x0, 5 beq x30, x31, pattern_match jal x0, continue_move pattern_match: continue_move: slli x5, x9, 2 addi x5, x5, 20 add x5, x2, x5 lw x18, 0(x5) bne x9, x0, handle_large addi x19, x18, 2 addi x6, x0, 3 blt x19, x6, display_move sub x19, x19, x6 jal x0, display_move handle_large: lw x6, 20(x2) addi x19, x0, 3 sub x19, x19, x18 sub x19, x19, x6 display_move: la x20, obdata add x5, x20, x18 lbu x11, 0(x5) li x6, 0x6F xor x11, x11, x6 addi x11, x11, -0x12 addi x27, x11, 0 add x7, x20, x19 lbu x12, 0(x7) xor x12, x12, x6 addi x12, x12, -0x12 addi x26, x12, 0 li x10, 1 la x11, str1 la x12, str1_size li x17, 64 ecall addi x10, x9, 49 addi x2, x2, -1 sb x10, 0(x2) li x10, 1 mv x11, x2 li x12, 1 li x17, 64 ecall addi x2, x2, 1 li x10, 1 la x11, str2 la x12, str2_size li x17, 64 ecall addi x10, x27, 0 jal x1, print_char li x10, 1 la x11, str3 la x12, str3_size li x17, 64 ecall addi x10, x26, 0 jal x1, print_char addi x5, x0, 10 addi x2, x2, -1 sb x5, 0(x2) li x10, 1 mv x11, x2 li x12, 1 li x17, 64 ecall addi x2, x2, 1 slli x5, x9, 2 addi x5, x5, 20 add x5, x2, x5 sw x19, 0(x5) addi x8, x8, 1 jal x0, game_loop finish_game: lw x8, 0(x2) lw x9, 4(x2) lw x18, 8(x2) lw x19, 12(x2) lw x20, 16(x2) addi x2, x2, 32 lw x1, 0(x2) addi x2, x2, 4 ret # ====================================================== # print_char(x10) # Print a single character to stdout # Argument: # x10 = ASCII value of the character (e.g., 65 = 'A') # ====================================================== .global print_char print_char: addi x2, x2, -4 sw x10, 0(x2) li x10, 1 add x11, x2, x0 li x12, 1 li x17, 64 ecall addi x2, x2, 4 ret .data obdata: .byte 0x3c, 0x3b, 0x3a str1: .asciz " Move Disk " .set str1_size, . - str1 str2: .asciz " from " .set str2_size, . - str2 str3: .asciz " to " .set str3_size, . - str3 ``` ### Execution Results under -O0 Optimization Level ```c Test: Quiz2_A Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C Cycles: 782 Instructions: 782 ``` ## Problem `C` in Quiz3 ### Description This problem implements a **fast reciprocal square root** (`fast_rsqrt`) function using **lookup table initialization**, **linear interpolation**, and **Newton–Raphson iteration** under fixed-point arithmetic (**Q0.16 scaling**). The implementation improves computational accuracy from the initial **20 % error** (table only) to **around 3 – 8 % after two Newton iterations**. ### Code Modification Summary #### 1. Lookup Table To ensure that all lookup table values fit within the `uint16_t` data type, the first entry was **adjusted from 65536 to 65535** (the closest representable value). This minor modification may slightly affect precision, but it allows the program to execute correctly without overflow. #### 2. `print_q16` in main.c A new helper function `print_q16()` was implemented to convert and display fixed-point **Q0.16** numbers in a readable decimal format. To implement `print_q16()`, we reused existing utility functions — `printstr`, `umod`, and `udiv` — which were already provided in the original main.c. These functions handle basic string output and unsigned division/modulo operations, allowing us to build the integer and fractional parts of the Q0.16 number entirely with integer arithmetic. By separating the upper 16 bits as the integer part and scaling the lower 16 bits by 10^6 using 64-bit arithmetic, `print_q16()` accurately prints six decimal places without relying on floating-point instructions. ```c /* Print fixed-point Q0.16 number in human-readable decimal format */ static void print_q16(uint32_t val) { unsigned long int_part = val >> 16; // Extract integer part (upper 16 bits) // Compute fractional part using 64-bit multiplication for precision uint64_t tmp = (uint64_t)(val & 0xFFFF) * 1000000ULL; // Scale to 6 decimal digits unsigned long frac_part = (unsigned long)(tmp >> 16); // Convert back to 32-bit range (0..999999) // Print integer part char buf_int[20]; char *p = buf_int + sizeof(buf_int); *--p = '\0'; unsigned long t = int_part; if (t == 0) *--p = '0'; else { while (t) { *--p = '0' + umod(t, 10); t = udiv(t, 10); } // Convert integer digits } printstr(p, (unsigned)(buf_int + sizeof(buf_int) - 1 - p)); printstr(".", 1); // Print decimal point // Print fractional part (always 6 digits) char buf_frac[6]; for (int i = 5; i >= 0; --i) { buf_frac[i] = '0' + umod(frac_part, 10); frac_part = udiv(frac_part, 10); } printstr(buf_frac, 6); printstr("\n", 1); // New line } ``` ### Modified `fast_rsqrt.c` ```c #include <stdio.h> #include <stdint.h> static const uint16_t rsqrt_table[32] = { 65535, 46341, 32768, 23170, 16384, 11585, 8192, 5792, 4096, 2896, 2048, 1448, 1024, 724, 512, 362, 256, 181, 128, 91, 64, 45, 32, 23, 16, 11, 8, 6, 4, 3, 2, 1}; /* ========================================================= * Optimized CLZ (Count Leading Zeros) — binary search style * ========================================================= */ static int clz(uint32_t x) { if (!x) return 32; // special case: all bits are 0 int n = 0; if (!(x & 0xFFFF0000u)) { n += 16; x <<= 16; } if (!(x & 0xFF000000u)) { n += 8; x <<= 8; } if (!(x & 0xF0000000u)) { n += 4; x <<= 4; } if (!(x & 0xC0000000u)) { n += 2; x <<= 2; } if (!(x & 0x80000000u)) { n += 1; } return n; } /* ========================================================= * Software 32×32 → 64 bit multiplication (no MUL) * ========================================================= */ static uint64_t mul32(uint32_t a, uint32_t b) { uint64_t r = 0; for (int i = 0; i < 32; i++) { if (b & (1u << i)) r += ((uint64_t)a << i); } return r; } /* ========================================================= * Fast inverse square root (fixed-point Q0.16) * ========================================================= */ uint32_t fast_rsqrt(uint32_t x) { if (x == 0) return 0xFFFFFFFF; if (x == 1) return 65536; /* Step 1: MSB position */ int exp = 31 - clz(x); /* Step 2: Lookup + interpolation */ uint32_t y_base = rsqrt_table[exp]; uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0; uint32_t diff = y_base - y_next; uint32_t base = 1u << exp; uint32_t frac = ((x - base) << 16) >> exp; // Q0.16 fraction uint32_t y = y_base - (uint32_t)(mul32(diff, frac) >> 16); /* Step 3: Newton-Raphson refinement (2 iterations) */ for (int i = 0; i < 2; i++) { uint32_t y2 = (uint32_t)(mul32(y, y) ); uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16); uint32_t term = (3u << 16) - xy2; uint32_t prod = (uint32_t)(mul32(y, term) >> 16); y = prod >> 1; } return y; } ``` ### Execution Result under -O0 Optimization Level ```c Test: Quiz3_C Test case 1 Input: 1 rsqrt: 1.000000 Test case 2 Input: 2 rsqrt: 0.707107 Test case 3 Input: 3 rsqrt: 0.577331 Test case 4 Input: 4 rsqrt: 0.500000 Test case 5 Input: 10 rsqrt: 0.316223 Test case 6 Input: 100 rsqrt: 0.099990 Test case 7 Input: 1000 rsqrt: 0.031616 Test case 8 Input: 10000 rsqrt: 0.009994 Test case 9 Input: 4294967295 rsqrt: 0.000015 Cycles: 149642 Instructions: 149642 ``` ## Optimization Below is the observation of compiler optimization effects. We implemented and tested `hanoi.S` and `fast_rsqrt.c` in four different optimization levels: -O0, -O1, -O2, -O3, and -Ofast. For each optimization level, we performed a detailed analysis that includes: * **Program Size** — comparing the ELF file sizes to evaluate code compactness. * **ELF Header Inspection** — examining memory section layout and alignment differences. * **Cycles & Instructions Count** — measuring execution performance by recording the total number of CPU cycles and retired instructions, allowing comparison of efficiency across optimization levels. * **Instruction Frequency Histogram** — analyzing how compiler optimizations affect instruction distribution and overall instruction count. Through these observations, we aim to understand how increasing optimization levels impact execution performance, code efficiency, and instruction behavior at the assembly level. ## -O0 Optimization ### Cycles & Instructions count | | Cycles| Instructions | | -------- | -------- | -------- | | hanoi.elf | 782 | 782 | | fast_rsqrt.S | 149642 | 149642 | ### Program Size | text | data | bss | dec | hex | filename | | -------- | -------- | -------- | -------- | -------- | -------- | | 5058 | 304 | 4108 | 9470 | 24f | hanoi.elf | | 26000 | 336 | 4096 | 30432 | 76e0 | fast_rsqrt.elf| ### ELF Header Inspection #### hanoi.elf ```c ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10000 Start of program headers: 52 (bytes into file) Start of section headers: 21880 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 18 Section header string table index: 17 ``` #### fast_rsqrt.elf ```c ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10000 Start of program headers: 52 (bytes into file) Start of section headers: 52632 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 18 Section header string table index: 17 ``` ### Instruction Frequency Histogram #### hanoi.elf ```c +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. addi 33.69% [408 ] ████████████████████████████████████████████████████████ 2. lw 18.83% [228 ] ███████████████████████████████▎ 3. sw 14.45% [175 ] ████████████████████████ 4. jal 5.37% [65 ] ████████▉ 5. ecall 2.64% [32 ] ████▍ 6. sub 2.56% [31 ] ████▎ 7. auipc 2.31% [28 ] ███▊ 8. slli 2.31% [28 ] ███▊ 9. bne 1.82% [22 ] ███ 10. srli 1.73% [21 ] ██▉ 11. jalr 1.65% [20 ] ██▋ 12. andi 1.49% [18 ] ██▍ 13. add 1.49% [18 ] ██▍ 14. lui 1.24% [15 ] ██ 15. sb 1.24% [15 ] ██ ``` #### fast_rsqrt.elf ```c +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. xor 27.27% [1334 ] █████████████████████████████████████████████████████ 2. add 14.46% [707 ] ████████████████████████████ 3. slli 14.01% [685 ] ███████████████████████████▏ 4. srli 13.51% [661 ] ██████████████████████████▎ 5. addi 9.47% [463 ] ██████████████████▍ 6. lw 7.69% [376 ] ██████████████▉ 7. sw 4.97% [243 ] █████████▋ 8. jal 2.09% [102 ] ████ ``` ## -O1 Optimization ### Cycles & Instructions count | | Cycles| Instructions | | -------- | -------- | -------- | | hanoi.elf | 780 | 780 | | fast_rsqrt.elf | 59856 | 59856 | ### Program Size | text | data | bss | dec | hex | filename | | -------- | -------- | -------- | -------- | -------- | -------- | | 2954| 304 |4100| 7358 | 1cbe | hanoi.elf | | 3798| 304 | 4104 | 8206 | 200e | fast_rsqrt.elf | ### ELF Header Inspection #### hanoi.elf ```c ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10000 Start of program headers: 52 (bytes into file) Start of section headers: 20888 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 20 Section header string table index: 19 ``` #### fast_rsqrt.elf ```c ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10000 Start of program headers: 52 (bytes into file) Start of section headers: 23628 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 20 Section header string table index: 19 ``` ### Instruction Frequency Histogram #### hanoi.elf ```c +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. addi 41.92% [288 ] ████████████████████████████████████████████████████████ 2. jal 7.28% [50 ] █████████▋ 3. lw 7.28% [50 ] █████████▋ 4. sw 6.55% [45 ] ████████▊ 5. auipc 4.08% [28 ] █████▍ 6. ecall 3.93% [27 ] █████▎ 7. sub 3.06% [21 ] ████ 8. slli 2.91% [20 ] ███▉ 9. add 2.47% [17 ] ███▎ 10. jalr 2.18% [15 ] ██▉ 11. beq 2.18% [15 ] ██▉ 12. bne 1.89% [13 ] ██▌ 13. lui 1.75% [12 ] ██▎ 14. srli 1.31% [9 ] █▊ 15. sb 1.16% [8 ] █▌ 16. xor 1.16% [8 ] █▌ 17. blt 1.02% [7 ] █▎ ``` #### fast_rsqrt.elf ```c +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. addi 41.56% [362 ] █████████████████████████████████████████████████████ 2. lw 8.38% [73 ] ██████████▋ 3. sw 7.58% [66 ] █████████▋ 4. jal 6.89% [60 ] ████████▊ 5. ecall 3.90% [34 ] ████▉ 6. slli 3.33% [29 ] ████▏ 7. sub 3.33% [29 ] ████▏ 8. auipc 3.21% [28 ] ████ 9. jalr 2.18% [19 ] ██▊ 10. add 2.18% [19 ] ██▊ 11. beq 1.95% [17 ] ██▍ 12. bne 1.95% [17 ] ██▍ 13. lui 1.72% [15 ] ██▏ 14. srli 1.72% [15 ] ██▏ 15. sb 1.38% [12 ] █▊ 16. or 1.15% [10 ] █▍ ``` ## -O2 Optimization ### Cycles & Instructions count | | Cycles| Instructions | | -------- | -------- | -------- | | hanoi.elf | 780 | 780 | | fast_rsqrt.elf | 62408 | 62408 | ### Program Size | text | data | bss | dec | hex | filename | | -------- | -------- | -------- | -------- | -------- | -------- | | 3066 | 304 | 4100| 7470 | 1d2e | hanoi.elf | | 3970| 304 | 4108 | 8382 | 20be | fast_rsqrt.elf | ### ELF Header Inspection #### hanoi.elf ```c ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10000 Start of program headers: 52 (bytes into file) Start of section headers: 21984 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 21 Section header string table index: 20 ``` #### fast_rsqrt.elf ```c ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10000 Start of program headers: 52 (bytes into file) Start of section headers: 26276 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 21 Section header string table index: 20 ``` ### Instruction Frequency Histogram #### hanoi.elf ```c +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. addi 41.40% [296 ] ████████████████████████████████████████████████████████ 2. jal 6.43% [46 ] ████████▋ 3. lw 6.29% [45 ] ████████▌ 4. sw 5.73% [41 ] ███████▊ 5. auipc 3.92% [28 ] █████▎ 6. ecall 3.78% [27 ] █████ 7. add 3.22% [23 ] ████▎ 8. sub 3.08% [22 ] ████▏ 9. beq 2.80% [20 ] ███▊ 10. slli 2.66% [19 ] ███▌ 11. jalr 2.24% [16 ] ███ 12. bne 2.10% [15 ] ██▊ 13. sll 1.96% [14 ] ██▋ 14. lui 1.68% [12 ] ██▎ 15. srli 1.54% [11 ] ██ 16. sb 1.12% [8 ] █▌ 17. xor 1.12% [8 ] █▌ ``` #### fast_rsqrt.elf ```c +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. addi 40.59% [371 ] ████████████████████████████████████████████████████████ 2. lw 7.33% [67 ] ██████████ 3. sw 6.67% [61 ] █████████▏ 4. jal 5.58% [51 ] ███████▋ 5. ecall 3.72% [34 ] █████▏ 6. slli 3.50% [32 ] ████▊ 7. auipc 3.06% [28 ] ████▏ 8. sub 3.06% [28 ] ████▏ 9. add 2.74% [25 ] ███▊ 10. beq 2.41% [22 ] ███▎ 11. bne 2.30% [21 ] ███▏ 12. srli 1.86% [17 ] ██▌ 13. lui 1.75% [16 ] ██▍ 14. jalr 1.75% [16 ] ██▍ 15. sll 1.75% [16 ] ██▍ 16. or 1.75% [16 ] ██▍ 17. sb 1.31% [12 ] █▊ 18. srl 1.20% [11 ] █▋ 19. andi 1.09% [10 ] █▌ ``` ## -O3 Optimization ### Cycles & Instructions count | | Cycles| Instructions | | -------- | -------- | -------- | | hanoi.elf | 780 | 780 | | fast_rsqrt.elf | 62408 | 62408 | ### Program Size | text | data | bss | dec | hex | filename | | -------- | -------- | -------- | -------- | -------- | -------- | | 3202 | 304 | 4108 | 7614 | 1dbe | hanoi.elf | | 4106 | 304 | 4100 | 8510 | 213e | fast_rsqrt.elf | ### ELF Header Inspection #### hanoi.elf ```c ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10000 Start of program headers: 52 (bytes into file) Start of section headers: 22424 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 21 Section header string table index: 20 ``` #### fast_rsqrt.elf ```c ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10000 Start of program headers: 52 (bytes into file) Start of section headers: 26708 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 21 Section header string table index: 20 ``` ### Instruction Frequency Histogram #### hanoi.elf ```c +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. addi 40.85% [306 ] ████████████████████████████████████████████████████████ 2. jal 6.14% [46 ] ████████▍ 3. lw 6.14% [46 ] ████████▍ 4. sw 5.61% [42 ] ███████▋ 5. auipc 3.74% [28 ] █████ 6. ecall 3.60% [27 ] ████▉ 7. add 3.34% [25 ] ████▌ 8. sub 3.20% [24 ] ████▍ 9. beq 3.07% [23 ] ████▏ 10. slli 2.54% [19 ] ███▍ 11. bne 2.40% [18 ] ███▎ 12. jalr 2.27% [17 ] ███ 13. sll 1.87% [14 ] ██▌ 14. lui 1.60% [12 ] ██▏ 15. sb 1.47% [11 ] ██ 16. srli 1.47% [11 ] ██ 17. andi 1.07% [8 ] █▍ 18. xor 1.07% [8 ] █▍ ``` #### fast_rsqrt.elf ```c +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. addi 40.19% [381 ] ████████████████████████████████████████████████████████ 2. lw 7.17% [68 ] █████████▉ 3. sw 6.54% [62 ] █████████ 4. jal 5.38% [51 ] ███████▍ 5. ecall 3.59% [34 ] ████▉ 6. slli 3.38% [32 ] ████▋ 7. sub 3.16% [30 ] ████▍ 8. auipc 2.95% [28 ] ████ 9. add 2.85% [27 ] ███▉ 10. beq 2.64% [25 ] ███▋ 11. bne 2.53% [24 ] ███▌ 12. jalr 1.79% [17 ] ██▍ 13. srli 1.79% [17 ] ██▍ 14. or 1.79% [17 ] ██▍ 15. lui 1.69% [16 ] ██▎ 16. sll 1.69% [16 ] ██▎ 17. sb 1.58% [15 ] ██▏ 18. andi 1.27% [12 ] █▊ 19. srl 1.16% [11 ] █▌ 20. bgeu 1.05% [10 ] █▍ ``` ## -Ofast Optimization ### Cycles & Instructions count | | Cycles| Instructions | | -------- | -------- | -------- | | hanoi.elf | 780 | 780 | | fast_rsqrt.S | 62408 | 62408 | ### Program Size | text | data | bss | dec | hex | filename | | -------- | -------- | -------- | -------- | -------- | -------- | | 3202 | 304 | 4108 | 7614 | 1dbe | hanoi.elf | | 4108 | 304 | 4100 | 8512 | 2140 | fast_rsqrt.elf | ### ELF Header Inspection #### hanoi.elf ```c ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10000 Start of program headers: 52 (bytes into file) Start of section headers: 22424 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 21 Section header string table index: 20 ``` #### fast_rsqrt.elf ```c ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10000 Start of program headers: 52 (bytes into file) Start of section headers: 26676 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 20 Section header string table index: 19 ``` ### Instruction Frequency Histogram #### hanoi.elf ```c +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. addi 40.85% [306 ] ████████████████████████████████████████████████████████ 2. jal 6.14% [46 ] ████████▍ 3. lw 6.14% [46 ] ████████▍ 4. sw 5.61% [42 ] ███████▋ 5. auipc 3.74% [28 ] █████ 6. ecall 3.60% [27 ] ████▉ 7. add 3.34% [25 ] ████▌ 8. sub 3.20% [24 ] ████▍ 9. beq 3.07% [23 ] ████▏ 10. slli 2.54% [19 ] ███▍ 11. bne 2.40% [18 ] ███▎ 12. jalr 2.27% [17 ] ███ 13. sll 1.87% [14 ] ██▌ 14. lui 1.60% [12 ] ██▏ 15. sb 1.47% [11 ] ██ 16. srli 1.47% [11 ] ██ 17. andi 1.07% [8 ] █▍ 18. xor 1.07% [8 ] █▍ ``` #### fast_rsqrt.elf ```c +---------------------------------------------+ | RV32 Target Instruction Frequency Histogram | +---------------------------------------------+ 1. addi 40.19% [381 ] ████████████████████████████████████████████████████████ 2. lw 7.17% [68 ] █████████▉ 3. sw 6.54% [62 ] █████████ 4. jal 5.38% [51 ] ███████▍ 5. ecall 3.59% [34 ] ████▉ 6. slli 3.38% [32 ] ████▋ 7. sub 3.16% [30 ] ████▍ 8. auipc 2.95% [28 ] ████ 9. add 2.85% [27 ] ███▉ 10. beq 2.64% [25 ] ███▋ 11. bne 2.53% [24 ] ███▌ 12. jalr 1.79% [17 ] ██▍ 13. srli 1.79% [17 ] ██▍ 14. or 1.79% [17 ] ██▍ 15. lui 1.69% [16 ] ██▎ 16. sll 1.69% [16 ] ██▎ 17. sb 1.58% [15 ] ██▏ 18. andi 1.27% [12 ] █▊ 19. srl 1.16% [11 ] █▌ 20. bgeu 1.05% [10 ] █▍ ``` ## Observation & Analysis ### 1. General Overview In this experiment, we evaluated the compiler optimization effects on two test programs — `hanoi.S` (assembly-based) and `fast_rsqrt.c` (C-based). Each program was compiled and analyzed under five optimization levels: -O0, -O1, -O2, -O3, and -Ofast. As optimization levels increase, redundant computations and memory accesses are eliminated, code size decreases, and performance improves. **However, beyond -O2, both performance and code size stabilize, showing that -O3 and -Ofast bring minimal further benefit.** ### 2. Cycles and Instructions Count | Optimization | hanoi.elf (Cycles / Instructions) | fast_rsqrt.elf (Cycles / Instructions) | | :----------- | :-------------------------------: | :------------------------------------: | | -O0 | 782 / 782 | 149,642 / 149,642 | | -O1 | 780 / 780 | 59,856 / 59,856 | | -O2 | 780 / 780 | 62,408 / 62,408 | | -O3 | 780 / 780 | 62,408 / 62,408 | | -Ofast | 780 / 780 | 62,408 / 62,408 | #### Analysis: * For `hanoi.elf`, cycles and instructions **remain almost constant (782 → 780) across all levels**, indicating the assembly code is already manually optimized and compiler optimization has minimal effect. * For `fast_rsqrt.elf`, the improvement is dramatic: from 149k to 59k cycles **(a 2.5× performance gain)** when moving from -O0 to -O1. * Starting from -O2, results plateau — all higher levels (-O2, -O3, -Ofast) perform identically, suggesting the compiler reached its maximum effective optimization. ### 3. Program Size Comparison | Optimization | hanoi.elf (text / data / bss / total bytes) | fast_rsqrt.elf (text / data / bss / total bytes) | | :----------- | :-----------------------------------------: | :----------------------------------------------: | | -O0 | 5058 / 304 / 4108 / 9470 | 26000 / 336 / 4096 / 30432 | | -O1 | 2954 / 304 / 4100 / 7358 | 3798 / 304 / 4104 / 8206 | | -O2 | 3066 / 304 / 4100 / 7470 | 3970 / 304 / 4108 / 8382 | | -O3 | 3202 / 304 / 4108 / 7614 | 4106 / 304 / 4100 / 8510 | | -Ofast | 3202 / 304 / 4108 / 7614 | 4108 / 304 / 4100 / 8512 | #### Analysis: * The unoptimized -O0 binaries are the largest, showing no compression or inlining. * Under -O1, the `fast_rsqrt` text section **shrinks from 26,000 → 3,798 bytes (≈85% reduction)**, demonstrating aggressive dead-code elimination and loop optimization. * From -O2 onward, the program size remains stable, indicating the compiler has **reached an optimization saturation point**. * -O3 and -Ofast show nearly identical binary sizes; their changes mainly involve code layout and instruction alignment rather than size reduction. ### 4. ELF Header Inspection All ELF files share the same general properties: * Architecture: RISC-V (RV32) * Endian: Little Endian * File Type: EXEC (Executable file) * OS ABI: UNIX - System V Notable differences: * **The start of section headers** gradually shifts closer to the program header as optimization levels increase — **indicating smaller overall binaries**. * **The number of section headers** increases slightly (from 18 at -O0 to 21 at -O2/-O3), reflecting **additional sections for symbol tables and inlined code during higher optimizations**. ### 5. Instruction Frequency Histogram Analysis #### hanoi.elf | Optimization | Key Instruction Observations | | :----------- | :------------------------------------------------------------------------------------------------------------------------------------------------- | | -O0 | Dominated by `addi`, `lw`, and `sw` (>65%), showing many redundant memory accesses and jumps. | | -O1 | Instruction count decreases significantly; `addi` rises to ~42%, meaning more immediate arithmetic and fewer load/store operations. | | -O2 ~ -Ofast | Very stable distribution — `addi` ≈ 40%, `jal` ≈ 6%, `lw/sw` ≈ 6% — indicating the compiler has reached optimal scheduling and minimal redundancy. | #### fast_rsqrt.elf | Optimization | Key Instruction Observations | | :----------------- | :--------------------------------------------------------------------------------------------------------------- | | -O0 | Heavy use of `xor`, `slli`, and `srli`, indicating unoptimized bit manipulation. | | -O1 | `addi` becomes dominant (≈41%), with fewer `lw/sw`, showing constant folding and loop simplification. | | -O2 / -O3 / -Ofast | Almost identical profiles — `addi` ≈ 40%, `lw` ≈ 7%, `sw` ≈ 6%, `jal` ≈ 5% — confirming optimization saturation. | ### Summary Compiler optimizations significantly enhance performance for C-based programs like `fast_rsqrt.c`, with -O1 already providing major speed-ups and -O2 achieving the best overall efficiency. For manually tuned assembly programs such as `hanoi.S`, optimization has minimal impact since the code is already close to optimal. ## Reference * [Lab2: RISC-V Instruction Set Simulator and System Emulator](https://hackmd.io/@sysprog/Sko2Ja5pel) * [Assignment1: RISC-V Assembly and Instruction Pipeline](https://hackmd.io/@Shaoen/arch2025-homework1) * [Assignment2: Complete Applications](https://hackmd.io/@sysprog/2025-arch-homework2)