Assignment2: Bare-Metal Programing

# Assignment2: Bare-Metal Programing ## Lab2: RISC-V Instruction Set Simulator and System Emulator contributed by < [hbshub](https://github.com/hbshub/ca2025-hw2) > >[!Note] AI tools usage >I use ChatGPT by providing code explanations, grammar revisions, pre-work research, code summaries. According to the instruction in [Lab2](https://hackmd.io/@sysprog/Sko2Ja5pel), I build the rv32emu simulator and set the system emulation enabled. Ensure `tests/system/playground` is fully verified. ![image](https://hackmd.io/_uploads/Sy7AOZWl-l.png =40%x) ![image](https://hackmd.io/_uploads/SkXedbZxZe.png =40%x) In this lab，I learn how to run program in bare-metal environment. In the playground directory has some files. ``` start.S - Startup code with BSS initialization linker.ld - Arrange entry point .text .data .bss .stack in memory perfcounter.S - CSR access routines Makefile - ``` `start.S` file 1. set up `sp` (stack pointer) 2. clear `bss` (block started by symbol) section, which holds uninitialized `global` variables 3. zero `bss` loop, writing zero to every word, ensuring all C global variables start at 0. 4. call `main` transfer control to C program, handle C return and exit executes `ecall` ```asm= # Startup code for bare metal RISC-V .section .text._start .globl _start .type _start, @function _start: # Set up stack pointer la sp, __stack_top # Clear BSS la t0, __bss_start la t1, __bss_end 1: bge t0, t1, 2f sw zero, 0(t0) addi t0, t0, 4 j 1b 2: # Call main call main # Exit syscall (if main returns) li a7, 93 # exit syscall number li a0, 0 # exit code ecall # Infinite loop (should never reach here) 3: j 3b .size _start, .-_start # Provide BSS markers if linker script doesn't define them .weak __bss_start .weak __bss_end .weak __stack_top ``` `linker.ld` 1. Sets the output architecture to riscv and the program entry point to _start. 2. The program is set to start at memory address 0x10000 3. `Code Section (.text)` : Places the _start function's code first, followed by all other program code. 4. `Data Section (.data)` : Groups all initialized global variables. 5. `BSS Section (.bss)` : Defines symbols __bss_start and __bss_end which are used by start.S to know which memory region to clear (zero-out). 6. `Stack Section (.stack)` : Reserves 4096 bytes of memory for the stack and defines the __stack_top symbol, which start.S uses to initialize the stack pointer (sp). ```asm= OUTPUT_ARCH( "riscv" ) ENTRY(_start) SECTIONS { . = 0x10000; .text : { *(.text._start) *(.text) } .data : { *(.data) } .bss : { __bss_start = .; *(.bss) __bss_end = .; } .stack (NOLOAD) : { . = ALIGN(16); . += 4096; __stack_top = .; } } ``` `perfCounter.S` 1. Provides functions to safely read 64-bit performance counters (cycle and instret) on a 32-bit RISC-V system. 2. The code uses a "read-high, read-low, read-high-again" loop. 3. The function returns the consistent 64-bit value in the a1:a0 register pair, as per the 32-bit ABI. ```asm= # Performance counter functions for RISC-V # Read 64-bit cycle counter .text .globl get_cycles .align 2 get_cycles: csrr a1, cycleh csrr a0, cycle csrr a2, cycleh bne a1, a2, get_cycles ret .size get_cycles,.-get_cycles # Read 64-bit instruction retired counter .globl get_instret .align 2 get_instret: csrr a1, instreth csrr a0, instret csrr a2, instreth bne a1, a2, get_instret ret .size get_instret,.-get_instret ``` `Makefile` 1. Builds the test.elf executable from C and Assembly sources and provides `.PHONY` target to execute it in the rv32emu emulator. 2. `run`、`clean`、`dump` target is auxiliary commands. 3. Contains the rules for compilation, assembly, and linking. ```bash= RV32EMU_PATH = /home/beta10/riscv-none-elf-gcc/rv32emu include $(RV32EMU_PATH)/mk/toolchain.mk ARCH = -march=rv32izicsr LINKER_SCRIPT = linker.ld EMU ?= $(RV32EMU_PATH)/build/rv32emu AFLAGS = -g $(ARCH) CFLAGS = -g -march=rv32i_zicsr LDFLAGS = -T $(LINKER_SCRIPT) EXEC = test.elf CC = $(CROSS_COMPILE)gcc AS = $(CROSS_COMPILE)as LD = $(CROSS_COMPILE)ld OBJDUMP = $(CROSS_COMPILE)objdump OBJS = start.o main.o perfcounter.o q1-uf8.o .PHONY: all run dump clean all: $(EXEC) $(EXEC): $(OBJS) $(LINKER_SCRIPT) $(LD) $(LDFLAGS) -o $@ $(OBJS) %.o: %.S $(AS) $(AFLAGS) $< -o $@ %.o: %.c $(CC) $(CFLAGS) $< -o $@ -c run: $(EXEC) @test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1) @grep -q "ENABLE_ELF_LOADER=1" $(RV32EMU_PATH)/build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1) @grep -q "ENABLE_SYSTEM=1" $(RV32EMU_PATH)/build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1) $(EMU) $< dump: $(EXEC) $(OBJDUMP) -Ds $< | less clean: rm -f $(EXEC) $(OBJS) ``` ## Programs run in Bare-Metal ## Quiz1 Problem B [link](https://hackmd.io/@sysprog/arch2025-quiz1-sol) :::spoiler asm code ```asm= .data str1: .string ": produces value " .equ str1_len, . - str1 - 1 str2: .string " but encodes back to " .equ str2_len, . - str2 - 1 str3: .string ": value " .equ str3_len, . - str3 - 1 str4: .string " <= previous_value " .equ str4_len, . - str4 - 1 str5: .string "\n" .equ str5_len, . - str5 - 1 str6: .string "All tests passed\n" .equ str6_len, . - str6 - 1 .text .globl run_q1_uf8 # Declare global symbol .extern print_dec # Declare external C function run_q1_uf8: # ABI: Allocate 32 bytes (16-byte aligned) for 6 registers addi sp, sp, -32 sw ra, 28(sp) # Save ra (return address to C) sw s0, 24(sp) # Save s0 sw s1, 20(sp) # Save s1 sw s2, 16(sp) # Save s2 sw s3, 12(sp) # Save s3 sw s5, 8(sp) # Save s5 li s0, 0 # s0 = i / fl (0..255) li s1, -1 # s1 = previous_value li s2, 256 # s2 = remaining count li s3, 1 # s3 = passed (1=ok, 0=fail) li s5, 15 # s5 = cmp value jal test # run test bne a0, x0, print_suc_msg main_c: # ABI: Restore all saved registers lw ra, 28(sp) lw s0, 24(sp) lw s1, 20(sp) lw s2, 16(sp) lw s3, 12(sp) lw s5, 8(sp) addi sp, sp, 32 # Restore stack ret # round-trip test test: # ABI: 'test' is a callee, must save callee-saved registers (s4) addi sp, sp, -16 # 16-byte align sw ra, 12(sp) # Save ra sw s4, 8(sp) # Save s4 loop: beqz s2, test_end andi a0, s0, 255 # load fl = (uint8_t)i # --- INLINED uf8_decode --- # This replaces 'jal uf8_decode' to save cycles srli t0, a0, 4 andi a0, a0, 0x0F addi a0, a0, 16 sll a0, a0, t0 addi a0, a0, -16 # --- End Inlined uf8_decode --- addi s4, a0, 0 # s4 = value = uf8_decode(fl) jal uf8_encode # Call optimized uf8_encode andi a0, a0, 255 bne s0, a0, set_fail # fl != fl2 ? fail check_mono_inc: blt s1, s4, ok # previous_value < value ? ok li s3, 0 # passed = 0 li t3, 1 # for debug jal print_fail_msg ok: addi s1, s4, 0 # previous_value = value addi s0, s0, 1 # i++ addi s2, s2, -1 # remaining-- j loop set_fail: li s3, 0 # passed = 0 li t3, 0 # for debug mv t4, a0 # t4 = fl2 jal print_fail_msg j check_mono_inc test_end: addi a0, s3, 0 # a0 = return value (1=pass, 0=fail) lw ra, 12(sp) # Restore ra lw s4, 8(sp) # Restore s4 addi sp, sp, 16 # Restore sp ret # ------------------------------------------------------------ # Note: uf8_decode function was removed and inlined into 'test' # ------------------------------------------------------------ uf8_encode: li t3, 16 bge a0, t3, e_not_zero ret e_not_zero: # We must save 'ra' because we 'jal clz'. # We use 't4' to save 'a0', avoiding sw/lw for it. addi sp, sp, -16 sw ra, 12(sp) mv t4, a0 jal ra, clz li t3, 31 sub t0, t3, a0 mv a0, t4 lw ra, 12(sp) addi sp, sp, 16 li t1, 0 li t2, 0 li t3, 5 blt t0, t3, find_exa_exp addi t1, t0, -4 # Replaced O(n) loop with O(1) formula & removed s5 dependency. sltiu t3, t1, 16 beqz t3, set_exp_15 # if (t1 >= 16), set exp = 15 j build_of_once set_exp_15: li t1, 15 build_of_once: # Build 'of' using O(1) formula: of = ((1<<exp) - 1) << 4 li t2, 1 sll t2, t2, t1 addi t2, t2, -1 slli t2, t2, 4 j adj_exp adj_exp: ble t1, x0, find_exa_exp bge a0, t2, find_exa_exp addi t2, t2, -16 srli t2, t2, 1 addi t1, t1, -1 j adj_exp find_exa_exp: bge t1, s5, calc_m slli t0, t2, 1 addi t0, t0, 16 blt a0, t0, calc_m mv t2, t0 addi t1, t1, 1 j find_exa_exp calc_m: sub t0, a0, t2 srl t0, t0, t1 ble t0, s5, cmb_num li t0, 15 cmb_num: slli t1, t1, 4 or a0, t1, t0 ret # ------------------------------------------------------------ # clz(x) for RV32I, binary-search unrolled (shift-only) # ------------------------------------------------------------ clz: beq a0, x0, clz_zero # Handle special case: x == 0 li t0, 0 # n = 0 (leading zero count) # 1. Check top 16 bits (bits 31-16) srli t1, a0, 16 bnez t1, check_8_bits addi t0, t0, 16 slli a0, a0, 16 check_8_bits: # 2. Check top 8 bits (bits 23-16 of original, now 31-24) srli t1, a0, 24 bnez t1, check_4_bits addi t0, t0, 8 slli a0, a0, 8 check_4_bits: # 3. Check top 4 bits (bits 27-24 of original, now 31-28) srli t1, a0, 28 bnez t1, check_2_bits addi t0, t0, 4 slli a0, a0, 4 check_2_bits: # 4. Check top 2 bits (bits 29-28 of original, now 31-30) srli t1, a0, 30 bnez t1, check_1_bit addi t0, t0, 2 slli a0, a0, 2 check_1_bit: # 5. Check top 1 bit (bit 30 of original, now 31) srli t1, a0, 31 bnez t1, clz_finish addi t0, t0, 1 clz_finish: mv a0, t0 ret clz_zero: li a0, 32 ret # --- End clz function --- # ------------------------------------------------------------ # --- Helper Functions --- # ------------------------------------------------------------ # Helper: printstr_asm(a0 = string_addr, a1 = string_len) # Uses ecall 64 (sys_write) printstr_asm: mv t1, a0 mv t2, a1 li a7, 64 li a0, 1 mv a1, t1 mv a2, t2 ecall ret # Helper: void print_fail_msg(void) print_fail_msg: # ABI: 'test' relies on s0, s1, s4. # Must save them before calling C function 'print_dec'. addi sp, sp, -16 sw ra, 12(sp) sw s0, 8(sp) sw s1, 4(sp) sw s4, 0(sp) bnez t3, t3_1 # Error 0: fl != fl2 (t3 = 0) mv a0, s0 jal print_dec la a0, str1 li a1, str1_len jal printstr_asm mv a0, s4 jal print_dec la a0, str2 li a1, str2_len jal printstr_asm mv a0, t4 jal print_dec la a0, str5 li a1, str5_len jal printstr_asm j print_fail_end t3_1: # Error 1: previous_value >= value (t3 = 1) mv a0, s0 jal print_dec la a0, str3 li a1, str3_len jal printstr_asm mv a0, s4 jal print_dec la a0, str4 li a1, str4_len jal printstr_asm mv a0, s1 jal print_dec la a0, str5 li a1, str5_len jal printstr_asm print_fail_end: lw ra, 12(sp) lw s0, 8(sp) lw s1, 4(sp) lw s4, 0(sp) addi sp, sp, 16 ret # Helper: void print_suc_msg(void) print_suc_msg: la a0, str6 li a1, str6_len jal printstr_asm j main_c ``` ::: :::spoiler main.c code snippets ```c= start_cycles = get_cycles(); start_instret = get_instret(); int passed = run_q1_uf8(); end_cycles = get_cycles(); end_instret = get_instret(); ``` ::: **make with no flag** ![image](https://hackmd.io/_uploads/SklO33-e-g.png) **riscv-none-elf-size test.elf** ![image](https://hackmd.io/_uploads/BkmLERWgZe.png) **make with -Ofast** ![image](https://hackmd.io/_uploads/B1BGkCWx-g.png) **riscv-none-elf-size test_Ofast.elf** ![image](https://hackmd.io/_uploads/r1jcNCWx-x.png) **make with -Os** ![image](https://hackmd.io/_uploads/B1BGkCWx-g.png) **riscv-none-elf-size test_Os.elf** ![image](https://hackmd.io/_uploads/ry3p40Wlbg.png) ### Analysis of print_dec Assembly: no flag vs. -Os This analysis compares the assembly output of the print_dec C function under two different compilation flags. The goal is to observe the optimization strategies used by the gcc compiler. 1. (No flag) Version ```asm= 000103d8 <print_dec>: 103d8: fc010113 addi sp,sp,-64 103dc: 02112e23 sw ra,60(sp) 103e0: 02812c23 sw s0,56(sp) 103e4: 04010413 addi s0,sp,64 103e8: fca42623 sw a0,-52(s0) 103ec: fd840793 addi a5,s0,-40 103f0: 01378793 addi a5,a5,19 103f4: fef42623 sw a5,-20(s0) 103f8: fcc42783 lw a5,-52(s0) 103fc: 06079063 bnez a5,1045c <print_dec+0x84> 10400: fec42783 lw a5,-20(s0) 10404: 03000713 li a4,48 10408: 00e78023 sb a4,0(a5) 1040c: fec42783 lw a5,-20(s0) 10410: fff78793 addi a5,a5,-1 10414: fef42623 sw a5,-20(s0) 10418: 04c0006f j 10464 <print_dec+0x8c> 1041c: 00a00593 li a1,10 10420: fcc42503 lw a0,-52(s0) 10424: d55ff0ef jal 10178 <umod> 10428: 00050793 mv a5,a0 1042c: 0ff7f793 zext.b a5,a5 10430: 03078793 addi a5,a5,48 10434: 0ff7f713 zext.b a4,a5 10438: fec42783 lw a5,-20(s0) 1043c: 00e78023 sb a4,0(a5) 10440: fec42783 lw a5,-20(s0) 10444: fff78793 addi a5,a5,-1 10448: fef42623 sw a5,-20(s0) 1044c: 00a00593 li a1,10 10450: fcc42503 lw a0,-52(s0) 10454: c61ff0ef jal 100b4 <udiv> 10458: fca42623 sw a0,-52(s0) 1045c: fcc42783 lw a5,-52(s0) 10460: fa079ee3 bnez a5,1041c <print_dec+0x44> 10464: fec42783 lw a5,-20(s0) 10468: 00178793 addi a5,a5,1 1046c: fef42623 sw a5,-20(s0) 10470: fd840793 addi a5,s0,-40 10474: 01478793 addi a5,a5,20 10478: fec42703 lw a4,-20(s0) 1047c: 40e78733 sub a4,a5,a4 10480: fec42783 lw a5,-20(s0) 10484: 04000893 li a7,64 10488: 00100513 li a0,1 1048c: 00f005b3 add a1,zero,a5 10490: 00070613 mv a2,a4 10494: 00000073 ecall 10498: 00000013 nop 1049c: 03c12083 lw ra,60(sp) 104a0: 03812403 lw s0,56(sp) 104a4: 04010113 addi sp,sp,64 104a8: 00008067 ret ``` - This version is a literal, unoptimized translation of the C code. - Heavy Stack & Memory Use: The function allocates a large 64-byte stack frame (addi sp,sp,-64). It constantly saves (sw) and loads (lw) C variables (like val and p) to and from the stack inside the loop. This is extremely inefficient. - External Function Calls: It makes expensive jal (function call) instructions to the **`umod`** and **`udiv`** functions on every iteration of the loop. - Code Structure: The assembly code is easy to follow because it directly maps to the C code's if (val == 0) and while (val > 0) structure. - Conclusion: This version is large, slow, and wastes cycles by prioritizing literal translation over performance. 2. -Os (Optimized for Size) Version ```asm= 0001008c <print_dec>: 1008c: fe010113 addi sp,sp,-32 10090: 0a051663 bnez a0,1013c <print_dec+0xb0> 10094: 03000793 li a5,48 10098: 00f10fa3 sb a5,31(sp) 1009c: 01e10713 addi a4,sp,30 100a0: 00170713 addi a4,a4,1 100a4: 02010793 addi a5,sp,32 100a8: 40e787b3 sub a5,a5,a4 100ac: 04000893 li a7,64 100b0: 00100513 li a0,1 100b4: 00e005b3 add a1,zero,a4 100b8: 00078613 mv a2,a5 100bc: 00000073 ecall 100c0: 02010113 addi sp,sp,32 100c4: 00008067 ret 100c8: 00d55633 srl a2,a0,a3 100cc: 00179793 slli a5,a5,0x1 100d0: 00167613 andi a2,a2,1 100d4: 00f667b3 or a5,a2,a5 100d8: 00f87463 bgeu a6,a5,100e0 <print_dec+0x54> 100dc: ff678793 addi a5,a5,-10 100e0: fff68693 addi a3,a3,-1 100e4: ff1692e3 bne a3,a7,100c8 <print_dec+0x3c> 100e8: 03078793 addi a5,a5,48 100ec: 00f70023 sb a5,0(a4) 100f0: 00000613 li a2,0 100f4: fff70713 addi a4,a4,-1 100f8: 01f00693 li a3,31 100fc: 00000793 li a5,0 10100: 00d555b3 srl a1,a0,a3 10104: 00179793 slli a5,a5,0x1 10108: 0015f593 andi a1,a1,1 1010c: 00f5e7b3 or a5,a1,a5 10110: 00f87863 bgeu a6,a5,10120 <print_dec+0x94> 10114: 00d315b3 sll a1,t1,a3 10118: ff678793 addi a5,a5,-10 1011c: 00b66633 or a2,a2,a1 10120: fff68693 addi a3,a3,-1 10124: fd169ee3 bne a3,a7,10100 <print_dec+0x74> 10128: f6060ce3 beqz a2,100a0 <print_dec+0x14> 1012c: 00060513 mv a0,a2 10130: 01f00693 li a3,31 10134: 00000793 li a5,0 10138: f91ff06f j 100c8 <print_dec+0x3c> 1013c: 01f10713 addi a4,sp,31 10140: 00900813 li a6,9 10144: fff00893 li a7,-1 10148: 00100313 li t1,1 1014c: fe5ff06f j 10130 <print_dec+0xa4> ``` - This version is a highly optimized and restructured piece of code. The compiler has completely rewritten the function's logic to be smaller and faster. - Minimal Stack Use: It only allocates a 32-byte stack frame (addi sp,sp,-32), which is used only as the char buf[20] buffer. It does not save ra or s0. - Function Inlining: This is the most significant optimization. The compiler has inlined the **`umod`** and **`udiv`** functions. The logic for division and modulo is now built directly into the print_dec assembly, eliminating all expensive jal calls. - Register-Based Logic: All variables (val, p, remainder) are kept in registers (a0-a7, t1). Slow lw/sw operations to the stack are almost completely gone, replaced by fast register operations. - Code Structure: The code is unreadable and does not resemble the original C code's if/while structure. The compiler has rebuilt it into a more efficient "goto-style" flow (using bnez and beqz) to minimize branches and instructions. - Conclusion: This version is far superior. It is smaller and much faster because it avoids function calls through inlining and avoids slow memory access by using registers. ### Analysis of memcpy Assembly: -Os vs. -Ofast vs. no flag This analysis compares the assembly output of the memcpy C function, a fundamental memory operation, under two different compilation flags. This clearly highlights the trade-offs between optimizing for code size and optimizing for speed. 1. -Os (Optimized for Size) Version ```asm= 0001003c <memcpy>: 1003c: 00000793 li a5,0 10040: 00f61463 bne a2,a5,10048 <memcpy+0xc> 10044: 00008067 ret 10048: 00f58733 add a4,a1,a5 1004c: 00074683 lbu a3,0(a4) 10050: 00f50733 add a4,a0,a5 10054: 00178793 addi a5,a5,1 10058: 00d70023 sb a3,0(a4) 1005c: fe5ff06f j 10040 <memcpy+0x4> ``` - This version prioritizes minimal code size for memcpy. - Strategy: Simple Byte-by-Byte Copy: The compiler generates a very straightforward loop that copies data one byte at a time. - It checks if n (the number of bytes to copy) is zero (bne a2,a5,10048). - Inside the loop (10048 to 1005c), it uses lbu (load byte unsigned) to fetch one byte from the source and sb (store byte) to write it to the destination. - The loop continues until all bytes are copied (j 10040). - Code Size: This implementation is extremely compact, consisting of only 9 instructions (excluding the initial ret for n=0). - Performance: While compact, this method is inherently slow for larger data blocks, as it performs byte-level operations and loop overhead for every single byte. - Complexity: Very low. - Conclusion: This version perfectly exemplifies optimization for size, resulting in a small footprint at the cost of execution speed for bulk operations. 2. -Ofast (Optimized for Speed) Version ```asm= 0001003c <memcpy>: 1003c: fff60813 addi a6,a2,-1 10040: 08060e63 beqz a2,100dc <memcpy+0xa0> 10044: 00600793 li a5,6 10048: 00158713 addi a4,a1,1 1004c: 0707fa63 bgeu a5,a6,100c0 <memcpy+0x84> 10050: 00b567b3 or a5,a0,a1 10054: 0037f793 andi a5,a5,3 10058: 00158713 addi a4,a1,1 1005c: 06079263 bnez a5,100c0 <memcpy+0x84> 10060: 40e507b3 sub a5,a0,a4 10064: 0037b793 sltiu a5,a5,3 10068: 04079c63 bnez a5,100c0 <memcpy+0x84> 1006c: ffc67893 andi a7,a2,-4 10070: 011586b3 add a3,a1,a7 10074: 00050793 mv a5,a0 10078: 0005a703 lw a4,0(a1) 1007c: 00458593 addi a1,a1,4 10080: 00478793 addi a5,a5,4 10084: fee7ae23 sw a4,-4(a5) 10088: feb698e3 bne a3,a1,10078 <memcpy+0x3c> 1008c: 41180833 sub a6,a6,a7 10090: 011507b3 add a5,a0,a7 10094: 05160463 beq a2,a7,100dc <memcpy+0xa0> 10098: 0006c703 lbu a4,0(a3) 1009c: 00e78023 sb a4,0(a5) 100a0: 02080e63 beqz a6,100dc <memcpy+0xa0> 100a4: 0016c603 lbu a2,1(a3) 100a8: 00100713 li a4,1 100ac: 00c780a3 sb a2,1(a5) 100b0: 02e80663 beq a6,a4,100dc <memcpy+0xa0> 100b4: 0026c703 lbu a4,2(a3) 100b8: 00e78123 sb a4,2(a5) 100bc: 00008067 ret 100c0: 00c50633 add a2,a0,a2 100c4: 00050793 mv a5,a0 100c8: fff74683 lbu a3,-1(a4) 100cc: 00178793 addi a5,a5,1 100d0: 00170713 addi a4,a4,1 100d4: fed78fa3 sb a3,-1(a5) 100d8: fec798e3 bne a5,a2,100c8 <memcpy+0x8c> 100dc: 00008067 ret ``` - This version prioritizes maximum execution speed for memcpy, even if it means significantly increasing code size and complexity. - Strategy: Aggressive Word-by-Word Copy with Alignment Handling: The compiler uses a much more sophisticated strategy to leverage the processor's ability to move data in 4-byte (word) chunks. - Alignment Checks: The function starts with complex logic (10044 to 10068) to check if the source and destination memory addresses are 4-byte aligned. Misaligned accesses can be slower or require special handling. - Fast Path (Word-by-Word Copy): If both addresses are aligned, the code enters a highly optimized main loop (10078 to 10088): - It uses lw (load word, 4 bytes) to load data and sw (store word, 4 bytes) to store it. This is significantly faster than byte-by-byte copies. - Partial Copy / Cleanup Logic: After the main word-by-word loop, there's additional logic (10098 to 100bc) to handle any remaining 1, 2, or 3 bytes that couldn't be copied in full 4-byte chunks. There's also a "slow path" (100c0) which copies byte-by-byte if addresses are misaligned. - Code Size: This implementation is substantially larger (over 40 instructions) due to the added complexity of alignment checks, branching, and handling partial copies. - Performance: This version is much faster for copying large blocks of data because it maximizes the use of the 32-bit data bus. - Complexity: Very high. - Conclusion: This version is a prime example of optimization for speed, trading significant code size and complexity for superior execution performance in data-intensive tasks. 3. -O0 (No Optimization) Version ```asm= 0001003c <memcpy>: 1003c: fd010113 addi sp,sp,-48 10040: 02112623 sw ra,44(sp) 10044: 02812423 sw s0,40(sp) 10048: 03010413 addi s0,sp,48 1004c: fca42e23 sw a0,-36(s0) 10050: fcb42c23 sw a1,-40(s0) 10054: fcc42a23 sw a2,-44(s0) 10058: fdc42783 lw a5,-36(s0) 1005c: fef42623 sw a5,-20(s0) 10060: fd842783 lw a5,-40(s0) 10064: fef42423 sw a5,-24(s0) 10068: 0240006f j 1008c <memcpy+0x50> 1006c: fe842703 lw a4,-24(s0) 10070: 00170793 addi a5,a4,1 10074: fef42423 sw a5,-24(s0) 10078: fec42783 lw a5,-20(s0) 1007c: 00178693 addi a3,a5,1 10080: fed42623 sw a3,-20(s0) 10084: 00074703 lbu a4,0(a4) 10088: 00e78023 sb a4,0(a5) 1008c: fd442783 lw a5,-44(s0) 10090: fff78713 addi a4,a5,-1 10094: fce42a23 sw a4,-44(s0) 10098: fc079ae3 bnez a5,1006c <memcpy+0x30> 1009c: fdc42783 lw a5,-36(s0) 100a0: 00078513 mv a0,a5 100a4: 02c12083 lw ra,44(sp) 100a8: 02812403 lw s0,40(sp) 100ac: 03010113 addi sp,sp,48 100b0: 00008067 ret ``` - This version prioritizes literal translation and debuggability, resulting in the slowest and most inefficient code. - Strategy: Literal Byte-by-Byte, Stack-Heavy: The compiler generates a "dumb" translation of the C code. It allocates a large 48-byte stack frame and saves all C arguments (dest, src, n) to memory before starting. - Alignment Checks: None. The logic is too simple to check for alignment. - Fast Path (Word-by-Word Copy): None. The function only contains a "slow path." - Partial Copy / Cleanup Logic: The entire function is the slow path. The loop (from 1006c to 10098) loads and saves variables from the stack for every single byte, performing lbu and sb one byte at a time. - Code Size: Substantially large (around 30 instructions), not because of complex logic, but because of the massive inefficiency of constant, repetitive lw and sw instructions to the stack. - Performance: Extremely slow. The loop performs multiple high-cost memory accesses (stack lw/sw) for every byte copied, leading to massive pipeline stalls. - Complexity: The logic is simple, but the implementation is highly inefficient. - Conclusion: This version is a prime example of no optimization. It is slow and bloated, prioritizing a literal translation for debugging over any form of performance. ## Quiz2 Problem A [link](https://hackmd.io/@sysprog/arch2025-quiz2-sol) :::spoiler original asm code ```asm= # --- External Function Declarations --- .extern print_dec # Declare external C function 'print_dec' to be linked # --- Text Section --- .text .globl run_q2_game_hanoi run_q2_game_hanoi: # 1. PROLOGUE (Modified) # We need to call an external function (jal print_dec), so we must save ra (x1) # Allocate 48 bytes (16-byte aligned) # 0-19: s0-s4 (x8, x9, x18, x19, x20) # 20-31: Disk array # 32: ra (x1) # 36-37: Temp char buffer for 'A'/'B'/'C' addi x2, x2, -48 sw x8, 0(x2) sw x9, 4(x2) sw x18, 8(x2) sw x19, 12(x2) sw x20, 16(x2) sw x1, 32(x2) # <-- Key: Save the return address (ra) # ... (Disk position initialization is unchanged) ... sw x0, 20(x2) sw x0, 24(x2) sw x0, 28(x2) addi x8, x0, 1 game_loop: addi x5, x0, 8 beq x8, x5, finish_game # --- (Gray code and move logic is unchanged) --- srli x5, x8, 1 xor x6, x8, x5 addi x7, x8, -1 srli x28, x7, 1 xor x7, x7, x28 xor x5, x6, x7 addi x9, x0, 0 andi x6, x5, 1 bne x6, x0, disk_found addi x9, x0, 1 andi x6, x5, 2 bne x6, x0, disk_found addi x9, x0, 2 disk_found: andi x30, x5, 5 addi x31, x0, 5 beq x30, x31, pattern_match jal x0, continue_move pattern_match: continue_move: slli x5, x9, 2 addi x5, x5, 20 add x5, x2, x5 lw x18, 0(x5) # x18 = source peg (0, 1, 2) bne x9, x0, handle_large addi x19, x18, 2 addi x6, x0, 3 blt x19, x6, display_move sub x19, x19, x6 jal x0, display_move handle_large: lw x6, 20(x2) addi x19, x0, 3 sub x19, x19, x18 sub x19, x19, x6 # x19 = destination peg (0, 1, 2) # x9 = disk_id, x18 = from_peg, x19 = to_peg # --- 2. DISPLAY SECTION (Rewritten to use sys_write and print_dec) --- display_move: # 1. Print "Move Disk " (using sys_write, a7=64) # This is identical to the asm volatile in your C macro add a7, x0, 0x40 # a7 = 64 (sys_write) add a0, x0, 1 # a0 = 1 (stdout) la a1, str1 # a1 = pointer to string addi a2, x0, 11 # a2 = length of "Move Disk " ecall # 2. Print the disk number (call C function) addi a0, x9, 1 # a0 = argument (x9+1) jal ra, print_dec # Call C function # 3. Print " from " (using sys_write) add a7, x0, 0x40 add a0, x0, 1 la a1, str2 addi a2, x0, 6 # length of " from " ecall # 4. Print 'from' peg (A/B/C) la x20, obdata add x5, x20, x18 # x18 = 'from' index lbu x11, 0(x5) li x6, 0x6F xor x11, x11, x6 addi x11, x11, -0x12 # x11 = 'A', 'B', or 'C' # Store the character 'A'/'B'/'C' into the temp buffer on the stack sb x11, 36(x2) # Print that 1-byte buffer add a7, x0, 0x40 add a0, x0, 1 add a1, x2, 36 # a1 = pointer to stack buffer addi a2, x0, 1 # a2 = length (1) ecall # 5. Print " to " (using sys_write) add a7, x0, 0x40 add a0, x0, 1 la a1, str3 addi a2, x0, 4 # length of " to " ecall # 6. Print 'to' peg (A/B/C) add x5, x20, x19 # x19 = 'to' index lbu x12, 0(x5) xor x12, x12, x6 addi x12, x12, -0x12 # x12 = 'A', 'B', or 'C' sb x12, 37(x2) # Store to a different location on the stack add a7, x0, 0x40 add a0, x0, 1 add a1, x2, 37 # a1 = pointer addi a2, x0, 1 # a2 = length (1) ecall # 7. Print newline character '\n' (using sys_write) add a7, x0, 0x40 add a0, x0, 1 la a1, str_nl addi a2, x0, 1 # length (1) ecall # --- (State update logic is unchanged) --- slli x5, x9, 2 addi x5, x5, 20 add x5, x2, x5 sw x19, 0(x5) # Update disk position addi x8, x8, 1 jal x0, game_loop # --- 3. FINISH SECTION (Modified) --- finish_game: # Restore all registers lw x8, 0(x2) lw x9, 4(x2) lw x18, 8(x2) lw x19, 12(x2) lw x20, 16(x2) lw x1, 32(x2) # <-- Key: Restore the return address (ra) # Deallocate the stack addi x2, x2, 48 # [Modified] Use ret instruction to return to C code ret .data obdata: .byte 0x3c, 0x3b, 0x3a str1: .asciz "Move Disk " # length 11 str2: .asciz " from " # length 6 str3: .asciz " to " # length 4 str_nl: .byte 10 # Newline (ASCII 10) ``` ::: ### Optimization Method #### Add detail illustrate and cycles compare :::info 1. Data De-obfuscation - Concept: The original code used complex logic (`obdata` array combined with XOR `0x6F` and subtraction `0x12`) to convert the numbers 0, 1, and 2 into the characters 'A', 'B', and 'C'. This required multiple CPU arithmetic instructions to be executed in every loop iteration. - Optimization: We created a direct lookup table (`peg_names: .asciz "ABC"`) in the `.data` section. Now, the code simply adds the index to the base address to retrieve the corresponding character directly, eliminating all redundant arithmetic operations inside the loop. ```asm .data # obdata: .byte 0x3c, 0x3b, 0x3a peg_names: .asciz "ABC" # directly save in ASCII .text # remove arithmetic operation in display_move loop： # la x20, obdata # add x5, x20, x18 # x18 = index # lbu x11, 0(x5) # read obdata byte # li x6, 0x6F # xor x11, x11, x6 # XOR decode # addi x11, x11, -0x12 # sub correction -> get 'A'/'B'/'C' ``` ::: :::info 2. Direct I/O Addressing - Concept: The original approach for printing peg names ('A'/'B'/'C') involved first reading the character from `.data` (`lbu`), writing it to a temporary buffer on the stack (`sb`), and then passing the stack address to the system call. This created unnecessary memory read/write operations. - Optimization: We leveraged the fact that sys_write only requires a "memory address." We directly calculated the original memory address of the character within the `.data` section (`peg_names base + index`) and passed this address directly to `a1`. This eliminates the intermediate step of copying the character to the stack, reducing both the instruction count and memory accesses. ```asm # assume x11 store 'A' # sb x11, 44(x2) # write to stack # add a1, x2, 44 # a1 points to stack # x25 (s9) store peg_names base addr, x18 = index # no sb and stack operation add a1, x25, x18 # a1 directly points to 'A' in .data. li a0, 1 li a7, 64 li a2, 1 ecall ``` ::: :::info 3. Loop-Invariant Code Motion - Concept: In assembly, `la` (Load Address) is a pseudo-instruction that is typically expanded into 2 actual instructions (`auipc` + `addi`). The original code executed `la` inside every loop iteration to reload the addresses of strings (like "Move Disk "), even though these addresses never change. - Optimization: We moved the loading of these constant string addresses to the Prologue phase (before the `game_loop` label) and stored them in s (Saved) registers. Inside the loop, we switched to using a single `mv` instruction to access these addresses, reducing the total instruction count within the loop. ```asm run_q2_game_hanoi: # --- Prologue --- la x21, str1 # store addr to s5 (x21) la x22, str2 # store addr to s6 (x22) # ... other string ... game_loop: # ... display_move: # in each iteration, la (auipc + addi) is executed. # la a1, str1 mv a1, x21 # use s5, just 1 instruction li a0, 1 li a7, 64 li a2, 11 ecall # other string ...... ``` ::: :::info 4. Redundant Code Elimination - Concept: The original code contained a check after the `disk_found` label to see if `x5` was equal to 5 (using `andi`, `addi`, `beq`). However, based on the mathematical properties of Gray Code, `x5` is the result of Gray(n) XOR Gray(n-1), which is always a power of 2 (1, 2, 4...), meaning it can absolutely never equal 5. - Optimization: We directly deleted this code block that checked for an impossible condition. This saves 4 useless instructions per iteration, purely reducing the CPU burden. ```asm disk_found: # remove 4 never-executed instructions # andi x30, x5, 5 # addi x31, x0, 5 # beq x30, x31, pattern_match # jal x0, continue_move # pattern_match: # continue_move: slli x5, x9, 2 # ... ``` ::: | method with no flag | cycles | improvement | |:------------------------ |:------:|:-----------:| | **no use** | 9289 | - | | **opt with 1. 2.** | 9226 | 0.7% | | **opt with 3.** | 9274 | 0.016% | | **opt with 4.** | 9261 | 0.3% | | **opt with 1. 2. 3. 4.** | 9149 | 1.5% | 5. Optimized Code ```asm= # --- External Function Declarations --- .extern print_dec # Declare external C function 'print_dec' to be linked # --- Text Section --- .text .globl run_q2_game_hanoi run_q2_game_hanoi: # 1. PROLOGUE (Expanded) # Allocate 64 bytes (16-byte aligned) to store: # s0-s4 (x8, x9, x18, x19, x20) (5 regs * 4B = 20B) # ra (x1) (1 reg * 4B = 4B) # s5-s9 (x21-x25) (5 regs * 4B = 20B) # disk array (3 regs * 4B = 12B) # Total = 20 + 4 + 20 + 12 = 56B. Align to 16B, allocate 64B. addi x2, x2, -64 sw x8, 0(x2) # s0 (loop_count) sw x9, 4(x2) # s1 (disk_id) sw x18, 8(x2) # s2 (from_peg) sw x19, 12(x2) # s3 (to_peg) sw x20, 16(x2) # s4 (disk_array_base) sw x1, 20(x2) # ra (because we will jal print_dec) sw x21, 24(x2) # s5 (ptr: str1) sw x22, 28(x2) # s6 (ptr: str2) sw x23, 32(x2) # s7 (ptr: str3) sw x24, 36(x2) # s8 (ptr: str_nl) sw x25, 40(x2) # s9 (ptr: peg_names) # The base of the disk position array will be at 48(x2) # 【Optimization 2】: Loop-invariant code motion la x21, str1 la x22, str2 la x23, str3 la x24, str_nl la x25, peg_names # 【Optimization 1】 # Set s4 (x20) as the base address for the disk array addi x20, x2, 48 # ... (Disk position initialization) ... sw x0, 0(x20) # Disk 0 position sw x0, 4(x20) # Disk 1 position sw x0, 8(x20) # Disk 2 position addi x8, x0, 1 game_loop: addi x5, x0, 8 beq x8, x5, finish_game # --- (Gray code and move logic) --- srli x5, x8, 1 xor x6, x8, x5 addi x7, x8, -1 srli x28, x7, 1 xor x7, x7, x28 xor x5, x6, x7 addi x9, x0, 0 andi x6, x5, 1 bne x6, x0, disk_found addi x9, x0, 1 andi x6, x5, 2 bne x6, x0, disk_found addi x9, x0, 2 disk_found: # 【Optimization 3】: Removed 4 redundant instructions # x5 store the result of Gray(n) XOR Gray(n-1) # result must be 1,2,4,8 (power of 2) # BLANK 16: Check impossible pattern (multiple bits) # andi x30, x5, 5 # addi x31, x0, 5 # beq x30, x31, pattern_match # jal x0, continue_move # pattern_match: # continue_move: slli x5, x9, 2 add x5, x20, x5 lw x18, 0(x5) # x18 = source peg (0, 1, 2) bne x9, x0, handle_large # Disk 0 logic addi x19, x18, 2 addi x6, x0, 3 blt x19, x6, display_move sub x19, x19, x6 jal x0, display_move handle_large: # Disk 1/2 logic lw x6, 0(x20) addi x19, x0, 3 sub x19, x19, x18 sub x19, x19, x6 # x19 = destination peg (0, 1, 2) # Thread will automatically fall through to display_move display_move: # --- 2. DISPLAY SECTION (Optimized) --- # 1. Print "Move Disk " add a7, x0, 0x40 add a0, x0, 1 mv a1, x21 # 【Optimization 2】: Use s5 addi a2, x0, 11 ecall # 2. Print the disk number (call C function) addi a0, x9, 1 jal ra, print_dec # 3. Print " from " add a7, x0, 0x40 add a0, x0, 1 mv a1, x22 # 【Optimization 2】: Use s6 addi a2, x0, 6 ecall # 4. Print 'from' peg (A/B/C) # 【Optimization 4】: Use peg_names address directly, no more copying to stack add a1, x25, x18 # a1 = peg_names + from_index add a7, x0, 0x40 add a0, x0, 1 addi a2, x0, 1 # a2 = length (1) ecall # 5. Print " to " add a7, x0, 0x40 add a0, x0, 1 mv a1, x23 # 【Optimization 2】: Use s7 addi a2, x0, 4 ecall # 6. Print 'to' peg (A/B/C) # 【Optimization 4】: Use peg_names address directly, no more copying to stack add a1, x25, x19 # a1 = peg_names + to_index add a7, x0, 0x40 add a0, x0, 1 addi a2, x0, 1 # a2 = length (1) ecall # 7. Print newline '\n' add a7, x0, 0x40 add a0, x0, 1 mv a1, x24 # 【Optimization 2】: Use s8 addi a2, x0, 1 ecall # --- (State update logic) --- slli x5, x9, 2 add x5, x20, x5 sw x19, 0(x5) # Update disk position addi x8, x8, 1 jal x0, game_loop # --- 3. FINISH SECTION (Epilogue) --- finish_game: # Restore all registers lw x8, 0(x2) lw x9, 4(x2) lw x18, 8(x2) lw x19, 12(x2) lw x20, 16(x2) lw x1, 20(x2) # Restore ra lw x21, 24(x2) # Restore s5 lw x22, 28(x2) # Restore s6 lw x23, 32(x2) # Restore s7 lw x24, 36(x2) # Restore s8 lw x25, 40(x2) # Restore s9 # Deallocate the stack addi x2, x2, 64 ret .data # 【Optimization 1】: Removed obdata, using a direct lookup table peg_names: .asciz "ABC" str1: .asciz "Move Disk " # length 11 str2: .asciz " from " # length 6 str3: .asciz " to " # length 4 str_nl: .byte 10 # Newline (ASCII 10) ``` Run in Bare-Metal environment In main.c call external asm function. ```c= start_cycles = get_cycles(); start_instret = get_instret(); int passed = run_q2_game_hanoi(); end_cycles = get_cycles(); end_instret = get_instret(); ``` ### make run result (1) Before Optimized ![image](https://hackmd.io/_uploads/rkqlLEzx-x.png =62%x) ![image](https://hackmd.io/_uploads/HJl0e_VMgbg.png) (2) Optimized (make with no flag) ![image](https://hackmd.io/_uploads/HJ4mH4fx-g.png =62%x) ![image](https://hackmd.io/_uploads/H1TjDVzg-g.png) (3) Optimized (make with -Ofast) ![image](https://hackmd.io/_uploads/Hyd_OVfe-x.png =62%x) ![image](https://hackmd.io/_uploads/B1ciu4zxWl.png) (4) Optimized (make with -Os) ![image](https://hackmd.io/_uploads/Hyo0d4GxZg.png =62%x) ![image](https://hackmd.io/_uploads/BytZFEfx-x.png) | method | cycles | improvement | |:---------------------------- |:------:|:-----------:| | **(1) before opt** | 9289 | - | | **(2) after hand write opt** | 9149 | 1.5% | | **(3) -Ofast** | 3988 | 57% | | **(4) -Os** | 3876 | 58% | ### Observation Analysis: 1. From (1) to (2): My Own Optimization - My optimized hanoi.S (e.g., removing obdata, removing redundant instructions). - The C Library was unchanged (both compiled with default). - Result: My handwritten assembly optimization saved 9289 - 9149 = 140 cycles. 2. From (2) to (3): The C Library Optimization - hanoi.S was unchanged. - The C Library (containing print_dec) went from no optimization to -Ofast (aggressive optimization). - Result: The C Library's efficiency massively improved, saving 9149 - 3988 = 5161 cycles. - This proves that in tests (1) and (2), the vast majority of the inefficiency was coming from the unoptimized C library (especially print_dec). 3. From (3) to (4): The Battle of Optimization Flags - hanoi.S was unchanged. - The C Library was compiled with -Os instead of -Ofast. - Result: The machine code for print_dec (and other C startup code) generated by -Os executes fewer total instructions than the version generated by -Ofast. - The difference is 3988 - 3876 = 112 fewer instructions executed. 4. -Ofast's "optimization" was not helpful in this specific situation, resulting in a C library that "requires more instructions to execute", so the total cycle number will be higher. ## Quiz3 Problem C [link](https://hackmd.io/@sysprog/arch2025-quiz3-sol) In Quiz 3 Problem C, we try to implement the **Fast Inverse Square Root** ( $\frac{65536}{\sqrt{x}}$ ) algorithm and run the test suite in a bare-metal environment. First, we implement the `fast_rsqrt` algorithm following the steps provided by the instructor. 1. The first step is lookup table initialization: we use `clz` to find the position of the most significant bit (MSB) of the input value `x`, then use that index to fetch the initial guess from the `rsqrt_table`. 2. The second step is linear interpolation: if `x` isn’t an exact power of two, we refine the initial guess by interpolating between table entries to reduce the error. 3. The third step is **Newton–Raphson iteration**: we perform two iterations to further improve the accuracy of the result. ```c static const uint16_t rsqrt_table[32] = { 65536, 46341, 32768, 23170, 16384, /* 2^0 to 2^4 */ 11585, 8192, 5793, 4096, 2896, /* 2^5 to 2^9 */ 2048, 1448, 1024, 724, 512, /* 2^10 to 2^14 */ 362, 256, 181, 128, 90, /* 2^15 to 2^19 */ 64, 45, 32, 23, 16, /* 2^20 to 2^24 */ 11, 8, 6, 4, 3, /* 2^25 to 2^ 29 */ 2, 1 /* 2^3O, 2^3l */ } ; uint32_t fast_rsqrt(uint32_t x) { if (x == 0) return 0xFFFFFFFF; // Handle zero case if (x == 1) return 65536; // Handle exact case for 1 // Step 1: Find MSB position int exp = 31 - clz(x); // Count leading zeros uint32_t y = rsqrt_table[exp]; // Initial estimate // Step 2: Linear interpolation for non-power-of-2 inputs if (x > (1u << exp)) { uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0; // Next estimate uint32_t delta = y - y_next; // Difference between estimates uint32_t frac =(uint32_t) ((((uint64_t)x - (1UL << exp)) << 16) >> exp); y -= (uint32_t) ((delta * frac) >> 16); // Interpolate } // Step 3: Newton-Raphson iterations for (int iter = 0; iter < 2; iter++) { uint32_t y2 = (uint32_t)mul32(y, y); // y^2 in Q0.32 uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16); // x * y^2 in Q16.16 y = (uint32_t)(mul32(y, (3u << 16) - xy2) >> 17); // Newton step } return y; } ``` ### Examples: `fast_rsqrt(1024)` and `fast_rsqrt(100)` To illustrate how the algorithm works, walk through two example inputs: `fast_rsqrt(1024)` and `fast_rsqrt(100)`. 1. **`fast_rsqrt(1024)`** * **Step 1: Finding the MSB (Exponent)** * The code uses `clz(1024)` to determine the position of the Most Significant Bit (MSB), which gives us the exponent `exp = 10`. * We look up the corresponding value in the table: `y = 2048` (which is $65536 / \sqrt{2^{10}}$). * **Step 2: The Execution Check (Skipping the Math)** * The code checks the condition: `if (x > (1u << exp))`? * The test is `if (1024 > 1024)`, which is **False**. * **Step 3: Execution Path** * The entire block containing the complex calculations (interpolation and Newton-Raphson iterations) is **skipped**. Since the table value for $2^{exp}$ is exact, no refinement is needed. * **Result** * The function returns the precise table value: **2048**. --- 2. **`fast_rsqrt(100)`** Full calculation path because **100 falls between two powers of two** ($2^6$ and $2^7$). * **Step 1: Finding the MSB (Exponent)** * `clz(100)` tells us the MSB is at position 6, so the initial exponent is exp = 6 ($2^6=64$). * The code takes the table value for $2^6$ as the starting point: `y = 8192`. * **Step 2: The Execution Check (Entering the Math Block)** * The code checks the condition: `if (x > (1u << exp))`? * The test is `if (100 > 64)`, which is **True**. This triggers the full calculation. * **Step 3: Execution Path** * The code **enters** the calculation block to refine the estimate: 1. **Linear Interpolation:** A fractional term is calculated (based on the distance between 100 and 64) and used to adjust `y` from 8192 downwards. 2. **Newton-Raphson:** The two iterations are run on the already improved `y` value to ensure high final accuracy. * **Result** * The function returns the fully refined value, approximately **6553**. I write a test suite function to check several case values. * **run_q3_rsqrt()** : test suite function. * **check_exact()** : checks whether the two values are exactly the same. * **check_approx()** : not only compares the values but also checks whether their difference exceeds a given tolerance. ```c int run_q3_rsqrt() { int all_passed = 1; // Edge/Special Cases check_exact("rsqrt(0)", fast_rsqrt(0), 0xFFFFFFFF, 0, &all_passed); check_exact("rsqrt(1)", fast_rsqrt(1), 65536, 0, &all_passed); check_exact("rsqrt(0xFFFFFFFF)", fast_rsqrt(0xFFFFFFFF), 1, 0, &all_passed); // Powers of 2 check_exact("rsqrt(4)", fast_rsqrt(4), 32768, 0, &all_passed); check_exact("rsqrt(16)", fast_rsqrt(16), 16384, 0, &all_passed); check_exact("rsqrt(1024)", fast_rsqrt(1024), 2048, 0, &all_passed); check_exact("rsqrt(65536)", fast_rsqrt(65536), 256, 0, &all_passed); check_exact("rsqrt(1048576)", fast_rsqrt(1048576), 64, 0, &all_passed); // General Cases (10% tolerance) check_approx("rsqrt(100)", fast_rsqrt(100), 6554, 10, 0, &all_passed); check_approx("rsqrt(2)", fast_rsqrt(2), 46341, 10, 0, &all_passed); check_approx("rsqrt(10)", fast_rsqrt(10), 20723, 10, 0, &all_passed); check_approx("rsqrt(42)", fast_rsqrt(42), 10103, 10, 0, &all_passed); check_approx("rsqrt(12345)", fast_rsqrt(12345), 590, 10, 0, &all_passed); check_approx("rsqrt(1000000)", fast_rsqrt(1000000), 66, 10, 0, &all_passed); check_approx("rsqrt(2000000000)", fast_rsqrt(2000000000), 1, 10, 0, &all_passed); return all_passed; } ``` Running in a Bare-Metal environment and observe the result. ```bash === HW2 FastRsqrt Tests in Bare Metal === Running fast_rsqrt test suite... Testing edge cases... -> Calling rsqrt(0)... [PASS] rsqrt(0) | Cycles: 45 -> Calling rsqrt(1)... [PASS] rsqrt(1) | Cycles: 48 -> Calling rsqrt(0xFFFFFFFF)... [PASS] rsqrt(0xFFFFFFFF) | Cycles: 3141 Testing powers of 2... [PASS] rsqrt(4) | Cycles: 114 [PASS] rsqrt(16) | Cycles: 114 [PASS] rsqrt(1024) | Cycles: 108 [PASS] rsqrt(65536) | Cycles: 114 [PASS] rsqrt(1048576) | Cycles: 108 Testing general cases (10% tolerance)... [PASS] rsqrt(100) (Got: 6553) | Cycles: 4225 [PASS] rsqrt(2) (Got: 46341) | Cycles: 117 [PASS] rsqrt(10) (Got: 20724) | Cycles: 4473 [PASS] rsqrt(42) (Got: 10112) | Cycles: 4295 [PASS] rsqrt(12345) (Got: 589) | Cycles: 4157 [PASS] rsqrt(1000000) (Got: 65) | Cycles: 3615 [PASS] rsqrt(2000000000) (Got: 1) | Cycles: 3414 q3-rsqrt Test Suite: PASSED Total Cycles: 150174 Total Instructions: 150174 === All Tests Completed === ``` We can see that if the value **isn’t** a power of two or one of an edge case, the computation takes many more cycles. My test suite mainly calls `fast_rsqrt`, which itself relies on `clz` and `mul32`. If we want to reduce the cycle count, we’ll need to implement the computation in a lower-level way. ### Rewrite `clz` in assembly In my implementation, I replaced the large constant masks with `srli` because loading big constants in RISC-V usually takes two instructions(`lui` + `addi`), while a shift only needs one. For the final 31-bit check, I used `bltz` to test the sign bit, so there’s no need for another `srli` and constant load. ```asm .global clz # ----------------------------------------------------------------------- # clz (Count Leading Zeros) # Input: a0 # Output: a0 # ----------------------------------------------------------------------- clz: # 1. Handle special case: input is 0 beqz a0, is_zero # If a0 == 0, jump to is_zero # 2. Initialize counter (t0) li t0, 0 # --- Check upper 16 bits --- srli t1, a0, 16 bnez t1, check_8 # If upper 16 bits are not zero, go check upper 8 bits addi t0, t0, 16 # Otherwise n += 16 slli a0, a0, 16 # x <<= 16 check_8: # --- Check upper 8 bits --- srli t1, a0, 24 bnez t1, check_4 addi t0, t0, 8 slli a0, a0, 8 check_4: # --- Check upper 4 bits --- srli t1, a0, 28 bnez t1, check_2 addi t0, t0, 4 slli a0, a0, 4 check_2: # --- Check upper 2 bits --- srli t1, a0, 30 bnez t1, check_1 addi t0, t0, 2 slli a0, a0, 2 check_1: # --- Check highest bit (Bit 31) --- bltz a0, done # If a0 < 0 (Bit 31 == 1), done addi t0, t0, 1 # Otherwise n += 1 done: mv a0, t0 # Move result into a0 ret is_zero: li a0, 32 # Special return value when input is zero ret ``` I actually tested a branchless version of CLZ, but it was slower than the branching version in every single case. So, I ended up sticking with the branching one. I suspect my branchless implementation might have been a bit inefficient. ```asm .global clz_brless # ----------------------------------------------------------------------- # clz_branchless_binary # Branchless CLZ implementation using a binary search pattern. # Input: a0 # Output: a0 # ----------------------------------------------------------------------- clz_brless # Preserve original input for zero correction mv t5, a0 # Initialize counter li t0, 0 # Check upper 16 bits srli t1, a0, 16 sltiu t2, t1, 1 # t2 = (t1 == 0) slli t3, t2, 4 # shift amount = 16 or 0 add t0, t0, t3 sll a0, a0, t3 # Check upper 8 bits srli t1, a0, 24 sltiu t2, t1, 1 slli t3, t2, 3 # shift amount = 8 or 0 add t0, t0, t3 sll a0, a0, t3 # Check upper 4 bits srli t1, a0, 28 sltiu t2, t1, 1 slli t3, t2, 2 # shift amount = 4 or 0 add t0, t0, t3 sll a0, a0, t3 # Check upper 2 bits srli t1, a0, 30 sltiu t2, t1, 1 slli t3, t2, 1 # shift amount = 2 or 0 add t0, t0, t3 sll a0, a0, t3 # Check MSB srli t1, a0, 31 sltiu t2, t1, 1 add t0, t0, t2 # Zero-input adjustment (CLZ(0) = 32) seqz t4, t5 add a0, t0, t4 ret ``` ### Rewrite`mul32` in assembly Compare with C code, my implementaion hava some advantages. 1. **Early Exit Mechanism**: Unlike the original code that forces 32 iterations, my code checks if the remaining multiplier is zero. If it is, it terminates the loop immediately to save time. 2. **Zero Skip Logic**: I also implemented a "Zero Skip" check inside the loop. If the current bit is zero, it bypasses the addition instructions entirely and only performs the shift, saving unnecessary cycles. 3. **Optimized Shifting**: Instead of calculating a large shift (<< i) from scratch, I use an efficient incremental shift of 1 in each step. 4. **Leaf Function Efficiency**: Finally, it’s a leaf function that runs purely on registers. There is absolutely no stack usage or memory access overhead. ```c static uint64_t mul32(uint32_t a, uint32_t b) { uint64_t r = 0; for (int i = 0; i < 32; i++) { if (b & ( 1U << i )) r += (uint64_t)a << i ; } return r; } ``` ```asm .global mul32 # ----------------------------------------------------------------------- # mul32: Unsigned 32x32 -> 64-bit multiplication # Input: a0 = a, a1 = b # Output: a1:a0 = result (hi:lo) # ----------------------------------------------------------------------- mul32: # Clear result accumulators # t0 = res_lo, t1 = res_hi li t0, 0 li t1, 0 # Extend 'a' to 64-bit. t2 holds the upper 32 bits. li t2, 0 mul32_loop: # Early exit if multiplier is empty beqz a1, mul32_done # Check LSB. Skip add if bit is 0. andi t3, a1, 1 beqz t3, mul32_shift # --- Add 'a' to result (64-bit) --- add t0, t0, a0 # res_lo += a_lo sltu t3, t0, a0 # Carry bit? (1 if wrapped around, else 0) add t1, t1, t2 # res_hi += a_hi add t1, t1, t3 # res_hi += carry mul32_shift: # --- Shift 'a' left by 1 (64-bit) --- # Need to bridge the MSB of a_lo to LSB of a_hi srli t3, a0, 31 # Grab MSB from low part slli a0, a0, 1 # a_lo << 1 slli t2, t2, 1 # a_hi << 1 or t2, t2, t3 # Merge MSB into high part # Shift multiplier right srli a1, a1, 1 j mul32_loop mul32_done: # Set return values mv a0, t0 mv a1, t1 ret ``` After rewrite and test **`clz`** 、 **`clz_brless`**、**`mul32`**, we can look at the results cycles again to see how it performs. | Input | (1)orginal | (2)clz | (3)clz_brless | (4)mul32 | (5)clz + mul32 | (1) vs (5) | |:------------------- |:---------- |:------ |:------------- |:-------- |:-------------- | :----------| | `rsqrt(0)` | 45 | 45 | 45 | 45 | 45 | 0% | | `rsqrt(1)` | 48 | 48 | 48 | 48 | 48 | 0% | | `rsqrt(0xFFFFFFFF)` | 3141 | 3126 | 3141 | 925 | 910 | 71% | | `rsqrt(4)` | 114 | 85 | 93 | 114 | 85 | 25% | | `rsqrt(16)` | 114 | 85 | 93 | 114 | 85 | 25% | | `rsqrt(1024)` | 108 | 83 | 93 | 108 | 83 | 23% | | `rsqrt(65536)` | 114 | 85 | 93 | 114 | 85 | 25% | | `rsqrt(1048576)` | 108 | 83 | 93 | 108 | 83 | 23% | | `rsqrt(100)` | 4225 | 4200 | 4210 | 1720 | 1695 | 60% | | `rsqrt(2)` | 117 | 86 | 93 | 117 | 86 | 26.5% | | `rsqrt(10)` | 4473 | 4446 | 4455 | 1838 | 1811 | 60% | | `rsqrt(42)` | 4295 | 4268 | 4277 | 1760 | 1733 | 60% | | `rsqrt(12345)` | 4157 | 4134 | 4145 | 1538 | 1515 | 64% | | `rsqrt(1000000)` | 3615 | 3592 | 3603 | 1296 | 1273 | 65% | | `rsqrt(2000000000)` | 3414 | 3397 | 3411 | 974 | 957 | 72% | The bottleneck here is the `mul32` operation—the 32-bit multiplication. Unless we're dealing with edge cases or values that can be handled by a lookup table, we're seeing a cycle reduction of at least **60%**. On the other hand, `clz` doesn't help much performance-wise since it's such a lightweight calculation, though it still offers a slight improvement. ### Compile with -Ofast and -Os Flag Let's check the performance of what the compiler spits out versus what we wrote by hand. #### Overview | Input Case | -O0 (Handwrite Asm) | -Ofast (Speed) | -Os (Size) | |:--------------------- |:------------------- |:-------------- |:---------- | | **Code Size (Bytes)** | 9146 | 6680 | **5492** | | `rsqrt(0)` | 45 | 8 | 8 | | `rsqrt(1)` | 48 | 8 | 8 | | `rsqrt(0xFFFFFFFF)` | 910 | 1636 | 1486 | | `rsqrt(4)` | 85 | 52 | 57 | | `rsqrt(16)` | 85 | 52 | 57 | | `rsqrt(1024)` | 83 | 50 | 55 | | `rsqrt(65536)` | 85 | 53 | 60 | | `rsqrt(1048576)` | 83 | 51 | 58 | | `rsqrt(100)` | 1695 | 1957 | 2425 | | `rsqrt(2)` | 86 | 53 | 59 | | `rsqrt(10)` | 1811 | 2055 | 2619 | | `rsqrt(42)` | 1733 | 1981 | 2482 | | `rsqrt(12345)` | 1515 | 1972 | 2353 | | `rsqrt(1000000)` | 1273 | 1812 | 1915 | | `rsqrt(2000000000)` | 957 | 1752 | 1753 | --- #### Analysis & Observations - **Code Size:** - Comparing `-Os` and `-Ofast`: `-Os` gives the smallest code size, and `-Ofast` is actually pretty decent too. But our hand-written version is way bigger simply because we didn't strip out the unused code. - **Bottleneck Analysis:** - We can see that `-Ofast` is slightly less efficient than our version on the `mul32` part, but it actually handles edge cases and lookups better. `-Os` takes a hit on `mul32`, but for the edge cases, it’s pretty much on par with `-Ofast`.