--- tags: CA2025 --- # Assignment2: Complete Applications contributed by [<wilson0828>](https://github.com/wilson0828) >Refer to [Assignment2](https://hackmd.io/@sysprog/2025-arch-homework2) ## Part 1 - Environment setup This project challenges to develop a highly practical system using RISC-V assembly language. We need to set up the experimental environment before moving on to the implementation. Follow [Lab2: RISC-V](https://hackmd.io/@sysprog/Sko2Ja5pel#Overview) to build rv32emu, which is a RISC-V RV32 emulator that faithfully implements the RISC-V instruction set architecture. After following the instructions in previous lab, we use Makefile to validate whether `tests/system/playground` is fully verified. ``` blz@localhost:~/riscv-none-elf-gcc/rv32emu/tests/system/playground$ make run ../../../build/rv32emu test.elf ``` This validates: 1. Emulator binary exists 2. `ENABLE_SYSTEM=1` is configured 3. `ENABLE_ELF_LOADER=1` is configured, meaning that we can load `.elf` file to execute directly. - output ``` === ChaCha20 Tests === Test 0: ChaCha20 (RISC-V Assembly) Test: ChaCha20 ChaCha20 RFC 7539: PASSED Cycles: 6797 Instructions: 6797 === BFloat16 Tests === Test 1: bf16_add Test: bf16_add 1.0 + 1.0 = 4000 PASSED Cycles: 432 Instructions: 432 Test 2: bf16_sub Test: bf16_sub 3.0 - 2.0 = 3f80 PASSED Cycles: 373 Instructions: 373 Test 3: bf16_mul Test: bf16_mul 2.0 * 3.0 = 40c0 PASSED Cycles: 464 Instructions: 464 Test 4: bf16_div Test: bf16_div 6.0 / 2.0 = 4040 PASSED Cycles: 624 Instructions: 624 Test 5: bf16_special_cases Test: bf16_special_cases bf16_iszero(0): PASSED bf16_isnan(NaN): PASSED bf16_isinf(Inf): PASSED Cycles: 239 Instructions: 239 === All Tests Completed === ``` We have completed the environment setup part - Note Just to clarify remember to `source $HOME/riscv-none-elf-gcc/setenv` everytime you open a new shell, so the correct xPack RISC-V toolchain is added to your PATH and Make will pick the right compiler/assembler. Refer to [Lab2](https://hackmd.io/@sysprog/Sko2Ja5pel) ## Part 2 - Run on bare-metal environment In this section, we need to run our complete program on in a bare-metal environment using rv32emu’s system emulation. Here, I choose to reuse the assembly of `bf16_sqrt`, `bf16_add` and `bf16_mul`, that we implemented in [Assignment1](https://hackmd.io/fBYPlyHETiGzC1sBdhYm0w?view#Part-2) <details> <summary>bf16_hw1</summary> ```asm .section .rodata .align 4 clz8_lut: .byte 8,7,6,6,5,5,5,5,4,4,4,4,4,4,4,4 .byte 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3 .byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2 .byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .text .align 2 clz8: andi a0, a0, 0xFF la t0, clz8_lut add t0, t0, a0 lbu a0, 0(t0) ret mul8x8_to16: andi a0, a0, 0xFF andi a1, a1, 0xFF mv t1, a0 mv t2, a1 li t0, 0 li t3, 8 mul8x8_to16_loop: andi t4, t2, 1 beqz t4, mul8x8_to16_skip add t0, t0, t1 mul8x8_to16_skip: slli t1, t1, 1 srli t2, t2, 1 addi t3, t3, -1 bnez t3, mul8x8_to16_loop mv a0, t0 ret .globl bf16_add bf16_add: addi sp, sp, -28 sw ra, 24(sp) sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) srli s0, a0, 15 andi s0, s0, 1 # s0 = sign_a srli s1, a1, 15 andi s1, s1, 1 # s1 = sign_b srli s2, a0, 7 andi s2, s2, 0xFF # s2 = exp_a srli s3, a1, 7 andi s3, s3, 0xFF # s3 = exp_b andi s4, a0, 0x7F # s4 = mant_a andi s5, a1, 0x7F # s5 = mant_b li t0, 0xFF bne s2, t0, bf16_add_chk bnez s4, bf16_add_ret_a bne s3, t0, bf16_add_ret_a bnez s5, bf16_add_ret_b bne s0, s1, bf16_add_ret_nan j bf16_add_ret_b bf16_add_chk: beq s3, t0, bf16_add_ret_b li t0, 0x7FFF and t1, a0, t0 beq t1, x0, bf16_add_ret_b and t1, a1, t0 beq t1, x0, bf16_add_ret_a beq s2, x0, bf16_add_a_den_done ori s4, s4, 0x80 bf16_add_a_den_done: beq s3, x0, bf16_add_b_den_done ori s5, s5, 0x80 bf16_add_b_den_done: sub t1, s2, s3 blt x0, t1, bf16_add_grt beq t1, x0, bf16_add_equ mv t2, s3 li t0, -8 blt t1, t0, bf16_add_ret_b sub t0, x0, t1 srl s4, s4, t0 j bf16_add_exp_dif bf16_add_grt: mv t2, s2 li t0, 8 blt t0, t1, bf16_add_ret_a srl s5, s5, t1 j bf16_add_exp_dif bf16_add_equ: mv t2, s2 bf16_add_exp_dif: bne s0, s1, bf16_add_diff_signs mv t3, s0 add t4, s4, s5 li t0, 0x100 and t1, t4, t0 beq t1, x0, bf16_add_pack srli t4, t4, 1 addi t2, t2, 1 li t0, 0xFF blt t2, t0, bf16_add_pack slli a0, t3, 15 li t5, 0x7F80 or a0, a0, t5 j bf16_add_ans bf16_add_diff_signs: blt s4, s5, bf16_add_gt_ma mv t3, s0 sub t4, s4, s5 j bf16_add_norm bf16_add_gt_ma: mv t3, s1 sub t4, s5, s4 bf16_add_norm: beq t4, x0, bf16_add_ret_zero mv a0, t4 jal ra, clz8 mv t0, a0 sll t4, t4, t0 sub t2, t2, t0 blt t2, x0, bf16_add_ret_zero beq t2, x0, bf16_add_ret_zero j bf16_add_pack bf16_add_ret_zero: li a0, 0x0000 j bf16_add_ans bf16_add_pack: slli a0, t3, 15 slli t1, t2, 7 or a0, a0, t1 andi t4, t4, 0x7F or a0, a0, t4 j bf16_add_ans bf16_add_ret_b: mv a0, a1 j bf16_add_ans bf16_add_ret_nan: li a0, 0x7FC0 j bf16_add_ans bf16_add_ret_a: j bf16_add_ans bf16_add_ans: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw ra, 24(sp) addi sp, sp, 28 ret .globl bf16_mul bf16_mul: addi sp, sp, -28 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) sw ra, 24(sp) srli s0, a0, 15 andi s0, s0, 1 srli s1, a1, 15 andi s1, s1, 1 srli s2, a0, 7 andi s2, s2, 0xFF srli s3, a1, 7 andi s3, s3, 0xFF andi s4, a0, 0x7F andi s5, a1, 0x7F li t0, 0xff xor t1, s0, s1 bne s2, t0, bf16_mul_a_exp bne s4, x0, bf16_mul_ret_b bne s3, x0, bf16_mul_inf1 beq s5, x0, bf16_mul_ret_nan bf16_mul_inf1: slli t2, t1, 15 li t3, 0x7F80 or a0, t2, t3 j bf16_mul_ans bf16_mul_a_exp: bne s3, t0, bf16_mul_b_exp bne s5, x0, bf16_mul_ret_b bne s2, x0, bf16_mul_inf2 beq s4, x0, bf16_mul_ret_nan bf16_mul_inf2: slli t2, t1, 15 li t3, 0x7F80 or a0, t2, t3 j bf16_mul_ans bf16_mul_b_exp: bne s2, x0, bf16_mul_skip1 beq s4, x0, bf16_mul_zero_ret bf16_mul_skip1: bne s3, x0, bf16_mul_skip2 bne s4, x0, bf16_mul_skip2 bf16_mul_zero_ret: srli a0, t1, 15 j bf16_mul_ans bf16_mul_skip2: li t2, 0 bne s2, x0, bf16_mul_else_a mv a0, s4 jal ra, clz8 mv t0, a0 sll s4, s4, t0 sub t2, t2, t0 li s2, 1 bf16_mul_else_a: ori s4, s4, 0x80 bne s3, x0, bf16_mul_else_b mv a0, s5 jal ra, clz8 mv t0, a0 sll s5, s5, t0 sub t2, t2, t0 li s3, 1 bf16_mul_else_b: ori s5, s5, 0x80 mv a0, s4 mv a1, s5 jal ra, mul8x8_to16 mv t3, a0 xor t1, s0, s1 add t4, s2, s3 addi t4, t4, -127 add t4, t4, t2 li t5, 0x8000 and t0, t3, t5 beq t0, x0, bf16_mul_l2 srli t3, t3, 8 andi t3, t3, 0x7F addi t4, t4, 1 j bf16_mul_mant bf16_mul_l2: srli t3, t3, 7 andi t3, t3, 0x7F bf16_mul_mant: li t0, 0xFF blt t4, t0, bf16_mul_skip3 slli a0, t1, 15 li t0, 0x7F80 or a0, a0, t0 j bf16_mul_ans bf16_mul_skip3: blt x0, t4, bf16_mul_pack addi t0, x0, -6 blt t4, t0, bf16_mul_underflow li t0, 1 sub t0, t0, t4 srl t3, t3, t0 li t4, 0 j bf16_mul_pack bf16_mul_underflow: srli a0, t1, 15 j bf16_mul_ans bf16_mul_pack: andi t1, t1, 1 slli t1, t1, 15 andi t4, t4, 0xFF slli t4, t4, 7 andi t3, t3, 0x7F or a0, t1, t4 or a0, a0, t3 li t0, 0xFFFF and a0, a0, t0 j bf16_mul_ans bf16_mul_ret_b: mv a0, a1 j bf16_mul_ans bf16_mul_ret_nan: li a0, 0x7FC0 j bf16_mul_ans bf16_mul_ret_a: j bf16_mul_ans bf16_mul_ans: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw ra, 24(sp) addi sp, sp, 28 ret isqrt16_pow4: li t0, 0 li t1, 16384 isqrt16_pow4_loop: beqz t1, isqrt16_pow4_done add t2, t0, t1 bgeu a0, t2, isqrt16_pow4_ge srli t0, t0, 1 srli t1, t1, 2 j isqrt16_pow4_loop isqrt16_pow4_ge: sub a0, a0, t2 srli t0, t0, 1 add t0, t0, t1 srli t1, t1, 2 j isqrt16_pow4_loop isqrt16_pow4_done: mv a0, t0 ret .globl bf16_sqrt bf16_sqrt: addi sp, sp, -32 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) sw s6, 24(sp) sw ra, 28(sp) srli s0, a0, 15 andi s0, s0, 1 # s0 = sign_a srli s1, a0, 7 andi s1, s1, 0xFF # s1 = exp_a andi s2, a0, 0x7F # s2 = mant_a li t0, 0xFF bne s1, t0, bf16_sqrt_a_exp beq s2, x0, bf16_sqrt_a_mant j bf16_sqrt_ret_a bf16_sqrt_a_mant: beq s0, x0, bf16_sqrt_a_sign li a0, 0x7FC0 j bf16_sqrt_ans bf16_sqrt_a_sign: j bf16_sqrt_ret_a bf16_sqrt_a_exp: bne s1, x0, bf16_sqrt_skip bne s2, x0, bf16_sqrt_skip li a0, 0x0000 j bf16_sqrt_ans bf16_sqrt_skip: beq s0, x0, bf16_sqrt_negative_skip li a0, 0x7FC0 j bf16_sqrt_ans bf16_sqrt_negative_skip: bne s1, x0, bf16_sqrt_denormals_skip li a0, 0x0000 j bf16_sqrt_ans bf16_sqrt_denormals_skip: addi t1, s1, -127 ori t2, s2, 0x80 andi t0, t1, 1 beq t0, x0, bf16_sqrt_else slli t2, t2, 1 addi t0, t1, -1 srai t0, t0, 1 addi t3, t0, 127 j bf16_sqrt_end_if bf16_sqrt_else: srai t0, t1, 1 addi t3, t0, 127 bf16_sqrt_end_if: mv s6, t3 slli a0, t2, 7 addi a0, a0, 127 jal isqrt16_pow4 mv s5, a0 mv t3, s6 j bf16_sqrt_l3 bf16_sqrt_l3: andi t5, s5, 0x7F li t0, 0xFF blt t3, t0, bf16_sqrt_no_overflow li a0, 0x7F80 j bf16_sqrt_ans bf16_sqrt_no_overflow: bgt t3, x0, bf16_sqrt_no_underflow li a0, 0 j bf16_sqrt_ans bf16_sqrt_no_underflow: andi t3, t3, 0xff slli t3, t3, 7 or a0, t3, t5 j bf16_sqrt_ans bf16_sqrt_ret_a: j bf16_sqrt_ans bf16_sqrt_ans: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw s6, 24(sp) lw ra, 28(sp) addi sp, sp, 32 ret ``` </details> - Copy the directory ```bash= cd ~/riscv-none-elf-gcc/rv32emu cp -r tests/system/playground tests/system/bf16 cd tests/system/bf16 ``` - Modify `main.c` Since we are running in a bare-metal environment using rv32emu’s system emulation, which means we can only use some [system call](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md) provided by rv32emu. In this case, I remove unneccsary part from original `main.c` and adapted it for our test setup. This allows us to test our own assembly of `bf16_sqrt`, `bf16_add` and `bf16_mul`. We replace the old C implementations with the bf16 routines in `bf16_hw1.o`, while still reusing the existing test functions in main.c to validate our bf16 add, mul, and sqrt. <details> <summary>main.c</summary> ```clike #include <stdbool.h> #include <stdint.h> #include <string.h> #define printstr(ptr, length) \ do { \ asm volatile( \ "add a7, x0, 0x40;" \ "add a0, x0, 0x1;" /* stdout */ \ "add a1, x0, %0;" \ "mv a2, %1;" /* length character */ \ "ecall;" \ : \ : "r"(ptr), "r"(length) \ : "a0", "a1", "a2", "a7"); \ } while (0) #define TEST_OUTPUT(msg, length) printstr(msg, length) #define TEST_LOGGER(msg) \ { \ char _msg[] = msg; \ TEST_OUTPUT(_msg, sizeof(_msg) - 1); \ } extern uint64_t get_cycles(void); extern uint64_t get_instret(void); /* Bare metal memcpy implementation */ void *memcpy(void *dest, const void *src, size_t n) { uint8_t *d = (uint8_t *) dest; const uint8_t *s = (const uint8_t *) src; while (n--) *d++ = *s++; return dest; } /* Software division for RV32I (no M extension) */ static unsigned long udiv(unsigned long dividend, unsigned long divisor) { if (divisor == 0) return 0; unsigned long quotient = 0; unsigned long remainder = 0; for (int i = 31; i >= 0; i--) { remainder <<= 1; remainder |= (dividend >> i) & 1; if (remainder >= divisor) { remainder -= divisor; quotient |= (1UL << i); } } return quotient; } static unsigned long umod(unsigned long dividend, unsigned long divisor) { if (divisor == 0) return 0; unsigned long remainder = 0; for (int i = 31; i >= 0; i--) { remainder <<= 1; remainder |= (dividend >> i) & 1; if (remainder >= divisor) { remainder -= divisor; } } return remainder; } /* Software multiplication for RV32I (no M extension) */ static uint32_t umul(uint32_t a, uint32_t b) { uint32_t result = 0; while (b) { if (b & 1) result += a; a <<= 1; b >>= 1; } return result; } /* Provide __mulsi3 for GCC */ uint32_t __mulsi3(uint32_t a, uint32_t b) { return umul(a, b); } /* Simple integer to hex string conversion */ static void print_hex(unsigned long val) { char buf[20]; char *p = buf + sizeof(buf) - 1; *p = '\n'; p--; if (val == 0) { *p = '0'; p--; } else { while (val > 0) { int digit = val & 0xf; *p = (digit < 10) ? ('0' + digit) : ('a' + digit - 10); p--; val >>= 4; } } p++; printstr(p, (buf + sizeof(buf) - p)); } /* Simple integer to decimal string conversion */ static void print_dec(unsigned long val) { char buf[20]; char *p = buf + sizeof(buf) - 1; *p = '\n'; p--; if (val == 0) { *p = '0'; p--; } else { while (val > 0) { *p = '0' + umod(val, 10); p--; val = udiv(val, 10); } } p++; printstr(p, (buf + sizeof(buf) - p)); } typedef struct { uint16_t bits; } bf16_t; uint32_t bf16_add(uint32_t a_bf16, uint32_t b_bf16); uint32_t bf16_mul(uint32_t a_bf16, uint32_t b_bf16); uint32_t bf16_sqrt(uint32_t a_bf16); static inline bf16_t bf16_add_w(bf16_t a, bf16_t b) { bf16_t r = { .bits = (uint16_t)bf16_add(a.bits, b.bits) }; return r; } static inline bf16_t bf16_mul_w(bf16_t a, bf16_t b) { bf16_t r = { .bits = (uint16_t)bf16_mul(a.bits, b.bits) }; return r; } static inline bf16_t bf16_sqrt_w(bf16_t a) { bf16_t r = { .bits = (uint16_t)bf16_sqrt(a.bits) }; return r; } static void test_bf16_add(void) { TEST_LOGGER("Test: bf16_add\n"); /* 1.0 + 1.0 = 2.0 */ bf16_t a = {.bits = 0x3F80}; /* 1.0 */ bf16_t b = {.bits = 0x3F80}; /* 1.0 */ bf16_t result = bf16_add_w(a, b); TEST_LOGGER(" 1.0 + 1.0 = "); print_hex(result.bits); /* Expected: 0x4000 (2.0) */ if (result.bits == 0x4000) { TEST_LOGGER(" PASSED\n"); } else { TEST_LOGGER(" FAILED (expected 0x4000)\n"); } } static void test_bf16_mul(void) { TEST_LOGGER("Test: bf16_mul\n"); /* 2.0 * 3.0 = 6.0 */ bf16_t a = {.bits = 0x4000}; /* 2.0 */ bf16_t b = {.bits = 0x4040}; /* 3.0 */ bf16_t result = bf16_mul_w(a, b); TEST_LOGGER(" 2.0 * 3.0 = "); print_hex(result.bits); /* Expected: 0x40C0 (6.0) */ if (result.bits == 0x40C0) { TEST_LOGGER(" PASSED\n"); } else { TEST_LOGGER(" FAILED (expected 0x40C0)\n"); } } static void test_bf16_sqrt(void) { TEST_LOGGER("Test: bf16_sqrt\n"); // sqrt(-1.0) = NaN bf16_t a = {.bits = 0xBF80}; bf16_t result = bf16_sqrt_w(a); TEST_LOGGER(" sqrt(-1.0) = "); print_hex(result.bits); /* Expected: 0x7FC0 (NaN) */ if (result.bits == 0x7FC0) { TEST_LOGGER(" PASSED\n"); } else { TEST_LOGGER(" FAILED (expected 0x7FC0)\n"); } } int main(void) { uint64_t start_cycles, end_cycles, cycles_elapsed; uint64_t start_instret, end_instret, instret_elapsed; TEST_LOGGER("\n=== BFloat16 Tests ===\n\n"); TEST_LOGGER("Test 1: bf16_add\n"); start_cycles = get_cycles(); start_instret = get_instret(); test_bf16_add(); end_cycles = get_cycles(); end_instret = get_instret(); cycles_elapsed = end_cycles - start_cycles; instret_elapsed = end_instret - start_instret; TEST_LOGGER(" Cycles: "); print_dec((unsigned long) cycles_elapsed); TEST_LOGGER(" Instructions: "); print_dec((unsigned long) instret_elapsed); TEST_LOGGER("\n"); end_cycles = get_cycles(); end_instret = get_instret(); cycles_elapsed = end_cycles - start_cycles; instret_elapsed = end_instret - start_instret; TEST_LOGGER("Test 3: bf16_mul\n"); start_cycles = get_cycles(); start_instret = get_instret(); test_bf16_mul(); end_cycles = get_cycles(); end_instret = get_instret(); cycles_elapsed = end_cycles - start_cycles; instret_elapsed = end_instret - start_instret; TEST_LOGGER(" Cycles: "); print_dec((unsigned long) cycles_elapsed); TEST_LOGGER(" Instructions: "); print_dec((unsigned long) instret_elapsed); TEST_LOGGER("\n"); TEST_LOGGER("Test 4: bf16_sqrt\n"); start_cycles = get_cycles(); start_instret = get_instret(); test_bf16_sqrt(); end_cycles = get_cycles(); end_instret = get_instret(); cycles_elapsed = end_cycles - start_cycles; instret_elapsed = end_instret - start_instret; TEST_LOGGER(" Cycles: "); print_dec((unsigned long) cycles_elapsed); TEST_LOGGER(" Instructions: "); print_dec((unsigned long) instret_elapsed); TEST_LOGGER("\n"); TEST_LOGGER("\n=== All Tests Completed ===\n"); return 0; } ``` </details> Since we placed a `.globl` directive in front of each bf16 operations, these symbols are visible and callable from main.c. Also we need Makefile to 1. Compile / Assemble: Convert source files like main.c, start.S, and bf16_hw1.s/.S into `.o` object files. 2. Link: Use `linker.ld` to combine all `.o` files into an ELF executable file. 3. Excute 4. Disassemble - Makefile ```asm= include ../../../mk/toolchain.mk ARCH = -march=rv32izicsr LINKER_SCRIPT = linker.ld EMU ?= ../../../build/rv32emu AFLAGS = -g $(ARCH) CFLAGS = -g -march=rv32i_zicsr LDFLAGS = -T $(LINKER_SCRIPT) EXEC = test.elf CC = $(CROSS_COMPILE)gcc AS = $(CROSS_COMPILE)as LD = $(CROSS_COMPILE)ld OBJDUMP = $(CROSS_COMPILE)objdump OBJS = start.o main.o perfcounter.o chacha20_asm.o bf16_hw1.o .PHONY: all run dump clean all: $(EXEC) $(EXEC): $(OBJS) $(LINKER_SCRIPT) $(LD) $(LDFLAGS) -o $@ $(OBJS) %.o: %.S $(AS) $(AFLAGS) $< -o $@ %.o: %.s $(AS) $(AFLAGS) $< -o $@ %.o: %.c $(CC) $(CFLAGS) $< -o $@ -c run: $(EXEC) @test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1) @grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1) @grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1) $(EMU) $< dump: $(EXEC) $(OBJDUMP) -Ds $< | less clean: rm -f $(EXEC) $(OBJS) ``` We only add `bf16_hw1.o` to OBJS and convert `.s` file to `.o` file from original Makefile. `OBJS` is the list of object files that will be fed into the linker to produce the final `.elf` file. - command ```bash= make clean make make run ``` - Result ``` === BFloat16 Tests === Test 1: bf16_add Test: bf16_add 1.0 + 1.0 = 4000 PASSED Cycles: 299 Instructions: 299 Test 3: bf16_mul Test: bf16_mul 2.0 * 3.0 = 40c0 PASSED Cycles: 361 Instructions: 361 Test 4: bf16_sqrt Test: bf16_sqrt sqrt(-1.0) = 7fc0 PASSED Cycles: 260 Instructions: 260 === All Tests Completed === ``` ## Part 3 - two problem ### Problem A from [Quiz 2](https://hackmd.io/@sysprog/arch2025-quiz2) - Makefile ``` include ../../../mk/toolchain.mk ARCH = -march=rv32izicsr LINKER_SCRIPT = linker.ld EMU ?= ../../../build/rv32emu AFLAGS = -g $(ARCH) CFLAGS = -g $(ARCH) LDFLAGS = -T $(LINKER_SCRIPT) EXEC = test.elf CROSS_COMPILE = riscv-none-elf- CC = $(CROSS_COMPILE)gcc AS = $(CROSS_COMPILE)as OBJDUMP = $(CROSS_COMPILE)objdump OBJS = start.o main.o perfcounter.o hanoi.o .PHONY: all run dump clean all: $(EXEC) $(EXEC): $(OBJS) $(LINKER_SCRIPT) $(CC) $(CFLAGS) $(LDFLAGS) -nostartfiles -o $@ $(OBJS) %.o: %.S $(AS) $(AFLAGS) -c $< -o $@ %.o: %.c $(CC) $(CFLAGS) $< -o $@ -c run: $(EXEC) @test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1) @grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1) @grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1) $(EMU) $< dump: $(EXEC) $(OBJDUMP) -Ds $< | less clean: rm -f $(EXEC) $(OBJS) ``` First, let’s look at the original display_move function in `hanoi.S` and the `.data` section: ```c display_move: la x20, obdata add x5, x20, x18 # BLANK 23: Load byte from obfuscated data lbu x11, 0(x5) # BLANK 24: Decode constant (0x6F) li x6, 0x6F xor x11, x11, x6 # BLANK 25: Final offset adjustment addi x11, x11, -0x12 add x7, x20, x19 lbu x12, 0(x7) xor x12, x12, x6 addi x12, x12, -0x12 ... .data obdata: .byte 0x3c, 0x3b, 0x3a str1: .asciz "Move Disk " str2: .asciz " from " str3: .asciz " to " ``` If you decode it, you can see that obdata is actually an encrypted version of 'A', 'B', 'C': 0x3c ^ 0x6F = 0x53, 0x53 - 0x12 = 0x41 → 'A' 0x3b ^ 0x6F = 0x54, 0x54 - 0x12 = 0x42 → 'B' 0x3a ^ 0x6F = 0x55, 0x55 - 0x12 = 0x43 → 'C' In the Tower of Hanoi code `x9` holds the disk index, `x18` holds the current peg index and `x19` holds the target peg index. Then we moved the display logic into a C function `hanoi_move()` that computes peg letters with 'A' + index and uses `TEST_LOGGER` + `print_dec` + `printstr`. By doing these two steps, we can cleanly print all the disk moves in a bare-metal environment while keeping the assembly focused on the actual Hanoi algorithm. - add a C function hanoi_move() ```clike= void hanoi_move(unsigned disk, unsigned from, unsigned to) { TEST_LOGGER("Move Disk "); print_dec(disk); TEST_LOGGER(" from "); char peg_from = 'A' + (from & 3u); printstr(&peg_from, 1); TEST_LOGGER(" to "); char peg_to = 'A' + (to & 3u); printstr(&peg_to, 1); TEST_LOGGER("\n"); } ``` - modify display_move ```c display_move: addi a0, x9, 1 mv a1, x18 mv a2, x19 jal ra, hanoi_move # BLANK 26: Calculate storage offset slli x5, x9, 2 addi x5, x5, 20 add x5, x2, x5 ... .data obdata: .byte 0x3c, 0x3b, 0x3a str1: .asciz "Move Disk " str2: .asciz " from " str3: .asciz " to " ``` - result ![image](https://hackmd.io/_uploads/BJ7Yj10bZg.png) ### Problem C from [Quiz 3](https://hackmd.io/@sysprog/arch2025-quiz3) - Makefile ``` include ../../../mk/toolchain.mk ARCH = -march=rv32izicsr LINKER_SCRIPT = linker.ld EMU ?= ../../../build/rv32emu AFLAGS = -g $(ARCH) CFLAGS = -g $(ARCH) LDFLAGS = -T $(LINKER_SCRIPT) EXEC = test.elf CROSS_COMPILE = riscv-none-elf- CC = $(CROSS_COMPILE)gcc AS = $(CROSS_COMPILE)as OBJDUMP = $(CROSS_COMPILE)objdump OBJS = start.o main.o perfcounter.o fast_rsqrt.o .PHONY: all run dump clean all: $(EXEC) $(EXEC): $(OBJS) $(LINKER_SCRIPT) $(CC) $(CFLAGS) $(LDFLAGS) -nostartfiles -o $@ $(OBJS) -lm %.o: %.S $(AS) $(AFLAGS) -c $< -o $@ %.o: %.c $(CC) $(CFLAGS) $< -o $@ -c run: $(EXEC) @test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1) @grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1) @grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1) $(EMU) $< dump: $(EXEC) $(OBJDUMP) -Ds $< | less clean: rm -f $(EXEC) $(OBJS) ``` To evaluate our integer-based fast reciprocal square root (rsqrt) implementations Quiz 3, we first wrote a baseline `rsqrt_baseline(x)` as a reference. We use a Q16 fixed-point format, where $( y \approx \frac{1}{\sqrt{x}} \cdot 2^{16} )$. The baseline computes an exact integer square root with binary search, then uses 64-bit integer division to get a high-precision reciprocal and convert it to Q16. - baseline ```clike= static uint32_t isqrt_u32(uint32_t x) { uint32_t left = 0, right = 65535; while (left <= right) { uint32_t mid = (left + right) >> 1; uint64_t sq = (uint64_t)mid * mid; if (sq == x) return mid; if (sq < x) left = mid + 1; else right = mid - 1; } return right; } static uint32_t rsqrt_baseline(uint32_t x) { if (x == 0) return 0; uint32_t s = isqrt_u32(x); if (s == 0) return 0xFFFFFFFFu; uint64_t num = (1ull << 32); uint32_t inv_s = (uint32_t)(num / s); uint32_t y = (inv_s + (1u << 15)) >> 16; return y; } ``` With this baseline in place, we run `fast_rsqrt` on rv32emu’s bare-metal system, compare each result against the baseline (absolute and relative error), and report the max and average error. - result ![image](https://hackmd.io/_uploads/HJa0hzA-bl.png) - analyzing precision and performance (with 1000 test cases) | | fast_rsqrt | fast_rsqrt_1iter(LUT + once Newton) | fast_rsqrt_lut(only LUT) | | -------------- | ---------- |:-----------------------------------:| ------------------------ | | Cycle | 21925181 | 17446433 | 12630288 | | Instructions | 21925181 | 17446433 | 12630288 | | mean_rel_error | 0.96% | 1.11% | 2.17% | From the above table, we can tell it's a tradeoff between cycle count and error rate when choosing a LUT+Newton approach versus a fully LUT-free method, so it either takes more cycles to reach comparable accuracy or settles for a higher error rate. ## Part 4 - Disassemble In this seciotn, we disassemble the ELF binaries generated by the C compiler and compare them with our handwritten implementation to analyze the differences in the resulting assembly code. First, I reused the RV32I friendly bf16_sqrt assembly implementation from [Assignment 1](https://hackmd.io/@IBxdYttpQ5-JUEW6_KypVg/arch2025-homework1#Part-2) - rv32i_friendly_bf16_sqrt ```asm .data .align 4 .text isqrt16_pow4: li t0, 0 # res li t1, 16384 # bit = 1<<14 (2^16 = 65536 > 65535) isqrt_loop: beqz t1, isqrt_done add t2, t0, t1 # tmp = res + bit bgeu a0, t2, isqrt_ge srli t0, t0, 1 srli t1, t1, 2 j isqrt_loop isqrt_ge: sub a0, a0, t2 srli t0, t0, 1 add t0, t0, t1 srli t1, t1, 2 j isqrt_loop isqrt_done: mv a0, t0 ret bf16_sqrt: addi sp, sp, -32 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) sw s6, 24(sp) sw ra, 28(sp) srli s0, a0, 15 andi s0, s0, 1 # s0 = sign_a srli s1, a0, 7 andi s1, s1, 0xFF # s1 = exp_a andi s2, a0, 0x7F # s2 = mant_a li t0, 0xFF bne s1, t0, a_exp beq s2, x0, a_mant j ret_a a_mant: beq s0, x0, a_sign li a0, 0x7FC0 j ans a_sign: j ret_a a_exp: bne s1, x0, skip bne s2, x0, skip li a0, 0x0000 j ans skip: beq s0, x0, negative_skip li a0, 0x7FC0 j ans negative_skip: bne s1, x0, denormals_skip li a0, 0x0000 j ans denormals_skip: addi t1, s1, -127 # t1 = e ori t2, s2, 0x80 # t2 = m (implicit 1) andi t0, t1, 1 beq t0, x0, else slli t2, t2, 1 # m <<= 1 (odd exponent) addi t0, t1, -1 # t0 = e - 1 srai t0, t0, 1 # t0 = (e-1)>>1 addi t3, t0, 127 # t3 = new_exp j end_if else: srai t0, t1, 1 # t0 = e>>1 addi t3, t0, 127 # t3 = new_exp end_if: mv s6, t3 slli a0, t2, 7 # a0 = m<<7 addi a0, a0, 127 # a0 = (m<<7) + 127 jal isqrt16_pow4 # a0 = result mv s5, a0 # s5 = result mv t3, s6 j l3 l3: andi t5, s5, 0x7F # t5 = new_mant li t0, 0xFF blt t3, t0, no_overflow li a0, 0x7F80 j ans no_overflow: bgt t3, x0, no_underflow li a0, 0 j ans no_underflow: andi t3, t3, 0xff slli t3, t3, 7 or a0, t3, t5 j ans ret_a: j ans ans: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw s6, 24(sp) lw ra, 28(sp) addi sp, sp, 32 ret ``` - For comparsion we use original c code from [quiz1](https://hackmd.io/@sysprog/arch2025-quiz1-sol) (remove `static inline`) ```c #include <stdbool.h> #include <stdint.h> typedef struct { uint16_t bits; } bf16_t; #define BF16_SIGN_MASK 0x8000U #define BF16_EXP_MASK 0x7F80U #define BF16_MANT_MASK 0x007FU #define BF16_EXP_BIAS 127 #define BF16_NAN() ((bf16_t) {.bits = 0x7FC0}) #define BF16_ZERO() ((bf16_t) {.bits = 0x0000}) bf16_t bf16_sqrt(bf16_t a) { uint16_t sign = (a.bits >> 15) & 1; int16_t exp = ((a.bits >> 7) & 0xFF); uint16_t mant = a.bits & 0x7F; /* Handle special cases */ if (exp == 0xFF) { if (mant) return a; /* NaN propagation */ if (sign) return BF16_NAN(); /* sqrt(-Inf) = NaN */ return a; /* sqrt(+Inf) = +Inf */ } /* sqrt(0) = 0 (handle both +0 and -0) */ if (!exp && !mant) return BF16_ZERO(); /* sqrt of negative number is NaN */ if (sign) return BF16_NAN(); /* Flush denormals to zero */ if (!exp) return BF16_ZERO(); /* Direct bit manipulation square root algorithm */ /* For sqrt: new_exp = (old_exp - bias) / 2 + bias */ int32_t e = exp - BF16_EXP_BIAS; int32_t new_exp; /* Get full mantissa with implicit 1 */ uint32_t m = 0x80 | mant; /* Range [128, 256) representing [1.0, 2.0) */ /* Adjust for odd exponents: sqrt(2^odd * m) = 2^((odd-1)/2) * sqrt(2*m) */ if (e & 1) { m <<= 1; /* Double mantissa for odd exponent */ new_exp = ((e - 1) >> 1) + BF16_EXP_BIAS; } else { new_exp = (e >> 1) + BF16_EXP_BIAS; } /* Now m is in range [128, 256) or [256, 512) if exponent was odd */ /* Binary search for integer square root */ /* We want result where result^2 = m * 128 (since 128 represents 1.0) */ uint32_t low = 90; /* Min sqrt (roughly sqrt(128)) */ uint32_t high = 256; /* Max sqrt (roughly sqrt(512)) */ uint32_t result = 128; /* Default */ /* Binary search for square root of m */ while (low <= high) { uint32_t mid = (low + high) >> 1; uint32_t sq = (mid * mid) / 128; /* Square and scale */ if (sq <= m) { result = mid; /* This could be our answer */ low = mid + 1; } else { high = mid - 1; } } /* result now contains sqrt(m) * sqrt(128) / sqrt(128) = sqrt(m) */ /* But we need to adjust the scale */ /* Since m is scaled where 128=1.0, result should also be scaled same way */ /* Normalize to ensure result is in [128, 256) */ if (result >= 256) { result >>= 1; new_exp++; } else if (result < 128) { while (result < 128 && new_exp > 1) { result <<= 1; new_exp--; } } /* Extract 7-bit mantissa (remove implicit 1) */ uint16_t new_mant = result & 0x7F; /* Check for overflow/underflow */ if (new_exp >= 0xFF) return (bf16_t) {.bits = 0x7F80}; /* +Inf */ if (new_exp <= 0) return BF16_ZERO(); return (bf16_t) {.bits = ((new_exp & 0xFF) << 7) | new_mant}; } ``` Additionally we reuse main.c from Part 2 and do a little change (remove unnecessary part like test file for other bf16 operations) - main.c ```c #include <stdbool.h> #include <stdint.h> #include <string.h> #define printstr(ptr, length) \ do { \ asm volatile( \ "add a7, x0, 0x40;" \ "add a0, x0, 0x1;" /* stdout */ \ "add a1, x0, %0;" \ "mv a2, %1;" /* length character */ \ "ecall;" \ : \ : "r"(ptr), "r"(length) \ : "a0", "a1", "a2", "a7"); \ } while (0) #define TEST_OUTPUT(msg, length) printstr(msg, length) #define TEST_LOGGER(msg) \ { \ char _msg[] = msg; \ TEST_OUTPUT(_msg, sizeof(_msg) - 1); \ } extern uint64_t get_cycles(void); extern uint64_t get_instret(void); /* Bare metal memcpy implementation */ void *memcpy(void *dest, const void *src, size_t n) { uint8_t *d = (uint8_t *) dest; const uint8_t *s = (const uint8_t *) src; while (n--) *d++ = *s++; return dest; } /* Software division for RV32I (no M extension) */ static unsigned long udiv(unsigned long dividend, unsigned long divisor) { if (divisor == 0) return 0; unsigned long quotient = 0; unsigned long remainder = 0; for (int i = 31; i >= 0; i--) { remainder <<= 1; remainder |= (dividend >> i) & 1; if (remainder >= divisor) { remainder -= divisor; quotient |= (1UL << i); } } return quotient; } static unsigned long umod(unsigned long dividend, unsigned long divisor) { if (divisor == 0) return 0; unsigned long remainder = 0; for (int i = 31; i >= 0; i--) { remainder <<= 1; remainder |= (dividend >> i) & 1; if (remainder >= divisor) { remainder -= divisor; } } return remainder; } /* Software multiplication for RV32I (no M extension) */ static uint32_t umul(uint32_t a, uint32_t b) { uint32_t result = 0; while (b) { if (b & 1) result += a; a <<= 1; b >>= 1; } return result; } /* Provide __mulsi3 for GCC */ uint32_t __mulsi3(uint32_t a, uint32_t b) { return umul(a, b); } /* Simple integer to hex string conversion */ static void print_hex(unsigned long val) { char buf[20]; char *p = buf + sizeof(buf) - 1; *p = '\n'; p--; if (val == 0) { *p = '0'; p--; } else { while (val > 0) { int digit = val & 0xf; *p = (digit < 10) ? ('0' + digit) : ('a' + digit - 10); p--; val >>= 4; } } p++; printstr(p, (buf + sizeof(buf) - p)); } /* Simple integer to decimal string conversion */ static void print_dec(unsigned long val) { char buf[20]; char *p = buf + sizeof(buf) - 1; *p = '\n'; p--; if (val == 0) { *p = '0'; p--; } else { while (val > 0) { *p = '0' + umod(val, 10); p--; val = udiv(val, 10); } } p++; printstr(p, (buf + sizeof(buf) - p)); } typedef struct { uint16_t bits; } bf16_t; uint32_t bf16_sqrt(uint32_t a_bf16); static inline bf16_t bf16_sqrt_w(bf16_t a) { bf16_t r = { .bits = (uint16_t)bf16_sqrt(a.bits) }; return r; } static void test_bf16_sqrt(void) { TEST_LOGGER("Test: bf16_sqrt\n"); // sqrt(-1.0) = NaN bf16_t a = {.bits = 0xBF80}; bf16_t result = bf16_sqrt_w(a); TEST_LOGGER(" sqrt(-1.0) = "); print_hex(result.bits); /* Expected: 0x7FC0 (NaN) */ if (result.bits == 0x7FC0) { TEST_LOGGER(" PASSED\n"); } else { TEST_LOGGER(" FAILED (expected 0x7FC0)\n"); } } int main(void) { uint64_t start_cycles, end_cycles, cycles_elapsed; uint64_t start_instret, end_instret, instret_elapsed; TEST_LOGGER("Test: bf16_sqrt\n"); start_cycles = get_cycles(); start_instret = get_instret(); test_bf16_sqrt(); end_cycles = get_cycles(); end_instret = get_instret(); cycles_elapsed = end_cycles - start_cycles; instret_elapsed = end_instret - start_instret; TEST_LOGGER(" Cycles: "); print_dec((unsigned long) cycles_elapsed); TEST_LOGGER(" Instructions: "); print_dec((unsigned long) instret_elapsed); TEST_LOGGER("\n"); TEST_LOGGER("\n=== All Tests Completed ===\n"); return 0; } ``` For Makefile we mainly change three things 1. Add `OPT` to control GCC’s optimization level(-O0 / -Os / -Ofast) and `BF_IMPL` selects which bf16_sqrt implementation to use (asm or c) 2. Add the optimization flag into the C compile flags `CFLAGS = -g -march=rv32i_zicsr $(OPT)` 3. Use `ifeq` to choose which BF16 object file gets linked - Makefile ```asm include ../../../mk/toolchain.mk ARCH = -march=rv32izicsr LINKER_SCRIPT = linker.ld EMU ?= ../../../build/rv32emu OPT ?= -O0 # -Ofast / -Os / -O2 BF_IMPL ?= asm # asm / c AFLAGS = -g $(ARCH) CFLAGS = -g -march=rv32i_zicsr $(OPT) LDFLAGS = -T $(LINKER_SCRIPT) EXEC = test.elf CC = $(CROSS_COMPILE)gcc AS = $(CROSS_COMPILE)as LD = $(CROSS_COMPILE)ld OBJDUMP = $(CROSS_COMPILE)objdump ifeq ($(BF_IMPL),asm) BF_OBJ = bf16_sqrt.o else BF_OBJ = bf16_sqrt_base.o endif OBJS = start.o main.o perfcounter.o chacha20_asm.o $(BF_OBJ) .PHONY: all run dump clean all: $(EXEC) $(EXEC): $(OBJS) $(LINKER_SCRIPT) $(LD) $(LDFLAGS) -o $@ $(OBJS) %.o: %.S $(AS) $(AFLAGS) $< -o $@ %.o: %.s $(AS) $(AFLAGS) $< -o $@ %.o: %.c $(CC) $(CFLAGS) -c $< -o $@ run: $(EXEC) @test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1) @grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1) @grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1) $(EMU) $< dump: $(EXEC) $(OBJDUMP) -Ds $< | less clean: rm -f $(EXEC) $(OBJS) ``` - command ```bash= make clean make BF_IMPL=asm OPT=-O0 # or BF_IMPL=c OPT=-Os, etc. riscv-none-elf-objdump -Ds test.elf > asm_O0.dump ``` By following the steps above, we can obtain four disassembled versions of bf16_sqrt: the handwritten assembly implementation, and the compiler-generated implementations built with `-O0`, `-Os`, and `-Ofast`. First, assembler/compiler will turn Source code(asm/c) into object code. Then Linker will combine all object code into ELF (Executable and linkable) file, which is likely contain with code section `.text`, data section `.data/.bss`, symbol table and entry point. However, an ELF is still mostly raw machine code, which humans can’t read. That’s why we disassemble it to turn machine code back into assembly so we can inspect what the compiler produced, compare it with handwritten assembly, and observe differences under different optimization flags. <details> <summary>asm.dump</summary> ``` 00014154 <isqrt16_pow4>: 14154: 00000293 li t0,0 14158: 00004337 lui t1,0x4 0001415c <isqrt_loop>: 1415c: 02030663 beqz t1,14188 <isqrt_done> 14160: 006283b3 add t2,t0,t1 14164: 00757863 bgeu a0,t2,14174 <isqrt_ge> 14168: 0012d293 srli t0,t0,0x1 1416c: 00235313 srli t1,t1,0x2 14170: fedff06f j 1415c <isqrt_loop> 00014174 <isqrt_ge>: 14174: 40750533 sub a0,a0,t2 14178: 0012d293 srli t0,t0,0x1 1417c: 006282b3 add t0,t0,t1 14180: 00235313 srli t1,t1,0x2 14184: fd9ff06f j 1415c <isqrt_loop> 00014188 <isqrt_done>: 14188: 00028513 mv a0,t0 1418c: 00008067 ret 00014190 <bf16_sqrt>: 14190: fe010113 addi sp,sp,-32 14194: 00812023 sw s0,0(sp) 14198: 00912223 sw s1,4(sp) 1419c: 01212423 sw s2,8(sp) 141a0: 01312623 sw s3,12(sp) 141a4: 01412823 sw s4,16(sp) 141a8: 01512a23 sw s5,20(sp) 141ac: 01612c23 sw s6,24(sp) 141b0: 00112e23 sw ra,28(sp) 141b4: 00f55413 srli s0,a0,0xf 141b8: 00147413 andi s0,s0,1 141bc: 00755493 srli s1,a0,0x7 141c0: 0ff4f493 zext.b s1,s1 141c4: 07f57913 andi s2,a0,127 141c8: 0ff00293 li t0,255 141cc: 02549063 bne s1,t0,141ec <a_exp> 141d0: 00090463 beqz s2,141d8 <a_mant> 141d4: 0c00006f j 14294 <ret_a> 000141d8 <a_mant>: 141d8: 00040863 beqz s0,141e8 <a_sign> 141dc: 00008537 lui a0,0x8 141e0: fc050513 addi a0,a0,-64 # 7fc0 <_start-0x8040> 141e4: 0b40006f j 14298 <ans> 000141e8 <a_sign>: 141e8: 0ac0006f j 14294 <ret_a> 000141ec <a_exp>: 141ec: 00049863 bnez s1,141fc <skip> 141f0: 00091663 bnez s2,141fc <skip> 141f4: 00000513 li a0,0 141f8: 0a00006f j 14298 <ans> 000141fc <skip>: 141fc: 00040863 beqz s0,1420c <negative_skip> 14200: 00008537 lui a0,0x8 14204: fc050513 addi a0,a0,-64 # 7fc0 <_start-0x8040> 14208: 0900006f j 14298 <ans> 0001420c <negative_skip>: 1420c: 00049663 bnez s1,14218 <denormals_skip> 14210: 00000513 li a0,0 14214: 0840006f j 14298 <ans> 00014218 <denormals_skip>: 14218: f8148313 addi t1,s1,-127 1421c: 08096393 ori t2,s2,128 14220: 00137293 andi t0,t1,1 14224: 00028c63 beqz t0,1423c <else> 14228: 00139393 slli t2,t2,0x1 1422c: fff30293 addi t0,t1,-1 # 3fff <_start-0xc001> 14230: 4012d293 srai t0,t0,0x1 14234: 07f28e13 addi t3,t0,127 14238: 00c0006f j 14244 <end_if> 0001423c <else>: 1423c: 40135293 srai t0,t1,0x1 14240: 07f28e13 addi t3,t0,127 00014244 <end_if>: 14244: 000e0b13 mv s6,t3 14248: 00739513 slli a0,t2,0x7 1424c: 07f50513 addi a0,a0,127 14250: f05ff0ef jal 14154 <isqrt16_pow4> 14254: 00050a93 mv s5,a0 14258: 000b0e13 mv t3,s6 1425c: 0040006f j 14260 <l3> 00014260 <l3>: 14260: 07faff13 andi t5,s5,127 14264: 0ff00293 li t0,255 14268: 005e4863 blt t3,t0,14278 <no_overflow> 1426c: 00008537 lui a0,0x8 14270: f8050513 addi a0,a0,-128 # 7f80 <_start-0x8080> 14274: 0240006f j 14298 <ans> 00014278 <no_overflow>: 14278: 01c04663 bgtz t3,14284 <no_underflow> 1427c: 00000513 li a0,0 14280: 0180006f j 14298 <ans> 00014284 <no_underflow>: 14284: 0ffe7e13 zext.b t3,t3 14288: 007e1e13 slli t3,t3,0x7 1428c: 01ee6533 or a0,t3,t5 14290: 0080006f j 14298 <ans> 00014294 <ret_a>: 14294: 0040006f j 14298 <ans> 00014298 <ans>: 14298: 00012403 lw s0,0(sp) 1429c: 00412483 lw s1,4(sp) 142a0: 00812903 lw s2,8(sp) 142a4: 00c12983 lw s3,12(sp) 142a8: 01012a03 lw s4,16(sp) 142ac: 01412a83 lw s5,20(sp) 142b0: 01812b03 lw s6,24(sp) 142b4: 01c12083 lw ra,28(sp) 142b8: 02010113 addi sp,sp,32 142bc: 00008067 ret ``` </details> <details> <summary>c_O0.dump</summary> ``` 00014154 <bf16_sqrt>: 14154: fb010113 addi sp,sp,-80 14158: 04112623 sw ra,76(sp) 1415c: 04812423 sw s0,72(sp) 14160: 05010413 addi s0,sp,80 14164: faa41e23 sh a0,-68(s0) 14168: fbc45783 lhu a5,-68(s0) 1416c: 00f7d793 srli a5,a5,0xf 14170: fcf41d23 sh a5,-38(s0) 14174: fbc45783 lhu a5,-68(s0) 14178: 0077d793 srli a5,a5,0x7 1417c: 01079793 slli a5,a5,0x10 14180: 0107d793 srli a5,a5,0x10 14184: 01079793 slli a5,a5,0x10 14188: 4107d793 srai a5,a5,0x10 1418c: 0ff7f793 zext.b a5,a5 14190: fcf41c23 sh a5,-40(s0) 14194: fbc45783 lhu a5,-68(s0) 14198: 07f7f793 andi a5,a5,127 1419c: fcf41b23 sh a5,-42(s0) 141a0: fd841703 lh a4,-40(s0) 141a4: 0ff00793 li a5,255 141a8: 02f71863 bne a4,a5,141d8 <bf16_sqrt+0x84> 141ac: fd645783 lhu a5,-42(s0) 141b0: 00078663 beqz a5,141bc <bf16_sqrt+0x68> 141b4: fbc45783 lhu a5,-68(s0) 141b8: 2280006f j 143e0 <bf16_sqrt+0x28c> 141bc: fda45783 lhu a5,-38(s0) 141c0: 00078863 beqz a5,141d0 <bf16_sqrt+0x7c> 141c4: ffff87b7 lui a5,0xffff8 141c8: fc07c793 xori a5,a5,-64 141cc: 2140006f j 143e0 <bf16_sqrt+0x28c> 141d0: fbc45783 lhu a5,-68(s0) 141d4: 20c0006f j 143e0 <bf16_sqrt+0x28c> 141d8: fd841783 lh a5,-40(s0) 141dc: 00079a63 bnez a5,141f0 <bf16_sqrt+0x9c> 141e0: fd645783 lhu a5,-42(s0) 141e4: 00079663 bnez a5,141f0 <bf16_sqrt+0x9c> 141e8: 00000793 li a5,0 141ec: 1f40006f j 143e0 <bf16_sqrt+0x28c> 141f0: fda45783 lhu a5,-38(s0) 141f4: 00078863 beqz a5,14204 <bf16_sqrt+0xb0> 141f8: ffff87b7 lui a5,0xffff8 141fc: fc07c793 xori a5,a5,-64 14200: 1e00006f j 143e0 <bf16_sqrt+0x28c> 14204: fd841783 lh a5,-40(s0) 14208: 00079663 bnez a5,14214 <bf16_sqrt+0xc0> 1420c: 00000793 li a5,0 14210: 1d00006f j 143e0 <bf16_sqrt+0x28c> 14214: fd841783 lh a5,-40(s0) 14218: f8178793 addi a5,a5,-127 # ffff7f81 <__stack_top+0xfffe2ae1> 1421c: fcf42823 sw a5,-48(s0) 14220: fd645783 lhu a5,-42(s0) 14224: 0807e793 ori a5,a5,128 14228: 01079793 slli a5,a5,0x10 1422c: 0107d793 srli a5,a5,0x10 14230: fef42423 sw a5,-24(s0) 14234: fd042783 lw a5,-48(s0) 14238: 0017f793 andi a5,a5,1 1423c: 02078463 beqz a5,14264 <bf16_sqrt+0x110> 14240: fe842783 lw a5,-24(s0) 14244: 00179793 slli a5,a5,0x1 14248: fef42423 sw a5,-24(s0) 1424c: fd042783 lw a5,-48(s0) 14250: fff78793 addi a5,a5,-1 14254: 4017d793 srai a5,a5,0x1 14258: 07f78793 addi a5,a5,127 1425c: fef42623 sw a5,-20(s0) 14260: 0140006f j 14274 <bf16_sqrt+0x120> 14264: fd042783 lw a5,-48(s0) 14268: 4017d793 srai a5,a5,0x1 1426c: 07f78793 addi a5,a5,127 14270: fef42623 sw a5,-20(s0) 14274: 05a00793 li a5,90 14278: fef42223 sw a5,-28(s0) 1427c: 10000793 li a5,256 14280: fef42023 sw a5,-32(s0) 14284: 08000793 li a5,128 14288: fcf42e23 sw a5,-36(s0) 1428c: 0600006f j 142ec <bf16_sqrt+0x198> 14290: fe442703 lw a4,-28(s0) 14294: fe042783 lw a5,-32(s0) 14298: 00f707b3 add a5,a4,a5 1429c: 0017d793 srli a5,a5,0x1 142a0: fcf42423 sw a5,-56(s0) 142a4: fc842583 lw a1,-56(s0) 142a8: fc842503 lw a0,-56(s0) 142ac: fe9fb0ef jal 10294 <__mulsi3> 142b0: 00050793 mv a5,a0 142b4: 0077d793 srli a5,a5,0x7 142b8: fcf42223 sw a5,-60(s0) 142bc: fc442703 lw a4,-60(s0) 142c0: fe842783 lw a5,-24(s0) 142c4: 00e7ee63 bltu a5,a4,142e0 <bf16_sqrt+0x18c> 142c8: fc842783 lw a5,-56(s0) 142cc: fcf42e23 sw a5,-36(s0) 142d0: fc842783 lw a5,-56(s0) 142d4: 00178793 addi a5,a5,1 142d8: fef42223 sw a5,-28(s0) 142dc: 0100006f j 142ec <bf16_sqrt+0x198> 142e0: fc842783 lw a5,-56(s0) 142e4: fff78793 addi a5,a5,-1 142e8: fef42023 sw a5,-32(s0) 142ec: fe442703 lw a4,-28(s0) 142f0: fe042783 lw a5,-32(s0) 142f4: f8e7fee3 bgeu a5,a4,14290 <bf16_sqrt+0x13c> 142f8: fdc42703 lw a4,-36(s0) 142fc: 0ff00793 li a5,255 14300: 02e7f063 bgeu a5,a4,14320 <bf16_sqrt+0x1cc> 14304: fdc42783 lw a5,-36(s0) 14308: 0017d793 srli a5,a5,0x1 1430c: fcf42e23 sw a5,-36(s0) 14310: fec42783 lw a5,-20(s0) 14314: 00178793 addi a5,a5,1 14318: fef42623 sw a5,-20(s0) 1431c: 0440006f j 14360 <bf16_sqrt+0x20c> 14320: fdc42703 lw a4,-36(s0) 14324: 07f00793 li a5,127 14328: 02e7ec63 bltu a5,a4,14360 <bf16_sqrt+0x20c> 1432c: 01c0006f j 14348 <bf16_sqrt+0x1f4> 14330: fdc42783 lw a5,-36(s0) 14334: 00179793 slli a5,a5,0x1 14338: fcf42e23 sw a5,-36(s0) 1433c: fec42783 lw a5,-20(s0) 14340: fff78793 addi a5,a5,-1 14344: fef42623 sw a5,-20(s0) 14348: fdc42703 lw a4,-36(s0) 1434c: 07f00793 li a5,127 14350: 00e7e863 bltu a5,a4,14360 <bf16_sqrt+0x20c> 14354: fec42703 lw a4,-20(s0) 14358: 00100793 li a5,1 1435c: fce7cae3 blt a5,a4,14330 <bf16_sqrt+0x1dc> 14360: fdc42783 lw a5,-36(s0) 14364: 01079793 slli a5,a5,0x10 14368: 0107d793 srli a5,a5,0x10 1436c: 07f7f793 andi a5,a5,127 14370: fcf41723 sh a5,-50(s0) 14374: fec42703 lw a4,-20(s0) 14378: 0fe00793 li a5,254 1437c: 00e7d863 bge a5,a4,1438c <bf16_sqrt+0x238> 14380: ffff87b7 lui a5,0xffff8 14384: f807c793 xori a5,a5,-128 14388: 0580006f j 143e0 <bf16_sqrt+0x28c> 1438c: fec42783 lw a5,-20(s0) 14390: 00f04663 bgtz a5,1439c <bf16_sqrt+0x248> 14394: 00000793 li a5,0 14398: 0480006f j 143e0 <bf16_sqrt+0x28c> 1439c: fec42783 lw a5,-20(s0) 143a0: 01079793 slli a5,a5,0x10 143a4: 4107d793 srai a5,a5,0x10 143a8: 00779793 slli a5,a5,0x7 143ac: 01079713 slli a4,a5,0x10 143b0: 41075713 srai a4,a4,0x10 143b4: 000087b7 lui a5,0x8 143b8: f8078793 addi a5,a5,-128 # 7f80 <_start-0x8080> 143bc: 00f777b3 and a5,a4,a5 143c0: 01079713 slli a4,a5,0x10 143c4: 41075713 srai a4,a4,0x10 143c8: fce41783 lh a5,-50(s0) 143cc: 00f767b3 or a5,a4,a5 143d0: 01079793 slli a5,a5,0x10 143d4: 4107d793 srai a5,a5,0x10 143d8: 01079793 slli a5,a5,0x10 143dc: 0107d793 srli a5,a5,0x10 143e0: 00078513 mv a0,a5 143e4: 04c12083 lw ra,76(sp) 143e8: 04812403 lw s0,72(sp) 143ec: 05010113 addi sp,sp,80 143f0: 00008067 ret ``` </details> <details> <summary>c_Os.dump</summary> ``` 0001399c <bf16_sqrt>: 1399c: 01051513 slli a0,a0,0x10 139a0: 01055513 srli a0,a0,0x10 139a4: fe010113 addi sp,sp,-32 139a8: 00755793 srli a5,a0,0x7 139ac: 01212823 sw s2,16(sp) 139b0: 00112e23 sw ra,28(sp) 139b4: 00812c23 sw s0,24(sp) 139b8: 00912a23 sw s1,20(sp) 139bc: 01312623 sw s3,12(sp) 139c0: 01412423 sw s4,8(sp) 139c4: 01512223 sw s5,4(sp) 139c8: 0ff7f793 zext.b a5,a5 139cc: 0ff00693 li a3,255 139d0: 00f55713 srli a4,a0,0xf 139d4: 07f57913 andi s2,a0,127 139d8: 04d79063 bne a5,a3,13a18 <bf16_sqrt+0x7c> 139dc: 00091c63 bnez s2,139f4 <bf16_sqrt+0x58> 139e0: 00008537 lui a0,0x8 139e4: fc050513 addi a0,a0,-64 # 7fc0 <_start-0x8040> 139e8: 00071663 bnez a4,139f4 <bf16_sqrt+0x58> 139ec: 00008537 lui a0,0x8 139f0: f8050513 addi a0,a0,-128 # 7f80 <_start-0x8080> 139f4: 01c12083 lw ra,28(sp) 139f8: 01812403 lw s0,24(sp) 139fc: 01412483 lw s1,20(sp) 13a00: 01012903 lw s2,16(sp) 13a04: 00c12983 lw s3,12(sp) 13a08: 00812a03 lw s4,8(sp) 13a0c: 00412a83 lw s5,4(sp) 13a10: 02010113 addi sp,sp,32 13a14: 00008067 ret 13a18: 04079a63 bnez a5,13a6c <bf16_sqrt+0xd0> 13a1c: 00000513 li a0,0 13a20: fc090ae3 beqz s2,139f4 <bf16_sqrt+0x58> 13a24: 00008537 lui a0,0x8 13a28: fc050513 addi a0,a0,-64 # 7fc0 <_start-0x8040> 13a2c: 40e00733 neg a4,a4 13a30: 00e57533 and a0,a0,a4 13a34: fc1ff06f j 139f4 <bf16_sqrt+0x58> 13a38: 40145413 srai s0,s0,0x1 13a3c: 07f40413 addi s0,s0,127 13a40: 0580006f j 13a98 <bf16_sqrt+0xfc> 13a44: fff98a93 addi s5,s3,-1 13a48: 0800006f j 13ac8 <bf16_sqrt+0x12c> 13a4c: 07f00793 li a5,127 13a50: 0897e663 bltu a5,s1,13adc <bf16_sqrt+0x140> 13a54: 00100713 li a4,1 13a58: 00149493 slli s1,s1,0x1 13a5c: fff40413 addi s0,s0,-1 13a60: 0697ee63 bltu a5,s1,13adc <bf16_sqrt+0x140> 13a64: fee41ae3 bne s0,a4,13a58 <bf16_sqrt+0xbc> 13a68: 0740006f j 13adc <bf16_sqrt+0x140> 13a6c: 00008537 lui a0,0x8 13a70: fc050513 addi a0,a0,-64 # 7fc0 <_start-0x8040> 13a74: f80710e3 bnez a4,139f4 <bf16_sqrt+0x58> 13a78: f8178413 addi s0,a5,-127 13a7c: 00147713 andi a4,s0,1 13a80: 08096913 ori s2,s2,128 13a84: fa070ae3 beqz a4,13a38 <bf16_sqrt+0x9c> 13a88: f8078793 addi a5,a5,-128 13a8c: 4017d793 srai a5,a5,0x1 13a90: 00191913 slli s2,s2,0x1 13a94: 07f78413 addi s0,a5,127 13a98: 08000493 li s1,128 13a9c: 10000a93 li s5,256 13aa0: 05a00a13 li s4,90 13aa4: 015a09b3 add s3,s4,s5 13aa8: 0019d993 srli s3,s3,0x1 13aac: 00098593 mv a1,s3 13ab0: 00098513 mv a0,s3 13ab4: e34fc0ef jal 100e8 <__mulsi3> 13ab8: 00755513 srli a0,a0,0x7 13abc: f8a964e3 bltu s2,a0,13a44 <bf16_sqrt+0xa8> 13ac0: 00198a13 addi s4,s3,1 13ac4: 00098493 mv s1,s3 13ac8: fd4afee3 bgeu s5,s4,13aa4 <bf16_sqrt+0x108> 13acc: 0ff00793 li a5,255 13ad0: f697fee3 bgeu a5,s1,13a4c <bf16_sqrt+0xb0> 13ad4: 0014d493 srli s1,s1,0x1 13ad8: 00140413 addi s0,s0,1 13adc: 00741513 slli a0,s0,0x7 13ae0: 07f4f493 andi s1,s1,127 13ae4: 00956533 or a0,a0,s1 13ae8: 01051513 slli a0,a0,0x10 13aec: 01055513 srli a0,a0,0x10 13af0: f05ff06f j 139f4 <bf16_sqrt+0x58> ``` </details> <details> <summary>c_Ofast.dump</summary> ``` 00013bc4 <main>: 13bc4: fc010113 addi sp,sp,-64 13bc8: 02812c23 sw s0,56(sp) 13bcc: 03312623 sw s3,44(sp) 13bd0: 02112e23 sw ra,60(sp) 13bd4: 02912a23 sw s1,52(sp) 13bd8: 03212823 sw s2,48(sp) 13bdc: 00010413 mv s0,sp 13be0: 01000993 li s3,16 13be4: 04000893 li a7,64 13be8: 00100513 li a0,1 13bec: 002005b3 add a1,zero,sp 13bf0: 00098613 mv a2,s3 13bf4: 00000073 ecall 13bf8: da0fc0ef jal 10198 <get_cycles> 13bfc: 00050913 mv s2,a0 13c00: dacfc0ef jal 101ac <get_instret> 13c04: 00050493 mv s1,a0 13c08: 04000893 li a7,64 13c0c: 00100513 li a0,1 13c10: 002005b3 add a1,zero,sp 13c14: 00098613 mv a2,s3 13c18: 00000073 ecall 13c1c: 0000c537 lui a0,0xc 13c20: f8050513 addi a0,a0,-128 # bf80 <_start-0x4080> 13c24: df9ff0ef jal 13a1c <bf16_sqrt> 13c28: 01051693 slli a3,a0,0x10 13c2c: 0106d693 srli a3,a3,0x10 13c30: 00f00793 li a5,15 13c34: 04000893 li a7,64 13c38: 00100513 li a0,1 13c3c: 002005b3 add a1,zero,sp 13c40: 00078613 mv a2,a5 13c44: 00000073 ecall 13c48: 01110793 addi a5,sp,17 13c4c: 00068c63 beqz a3,13c64 <main+0xa0> 13c50: 00068713 mv a4,a3 13c54: 01210793 addi a5,sp,18 13c58: 00475713 srli a4,a4,0x4 13c5c: fff78793 addi a5,a5,-1 13c60: fe071ce3 bnez a4,13c58 <main+0x94> 13c64: 00178793 addi a5,a5,1 13c68: 01410713 addi a4,sp,20 13c6c: 40f70733 sub a4,a4,a5 13c70: 04000893 li a7,64 13c74: 00100513 li a0,1 13c78: 00f005b3 add a1,zero,a5 13c7c: 00070613 mv a2,a4 13c80: 00000073 ecall 13c84: 000087b7 lui a5,0x8 13c88: fc078793 addi a5,a5,-64 # 7fc0 <_start-0x8040> 13c8c: 0af68e63 beq a3,a5,13d48 <main+0x184> 13c90: 01b00793 li a5,27 13c94: 04000893 li a7,64 13c98: 00100513 li a0,1 13c9c: 008005b3 add a1,zero,s0 13ca0: 00078613 mv a2,a5 13ca4: 00000073 ecall 13ca8: cf0fc0ef jal 10198 <get_cycles> 13cac: 00050993 mv s3,a0 13cb0: cfcfc0ef jal 101ac <get_instret> 13cb4: 409504b3 sub s1,a0,s1 13cb8: 00a00793 li a5,10 13cbc: 04000893 li a7,64 13cc0: 00100513 li a0,1 13cc4: 008005b3 add a1,zero,s0 13cc8: 00078613 mv a2,a5 13ccc: 00000073 ecall 13cd0: 41298533 sub a0,s3,s2 13cd4: b68fc0ef jal 1003c <print_dec> 13cd8: 01000793 li a5,16 13cdc: 04000893 li a7,64 13ce0: 00100513 li a0,1 13ce4: 008005b3 add a1,zero,s0 13ce8: 00078613 mv a2,a5 13cec: 00000073 ecall 13cf0: 00048513 mv a0,s1 13cf4: b48fc0ef jal 1003c <print_dec> 13cf8: 00100793 li a5,1 13cfc: 04000893 li a7,64 13d00: 00100513 li a0,1 13d04: 008005b3 add a1,zero,s0 13d08: 00078613 mv a2,a5 13d0c: 00000073 ecall 13d10: 01d00793 li a5,29 13d14: 04000893 li a7,64 13d18: 00100513 li a0,1 13d1c: 008005b3 add a1,zero,s0 13d20: 00078613 mv a2,a5 13d24: 00000073 ecall 13d28: 03c12083 lw ra,60(sp) 13d2c: 03812403 lw s0,56(sp) 13d30: 03412483 lw s1,52(sp) 13d34: 03012903 lw s2,48(sp) 13d38: 02c12983 lw s3,44(sp) 13d3c: 00000513 li a0,0 13d40: 04010113 addi sp,sp,64 13d44: 00008067 ret 13d48: 00900793 li a5,9 13d4c: 04000893 li a7,64 13d50: 00100513 li a0,1 13d54: 008005b3 add a1,zero,s0 13d58: 00078613 mv a2,a5 13d5c: 00000073 ecall 13d60: f49ff06f j 13ca8 <main+0xe4> ``` </details> ### Observation 1: Stack usage - compiler original version Stack frame is the largest of all version (e.g.,`addi sp,sp,-80`). And that's because of complier perfer to save lots of loacl variable to stack such as sign, exp, mant and so on. As the result, the generated code contains many unnecessary lw/sw instructions for loading and storing temporaries. Moreover, disassembly shows frequent lw/sw traffic to spill and reload values inside the binary-search loop, which produce extra memory accesses and increases instruction count. - compiler with flag `-Os`(size) version With `-Os`, this time compiler reduces stack usage(e.g.,`addi sp,sp,-32`) and tries to keep more live values in registers instead of spilling them to memory. - compiler with flag `-Ofast` (spead) version In this disassembly, `-Ofast` also drops the stack pointer by 32 right away, but its register saving strategy feels a bit more staggered. It starts by saving just a few registers (like ra and s4). It waits to push the rest (s0-s5) until it actually hits the binary search path. By this way in some serval cases like `NaN`, `Inf`, `zero` or `negative values`, the `-Ofast` can perform much better than `-Os`, since it don't need to save all registers from the start of the program. ### Observation 2: Binary Search part Since we are using bf16_sqrt of RV32I friendly from [Assignment1](https://hackmd.io/@IBxdYttpQ5-JUEW6_KypVg/arch2025-homework1), which don't require any multiplication at all. Instead we use other operations like add, shift, sub and compare to replace it. On the other hand, no matter which optimization flag the compiler uses, the C version still has to implement `mid * mid`, so it ends up calling `__mulsi3` (GCC's runtime helper) on RV32I (no M extension) and keeping the binary-search loop structure. ## Part 5 - Optimization In this section, we aim to improve the assembly code generated by GCC, and we use [ticks.c](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c) and [perfcounter](https://github.com/sysprog21/rv32emu/tree/master/tests/perfcounter) to collect performance statistics. - Performance metric - Cycle: the number of clock cycles elapsed during execution - Instructions: the number of instructions actually executed In terms of implementation, I choose the compiler generated code withuot any flag to optimize. Let's look at the `main.c` (from Part 4). In order to analyze the performance of program more directly. We simply extend the test function from single to 10 cases. ```clike= static void test_bf16_sqrt(void) { TEST_LOGGER("Test: bf16_sqrt\n"); // 1) sqrt(-1.0) = NaN (0x7FC0) { bf16_t a = {.bits = 0xBF80}; bf16_t result = bf16_sqrt_w(a); TEST_LOGGER(" sqrt(-1.0) = "); print_hex(result.bits); if (result.bits == 0x7FC0) { TEST_LOGGER(" PASSED\n"); } else { TEST_LOGGER(" FAILED (expected 0x7FC0)\n"); } } // 2) sqrt(+0.0) = +0.0 { bf16_t a = {.bits = 0x0000}; bf16_t result = bf16_sqrt_w(a); TEST_LOGGER(" sqrt(+0.0) = "); print_hex(result.bits); if (result.bits == 0x0000) { TEST_LOGGER(" PASSED\n"); } else { TEST_LOGGER(" FAILED (expected 0x0000)\n"); } } // 3) sqrt(-0.0) = +0.0 { bf16_t a = {.bits = 0x8000}; bf16_t result = bf16_sqrt_w(a); TEST_LOGGER(" sqrt(-0.0) = "); print_hex(result.bits); if (result.bits == 0x0000) { TEST_LOGGER(" PASSED\n"); } else { TEST_LOGGER(" FAILED (expected 0x0000)\n"); } } // 4) sqrt(+Inf) = +Inf { bf16_t a = {.bits = 0x7F80}; bf16_t result = bf16_sqrt_w(a); TEST_LOGGER(" sqrt(+Inf) = "); print_hex(result.bits); if (result.bits == 0x7F80) { TEST_LOGGER(" PASSED\n"); } else { TEST_LOGGER(" FAILED (expected 0x7F80)\n"); } } // 5) sqrt(-Inf) = NaN (0x7FC0) { bf16_t a = {.bits = 0xFF80}; bf16_t result = bf16_sqrt_w(a); TEST_LOGGER(" sqrt(-Inf) = "); print_hex(result.bits); if (result.bits == 0x7FC0) { TEST_LOGGER(" PASSED\n"); } else { TEST_LOGGER(" FAILED (expected 0x7FC0)\n"); } } // 6) sqrt(NaN canonical 0x7FC0) = NaN (expect 0x7FC0) { bf16_t a = {.bits = 0x7FC0}; bf16_t result = bf16_sqrt_w(a); TEST_LOGGER(" sqrt(NaN 0x7FC0) = "); print_hex(result.bits); if (result.bits == 0x7FC0) { TEST_LOGGER(" PASSED\n"); } else { TEST_LOGGER(" FAILED (expected 0x7FC0)\n"); } } // 7) sqrt(1.0) = 1.0 { bf16_t a = {.bits = 0x3F80}; bf16_t result = bf16_sqrt_w(a); TEST_LOGGER(" sqrt(1.0) = "); print_hex(result.bits); if (result.bits == 0x3F80) { TEST_LOGGER(" PASSED\n"); } else { TEST_LOGGER(" FAILED (expected 0x3F80)\n"); } } // 8) sqrt(4.0) = 2.0 { bf16_t a = {.bits = 0x4080}; // 4.0 in bf16 bf16_t result = bf16_sqrt_w(a); TEST_LOGGER(" sqrt(4.0) = "); print_hex(result.bits); if (result.bits == 0x4000) { // 2.0 in bf16 TEST_LOGGER(" PASSED\n"); } else { TEST_LOGGER(" FAILED (expected 0x4000)\n"); } } // 9) sqrt(0.25) = 0.5 { bf16_t a = {.bits = 0x3E80}; // 0.25 in bf16 bf16_t result = bf16_sqrt_w(a); TEST_LOGGER(" sqrt(0.25) = "); print_hex(result.bits); if (result.bits == 0x3F00) { // 0.5 in bf16 TEST_LOGGER(" PASSED\n"); } else { TEST_LOGGER(" FAILED (expected 0x3F00)\n"); } } // 10) sqrt(2.25) = 1.5 { bf16_t a = {.bits = 0x4010}; bf16_t result = bf16_sqrt_w(a); TEST_LOGGER(" sqrt(2.25) = "); print_hex(result.bits); if (result.bits == 0x3FC0) { TEST_LOGGER(" PASSED\n"); } else { TEST_LOGGER(" FAILED (expected 0x3FC0)\n"); } } } ``` Before we go further, we have to transfer compiler generated c code to assembly first, since we need to improve it. ```bash= riscv-none-elf-gcc -S -march=rv32i_zicsr -mabi=ilp32 -O0 bf16_sqrt_base.c -o bf16_sqrt_gcc_O0.s ``` - Makefile ``` ... OBJS = start.o main.o perfcounter.o chacha20_asm.o bf16_sqrt_gcc_O0.o ... ``` - command ```bash= make clean make make run ``` - Result ```bash= Test: bf16_sqrt sqrt(-1.0) = 7fc0 PASSED sqrt(+0.0) = 0 PASSED sqrt(-0.0) = 0 PASSED sqrt(+Inf) = 7f80 PASSED sqrt(-Inf) = 7fc0 PASSED sqrt(NaN 0x7FC0) = 7fc0 PASSED sqrt(1.0) = 3f80 PASSED sqrt(4.0) = 4000 PASSED sqrt(0.25) = 3f00 PASSED sqrt(2.25) = 3fc0 PASSED Cycles: 6745 Instructions: 6745 === All Tests Completed === ``` Without any optimization: - Cycle: 6745 - Instruction: 6745 ### Improvement1: Peephole inside loop Like we mentiond before, complier (without any optimization flag) perfer to save lots of loacl variable to stack. And that give us a lot of room for improvement. Since every loop iteration we have to perform lots of unnecessary lw/sw, our strategy here is simply using sepcial registers instead of spill to memory for binary search loop. Because the loop contains a function call `__mulsi3`, t/a registers might get overwritten. To avoid saving/restoring them every iteration, we use s registers to hold the loop state. As we require sepcial registers for binary search loop, we have to allocate memory space for them first. - origin ``` bf16_sqrt: addi sp,sp,-80 sw ra,76(sp) sw s0,72(sp) addi s0,sp,80 ``` - after ``` bf16_sqrt: addi sp,sp,-96 sw ra,92(sp) sw s0,88(sp) sw s1,84(sp) sw s2,80(sp) sw s3,76(sp) sw s4,72(sp) addi s0,sp,96 ``` - origin ``` .L20: mv a0,a5 lw ra,76(sp) lw s0,72(sp) addi sp,sp,80 jr ra ``` - after ``` .L20: mv a0,a5 lw s4,72(sp) lw s3,76(sp) lw s2,80(sp) lw s1,84(sp) lw s0,88(sp) lw ra,92(sp) addi sp,sp,96 jr ra ``` Now, let’s rewrite the binary search loop to use registers only. - origin ```asm= .L10: li a5,90 sw a5,-28(s0) # low li a5,256 sw a5,-32(s0) # high li a5,128 sw a5,-36(s0) # best j .L11 .L13: lw a4,-28(s0) lw a5,-32(s0) add a5,a4,a5 srli a5,a5,1 sw a5,-56(s0) # mid lw a1,-56(s0) lw a0,-56(s0) call __mulsi3 mv a5,a0 srli a5,a5,7 sw a5,-60(s0) lw a4,-60(s0) lw a5,-24(s0) # mant bgtu a4,a5,.L12 lw a5,-56(s0) sw a5,-36(s0) lw a5,-56(s0) addi a5,a5,1 sw a5,-28(s0) j .L11 .L12: lw a5,-56(s0) addi a5,a5,-1 sw a5,-32(s0) .L11: lw a4,-28(s0) lw a5,-32(s0) bleu a4,a5,.L13 ``` - after ```asm= .L10: li s1,90 # low li s2,256 # high li s3,128 # best lw s4,-24(s0) # mant j .L11 .L13: add t0,s1,s2 srli t0,t0,1 # mid mv a0,t0 mv a1,t0 call __mulsi3 srli t1,a0,7 # mid2 = (mid*mid)>>7 bgtu t1,s4,.L12 mv s3,t0 # best = mid addi s1,t0,1 # low = mid+1 j .L11 .L12: addi s2,t0,-1 # high = mid-1 .L11: bleu s1,s2,.L13 sw s3,-36(s0) ``` - Result ``` Test: bf16_sqrt sqrt(-1.0) = 7fc0 PASSED sqrt(+0.0) = 0 PASSED sqrt(-0.0) = 0 PASSED sqrt(+Inf) = 7f80 PASSED sqrt(-Inf) = 7fc0 PASSED sqrt(NaN 0x7FC0) = 7fc0 PASSED sqrt(1.0) = 3f80 PASSED sqrt(4.0) = 4000 PASSED sqrt(0.25) = 3f00 PASSED sqrt(2.25) = 3fc0 PASSED Cycles: 6397 Instructions: 6397 === All Tests Completed === ``` ### Improvement2: Drop some function calls In order to calculate `mid*mid` for binary search loop, we have to call function call `__mulsi3`, which is quite heavy for our cycle. We observed that `high` and `low` are initialized to 256 and 90, respectively, so `mid` is always bounded within this small interval. As a result, `mid` only needs about 9 bits to represent (since the maximum value is 256), and computing `mid * mid` does not require a full 32-bit multiplication routine. Therefore we replaced function call `__mulsi3` with a shift and add multiplier (9 iterations). - origin ```asm= mv a0,t0 mv a1,t0 call __mulsi3 srli t1,a0,7 ``` - after ```asm= # t0 = mid mv t2, t0 # multiplicand (a) mv t3, t0 # multiplier (b) li t1, 0 # acc = 0 andi t4, t3, 1 sub t4, zero, t4 # t4 = 0 or -1 and t5, t2, t4 # t5 = (b&1)?a:0 add t1, t1, t5 slli t2, t2, 1 srli t3, t3, 1 andi t4, t3, 1 sub t4, zero, t4 and t5, t2, t4 add t1, t1, t5 slli t2, t2, 1 srli t3, t3, 1 andi t4, t3, 1 sub t4, zero, t4 and t5, t2, t4 add t1, t1, t5 slli t2, t2, 1 srli t3, t3, 1 andi t4, t3, 1 sub t4, zero, t4 and t5, t2, t4 add t1, t1, t5 slli t2, t2, 1 srli t3, t3, 1 andi t4, t3, 1 sub t4, zero, t4 and t5, t2, t4 add t1, t1, t5 slli t2, t2, 1 srli t3, t3, 1 andi t4, t3, 1 sub t4, zero, t4 and t5, t2, t4 add t1, t1, t5 slli t2, t2, 1 srli t3, t3, 1 andi t4, t3, 1 sub t4, zero, t4 and t5, t2, t4 add t1, t1, t5 slli t2, t2, 1 srli t3, t3, 1 andi t4, t3, 1 sub t4, zero, t4 and t5, t2, t4 add t1, t1, t5 slli t2, t2, 1 srli t3, t3, 1 andi t4, t3, 1 sub t4, zero, t4 and t5, t2, t4 add t1, t1, t5 srli t1, t1, 7 ``` To better understand the iterations, we analyze one loop iteration in detail ```asm= andi t4, t3, 1 sub t4, zero, t4 # t4 = 0 or -1 (0xFFFFFFFF) and t5, t2, t4 # t5 = a & 0 = 0 or t5 = a & 0xFFFFFFFF = a add t1, t1, t5 slli t2, t2, 1 srli t3, t3, 1 ``` > t3 = multiplier, t2 = multiplicand, t1 = temp Basically we check if LSB of multiplier is 1 every round. If it is, then we just add the current multiplicand to the accumulator. Then we left-shift the multiplicand and right-shift the multiplier for the next round. - result ``` bash= Test: bf16_sqrt sqrt(-1.0) = 7fc0 PASSED sqrt(+0.0) = 0 PASSED sqrt(-0.0) = 0 PASSED sqrt(+Inf) = 7f80 PASSED sqrt(-Inf) = 7fc0 PASSED sqrt(NaN 0x7FC0) = 7fc0 PASSED sqrt(1.0) = 3f80 PASSED sqrt(4.0) = 4000 PASSED sqrt(0.25) = 3f00 PASSED sqrt(2.25) = 3fc0 PASSED Cycles: 4093 Instructions: 4093 === All Tests Completed === ``` After above two improvement our assembly can reduce about 39.3% !