Assignment 2: Complete Applications

# Assignment 2: Complete Applications > ==Github: [dorobouharu133-CA25Assignment 2](https://github.com/dorobouharu133/CA25-Assignment-2.git)== >[!Note] AI Tools Usage >I used Google Gemini to refine the grammar of this article and to debug code within the compiler environment. ## Requirement 2 In Requirement 2, I chose Quiz 2 Problem B to run in the bare-metal environment. ### ==Makefile== > [tests/system/playground/main.c](https://gist.github.com/jserv/5f682ac880773cab69e3564f4f20d60a) Below is the "Makefile" I used. ```makefile include ../../../mk/toolchain.mk ARCH = -march=rv32izicsr LINKER_SCRIPT = linker.ld EMU ?= ../../../build/rv32emu AFLAGS = -g $(ARCH) CFLAGS = -g -march=rv32i_zicsr LDFLAGS = -T $(LINKER_SCRIPT) EXEC = test.elf CC = $(CROSS_COMPILE)gcc AS = $(CROSS_COMPILE)as LD = $(CROSS_COMPILE)ld OBJDUMP = $(CROSS_COMPILE)objdump OBJS = start.o uf8_decode.o main.o .PHONY: all run dump clean all: $(EXEC) $(EXEC): $(OBJS) $(LINKER_SCRIPT) $(LD) $(LDFLAGS) -o $@ $(OBJS) %.o: %.S $(AS) $(AFLAGS) $< -o $@ %.o: %.c $(CC) $(CFLAGS) $< -o $@ -c run: $(EXEC) @test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1) @grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1) @grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1) $(EMU) $< dump: $(EXEC) $(OBJDUMP) -Ds $< | less clean: rm -f $(EXEC) $(OBJS) ``` Switching build targets or adding modules is easy--just update the `OBJS` line. For instance, to include encoding features, simply add the corresponding `.o` files to the list. ``` OBJS = start.o clz.o uf8_decode.o uf8_encode.o main.o ``` ### ==Quiz2 Problem B== #### clz <details> <summary> CLZ Full Code </summary> ```asm= # clz .equ STDOUT, 1 .equ WRITE, 64 .equ EXIT, 93 .section .rodata .align 2 undefined: .ascii "Undefined Behavior!\n" .section .text .global clz # a0: x # t0: n # t1: c # t2: y clz: beq a0, x0, clz_equal_zero addi t0, x0, 32 # n = 32 addi t1, x0, 16 # c = 16 clz_while: srl t2, a0, t1 # y = x >> c beq t2, x0, clz_skip sub t0, t0, t1 # n -= c addi a0, t2, 0 # x = y clz_skip: srli t1, t1, 1 # c >>= 1 bnez t1, clz_while # while (c != 0) clz_ret: sub a0, t0, a0 # return n - x jr ra clz_equal_zero: # print undefined behavior li a0, STDOUT la a1, undefined li a2, 20 li a7, WRITE ecall li a0, -1 jr ra ``` </details> ``` > Different from what I have wrote in Homework 1, I add `undefined behavior` check to detect whether the input is all zero, if so, print "Undefined Behavior!" and return "-1". In the following cases, I use ==main.c== with ==clz.S== to run in rv32emu bare-metal environment. And so are ==uf8_decode.S== and ==uf8_encode.S==. ```c= //main.c #include <stdio.h> int clz(int x); int main() { int nums[] = {0x0000FFFF, 0x00F00000, 0x80000000, 0x00000001, 0}; int n = sizeof(nums) / sizeof(nums[0]); for(int i = 0; i < n; i++) { int x = nums[i]; int leading_zeros = clz(x); printf("x = 0x%08X, clz = %d\n", x, leading_zeros); } return 0; } ``` > Since `printf` does not work in the RV32emu bare-metal environment, I referred to [RV32emu syscall.md](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md) and modified `main.S` to avoid using `printf`. <details> <summary>modified main.S</summary> ```asm= .equ STDOUT, 1 .equ WRITE, 64 .equ EXIT, 93 .section .rodata .align 2 pass_msg: .ascii "Tests Passed!\n" fail_msg: .ascii "Tests Failed!\n" undefined: .ascii "Undefined Behavior!\n" test_values_1: .word 0x0000FFFF test_values_2: .word 0x00F00000 test_values_3: .word 0x80000000 test_values_4: .word 0x00000001 test_values_5: .word 0x00000000 .section .text .align 2 .globl clz .globl main main: test_1: la t0, test_values_1 lw a0, 0(t0) jal clz addi t0, a0, 0 li t1, 16 bne t0, t1, test_failed test_2: la t0, test_values_2 lw a0, 0(t0) jal clz addi t0, a0, 0 li t1, 8 bne t0, t1, test_failed test_3: la t0, test_values_3 lw a0, 0(t0) jal clz addi t0, a0, 0 li t1, 0 bne t0, t1, test_failed test_4: la t0, test_values_4 lw a0, 0(t0) jal clz addi t0, a0, 0 li t1, 31 bne t0, t1, test_failed test_5: la t0, test_values_5 lw a0, 0(t0) jal clz addi t0, a0, 0 li t1, -1 bne t0, t1, test_failed test_passed: li a0, STDOUT la a1, pass_msg li a2, 14 li a7, WRITE ecall jal test_end test_failed: li a0, STDOUT la a1, fail_msg li a2, 14 li a7, WRITE ecall test_end: li a0, 0 li a7, EXIT ecall # a0: x # t0: n # t1: c # t2: y clz: beq a0, x0, clz_equal_zero addi t0, x0, 32 # n = 32 addi t1, x0, 16 # c = 16 clz_while: srl t2, a0, t1 # y = x >> c beq t2, x0, clz_skip sub t0, t0, t1 # n -= c addi a0, t2, 0 # x = y clz_skip: srli t1, t1, 1 # c >>= 1 bnez t1, clz_while # while (c != 0) clz_ret: sub a0, t0, a0 # return n - x jr ra clz_equal_zero: # print undefined behavior li a0, STDOUT la a1, undefined li a2, 20 li a7, WRITE ecall li a0, -1 jr ra ``` </details> ``` #### Test Cases 1. `0x0000FFFF` 2. `0x00F00000` 3. `0x80000000` 4. `0x00000001` 5. `0x00000000` Before Modification:(with `printf`) ``` x = 0x0000FFFF, clz = 16 x = 0x00F00000, clz = 8 x = 0x80000000, clz = 0 x = 0x00000001, clz = 31 Undefined Behavior! x = 0x00000000, clz = -1 ``` After modification:(without `printf`) :::success Undefined Behavior! Tests Passed! ::: #### uf8_decode <details> <summary>uf8_decode</summary> ```asm= # uf8_decode.S .section .text .global uf8_decode # a0: f1 # t0: mantissa # t1: exponent # t2: offset uf8_decode: andi t0, a0, 0x0f # mantissa = f1 & 0x0f srli t1, a0, 4 # exponent = f1 >> 4 addi t2, x0, 15 sub t2, t2, t1 addi t3, x0, 0x7F # replace pseudo instruction li t3, 0x7FFF slli t3, t3, 8 ori t3, t3, 0xFF srl t3, t3, t2 slli t3, t3, 4 sll a0, t0, t1 add a0, a0, t3 jr ra ``` </details> ``` ```c= //main.c #include <stdio.h> int uf8_decode(int f1); int main() { unsigned char test_vals[] = {0x00, 0x10, 0x2F, 0x8A, 0xFF}; int expected[] = { 0, // uf8_decode(0x00) 16, // uf8_decode(0x10) 108, // uf8_decode(0x2F) 6640, // uf8_decode(0x8A) 1015792 // uf8_decode(0xFF) }; int n = sizeof(test_vals) / sizeof(test_vals[0]); int pass = 0; printf("==== UF8 Decode Test ====\n"); for (int i = 0; i < n; i++) { int f1 = test_vals[i]; int out = uf8_decode(f1); printf("Test %d: f1=0x%02X -> decoded=0x%08X expected=0x%08X : %s\n", i, f1, out, expected[i], (out == expected[i] ? "PASS" : "FAIL")); if (out == expected[i]) pass++; } printf("\n==== %d/%d PASSED ====\n", pass, n); return 0; } ``` ``` gffg3@jiahui:~/riscv-none-elf-gcc/rv32emu0102/rv32emu/tests/system/homework1$ make riscv-none-elf-as -g -march=rv32izicsr start.S -o start.o start.S: Assembler messages: start.S: Warning: end of file not at end of a line; newline inserted riscv-none-elf-as -g -march=rv32izicsr uf8_decode.S -o uf8_decode.o riscv-none-elf-gcc -g -march=rv32i_zicsr main.c -o main.o -c riscv-none-elf-ld -T linker.ld -o test.elf start.o uf8_decode.o main.o riscv-none-elf-ld: warning: test.elf has a LOAD segment with RWX permissions riscv-none-elf-ld: main.o: in function `main': /home/gffg3/riscv-none-elf-gcc/rv32emu0102/rv32emu/tests/system/homework1/main.c:19:(.text+0x64): undefined reference to `puts' riscv-none-elf-ld: /home/gffg3/riscv-none-elf-gcc/rv32emu0102/rv32emu/tests/system/homework1/main.c:24:(.text+0xf0): undefined reference to `printf' riscv-none-elf-ld: /home/gffg3/riscv-none-elf-gcc/rv32emu0102/rv32emu/tests/system/homework1/main.c:31:(.text+0x148): undefined reference to `printf' make: *** [Makefile:25: test.elf] Error 1 ``` > `printf` can not be used in bare-metal environment #### print We need to modify `main.c` in order not to use `printf`, and the below code is useful for us to replace `printf`. > [tests/system/playground/main.c](https://gist.github.com/jserv/5f682ac880773cab69e3564f4f20d60a) ```c= #include <stdbool.h> #include <stdint.h> #include <string.h> #define printstr(ptr, length) \ do { \ asm volatile( \ "add a7, x0, 0x40;" \ "add a0, x0, 0x1;" /* stdout */ \ "add a1, x0, %0;" \ "mv a2, %1;" /* length character */ \ "ecall;" \ : \ : "r"(ptr), "r"(length) \ : "a0", "a1", "a2", "a7"); \ } while (0) #define TEST_OUTPUT(msg, length) printstr(msg, length) #define TEST_LOGGER(msg) \ { \ char _msg[] = msg; \ TEST_OUTPUT(_msg, sizeof(_msg) - 1); \ } extern uint64_t get_cycles(void); extern uint64_t get_instret(void); /* Bare metal memcpy implementation */ void *memcpy(void *dest, const void *src, size_t n) { uint8_t *d = (uint8_t *) dest; const uint8_t *s = (const uint8_t *) src; while (n--) *d++ = *s++; return dest; } /* Software division for RV32I (no M extension) */ static unsigned long udiv(unsigned long dividend, unsigned long divisor) { if (divisor == 0) return 0; unsigned long quotient = 0; unsigned long remainder = 0; for (int i = 31; i >= 0; i--) { remainder <<= 1; remainder |= (dividend >> i) & 1; if (remainder >= divisor) { remainder -= divisor; quotient |= (1UL << i); } } return quotient; } static unsigned long umod(unsigned long dividend, unsigned long divisor) { if (divisor == 0) return 0; unsigned long remainder = 0; for (int i = 31; i >= 0; i--) { remainder <<= 1; remainder |= (dividend >> i) & 1; if (remainder >= divisor) { remainder -= divisor; } } return remainder; } /* Software multiplication for RV32I (no M extension) */ static uint32_t umul(uint32_t a, uint32_t b) { uint32_t result = 0; while (b) { if (b & 1) result += a; a <<= 1; b >>= 1; } return result; } /* Provide __mulsi3 for GCC */ uint32_t __mulsi3(uint32_t a, uint32_t b) { return umul(a, b); } /* Simple integer to hex string conversion */ static void print_hex(unsigned long val) { char buf[20]; char *p = buf + sizeof(buf) - 1; *p = '\n'; p--; if (val == 0) { *p = '0'; p--; } else { while (val > 0) { int digit = val & 0xf; *p = (digit < 10) ? ('0' + digit) : ('a' + digit - 10); p--; val >>= 4; } } p++; printstr(p, (buf + sizeof(buf) - p)); } /* Simple integer to decimal string conversion */ static void print_dec(unsigned long val) { char buf[20]; char *p = buf + sizeof(buf) - 1; *p = '\n'; p--; if (val == 0) { *p = '0'; p--; } else { while (val > 0) { *p = '0' + umod(val, 10); p--; val = udiv(val, 10); } } p++; printstr(p, (buf + sizeof(buf) - p)); } unsigned uf8_decode(uint32_t x); int main() { char *pass_msg = "Test Passed!\n"; char *fail_msg = "Test Failed!\n"; uint8_t test_value [5] = {0x00, 0x10, 0x2F, 0x8A, 0xFF}; int32_t correct_value [5] = {0, 16, 108, 6640, 1015792}; bool all_passed = true; for(int i = 0; i < 5; i++) { int32_t decode_value = uf8_decode(test_value[i]); printstr("Case ", 5); if(decode_value == correct_value[i]) { print_dec(i+1); printstr(" Passed!\n", 10); } else { all_passed = false; printstr(" Failed!\n", 8); printstr("Expected: ", 11); print_dec(correct_value[i]); printstr("Actual: ", 8); print_dec(decode_value); } } if(all_passed) { printstr(pass_msg, 13); } else { printstr(fail_msg, 13); } return 0; } ``` #### Test Cases 1. `0x00`, expected value should be `0` 2. `0x10`, expected value should be `16` 3. `0x2F`, expected value should be `108` 4. `0x8A`, expected value should be `6640` 5. `0xFF`, expected value should be `1015792` ``` gffg3@jiahui:~/riscv-none-elf-gcc/rv32emu0102/rv32emu/tests/system/homework1$ make run ../../../build/rv32emu test.elf 03:27:27 INFO src/riscv.c:570: Log level: TRACE 03:27:27 INFO src/riscv.c:583: test.elf ELF loaded 03:27:27 INFO src/main.c:325: RISC-V emulator is created and ready to run Case 1 Passed! Case 2 Passed! Case 3 Passed! Case 4 Passed! Case 5 Passed! Test Passed! 03:27:27 INFO src/main.c:368: Peak memory usage: 192 KB (0 MB) 03:27:27 INFO src/main.c:370: RISC-V emulator is destroyed ``` > All test cases passed. However, there's a small bug that prints a newline right after the integer. I'll fixed it next. #### uf8_encode ```asm= # uf8_encode.S .section .text .global uf8_encode # a0: value # t0: lz # t1: msb # t2: exponent # t3: overflow # t4: temp # t5: mantissa # t6: next_overflow uf8_encode: addi t4, x0, 16 blt a0, t4, encode_return_lt16 # if(value < 16) then return value encode_ge16: addi sp, sp, -8 sw ra, 0(sp) sw s0, 4(sp) addi s0, a0, 0 # save value, help us not to access the memory jal clz addi t0, a0, 0 # lz addi t1, x0, 31 sub t1, t1, t0 # msb = 31 - lz addi t2, x0, 0 # exponent = 0 addi t3, x0, 0 # overflow = 0 addi t4, x0, 5 encode_if_1: blt t1, t4, encode_while_2 # if(msb < 5) goto encode_while_2 addi t2, t1, -4 # exponent = msb - 4 encode_if_2: addi t4, x0, 15 ble t2, t4, encode_for_1 # if (exponent <= 15) goto encode_for_1 li t2, 15 encode_for_1: li t4, 1 sll t4, t4, t2 addi t4, t4, -1 slli t4, t4, 4 mv t3, t4 encode_while_1: beq t2, x0, encode_while_2 bge s0, t3, encode_while_2 addi t3, t3, -16 srli t3, t3, 1 addi t2, t2, -1 jal x0, encode_while_1 encode_while_2: li t4, 15 bge t2, t4, encode_return slli t6, t3, 1 addi t6, t6, 16 encode_if_3: blt s0, t6, encode_return # if(value < nextoverflow) then break addi t3, t6, 0 addi t2, t2, 1 jal x0, encode_while_2 encode_return: sub t5, s0, t3 srl t5, t5, t2 # mantissa = (value - overflow) >> exponent slli a0, t2, 4 or a0, a0, t5 lw s0, 4(sp) lw ra, 0(sp) addi sp, sp, 8 jr ra # can be omitted encode_return_lt16: jr ra ``` The below ==main.c== is adapted from **quiz 1**. ```c= // main.c #include <stdbool.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> typedef uint8_t uf8; uint32_t uf8_decode(uf8 fl); uf8 uf8_encode(uint32_t value); static bool test(void) { int32_t previous_value = -1; bool passed = true; printf("==== UF8 Encode/Decode Test ====\n"); for (int i = 0; i < 256; i++) { uint8_t fl = i; int32_t value = uf8_decode(fl); uint8_t fl2 = uf8_encode(value); printf("Test %3d | UF8 = 0x%02X | Decode = %7u | Encode = 0x%02X \n ", i, fl, value, fl2); if (fl != fl2) { printf("0x%02X: decoded=%d but encodes back to 0x%02X\n", fl, value, fl2); passed = false; } if (value <= previous_value) { printf("0x%02X: decoded=%d <= previous %d\n", fl, value, previous_value); passed = false; } previous_value = value; } return passed; } int main(void) { if (test()) { printf("==== All tests PASSED ====\n"); return 0; } else { printf("==== SOME TESTS FAILED ====\n"); return 1; } } ``` Originally, `print_dec` and `print_hex` appended a newline character to every output, which broke the formatting of the test logs. ``` // Before modification ... Test 253 | UF8 = fd | Decode = 950256 | Encode =fd Test 254 | UF8 = fe | Decode = 983024 | Encode =fe Test 255 | UF8 = ff | Decode = 1015792 | Encode =ff ==== All tests PASSED ==== ``` I modified these functions to only output the numeric string itself. This allows for printing multiple values on a single line. ```c= static void print_dec(unsigned long val) { char buf[20]; char *p = buf + sizeof(buf); if (val == 0) { *--p = '0'; } else { while (val > 0) { *--p = '0' + umod(val, 10); val = udiv(val, 10); } } printstr(p, (buf + sizeof(buf) - p)); } static void print_hex(unsigned long val) { char buf[20]; char *p = buf + sizeof(buf); if (val == 0) { *--p = '0'; } else { while (val > 0) { int digit = val & 0xf; *--p = (digit < 10) ? ('0' + digit) : ('a' + digit - 10); val >>= 4; } } printstr(p, (buf + sizeof(buf) - p)); } ``` ```c= #include <stdbool.h> #include <stdint.h> #include <string.h> #include <stdlib.h> #define printstr(ptr, length) \ do { \ asm volatile( \ "add a7, x0, 0x40;" \ "add a0, x0, 0x1;" /* stdout */ \ "add a1, x0, %0;" \ "mv a2, %1;" /* length character */ \ "ecall;" \ : \ : "r"(ptr), "r"(length) \ : "a0", "a1", "a2", "a7"); \ } while (0) #define TEST_OUTPUT(msg, length) printstr(msg, length) #define TEST_LOGGER(msg) \ { \ char _msg[] = msg; \ TEST_OUTPUT(_msg, sizeof(_msg) - 1); \ } extern uint64_t get_cycles(void); extern uint64_t get_instret(void); /* Bare metal memcpy implementation */ void *memcpy(void *dest, const void *src, size_t n) { uint8_t *d = (uint8_t *) dest; const uint8_t *s = (const uint8_t *) src; while (n--) *d++ = *s++; return dest; } /* Software division for RV32I (no M extension) */ static unsigned long udiv(unsigned long dividend, unsigned long divisor) { if (divisor == 0) return 0; unsigned long quotient = 0; unsigned long remainder = 0; for (int i = 31; i >= 0; i--) { remainder <<= 1; remainder |= (dividend >> i) & 1; if (remainder >= divisor) { remainder -= divisor; quotient |= (1UL << i); } } return quotient; } static unsigned long umod(unsigned long dividend, unsigned long divisor) { if (divisor == 0) return 0; unsigned long remainder = 0; for (int i = 31; i >= 0; i--) { remainder <<= 1; remainder |= (dividend >> i) & 1; if (remainder >= divisor) { remainder -= divisor; } } return remainder; } /* Software multiplication for RV32I (no M extension) */ static uint32_t umul(uint32_t a, uint32_t b) { uint32_t result = 0; while (b) { if (b & 1) result += a; a <<= 1; b >>= 1; } return result; } /* Provide __mulsi3 for GCC */ uint32_t __mulsi3(uint32_t a, uint32_t b) { return umul(a, b); } /* Simple integer to hex string conversion */ static void print_hex(unsigned long val) { char buf[20]; char *p = buf + sizeof(buf); if (val == 0) { *--p = '0'; } else { while (val > 0) { int digit = val & 0xf; *--p = (digit < 10) ? ('0' + digit) : ('a' + digit - 10); val >>= 4; } } printstr(p, (buf + sizeof(buf) - p)); } /* Simple integer to decimal string conversion */ static void print_dec(unsigned long val) { char buf[20]; char *p = buf + sizeof(buf); if (val == 0) { *--p = '0'; } else { while (val > 0) { *--p = '0' + umod(val, 10); val = udiv(val, 10); } } printstr(p, (buf + sizeof(buf) - p)); } typedef uint8_t uf8; uint32_t uf8_decode(uf8 f1); uf8 uf8_encode(uint32_t value); static bool test(void) { int32_t previous_value = -1; bool passed = true; printstr("==== UF8 Encode/Decode Test ====\n", 33); for (int i = 0; i < 256; i++) { uint8_t f1 = i; int32_t value = uf8_decode(f1); uint8_t f12 = uf8_encode(value); printstr("Test ", 5); print_dec(i); printstr(" | UF8 = ", 9); print_hex(f1); printstr(" | Decode = ", 12); print_dec(value); printstr(" | Encode = ", 11); print_hex(f12); printstr("\n", 1); if (f1 != f12) { printstr("0x", 2); print_hex(f1); printstr(": decoded=", 10); print_dec(value); printstr(" but encodes back to 0x", 23); print_hex(f12); printstr("\n", 1); passed = false; } if (value <= previous_value) { printstr("0x", 2); print_hex(f1); printstr(": decoded=", 10); print_dec(value); printstr(" <= previous ", 13); print_dec(previous_value); printstr("\n", 1); passed = false; } previous_value = value; } return passed; } int main(void) { if(test()) { printstr("==== All tests PASSED ====\n", 27); return 0; } else { printstr("==== Some tests FAILED ====\n", 27); return 1; } } ``` #### Test Cases The test function iterated through the range 0 to 255 to validate the encoding logic. All 256 cases were verified successfully. ``` ... Test 250 | UF8 = fa | Decode = 851952 | Encode =fa Test 251 | UF8 = fb | Decode = 884720 | Encode =fb Test 252 | UF8 = fc | Decode = 917488 | Encode =fc Test 253 | UF8 = fd | Decode = 950256 | Encode =fd Test 254 | UF8 = fe | Decode = 983024 | Encode =fe Test 255 | UF8 = ff | Decode = 1015792 | Encode =ff ==== All tests PASSED ==== ``` ## Requirement 3 ### ==Quiz 2 Problem A== The code for **Quiz 2 Problem A** cannot run in the rv32emu bare-metal environment because the original system calls are specifically for Ripes. I need to modify the code to use the [System Calls](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md) instead. #### exit system call And the `exit` system call need to be modified. ```asm= addi x17, x0, 10 ecall ``` modified to ```asm= li a0, 0 ret ``` or ```asm= li a7, EXIT # 93 ecall ``` And the print system call also need to be modified. Because the [newlib](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md) system call does't support print integer directly, we have to figure some approach to do it. This means that [newlib](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md) can only print string and char through the `write` system call. ```asm= la x10, str1 addi x17, x0, 4 # print string ecall addi x10, x9, 1 addi x17, x0, 1 # print int ecall la x10, str2 addi x17, x0, 4 # print string ecall addi x10, x11, 0 addi x17, x0, 11 # print char ecall la x10, str3 addi x17, x0, 4 # print string ecall addi x10, x12, 0 addi x17, x0, 11 # print char ecall addi x10, x0, 10 addi x17, x0, 11 # print char ecall ``` Alothough the system call dose not support print integer directly, we still can do it indirectly. Get inspired from `printstr`, we can store the integer in memory by `sw`, and use the print string system call to print the integer as a string. Below is the modified code: ```asm= display_move: la x20, obdata add x5, x20, x18 # BLANK 23: Load byte from obfuscated data lbu x11, 0(x5) # BLANK 24: Decode constant (0x6F) li x6, 0x6F xor x11, x11, x6 # BLANK 25: Final offset adjustment addi x11, x11, -0x12 add x7, x20, x19 lbu x12, 0(x7) xor x12, x12, x6 addi x12, x12, -0x12 addi sp, sp, -8 sw x11, 0(sp) sw x12, 4(sp) li a0, 1 # STDOUT la a1, str1 # address of string li a2, 10 # length of string li a7, 64 # WRITE ecall addi x10, x9, 1 addi x10, x10, 0x30 # convert to ASCII la a1, buffer sb x10, 0(a1) li a0, 1 li a2, 1 li a7, 64 ecall li a0, 1 la a1, str2 li a2, 6 li a7, 64 ecall lw x12, 0(sp) la a1, buffer sb x12, 0(a1) li a0, 1 li a2, 1 li a7, 64 ecall li a0, 1 la a1, str3 li a2, 4 li a7, 64 ecall lw x12, 4(sp) la a1, buffer sb x12, 0(a1) li a0, 1 li a2, 1 li a7, 64 ecall addi x10, x0, 10 la a1, buffer sb x10, 0(a1) li a0, 1 li a2, 1 li a7, 64 ecall addi sp, sp, 8 # BLANK 26: Calculate storage offset slli x5, x9, 2 addi x5, x5, 20 add x5, x2, x5 # BLANK 27: Update disk position sw x19, 0(x5) # BLANK 28-29: Increment counter and loop addi x8, x8, 1 jal x0, game_loop finish_game: lw x8, 0(x2) lw x9, 4(x2) lw x18, 8(x2) lw x19, 12(x2) lw x20, 16(x2) addi x2, x2, 32 li a0, 0 li a7, 93 # EXIT ecall .data obdata: .byte 0x3c, 0x3b, 0x3a str1: .asciz "Move Disk " str2: .asciz " from " str3: .asciz " to " buffer: .space 4 ``` #### Test Cases ``` Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C ``` ::: success The test passed successfully. ::: ___ ### ==Quiz 3 Problem C== First, I encountered a Compiler Warning. ``` main.c:38:5: warning: unsigned conversion from 'int' to 'short unsigned int' changes value from '65536' to '0' [-Woverflow] 38 | 65536, 46341, 32768, 23170, 16384, | ^~~~~ ``` This is because the range of `unsigned short int` is $[0, 65535]$, and `rsqrt_table[0]` (which is `65536`) is larger than the upper bound, ```c= static const uint16_t rsqrt_table[32] = { 65536, 46341, 32768, 23170, 16384, 11585, 8192, 5793, 4096, 2896, 2048, 1448, 1024, 724, 512, 362, 256, 181, 128, 90, 64, 45, 32, 23, 16, 11, 8, 6, 4, 3, 2, 1 }; ``` Therefore I modify `65536` to `65535` in the below code. ```c= static const uint16_t rsqrt_table[32] = { 65535, 46341, 32768, 23170, 16384, 11585, 8192, 5793, 4096, 2896, 2048, 1448, 1024, 724, 512, 362, 256, 181, 128, 90, 64, 45, 32, 23, 16, 11, 8, 6, 4, 3, 2, 1 }; ``` I reused the print marco, and wrote the `main` function to test, and the test cases are from **Quiz 3**. ```c= #include <stdbool.h> #include <stdint.h> #include <string.h> #define printstr(ptr, length) \ do { \ asm volatile( \ "add a7, x0, 0x40;" \ "add a0, x0, 0x1;" /* stdout */ \ "add a1, x0, %0;" \ "mv a2, %1;" /* length character */ \ "ecall;" \ : \ : "r"(ptr), "r"(length) \ : "a0", "a1", "a2", "a7"); \ } while (0) #define TEST_OUTPUT(msg, length) printstr(msg, length) #define TEST_LOGGER(msg) \ { \ char _msg[] = msg; \ TEST_OUTPUT(_msg, sizeof(_msg) - 1); \ } extern uint64_t get_cycles(void); extern uint64_t get_instret(void); /* Bare metal memcpy implementation */ void *memcpy(void *dest, const void *src, size_t n) { uint8_t *d = (uint8_t *) dest; const uint8_t *s = (const uint8_t *) src; while (n--) *d++ = *s++; return dest; } /* Software division for RV32I (no M extension) */ static unsigned long udiv(unsigned long dividend, unsigned long divisor) { if (divisor == 0) return 0; unsigned long quotient = 0; unsigned long remainder = 0; for (int i = 31; i >= 0; i--) { remainder <<= 1; remainder |= (dividend >> i) & 1; if (remainder >= divisor) { remainder -= divisor; quotient |= (1UL << i); } } return quotient; } static unsigned long umod(unsigned long dividend, unsigned long divisor) { if (divisor == 0) return 0; unsigned long remainder = 0; for (int i = 31; i >= 0; i--) { remainder <<= 1; remainder |= (dividend >> i) & 1; if (remainder >= divisor) { remainder -= divisor; } } return remainder; } /* Software multiplication for RV32I (no M extension) */ static uint32_t umul(uint32_t a, uint32_t b) { uint32_t result = 0; while (b) { if (b & 1) result += a; a <<= 1; b >>= 1; } return result; } /* Provide __mulsi3 for GCC */ uint32_t __mulsi3(uint32_t a, uint32_t b) { return umul(a, b); } /* Simple integer to hex string conversion */ static void print_hex(unsigned long val) { char buf[20]; char *p = buf + sizeof(buf) - 1; *p = '\n'; p--; if (val == 0) { *p = '0'; p--; } else { while (val > 0) { int digit = val & 0xf; *p = (digit < 10) ? ('0' + digit) : ('a' + digit - 10); p--; val >>= 4; } } p++; printstr(p, (buf + sizeof(buf) - p)); } /* Simple integer to decimal string conversion */ static void print_dec(unsigned long val) { char buf[20]; char *p = buf + sizeof(buf) - 1; *p = '\n'; p--; if (val == 0) { *p = '0'; p--; } else { while (val > 0) { *p = '0' + umod(val, 10); p--; val = udiv(val, 10); } } p++; printstr(p, (buf + sizeof(buf) - p)); } static uint64_t mul32(uint32_t a, uint32_t b); static void newton_step(uint32_t *rec_inv_sqrt, uint32_t x); uint32_t fast_rsqrt(uint32_t x); static int clz(uint32_t x); uint32_t fast_distance_3d(int32_t dx, int32_t dy, int32_t dz); #define REC_INV_SQRT_CACHE (16) static const uint32_t inv_sqrt_cache[REC_INV_SQRT_CACHE] = { ~0U, ~0U, 3037000500, 2479700525, 2147483647, 1920767767, 1753413056, 1623345051, 1518500250, 1431655765, 1358187914, 1294981364, 1239850263, 1191209601, 1147878294, 1108955788 }; /* * Newton iteration: new_y = y * (3/2 - x * y^2 / 2) * Here, y is a Q0.32 fixed-point number (< 1.0) */ static void newton_step(uint32_t *rec_inv_sqrt, uint32_t x) { uint32_t invsqrt, invsqrt2; uint64_t val; invsqrt = *rec_inv_sqrt; /* Dereference pointer */ invsqrt2 = ((uint64_t)invsqrt * invsqrt) >> 32; val = (3LL << 32) - ((uint64_t)x * invsqrt2); val >>= 2; /* Avoid overflow in following multiply */ val = (val * invsqrt) >> 31; /* Right shift by 31 = (32 - 2 + 1) */ *rec_inv_sqrt = (uint32_t)val; } static const uint16_t rsqrt_table[32] = { 65535, 46341, 32768, 23170, 16384, 11585, 8192, 5793, 4096, 2896, 2048, 1448, 1024, 724, 512, 362, 256, 181, 128, 90, 64, 45, 32, 23, 16, 11, 8, 6, 4, 3, 2, 1 }; uint32_t fast_rsqrt(uint32_t x) { if (x == 0) return 0xFFFFFFFF; if (x == 1) return 65536; int exp = 31 - clz(x); uint32_t y = rsqrt_table[exp]; if(x > (1u << exp)) { uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0; uint32_t delta = y - y_next; uint32_t frac = (uint32_t) ((((uint64_t)x - (1UL << exp)) << 16) >> exp); y -= (uint32_t) ((delta * frac) >> 16); } for(int iter = 0; iter < 2; iter++) { uint32_t y2 = (uint32_t)mul32(y, y); uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16); y = (uint32_t)(mul32(y, (3u << 16) - xy2) >> 17); } return y; } static uint64_t mul32(uint32_t a, uint32_t b) { uint64_t r = 0; for(int i = 0; i < 32; i++) { if(b & (1U << i)) r += (uint64_t)a << i; } return r; } static int clz(uint32_t x) { if(!x) return 32; int n = 0; if(!(x & 0xFFFF0000)) { n += 16; x <<= 16;} if(!(x & 0xFF000000)) { n += 8; x <<= 8;} if(!(x & 0xF0000000)) { n += 4; x <<= 4;} if(!(x & 0xC0000000)) { n += 2; x <<= 2;} if(!(x & 0x80000000)) { n += 1;} return n; } /* Calculate distance using rsqrt * * Faster than computing sqrt directly: * sqrt(x) = x * (1/sqrt(x)) */ uint32_t fast_distance_3d(int32_t dx, int32_t dy, int32_t dz) { uint64_t dist_sq = (uint64_t)dx * dx + (uint64_t)dy * dy + (uint64_t)dz * dz; if (dist_sq > 0xFFFFFFFF) dist_sq >>= 16; uint32_t inv_dist = fast_rsqrt((uint32_t)dist_sq); /* sqrt(x) = x / sqrt(x) = x * (1/sqrt(x)) */ uint64_t dist = ((uint64_t)dist_sq * inv_dist) >> 16; return (uint32_t)dist; } int main() { TEST_LOGGER("=== RV32 Bare-metal rsqrt test ===\n"); /* Test Case 1: 1 */ /* 1/sqrt(1) = 1.00 */ TEST_LOGGER("Input: 1 (Expect 65536)\n"); uint32_t res = fast_rsqrt(1); TEST_LOGGER("Result: "); print_dec(res); /* Test Case 2: 4 */ /* 1/sqrt(4) = 0.50 */ TEST_LOGGER("Input: 4 (Expect 32768)\n"); res = fast_rsqrt(4); TEST_LOGGER("Result: "); print_dec(res); /* Test Case 3: 16 */ /* 1/sqrt(16) = 0.25. Q16.16 format: 0.25 * 65536 = 16384 */ TEST_LOGGER("Input: 16 (Expect ~16384)\n"); res = fast_rsqrt(16); TEST_LOGGER("Result: "); print_dec(res); /* Test Case 4: 100 */ /* 1/sqrt(100) = 0.1. Q16.16 format: 0.1 * 65536 = 6553.6 -> 6553 */ TEST_LOGGER("Input: 100 (Expect ~6553)\n"); res = fast_rsqrt(100); TEST_LOGGER("Result: "); print_dec(res); /* Test Case 5: 4096 */ /* 1/sqrt(4096) = 1/64 = 0.015625. Q16.16: 1024 */ TEST_LOGGER("Input: 4096 (Expect ~1024)\n"); res = fast_rsqrt(4096); TEST_LOGGER("Result: "); print_dec(res); return 0; } ``` ::: success Ran successfully. ::: ``` gffg3@jiahui:~/riscv-none-elf-gcc/rv32emu0102/rv32emu/tests/system/quiz3_problem_c$ make run ../../../build/rv32emu test.elf 04:53:40 INFO src/riscv.c:570: Log level: TRACE 04:53:40 INFO src/riscv.c:583: test.elf ELF loaded 04:53:40 INFO src/main.c:325: RISC-V emulator is created and ready to run === RV32 Bare-metal rsqrt test === Input: 1 (Expect 65536) Result: 65536 Input: 4 (Expect 32768) Result: 32768 Input: 16 (Expect ~16384) Result: 16384 Input: 100 (Expect ~6553) Result: 6553 Input: 4096 (Expect ~1024) Result: 1024 04:53:40 INFO src/main.c:368: Peak memory usage: 192 KB (0 MB) 04:53:40 INFO src/main.c:370: RISC-V emulator is destroyed ``` #### ==Makefile== Below is the Makefile I used in this case. ```makefile include ../../../mk/toolchain.mk ARCH = -march=rv32izicsr LINKER_SCRIPT = linker.ld EMU ?= ../../../build/rv32emu AFLAGS = -g $(ARCH) CFLAGS = -g $(ARCH) LDFLAGS = -T $(LINKER_SCRIPT) EXEC = test.elf CC = $(CROSS_COMPILE)gcc AS = $(CROSS_COMPILE)as OBJDUMP = $(CROSS_COMPILE)objdump OBJS = start.o main.o .PHONY: all run dump clean all: $(EXEC) $(EXEC): $(OBJS) $(LINKER_SCRIPT) $(CC) $(CFLAGS) $(LDFLAGS) -nostartfiles -o $@ $(OBJS) %.o: %.S $(AS) $(AFLAGS) -c $< -o $@ %.o: %.c $(CC) $(CFLAGS) $< -o $@ -c run: $(EXEC) @test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1) @grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1) @grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1) $(EMU) $< dump: $(EXEC) $(OBJDUMP) -Ds $< | less clean: rm -f $(EXEC) $(OBJS) ``` Since the previous Makefile suffered from **undefined reference errors** when using `ld`, I modified the build process to **link using `gcc`**, which automatically resulves the dependency on `__muldi3`. ``` riscv-none-elf-as -g -march=rv32izicsr start.S -o start.o start.S: Assembler messages: start.S: Warning: end of file not at end of a line; newline inserted riscv-none-elf-gcc -g -march=rv32i_zicsr main.c -o main.o -c riscv-none-elf-ld -T linker.ld -o test.elf start.o main.o riscv-none-elf-ld: warning: test.elf has a LOAD segment with RWX permissions riscv-none-elf-ld: main.o: in function `newton_step': /home/gffg3/riscv-none-elf-gcc/rv32emu0102/rv32emu/tests/system/quiz3_problem_c/main.c:170:(.text+0x4f4): undefined reference to `__muldi3' riscv-none-elf-ld: /home/gffg3/riscv-none-elf-gcc/rv32emu0102/rv32emu/tests/system/quiz3_problem_c/main.c:171:(.text+0x538): undefined reference to `__muldi3' riscv-none-elf-ld: /home/gffg3/riscv-none-elf-gcc/rv32emu0102/rv32emu/tests/system/quiz3_problem_c/main.c:174:(.text+0x5b8): undefined reference to `__muldi3' riscv-none-elf-ld: main.o: in function `fast_distance_3d': /home/gffg3/riscv-none-elf-gcc/rv32emu0102/rv32emu/tests/system/quiz3_problem_c/main.c:241:(.text+0xaec): undefined reference to `__muldi3' riscv-none-elf-ld: /home/gffg3/riscv-none-elf-gcc/rv32emu0102/rv32emu/tests/system/quiz3_problem_c/main.c:242:(.text+0xb34): undefined reference to `__muldi3' riscv-none-elf-ld: main.o:/home/gffg3/riscv-none-elf-gcc/rv32emu0102/rv32emu/tests/system/quiz3_problem_c/main.c:243: more undefined references to `__muldi3' follow make: *** [Makefile:25: test.elf] Error 1 ``` ## Requirement 4 In Requirement 4, I will conduct a comparative analysis between my handwritten RISC-V assembly code (from Quiz 1 Problems B and C) and the assembly code generated by the compiler with optimization, specifically **-Ofast** and **-Os**. We can disassembly the `\*.elf` by running `make dump` as defined in Makefile. ``` make dump ``` Furthermore, using `make dump > main.asm` to save the generated code as a file. > GitHub: [my handwritten version](https://github.com/dorobouharu133/ca2025-quizzes/tree/main) ___ ### Quiz 1 Problem B > GitHub: [Quiz 1 Problem B](https://github.com/dorobouharu133/CA25-Assignment-2/tree/main/quiz1_problem_b) > GitHub: [quiz 1 problem b baseline](https://github.com/dorobouharu133/CA25-Assignment-2/blob/main/quiz1_problem_b/main.asm) I used ```make dump > main.asm``` to get the baseline, and below is the partial code section, including 3 functions (uf8_encode, uf8_decode and clz). ```asm= 0001047c <clz>: 1047c: fd010113 addi sp,sp,-48 10480: 02112623 sw ra,44(sp) 10484: 02812423 sw s0,40(sp) 10488: 03010413 addi s0,sp,48 1048c: fca42e23 sw a0,-36(s0) 10490: 02000793 li a5,32 10494: fef42623 sw a5,-20(s0) 10498: 01000793 li a5,16 1049c: fef42423 sw a5,-24(s0) 104a0: fe842783 lw a5,-24(s0) 104a4: fdc42703 lw a4,-36(s0) 104a8: 00f757b3 srl a5,a4,a5 104ac: fef42223 sw a5,-28(s0) 104b0: fe442783 lw a5,-28(s0) 104b4: 00078e63 beqz a5,104d0 <clz+0x54> 104b8: fec42703 lw a4,-20(s0) 104bc: fe842783 lw a5,-24(s0) 104c0: 40f707b3 sub a5,a4,a5 104c4: fef42623 sw a5,-20(s0) 104c8: fe442783 lw a5,-28(s0) 104cc: fcf42e23 sw a5,-36(s0) 104d0: fe842783 lw a5,-24(s0) 104d4: 4017d793 srai a5,a5,0x1 104d8: fef42423 sw a5,-24(s0) 104dc: fe842783 lw a5,-24(s0) 104e0: fc0790e3 bnez a5,104a0 <clz+0x24> 104e4: fec42703 lw a4,-20(s0) 104e8: fdc42783 lw a5,-36(s0) 104ec: 40f707b3 sub a5,a4,a5 104f0: 00078513 mv a0,a5 104f4: 02c12083 lw ra,44(sp) 104f8: 02812403 lw s0,40(sp) 104fc: 03010113 addi sp,sp,48 10500: 00008067 ret 00010504 <uf8_decode>: 10504: fd010113 addi sp,sp,-48 10508: 02112623 sw ra,44(sp) 1050c: 02812423 sw s0,40(sp) 10510: 03010413 addi s0,sp,48 10514: 00050793 mv a5,a0 10518: fcf40fa3 sb a5,-33(s0) 1051c: fdf44783 lbu a5,-33(s0) 10520: 00f7f793 andi a5,a5,15 10524: fef42623 sw a5,-20(s0) 10528: fdf44783 lbu a5,-33(s0) 1052c: 0047d793 srli a5,a5,0x4 10530: fef405a3 sb a5,-21(s0) 10534: feb44783 lbu a5,-21(s0) 10538: 00f00713 li a4,15 1053c: 40f707b3 sub a5,a4,a5 10540: 00008737 lui a4,0x8 10544: fff70713 addi a4,a4,-1 # 7fff <_start-0x8001> 10548: 40f757b3 sra a5,a4,a5 1054c: 00479793 slli a5,a5,0x4 10550: fef42223 sw a5,-28(s0) 10554: feb44783 lbu a5,-21(s0) 10558: fec42703 lw a4,-20(s0) 1055c: 00f71733 sll a4,a4,a5 10560: fe442783 lw a5,-28(s0) 10564: 00f707b3 add a5,a4,a5 10568: 00078513 mv a0,a5 1056c: 02c12083 lw ra,44(sp) 10570: 02812403 lw s0,40(sp) 10574: 03010113 addi sp,sp,48 10578: 00008067 ret 0001057c <uf8_encode>: 1057c: fc010113 addi sp,sp,-64 10580: 02112e23 sw ra,60(sp) 10584: 02812c23 sw s0,56(sp) 10588: 04010413 addi s0,sp,64 1058c: fca42623 sw a0,-52(s0) 10590: fcc42703 lw a4,-52(s0) 10594: 00f00793 li a5,15 10598: 00e7e863 bltu a5,a4,105a8 <uf8_encode+0x2c> 1059c: fcc42783 lw a5,-52(s0) 105a0: 0ff7f793 zext.b a5,a5 105a4: 1440006f j 106e8 <uf8_encode+0x16c> 105a8: fcc42503 lw a0,-52(s0) 105ac: ed1ff0ef jal 1047c <clz> 105b0: 00050793 mv a5,a0 105b4: fef42023 sw a5,-32(s0) 105b8: 01f00713 li a4,31 105bc: fe042783 lw a5,-32(s0) 105c0: 40f707b3 sub a5,a4,a5 105c4: fcf42e23 sw a5,-36(s0) 105c8: fe0407a3 sb zero,-17(s0) 105cc: fe042423 sw zero,-24(s0) 105d0: fdc42703 lw a4,-36(s0) 105d4: 00400793 li a5,4 105d8: 0ce7d063 bge a5,a4,10698 <uf8_encode+0x11c> 105dc: fdc42783 lw a5,-36(s0) 105e0: 0ff7f793 zext.b a5,a5 105e4: ffc78793 addi a5,a5,-4 105e8: fef407a3 sb a5,-17(s0) 105ec: fef44703 lbu a4,-17(s0) 105f0: 00f00793 li a5,15 105f4: 00e7f663 bgeu a5,a4,10600 <uf8_encode+0x84> 105f8: 00f00793 li a5,15 105fc: fef407a3 sb a5,-17(s0) 10600: fe0403a3 sb zero,-25(s0) 10604: 0200006f j 10624 <uf8_encode+0xa8> 10608: fe842783 lw a5,-24(s0) 1060c: 00179793 slli a5,a5,0x1 10610: 01078793 addi a5,a5,16 10614: fef42423 sw a5,-24(s0) 10618: fe744783 lbu a5,-25(s0) 1061c: 00178793 addi a5,a5,1 10620: fef403a3 sb a5,-25(s0) 10624: fe744703 lbu a4,-25(s0) 10628: fef44783 lbu a5,-17(s0) 1062c: fcf76ee3 bltu a4,a5,10608 <uf8_encode+0x8c> 10630: 0200006f j 10650 <uf8_encode+0xd4> 10634: fe842783 lw a5,-24(s0) 10638: ff078793 addi a5,a5,-16 1063c: 0017d793 srli a5,a5,0x1 10640: fef42423 sw a5,-24(s0) 10644: fef44783 lbu a5,-17(s0) 10648: fff78793 addi a5,a5,-1 1064c: fef407a3 sb a5,-17(s0) 10650: fef44783 lbu a5,-17(s0) 10654: 04078263 beqz a5,10698 <uf8_encode+0x11c> 10658: fcc42703 lw a4,-52(s0) 1065c: fe842783 lw a5,-24(s0) 10660: fcf76ae3 bltu a4,a5,10634 <uf8_encode+0xb8> 10664: 0340006f j 10698 <uf8_encode+0x11c> 10668: fe842783 lw a5,-24(s0) 1066c: 00179793 slli a5,a5,0x1 10670: 01078793 addi a5,a5,16 10674: fcf42c23 sw a5,-40(s0) 10678: fcc42703 lw a4,-52(s0) 1067c: fd842783 lw a5,-40(s0) 10680: 02f76463 bltu a4,a5,106a8 <uf8_encode+0x12c> 10684: fd842783 lw a5,-40(s0) 10688: fef42423 sw a5,-24(s0) 1068c: fef44783 lbu a5,-17(s0) 10690: 00178793 addi a5,a5,1 10694: fef407a3 sb a5,-17(s0) 10698: fef44703 lbu a4,-17(s0) 1069c: 00e00793 li a5,14 106a0: fce7f4e3 bgeu a5,a4,10668 <uf8_encode+0xec> 106a4: 0080006f j 106ac <uf8_encode+0x130> 106a8: 00000013 nop 106ac: fcc42703 lw a4,-52(s0) 106b0: fe842783 lw a5,-24(s0) 106b4: 40f70733 sub a4,a4,a5 106b8: fef44783 lbu a5,-17(s0) 106bc: 00f757b3 srl a5,a4,a5 106c0: fcf40ba3 sb a5,-41(s0) 106c4: fef40783 lb a5,-17(s0) 106c8: 00479793 slli a5,a5,0x4 106cc: 01879713 slli a4,a5,0x18 106d0: 41875713 srai a4,a4,0x18 106d4: fd740783 lb a5,-41(s0) 106d8: 00f767b3 or a5,a4,a5 106dc: 01879793 slli a5,a5,0x18 106e0: 4187d793 srai a5,a5,0x18 106e4: 0ff7f793 zext.b a5,a5 106e8: 00078513 mv a0,a5 106ec: 03c12083 lw ra,60(sp) 106f0: 03812403 lw s0,56(sp) 106f4: 04010113 addi sp,sp,64 106f8: 00008067 ret ``` I observed that the baseline version is quite different from the optmized implementations (my handwritten code, `-Ofast` and `-Os`). Specifically, the baseline has to **create a stack frame** to save its registers, while the others do not. #### ==-Ofast== (optimized for speed) By modifying `CFLAGS` in `Makefile`, we can get the `-Ofast` version. ```makefile= CFLAGS = -g -march=rv32i_zicsr -Ofast ``` > Github: [quiz1 problem b Ofast](https://github.com/dorobouharu133/CA25-Assignment-2/blob/main/quiz1_problem_b/main_Ofast.asm) I observed that there is no explicit `<clz>` label in the assembly code. Based on my analysis, the `-Ofast` version employs **loop unrolling** to eliminate **branches** and the loop. ```asm= 0001003c <uf8_encode.part.0>: 1003c: 01055793 srli a5,a0,0x10 10040: 00050593 mv a1,a0 10044: 01000713 li a4,16 10048: 00079663 bnez a5,10054 <uf8_encode.part.0+0x18> 1004c: 00050793 mv a5,a0 10050: 02000713 li a4,32 10054: 0087d693 srli a3,a5,0x8 10058: 00068663 beqz a3,10064 <uf8_encode.part.0+0x28> 1005c: ff870713 addi a4,a4,-8 10060: 00068793 mv a5,a3 10064: 0047d693 srli a3,a5,0x4 10068: 00068663 beqz a3,10074 <uf8_encode.part.0+0x38> 1006c: ffc70713 addi a4,a4,-4 10070: 00068793 mv a5,a3 10074: 0027d693 srli a3,a5,0x2 10078: 00068663 beqz a3,10084 <uf8_encode.part.0+0x48> 1007c: ffe70713 addi a4,a4,-2 10080: 00068793 mv a5,a3 10084: 0017d693 srli a3,a5,0x1 10088: 00068663 beqz a3,10094 <uf8_encode.part.0+0x58> 1008c: fff70713 addi a4,a4,-1 ``` `Ofast` unrolled the `for` loop and mixed it directly into `uf8_encode`. In my original approach, I used a loop to count leading zeros. Howevere, the compiler unrolled this logic, directly using the immediates **0x10, 0x8, 0x4, 0x2 and 0x1** in the assembly. Although this sturcture looks similar to **branchless implementation**, it uses `bnez` and `beqz` instructions for early termination. ```asm= srli a5,a0,0x10 # Check the upper 16 bits mv a1,a0 # save li a4,16 # Load the shift amount bnez a5,10054 <uf8_encode.part.0+0x18> # branch if upper 16 bits are zero ``` For example, the code snippet above performs the same **binary search** logic as the **branchless CLZ**. It first checks whether the upper 16 bits are all zeros. If they are, it proceeds to check the lower 16 bits; otherwise, it focuses on the upper half and repeats the check for the next 8 bits. ```asm= 000106c0 <uf8_decode>: 106c0: 00455693 srli a3,a0,0x4 106c4: 00f00713 li a4,15 106c8: 000087b7 lui a5,0x8 106cc: 40d70733 sub a4,a4,a3 106d0: fff78793 addi a5,a5,-1 # 7fff <_start-0x8001> 106d4: 40e7d7b3 sra a5,a5,a4 106d8: 00f57513 andi a0,a0,15 106dc: 00d51533 sll a0,a0,a3 106e0: 00479793 slli a5,a5,0x4 106e4: 00a78533 add a0,a5,a0 106e8: 00008067 ret ``` I observed that the way to generate `0x7FFF` is different. The `-Ofast` version uses just 2 instructions: ```asm= # -Ofast lui a5, 0x8 addi a5,a5,-1 ``` whereas my code requires 3 instructions: ```asm= # Mine addi t3, x0, 0x7F slli t3, t3, 8 ori t3, t3, 0xFF ``` Although the difference in instruction count is slight, it subtly changes the way I think about handling hexadecimal constants. ___ > Assignment 1: [uf8_encode improvement](https://hackmd.io/H-fawT_uSA6Gn8skMknOUw?view) Let's talk about `uf8_encode`, I focused on optimizing the **overflow calculation**. By analyzing the original `for` loop, I discovered a consistent bit pattern that allowed me to replace this loop with bitwise operations, effectively eliminating branch instructions. ```c= /* Start from a good initial guess */ uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; /* Calculate overflow for estimated exponent */ for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } } ``` By unrolling the loop, I observed the binary representation of `overflow`: ``` iteration 1(e = 1): 0001 0000 (16) iteration 2(e = 2): 0011 0000 (48) iteration 3(e = 3): 0111 0000 (112) ... ``` I found that `overflow` always consists of $e$ **consecutive 1s followed by 4 zeros** (where $e$ is the exponent). Mathematically, this loop calculates the sum of a Geometric Series: $$ \text{Overflow} = \sum_{i=0}^{e-1} 16 \times 2^i = 16 \times (2^e - 1) $$ In binary, ($2^{e}-1$) creates a mask of $e$ consecutive 1s (e.g., $2^{3}-1=7=111_2$). And multiplying by 16 is equivalent to shift left by 4 bits. Overall, I replaced the $O(N)$ loop with just 4 constant-time instructions: ```asm= # t0: exponent # t1: overflow addi t1, x0, 1 sll t1, t1, t0 # get 2^exponent addi t1, t1, -1 # get leading 1's slli t1, t1. 4 # get overflow ``` Compared with `-Ofast` version, I found that the compiler takes a "brute-force" method to get `overflow`. Since the range of `exponent` is $[0, 15]$, `-Ofast` unrolls this logic into a dicision tree to reach early termination. Below are the relevant code snippets illustrating how the `-Ofast` version implemented. ```asm= 1020c: 00f00793 li a5,15 10210: 0ff77613 zext.b a2,a4 10214: 1ee7ec63 bltu a5,a4,1040c <uf8_encode.part.0+0x3d0> 10218: 00100793 li a5,1 1021c: 2cf70463 beq a4,a5,104e4 <uf8_encode.part.0+0x4a8> 10220: 00200793 li a5,2 10224: 20f70063 beq a4,a5,10424 <uf8_encode.part.0+0x3e8> 10228: 00300793 li a5,3 1022c: 2ef70a63 beq a4,a5,10520 <uf8_encode.part.0+0x4e4> 10230: 00400793 li a5,4 10234: 2ef70e63 beq a4,a5,10530 <uf8_encode.part.0+0x4f4> 10238: 00500793 li a5,5 1023c: 2ef70e63 beq a4,a5,10538 <uf8_encode.part.0+0x4fc> 10240: 00600793 li a5,6 10244: 30f70463 beq a4,a5,1054c <uf8_encode.part.0+0x510> 10248: 00700793 li a5,7 1024c: 2cf70063 beq a4,a5,1050c <uf8_encode.part.0+0x4d0> 10250: 00800793 li a5,8 10254: 30f70063 beq a4,a5,10554 <uf8_encode.part.0+0x518> 10258: 00900793 li a5,9 1025c: 30f70263 beq a4,a5,10560 <uf8_encode.part.0+0x524> 10260: 00a00793 li a5,10 10264: 30f70463 beq a4,a5,1056c <uf8_encode.part.0+0x530> 10268: 00b00793 li a5,11 1026c: 30f70663 beq a4,a5,10578 <uf8_encode.part.0+0x53c> 10270: 00c00793 li a5,12 10274: 30f70863 beq a4,a5,10584 <uf8_encode.part.0+0x548> 10278: 00d00793 li a5,13 1027c: 30f70a63 beq a4,a5,10590 <uf8_encode.part.0+0x554> 10280: 00040537 lui a0,0x40 10284: 00e00693 li a3,14 10288: ff050513 addi a0,a0,-16 # 3fff0 <__stack_top+0x2e0f0> 1028c: 18d70e63 beq a4,a3,10428 <uf8_encode.part.0+0x3ec> ``` ```asm= 10504: 03000713 li a4,48 10508: f8dff06f j 10494 <uf8_encode.part.0+0x458> 1050c: 7f000513 li a0,2032 10510: f19ff06f j 10428 <uf8_encode.part.0+0x3ec> 10514: 00050793 mv a5,a0 10518: 00e00613 li a2,14 1051c: ed1ff06f j 103ec <uf8_encode.part.0+0x3b0> 10520: 07000513 li a0,112 10524: f05ff06f j 10428 <uf8_encode.part.0+0x3ec> 10528: 00200613 li a2,2 1052c: ec1ff06f j 103ec <uf8_encode.part.0+0x3b0> 10530: 0f000513 li a0,240 10534: ef5ff06f j 10428 <uf8_encode.part.0+0x3ec> 10538: 1f000513 li a0,496 1053c: eedff06f j 10428 <uf8_encode.part.0+0x3ec> 10540: 00070793 mv a5,a4 10544: 00100613 li a2,1 10548: ea5ff06f j 103ec <uf8_encode.part.0+0x3b0> 1054c: 3f000513 li a0,1008 ``` Let's take a closer look at the instruction: ```asm= 10230: 00400793 li a5,4 # if e = 4, jump to load 240 10234: 2ef70e63 beq a4,a5,10530 10238: 00500793 li a5,5 1023c: 2ef70e63 beq a4,a5,10538 # if e = 5, jump to load 496 ... ``` ```asm= 1050c: 7f000513 li a0,2032 ... 10520: 07000513 li a0,112 ... 10530: 0f000513 li a0,240 ... 10538: 1f000513 li a0,496 ... 1054c: 3f000513 li a0,1008 ``` * exponent = 3: $(2^{3} - 1) \times 16 = 112$ * exponent = 4: $(2^{4} - 1) \times 16 = 240$ * exponent = 5: $(2^{5} - 1) \times 16 = 496$ * exponent = 6: $(2^{6} - 1) \times 16 = 1008$ * exponent = 7: $(2^{7} - 1) \times 16 = 2032$ The compiler **pre-calculated** all possible results of `overflow` and embedded them directly as immediate values (e.g., `li a0,496`). This means that the code size will become much large than mine. **Observation:** Since this is a **Space-Time Trade-off**: the compiler sacrifices instruction memory to gain execution speed. My implementation takes a different approach: I use a mathematical formula to calculate the value at runtime. Thsi avoids the code size increase, making my solution much friendlier to **memory-constrained** device without sacrificing performance (still $O(1)$). --- #### ==-Os== (optimized for size) Similarly, I modified `CFLAGS` to get the `-Os` version. ```Makefile CFLAGS = -g -march=rv32i_zicsr -Os ``` > GitHub: [quiz1 problem b Os](https://github.com/dorobouharu133/CA25-Assignment-2/blob/main/quiz1_problem_b/main_Os.asm) ```asm= 00010150 <uf8_decode>: 10150: 00455693 srli a3,a0,0x4 10154: 00f00713 li a4,15 10158: 000087b7 lui a5,0x8 1015c: 40d70733 sub a4,a4,a3 10160: fff78793 addi a5,a5,-1 # 7fff <_start-0x8001> 10164: 40e7d7b3 sra a5,a5,a4 10168: 00f57513 andi a0,a0,15 1016c: 00479793 slli a5,a5,0x4 10170: 00d51533 sll a0,a0,a3 10174: 00a78533 add a0,a5,a0 10178: 00008067 ret 0001017c <uf8_encode>: 1017c: 00f00793 li a5,15 10180: 00a7e663 bltu a5,a0,1018c <uf8_encode+0x10> 10184: 0ff57513 zext.b a0,a0 10188: 00008067 ret 1018c: 00050793 mv a5,a0 10190: 00500713 li a4,5 10194: 01000693 li a3,16 10198: 02000613 li a2,32 1019c: 00d7d5b3 srl a1,a5,a3 101a0: 00058663 beqz a1,101ac <uf8_encode+0x30> 101a4: 40d60633 sub a2,a2,a3 101a8: 00058793 mv a5,a1 101ac: fff70713 addi a4,a4,-1 101b0: 4016d693 srai a3,a3,0x1 101b4: fe0714e3 bnez a4,1019c <uf8_encode+0x20> 101b8: 40c787b3 sub a5,a5,a2 101bc: 01f78693 addi a3,a5,31 101c0: 00400613 li a2,4 101c4: 00000793 li a5,0 101c8: 02d65663 bge a2,a3,101f4 <uf8_encode+0x78> 101cc: ffc68793 addi a5,a3,-4 101d0: 0ff7f693 zext.b a3,a5 101d4: 00f00613 li a2,15 101d8: 0ff7f793 zext.b a5,a5 101dc: 00f67463 bgeu a2,a5,101e4 <uf8_encode+0x68> 101e0: 0ff67693 zext.b a3,a2 101e4: 00000793 li a5,0 101e8: 02f69463 bne a3,a5,10210 <uf8_encode+0x94> 101ec: 00078463 beqz a5,101f4 <uf8_encode+0x78> 101f0: 02e56a63 bltu a0,a4,10224 <uf8_encode+0xa8> 101f4: 00f00613 li a2,15 101f8: 04c79063 bne a5,a2,10238 <uf8_encode+0xbc> 101fc: 40e50533 sub a0,a0,a4 10200: 00f55533 srl a0,a0,a5 10204: 00479793 slli a5,a5,0x4 10208: 00f56533 or a0,a0,a5 1020c: f79ff06f j 10184 <uf8_encode+0x8> 10210: 00171713 slli a4,a4,0x1 10214: 00178793 addi a5,a5,1 10218: 01070713 addi a4,a4,16 1021c: 0ff7f793 zext.b a5,a5 10220: fc9ff06f j 101e8 <uf8_encode+0x6c> 10224: 00175713 srli a4,a4,0x1 10228: fff78793 addi a5,a5,-1 1022c: ff870713 addi a4,a4,-8 10230: 0ff7f793 zext.b a5,a5 10234: fb9ff06f j 101ec <uf8_encode+0x70> 10238: 00171693 slli a3,a4,0x1 1023c: 01068693 addi a3,a3,16 10240: fad56ee3 bltu a0,a3,101fc <uf8_encode+0x80> 10244: 00178793 addi a5,a5,1 10248: 0ff7f793 zext.b a5,a5 1024c: 00068713 mv a4,a3 10250: fa9ff06f j 101f8 <uf8_encode+0x7c> ``` Similar to the `-Ofast` version, `-Os` inlined `<clz>` logic into `uf8_encode`. However, instead of an "Unrolled Decision Tree", `-Os` uses a "**Binary Search Loop**". ```asm= # clz 10190: 00500713 li a4,5 # note 10194: 01000693 li a3,16 # c = 16 10198: 02000613 li a2,32 # n = 32 1019c: 00d7d5b3 srl a1,a5,a3 # y = x >> c 101a0: 00058663 beqz a1,101ac # if (y == 0) skip 101a4: 40d60633 sub a2,a2,a3 # n -= c 101a8: 00058793 mv a5,a1 # x = y 101ac: fff70713 addi a4,a4,-1 # decrement loop counter 101b0: 4016d693 srai a3,a3,0x1 # c >> 1 101b4: fe0714e3 bnez a4,1019c # loop back ``` > **Note:** The compiler smartly uses `li a4, 5` as a loop counter. This is the 32-bit binary search needs $log_2(32) = 5$ loop times (`32->16->8->4->2->1`) . It's interesting that the compiler knows that the clz loop need 5 times so used 5 as immediate. I compared the `-Os` `clz` with my code, and the implementations are almost identical. However, there is a subtle differenct: `-Os` uses **a additional register as a loop counter**, whereas my approach uses **the shfit amount directly to determine the loop termination condition**. ```asm= # -Os 10190: 00500713 li a4,5 10194: 01000693 li a3,16 # c = 16 # ... 101ac: fff70713 addi a4,a4,-1 # decrement loop counter 101b0: 4016d693 srai a3,a3,0x1 # c >> 1 ``` ```asm= # Mine addi t0, x0, 32 # n = 32 addi t1, x0, 16 # c = 16 # ... srli t1, t1, 1 # c >>= 1 ``` `-Os` uses seperate registers for the loop counter (`a4`) and the shift amount (`a3`). This requires extra instructions to initialize and decrement the counter, and my implementation **resuing the shift amount as the loop counter**. Regarding `uf8_decode`, my approach shares the same logic as `-Os`, but the compiler's code is more concise. As noted in `-Ofast` comparison, I used 3 instructions to get immediate `0x7FFF`, while `-Os` achieves this with only 2 instructions: ```asm= # -Os 101c0: 67a1 lui a5,0x8 # ... 101c4: 17fd addi a5,a5,-1 # 7fff <exit-0x80b5> ``` Let's talk about `uf8_encoding`, unlike `-Os` which unconditionally saves stack and registers, my implementation is more aggressive. Since I know exactly which registers are used, I treat `uf8_encoding` as a leaf function for small inputs and **bypass the calling convention**. This requires me to be fully aware of the data flow (one mistake could corrupt the registers), but the performance gain is worth the risk. And just like we discussed in the `-Ofast` part, I used $O(1)$ time to calculate `overflow` while `-Os` used many branches, which takes $O(N)$. Although `-Os` does not have performance enhancement, it is used to minimize the binary machine code generated by the compiler. > **Note:** The following part is used to show the difference of `-Os` and `-Ofast`. So `rv32ic` is used. ==**AI Usage**==[Gemini - Comparing Compiler Optimizations: -Ofast vs -Os](https://gemini.google.com/share/3e7407cc10fa) To illustrate the impact on code size, I analyzed the usage of [RVC (RISC-V Compressed)](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-209.pdf) instructions across different implementations. The results are shown below: > GitHub: [-Ofast rv32ic assembly](https://github.com/dorobouharu133/CA25-Assignment-2/blob/dc1bc2d6e06ca99fde2cd2f8bd3434be036d9fc3/quiz1_problem_b/main_Ofast.asm), [-Os rv32ic assembly](https://github.com/dorobouharu133/CA25-Assignment-2/blob/db2d1e0c45dd832512baecb3d29cf3b953d34f2c/quiz1_problem_b/main_Os.asm) |Function| -Ofast (RVC %)| -Os (RVC %)| |---|---|---| |clz|16/23 (69.5%)|9/12 (75.0%)| |uf8_decode|8/11 (72.7%)|8/11 (72.7%)| |uf8_encode|161/388 (41.4%)|37/56 (66.1%)| * Numerator: Number of 16-bit instructions. * Denominator: Total number of instructions > **Note:** This is a **static** code size analysis. The impact on dynamic execution **might be even more significant**. **Observation:** In contrast, `-Os` achieves higher compression ratio, with approximately **68%** of instructions compressed to 16-bit. It's insteresting that `-Ofast` also uses RVC quite a bit, but since it focuses on execution speed, it often **trades code size for performance**. ___ ### Quiz 1 Problem C > GitHub: [Quiz 1 Problem C](https://github.com/dorobouharu133/CA25-Assignment-2/tree/main/quiz1_problem_c) #### Baseline > Github: [quiz1 problem c baseline](https://github.com/dorobouharu133/CA25-Assignment-2/blob/main/quiz1_problem_c/main.asm) ``` make dump > main.asm ``` ___ Instead of discussing the baseline, I will focus on comparing my handwritten implementation against the compiler-optimized versions. #### ==-Ofast== (optimized for speed) > GitHub: [quiz1 problem c -Ofast](https://github.com/dorobouharu133/CA25-Assignment-2/blob/main/quiz1_problem_c/main_Ofast.asm) ##### 1. bf16_add > GitHub: [my implementation: bf16_add](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_add/bf16_add.S) I implemented a **Branchless Swap** using XOR operations and masks to enforce $|A| \ge |B|$. ```asm= bf16_add_align_branchless: # a0: a # a1: b # s0: unsigned_a # s1: unsigned_b # t6: temp, used as mask addi t6, x0, 0x7F slli t6, t6, 8 ori t6, t6, 0xFF # mask = 0x7FFF and s0, a0, t6 # unsigned_a = a.bits & mask and s1, a1, t6 # unsigned_b = b.bits & mask slt t6, s0, s1 # t6 = (|A| < |B|) ? 1 : 0 sub t6, x0, t6 # t6 = (|A| < |B|) ? 0xFFFF : 0x0 bf16_add_branchless_swap_sign: # swap sign bit # t0: sign_a # t1: sign_b # t6: mask (0xFFFF or 0x0000) # s0: sign_a ^ sign_b xor s0, t0, t1 # s0 = sign_a ^ sign_b and s0, s0, t6 # use mask to check swap or not xor t0, t0, s0 xor t1, t1, s0 bf16_add_branchless_swap_exponent: # swap exponent # t2: exp_a # t3: exp_b # t6: mask (0xFFFF or 0x0000) # s0: exp_a ^ exp_b xor s0, t2, t3 # s0 = exp_a ^ exp_b and s0, s0, t6 xor t2, t2, s0 xor t3, t3, s0 bf16_add_branchless_swap_mantissa: # swap mantissa # t4: mant_a # t5: mant_b # t6: mask (0xFFFF or 0x0000) # s0: mant_a ^ mant_b xor s0, t4, t5 # s0 = mant_a ^ mant_b and s0, s0, t6 xor t4, t4, s0 xor t5, t5, s0 addi s1, t2, 0 # initialize s1 = exp_a (bigger) # s3: exp_diff # t6: mask sub s3, t2, t3 # exp_diff = exp_a - exp_b addi t6, x0, 8 slt t6, t6, s3 sub t6, x0, t6 # t6 = (8 < exp_diff) ? 0xFFFF : 0 xori t6, t6, -1 # t6 = ~t6 # align mant_b sra t5, t5, s3 # mant_b >>= exp_diff and t5, t5, t6 # mant_b &= ~mask ``` By **Branchless Swap**, I eliminated all the branch logic in the C source code: ```c= int16_t exp_diff = exp_a - exp_b; uint16_t result_sign; int16_t result_exp; uint32_t result_mant; if (exp_diff > 0) { result_exp = exp_a; if (exp_diff > 8) return a; mant_b >>= exp_diff; } else if (exp_diff < 0) { result_exp = exp_b; if (exp_diff < -8) return b; mant_a >>= -exp_diff; } else { result_exp = exp_a; } ``` And then the following comparation in the C source code can be skipped due to **Branchless Swap**. ```c= // comparison can be skipped if (sign_a == sign_b) { result_sign = sign_a; result_mant = (uint32_t) mant_a + mant_b; // ... } ``` Whereas `-Ofast` version retains the conditional logic from the C source code. ```asm= # a2: mant_a # a6: mant_b 10120: 19067e63 bgeu a2,a6,102bc <bf16_add+0x1f4> ``` `-Ofast` basically performs a **direct translation** of the C conditional logic and uses [**instruction scheduling**](https://en.wikipedia.org/wiki/Instruction_scheduling) to fill in pipeline. My **Branchless** version provides several benefits to Pipeline: * **Eliminate Misprediction penalties:** Once a misprediction occurs, the CPU must flush the pipeline and discard mis-fetched instruction, this results in several clock cycle stalls. * **Deterministic Execution Time:** In the `-Ofast` version, execution time depends on the input data. In contrast, my implementation is **input-independent**, which means that it executes the exact same instructions sequence regardless of the input values. > **Reference: ["Branch_(computer_science)" Branch-free code](https://en.wikipedia.org/wiki/Branch_(computer_science)) (Wikipedia)** In fact, **branch-free code is a must for cryptography due to timing attacks**. While `bf16_add` is not cryptography, the benefits remains, removing branches brings predictable performance, which is critical for a real-time system. However, I faced a challenge when I implemented branchless swap, I had to manually implement **Guard Bits** (the original precision is not sufficient). **Observation:** Although my handwritten implementation works better than the naive implementation, this performance gain was not free. I had to deal with additional tasks (such as guard bits problem) since I optimized the Code, while `-Ofast` can just translate directly. My approach worth only when `bf16_add` executes frequently enough. ##### 2. bf16_sub > GitHub: [my implementation: bf16_sub](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_sub/bf16_sub.S) ```asm= # -Ofast 00010248 <bf16_sub>: 10248: ffff87b7 lui a5,0xffff8 1024c: 00b7c5b3 xor a1,a5,a1 10250: e79ff06f j 100c8 <bf16_add> ``` ```asm= # my implementation bf16_sub: addi t0, x0, 0x80 slli t0, t0, 8 addi sp, sp, -8 sw ra, 4(sp) sw a1, 0(sp) xor a1, a1, t0 # negate b jal bf16_add lw a1, 0(sp) lw ra, 4(sp) addi sp, sp, 8 jr ra ``` The `-Ofast` version is quite shorter than mine, this is because I store the registers and return address, and the `-Ofast` used a very simple instruction `lui a5,0xffff8` to get the negative `b`. ##### 3. bf16_div > GitHub: [my implementation: bf16_div](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_div/bf16_div.S) In my implementation, I employs a **branchless 16-bit** `clz` as a function to normalize `quotient`. ```asm= # a0: mant_a # t0: x # t1: s # t2: shift # t3: temp # t4: temp2 branchless_16_bit_clz: addi sp, sp, -20 sw t0, 0(sp) sw t1, 4(sp) sw t2, 8(sp) sw t3, 12(sp) sw t4, 16(sp) addi t0, a0, 0 # x = mant_a addi t1, x0, 0 # s = 0 addi t2, x0, 0 # shift = 0 addi t3, x0, 0x00FF slt t1, t3, t0 # s = (x > 0x00FF) ? 1 : 0 xori t1, t1, 1 # s = (x <= 0x00FF) ? 1 : 0 slli t4, t1, 3 add t2, t2, t4 # shift += (s << 3) sll t0, t0, t4 # x <<= (s << 3) addi t3, x0, 0xF slli t3, t3, 8 ori t3, t3, 0xFF slt t1, t3, t0 # s = (x > 0x0FFF) ? 1 : 0 xori t1, t1, 1 # s = (x <= 0x0FFF) ? 1 : 0 slli t4, t1, 2 add t2, t2, t4 # shift += (s << 2) sll t0, t0, t4 # x <<= (s << 2) addi t3, x0, 0x3F slli t3, t3, 8 ori t3, t3, 0xFF slt t1, t3, t0 # s = (x > 0x3FFF) ? 1 : 0 xori t1, t1, 1 # s = (x <= 0x3FFF) ? 1 : 0 slli t4, t1, 1 add t2, t2, t4 # shift += (s << 1) sll t0, t0, t4 # x <<= (s << 1) addi t3, x0, 0x7F slli t3, t3, 8 ori t3, t3, 0xFF slt t1, t3, t0 # s = (x > 0x7FFF) ? 1 : 0 xori t1, t1, 1 # s = (x <= 0x7FFF) ? 1 : 0 slli t4, t1, 0 add t2, t2, t4 # shift += (s << 0) sll t0, t0, t4 # x <<= (s << 0) addi a0, t2, 0 # return shift lw t0, 0(sp) lw t1, 4(sp) lw t2, 8(sp) lw t3, 12(sp) lw t4, 16(sp) addi sp, sp, 20 jr ra ``` Where as the `-Ofast` uses a loop in `bf16_div` to normalize, this loop shifts the quotient and decrements the exponent until the leading bit is set. ```asm= # -Ofast quotient normalization 10510: 00e6da63 bge a3,a4,10524 10514: 00179793 slli a5,a5,0x1 10518: 01079613 slli a2,a5,0x10 1051c: fff70713 addi a4,a4,-1 10520: fe0658e3 bgez a2,10510 ``` Surprisingly, `-Ofast` used a Loop to normalize `quotient` instead of Loop Unrolling. Typically, `-Ofast` is expected to trade off code size for execution speed. However, `-Ofast` seems to have done the opposite. Therefore, I want to unroll this loop and check the performance in each cases in next section. --- #### ==-Os== (optimized for size) > GitHub: [quiz1 problem c -Os](https://github.com/dorobouharu133/CA25-Assignment-2/blob/main/quiz1_problem_c/main_Os.asm) ##### 1. bf16_add > GitHub: [my implementation: bf16_add](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_add/bf16_add.S) ```asm= 10138: 0807e793 ori a5,a5,128 # a5 = mant_b 1013c: 41170733 sub a4,a4,a7 # a4 = exp_a - exp_b 10140: 01071e13 slli t3,a4,0x10 10144: 010e5e13 srli t3,t3,0x10 ``` The above code is used to implement the C source code below: ```c= int16_t exp_diff = exp_a - exp_b; ``` Since the type of `exp_diff` is `int16_t`, therefore the compiler used two shift instructions to truncate the result of `exp_a - exp_b`. (Upper 16 bits are all zeros) ```asm= 10140: 01071e13 slli t3,a4,0x10 10144: 010e5e13 srli t3,t3,0x10 ``` ##### 2. bf16_sub > GitHub: [my implementation: bf16_sub](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_sub/bf16_sub.S) ```asm= 00010248 <bf16_sub>: 10248: ffff87b7 lui a5,0xffff8 1024c: 00b7c5b3 xor a1,a5,a1 10250: e79ff06f j 100c8 <bf16_add> ``` As the `-Ofast` version, `-Os` uses 3 instructions to perform `bf16_sub` and dese not save the stack and the registers. ==**AI Usage**==[Tail Call Optimization (TCO)](https://gemini.google.com/share/7bd7d3e953c3) **Observation:** I checked the **RISC-V Calling Convention**. Although `a1` and `a5` are **Caller-saved** registers, `bf16_sub` does not save them to the stack. This is due to [Tail Call](https://en.wikipedia.org/wiki/Tail_call) Optimization. `bf16_sub` uses `a5` as a mask and modifies `a1` to pass it as an argument to `bf16_add`. Since `bf16_sub` teminates immediately after `j 100c8 <bf16_add>`, there is no need to be saved (included `ra`). `bf16_sub` can be considered as just an intermeidate state of calling `bf16_add`. |Register|ABI Name|Description|Saver| |---|---|---|---| |x0|zero|Hard-wired zero|| |x1|r1|Return address|Caller| |x2|sp|Stack pointer|Callee| |x3|gp|Global pointer|| |x4|tp|Thread pointer|| |x5-7|t0-2|Temporaries|Caller| |x8|s0/fp|Saved register/frame pointer|Callee| |x9|s1|Saved register|Callee| |x10-11|a0-1|Function arguments/return values|Caller| |x12-17|a2-7|Function arguments|Caller| |x18-27|s2-11|Saved registers|Callee| |x28-31|t3-6|Temporaries|Caller| > **Reference:** [Calling Convention](https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf), [RISC-V Calling Conventions ](https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc) ##### 3. bf16_div > GitHub: [my implementation: bf16_div](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_div/bf16_div.S) ```asm=1016 10510: 00e6da63 bge a3,a4,10524 <bf16_div+0x108> 10514: 00179793 slli a5,a5,0x1 10518: 01079613 slli a2,a5,0x10 1051c: fff70713 addi a4,a4,-1 10520: fe0658e3 bgez a2,10510 <bf16_div+0xf4> ``` To minimize the code size, `-Os` uses a loop to iteratively count leading zeros. In contrast, my implementation employed a branchless 16-bit CLZ function. ```asm=1007 104ec: 40c706b3 sub a3,a4,a2 104f0: 00e03733 snez a4,a4 104f4: 00d70733 add a4,a4,a3 104f8: 07e70713 addi a4,a4,126 104fc: 00163613 seqz a2,a2 10500: 01079693 slli a3,a5,0x10 10504: 00c70733 add a4,a4,a2 ``` I also noticed that `-Os` uses a clever branchless implementation to calculate the `exponent`, whereas I used condition check for the implicit one: ``` bf16_div_a_explicit_one: beq t2, x0, bf16_div_b_explicit_one ori t4, t4, 0x80 # mant_a |= 0x80 ``` In conclusion, I found that my implementation isn't fully branchless yet, and there is still room to optimize. ### Quiz 2 Problem C > GitHub: [Quiz 2 Problem C disassembly code](https://github.com/dorobouharu133/CA25-Assignment-2/tree/main/quiz3_problem_c) ## Requirement 5 ### [ticks.c](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c) ```c= #include <inttypes.h> #include <stdint.h> #include <stdio.h> typedef uint64_t ticks; static inline ticks getticks(void) { uint64_t result; uint32_t l, h, h2; asm volatile( "rdcycleh %0\n" "rdcycle %1\n" "rdcycleh %2\n" "sub %0, %0, %2\n" "seqz %0, %0\n" "sub %0, zero, %0\n" "and %1, %1, %0\n" : "=r"(h), "=r"(l), "=r"(h2)); result = (((uint64_t) h) << 32) | ((uint64_t) l); return result; } static uint64_t fib(uint64_t n) { if (n <= 1) return n; return fib(n - 1) + fib(n - 2); } int main() { ticks t0 = getticks(); fib(19); ticks t1 = getticks(); printf("elapsed cycle: %" PRIu64 "\n", t1 - t0); return 0; } ``` `tick.s` provides a straightward method for measuring execution performance. And all we needs here is as below: ```c= typedef uint64_t ticks; static inline ticks getticks(void) { // ... } ``` For example: ```c= int main(void) { ticks t0 = getticks(); clz(0x0000FFFF); clz(0x00F00000); clz(0x80000000); clz(0x00000001); clz(0x00000000); ticks t1 = getticks(); printstr("clz elapsed cycle: ", 19); print_dec((unsigned long)(t1 - t0)); printstr("\n", 1); return 0; } ``` ### [perfcounter](https://github.com/sysprog21/rv32emu/tree/master/tests/perfcounter) ```asm= # get_cycles.S .text .globl get_cycles .align 2 get_cycles: csrr a1, cycleh csrr a0, cycle csrr a2, cycleh bne a1, a2, get_cycles ret .size get_cycles,.-get_cycles ``` Meanwhile, `get_cycles.S` uses the `csrr` instructions to access hardware counters (`cycleh` and `cycle`). By integrating these two implementations, we can effectively evaluate and compare the performance of both C and assembly code. ```asm= .text .globl get_instret .align 2 get_instret: csrr a1, instreth csrr a0, instret csrr a2, instreth bne a1, a2, get_instret ret .size get_instret,.-get_instret ``` We can use the above assembly code to get **intruction retired**. ### CSRR (CSR Read) > Quote from [riscv-isa-manual](https://github.com/riscv/riscv-isa-manual/blob/3d19e7cf49a7e706d2eced98d26a8007ab4a795f/src/zicsr.adoc#L198): The assembler pseudoinstruction to read a CSR, CSRR rd, csr, is encoded as CSRRS rd, csr, x0. > The CSRRS (Atomic Read and Set Bits in CSR) instruction reads the value of the CSR, zero-extends the value to XLEN bits, and writes it to integer register rd. The initial value in integer register rs1 is treated as a bit mask that specifies bit positions to be set in the CSR. Any bit that is high in rs1 will cause the corresponding bit to be set in the CSR, if that CSR bit is writable. ### Comparison (Before Optimized) #### Quiz 1 Problem B Below is the `main.c` I used to get the cycle counts. ```c= int main(void) { ticks t0 = getticks(); clz(0x0000FFFF); clz(0x00F00000); clz(0x80000000); clz(0x00000001); clz(0x00000000); ticks t1 = getticks(); printstr("clz elapsed cycle: ", 19); print_dec((unsigned long)(t1 - t0)); printstr("\n", 1); ticks t2 = getticks(); uf8_decode(0x00); uf8_decode(0x10); uf8_decode(0x2F); uf8_decode(0x8A); uf8_decode(0xFF); ticks t3 = getticks(); printstr("uf8_decode elapsed cycle: ", 26); print_dec((unsigned long)(t3 - t2)); printstr("\n", 1); ticks t4 = getticks(); uf8_encode(0); uf8_encode(16); uf8_encode(17); uf8_encode(1015792); uf8_encode(1048560); ticks t5 = getticks(); printstr("uf8_encode elapsed cycle: ", 26); print_dec((unsigned long)(t5 - t4)); printstr("\n", 1); return 0; } ``` |function|My Approach|Baseline|`-Ofast`|`-Os`| |---|---|---|---|---| |`clz`|228|468|92|186| |`uf8_decode`|177|267|58|102| |`uf8_encode`|374|1031|250|524| > Unit: Clock Cycle #### Quiz 1 Problem C Below is the `main.S` I used to get the cycle counts and the instruction retired. > GitHub: [main.S for Quiz 1 Problem C](https://github.com/dorobouharu133/CA25-Assignment-2/blob/main/main.S) |function|My Approach|`-O0`Baseline|`-Ofast`|`-Os`| |---|---|---|---|---| |`bf16_add`|254|411|186|196| |`bf16_sub`|317|451|186|189| |`bf16_div`|405|1035|296|371| > Unit: Clock Cycles |function|My Approach|`-O0`Baseline|`-Ofast`|`-Os`| |---|---|---|---|---| |`bf16_add`|253|410|185|195| |`bf16_sub`|316|450|185|188| |`bf16_div`|404|1034|295|370| > Unit: Instruction Retired ### Optimization #### My Handwritten Approach ##### `bf16_add` > GitHub: [my handwritten bf16_add](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_add/bf16_add.S) 1. **Replace the instructions to get immediate with `li` and `mv`** For example, I used the below instructions to get `0x7F80` in my original code: ```asm= addi t6, x0, 0x7F slli t6, t6, 8 ori t6, t6, 0x80 ``` I previously avoided using pseudo-instructions like `li` and `mv` because they are not raw RV32I instructions. However, I will now use `li t6, 0x7F80` to simplify the code. 2. **Simplfy the Branchless Swap logic** In my original implementation, I divided the `bf16` numbers into 3 parts (Sign, Exponent and Mantissa) and applied a branchless swap to each part individually, which required **3 swaps**. However, we can optimize this to just **1 swap**. Let me explain this logic: We can first mask the sign bit, and then compare the remaining magnitude directly. ``` 1. Get abs(A) and abs(B) 2. Compare abs(A) and abs(B) 3. Swap A, B (only 1 swap) 4. Unpacked A and B ``` This ahieves the same effect as the original approach but requires only 1 swap. Therefore, we have to swap A and B at the begining. ```asm= # a0: a # a1: b bf16_add_swap_magnitude: li t6, 0x7FFF # use t6 as mask to get magnitude and s0, a0, t6 # s0 = |A| and s1, a1, t6 # s1 = |B| slt t6, s0, s1 # t6 = (|A| < |B|) ? 1 : 0 neg t6, t6 # t6 = (|A| < |B|) ? 0xFFFFFFFF : 0 xor t5, a0, a1 # swap logic and t5, t5, t6 xor a0, a0, t5 xor a1, a1, t5 ``` And the `bf16_add_align_branchless` is modified to: ```asm= bf16_add_align: # t2: exp_a # t3: exp_b sub s3, t2, t3 # exp_diff = exp_a - exp_b li t6, 8 blt t6, s3, bf16_add_return_a sra t5, t5, s3 # mant_b >>= exp_diff ``` > GitHub: [optimized bf16_add.S](https://github.com/dorobouharu133/CA25-Assignment-2/blob/main/bf16_add_optmized.S) #### `-Ofast` Optimization ##### `bf16_div` As I mentioned in Requirement 4 `-Ofast` discussion, I want to unroll the below loop in `bf16_div`. ```asm= # -Ofast quotient normalization 10510: 00e6da63 bge a3,a4,10524 # check exp <= 1 ? 10514: 00179793 slli a5,a5,0x1 # mantissa << 1 10518: 01079613 slli a2,a5,0x10 # exponent-- 1051c: fff70713 addi a4,a4,-1 # shift bit 15 to bit 31 10520: fe0658e3 bgez a2,10510 # check bit 31 ``` The above code implements the `quotient` normalization in the C source code: ```c= else { while (!(quotient & 0x8000) && result_exp > 1) { quotient <<= 1; result_exp--; } quotient >>= 8; } ``` Since the above assembly code of `bf16_div` used absolute address, this make me hard to modify. Therefore, I asked Gemini to get the absolute-address-free assembly code. ==**AI Usage**==[Get the assembly code of bf16_div](https://gemini.google.com/share/77965dcfcc48) > GitHub: [-Ofast bf16_div.S with label](https://github.com/dorobouharu133/CA25-Assignment-2/blob/main/labeled_Ofast_bf16_div.S) ```asm= L_10510: bge a3, a4, L_10524 slli a5, a5, 0x1 slli a2, a5, 0x10 addi a4, a4, -1 bgez a2, L_10510 ``` **Clock Cycles:** 538 (7 Test Cases) 2-way loop unrolling: ```asm= L_10510: bge a3, a4, L_10524 slli a5, a5, 0x1 slli a2, a5, 0x10 addi a4, a4, -1 bltz a2, L_10524 bge a3, a4, L_10524 slli a5, a5, 0x1 slli a2, a5, 0x10 addi a4, a4, -1 bltz a2, L_10524 j L_10510 ``` **Clock Cycles:** 538 (7 Test Cases) 4-way loop unrolling: ```asm= L_10510: bge a3, a4, L_10524 slli a5, a5, 0x1 slli a2, a5, 0x10 addi a4, a4, -1 bltz a2, L_10524 bge a3, a4, L_10524 slli a5, a5, 0x1 slli a2, a5, 0x10 addi a4, a4, -1 bltz a2, L_10524 bge a3, a4, L_10524 slli a5, a5, 0x1 slli a2, a5, 0x10 addi a4, a4, -1 bltz a2, L_10524 bge a3, a4, L_10524 slli a5, a5, 0x1 slli a2, a5, 0x10 addi a4, a4, -1 bltz a2, L_10524 j L_10510 ``` 8-way loop unrolling: **Clock Cycles:** 538 (7 Test Cases) 16-way loop unrolling: **Clock Cycles:** 538 (7 Test Cases) I found that no matter how many way we unroll this loop, **the cycle counts are still the same**. In my observation, this `quotient` normalization is not the bottleneck. And I found an another loop that may be the bottleneck. ```asm= L_104d0: sll a0, a7, a1 slli a5, a5, 0x1 bltu a3, a0, L_104e4 sub a3, a3, a0 ori a5, a5, 1 L_104e4: addi a1, a1, -1 bne a1, t3, L_104d0 ``` **Clock Cycles:** 1135 (11 Test Cases) The above code is the implementation of the below C source code: ```c= uint32_t dividend = (uint32_t) mant_a << 15; uint32_t divisor = mant_b; uint32_t quotient = 0; for (int i = 0; i < 16; i++) { quotient <<= 1; if (dividend >= (divisor << (15 - i))) { dividend -= (divisor << (15 - i)); quotient |= 1; } } ``` 2-way loop unrolling: ```asm= L_104c0: slli a3, a5, 0xf # dividend li a5, 0 # quotient li a1, 15 # loop counter li t3, -1 L_104d0: sll a0, a7, a1 # divisor << current_i slli a5, a5, 1 # quotient << 1 bltu a3, a0, .L_skip_odd sub a3, a3, a0 # dividend -= shifted_divisor ori a5, a5, 1 # quotient |= 1 .L_skip_odd: addi a1, a1, -1 sll a0, a7, a1 # divisor << current_i slli a5, a5, 1 # quotient << 1 bltu a3, a0, .L_skip_even sub a3, a3, a0 ori a5, a5, 1 .L_skip_even: addi a1, a1, -1 bne a1, t3, L_104d0 ``` **Clock Cycles:** 1079 (11 Test Cases) <details> <summary>4-way unrolling</summary> ```asm= L_104c0: slli a3, a5, 0xf li a1, 15 li a5, 0 li t3, -1 L_104d0: sll a0, a7, a1 # divisor << current_i slli a5, a5, 1 # quotient << 1 bltu a3, a0, .L_skip_1 sub a3, a3, a0 # dividend -= shifted_divisor ori a5, a5, 1 # quotient |= 1 .L_skip_1: addi a1, a1, -1 # Counter-- sll a0, a7, a1 # divisor << a1 slli a5, a5, 1 bltu a3, a0, .L_skip_2 sub a3, a3, a0 ori a5, a5, 1 .L_skip_2: addi a1, a1, -1 # Counter-- sll a0, a7, a1 # divisor << a1 slli a5, a5, 1 bltu a3, a0, .L_skip_3 sub a3, a3, a0 ori a5, a5, 1 .L_skip_3: addi a1, a1, -1 # Counter-- sll a0, a7, a1 # divisor << a1 slli a5, a5, 1 bltu a3, a0, .L_skip_4 sub a3, a3, a0 ori a5, a5, 1 .L_skip_4: addi a1, a1, -1 # Counter-- bne a1, t3, L_104d0 ``` </details> **Clock Cycles:** 1051 (11 Test Cases) <details> <summary>8-way unrolling</summary> ```asm= L_104c0: slli a3, a5, 0xf # dividend li a5, 0 # quotient li a1, 15 # loop counter li t3, -1 L_104d0: sll a0, a7, a1 # divisor << a1 slli a5, a5, 1 # quotient << 1 bltu a3, a0, .L_skip_1 sub a3, a3, a0 # dividend -= shifted_divisor ori a5, a5, 1 # quotient |= 1 .L_skip_1: addi a1, a1, -1 # Counter-- sll a0, a7, a1 slli a5, a5, 1 bltu a3, a0, .L_skip_2 sub a3, a3, a0 ori a5, a5, 1 .L_skip_2: addi a1, a1, -1 # Counter-- sll a0, a7, a1 slli a5, a5, 1 bltu a3, a0, .L_skip_3 sub a3, a3, a0 ori a5, a5, 1 .L_skip_3: addi a1, a1, -1 # Counter-- sll a0, a7, a1 slli a5, a5, 1 bltu a3, a0, .L_skip_4 sub a3, a3, a0 ori a5, a5, 1 .L_skip_4: addi a1, a1, -1 # Counter-- sll a0, a7, a1 slli a5, a5, 1 bltu a3, a0, .L_skip_5 sub a3, a3, a0 ori a5, a5, 1 .L_skip_5: addi a1, a1, -1 # Counter-- sll a0, a7, a1 slli a5, a5, 1 bltu a3, a0, .L_skip_6 sub a3, a3, a0 ori a5, a5, 1 .L_skip_6: addi a1, a1, -1 # Counter-- sll a0, a7, a1 slli a5, a5, 1 bltu a3, a0, .L_skip_7 sub a3, a3, a0 ori a5, a5, 1 .L_skip_7: addi a1, a1, -1 # Counter-- sll a0, a7, a1 slli a5, a5, 1 bltu a3, a0, .L_skip_8 sub a3, a3, a0 ori a5, a5, 1 .L_skip_8: addi a1, a1, -1 # Counter-- bne a1, t3, L_104d0 ``` </details> **Clock Cycles:** 1051 (11 Test Cases) In my observation, unrolled this loop to 8 ways is enough. ### Comparison (After Optimized) #### Quiz 1 Problem C Since the original test cases of `bf16_add` couldn't properly show my improvement, I added 2 more cases. ```asm= # 2.0 + 1.0 = 3.0 (No Swap needed, Simple Add) bf16_add_test5: li a0, 0x4000 # 2.0 li a1, 0x3F80 # 1.0 jal ra, bf16_add li t1, 0x4040 # 3.0 (1.1 in binary * 2^1) bne a0, t1, test_fail # 1.0 + 2.0 = 3.0 (Swap Needed) bf16_add_test6: li a0, 0x3F80 # 1.0 li a1, 0x4000 # 2.0 jal ra, bf16_add li t1, 0x4040 # 3.0 bne a0, t1, test_fail ``` In test case 6 (Swap Need), I successfully reduced the cycle count from 163 to 140. --- In `bf16_div`, I also added 2 more cases (total 6 cases): ```asm= # 0.000000 / 0.000000 = NaN (0x7FC0) bf16_div_test_5: addi a0, x0, 0x0000 # +0 addi a1, x0, 0x0000 # +0 jal ra, bf16_div addi s0, a0, 0 li t0, 0x7FC0 bne s0, t0, test_fail # -6.0 / 2.0 = -3.0 bf16_div_test_6: li a0, 0xC0C0 li a1, 0x4000 jal ra, bf16_div addi s0, a0, 0 li t0, 0xC040 bne s0, t0, test_fail ``` > GitHub: [Unrolled bf16_div](https://github.com/dorobouharu133/CA25-Assignment-2/blob/main/unrolled%20bf16_div.S), [bf16_div_test](https://github.com/dorobouharu133/CA25-Assignment-2/blob/main/bf16_div_test.S) |function|My Approach|`-O0`Baseline|`-Ofast`|`-Os`| |---|---|---|---|---| |`bf16_add`|413|802|348|368| |`bf16_div`|790|1948|620|668| > Unit: Clock Cycle |function|My Approach|`-O0`Baseline|`-Ofast`|`-Os`| |---|---|---|---|---| |`bf16_add`|412|801|347|678| |`bf16_div`|789|1947|619|667| > Unit: Instruction Retired * My Approach (6 test cases): - `bf16_add` : $475 \to 413$ Clock Cycles * -Ofast (6 test cases): - `bf16_div` : $680 \to 620$ Clock Cycles **Conclusion:** Although I tried to improve my handwritten approach, the cycle count is still slightly higher than the compiler-optimized `-Ofast` and `-Os` versions. This means that there is still room for optimization in my assembly code. However, I found that even the `-Ofast` is not perfect, as I was able to further reduce its cycle count ## AI Usage * [Gemini - Comparing Compiler Optimizations: -Ofast vs -Os](https://gemini.google.com/share/f43d49e38603) * [Gemini - Tail Call Optimization (TCO)](https://gemini.google.com/share/7bd7d3e953c3) * [Get the assembly code of bf16_div](https://gemini.google.com/share/77965dcfcc48)