# Assignment 1: RISC-V Assembly and Instruction Pipeline > ==Github: [dorobouharu133](https://github.com/dorobouharu133/ca2025-quizzes/tree/main)== >[!Note] AI Tools Usage >I used ChatGPT, Gemini and Copilot to find inspiration, refine the grammar of this article, explain the original C source code, debug and assist in generating the corresponding test data. # Requirement 2 ## Problem B In uf8_encode, we can find some patterns in this function. In my first thought, I wanted to translate the C code to RISC-V instructions line-by-line. ```c= /* Start from a good initial guess */ uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; /* Calculate overflow for estimated exponent */ for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } } ``` During the translation, I found that the for loop did not need to be translate to RISC-V instruction directly, and we can just use some bit operation to have the same result. Observe the above code, we can find that the range of exponent in the for loop is [1, 15], because only if `msb >= 5` then can enter the for loop, and then `exponent = msb - 4` which means exponent is at least 1, and the following if statement constrains exponent's upper bound. Let's discuss the following for loop. `uint_8 e = 0`, the termination condition is `e < exponent`, since the range of exponent is [1, 15], so the for loop will run 1~15 times. And the following step is my optimization. ```c= for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; ``` The initial value of overflow is 0, #### Pass 1 overflow<<$1= 0 << 1 = 0$, overflow = 0 + 16(in deicmal) = 0 + 10000(in binary) = 10000(in binary). In the first pass, we get overflow = `10000`(in binary). #### Pass 2 `Overflow<<1` = 10000 << 1 = 100000(binary), and `(Overflow<<1) + 16` = 100000 + 10000 = 110000(in binary). In the second pass, we get overflow = `110000`(in binary). --- #### Observation I observe that the overflow has the pattern: overflow can be concatenated by (exponent) bits leading 1's + (fixed) 4 bits 0's. Therefore, the translation of for loop is not needed, we can just use some bit operation to get this pattern. #### Bit Operation Overflow is consist of (exponent) bits leading 1's and 4 bits 0's. How can we get the leading 1's of total exponent bits? For example, we assume that exponent is 5, thus the overflow should be `111110000`, we want to get `11111` first. This step we observe that `11111`(in binary) = 31(in decimal) = 32 - 1(in decimal). Therefore, we can calculate by $2^5 - 1$. So exponent bits leading 1's can be obtained by $2^{\text{exponent}} - 1$, and we can easily get overflow by shift the above result 4 bits left. ```asm= # t0: exponent # t1: overflow addi t1, x0, 1 sll t1, t1, t0 # get 2^exponent addi t1, t1, -1 # get leading 1's slli t1, t1. 4 # get overflow ``` Below is my approach 1, using bit operations to replace for loop. ### ==Approach 1== > [Quiz 1 Problem B Approach 1](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/quiz1_approach/q1-uf8-approach-1.S) ```asm= .data pass_msg: .string "All tests passed!\n" fail_msg: .string "Test failed for value: " test_msg: .string "Testing value: " arrow_msg: .string " -> decoded: " encoded_msg: .string " -> re-encoded: " newline: .string "\n" .align 2 .text .globl main main: # s0 will be our loop counter (0-255) # s1 will store the original value # s2 will store the decoded value # s3 will store the re-encoded value addi s0, x0, 0 test_loop: # Print current test value la a0, test_msg addi a7, x0, 4 ecall addi a0, s0, 0 addi a7, x0, 1 ecall # First decode the value addi a0, s0, 0 addi s1, a0, 0 # Save original value jal uf8_decode addi s2, a0, 0 # Save decoded value # Print decoded value la a0, arrow_msg addi a7, x0, 4 ecall addi a0, s2, 0 addi a7, x0, 1 ecall # Now encode the decoded value addi a0, s2, 0 jal uf8_encode addi s3, a0, 0 # Save re-encoded value # Print re-encoded value la a0, encoded_msg addi a7, x0, 4 ecall addi a0, s3, 0 addi a7, x0, 1 ecall la a0, newline addi a7, x0, 4 ecall # Check if re-encoded value matches original bne s3, s1, test_fail # Increment counter and continue addi s0, s0, 1 addi t0, x0, 256 blt s0, t0, test_loop # All tests passed la a0, pass_msg addi a7, x0, 4 ecall j exit test_fail: la a0, fail_msg addi a7, x0, 4 ecall addi a0, s1, 0 addi a7, x0, 1 ecall la a0, newline addi a7, x0, 4 ecall exit: addi a7, x0, 10 # exit ecall # a0: x # t0: n # t1: c # t2: y clz: addi t0, x0, 32 # n = 32 addi t1, x0, 16 # c = 16 clz_while: srl t2, a0, t1 # y = x >> c beq t2, x0, clz_skip sub t0, t0, t1 # n -= c addi a0, t2, 0 # x = y clz_skip: srli t1, t1, 1 # c >>= 1 bnez t1, clz_while # while (c != 0) clz_ret: sub a0, t0, a0 # return n - x jr ra # a0: f1 # t0: mantissa # t1: exponent # t2: offset uf8_decode: andi t0, a0, 0x0f # mantissa = f1 & 0x0f srli t1, a0, 4 # exponent = f1 >> 4 addi t2, x0, 15 sub t2, t2, t1 addi t3, x0, 0x7F # replace pseudo instruction li t3, 0x7FFF slli t3, t3, 8 ori t3, t3, 0xFF srl t3, t3, t2 slli t3, t3, 4 sll a0, t0, t1 add a0, a0, t3 jr ra # a0: value # t0: lz # t1: msb # t2: exponent # t3: overflow # t4: temp # t5: mantissa # t6: next_overflow uf8_encode: addi t4, x0, 16 blt a0, t4, encode_return_lt16 # if(value < 16) then return value encode_ge16: addi sp, sp, -8 sw ra, 0(sp) sw s0, 4(sp) addi s0, a0, 0 # save value, help us not to access the memory jal clz addi t0, a0, 0 # lz addi t1, x0, 31 sub t1, t1, t0 # msb = 31 - lz addi t2, x0, 0 # exponent = 0 addi t3, x0, 0 # overflow = 0 addi t4, x0, 5 encode_if_1: blt t1, t4, encode_while_2 # if(msb < 5) goto encode_while_2 addi t2, t1, -4 # exponent = msb - 4 encode_if_2: addi t4, x0, 15 ble t2, t4, encode_for_1 # if (exponent <= 15) goto encode_for_1 addi t2, x0, 15 encode_for_1: addi t4, x0, 1 sll t4, t4, t2 addi t4, t4, -1 slli t4, t4, 4 addi t3, t4, 0 encode_while_1: beq t2, x0, encode_while_2 bge s0, t3, encode_while_2 addi t3, t3, -16 srli t3, t3, 1 addi t2, t2, -1 jal x0, encode_while_1 encode_while_2: li t4, 15 bge t2, t4, encode_return slli t6, t3, 1 addi t6, t6, 16 encode_if_3: blt s0, t6, encode_return # if(value < nextoverflow) then break addi t3, t6, 0 addi t2, t2, 1 jal x0, encode_while_2 encode_return: sub t5, s0, t3 srl t5, t5, t2 # mantissa = (value - overflow) >> exponent slli a0, t2, 4 or a0, a0, t5 lw s0, 4(sp) lw ra, 0(sp) addi sp, sp, 8 jr ra # can be omitted encode_return_lt16: jr ra ``` #### Test cases In this case, I want to use the test function in the original C code as the test cases. The test function encodes and decodes 0 to 255 to bf16 format. And I will encode from 0 to 255 and verify that whether the format is correct. All test passed! Program exited with code: 0 According to the test, the functionality of decode and encode are correct. #### Compare to Compiler Below is the execution info of my RISC-V code. (Run in 5-stage pipeline CPU) ``` // My approach 1 Cycles: 40004 Instrs. retired: 28164 CPI: 1.42 IPC: 0.74 Clock rate: 0Hz ``` And below is the execution info of the compiler(Compiler Explorer -O2) (Run in 5-stage pipeline CPU) > GitHub: [q1-uf8-complier.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/quiz1_approach/q1-uf8-compiler.S) ``` // Compiler-optimized Cycles: 106898 Instrs. retired: 73350 CPI: 1.46 IPC: 0.686 Clock rate: 152.73 Hz ``` **Performance:** Our method is about **2.67x faster** than the compiler's. **Code Size:** Compared to the compiler-optmized version, my improved approach reduces the code size by nearly **52%**. ___ ### ==Approach 2== > [Quiz 1 Problem B Approach 2](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/quiz1_approach/q1-uf8-approach-2.S) ```asm= .data overflow_1: .word 0x10 overflow_2: .word 0x30 overflow_3: .word 0x70 overflow_4: .word 0xF0 overflow_5: .word 0x1F0 overflow_6: .word 0x3F0 overflow_7: .word 0x7F0 overflow_8: .word 0xFF0 overflow_9: .word 0x1FF0 overflow_10: .word 0x3FF0 overflow_11: .word 0x7FF0 overflow_12: .word 0xFFF0 overflow_13: .word 0x1FFF0 overflow_14: .word 0x3FFF0 overflow_15: .word 0x7FFF0 pass_msg: .string "All test passed!" newline: .string "\n" .align 2 .text .globl main main: # li a0, 54372 lui a0, 13 addi a0, a0, 1124 jal uf8_encode # call function add s0, x0, a0 # save result addi a7, x0, 1 # print integer addi a0, s0, 0 ecall la a0, newline addi a7, x0, 4 ecall addi a7, x0, 10 # exit ecall # a0: x # t0: n # t1: c # t2: y clz: addi t0, x0, 32 # n = 32 addi t1, x0, 16 # c = 16 clz_while: srl t2, a0, t1 # y = x >> c beq t2, x0, clz_skip sub t0, t0, t1 # n -= c addi a0, t2, 0 # x = y clz_skip: srli t1, t1, 1 # c >>= 1 bnez t1, clz_while # while (c != 0) clz_ret: sub a0, t0, a0 # return n - x jr ra # a0: f1 # t0: mantissa # t1: exponent # t2: offset uf8_decode: andi t0, a0, 0x0f # mantissa = f1 & 0x0f srli t1, a0, 4 # exponent = f1 >> 4 addi t2, x0, 15 sub t2, t2, t1 addi t3, x0, 0x7F slli t3, t3, 8 ori t3, t3, 0xFF srl t3, t3, t2 slli t3, t3, 4 sll a0, t0, t1 add a0, a0, t3 jr ra # a0: value # t0: lz # t1: msb # t2: exponent # t3: overflow # t4: temp # t5: mantissa # t6: next_overflow uf8_encode: addi t4, x0, 16 blt a0, t4, encode_return_lt16 # if(value < 16) return value encode_ge16: # function call, need to save ra, a0 addi sp, sp, -8 sw ra, 0(sp) sw s0, 4(sp) addi s0, a0, 0 # save value jal clz addi t0, a0, 0 # lz addi t1, x0, 31 sub t1, t1, t0 # msb = 31 - lz addi t2, x0, 0 # exponent = 0 addi t3, x0, 0 # overflow = 0 addi t4, x0, 5 encode_if_1: blt t1, t4, encode_while_2 # if(msb < 5) goto encode_while_2 addi t2, t1, -4 # exponent = msb - 4 encode_if_2: addi t4, x0, 15 ble t2, t4, encode_for_1 # if (exponent <= 15) goto encode_for_1 addi t2, x0, 15 encode_overflow: # calculate overflow using memory lookup la, t3, overflow_1 # t3 = base address of table addi t4, t2, -1 # t4 = exponent - 1 slli t4, t4, 2 add t3, t3, t4 # t3 = &overflow_{exponent} lw t3, 0(t3) # t3 = overflow encode_while_1: # adjust exponent # need to be optimized beq t2, x0, encode_while_2 slt t4, s0, t3 # t4 = (value < overflow) beq t4, x0, encode_while_2 addi t3, t3, -16 srli t3, t3, 1 addi t2, t2, -1 jal x0, encode_while_1 encode_while_2: # optimize later # find exact exponent addi t4, x0, 15 bge t2, t4, encode_return slli t6, t3, 1 addi t6, t6, 16 encode_if_3: blt s0, t6, encode_return addi t3, t6, 0 addi t2, t2, 1 jal x0, encode_while_2 encode_return: sub t5, s0, t3 srl t5, t5, t2 # mantissa = (value - overflow) >> exponent slli a0, t2, 4 or a0, a0, t5 lw s0, 4(sp) lw ra, 0(sp) addi sp, sp, 8 # restore stack jr ra encode_return_lt16: jr ra ``` In my approach 2, I used lookup table to accelerate the procedure. As we had discussed in my approach 1, the overflow has some pattern in the for loop, thus we can put all patterns into memory, and we can quickly get the overflow. ```asm= encode_overflow: # calculate overflow using memory lookup la t3, overflow_1 # t3 = base address of table addi t4, t2, -1 # t4 = exponent - 1 slli t4, t4, 2 add t3, t3, t4 # t3 = &overflow_{exponent} lw t3, 0(t3) # t3 = overflow ``` As we do in the first approach, the overflow can be calculated by exponent bits leading 1's and fixed 4 bits 0's. So we access the memory table with the offset of the number of leading 1's. ```asm= .data overflow_1: .word 0x10 overflow_2: .word 0x30 overflow_3: .word 0x70 overflow_4: .word 0xF0 overflow_5: .word 0x1F0 overflow_6: .word 0x3F0 overflow_7: .word 0x7F0 overflow_8: .word 0xFF0 overflow_9: .word 0x1FF0 overflow_10: .word 0x3FF0 overflow_11: .word 0x7FF0 overflow_12: .word 0xFFF0 overflow_13: .word 0x1FFF0 overflow_14: .word 0x3FFF0 overflow_15: .word 0x7FFF0 ``` Above is the lookup table, I list overflow in accending order. #### Test Cases All tests passed! Program exited with code: 0 My approach 2 passes the test too. #### Compare to compiler The below figure is the execution info of my approach2. (Run in 5-stage pipeline CPU) ``` // My approach 2 Cycles: 40486 Instrs. retired: 28628 CPI: 1.41 IPC: 0.707 Clock rate: 0Hz ``` **Performance:** My approach 2 is about **2.64x faser** than the compiler-optimized version. **Code Size:** Compared to the compiler-optmized version, my approach 2 reduces the code size by nearly **42%**. According to the clock cycles, the approach 2 is slower than the approach 1 slightly. But I think this is a good try for me to come up with the lookup table method, even though it does't perform better than my approach 1. --- ## Problem C ### ==bf16_sqrt== > GitHub: [bf16_sqrt](https://github.com/dorobouharu133/ca2025-quizzes/tree/main/bf16_sqrt) > Because the `mul` instruction is not included in the [RV32I instructions](https://docs.riscv.org/reference/isa/unpriv/rv32.html), I build up the architecture of the RISC-V code with the `mul` instruction, and I will develop some way to replace the `mul` instruction. > I translate the C code to RISC-V code line-by-line first, and use `mul`. In the following steps, I will optimize the RISC-V code. #### bf16_sqrt_with_mul > GitHub: [bf16_sqrt_original.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/8052ba042ba0f1216df9d70e5d97e2f7e00ebeb1/bf16_sqrt.S) ```asm= .data newline: .string "\n" pass_msg: .string "bf16_sqrt All tests passed!" fail_msg: .string "Tests failed!" .text .globl bf16_sqrt .globl __mulsi3 .globl main main: # sqrt(+inf) = +inf test_1: addi a0, x0, 0x7F slli a0, a0, 8 ori a0, a0, 0x80 jal ra, bf16_sqrt addi s0, a0, 0 addi t0, x0, 0x7F slli t0, t0, 8 ori t0, t0, 0x80 bne s0, t0, test_failed # sqrt(NaN) = NaN test_2: addi a0, x0, 0x7F slli a0, a0, 8 ori a0, a0, 0xC0 jal ra, bf16_sqrt addi s0, a0, 0 addi t0, x0, 0x7F slli t0, t0, 8 ori t0, t0, 0xC0 bne s0, t0, test_failed # sqrt(-1.000000) = NaN test_3: addi a0, x0, 0xBF slli a0, a0, 8 ori a0, a0, 0x80 jal ra, bf16_sqrt addi s0, a0, 0 addi t0, x0, 0x7F slli t0, t0, 8 ori t0, t0, 0xC0 bne s0, t0, test_failed # sqrt(2.000000) = 1.414062 test_4: addi a0, x0, 0x40 slli a0, a0, 8 jal ra, bf16_sqrt addi s0, a0, 0 addi t0, x0, 0x3F slli t0, t0, 8 ori t0, t0, 0xB5 bne s0, t0, test_failed # sqrt(1.000000) = 1.000000 test_5: addi a0, x0, 0x3F slli a0, a0, 8 ori a0, a0, 0x80 jal ra, bf16_sqrt addi t0, x0, 0x3F slli t0, t0, 8 ori t0, t0, 0x80 bne a0, t0, test_failed # sqrt(0.000000) = 0.000000 test_6: addi a0, x0, 0 jal ra, bf16_sqrt addi t0, x0, 0x0000 bne a0, t0, test_failed # sqrt(2648071397852762090550553127902248960.000000) = 1630303065108119552.000000 test_7: addi a0, x0, 0x7B slli a0, a0, 8 ori a0, a0, 0xFF jal ra, bf16_sqrt addi t0, x0, 0x5D slli t0, t0, 8 ori t0, t0, 0xB5 bne a0, t0, test_failed test_passed: la a0, pass_msg addi a7, x0, 4 ecall jal x0, end test_failed: la a0, fail_msg addi a7, x0, 4 ecall end: la a0, newline addi a7, x0, 4 ecall addi a7, x0, 10 ecall # a0: a # t0: sign # t1: exp # t2: mant # t3: temp # t4: e # t5: m # t6: new_exp # s0: low # s1: high # s2: result # s3: mid # s4: sq # s5: new_mant bf16_sqrt: srli t0, a0, 15 srli t1, a0, 7 andi t1, t1, 0xFF # exp = (a.bits >> 7) & 0xFF andi t2, a0, 0x7F # mant = a.bits & 0x7F sqrt_special_case: addi t3, x0, 0xFF bne t1, t3, sqrt_skip_1 # if(exp != 0xFF) then goto sqrt_skip_1 bne t2, x0, sqrt_return_a bne t0, x0, sqrt_return_bf16_nan jal x0, sqrt_return_a sqrt_skip_1: bne t1, x0, sqrt_skip_2 beq t2, x0, sqrt_return_bf16_zero sqrt_skip_2: bne t0, x0, sqrt_return_bf16_nan sqrt_skip_3: beq t1, x0, sqrt_return_bf16_zero addi t4, t1, -127 # e = exp - BF16_EXP_BIAS ori t5, t2, 0x80 # m = 0x80 | mant adjust_for_odd_exponents_if: andi t3, t4, 1 beq t3, x0, adjust_for_odd_exponents_else slli t5, t5, 1 addi t3, t4, -1 # t4 = (e - 1) srli t3, t3, 1 addi t6, t3, 127 # new_exp = ((e-1) >> 1) + BF16_EXP_BIAS beq x0, x0, sqrt_binary_search_initialize adjust_for_odd_exponents_else: srli t3, t4, 1 addi t6, t3, 127 sqrt_binary_search_initialize: addi s0, x0, 90 # low = 90 addi s1, x0, 256 # high = 256 addi s2, x0, 128 # result = 128 sqrt_while_loop: # binary_search bgt s0, s1, sqrt_normalize_greater_equal add s3, s0, s1 # s3 = (low + high) srli s3, s3, 1 # mid = (low + high) >> 1 sqrt_mul: ## mul is not allowed, need to be modified !! # s4: sq # mid*mid implement mul s4, s3, s3 # sq = mid * mid, need to be replaced srli s4, s4, 7 sqrt_while_loop_if: bgt s4, t5, sqrt_while_loop_else addi s2, s3, 0 # result = mid addi s0, s3, 1 # low = mid + 1 jal x0, sqrt_while_loop sqrt_while_loop_else: addi s1, s3, -1 # high = mid - 1 jal x0, sqrt_while_loop sqrt_normalize_greater_equal: # branchless addi t3, x0, 256 sltu t3, s2, t3 # t3 = (result < 256) ? 1 : 0 xori t3, t3, 1 # t3 = (t3) ? 0 : 1, reverse t3 srl s2, s2, t3 add t6, t6, t3 # t6: new_exp jal x0, sqrt_normalize_done sqrt_normalize_less: addi s6, x0, 128 sqrt_normalize_less_loop: bge s2, s6, sqrt_normalize_done addi t3, x0, 1 ble t6, t3, sqrt_normalize_done slli s2, s2, 1 addi t6, t6, -1 jal x0, sqrt_normalize_less_loop sqrt_normalize_done: andi s5, s2, 0x7F # new_mant = result & 0x7F, use t2 to store new_mant addi t3, x0, 0xFF bge t6, t3, sqrt_return_pInf ble t6, x0, sqrt_return_bf16_zero andi t3, t6, 0xFF slli t3, t3, 7 or a0, t3, s5 # .bits = (new_exp & 0xFF) << 7 | new_mant jr ra sqrt_return_a: jr ra sqrt_return_bf16_nan: li a0, 0x7FC0 jr ra sqrt_return_bf16_zero: andi a0, x0, 0 jr ra sqrt_return_pInf: addi a0, x0, 0x7F slli a0, a0, 8 ori a0, a0, 0x80 jr ra ``` > **Note**: The code above serves as a **baseline** for implementing multiplication without the `mul` instruction. #### bf16_sqrt Test Cases Using the `mul` instruction, I want to verify whether the functionality of my approach is correct or not first. Let's choose some corner cases as the test cases first. ``` 1. sqrt(`0x7F80`) = `0x7F80`, sqrt(+inf) = +inf 2. sqrt(`0x7FC0`) = `0x7FC0`, sqrt(NaN) = NaN 3. sqrt(`0xBF80`) = `0x7FC0`, sqrt(-1.000000) = NaN 4. sqrt(`0x4000`) = `0x3FB5`, sqrt(2.000000) = 1.414062 5. sqrt(`0x3F80`) = `0x3F80`, sqrt(1.000000) = 1.000000 6. sqrt(`0x0000`) = `0x0000`, sqrt(0.000000) = 0.000000 7. sqrt(`0x7BFF`) = `0x5DB5`, sqrt(2648071397852762090550553127902248960.000000) = 1630303065108119552.000000 ``` (Run in 5-stage pipeline CPU) > GitHub: [bf16_sqrt_original.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/8052ba042ba0f1216df9d70e5d97e2f7e00ebeb1/bf16_sqrt.S) ``` // My original approach for bf16_sqrt Cycles: 555 Instrs. retired: 403 CPI: 1.38 IPC: 0.726 Clock rate: 584.21 Hz ``` ___ #### ==bf16_sqrt_shift_add== The next step to develop an alternative method to replace the `mul` instruction in the `bf16_sqrt_with_mul` function. A common approach is to manually implement multiplication, which we can motivate by revisting the classic [**long multiplication**](https://mathworld.wolfram.com/LongMultiplication.html). The [**Shift-and-Add Multiplication**](https://users.utcluj.ro/%7Ebaruch/book_ssce/SSCE-Shift-Mult.pdf) method provides a strightforward way to realize the concept of [**long multiplication**](https://mathworld.wolfram.com/LongMultiplication.html) in software. We can use a simple binary multiplication example to illustrate how the shift-and-add method is implemented. For example, `0x81` $\times$ `0xC3` $=$ `0x6243` ($129 \times 195 = 25{,}155$). In binary `1000 0001`(Multiplicand) $\times$ `1100 0011` (Multiplier) $=$ `110001001000011`. ``` 1000 0001 <- Multiplicand x 1100 0011 <- Multiplier ------------------------------------------- pass0: 1000 0001 (Multiplier bit 0 is 1: Add) pass1: 1000 0001 (Multiplier bit 1 is 1: Add) pass2: 0000 0000 (Multiplier bit 2 is 0: Skip) pass3: 0000 0000 (Multiplier bit 3 is 0: Skip) pass4: 0000 0000 (Multiplier bit 4 is 0: Skip) pass5: 0000 0000 (Multiplier bit 5 is 0: Skip) pass6: 1000 0001 (Multiplier bit 6 is 1: Add) pass7: 1000 0001 (Multiplier bit 7 is 1: Add) ------------------------------------------- Result:1100 0100 100 0011 (0x6243) ``` **Algorithm Logic**: In each pass, we examine the **LSB** of the **Multiplier**: - If bit 0 is 1, add the current Multiplicand to the partial product. - If bit 0 is 0, skip the addition. After each pass, we **shift the Multiplicand 1 bit to the left** and **shift the Multiplier 1 bit to the right**. This allows us to consistently check the LSB of the Multiplier in every iteration. The process continues until the Multiplier becomes zero. I implement the **shift-and-add** logic as the below code: ```asm= # a0: multiplicand # a1: multiplier # t0: accumulator # t1: multiplier_copy # t2: bit0 shift_and_add: # save stack if needed addi t0, x0, 0 addi t1, a1, 0 andi t2, a1, 1 shift_and_add_for_loop: andi t2, t1, 1 beq t2, x0, shift_and_add_for_skip_add add t0, t0, a0 shift_and_add_for_skip_add: slli a0, a0, 1 srli t1, t1, 1 bne t1, x0, shift_and_add_for_loop shfit_and_add_end: # restore stack if needed addi a0, t0, 0 jr ra ``` And I implement a test to verify the correctness of the algorithm. ```asm= .data newline: .string "\n" pass_msg: .string "SHIFT-AND-ADD implementation successful!" fail_msg: .string "SHIFT-AND-ADD implementation failed!" .text .global shift_and_add .global main main: addi a0, x0, 0x80 addi a1, x0, 0xC0 jal ra, shift_and_add addi s0, a0, 0 addi t0, x0, 0x60 slli t0, t0, 8 bne s0, t0, test_failed test_passed: la a0, pass_msg addi a7, x0, 4 ecall jal x0, end test_failed: la a0, fail_msg addi a7, x0, 4 ecall end: la a0, newline addi a7, x0, 4 ecall addi a7, x0, 10 ecall # a0: multiplicand # a1: multiplier # t0: accumulator # t1: multiplier_copy # t2: bit0 shift_and_add: # save stack if needed addi t0, x0, 0 addi t1, a1, 0 andi t2, a1, 1 shift_and_add_for_loop: andi t2, t1, 1 beq t2, x0, shift_and_add_for_skip_add add t0, t0, a0 shift_and_add_for_skip_add: slli a0, a0, 1 srli t1, t1, 1 bne t1, x0, shift_and_add_for_loop shfit_and_add_end: # restore stack if needed addi a0, t0, 0 jr ra ``` SHIFT-AND-ADD implementation successful! We can now replace the `mul` instruction in the original code (note that the code must be modified to save and restore the stack). > GitHub: [bf16_sqrt.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/7e584d6ae8203d226c6e8bb119ce17bd55b7bdeb/bf16_sqrt.S) ```asm= .data newline: .string "\n" pass_msg: .string "All tests passed!" fail_msg: .string "Tests failed!" .text .global bf16_sqrt .global main main: # sqrt(+inf) = +inf test_1: addi a0, x0, 0x7F slli a0, a0, 8 ori a0, a0, 0x80 jal ra, bf16_sqrt addi s0, a0, 0 addi t0, x0, 0x7F slli t0, t0, 8 ori t0, t0, 0x80 bne s0, t0, test_failed # sqrt(NaN) = NaN test_2: addi a0, x0, 0x7F slli a0, a0, 8 ori a0, a0, 0xC0 jal ra, bf16_sqrt addi s0, a0, 0 addi t0, x0, 0x7F slli t0, t0, 8 ori t0, t0, 0xC0 bne s0, t0, test_failed # sqrt(-1.000000) = NaN test_3: addi a0, x0, 0xBF slli a0, a0, 8 ori a0, a0, 0x80 jal ra, bf16_sqrt addi s0, a0, 0 addi t0, x0, 0x7F slli t0, t0, 8 ori t0, t0, 0xC0 bne s0, t0, test_failed # sqrt(2.000000) = 1.414062 test_4: addi a0, x0, 0x40 slli a0, a0, 8 jal ra, bf16_sqrt addi s0, a0, 0 addi t0, x0, 0x3F slli t0, t0, 8 ori t0, t0, 0xB5 bne s0, t0, test_failed # sqrt(1.000000) = 1.000000 test_5: addi a0, x0, 0x3F slli a0, a0, 8 ori a0, a0, 0x80 jal ra, bf16_sqrt addi t0, x0, 0x3F slli t0, t0, 8 ori t0, t0, 0x80 bne a0, t0, test_failed # sqrt(0.000000) = 0.000000 test_6: addi a0, x0, 0 jal ra, bf16_sqrt addi t0, x0, 0x0000 bne a0, t0, test_failed # sqrt(2648071397852762090550553127902248960.000000) = 1630303065108119552.000000 test_7: addi a0, x0, 0x7B slli a0, a0, 8 ori a0, a0, 0xFF jal ra, bf16_sqrt addi t0, x0, 0x5D slli t0, t0, 8 ori t0, t0, 0xB5 bne a0, t0, test_failed test_passed: la a0, pass_msg la a0, pass_msg addi a7, x0, 4 ecall jal x0, end test_failed: la a0, fail_msg addi a7, x0, 4 ecall end: la a0, newline addi a7, x0, 4 ecall addi a7, x0, 10 ecall # a0: a # t0: sign # t1: exp # t2: mant # t3: temp # t4: e # t5: m # t6: new_exp # s0: low # s1: high # s2: result # s3: mid # s4: sq # s5: new_mant bf16_sqrt: srli t0, a0, 15 srli t1, a0, 7 andi t1, t1, 0xFF # exp = (a.bits >> 7) & 0xFF andi t2, a0, 0x7F # mant = a.bits & 0x7F sqrt_special_case: addi t3, x0, 0xFF bne t1, t3, sqrt_skip_1 # if(exp != 0xFF) then goto sqrt_skip_1 bne t2, x0, sqrt_return_a bne t0, x0, sqrt_return_bf16_nan jal x0, sqrt_return_a sqrt_skip_1: bne t1, x0, sqrt_skip_2 beq t2, x0, sqrt_return_bf16_zero sqrt_skip_2: bne t0, x0, sqrt_return_bf16_nan sqrt_skip_3: beq t1, x0, sqrt_return_bf16_zero addi t4, t1, -127 # e = exp - BF16_EXP_BIAS ori t5, t2, 0x80 # m = 0x80 | mant adjust_for_odd_exponents_if: andi t3, t4, 1 beq t3, x0, adjust_for_odd_exponents_else slli t5, t5, 1 addi t3, t4, -1 # t4 = (e - 1) srli t3, t3, 1 addi t6, t3, 127 # new_exp = ((e-1) >> 1) + BF16_EXP_BIAS beq x0, x0, sqrt_binary_search_initialize adjust_for_odd_exponents_else: srli t3, t4, 1 addi t6, t3, 127 sqrt_binary_search_initialize: addi s0, x0, 90 # low = 90 addi s1, x0, 256 # high = 256 addi s2, x0, 128 # result = 128 sqrt_while_loop: # binary_search bgt s0, s1, sqrt_normalize_greater_equal add s3, s0, s1 # s3 = (low + high) srli s3, s3, 1 # mid = (low + high) >> 1 sqrt_mul: # SHIFT-AND-ADD # s4: sq # mid*mid implemented by shift-and-add addi sp, sp, -12 sw ra, 0(sp) sw a0, 4(sp) sw a1, 8(sp) addi a0, s3, 0 addi a1, s3, 0 jal ra, shift_and_add addi a3, a0, 0 lw ra, 0(sp) lw a0, 4(sp) lw a1, 8(sp) addi sp, sp, 12 addi s4, a3, 0 srli s4, s4, 7 sqrt_while_loop_if: bgt s4, t5, sqrt_while_loop_else addi s2, s3, 0 # result = mid addi s0, s3, 1 # low = mid + 1 jal x0, sqrt_while_loop sqrt_while_loop_else: addi s1, s3, -1 # high = mid - 1 jal x0, sqrt_while_loop sqrt_normalize_greater_equal: # branchless addi t3, x0, 256 sltu t3, s2, t3 # t3 = (result < 256) ? 1 : 0 xori t3, t3, 1 # t3 = (t3) ? 0 : 1, reverse t3 srl s2, s2, t3 add t6, t6, t3 # t6: new_exp jal x0, sqrt_normalize_done sqrt_normalize_less: addi s6, x0, 128 sqrt_normalize_less_loop: bge s2, s6, sqrt_normalize_done addi t3, x0, 1 ble t6, t3, sqrt_normalize_done slli s2, s2, 1 addi t6, t6, -1 jal x0, sqrt_normalize_less_loop sqrt_normalize_done: andi s5, s2, 0x7F # new_mant = result & 0x7F, use t2 to store new_mant addi t3, x0, 0xFF bge t6, t3, sqrt_return_pInf ble t6, x0, sqrt_return_bf16_zero andi t3, t6, 0xFF slli t3, t3, 7 or a0, t3, s5 # .bits = (new_exp & 0xFF) << 7 | new_mant jr ra sqrt_return_a: jr ra sqrt_return_bf16_nan: li a0, 0x7FC0 jr ra sqrt_return_bf16_zero: andi a0, x0, 0 jr ra sqrt_return_pInf: addi a0, x0, 0x7F slli a0, a0, 8 ori a0, a0, 0x80 jr ra # a0: multiplicand # a1: multiplier # t0: accumulator # t1: multiplier_copy # t2: bit0 shift_and_add: addi sp, sp, -12 sw t0, 0(sp) sw t1, 4(sp) sw t2, 8(sp) addi t0, x0, 0 addi t1, a1, 0 andi t2, a1, 1 shift_and_add_for_loop: andi t2, t1, 1 beq t2, x0, shift_and_add_for_skip_add add t0, t0, a0 shift_and_add_for_skip_add: slli a0, a0, 1 srli t1, t1, 1 bne t1, x0, shift_and_add_for_loop shfit_and_add_end: addi a1, t0, 0 lw t0, 0(sp) lw t1, 4(sp) lw t2, 8(sp) addi sp, sp, 12 addi a0, a1, 0 jr ra ``` #### bf16_sqrt Test Cases (Run in 5-stage pipeline CPU) > GitHub: [bf16_sqrt_test.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_sqrt/bf16_sqrt_test.S) ``` // My improved approach for bf16_sqrt Cycles: 2726 Instrs. retired: 2008 CPI: 1.36 IPC: 0.737 Clock rate: 259.26 Hz ``` > GitHub: [bf16_sqrt_compiler.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_sqrt/bf16_sqrt_compiler.S) ``` // bf16_sqrt compiler version Cycles: 3315 Instrs. retired: 2278 CPI: 1.46 IPC: 0.687 Clock rate: 7.37 KHz ``` **Performance:** My improved approach is **1.21x faster** than the compiler-optimized version. **Code Size:** Compared to the compiler-optimized version, my improved approach reduces the code size by nearly **39%**. While shift-and-add is an alternative to the mul instruction, its performance is significantly lower; it is approximately **5.1x slower** than the native hardware approach. ___ ### ==bf16_add== > GitHub: [bf16_add](https://github.com/dorobouharu133/ca2025-quizzes/tree/main/bf16_add) The following code is the initial template for my **bf16_add** implementation. I will introduce the optimized version and provide the details in the next section. #### [bf16_add_original code](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_add/bf16_add_original.S) The link above is the implementation of my original approach. #### bf16_add Test Cases ``` 1. `0x3F80` + `0xBF80`, +1.000000 + (-1.000000) = 0.000000 2. `0x7F7F` + `0x7F7F`, maximum finite positive number + minimal normalized number = +inf 3. `0x3F80` + `0x3B80`, 1.0000000 + 2^-8 = 1.000000 4. `0x7F80` + `0xFF80`, inf + (-inf) = NaN ``` > GitHub: [bf16_add_original.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_add/bf16_add_original.S) (Run in 5-stage pipeline CPU) ``` // My original approach for bf16_add Cycles: 247 Instrs. retired: 189 CPI: 1.31 IPC: 0.765 Clock rate: 228.28 Hz ``` The above figure is the execution info of my approach, and the below is the execution info of compiler. (Run in 5-stage pipeline CPU) > GitHub: [bf16_add_compiler.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_add/bf16_add_compiler.S) ``` // Compiler-optmized bf16_add Cycles: 259 Instrs. retired: 207 CPI: 1.25 IPC: 0.799 Clock rate: 254.92 Hz ``` My original approach is about **1.049x faster** than the compiler-optimized version (**o2-optimized**). #### bf16_add_improved **Observation:** This section implements the **alignment** logic. ```c= if (exp_diff > 0) { result_exp = exp_a; if (exp_diff > 8) return a; mant_b >>= exp_diff; } else if (exp_diff < 0) { result_exp = exp_b; if (exp_diff < -8) return b; mant_a >>= -exp_diff; } else { result_exp = exp_a; } ``` In floating-point addition, numbers with different exponents cannot be summed directly. For example, to add $3.0 \times 10^{4}$ and $1.5 \times 10^{2}$, we cannot simply add the mantissas ($3.0 + 1.5$). The **exponents** must be aligned first. 1. **Alignment Process:** The number with the smaller exponent is adjusted to match the larger exponent. $$ 1.5 \times 10^{2} \to 0.15 \times 10^{3} \to 0.015 \times 10^{4} $$ 2. **Addition:** Now, we can sum the two numbers: $$ 3.0 \times 10^{4} + 0.015 \times 10^{4} = (3.0 + 0.015) \times 10^{4} = 3.015 \times 10^{4} $$ > **Note:** After addition, it is necessary to check whether normalization is requried or not. In this case, the result is already normalized, so no further adjustment is needed. I implement a **branchless swap** to ensure that the **magnitude of operand A** is always greater than or equal to that of **operand B** (i.e., $|A| \ge |B|$). And I explain how to use branchless logic to implement swap A and B. Since we want to let $|A|$ is always larger or equal to $|B|$, therefore we only care about the magnitude, and the `result_exp` can be determind later. First step is to get the absolute value, we can achieve this by using a mask. ```asm= addi t0, x0, 0x7F slli t0, t0, 8 ori t0, t0, 0xFF # mask = 0x7F00 ``` Because `0x7F00` = `0111 1111 1111 1111` in binary, we can clear the sign bit of a and b. Implemented in my code as below: ```asm= bf16_add_align_branchless: # a0: a # a1: b # s0: unsigned_a # s1: unsigned_b # t6: temp, used as mask addi t6, x0, 0x7F slli t6, t6, 8 ori t6, t6, 0xFF # mask = 0x7FFF andi s0, a0, 0x7FFF # unsigned_a = a.bits & mask andi s1, a1, 0x7FFF # unsigned_b = b.bits & mask slt t6, s0, s1 # t6 = (|A| < |B|) ? 1 : 0 sub t6, x0, t6 # t6 = (|A| < |B|) ? 0xFFFF : 0x0 ``` Here we can get `t6` as a mask: - if $|A| < |B|$, then t6 = `0xFFFF`. (**Need to swap**) - if $|A| >= |B|$, then t6 = `0x0000`. (**No need to swap**) Then ths mask (t6) can be used to swap sign, exponent and mantissa simutaneously. I explain the principle of using xor to swap, and below are the properties of xor oepration: $$ \begin{aligned} A \oplus 0 &= A && \\ (A \oplus B) \oplus A &= B && \end{aligned} $$ $$ \begin{aligned} \text{If} \quad & C = A \oplus B \\ \text{then} \quad & C \oplus A = B \\ \text{and} \quad & C \oplus B = A \end{aligned} $$ `and s0, s0, t6` is used to determine whether swap or not. if $\text{C} = \text{A} \oplus \text{B}$, then - $\text{C} \oplus \text{0x0000} = \text{C}$ - and $\text{C} \oplus \text{0xFFFF}$ = the 1's complement of $\text{C}$ > **Note:** A, B, C here are all 16-bit binary numbers. ```asm= # Swap sign bit # t0: sign_a # t1: sign_b # t6: mask (0xFFFF or 0x0000) # s0: sign_a ^ sign_b xor s0, t0, t1 # s0 = sign_a ^ sign_b and s0, s0, t6 # use mask to check swap or not xor t0, t0, s0 xor t1, t1, s0 ``` For example, mask (t6) is `0xFFFF`, exp_a (t0) = `0x1001`, exp_b (t1) = `0x1011`. > Note: tha if mask is `0xFFFF`,i.e, need to swap, exp_a should less than exp_b. ``` s0 = 0x1001 ^ 0x1011 = 0x0010 s0 & t6 = 0x0010 & 0xFFFF = 0x0010 t0 ^ 0x0010 = 0x1001 ^ 0x0010 = 0x1011 t1 ^ 0x0010 = 0x1011 ^ 0x0010 = 0x1001 ``` And another example, mask is `0x0000`, which means that there is no need to swap, and exp_a = `0x1100` and exp_b = `0x0011`. ``` s0 = 0x1100 ^ 0x0011 = 0x1111 s0 & t6 = 0x1111 & 0x0000 = 0x0000 t0 ^ 0x0000 = 0x1100 ^ 0x0000 = 0x1100 t1 ^ 0x0000 = 0x0011 ^ 0x0000 = 0x0011 ``` Repeat the above procedure three times to implement branchless swap. Below is the complete code of branchless swap implementation: ```asm= bf16_add_align_branchless: # a0: a # a1: b # s0: unsigned_a # s1: unsigned_b # t6: temp, used as mask addi t6, x0, 0x7F slli t6, t6, 8 ori t6, t6, 0xFF # mask = 0x7FFF andi s0, a0, t6 # unsigned_a = a.bits & mask andi s1, a1, t6 # unsigned_b = b.bits & mask slt t6, s0, s1 # t6 = (|A| < |B|) ? 1 : 0 sub t6, x0, t6 # t6 = (|A| < |B|) ? 0xFFFF : 0x0 bf16_add_branchless_swap_sign: # swap sign bit # t0: sign_a # t1: sign_b # t6: mask (0xFFFF or 0x0000) # s0: sign_a ^ sign_b xor s0, t0, t1 # s0 = sign_a ^ sign_b and s0, s0, t6 # use mask to check swap or not xor t0, t0, s0 xor t1, t1, s0 bf16_add_branchless_swap_exponent: # swap exponent # t2: exp_a # t3: exp_b # t6: mask (0xFFFF or 0x0000) # s0: exp_a ^ exp_b xor s0, t2, t3 # s0 = exp_a ^ exp_b and s0, s0, t6 xor t2, t2, s0 xor t3, t3, s0 bf16_add_branchless_swap_mantissa: # swap mantissa # t4: mant_a # t5: mant_b # t6: mask (0xFFFF or 0x0000) # s0: mant_a ^ mant_b xor s0, t4, t5 # s0 = mant_a ^ mant_b and s0, s0, t6 xor t4, t4, s0 xor t5, t5, s0 ``` We have to determine `exp_diff` and align the smaller part (i.e., `mant_b`) in the remaining part. ```c= int16_t exp_diff = exp_a - exp_b; uint16_t result_sign; int16_t result_exp; uint32_t result_mant; if (exp_diff > 0) { result_exp = exp_a; if (exp_diff > 8) return a; mant_b >>= exp_diff; } else if (exp_diff < 0) { result_exp = exp_b; if (exp_diff < -8) return b; mant_a >>= -exp_diff; } else { result_exp = exp_a; } ``` We also want to convert to above code into branchless logic, because the `exp_a` and `mant_a` are always larger than or euqal to the corresponding parts, we can **ignore** the part that deal with `exp_a` and `mant_a`. The above code can be simplified as below: ```c= int16_t exp_diff = exp_a - exp_b; uint16_t result_sign; int16_t result_exp; uint32_t result_mant; if(exp_diff > 0) { result_exp = exp_a; if (exp_diff > 8) return a; mant_b >>= exp_diff; } else if (exp_diff == 0) { result_exp = exp_a; } ``` > **Note:** `exp_diff` = `exp_a` - `exp_b` $\ge$ 0. We can use `srl` instruction to alignment `mant_b`, and - if exp_diff > 8, then return a. - else if 0 <= exp_diff and 8 >= exp_diff, retain `mant_b`. > **Note:** `exp_diff` > 8 means that b is negligible relative to a, thus mant_b can be zeroed out by the mask. ```asm= # s3: exp_diff # t6: mask sub s3, t2, t3 # exp_diff = exp_a - exp_b addi t6, x0, 8 slt t6, t6, s3 sub t6, x0, t6 # t6 = (8 < exp_diff) ? 0xFFFFFFFF : 0x00000000 xori t6, t6, -1 # t6 = (8 < exp_diff) ? 0x00000000 : 0xFFFFFFFF # align mant_b sra t5, t5, s3 # mant_b >>= exp_diff and t5, t5, t6 # mant_b &= ~mask ``` **Observation:** Logic Simplification ```c= else { if (mant_a >= mant_b) { result_sign = sign_a; result_mant = mant_a - mant_b; } else { result_sign = sign_b; result_mant = mant_b - mant_a; } if (!result_mant) return BF16_ZERO(); while (!(result_mant & 0x80)) { result_mant <<= 1; if (--result_exp <= 0) return BF16_ZERO(); } } ``` Since we have enforeced $|A| \ge |B|$ using the **branchless swap**, this section can be significantly simplifed. ```c= // This conditional block is removed in my approach if (mant_a >= mant_b) { result_sign = sign_a; result_mant = mant_a - mant_b; } else { result_sign = sign_b; result_mant = mant_b - mant_a; } ``` In my approach, this condition check block is reduced to a single subtraction: `result_mant = mant_a - mant_b` as $|A| \ge |B|$ is guaranteed. Implement the above logic into branchless code is as below: ```asm= bf16_add_same_sign: add s3, t4, t5 # result_mant = mant_a + mant_b andi t2, s3, 0x100 # result & 0x100, check if carry out beq t2, x0, bf16_add_return srli s3, s3, 1 # result_mant >>= 1 addi s1, s1, 1 # result_exp += 1 addi t6, x0, 0xFF # check for overflow bge s1, t6, bf16_add_return_overflow jal x0, bf16_add_return bf16_add_diff_sign: sub s3, t4, t5 # result_mant = mant_a - mant_b beq s3, x0, bf16_add_return_bf16_zero bf16_add_normalize_loop: # normalize, maybe can optimize later andi t6, s3, 0x80 bne t6, x0, bf16_add_return slli s3, s3, 1 # result_mant <<= 1 addi s1, s1, -1 # result_exp -= 1 blt s1, x0, bf16_add_return_bf16_zero jal x0, bf16_add_normalize_loop ``` ##### Bug Fixed (Guard Bits) > GitHub: [bf16_add_with_guard_bits](https://github.com/dorobouharu133/ca2025-quizzes/blob/b724ad166d63b936c33c774758720f2d98571cb7/bf16_add.S) ``` // Commit message Fixed three logic errors in bf16_add improved approach: 1. Initialize result exponent s1 correctly before calculation loop. 2. Left shift mantissa 3 bits to protect precision. 3. Updated overflow mask from 0x100 to 0x800 due to point 2. ``` ``` Test Case 3: 1.0000000 + 2^-8 = 1.000000 ``` Since the above code faced a precision problem, I preserved 3 bits as guard bits. ```asm= # t4: mant_a # t5: mant_b bf16_add_skip2: slli t4, t4, 3 # for further rounding (guard bits) slli t5, t5, 3 # for further rounding (guard bits) ``` Because I preserved 3 bits by shift left 3 bits, the below procedure also needed to be modify. ```asm= # t2: mask, to check carry out # s3: result_mant addi t2, x0, 0x10 slli t2, t2, 4 and t2, s3, t2 # result & 0x100, check if carry out ``` `0x100` is not enough to check overflow, we have to check **bit 11**(0x800) instead of bit 8 (0x100). > **Reason:** Originally, the implicit 1 (hidden bit) is at **bit 7**, so a carry-out reaches **bit 8**. After shifting left 3 bits, the implicit 1 moves to **bit 10**. Therefore, adding two 10-bit mantissas may cause a carry-out to **bit 11**, so we have to use the mask **0x800**. Since mantissa is 7 bits (bit 0 to bit 6), after shifting left 3 bit, the mantissa is 10 bits. Therefore, we use mask 0x800 to check whether the sum (including **the hidden bit at bit 10**) carries out to **bit 11** or not. ``` 0x100 -> 0x800 ``` ```asm= addi t2, x0, 0x80 slli t2, t2, 4 and t2, s3, t2 # result & 0x800, check if carry out ``` #### bf16_add Test Cases Original version with 5-stage pipeline CPU ``` // My original approach for bf16_add Cycles: 247 Instrs. retired: 189 CPI: 1.31 IPC: 0.765 Clock rate: 228.28 Hz ``` **Branchless** version with 5-stage pipeline CPU ``` // Branchless Cycles: 313 Instrs. retired: 265 CPI: 1.18 IPC: 0.847 Clock rate: 333.69 Hz ``` Compiler version with 5-stage pipeline CPU ``` // Compiler Cycles: 259 Instrs. retired: 207 CPI: 1.25 IPC: 0.799 Clock rate: 254.92 Hz ``` **Performance:** My improved approach is about **1.21x slower** than the compiler-optimized version. **Code Size:** Compared to the compiler-optimized version, my improved approach reduces the code size by nearly **49%**. ___ ### ==bf16_sub== ```asm= bf16_sub: addi t0, x0, 0x80 slli t0, t0, 8 addi sp, sp, -8 sw ra, 4(sp) sw a1, 0(sp) xor a1, a1, t0 # negate b jal bf16_add lw a1, 0(sp) lw ra, 4(sp) addi sp, sp, 8 jr ra ``` #### bf16_sub Test Cases In the test of sub, I use 6 test cases. ``` 1. `0x7F7F` - `0x0080`, maximal finite positive number - minimal normalized number 2. `0x7FC0` - `0x3F80`, NaN - 1.000000 3. `0x3F80` - `0x3B80`, 1.000000 - 2^(-8) 4. `0x7F80` - `0xFF80`, +inf - (-inf) 5. `0x4040` - `0x4000`, 3.000000 - 2.000000 6. `0x7F7F` - `0xFF7F`, maximal finite positive numbe - maximal finite negative number ``` #### Bug fixed (Precision) ``` # 1.0000000 - 2^(-8) = 1.000000 ``` The correct answer should be `0x3B80`, but my approach gets `0x3B7F`. In order to get the correct answer, we have to add **the guard bits** to my approach. > GitHub: [bf16_sub_adjust_exp_diff](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_sub/bf16_sub.S) ``` // My approach Cycles: 586 Instrs. retired: 492 CPI: 1.19 IPC: 0.84 Clock rate: 350.06 Hz ``` > GitHub: [bf16_sub_compiler.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_sub/bf16_sub_compiler.S) ``` // Compiler-optimized Cycles: 1004 Instrs. retired: 750 CPI: 1.34 IPC: 0.747 Clock rate: 1.06 KHz ``` **Code Size:** Compared to the compiler-optimized version, my improved approach reduces the code size by nearly **53.6%**. **Performance:** My approach is **1.71x faster** than the compiler-optimized version. > Because `bf16_sub` is essentially derived from `bf16_add`, an additional `bf16_sub_improved` implementation is not required. ___ ### ==bf16_mul== > GitHub: [bf16_mul.S](https://github.com/dorobouharu133/ca2025-quizzes/tree/main/bf16_mul) As before, I'll start with an **original version** that acts as a foundation for the enhanced version I will introduce later. #### bf16_mul original > GitHub: [bf16_mul_original.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_mul/bf16_mul_original.S) #### bf16_mul Test Cases And I choose these test cases since: ``` 1. `0x7F7F` x `0x0080`, `0x407F`, maximal finite positive number x minimal normalized number = 3.984375 2. `0x7FC0` x `0x3F80`, `0x7FC0`, NaN x 1.000000 -> NaN 3. `0x7F80` x `0xFF80`, `0x7F80`, +inf x (-inf) -> +inf 4. `0x4040` x `0x4000`, `0x40C0`, 3.000000 x 2.000000 = 6.000000 5. `0x3F80` x `0x3F80`, 1.000000 x 1.000000 = 1.000000 6. `0x0000` x `0x8000`, 0.000000 x (-0.000000) = (-0.000000) 7. `0x7F80` x `0x0000`, +inf x 0.000000 = NaN 8. `0x7F7F` x `0x4000`, maximal finite positive number x 2.000000 = +inf 9. `0x0080` x `0x0080`, minimal finite normalized positive number x minimal finite normalized positive number = 0.000000 10. `0x7FC0` x `0x3F80`, NaN x 1.000000 = NaN 11. `0xFF80` x `0x7F80`, (-inf) x +inf = -inf 12. `0x3F81` x `0x3F81`, 1.007812 x 1.007812 = 1.015625 13. `0xC000` x `0x4040`, -2.000000 x 3.000000 = -6.000000 14. `0x0040` x `0x0040`, minimal finite denormalized positive number x minimal finite denormalized positive number = 0.000000 15. `0x3F81` x `0x4000`, 1.007812 x 2.000000 = 2.015625 16. `0x4040` x `0x3F00`, 3.000000 x 0.500000 = 1.500000 17. `0x40A0` x `0x3F00`, 5.000000 x 0.500000 = 2.500000 18. `0x4010` x `0x4010`, 2.250000 x 2.250000 = 5.062500 ``` (Run in 5-stage pipeline CPU) > GitHub: [bf16_mul_original.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_mul/bf16_mul_original.S) ``` // My original approach for bf16_mul Cycles: 1095 Instrs. retired: 817 CPI: 1.34 IPC: 0.746 Clock rate: 1.22 KHz ``` The above figure is the execution info of my approach, and the below figure is the execution info of compiler. (Run in 5-stage pipeline CPU) > GitHub: [bf16_mul_compiler.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_mul/bf16_mul_compiler.S) ``` // Compiler-optimized bf16_mul Cycles: 3696 Instrs. retired: 2609 CPI: 1.42 IPC: 0.706 Clock rate: 1.60 KHz ``` **Performance:** My original approach is **3.37x faster** than the compiler-optimized version. ___ #### [bf16_mul_improved](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_mul/bf16_mul.S) > AI Usage: [bf16_mul improvement thought](https://gemini.google.com/share/60cc3a43aa7d) ```c= int16_t exp_adjust = 0; if (!exp_a) { while (!(mant_a & 0x80)) { mant_a <<= 1; exp_adjust--; } exp_a = 1; } else mant_a |= 0x80; ``` The goal here is to find the leading bit of the **mantissa** for **normalization**. Instead of using a loop, we can implement a **branchless 8-bit CLZ** approach to make the `bf16_mul` operation more efficient. Since the mantissa is 7-bit (8-bit if implicit 1 included), we can use a simpler way of CLZ. #### Branchless 8-bit CLZ **The below C code of 8-bit CLZ is [AI-generated](https://gemini.google.com/share/60cc3a43aa7d).** ```c= // Gemini uint32_t x = mant_a; uint32_t s, shift = 0; s = (x <= 0xF); shift += (s << 2); x <<= (s << 2); s = (x <= 0x3F); shift += (s << 1); x <<= (s << 1); s = (x <= 0x7F); shift += (s << 0); x <<= (s << 0); // Now shift is the number of Leading Zeros ``` I implemented the above code as the RISC-V instruction below: ```asm= bf16_mul_exp_adjust_a: # normalize a addi s1, t4, 0 # s1 = x = mant_a addi s2, x0, 0 # s2 = shift addi s3, x0, 0 # s3 = s addi t6, x0, 0xF slt s3, t6, s1 # s3 = (x > 0xF) ? 1 : 0 xori s3, s3, 1 # s3 = (x <= 0xF) ? 1 : 0 slli t6, s3, 2 add s2, s2, t6 # shift += (s << 2) sll s1, s1, t6 # x <<= (s << 2) # addi t6, x0, 0x3F slt s3, t6, s1 # s3 = (x > 0x3F) ? 1 : 0 xori s3, s3, 1 # s3 = (x <= 0x3F) ? 1 : 0 slli t6, s3, 1 add s2, s2, t6 # shift += (s << 1) sll s1, s1, t6 # x <<= (s << 1) # addi t6, x0, 0x7F slt s3, t6, s1 # s3 = (x > 0x7F) ? 1 : 0 xori s3, s3, 1 # s3 = (x <= 0x7F) ? 1 : 0 # s2: shift = # of leading zeros # s4: exp_adjust_a sltiu t6, t2, 1 # t6 = (exp_a == 0) ? 1 : 0 or t2, t2, t6 # exp_a = (exp_a == 0) ? 1 : exp_a addi t6, t6, -1 xori t6, t6, -1 and s4, s2, t6 and s2, s2, t6 sll t4, t4, s2 ori t4, t4, 0x80 # mant_a |= 0x80 ``` Below is my improved code with partial branchless: ```asm= # a0: a # a1: b # t0: sign_a # t1: sigb_b # t2: exp_a # t3: exp_b # t4: mant_a # t5: mant_b # t6: temp # s0: result_sign bf16_mul: srli t0, a0, 15 andi t0, t0, 1 # t0 = (a.bits >> 15) & 1 srli t1, a1, 15 andi t1, t1, 1 # t1 = (b.bits >> 15) & 1 srli t2, a0, 7 andi t2, t2, 0xFF # t2 = (a.bits >> 7) & 0xFF srli t3, a1, 7 andi t3, t3, 0xFF # t3 = (b.bits >> 7) & 0xFF andi t4, a0, 0x7F # t4 = a.bits & 0x7F andi t5, a1, 0x7F # t5 = a.bits & 0x7F xor s0, t0, t1 # result_sign = sign_a ^ sign_b bf16_mul_check_a_0xFF: addi t6, x0, 0xFF bne t2, t6, bf16_mul_check_b_0xFF bne t4, x0, bf16_mul_return_a bne t5, x0, bf16_mul_return_pInf bne t3, x0, bf16_mul_return_pInf jal x0, bf16_mul_return_NaN bf16_mul_check_b_0xFF: addi t6, x0, 0xFF bne t3, t6, bf16_mul_check_zero bne t4, x0, bf16_mul_return_b bne t2, x0, bf16_mul_return_pInf bne t5, x0, bf16_mul_return_pInf jal x0, bf16_mul_return_NaN bf16_mul_check_zero: or t6, t2, t4 bne t6, x0, bf16_mul_exp_adjust or t6, t3, t5 bne t6, x0, bf16_mul_exp_adjust jal x0, bf16_mul_return_zero bf16_mul_exp_adjust: bf16_mul_exp_adjust_a: # normalize a addi s1, t4, 0 # s1 = x = mant_a addi s2, x0, 0 # s2 = shift addi s3, x0, 0 # s3 = s addi t6, x0, 0xF slt s3, t6, s1 # s3 = (x > 0xF) ? 1 : 0 xori s3, s3, 1 # s3 = (x <= 0xF) ? 1 : 0 slli t6, s3, 2 add s2, s2, t6 # shift += (s << 2) sll s1, s1, t6 # x <<= (s << 2) # addi t6, x0, 0x3F slt s3, t6, s1 # s3 = (x > 0x3F) ? 1 : 0 xori s3, s3, 1 # s3 = (x <= 0x3F) ? 1 : 0 slli t6, s3, 1 add s2, s2, t6 # shift += (s << 1) sll s1, s1, t6 # x <<= (s << 1) # addi t6, x0, 0x7F slt s3, t6, s1 # s3 = (x > 0x7F) ? 1 : 0 xori s3, s3, 1 # s3 = (x <= 0x7F) ? 1 : 0 # s2: shift = # of leading zeros # s4: exp_adjust_a sltiu t6, t2, 1 # t6 = (exp_a == 0) ? 1 : 0 or t2, t2, t6 # exp_a = (exp_a == 0) ? 1 : exp_a addi t6, t6, -1 xori t6, t6, -1 and s4, s2, t6 and s2, s2, t6 sll t4, t4, s2 ori t4, t4, 0x80 # mant_a |= 0x80 bf16_mul_exp_adjust_b: # normalize b addi s1, t4, 0 # s1 = x = mant_b addi s2, x0, 0 # s2 = shift addi s3, x0, 0 # s3 = s addi t6, x0, 0xF slt s3, t6, s1 # s3 = (x > 0xF) ? 1 : 0 xori s3, s3, 1 # s3 = (x <= 0xF) ? 1 : 0 slli t6, s3, 2 add s2, s2, t6 # shift += (s << 2) sll s1, s1, t6 # x <<= (s << 2) # addi t6, x0, 0x3F slt s3, t6, s1 # s3 = (x > 0x3F) ? 1 : 0 xori s3, s3, 1 # s3 = (x <= 0x3F) ? 1 : 0 slli t6, s3, 1 add s2, s2, t6 # shift += (s << 1) sll s1, s1, t6 # x <<= (s << 1) # addi t6, x0, 0x7F slt s3, t6, s1 # s3 = (x > 0x7F) ? 1 : 0 xori s3, s3, 1 # s3 = (x <= 0x7F) ? 1 : 0 # s2: shift = # of leading zeros # s5: exp_adjust_b sltiu t6, t3, 1 # t6 = (exp_a == 0) ? 1 : 0 or t3, t3, t6 # exp_a = (exp_a == 0) ? 1 : exp_a addi t6, t6, -1 xori t6, t6, -1 and s5, s2, t6 and s2, s2, t6 sll t5, t5, s2 ori t5, t5, 0x80 # mant_b |= 0x80 # s1: exp_adjust = exp_adjust_a + exp_adjust_b add s1, s4, s5 bf16_mul_normalize: # s2: result_mant # s3: result_exp mul s2, t4, t5 # result_mant = mant_a * mant_b, need to modify later add s3, t2, t3 # result_exp = exp_a + exp_b add s3, s3, s5 # result_exp += exp_adjust addi s3, s3, -127 # result_exp -= BF16_EXP_BIAS bf16_mul_normalize_check: # branchless later bf16_mul_normalize_check_result_mant_if: addi t6, x0, 0x80 slli t6, t6, 8 and t6, s2, t6 beq t6, x0, bf16_mul_normalize_check_result_mant_else srli t6, s2, 8 andi s2, t6, 0xFF addi s3, s3, 1 jal x0, bf16_mul_normalize_check_result_exp_if_ge_0xFF bf16_mul_normalize_check_result_mant_else: srli t6, s2, 7 andi s2, t6, 0x7F bf16_mul_normalize_check_result_exp_if_ge_0xFF: addi t6, x0, 0xFF bge s3, t6, bf16_mul_return_pInf bf16_mul_normalize_check_result_exp_if_le_0: bgt s3, x0, bf16_mul_return addi t6, x0, -6 bgt s3, t6, bf16_mul_return_zero addi t6, x0, 1 sub t6, t6, s3 # t6 = 1 - result_exp sll s2, s2, t6 addi s3, x0, 0 bf16_mul_return: slli a0, s0, 15 andi s3, s3, 0xFF slli s3, s3, 7 or a0, a0, s3 andi s2, s2, 0x7F or a0, a0, s2 jr ra bf16_mul_return_a: jr ra bf16_mul_return_b: addi a0, a1, 0 jr ra bf16_mul_return_pInf: addi a0, s0, 0 slli a0, a0, 15 addi t6, x0, 0x7F slli t6, t6, 8 ori t6, t6, 0x80 or a0, a0, t6 jr ra bf16_mul_return_NaN: addi a0, x0, 0x7F slli a0, a0, 8 ori a0, a0, 0xC0 jr ra bf16_mul_return_zero: addi a0, s0, 0 slli a0, a0, 15 jr ra ``` I’ve finished converting the first section of the logic to be branchless. Next, I’ll apply the same optimization to the remaining parts of the code. > AI Usage: [bf16_mul test cases & debug](https://gemini.google.com/share/7f33de0eeaf6) **Debug Update:** Found an error in the BF16 test data. For $2.25 \times 2.25$, the precise result is $5.0625$, which rounds to `0x402A` in BF16. The initial expected value of `0x4020` was a typo. The hardware implementation is actually correct; no precision fix is needed. ```c= uint32_t result_mant = (uint32_t) mant_a * mant_b; int32_t result_exp = (int32_t) exp_a + exp_b - BF16_EXP_BIAS + exp_adjust; if (result_mant & 0x8000) { result_mant = (result_mant >> 8) & 0x7F; result_exp++; } else result_mant = (result_mant >> 7) & 0x7F; ``` The concept of the above code is pretty simple, we want to check whether the result_mant is 15 bits or 16 bits, - 16 bits means there is carry out for mant_a * mant_b, therefore result_mant & 0x8000 = 1 (using mask 0x8000 to check bit 15). - 15 bits means there is no carry out for mant_a * mant_b, therefore result_mant & 0x8000 = 0. Then use shift right 8 (or 7) bits to truncate `result_mant` and retain the upper 8-bits. Note that if the result_mant is 16 bits, it has to be normalized, therefore result_exp is incremented 1. I implement the above logic into RISC-V instructions. Instead of serval codes and branch instrucion in the original approach, this section is quite simple. Since we only want to check whether **result_mant is 16 bits or not**, we can **shift right logical 15 bits** to get bit 15, because the condition of shfit right result_mant is related to bit 15, we can determine the shift amount and result_mant by: - `result_mant >> 15` == 1, shift right `7 + (result_mant>>15)` = 8 bits - `result_mant >> 15` == 0, shfit right `7 + (result_mant>>15)` = 7 bits And result_exp can be also determined by: `result_exp += (result_mant>>15)`. > **Note:** `result_mant >> 15` here is effective to `result_mant & 0x8000`. ```asm= bf16_mul_normalize: # s2: result_mant # s3: result_exp mul s2, t4, t5 # result_mant = mant_a * mant_b, need to modify later add s3, t2, t3 # result_exp = exp_a + exp_b sub s3, s3, s5 # result_exp - exp_adjust_b sub s3, s3, s4 # result_exp - exp_adjust_a - exp_adjust_b addi s3, s3, -127 # result_exp -= BF16_EXP_BIAS bf16_mul_normalize_check: srli t6, s2, 15 addi t0, t6, 7 # t0 = (result_mant == 1) ? 8 : 7, shift_amount srl s2, s2, t0 # result_mant >>= (shift_amount) add s3, s3, t6 # result_exp += t6 ``` The next step is to replace `mul` instruction with the [**shift-and-add**](##==bf16_sqrt_shift_add==) method used in bf16_sqrt. Below is the complete code of bf16_mul: ```asm= .global bf16_mul .global shift_and_add # a0: a # a1: b # t0: sign_a # t1: sigb_b # t2: exp_a # t3: exp_b # t4: mant_a # t5: mant_b # t6: temp # s0: result_sign bf16_mul: srli t0, a0, 15 andi t0, t0, 1 # t0 = (a.bits >> 15) & 1 srli t1, a1, 15 andi t1, t1, 1 # t1 = (b.bits >> 15) & 1 srli t2, a0, 7 andi t2, t2, 0xFF # t2 = (a.bits >> 7) & 0xFF srli t3, a1, 7 andi t3, t3, 0xFF # t3 = (b.bits >> 7) & 0xFF andi t4, a0, 0x7F # t4 = a.bits & 0x7F andi t5, a1, 0x7F # t5 = a.bits & 0x7F xor s0, t0, t1 # result_sign = sign_a ^ sign_b bf16_mul_check_a_0xFF: addi t6, x0, 0xFF bne t2, t6, bf16_mul_check_b_0xFF bne t4, x0, bf16_mul_return_a bne t5, x0, bf16_mul_return_pInf bne t3, x0, bf16_mul_return_pInf jal x0, bf16_mul_return_NaN bf16_mul_check_b_0xFF: addi t6, x0, 0xFF bne t3, t6, bf16_mul_check_zero bne t4, x0, bf16_mul_return_b bne t2, x0, bf16_mul_return_pInf bne t5, x0, bf16_mul_return_pInf jal x0, bf16_mul_return_NaN bf16_mul_check_zero: or t6, t2, t4 bne t6, x0, bf16_mul_exp_adjust or t6, t3, t5 bne t6, x0, bf16_mul_exp_adjust jal x0, bf16_mul_return_zero bf16_mul_exp_adjust: bf16_mul_exp_adjust_a: # normalize a addi s1, t4, 0 # s1 = x = mant_a addi s2, x0, 0 # s2 = shift addi s3, x0, 0 # s3 = s addi t6, x0, 0xF slt s3, t6, s1 # s3 = (x > 0xF) ? 1 : 0 xori s3, s3, 1 # s3 = (x <= 0xF) ? 1 : 0 slli t6, s3, 2 add s2, s2, t6 # shift += (s << 2) sll s1, s1, t6 # x <<= (s << 2) # addi t6, x0, 0x3F slt s3, t6, s1 # s3 = (x > 0x3F) ? 1 : 0 xori s3, s3, 1 # s3 = (x <= 0x3F) ? 1 : 0 slli t6, s3, 1 add s2, s2, t6 # shift += (s << 1) sll s1, s1, t6 # x <<= (s << 1) # addi t6, x0, 0x7F slt s3, t6, s1 # s3 = (x > 0x7F) ? 1 : 0 xori s3, s3, 1 # s3 = (x <= 0x7F) ? 1 : 0 # s2: shift = # of leading zeros # s4: exp_adjust_a sltiu t6, t2, 1 # t6 = (exp_a == 0) ? 1 : 0 or t2, t2, t6 # exp_a = (exp_a == 0) ? 1 : exp_a addi t6, t6, -1 xori t6, t6, -1 and s4, s2, t6 and s2, s2, t6 sll t4, t4, s2 ori t4, t4, 0x80 # mant_a |= 0x80 bf16_mul_exp_adjust_b: # normalize b addi s1, t4, 0 # s1 = x = mant_b addi s2, x0, 0 # s2 = shift addi s3, x0, 0 # s3 = s addi t6, x0, 0xF slt s3, t6, s1 # s3 = (x > 0xF) ? 1 : 0 xori s3, s3, 1 # s3 = (x <= 0xF) ? 1 : 0 slli t6, s3, 2 add s2, s2, t6 # shift += (s << 2) sll s1, s1, t6 # x <<= (s << 2) # addi t6, x0, 0x3F slt s3, t6, s1 # s3 = (x > 0x3F) ? 1 : 0 xori s3, s3, 1 # s3 = (x <= 0x3F) ? 1 : 0 slli t6, s3, 1 add s2, s2, t6 # shift += (s << 1) sll s1, s1, t6 # x <<= (s << 1) # addi t6, x0, 0x7F slt s3, t6, s1 # s3 = (x > 0x7F) ? 1 : 0 xori s3, s3, 1 # s3 = (x <= 0x7F) ? 1 : 0 # s2: shift = # of leading zeros # s5: exp_adjust_b sltiu t6, t3, 1 # t6 = (exp_a == 0) ? 1 : 0 or t3, t3, t6 # exp_a = (exp_a == 0) ? 1 : exp_a addi t6, t6, -1 xori t6, t6, -1 and s5, s2, t6 and s2, s2, t6 sll t5, t5, s2 ori t5, t5, 0x80 # mant_b |= 0x80 # s1: exp_adjust = exp_adjust_a + exp_adjust_b add s1, s4, s5 bf16_mul_normalize: # s2: result_mant # s3: result_exp # mul s2, t4, t5 # result_mant = mant_a * mant_b addi sp, sp, -12 sw ra, 0(sp) sw a0, 4(sp) sw a1, 8(sp) addi a0, t4, 0 addi a1, t5, 0 jal ra, shift_and_add addi s2, a0, 0 lw ra, 0(sp) lw a0, 4(sp) lw a1, 8(sp) addi sp, sp, 12 add s3, t2, t3 # result_exp = exp_a + exp_b sub s3, s3, s5 # result_exp - exp_adjust_b sub s3, s3, s4 # result_exp - exp_adjust_a - exp_adjust_b addi s3, s3, -127 # result_exp -= BF16_EXP_BIAS bf16_mul_normalize_check: srli t6, s2, 15 addi t0, t6, 7 # t0 = (result_mant == 1) ? 8 : 7, shift_amount srl s2, s2, t0 # result_mant >>= (shift_amount) add s3, s3, t6 # result_exp += t6 bf16_mul_normalize_check_result_exp_if_ge_0xFF: addi t6, x0, 0xFF bge s3, t6, bf16_mul_return_pInf bf16_mul_normalize_check_result_exp_if_le_0: bgt s3, x0, bf16_mul_return addi t6, x0, -6 blt s3, t6, bf16_mul_return_zero addi t6, x0, 1 sub t6, t6, s3 # t6 = 1 - result_exp srl s2, s2, t6 addi s3, x0, 0 bf16_mul_return: slli a0, s0, 15 andi s3, s3, 0xFF slli s3, s3, 7 or a0, a0, s3 andi s2, s2, 0x7F or a0, a0, s2 jr ra bf16_mul_return_a: jr ra bf16_mul_return_b: addi a0, a1, 0 jr ra bf16_mul_return_pInf: addi a0, s0, 0 slli a0, a0, 15 addi t6, x0, 0x7F slli t6, t6, 8 ori t6, t6, 0x80 or a0, a0, t6 jr ra bf16_mul_return_NaN: addi a0, x0, 0x7F slli a0, a0, 8 ori a0, a0, 0xC0 jr ra bf16_mul_return_zero: addi a0, s0, 0 slli a0, a0, 15 jr ra # a0: multiplicand # a1: multiplier # t0: accumulator # t1: multiplier_copy # t2: bit0 shift_and_add: addi sp, sp, -12 sw t0, 0(sp) sw t1, 4(sp) sw t2, 8(sp) addi t0, x0, 0 addi t1, a1, 0 andi t2, a1, 1 shift_and_add_for_loop: andi t2, t1, 1 beq t2, x0, shift_and_add_for_skip_add add t0, t0, a0 shift_and_add_for_skip_add: slli a0, a0, 1 srli t1, t1, 1 bne t1, x0, shift_and_add_for_loop shfit_and_add_end: addi a1, t0, 0 lw t0, 0(sp) lw t1, 4(sp) lw t2, 8(sp) addi sp, sp, 12 addi a0, a1, 0 jr ra ``` #### bf16_mul Test Cases > GitHub: [bf16_mul.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_mul/bf16_mul.S) ``` Cycles: 2715 Instrs. retired: 2141 CPI: 1.27 IPC: 0.789 Clock rate: 2.26 KHz ``` **Performance:** My improverd approach is **1.36x faster** than the compiler-optmized version. **Code Size:** Compared to the compiler-optimized version, my approach **reduces** the code size by nearly **39%**. ___ ### ==bf16_div== #### bf16_div original > GitHub: [bf16_div original approach](https://github.com/dorobouharu133/ca2025-quizzes/blob/ddfb96a8105e4fa1adb2c97fd85c5c548076904b/bf16_div.S) #### bf16_div Test Cases ``` 1. `0x7F7F` / `0x0080`, maximal finite positive number / minimal normalized number 2. `0x7FC0` / `0x3F80`, NaN / 1.000000 3. `0x7F80` / `0xFF80`, +inf / (-inf) 4. `0x4040` / `0x4000`, 3.000000 / 2.000000 ``` ``` // Original Approach Cycles: 643 Instrs. retired: 465 CPI: 1.38 IPC: 0.723 Clock rate: 595.37 Hz ``` The above figure is the execution info of my approach, and the below is the execution info of the compiler-optimized version. ``` // Compiler Cycles: 1479 Instrs. retired: 1028 CPI: 1.44 IPC: 0.695 Clock rate: 1.91 KHz ``` My original approach is **2.3x faster** than compiler's version. ___ #### bf16_div_improved There are two loops in the original approach, we can improve it. ```c= for (int i = 0; i < 16; i++) { quotient <<= 1; if (dividend >= (divisor << (15 - i))) { dividend -= (divisor << (15 - i)); quotient |= 1; } } ``` In my original approach, I translated it line-by-line: ```asm= # Original approach bf16_div_division: # t6: loop times # s1: dividend # s2: divisor # s3: quotient # s4: counter slli s1, t4, 15 # dividend = mant_a << 15 addi s2, t5, 0 # divisor = mant_b addi s3, x0, 0 # quotient = 0 addi t6, x0, 16 # loop times = 16 addi s4, x0, 0 # counter = 0 bf16_div_division_loop: bge s4, t6, bf16_div_division_loop_end slli s3, s3, 1 # quotient <<= 1 addi t4, x0, 15 sub t4, t4, s4 # t4 = 15 - i sll t4, s2, t4 # temp = divisor << (15 - i) blt s1, t4, bf16_div_division_loop_sub_skip bf16_div_division_loop_sub: sub s1, s1, t4 # dividend -= (divisor << (15 - i)) ori s3, s3, 1 # quotient |= 1 bf16_div_division_loop_sub_skip: addi s4, s4, 1 # i++ jal x0, bf16_div_division_loop ``` However, the code is actually doing division, --- ```c= while (!(quotient & 0x8000) && result_exp > 1) { quotient <<= 1; result_exp--; } quotient >>= 8; ``` The purpose of the above code is to normalize the quotient, ```asm= # Original approah bf16_div_normalize_quotient_while_loop: addi t4, x0, 0x80 slli t4, t4, 8 # t6 = 0x8000 and t4, s3, t4 # t6 = quotient & 0x8000 bne t4, x0, bf16_div_normalize_quotient_while_loop_end addi t6, x0, 1 ble s5, t6, bf16_div_normalize_quotient_while_loop_end slli s3, s3, 1 # quotient <<= 1 addi s5, s5, -1 # result_exp -= 1 jal x0, bf16_div_normalize_quotient_while_loop bf16_div_normalize_quotient_while_loop_end: srli s3, s3, 8 # quotient >>= 8 ``` Just like what we have done before, **this while loop tries to clz and normalize the quotient**. Note that **the result_exp should greater than or equal to zero at the end of the loop**. Therefore, we can use the [**clz concept**](###Branchless-8-bit-CLZ) that was used in the previous section. ```c= // Gemini uint32_t x = mant_a; uint32_t s, shift = 0; s = (x <= 0xF); shift += (s << 2); x <<= (s << 2); s = (x <= 0x3F); shift += (s << 1); x <<= (s << 1); s = (x <= 0x7F); shift += (s << 0); x <<= (s << 0); // Now shift is the number of Leading Zeros ``` **Observe the bit mask:** ``` # 8-bit clz 0xF = 0000 1111 0x3F = 0011 1111 0x7F = 0111 1111 ``` We can observe that **clz always divides the remaining range into two parts**. At each step, it checks the upper half and then repeats the same operation on the selected half. Based on the observation, I implemented this logic to develop a 16-bit branchlss `clz`. **16-bit branchless clz mask:** ``` # 16-bit clz # divided into 2 parts 0000 0000 1111 1111 -> 0x00FF 0000 1111 1111 1111 -> 0x0FFF 0011 1111 1111 1111 -> 0x3FFF 0111 1111 1111 1111 -> 0x7FFF ``` > **Note:** Since $16 = 2^{4}$, the maximum shift amount in 16-bit CLZ is 2^4. > GitHub: [Branchless 16-bit CLZ](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/branchless_16_bit_clz.c) So we can implemented in C code: ```c= // My own approach uint32_t x = mant_a; uint32_t s, shift = 0; // 0000 0000 1111 1111 -> 0x00FF // 0000 1111 1111 1111 -> 0x0FFF // 0011 1111 1111 1111 -> 0x3FFF // 0111 1111 1111 1111 -> 0x7FFF s = (x <= 0x00FF); shift += (s << 3); x <<= (s << 3); s = (x <= 0x0FFF); shift += (s << 2); x <<= (s << 2); s = (x <= 0x3FFF); shift += (s << 1); x <<= (s << 1); s = (x <= 0x7FFF); shift += (s << 0); x <<= (s << 0); // Now shift is the number of Leading Zeros ``` > **Note:** Verlidatation code is included in [GitHub](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/branchless_16_bit_clz.c). > GitHub: [Branchless 32-bit CLZ](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/branchless_32_bit_clz.c) ```c= // My own approach // My own approach uint32_t x = mant_a; uint32_t s, shift = 0; // 0000 0000 0000 0000 1111 1111 1111 1111 -> 0x0000FFFF // 0000 0000 1111 1111 1111 1111 1111 1111 -> 0x00FFFFFF // 0000 1111 1111 1111 1111 1111 1111 1111 -> 0x0FFFFFFF // 0011 1111 1111 1111 1111 1111 1111 1111 -> 0x3FFFFFFF // 0111 1111 1111 1111 1111 1111 1111 1111 -> 0x7FFFFFFF s = (x <= 0x0000FFFF); shift += (s << 4); x <<= (s << 4); s = (x <= 0x00FFFFFF); shift += (s << 3); x <<= (s << 3); s = (x <= 0x0FFFFFFF); shift += (s << 2); x <<= (s << 2); s = (x <= 0x3FFFFFFF); shift += (s << 1); x <<= (s << 1); s = (x <= 0x7FFFFFFF); shift += (s << 0); x <<= (s << 0); // Now shift is the number of Leading Zeros ``` > **Note:** Verlidatation code is included in [GitHub](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/branchless_32_bit_clz.c). **Note:** Although the original branchless 8-bit CLZ implementation was AI-generated, **extending it to 16-bit and 32-bit was my own work**. Though I develop 32-bit branchless CLZ, the 16-bit version is what we used in this case. Since **quotient is 16-bit**. ```c= while (!(quotient & 0x8000) && result_exp > 1) { quotient <<= 1; result_exp--; } ``` I will introduce how I improved this logic. Since the purpose of this code segment is to normalize the quotient, we can use **branchless clz** instead of using while loop to iteratively find leading 1. And note that the second condition: `result_exp > 1`, this means that the maximum shift amount is **result_exp-1**. Therefore, the number of leading zeros need compare the number of leading zeros and `result_exp - 1`. That is: - if (result_exp - 1) < the number of leading zeros, - result_exp = 1 - quotient <<= (result_exp - 1) - if (result_exp - 1) >= the number of leading zeros, - result_exp -= the number of leading zeros - quotient <<= the number of leading zeros > GitHub: [bf16_div.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_div/bf16_div.S) ```asm= .data newline: .string "\n" pass_msg: .string "All tests passed!" fail_msg: .string "Tests failed!" .text .globl bf16_div .globl branchless_16_bit_clz .globl main main: # maximal finite positive number / 0.000000 = +inf test_1: addi a0, x0, 0x7F slli a0, a0, 8 ori a0, a0, 0x7F addi a1, x0, 0x80 jal ra, bf16_div addi s0, a0, 0 addi t0, x0, 0x7F slli t0, t0, 8 ori t0, t0, 0x80 bne s0, t0, test_failed test_passed: la a0, pass_msg addi a7, x0, 4 ecall jal x0, end test_failed: la a0, fail_msg addi a7, x0, 4 ecall end: la a0, newline addi a7, x0, 4 ecall addi a7, x0, 10 ecall # a0: a # a1: b # t0: sign_a # t1: sign_b # t2: exp_a # t3: exp_b # t4: mant_a # t5: mant_b # t6: temp # s0: result_sign bf16_div: srli t0, a0, 15 andi t0, t0, 1 # t0 = (a.bits >> 15) & 1, sign_a srli t1, a1, 15 andi t1, t1, 1 # t1 = (b.bits >> 15) & 1, sign_b srli t2, a0, 7 andi t2, t2, 0xFF # t2 = (a.bits >> 7) & 0xFF, exp_a srli t3, a1, 7 andi t3, t3, 0xFF # t3 = (b.bits >> 7) & 0xFF, exp_b andi t4, a0, 0x7F # t4 = a.bits & 0x7F, mant_a andi t5, a1, 0x7F # t5 = b.bits & 0x7F, mant_b xor s0, t0, t1 # result_sign = sign_a ^ sign_b bf16_div_check_exp_b_0xFF: # maybe optimize later addi t6, x0, 0xFF bne t3, t6, bf16_div_check_b_NaN_inf bne t5, x0, bf16_div_return_b bne t2, t6, bf16_div_return_signed_zero bne t4, x0, bf16_div_return_signed_zero jal x0, bf16_div_return_NaN bf16_div_check_b_NaN_inf: # maybe optimize later bne t3, x0, bf16_div_check_exp_a_0xFF bne t5, x0, bf16_div_check_exp_a_0xFF beq t2, x0, bf16_div_return_NaN beq t4, x0, bf16_div_return_NaN jal x0, bf16_div_return_signed_zero bf16_div_check_exp_a_0xFF: # maybe optimize later bne t2, t6, bf16_div_check_a_NaN_inf bne t4, x0, bf16_div_return_a jal x0, bf16_div_return_signed_inf bf16_div_check_a_NaN_inf: # maybe optimize later bne t2, x0, bf16_div_a_explicit_one bne t4, x0, bf16_div_a_explicit_one jal x0, bf16_div_return_signed_zero bf16_div_a_explicit_one: beq t2, x0, bf16_div_b_explicit_one ori t4, t4, 0x80 # mant_a |= 0x80 bf16_div_b_explicit_one: beq t3, x0, bf16_div_division ori t5, t5, 0x80 # mant_b |= 0x80 bf16_div_division: # should be optimized # t6: loop times # s1: dividend # s2: divisor # s3: quotient # s4: counter slli s1, t4, 15 # dividend = mant_a << 15 addi s2, t5, 0 # divisor = mant_b addi s3, x0, 0 # quotient = 0 addi t6, x0, 16 # loop times = 16 addi s4, x0, 0 # counter = 0 bf16_div_division_loop: bge s4, t6, bf16_div_division_loop_end slli s3, s3, 1 # quotient <<= 1 addi t4, x0, 15 sub t4, t4, s4 # t4 = 15 - i sll t4, s2, t4 # temp = divisor << (15 - i) blt s1, t4, bf16_div_division_loop_sub_skip bf16_div_division_loop_sub: sub s1, s1, t4 # dividend -= (divisor << (15 - i)) ori s3, s3, 1 # quotient |= 1 bf16_div_division_loop_sub_skip: addi s4, s4, 1 # i++ jal x0, bf16_div_division_loop bf16_div_division_loop_end: # s5: result_exp add s5, t2, x0 # result_exp = exp_a sub s5, s5, t3 # result_exp -= exp_b addi s5, s5, 127 # result_exp += bias (127) bf16_div_normalize_exp_a: bne t2, x0, bf16_div_normalize_exp_b addi s5, s5, -1 bf16_div_normalize_exp_b: bne t3, x0, bf16_div_normalize_quotient_if addi s5, s5, 1 bf16_div_normalize_quotient_if: addi t6, x0, 0x80 slli t6, t6, 8 # t6 = 0x8000 and t4, s3, t6 # t4 = quotient & 0x8000 beq t4, x0, bf16_div_normalize_quotient_else srli s3, s3, 8 # quotient >>= 8 jal x0, bf16_div_normalize_quotient_end bf16_div_normalize_quotient_else: bf16_div_normalize_quotient_clz: addi sp, sp, -8 sw ra, 0(sp) sw a0, 4(sp) addi a0, s3, 0 jal ra, branchless_16_bit_clz addi t6, a0, 0 # t6 = # of leading zeros lw ra, 0(sp) lw a0, 4(sp) addi sp, sp, 8 addi t4, s5, -1 # t4 = result_exp - 1 bge t6, t4, bf16_div_normalize_quotient_clz_sub_result_exp_minus_1 # optimize to whole branchless later bf16_div_normalize_quotient_clz_sub_clz: sub s5, s5, t6 # result_exp -= leading_zeros sll s3, s3, t6 # quotient <<= leading_zeros jal x0, bf16_div_normalize_quotient_clz_end bf16_div_normalize_quotient_clz_sub_result_exp_minus_1: addi s5, x0, 1 sll s3, s3, t4 bf16_div_normalize_quotient_clz_end: srli s3, s3, 8 # quotient >>= 8 bf16_div_normalize_quotient_end: andi s3, s3, 0x7F # quotient & 0x7F bf16_div_return: addi t6, x0, 0xFF bge s5, t6, bf16_div_return_signed_inf ble s5, x0, bf16_div_return_signed_zero slli a0, s0, 15 # result_sign << 15 andi s5, s5, 0xFF slli s5, s5, 7 # result_exp << 7 or a0, a0, s5 andi s3, s3, 0x7F or a0, a0, s3 jr ra bf16_div_return_a: jr ra bf16_div_return_b: addi a0, a1, 0 jr ra bf16_div_return_signed_zero: slli a0, s0, 15 # result_sign << 15 jr ra bf16_div_return_signed_inf: slli a0, s0, 15 # result_sign << 15 addi t6, x0, 0x7F slli t6, t6, 8 ori t6, t6, 0x80 or a0, a0, t6 # return Inf jr ra bf16_div_return_NaN: addi a0, x0, 0x7F slli a0, a0, 8 ori a0, a0, 0xC0 jr ra # a0: mant_a # t0: x # t1: s # t2: shift # t3: temp # t4: temp2 branchless_16_bit_clz: addi sp, sp, -20 sw t0, 0(sp) sw t1, 4(sp) sw t2, 8(sp) sw t3, 12(sp) sw t4, 16(sp) addi t0, a0, 0 # x = mant_a addi t1, x0, 0 # s = 0 addi t2, x0, 0 # shift = 0 addi t3, x0, 0x00FF slt t1, t3, t0 # s = (x > 0x00FF) ? 1 : 0 xori t1, t1, 1 # s = (x <= 0x00FF) ? 1 : 0 slli t4, t1, 3 add t2, t2, t4 # shift += (s << 3) sll t0, t0, t4 # x <<= (s << 3) addi t3, x0, 0xF slli t3, t3, 8 ori t3, t3, 0xFF slt t1, t3, t0 # s = (x > 0x0FFF) ? 1 : 0 xori t1, t1, 1 # s = (x <= 0x0FFF) ? 1 : 0 slli t4, t1, 2 add t2, t2, t4 # shift += (s << 2) sll t0, t0, t4 # x <<= (s << 2) addi t3, x0, 0x3F slli t3, t3, 8 ori t3, t3, 0xFF slt t1, t3, t0 # s = (x > 0x3FFF) ? 1 : 0 xori t1, t1, 1 # s = (x <= 0x3FFF) ? 1 : 0 slli t4, t1, 1 add t2, t2, t4 # shift += (s << 1) sll t0, t0, t4 # x <<= (s << 1) addi t3, x0, 0x7F slli t3, t3, 8 ori t3, t3, 0xFF slt t1, t3, t0 # s = (x > 0x7FFF) ? 1 : 0 xori t1, t1, 1 # s = (x <= 0x7FFF) ? 1 : 0 slli t4, t1, 0 add t2, t2, t4 # shift += (s << 0) sll t0, t0, t4 # x <<= (s << 0) addi a0, t2, 0 # return shift lw t0, 0(sp) lw t1, 4(sp) lw t2, 8(sp) lw t3, 12(sp) lw t4, 16(sp) addi sp, sp, 20 jr ra ``` #### bf16_div Test Cases > GitHub: [bf16_div_test](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_div/bf16_div_test.S) ``` // Improved Approach Cycles: 643 Instrs. retired: 465 CPI: 1.38 IPC: 0.723 Clock rate: 912.06 Hz ``` **Performance:** My improved approach is about **2.3x faster** than the compiler-optmized version. **Code Size:** Compared to the compiler-optimized version, my improved approach reduces the code size by nearly **36%**. #### division improvement ```c= for (int i = 0; i < 16; i++) { quotient <<= 1; if (dividend >= (divisor << (15 - i))) { dividend -= (divisor << (15 - i)); quotient |= 1; } } ``` **Observe:** In my previoud improved approach, this part I shift right divisor at each iteration, but this logic can be simplifed as below. ```c= for (int i = 0; i < 16; i++) { quotient <<= 1; dividend << 1; if (dividend >= divisor) { dividend -= divisor; quotient |= 1; } } ``` We can shift the dividend to left in each iteration instead of shifing divisor. This allows us to **avoid calculating** `(15-i)` and performing a dynamic shift. ```asm= bf16_div_division: # should be optimized # t6: loop times # s1: dividend # s2: divisor # s3: quotient # s4: counter slli s1, t4, 15 # dividend = mant_a << 15 addi s2, t5, 0 # divisor = mant_b addi s3, x0, 0 # quotient = 0 addi t6, x0, 16 # loop times = 16 addi s4, x0, 0 # counter = 0 bf16_div_division_loop: bge s4, t6, bf16_div_division_loop_end slli s3, s3, 1 # quotient <<= 1 addi t4, x0, 15 sub t4, t4, s4 # t4 = 15 - i sll t4, s2, t4 # temp = divisor << (15 - i) blt s1, t4, bf16_div_division_loop_sub_skip bf16_div_division_loop_sub: sub s1, s1, t4 # dividend -= (divisor << (15 - i)) ori s3, s3, 1 # quotient |= 1 bf16_div_division_loop_sub_skip: addi s4, s4, 1 # i++ jal x0, bf16_div_division_loop ``` In order to shift the dividend, we have to adjust the initialization for the dividend and the divisor. - The dividend: Initialize with `mant_a`. > We shift left **gradually inside the loop** instead of shifting 15 bits **all at once**. - The divisor: Initialize with `mant_b << 1`. > Since the dividend gets shifted left **left** inside the loop, we have to pre-shift the divisor by 1 bit. Otherwise, the comparision **will be off**. ```asm= bf16_div_division: # t6: loop times # s1: dividend # s2: divisor # s3: quotient # s4: counter addi s1, t4, 0 # dividend = mant_a slli s2, t5, 1 # divisor = mant_b << 1 addi s3, x0, 0 # quotient = 0 addi t6, x0, 16 # loop times = 16 addi s4, x0, 0 # counter = 0 bf16_div_division_loop: bge s4, t6, bf16_div_division_loop_end slli s3, s3, 1 # quotient <<= 1 slli s1, s1, 1 # dividend <<= 1 blt s1, s2, bf16_div_division_loop_sub_skip bf16_div_division_loop_sub: sub s1, s1, s2 # dividend -= divisor ori s3, s3, 1 # quotient |= 1 bf16_div_division_loop_sub_skip: addi s4, s4, 1 # i++ jal x0, bf16_div_division_loop ``` I reduced the loop body from 15 to 13 instructions. Since the loop **iterates 16 times**, this results in a **total reduction of 32 dynamic instructions**. > GitHub: [bf16_div_loop_improved](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_div/bf16_div.S) ``` // Loop Improved Approach Cycles: 579 Instrs. retired: 401 CPI: 1.44 IPC: 0.693 Clock rate: 619.25 Hz ``` **Performance:** My latest improved approach is about **1.11x faster** than the previous improved version, and **2.55x faster** than the compiler-optmized version. ___ ### ==Other functions== ```asm= .data newline: .string "\n" .text .globl main .globl bf16_isnan .globl bf16_isinf main: addi a0, x0, 0x40 slli a0, a0, 8 ori a0, a0, 0x49 slli a0, a0, 8 ori a0, a0, 0x0F slli a0, a0, 8 ori a0, a0, 0xD0 jal ra, f32_to_bf16 addi s0, a0, 0 addi a7, x0, 1 addi a0, s0, 0 ecall la a0, newline addi a7, x0, 4 ecall addi a7, x0, 10 ecall bf16_isnan: addi t0, x0, 0x7F slli t0, t0, 8 ori t0, t0, 0x80 # BF16_EXP_MASK and t1, a0, t0 # a.bits & BF16_EXP_MASK bne t1, t0, bf16_isnan_return_false andi t1, a0, 0x007F # a.bits & BF16_MANT_MASK beq t1, x0, bf16_isnan_return_false # mantissa == 0 bf16_isnan_return_true: addi a0, x0, 1 jr ra bf16_isnan_return_false: addi a0, x0, 0 jr ra bf16_isinf: addi t0, x0, 0x7F slli t0, t0, 8 ori t0, t0, 0x80 # BF16_EXP_MASK and t1, a0, t0 # a.bits & BF16_EXP_MASK bne t1, t0, bf16_isinf_return_false andi t1, a0, 0x007F # a.bits & BF16_MANT_MASK bne t1, x0, bf16_isinf_return_false # mantissa != 0 bf16_isinf_return_true: addi a0, x0, 1 jr ra bf16_isinf_return_false: addi a0, x0, 0 jr ra bf16_iszero: addi t0, x0, 0x7F slli t0, t0, 8 ori t0, t0, 0xFF and t1, a0, t0 beq t1, x0, bf16_iszero_return_true bf16_iszero_return_false: addi a0, x0, 0 jr ra bf16_iszero_return_true: addi a0, x0, 1 jr ra f32_to_bf16: srli t0, a0, 23 # f32bits >> 23 andi t0, t0, 0xFF addi t1, x0, 0xFF beq t0, t1, f32_to_bf16_return_if_equal # if(((f32bits >> 23) & 0xFF) == 0xFF) return f32_to_bf16_return: srli t0, a0, 16 andi t0, t0, 1 addi t1, x0, 0x7F slli t1, t1, 8 ori t1, t1, 0xFF add t0, t0, t1 add a0, a0, t0 srli a0, a0, 16 and a0, a0, t1 jr ra f32_to_bf16_return_if_equal: srli a0, a0, 16 and a0, a0, t1 jr ra ``` > I think this piece of code can be directly translated into RISC-V instructions, since the original C code is already very concise. #### Test Cases ``` 1. bf16_isnan: 0x7FC1 2. bf16_isinf: 0x7F80 3. bf16_iszero: 0x0000 4. f32_to_bf16: 0x40490FD0 ``` ``` // My approach Cycles: 80 Instrs. retired: 80 CPI: 1 IPC: 1 Clock rate: 0Hz ``` Above is the execution info of my approach. Below is the execution info of compiler. ``` // Compiler-optimized Cycles: 132 Instrs. retired: 132 CPI: 1 IPC: 1 Clock rate: 0Hz ``` My approach is **1.65x faster** than compiler's version. ## Use Cases ### [LeetCode #973 K Closest Points to Origin](https://leetcode.com/problems/k-closest-points-to-origin/description/) In [LeetCode #973](https://leetcode.com/problems/k-closest-points-to-origin/description/), we have to process a large amount of coordinate data. In real-world scenarios, such as on a drone with limited memory, using **uf8-encoding** allows us to convert 32-bit data into 8-bit. This reduces memory usage greatly, allowing the device to **store 4 times more data** in the same amount of Memory. # Requirement 3 ## [Branchless 16 bit CLZ](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/branchless_clz/branchless_16_bit_clz.S) ```asm= li a0, 0x30 jal branchless_16_bit_clz # a0: mant_a # t0: x # t1: s # t2: shift # t3: temp # t4: temp2 branchless_16_bit_clz: addi sp, sp, -20 sw t0, 0(sp) sw t1, 4(sp) sw t2, 8(sp) sw t3, 12(sp) sw t4, 16(sp) addi t0, a0, 0 # x = mant_a addi t1, x0, 0 # s = 0 addi t2, x0, 0 # shift = 0 addi t3, x0, 0x00FF slt t1, t3, t0 # s = (x > 0x00FF) ? 1 : 0 xori t1, t1, 1 # s = (x <= 0x00FF) ? 1 : 0 slli t4, t1, 3 add t2, t2, t4 # shift += (s << 3) sll t0, t0, t4 # x <<= (s << 3) addi t3, x0, 0xF slli t3, t3, 8 ori t3, t3, 0xFF slt t1, t3, t0 # s = (x > 0x0FFF) ? 1 : 0 xori t1, t1, 1 # s = (x <= 0x0FFF) ? 1 : 0 slli t4, t1, 2 add t2, t2, t4 # shift += (s << 2) sll t0, t0, t4 # x <<= (s << 2) addi t3, x0, 0x3F slli t3, t3, 8 ori t3, t3, 0xFF slt t1, t3, t0 # s = (x > 0x3FFF) ? 1 : 0 xori t1, t1, 1 # s = (x <= 0x3FFF) ? 1 : 0 slli t4, t1, 1 add t2, t2, t4 # shift += (s << 1) sll t0, t0, t4 # x <<= (s << 1) addi t3, x0, 0x7F slli t3, t3, 8 ori t3, t3, 0xFF slt t1, t3, t0 # s = (x > 0x7FFF) ? 1 : 0 xori t1, t1, 1 # s = (x <= 0x7FFF) ? 1 : 0 slli t4, t1, 0 add t2, t2, t4 # shift += (s << 0) sll t0, t0, t4 # x <<= (s << 0) addi a0, t2, 0 # return shift lw t0, 0(sp) lw t1, 4(sp) lw t2, 8(sp) lw t3, 12(sp) lw t4, 16(sp) addi sp, sp, 20 jr ra ``` ```asm= slt t1, t3, t0 # s = (x > 0x00FF) ? 1 : 0 xori t1, t1, 1 # s = (x <= 0x00FF) ? 1 : 0 slli t4, t1, 3 ``` ![image](https://hackmd.io/_uploads/SJ4r2mk4Ze.png) **IF (Instruction Fetch):** > PC Enable: 1 (High), since there is no pipeline stall > Next PC Mux: 0 (Low), PC = PC+4,because the instruction in ID is not branch. **ID (Instruction Decode):** > Reg Wr En: 1 (High), because the instruction in **WB** stage is `addi`, `addi` will write data back to rigister file (x5 register). **EX (Execute):** > ALUOp1: from register file or forwarding path. > ALUOp2: from immediate. > ForwardA: (no forwarding), from ID/EX pipeline register. > ForwardB: (no forwarding), from ID/EX pipeline register. > Branch: not taken. **MEM (Memory):** > Data Memory Wr En: The instruction in **MEM** stage is `addi`, `addi` does not access data memory, therefore **Data Memory Wr En is 0 (Low)**. **WB (Write Back):** > WB Mux: from ALU result, since the instruction in WB is `addi`. --- ![image](https://hackmd.io/_uploads/ry-5bXkEZe.png) **IF:** > PC Enable: 1 (High), since there is no pipeline stall > Next PC Mux: 0 (Low), PC = PC+4,because the instruction in ID is not branch. **ID:** > Reg Wr En: 1 (High), because the instruction in **WB** stage is `addi`, `addi` will write data back to rigister file (x6 register). **EX:** > ALUOp1: from register file or forwarding path. > ALUOp2: from immediate. > ForwardA: (no forwarding), from ID/EX pipeline register. > ForwardB: (no forwarding), from ID/EX pipeline register. > Branch: not taken. **MEM:** > Data Memory Wr En: The instruction in **MEM** stage is `addi`, `addi` does not access data memory, therefore **Data Memory Wr En is 0 (Low)**. **WB:** > WB Mux: from ALU result, since the instruction in WB is `addi`. --- ![image](https://hackmd.io/_uploads/SyaX7Nk4bl.png) **IF:** > PC Enable: 1 (High), since there is no pipeline stall > Next PC Mux: 0 (Low), PC = PC+4,because the instruction in ID is not branch. **ID:** > Reg Wr En: 1 (High), because the instruction in **WB** stage is `addi`, `addi` will write data back to rigister file (x7 register). **EX:** (Forwarding Happens) > ALUOp1: from register file or forwarding path. > ALUOp2: from register file or forwarding path. > ForwardA: (**forwarding**), from **EX/MEM** pipeline register. > ForwardB: (no forwarding), from ID/EX pipeline register. > Branch: not taken. **MEM:** > Data Memory Wr En: The instruction in **MEM** stage is `addi`, `addi` does not access data memory, therefore **Data Memory Wr En is 0 (Low)**. **WB:** > WB Mux: from ALU result, since the instruction in WB is `addi`. ![image](https://hackmd.io/_uploads/BkEZNEkEbg.png) ![image](https://hackmd.io/_uploads/Sk7NEN14Zl.png) ![image](https://hackmd.io/_uploads/ry4SVEJNbg.png) > Because the x28 is litte than x5, so the result is, and will write back to register file (x6) when `slt` instruction is in WB stage. --- ![image](https://hackmd.io/_uploads/r1Mh-myE-x.png) **IF:** > PC Enable: 1 (High), since there is no pipeline stall > Next PC Mux: 0 (Low), PC = PC+4,because the instruction in ID is not branch. **ID:** > Reg Wr En: 1 (High), because the instruction in **WB** stage is `addi`, `addi` will write data back to rigister file (x28 register). **EX:** (Forwarding happens) > ALUOp1: from register file or forwarding path. > ALUOp2: from immediate > ForwardA: (**forwarding**), from **EX/MEM** pipeline register. > ForwardB: (no forwarding), from ID/EX pipeline register. > Branch: not taken. **MEM:** > Data Memory Wr En: The instruction in **MEM** stage is `slt`, `slt` does not access data memory, therefore **Data Memory Wr En is 0 (Low)**. **WB:** > WB Mux: from ALU result, since the instruction in WB is `addi`. > With forwarding from EX/MEM pipeline register, the result of `xori x6, x6, 1` is the negative of `x6`, and `0 xor 0 = 1`. Therefore, `1` will be write back to x6 when the `xori` instruction is in **WB stage**. --- ![image](https://hackmd.io/_uploads/Hkp3WXkN-x.png) **IF:** > PC Enable: 1 (High), since there is no pipeline stall > Next PC Mux: 0 (Low), PC = PC+4,because the instruction in ID is not branch. **ID:** > Reg Wr En: 1 (High), because the instruction in **WB** stage is `addi`, `addi` will write data back to rigister file (x6 register). **EX:** (forwarding happens) > ALUOp1: from register file or forwarding path. > ALUOp2: from immediate. > ForwardA: (**forwarding**), from **EX/MEM** pipeline register. > ForwardB: (no forwarding), from ID/EX pipeline register. > Branch: not taken. **MEM:** > Data Memory Wr En: The instruction in **MEM** stage is `xori`, `xori` does not access data memory, therefore **Data Memory Wr En is 0 (Low)**. **WB:** > WB Mux: from ALU result, since the instruction in WB is `slt`. --- ![image](https://hackmd.io/_uploads/SJ26bQJVZx.png) **IF:** > PC Enable: 1 (High), since there is no pipeline stall > Next PC Mux: 0 (Low), PC = PC+4,because the instruction in ID is not branch. **ID:** > Reg Wr En: 1 (High), because the instruction in **WB** stage is `xori`, `xori` will write data back to rigister file (x6 register). **EX:** > ALUOp1: from register file or forwarding path. > ALUOp2: from register file or forwarding path. > ForwardA: (no forwarding), from ID/EX pipeline register. > ForwardB: (**forwarding**), from **EX/MEM** pipeline register. > Branch: not taken. **MEM:** > Data Memory Wr En: The instruction in **MEM** stage is `slli`, `slli` does not access data memory, therefore **Data Memory Wr En is 0 (Low)**. **WB:** > WB Mux: from ALU result, since the instruction in WB is `xori`. > As mentioned before, the result of `xori x6, x6, 1` will write back to register file. ![image](https://hackmd.io/_uploads/SyzuOVJV-x.png) In the next cycle, `x6` will be `0x00000001`. --- ![image](https://hackmd.io/_uploads/BJaC-X14be.png) **IF:** > PC Enable: 1 (High), since there is no pipeline stall > Next PC Mux: 0 (Low), PC = PC+4,because the instruction in ID is not branch. **ID:** > Reg Wr En: 1 (High), because the instruction in **WB** stage is `slli`, `slli` will write data back to rigister file (x29 register). **EX:** (forwarding happens) > ALUOp1: from register file or forwarding path. > ALUOp2: from register file or forwarding path. > ForwardA: (no forwarding), from ID/EX pipeline register. > ForwardB: (**forwarding**), from **MEM/WB** pipeline register. > Branch: not taken. **MEM:** > Data Memory Wr En: The instruction in **MEM** stage is `add`, `add` does not access data memory, therefore **Data Memory Wr En is 0 (Low)**. **WB:** > WB Mux: from ALU result, since the instruction in WB is `slli`. ![image](https://hackmd.io/_uploads/Sk8hdEJVWl.png) Check: `x6` is `1`, the function is correct. # References [RISCV branchless coding - *stackoverflow*](https://stackoverflow.com/questions/72340698/riscv-branchless-coding) [RISCV assembly 32bit multiplication without MUL giving incorrect result - *stackoverflow*](https://stackoverflow.com/questions/79503254/riscv-assembly-32bit-multiplication-without-mul-giving-incorrect-result) [Intel® 64 and IA-32 Architectures Optimization Reference Manual Volume 1 ](https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html) # AI Usage [Copilot](https://copilot.microsoft.com/shares/iWjxqjKuh8HA6NGhFfUZD) [Copilot, correct my test function to output the encode and decode values](https://copilot.microsoft.com/shares/pAcS6KRjp4fjSTUFdP3KN) [ChatGPT](https://chatgpt.com/share/68f2559b-b4f0-8006-a075-d2c62339c95f) [Copilot, bf16_mul debug](https://copilot.microsoft.com/shares/dz4MNC8W3mtAYefiprkGN) [ChatGPT, bf16_mul debug](https://chatgpt.com/share/68f79767-7fb0-8006-8ac0-e30787814397) [Gemini, bf16_add proofreading](https://gemini.google.com/share/8f267b63f019) [Gemini, bf16_add test cases](https://gemini.google.com/share/f34265e85ffd) [Gemini, bf16_sqrt proofreading](https://gemini.google.com/share/a882f07040c2) [Gemini, bf16_mul proofreading](https://gemini.google.com/share/8ccf0e0f70a8) [Gemini, Branchless Programming](https://gemini.google.com/share/f0e56a89944b)