# Assignment 1: RISC-V Assembly and Instruction Pipeline
> ==Github: [dorobouharu133](https://github.com/dorobouharu133/ca2025-quizzes/tree/main)==
>[!Note] AI Tools Usage
>I used ChatGPT, Gemini and Copilot to find inspiration, refine the grammar of this article, explain the original C source code, debug and assist in generating the corresponding test data.
# Requirement 2
## Problem B
In uf8_encode, we can find some patterns in this function. In my first thought, I wanted to translate the C code to RISC-V instructions line-by-line.
```c=
/* Start from a good initial guess */
uint8_t exponent = 0;
uint32_t overflow = 0;
if (msb >= 5) {
/* Estimate exponent - the formula is empirical */
exponent = msb - 4;
if (exponent > 15)
exponent = 15;
/* Calculate overflow for estimated exponent */
for (uint8_t e = 0; e < exponent; e++)
overflow = (overflow << 1) + 16;
/* Adjust if estimate was off */
while (exponent > 0 && value < overflow) {
overflow = (overflow - 16) >> 1;
exponent--;
}
}
```
During the translation, I found that the for loop did not need to be translate to RISC-V instruction directly, and we can just use some bit operation to have the same result.
Observe the above code, we can find that the range of exponent in the for loop is [1, 15], because only if `msb >= 5` then can enter the for loop, and then `exponent = msb - 4` which means exponent is at least 1, and the following if statement constrains exponent's upper bound.
Let's discuss the following for loop. `uint_8 e = 0`, the termination condition is `e < exponent`, since the range of exponent is [1, 15], so the for loop will run 1~15 times.
And the following step is my optimization.
```c=
for (uint8_t e = 0; e < exponent; e++)
overflow = (overflow << 1) + 16;
```
The initial value of overflow is 0,
#### Pass 1
overflow<<$1= 0 << 1 = 0$, overflow = 0 + 16(in deicmal) = 0 + 10000(in binary) = 10000(in binary).
In the first pass, we get overflow = `10000`(in binary).
#### Pass 2
`Overflow<<1` = 10000 << 1 = 100000(binary), and `(Overflow<<1) + 16` = 100000 + 10000 = 110000(in binary).
In the second pass, we get overflow = `110000`(in binary).
---
#### Observation
I observe that the overflow has the pattern: overflow can be concatenated by (exponent) bits leading 1's + (fixed) 4 bits 0's.
Therefore, the translation of for loop is not needed, we can just use some bit operation to get this pattern.
#### Bit Operation
Overflow is consist of (exponent) bits leading 1's and 4 bits 0's.
How can we get the leading 1's of total exponent bits?
For example, we assume that exponent is 5, thus the overflow should be `111110000`, we want to get `11111` first.
This step we observe that `11111`(in binary) = 31(in decimal) = 32 - 1(in decimal). Therefore, we can calculate by $2^5 - 1$.
So exponent bits leading 1's can be obtained by $2^{\text{exponent}} - 1$, and we can easily get overflow by shift the above result 4 bits left.
```asm=
# t0: exponent
# t1: overflow
addi t1, x0, 1
sll t1, t1, t0 # get 2^exponent
addi t1, t1, -1 # get leading 1's
slli t1, t1. 4 # get overflow
```
Below is my approach 1, using bit operations to replace for loop.
### ==Approach 1==
> [Quiz 1 Problem B Approach 1](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/quiz1_approach/q1-uf8-approach-1.S)
```asm=
.data
pass_msg: .string "All tests passed!\n"
fail_msg: .string "Test failed for value: "
test_msg: .string "Testing value: "
arrow_msg: .string " -> decoded: "
encoded_msg: .string " -> re-encoded: "
newline: .string "\n"
.align 2
.text
.globl main
main:
# s0 will be our loop counter (0-255)
# s1 will store the original value
# s2 will store the decoded value
# s3 will store the re-encoded value
addi s0, x0, 0
test_loop:
# Print current test value
la a0, test_msg
addi a7, x0, 4
ecall
addi a0, s0, 0
addi a7, x0, 1
ecall
# First decode the value
addi a0, s0, 0
addi s1, a0, 0 # Save original value
jal uf8_decode
addi s2, a0, 0 # Save decoded value
# Print decoded value
la a0, arrow_msg
addi a7, x0, 4
ecall
addi a0, s2, 0
addi a7, x0, 1
ecall
# Now encode the decoded value
addi a0, s2, 0
jal uf8_encode
addi s3, a0, 0 # Save re-encoded value
# Print re-encoded value
la a0, encoded_msg
addi a7, x0, 4
ecall
addi a0, s3, 0
addi a7, x0, 1
ecall
la a0, newline
addi a7, x0, 4
ecall
# Check if re-encoded value matches original
bne s3, s1, test_fail
# Increment counter and continue
addi s0, s0, 1
addi t0, x0, 256
blt s0, t0, test_loop
# All tests passed
la a0, pass_msg
addi a7, x0, 4
ecall
j exit
test_fail:
la a0, fail_msg
addi a7, x0, 4
ecall
addi a0, s1, 0
addi a7, x0, 1
ecall
la a0, newline
addi a7, x0, 4
ecall
exit:
addi a7, x0, 10 # exit
ecall
# a0: x
# t0: n
# t1: c
# t2: y
clz:
addi t0, x0, 32 # n = 32
addi t1, x0, 16 # c = 16
clz_while:
srl t2, a0, t1 # y = x >> c
beq t2, x0, clz_skip
sub t0, t0, t1 # n -= c
addi a0, t2, 0 # x = y
clz_skip:
srli t1, t1, 1 # c >>= 1
bnez t1, clz_while # while (c != 0)
clz_ret:
sub a0, t0, a0 # return n - x
jr ra
# a0: f1
# t0: mantissa
# t1: exponent
# t2: offset
uf8_decode:
andi t0, a0, 0x0f # mantissa = f1 & 0x0f
srli t1, a0, 4 # exponent = f1 >> 4
addi t2, x0, 15
sub t2, t2, t1
addi t3, x0, 0x7F # replace pseudo instruction li t3, 0x7FFF
slli t3, t3, 8
ori t3, t3, 0xFF
srl t3, t3, t2
slli t3, t3, 4
sll a0, t0, t1
add a0, a0, t3
jr ra
# a0: value
# t0: lz
# t1: msb
# t2: exponent
# t3: overflow
# t4: temp
# t5: mantissa
# t6: next_overflow
uf8_encode:
addi t4, x0, 16
blt a0, t4, encode_return_lt16 # if(value < 16) then return value
encode_ge16:
addi sp, sp, -8
sw ra, 0(sp)
sw s0, 4(sp)
addi s0, a0, 0 # save value, help us not to access the memory
jal clz
addi t0, a0, 0 # lz
addi t1, x0, 31
sub t1, t1, t0 # msb = 31 - lz
addi t2, x0, 0 # exponent = 0
addi t3, x0, 0 # overflow = 0
addi t4, x0, 5
encode_if_1:
blt t1, t4, encode_while_2 # if(msb < 5) goto encode_while_2
addi t2, t1, -4 # exponent = msb - 4
encode_if_2:
addi t4, x0, 15
ble t2, t4, encode_for_1 # if (exponent <= 15) goto encode_for_1
addi t2, x0, 15
encode_for_1:
addi t4, x0, 1
sll t4, t4, t2
addi t4, t4, -1
slli t4, t4, 4
addi t3, t4, 0
encode_while_1:
beq t2, x0, encode_while_2
bge s0, t3, encode_while_2
addi t3, t3, -16
srli t3, t3, 1
addi t2, t2, -1
jal x0, encode_while_1
encode_while_2:
li t4, 15
bge t2, t4, encode_return
slli t6, t3, 1
addi t6, t6, 16
encode_if_3:
blt s0, t6, encode_return # if(value < nextoverflow) then break
addi t3, t6, 0
addi t2, t2, 1
jal x0, encode_while_2
encode_return:
sub t5, s0, t3
srl t5, t5, t2 # mantissa = (value - overflow) >> exponent
slli a0, t2, 4
or a0, a0, t5
lw s0, 4(sp)
lw ra, 0(sp)
addi sp, sp, 8
jr ra # can be omitted
encode_return_lt16:
jr ra
```
#### Test cases
In this case, I want to use the test function in the original C code as the test cases.
The test function encodes and decodes 0 to 255 to bf16 format.
And I will encode from 0 to 255 and verify that whether the format is correct.
All test passed!
Program exited with code: 0
According to the test, the functionality of decode and encode are correct.
#### Compare to Compiler
Below is the execution info of my RISC-V code.
(Run in 5-stage pipeline CPU)
```
// My approach 1
Cycles: 40004
Instrs. retired: 28164
CPI: 1.42
IPC: 0.74
Clock rate: 0Hz
```
And below is the execution info of the compiler(Compiler Explorer -O2)
(Run in 5-stage pipeline CPU)
> GitHub: [q1-uf8-complier.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/quiz1_approach/q1-uf8-compiler.S)
```
// Compiler-optimized
Cycles: 106898
Instrs. retired: 73350
CPI: 1.46
IPC: 0.686
Clock rate: 152.73 Hz
```
**Performance:** Our method is about **2.67x faster** than the compiler's.
**Code Size:** Compared to the compiler-optmized version, my improved approach reduces the code size by nearly **52%**.
___
### ==Approach 2==
> [Quiz 1 Problem B Approach 2](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/quiz1_approach/q1-uf8-approach-2.S)
```asm=
.data
overflow_1: .word 0x10
overflow_2: .word 0x30
overflow_3: .word 0x70
overflow_4: .word 0xF0
overflow_5: .word 0x1F0
overflow_6: .word 0x3F0
overflow_7: .word 0x7F0
overflow_8: .word 0xFF0
overflow_9: .word 0x1FF0
overflow_10: .word 0x3FF0
overflow_11: .word 0x7FF0
overflow_12: .word 0xFFF0
overflow_13: .word 0x1FFF0
overflow_14: .word 0x3FFF0
overflow_15: .word 0x7FFF0
pass_msg: .string "All test passed!"
newline: .string "\n"
.align 2
.text
.globl main
main:
# li a0, 54372
lui a0, 13
addi a0, a0, 1124
jal uf8_encode # call function
add s0, x0, a0 # save result
addi a7, x0, 1 # print integer
addi a0, s0, 0
ecall
la a0, newline
addi a7, x0, 4
ecall
addi a7, x0, 10 # exit
ecall
# a0: x
# t0: n
# t1: c
# t2: y
clz:
addi t0, x0, 32 # n = 32
addi t1, x0, 16 # c = 16
clz_while:
srl t2, a0, t1 # y = x >> c
beq t2, x0, clz_skip
sub t0, t0, t1 # n -= c
addi a0, t2, 0 # x = y
clz_skip:
srli t1, t1, 1 # c >>= 1
bnez t1, clz_while # while (c != 0)
clz_ret:
sub a0, t0, a0 # return n - x
jr ra
# a0: f1
# t0: mantissa
# t1: exponent
# t2: offset
uf8_decode:
andi t0, a0, 0x0f # mantissa = f1 & 0x0f
srli t1, a0, 4 # exponent = f1 >> 4
addi t2, x0, 15
sub t2, t2, t1
addi t3, x0, 0x7F
slli t3, t3, 8
ori t3, t3, 0xFF
srl t3, t3, t2
slli t3, t3, 4
sll a0, t0, t1
add a0, a0, t3
jr ra
# a0: value
# t0: lz
# t1: msb
# t2: exponent
# t3: overflow
# t4: temp
# t5: mantissa
# t6: next_overflow
uf8_encode:
addi t4, x0, 16
blt a0, t4, encode_return_lt16 # if(value < 16) return value
encode_ge16: # function call, need to save ra, a0
addi sp, sp, -8
sw ra, 0(sp)
sw s0, 4(sp)
addi s0, a0, 0 # save value
jal clz
addi t0, a0, 0 # lz
addi t1, x0, 31
sub t1, t1, t0 # msb = 31 - lz
addi t2, x0, 0 # exponent = 0
addi t3, x0, 0 # overflow = 0
addi t4, x0, 5
encode_if_1:
blt t1, t4, encode_while_2 # if(msb < 5) goto encode_while_2
addi t2, t1, -4 # exponent = msb - 4
encode_if_2:
addi t4, x0, 15
ble t2, t4, encode_for_1 # if (exponent <= 15) goto encode_for_1
addi t2, x0, 15
encode_overflow: # calculate overflow using memory lookup
la, t3, overflow_1 # t3 = base address of table
addi t4, t2, -1 # t4 = exponent - 1
slli t4, t4, 2
add t3, t3, t4 # t3 = &overflow_{exponent}
lw t3, 0(t3) # t3 = overflow
encode_while_1: # adjust exponent # need to be optimized
beq t2, x0, encode_while_2
slt t4, s0, t3 # t4 = (value < overflow)
beq t4, x0, encode_while_2
addi t3, t3, -16
srli t3, t3, 1
addi t2, t2, -1
jal x0, encode_while_1
encode_while_2: # optimize later # find exact exponent
addi t4, x0, 15
bge t2, t4, encode_return
slli t6, t3, 1
addi t6, t6, 16
encode_if_3:
blt s0, t6, encode_return
addi t3, t6, 0
addi t2, t2, 1
jal x0, encode_while_2
encode_return:
sub t5, s0, t3
srl t5, t5, t2 # mantissa = (value - overflow) >> exponent
slli a0, t2, 4
or a0, a0, t5
lw s0, 4(sp)
lw ra, 0(sp)
addi sp, sp, 8 # restore stack
jr ra
encode_return_lt16:
jr ra
```
In my approach 2, I used lookup table to accelerate the procedure. As we had discussed in my approach 1, the overflow has some pattern in the for loop, thus we can put all patterns into memory, and we can quickly get the overflow.
```asm=
encode_overflow: # calculate overflow using memory lookup
la t3, overflow_1 # t3 = base address of table
addi t4, t2, -1 # t4 = exponent - 1
slli t4, t4, 2
add t3, t3, t4 # t3 = &overflow_{exponent}
lw t3, 0(t3) # t3 = overflow
```
As we do in the first approach, the overflow can be calculated by exponent bits leading 1's and fixed 4 bits 0's. So we access the memory table with the offset of the number of leading 1's.
```asm=
.data
overflow_1: .word 0x10
overflow_2: .word 0x30
overflow_3: .word 0x70
overflow_4: .word 0xF0
overflow_5: .word 0x1F0
overflow_6: .word 0x3F0
overflow_7: .word 0x7F0
overflow_8: .word 0xFF0
overflow_9: .word 0x1FF0
overflow_10: .word 0x3FF0
overflow_11: .word 0x7FF0
overflow_12: .word 0xFFF0
overflow_13: .word 0x1FFF0
overflow_14: .word 0x3FFF0
overflow_15: .word 0x7FFF0
```
Above is the lookup table, I list overflow in accending order.
#### Test Cases
All tests passed!
Program exited with code: 0
My approach 2 passes the test too.
#### Compare to compiler
The below figure is the execution info of my approach2.
(Run in 5-stage pipeline CPU)
```
// My approach 2
Cycles: 40486
Instrs. retired: 28628
CPI: 1.41
IPC: 0.707
Clock rate: 0Hz
```
**Performance:** My approach 2 is about **2.64x faser** than the compiler-optimized version.
**Code Size:** Compared to the compiler-optmized version, my approach 2 reduces the code size by nearly **42%**.
According to the clock cycles, the approach 2 is slower than the approach 1 slightly. But I think this is a good try for me to come up with the lookup table method, even though it does't perform better than my approach 1.
---
## Problem C
### ==bf16_sqrt==
> GitHub: [bf16_sqrt](https://github.com/dorobouharu133/ca2025-quizzes/tree/main/bf16_sqrt)
> Because the `mul` instruction is not included in the [RV32I instructions](https://docs.riscv.org/reference/isa/unpriv/rv32.html), I build up the architecture of the RISC-V code with the `mul` instruction, and I will develop some way to replace the `mul` instruction.
> I translate the C code to RISC-V code line-by-line first, and use `mul`. In the following steps, I will optimize the RISC-V code.
#### bf16_sqrt_with_mul
> GitHub: [bf16_sqrt_original.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/8052ba042ba0f1216df9d70e5d97e2f7e00ebeb1/bf16_sqrt.S)
```asm=
.data
newline: .string "\n"
pass_msg: .string "bf16_sqrt All tests passed!"
fail_msg: .string "Tests failed!"
.text
.globl bf16_sqrt
.globl __mulsi3
.globl main
main:
# sqrt(+inf) = +inf
test_1:
addi a0, x0, 0x7F
slli a0, a0, 8
ori a0, a0, 0x80
jal ra, bf16_sqrt
addi s0, a0, 0
addi t0, x0, 0x7F
slli t0, t0, 8
ori t0, t0, 0x80
bne s0, t0, test_failed
# sqrt(NaN) = NaN
test_2:
addi a0, x0, 0x7F
slli a0, a0, 8
ori a0, a0, 0xC0
jal ra, bf16_sqrt
addi s0, a0, 0
addi t0, x0, 0x7F
slli t0, t0, 8
ori t0, t0, 0xC0
bne s0, t0, test_failed
# sqrt(-1.000000) = NaN
test_3:
addi a0, x0, 0xBF
slli a0, a0, 8
ori a0, a0, 0x80
jal ra, bf16_sqrt
addi s0, a0, 0
addi t0, x0, 0x7F
slli t0, t0, 8
ori t0, t0, 0xC0
bne s0, t0, test_failed
# sqrt(2.000000) = 1.414062
test_4:
addi a0, x0, 0x40
slli a0, a0, 8
jal ra, bf16_sqrt
addi s0, a0, 0
addi t0, x0, 0x3F
slli t0, t0, 8
ori t0, t0, 0xB5
bne s0, t0, test_failed
# sqrt(1.000000) = 1.000000
test_5:
addi a0, x0, 0x3F
slli a0, a0, 8
ori a0, a0, 0x80
jal ra, bf16_sqrt
addi t0, x0, 0x3F
slli t0, t0, 8
ori t0, t0, 0x80
bne a0, t0, test_failed
# sqrt(0.000000) = 0.000000
test_6:
addi a0, x0, 0
jal ra, bf16_sqrt
addi t0, x0, 0x0000
bne a0, t0, test_failed
# sqrt(2648071397852762090550553127902248960.000000) = 1630303065108119552.000000
test_7:
addi a0, x0, 0x7B
slli a0, a0, 8
ori a0, a0, 0xFF
jal ra, bf16_sqrt
addi t0, x0, 0x5D
slli t0, t0, 8
ori t0, t0, 0xB5
bne a0, t0, test_failed
test_passed:
la a0, pass_msg
addi a7, x0, 4
ecall
jal x0, end
test_failed:
la a0, fail_msg
addi a7, x0, 4
ecall
end:
la a0, newline
addi a7, x0, 4
ecall
addi a7, x0, 10
ecall
# a0: a
# t0: sign
# t1: exp
# t2: mant
# t3: temp
# t4: e
# t5: m
# t6: new_exp
# s0: low
# s1: high
# s2: result
# s3: mid
# s4: sq
# s5: new_mant
bf16_sqrt:
srli t0, a0, 15
srli t1, a0, 7
andi t1, t1, 0xFF # exp = (a.bits >> 7) & 0xFF
andi t2, a0, 0x7F # mant = a.bits & 0x7F
sqrt_special_case:
addi t3, x0, 0xFF
bne t1, t3, sqrt_skip_1 # if(exp != 0xFF) then goto sqrt_skip_1
bne t2, x0, sqrt_return_a
bne t0, x0, sqrt_return_bf16_nan
jal x0, sqrt_return_a
sqrt_skip_1:
bne t1, x0, sqrt_skip_2
beq t2, x0, sqrt_return_bf16_zero
sqrt_skip_2:
bne t0, x0, sqrt_return_bf16_nan
sqrt_skip_3:
beq t1, x0, sqrt_return_bf16_zero
addi t4, t1, -127 # e = exp - BF16_EXP_BIAS
ori t5, t2, 0x80 # m = 0x80 | mant
adjust_for_odd_exponents_if:
andi t3, t4, 1
beq t3, x0, adjust_for_odd_exponents_else
slli t5, t5, 1
addi t3, t4, -1 # t4 = (e - 1)
srli t3, t3, 1
addi t6, t3, 127 # new_exp = ((e-1) >> 1) + BF16_EXP_BIAS
beq x0, x0, sqrt_binary_search_initialize
adjust_for_odd_exponents_else:
srli t3, t4, 1
addi t6, t3, 127
sqrt_binary_search_initialize:
addi s0, x0, 90 # low = 90
addi s1, x0, 256 # high = 256
addi s2, x0, 128 # result = 128
sqrt_while_loop: # binary_search
bgt s0, s1, sqrt_normalize_greater_equal
add s3, s0, s1 # s3 = (low + high)
srli s3, s3, 1 # mid = (low + high) >> 1
sqrt_mul: ## mul is not allowed, need to be modified !!
# s4: sq
# mid*mid implement
mul s4, s3, s3 # sq = mid * mid, need to be replaced
srli s4, s4, 7
sqrt_while_loop_if:
bgt s4, t5, sqrt_while_loop_else
addi s2, s3, 0 # result = mid
addi s0, s3, 1 # low = mid + 1
jal x0, sqrt_while_loop
sqrt_while_loop_else:
addi s1, s3, -1 # high = mid - 1
jal x0, sqrt_while_loop
sqrt_normalize_greater_equal: # branchless
addi t3, x0, 256
sltu t3, s2, t3 # t3 = (result < 256) ? 1 : 0
xori t3, t3, 1 # t3 = (t3) ? 0 : 1, reverse t3
srl s2, s2, t3
add t6, t6, t3 # t6: new_exp
jal x0, sqrt_normalize_done
sqrt_normalize_less:
addi s6, x0, 128
sqrt_normalize_less_loop:
bge s2, s6, sqrt_normalize_done
addi t3, x0, 1
ble t6, t3, sqrt_normalize_done
slli s2, s2, 1
addi t6, t6, -1
jal x0, sqrt_normalize_less_loop
sqrt_normalize_done:
andi s5, s2, 0x7F # new_mant = result & 0x7F, use t2 to store new_mant
addi t3, x0, 0xFF
bge t6, t3, sqrt_return_pInf
ble t6, x0, sqrt_return_bf16_zero
andi t3, t6, 0xFF
slli t3, t3, 7
or a0, t3, s5 # .bits = (new_exp & 0xFF) << 7 | new_mant
jr ra
sqrt_return_a:
jr ra
sqrt_return_bf16_nan:
li a0, 0x7FC0
jr ra
sqrt_return_bf16_zero:
andi a0, x0, 0
jr ra
sqrt_return_pInf:
addi a0, x0, 0x7F
slli a0, a0, 8
ori a0, a0, 0x80
jr ra
```
> **Note**: The code above serves as a **baseline** for implementing multiplication without the `mul` instruction.
#### bf16_sqrt Test Cases
Using the `mul` instruction, I want to verify whether the functionality of my approach is correct or not first.
Let's choose some corner cases as the test cases first.
```
1. sqrt(`0x7F80`) = `0x7F80`, sqrt(+inf) = +inf
2. sqrt(`0x7FC0`) = `0x7FC0`, sqrt(NaN) = NaN
3. sqrt(`0xBF80`) = `0x7FC0`, sqrt(-1.000000) = NaN
4. sqrt(`0x4000`) = `0x3FB5`, sqrt(2.000000) = 1.414062
5. sqrt(`0x3F80`) = `0x3F80`, sqrt(1.000000) = 1.000000
6. sqrt(`0x0000`) = `0x0000`, sqrt(0.000000) = 0.000000
7. sqrt(`0x7BFF`) = `0x5DB5`, sqrt(2648071397852762090550553127902248960.000000) = 1630303065108119552.000000
```
(Run in 5-stage pipeline CPU)
> GitHub: [bf16_sqrt_original.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/8052ba042ba0f1216df9d70e5d97e2f7e00ebeb1/bf16_sqrt.S)
```
// My original approach for bf16_sqrt
Cycles: 555
Instrs. retired: 403
CPI: 1.38
IPC: 0.726
Clock rate: 584.21 Hz
```
___
#### ==bf16_sqrt_shift_add==
The next step to develop an alternative method to replace the `mul` instruction in the `bf16_sqrt_with_mul` function.
A common approach is to manually implement multiplication, which we can motivate by revisting the classic [**long multiplication**](https://mathworld.wolfram.com/LongMultiplication.html).
The [**Shift-and-Add Multiplication**](https://users.utcluj.ro/%7Ebaruch/book_ssce/SSCE-Shift-Mult.pdf) method provides a strightforward way to realize the concept of [**long multiplication**](https://mathworld.wolfram.com/LongMultiplication.html) in software.
We can use a simple binary multiplication example to illustrate how the shift-and-add method is implemented.
For example, `0x81` $\times$ `0xC3` $=$ `0x6243` ($129 \times 195 = 25{,}155$).
In binary `1000 0001`(Multiplicand) $\times$ `1100 0011` (Multiplier) $=$ `110001001000011`.
```
1000 0001 <- Multiplicand
x 1100 0011 <- Multiplier
-------------------------------------------
pass0: 1000 0001 (Multiplier bit 0 is 1: Add)
pass1: 1000 0001 (Multiplier bit 1 is 1: Add)
pass2: 0000 0000 (Multiplier bit 2 is 0: Skip)
pass3: 0000 0000 (Multiplier bit 3 is 0: Skip)
pass4: 0000 0000 (Multiplier bit 4 is 0: Skip)
pass5: 0000 0000 (Multiplier bit 5 is 0: Skip)
pass6: 1000 0001 (Multiplier bit 6 is 1: Add)
pass7: 1000 0001 (Multiplier bit 7 is 1: Add)
-------------------------------------------
Result:1100 0100 100 0011 (0x6243)
```
**Algorithm Logic**:
In each pass, we examine the **LSB** of the **Multiplier**:
- If bit 0 is 1, add the current Multiplicand to the partial product.
- If bit 0 is 0, skip the addition.
After each pass, we **shift the Multiplicand 1 bit to the left** and **shift the Multiplier 1 bit to the right**. This allows us to consistently check the LSB of the Multiplier in every iteration.
The process continues until the Multiplier becomes zero.
I implement the **shift-and-add** logic as the below code:
```asm=
# a0: multiplicand
# a1: multiplier
# t0: accumulator
# t1: multiplier_copy
# t2: bit0
shift_and_add:
# save stack if needed
addi t0, x0, 0
addi t1, a1, 0
andi t2, a1, 1
shift_and_add_for_loop:
andi t2, t1, 1
beq t2, x0, shift_and_add_for_skip_add
add t0, t0, a0
shift_and_add_for_skip_add:
slli a0, a0, 1
srli t1, t1, 1
bne t1, x0, shift_and_add_for_loop
shfit_and_add_end:
# restore stack if needed
addi a0, t0, 0
jr ra
```
And I implement a test to verify the correctness of the algorithm.
```asm=
.data
newline: .string "\n"
pass_msg: .string "SHIFT-AND-ADD implementation successful!"
fail_msg: .string "SHIFT-AND-ADD implementation failed!"
.text
.global shift_and_add
.global main
main:
addi a0, x0, 0x80
addi a1, x0, 0xC0
jal ra, shift_and_add
addi s0, a0, 0
addi t0, x0, 0x60
slli t0, t0, 8
bne s0, t0, test_failed
test_passed:
la a0, pass_msg
addi a7, x0, 4
ecall
jal x0, end
test_failed:
la a0, fail_msg
addi a7, x0, 4
ecall
end:
la a0, newline
addi a7, x0, 4
ecall
addi a7, x0, 10
ecall
# a0: multiplicand
# a1: multiplier
# t0: accumulator
# t1: multiplier_copy
# t2: bit0
shift_and_add:
# save stack if needed
addi t0, x0, 0
addi t1, a1, 0
andi t2, a1, 1
shift_and_add_for_loop:
andi t2, t1, 1
beq t2, x0, shift_and_add_for_skip_add
add t0, t0, a0
shift_and_add_for_skip_add:
slli a0, a0, 1
srli t1, t1, 1
bne t1, x0, shift_and_add_for_loop
shfit_and_add_end:
# restore stack if needed
addi a0, t0, 0
jr ra
```
SHIFT-AND-ADD implementation successful!
We can now replace the `mul` instruction in the original code (note that the code must be modified to save and restore the stack).
> GitHub: [bf16_sqrt.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/7e584d6ae8203d226c6e8bb119ce17bd55b7bdeb/bf16_sqrt.S)
```asm=
.data
newline: .string "\n"
pass_msg: .string "All tests passed!"
fail_msg: .string "Tests failed!"
.text
.global bf16_sqrt
.global main
main:
# sqrt(+inf) = +inf
test_1:
addi a0, x0, 0x7F
slli a0, a0, 8
ori a0, a0, 0x80
jal ra, bf16_sqrt
addi s0, a0, 0
addi t0, x0, 0x7F
slli t0, t0, 8
ori t0, t0, 0x80
bne s0, t0, test_failed
# sqrt(NaN) = NaN
test_2:
addi a0, x0, 0x7F
slli a0, a0, 8
ori a0, a0, 0xC0
jal ra, bf16_sqrt
addi s0, a0, 0
addi t0, x0, 0x7F
slli t0, t0, 8
ori t0, t0, 0xC0
bne s0, t0, test_failed
# sqrt(-1.000000) = NaN
test_3:
addi a0, x0, 0xBF
slli a0, a0, 8
ori a0, a0, 0x80
jal ra, bf16_sqrt
addi s0, a0, 0
addi t0, x0, 0x7F
slli t0, t0, 8
ori t0, t0, 0xC0
bne s0, t0, test_failed
# sqrt(2.000000) = 1.414062
test_4:
addi a0, x0, 0x40
slli a0, a0, 8
jal ra, bf16_sqrt
addi s0, a0, 0
addi t0, x0, 0x3F
slli t0, t0, 8
ori t0, t0, 0xB5
bne s0, t0, test_failed
# sqrt(1.000000) = 1.000000
test_5:
addi a0, x0, 0x3F
slli a0, a0, 8
ori a0, a0, 0x80
jal ra, bf16_sqrt
addi t0, x0, 0x3F
slli t0, t0, 8
ori t0, t0, 0x80
bne a0, t0, test_failed
# sqrt(0.000000) = 0.000000
test_6:
addi a0, x0, 0
jal ra, bf16_sqrt
addi t0, x0, 0x0000
bne a0, t0, test_failed
# sqrt(2648071397852762090550553127902248960.000000) = 1630303065108119552.000000
test_7:
addi a0, x0, 0x7B
slli a0, a0, 8
ori a0, a0, 0xFF
jal ra, bf16_sqrt
addi t0, x0, 0x5D
slli t0, t0, 8
ori t0, t0, 0xB5
bne a0, t0, test_failed
test_passed:
la a0, pass_msg
la a0, pass_msg
addi a7, x0, 4
ecall
jal x0, end
test_failed:
la a0, fail_msg
addi a7, x0, 4
ecall
end:
la a0, newline
addi a7, x0, 4
ecall
addi a7, x0, 10
ecall
# a0: a
# t0: sign
# t1: exp
# t2: mant
# t3: temp
# t4: e
# t5: m
# t6: new_exp
# s0: low
# s1: high
# s2: result
# s3: mid
# s4: sq
# s5: new_mant
bf16_sqrt:
srli t0, a0, 15
srli t1, a0, 7
andi t1, t1, 0xFF # exp = (a.bits >> 7) & 0xFF
andi t2, a0, 0x7F # mant = a.bits & 0x7F
sqrt_special_case:
addi t3, x0, 0xFF
bne t1, t3, sqrt_skip_1 # if(exp != 0xFF) then goto sqrt_skip_1
bne t2, x0, sqrt_return_a
bne t0, x0, sqrt_return_bf16_nan
jal x0, sqrt_return_a
sqrt_skip_1:
bne t1, x0, sqrt_skip_2
beq t2, x0, sqrt_return_bf16_zero
sqrt_skip_2:
bne t0, x0, sqrt_return_bf16_nan
sqrt_skip_3:
beq t1, x0, sqrt_return_bf16_zero
addi t4, t1, -127 # e = exp - BF16_EXP_BIAS
ori t5, t2, 0x80 # m = 0x80 | mant
adjust_for_odd_exponents_if:
andi t3, t4, 1
beq t3, x0, adjust_for_odd_exponents_else
slli t5, t5, 1
addi t3, t4, -1 # t4 = (e - 1)
srli t3, t3, 1
addi t6, t3, 127 # new_exp = ((e-1) >> 1) + BF16_EXP_BIAS
beq x0, x0, sqrt_binary_search_initialize
adjust_for_odd_exponents_else:
srli t3, t4, 1
addi t6, t3, 127
sqrt_binary_search_initialize:
addi s0, x0, 90 # low = 90
addi s1, x0, 256 # high = 256
addi s2, x0, 128 # result = 128
sqrt_while_loop: # binary_search
bgt s0, s1, sqrt_normalize_greater_equal
add s3, s0, s1 # s3 = (low + high)
srli s3, s3, 1 # mid = (low + high) >> 1
sqrt_mul: # SHIFT-AND-ADD
# s4: sq
# mid*mid implemented by shift-and-add
addi sp, sp, -12
sw ra, 0(sp)
sw a0, 4(sp)
sw a1, 8(sp)
addi a0, s3, 0
addi a1, s3, 0
jal ra, shift_and_add
addi a3, a0, 0
lw ra, 0(sp)
lw a0, 4(sp)
lw a1, 8(sp)
addi sp, sp, 12
addi s4, a3, 0
srli s4, s4, 7
sqrt_while_loop_if:
bgt s4, t5, sqrt_while_loop_else
addi s2, s3, 0 # result = mid
addi s0, s3, 1 # low = mid + 1
jal x0, sqrt_while_loop
sqrt_while_loop_else:
addi s1, s3, -1 # high = mid - 1
jal x0, sqrt_while_loop
sqrt_normalize_greater_equal: # branchless
addi t3, x0, 256
sltu t3, s2, t3 # t3 = (result < 256) ? 1 : 0
xori t3, t3, 1 # t3 = (t3) ? 0 : 1, reverse t3
srl s2, s2, t3
add t6, t6, t3 # t6: new_exp
jal x0, sqrt_normalize_done
sqrt_normalize_less:
addi s6, x0, 128
sqrt_normalize_less_loop:
bge s2, s6, sqrt_normalize_done
addi t3, x0, 1
ble t6, t3, sqrt_normalize_done
slli s2, s2, 1
addi t6, t6, -1
jal x0, sqrt_normalize_less_loop
sqrt_normalize_done:
andi s5, s2, 0x7F # new_mant = result & 0x7F, use t2 to store new_mant
addi t3, x0, 0xFF
bge t6, t3, sqrt_return_pInf
ble t6, x0, sqrt_return_bf16_zero
andi t3, t6, 0xFF
slli t3, t3, 7
or a0, t3, s5 # .bits = (new_exp & 0xFF) << 7 | new_mant
jr ra
sqrt_return_a:
jr ra
sqrt_return_bf16_nan:
li a0, 0x7FC0
jr ra
sqrt_return_bf16_zero:
andi a0, x0, 0
jr ra
sqrt_return_pInf:
addi a0, x0, 0x7F
slli a0, a0, 8
ori a0, a0, 0x80
jr ra
# a0: multiplicand
# a1: multiplier
# t0: accumulator
# t1: multiplier_copy
# t2: bit0
shift_and_add:
addi sp, sp, -12
sw t0, 0(sp)
sw t1, 4(sp)
sw t2, 8(sp)
addi t0, x0, 0
addi t1, a1, 0
andi t2, a1, 1
shift_and_add_for_loop:
andi t2, t1, 1
beq t2, x0, shift_and_add_for_skip_add
add t0, t0, a0
shift_and_add_for_skip_add:
slli a0, a0, 1
srli t1, t1, 1
bne t1, x0, shift_and_add_for_loop
shfit_and_add_end:
addi a1, t0, 0
lw t0, 0(sp)
lw t1, 4(sp)
lw t2, 8(sp)
addi sp, sp, 12
addi a0, a1, 0
jr ra
```
#### bf16_sqrt Test Cases
(Run in 5-stage pipeline CPU)
> GitHub: [bf16_sqrt_test.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_sqrt/bf16_sqrt_test.S)
```
// My improved approach for bf16_sqrt
Cycles: 2726
Instrs. retired: 2008
CPI: 1.36
IPC: 0.737
Clock rate: 259.26 Hz
```
> GitHub: [bf16_sqrt_compiler.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_sqrt/bf16_sqrt_compiler.S)
```
// bf16_sqrt compiler version
Cycles: 3315
Instrs. retired: 2278
CPI: 1.46
IPC: 0.687
Clock rate: 7.37 KHz
```
**Performance:** My improved approach is **1.21x faster** than the compiler-optimized version.
**Code Size:** Compared to the compiler-optimized version, my improved approach reduces the code size by nearly **39%**.
While shift-and-add is an alternative to the mul instruction, its performance is significantly lower; it is approximately **5.1x slower** than the native hardware approach.
___
### ==bf16_add==
> GitHub: [bf16_add](https://github.com/dorobouharu133/ca2025-quizzes/tree/main/bf16_add)
The following code is the initial template for my **bf16_add** implementation. I will introduce the optimized version and provide the details in the next section.
#### [bf16_add_original code](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_add/bf16_add_original.S)
The link above is the implementation of my original approach.
#### bf16_add Test Cases
```
1. `0x3F80` + `0xBF80`, +1.000000 + (-1.000000) = 0.000000
2. `0x7F7F` + `0x7F7F`, maximum finite positive number + minimal normalized number = +inf
3. `0x3F80` + `0x3B80`, 1.0000000 + 2^-8 = 1.000000
4. `0x7F80` + `0xFF80`, inf + (-inf) = NaN
```
> GitHub: [bf16_add_original.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_add/bf16_add_original.S)
(Run in 5-stage pipeline CPU)
```
// My original approach for bf16_add
Cycles: 247
Instrs. retired: 189
CPI: 1.31
IPC: 0.765
Clock rate: 228.28 Hz
```
The above figure is the execution info of my approach, and the below is the execution info of compiler.
(Run in 5-stage pipeline CPU)
> GitHub: [bf16_add_compiler.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_add/bf16_add_compiler.S)
```
// Compiler-optmized bf16_add
Cycles: 259
Instrs. retired: 207
CPI: 1.25
IPC: 0.799
Clock rate: 254.92 Hz
```
My original approach is about **1.049x faster** than the compiler-optimized version (**o2-optimized**).
#### bf16_add_improved
**Observation:** This section implements the **alignment** logic.
```c=
if (exp_diff > 0) {
result_exp = exp_a;
if (exp_diff > 8)
return a;
mant_b >>= exp_diff;
} else if (exp_diff < 0) {
result_exp = exp_b;
if (exp_diff < -8)
return b;
mant_a >>= -exp_diff;
} else {
result_exp = exp_a;
}
```
In floating-point addition, numbers with different exponents cannot be summed directly. For example, to add $3.0 \times 10^{4}$ and $1.5 \times 10^{2}$, we cannot simply add the mantissas ($3.0 + 1.5$). The **exponents** must be aligned first.
1. **Alignment Process:** The number with the smaller exponent is adjusted to match the larger exponent.
$$
1.5 \times 10^{2} \to 0.15 \times 10^{3} \to 0.015 \times 10^{4}
$$
2. **Addition:** Now, we can sum the two numbers:
$$
3.0 \times 10^{4} + 0.015 \times 10^{4} = (3.0 + 0.015) \times 10^{4} = 3.015 \times 10^{4}
$$
> **Note:** After addition, it is necessary to check whether normalization is requried or not. In this case, the result is already normalized, so no further adjustment is needed.
I implement a **branchless swap** to ensure that the **magnitude of operand A** is always greater than or equal to that of **operand B** (i.e., $|A| \ge |B|$).
And I explain how to use branchless logic to implement swap A and B.
Since we want to let $|A|$ is always larger or equal to $|B|$, therefore we only care about the magnitude, and the `result_exp` can be determind later.
First step is to get the absolute value, we can achieve this by using a mask.
```asm=
addi t0, x0, 0x7F
slli t0, t0, 8
ori t0, t0, 0xFF # mask = 0x7F00
```
Because `0x7F00` = `0111 1111 1111 1111` in binary, we can clear the sign bit of a and b.
Implemented in my code as below:
```asm=
bf16_add_align_branchless:
# a0: a
# a1: b
# s0: unsigned_a
# s1: unsigned_b
# t6: temp, used as mask
addi t6, x0, 0x7F
slli t6, t6, 8
ori t6, t6, 0xFF # mask = 0x7FFF
andi s0, a0, 0x7FFF # unsigned_a = a.bits & mask
andi s1, a1, 0x7FFF # unsigned_b = b.bits & mask
slt t6, s0, s1 # t6 = (|A| < |B|) ? 1 : 0
sub t6, x0, t6 # t6 = (|A| < |B|) ? 0xFFFF : 0x0
```
Here we can get `t6` as a mask:
- if $|A| < |B|$, then t6 = `0xFFFF`. (**Need to swap**)
- if $|A| >= |B|$, then t6 = `0x0000`. (**No need to swap**)
Then ths mask (t6) can be used to swap sign, exponent and mantissa simutaneously.
I explain the principle of using xor to swap, and below are the properties of xor oepration:
$$
\begin{aligned}
A \oplus 0 &= A && \\
(A \oplus B) \oplus A &= B &&
\end{aligned}
$$
$$
\begin{aligned}
\text{If} \quad & C = A \oplus B \\
\text{then} \quad & C \oplus A = B \\
\text{and} \quad & C \oplus B = A
\end{aligned}
$$
`and s0, s0, t6` is used to determine whether swap or not.
if $\text{C} = \text{A} \oplus \text{B}$, then
- $\text{C} \oplus \text{0x0000} = \text{C}$
- and $\text{C} \oplus \text{0xFFFF}$ = the 1's complement of $\text{C}$
> **Note:** A, B, C here are all 16-bit binary numbers.
```asm=
# Swap sign bit
# t0: sign_a
# t1: sign_b
# t6: mask (0xFFFF or 0x0000)
# s0: sign_a ^ sign_b
xor s0, t0, t1 # s0 = sign_a ^ sign_b
and s0, s0, t6 # use mask to check swap or not
xor t0, t0, s0
xor t1, t1, s0
```
For example, mask (t6) is `0xFFFF`, exp_a (t0) = `0x1001`, exp_b (t1) = `0x1011`.
> Note: tha if mask is `0xFFFF`,i.e, need to swap, exp_a should less than exp_b.
```
s0 = 0x1001 ^ 0x1011 = 0x0010
s0 & t6 = 0x0010 & 0xFFFF = 0x0010
t0 ^ 0x0010 = 0x1001 ^ 0x0010 = 0x1011
t1 ^ 0x0010 = 0x1011 ^ 0x0010 = 0x1001
```
And another example, mask is `0x0000`, which means that there is no need to swap, and exp_a = `0x1100` and exp_b = `0x0011`.
```
s0 = 0x1100 ^ 0x0011 = 0x1111
s0 & t6 = 0x1111 & 0x0000 = 0x0000
t0 ^ 0x0000 = 0x1100 ^ 0x0000 = 0x1100
t1 ^ 0x0000 = 0x0011 ^ 0x0000 = 0x0011
```
Repeat the above procedure three times to implement branchless swap.
Below is the complete code of branchless swap implementation:
```asm=
bf16_add_align_branchless:
# a0: a
# a1: b
# s0: unsigned_a
# s1: unsigned_b
# t6: temp, used as mask
addi t6, x0, 0x7F
slli t6, t6, 8
ori t6, t6, 0xFF # mask = 0x7FFF
andi s0, a0, t6 # unsigned_a = a.bits & mask
andi s1, a1, t6 # unsigned_b = b.bits & mask
slt t6, s0, s1 # t6 = (|A| < |B|) ? 1 : 0
sub t6, x0, t6 # t6 = (|A| < |B|) ? 0xFFFF : 0x0
bf16_add_branchless_swap_sign:
# swap sign bit
# t0: sign_a
# t1: sign_b
# t6: mask (0xFFFF or 0x0000)
# s0: sign_a ^ sign_b
xor s0, t0, t1 # s0 = sign_a ^ sign_b
and s0, s0, t6 # use mask to check swap or not
xor t0, t0, s0
xor t1, t1, s0
bf16_add_branchless_swap_exponent:
# swap exponent
# t2: exp_a
# t3: exp_b
# t6: mask (0xFFFF or 0x0000)
# s0: exp_a ^ exp_b
xor s0, t2, t3 # s0 = exp_a ^ exp_b
and s0, s0, t6
xor t2, t2, s0
xor t3, t3, s0
bf16_add_branchless_swap_mantissa:
# swap mantissa
# t4: mant_a
# t5: mant_b
# t6: mask (0xFFFF or 0x0000)
# s0: mant_a ^ mant_b
xor s0, t4, t5 # s0 = mant_a ^ mant_b
and s0, s0, t6
xor t4, t4, s0
xor t5, t5, s0
```
We have to determine `exp_diff` and align the smaller part (i.e., `mant_b`) in the remaining part.
```c=
int16_t exp_diff = exp_a - exp_b;
uint16_t result_sign;
int16_t result_exp;
uint32_t result_mant;
if (exp_diff > 0) {
result_exp = exp_a;
if (exp_diff > 8)
return a;
mant_b >>= exp_diff;
} else if (exp_diff < 0) {
result_exp = exp_b;
if (exp_diff < -8)
return b;
mant_a >>= -exp_diff;
} else {
result_exp = exp_a;
}
```
We also want to convert to above code into branchless logic, because the `exp_a` and `mant_a` are always larger than or euqal to the corresponding parts, we can **ignore** the part that deal with `exp_a` and `mant_a`.
The above code can be simplified as below:
```c=
int16_t exp_diff = exp_a - exp_b;
uint16_t result_sign;
int16_t result_exp;
uint32_t result_mant;
if(exp_diff > 0) {
result_exp = exp_a;
if (exp_diff > 8)
return a;
mant_b >>= exp_diff;
} else if (exp_diff == 0) {
result_exp = exp_a;
}
```
> **Note:** `exp_diff` = `exp_a` - `exp_b` $\ge$ 0.
We can use `srl` instruction to alignment `mant_b`, and
- if exp_diff > 8, then return a.
- else if 0 <= exp_diff and 8 >= exp_diff, retain `mant_b`.
> **Note:** `exp_diff` > 8 means that b is negligible relative to a, thus mant_b can be zeroed out by the mask.
```asm=
# s3: exp_diff
# t6: mask
sub s3, t2, t3 # exp_diff = exp_a - exp_b
addi t6, x0, 8
slt t6, t6, s3
sub t6, x0, t6 # t6 = (8 < exp_diff) ? 0xFFFFFFFF : 0x00000000
xori t6, t6, -1 # t6 = (8 < exp_diff) ? 0x00000000 : 0xFFFFFFFF
# align mant_b
sra t5, t5, s3 # mant_b >>= exp_diff
and t5, t5, t6 # mant_b &= ~mask
```
**Observation:** Logic Simplification
```c=
else {
if (mant_a >= mant_b) {
result_sign = sign_a;
result_mant = mant_a - mant_b;
} else {
result_sign = sign_b;
result_mant = mant_b - mant_a;
}
if (!result_mant)
return BF16_ZERO();
while (!(result_mant & 0x80)) {
result_mant <<= 1;
if (--result_exp <= 0)
return BF16_ZERO();
}
}
```
Since we have enforeced $|A| \ge |B|$ using the **branchless swap**, this section can be significantly simplifed.
```c=
// This conditional block is removed in my approach
if (mant_a >= mant_b) {
result_sign = sign_a;
result_mant = mant_a - mant_b;
} else {
result_sign = sign_b;
result_mant = mant_b - mant_a;
}
```
In my approach, this condition check block is reduced to a single subtraction: `result_mant = mant_a - mant_b` as $|A| \ge |B|$ is guaranteed.
Implement the above logic into branchless code is as below:
```asm=
bf16_add_same_sign:
add s3, t4, t5 # result_mant = mant_a + mant_b
andi t2, s3, 0x100 # result & 0x100, check if carry out
beq t2, x0, bf16_add_return
srli s3, s3, 1 # result_mant >>= 1
addi s1, s1, 1 # result_exp += 1
addi t6, x0, 0xFF # check for overflow
bge s1, t6, bf16_add_return_overflow
jal x0, bf16_add_return
bf16_add_diff_sign:
sub s3, t4, t5 # result_mant = mant_a - mant_b
beq s3, x0, bf16_add_return_bf16_zero
bf16_add_normalize_loop: # normalize, maybe can optimize later
andi t6, s3, 0x80
bne t6, x0, bf16_add_return
slli s3, s3, 1 # result_mant <<= 1
addi s1, s1, -1 # result_exp -= 1
blt s1, x0, bf16_add_return_bf16_zero
jal x0, bf16_add_normalize_loop
```
##### Bug Fixed (Guard Bits)
> GitHub: [bf16_add_with_guard_bits](https://github.com/dorobouharu133/ca2025-quizzes/blob/b724ad166d63b936c33c774758720f2d98571cb7/bf16_add.S)
```
// Commit message
Fixed three logic errors in bf16_add improved approach:
1. Initialize result exponent s1 correctly before calculation loop.
2. Left shift mantissa 3 bits to protect precision.
3. Updated overflow mask from 0x100 to 0x800 due to point 2.
```
```
Test Case 3:
1.0000000 + 2^-8 = 1.000000
```
Since the above code faced a precision problem, I preserved 3 bits as guard bits.
```asm=
# t4: mant_a
# t5: mant_b
bf16_add_skip2:
slli t4, t4, 3 # for further rounding (guard bits)
slli t5, t5, 3 # for further rounding (guard bits)
```
Because I preserved 3 bits by shift left 3 bits, the below procedure also needed to be modify.
```asm=
# t2: mask, to check carry out
# s3: result_mant
addi t2, x0, 0x10
slli t2, t2, 4
and t2, s3, t2 # result & 0x100, check if carry out
```
`0x100` is not enough to check overflow, we have to check **bit 11**(0x800) instead of bit 8 (0x100).
> **Reason:** Originally, the implicit 1 (hidden bit) is at **bit 7**, so a carry-out reaches **bit 8**. After shifting left 3 bits, the implicit 1 moves to **bit 10**. Therefore, adding two 10-bit mantissas may cause a carry-out to **bit 11**, so we have to use the mask **0x800**.
Since mantissa is 7 bits (bit 0 to bit 6), after shifting left 3 bit, the mantissa is 10 bits. Therefore, we use mask 0x800 to check whether the sum (including **the hidden bit at bit 10**) carries out to **bit 11** or not.
```
0x100 -> 0x800
```
```asm=
addi t2, x0, 0x80
slli t2, t2, 4
and t2, s3, t2 # result & 0x800, check if carry out
```
#### bf16_add Test Cases
Original version with 5-stage pipeline CPU
```
// My original approach for bf16_add
Cycles: 247
Instrs. retired: 189
CPI: 1.31
IPC: 0.765
Clock rate: 228.28 Hz
```
**Branchless** version with 5-stage pipeline CPU
```
// Branchless
Cycles: 313
Instrs. retired: 265
CPI: 1.18
IPC: 0.847
Clock rate: 333.69 Hz
```
Compiler version with 5-stage pipeline CPU
```
// Compiler
Cycles: 259
Instrs. retired: 207
CPI: 1.25
IPC: 0.799
Clock rate: 254.92 Hz
```
**Performance:** My improved approach is about **1.21x slower** than the compiler-optimized version.
**Code Size:** Compared to the compiler-optimized version, my improved approach reduces the code size by nearly **49%**.
___
### ==bf16_sub==
```asm=
bf16_sub:
addi t0, x0, 0x80
slli t0, t0, 8
addi sp, sp, -8
sw ra, 4(sp)
sw a1, 0(sp)
xor a1, a1, t0 # negate b
jal bf16_add
lw a1, 0(sp)
lw ra, 4(sp)
addi sp, sp, 8
jr ra
```
#### bf16_sub Test Cases
In the test of sub, I use 6 test cases.
```
1. `0x7F7F` - `0x0080`, maximal finite positive number - minimal normalized number
2. `0x7FC0` - `0x3F80`, NaN - 1.000000
3. `0x3F80` - `0x3B80`, 1.000000 - 2^(-8)
4. `0x7F80` - `0xFF80`, +inf - (-inf)
5. `0x4040` - `0x4000`, 3.000000 - 2.000000
6. `0x7F7F` - `0xFF7F`, maximal finite positive numbe - maximal finite negative number
```
#### Bug fixed (Precision)
```
# 1.0000000 - 2^(-8) = 1.000000
```
The correct answer should be `0x3B80`, but my approach gets `0x3B7F`.
In order to get the correct answer, we have to add **the guard bits** to my approach.
> GitHub: [bf16_sub_adjust_exp_diff](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_sub/bf16_sub.S)
```
// My approach
Cycles: 586
Instrs. retired: 492
CPI: 1.19
IPC: 0.84
Clock rate: 350.06 Hz
```
> GitHub: [bf16_sub_compiler.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_sub/bf16_sub_compiler.S)
```
// Compiler-optimized
Cycles: 1004
Instrs. retired: 750
CPI: 1.34
IPC: 0.747
Clock rate: 1.06 KHz
```
**Code Size:** Compared to the compiler-optimized version, my improved approach reduces the code size by nearly **53.6%**.
**Performance:** My approach is **1.71x faster** than the compiler-optimized version.
> Because `bf16_sub` is essentially derived from `bf16_add`, an additional `bf16_sub_improved` implementation is not required.
___
### ==bf16_mul==
> GitHub: [bf16_mul.S](https://github.com/dorobouharu133/ca2025-quizzes/tree/main/bf16_mul)
As before, I'll start with an **original version** that acts as a foundation for the enhanced version I will introduce later.
#### bf16_mul original
> GitHub: [bf16_mul_original.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_mul/bf16_mul_original.S)
#### bf16_mul Test Cases
And I choose these test cases since:
```
1. `0x7F7F` x `0x0080`, `0x407F`, maximal finite positive number x minimal normalized number = 3.984375
2. `0x7FC0` x `0x3F80`, `0x7FC0`, NaN x 1.000000 -> NaN
3. `0x7F80` x `0xFF80`, `0x7F80`, +inf x (-inf) -> +inf
4. `0x4040` x `0x4000`, `0x40C0`, 3.000000 x 2.000000 = 6.000000
5. `0x3F80` x `0x3F80`, 1.000000 x 1.000000 = 1.000000
6. `0x0000` x `0x8000`, 0.000000 x (-0.000000) = (-0.000000)
7. `0x7F80` x `0x0000`, +inf x 0.000000 = NaN
8. `0x7F7F` x `0x4000`, maximal finite positive number x 2.000000 = +inf
9. `0x0080` x `0x0080`, minimal finite normalized positive number x minimal finite normalized positive number = 0.000000
10. `0x7FC0` x `0x3F80`, NaN x 1.000000 = NaN
11. `0xFF80` x `0x7F80`, (-inf) x +inf = -inf
12. `0x3F81` x `0x3F81`, 1.007812 x 1.007812 = 1.015625
13. `0xC000` x `0x4040`, -2.000000 x 3.000000 = -6.000000
14. `0x0040` x `0x0040`, minimal finite denormalized positive number x minimal finite denormalized positive number = 0.000000
15. `0x3F81` x `0x4000`, 1.007812 x 2.000000 = 2.015625
16. `0x4040` x `0x3F00`, 3.000000 x 0.500000 = 1.500000
17. `0x40A0` x `0x3F00`, 5.000000 x 0.500000 = 2.500000
18. `0x4010` x `0x4010`, 2.250000 x 2.250000 = 5.062500
```
(Run in 5-stage pipeline CPU)
> GitHub: [bf16_mul_original.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_mul/bf16_mul_original.S)
```
// My original approach for bf16_mul
Cycles: 1095
Instrs. retired: 817
CPI: 1.34
IPC: 0.746
Clock rate: 1.22 KHz
```
The above figure is the execution info of my approach, and the below figure is the execution info of compiler.
(Run in 5-stage pipeline CPU)
> GitHub: [bf16_mul_compiler.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_mul/bf16_mul_compiler.S)
```
// Compiler-optimized bf16_mul
Cycles: 3696
Instrs. retired: 2609
CPI: 1.42
IPC: 0.706
Clock rate: 1.60 KHz
```
**Performance:** My original approach is **3.37x faster** than the compiler-optimized version.
___
#### [bf16_mul_improved](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_mul/bf16_mul.S)
> AI Usage: [bf16_mul improvement thought](https://gemini.google.com/share/60cc3a43aa7d)
```c=
int16_t exp_adjust = 0;
if (!exp_a) {
while (!(mant_a & 0x80)) {
mant_a <<= 1;
exp_adjust--;
}
exp_a = 1;
} else
mant_a |= 0x80;
```
The goal here is to find the leading bit of the **mantissa** for **normalization**. Instead of using a loop, we can implement a **branchless 8-bit CLZ** approach to make the `bf16_mul` operation more efficient. Since the mantissa is 7-bit (8-bit if implicit 1 included), we can use a simpler way of CLZ.
#### Branchless 8-bit CLZ
**The below C code of 8-bit CLZ is [AI-generated](https://gemini.google.com/share/60cc3a43aa7d).**
```c=
// Gemini
uint32_t x = mant_a;
uint32_t s, shift = 0;
s = (x <= 0xF); shift += (s << 2); x <<= (s << 2);
s = (x <= 0x3F); shift += (s << 1); x <<= (s << 1);
s = (x <= 0x7F); shift += (s << 0); x <<= (s << 0);
// Now shift is the number of Leading Zeros
```
I implemented the above code as the RISC-V instruction below:
```asm=
bf16_mul_exp_adjust_a: # normalize a
addi s1, t4, 0 # s1 = x = mant_a
addi s2, x0, 0 # s2 = shift
addi s3, x0, 0 # s3 = s
addi t6, x0, 0xF
slt s3, t6, s1 # s3 = (x > 0xF) ? 1 : 0
xori s3, s3, 1 # s3 = (x <= 0xF) ? 1 : 0
slli t6, s3, 2
add s2, s2, t6 # shift += (s << 2)
sll s1, s1, t6 # x <<= (s << 2)
#
addi t6, x0, 0x3F
slt s3, t6, s1 # s3 = (x > 0x3F) ? 1 : 0
xori s3, s3, 1 # s3 = (x <= 0x3F) ? 1 : 0
slli t6, s3, 1
add s2, s2, t6 # shift += (s << 1)
sll s1, s1, t6 # x <<= (s << 1)
#
addi t6, x0, 0x7F
slt s3, t6, s1 # s3 = (x > 0x7F) ? 1 : 0
xori s3, s3, 1 # s3 = (x <= 0x7F) ? 1 : 0
# s2: shift = # of leading zeros
# s4: exp_adjust_a
sltiu t6, t2, 1 # t6 = (exp_a == 0) ? 1 : 0
or t2, t2, t6 # exp_a = (exp_a == 0) ? 1 : exp_a
addi t6, t6, -1
xori t6, t6, -1
and s4, s2, t6
and s2, s2, t6
sll t4, t4, s2
ori t4, t4, 0x80 # mant_a |= 0x80
```
Below is my improved code with partial branchless:
```asm=
# a0: a
# a1: b
# t0: sign_a
# t1: sigb_b
# t2: exp_a
# t3: exp_b
# t4: mant_a
# t5: mant_b
# t6: temp
# s0: result_sign
bf16_mul:
srli t0, a0, 15
andi t0, t0, 1 # t0 = (a.bits >> 15) & 1
srli t1, a1, 15
andi t1, t1, 1 # t1 = (b.bits >> 15) & 1
srli t2, a0, 7
andi t2, t2, 0xFF # t2 = (a.bits >> 7) & 0xFF
srli t3, a1, 7
andi t3, t3, 0xFF # t3 = (b.bits >> 7) & 0xFF
andi t4, a0, 0x7F # t4 = a.bits & 0x7F
andi t5, a1, 0x7F # t5 = a.bits & 0x7F
xor s0, t0, t1 # result_sign = sign_a ^ sign_b
bf16_mul_check_a_0xFF:
addi t6, x0, 0xFF
bne t2, t6, bf16_mul_check_b_0xFF
bne t4, x0, bf16_mul_return_a
bne t5, x0, bf16_mul_return_pInf
bne t3, x0, bf16_mul_return_pInf
jal x0, bf16_mul_return_NaN
bf16_mul_check_b_0xFF:
addi t6, x0, 0xFF
bne t3, t6, bf16_mul_check_zero
bne t4, x0, bf16_mul_return_b
bne t2, x0, bf16_mul_return_pInf
bne t5, x0, bf16_mul_return_pInf
jal x0, bf16_mul_return_NaN
bf16_mul_check_zero:
or t6, t2, t4
bne t6, x0, bf16_mul_exp_adjust
or t6, t3, t5
bne t6, x0, bf16_mul_exp_adjust
jal x0, bf16_mul_return_zero
bf16_mul_exp_adjust:
bf16_mul_exp_adjust_a: # normalize a
addi s1, t4, 0 # s1 = x = mant_a
addi s2, x0, 0 # s2 = shift
addi s3, x0, 0 # s3 = s
addi t6, x0, 0xF
slt s3, t6, s1 # s3 = (x > 0xF) ? 1 : 0
xori s3, s3, 1 # s3 = (x <= 0xF) ? 1 : 0
slli t6, s3, 2
add s2, s2, t6 # shift += (s << 2)
sll s1, s1, t6 # x <<= (s << 2)
#
addi t6, x0, 0x3F
slt s3, t6, s1 # s3 = (x > 0x3F) ? 1 : 0
xori s3, s3, 1 # s3 = (x <= 0x3F) ? 1 : 0
slli t6, s3, 1
add s2, s2, t6 # shift += (s << 1)
sll s1, s1, t6 # x <<= (s << 1)
#
addi t6, x0, 0x7F
slt s3, t6, s1 # s3 = (x > 0x7F) ? 1 : 0
xori s3, s3, 1 # s3 = (x <= 0x7F) ? 1 : 0
# s2: shift = # of leading zeros
# s4: exp_adjust_a
sltiu t6, t2, 1 # t6 = (exp_a == 0) ? 1 : 0
or t2, t2, t6 # exp_a = (exp_a == 0) ? 1 : exp_a
addi t6, t6, -1
xori t6, t6, -1
and s4, s2, t6
and s2, s2, t6
sll t4, t4, s2
ori t4, t4, 0x80 # mant_a |= 0x80
bf16_mul_exp_adjust_b: # normalize b
addi s1, t4, 0 # s1 = x = mant_b
addi s2, x0, 0 # s2 = shift
addi s3, x0, 0 # s3 = s
addi t6, x0, 0xF
slt s3, t6, s1 # s3 = (x > 0xF) ? 1 : 0
xori s3, s3, 1 # s3 = (x <= 0xF) ? 1 : 0
slli t6, s3, 2
add s2, s2, t6 # shift += (s << 2)
sll s1, s1, t6 # x <<= (s << 2)
#
addi t6, x0, 0x3F
slt s3, t6, s1 # s3 = (x > 0x3F) ? 1 : 0
xori s3, s3, 1 # s3 = (x <= 0x3F) ? 1 : 0
slli t6, s3, 1
add s2, s2, t6 # shift += (s << 1)
sll s1, s1, t6 # x <<= (s << 1)
#
addi t6, x0, 0x7F
slt s3, t6, s1 # s3 = (x > 0x7F) ? 1 : 0
xori s3, s3, 1 # s3 = (x <= 0x7F) ? 1 : 0
# s2: shift = # of leading zeros
# s5: exp_adjust_b
sltiu t6, t3, 1 # t6 = (exp_a == 0) ? 1 : 0
or t3, t3, t6 # exp_a = (exp_a == 0) ? 1 : exp_a
addi t6, t6, -1
xori t6, t6, -1
and s5, s2, t6
and s2, s2, t6
sll t5, t5, s2
ori t5, t5, 0x80 # mant_b |= 0x80
# s1: exp_adjust = exp_adjust_a + exp_adjust_b
add s1, s4, s5
bf16_mul_normalize:
# s2: result_mant
# s3: result_exp
mul s2, t4, t5 # result_mant = mant_a * mant_b, need to modify later
add s3, t2, t3 # result_exp = exp_a + exp_b
add s3, s3, s5 # result_exp += exp_adjust
addi s3, s3, -127 # result_exp -= BF16_EXP_BIAS
bf16_mul_normalize_check: # branchless later
bf16_mul_normalize_check_result_mant_if:
addi t6, x0, 0x80
slli t6, t6, 8
and t6, s2, t6
beq t6, x0, bf16_mul_normalize_check_result_mant_else
srli t6, s2, 8
andi s2, t6, 0xFF
addi s3, s3, 1
jal x0, bf16_mul_normalize_check_result_exp_if_ge_0xFF
bf16_mul_normalize_check_result_mant_else:
srli t6, s2, 7
andi s2, t6, 0x7F
bf16_mul_normalize_check_result_exp_if_ge_0xFF:
addi t6, x0, 0xFF
bge s3, t6, bf16_mul_return_pInf
bf16_mul_normalize_check_result_exp_if_le_0:
bgt s3, x0, bf16_mul_return
addi t6, x0, -6
bgt s3, t6, bf16_mul_return_zero
addi t6, x0, 1
sub t6, t6, s3 # t6 = 1 - result_exp
sll s2, s2, t6
addi s3, x0, 0
bf16_mul_return:
slli a0, s0, 15
andi s3, s3, 0xFF
slli s3, s3, 7
or a0, a0, s3
andi s2, s2, 0x7F
or a0, a0, s2
jr ra
bf16_mul_return_a:
jr ra
bf16_mul_return_b:
addi a0, a1, 0
jr ra
bf16_mul_return_pInf:
addi a0, s0, 0
slli a0, a0, 15
addi t6, x0, 0x7F
slli t6, t6, 8
ori t6, t6, 0x80
or a0, a0, t6
jr ra
bf16_mul_return_NaN:
addi a0, x0, 0x7F
slli a0, a0, 8
ori a0, a0, 0xC0
jr ra
bf16_mul_return_zero:
addi a0, s0, 0
slli a0, a0, 15
jr ra
```
I’ve finished converting the first section of the logic to be branchless. Next, I’ll apply the same optimization to the remaining parts of the code.
> AI Usage: [bf16_mul test cases & debug](https://gemini.google.com/share/7f33de0eeaf6)
**Debug Update:**
Found an error in the BF16 test data. For $2.25 \times 2.25$, the precise result is $5.0625$, which rounds to `0x402A` in BF16. The initial expected value of `0x4020` was a typo. The hardware implementation is actually correct; no precision fix is needed.
```c=
uint32_t result_mant = (uint32_t) mant_a * mant_b;
int32_t result_exp = (int32_t) exp_a + exp_b - BF16_EXP_BIAS + exp_adjust;
if (result_mant & 0x8000) {
result_mant = (result_mant >> 8) & 0x7F;
result_exp++;
} else
result_mant = (result_mant >> 7) & 0x7F;
```
The concept of the above code is pretty simple, we want to check whether the result_mant is 15 bits or 16 bits,
- 16 bits means there is carry out for mant_a * mant_b, therefore result_mant & 0x8000 = 1 (using mask 0x8000 to check bit 15).
- 15 bits means there is no carry out for mant_a * mant_b, therefore result_mant & 0x8000 = 0.
Then use shift right 8 (or 7) bits to truncate `result_mant` and retain the upper 8-bits.
Note that if the result_mant is 16 bits, it has to be normalized, therefore result_exp is incremented 1.
I implement the above logic into RISC-V instructions.
Instead of serval codes and branch instrucion in the original approach, this section is quite simple.
Since we only want to check whether **result_mant is 16 bits or not**, we can **shift right logical 15 bits** to get bit 15, because the condition of shfit right result_mant is related to bit 15, we can determine the shift amount and result_mant by:
- `result_mant >> 15` == 1, shift right `7 + (result_mant>>15)` = 8 bits
- `result_mant >> 15` == 0, shfit right `7 + (result_mant>>15)` = 7 bits
And result_exp can be also determined by: `result_exp += (result_mant>>15)`.
> **Note:** `result_mant >> 15` here is effective to `result_mant & 0x8000`.
```asm=
bf16_mul_normalize:
# s2: result_mant
# s3: result_exp
mul s2, t4, t5 # result_mant = mant_a * mant_b, need to modify later
add s3, t2, t3 # result_exp = exp_a + exp_b
sub s3, s3, s5 # result_exp - exp_adjust_b
sub s3, s3, s4 # result_exp - exp_adjust_a - exp_adjust_b
addi s3, s3, -127 # result_exp -= BF16_EXP_BIAS
bf16_mul_normalize_check:
srli t6, s2, 15
addi t0, t6, 7 # t0 = (result_mant == 1) ? 8 : 7, shift_amount
srl s2, s2, t0 # result_mant >>= (shift_amount)
add s3, s3, t6 # result_exp += t6
```
The next step is to replace `mul` instruction with the [**shift-and-add**](##==bf16_sqrt_shift_add==) method used in bf16_sqrt.
Below is the complete code of bf16_mul:
```asm=
.global bf16_mul
.global shift_and_add
# a0: a
# a1: b
# t0: sign_a
# t1: sigb_b
# t2: exp_a
# t3: exp_b
# t4: mant_a
# t5: mant_b
# t6: temp
# s0: result_sign
bf16_mul:
srli t0, a0, 15
andi t0, t0, 1 # t0 = (a.bits >> 15) & 1
srli t1, a1, 15
andi t1, t1, 1 # t1 = (b.bits >> 15) & 1
srli t2, a0, 7
andi t2, t2, 0xFF # t2 = (a.bits >> 7) & 0xFF
srli t3, a1, 7
andi t3, t3, 0xFF # t3 = (b.bits >> 7) & 0xFF
andi t4, a0, 0x7F # t4 = a.bits & 0x7F
andi t5, a1, 0x7F # t5 = a.bits & 0x7F
xor s0, t0, t1 # result_sign = sign_a ^ sign_b
bf16_mul_check_a_0xFF:
addi t6, x0, 0xFF
bne t2, t6, bf16_mul_check_b_0xFF
bne t4, x0, bf16_mul_return_a
bne t5, x0, bf16_mul_return_pInf
bne t3, x0, bf16_mul_return_pInf
jal x0, bf16_mul_return_NaN
bf16_mul_check_b_0xFF:
addi t6, x0, 0xFF
bne t3, t6, bf16_mul_check_zero
bne t4, x0, bf16_mul_return_b
bne t2, x0, bf16_mul_return_pInf
bne t5, x0, bf16_mul_return_pInf
jal x0, bf16_mul_return_NaN
bf16_mul_check_zero:
or t6, t2, t4
bne t6, x0, bf16_mul_exp_adjust
or t6, t3, t5
bne t6, x0, bf16_mul_exp_adjust
jal x0, bf16_mul_return_zero
bf16_mul_exp_adjust:
bf16_mul_exp_adjust_a: # normalize a
addi s1, t4, 0 # s1 = x = mant_a
addi s2, x0, 0 # s2 = shift
addi s3, x0, 0 # s3 = s
addi t6, x0, 0xF
slt s3, t6, s1 # s3 = (x > 0xF) ? 1 : 0
xori s3, s3, 1 # s3 = (x <= 0xF) ? 1 : 0
slli t6, s3, 2
add s2, s2, t6 # shift += (s << 2)
sll s1, s1, t6 # x <<= (s << 2)
#
addi t6, x0, 0x3F
slt s3, t6, s1 # s3 = (x > 0x3F) ? 1 : 0
xori s3, s3, 1 # s3 = (x <= 0x3F) ? 1 : 0
slli t6, s3, 1
add s2, s2, t6 # shift += (s << 1)
sll s1, s1, t6 # x <<= (s << 1)
#
addi t6, x0, 0x7F
slt s3, t6, s1 # s3 = (x > 0x7F) ? 1 : 0
xori s3, s3, 1 # s3 = (x <= 0x7F) ? 1 : 0
# s2: shift = # of leading zeros
# s4: exp_adjust_a
sltiu t6, t2, 1 # t6 = (exp_a == 0) ? 1 : 0
or t2, t2, t6 # exp_a = (exp_a == 0) ? 1 : exp_a
addi t6, t6, -1
xori t6, t6, -1
and s4, s2, t6
and s2, s2, t6
sll t4, t4, s2
ori t4, t4, 0x80 # mant_a |= 0x80
bf16_mul_exp_adjust_b: # normalize b
addi s1, t4, 0 # s1 = x = mant_b
addi s2, x0, 0 # s2 = shift
addi s3, x0, 0 # s3 = s
addi t6, x0, 0xF
slt s3, t6, s1 # s3 = (x > 0xF) ? 1 : 0
xori s3, s3, 1 # s3 = (x <= 0xF) ? 1 : 0
slli t6, s3, 2
add s2, s2, t6 # shift += (s << 2)
sll s1, s1, t6 # x <<= (s << 2)
#
addi t6, x0, 0x3F
slt s3, t6, s1 # s3 = (x > 0x3F) ? 1 : 0
xori s3, s3, 1 # s3 = (x <= 0x3F) ? 1 : 0
slli t6, s3, 1
add s2, s2, t6 # shift += (s << 1)
sll s1, s1, t6 # x <<= (s << 1)
#
addi t6, x0, 0x7F
slt s3, t6, s1 # s3 = (x > 0x7F) ? 1 : 0
xori s3, s3, 1 # s3 = (x <= 0x7F) ? 1 : 0
# s2: shift = # of leading zeros
# s5: exp_adjust_b
sltiu t6, t3, 1 # t6 = (exp_a == 0) ? 1 : 0
or t3, t3, t6 # exp_a = (exp_a == 0) ? 1 : exp_a
addi t6, t6, -1
xori t6, t6, -1
and s5, s2, t6
and s2, s2, t6
sll t5, t5, s2
ori t5, t5, 0x80 # mant_b |= 0x80
# s1: exp_adjust = exp_adjust_a + exp_adjust_b
add s1, s4, s5
bf16_mul_normalize:
# s2: result_mant
# s3: result_exp
# mul s2, t4, t5 # result_mant = mant_a * mant_b
addi sp, sp, -12
sw ra, 0(sp)
sw a0, 4(sp)
sw a1, 8(sp)
addi a0, t4, 0
addi a1, t5, 0
jal ra, shift_and_add
addi s2, a0, 0
lw ra, 0(sp)
lw a0, 4(sp)
lw a1, 8(sp)
addi sp, sp, 12
add s3, t2, t3 # result_exp = exp_a + exp_b
sub s3, s3, s5 # result_exp - exp_adjust_b
sub s3, s3, s4 # result_exp - exp_adjust_a - exp_adjust_b
addi s3, s3, -127 # result_exp -= BF16_EXP_BIAS
bf16_mul_normalize_check:
srli t6, s2, 15
addi t0, t6, 7 # t0 = (result_mant == 1) ? 8 : 7, shift_amount
srl s2, s2, t0 # result_mant >>= (shift_amount)
add s3, s3, t6 # result_exp += t6
bf16_mul_normalize_check_result_exp_if_ge_0xFF:
addi t6, x0, 0xFF
bge s3, t6, bf16_mul_return_pInf
bf16_mul_normalize_check_result_exp_if_le_0:
bgt s3, x0, bf16_mul_return
addi t6, x0, -6
blt s3, t6, bf16_mul_return_zero
addi t6, x0, 1
sub t6, t6, s3 # t6 = 1 - result_exp
srl s2, s2, t6
addi s3, x0, 0
bf16_mul_return:
slli a0, s0, 15
andi s3, s3, 0xFF
slli s3, s3, 7
or a0, a0, s3
andi s2, s2, 0x7F
or a0, a0, s2
jr ra
bf16_mul_return_a:
jr ra
bf16_mul_return_b:
addi a0, a1, 0
jr ra
bf16_mul_return_pInf:
addi a0, s0, 0
slli a0, a0, 15
addi t6, x0, 0x7F
slli t6, t6, 8
ori t6, t6, 0x80
or a0, a0, t6
jr ra
bf16_mul_return_NaN:
addi a0, x0, 0x7F
slli a0, a0, 8
ori a0, a0, 0xC0
jr ra
bf16_mul_return_zero:
addi a0, s0, 0
slli a0, a0, 15
jr ra
# a0: multiplicand
# a1: multiplier
# t0: accumulator
# t1: multiplier_copy
# t2: bit0
shift_and_add:
addi sp, sp, -12
sw t0, 0(sp)
sw t1, 4(sp)
sw t2, 8(sp)
addi t0, x0, 0
addi t1, a1, 0
andi t2, a1, 1
shift_and_add_for_loop:
andi t2, t1, 1
beq t2, x0, shift_and_add_for_skip_add
add t0, t0, a0
shift_and_add_for_skip_add:
slli a0, a0, 1
srli t1, t1, 1
bne t1, x0, shift_and_add_for_loop
shfit_and_add_end:
addi a1, t0, 0
lw t0, 0(sp)
lw t1, 4(sp)
lw t2, 8(sp)
addi sp, sp, 12
addi a0, a1, 0
jr ra
```
#### bf16_mul Test Cases
> GitHub: [bf16_mul.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_mul/bf16_mul.S)
```
Cycles: 2715
Instrs. retired: 2141
CPI: 1.27
IPC: 0.789
Clock rate: 2.26 KHz
```
**Performance:** My improverd approach is **1.36x faster** than the compiler-optmized version.
**Code Size:** Compared to the compiler-optimized version, my approach **reduces** the code size by nearly **39%**.
___
### ==bf16_div==
#### bf16_div original
> GitHub: [bf16_div original approach](https://github.com/dorobouharu133/ca2025-quizzes/blob/ddfb96a8105e4fa1adb2c97fd85c5c548076904b/bf16_div.S)
#### bf16_div Test Cases
```
1. `0x7F7F` / `0x0080`, maximal finite positive number / minimal normalized number
2. `0x7FC0` / `0x3F80`, NaN / 1.000000
3. `0x7F80` / `0xFF80`, +inf / (-inf)
4. `0x4040` / `0x4000`, 3.000000 / 2.000000
```
```
// Original Approach
Cycles: 643
Instrs. retired: 465
CPI: 1.38
IPC: 0.723
Clock rate: 595.37 Hz
```
The above figure is the execution info of my approach, and the below is the execution info of the compiler-optimized version.
```
// Compiler
Cycles: 1479
Instrs. retired: 1028
CPI: 1.44
IPC: 0.695
Clock rate: 1.91 KHz
```
My original approach is **2.3x faster** than compiler's version.
___
#### bf16_div_improved
There are two loops in the original approach, we can improve it.
```c=
for (int i = 0; i < 16; i++) {
quotient <<= 1;
if (dividend >= (divisor << (15 - i))) {
dividend -= (divisor << (15 - i));
quotient |= 1;
}
}
```
In my original approach, I translated it line-by-line:
```asm=
# Original approach
bf16_div_division:
# t6: loop times
# s1: dividend
# s2: divisor
# s3: quotient
# s4: counter
slli s1, t4, 15 # dividend = mant_a << 15
addi s2, t5, 0 # divisor = mant_b
addi s3, x0, 0 # quotient = 0
addi t6, x0, 16 # loop times = 16
addi s4, x0, 0 # counter = 0
bf16_div_division_loop:
bge s4, t6, bf16_div_division_loop_end
slli s3, s3, 1 # quotient <<= 1
addi t4, x0, 15
sub t4, t4, s4 # t4 = 15 - i
sll t4, s2, t4 # temp = divisor << (15 - i)
blt s1, t4, bf16_div_division_loop_sub_skip
bf16_div_division_loop_sub:
sub s1, s1, t4 # dividend -= (divisor << (15 - i))
ori s3, s3, 1 # quotient |= 1
bf16_div_division_loop_sub_skip:
addi s4, s4, 1 # i++
jal x0, bf16_div_division_loop
```
However, the code is actually doing division,
---
```c=
while (!(quotient & 0x8000) && result_exp > 1) {
quotient <<= 1;
result_exp--;
}
quotient >>= 8;
```
The purpose of the above code is to normalize the quotient,
```asm=
# Original approah
bf16_div_normalize_quotient_while_loop:
addi t4, x0, 0x80
slli t4, t4, 8 # t6 = 0x8000
and t4, s3, t4 # t6 = quotient & 0x8000
bne t4, x0, bf16_div_normalize_quotient_while_loop_end
addi t6, x0, 1
ble s5, t6, bf16_div_normalize_quotient_while_loop_end
slli s3, s3, 1 # quotient <<= 1
addi s5, s5, -1 # result_exp -= 1
jal x0, bf16_div_normalize_quotient_while_loop
bf16_div_normalize_quotient_while_loop_end:
srli s3, s3, 8 # quotient >>= 8
```
Just like what we have done before, **this while loop tries to clz and normalize the quotient**.
Note that **the result_exp should greater than or equal to zero at the end of the loop**.
Therefore, we can use the [**clz concept**](###Branchless-8-bit-CLZ) that was used in the previous section.
```c=
// Gemini
uint32_t x = mant_a;
uint32_t s, shift = 0;
s = (x <= 0xF); shift += (s << 2); x <<= (s << 2);
s = (x <= 0x3F); shift += (s << 1); x <<= (s << 1);
s = (x <= 0x7F); shift += (s << 0); x <<= (s << 0);
// Now shift is the number of Leading Zeros
```
**Observe the bit mask:**
```
# 8-bit clz
0xF = 0000 1111
0x3F = 0011 1111
0x7F = 0111 1111
```
We can observe that **clz always divides the remaining range into two parts**. At each step, it checks the upper half and then repeats the same operation on the selected half.
Based on the observation, I implemented this logic to develop a 16-bit branchlss `clz`.
**16-bit branchless clz mask:**
```
# 16-bit clz
# divided into 2 parts
0000 0000 1111 1111 -> 0x00FF
0000 1111 1111 1111 -> 0x0FFF
0011 1111 1111 1111 -> 0x3FFF
0111 1111 1111 1111 -> 0x7FFF
```
> **Note:** Since $16 = 2^{4}$, the maximum shift amount in 16-bit CLZ is 2^4.
> GitHub: [Branchless 16-bit CLZ](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/branchless_16_bit_clz.c)
So we can implemented in C code:
```c=
// My own approach
uint32_t x = mant_a;
uint32_t s, shift = 0;
// 0000 0000 1111 1111 -> 0x00FF
// 0000 1111 1111 1111 -> 0x0FFF
// 0011 1111 1111 1111 -> 0x3FFF
// 0111 1111 1111 1111 -> 0x7FFF
s = (x <= 0x00FF); shift += (s << 3); x <<= (s << 3);
s = (x <= 0x0FFF); shift += (s << 2); x <<= (s << 2);
s = (x <= 0x3FFF); shift += (s << 1); x <<= (s << 1);
s = (x <= 0x7FFF); shift += (s << 0); x <<= (s << 0);
// Now shift is the number of Leading Zeros
```
> **Note:** Verlidatation code is included in [GitHub](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/branchless_16_bit_clz.c).
> GitHub: [Branchless 32-bit CLZ](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/branchless_32_bit_clz.c)
```c=
// My own approach
// My own approach
uint32_t x = mant_a;
uint32_t s, shift = 0;
// 0000 0000 0000 0000 1111 1111 1111 1111 -> 0x0000FFFF
// 0000 0000 1111 1111 1111 1111 1111 1111 -> 0x00FFFFFF
// 0000 1111 1111 1111 1111 1111 1111 1111 -> 0x0FFFFFFF
// 0011 1111 1111 1111 1111 1111 1111 1111 -> 0x3FFFFFFF
// 0111 1111 1111 1111 1111 1111 1111 1111 -> 0x7FFFFFFF
s = (x <= 0x0000FFFF); shift += (s << 4); x <<= (s << 4);
s = (x <= 0x00FFFFFF); shift += (s << 3); x <<= (s << 3);
s = (x <= 0x0FFFFFFF); shift += (s << 2); x <<= (s << 2);
s = (x <= 0x3FFFFFFF); shift += (s << 1); x <<= (s << 1);
s = (x <= 0x7FFFFFFF); shift += (s << 0); x <<= (s << 0);
// Now shift is the number of Leading Zeros
```
> **Note:** Verlidatation code is included in [GitHub](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/branchless_32_bit_clz.c).
**Note:** Although the original branchless 8-bit CLZ implementation was AI-generated, **extending it to 16-bit and 32-bit was my own work**.
Though I develop 32-bit branchless CLZ, the 16-bit version is what we used in this case. Since **quotient is 16-bit**.
```c=
while (!(quotient & 0x8000) && result_exp > 1) {
quotient <<= 1;
result_exp--;
}
```
I will introduce how I improved this logic.
Since the purpose of this code segment is to normalize the quotient,
we can use **branchless clz** instead of using while loop to iteratively find leading 1.
And note that the second condition: `result_exp > 1`, this means that the maximum shift amount is **result_exp-1**. Therefore, the number of leading zeros need compare the number of leading zeros and `result_exp - 1`.
That is:
- if (result_exp - 1) < the number of leading zeros,
- result_exp = 1
- quotient <<= (result_exp - 1)
- if (result_exp - 1) >= the number of leading zeros,
- result_exp -= the number of leading zeros
- quotient <<= the number of leading zeros
> GitHub: [bf16_div.S](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_div/bf16_div.S)
```asm=
.data
newline: .string "\n"
pass_msg: .string "All tests passed!"
fail_msg: .string "Tests failed!"
.text
.globl bf16_div
.globl branchless_16_bit_clz
.globl main
main:
# maximal finite positive number / 0.000000 = +inf
test_1:
addi a0, x0, 0x7F
slli a0, a0, 8
ori a0, a0, 0x7F
addi a1, x0, 0x80
jal ra, bf16_div
addi s0, a0, 0
addi t0, x0, 0x7F
slli t0, t0, 8
ori t0, t0, 0x80
bne s0, t0, test_failed
test_passed:
la a0, pass_msg
addi a7, x0, 4
ecall
jal x0, end
test_failed:
la a0, fail_msg
addi a7, x0, 4
ecall
end:
la a0, newline
addi a7, x0, 4
ecall
addi a7, x0, 10
ecall
# a0: a
# a1: b
# t0: sign_a
# t1: sign_b
# t2: exp_a
# t3: exp_b
# t4: mant_a
# t5: mant_b
# t6: temp
# s0: result_sign
bf16_div:
srli t0, a0, 15
andi t0, t0, 1 # t0 = (a.bits >> 15) & 1, sign_a
srli t1, a1, 15
andi t1, t1, 1 # t1 = (b.bits >> 15) & 1, sign_b
srli t2, a0, 7
andi t2, t2, 0xFF # t2 = (a.bits >> 7) & 0xFF, exp_a
srli t3, a1, 7
andi t3, t3, 0xFF # t3 = (b.bits >> 7) & 0xFF, exp_b
andi t4, a0, 0x7F # t4 = a.bits & 0x7F, mant_a
andi t5, a1, 0x7F # t5 = b.bits & 0x7F, mant_b
xor s0, t0, t1 # result_sign = sign_a ^ sign_b
bf16_div_check_exp_b_0xFF: # maybe optimize later
addi t6, x0, 0xFF
bne t3, t6, bf16_div_check_b_NaN_inf
bne t5, x0, bf16_div_return_b
bne t2, t6, bf16_div_return_signed_zero
bne t4, x0, bf16_div_return_signed_zero
jal x0, bf16_div_return_NaN
bf16_div_check_b_NaN_inf: # maybe optimize later
bne t3, x0, bf16_div_check_exp_a_0xFF
bne t5, x0, bf16_div_check_exp_a_0xFF
beq t2, x0, bf16_div_return_NaN
beq t4, x0, bf16_div_return_NaN
jal x0, bf16_div_return_signed_zero
bf16_div_check_exp_a_0xFF: # maybe optimize later
bne t2, t6, bf16_div_check_a_NaN_inf
bne t4, x0, bf16_div_return_a
jal x0, bf16_div_return_signed_inf
bf16_div_check_a_NaN_inf: # maybe optimize later
bne t2, x0, bf16_div_a_explicit_one
bne t4, x0, bf16_div_a_explicit_one
jal x0, bf16_div_return_signed_zero
bf16_div_a_explicit_one:
beq t2, x0, bf16_div_b_explicit_one
ori t4, t4, 0x80 # mant_a |= 0x80
bf16_div_b_explicit_one:
beq t3, x0, bf16_div_division
ori t5, t5, 0x80 # mant_b |= 0x80
bf16_div_division: # should be optimized
# t6: loop times
# s1: dividend
# s2: divisor
# s3: quotient
# s4: counter
slli s1, t4, 15 # dividend = mant_a << 15
addi s2, t5, 0 # divisor = mant_b
addi s3, x0, 0 # quotient = 0
addi t6, x0, 16 # loop times = 16
addi s4, x0, 0 # counter = 0
bf16_div_division_loop:
bge s4, t6, bf16_div_division_loop_end
slli s3, s3, 1 # quotient <<= 1
addi t4, x0, 15
sub t4, t4, s4 # t4 = 15 - i
sll t4, s2, t4 # temp = divisor << (15 - i)
blt s1, t4, bf16_div_division_loop_sub_skip
bf16_div_division_loop_sub:
sub s1, s1, t4 # dividend -= (divisor << (15 - i))
ori s3, s3, 1 # quotient |= 1
bf16_div_division_loop_sub_skip:
addi s4, s4, 1 # i++
jal x0, bf16_div_division_loop
bf16_div_division_loop_end:
# s5: result_exp
add s5, t2, x0 # result_exp = exp_a
sub s5, s5, t3 # result_exp -= exp_b
addi s5, s5, 127 # result_exp += bias (127)
bf16_div_normalize_exp_a:
bne t2, x0, bf16_div_normalize_exp_b
addi s5, s5, -1
bf16_div_normalize_exp_b:
bne t3, x0, bf16_div_normalize_quotient_if
addi s5, s5, 1
bf16_div_normalize_quotient_if:
addi t6, x0, 0x80
slli t6, t6, 8 # t6 = 0x8000
and t4, s3, t6 # t4 = quotient & 0x8000
beq t4, x0, bf16_div_normalize_quotient_else
srli s3, s3, 8 # quotient >>= 8
jal x0, bf16_div_normalize_quotient_end
bf16_div_normalize_quotient_else:
bf16_div_normalize_quotient_clz:
addi sp, sp, -8
sw ra, 0(sp)
sw a0, 4(sp)
addi a0, s3, 0
jal ra, branchless_16_bit_clz
addi t6, a0, 0 # t6 = # of leading zeros
lw ra, 0(sp)
lw a0, 4(sp)
addi sp, sp, 8
addi t4, s5, -1 # t4 = result_exp - 1
bge t6, t4, bf16_div_normalize_quotient_clz_sub_result_exp_minus_1 # optimize to whole branchless later
bf16_div_normalize_quotient_clz_sub_clz:
sub s5, s5, t6 # result_exp -= leading_zeros
sll s3, s3, t6 # quotient <<= leading_zeros
jal x0, bf16_div_normalize_quotient_clz_end
bf16_div_normalize_quotient_clz_sub_result_exp_minus_1:
addi s5, x0, 1
sll s3, s3, t4
bf16_div_normalize_quotient_clz_end:
srli s3, s3, 8 # quotient >>= 8
bf16_div_normalize_quotient_end:
andi s3, s3, 0x7F # quotient & 0x7F
bf16_div_return:
addi t6, x0, 0xFF
bge s5, t6, bf16_div_return_signed_inf
ble s5, x0, bf16_div_return_signed_zero
slli a0, s0, 15 # result_sign << 15
andi s5, s5, 0xFF
slli s5, s5, 7 # result_exp << 7
or a0, a0, s5
andi s3, s3, 0x7F
or a0, a0, s3
jr ra
bf16_div_return_a:
jr ra
bf16_div_return_b:
addi a0, a1, 0
jr ra
bf16_div_return_signed_zero:
slli a0, s0, 15 # result_sign << 15
jr ra
bf16_div_return_signed_inf:
slli a0, s0, 15 # result_sign << 15
addi t6, x0, 0x7F
slli t6, t6, 8
ori t6, t6, 0x80
or a0, a0, t6 # return Inf
jr ra
bf16_div_return_NaN:
addi a0, x0, 0x7F
slli a0, a0, 8
ori a0, a0, 0xC0
jr ra
# a0: mant_a
# t0: x
# t1: s
# t2: shift
# t3: temp
# t4: temp2
branchless_16_bit_clz:
addi sp, sp, -20
sw t0, 0(sp)
sw t1, 4(sp)
sw t2, 8(sp)
sw t3, 12(sp)
sw t4, 16(sp)
addi t0, a0, 0 # x = mant_a
addi t1, x0, 0 # s = 0
addi t2, x0, 0 # shift = 0
addi t3, x0, 0x00FF
slt t1, t3, t0 # s = (x > 0x00FF) ? 1 : 0
xori t1, t1, 1 # s = (x <= 0x00FF) ? 1 : 0
slli t4, t1, 3
add t2, t2, t4 # shift += (s << 3)
sll t0, t0, t4 # x <<= (s << 3)
addi t3, x0, 0xF
slli t3, t3, 8
ori t3, t3, 0xFF
slt t1, t3, t0 # s = (x > 0x0FFF) ? 1 : 0
xori t1, t1, 1 # s = (x <= 0x0FFF) ? 1 : 0
slli t4, t1, 2
add t2, t2, t4 # shift += (s << 2)
sll t0, t0, t4 # x <<= (s << 2)
addi t3, x0, 0x3F
slli t3, t3, 8
ori t3, t3, 0xFF
slt t1, t3, t0 # s = (x > 0x3FFF) ? 1 : 0
xori t1, t1, 1 # s = (x <= 0x3FFF) ? 1 : 0
slli t4, t1, 1
add t2, t2, t4 # shift += (s << 1)
sll t0, t0, t4 # x <<= (s << 1)
addi t3, x0, 0x7F
slli t3, t3, 8
ori t3, t3, 0xFF
slt t1, t3, t0 # s = (x > 0x7FFF) ? 1 : 0
xori t1, t1, 1 # s = (x <= 0x7FFF) ? 1 : 0
slli t4, t1, 0
add t2, t2, t4 # shift += (s << 0)
sll t0, t0, t4 # x <<= (s << 0)
addi a0, t2, 0 # return shift
lw t0, 0(sp)
lw t1, 4(sp)
lw t2, 8(sp)
lw t3, 12(sp)
lw t4, 16(sp)
addi sp, sp, 20
jr ra
```
#### bf16_div Test Cases
> GitHub: [bf16_div_test](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_div/bf16_div_test.S)
```
// Improved Approach
Cycles: 643
Instrs. retired: 465
CPI: 1.38
IPC: 0.723
Clock rate: 912.06 Hz
```
**Performance:** My improved approach is about **2.3x faster** than the compiler-optmized version.
**Code Size:** Compared to the compiler-optimized version, my improved approach reduces the code size by nearly **36%**.
#### division improvement
```c=
for (int i = 0; i < 16; i++) {
quotient <<= 1;
if (dividend >= (divisor << (15 - i))) {
dividend -= (divisor << (15 - i));
quotient |= 1;
}
}
```
**Observe:** In my previoud improved approach, this part I shift right divisor at each iteration, but this logic can be simplifed as below.
```c=
for (int i = 0; i < 16; i++) {
quotient <<= 1;
dividend << 1;
if (dividend >= divisor) {
dividend -= divisor;
quotient |= 1;
}
}
```
We can shift the dividend to left in each iteration instead of shifing divisor. This allows us to **avoid calculating** `(15-i)` and performing a dynamic shift.
```asm=
bf16_div_division: # should be optimized
# t6: loop times
# s1: dividend
# s2: divisor
# s3: quotient
# s4: counter
slli s1, t4, 15 # dividend = mant_a << 15
addi s2, t5, 0 # divisor = mant_b
addi s3, x0, 0 # quotient = 0
addi t6, x0, 16 # loop times = 16
addi s4, x0, 0 # counter = 0
bf16_div_division_loop:
bge s4, t6, bf16_div_division_loop_end
slli s3, s3, 1 # quotient <<= 1
addi t4, x0, 15
sub t4, t4, s4 # t4 = 15 - i
sll t4, s2, t4 # temp = divisor << (15 - i)
blt s1, t4, bf16_div_division_loop_sub_skip
bf16_div_division_loop_sub:
sub s1, s1, t4 # dividend -= (divisor << (15 - i))
ori s3, s3, 1 # quotient |= 1
bf16_div_division_loop_sub_skip:
addi s4, s4, 1 # i++
jal x0, bf16_div_division_loop
```
In order to shift the dividend, we have to adjust the initialization for the dividend and the divisor.
- The dividend: Initialize with `mant_a`.
> We shift left **gradually inside the loop** instead of shifting 15 bits **all at once**.
- The divisor: Initialize with `mant_b << 1`.
> Since the dividend gets shifted left **left** inside the loop, we have to pre-shift the divisor by 1 bit. Otherwise, the comparision **will be off**.
```asm=
bf16_div_division:
# t6: loop times
# s1: dividend
# s2: divisor
# s3: quotient
# s4: counter
addi s1, t4, 0 # dividend = mant_a
slli s2, t5, 1 # divisor = mant_b << 1
addi s3, x0, 0 # quotient = 0
addi t6, x0, 16 # loop times = 16
addi s4, x0, 0 # counter = 0
bf16_div_division_loop:
bge s4, t6, bf16_div_division_loop_end
slli s3, s3, 1 # quotient <<= 1
slli s1, s1, 1 # dividend <<= 1
blt s1, s2, bf16_div_division_loop_sub_skip
bf16_div_division_loop_sub:
sub s1, s1, s2 # dividend -= divisor
ori s3, s3, 1 # quotient |= 1
bf16_div_division_loop_sub_skip:
addi s4, s4, 1 # i++
jal x0, bf16_div_division_loop
```
I reduced the loop body from 15 to 13 instructions. Since the loop **iterates 16 times**, this results in a **total reduction of 32 dynamic instructions**.
> GitHub: [bf16_div_loop_improved](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/bf16_div/bf16_div.S)
```
// Loop Improved Approach
Cycles: 579
Instrs. retired: 401
CPI: 1.44
IPC: 0.693
Clock rate: 619.25 Hz
```
**Performance:** My latest improved approach is about **1.11x faster** than the previous improved version, and **2.55x faster** than the compiler-optmized version.
___
### ==Other functions==
```asm=
.data
newline: .string "\n"
.text
.globl main
.globl bf16_isnan
.globl bf16_isinf
main:
addi a0, x0, 0x40
slli a0, a0, 8
ori a0, a0, 0x49
slli a0, a0, 8
ori a0, a0, 0x0F
slli a0, a0, 8
ori a0, a0, 0xD0
jal ra, f32_to_bf16
addi s0, a0, 0
addi a7, x0, 1
addi a0, s0, 0
ecall
la a0, newline
addi a7, x0, 4
ecall
addi a7, x0, 10
ecall
bf16_isnan:
addi t0, x0, 0x7F
slli t0, t0, 8
ori t0, t0, 0x80 # BF16_EXP_MASK
and t1, a0, t0 # a.bits & BF16_EXP_MASK
bne t1, t0, bf16_isnan_return_false
andi t1, a0, 0x007F # a.bits & BF16_MANT_MASK
beq t1, x0, bf16_isnan_return_false # mantissa == 0
bf16_isnan_return_true:
addi a0, x0, 1
jr ra
bf16_isnan_return_false:
addi a0, x0, 0
jr ra
bf16_isinf:
addi t0, x0, 0x7F
slli t0, t0, 8
ori t0, t0, 0x80 # BF16_EXP_MASK
and t1, a0, t0 # a.bits & BF16_EXP_MASK
bne t1, t0, bf16_isinf_return_false
andi t1, a0, 0x007F # a.bits & BF16_MANT_MASK
bne t1, x0, bf16_isinf_return_false # mantissa != 0
bf16_isinf_return_true:
addi a0, x0, 1
jr ra
bf16_isinf_return_false:
addi a0, x0, 0
jr ra
bf16_iszero:
addi t0, x0, 0x7F
slli t0, t0, 8
ori t0, t0, 0xFF
and t1, a0, t0
beq t1, x0, bf16_iszero_return_true
bf16_iszero_return_false:
addi a0, x0, 0
jr ra
bf16_iszero_return_true:
addi a0, x0, 1
jr ra
f32_to_bf16:
srli t0, a0, 23 # f32bits >> 23
andi t0, t0, 0xFF
addi t1, x0, 0xFF
beq t0, t1, f32_to_bf16_return_if_equal # if(((f32bits >> 23) & 0xFF) == 0xFF) return
f32_to_bf16_return:
srli t0, a0, 16
andi t0, t0, 1
addi t1, x0, 0x7F
slli t1, t1, 8
ori t1, t1, 0xFF
add t0, t0, t1
add a0, a0, t0
srli a0, a0, 16
and a0, a0, t1
jr ra
f32_to_bf16_return_if_equal:
srli a0, a0, 16
and a0, a0, t1
jr ra
```
> I think this piece of code can be directly translated into RISC-V instructions, since the original C code is already very concise.
#### Test Cases
```
1. bf16_isnan: 0x7FC1
2. bf16_isinf: 0x7F80
3. bf16_iszero: 0x0000
4. f32_to_bf16: 0x40490FD0
```
```
// My approach
Cycles: 80
Instrs. retired: 80
CPI: 1
IPC: 1
Clock rate: 0Hz
```
Above is the execution info of my approach.
Below is the execution info of compiler.
```
// Compiler-optimized
Cycles: 132
Instrs. retired: 132
CPI: 1
IPC: 1
Clock rate: 0Hz
```
My approach is **1.65x faster** than compiler's version.
## Use Cases
### [LeetCode #973 K Closest Points to Origin](https://leetcode.com/problems/k-closest-points-to-origin/description/)
In [LeetCode #973](https://leetcode.com/problems/k-closest-points-to-origin/description/), we have to process a large amount of coordinate data. In real-world scenarios, such as on a drone with limited memory, using **uf8-encoding** allows us to convert 32-bit data into 8-bit. This reduces memory usage greatly, allowing the device to **store 4 times more data** in the same amount of Memory.
# Requirement 3
## [Branchless 16 bit CLZ](https://github.com/dorobouharu133/ca2025-quizzes/blob/main/branchless_clz/branchless_16_bit_clz.S)
```asm=
li a0, 0x30
jal branchless_16_bit_clz
# a0: mant_a
# t0: x
# t1: s
# t2: shift
# t3: temp
# t4: temp2
branchless_16_bit_clz:
addi sp, sp, -20
sw t0, 0(sp)
sw t1, 4(sp)
sw t2, 8(sp)
sw t3, 12(sp)
sw t4, 16(sp)
addi t0, a0, 0 # x = mant_a
addi t1, x0, 0 # s = 0
addi t2, x0, 0 # shift = 0
addi t3, x0, 0x00FF
slt t1, t3, t0 # s = (x > 0x00FF) ? 1 : 0
xori t1, t1, 1 # s = (x <= 0x00FF) ? 1 : 0
slli t4, t1, 3
add t2, t2, t4 # shift += (s << 3)
sll t0, t0, t4 # x <<= (s << 3)
addi t3, x0, 0xF
slli t3, t3, 8
ori t3, t3, 0xFF
slt t1, t3, t0 # s = (x > 0x0FFF) ? 1 : 0
xori t1, t1, 1 # s = (x <= 0x0FFF) ? 1 : 0
slli t4, t1, 2
add t2, t2, t4 # shift += (s << 2)
sll t0, t0, t4 # x <<= (s << 2)
addi t3, x0, 0x3F
slli t3, t3, 8
ori t3, t3, 0xFF
slt t1, t3, t0 # s = (x > 0x3FFF) ? 1 : 0
xori t1, t1, 1 # s = (x <= 0x3FFF) ? 1 : 0
slli t4, t1, 1
add t2, t2, t4 # shift += (s << 1)
sll t0, t0, t4 # x <<= (s << 1)
addi t3, x0, 0x7F
slli t3, t3, 8
ori t3, t3, 0xFF
slt t1, t3, t0 # s = (x > 0x7FFF) ? 1 : 0
xori t1, t1, 1 # s = (x <= 0x7FFF) ? 1 : 0
slli t4, t1, 0
add t2, t2, t4 # shift += (s << 0)
sll t0, t0, t4 # x <<= (s << 0)
addi a0, t2, 0 # return shift
lw t0, 0(sp)
lw t1, 4(sp)
lw t2, 8(sp)
lw t3, 12(sp)
lw t4, 16(sp)
addi sp, sp, 20
jr ra
```
```asm=
slt t1, t3, t0 # s = (x > 0x00FF) ? 1 : 0
xori t1, t1, 1 # s = (x <= 0x00FF) ? 1 : 0
slli t4, t1, 3
```

**IF (Instruction Fetch):**
> PC Enable: 1 (High), since there is no pipeline stall
> Next PC Mux: 0 (Low), PC = PC+4,because the instruction in ID is not branch.
**ID (Instruction Decode):**
> Reg Wr En: 1 (High), because the instruction in **WB** stage is `addi`, `addi` will write data back to rigister file (x5 register).
**EX (Execute):**
> ALUOp1: from register file or forwarding path.
> ALUOp2: from immediate.
> ForwardA: (no forwarding), from ID/EX pipeline register.
> ForwardB: (no forwarding), from ID/EX pipeline register.
> Branch: not taken.
**MEM (Memory):**
> Data Memory Wr En: The instruction in **MEM** stage is `addi`, `addi` does not access data memory, therefore **Data Memory Wr En is 0 (Low)**.
**WB (Write Back):**
> WB Mux: from ALU result, since the instruction in WB is `addi`.
---

**IF:**
> PC Enable: 1 (High), since there is no pipeline stall
> Next PC Mux: 0 (Low), PC = PC+4,because the instruction in ID is not branch.
**ID:**
> Reg Wr En: 1 (High), because the instruction in **WB** stage is `addi`, `addi` will write data back to rigister file (x6 register).
**EX:**
> ALUOp1: from register file or forwarding path.
> ALUOp2: from immediate.
> ForwardA: (no forwarding), from ID/EX pipeline register.
> ForwardB: (no forwarding), from ID/EX pipeline register.
> Branch: not taken.
**MEM:**
> Data Memory Wr En: The instruction in **MEM** stage is `addi`, `addi` does not access data memory, therefore **Data Memory Wr En is 0 (Low)**.
**WB:**
> WB Mux: from ALU result, since the instruction in WB is `addi`.
---

**IF:**
> PC Enable: 1 (High), since there is no pipeline stall
> Next PC Mux: 0 (Low), PC = PC+4,because the instruction in ID is not branch.
**ID:**
> Reg Wr En: 1 (High), because the instruction in **WB** stage is `addi`, `addi` will write data back to rigister file (x7 register).
**EX:** (Forwarding Happens)
> ALUOp1: from register file or forwarding path.
> ALUOp2: from register file or forwarding path.
> ForwardA: (**forwarding**), from **EX/MEM** pipeline register.
> ForwardB: (no forwarding), from ID/EX pipeline register.
> Branch: not taken.
**MEM:**
> Data Memory Wr En: The instruction in **MEM** stage is `addi`, `addi` does not access data memory, therefore **Data Memory Wr En is 0 (Low)**.
**WB:**
> WB Mux: from ALU result, since the instruction in WB is `addi`.



> Because the x28 is litte than x5, so the result is, and will write back to register file (x6) when `slt` instruction is in WB stage.
---

**IF:**
> PC Enable: 1 (High), since there is no pipeline stall
> Next PC Mux: 0 (Low), PC = PC+4,because the instruction in ID is not branch.
**ID:**
> Reg Wr En: 1 (High), because the instruction in **WB** stage is `addi`, `addi` will write data back to rigister file (x28 register).
**EX:** (Forwarding happens)
> ALUOp1: from register file or forwarding path.
> ALUOp2: from immediate
> ForwardA: (**forwarding**), from **EX/MEM** pipeline register.
> ForwardB: (no forwarding), from ID/EX pipeline register.
> Branch: not taken.
**MEM:**
> Data Memory Wr En: The instruction in **MEM** stage is `slt`, `slt` does not access data memory, therefore **Data Memory Wr En is 0 (Low)**.
**WB:**
> WB Mux: from ALU result, since the instruction in WB is `addi`.
> With forwarding from EX/MEM pipeline register, the result of `xori x6, x6, 1` is the negative of `x6`, and `0 xor 0 = 1`. Therefore, `1` will be write back to x6 when the `xori` instruction is in **WB stage**.
---

**IF:**
> PC Enable: 1 (High), since there is no pipeline stall
> Next PC Mux: 0 (Low), PC = PC+4,because the instruction in ID is not branch.
**ID:**
> Reg Wr En: 1 (High), because the instruction in **WB** stage is `addi`, `addi` will write data back to rigister file (x6 register).
**EX:** (forwarding happens)
> ALUOp1: from register file or forwarding path.
> ALUOp2: from immediate.
> ForwardA: (**forwarding**), from **EX/MEM** pipeline register.
> ForwardB: (no forwarding), from ID/EX pipeline register.
> Branch: not taken.
**MEM:**
> Data Memory Wr En: The instruction in **MEM** stage is `xori`, `xori` does not access data memory, therefore **Data Memory Wr En is 0 (Low)**.
**WB:**
> WB Mux: from ALU result, since the instruction in WB is `slt`.
---

**IF:**
> PC Enable: 1 (High), since there is no pipeline stall
> Next PC Mux: 0 (Low), PC = PC+4,because the instruction in ID is not branch.
**ID:**
> Reg Wr En: 1 (High), because the instruction in **WB** stage is `xori`, `xori` will write data back to rigister file (x6 register).
**EX:**
> ALUOp1: from register file or forwarding path.
> ALUOp2: from register file or forwarding path.
> ForwardA: (no forwarding), from ID/EX pipeline register.
> ForwardB: (**forwarding**), from **EX/MEM** pipeline register.
> Branch: not taken.
**MEM:**
> Data Memory Wr En: The instruction in **MEM** stage is `slli`, `slli` does not access data memory, therefore **Data Memory Wr En is 0 (Low)**.
**WB:**
> WB Mux: from ALU result, since the instruction in WB is `xori`.
> As mentioned before, the result of `xori x6, x6, 1` will write back to register file.

In the next cycle, `x6` will be `0x00000001`.
---

**IF:**
> PC Enable: 1 (High), since there is no pipeline stall
> Next PC Mux: 0 (Low), PC = PC+4,because the instruction in ID is not branch.
**ID:**
> Reg Wr En: 1 (High), because the instruction in **WB** stage is `slli`, `slli` will write data back to rigister file (x29 register).
**EX:** (forwarding happens)
> ALUOp1: from register file or forwarding path.
> ALUOp2: from register file or forwarding path.
> ForwardA: (no forwarding), from ID/EX pipeline register.
> ForwardB: (**forwarding**), from **MEM/WB** pipeline register.
> Branch: not taken.
**MEM:**
> Data Memory Wr En: The instruction in **MEM** stage is `add`, `add` does not access data memory, therefore **Data Memory Wr En is 0 (Low)**.
**WB:**
> WB Mux: from ALU result, since the instruction in WB is `slli`.

Check: `x6` is `1`, the function is correct.
# References
[RISCV branchless coding - *stackoverflow*](https://stackoverflow.com/questions/72340698/riscv-branchless-coding)
[RISCV assembly 32bit multiplication without MUL giving incorrect result - *stackoverflow*](https://stackoverflow.com/questions/79503254/riscv-assembly-32bit-multiplication-without-mul-giving-incorrect-result)
[Intel® 64 and IA-32 Architectures Optimization Reference Manual Volume 1
](https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html)
# AI Usage
[Copilot](https://copilot.microsoft.com/shares/iWjxqjKuh8HA6NGhFfUZD)
[Copilot, correct my test function to output the encode and decode values](https://copilot.microsoft.com/shares/pAcS6KRjp4fjSTUFdP3KN)
[ChatGPT](https://chatgpt.com/share/68f2559b-b4f0-8006-a075-d2c62339c95f)
[Copilot, bf16_mul debug](https://copilot.microsoft.com/shares/dz4MNC8W3mtAYefiprkGN)
[ChatGPT, bf16_mul debug](https://chatgpt.com/share/68f79767-7fb0-8006-8ac0-e30787814397)
[Gemini, bf16_add proofreading](https://gemini.google.com/share/8f267b63f019)
[Gemini, bf16_add test cases](https://gemini.google.com/share/f34265e85ffd)
[Gemini, bf16_sqrt proofreading](https://gemini.google.com/share/a882f07040c2)
[Gemini, bf16_mul proofreading](https://gemini.google.com/share/8ccf0e0f70a8)
[Gemini, Branchless Programming](https://gemini.google.com/share/f0e56a89944b)