---
tags: CA2025
---
# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by [<wilson0828>](https://github.com/wilson0828)
>Refer to [Assignment1](https://hackmd.io/@sysprog/2025-arch-homework1)
## Problem B
Refer to [Quiz1 of Computer Architecture (2025 Fall) Problem `B`](https://hackmd.io/@sysprog/arch2025-quiz1-sol)
### clz
#### C code
```c
static inline unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return n - x;
}
```
#### Assembly implement
- Goal: Compute the number of leading zeros of a 32-bit unsigned integer.
- Algorithm: Binary-search style with shifts `c=16,8,4,2,1`; if `y=x>>c` is nonzero then `n-=c, x=y`; loop while `c!=0`.
- Works with all x, including x=0 (returns 32).
Since variable c is initialized to 16 (non-zero), switching from do–while to while( c ) preserves behavior and simplifies the branch structure in RV32I.
```asm
clz:
li t0, 32 # n = 32
li t1, 16 # c = 16
Lwhile:
beq t1, x0, ans
srl t2, a0, t1
beq t2, x0, skip
sub t0, t0, t1
mv a0, t2
skip:
srli t1, t1, 1
j Lwhile
ans:
sub a0, t0, a0
ret
```
### uf8_decode
#### C code
```c
uint32_t uf8_decode(uf8 fl)
{
uint32_t mantissa = fl & 0x0f;
uint8_t exponent = fl >> 4;
uint32_t offset = (0x7FFF >> (15 - exponent)) << 4;
return (mantissa << exponent) + offset;
}
```
#### Assembly implement
Rather than a line-by-line C translation, I use the closed-form offset: `offset(e)=((2^e−1)*16)`, so decode is simply `(m<<e) + offset`. As the result we can get a much simpler version of uf8_decode.
```asm
uf8_decode:
andi t0, a0, 0x0F # mantissa = fl & 0x0F
srli t1, a0, 4 # exponent = fl >> 4
li t2, 1
sll t2, t2, t1 # t2 = 1 << e
addi t2, t2, -1 # t2 = (1<<e) - 1
slli t2, t2, 4 # offset = (2^e -1)*16
sll t0, t0, t1 # mantissa << e
add a0, t0, t2 # return
ret
```
### uf8_encode
#### Assembly implement
Function signature
- Param: a0 = value (32-bit unsigned).
- Return: a0 = uf8 (4-bit exponent | 4-bit mantissa).
Caller/callee saves
- Saves ra on stack before jal clz and restores after.
- Uses only t temporaries for working regs.
Temporaries usage
- t1 → msb (31 - clz(value)).
- t2 → exponent e.
- t3 → overflow = offset(e).
- t4 → next_overflow candidate.
- t5 → scratch to build next_overflow.
- t0 → general scratch for compares.
- a1 → holds clz(value) result temporarily after clz.
<details>
<summary>uf8_encode</summary>
```asm
uf8_encode:
slti t0, a0 ,16
bne t0, x0, if1
addi sp, sp, -8
sw a0, 0(sp)
sw ra, 4(sp)
jal clz
mv a1, a0 # leading zero
lw a0, 0(sp)
lw ra, 4(sp)
addi sp, sp, 8
li t0, 31
sub t1, t0 , a1 # msb
li t2 , 0 # exponent
li t3 , 0 # overflow = offset
slti t0, t1, 5
bne t0, x0, if2
addi t2, t1, -4
slti t0, t2, 15
bne t0, x0 ,if3
li t2, 15
if3:
li t0, 1
sll t0, t0, t2 # t0 = 1 << e
addi t0, t0, -1 # t0 = (1<<e) - 1
slli t3, t0, 4 # overflow = t3 = (2^e -1)*16
wloop:
beq t2, x0, if2
bgeu a0, t3, if2
addi t3, t3, -16 # overflow = overflow - 16
srli t3, t3, 1
addi t2, t2, -1
j wloop
if2:
slti t0, t2, 15
beq t0, x0, wdone
slli t5, t3, 1
addi t5, t5, 16
mv t4, t5 # next_overflow = (overflow << 1) + 16;
sltu t0, a0, t4
bne t0, x0, wdone
mv t3, t4
addi t2, t2, 1
j if2
wdone:
sub a0, a0, t3 # a0 = value - overflow
srl a0, a0, t2 # a0 >>= exponent
slli t2, t2, 4 # t2 = exponent << 4
or a0, t2, a0 # a0 = (e<<4) | mantissa
if1:
ret
```
</details>
### Validation
Here I used [Compiler Explorer](https://godbolt.org/) with RISC-V (32-bits) gcc (trunk) and with flag -O2 to generate the following assembly.
<details>
<summary>Compiler generated code</summary>
```asm
uf8_decode:
srli a3,a0,4
li a4,15
li a5,32768
sub a4,a4,a3
addi a5,a5,-1
sra a5,a5,a4
andi a0,a0,15
sll a0,a0,a3
slli a5,a5,4
add a0,a5,a0
ret
uf8_encode:
li a5,15
bleu a0,a5,L25
mv a1,a0
li a5,5
li a4,16
li a2,32
L4:
srl a3,a1,a4
addi a5,a5,-1
beq a3,zero,L6
sub a2,a2,a4
mv a1,a3
L6:
srai a4,a4,1
bne a5,zero,L4
sub a2,a2,a1
li a4,26
bgt a2,a4,L16
li a4,31
sub a4,a4,a2
andi a4,a4,0xff
li a3,4
beq a4,a3,L16
addi a4,a4,-4
andi a2,a4,0xff
li a3,15
bgtu a2,a3,L26
L10:
andi a4,a4,0xff
li a3,0
L11:
addi a3,a3,1
slli a5,a5,1
andi a3,a3,0xff
addi a5,a5,16
bgtu a4,a3,L11
bgeu a0,a5,L12
L13:
srli a5,a5,1
addi a4,a4,-1
addi a5,a5,-8
andi a4,a4,0xff
snez a3,a4
sltu a2,a0,a5
and a3,a3,a2
bne a3,zero,L13
L12:
li a3,14
bgtu a4,a3,L24
L8:
li a1,15
j L14
L27:
andi a4,a2,0xff
beq a4,a1,L24
L14:
mv a3,a5
slli a5,a5,1
addi a5,a5,16
addi a2,a4,1
bgeu a0,a5,L27
sub a0,a0,a3
srl a0,a0,a4
slli a4,a4,4
or a0,a0,a4
L25:
andi a0,a0,0xff
ret
L16:
li a4,0
j L8
L24:
mv a3,a5
sub a0,a0,a3
srl a0,a0,a4
slli a4,a4,4
or a0,a0,a4
j L25
L26:
mv a4,a3
j L10
```
</details>
For validation, I used six consecutive test cases for both `encode (uint32 → uf8)` and `decode (uf8 → uint32)`. They’re chosen to hit zero, normal ranges, and the tricky exponent boundaries. And then ran both the compiler-generated code and my hand-written version on Ripes simulator to compare the results.
Encode tests (uint32 → uf8 byte): 0, 15, 16, 46, 524272, 1015792
Decode tests (uf8 byte → uint32): 0x00, 0x0F, 0x10, 0x1F, 0xF0, 0xFF
#### Console

### Analyze
- Cycle count
|  |  |
|:--:|:--:|
| mine | RISC-V (32-bits) gcc |
- Code size (only real instructions)
- mine : 65 lines
- compilers : 79 lines
### Improvement
- My assembly runs about 13.5% faster than the compiler generated code.
- My code size reduces about 17.7% compared to the compiler generated code.
## Problem C
Refer to [Quiz1 of Computer Architecture (2025 Fall) Problem `C`](https://hackmd.io/@sysprog/arch2025-quiz1-sol)
### Part 1
In this section, I will implement all bf16 operation without bf16_sqrt.
#### Assembly Implementation
#### bf16_add
In the add path, when the two operands have opposite signs, you can get big cancellation. That knocks the leading 1 out of the mantissa, so we have to renormalize. I switched to a `LUT-based clz` to grab the leading-zero count in one shot, then shift the mantissa left and subtract that amount from the exponent. Simple and also avoids a slow loop. In addition, I also swapped `if (!exp && !mant)` for `if ((bits & 0x7FFF) == 0)` to fast-check zero. It catches +0/−0.
<details>
<summary>bf16_add</summary>
```asm
.data
.align 4
clz8_lut:
.byte 8,7,6,6,5,5,5,5,4,4,4,4,4,4,4,4
.byte 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
.byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
.byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.text
clz8:
andi a0, a0, 0xFF
la t0, clz8_lut
add t0, t0, a0
lbu a0, 0(t0)
ret
bf16_add:
addi sp, sp, -28
sw ra, 24(sp)
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
srli s0, a0, 15
andi s0, s0, 1 # s0 = sign_a
srli s1, a1, 15
andi s1, s1, 1 # s1 = sign_b
srli s2, a0, 7
andi s2, s2, 0xFF # s2 = exp_a
srli s3, a1, 7
andi s3, s3, 0xFF # s3 = exp_b
andi s4, a0, 0x7F # s4 = mant_a
andi s5, a1, 0x7F # s5 = mant_b
li t0, 0xFF
bne s2, t0, chk
bnez s4, ret_a
bne s3, t0, ret_a
bnez s5, ret_b
bne s0, s1, ret_nan
j ret_b
chk:
beq s3, t0, ret_b
li t0, 0x7FFF
and t1, a0, t0
beq t1, x0, ret_b
and t1, a1, t0
beq t1, x0, ret_a
beq s2, x0, l1
ori s4, s4, 0x80
l1:
beq s3, x0, l2
ori s5, s5, 0x80
l2:
sub t1, s2, s3 # t1 = exp_diff
blt x0, t1, grt
beq t1, x0, equ
mv t2, s3 # t2 = result_exp
li t0, -8
blt t1, t0, ret_b
sub t0, x0 ,t1
srl s4, s4, t0
j exp_dif
grt:
mv t2, s2
li t0, 8
blt t0, t1, ret_a
srl s5, s5, t1
j exp_dif
equ:
mv t2, s2
exp_dif:
bne s0, s1, diff_signs
mv t3, s0 # t3 = result_sign
add t4, s4, s5 # t4 = rm
li t0, 0x100
and t1, t4, t0
beq t1, x0, pack
srli t4, t4, 1
addi t2, t2, 1
li t0, 0xFF
blt t2, t0, pack
slli a0, t3, 15
li t5, 0x7F80
or a0, a0, t5
j ans
diff_signs:
blt s4, s5, gt_ma
mv t3, s0 # result_sign = sign_a
sub t4, s4, s5
j l3
gt_ma:
mv t3, s1 # result_sign = sign_b
sub t4, s5, s4
l3:
beq t4, x0, ret_zero
mv a0, t4
jal ra, clz8 # a0 = clz8(rm)
mv t0, a0
sll t4, t4, t0 # rm <<= sh
sub t2, t2, t0 # result_exp -= sh
blt t2, x0, ret_zero
beq t2, x0, ret_zero
j pack
ret_zero:
li a0, 0x0000 # BF16_ZERO()
j ans
pack:
slli a0, t3, 15
slli t1, t2, 7
or a0, a0, t1
andi t4, t4, 0x7F
or a0, a0, t4
j ans
ret_b:
mv a0, a1
j ans
ret_nan:
li a0, 0x7FC0
j ans
ret_a:
j ans
ans:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
lw ra, 24(sp)
addi sp, sp, 28
ret
```
</details>
#### bf16_sub
Instead of implementing a brand new `bf16_sub` function, we simply add some adjustments and then call the `bf16_add` above.
<details>
<summary>bf16_sub</summary>
```asm
bf16_sub:
addi sp, sp, -8
sw ra, 4(sp)
li t0, 0x8000
xor a1, a1, t0
jal ra, bf16_add
lw ra, 4(sp)
addi sp, sp, 8
ret
```
</details>
#### bf16_div
<details>
<summary>bf16_div</summary>
```asm
bf16_div:
addi sp, sp, -24
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
srli s0, a0, 15
andi s0, s0, 1 # s0 = sign_a
srli s1, a1, 15
andi s1, s1, 1 # s1 = sign_b
srli s2, a0, 7
andi s2, s2, 0xFF # s2 = exp_a
srli s3, a1, 7
andi s3, s3, 0xFF # s3 = exp_b
andi s4, a0, 0x7F # s4 = mant_a
andi s5, a1, 0x7F # s5 = mant_b
xor t1, s0, s1 # t1 = result_sign
li t0, 0xff
bne s3, t0, exp_b_f
bne s5, x0, ret_b
bne s2, t0, l1
bne s4, x0 ,l1
j ret_nan
l1:
slli a0, t1, 15
j ans
exp_b_f:
bne s3, x0, skip
bne s5, x0, skip
bne s2, x0, skip2
beq s4, x0, ret_nan
skip2:
slli t1, t1, 15
li t2, 0x7F80
or a0, t1, t2
j ans
skip:
bne s2, t0, exp_a_f
bne s4, x0, ret_a
slli t1, t1, 15
li t2, 0x7F80
or a0, t1, t2
j ans
exp_a_f:
beq s2, x0, exp_a_is_zero
j l2
exp_a_is_zero:
beq s4, x0, a_is_zero_return
j l2
a_is_zero_return:
slli a0, t1, 15
j ans
l2:
beq s2, x0, l3
ori s4, s4, 0x80
l3:
beq s3, x0, l4
ori s5, s5, 0x80
l4:
slli t2, s4, 15 # t2 = dividend
mv t3, s5 # t3 = divisor
li t4, 0 # t4 = counter
li t5, 0 # t5 = quotient
div_loop:
li t6, 16
bge t4, t6, out_loop
slli t5, t5, 1
sub t0, x0, t4 # t0 = -i
addi t0, t0, 15
sll t1, t3, t0
bltu t2, t1, cant_div
sub t2, t2, t1
ori t5, t5, 1
cant_div:
addi t4, t4, 1
j div_loop
out_loop:
sub t2, s2, s3
addi t2, t2, 127 # t2 = result_exp
bne s2, x0, l5
addi t2, t2, -1
l5:
bne s3, x0, l6
addi t2, t2, 1
l6:
li t0, 0x8000
and t3, t5, t0
bne t3, x0, set
norm_loop:
and t3, t5, t0
bne t3, x0, norm_done
li t6, 2
blt t2, t6, norm_done
slli t5, t5, 1
addi t2, t2, -1
j norm_loop
norm_done:
srli t5, t5, 8
j l7
set:
srli t5, t5, 8
l7:
andi t5, t5, 0x7F
li t0, 0xFF
bge t2, t0, ret_inf
blt t2, x0, ret_zero
beq t2, x0, ret_zero
slli a0, t1, 15
andi t2, t2, 0xFF
slli t2, t2, 7
or a0, a0, t2
or a0, a0, t5
j ans
ret_inf:
slli a0, t1, 15
li t0, 0x7F80
or a0, a0, t0
j ans
ret_zero:
slli a0, t1, 15
j ans
ret_b:
mv a0, a1
j ans
ret_nan:
li a0, 0x7FC0
j ans
ret_a:
j ans
ans:
li t0, 0xFFFF
and a0, a0, t0
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
addi sp, sp, 24
ret
```
</details>
#### bf16_mul
For mul, I just reuse the same `LUT-based clz` idea from add. When checking whether a or b is normalized, I call `clz` to get the leading-zero count in one shot, then left-shift the mantissa by that amount and track the exponent adjust. It’s a quick way to snap subnormals into the `1.xxxxxx` form without a slow loop.
Also, to stay RV32I-only (no hardware mul), I ditched the multiply instruction and went with a classic shift-and-add routine. I wrapped it as `mul8x8_to16`: it loops over the bits of the multiplier, adds the shifted multiplicand when a bit is set, and builds the 16-bit product.
<details>
<summary>bf16_mul</summary>
```asm
.data
.align 4
clz8_lut:
.byte 8,7,6,6,5,5,5,5,4,4,4,4,4,4,4,4
.byte 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
.byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
.byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.text
clz8:
andi a0, a0, 0xFF
la t0, clz8_lut
add t0, t0, a0
lbu a0, 0(t0)
ret
mul8x8_to16:
andi a0, a0, 0xFF
andi a1, a1, 0xFF
mv t1, a0 # multiplicand
mv t2, a1 # multiplier
li t0, 0
li t3, 8
label1:
andi t4, t2, 1
beqz t4, label2
add t0, t0, t1 # acc += multiplicand
label2:
slli t1, t1, 1 # multiplicand <<= 1
srli t2, t2, 1 # multiplier >>= 1
addi t3, t3, -1
bnez t3, label1
mv a0, t0
ret
bf16_mul:
addi sp, sp, -28
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
sw ra, 24(sp)
srli s0, a0, 15
andi s0, s0, 1 # s0 = sign_a
srli s1, a1, 15
andi s1, s1, 1 # s1 = sign_b
srli s2, a0, 7
andi s2, s2, 0xFF # s2 = exp_a
srli s3, a1, 7
andi s3, s3, 0xFF # s3 = exp_b
andi s4, a0, 0x7F # s4 = mant_a
andi s5, a1, 0x7F # s5 = mant_b
li t0, 0xff
xor t1, s0, s1 # t1 = result_sign
bne s2, t0, a_exp
bne s4, x0, ret_b
bne s3, x0, inf1
beq s5, x0, ret_nan
inf1:
slli t2, t1, 15
li t3, 0x7F80
or a0, t2, t3
j ans
a_exp:
bne s3, t0, b_exp
bne s5, x0, ret_b
bne s2, x0, inf2
beq s4, x0, ret_nan
inf2:
srli t2, t1, 15
li t3, 0x7F80
or a0, t2, t3
j ans
b_exp:
bne s2, x0, skip1
beq s4, x0, l1
skip1:
bne s3, x0, skip2
bne s4, x0, skip2
l1:
srli a0, t1, 15
j ans
skip2:
li t2, 0 # t2 = exp_adjust
bne s2, x0, else_a
mv a0, s4
jal ra, clz8 # a0 = clz8(rm)
mv t0, a0
sll s4, s4, t0
sub t2, t2, t0
li s2, 1
else_a:
ori s4, s4, 0x80
bne s3, x0, else_b
mv a0, s5
jal ra, clz8 # a0 = clz8(rm)
mv t0, a0
sll s5, s5, t0
sub t2, t2, t0
li s3, 1
else_b:
ori s5, s5, 0x80
mv a0, s4
mv a1, s5
jal mul8x8_to16
mv t3, a0 # t3 = result_mant = product
xor t1, s0, s1
add t4, s2, s3
addi t4, t4, -127
add t4, t4, t2 # t4 = result_exp
li t5, 0x8000
and t0, t3, t5
beq t0, x0, l2
srli t3, t3, 8
andi t3, t3, 0x7F
addi t4, t4, 1
j mant
l2:
srli t3, t3, 7
andi t3, t3, 0x7F
mant:
li t0, 0xFF
blt t4, t0, skip3
srli t1, t1, 15
li t0, 0x7F80
or a0, a0, t0
j ans
skip3:
blt x0, t4, l3
addi t0, x0, -6
blt t4, t0, l4
li t0, 1
sub t0, t0, t4
srl t3, t3, t0
li t4, 0
j l3
l4:
srli a0, t1, 15
j ans
l3:
andi t1, t1, 1
slli t1, t1, 15
andi t4, t4, 0xFF
slli t4, t4, 7
andi t3, t3, 0x7F
or a0, t1, t4
or a0, a0, t3
li t0, 0xFFFF
and a0, a0, t0
j ans
ret_inf:
slli a0, t1, 15
li t0, 0x7F80
or a0, a0, t0
j ans
ret_zero:
slli a0, t1, 15
j ans
ret_b:
mv a0, a1
j ans
ret_nan:
li a0, 0x7FC0
j ans
ret_a:
j ans
ans:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
lw ra, 24(sp)
addi sp, sp, 28
ret
```
</details>
<details>
<summary>others</summary>
```asm
bf16_isnan:
li t0, 0x7F80
and t1, a0, t0
bne t1, t0, nan_false
andi t1, a0, 0x007F
beq t1, x0, nan_false
li a0, 1
ret
nan_false:
li a0, 0
ret
bf16_isinf:
li t0, 0x7F80
and t1, a0, t0
bne t1, t0, inf_false
andi t1, a0, 0x007F
bne t1, x0, inf_false
li a0, 1
ret
inf_false:
li a0, 0
ret
bf16_iszero:
li t0, 0x7FFF
and t1, a0, t0
bne t1, x0, zero_false
li a0, 1
ret
zero_false:
li a0, 0
ret
f32_to_bf16:
srli t1, a0, 23
andi t1, t1, 0xFF
li t2, 0xFF
bne t1, t2, L1
srli a0, a0, 16
ret
L1:
srli t1, a0, 16
andi t1, t1, 1
add a0, a0, t1
li t3, 0x7FFF
add a0, a0, t3
srli a0, a0, 16
ret
bf16_to_f32:
slli a0, a0, 16
ret
```
</details>
### Part 2
#### bf16_sqrt
First, I give a purely mechanical, line-by-line translation version from C
<details>
<summary>Version 1</summary>
```asm
.data
.align 4
.text
mul8x8_to16:
andi a0, a0, 0xFF
andi a1, a1, 0xFF
mv t1, a0 # multiplicand
mv t2, a1 # multiplier
li t0, 0
li t3, 8
label1:
andi t4, t2, 1
beqz t4, label2
add t0, t0, t1 # acc += multiplicand
label2:
slli t1, t1, 1 # multiplicand <<= 1
srli t2, t2, 1 # multiplier >>= 1
addi t3, t3, -1
bnez t3, label1
mv a0, t0
ret
bf16_sqrt:
addi sp, sp, -32
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
sw s6, 24(sp)
sw ra, 28(sp)
srli s0, a0, 15
andi s0, s0, 1 # s0 = sign_a
srli s1, a0, 7
andi s1, s1, 0xFF # s1 = exp_a
andi s2, a0, 0x7F # s2 = mant_a
li t0, 0xFF
bne s1, t0, a_exp
beq s2, x0, a_mant
j ret_a
a_mant:
beq s0, x0, a_sign
li a0, 0x7FC0
j ans
a_sign:
j ret_a
a_exp:
bne s1, x0, skip
bne s2, x0, skip
li a0, 0x0000
j ans
skip:
beq s0, x0, negative_skip
li a0, 0x7FC0
j ans
negative_skip:
bne s1, x0, denormals_skip
li a0, 0x0000
j ans
denormals_skip:
addi t1, s1, -127 #t1 = e
ori t2, s2, 0x80 #t2 = m
andi t0, t1, 1
beq t0, x0 ,else
slli t2, t2, 1
addi t0, t1, -1
srai t0, t0, 1
addi t3, t0 ,127 #t3 = new_exp
j end_if
else:
srai t0, t1, 1
addi t3, t0, 127 # t3 = new_exp
end_if:
li s3, 90 # s3 = low
li s4, 255 # s4 = high
li s5, 128 # s5 = result
mv s6, t3
loop:
bgtu s3, s4, loop_done
add t0, s3, s4
srli t1, t0, 1 # t1 = mid
mv a0, t1
mv a1, t1
mv t5, t1 # protect
mv t6, t2
jal mul8x8_to16
mv t0, a0
mv t1, t5
mv t2, t6
srli t0, t0, 7 #t0 = sq
bleu t0, t2, do_if
addi s4, t1, -1
j end_if2
do_if:
mv s5, t1
addi s3, t1, 1
end_if2:
j loop
loop_done:
mv t3, s6
li t0, 256
bltu s5, t0, l1
srli s5, s5, 1
addi t3, t3, 1
j l3
l1: li t0, 128
bgeu s5, t0, l3
l2: li t0, 128
bgeu s5, t0, l3
slti t2, t3, 2
bne t2, x0, l3
slli s5, s5, 1
addi t3, t3, -1
j l2
l3:
andi t5, s5, 0x7F # t5 = new_mant
li t0, 0xFF
blt t3, t0, no_overflow
li a0, 0x7F80
j ans
no_overflow:
bgt t3, x0, no_underflow
li a0, 0
j ans
no_underflow:
andi t3, t3, 0xff
slli t3, t3, 7
or a0, t3, t5
j ans
ret_a:
j ans
ans:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
lw s6, 24(sp)
lw ra, 28(sp)
addi sp, sp, 32
ret
```
</details>
#### Improvement Strategy: Binary search
To improve clock cycle count and code size, I mainly address the binary-search part. That's because we implement in RV32I, and there's no `mul` instruction to use, so a multiplication has to be emulated with shift-and-add loops.
In the original `bf16_sqrt`, the binary search repeatedly computes mid * mid every iteration. This not only adds many instructions per step, but also introduces extra branches, which significantly increases the total cycle count in Ripes.
#### Implement Strategy
- Rewrite the comparsion
In the original binary search part we are looking for the largest `mid` that satisfies condition `floor(mid^2/128) <= m` and it can easily transfer to `mid^2 <= (m<<7)+127` (+127 for remove the floor). In other words, we want the result
$$
\mathrm{mid}=\left\lfloor \sqrt{(m \ll 7)+127} \right\rfloor
$$
which produces the same mid as the original binary search but avoids repeated multiplication inside the loop.
- Implement sqrt with RV32I (only use shift / add / sub / compare)
- main concept: decide one bit of sqrt each round
we maintain three variables (n, res and bit), where `bit` starts from the highest power of four `1<<14` and shift right by two each round. In each round, we test `n >= res + bit` to decide whether this bit can be included. If it can, we subtract `res + bit` and update `res`; Otherwise we only shift `res`.
- C implementation of sqrt without multiplication or division
```clike=
static inline uint16_t isqrt16(uint32_t n)
{
uint32_t res = 0;
uint32_t bit = 1u << 14; // 16384
while (bit != 0) {
uint32_t tmp = res + bit;
if (n >= tmp) {
n -= tmp;
res = (res >> 1) + bit;
} else {
res >>= 1;
}
bit >>= 2;
}
return (uint16_t)res;
}
```
<details>
<summary>Veresion2</summary>
```asm
.data
.align 4
.text
isqrt16_pow4:
li t0, 0 # res
li t1, 16384 # bit = 1<<14 (2^16 = 65536 > 65535)
isqrt_loop:
beqz t1, isqrt_done
add t2, t0, t1 # tmp = res + bit
bgeu a0, t2, isqrt_ge
srli t0, t0, 1
srli t1, t1, 2
j isqrt_loop
isqrt_ge:
sub a0, a0, t2
srli t0, t0, 1
add t0, t0, t1
srli t1, t1, 2
j isqrt_loop
isqrt_done:
mv a0, t0
ret
bf16_sqrt:
addi sp, sp, -32
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
sw s6, 24(sp)
sw ra, 28(sp)
srli s0, a0, 15
andi s0, s0, 1 # s0 = sign_a
srli s1, a0, 7
andi s1, s1, 0xFF # s1 = exp_a
andi s2, a0, 0x7F # s2 = mant_a
li t0, 0xFF
bne s1, t0, a_exp
beq s2, x0, a_mant
j ret_a
a_mant:
beq s0, x0, a_sign
li a0, 0x7FC0
j ans
a_sign:
j ret_a
a_exp:
bne s1, x0, skip
bne s2, x0, skip
li a0, 0x0000
j ans
skip:
beq s0, x0, negative_skip
li a0, 0x7FC0
j ans
negative_skip:
bne s1, x0, denormals_skip
li a0, 0x0000
j ans
denormals_skip:
addi t1, s1, -127 # t1 = e
ori t2, s2, 0x80 # t2 = m (implicit 1)
andi t0, t1, 1
beq t0, x0, else
slli t2, t2, 1 # m <<= 1 (odd exponent)
addi t0, t1, -1 # t0 = e - 1
srai t0, t0, 1 # t0 = (e-1)>>1
addi t3, t0, 127 # t3 = new_exp
j end_if
else:
srai t0, t1, 1 # t0 = e>>1
addi t3, t0, 127 # t3 = new_exp
end_if:
mv s6, t3
slli a0, t2, 7 # a0 = m<<7
addi a0, a0, 127 # a0 = (m<<7) + 127
jal isqrt16_pow4 # a0 = result
mv s5, a0 # s5 = result
mv t3, s6
j l3
l3:
andi t5, s5, 0x7F # t5 = new_mant
li t0, 0xFF
blt t3, t0, no_overflow
li a0, 0x7F80
j ans
no_overflow:
bgt t3, x0, no_underflow
li a0, 0
j ans
no_underflow:
andi t3, t3, 0xff
slli t3, t3, 7
or a0, t3, t5
j ans
ret_a:
j ans
ans:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
lw s6, 24(sp)
lw ra, 28(sp)
addi sp, sp, 32
ret
```
</details>
### Validation & Analyze
- In this part, I compared all operations against the compiler generated versions using the test code above. The testing code includes three to five cases per `Part1` operation and for `bf16_sqrt` we have about twenty test cases for it. The test case covers most situations.
<details>
<summary>BF16 test </summary>
```asm
main:
# BF16 MUL (1~5)
#1 Inf * 0 = NaN
li a0, 0x7F80
li a1, 0x0000
jal ra, bf16_mul
li t6, 0x7FC0
bne a0, t6, fail
li t0, 1
mv s0, t0
#2 0 * 3 = 0
li a0, 0x0000
li a1, 0x4040
jal ra, bf16_mul
li t6, 0x0000
bne a0, t6, fail
#3 2 * 3 = 6
li a0, 0x4000
li a1, 0x4040
jal ra, bf16_mul
li t6, 0x40C0
bne a0, t6, fail
#4 -2 * 3 = -6
li a0, 0xC000
li a1, 0x4040
jal ra, bf16_mul
li t6, 0xC0C0
bne a0, t6, fail
#5 1.5 * 2 = 3
li a0, 0x3FC0
li a1, 0x4000
jal ra, bf16_mul
li t6, 0x4040
bne a0, t6, fail
# BF16 ADD (6~10)
#6 1 + 1 = 2
li a0, 0x3F80
li a1, 0x3F80
jal ra, bf16_add
li t6, 0x4000
bne a0, t6, fail
#7 1 + 0.5 = 1.5
li a0, 0x3F80
li a1, 0x3F00
jal ra, bf16_add
li t6, 0x3FC0
bne a0, t6, fail
#8 2 + (-0.5) = 1.5
li a0, 0x4000
li a1, 0xBF00
jal ra, bf16_add
li t6, 0x3FC0
bne a0, t6, fail
#9 -1 + 1 = 0
li a0, 0xBF80
li a1, 0x3F80
jal ra, bf16_add
li t6, 0x0000
bne a0, t6, fail
#10 +Inf + (-Inf) = NaN
li a0, 0x7F80
li a1, 0xFF80
jal ra, bf16_add
li t6, 0x7FC0
bne a0, t6, fail
# BF16 SUB (11~15)
#11 3 - 1 = 2
li a0, 0x4040
li a1, 0x3F80
jal ra, bf16_sub
li t6, 0x4000
bne a0, t6, fail
#12 1 - 1 = 0
li a0, 0x3F80
li a1, 0x3F80
jal ra, bf16_sub
li t6, 0x0000
bne a0, t6, fail
#13 1 - (-1) = 2
li a0, 0x3F80
li a1, 0xBF80
jal ra, bf16_sub
li t6, 0x4000
bne a0, t6, fail
#14 -2 - 3 = -5
li a0, 0xC000
li a1, 0x4040
jal ra, bf16_sub
li t6, 0xC0A0
bne a0, t6, fail
#15 +Inf - +Inf = NaN
li a0, 0x7F80
li a1, 0x7F80
jal ra, bf16_sub
li t6, 0x7FC0
bne a0, t6, fail
# BF16 DIV (16~20)
#16 3 / 2 = 1.5
li a0, 0x4040
li a1, 0x4000
jal ra, bf16_div
li t6, 0x3FC0
bne a0, t6, fail
#17 1 / 2 = 0.5
li a0, 0x3F80
li a1, 0x4000
jal ra, bf16_div
li t6, 0x3F00
bne a0, t6, fail
#18 0 / 3 = 0
li a0, 0x0000
li a1, 0x4040
jal ra, bf16_div
li t6, 0x0000
bne a0, t6, fail
#19 1 / 0 = +Inf
li a0, 0x3F80
li a1, 0x0000
jal ra, bf16_div
li t6, 0x7F80
bne a0, t6, fail
#20 0 / 0 = NaN
li a0, 0x0000
li a1, 0x0000
jal ra, bf16_div
li t6, 0x7FC0
bne a0, t6, fail
# BF16 ISNAN test (21~23)
#21 isnan(+qNaN) = 1
li a0, 0x7FC1
jal ra, bf16_isnan
li t6, 1
bne a0, t6, fail
#22 isnan(+sNaN-ish) = 1
li a0, 0x7F81
jal ra, bf16_isnan
li t6, 1
bne a0, t6, fail
#23 isnan(+Inf) = 0
li a0, 0x7F80
jal ra, bf16_isnan
li t6, 0
bne a0, t6, fail
# BF16 ISINF test (24~26)
#24 isinf(+Inf) = 1
li a0, 0x7F80
jal ra, bf16_isinf
li t6, 1
bne a0, t6, fail
#25 isinf(-Inf) = 1
li a0, 0xFF80
jal ra, bf16_isinf
li t6, 1
bne a0, t6, fail
#26 isinf(NaN) = 0
li a0, 0x7FC0
jal ra, bf16_isinf
li t6, 0
bne a0, t6, fail
# BF16 ISZERO (27~29)
#27 iszero(+0) = 1
li a0, 0x0000
jal ra, bf16_iszero
li t6, 1
bne a0, t6, fail
#28 iszero(-0) = 1
li a0, 0x8000
jal ra, bf16_iszero
li t6, 1
bne a0, t6, fail
#29 iszero(subnormal != 0) = 0
li a0, 0x0001
jal ra, bf16_iszero
li t6, 0
bne a0, t6, fail
# f32_to_bf16 (30~32)
#30 1.0f -> 0x3F80
li a0, 0x3F800000
jal ra, f32_to_bf16
li t6, 0x3F80
bne a0, t6, fail
#31 0x3F7F8000 to even -> 0x3F80
li a0, 0x3F7F8000
jal ra, f32_to_bf16
li t6, 0x3F80
bne a0, t6, fail
#32 NaN remain high 16 位 -> 0x7FC0
li a0, 0x7FC00001
jal ra, f32_to_bf16
li t6, 0x7FC0
bne a0, t6, fail
# bf16_to_f32 test (33~35)
#33 0x3F80 -> 1.0f
li a0, 0x3F80
jal ra, bf16_to_f32
li t6, 0x3F800000
bne a0, t6, fail
#34 0x7F80 -> +Inf
li a0, 0x7F80
jal ra, bf16_to_f32
li t6, 0x7F800000
bne a0, t6, fail
#35 0xC000 -> -2.0f
li a0, 0xC000
jal ra, bf16_to_f32
li t6, 0xC0000000
bne a0, t6, fail
# BF16 SQRT (36~55)
#36 sqrt(+Inf) = +Inf
li a0, 0x7F80
jal ra, bf16_sqrt
li t6, 0x7F80
bne a0, t6, fail
#37 sqrt(-Inf) = NaN (canonical 0x7FC0)
li a0, 0xFF80
jal ra, bf16_sqrt
li t6, 0x7FC0
bne a0, t6, fail
#38 NaN(payload) propagates (return original)
li a0, 0x7FC1
jal ra, bf16_sqrt
li t6, 0x7FC1
bne a0, t6, fail
#39 NaN(canonical) propagates (return original)
li a0, 0x7FC0
jal ra, bf16_sqrt
li t6, 0x7FC0
bne a0, t6, fail
#40 sqrt(+0) = +0
li a0, 0x0000
jal ra, bf16_sqrt
li t6, 0x0000
bne a0, t6, fail
#41 sqrt(-0) = +0 (your code returns BF16_ZERO)
li a0, 0x8000
jal ra, bf16_sqrt
li t6, 0x0000
bne a0, t6, fail
#42 denorm(min) flush to 0
li a0, 0x0001
jal ra, bf16_sqrt
li t6, 0x0000
bne a0, t6, fail
#43 denorm(max) flush to 0
li a0, 0x007F
jal ra, bf16_sqrt
li t6, 0x0000
bne a0, t6, fail
#44 sqrt(0.25) = 0.5
li a0, 0x3E80
jal ra, bf16_sqrt
li t6, 0x3F00
bne a0, t6, fail
#45 sqrt(0.5) ≈ 0.70703125 -> 0x3F35
li a0, 0x3F00
jal ra, bf16_sqrt
li t6, 0x3F35
bne a0, t6, fail
#46 sqrt(1.0) = 1.0
li a0, 0x3F80
jal ra, bf16_sqrt
li t6, 0x3F80
bne a0, t6, fail
#47 sqrt(1.5) -> 0x3F9D
li a0, 0x3FC0
jal ra, bf16_sqrt
li t6, 0x3F9D
bne a0, t6, fail
#48 sqrt(2.0) ≈ 1.4140625 -> 0x3FB5
li a0, 0x4000
jal ra, bf16_sqrt
li t6, 0x3FB5
bne a0, t6, fail
#49 sqrt(3.0) -> 0x3FDD
li a0, 0x4040
jal ra, bf16_sqrt
li t6, 0x3FDD
bne a0, t6, fail
#50 sqrt(4.0) = 2.0
li a0, 0x4080
jal ra, bf16_sqrt
li t6, 0x4000
bne a0, t6, fail
#51 sqrt(9.0) = 3.0
li a0, 0x4110
jal ra, bf16_sqrt
li t6, 0x4040
bne a0, t6, fail
#52 sqrt(16.0) = 4.0
li a0, 0x4180
jal ra, bf16_sqrt
li t6, 0x4080
bne a0, t6, fail
#53 sqrt(min normal 0x0080) -> 0x2000
li a0, 0x0080
jal ra, bf16_sqrt
li t6, 0x2000
bne a0, t6, fail
#54 sqrt(max finite 0x7F7F) -> 0x5F7F
li a0, 0x7F7F
jal ra, bf16_sqrt
li t6, 0x5F7F
bne a0, t6, fail
#55 sqrt(-1.0) = NaN (canonical 0x7FC0)
li a0, 0xBF80
jal ra, bf16_sqrt
li t6, 0x7FC0
bne a0, t6, fail
ok:
li a0, 0 # all passed
li a7, 10
ecall
fail:
li a7, 10
ecall
```
</details>
<details>
<summary>BF16 operation with sqrt v1 </summary>
```asm
.data
.align 4
clz8_lut:
.byte 8,7,6,6,5,5,5,5,4,4,4,4,4,4,4,4
.byte 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
.byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
.byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.text
.globl clz8
clz8:
andi a0, a0, 0xFF
la t0, clz8_lut
add t0, t0, a0
lbu a0, 0(t0)
ret
.globl mul8x8_to16
mul8x8_to16:
andi a0, a0, 0xFF
andi a1, a1, 0xFF
mv t1, a0
mv t2, a1
li t0, 0
li t3, 8
mul8x8_to16_loop:
andi t4, t2, 1
beqz t4, mul8x8_to16_skip
add t0, t0, t1
mul8x8_to16_skip:
slli t1, t1, 1
srli t2, t2, 1
addi t3, t3, -1
bnez t3, mul8x8_to16_loop
mv a0, t0
ret
.globl bf16_add
bf16_add:
addi sp, sp, -28
sw ra, 24(sp)
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
srli s0, a0, 15
andi s0, s0, 1 # s0 = sign_a
srli s1, a1, 15
andi s1, s1, 1 # s1 = sign_b
srli s2, a0, 7
andi s2, s2, 0xFF # s2 = exp_a
srli s3, a1, 7
andi s3, s3, 0xFF # s3 = exp_b
andi s4, a0, 0x7F # s4 = mant_a
andi s5, a1, 0x7F # s5 = mant_b
li t0, 0xFF
bne s2, t0, bf16_add_chk
bnez s4, bf16_add_ret_a
bne s3, t0, bf16_add_ret_a
bnez s5, bf16_add_ret_b
bne s0, s1, bf16_add_ret_nan
j bf16_add_ret_b
bf16_add_chk:
beq s3, t0, bf16_add_ret_b
li t0, 0x7FFF
and t1, a0, t0
beq t1, x0, bf16_add_ret_b
and t1, a1, t0
beq t1, x0, bf16_add_ret_a
beq s2, x0, bf16_add_a_den_done
ori s4, s4, 0x80
bf16_add_a_den_done:
beq s3, x0, bf16_add_b_den_done
ori s5, s5, 0x80
bf16_add_b_den_done:
sub t1, s2, s3
blt x0, t1, bf16_add_grt
beq t1, x0, bf16_add_equ
mv t2, s3
li t0, -8
blt t1, t0, bf16_add_ret_b
sub t0, x0, t1
srl s4, s4, t0
j bf16_add_exp_dif
bf16_add_grt:
mv t2, s2
li t0, 8
blt t0, t1, bf16_add_ret_a
srl s5, s5, t1
j bf16_add_exp_dif
bf16_add_equ:
mv t2, s2
bf16_add_exp_dif:
bne s0, s1, bf16_add_diff_signs
mv t3, s0
add t4, s4, s5
li t0, 0x100
and t1, t4, t0
beq t1, x0, bf16_add_pack
srli t4, t4, 1
addi t2, t2, 1
li t0, 0xFF
blt t2, t0, bf16_add_pack
slli a0, t3, 15
li t5, 0x7F80
or a0, a0, t5
j bf16_add_ans
bf16_add_diff_signs:
blt s4, s5, bf16_add_gt_ma
mv t3, s0
sub t4, s4, s5
j bf16_add_norm
bf16_add_gt_ma:
mv t3, s1
sub t4, s5, s4
bf16_add_norm:
beq t4, x0, bf16_add_ret_zero
mv a0, t4
jal ra, clz8
mv t0, a0
sll t4, t4, t0
sub t2, t2, t0
blt t2, x0, bf16_add_ret_zero
beq t2, x0, bf16_add_ret_zero
j bf16_add_pack
bf16_add_ret_zero:
li a0, 0x0000
j bf16_add_ans
bf16_add_pack:
slli a0, t3, 15
slli t1, t2, 7
or a0, a0, t1
andi t4, t4, 0x7F
or a0, a0, t4
j bf16_add_ans
bf16_add_ret_b:
mv a0, a1
j bf16_add_ans
bf16_add_ret_nan:
li a0, 0x7FC0
j bf16_add_ans
bf16_add_ret_a:
j bf16_add_ans
bf16_add_ans:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
lw ra, 24(sp)
addi sp, sp, 28
ret
.globl bf16_sub
bf16_sub:
addi sp, sp, -8
sw ra, 4(sp)
li t0, 0x8000
xor a1, a1, t0
jal ra, bf16_add
lw ra, 4(sp)
addi sp, sp, 8
ret
.globl bf16_div
bf16_div:
addi sp, sp, -24
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
srli s0, a0, 15
andi s0, s0, 1
srli s1, a1, 15
andi s1, s1, 1
srli s2, a0, 7
andi s2, s2, 0xFF
srli s3, a1, 7
andi s3, s3, 0xFF
andi s4, a0, 0x7F
andi s5, a1, 0x7F
xor t1, s0, s1
li t0, 0xff
bne s3, t0, bf16_div_exp_b_f
bne s5, x0, bf16_div_ret_b
bne s2, t0, bf16_div_l1
bne s4, x0, bf16_div_l1
j bf16_div_ret_nan
bf16_div_l1:
slli a0, t1, 15
j bf16_div_ans
bf16_div_exp_b_f:
bne s3, x0, bf16_div_skip
bne s5, x0, bf16_div_skip
bne s2, x0, bf16_div_skip2
beq s4, x0, bf16_div_ret_nan
bf16_div_skip2:
slli t1, t1, 15
li t2, 0x7F80
or a0, t1, t2
j bf16_div_ans
bf16_div_skip:
bne s2, t0, bf16_div_exp_a_f
bne s4, x0, bf16_div_ret_a
slli t1, t1, 15
li t2, 0x7F80
or a0, t1, t2
j bf16_div_ans
bf16_div_exp_a_f:
beq s2, x0, bf16_div_exp_a_is_zero
j bf16_div_l2
bf16_div_exp_a_is_zero:
beq s4, x0, bf16_div_a_is_zero_return
j bf16_div_l2
bf16_div_a_is_zero_return:
slli a0, t1, 15
j bf16_div_ans
bf16_div_l2:
beq s2, x0, bf16_div_l3
ori s4, s4, 0x80
bf16_div_l3:
beq s3, x0, bf16_div_l4
ori s5, s5, 0x80
bf16_div_l4:
slli t2, s4, 15
mv t3, s5
li t4, 0
li t5, 0
bf16_div_div_loop:
li t6, 16
bge t4, t6, bf16_div_out_loop
slli t5, t5, 1
sub t0, x0, t4
addi t0, t0, 15
sll t1, t3, t0
bltu t2, t1, bf16_div_cant_div
sub t2, t2, t1
ori t5, t5, 1
bf16_div_cant_div:
addi t4, t4, 1
j bf16_div_div_loop
bf16_div_out_loop:
sub t2, s2, s3
addi t2, t2, 127
bne s2, x0, bf16_div_l5
addi t2, t2, -1
bf16_div_l5:
bne s3, x0, bf16_div_l6
addi t2, t2, 1
bf16_div_l6:
li t0, 0x8000
and t3, t5, t0
bne t3, x0, bf16_div_set
bf16_div_norm_loop:
and t3, t5, t0
bne t3, x0, bf16_div_norm_done
li t6, 2
blt t2, t6, bf16_div_norm_done
slli t5, t5, 1
addi t2, t2, -1
j bf16_div_norm_loop
bf16_div_norm_done:
srli t5, t5, 8
j bf16_div_l7
bf16_div_set:
srli t5, t5, 8
bf16_div_l7:
andi t5, t5, 0x7F
li t0, 0xFF
bge t2, t0, bf16_div_ret_inf
blt t2, x0, bf16_div_ret_zero
beq t2, x0, bf16_div_ret_zero
slli a0, t1, 15
andi t2, t2, 0xFF
slli t2, t2, 7
or a0, a0, t2
or a0, a0, t5
j bf16_div_ans
bf16_div_ret_inf:
slli a0, t1, 15
li t0, 0x7F80
or a0, a0, t0
j bf16_div_ans
bf16_div_ret_zero:
slli a0, t1, 15
j bf16_div_ans
bf16_div_ret_b:
mv a0, a1
j bf16_div_ans
bf16_div_ret_nan:
li a0, 0x7FC0
j bf16_div_ans
bf16_div_ret_a:
j bf16_div_ans
bf16_div_ans:
li t0, 0xFFFF
and a0, a0, t0
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
addi sp, sp, 24
ret
.globl bf16_mul
bf16_mul:
addi sp, sp, -28
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
sw ra, 24(sp)
srli s0, a0, 15
andi s0, s0, 1
srli s1, a1, 15
andi s1, s1, 1
srli s2, a0, 7
andi s2, s2, 0xFF
srli s3, a1, 7
andi s3, s3, 0xFF
andi s4, a0, 0x7F
andi s5, a1, 0x7F
li t0, 0xff
xor t1, s0, s1
bne s2, t0, bf16_mul_a_exp
bne s4, x0, bf16_mul_ret_b
bne s3, x0, bf16_mul_inf1
beq s5, x0, bf16_mul_ret_nan
bf16_mul_inf1:
slli t2, t1, 15
li t3, 0x7F80
or a0, t2, t3
j bf16_mul_ans
bf16_mul_a_exp:
bne s3, t0, bf16_mul_b_exp
bne s5, x0, bf16_mul_ret_b
bne s2, x0, bf16_mul_inf2
beq s4, x0, bf16_mul_ret_nan
bf16_mul_inf2:
slli t2, t1, 15
li t3, 0x7F80
or a0, t2, t3
j bf16_mul_ans
bf16_mul_b_exp:
bne s2, x0, bf16_mul_skip1
beq s4, x0, bf16_mul_zero_ret
bf16_mul_skip1:
bne s3, x0, bf16_mul_skip2
bne s4, x0, bf16_mul_skip2
bf16_mul_zero_ret:
srli a0, t1, 15
j bf16_mul_ans
bf16_mul_skip2:
li t2, 0
bne s2, x0, bf16_mul_else_a
mv a0, s4
jal ra, clz8
mv t0, a0
sll s4, s4, t0
sub t2, t2, t0
li s2, 1
bf16_mul_else_a:
ori s4, s4, 0x80
bne s3, x0, bf16_mul_else_b
mv a0, s5
jal ra, clz8
mv t0, a0
sll s5, s5, t0
sub t2, t2, t0
li s3, 1
bf16_mul_else_b:
ori s5, s5, 0x80
mv a0, s4
mv a1, s5
jal ra, mul8x8_to16
mv t3, a0
xor t1, s0, s1
add t4, s2, s3
addi t4, t4, -127
add t4, t4, t2
li t5, 0x8000
and t0, t3, t5
beq t0, x0, bf16_mul_l2
srli t3, t3, 8
andi t3, t3, 0x7F
addi t4, t4, 1
j bf16_mul_mant
bf16_mul_l2:
srli t3, t3, 7
andi t3, t3, 0x7F
bf16_mul_mant:
li t0, 0xFF
blt t4, t0, bf16_mul_skip3
slli a0, t1, 15
li t0, 0x7F80
or a0, a0, t0
j bf16_mul_ans
bf16_mul_skip3:
blt x0, t4, bf16_mul_pack
addi t0, x0, -6
blt t4, t0, bf16_mul_underflow
li t0, 1
sub t0, t0, t4
srl t3, t3, t0
li t4, 0
j bf16_mul_pack
bf16_mul_underflow:
srli a0, t1, 15
j bf16_mul_ans
bf16_mul_pack:
andi t1, t1, 1
slli t1, t1, 15
andi t4, t4, 0xFF
slli t4, t4, 7
andi t3, t3, 0x7F
or a0, t1, t4
or a0, a0, t3
li t0, 0xFFFF
and a0, a0, t0
j bf16_mul_ans
bf16_mul_ret_b:
mv a0, a1
j bf16_mul_ans
bf16_mul_ret_nan:
li a0, 0x7FC0
j bf16_mul_ans
bf16_mul_ret_a:
j bf16_mul_ans
bf16_mul_ans:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
lw ra, 24(sp)
addi sp, sp, 28
ret
.globl bf16_isnan
bf16_isnan:
li t0, 0x7F80
and t1, a0, t0
bne t1, t0, bf16_isnan_false
andi t1, a0, 0x007F
beq t1, x0, bf16_isnan_false
li a0, 1
ret
bf16_isnan_false:
li a0, 0
ret
.globl bf16_isinf
bf16_isinf:
li t0, 0x7F80
and t1, a0, t0
bne t1, t0, bf16_isinf_false
andi t1, a0, 0x007F
bne t1, x0, bf16_isinf_false
li a0, 1
ret
bf16_isinf_false:
li a0, 0
ret
.globl bf16_iszero
bf16_iszero:
li t0, 0x7FFF
and t1, a0, t0
bne t1, x0, bf16_iszero_false
li a0, 1
ret
bf16_iszero_false:
li a0, 0
ret
.globl f32_to_bf16
f32_to_bf16:
srli t1, a0, 23
andi t1, t1, 0xFF
li t2, 0xFF
bne t1, t2, f32_to_bf16_L1
srli a0, a0, 16
ret
f32_to_bf16_L1:
srli t1, a0, 16
andi t1, t1, 1
add a0, a0, t1
li t3, 0x7FFF
add a0, a0, t3
srli a0, a0, 16
ret
.globl bf16_to_f32
bf16_to_f32:
slli a0, a0, 16
ret
.globl bf16_sqrt
bf16_sqrt:
addi sp, sp, -32
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
sw s6, 24(sp)
sw ra, 28(sp)
srli s0, a0, 15
andi s0, s0, 1 # s0 = sign_a
srli s1, a0, 7
andi s1, s1, 0xFF # s1 = exp_a
andi s2, a0, 0x7F # s2 = mant_a
li t0, 0xFF
bne s1, t0, bf16_sqrt_a_exp
beq s2, x0, bf16_sqrt_a_mant
j bf16_sqrt_ret_a
bf16_sqrt_a_mant:
beq s0, x0, bf16_sqrt_a_sign
li a0, 0x7FC0
j bf16_sqrt_ans
bf16_sqrt_a_sign:
j bf16_sqrt_ret_a
bf16_sqrt_a_exp:
bne s1, x0, bf16_sqrt_skip
bne s2, x0, bf16_sqrt_skip
li a0, 0x0000
j bf16_sqrt_ans
bf16_sqrt_skip:
beq s0, x0, bf16_sqrt_negative_skip
li a0, 0x7FC0
j bf16_sqrt_ans
bf16_sqrt_negative_skip:
bne s1, x0, bf16_sqrt_denormals_skip
li a0, 0x0000
j bf16_sqrt_ans
bf16_sqrt_denormals_skip:
addi t1, s1, -127
ori t2, s2, 0x80
andi t0, t1, 1
beq t0, x0, bf16_sqrt_else
slli t2, t2, 1
addi t0, t1, -1
srai t0, t0, 1
addi t3, t0, 127
j bf16_sqrt_end_if
bf16_sqrt_else:
srai t0, t1, 1
addi t3, t0, 127
bf16_sqrt_end_if:
li s3, 90
li s4, 255
li s5, 128
mv s6, t3
bf16_sqrt_loop:
bgtu s3, s4, bf16_sqrt_loop_done
add t0, s3, s4
srli t1, t0, 1
mv a0, t1
mv a1, t1
mv t5, t1
mv t6, t2
jal mul8x8_to16
mv t0, a0
mv t1, t5
mv t2, t6
srli t0, t0, 7
bleu t0, t2, bf16_sqrt_do_if
addi s4, t1, -1
j bf16_sqrt_loop
bf16_sqrt_do_if:
mv s5, t1
addi s3, t1, 1
j bf16_sqrt_loop
bf16_sqrt_loop_done:
mv t3, s6
li t0, 256
bltu s5, t0, bf16_sqrt_l3
srli s5, s5, 1
addi t3, t3, 1
j bf16_sqrt_l3
bf16_sqrt_l3:
andi t5, s5, 0x7F
li t0, 0xFF
blt t3, t0, bf16_sqrt_no_overflow
li a0, 0x7F80
j bf16_sqrt_ans
bf16_sqrt_no_overflow:
bgt t3, x0, bf16_sqrt_no_underflow
li a0, 0
j bf16_sqrt_ans
bf16_sqrt_no_underflow:
andi t3, t3, 0xff
slli t3, t3, 7
or a0, t3, t5
j bf16_sqrt_ans
bf16_sqrt_ret_a:
j bf16_sqrt_ans
bf16_sqrt_ans:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
lw s6, 24(sp)
lw ra, 28(sp)
addi sp, sp, 32
ret
```
</details>
<details>
<summary>BF16 operation with sqrt v2 </summary>
```asm
.data
.align 4
clz8_lut:
.byte 8,7,6,6,5,5,5,5,4,4,4,4,4,4,4,4
.byte 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
.byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
.byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.text
.globl clz8
clz8:
andi a0, a0, 0xFF
la t0, clz8_lut
add t0, t0, a0
lbu a0, 0(t0)
ret
.globl mul8x8_to16
mul8x8_to16:
andi a0, a0, 0xFF
andi a1, a1, 0xFF
mv t1, a0
mv t2, a1
li t0, 0
li t3, 8
mul8x8_to16_loop:
andi t4, t2, 1
beqz t4, mul8x8_to16_skip
add t0, t0, t1
mul8x8_to16_skip:
slli t1, t1, 1
srli t2, t2, 1
addi t3, t3, -1
bnez t3, mul8x8_to16_loop
mv a0, t0
ret
.globl bf16_add
bf16_add:
addi sp, sp, -28
sw ra, 24(sp)
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
srli s0, a0, 15
andi s0, s0, 1 # s0 = sign_a
srli s1, a1, 15
andi s1, s1, 1 # s1 = sign_b
srli s2, a0, 7
andi s2, s2, 0xFF # s2 = exp_a
srli s3, a1, 7
andi s3, s3, 0xFF # s3 = exp_b
andi s4, a0, 0x7F # s4 = mant_a
andi s5, a1, 0x7F # s5 = mant_b
li t0, 0xFF
bne s2, t0, bf16_add_chk
bnez s4, bf16_add_ret_a
bne s3, t0, bf16_add_ret_a
bnez s5, bf16_add_ret_b
bne s0, s1, bf16_add_ret_nan
j bf16_add_ret_b
bf16_add_chk:
beq s3, t0, bf16_add_ret_b
li t0, 0x7FFF
and t1, a0, t0
beq t1, x0, bf16_add_ret_b
and t1, a1, t0
beq t1, x0, bf16_add_ret_a
beq s2, x0, bf16_add_a_den_done
ori s4, s4, 0x80
bf16_add_a_den_done:
beq s3, x0, bf16_add_b_den_done
ori s5, s5, 0x80
bf16_add_b_den_done:
sub t1, s2, s3
blt x0, t1, bf16_add_grt
beq t1, x0, bf16_add_equ
mv t2, s3
li t0, -8
blt t1, t0, bf16_add_ret_b
sub t0, x0, t1
srl s4, s4, t0
j bf16_add_exp_dif
bf16_add_grt:
mv t2, s2
li t0, 8
blt t0, t1, bf16_add_ret_a
srl s5, s5, t1
j bf16_add_exp_dif
bf16_add_equ:
mv t2, s2
bf16_add_exp_dif:
bne s0, s1, bf16_add_diff_signs
mv t3, s0
add t4, s4, s5
li t0, 0x100
and t1, t4, t0
beq t1, x0, bf16_add_pack
srli t4, t4, 1
addi t2, t2, 1
li t0, 0xFF
blt t2, t0, bf16_add_pack
slli a0, t3, 15
li t5, 0x7F80
or a0, a0, t5
j bf16_add_ans
bf16_add_diff_signs:
blt s4, s5, bf16_add_gt_ma
mv t3, s0
sub t4, s4, s5
j bf16_add_norm
bf16_add_gt_ma:
mv t3, s1
sub t4, s5, s4
bf16_add_norm:
beq t4, x0, bf16_add_ret_zero
mv a0, t4
jal ra, clz8
mv t0, a0
sll t4, t4, t0
sub t2, t2, t0
blt t2, x0, bf16_add_ret_zero
beq t2, x0, bf16_add_ret_zero
j bf16_add_pack
bf16_add_ret_zero:
li a0, 0x0000
j bf16_add_ans
bf16_add_pack:
slli a0, t3, 15
slli t1, t2, 7
or a0, a0, t1
andi t4, t4, 0x7F
or a0, a0, t4
j bf16_add_ans
bf16_add_ret_b:
mv a0, a1
j bf16_add_ans
bf16_add_ret_nan:
li a0, 0x7FC0
j bf16_add_ans
bf16_add_ret_a:
j bf16_add_ans
bf16_add_ans:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
lw ra, 24(sp)
addi sp, sp, 28
ret
.globl bf16_sub
bf16_sub:
addi sp, sp, -8
sw ra, 4(sp)
li t0, 0x8000
xor a1, a1, t0
jal ra, bf16_add
lw ra, 4(sp)
addi sp, sp, 8
ret
.globl bf16_div
bf16_div:
addi sp, sp, -24
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
srli s0, a0, 15
andi s0, s0, 1
srli s1, a1, 15
andi s1, s1, 1
srli s2, a0, 7
andi s2, s2, 0xFF
srli s3, a1, 7
andi s3, s3, 0xFF
andi s4, a0, 0x7F
andi s5, a1, 0x7F
xor t1, s0, s1
li t0, 0xff
bne s3, t0, bf16_div_exp_b_f
bne s5, x0, bf16_div_ret_b
bne s2, t0, bf16_div_l1
bne s4, x0, bf16_div_l1
j bf16_div_ret_nan
bf16_div_l1:
slli a0, t1, 15
j bf16_div_ans
bf16_div_exp_b_f:
bne s3, x0, bf16_div_skip
bne s5, x0, bf16_div_skip
bne s2, x0, bf16_div_skip2
beq s4, x0, bf16_div_ret_nan
bf16_div_skip2:
slli t1, t1, 15
li t2, 0x7F80
or a0, t1, t2
j bf16_div_ans
bf16_div_skip:
bne s2, t0, bf16_div_exp_a_f
bne s4, x0, bf16_div_ret_a
slli t1, t1, 15
li t2, 0x7F80
or a0, t1, t2
j bf16_div_ans
bf16_div_exp_a_f:
beq s2, x0, bf16_div_exp_a_is_zero
j bf16_div_l2
bf16_div_exp_a_is_zero:
beq s4, x0, bf16_div_a_is_zero_return
j bf16_div_l2
bf16_div_a_is_zero_return:
slli a0, t1, 15
j bf16_div_ans
bf16_div_l2:
beq s2, x0, bf16_div_l3
ori s4, s4, 0x80
bf16_div_l3:
beq s3, x0, bf16_div_l4
ori s5, s5, 0x80
bf16_div_l4:
slli t2, s4, 15
mv t3, s5
li t4, 0
li t5, 0
bf16_div_div_loop:
li t6, 16
bge t4, t6, bf16_div_out_loop
slli t5, t5, 1
sub t0, x0, t4
addi t0, t0, 15
sll t1, t3, t0
bltu t2, t1, bf16_div_cant_div
sub t2, t2, t1
ori t5, t5, 1
bf16_div_cant_div:
addi t4, t4, 1
j bf16_div_div_loop
bf16_div_out_loop:
sub t2, s2, s3
addi t2, t2, 127
bne s2, x0, bf16_div_l5
addi t2, t2, -1
bf16_div_l5:
bne s3, x0, bf16_div_l6
addi t2, t2, 1
bf16_div_l6:
li t0, 0x8000
and t3, t5, t0
bne t3, x0, bf16_div_set
bf16_div_norm_loop:
and t3, t5, t0
bne t3, x0, bf16_div_norm_done
li t6, 2
blt t2, t6, bf16_div_norm_done
slli t5, t5, 1
addi t2, t2, -1
j bf16_div_norm_loop
bf16_div_norm_done:
srli t5, t5, 8
j bf16_div_l7
bf16_div_set:
srli t5, t5, 8
bf16_div_l7:
andi t5, t5, 0x7F
li t0, 0xFF
bge t2, t0, bf16_div_ret_inf
blt t2, x0, bf16_div_ret_zero
beq t2, x0, bf16_div_ret_zero
slli a0, t1, 15
andi t2, t2, 0xFF
slli t2, t2, 7
or a0, a0, t2
or a0, a0, t5
j bf16_div_ans
bf16_div_ret_inf:
slli a0, t1, 15
li t0, 0x7F80
or a0, a0, t0
j bf16_div_ans
bf16_div_ret_zero:
slli a0, t1, 15
j bf16_div_ans
bf16_div_ret_b:
mv a0, a1
j bf16_div_ans
bf16_div_ret_nan:
li a0, 0x7FC0
j bf16_div_ans
bf16_div_ret_a:
j bf16_div_ans
bf16_div_ans:
li t0, 0xFFFF
and a0, a0, t0
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
addi sp, sp, 24
ret
.globl bf16_mul
bf16_mul:
addi sp, sp, -28
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
sw ra, 24(sp)
srli s0, a0, 15
andi s0, s0, 1
srli s1, a1, 15
andi s1, s1, 1
srli s2, a0, 7
andi s2, s2, 0xFF
srli s3, a1, 7
andi s3, s3, 0xFF
andi s4, a0, 0x7F
andi s5, a1, 0x7F
li t0, 0xff
xor t1, s0, s1
bne s2, t0, bf16_mul_a_exp
bne s4, x0, bf16_mul_ret_b
bne s3, x0, bf16_mul_inf1
beq s5, x0, bf16_mul_ret_nan
bf16_mul_inf1:
slli t2, t1, 15
li t3, 0x7F80
or a0, t2, t3
j bf16_mul_ans
bf16_mul_a_exp:
bne s3, t0, bf16_mul_b_exp
bne s5, x0, bf16_mul_ret_b
bne s2, x0, bf16_mul_inf2
beq s4, x0, bf16_mul_ret_nan
bf16_mul_inf2:
slli t2, t1, 15
li t3, 0x7F80
or a0, t2, t3
j bf16_mul_ans
bf16_mul_b_exp:
bne s2, x0, bf16_mul_skip1
beq s4, x0, bf16_mul_zero_ret
bf16_mul_skip1:
bne s3, x0, bf16_mul_skip2
bne s4, x0, bf16_mul_skip2
bf16_mul_zero_ret:
srli a0, t1, 15
j bf16_mul_ans
bf16_mul_skip2:
li t2, 0
bne s2, x0, bf16_mul_else_a
mv a0, s4
jal ra, clz8
mv t0, a0
sll s4, s4, t0
sub t2, t2, t0
li s2, 1
bf16_mul_else_a:
ori s4, s4, 0x80
bne s3, x0, bf16_mul_else_b
mv a0, s5
jal ra, clz8
mv t0, a0
sll s5, s5, t0
sub t2, t2, t0
li s3, 1
bf16_mul_else_b:
ori s5, s5, 0x80
mv a0, s4
mv a1, s5
jal ra, mul8x8_to16
mv t3, a0
xor t1, s0, s1
add t4, s2, s3
addi t4, t4, -127
add t4, t4, t2
li t5, 0x8000
and t0, t3, t5
beq t0, x0, bf16_mul_l2
srli t3, t3, 8
andi t3, t3, 0x7F
addi t4, t4, 1
j bf16_mul_mant
bf16_mul_l2:
srli t3, t3, 7
andi t3, t3, 0x7F
bf16_mul_mant:
li t0, 0xFF
blt t4, t0, bf16_mul_skip3
slli a0, t1, 15
li t0, 0x7F80
or a0, a0, t0
j bf16_mul_ans
bf16_mul_skip3:
blt x0, t4, bf16_mul_pack
addi t0, x0, -6
blt t4, t0, bf16_mul_underflow
li t0, 1
sub t0, t0, t4
srl t3, t3, t0
li t4, 0
j bf16_mul_pack
bf16_mul_underflow:
srli a0, t1, 15
j bf16_mul_ans
bf16_mul_pack:
andi t1, t1, 1
slli t1, t1, 15
andi t4, t4, 0xFF
slli t4, t4, 7
andi t3, t3, 0x7F
or a0, t1, t4
or a0, a0, t3
li t0, 0xFFFF
and a0, a0, t0
j bf16_mul_ans
bf16_mul_ret_b:
mv a0, a1
j bf16_mul_ans
bf16_mul_ret_nan:
li a0, 0x7FC0
j bf16_mul_ans
bf16_mul_ret_a:
j bf16_mul_ans
bf16_mul_ans:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
lw ra, 24(sp)
addi sp, sp, 28
ret
.globl bf16_isnan
bf16_isnan:
li t0, 0x7F80
and t1, a0, t0
bne t1, t0, bf16_isnan_false
andi t1, a0, 0x007F
beq t1, x0, bf16_isnan_false
li a0, 1
ret
bf16_isnan_false:
li a0, 0
ret
.globl bf16_isinf
bf16_isinf:
li t0, 0x7F80
and t1, a0, t0
bne t1, t0, bf16_isinf_false
andi t1, a0, 0x007F
bne t1, x0, bf16_isinf_false
li a0, 1
ret
bf16_isinf_false:
li a0, 0
ret
.globl bf16_iszero
bf16_iszero:
li t0, 0x7FFF
and t1, a0, t0
bne t1, x0, bf16_iszero_false
li a0, 1
ret
bf16_iszero_false:
li a0, 0
ret
.globl f32_to_bf16
f32_to_bf16:
srli t1, a0, 23
andi t1, t1, 0xFF
li t2, 0xFF
bne t1, t2, f32_to_bf16_L1
srli a0, a0, 16
ret
f32_to_bf16_L1:
srli t1, a0, 16
andi t1, t1, 1
add a0, a0, t1
li t3, 0x7FFF
add a0, a0, t3
srli a0, a0, 16
ret
.globl bf16_to_f32
bf16_to_f32:
slli a0, a0, 16
ret
.globl isqrt16_pow4
isqrt16_pow4:
li t0, 0
li t1, 16384
isqrt16_pow4_loop:
beqz t1, isqrt16_pow4_done
add t2, t0, t1
bgeu a0, t2, isqrt16_pow4_ge
srli t0, t0, 1
srli t1, t1, 2
j isqrt16_pow4_loop
isqrt16_pow4_ge:
sub a0, a0, t2
srli t0, t0, 1
add t0, t0, t1
srli t1, t1, 2
j isqrt16_pow4_loop
isqrt16_pow4_done:
mv a0, t0
ret
.globl bf16_sqrt
bf16_sqrt:
addi sp, sp, -32
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
sw s6, 24(sp)
sw ra, 28(sp)
srli s0, a0, 15
andi s0, s0, 1 # s0 = sign_a
srli s1, a0, 7
andi s1, s1, 0xFF # s1 = exp_a
andi s2, a0, 0x7F # s2 = mant_a
li t0, 0xFF
bne s1, t0, bf16_sqrt_a_exp
beq s2, x0, bf16_sqrt_a_mant
j bf16_sqrt_ret_a
bf16_sqrt_a_mant:
beq s0, x0, bf16_sqrt_a_sign
li a0, 0x7FC0
j bf16_sqrt_ans
bf16_sqrt_a_sign:
j bf16_sqrt_ret_a
bf16_sqrt_a_exp:
bne s1, x0, bf16_sqrt_skip
bne s2, x0, bf16_sqrt_skip
li a0, 0x0000
j bf16_sqrt_ans
bf16_sqrt_skip:
beq s0, x0, bf16_sqrt_negative_skip
li a0, 0x7FC0
j bf16_sqrt_ans
bf16_sqrt_negative_skip:
bne s1, x0, bf16_sqrt_denormals_skip
li a0, 0x0000
j bf16_sqrt_ans
bf16_sqrt_denormals_skip:
addi t1, s1, -127
ori t2, s2, 0x80
andi t0, t1, 1
beq t0, x0, bf16_sqrt_else
slli t2, t2, 1
addi t0, t1, -1
srai t0, t0, 1
addi t3, t0, 127
j bf16_sqrt_end_if
bf16_sqrt_else:
srai t0, t1, 1
addi t3, t0, 127
bf16_sqrt_end_if:
mv s6, t3
slli a0, t2, 7
addi a0, a0, 127
jal isqrt16_pow4
mv s5, a0
mv t3, s6
j bf16_sqrt_l3
bf16_sqrt_l3:
andi t5, s5, 0x7F
li t0, 0xFF
blt t3, t0, bf16_sqrt_no_overflow
li a0, 0x7F80
j bf16_sqrt_ans
bf16_sqrt_no_overflow:
bgt t3, x0, bf16_sqrt_no_underflow
li a0, 0
j bf16_sqrt_ans
bf16_sqrt_no_underflow:
andi t3, t3, 0xff
slli t3, t3, 7
or a0, t3, t5
j bf16_sqrt_ans
bf16_sqrt_ret_a:
j bf16_sqrt_ans
bf16_sqrt_ans:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
lw s6, 24(sp)
lw ra, 28(sp)
addi sp, sp, 32
ret
```
</details>
- Code size (excluding labels and space)
- bf16 operation with sqrt v1 : 526 lines
- bf16 operation with sqrt v2 : 521 lines
- compilers : 1007 lines
- Cycle count
- bf16 operation with sqrt v1

- bf16 operation with sqrt v2

- RISC-V (32-bits) gcc

### Improvement
- My `bf16-v2` assembly runs about 78.4% faster than the compiler generated code.
- My `bf16-v2` code size reduces about 48.26% compared to the compiler generated code.
The code size betweeen two veresion of sqrt is 30 lines. Since we don't need redundancy multiply loop anymore, our speed can be impove to nearly 256%
### Use Case
[LeetCode 688. Knight Probability in Chessboard](https://leetcode.com/problems/knight-probability-in-chessboard/description/)
Many Leetcode dynamic programming solutions allocate a 2D table `dp[m][n]`. When `m` and `n` are large, the DPtable becomes memory-bound due to high bandwidth demand. In our use case, we store `dp` entries in bfloat16 instead of 32-bit integers/floats. This change makes our memory requirment lower half immediately. Arithmetic and normalization are implemented in RV32I using our bf16 add/sub/mul/div routines.
## Explanations for both program functionality and instruction-level operations using the Ripes simulator
Here I use `uf8_decode` assembly from problem b to demonstrate.
```asm
.text
.globl main
main:
li a0, 0x7F
jal ra, uf8_decode
li a7, 10
ecall
uf8_decode:
andi t0, a0, 0x0F # mantissa = fl & 0x0F
srli t1, a0, 4 # exponent = fl >> 4
li t2, 1
sll t2, t2, t1 # t2 = 1 << e
addi t2, t2, -1 # t2 = (1<<e) - 1
slli t2, t2, 4 # offset = (2^e -1)*16
sll t0, t0, t1 # mantissa << e
add a0, t0, t2 # return
ret
```
Since `li t2, 1` is a pseudo-instruction, the assembler expands it to the real RV32I instruction `addi t2, x0, 1`
In RISC-V, register names like `t0` and `s0` are register name aliases for the physical registers `x0-x31`. For example, `t0` maps to `x5`, and `s0` maps to `x8`.
### Program Functionality
Now let's go through pipeline with some insturcions

#### IF (Instruction Fetch)
- PC Enable: 1 (high), since there's no pipeline stall
- Next PC Mux: 0 (low), since there's no jump instruction in pipeline EX stage and signal `Branch taken` is low
#### ID (Instruction Decode)
- Reg Wr En: 1 (high), the instruction `addi x7, x7, -1` is in WB stage, which needs to be write back to `x7`
#### EX (Execute)
- ALUOp1: from Reg file or forwarding path
- ALUOp2: from Reg file or forwarding path
- ForwardA: no forwarding (no data dependency with previous instruction)
- ForwardB: no forwarding (no data dependency with previous instruction)
- Branch taken: not taken
#### MEM (Memory)
- Data Memory Wr En: 0 (low), since the instruction in MEM stage is not `sw`.
#### WB (Write Back)
- WB Mux: from ALU result, since the instruction in WB stage is not link/return instruction (PC+4) or load word instruction (Memory Data).

#### IF (Instruction Fetch)
- PC Enable: 1 (high), since there's no pipeline stall
- Next PC Mux: 0 (low), since there's no jump instruction in pipeline EX stage and signal `Branch taken` is low
#### ID (Instruction Decode)
- Reg Wr En: 1 (high), the instruction `slli x7, x7, 4` is in WB stage, which needs to be write back to `x7`
#### EX (Execute)
- ALUOp1: from Reg file or forwarding path
- ALUOp2: from Reg file or forwarding path
- ForwardA: from EX/MEM piprline register (`x5`)
- ForwardB: from MEM/WB piprline register (`x7`)
- Branch taken: not taken
#### MEM (Memory)
- Data Memory Wr En: 0 (low), since the instruction in MEM stage is not `sw`.
#### WB (Write Back)
- WB Mux: from ALU result, since the instruction in WB stage is not link/return instruction (PC+4) or load word instruction (Memory Data).

#### IF (Instruction Fetch)
- PC Enable: 1 (high), since there's no pipeline stall
- Next PC Mux: 1 (high), since there's jump instruction in pipeline EX stage. So the next clock won't be PC+4 as before.
#### ID (Instruction Decode)
- Reg Wr En: 1 (high), the instruction `sll x5, x5, x6` is in WB stage, which needs to be write back to `x6`
#### EX (Execute)
- ALUOp1: from Reg file or forwarding path
- ALUOp2: from immediate
- ForwardA: no forwarding (no data dependency with previous instruction)
- ForwardB: no forwarding (no data dependency with previous instruction)
- Branch taken: not taken
#### MEM (Memory)
- Data Memory Wr En: 0 (low), since the instruction in MEM stage is not `sw`.
#### WB (Write Back)
- WB Mux: from ALU result, since the instruction in WB stage is not link/return instruction (PC+4) or load word instruction (Memory Data).
When `jalr` enters the EX stage, `IF/ID clear` and `ID/EX clear` go high to flush the pipeline. Because a jump is unconditional, any instructions behind it are invalid and must be removed. As a result, we get a two cycle penalty.

#### IF (Instruction Fetch)
- PC Enable: 1 (high), since there's no pipeline stall
- Next PC Mux: 0 (low), since there's no jump instruction in pipeline EX stage and signal `Branch taken` is low
#### ID (Instruction Decode)
- Reg Wr En: 1 (high), the instruction `add x10, x5, x7` is in WB stage, which needs to be write back to `x10`
#### EX (Execute)
- ALUOp1: from Reg file or forwarding path
- ALUOp2: from Reg file or forwarding path
- ForwardA: no forwarding (no data dependency with previous instruction)
- ForwardB: no forwarding (no data dependency with previous instruction)
- Branch taken: not taken
#### MEM (Memory)
- Data Memory Wr En: 0 (low), since the instruction in MEM stage is not `sw`.
#### WB (Write Back)
- WB Mux: from ALU result, since the instruction in WB stage is not link/return instruction (PC+4) or load word instruction (Memory Data).
Like we said before, `IF/ID clear` and `ID/EX clear` went high in the last cycle. Therefore, the two younger instructions in the IF and ID stages are flushed and replaced with NOPs in this cycle.
Another thing to note is that the final instruction in `uf8_decode`, `add x10, x5, x7`, is now in the WB stage. Therefore, the decoded result is written back to `a0 (x10)` in this cycle.

#### IF (Instruction Fetch)
- PC Enable: 1 (high), since there's no pipeline stall
- Next PC Mux: 0 (low), since there's no jump instruction in pipeline EX stage and signal `Branch taken` is low
#### ID (Instruction Decode)
- Reg Wr En: 1 (high), the instruction `jarl x0, x1, 0` is in WB stage, which needs to be write back to `x0`
#### EX (Execute)
- ALUOp1: from Reg file or forwarding path
- ALUOp2: from Reg file or forwarding path
- ForwardA: no forwarding (no data dependency with previous instruction)
- ForwardB: no forwarding (no data dependency with previous instruction)
- Branch taken: not taken
#### MEM (Memory)
- Data Memory Wr En: 0 (low), since the instruction in MEM stage is not `sw`.
#### WB (Write Back)
- WB Mux: from link/return instruction.
we first look at the register `a0` value. As we mentioned above, the final result of our decode target have been stored in `a0`. we will validate if it is correct down below.

uf8_decode.c
```clike=
uint32_t uf8_decode(uf8 fl)
{
uint32_t mantissa = fl & 0x0f;
uint8_t exponent = fl >> 4;
uint32_t offset = (0x7FFF >> (15 - exponent)) << 4;
return (mantissa << exponent) + offset;
}
```
our input is `0x7F`
- mantissa
m = a0 & 0x0F = 0xF
- exponent
e = a0 >> 4 = 0x7 = 7
- offset
o = 0x7FFF >> (15-7) = 0x7FFF >> 8 = 0x7F0
return (0xF << 7) + 0x7F0 = 0x780 + 0x7F0 = `0xF70`
Exactly the same with our result from Ripes simulator. As the result, we can say our program functionality is correct !