Assignment1: RISC-V Assembly and Instruction Pipeline

--- tags: CA2025 --- # Assignment1: RISC-V Assembly and Instruction Pipeline contributed by [<wilson0828>](https://github.com/wilson0828) >Refer to [Assignment1](https://hackmd.io/@sysprog/2025-arch-homework1) ## Problem B Refer to [Quiz1 of Computer Architecture (2025 Fall) Problem `B`](https://hackmd.io/@sysprog/arch2025-quiz1-sol) ### clz #### C code ```c static inline unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; } ``` #### Assembly implement - Goal: Compute the number of leading zeros of a 32-bit unsigned integer. - Algorithm: Binary-search style with shifts `c=16,8,4,2,1`; if `y=x>>c` is nonzero then `n-=c, x=y`; loop while `c!=0`. - Works with all x, including x=0 (returns 32). Since variable c is initialized to 16 (non-zero), switching from do–while to while( c ) preserves behavior and simplifies the branch structure in RV32I. ```asm clz: li t0, 32 # n = 32 li t1, 16 # c = 16 Lwhile: beq t1, x0, ans srl t2, a0, t1 beq t2, x0, skip sub t0, t0, t1 mv a0, t2 skip: srli t1, t1, 1 j Lwhile ans: sub a0, t0, a0 ret ``` ### uf8_decode #### C code ```c uint32_t uf8_decode(uf8 fl) { uint32_t mantissa = fl & 0x0f; uint8_t exponent = fl >> 4; uint32_t offset = (0x7FFF >> (15 - exponent)) << 4; return (mantissa << exponent) + offset; } ``` #### Assembly implement Rather than a line-by-line C translation, I use the closed-form offset: `offset(e)=((2^e−1)*16)`, so decode is simply `(m<<e) + offset`. As the result we can get a much simpler version of uf8_decode. ```asm uf8_decode: andi t0, a0, 0x0F # mantissa = fl & 0x0F srli t1, a0, 4 # exponent = fl >> 4 li t2, 1 sll t2, t2, t1 # t2 = 1 << e addi t2, t2, -1 # t2 = (1<<e) - 1 slli t2, t2, 4 # offset = (2^e -1)*16 sll t0, t0, t1 # mantissa << e add a0, t0, t2 # return ret ``` ### uf8_encode #### Assembly implement Function signature - Param: a0 = value (32-bit unsigned). - Return: a0 = uf8 (4-bit exponent | 4-bit mantissa). Caller/callee saves - Saves ra on stack before jal clz and restores after. - Uses only t temporaries for working regs. Temporaries usage - t1 → msb (31 - clz(value)). - t2 → exponent e. - t3 → overflow = offset(e). - t4 → next_overflow candidate. - t5 → scratch to build next_overflow. - t0 → general scratch for compares. - a1 → holds clz(value) result temporarily after clz. <details> <summary>uf8_encode</summary> ```asm uf8_encode: slti t0, a0 ,16 bne t0, x0, if1 addi sp, sp, -8 sw a0, 0(sp) sw ra, 4(sp) jal clz mv a1, a0 # leading zero lw a0, 0(sp) lw ra, 4(sp) addi sp, sp, 8 li t0, 31 sub t1, t0 , a1 # msb li t2 , 0 # exponent li t3 , 0 # overflow = offset slti t0, t1, 5 bne t0, x0, if2 addi t2, t1, -4 slti t0, t2, 15 bne t0, x0 ,if3 li t2, 15 if3: li t0, 1 sll t0, t0, t2 # t0 = 1 << e addi t0, t0, -1 # t0 = (1<<e) - 1 slli t3, t0, 4 # overflow = t3 = (2^e -1)*16 wloop: beq t2, x0, if2 bgeu a0, t3, if2 addi t3, t3, -16 # overflow = overflow - 16 srli t3, t3, 1 addi t2, t2, -1 j wloop if2: slti t0, t2, 15 beq t0, x0, wdone slli t5, t3, 1 addi t5, t5, 16 mv t4, t5 # next_overflow = (overflow << 1) + 16; sltu t0, a0, t4 bne t0, x0, wdone mv t3, t4 addi t2, t2, 1 j if2 wdone: sub a0, a0, t3 # a0 = value - overflow srl a0, a0, t2 # a0 >>= exponent slli t2, t2, 4 # t2 = exponent << 4 or a0, t2, a0 # a0 = (e<<4) | mantissa if1: ret ``` </details> ### Validation Here I used [Compiler Explorer](https://godbolt.org/) with RISC-V (32-bits) gcc (trunk) and with flag -O2 to generate the following assembly. <details> <summary>Compiler generated code</summary> ```asm uf8_decode: srli a3,a0,4 li a4,15 li a5,32768 sub a4,a4,a3 addi a5,a5,-1 sra a5,a5,a4 andi a0,a0,15 sll a0,a0,a3 slli a5,a5,4 add a0,a5,a0 ret uf8_encode: li a5,15 bleu a0,a5,L25 mv a1,a0 li a5,5 li a4,16 li a2,32 L4: srl a3,a1,a4 addi a5,a5,-1 beq a3,zero,L6 sub a2,a2,a4 mv a1,a3 L6: srai a4,a4,1 bne a5,zero,L4 sub a2,a2,a1 li a4,26 bgt a2,a4,L16 li a4,31 sub a4,a4,a2 andi a4,a4,0xff li a3,4 beq a4,a3,L16 addi a4,a4,-4 andi a2,a4,0xff li a3,15 bgtu a2,a3,L26 L10: andi a4,a4,0xff li a3,0 L11: addi a3,a3,1 slli a5,a5,1 andi a3,a3,0xff addi a5,a5,16 bgtu a4,a3,L11 bgeu a0,a5,L12 L13: srli a5,a5,1 addi a4,a4,-1 addi a5,a5,-8 andi a4,a4,0xff snez a3,a4 sltu a2,a0,a5 and a3,a3,a2 bne a3,zero,L13 L12: li a3,14 bgtu a4,a3,L24 L8: li a1,15 j L14 L27: andi a4,a2,0xff beq a4,a1,L24 L14: mv a3,a5 slli a5,a5,1 addi a5,a5,16 addi a2,a4,1 bgeu a0,a5,L27 sub a0,a0,a3 srl a0,a0,a4 slli a4,a4,4 or a0,a0,a4 L25: andi a0,a0,0xff ret L16: li a4,0 j L8 L24: mv a3,a5 sub a0,a0,a3 srl a0,a0,a4 slli a4,a4,4 or a0,a0,a4 j L25 L26: mv a4,a3 j L10 ``` </details> For validation, I used six consecutive test cases for both `encode (uint32 → uf8)` and `decode (uf8 → uint32)`. They’re chosen to hit zero, normal ranges, and the tricky exponent boundaries. And then ran both the compiler-generated code and my hand-written version on Ripes simulator to compare the results. Encode tests (uint32 → uf8 byte): 0, 15, 16, 46, 524272, 1015792 Decode tests (uf8 byte → uint32): 0x00, 0x0F, 0x10, 0x1F, 0xF0, 0xFF #### Console ![image](https://hackmd.io/_uploads/HJKinGT6xl.png) ### Analyze - Cycle count | ![image](https://hackmd.io/_uploads/S1hkaz6Tlg.png) | ![image](https://hackmd.io/_uploads/HJZ74Qapxx.png) | |:--:|:--:| | mine | RISC-V (32-bits) gcc | - Code size (only real instructions) - mine : 65 lines - compilers : 79 lines ### Improvement - My assembly runs about 13.5% faster than the compiler generated code. - My code size reduces about 17.7% compared to the compiler generated code. ## Problem C Refer to [Quiz1 of Computer Architecture (2025 Fall) Problem `C`](https://hackmd.io/@sysprog/arch2025-quiz1-sol) ### Part 1 In this section, I will implement all bf16 operation without bf16_sqrt. #### Assembly Implementation #### bf16_add In the add path, when the two operands have opposite signs, you can get big cancellation. That knocks the leading 1 out of the mantissa, so we have to renormalize. I switched to a `LUT-based clz` to grab the leading-zero count in one shot, then shift the mantissa left and subtract that amount from the exponent. Simple and also avoids a slow loop. In addition, I also swapped `if (!exp && !mant)` for `if ((bits & 0x7FFF) == 0)` to fast-check zero. It catches +0/−0. <details> <summary>bf16_add</summary> ```asm .data .align 4 clz8_lut: .byte 8,7,6,6,5,5,5,5,4,4,4,4,4,4,4,4 .byte 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3 .byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2 .byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .text clz8: andi a0, a0, 0xFF la t0, clz8_lut add t0, t0, a0 lbu a0, 0(t0) ret bf16_add: addi sp, sp, -28 sw ra, 24(sp) sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) srli s0, a0, 15 andi s0, s0, 1 # s0 = sign_a srli s1, a1, 15 andi s1, s1, 1 # s1 = sign_b srli s2, a0, 7 andi s2, s2, 0xFF # s2 = exp_a srli s3, a1, 7 andi s3, s3, 0xFF # s3 = exp_b andi s4, a0, 0x7F # s4 = mant_a andi s5, a1, 0x7F # s5 = mant_b li t0, 0xFF bne s2, t0, chk bnez s4, ret_a bne s3, t0, ret_a bnez s5, ret_b bne s0, s1, ret_nan j ret_b chk: beq s3, t0, ret_b li t0, 0x7FFF and t1, a0, t0 beq t1, x0, ret_b and t1, a1, t0 beq t1, x0, ret_a beq s2, x0, l1 ori s4, s4, 0x80 l1: beq s3, x0, l2 ori s5, s5, 0x80 l2: sub t1, s2, s3 # t1 = exp_diff blt x0, t1, grt beq t1, x0, equ mv t2, s3 # t2 = result_exp li t0, -8 blt t1, t0, ret_b sub t0, x0 ,t1 srl s4, s4, t0 j exp_dif grt: mv t2, s2 li t0, 8 blt t0, t1, ret_a srl s5, s5, t1 j exp_dif equ: mv t2, s2 exp_dif: bne s0, s1, diff_signs mv t3, s0 # t3 = result_sign add t4, s4, s5 # t4 = rm li t0, 0x100 and t1, t4, t0 beq t1, x0, pack srli t4, t4, 1 addi t2, t2, 1 li t0, 0xFF blt t2, t0, pack slli a0, t3, 15 li t5, 0x7F80 or a0, a0, t5 j ans diff_signs: blt s4, s5, gt_ma mv t3, s0 # result_sign = sign_a sub t4, s4, s5 j l3 gt_ma: mv t3, s1 # result_sign = sign_b sub t4, s5, s4 l3: beq t4, x0, ret_zero mv a0, t4 jal ra, clz8 # a0 = clz8(rm) mv t0, a0 sll t4, t4, t0 # rm <<= sh sub t2, t2, t0 # result_exp -= sh blt t2, x0, ret_zero beq t2, x0, ret_zero j pack ret_zero: li a0, 0x0000 # BF16_ZERO() j ans pack: slli a0, t3, 15 slli t1, t2, 7 or a0, a0, t1 andi t4, t4, 0x7F or a0, a0, t4 j ans ret_b: mv a0, a1 j ans ret_nan: li a0, 0x7FC0 j ans ret_a: j ans ans: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw ra, 24(sp) addi sp, sp, 28 ret ``` </details> #### bf16_sub Instead of implementing a brand new `bf16_sub` function, we simply add some adjustments and then call the `bf16_add` above. <details> <summary>bf16_sub</summary> ```asm bf16_sub: addi sp, sp, -8 sw ra, 4(sp) li t0, 0x8000 xor a1, a1, t0 jal ra, bf16_add lw ra, 4(sp) addi sp, sp, 8 ret ``` </details> #### bf16_div <details> <summary>bf16_div</summary> ```asm bf16_div: addi sp, sp, -24 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) srli s0, a0, 15 andi s0, s0, 1 # s0 = sign_a srli s1, a1, 15 andi s1, s1, 1 # s1 = sign_b srli s2, a0, 7 andi s2, s2, 0xFF # s2 = exp_a srli s3, a1, 7 andi s3, s3, 0xFF # s3 = exp_b andi s4, a0, 0x7F # s4 = mant_a andi s5, a1, 0x7F # s5 = mant_b xor t1, s0, s1 # t1 = result_sign li t0, 0xff bne s3, t0, exp_b_f bne s5, x0, ret_b bne s2, t0, l1 bne s4, x0 ,l1 j ret_nan l1: slli a0, t1, 15 j ans exp_b_f: bne s3, x0, skip bne s5, x0, skip bne s2, x0, skip2 beq s4, x0, ret_nan skip2: slli t1, t1, 15 li t2, 0x7F80 or a0, t1, t2 j ans skip: bne s2, t0, exp_a_f bne s4, x0, ret_a slli t1, t1, 15 li t2, 0x7F80 or a0, t1, t2 j ans exp_a_f: beq s2, x0, exp_a_is_zero j l2 exp_a_is_zero: beq s4, x0, a_is_zero_return j l2 a_is_zero_return: slli a0, t1, 15 j ans l2: beq s2, x0, l3 ori s4, s4, 0x80 l3: beq s3, x0, l4 ori s5, s5, 0x80 l4: slli t2, s4, 15 # t2 = dividend mv t3, s5 # t3 = divisor li t4, 0 # t4 = counter li t5, 0 # t5 = quotient div_loop: li t6, 16 bge t4, t6, out_loop slli t5, t5, 1 sub t0, x0, t4 # t0 = -i addi t0, t0, 15 sll t1, t3, t0 bltu t2, t1, cant_div sub t2, t2, t1 ori t5, t5, 1 cant_div: addi t4, t4, 1 j div_loop out_loop: sub t2, s2, s3 addi t2, t2, 127 # t2 = result_exp bne s2, x0, l5 addi t2, t2, -1 l5: bne s3, x0, l6 addi t2, t2, 1 l6: li t0, 0x8000 and t3, t5, t0 bne t3, x0, set norm_loop: and t3, t5, t0 bne t3, x0, norm_done li t6, 2 blt t2, t6, norm_done slli t5, t5, 1 addi t2, t2, -1 j norm_loop norm_done: srli t5, t5, 8 j l7 set: srli t5, t5, 8 l7: andi t5, t5, 0x7F li t0, 0xFF bge t2, t0, ret_inf blt t2, x0, ret_zero beq t2, x0, ret_zero slli a0, t1, 15 andi t2, t2, 0xFF slli t2, t2, 7 or a0, a0, t2 or a0, a0, t5 j ans ret_inf: slli a0, t1, 15 li t0, 0x7F80 or a0, a0, t0 j ans ret_zero: slli a0, t1, 15 j ans ret_b: mv a0, a1 j ans ret_nan: li a0, 0x7FC0 j ans ret_a: j ans ans: li t0, 0xFFFF and a0, a0, t0 lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) addi sp, sp, 24 ret ``` </details> #### bf16_mul For mul, I just reuse the same `LUT-based clz` idea from add. When checking whether a or b is normalized, I call `clz` to get the leading-zero count in one shot, then left-shift the mantissa by that amount and track the exponent adjust. It’s a quick way to snap subnormals into the `1.xxxxxx` form without a slow loop. Also, to stay RV32I-only (no hardware mul), I ditched the multiply instruction and went with a classic shift-and-add routine. I wrapped it as `mul8x8_to16`: it loops over the bits of the multiplier, adds the shifted multiplicand when a bit is set, and builds the 16-bit product. <details> <summary>bf16_mul</summary> ```asm .data .align 4 clz8_lut: .byte 8,7,6,6,5,5,5,5,4,4,4,4,4,4,4,4 .byte 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3 .byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2 .byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .text clz8: andi a0, a0, 0xFF la t0, clz8_lut add t0, t0, a0 lbu a0, 0(t0) ret mul8x8_to16: andi a0, a0, 0xFF andi a1, a1, 0xFF mv t1, a0 # multiplicand mv t2, a1 # multiplier li t0, 0 li t3, 8 label1: andi t4, t2, 1 beqz t4, label2 add t0, t0, t1 # acc += multiplicand label2: slli t1, t1, 1 # multiplicand <<= 1 srli t2, t2, 1 # multiplier >>= 1 addi t3, t3, -1 bnez t3, label1 mv a0, t0 ret bf16_mul: addi sp, sp, -28 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) sw ra, 24(sp) srli s0, a0, 15 andi s0, s0, 1 # s0 = sign_a srli s1, a1, 15 andi s1, s1, 1 # s1 = sign_b srli s2, a0, 7 andi s2, s2, 0xFF # s2 = exp_a srli s3, a1, 7 andi s3, s3, 0xFF # s3 = exp_b andi s4, a0, 0x7F # s4 = mant_a andi s5, a1, 0x7F # s5 = mant_b li t0, 0xff xor t1, s0, s1 # t1 = result_sign bne s2, t0, a_exp bne s4, x0, ret_b bne s3, x0, inf1 beq s5, x0, ret_nan inf1: slli t2, t1, 15 li t3, 0x7F80 or a0, t2, t3 j ans a_exp: bne s3, t0, b_exp bne s5, x0, ret_b bne s2, x0, inf2 beq s4, x0, ret_nan inf2: srli t2, t1, 15 li t3, 0x7F80 or a0, t2, t3 j ans b_exp: bne s2, x0, skip1 beq s4, x0, l1 skip1: bne s3, x0, skip2 bne s4, x0, skip2 l1: srli a0, t1, 15 j ans skip2: li t2, 0 # t2 = exp_adjust bne s2, x0, else_a mv a0, s4 jal ra, clz8 # a0 = clz8(rm) mv t0, a0 sll s4, s4, t0 sub t2, t2, t0 li s2, 1 else_a: ori s4, s4, 0x80 bne s3, x0, else_b mv a0, s5 jal ra, clz8 # a0 = clz8(rm) mv t0, a0 sll s5, s5, t0 sub t2, t2, t0 li s3, 1 else_b: ori s5, s5, 0x80 mv a0, s4 mv a1, s5 jal mul8x8_to16 mv t3, a0 # t3 = result_mant = product xor t1, s0, s1 add t4, s2, s3 addi t4, t4, -127 add t4, t4, t2 # t4 = result_exp li t5, 0x8000 and t0, t3, t5 beq t0, x0, l2 srli t3, t3, 8 andi t3, t3, 0x7F addi t4, t4, 1 j mant l2: srli t3, t3, 7 andi t3, t3, 0x7F mant: li t0, 0xFF blt t4, t0, skip3 srli t1, t1, 15 li t0, 0x7F80 or a0, a0, t0 j ans skip3: blt x0, t4, l3 addi t0, x0, -6 blt t4, t0, l4 li t0, 1 sub t0, t0, t4 srl t3, t3, t0 li t4, 0 j l3 l4: srli a0, t1, 15 j ans l3: andi t1, t1, 1 slli t1, t1, 15 andi t4, t4, 0xFF slli t4, t4, 7 andi t3, t3, 0x7F or a0, t1, t4 or a0, a0, t3 li t0, 0xFFFF and a0, a0, t0 j ans ret_inf: slli a0, t1, 15 li t0, 0x7F80 or a0, a0, t0 j ans ret_zero: slli a0, t1, 15 j ans ret_b: mv a0, a1 j ans ret_nan: li a0, 0x7FC0 j ans ret_a: j ans ans: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw ra, 24(sp) addi sp, sp, 28 ret ``` </details> <details> <summary>others</summary> ```asm bf16_isnan: li t0, 0x7F80 and t1, a0, t0 bne t1, t0, nan_false andi t1, a0, 0x007F beq t1, x0, nan_false li a0, 1 ret nan_false: li a0, 0 ret bf16_isinf: li t0, 0x7F80 and t1, a0, t0 bne t1, t0, inf_false andi t1, a0, 0x007F bne t1, x0, inf_false li a0, 1 ret inf_false: li a0, 0 ret bf16_iszero: li t0, 0x7FFF and t1, a0, t0 bne t1, x0, zero_false li a0, 1 ret zero_false: li a0, 0 ret f32_to_bf16: srli t1, a0, 23 andi t1, t1, 0xFF li t2, 0xFF bne t1, t2, L1 srli a0, a0, 16 ret L1: srli t1, a0, 16 andi t1, t1, 1 add a0, a0, t1 li t3, 0x7FFF add a0, a0, t3 srli a0, a0, 16 ret bf16_to_f32: slli a0, a0, 16 ret ``` </details> ### Part 2 #### bf16_sqrt First, I give a purely mechanical, line-by-line translation version from C <details> <summary>Version 1</summary> ```asm .data .align 4 .text mul8x8_to16: andi a0, a0, 0xFF andi a1, a1, 0xFF mv t1, a0 # multiplicand mv t2, a1 # multiplier li t0, 0 li t3, 8 label1: andi t4, t2, 1 beqz t4, label2 add t0, t0, t1 # acc += multiplicand label2: slli t1, t1, 1 # multiplicand <<= 1 srli t2, t2, 1 # multiplier >>= 1 addi t3, t3, -1 bnez t3, label1 mv a0, t0 ret bf16_sqrt: addi sp, sp, -32 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) sw s6, 24(sp) sw ra, 28(sp) srli s0, a0, 15 andi s0, s0, 1 # s0 = sign_a srli s1, a0, 7 andi s1, s1, 0xFF # s1 = exp_a andi s2, a0, 0x7F # s2 = mant_a li t0, 0xFF bne s1, t0, a_exp beq s2, x0, a_mant j ret_a a_mant: beq s0, x0, a_sign li a0, 0x7FC0 j ans a_sign: j ret_a a_exp: bne s1, x0, skip bne s2, x0, skip li a0, 0x0000 j ans skip: beq s0, x0, negative_skip li a0, 0x7FC0 j ans negative_skip: bne s1, x0, denormals_skip li a0, 0x0000 j ans denormals_skip: addi t1, s1, -127 #t1 = e ori t2, s2, 0x80 #t2 = m andi t0, t1, 1 beq t0, x0 ,else slli t2, t2, 1 addi t0, t1, -1 srai t0, t0, 1 addi t3, t0 ,127 #t3 = new_exp j end_if else: srai t0, t1, 1 addi t3, t0, 127 # t3 = new_exp end_if: li s3, 90 # s3 = low li s4, 255 # s4 = high li s5, 128 # s5 = result mv s6, t3 loop: bgtu s3, s4, loop_done add t0, s3, s4 srli t1, t0, 1 # t1 = mid mv a0, t1 mv a1, t1 mv t5, t1 # protect mv t6, t2 jal mul8x8_to16 mv t0, a0 mv t1, t5 mv t2, t6 srli t0, t0, 7 #t0 = sq bleu t0, t2, do_if addi s4, t1, -1 j end_if2 do_if: mv s5, t1 addi s3, t1, 1 end_if2: j loop loop_done: mv t3, s6 li t0, 256 bltu s5, t0, l1 srli s5, s5, 1 addi t3, t3, 1 j l3 l1: li t0, 128 bgeu s5, t0, l3 l2: li t0, 128 bgeu s5, t0, l3 slti t2, t3, 2 bne t2, x0, l3 slli s5, s5, 1 addi t3, t3, -1 j l2 l3: andi t5, s5, 0x7F # t5 = new_mant li t0, 0xFF blt t3, t0, no_overflow li a0, 0x7F80 j ans no_overflow: bgt t3, x0, no_underflow li a0, 0 j ans no_underflow: andi t3, t3, 0xff slli t3, t3, 7 or a0, t3, t5 j ans ret_a: j ans ans: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw s6, 24(sp) lw ra, 28(sp) addi sp, sp, 32 ret ``` </details> #### Improvement Strategy: Binary search To improve clock cycle count and code size, I mainly address the binary-search part. That's because we implement in RV32I, and there's no `mul` instruction to use, so a multiplication has to be emulated with shift-and-add loops. In the original `bf16_sqrt`, the binary search repeatedly computes mid * mid every iteration. This not only adds many instructions per step, but also introduces extra branches, which significantly increases the total cycle count in Ripes. #### Implement Strategy - Rewrite the comparsion In the original binary search part we are looking for the largest `mid` that satisfies condition `floor(mid^2/128) <= m` and it can easily transfer to `mid^2 <= (m<<7)+127` (+127 for remove the floor). In other words, we want the result $$ \mathrm{mid}=\left\lfloor \sqrt{(m \ll 7)+127} \right\rfloor $$ which produces the same mid as the original binary search but avoids repeated multiplication inside the loop. - Implement sqrt with RV32I (only use shift / add / sub / compare) - main concept: decide one bit of sqrt each round we maintain three variables (n, res and bit), where `bit` starts from the highest power of four `1<<14` and shift right by two each round. In each round, we test `n >= res + bit` to decide whether this bit can be included. If it can, we subtract `res + bit` and update `res`; Otherwise we only shift `res`. - C implementation of sqrt without multiplication or division ```clike= static inline uint16_t isqrt16(uint32_t n) { uint32_t res = 0; uint32_t bit = 1u << 14; // 16384 while (bit != 0) { uint32_t tmp = res + bit; if (n >= tmp) { n -= tmp; res = (res >> 1) + bit; } else { res >>= 1; } bit >>= 2; } return (uint16_t)res; } ``` <details> <summary>Veresion2</summary> ```asm .data .align 4 .text isqrt16_pow4: li t0, 0 # res li t1, 16384 # bit = 1<<14 (2^16 = 65536 > 65535) isqrt_loop: beqz t1, isqrt_done add t2, t0, t1 # tmp = res + bit bgeu a0, t2, isqrt_ge srli t0, t0, 1 srli t1, t1, 2 j isqrt_loop isqrt_ge: sub a0, a0, t2 srli t0, t0, 1 add t0, t0, t1 srli t1, t1, 2 j isqrt_loop isqrt_done: mv a0, t0 ret bf16_sqrt: addi sp, sp, -32 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) sw s6, 24(sp) sw ra, 28(sp) srli s0, a0, 15 andi s0, s0, 1 # s0 = sign_a srli s1, a0, 7 andi s1, s1, 0xFF # s1 = exp_a andi s2, a0, 0x7F # s2 = mant_a li t0, 0xFF bne s1, t0, a_exp beq s2, x0, a_mant j ret_a a_mant: beq s0, x0, a_sign li a0, 0x7FC0 j ans a_sign: j ret_a a_exp: bne s1, x0, skip bne s2, x0, skip li a0, 0x0000 j ans skip: beq s0, x0, negative_skip li a0, 0x7FC0 j ans negative_skip: bne s1, x0, denormals_skip li a0, 0x0000 j ans denormals_skip: addi t1, s1, -127 # t1 = e ori t2, s2, 0x80 # t2 = m (implicit 1) andi t0, t1, 1 beq t0, x0, else slli t2, t2, 1 # m <<= 1 (odd exponent) addi t0, t1, -1 # t0 = e - 1 srai t0, t0, 1 # t0 = (e-1)>>1 addi t3, t0, 127 # t3 = new_exp j end_if else: srai t0, t1, 1 # t0 = e>>1 addi t3, t0, 127 # t3 = new_exp end_if: mv s6, t3 slli a0, t2, 7 # a0 = m<<7 addi a0, a0, 127 # a0 = (m<<7) + 127 jal isqrt16_pow4 # a0 = result mv s5, a0 # s5 = result mv t3, s6 j l3 l3: andi t5, s5, 0x7F # t5 = new_mant li t0, 0xFF blt t3, t0, no_overflow li a0, 0x7F80 j ans no_overflow: bgt t3, x0, no_underflow li a0, 0 j ans no_underflow: andi t3, t3, 0xff slli t3, t3, 7 or a0, t3, t5 j ans ret_a: j ans ans: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw s6, 24(sp) lw ra, 28(sp) addi sp, sp, 32 ret ``` </details> ### Validation & Analyze - In this part, I compared all operations against the compiler generated versions using the test code above. The testing code includes three to five cases per `Part1` operation and for `bf16_sqrt` we have about twenty test cases for it. The test case covers most situations. <details> <summary>BF16 test </summary> ```asm main: # BF16 MUL (1~5) #1 Inf * 0 = NaN li a0, 0x7F80 li a1, 0x0000 jal ra, bf16_mul li t6, 0x7FC0 bne a0, t6, fail li t0, 1 mv s0, t0 #2 0 * 3 = 0 li a0, 0x0000 li a1, 0x4040 jal ra, bf16_mul li t6, 0x0000 bne a0, t6, fail #3 2 * 3 = 6 li a0, 0x4000 li a1, 0x4040 jal ra, bf16_mul li t6, 0x40C0 bne a0, t6, fail #4 -2 * 3 = -6 li a0, 0xC000 li a1, 0x4040 jal ra, bf16_mul li t6, 0xC0C0 bne a0, t6, fail #5 1.5 * 2 = 3 li a0, 0x3FC0 li a1, 0x4000 jal ra, bf16_mul li t6, 0x4040 bne a0, t6, fail # BF16 ADD (6~10) #6 1 + 1 = 2 li a0, 0x3F80 li a1, 0x3F80 jal ra, bf16_add li t6, 0x4000 bne a0, t6, fail #7 1 + 0.5 = 1.5 li a0, 0x3F80 li a1, 0x3F00 jal ra, bf16_add li t6, 0x3FC0 bne a0, t6, fail #8 2 + (-0.5) = 1.5 li a0, 0x4000 li a1, 0xBF00 jal ra, bf16_add li t6, 0x3FC0 bne a0, t6, fail #9 -1 + 1 = 0 li a0, 0xBF80 li a1, 0x3F80 jal ra, bf16_add li t6, 0x0000 bne a0, t6, fail #10 +Inf + (-Inf) = NaN li a0, 0x7F80 li a1, 0xFF80 jal ra, bf16_add li t6, 0x7FC0 bne a0, t6, fail # BF16 SUB (11~15) #11 3 - 1 = 2 li a0, 0x4040 li a1, 0x3F80 jal ra, bf16_sub li t6, 0x4000 bne a0, t6, fail #12 1 - 1 = 0 li a0, 0x3F80 li a1, 0x3F80 jal ra, bf16_sub li t6, 0x0000 bne a0, t6, fail #13 1 - (-1) = 2 li a0, 0x3F80 li a1, 0xBF80 jal ra, bf16_sub li t6, 0x4000 bne a0, t6, fail #14 -2 - 3 = -5 li a0, 0xC000 li a1, 0x4040 jal ra, bf16_sub li t6, 0xC0A0 bne a0, t6, fail #15 +Inf - +Inf = NaN li a0, 0x7F80 li a1, 0x7F80 jal ra, bf16_sub li t6, 0x7FC0 bne a0, t6, fail # BF16 DIV (16~20) #16 3 / 2 = 1.5 li a0, 0x4040 li a1, 0x4000 jal ra, bf16_div li t6, 0x3FC0 bne a0, t6, fail #17 1 / 2 = 0.5 li a0, 0x3F80 li a1, 0x4000 jal ra, bf16_div li t6, 0x3F00 bne a0, t6, fail #18 0 / 3 = 0 li a0, 0x0000 li a1, 0x4040 jal ra, bf16_div li t6, 0x0000 bne a0, t6, fail #19 1 / 0 = +Inf li a0, 0x3F80 li a1, 0x0000 jal ra, bf16_div li t6, 0x7F80 bne a0, t6, fail #20 0 / 0 = NaN li a0, 0x0000 li a1, 0x0000 jal ra, bf16_div li t6, 0x7FC0 bne a0, t6, fail # BF16 ISNAN test (21~23) #21 isnan(+qNaN) = 1 li a0, 0x7FC1 jal ra, bf16_isnan li t6, 1 bne a0, t6, fail #22 isnan(+sNaN-ish) = 1 li a0, 0x7F81 jal ra, bf16_isnan li t6, 1 bne a0, t6, fail #23 isnan(+Inf) = 0 li a0, 0x7F80 jal ra, bf16_isnan li t6, 0 bne a0, t6, fail # BF16 ISINF test (24~26) #24 isinf(+Inf) = 1 li a0, 0x7F80 jal ra, bf16_isinf li t6, 1 bne a0, t6, fail #25 isinf(-Inf) = 1 li a0, 0xFF80 jal ra, bf16_isinf li t6, 1 bne a0, t6, fail #26 isinf(NaN) = 0 li a0, 0x7FC0 jal ra, bf16_isinf li t6, 0 bne a0, t6, fail # BF16 ISZERO (27~29) #27 iszero(+0) = 1 li a0, 0x0000 jal ra, bf16_iszero li t6, 1 bne a0, t6, fail #28 iszero(-0) = 1 li a0, 0x8000 jal ra, bf16_iszero li t6, 1 bne a0, t6, fail #29 iszero(subnormal != 0) = 0 li a0, 0x0001 jal ra, bf16_iszero li t6, 0 bne a0, t6, fail # f32_to_bf16 (30~32) #30 1.0f -> 0x3F80 li a0, 0x3F800000 jal ra, f32_to_bf16 li t6, 0x3F80 bne a0, t6, fail #31 0x3F7F8000 to even -> 0x3F80 li a0, 0x3F7F8000 jal ra, f32_to_bf16 li t6, 0x3F80 bne a0, t6, fail #32 NaN remain high 16 位 -> 0x7FC0 li a0, 0x7FC00001 jal ra, f32_to_bf16 li t6, 0x7FC0 bne a0, t6, fail # bf16_to_f32 test (33~35) #33 0x3F80 -> 1.0f li a0, 0x3F80 jal ra, bf16_to_f32 li t6, 0x3F800000 bne a0, t6, fail #34 0x7F80 -> +Inf li a0, 0x7F80 jal ra, bf16_to_f32 li t6, 0x7F800000 bne a0, t6, fail #35 0xC000 -> -2.0f li a0, 0xC000 jal ra, bf16_to_f32 li t6, 0xC0000000 bne a0, t6, fail # BF16 SQRT (36~55) #36 sqrt(+Inf) = +Inf li a0, 0x7F80 jal ra, bf16_sqrt li t6, 0x7F80 bne a0, t6, fail #37 sqrt(-Inf) = NaN (canonical 0x7FC0) li a0, 0xFF80 jal ra, bf16_sqrt li t6, 0x7FC0 bne a0, t6, fail #38 NaN(payload) propagates (return original) li a0, 0x7FC1 jal ra, bf16_sqrt li t6, 0x7FC1 bne a0, t6, fail #39 NaN(canonical) propagates (return original) li a0, 0x7FC0 jal ra, bf16_sqrt li t6, 0x7FC0 bne a0, t6, fail #40 sqrt(+0) = +0 li a0, 0x0000 jal ra, bf16_sqrt li t6, 0x0000 bne a0, t6, fail #41 sqrt(-0) = +0 (your code returns BF16_ZERO) li a0, 0x8000 jal ra, bf16_sqrt li t6, 0x0000 bne a0, t6, fail #42 denorm(min) flush to 0 li a0, 0x0001 jal ra, bf16_sqrt li t6, 0x0000 bne a0, t6, fail #43 denorm(max) flush to 0 li a0, 0x007F jal ra, bf16_sqrt li t6, 0x0000 bne a0, t6, fail #44 sqrt(0.25) = 0.5 li a0, 0x3E80 jal ra, bf16_sqrt li t6, 0x3F00 bne a0, t6, fail #45 sqrt(0.5) ≈ 0.70703125 -> 0x3F35 li a0, 0x3F00 jal ra, bf16_sqrt li t6, 0x3F35 bne a0, t6, fail #46 sqrt(1.0) = 1.0 li a0, 0x3F80 jal ra, bf16_sqrt li t6, 0x3F80 bne a0, t6, fail #47 sqrt(1.5) -> 0x3F9D li a0, 0x3FC0 jal ra, bf16_sqrt li t6, 0x3F9D bne a0, t6, fail #48 sqrt(2.0) ≈ 1.4140625 -> 0x3FB5 li a0, 0x4000 jal ra, bf16_sqrt li t6, 0x3FB5 bne a0, t6, fail #49 sqrt(3.0) -> 0x3FDD li a0, 0x4040 jal ra, bf16_sqrt li t6, 0x3FDD bne a0, t6, fail #50 sqrt(4.0) = 2.0 li a0, 0x4080 jal ra, bf16_sqrt li t6, 0x4000 bne a0, t6, fail #51 sqrt(9.0) = 3.0 li a0, 0x4110 jal ra, bf16_sqrt li t6, 0x4040 bne a0, t6, fail #52 sqrt(16.0) = 4.0 li a0, 0x4180 jal ra, bf16_sqrt li t6, 0x4080 bne a0, t6, fail #53 sqrt(min normal 0x0080) -> 0x2000 li a0, 0x0080 jal ra, bf16_sqrt li t6, 0x2000 bne a0, t6, fail #54 sqrt(max finite 0x7F7F) -> 0x5F7F li a0, 0x7F7F jal ra, bf16_sqrt li t6, 0x5F7F bne a0, t6, fail #55 sqrt(-1.0) = NaN (canonical 0x7FC0) li a0, 0xBF80 jal ra, bf16_sqrt li t6, 0x7FC0 bne a0, t6, fail ok: li a0, 0 # all passed li a7, 10 ecall fail: li a7, 10 ecall ``` </details> <details> <summary>BF16 operation with sqrt v1 </summary> ```asm .data .align 4 clz8_lut: .byte 8,7,6,6,5,5,5,5,4,4,4,4,4,4,4,4 .byte 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3 .byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2 .byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .text .globl clz8 clz8: andi a0, a0, 0xFF la t0, clz8_lut add t0, t0, a0 lbu a0, 0(t0) ret .globl mul8x8_to16 mul8x8_to16: andi a0, a0, 0xFF andi a1, a1, 0xFF mv t1, a0 mv t2, a1 li t0, 0 li t3, 8 mul8x8_to16_loop: andi t4, t2, 1 beqz t4, mul8x8_to16_skip add t0, t0, t1 mul8x8_to16_skip: slli t1, t1, 1 srli t2, t2, 1 addi t3, t3, -1 bnez t3, mul8x8_to16_loop mv a0, t0 ret .globl bf16_add bf16_add: addi sp, sp, -28 sw ra, 24(sp) sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) srli s0, a0, 15 andi s0, s0, 1 # s0 = sign_a srli s1, a1, 15 andi s1, s1, 1 # s1 = sign_b srli s2, a0, 7 andi s2, s2, 0xFF # s2 = exp_a srli s3, a1, 7 andi s3, s3, 0xFF # s3 = exp_b andi s4, a0, 0x7F # s4 = mant_a andi s5, a1, 0x7F # s5 = mant_b li t0, 0xFF bne s2, t0, bf16_add_chk bnez s4, bf16_add_ret_a bne s3, t0, bf16_add_ret_a bnez s5, bf16_add_ret_b bne s0, s1, bf16_add_ret_nan j bf16_add_ret_b bf16_add_chk: beq s3, t0, bf16_add_ret_b li t0, 0x7FFF and t1, a0, t0 beq t1, x0, bf16_add_ret_b and t1, a1, t0 beq t1, x0, bf16_add_ret_a beq s2, x0, bf16_add_a_den_done ori s4, s4, 0x80 bf16_add_a_den_done: beq s3, x0, bf16_add_b_den_done ori s5, s5, 0x80 bf16_add_b_den_done: sub t1, s2, s3 blt x0, t1, bf16_add_grt beq t1, x0, bf16_add_equ mv t2, s3 li t0, -8 blt t1, t0, bf16_add_ret_b sub t0, x0, t1 srl s4, s4, t0 j bf16_add_exp_dif bf16_add_grt: mv t2, s2 li t0, 8 blt t0, t1, bf16_add_ret_a srl s5, s5, t1 j bf16_add_exp_dif bf16_add_equ: mv t2, s2 bf16_add_exp_dif: bne s0, s1, bf16_add_diff_signs mv t3, s0 add t4, s4, s5 li t0, 0x100 and t1, t4, t0 beq t1, x0, bf16_add_pack srli t4, t4, 1 addi t2, t2, 1 li t0, 0xFF blt t2, t0, bf16_add_pack slli a0, t3, 15 li t5, 0x7F80 or a0, a0, t5 j bf16_add_ans bf16_add_diff_signs: blt s4, s5, bf16_add_gt_ma mv t3, s0 sub t4, s4, s5 j bf16_add_norm bf16_add_gt_ma: mv t3, s1 sub t4, s5, s4 bf16_add_norm: beq t4, x0, bf16_add_ret_zero mv a0, t4 jal ra, clz8 mv t0, a0 sll t4, t4, t0 sub t2, t2, t0 blt t2, x0, bf16_add_ret_zero beq t2, x0, bf16_add_ret_zero j bf16_add_pack bf16_add_ret_zero: li a0, 0x0000 j bf16_add_ans bf16_add_pack: slli a0, t3, 15 slli t1, t2, 7 or a0, a0, t1 andi t4, t4, 0x7F or a0, a0, t4 j bf16_add_ans bf16_add_ret_b: mv a0, a1 j bf16_add_ans bf16_add_ret_nan: li a0, 0x7FC0 j bf16_add_ans bf16_add_ret_a: j bf16_add_ans bf16_add_ans: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw ra, 24(sp) addi sp, sp, 28 ret .globl bf16_sub bf16_sub: addi sp, sp, -8 sw ra, 4(sp) li t0, 0x8000 xor a1, a1, t0 jal ra, bf16_add lw ra, 4(sp) addi sp, sp, 8 ret .globl bf16_div bf16_div: addi sp, sp, -24 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) srli s0, a0, 15 andi s0, s0, 1 srli s1, a1, 15 andi s1, s1, 1 srli s2, a0, 7 andi s2, s2, 0xFF srli s3, a1, 7 andi s3, s3, 0xFF andi s4, a0, 0x7F andi s5, a1, 0x7F xor t1, s0, s1 li t0, 0xff bne s3, t0, bf16_div_exp_b_f bne s5, x0, bf16_div_ret_b bne s2, t0, bf16_div_l1 bne s4, x0, bf16_div_l1 j bf16_div_ret_nan bf16_div_l1: slli a0, t1, 15 j bf16_div_ans bf16_div_exp_b_f: bne s3, x0, bf16_div_skip bne s5, x0, bf16_div_skip bne s2, x0, bf16_div_skip2 beq s4, x0, bf16_div_ret_nan bf16_div_skip2: slli t1, t1, 15 li t2, 0x7F80 or a0, t1, t2 j bf16_div_ans bf16_div_skip: bne s2, t0, bf16_div_exp_a_f bne s4, x0, bf16_div_ret_a slli t1, t1, 15 li t2, 0x7F80 or a0, t1, t2 j bf16_div_ans bf16_div_exp_a_f: beq s2, x0, bf16_div_exp_a_is_zero j bf16_div_l2 bf16_div_exp_a_is_zero: beq s4, x0, bf16_div_a_is_zero_return j bf16_div_l2 bf16_div_a_is_zero_return: slli a0, t1, 15 j bf16_div_ans bf16_div_l2: beq s2, x0, bf16_div_l3 ori s4, s4, 0x80 bf16_div_l3: beq s3, x0, bf16_div_l4 ori s5, s5, 0x80 bf16_div_l4: slli t2, s4, 15 mv t3, s5 li t4, 0 li t5, 0 bf16_div_div_loop: li t6, 16 bge t4, t6, bf16_div_out_loop slli t5, t5, 1 sub t0, x0, t4 addi t0, t0, 15 sll t1, t3, t0 bltu t2, t1, bf16_div_cant_div sub t2, t2, t1 ori t5, t5, 1 bf16_div_cant_div: addi t4, t4, 1 j bf16_div_div_loop bf16_div_out_loop: sub t2, s2, s3 addi t2, t2, 127 bne s2, x0, bf16_div_l5 addi t2, t2, -1 bf16_div_l5: bne s3, x0, bf16_div_l6 addi t2, t2, 1 bf16_div_l6: li t0, 0x8000 and t3, t5, t0 bne t3, x0, bf16_div_set bf16_div_norm_loop: and t3, t5, t0 bne t3, x0, bf16_div_norm_done li t6, 2 blt t2, t6, bf16_div_norm_done slli t5, t5, 1 addi t2, t2, -1 j bf16_div_norm_loop bf16_div_norm_done: srli t5, t5, 8 j bf16_div_l7 bf16_div_set: srli t5, t5, 8 bf16_div_l7: andi t5, t5, 0x7F li t0, 0xFF bge t2, t0, bf16_div_ret_inf blt t2, x0, bf16_div_ret_zero beq t2, x0, bf16_div_ret_zero slli a0, t1, 15 andi t2, t2, 0xFF slli t2, t2, 7 or a0, a0, t2 or a0, a0, t5 j bf16_div_ans bf16_div_ret_inf: slli a0, t1, 15 li t0, 0x7F80 or a0, a0, t0 j bf16_div_ans bf16_div_ret_zero: slli a0, t1, 15 j bf16_div_ans bf16_div_ret_b: mv a0, a1 j bf16_div_ans bf16_div_ret_nan: li a0, 0x7FC0 j bf16_div_ans bf16_div_ret_a: j bf16_div_ans bf16_div_ans: li t0, 0xFFFF and a0, a0, t0 lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) addi sp, sp, 24 ret .globl bf16_mul bf16_mul: addi sp, sp, -28 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) sw ra, 24(sp) srli s0, a0, 15 andi s0, s0, 1 srli s1, a1, 15 andi s1, s1, 1 srli s2, a0, 7 andi s2, s2, 0xFF srli s3, a1, 7 andi s3, s3, 0xFF andi s4, a0, 0x7F andi s5, a1, 0x7F li t0, 0xff xor t1, s0, s1 bne s2, t0, bf16_mul_a_exp bne s4, x0, bf16_mul_ret_b bne s3, x0, bf16_mul_inf1 beq s5, x0, bf16_mul_ret_nan bf16_mul_inf1: slli t2, t1, 15 li t3, 0x7F80 or a0, t2, t3 j bf16_mul_ans bf16_mul_a_exp: bne s3, t0, bf16_mul_b_exp bne s5, x0, bf16_mul_ret_b bne s2, x0, bf16_mul_inf2 beq s4, x0, bf16_mul_ret_nan bf16_mul_inf2: slli t2, t1, 15 li t3, 0x7F80 or a0, t2, t3 j bf16_mul_ans bf16_mul_b_exp: bne s2, x0, bf16_mul_skip1 beq s4, x0, bf16_mul_zero_ret bf16_mul_skip1: bne s3, x0, bf16_mul_skip2 bne s4, x0, bf16_mul_skip2 bf16_mul_zero_ret: srli a0, t1, 15 j bf16_mul_ans bf16_mul_skip2: li t2, 0 bne s2, x0, bf16_mul_else_a mv a0, s4 jal ra, clz8 mv t0, a0 sll s4, s4, t0 sub t2, t2, t0 li s2, 1 bf16_mul_else_a: ori s4, s4, 0x80 bne s3, x0, bf16_mul_else_b mv a0, s5 jal ra, clz8 mv t0, a0 sll s5, s5, t0 sub t2, t2, t0 li s3, 1 bf16_mul_else_b: ori s5, s5, 0x80 mv a0, s4 mv a1, s5 jal ra, mul8x8_to16 mv t3, a0 xor t1, s0, s1 add t4, s2, s3 addi t4, t4, -127 add t4, t4, t2 li t5, 0x8000 and t0, t3, t5 beq t0, x0, bf16_mul_l2 srli t3, t3, 8 andi t3, t3, 0x7F addi t4, t4, 1 j bf16_mul_mant bf16_mul_l2: srli t3, t3, 7 andi t3, t3, 0x7F bf16_mul_mant: li t0, 0xFF blt t4, t0, bf16_mul_skip3 slli a0, t1, 15 li t0, 0x7F80 or a0, a0, t0 j bf16_mul_ans bf16_mul_skip3: blt x0, t4, bf16_mul_pack addi t0, x0, -6 blt t4, t0, bf16_mul_underflow li t0, 1 sub t0, t0, t4 srl t3, t3, t0 li t4, 0 j bf16_mul_pack bf16_mul_underflow: srli a0, t1, 15 j bf16_mul_ans bf16_mul_pack: andi t1, t1, 1 slli t1, t1, 15 andi t4, t4, 0xFF slli t4, t4, 7 andi t3, t3, 0x7F or a0, t1, t4 or a0, a0, t3 li t0, 0xFFFF and a0, a0, t0 j bf16_mul_ans bf16_mul_ret_b: mv a0, a1 j bf16_mul_ans bf16_mul_ret_nan: li a0, 0x7FC0 j bf16_mul_ans bf16_mul_ret_a: j bf16_mul_ans bf16_mul_ans: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw ra, 24(sp) addi sp, sp, 28 ret .globl bf16_isnan bf16_isnan: li t0, 0x7F80 and t1, a0, t0 bne t1, t0, bf16_isnan_false andi t1, a0, 0x007F beq t1, x0, bf16_isnan_false li a0, 1 ret bf16_isnan_false: li a0, 0 ret .globl bf16_isinf bf16_isinf: li t0, 0x7F80 and t1, a0, t0 bne t1, t0, bf16_isinf_false andi t1, a0, 0x007F bne t1, x0, bf16_isinf_false li a0, 1 ret bf16_isinf_false: li a0, 0 ret .globl bf16_iszero bf16_iszero: li t0, 0x7FFF and t1, a0, t0 bne t1, x0, bf16_iszero_false li a0, 1 ret bf16_iszero_false: li a0, 0 ret .globl f32_to_bf16 f32_to_bf16: srli t1, a0, 23 andi t1, t1, 0xFF li t2, 0xFF bne t1, t2, f32_to_bf16_L1 srli a0, a0, 16 ret f32_to_bf16_L1: srli t1, a0, 16 andi t1, t1, 1 add a0, a0, t1 li t3, 0x7FFF add a0, a0, t3 srli a0, a0, 16 ret .globl bf16_to_f32 bf16_to_f32: slli a0, a0, 16 ret .globl bf16_sqrt bf16_sqrt: addi sp, sp, -32 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) sw s6, 24(sp) sw ra, 28(sp) srli s0, a0, 15 andi s0, s0, 1 # s0 = sign_a srli s1, a0, 7 andi s1, s1, 0xFF # s1 = exp_a andi s2, a0, 0x7F # s2 = mant_a li t0, 0xFF bne s1, t0, bf16_sqrt_a_exp beq s2, x0, bf16_sqrt_a_mant j bf16_sqrt_ret_a bf16_sqrt_a_mant: beq s0, x0, bf16_sqrt_a_sign li a0, 0x7FC0 j bf16_sqrt_ans bf16_sqrt_a_sign: j bf16_sqrt_ret_a bf16_sqrt_a_exp: bne s1, x0, bf16_sqrt_skip bne s2, x0, bf16_sqrt_skip li a0, 0x0000 j bf16_sqrt_ans bf16_sqrt_skip: beq s0, x0, bf16_sqrt_negative_skip li a0, 0x7FC0 j bf16_sqrt_ans bf16_sqrt_negative_skip: bne s1, x0, bf16_sqrt_denormals_skip li a0, 0x0000 j bf16_sqrt_ans bf16_sqrt_denormals_skip: addi t1, s1, -127 ori t2, s2, 0x80 andi t0, t1, 1 beq t0, x0, bf16_sqrt_else slli t2, t2, 1 addi t0, t1, -1 srai t0, t0, 1 addi t3, t0, 127 j bf16_sqrt_end_if bf16_sqrt_else: srai t0, t1, 1 addi t3, t0, 127 bf16_sqrt_end_if: li s3, 90 li s4, 255 li s5, 128 mv s6, t3 bf16_sqrt_loop: bgtu s3, s4, bf16_sqrt_loop_done add t0, s3, s4 srli t1, t0, 1 mv a0, t1 mv a1, t1 mv t5, t1 mv t6, t2 jal mul8x8_to16 mv t0, a0 mv t1, t5 mv t2, t6 srli t0, t0, 7 bleu t0, t2, bf16_sqrt_do_if addi s4, t1, -1 j bf16_sqrt_loop bf16_sqrt_do_if: mv s5, t1 addi s3, t1, 1 j bf16_sqrt_loop bf16_sqrt_loop_done: mv t3, s6 li t0, 256 bltu s5, t0, bf16_sqrt_l3 srli s5, s5, 1 addi t3, t3, 1 j bf16_sqrt_l3 bf16_sqrt_l3: andi t5, s5, 0x7F li t0, 0xFF blt t3, t0, bf16_sqrt_no_overflow li a0, 0x7F80 j bf16_sqrt_ans bf16_sqrt_no_overflow: bgt t3, x0, bf16_sqrt_no_underflow li a0, 0 j bf16_sqrt_ans bf16_sqrt_no_underflow: andi t3, t3, 0xff slli t3, t3, 7 or a0, t3, t5 j bf16_sqrt_ans bf16_sqrt_ret_a: j bf16_sqrt_ans bf16_sqrt_ans: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw s6, 24(sp) lw ra, 28(sp) addi sp, sp, 32 ret ``` </details> <details> <summary>BF16 operation with sqrt v2 </summary> ```asm .data .align 4 clz8_lut: .byte 8,7,6,6,5,5,5,5,4,4,4,4,4,4,4,4 .byte 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3 .byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2 .byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .text .globl clz8 clz8: andi a0, a0, 0xFF la t0, clz8_lut add t0, t0, a0 lbu a0, 0(t0) ret .globl mul8x8_to16 mul8x8_to16: andi a0, a0, 0xFF andi a1, a1, 0xFF mv t1, a0 mv t2, a1 li t0, 0 li t3, 8 mul8x8_to16_loop: andi t4, t2, 1 beqz t4, mul8x8_to16_skip add t0, t0, t1 mul8x8_to16_skip: slli t1, t1, 1 srli t2, t2, 1 addi t3, t3, -1 bnez t3, mul8x8_to16_loop mv a0, t0 ret .globl bf16_add bf16_add: addi sp, sp, -28 sw ra, 24(sp) sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) srli s0, a0, 15 andi s0, s0, 1 # s0 = sign_a srli s1, a1, 15 andi s1, s1, 1 # s1 = sign_b srli s2, a0, 7 andi s2, s2, 0xFF # s2 = exp_a srli s3, a1, 7 andi s3, s3, 0xFF # s3 = exp_b andi s4, a0, 0x7F # s4 = mant_a andi s5, a1, 0x7F # s5 = mant_b li t0, 0xFF bne s2, t0, bf16_add_chk bnez s4, bf16_add_ret_a bne s3, t0, bf16_add_ret_a bnez s5, bf16_add_ret_b bne s0, s1, bf16_add_ret_nan j bf16_add_ret_b bf16_add_chk: beq s3, t0, bf16_add_ret_b li t0, 0x7FFF and t1, a0, t0 beq t1, x0, bf16_add_ret_b and t1, a1, t0 beq t1, x0, bf16_add_ret_a beq s2, x0, bf16_add_a_den_done ori s4, s4, 0x80 bf16_add_a_den_done: beq s3, x0, bf16_add_b_den_done ori s5, s5, 0x80 bf16_add_b_den_done: sub t1, s2, s3 blt x0, t1, bf16_add_grt beq t1, x0, bf16_add_equ mv t2, s3 li t0, -8 blt t1, t0, bf16_add_ret_b sub t0, x0, t1 srl s4, s4, t0 j bf16_add_exp_dif bf16_add_grt: mv t2, s2 li t0, 8 blt t0, t1, bf16_add_ret_a srl s5, s5, t1 j bf16_add_exp_dif bf16_add_equ: mv t2, s2 bf16_add_exp_dif: bne s0, s1, bf16_add_diff_signs mv t3, s0 add t4, s4, s5 li t0, 0x100 and t1, t4, t0 beq t1, x0, bf16_add_pack srli t4, t4, 1 addi t2, t2, 1 li t0, 0xFF blt t2, t0, bf16_add_pack slli a0, t3, 15 li t5, 0x7F80 or a0, a0, t5 j bf16_add_ans bf16_add_diff_signs: blt s4, s5, bf16_add_gt_ma mv t3, s0 sub t4, s4, s5 j bf16_add_norm bf16_add_gt_ma: mv t3, s1 sub t4, s5, s4 bf16_add_norm: beq t4, x0, bf16_add_ret_zero mv a0, t4 jal ra, clz8 mv t0, a0 sll t4, t4, t0 sub t2, t2, t0 blt t2, x0, bf16_add_ret_zero beq t2, x0, bf16_add_ret_zero j bf16_add_pack bf16_add_ret_zero: li a0, 0x0000 j bf16_add_ans bf16_add_pack: slli a0, t3, 15 slli t1, t2, 7 or a0, a0, t1 andi t4, t4, 0x7F or a0, a0, t4 j bf16_add_ans bf16_add_ret_b: mv a0, a1 j bf16_add_ans bf16_add_ret_nan: li a0, 0x7FC0 j bf16_add_ans bf16_add_ret_a: j bf16_add_ans bf16_add_ans: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw ra, 24(sp) addi sp, sp, 28 ret .globl bf16_sub bf16_sub: addi sp, sp, -8 sw ra, 4(sp) li t0, 0x8000 xor a1, a1, t0 jal ra, bf16_add lw ra, 4(sp) addi sp, sp, 8 ret .globl bf16_div bf16_div: addi sp, sp, -24 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) srli s0, a0, 15 andi s0, s0, 1 srli s1, a1, 15 andi s1, s1, 1 srli s2, a0, 7 andi s2, s2, 0xFF srli s3, a1, 7 andi s3, s3, 0xFF andi s4, a0, 0x7F andi s5, a1, 0x7F xor t1, s0, s1 li t0, 0xff bne s3, t0, bf16_div_exp_b_f bne s5, x0, bf16_div_ret_b bne s2, t0, bf16_div_l1 bne s4, x0, bf16_div_l1 j bf16_div_ret_nan bf16_div_l1: slli a0, t1, 15 j bf16_div_ans bf16_div_exp_b_f: bne s3, x0, bf16_div_skip bne s5, x0, bf16_div_skip bne s2, x0, bf16_div_skip2 beq s4, x0, bf16_div_ret_nan bf16_div_skip2: slli t1, t1, 15 li t2, 0x7F80 or a0, t1, t2 j bf16_div_ans bf16_div_skip: bne s2, t0, bf16_div_exp_a_f bne s4, x0, bf16_div_ret_a slli t1, t1, 15 li t2, 0x7F80 or a0, t1, t2 j bf16_div_ans bf16_div_exp_a_f: beq s2, x0, bf16_div_exp_a_is_zero j bf16_div_l2 bf16_div_exp_a_is_zero: beq s4, x0, bf16_div_a_is_zero_return j bf16_div_l2 bf16_div_a_is_zero_return: slli a0, t1, 15 j bf16_div_ans bf16_div_l2: beq s2, x0, bf16_div_l3 ori s4, s4, 0x80 bf16_div_l3: beq s3, x0, bf16_div_l4 ori s5, s5, 0x80 bf16_div_l4: slli t2, s4, 15 mv t3, s5 li t4, 0 li t5, 0 bf16_div_div_loop: li t6, 16 bge t4, t6, bf16_div_out_loop slli t5, t5, 1 sub t0, x0, t4 addi t0, t0, 15 sll t1, t3, t0 bltu t2, t1, bf16_div_cant_div sub t2, t2, t1 ori t5, t5, 1 bf16_div_cant_div: addi t4, t4, 1 j bf16_div_div_loop bf16_div_out_loop: sub t2, s2, s3 addi t2, t2, 127 bne s2, x0, bf16_div_l5 addi t2, t2, -1 bf16_div_l5: bne s3, x0, bf16_div_l6 addi t2, t2, 1 bf16_div_l6: li t0, 0x8000 and t3, t5, t0 bne t3, x0, bf16_div_set bf16_div_norm_loop: and t3, t5, t0 bne t3, x0, bf16_div_norm_done li t6, 2 blt t2, t6, bf16_div_norm_done slli t5, t5, 1 addi t2, t2, -1 j bf16_div_norm_loop bf16_div_norm_done: srli t5, t5, 8 j bf16_div_l7 bf16_div_set: srli t5, t5, 8 bf16_div_l7: andi t5, t5, 0x7F li t0, 0xFF bge t2, t0, bf16_div_ret_inf blt t2, x0, bf16_div_ret_zero beq t2, x0, bf16_div_ret_zero slli a0, t1, 15 andi t2, t2, 0xFF slli t2, t2, 7 or a0, a0, t2 or a0, a0, t5 j bf16_div_ans bf16_div_ret_inf: slli a0, t1, 15 li t0, 0x7F80 or a0, a0, t0 j bf16_div_ans bf16_div_ret_zero: slli a0, t1, 15 j bf16_div_ans bf16_div_ret_b: mv a0, a1 j bf16_div_ans bf16_div_ret_nan: li a0, 0x7FC0 j bf16_div_ans bf16_div_ret_a: j bf16_div_ans bf16_div_ans: li t0, 0xFFFF and a0, a0, t0 lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) addi sp, sp, 24 ret .globl bf16_mul bf16_mul: addi sp, sp, -28 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) sw ra, 24(sp) srli s0, a0, 15 andi s0, s0, 1 srli s1, a1, 15 andi s1, s1, 1 srli s2, a0, 7 andi s2, s2, 0xFF srli s3, a1, 7 andi s3, s3, 0xFF andi s4, a0, 0x7F andi s5, a1, 0x7F li t0, 0xff xor t1, s0, s1 bne s2, t0, bf16_mul_a_exp bne s4, x0, bf16_mul_ret_b bne s3, x0, bf16_mul_inf1 beq s5, x0, bf16_mul_ret_nan bf16_mul_inf1: slli t2, t1, 15 li t3, 0x7F80 or a0, t2, t3 j bf16_mul_ans bf16_mul_a_exp: bne s3, t0, bf16_mul_b_exp bne s5, x0, bf16_mul_ret_b bne s2, x0, bf16_mul_inf2 beq s4, x0, bf16_mul_ret_nan bf16_mul_inf2: slli t2, t1, 15 li t3, 0x7F80 or a0, t2, t3 j bf16_mul_ans bf16_mul_b_exp: bne s2, x0, bf16_mul_skip1 beq s4, x0, bf16_mul_zero_ret bf16_mul_skip1: bne s3, x0, bf16_mul_skip2 bne s4, x0, bf16_mul_skip2 bf16_mul_zero_ret: srli a0, t1, 15 j bf16_mul_ans bf16_mul_skip2: li t2, 0 bne s2, x0, bf16_mul_else_a mv a0, s4 jal ra, clz8 mv t0, a0 sll s4, s4, t0 sub t2, t2, t0 li s2, 1 bf16_mul_else_a: ori s4, s4, 0x80 bne s3, x0, bf16_mul_else_b mv a0, s5 jal ra, clz8 mv t0, a0 sll s5, s5, t0 sub t2, t2, t0 li s3, 1 bf16_mul_else_b: ori s5, s5, 0x80 mv a0, s4 mv a1, s5 jal ra, mul8x8_to16 mv t3, a0 xor t1, s0, s1 add t4, s2, s3 addi t4, t4, -127 add t4, t4, t2 li t5, 0x8000 and t0, t3, t5 beq t0, x0, bf16_mul_l2 srli t3, t3, 8 andi t3, t3, 0x7F addi t4, t4, 1 j bf16_mul_mant bf16_mul_l2: srli t3, t3, 7 andi t3, t3, 0x7F bf16_mul_mant: li t0, 0xFF blt t4, t0, bf16_mul_skip3 slli a0, t1, 15 li t0, 0x7F80 or a0, a0, t0 j bf16_mul_ans bf16_mul_skip3: blt x0, t4, bf16_mul_pack addi t0, x0, -6 blt t4, t0, bf16_mul_underflow li t0, 1 sub t0, t0, t4 srl t3, t3, t0 li t4, 0 j bf16_mul_pack bf16_mul_underflow: srli a0, t1, 15 j bf16_mul_ans bf16_mul_pack: andi t1, t1, 1 slli t1, t1, 15 andi t4, t4, 0xFF slli t4, t4, 7 andi t3, t3, 0x7F or a0, t1, t4 or a0, a0, t3 li t0, 0xFFFF and a0, a0, t0 j bf16_mul_ans bf16_mul_ret_b: mv a0, a1 j bf16_mul_ans bf16_mul_ret_nan: li a0, 0x7FC0 j bf16_mul_ans bf16_mul_ret_a: j bf16_mul_ans bf16_mul_ans: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw ra, 24(sp) addi sp, sp, 28 ret .globl bf16_isnan bf16_isnan: li t0, 0x7F80 and t1, a0, t0 bne t1, t0, bf16_isnan_false andi t1, a0, 0x007F beq t1, x0, bf16_isnan_false li a0, 1 ret bf16_isnan_false: li a0, 0 ret .globl bf16_isinf bf16_isinf: li t0, 0x7F80 and t1, a0, t0 bne t1, t0, bf16_isinf_false andi t1, a0, 0x007F bne t1, x0, bf16_isinf_false li a0, 1 ret bf16_isinf_false: li a0, 0 ret .globl bf16_iszero bf16_iszero: li t0, 0x7FFF and t1, a0, t0 bne t1, x0, bf16_iszero_false li a0, 1 ret bf16_iszero_false: li a0, 0 ret .globl f32_to_bf16 f32_to_bf16: srli t1, a0, 23 andi t1, t1, 0xFF li t2, 0xFF bne t1, t2, f32_to_bf16_L1 srli a0, a0, 16 ret f32_to_bf16_L1: srli t1, a0, 16 andi t1, t1, 1 add a0, a0, t1 li t3, 0x7FFF add a0, a0, t3 srli a0, a0, 16 ret .globl bf16_to_f32 bf16_to_f32: slli a0, a0, 16 ret .globl isqrt16_pow4 isqrt16_pow4: li t0, 0 li t1, 16384 isqrt16_pow4_loop: beqz t1, isqrt16_pow4_done add t2, t0, t1 bgeu a0, t2, isqrt16_pow4_ge srli t0, t0, 1 srli t1, t1, 2 j isqrt16_pow4_loop isqrt16_pow4_ge: sub a0, a0, t2 srli t0, t0, 1 add t0, t0, t1 srli t1, t1, 2 j isqrt16_pow4_loop isqrt16_pow4_done: mv a0, t0 ret .globl bf16_sqrt bf16_sqrt: addi sp, sp, -32 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) sw s4, 16(sp) sw s5, 20(sp) sw s6, 24(sp) sw ra, 28(sp) srli s0, a0, 15 andi s0, s0, 1 # s0 = sign_a srli s1, a0, 7 andi s1, s1, 0xFF # s1 = exp_a andi s2, a0, 0x7F # s2 = mant_a li t0, 0xFF bne s1, t0, bf16_sqrt_a_exp beq s2, x0, bf16_sqrt_a_mant j bf16_sqrt_ret_a bf16_sqrt_a_mant: beq s0, x0, bf16_sqrt_a_sign li a0, 0x7FC0 j bf16_sqrt_ans bf16_sqrt_a_sign: j bf16_sqrt_ret_a bf16_sqrt_a_exp: bne s1, x0, bf16_sqrt_skip bne s2, x0, bf16_sqrt_skip li a0, 0x0000 j bf16_sqrt_ans bf16_sqrt_skip: beq s0, x0, bf16_sqrt_negative_skip li a0, 0x7FC0 j bf16_sqrt_ans bf16_sqrt_negative_skip: bne s1, x0, bf16_sqrt_denormals_skip li a0, 0x0000 j bf16_sqrt_ans bf16_sqrt_denormals_skip: addi t1, s1, -127 ori t2, s2, 0x80 andi t0, t1, 1 beq t0, x0, bf16_sqrt_else slli t2, t2, 1 addi t0, t1, -1 srai t0, t0, 1 addi t3, t0, 127 j bf16_sqrt_end_if bf16_sqrt_else: srai t0, t1, 1 addi t3, t0, 127 bf16_sqrt_end_if: mv s6, t3 slli a0, t2, 7 addi a0, a0, 127 jal isqrt16_pow4 mv s5, a0 mv t3, s6 j bf16_sqrt_l3 bf16_sqrt_l3: andi t5, s5, 0x7F li t0, 0xFF blt t3, t0, bf16_sqrt_no_overflow li a0, 0x7F80 j bf16_sqrt_ans bf16_sqrt_no_overflow: bgt t3, x0, bf16_sqrt_no_underflow li a0, 0 j bf16_sqrt_ans bf16_sqrt_no_underflow: andi t3, t3, 0xff slli t3, t3, 7 or a0, t3, t5 j bf16_sqrt_ans bf16_sqrt_ret_a: j bf16_sqrt_ans bf16_sqrt_ans: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) lw s4, 16(sp) lw s5, 20(sp) lw s6, 24(sp) lw ra, 28(sp) addi sp, sp, 32 ret ``` </details> - Code size (excluding labels and space) - bf16 operation with sqrt v1 : 526 lines - bf16 operation with sqrt v2 : 521 lines - compilers : 1007 lines - Cycle count - bf16 operation with sqrt v1 ![image](https://hackmd.io/_uploads/HJka_FxNWx.png) - bf16 operation with sqrt v2 ![image](https://hackmd.io/_uploads/B1LXsKlEZl.png) - RISC-V (32-bits) gcc ![image](https://hackmd.io/_uploads/ryvQf5eN-x.png) ### Improvement - My `bf16-v2` assembly runs about 78.4% faster than the compiler generated code. - My `bf16-v2` code size reduces about 48.26% compared to the compiler generated code. The code size betweeen two veresion of sqrt is 30 lines. Since we don't need redundancy multiply loop anymore, our speed can be impove to nearly 256% ### Use Case [LeetCode 688. Knight Probability in Chessboard](https://leetcode.com/problems/knight-probability-in-chessboard/description/) Many Leetcode dynamic programming solutions allocate a 2D table `dp[m][n]`. When `m` and `n` are large, the DPtable becomes memory-bound due to high bandwidth demand. In our use case, we store `dp` entries in bfloat16 instead of 32-bit integers/floats. This change makes our memory requirment lower half immediately. Arithmetic and normalization are implemented in RV32I using our bf16 add/sub/mul/div routines. ## Explanations for both program functionality and instruction-level operations using the Ripes simulator Here I use `uf8_decode` assembly from problem b to demonstrate. ```asm .text .globl main main: li a0, 0x7F jal ra, uf8_decode li a7, 10 ecall uf8_decode: andi t0, a0, 0x0F # mantissa = fl & 0x0F srli t1, a0, 4 # exponent = fl >> 4 li t2, 1 sll t2, t2, t1 # t2 = 1 << e addi t2, t2, -1 # t2 = (1<<e) - 1 slli t2, t2, 4 # offset = (2^e -1)*16 sll t0, t0, t1 # mantissa << e add a0, t0, t2 # return ret ``` Since `li t2, 1` is a pseudo-instruction, the assembler expands it to the real RV32I instruction `addi t2, x0, 1` In RISC-V, register names like `t0` and `s0` are register name aliases for the physical registers `x0-x31`. For example, `t0` maps to `x5`, and `s0` maps to `x8`. ### Program Functionality Now let's go through pipeline with some insturcions ![image](https://hackmd.io/_uploads/Sy127KZ4Wx.png) #### IF (Instruction Fetch) - PC Enable: 1 (high), since there's no pipeline stall - Next PC Mux: 0 (low), since there's no jump instruction in pipeline EX stage and signal `Branch taken` is low #### ID (Instruction Decode) - Reg Wr En: 1 (high), the instruction `addi x7, x7, -1` is in WB stage, which needs to be write back to `x7` #### EX (Execute) - ALUOp1: from Reg file or forwarding path - ALUOp2: from Reg file or forwarding path - ForwardA: no forwarding (no data dependency with previous instruction) - ForwardB: no forwarding (no data dependency with previous instruction) - Branch taken: not taken #### MEM (Memory) - Data Memory Wr En: 0 (low), since the instruction in MEM stage is not `sw`. #### WB (Write Back) - WB Mux: from ALU result, since the instruction in WB stage is not link/return instruction (PC+4) or load word instruction (Memory Data). ![image](https://hackmd.io/_uploads/B1XBYB9E-g.png) #### IF (Instruction Fetch) - PC Enable: 1 (high), since there's no pipeline stall - Next PC Mux: 0 (low), since there's no jump instruction in pipeline EX stage and signal `Branch taken` is low #### ID (Instruction Decode) - Reg Wr En: 1 (high), the instruction `slli x7, x7, 4` is in WB stage, which needs to be write back to `x7` #### EX (Execute) - ALUOp1: from Reg file or forwarding path - ALUOp2: from Reg file or forwarding path - ForwardA: from EX/MEM piprline register (`x5`) - ForwardB: from MEM/WB piprline register (`x7`) - Branch taken: not taken #### MEM (Memory) - Data Memory Wr En: 0 (low), since the instruction in MEM stage is not `sw`. #### WB (Write Back) - WB Mux: from ALU result, since the instruction in WB stage is not link/return instruction (PC+4) or load word instruction (Memory Data). ![image](https://hackmd.io/_uploads/BJa_Frq4bl.png) #### IF (Instruction Fetch) - PC Enable: 1 (high), since there's no pipeline stall - Next PC Mux: 1 (high), since there's jump instruction in pipeline EX stage. So the next clock won't be PC+4 as before. #### ID (Instruction Decode) - Reg Wr En: 1 (high), the instruction `sll x5, x5, x6` is in WB stage, which needs to be write back to `x6` #### EX (Execute) - ALUOp1: from Reg file or forwarding path - ALUOp2: from immediate - ForwardA: no forwarding (no data dependency with previous instruction) - ForwardB: no forwarding (no data dependency with previous instruction) - Branch taken: not taken #### MEM (Memory) - Data Memory Wr En: 0 (low), since the instruction in MEM stage is not `sw`. #### WB (Write Back) - WB Mux: from ALU result, since the instruction in WB stage is not link/return instruction (PC+4) or load word instruction (Memory Data). When `jalr` enters the EX stage, `IF/ID clear` and `ID/EX clear` go high to flush the pipeline. Because a jump is unconditional, any instructions behind it are invalid and must be removed. As a result, we get a two cycle penalty. ![image](https://hackmd.io/_uploads/Hkm9YHqE-l.png) #### IF (Instruction Fetch) - PC Enable: 1 (high), since there's no pipeline stall - Next PC Mux: 0 (low), since there's no jump instruction in pipeline EX stage and signal `Branch taken` is low #### ID (Instruction Decode) - Reg Wr En: 1 (high), the instruction `add x10, x5, x7` is in WB stage, which needs to be write back to `x10` #### EX (Execute) - ALUOp1: from Reg file or forwarding path - ALUOp2: from Reg file or forwarding path - ForwardA: no forwarding (no data dependency with previous instruction) - ForwardB: no forwarding (no data dependency with previous instruction) - Branch taken: not taken #### MEM (Memory) - Data Memory Wr En: 0 (low), since the instruction in MEM stage is not `sw`. #### WB (Write Back) - WB Mux: from ALU result, since the instruction in WB stage is not link/return instruction (PC+4) or load word instruction (Memory Data). Like we said before, `IF/ID clear` and `ID/EX clear` went high in the last cycle. Therefore, the two younger instructions in the IF and ID stages are flushed and replaced with NOPs in this cycle. Another thing to note is that the final instruction in `uf8_decode`, `add x10, x5, x7`, is now in the WB stage. Therefore, the decoded result is written back to `a0 (x10)` in this cycle. ![image](https://hackmd.io/_uploads/Sy_JCBqNbx.png) #### IF (Instruction Fetch) - PC Enable: 1 (high), since there's no pipeline stall - Next PC Mux: 0 (low), since there's no jump instruction in pipeline EX stage and signal `Branch taken` is low #### ID (Instruction Decode) - Reg Wr En: 1 (high), the instruction `jarl x0, x1, 0` is in WB stage, which needs to be write back to `x0` #### EX (Execute) - ALUOp1: from Reg file or forwarding path - ALUOp2: from Reg file or forwarding path - ForwardA: no forwarding (no data dependency with previous instruction) - ForwardB: no forwarding (no data dependency with previous instruction) - Branch taken: not taken #### MEM (Memory) - Data Memory Wr En: 0 (low), since the instruction in MEM stage is not `sw`. #### WB (Write Back) - WB Mux: from link/return instruction. we first look at the register `a0` value. As we mentioned above, the final result of our decode target have been stored in `a0`. we will validate if it is correct down below. ![image](https://hackmd.io/_uploads/BkGmCSc4Wx.png) uf8_decode.c ```clike= uint32_t uf8_decode(uf8 fl) { uint32_t mantissa = fl & 0x0f; uint8_t exponent = fl >> 4; uint32_t offset = (0x7FFF >> (15 - exponent)) << 4; return (mantissa << exponent) + offset; } ``` our input is `0x7F` - mantissa m = a0 & 0x0F = 0xF - exponent e = a0 >> 4 = 0x7 = 7 - offset o = 0x7FFF >> (15-7) = 0x7FFF >> 8 = 0x7F0 return (0xF << 7) + 0x7F0 = 0x780 + 0x7F0 = `0xF70` Exactly the same with our result from Ripes simulator. As the result, we can say our program functionality is correct !