# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by <ukp66482>
[TOC]
## Leetcode No.1342 Number of Steps to Reduce a Number to Zero
### Problem Goal
The goal of this problem is to calculate the number of steps required to reduce a given integer to zero.
In each step:
- If the number is even, divide it by 2.
- If the number is odd, subtract 1.
The process continues until the number becomes zero, and the program returns the total number of steps taken.
### Solution 1 Loop based
A simple while-loop repeatedly checks if the number is even or odd and updates it accordingly until it reaches zero.
#### C Code
```c
int numberOfSteps_simple(uint32_t num) {
int steps = 0;
while (num != 0u) {
if ((num & 1u) == 0u)
num >>= 1; // even → divide by 2
else
num -= 1u; // odd → subtract 1
steps++;
}
return steps;
}
```
#### Assembly Code
``` asm
NUMBEROFSTEPS_SIMPLE:
addi sp, sp, -4
sw ra, 0(sp)
li t0, 0 # steps = 0
WHILE_LOOP:
beqz a0, DONE # while(num != 0)
andi t1, a0, 1 # check LSB (num % 2)
beqz t1, EVEN
addi a0, a0, -1 # num -= 1
addi t0, t0, 1 # steps++
jal x0, WHILE_LOOP
EVEN:
srli a0, a0, 1 # num /= 2
addi t0, t0, 1 # steps++
jal x0, WHILE_LOOP
DONE:
mv a0, t0 # return steps
lw ra, 0(sp)
addi sp, sp, 4
ret
```
### Solution 2 CLZ + POPCOUNT
The total number of steps can be expressed as:
**(bit length − 1) + number of 1 bits**
#### Why?
1. Each 1 bit causes one “subtract 1” operation (odd → even).
2. Each division by 2 corresponds to a right shift — that is, the number of bit positions.
3. However, the most significant 1 is only subtracted once at the end and is not divided again, so we subtract 1 from the total.
#### C code
```c
int numberOfSteps(int num) {
if (num == 0) return 0;
int bits = 32 - __builtin_clz(num);
int ones = __builtin_popcount(num);
return (bits - 1) + ones;
}
```
`numberOfSteps` function calculates how many steps are needed to reduce a number to zero.
It utilizes two GCC/Clang built-in functions:
* `__builtin_clz()` counts the number of leading zeros in the 32-bit binary form of input, helping determine the position of the most significant bit.
* `__builtin_popcount()` counts how many bits are set to 1 in num, representing how many subtraction steps occur.
Both are compiler intrinsics provided by GCC and Clang, which are often translated directly into efficient CPU instructions
#### Assembly Code
#### CLZ
``` asm
#////////////////////////////////////////////
#
# CLZ Function
#
# input a0 = x, return a0 = orig x, a1 = clz_result
#
#////////////////////////////////////////////
CLZ:
addi sp, sp, -8
sw ra, 0(sp)
sw a0, 4(sp)
addi t0, x0, 32 # n = 32
addi t1, x0, 16 # c = 16
CLZ_LOOP:
srl t2, a0, t1 # t2 = x >> c
beq t2, x0, CLZ_LOWER # if(t2 == 0) goto lower
sub t0, t0, t1 # n -= c
mv a0, t2 # x = y
CLZ_LOWER:
srli t1, t1, 1
bne t1, x0, CLZ_LOOP
sub a1, t0, a0 # return = n - x
lw ra, 0(sp)
lw a0, 4(sp)
addi sp, sp, 8
ret
```
##### Key Idea
This is a **binary-search–style** implementation of CLZ — it determines how many zeros precede the most significant 1 bit in the number efficiently **without looping through all bits**.
#### POPCOUNT
``` asm
#////////////////////////////////////////////
#
# POPCOUNT Function
#
# input a0 = x, return a0 = popcount result
#
#////////////////////////////////////////////
POPCOUNT:
li t1, 0x55555555 # t1 = 0x55555555
and t2, a0, t1 # t2 = u & 0x55555555
srli t3, a0, 1 # t3 = u >> 1
and t3, t3, t1 # t3 = (u >> 1) & 0x55555555
add a0, t2, t3 # u = (u & 0x55555555) + ((u >> 1) & 0x55555555)
li t1, 0x33333333 # t1 = 0x33333333
and t2, a0, t1 # t2 = u & 0x33333333
srli t3, a0, 2 # t3 = u >> 2
and t3, t3, t1 # t3 = (u >> 2) & 0x33333333
add a0, t2, t3 # u = (u & 0x33333333) + ((u >> 2) & 0x33333333)
li t1, 0x0F0F0F0F # t1 = 0x0F0F0F0F
and t2, a0, t1 # t2 = u & 0x0F0F0F0F
srli t3, a0, 4 # t3 = u >> 4
and t3, t3, t1 # t3 = (u >> 4) & 0x0F0F0F
add a0, t2, t3 # u = (u & 0x0F0F0F0F) + ((u >> 4) & 0x0F0F0F0F)
li t1, 0x00FF00FF # t1 = 0x00FF00FF
and t2, a0, t1 # t2 = u & 0x00FF00FF
srli t3, a0, 8 # t3 = u >> 8
and t3, t3, t1 # t3 = (u >> 8) & 0x00FF00FF
add a0, t2, t3 # u = (u & 0x00FF00FF) + ((u >> 8) & 0x00FF00FF)
li t1, 0x0000FFFF # t1 = 0x0000FFFF
and t2, a0, t1 # t2 = u & 0x0000FFFF
srli t3, a0, 16 # t3 = u >> 16
and t3, t3, t1 # t3 = (u >> 16) & 0x0000FFFF
add a0, t2, t3 # u = (u & 0x0000FFFF) + ((u >> 16) & 0x0000FFFF)
ret
```
##### POPCOUNT Algorithm Explanation
It implements the parallel bit count (Hamming weight) algorithm, which sums bits in groups **using bitwise operations — without using loops**.
1. Pairwise summation (every 2 bits):
u = (u & 0x55555555) + ((u >> 1) & 0x55555555)
→ counts 1s in every 2-bit group.
2. Group 4 bits:
u = (u & 0x33333333) + ((u >> 2) & 0x33333333)
→ accumulates results over 4-bit chunks.
3. Group 8 bits:
u = (u & 0x0F0F0F0F) + ((u >> 4) & 0x0F0F0F0F)
4. Group 16 bits:
u = (u & 0x00FF00FF) + ((u >> 8) & 0x00FF00FF)
5. Group 32 bits:
u = (u & 0x0000FFFF) + ((u >> 16) & 0x0000FFFF)
→ popcount final result.
>Example: u = (u & 0x55555555) + ((u >> 1) & 0x55555555)
>
>1: (u & 0x55555555)
>
>Keeps only the bits in even positions (0, 2, 4, 6, ...).
The mask 0x55555555 in binary is 0b_0101_0101_0101...,
so this filters out every other bit (the ones in odd positions are cleared to 0).
>
>2: (u >> 1)
>
>Shifts all bits in u right by 1 position.
This moves each odd-position bit into the place of its neighboring even bit,
allowing them to be added together.
>
>3: ((u >> 1) & 0x55555555)
>
>After shifting, applies the same mask to keep only the shifted bits
that were originally in odd positions.
>
>4: Add the two results together:
>
>Sums the counts of bits in each pair (2-bit group).
Each 2-bit group now contains the number of 1s in those two bits
(possible results per group: 00, 01, 10).
#### NumberOfStep
``` asm
NUMBEROFSTEPS:
addi sp, sp, -4
sw ra, 0(sp) # save return address
beqz a0, RETURN_ZERO
jal ra, CLZ # a0 = original num, a1 = clz_result
addi a2, x0, 32
sub a1, a2, a1 # a1 = 32 - clz(num)
jal ra, POPCOUNT # a0 = popcount(num)
add a0, a0, a1 # steps = popcount(num) + (32 - clz(num))
addi a0, a0, -1 # steps -= 1
lw ra, 0(sp) # restore return address
addi sp, sp, 4
ret
RETURN_ZERO:
lw ra, 0(sp) # restore return address
addi sp, sp, 4
addi a0, x0, 0 # steps = 0
ret
```
### Test Function
``` asm
#////////////////////////////////////////////
#
# TEST Function
#
#////////////////////////////////////////////
TEST:
addi sp, sp, 4
sw ra, 0(sp) # save return address
la s0, TEST_DATA
la s1, TEST_RESULT
li s2, 20 # number of test cases
li s3, 0 # current test case index
TEST_LOOP:
beq s3, s2, TEST_ALL_PASSED # if (i == number of test cases) all passed
lw a0, 0(s0) # load test data
jal ra, NUMBEROFSTEPS # call numberOfSteps
lw t0, 0(s1) # load expected result
bne a0, t0, TEST_FAIL # if (result != expected) fail
addi s0, s0, 4 # move to next test data
addi s1, s1, 4 # move to next expected result
addi s3, s3, 1 # i++
jal x0, TEST_LOOP
TEST_ALL_PASSED:
la a0, TEST_OK_MSG
li a7, 4
ecall
lw ra, 0(sp) # restore return address
addi sp, sp, 4
ret
TEST_FAIL:
la a0, TEST_FAIL_MSG
li a7, 4
ecall
lw ra, 0(sp) # restore return address
addi sp, sp, 4
ret
```
### Test Case
| Test Input | Expected Output |
|------------|-----------------|
| 0 | 0 |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 3 |
| 7 | 5 |
| 8 | 4 |
| 14 | 6 |
| 15 | 7 |
| 16 | 5 |
| 31 | 9 |
| 32 | 6 |
| 63 | 11 |
| 64 | 7 |
| 123 | 12 |
| 255 | 15 |
| 256 | 9 |
| 1023 | 19 |
| 1024 | 11 |
| 0xFFFFFFFF | 63 |
### Performance Comparison Between Two Solutions
Both Solution 1 (Loop-Based) and Solution 2 (CLZ + POPCOUNT) were tested under the same test function and identical test cases.
This ensures a fair comparison, as both implementations were evaluated using the same input data, expected outputs, and verification process.
Therefore, the performance difference observed — in terms of cycles, CPI, and IPC — reflects only the efficiency of each algorithm’s logic, not variations in testing conditions.
#### Solution 1: Loop Based

#### Solution 2: CLZ + POPCOUNT

#### Conclusion
Solution 2 (CLZ + POPCOUNT) executes fewer cycles and has a lower CPI, showing higher efficiency.
However, Solution 1 (loop-based) is simpler and easier to understand, though slower in performance.
## 5-Stage Pipeline CPU
In this section, we analyze how a RISC-V instruction is executed in a 5-stage pipelined CPU.
The five classic pipeline stages are:
1. IF (Instruction Fetch) – Retrieve instruction from instruction memory.
2. ID (Instruction Decode / Register Fetch) – Decode the instruction and read source registers.
3. EX (Execute) – Perform arithmetic or logical operations in the ALU.
4. MEM (Memory Access) – Access data memory if needed (load/store).
5. WB (Write Back) – Write the computation result back to the destination register.
To demonstrate how each stage operates, I use the CLZ function as an example.The function utilizes several R-type and I-type instructions, including logical shifts, branches, and arithmetic operations.
By examining these instructions, we can clearly observe how data flows through each pipeline stage, and how hazards may occur and be resolved.
### Assembly Code
Below is the assembly implementation of the CLZ function :
``` asm
main:
addi a0, x0, 0xff
jal ra, CLZ
mv a0, a1
li a7, 1
ecall
li a7, 10
ecall
#////////////////////////////////////////////
#
# CLZ Function
#
# input a0 = x, output a0 = orig x, a1 = clz_result
#
#////////////////////////////////////////////
CLZ:
addi sp, sp, -8
sw ra, 0(sp)
sw a0, 4(sp)
addi t0, x0, 32 # n = 32
addi t1, x0, 16 # c = 16
CLZ_LOOP:
srl t2, a0, t1 # t2 = x >> c
beq t2, x0, CLZ_LOWER # if(t2 == 0) goto lower
sub t0, t0, t1 # n -= c
mv a0, t2 # x = y
CLZ_LOWER:
srli t1, t1, 1
bne t1, x0, CLZ_LOOP
sub a1, t0, a0 # return = n - x
lw ra, 0(sp)
lw a0, 4(sp)
addi sp, sp, 8
ret
```
### Ripes Executable Code
The following machine-level disassembly was generated by Ripes after assembling the CLZ program.
Each line shows the instruction address, encoded 32-bit machine code, and the corresponding assembly instruction recognized by the simulator.
``` asm
00000000 <main>:
0: 0ff00513 addi x10 x0 255
4: 018000ef jal x1 24 <CLZ>
8: 00058513 addi x10 x11 0
c: 00100893 addi x17 x0 1
10: 00000073 ecall
14: 00a00893 addi x17 x0 10
18: 00000073 ecall
0000001c <CLZ>:
1c: ff810113 addi x2 x2 -8
20: 00112023 sw x1 0 x2
24: 00a12223 sw x10 4 x2
28: 02000293 addi x5 x0 32
2c: 01000313 addi x6 x0 16
00000030 <CLZ_LOOP>:
30: 006553b3 srl x7 x10 x6
34: 00038663 beq x7 x0 12 <CLZ_LOWER>
38: 406282b3 sub x5 x5 x6
3c: 00038513 addi x10 x7 0
00000040 <CLZ_LOWER>:
40: 00135313 srli x6 x6 1
44: fe0316e3 bne x6 x0 -20 <CLZ_LOOP>
48: 40a285b3 sub x11 x5 x10
4c: 00012083 lw x1 0 x2
50: 00412503 lw x10 4 x2
54: 00810113 addi x2 x2 8
58: 00008067 jalr x0 x1 0
```
### Instruction Analysis Example – sw ra, 0(sp)
In the CLZ function, the instruction
``` asm
sw ra, 0(sp)
```
This instruction stores the return address (ra) into the memory location pointed to by the stack pointer (sp).
This ensures that the function can later restore the return address before executing ret.
It is an S-type (Store Word) instruction, which writes a 32-bit word from a source register into memory using a base-address-plus-offset addressing mode.
#### Instruction Type
Format: `sw rs2, imm(rs1)`
Type: `S-type (Store)`
Main purpose: Save data from a register into memory.
Field Bits Value Meaning:
```
31 25 24 20 19 15 14 12 11 7 6 0
+-----------+-------+-------+------+----------+---------+
| imm[11:5] | rs2 | rs1 |funct3| imm[4:0] | opcode |
+-----------+-------+-------+------+----------+---------+
| 0000000 | x1 | x2 | 010 | 00000 | 0100011 |
ra sp SW imm=0 store
```
#### IF

In this stage, the CPU fetches the instruction from memory using the current program counter (PC).
From the diagram, the PC = `0x00000020`, and the instruction read from Instr. memory is `0x00112023`, which corresponds to `sw x1, 0(x2)` in the disassembly.
The Compressed Decoder checks whether the instruction is a 16-bit compressed type. Since the input flag is `0x1`, it indicates a normal 32-bit instruction, so no decompression is required.
At the same time, the adder calculates the next PC as 0x00000024, preparing for the next fetch.
Finally, the IF/ID pipeline register stores the current PC `0x00000020`, the next PC `0x00000024`, and the fetched instruction `0x00112023`, which will be used for decoding in the next stage.
#### ID

In the Instruction Decode (ID) stage, the instruction `0x00112023` is decoded as `sw x1, 0(x2)`.
The Decode unit extracts the opcode `0x23`, indicating a `store-type instruction`.
It also checks the `funct3` field, which has the value `0x2`, confirming that this specific instruction is `sw (store word)`.
The Control unit then generates the appropriate control signals for the store operation — enabling `MemWrite`
The lower control signal labeled `IMM` corresponds to ALUSrc, which tells the ALU to use the immediate value (offset) as its second operand instead of reading a second register.
In addition, the ALU operation control is set to `ADD`, instructing the ALU to perform an addition to calculate the effective memory address for the store instruction.
The Register File reads the values of the two registers:
- rs1: x2 (sp) = `0x7FFFFFF0` → base address for memory access
- rs2: x1 (ra) = `0x00000008` → data to be stored
The Immediate Generator produces the offset value `0x00000000` from the instruction’s immediate field.
All of this information — including the register data, control signals, and immediate — is then passed into the ID/EX pipeline register, ready for the address computation in the EX stage.
#### EXE

In the Execute (EX) stage, the CPU performs the address calculation required for the store instruction.
From the diagram, the ALU receives two inputs:
- Op1 = x2 = `0x7FFFFFF0` (base address)
- Op2 = Immediate = `0x00000000` (offset)
In last stage, The control signal, set to `IMM`, ensures that the ALU uses the immediate value instead of a second register.
The ALU control signal specifies an add operation, producing the effective memory address
```
0x7FFFFFF0 + 0x00000000 = 0x7FFFFFF0.
```
The computed result `0x7FFFFFF0` will be used as the memory address in the next stage `MEM`,while the data to be stored (x1 = 0x00000008) is also forwarded through the pipeline to the EX/MEM register.
#### MEM

In the MEM stage, the CPU performs the actual data memory operation.
From the diagram, the Data Memory module receives two inputs:
- Address = 0x7FFFFFF0, which comes from the ALU result calculated in the previous EX stage.
- Data in = 0x00000008, the value stored in register x1 (ra).
The MemWrite control signal is enabled `WrEn` = 1, allowing the processor to write the data into memory at the specified address.
Since this instruction is a store operation, there is no data read from memory `Read out` = 0x00000000.
The updated memory content is then forwarded through the MEM/WB pipeline register, although for sw, the write-back path remains unused.
#### WB

In the WB stage, the processor determines whether data should be written back to the register file.
For the `sw (store word)` instruction, there is no register destination, since the operation only writes data to memory and does not produce a value that needs to be stored in a register.
From the diagram, the multiplexer selects the ALU result `0x7FFFFFF0` as the output path, but the RegWrite control signal is disabled (0x0).
As a result, no data is written back to the register file.
## UF8
### CLZ
``` asm
#////////////////////////////////////////////
#
# CLZ Function
#
# input a0 = x, return a0 = clz_result, a1 = original x
#
#////////////////////////////////////////////
CLZ:
addi sp, sp, -8
sw ra, 4(sp)
sw a0, 0(sp) # input x
addi t0, x0, 32 # t0 = n
addi t1, x0, 16 # t1 = c
CLZ_LOOP:
srl t2, a0, t1 # t2 = x >> c = y
beq t2, x0, CLZ_MSB_IN_LOWER_HALF # if y == 0
sub t0, t0, t1 # n = n - c
add a0, t2, x0 # x = y
CLZ_MSB_IN_LOWER_HALF:
srli t1, t1, 1
bne t1, x0, CLZ_LOOP
sub a0, t0, a0 # return
lw ra, 4(sp)
lw a1, 0(sp) # input x
addi sp, sp, 8
ret # a0 clz_result, a1 input
```
### UF8_ENCODE
``` asm
#////////////////////////////////////////////
#
# UF8_ENCODE Function
#
# input a0 = x, output a0 = encode_result
#
#////////////////////////////////////////////
UF8_ENCODE:
addi sp, sp, -8
sw ra, 4(sp)
sw a0, 0(sp)
addi t0, x0, 16
blt a0, t0, ENCODE_RET # if x < 16 return a0
jal ra, CLZ
addi t0, x0, 31
sub t0, t0, a0 # t0 = msb = 31 - clz_result
lw a0, 0(sp) # reload a0 = value
addi t1, x0, 0 # t1 = exponent = 0
addi t4, x0, 0 # t4 = overflow = 0
addi t3, x0, 5
blt t0, t3, ENCODE_FIND_E # if msb < 5 goto ENCODE_FIND
addi t1, t0, -4 # exponent = msb - 4
addi t3, x0, 15
blt t1, t3, SKIP # if exponent < 15 goto SKIP
addi t1, x0, 15 # exponent = 15
SKIP:
addi t5, x0, 0 # t5 = e = 0
lw a0, 0(sp) # reload a0 = value
ENCODE_CAL_OVERFLOW:
bge t5, t1, ENCODE_ADJUST # if e >= exponent
slli t4, t4, 1 # overflow = overflow << 1
addi t4, t4, 16 # overflow = overflow + 16
addi t5, t5, 1 # e++
jal x0, ENCODE_CAL_OVERFLOW
ENCODE_ADJUST:
bge x0, t1, ENCODE_FIND_E # if 0 >= exponent
bge a0, t4, ENCODE_FIND_E # if value >= overflow
addi t4, t4, -16 # overflow = overflow - 16
srli t4, t4, 1 # overflow = overflow >> 1
addi t1, t1, -1 # exponent--
jal x0, ENCODE_ADJUST
ENCODE_FIND_E:
addi t3, x0, 15
bge t1, t3, ENCODE_COMBINE # if exponent >= 15 break
slli t6, t4, 1 # t6 = next_overflow = overflow << 1
addi t6, t6, 16 # next_overflow += 16
blt a0, t6, ENCODE_COMBINE # if value < next_overflow break
addi t1, t1, 1 # exponent++
addi t4, t6, 0 # overflow = next_overflow
jal x0, ENCODE_FIND_E
ENCODE_COMBINE:
sub t2, a0, t4 # t2 = value - overflow
srl t2, t2, t1 # t2 = mantissa = (value - overflow) >> exponent
slli t1, t1, 4 # t1 = exponent << 4
or a0, t1, t2
ENCODE_RET:
lw ra, 4(sp)
lw a1, 0(sp)
addi sp, sp, 8
ret
```
### UF8_DECODE
``` asm
#////////////////////////////////////////////
#
# UF8_DECODE Function
#
# input a0 = x, return a0 = decode_result
#
#////////////////////////////////////////////
UF8_DECODE:
addi sp, sp, -8
sw ra, 4(sp)
sw a0, 0(sp)
andi t0, a0, 0x0f # t0 = mantissa
srli t1, a0, 4 # t1 = exponent
addi t2, x0, 15
sub t2, t2, t1 # t2 = 15 - exponent
lui t3, 0x8 # 0x8 << 12 = 0x8000 upper 20 bits
addi t3, t3, -1 # 0x8000 - 1 = 0x7FFF
srl t3, t3, t2 # t3 = 0x7FFF >> (15 - exponent)
slli t3, t3, 4 # t3 = t3 << 4 (offset)
sll t0, t0, t1
add t0, t0, t3
addi a0, t0, 0
lw ra, 4(sp)
lw a1, 0(sp)
addi sp, sp, 8
ret # a0 decode_result, a1 input
```
### Test Function
``` asm
#////////////////////////////////////////////
#
# Test for UF8_ENCODE and UF8_DECODE
# (exhaustive for all 256 possible inputs)
#
#////////////////////////////////////////////
main:
addi a3, x0, 0 # i = 0
addi a4, x0, 256 # limit = 256
addi s3, x0, 1 # passed = true (1)
LOOP:
bge a3, a4, END # if i >= 256 stop
mv a0, a3 # fl = i
jal ra, UF8_DECODE
mv s1, a0 # value
mv a0, s1
jal ra, UF8_ENCODE
mv s2, a0 # fl2
bne s2, a3, SET_FAIL # if fl2 != i fail
NEXT_ITER:
addi a3, a3, 1
jal x0, LOOP
SET_FAIL:
addi s3, x0, 0 # passed = false
addi a3, a4, 0 # force break (i = limit)
jal x0, LOOP
END:
beq s3, x0, EXIT # if !passed skip message
la a0, okmsg
li a7, 4
ecall
EXIT:
li a7, 10
ecall
```
#### How the Test Works
The goal of this program is to exhaustively test `UF8_ENCODE` and `UF8_DECODE` for all possible 8-bit inputs (0–255).
The program ensures that:
For every 8-bit input i, UF8_ENCODE(UF8_DECODE(i)) == i.
This verifies the correctness of the encode/decode pair across the full input space.
#### Test Result


## BFloat16
### Conversion Function
#### F32_TO_BF16
``` asm
#////////////////////////////////////////////
#
# F32_TO_BF16 function
#
# input: a0 = f32 bits, return bf16 bits in a0
#
#////////////////////////////////////////////
F32_TO_BF16:
addi sp, sp, -8
sw ra, 4(sp)
sw a0, 0(sp)
srli t0, a0, 23 # t0 = exponent + mantissa
andi t0, t0, 0xFF # t0 = exponent + 0x7F
addi t1, x0, 0xFF # t1 = 0xFF
beq t0, t1, F32_TO_BF16_INF_NAN # if exponent == 0xFF, go to INF_NAN
srli t0, a0, 16 # t0 = f32bits >> 16
andi t0, t0, 1 # t0 = (f32bits >> 16) & 1 = tmp
add a0, a0, t0 # a0 = f32bits + tmp
li t1, BF16_ALL_MASK # t1 = 0x7FFF
add a0, a0, t1 # a0 = f32bits + tmp + 0x7FFF
srli a0, a0, 16 # a0 = f32bits >> 16
lw ra, 4(sp)
addi sp, sp, 8
ret
F32_TO_BF16_INF_NAN:
srli a0, a0, 16 # a0 = f32bits >> 16
lw ra, 4(sp)
addi sp, sp, 8
ret
```
#### BF16_TO_F32
``` asm
#////////////////////////////////////////////
#
# BF16_TO_F32 function
#
# input: a0 = bf16 bits, return f32 bits in a0
#
#////////////////////////////////////////////
BF16_TO_F32:
addi sp, sp, -8
sw ra, 4(sp)
sw a0, 0(sp)
slli a0, a0, 16 # a0 = bf16 << 16
lw ra, 4(sp)
addi sp, sp, 8
ret
```
### Special Value Check Function
#### BF16_EXP_ALL1
``` asm
#////////////////////////////////////////////
#
# BF16_EXP_ALL1 function
#
# input: a0 = x, return 1 if exponent is all 1 else 0
#
#////////////////////////////////////////////
BF16_EXP_ALL1:
addi sp, sp, -8
sw ra, 4(sp)
sw a0, 0(sp)
li t0, BF16_EXP_MASK
and t1, a0, t0 # t1 = a0 & BF16_EXP_MASK
bne t1, t0, BF16_EXP_NOTALL1 # if (a0 & BF16_EXP_MASK) != BF16_EXP_MASK, go to NOTALL1
addi a0, x0, 1 # exponent is all 1 return 1
lw ra, 4(sp)
addi sp, sp, 8
ret
BF16_EXP_NOTALL1:
addi a0, x0, 0 # exponent is not all 1 return 0
lw ra, 4(sp)
addi sp, sp, 8
ret
```
#### BF16_MANT_NOT0
``` asm
#////////////////////////////////////////////
#
# BF16_MANT_NOT0 function
#
# input: a0 = x, return 1 if mantissa is not 0 else 0
#
#////////////////////////////////////////////
BF16_MANT_NOT0:
addi sp, sp, -8
sw ra, 4(sp)
sw a0, 0(sp)
li t0, BF16_MANT_MASK # t0 = BF16_MANT_MASK
and t1, a0, t0 # t1 = a0 & BF16_MANT_MASK
beq t1, x0, BF16_MANT0
addi a0, x0, 1 # mantissa is not 0 return 1
lw ra, 4(sp)
addi sp, sp, 8
ret
BF16_MANT0:
lw ra, 4(sp)
addi sp, sp, 8
addi a0, x0, 0 # mantissa is 0 return 0
ret
```
#### BF16_ISNAN
``` asm
#////////////////////////////////////////////
#
# BF16_ISNAN function
#
# input: a0 = x, return 1 if x is NaN else 0
#
#////////////////////////////////////////////
BF16_ISNAN:
addi sp, sp, -8
sw ra, 4(sp)
sw a0, 0(sp)
jal ra, BF16_EXP_ALL1
beq a0, x0, BF16_NOTNAN # if exponent is not all 1, go to NOTNAN
lw a0, 0(sp) # load original a0
jal ra, BF16_MANT_NOT0
beq a0, x0, BF16_NOTNAN # if mantissa is 0, go to NOTNAN
addi a0, x0, 1 # is NaN return 1
lw ra, 4(sp)
addi sp, sp, 8
ret
BF16_NOTNAN:
addi a0, x0, 0 # not NaN return 0
lw ra, 4(sp)
addi sp, sp, 8
ret
```
#### BF16_INF
``` asm
#////////////////////////////////////////////
#
# BF16_ISINF function
#
# input: a0 = x, return 1 if x is Inf else 0
#
#////////////////////////////////////////////
BF16_ISINF:
addi sp, sp, -8
sw ra, 4(sp)
sw a0, 0(sp)
jal ra, BF16_EXP_ALL1
beq a0, x0, BF16_NOTINF # if exponent is not all 1, go to NOTINF
lw a0, 0(sp) # load original a0
jal ra, BF16_MANT_NOT0 # a0 = 1 if mantissa is not 0 else 0
bne a0, x0, BF16_NOTINF # if mantissa is not 0, go to NOTINF
addi a0, x0, 1 # is Inf return 1
lw ra, 4(sp)
addi sp, sp, 8
ret
BF16_NOTINF:
addi a0, x0, 0 # not Inf return 0
lw ra, 4(sp)
addi sp, sp, 8
ret
```
#### BF16_ISZERO
``` asm
#////////////////////////////////////////////
#
# BF16_ISZERO function
#
# input: a0 = x, return 1 if x is zero else 0
#
#////////////////////////////////////////////
BF16_ISZERO:
addi sp, sp, -8
sw ra, 4(sp)
sw a0, 0(sp)
li t0, BF16_ALL_MASK
and t1, a0, t0 # t1 = a0 & BF16_ALL_MASK
bne t1, x0, BF16_NOTZERO # if (a0 & BF16_ALL_MASK) != 0, go to NOTZERO
addi a0, x0, 1 # is zero return 1
lw ra, 4(sp)
addi sp, sp, 8
ret
BF16_NOTZERO:
addi a0, x0, 0 # not zero return 0
lw ra, 4(sp)
addi sp, sp, 8
ret
```
### Arithmetic Function
#### BF16_ADD
``` asm
#////////////////////////////////////////////
#
# BF16_ADD function
#
# input: a0 = a, a1 = b
# return: a0 = a + b
#
#////////////////////////////////////////////
BF16_ADD:
addi sp, sp, -12
sw ra, 8(sp)
sw a1, 4(sp) # 4(sp) = b
sw a0, 0(sp) # 0(sp) = a
srli t1, a0, 15 # t1 = sign_a
srli t2, a1, 15 # t2 = sign_b
srli t3, a0, 7
srli t4, a1, 7
andi t3, t3, 0xFF # t3 = exp_a
andi t4, t4, 0xFF # t4 = exp_b
andi t5, a0, 0x7F # t5 = mant_a
andi t6, a1, 0x7F # t6 = mant_b
addi t0, x0, 0xFF
beq t3, t0, 1f # if exp_a == 0xFF, go to 1
beq t4, t0, RETURN_B # if exp_b == 0xFF, go to RETURN_B
beq t3, x0, 3f # if exp_a == 0, go to 3
ADD_A_NOT_ZERO:
beq t4, x0, 4f # if exp_b == 0, go to 4
ADD_B_NOT_ZERO:
bne t3, x0, 5f # if exp_a != 0, go to 5
ADD_EXP_A_NOT_ZERO:
bne t4, x0, 6f # if exp_b != 0, go to 6
jal x0, ADD_EXP_B_NOT_ZERO
1:
bne t5, x0, RETURN_A # if mant_a != 0, go to RETURN_A
beq t6, t0, 2f # if mant_b == 0xFF, go to 2
jal x0, RETURN_A # else return a
2:
bne t6, x0, RETURN_B # if mant_b != 0, go to RETURN_B
beq t1, t2, RETURN_B # if sign_a == sign_b, go to RETURN_B
jal x0, RETURN_NAN # else return NAN
3:
beq t5, x0, RETURN_B # if mant_a == 0, go to RETURN_B
jal x0, ADD_A_NOT_ZERO
4:
beq t6, x0, RETURN_A # if mant_b == 0, go to RETURN_A
jal x0, ADD_B_NOT_ZERO
5:
ori t5, t5, 0x80 # mant_a = mant_a | 0x80
jal x0, ADD_EXP_A_NOT_ZERO
6:
ori t6, t6, 0x80 # mant_b = mant_b | 0x80
jal x0, ADD_EXP_B_NOT_ZERO
ADD_EXP_B_NOT_ZERO:
sub a0, t3, t4 # a0 = exp_a - exp_b
bgt a0, x0, 1f # if exp_a > exp_b, go to 1
blt a0, x0, 2f # if exp_a < exp_b, go to 2
addi a3, t3, 0 # a3 = result_exp = exp_a
jal x0, ADD_EXP_DONE
1:
addi a3, t3, 0 # a3 = result_exp = exp_a
addi t0, x0, 8
bgt a0, t0, RETURN_A # if exp_a - exp_b > 8, go to RETURN_A
srl t6, t6, a0 # mant_b = mant_b >> (exp_a - exp_b)
jal x0, ADD_EXP_DONE
2:
addi a3, t4, 0
addi t0, x0, -8
blt a0, t0, RETURN_B # if exp_a - exp_b < -8, go to RETURN_B
sub a0, x0, a0 # a0 = -a0
srl t5, t5, a0 # mant_a = mant_a >> (exp_b - exp_a)
jal x0, ADD_EXP_DONE
ADD_EXP_DONE:
bne t1, t2, ADD_DIFF_SIGN # if sign_a != sign_b, go to ADD_DIFF_SIGN
ADD_SAME_SIGN:
addi a2, t1, 0 # a2 = result_sign = sign_a
add a4, t5, t6 # a4 = mant_a + mant_b = result_mant
andi t0, a4, 0x100 # t0 = (mant_a + mant_b) & 0x100
bne t0, x0, 1f
jal x0, RETURN_ADD
1:
srli a4, a4, 1 # result_mant = result_mant >> 1
addi a3, a3, 1 # result_exp = result_exp + 1
addi t0, x0,0xFF
beq a3, t0, 2f # if result_exp == 0xFF, go to 2
bgt a3, t0, 2f # if result_exp >= 0xFF, go to 2
jal x0, RETURN_ADD
2:
slli a0, a2, 15 # return (bf16_t) {.bits = (result_sign << 15) | 0x7F80};
li t0, BF16_EXP_MASK
or a0, a0, t0
lw ra, 8(sp)
addi sp, sp, 12
ret
ADD_DIFF_SIGN:
beq t5, t6, 2f # if mant_a == mant_b, go to 2
bgt t5, t6, 2f # if mant_a > mant_b, go to 2
addi a2, t2, 0 # a2 = result_sign = sign_b
sub a4, t6, t5 # a4 = mant_b - mant_a = result_mant
1:
beq a4, x0, RETURN_ZERO # if result_mant == 0, go to RETURN_ZERO
3:
andi t0, a4, 0x80 # t0 = result_mant & 0x80
beq t0, x0, 4f
jal x0, RETURN_ADD
4:
slli a4, a4, 1 # result_mant = result_mant << 1
addi a3, a3, -1 # result_exp = result_exp - 1
bgt a3, x0, 3b # if result_exp > 0, go to 3
jal x0, RETURN_ZERO # else go to RETURN_ZERO
2:
addi a2, t1, 0 # a2 = result_sign = sign_a
sub a4, t5, t6 # a4 = mant_a - mant_b = result_mant
jal x0, 1b
lw ra, 8(sp)
addi sp, sp, 12
ret
RETURN_A:
lw a0, 0(sp)
lw ra, 8(sp)
addi sp, sp, 12
ret
RETURN_B:
lw a0, 4(sp)
lw ra, 8(sp)
addi sp, sp, 12
ret
RETURN_NAN:
li a0, 0x7FC0
lw ra, 8(sp)
addi sp, sp, 12
ret
RETURN_ZERO:
addi a0, x0, 0
lw ra, 8(sp)
addi sp, sp, 12
ret
RETURN_ADD:
slli a0, a2, 15 # a0 = result_sign << 15
andi t0, a3, 0xFF
slli t0, t0, 7
or a0, a0, t0 # a0 = (result_sign << 15) | result_exp
andi t0, a4, 0x7F
or a0, a0, t0 # a0 = (result_sign << 15) | (result_exp << 7) | result_mant
lw ra, 8(sp)
addi sp, sp, 12
ret
```
#### BF16_SUB
``` asm
#////////////////////////////////////////////
#
# BF16_SUB function
#
# input: a0 = a, a1 = b
# return: a0 = a - b
#
#////////////////////////////////////////////
BF16_SUB:
li t0, BF16_SIGN_MASK
xor a1, a1, t0 # b = b ^ 0x8000
jal x0, BF16_ADD
```
#### BF16_MUL
``` asm
#////////////////////////////////////////////
#
# BF16_MUL function
#
# input: a0 = a, a1 = b
# return: a0 = a * b
#
#////////////////////////////////////////////
BF16_MUL:
addi sp, sp, -12
sw ra, 8(sp)
sw a1, 4(sp) # 4(sp) = b
sw a0, 0(sp) # 0(sp) = a
srli t1, a0, 15 # t1 = sign_a
srli t2, a1, 15 # t2 = sign_b
srli t3, a0, 7
srli t4, a1, 7
andi t3, t3, 0xFF # t3 = exp_a
andi t4, t4, 0xFF # t4 = exp_b
andi t5, a0, 0x7F # t5 = mant_a
andi t6, a1, 0x7F # t6 = mant_b
xor a2, t1, t2 # a2 = result_sign = sign_a ^ sign_b
addi t0, x0, 0xFF
beq t3, t0, 1f # if exp_a == 0xFF, go to 1
jal x0, CHECK_B
1: # exp_a == 0xFF
bne t5, x0, RETURN_A # a is NaN
beq t4, x0, 2f # exp_b == 0 → check mant_b
beq t4, t0, 3f # exp_b == 0xFF → check mant_b
jal x0, RETURN_INF
2: # exp_b == 0
beq t6, x0, RETURN_NAN # Inf * 0 -> NaN
jal x0, RETURN_INF # Inf * subnormal(nonzero) -> Inf
3: # exp_b == 0xFF
bne t6, x0, RETURN_B # b == NaN
jal x0, RETURN_INF # b == INF
CHECK_B:
beq t4, t0, 1f # if exp_b == 0xFF, go to 1
jal x0, CHECK_ZERO
1:
bne t6, x0, RETURN_B # if mant_b != 0, go to RETURN_B
beq t3, t0, 2f # if exp_a == 0, go to 2
jal x0, RETURN_INF
2:
beq t5, x0, RETURN_NAN # if mant_a == 0, go to RETURN_NAN
jal x0, RETURN_INF
CHECK_ZERO:
beq t3, x0, CHECK_A_ZERO # if exp_a == 0, check mant_a
beq t4, x0, CHECK_B_ZERO # if exp_b == 0, check mant_b
jal x0, CONT_MUL
CHECK_A_ZERO:
beq t5, x0, RETURN_ZERO_SIGN # if mant_a == 0 -> return 0
jal x0, CONT_MUL
CHECK_B_ZERO:
beq t6, x0, RETURN_ZERO_SIGN # if mant_b == 0 -> return 0
jal x0, CONT_MUL
RETURN_ZERO_SIGN:
slli a0, a2, 15 # a0 = result_sign << 15
lw ra, 8(sp)
addi sp, sp, 12
ret
CONT_MUL:
addi a3, x0, 0 # a3 = exp_adjust = 0
beq t3, x0, 1f # if exp_a == 0, go to 1
ori t5, t5, 0x80 # mant_a = mant_a | 0x80
jal x0, 3f
1:
andi t0, t5, 0x80 # t0 = mant_a & 0x80
beq t0, x0, 2f # if mant_a & 0x80 == 0, go to 2
addi t3, x0, 1 # exp_a = 1
jal x0, 3f
2:
slli t5, t5, 1 # mant_a = mant_a << 1
addi a3, a3, -1 # exp_adjust = exp_adjust - 1
jal x0, 1b # go to 1
3:
beq t4, x0, 4f # if exp_b == 0, go to 4
ori t6, t6, 0x80 # mant_b = mant_b | 0x80
jal x0, SHIFT_ADD
4:
andi t0, t6, 0x80 # t0 = mant_b & 0x80
beq t0, x0, 5f # if mant_b & 0x80 == 0, go to 5
addi t4, x0, 1 # exp_b = 1
jal x0, SHIFT_ADD
5:
slli t6, t6, 1 # mant_b = mant_b << 1
addi a3, a3, -1 # exp_adjust = exp_adjust - 1
jal x0, 4b # go to 4
SHIFT_ADD:
addi a4, x0, 0 # a4 = result_mant = 0
addi t0, x0, 8 # counter = 8 = mant_b 8-bit
SHIFT_LOOP:
andi t1, t6, 1 # t1 = mant_b & 1
beq t1, x0, NO_ADD # if (mant_b & 1) == 0, skip add
add a4, a4, t5 # result_mant += mant_a
NO_ADD:
slli t5, t5, 1 # mant_a <<= 1
srli t6, t6, 1 # mant_b >>= 1
addi t0, t0, -1 # counter--
bne t0, x0, SHIFT_LOOP # loop until counter = 0
add a3, a3, t3
add a3, a3, t4 # result_exp = exp_a + exp_b + exp_adjust
addi a3, a3, -BF16_EXP_BIAS # result_exp -= 127
li t0, BF16_SIGN_MASK
and t0, t0, a4 # t0 = result_mant & 0x8000
bne t0, x0, 1f
srli a4, a4, 7 # result_mant >>= 7
andi a4, a4, 0x7F # result_mant &= 0x7F
jal x0, 2f
1:
srli a4, a4, 8 # result_mant >>= 8
andi a4, a4, 0x7F # result_mant &= 0x7F
addi a3, a3, 1 # result_exp++
2:
addi t0, x0, 0xFF
beq a3, t0, RETURN_INF # if result_exp == 0xFF, go to RETURN_INF
bgt a3, t0, RETURN_INF # if result_exp > 0xFF, go to RETURN_INF
3:
beq a3, x0, 4f # if result_exp == 0, go to 4
blt a3, x0, 4f # if result_exp < 0, go to 4
jal x0, RETURN_MUL
4:
addi a0, x0, -6 # a0 = -6
blt a3, a0, RETURN_ZERO_SIGN # if result_exp < -6, go to RETURN_ZERO_SIGN
addi a3, x0, 0 # result_exp = 0
addi t0, x0, 1
sub t0, t0, a3 # t0 = 1 - result_exp
srl a4, a4, t0 # result_mant >>= (1 - result_exp)
RETURN_INF: #return (bf16_t) {.bits = (result_sign << 15) | 0x7F80};
slli a0, a2, 15
li t0, BF16_EXP_MASK
or a0, a0, t0
lw ra, 8(sp)
addi sp, sp, 12
ret
RETURN_MUL:
slli a0, a2, 15 # a0 = result_sign << 15
andi t0, a3, 0xFF
slli t0, t0, 7
or a0, a0, t0 # a0 = (result_sign << 15) | (result_exp << 7)
andi t0, a4, 0x7F
or a0, a0, t0 # a0 = (result_sign << 15) | (result_exp << 7) | result_mant
lw ra, 8(sp)
addi sp, sp, 12
ret
```
Since RV32I has no hardware mul instruction, multiplication is implemented using a shift-add algorithm.
In this method, the multiplier’s least significant bit (LSB) is checked each cycle:
* If it is 1, the multiplicand is added to the partial result.
* Then the multiplicand is shifted left (<< 1) and the multiplier is shifted right (>> 1).
* This repeats until all bits of the multiplier are processed.
#### BF16_DIV
``` asm
#////////////////////////////////////////////
#
# BF16_DIV function
#
# input: a0 = a (dividend), a1 = b (divisor)
# return: a0 = a / b
#
#////////////////////////////////////////////
BF16_DIV:
addi sp, sp, -12
sw ra, 8(sp)
sw a1, 4(sp) # 4(sp) = b
sw a0, 0(sp) # 0(sp) = a
srli t1, a0, 15 # t1 = sign_a
srli t2, a1, 15 # t2 = sign_b
srli t3, a0, 7
srli t4, a1, 7
andi t3, t3, 0xFF # t3 = exp_a
andi t4, t4, 0xFF # t4 = exp_b
andi t5, a0, 0x7F # t5 = mant_a
andi t6, a1, 0x7F # t6 = mant_b
xor a2, t1, t2 # a2 = result_sign = sign_a ^ sign_b
addi t0, x0, 0xFF
beq t4, t0, 1f # if exp_b == 0xFF, go to 1
beq t4, x0, 3f # if exp_b == 0, go to 3
8:
beq t3, t0, 6f # if exp_a == 0xFF, go to 6
beq t3, x0, 7f # if exp_a == 0, go to 7
jal x0, CONT_DIV
1:
bne t6, x0, RETURN_B # if mant_b != 0, go to RETURN_B
beq t3, t0, 2f # if exp_a == 0xFF, go to 2
jal x0, RETURN_ZERO_SIGN # return 0
2:
bne t5, x0, RETURN_NAN # if mant_a != 0, go to RETURN_NAN
jal x0, RETURN_NAN
3:
beq t6, x0, 4f # if mant_b == 0, go to 4
jal x0, 8b # go to 8
4:
beq t3, x0, 5f # if exp_a == 0, go to 5
jal x0, RETURN_INF # return inf
5:
beq t5, x0, RETURN_NAN # if mant_a == 0, go to return_nan
jal x0, RETURN_INF # return inf
6:
bne t5, x0, RETURN_A # if mant_a != 0, go to RETURN_A
jal x0, RETURN_INF # return inf
7:
beq t5, x0, RETURN_ZERO_SIGN # if mant_a == 0, go to RETURN_ZERO_SIGN
CONT_DIV:
bne t3, x0, 1f # if exp_a != 0, go to 1
3:
bne t4, x0, 2f # if exp_b != 0, go to 2
jal x0, DIV_SHIFT
1:
ori t5, t5, 0x80 # mant_a = mant_a | 0x80
jal x0, 3b
2:
ori t6, t6, 0x80 # mant_b = mant_b | 0x80
DIV_SHIFT:
slli t1, t5, 15 # t1 = dividend = mant_a << 15
addi t5, t6, 0 # t5 = divisor = mant_b
addi t6, x0, 0 # t6 = quotient = 0
addi t0, x0, 0 # t0 = i = 0
DIV_LOOP:
addi t2, x0, 16
beq t0, t2, DIV_END
slli t6, t6, 1 # quotient <<= 1
addi t2, x0, 15
sub t2, t2, t0 # t2 = 15 - i
sll t3, t5, t2 # t3 = divisor << (15 - i)
blt t1, t3, DIV_SKIP
sub t1, t1, t3 # dividend -= (divisor << (15 - i))
ori t6, t6, 1 # quotient |= 1
DIV_SKIP:
addi t0, t0, 1
jal x0, DIV_LOOP
DIV_END:
srli t3, a0, 7
andi t3, t3, 0xFF # t3 = exp_a
sub a3, t3, t4 # a3 = exp_a - exp_b
addi a3, a3, BF16_EXP_BIAS # a3 = result_exp = exp_a - exp_b + 127
beq t3, x0, 1f # if exp_a == 0, go to 1
2:
beq t4, x0, 3f # if exp_b == 0, go to 3
jal x0, DIV_EXP_DONE
1:
addi a3, a3, -1 # result_exp--
jal x0, 2b # go to 2
3:
addi a3, a3, 1 # result_exp++
DIV_EXP_DONE:
li t0, BF16_SIGN_MASK
and t1, t6, t0 # quotient & 0x8000
bne t1, x0, DIV_SHIFT_DONE
3:
beq t6, x0, 1f
jal x0, DIV_SHIFT_DONE
1:
addi t1, x0, 1 # t1 = 1
bgt a3, t1, 2f # if result_exp > 1, go to 2
jal x0, DIV_SHIFT_DONE
2:
slli t6, t6, 1 # quotient <<= 1
addi a3, a3, -1 # result_exp--
jal x0, 3b
DIV_SHIFT_DONE:
srli t6, t6, 8 # quotient >>= 8
andi t6, t6, 0x7F # t6 = result_mant = quotient & 0x7F
addi t0, x0, 0xFF
beq a3, t0, RETURN_INF # if result_exp == 0xFF, go to RETURN_INF
bgt a3, t0, RETURN_INF # if result_exp > 0xFF, go to RETURN_INF
beq a3, x0, RETURN_ZERO_SIGN # if result_exp == 0, go to RETURN_ZERO_SIGN
blt a3, x0, RETURN_ZERO_SIGN # if result_exp < 0, go to RETURN_ZERO_SIGN
RETURN_DIV:
slli a0, a2, 15 # a0 = result_sign << 15
andi t0, a3, 0xFF
slli t0, t0, 7
or a0, a0, t0 # a0 = (result_sign << 15) | (result_exp << 7)
andi t0, t6, 0x7F # t0 = result_mant = quotient & 0x7F
or a0, a0, t0 # a0 = (result_sign << 15) | (result_exp << 7) | result_mant
lw ra, 8(sp)
addi sp, sp, 12
ret
```
#### BF16_SQRT
``` asm
#////////////////////////////////////////////
#
# BF16_SQRT function
#
# input: a0(old), return: a0(result)
#
#////////////////////////////////////////////
BF16_SQRT:
addi sp, sp, -8
sw ra, 4(sp)
sw a0, 0(sp) # 0(sp) = a
srli t1, a0, 15 # t1 = sign
srli t2, a0, 7 # t2 = exp
andi t2, t2, 0xFF # t2 = exp
andi t3, a0, 0x7F # t3 = mant
addi t0, x0, 0xFF
beq t2, t0, 1f # if exp == 0xFF, go to 1
jal x0, 2f
1:
bne t3, x0, RETURN_A_SQRT # if mant != 0, go to RETURN_A
bne t1, x0, RETURN_NAN_SQRT # if sign != 0, go to RETURN_NAN
jal x0, RETURN_A_SQRT # else return a
2:
beq t2, x0, 3f # if exp == 0, go to 3
jal x0, 4f
3:
beq t1, x0, RETURN_ZERO_SQRT # if sign == 0, go to RETURN_ZERO_SQRT
4:
bne t1, x0, RETURN_NAN_SQRT # if sign != 0, go to RETURN_NAN_SQRT
beq t2, x0, RETURN_ZERO_SQRT # if exp == 0, go to RETURN_ZERO_SQRT
addi t0, x0, BF16_EXP_BIAS
sub t2, t2, t0 # t2 = e = exp - 127
ori t3, t3, 0x80 # t3 = m = mant | 0x80
andi t4, t2, 1 # t4 = e & 1
bne t4, x0, 1f # if (e & 1) != 0, go to 1
jal x0, 2f # else go to 2
1:
slli t3, t3, 1 # t3 = m = mant << 1
addi t2, t2, -1 # e = e - 1
srli t2, t2, 1 # e = e >> 1
addi t2, t2, BF16_EXP_BIAS # t2 = new_exp = e + 127
jal x0, ADJUST_DONE
2:
srli t2, t2, 1 # e = e >> 1
addi t2, t2, BF16_EXP_BIAS # t2 = new_exp = e + 127
ADJUST_DONE:
# t3 = m, t2 = new_exp, t1 = sign
addi t0, x0, 90 # low
addi t1, x0, 256 # high
addi a2, x0, 128 # result
BS_LOOP:
bgt t0, t1, BS_END
add t4, t0, t1
srli t4, t4, 1 # t4 = mid
mv a3, t4 # a3 = mid
addi a0, x0, 0 # acc = 0
mv t6, a3 # multiplicand = mid
mv t5, a3 # multiplier = mid
addi a1, x0, 9 # counter = 9
BS_MUL_LOOP:
andi t4, t5, 1 # t4 = (multiplier & 1)
beq t4, x0, BS_NO_ADD
add a0, a0, t6
BS_NO_ADD:
slli t6, t6, 1
srli t5, t5, 1
addi a1, a1, -1
bne a1, x0, BS_MUL_LOOP
srli a0, a0, 7 # sq
BS_UPDATE:
bgeu t3, a0, BS_IF
j BS_ELSE
BS_IF:
mv a2, a3 # result = mid
addi t0, a3, 1 # low = mid + 1
j BS_LOOP
BS_ELSE:
addi t1, a3, -1 # high = mid - 1
j BS_LOOP
BS_END:
addi a0, x0, 256
bgeu a2, a0, 1f # if (result >= 256), go to 1
addi a0, x0, 128
addi a1, x0, 1
blt a2, a0, 2f # if (result < 128), go to 2
jal x0, 5f
1:
srli a2, a2, 1 # result >>= 1
addi t2, t2, 1 # new_exp++
jal x0, 5f
2:
blt a2, a0, 3f # if (result < 0), go to 3
jal x0, 5f
3: bgt t2, a1, 4f
jal x0, 5f
4:
slli a2, a2, 1 # result <<= 1
addi t2, t2, -1 # new_exp--
jal x0, 2b
5:
andi t3, a2, 0x7F # result_mant = result & 0x7F
addi a0, x0, 0xFF
bge t2, a0, RETURN_INF_SQRT # if new_exp >= 0xFF, go to RETURN_INF
blt t2, x0, RETURN_ZERO_SQRT # if new_exp < 0, go to RETURN_ZERO
beq t2, x0, RETURN_ZERO_SQRT # if new_exp == 0, go to RETURN_ZERO
andi t2, t2, 0xFF # new_exp = new_exp & 0xFF
slli t2, t2, 7
or a0, t2, t3 # a0 = (new_exp << 7) | result_mant
lw ra, 4(sp)
addi sp, sp, 8
ret
RETURN_INF_SQRT: #return (bf16_t) {.bits = (sign << 15) | 0x7F80};
li a0, BF16_EXP_MASK
lw ra, 4(sp)
addi sp, sp, 8
ret
RETURN_A_SQRT:
lw a0, 0(sp)
lw ra, 4(sp)
addi sp, sp, 8
ret
RETURN_NAN_SQRT:
li a0, 0x7FC0
lw ra, 4(sp)
addi sp, sp, 8
ret
RETURN_ZERO_SQRT:
addi a0, x0, 0
lw ra, 4(sp)
addi sp, sp, 8
ret
```
The `bf16_sqrt()` function computes the square root using a **binary search algorithm** on integer values.
During each iteration, it needs to calculate `mid * mid` to compare with the mantissa value.
Instead of using hardware multiplication, it performs this operation using the **shift-and-add method**, which adds and shifts bits repeatedly to emulate integer multiplication.
```c
/* Binary search for square root of m */
while (low <= high) {
uint32_t mid = (low + high) >> 1;
uint32_t sq = (mid * mid) / 128; /* Square and scale */
if (sq <= m) {
result = mid; /* This could be our answer */
low = mid + 1;
} else {
high = mid - 1;
}
}
```
### Comparison Function
#### BF16_EQ
``` asm
#////////////////////////////////////////////
#
# BF16_Equal function
#
# input: a0 = a, a1 = b, return 1 if a == b else 0
#
#////////////////////////////////////////////
BF16_EQ:
addi sp, sp, -12
sw ra, 8(sp)
sw a1, 4(sp) # 4(sp) = b
sw a0, 0(sp) # 0(sp) = a
jal ra, BF16_ISNAN
bne a0, x0, 1f # if a is NaN, go to Not equal
lw a0, 4(sp) # load b
jal ra, BF16_ISNAN
bne a0, x0, 1f # if b is NaN, go to Not equal
lw a0, 0(sp) # load a
jal ra, BF16_ISZERO
beq a0, x0, 3f # if a is not zero, go to compare a and b
lw a0, 4(sp) # load b
jal ra, BF16_ISZERO
beq a0, x0, 3f # if b is not zero, go to compare a and b
jal x0, 2f # both a and b are zero, go to Equal
1:
addi a0, x0, 0 # Not Equal return 0
lw ra, 8(sp)
addi sp, sp, 12
ret
2:
addi a0, x0, 1 # Equal return 1
lw ra, 8(sp)
addi sp, sp, 12
ret
3:
lw a0, 0(sp) # load a
lw a1, 4(sp) # load b
beq a0, a1, 2b # if a == b, go to Equal
jal x0, 1b # else go to Not Equal
```
#### BF16_LT
``` asm
#////////////////////////////////////////////
#
# BF16_Less_Than function
#
# input: a0 = a, a1 = b, return 1 if a < b else 0
#
#////////////////////////////////////////////
BF16_LT:
addi sp, sp, -12
sw ra, 8(sp)
sw a1, 4(sp) # 4(sp) = b
sw a0, 0(sp) # 0(sp) = a
jal ra, BF16_ISNAN
bne a0, x0, 1f # if a is NaN, go to Not Less
lw a0, 4(sp) # load b
jal ra, BF16_ISNAN
bne a0, x0, 1f # if b is NaN, go to Not Less
lw a0, 0(sp) # load a
jal ra, BF16_ISZERO
beq a0, x0, 3f # if a is not zero, go to check a and b signed bit
lw a0, 4(sp) # load b
jal ra, BF16_ISZERO
beq a0, x0, 3f # if b is not zero, go to check a and b signed bit
jal x0, 1f # both a and b are zero, go to Not Less
1:
addi a0, x0, 0 # Not Less return 0
lw ra, 8(sp)
addi sp, sp, 12
ret
2:
addi a0, x0, 1 # Less return 1
lw ra, 8(sp)
addi sp, sp, 12
ret
3:
lw a0, 0(sp) # load a
lw a1, 4(sp) # load b
srli t0, a0, 15 # t0 = sign_a
srli t1, a1, 15 # t1 = sign_b
bne t0, t1, 4f # if sign_a != sign_b, go to sign bit compare
jal x0, 5f # if sign_a == sign_b
4:
bgt t0, t1, 2b # if sign_a > sign_b, go to Less
jal x0, 1b # else go to Not Less
5:
beq t0, x0, 7f # if a_sign == 0, go to both pos
jal x0, 6f # if a_sign == 1, go to both neg
6:
bgt a0, a1, 2b # if a > b, go to Less
jal x0, 1b # else go to Not Less
7:
blt a0, a1, 2b # if a < b, go to Less
jal x0, 1b # else go to Not Less
```
#### BF16_GT
``` asm
#////////////////////////////////////////////
#
# BF16_Greater_Than function
#
# input: a0 = a, a1 = b, return 1 if a > b else 0
#
#////////////////////////////////////////////
BF16_GT:
mv t0, a0
mv a0, a1
mv a1, t0
jal x0, BF16_LT
```
### Test Case
#### FP32 / BF16 Conversion
| # | FP32 (Original) | Value | BF16 (Original) | Value | Converted FP32 |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | `0x00000000` | 0.0 | `0x0000` | 0.0 | `0x00000000` |
| 2 | `0x3F800000` | 1.0 | `0x3F80` | 1.0 | `0x3F800000` |
| 3 | `0xBF800000` | -1.0 | `0xBF80` | -1.0 | `0xBF800000` |
| 4 | `0x40000000` | 2.0 | `0x4000` | 2.0 | `0x40000000` |
| 5 | `0xC0000000` | -2.0 | `0xC000` | -2.0 | `0xC0000000` |
| 6 | `0x3F000000` | 0.5 | `0x3F00` | 0.5 | `0x3F000000` |
| 7 | `0xBF000000` | -0.5 | `0xBF00` | -0.5 | `0xBF000000` |
| 8 | `0x40490FD0` | 3.14159 | `0x4049` | 3.14159 | `0x40490000` |
| 9 | `0xC0490FD0` | -3.14159 | `0xC049` | -3.14159 | `0xC0490000` |
| 10 | `0x501502F9` | 1e10 | `0x5015` | 1e10 | `0x50150000` |
| 11 | `0xD01502F9` | -1e10 | `0xD015` | -1e10 | `0xD0150000` |
#### Special Values
| # | BF16 (Hex) | Description | isNaN | isInf | isZero |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | `0x7F80` | +Inf | 0 | 1 | 0 |
| 2 | `0xFF80` | -Inf | 0 | 1 | 0 |
| 3 | `0x7FC0` | NaN | 1 | 0 | 0 |
| 4 | `0x0000` | +0.0 | 0 | 0 | 1 |
| 5 | `0x8000` | -0.0 | 0 | 0 | 1 |
---
#### BF16 ADD
| # | Operand A (Hex) | Operand B (Hex) | Expression | Expected Result (BF16 Hex) | Result (Value) |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | `0x3F80` | `0x4000` | 1.0 + 2.0 | `0x4040` | 3.0 |
| 2 | `0x4049` | `0x402E` | 3.140625 + 2.71875 | `0x40BB` | 5.84375 |
| 3 | `0x3F80` | `0xC000` | 1.0 + (-2.0) | `0xBF80` | -1.0 |
| 4 | `0xC000` | `0x3F80` | -2.0 + 1.0 | `0xBF80` | -1.0 |
| 5 | `0x0000` | `0x3F80` | 0.0 + 1.0 | `0x3F80` | 1.0 |
| 6 | `0x3F80` | `0x0000` | 1.0 + 0.0 | `0x3F80` | 1.0 |
| 7 | `0x7F80` | `0x3F80` | +Inf + 1.0 | `0x7F80` | +Inf |
| 8 | `0x3F80` | `0x7F80` | 1.0 + +Inf | `0x7F80` | +Inf |
| 9 | `0xFF80` | `0x3F80` | -Inf + 1.0 | `0xFF80` | -Inf |
| 10 | `0x3F80` | `0xFF80` | 1.0 + -Inf | `0xFF80` | -Inf |
| 11 | `0x7F62` | `0x7F62` | 3e38 + 3e38 (f32→bf16 overflow) | `0x7F80` | +Inf |
---
#### BF16 SUB
| # | Operand A (Hex) | Operand B (Hex) | Expression | Expected Result (BF16 Hex) | Result (Value) |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | `0x4000` | `0x3F80` | 2.0 − 1.0 | `0x3F80` | 1.0 |
| 2 | `0x4049` | `0x402E` | 3.140625 − 2.71875 | `0x3ED8` | 0.421875 |
| 3 | `0x3F80` | `0xC000` | 1.0 − (−2.0) | `0x4040` | 3.0 |
---
#### BF16 MUL
| # | Operand A (Hex) | Operand B (Hex) | Expression | Expected Result (BF16 Hex) | Result (Value) |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | `0x4040` | `0x4080` | 3.0 × 4.0 | `0x4140` | 12.0 |
| 2 | `0x4049` | `0x402E` | 3.140625 × 2.71875 | `0x4108` | 8.5 |
| 3 | `0x7F80` | `0x3F80` | +Inf × 1.0 | `0x7F80` | +Inf |
| 4 | `0x7F80` | `0x0000` | +Inf × 0.0 | `0x7FC0` | NaN |
---
#### BF16 DIV
| # | Operand A (Hex) | Operand B (Hex) | Expression | Expected Result (BF16 Hex) | Result (Value) |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | `0x4120` | `0x4000` | 10.0 ÷ 2.0 | `0x40A0` | 5.0 |
| 2 | `0xC120` | `0x4000` | −10.0 ÷ 2.0 | `0xC0A0` | −5.0 |
| 3 | `0x4128` | `0x3F00` | 10.5 ÷ 0.5 | `0x41A8` | 21.0 |
| 4 | `0x7F80` | `0x7F80` | +Inf ÷ +Inf | `0x7FC0` | NaN |
| 5 | `0x3F80` | `0x7F80` | 1.0 ÷ +Inf | `0x0000` | 0.0 |
| 6 | `0x7F80` | `0x3F80` | +Inf ÷ 1.0 | `0x7F80` | +Inf |
| 7 | `0x3F80` | `0x0000` | 1.0 ÷ 0.0 | `0x7F80` | +Inf |
---
#### BF16 SQRT
| # | Operand (Hex) | Expression | Expected Result (BF16 Hex) | Result (Value) |
|:-:|:-:|:-:|:-:|:-:|
| 1 | `0x4080` | √(4.0) | `0x4000` | 2.0 |
| 2 | `0x4110` | √(9.0) | `0x4040` | 3.0 |
| 3 | `0x42A2` | √(81.0) | `0x4110` | 9.0 |
---
#### BF16 COMPARISON
##### Operands
| Symbol | FP32 (Hex) | Value |
|:--:|:--:|:--:|
| a | `0x3F800000` | 1.0 |
| b | `0x40000000` | 2.0 |
| c | `0x3F800000` | 1.0 |
| nan | `0x7FC0` | NaN |
##### Equality Tests (`BF16_EQ`)
| Case | Expression | Expected Result |
|:--:|:--:|:--:|
| 1 | a == c → 1.0 == 1.0 | **True (1)** |
| 2 | a == b → 1.0 == 2.0 | **False (0)** |
| 3 | nan == nan | **False (0)** |
##### Less-Than Tests (`BF16_LT`)
| Case | Expression | Expected Result |
|:--:|:--:|:--:|
| 4 | a < b → 1.0 < 2.0 | **True (1)** |
| 5 | b < a → 2.0 < 1.0 | **False (0)** |
| 6 | a < c → 1.0 < 1.0 | **False (0)** |
| 7 | nan < a | **False (0)** |
##### Greater-Than Tests (`BF16_GT`)
| Case | Expression | Expected Result |
|:--:|:--:|:--:|
| 8 | b > a → 2.0 > 1.0 | **True (1)** |
| 9 | a > b → 1.0 > 2.0 | **False (0)** |
| 10 | nan > a | **False (0)** |
---
#### BF16 Edge Case Conversion
##### Input Operands
| Symbol | FP32 (Hex) | Description |
|:--:|:--:|:--:|
| v₁ | `0x00000001` | 1 × 10⁻⁴⁵ |
| v₂ | `0x7E967699` | 1 × 10³⁸ |
| v₃ | `0x006CE3EE` | 1 × 10⁻³⁸ |
##### Expected Results
| Case | Operation | Expected Result (FP32 Hex) | Meaning |
|:--:|:--:|:--:|:--:|
| 1 | `F32_TO_BF16(v₁)` | `0x00000000` | Underflow → 0 |
| 2 | `F32_TO_BF16(v₂)` → `× 10.0` → `BF16_TO_F32` | `0x7F800000` | Overflow → +Inf |
| 3 | `F32_TO_BF16(1e10)` ÷ `F32_TO_BF16(v₃)` → `BF16_TO_F32` | `0x00000000` | Underflow → 0 |
---
#### BF16 Rounding
##### Input Operands
| Symbol | FP32 (Hex) | Description |
|:--:|:--:|:--:|
| r₁ | `0x3FC00000` | 1.5 (exactly halfway, should round to nearest even) |
| r₂ | `0x3F800347` | 1.0001 (tiny fraction beyond 1.0) |
##### Expected Results
| Case | Operation | Expected Result (FP32 Hex) | Meaning |
|:--:|:--:|:--:|:--:|
| 1 | `F32_TO_BF16(r₁)` → `BF16_TO_F32` | `0x3FC00000` | Stays **1.5**, correctly rounded |
| 2 | `F32_TO_BF16(r₂)` → `BF16_TO_F32` | `0x3F800000` | Rounds down to **1.0** (nearest representable) |
---
### Test Result

