[toc]
# Assignment1: RISC-V Assembly and Instructon Pipeline
contributed by [< jningmin >](https://github.com/jningmin/2025_computer_architecture)
## [Quiz 1](https://hackmd.io/@sysprog/arch2025-quiz1-sol?stext=6284%3A2518%3A0%3A1759746369%3AseANJk) problem B
<h3>Implementation_Uf8 and C code </h3>
In this problem, we are asked to implement and test a simplified floating-point encoding scheme called UF8 (Unsigned Float 8-bit).
The goal is to convert between a 32-bit unsigned integer (uint32_t) and a compact 8-bit representation (uf8), while preserving approximate numeric relationships.
C code:
https://hackmd.io/@sysprog/arch2025-quiz1-sol?stext=6284%3A2518%3A0%3A1759746369%3AseANJk
<h3>Assembly code</h3>
My test cases are :` A : 15, B : 125, C : 225.`
Below are displaying main functions, if you want to see the full vertion of code, please scoll to the botom.
:::info
#### Step-by-step Explanation:
:::
:::info
1.The program calls `Decode`, which converts the UF8 value (in register a1) into a decoded integer representation returned in a0.
2.After restoring the stack, the decoded integer result is stored in temporary register t0.
3.The result is printed as the decoded output using the system call (ecall) for string and integer printing.
4.The `Encode` subroutine is called to re-encode the integer value back into UF8 format.
5.After encoding, the program compares:
Whether the re-encoded value matches the original input (bne t0, a1, error), and
Whether the encoded value maintains strict monotonicity (blt t0, t2, error).
6.If both conditions are satisfied, the encoded value is printed in hexadecimal form by calling print_hex.
The routine restores the stack pointer and returns control to the main program.
:::
**Decoding**
>In the decoding process,
>→mantissa is obtained by masking the lower four bits of the input.
>→exponent is extracted through a right-shift operation.
The exponent is then converted to its two’s complement form to prepare for offset adjustment. An offset value is generated by shifting a constant (0x7FFF) proportionally to the adjusted exponent, effectively aligning the decoded number within the representable dynamic range. The mantissa is subsequently scaled by left-shifting it according to the exponent, and the final decoded integer is produced by adding this scaled mantissa to the computed offset, as expressed by
$\text{value = ( mantissa ≪ exponent ) + offset}$
```s
Decode:
andi t0,a1,0x0f ### store mantissa in t0
srli t1,a1,4 ### store expo in t1
not t2,t1
addi t2,t2,1 ### two's complement of expo
addi t2,t2,15
li t3,0x7fff
srl t3,t3,t2
slli t3,t3,4 ### store offset in t3
sll t4,t0,t1
add a0,t4,t3 ### store trans_num in a2
xor t0,t0,t0
xor t3,t3,t3
ret
```
**Encoding**
>The encoding routine uses a logarithmic quantization approach similar to floating-point representation.
It first estimates the exponent by locating the position of the most significant bit (MSB), found through a Count-Leading-Zero (CLZ) operation.
This exponent effectively determines the dynamic range in which the integer lies.
>Once the exponent is estimated, the algorithm constructs an overflow threshold that represents the smallest value encodable with that exponent.
This threshold is iteratively adjusted until it properly bounds the input value.
The mantissa is then derived by measuring how far the value exceeds this overflow boundary, scaled by the exponent’s power of two.
>Mathematically, the overflow is built as an accumulated sum of power-of-two increments:$$
\text{overflow} = \sum_{i=0}^{e-1} 2^i \times 16
$$
Finally, the mantissa and exponent are combined into a single 8-bit quantity:$$
\text{UF8 = ( exponent ≪ 4 ) ∣ mantissa}
$$
```s
Encode:
addi sp, sp, -16
sw ra, 12(sp) # store
add s4,a0,x0 ### s4=value
li t0,16
bltu a0,t0,no_need_more_op ### a0=value
jal ra,clz
mv t1,a0 ### lz = clz(value)
li t2,31
sub s1,t2,t1 ### store msb in s1
li s2,0 ### store exponent in s2
li s3,0 ### store overflow in s3
li t6,5
bltu s1,t6,find_exact_exponent
addi s2,s1,-4
li t6,15
bltu s2,t6, Calculate_overflow ### t6=1 if greater than 15
li s2,15
Calculate_overflow: ###for loop
li s8,0 ###counting
loop1:
beq s8,s2,Adjust_if_estimate
slli s3,s3,1
addi s3,s3,16
addi s8,s8,1
j loop1
Adjust_if_estimate:
bgtz s2,check1
j find_exact_exponent
check1:
bltu s4,s3,check2
j find_exact_exponent
check2:
addi,s3,s3,-16
srli s3,s3,1
addi s2,s2,-1
j Adjust_if_estimate
find_exact_exponent:
li s7,20
bge s2, s7, end_encode
li t6,15
bgtu s2,t6,end_encode
slli s5,s3,1 ### 7f0 should be in s3 overflow
addi s5,s5,16 ### next_overflow store in s5
bltu s4,s5,end_encode
mv s3,s5
addi s2,s2,1
j find_exact_exponent
end_encode:
sub s6,s4,s3
srl s6,s6,s2 ### store other mantissa in s6
slli t6,s2,4
add a0,s6,t6
lw ra, 12(sp)
addi sp, sp, 16
ret
```
**Counting leading zeros**
>Initialize n = 32 and c = 16.
Iteratively shift the input right by c bits and check whether the result is non-zero.
If non-zero, decrement n by c and update x with the shifted value.
Halve c each iteration to perform a binary search over bit positions.
Once c reaches zero, compute the final count as n - x, yielding the total number of leading zeros.
Return this count in a0.
```s
clz:
li t0,32 ### n=32
li t1,16 ### c=16
mv t3,a0
loop0:
beqz t1,return0
srl t2,t3,t1 ### y = x >> c
beqz t2,devide_c
sub t0,t0,t1
mv t3,t2
j loop0
devide_c:
srli t1,t1,1
j loop0
return0:
sub t0,t0,t3
mv a0,t0
ret
no_need_more_op:
lw ra, 12(sp) # restore ra stored at start of Encode
addi sp, sp, 16 # restore stack
ret
```
To display the floating point value in human-readable hexadecimal form without using hardware floating-point instructions or registers.
>print_hex divides the 8-bit number into two 4-bit nibbles:
The high nibble` (a0 >> 4)`
The low nibble `(a0 & 0xF)`
Each nibble is passed to the print_nibble routine for ASCII conversion:
Values 0–9 are mapped to characters '0'–'9'.
Values 10–15 are mapped to 'a'–'f'.
System call ecall with code `11` is used to print the` ASCII` character to the console.
```s
print_hex:
addi sp, sp, -16
sw ra, 8(sp) # store
srli t0, a0, 4 # high nibble
andi t1, a0, 0xF # low nibble
la a0, msg6
li a7, 4
ecall
mv a0, t0
jal ra, print_nibble
mv a0, t1
jal ra, print_nibble
lw ra,8(sp)
addi sp,sp,16
ret
print_nibble:
li t2,10
bltu a0, t2, digit
addi a0, a0, 87
j out
digit:
addi a0, a0, 48
out:
li a7, 11 # print_char
ecall
ret
```
:::info
Result display on comsole:
:::

After all these stage are done, the register is updated like this:

:::info
Memory viewer:
:::
Below denotes the data section of memory.

:::info
The clock cycles of program:
:::

## [Quiz 1](https://hackmd.io/@sysprog/arch2025-quiz1-sol?stext=6284%3A2518%3A0%3A1759746369%3AseANJk) problem C
<h3>Implementation_bfloat16 and C code </h3>
In this project, we implemented and verified a simplified 16-bit floating-point format called Bfloat16 (BF16) entirely in RISC-V assembly language.
This project demonstrates the ability to simulate IEEE-like floating-point behavior using only integer and bitwise instructions.
C code:
https://hackmd.io/@sysprog/arch2025-quiz1-sol?stext=10087%3A7233%3A0%3A1759746474%3Ap5GiEp
https://hackmd.io/@sysprog/arch2025-quiz1-sol?stext=18828%3A2804%3A0%3A1759746490%3ABFJGmJ
<h3>Assembly code</h3>
Below are some main operating functions, if you want to see the full vertion of the assembly code , please scoll down to the bottom.
:::info
Six main test functions
→ test basic conversions
→ test special values
→ test arithmetic
→ test comparisons
→ test edge cases
→ test rounding
:::
**test_basic_conversions()**
>**f32_to_bf16**:
This function converts a 32-bit IEEE single-precision float (f32) into a 16-bit bfloat16 (bf16) representation.
It first extracts the exponent and mantissa, then handles special cases such as NaN and infinity.
The conversion keeps the sign and exponent bits while truncating the lower 16 bits of the mantissa.
A simple rounding is applied by checking the most significant discarded bit (bit 16).
>**bf16_to_f32**:
This reverses the previous operation.
It converts a 16-bit bfloat16 number back into 32-bit float representation by shifting left 16 bits.
The lower bits of the mantissa are filled with zeros.
```s
f32_to_bf16:
mv t1,a0
# ((f32bits >> 23) & 0xFF)
srli t2, t1, 23
andi t2, t2, 0xFF #store expo in t2
li t3, 0xFF
beq t2,t3,is_nan_inf
srli t4,t1,16
andi t4,t4,1
li t3,0x7FFF
add t4,t4,t3
add t1,t1,t4
srli t2,t1,16
mv a0,t2
ret
bf16_to_f32:
slli t3,a0,16
mv a0,t3
ret
```
**test_special_values()**
>This test routine verifies that special BF16 values (Inf, -Inf, NaN, +0, -0) are correctly detected.
It calls helper functions like` bf16_isinf`,` bf16_isnan`, `bf16_iszero`, and `bf16_is_neg_zero`.
Each case prints a message if it fails the expected check.
Essentially, it ensures the system correctly interprets bit patterns for special cases.
```s
test_special_values:
la a0, msg_start_special
li a7, 4
ecall
li a0, 0x7F80 #Test +Inf
jal ra, bf16_isinf
beqz a0, fail_posinf
li a0, 0x7F80
jal ra, bf16_isnan
bnez a0, fail_inf_nan
li a0, 0xFF80 #Test -Inf
jal ra, bf16_isinf
beqz a0, fail_neginf
li a0, 0x7FC0 #Test NaN
jal ra, bf16_isnan
beqz a0, fail_nan
li a0, 0x7FC0
jal ra, bf16_isinf
bnez a0, fail_nan_inf
li a0, 0x0000 # Test +0
jal ra, bf16_iszero
beqz a0, fail_zero
li a0, 0x8000 # Test -0
jal ra, bf16_is_neg_zero
beqz a0, fail_negzero
la a0, Special_msg
li a7, 4
ecall
```
**test_arithmetic()**
>**Addication :**
`bf16_add`:Main function for arithmetic operation, use in addicaion substraction,multiplication.
This function performs addition (or subtraction) between two BF16 numbers.
It follows IEEE-style floating-point addition logic:
1.Extract signs, exponents, and mantissas.
2.Align the smaller exponent by shifting its mantissa.
3.Depending on signs:
→If same → add mantissas, normalize and adjust exponent.
→If different → subtract mantissas, normalize result toward zero.
4.Detect overflow (→ set to Inf) and underflow (→ set to 0).
5.Reconstruct the final 16-bit BF16 `result: (sign << 15) | (exp << 7) | mantissa`.
```s
bf16_add:
addi sp,sp,-8
sw ra,4(sp)
li t5,0xFF
li t6,0x7F
srli t0, a0, 15 # sign_a in t0
andi t0, t0, 1
srli t1, a1, 15 # sign_b in t1
andi t1, t1, 1
srli t2, a0, 7 # exp_a in t2
and t2, t2, t5
srli t3, a1, 7 # exp_b in t3
and t3, t3, t5
and t4, a0, t6 # mant_a in t4
and t5, a1, t6 # mant_b in t5
#s6 = result_sign, s7 = result_expo, s8 = mantissa
# if a is zero (exp_a==0 && mant_a==0) => return b
beqz a0,return_b
beqz a1,return_a
beqz t2, skip_a_norm
ori t4, t4, 0x80 # mant_a |= 0x80
skip_a_norm:
beqz t3, skip_b_norm
ori t5, t5, 0x80 # mant_b |= 0x80
skip_b_norm:
sub s9, t2, t3 # exp_diff = exp_a - exp_b
bgtz s9, exp_a_bigger
bltz s9, exp_b_bigger
j exp_equal
exp_a_bigger:
mv s7, t2 # result_exp = exp_a
li t6, 8
bgt s9, t6, return_a
srl t5, t5, s9 # mant_b >>= exp_diff
j continue_add
exp_b_bigger:
neg s9, s9 # exp_diff = -exp_diff
mv s7, t3 # result_exp = exp_b
li t6, -8
blt s9, t6, return_b
srl t4, t4, s9 # mant_a >>= -exp_diff
j continue_add
exp_equal:
mv s7, t2
continue_add:
beq t0, t1, same_sign
j diff_sign
same_sign:
mv s6, t0 # result_sign = sign_a
add s8, t4, t5
li t6, 0x100
and s9, s8, t6
beqz s9, normalize_end
srli s8, s8, 1 # => mantissa >> 1
addi s7, s7, 1 # exponent++
li t6, 0xFF
blt s7, t6, normalize_end
slli t6, s6, 15
li s9, 0x7F80
or t6, t6, s9
mv a0, t6
j add_exit
diff_sign:
bgeu t4, t5, mant_a_ge
mv s6, t1 # result_sign = sign_b
sub s8, t5, t4 # result_mant = mant_b - mant_a
j normalize_check
mant_a_ge:
mv s6, t0
sub s8, t4, t5
normalize_check:
beqz s8, return_zero
normalize_loop:
li t6, 0x80
and s9, s8, t6
bnez s9, normalize_end
slli s8, s8, 1
addi s7, s7, -1 # exponent--
blez s7, return_zero
j normalize_loop
normalize_end:
slli t6, s6, 15 # (sign << 15)
andi s9, s7, 0xFF
slli s9, s9, 7 # (exp << 7)
or t6, t6, s9
andi s9, s8, 0x7F # mantissa (7 bits)
or t6, t6, s9
mv a0, t6
j add_exit
return_a:
mv a0, a0
j add_exit
return_b:
mv a0, a1
j add_exit
return_zero:
li a0, 0
j add_exit
add_exit:
lw ra, 4(sp)
addi sp, sp, 8
ret
```
>**Subtraction :**
Subtraction is implemented by flipping the sign bit of the second operand (XOR with 0x8000) and then calling bf16_add.
This reuses the same addition logic since subtraction is just addition with a negated operand.
```s
bf16_sub:
addi sp, sp, 4
sw ra, 0(sp)
li t0, 0x8000
xor a1, a1, t0
jal ra,bf16_add
lw ra,0(sp)
addi sp,sp,4
ret
```
>**Multiplication :**
1.Extract sign, exponent, and mantissa from both operands.
2.Result sign = XOR(sign_a, sign_b).
3.Result `exponent = exp_a + exp_b − bias (127)`.
4.Multiply mantissas (8-bit normalized, producing up to 16 bits).
5.Normalize result mantissa (shift if overflow/underflow).
6.Handle special cases:
-Overflow → return `±Inf`.
-Underflow → return `0`.
-Otherwise → combine sign, exponent, mantissa to form BF16 result.
```s
bf16_mul:
addi sp, sp, -4
sw ra, 0(sp)
# constants
li t6, 0xFF
li s4, 0x7F
li s5, 127 # BF16_EXP_BIAS
beqz a0,return_zero
beqz a1,return_zero
# extract sign bits
srli t0, a0, 15
andi t0, t0, 1 # sign_a (t0)
srli t1, a1, 15
andi t1, t1, 1 # sign_b(t1)
# extract exponents (8 bits)
srli t2, a0, 7
and t2, t2, t6 # exp_a (t2)
srli t3, a1, 7
and t3, t3, t6 # exp_b (t3)
# extract mantissas (7 bits)
and t4, a0, s4 # mant_a (t4)
and t5, a1, s4 # mant_b (t5)
# result sign = sign_a ^ sign_b (s7)
xor s7, t0, t1
# result expo (s8)
# result mant (s9)
# exp_adjust = 0
li s6, 0
# normalize mant_a
beqz t2, denorm_a
norm_a:
ori t4, t4, 0x80
j mant_a_done
denorm_a:
beqz t4, mant_a_done
denorm_a_loop:
andi t0, t4, 0x80
bnez t0, mant_a_done
slli t4, t4, 1
addi s6, s6, -1
j denorm_a_loop
mant_a_done:
# normalize mant_b
beqz t3, denorm_b
norm_b:
ori t5, t5, 0x80
j mant_b_done
denorm_b:
beqz t5, mant_b_done
denorm_b_loop:
andi t0, t5, 0x80
bnez t0, mant_b_done
slli t5, t5, 1
addi s6, s6, -1
j denorm_b_loop
mant_b_done:
# mantissa multiply (8x8 = 16-bit)
mul s9, t4, t5
# result_exp = exp_a + exp_b - bias + exp_adjust
add s8, t2, t3
add s8, s8, s6
addi s8, s8, -127
# normalize mantissa
li t0, 0x8000
and t1, s9, t0
bnez t1, shift8
# no overflow: shift right 7 bits
srli s9, s9, 7
andi s9, s9, 0x7F
j norm_done
shift8:
srli s9, s9, 8
andi s9, s9, 0x7F
addi s8, s8, 1
norm_done:
# overflow check
li t0, 0xFF
bge s8, t0, set_inf
# underflow check
blez s8, underflow
# ===== normal result =====
slli t0, s7, 15
andi t1, s8, 0xFF
slli t1, t1, 7
or t0, t0, t1
andi t1, s9, 0x7F
or a0, t0, t1
j mul_done
# underflow case
underflow:
li t0, -6
blt s8, t0, return_zero_mul
li t1, 1
sub t1, t1, s8
srl s9, s9, t1
li s8, 0
slli t0, s7, 15
andi t1, s8, 0xFF
slli t1, t1, 7
or t0, t0, t1
andi t1, s9, 0x7F
or a0, t0, t1
j mul_done
# overflow (Inf)
set_inf:
slli a0, s7, 15
li t6,0x7F80
or a0, a0, t6
j mul_done
# zero result
return_zero_mul:
slli a0, s7, 15
# finish
mul_done:
lw ra, 0(sp)
addi sp, sp, 4
ret
```
>**Divedend :**
1.Result `sign = XOR(sign_a, sign_b)`.
2.Result `exponent = exp_a − exp_b + bias`.
3.Mantissa division (approximated with integer division).
4.Normalize result if `mantissa >= 0x100`.
5.Recombine bits into final BF16 format.
It’s a simplified model of floating-point division — precision loss is expected.
```s
bf16_div:
li t6, 0xFF
li s4, 0x7F
li s5, 127 # BF16_EXP_BIAS
# extract sign bits
srli t0, a0, 15
andi t0, t0, 1 # sign_a
srli t1, a1, 15
andi t1, t1, 1 # sign_b
# extract exponents
srli t2, a0, 7
and t2, t2, t6 # exp_a
srli t3, a1, 7
and t3, t3, t6 # exp_b
# extract mantissas
and t4, a0, s4 # mant_a
and t5, a1, s4 # mant_b
# result sign = sign_a ^ sign_b
xor s7, t0, t1
# add hidden 1-bit
ori t4, t4, 0x80 # mant_a |= 0x80
ori t5, t5, 0x80 # mant_b |= 0x80
# result_exp = exp_a - exp_b + bias
sub s8, t2, t3
add s8, s8, s5 # s8 = exp_a - exp_b + 127
# mantissa division (approx)
slli s9, t4, 7 # mant_a << 7
divu s9, s9, t5 # result_mant = mant_a / mant_b
# normalization (if mantissa >= 0x100)
li t0, 0x100
and t1, s9, t0
beqz t1, skip_norm
srli s9, s9, 1
addi s8, s8, 1
skip_norm:
# pack result bits
slli t0, s7, 15 # sign << 15
slli t1, s8, 7 # exp << 7
or t0, t0, t1
and s9, s9, s4 # mant & 0x7F
or a0, t0, s9
ret
```
>**Square root :**
1.Extract exponent and mantissa.
2.Adjust exponent:
-If exponent is odd → shift mantissa left and reduce exponent by 1.
-Then halve exponent (since $\sqrt{2^e} = 2^{\frac{e}{2}}$).
3.Perform binary search between 90–256 to find mantissa whose square best fits the original value.
4.Normalize and reassemble the BF16 result.
This mimics the floating-point sqrt hardware behavior using integer approximation.
```s
bf16_sqrt:
li s5, 127 # BF16_EXP_BIAS
li t6, 0xFF
li s4, 0x7F
# exponent and mantissa
srli t0, a0, 7
and t0, t0, t6 # exp
and t1, a0, s4 # mant
# e = exp - bias
addi t2, t0, -127 # e = exp - 127
li t3, 1
and t4, t2, t3 # t4 = e & 1
ori t5, t1, 0x80 # m = 0x80 | mant
beqz t4, sqrt_even_exp
slli t5, t5, 1 # m <<= 1
addi t2, t2, -1
sqrt_even_exp:
srai t6, t2, 1
add t6, t6, s5 # new_exp = (e>>1)+bias
#binary search
li s0, 90 # low
li s1, 256 # high
li s2, 128 # result
sqrt_loop:
bgt s0, s1, sqrt_done
add s3, s0, s1
srli s3, s3, 1 # mid = (low + high) >> 1
mul s4, s3, s3
srli s4, s4, 7 # sq = (mid*mid)/128
ble s4, t5, sqrt_le
addi s1, s3, -1 # high = mid - 1
j sqrt_loop
sqrt_le:
mv s2, s3 # result = mid
addi s0, s3, 1 # low = mid + 1
j sqrt_loop
sqrt_done:
li t0, 256
blt s2, t0, sqrt_check_low
srli s2, s2, 1
addi t6, t6, 1 # new_exp++
j sqrt_pack
sqrt_check_low:
li t1, 128
bge s2, t1, sqrt_pack
sqrt_shift_up:
blt t6, zero, sqrt_pack
slli s2, s2, 1
addi t6, t6, -1
blt s2, t1, sqrt_shift_up
sqrt_pack:
andi s2, s2, 0x7F # new_mant = result & 0x7F
slli t6, t6, 7
or a0, t6, s2
ret
```
**test_comparisons()**
>`bf16_eq` : Checks equality. Returns false if either operand is `NaN`. Treats `+0` and `-0` as `equal`.
`bf16_lt` : Returns true if a < b, considering sign bits and magnitude.
`bf16_gt` : Simply calls bf16_lt(b, a) (reversed operands).
```s
bf16_eq:
addi sp, sp, -16
sw ra, 8(sp)
mv t0, a0 #store a in t0
jal ra, bf16_isnan
bnez a0, eq_false
# b NaN or not
mv t1, a1 #store b in t1
mv a0, t1
jal ra, bf16_isnan
bnez a0, eq_false
# both zero or not
mv a0, t0 # a0 = a
jal ra, bf16_iszero
mv t1, a0 # t1 = iszero(a)
mv a0, a1
jal ra, bf16_iszero
and t2, t1, a0
bnez t2, eq_true
# bit equality
beq t0, a1, eq_true
eq_false:
li a0, 0
j eq_exit
eq_true:
li a0, 1
eq_exit:
lw ra, 8(sp)
addi sp, sp, 16
ret
bf16_lt:
addi sp, sp, -16
sw ra, 12(sp)
mv t0, a0
jal ra, bf16_isnan
bnez a0, lt_false
mv a0, a1
jal ra, bf16_isnan
bnez a0, lt_false
# check zero
mv a0, t0
jal ra, bf16_iszero
mv t1, a0
mv a0, a1
jal ra, bf16_iszero
and t2, t1, a0
bnez t2, lt_false
# sign_a = (a >> 15) & 1
srli t3, t0, 15
andi t3, t3, 1
# sign_b = (b >> 15) & 1
srli t4, a1, 15
andi t4, t4, 1
# sign_a != sign_b ?
bne t3, t4, sign_diff
# same sign
beqz t3, both_pos # if sign = 0 , positive
j both_neg
both_pos:
blt t0, a1, lt_true # both pos compare with numbers
j lt_false
both_neg:
bgt t0, a1, lt_true
j lt_false
sign_diff:
# sign_a > sign_b ? (1 > 0 ?? neg < pos)
bgt t3, t4, lt_true
j lt_false
lt_true:
li a0, 1
j lt_exit
lt_false:
li a0, 0
lt_exit:
lw ra, 12(sp)
addi sp, sp, 16
ret
bf16_gt:
addi sp, sp, -16
sw ra, 4(sp)
mv t0, a0 # store a
mv a0, a1 # bf16_lt(b, a)
mv a1, t0
jal ra, bf16_lt
lw ra, 4(sp)
addi sp, sp, 16
ret
```
**test_edge_cases();**
>`Tiny values` — Check if underflowed to zero properly.
`Overflow` — Multiply large values to ensure result saturates to ±Inf.
`Underflow` — Divide very small by large number, expect zero.
It validates BF16 arithmetic correctness near numerical boundaries.
```s
# Test 1: Tiny value handling
la a0, msg_testing_edges
li a7, 4
ecall
li a0, 0x00000001 # tiny 1e-45f
jal ra, f32_to_bf16 # -> a0 = bf_tiny(bits)
mv s0, a0
jal ra, bf16_to_f32 # -> a0 = tiny_val(bits)
mv s1, a0
# bf16_iszero(bf_tiny)?
mv a0, s0
jal ra, bf16_iszero
bnez a0, test1_pass
# abs(tiny_val)
li t3, 0x7FFFFFFF
and t4, s1, t3
# load threshold (1e-37)
li t5, 0x0C2CF59E # 1e-37f
# compare abs(tiny_val) < threshold ?
bltu t4, t5, test1_pass
# fail
la a0, msg_fail_tiny
li a7, 4
ecall
li a0, 1
j test_edge_finish
test1_pass: # Test 2: Overflow → Inf
li a0, 0x7E967699 # 1e38f
jal ra, f32_to_bf16
mv s2, a0 # s2 = bf_huge
li a0, 0x41200000 # 10.0f
jal ra, f32_to_bf16
mv s3, a0 # s3 = bf10
mv a0, s2
mv a1, s3
jal ra, bf16_mul
mv s2, a0 # s2 = bf_huge2
jal ra, bf16_isinf
beqz a0, fail_huge
j test2_pass
fail_huge:
la a0, msg_fail_huge
li a7, 4
ecall
li a0, 1
j test_edge_finish
test2_pass: # Test 3: Underflow
li a0, 0x007CE666 # 1e-38f
jal ra, f32_to_bf16
mv s0, a0 # s0 = bf_small
li a0, 0x501502F9 # 1e10f
jal ra, f32_to_bf16
mv s1, a0 # s1 = bf_1e10
mv a0, s0
mv a1, s1
jal ra, bf16_div
mv s2, a0 # s2 = smaller
mv a0, s2
jal ra, bf16_to_f32
mv t4, a0 # t4 = smaller_val f32 bits
jal ra, bf16_iszero
bnez a0, test3_pass
li t3, 0x7FFFFFFF
and t4, t4, t3 # clear sign
li t6, 0x00000001 # 1e-45f f32 bits
bltu t4, t6, test3_pass
la a0, msg_fail_underflow
li a7, 4
ecall
li a0, 1
j test_edge_finish
test3_pass:
la a0, Edge_msg
li a7, 4
ecall
li a0, 0
test_edge_finish:
```
**test_rounding();**
>1.Checks that 1.5f is exactly representable.
2.Verifies that converting 1.0001f to BF16 and back yields minimal rounding error (< 0.001).
Ensures conversion maintains acceptable accuracy within expected precision.
```s
la a0, msg_rounding
li a7, 4
ecall
li a0, 0x3FC00000 # 1.5f
jal ra, f32_to_bf16
mv s0, a0 # s0 = bf_exact
# back_exact = bf16_to_f32(bf_exact)
jal ra, bf16_to_f32
mv t0, a0 # t0 = back_exact f32 bits
# check exact representation preserved
li t1, 0x3FC00000 # 1.5f bits
bne t0, t1, rounding_fail
pass_test_rounding_1:
li a0, 0x3F800066 # 1.0001f bits
jal ra, f32_to_bf16
mv s1, a0 # s1 = bf
jal ra, bf16_to_f32
mv t2, a0 # t2 = back f32 bits
# diff2 = back - val
li t3, 0x3F800066 # val bits
sub t4, t3, t2 # t4 = diff2 bits
# 取絕對值
li t5, 0x7FFFFFFF
and t4, t4, t5
# check rounding error < 0.001
li t6, 0x3A83126F # 0.001f bits
bltu t4, t6, rounding_pass
rounding_fail:
la a0, msg_fail_rounding
li a7, 4
ecall
li a0, 1
j test_rounding_end
rounding_pass:
la a0, Rounding_msg
li a7, 4
ecall
test_rounding_end:
j end
```
>A helper that computes the absolute value of a signed integer and checks if it’s less than 10.
Used as a verification step for correctness tests (e.g., absolute error check).
```s
abs:
bltz s3, abs_neg
j abs_positive
abs_neg:
neg s3, s3 # s3 = -s3
abs_positive:
li t1, 10
blt s3, t1, abs_pass
li a0, 1 # fail
ret
abs_pass:
li a0, 0 # pass
ret
```
>These are exception detection helpers that check BF16 bit patterns:
`bf16_isinf`: exponent = 0xFF, mantissa = 0 → `Inf`
`bf16_isnan`: exponent = 0xFF, mantissa ≠ 0 → `NaN`
`bf16_iszero`: exponent = 0, mantissa = 0 → `Zero`
`bf16_is_neg_zero`: Zero with sign bit = 1
Used throughout all test functions to handle edge and invalid cases correctly.
```s
bf16_isinf:
mv t1,a0
li t2,0x7F80
and t3,t1,t2
bne t3,t2, not_inf
li t2,0x007F
and t4,t1,t2
bnez t4, not_inf
li a0,1
ret
not_inf:
li a0,0
ret
is_nan_inf:
srli t4,t2,16
li t3 0xff
and t4,t4,t3
mv a0,t4
ret
bf16_isnan:
mv t1,a0
li t2,0x7F80
and t3,t1,t2
bne t3,t2, not_nan
li t2,0x007F
and t4,t1,t2
beqz t4, not_nan
li a0,1
ret
not_nan:
li a0,0
ret
bf16_iszero:
mv t1,a0
li t2,0x7FFF
and t1,t1,t2
beqz t1, is_zero
li a0,0
ret
is_zero:
li a0,1
ret
bf16_is_neg_zero:
li t1,0x8000
beq a0,t1,is_negzero
li a0,0
ret
is_negzero:
li a0,1
ret
```
To display the floating point value in human-readable hexadecimal form without using hardware floating-point instructions or registers.
```s
print_hex:
mv t0, a0 # val
la a0,hex
li a7,4
ecall
mv t1, a1 # digits
la t2, hexchars # hex
print_hex_loop:
beqz t1, print_hex_done
addi t1, t1, -1
slli t3, t1, 2
srl t4, t0, t3 # val >> (4*pos)
andi t4, t4, 0xF
add t4, t2, t4
lbu a0, 0(t4)
li a7, 11 # print_char
ecall
j print_hex_loop
print_hex_done:
li a0, 10 # newline
li a7, 11
ecall
ret
fail:
la a0, fail_msg
li a7, 4
ecall
li a0, 1 # return 1 if fail
ret
end:
la a0, all_test_pass_msg
li a7, 4
ecall
li a7,10
ecall
```
:::info
Results on console
:::

:::info
The clock cycles of program:
:::

:::info
Memory viewer:
:::

## Leetcode
## [leetcode_70_climbing_stairs](https://leetcode.com/problems/climbing-stairs/description/)
You are climbing a staircase. It takes n steps to reach the top.
Each time you can either climb 1 or 2 steps. In how many distinct ways can you climb to the top?
<h3>Implementation</h3>
Formally, it solves the recurrence:
${f(n)=f(n-1)+f(n-2)}$
which is similar to the Fibonacci sequence in concept.
>Example A:
Input: n = 2
Output: 2
Explanation: There are two ways to climb to the top.
1. 1 step + 1 step
2. 2 steps
>Example B:
Input: n = 3
Output: 3
Explanation: There are three ways to climb to the top.
1. 1 step + 1 step + 1 step
2. 1 step + 2 steps
3. 2 steps + 1 step
<h4>
C code
</h4>
```c
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define VECTOR_MIN_SIZE 16
typedef struct {
void **data;
size_t size; /* Allocated size */
size_t count; /* Number of elements */
size_t free_slot; /* Index of a known hole */
} vector_t;
/* -------- Vector 基本功能 -------- */
void vector_init(vector_t *v)
{
v->data = NULL;
v->size = 0;
v->count = 0;
v->free_slot = 0;
}
int32_t vector_push(vector_t *v, void *ptr)
{
if (!v->size) {
v->size = VECTOR_MIN_SIZE;
v->data = calloc(v->size, sizeof(void *));
}
if (v->free_slot && v->free_slot < v->count) {
size_t idx = v->free_slot;
v->data[idx] = ptr;
v->free_slot = 0;
return idx;
}
if (v->count == v->size) {
v->size *= 2;
v->data = realloc(v->data, v->size * sizeof(void *));
memset(v->data + v->count, 0, (v->size - v->count) * sizeof(void *));
}
v->data[v->count] = ptr;
return v->count++;
}
void *vector_pop(vector_t *v)
{
if (!v->count)
return NULL;
void *last = v->data[--v->count];
v->data[v->count] = NULL;
return last;
}
void vector_free(vector_t *v)
{
if (!v->data)
return;
free(v->data);
v->data = NULL;
v->size = 0;
v->count = 0;
v->free_slot = 0;
}
/* -------- 遞迴列舉組合 -------- */
void generate_combinations(int n, vector_t *current) {
if (n == 0) {
printf("[");
for (size_t i = 0; i < current->count; i++) {
printf("%ld", (intptr_t) current->data[i]);
if (i < current->count - 1) printf(",");
}
printf("]\n");
return;
}
if (n >= 1) {
vector_push(current, (void *)(intptr_t)1);
generate_combinations(n - 1, current);
vector_pop(current);
}
if (n >= 2) {
vector_push(current, (void *)(intptr_t)2);
generate_combinations(n - 2, current);
vector_pop(current);
}
}
void stairs(int n) {
if (n <= 0) {
printf("No stairs to climb.\n");
return;
}
printf("Climbing %d stairs:\n", n);
vector_t path;
vector_init(&path);
generate_combinations(n, &path);
vector_free(&path);
}
/* -------- 測試 -------- */
void test_case1()
{
stairs(3);
}
void test_case2()
{
stairs(4);
}
void test_case3()
{
stairs(5);
}
int main()
{
test_case1();
test_case2();
test_case3();
printf("All tests passed !\n");
return 0;
}
```
<h3>
Assembly code : version_1
</h3>
My test cases are : A : 3, B : 4, C : 5
>Output string preparing
```s
.data
msg: .string "How many stairs?\n"
msg1: .string "Step combinations:\n"
msg2: .string ": { "
msg3: .string "}\n"
msg4: .string " "
msg5: .string "\n"
testcase_a: .string "n = 3"
testcase_b: .string "n = 4"
testcase_c: .string "n = 5"
steps: .word 256
.text
```
>main function
```s
main:
la a0,msg
li a7,4
ecall
##------------------A---------------------------
la a0,testcase_a
li a7,4
ecall
la a0,msg5
li a7,4
ecall
li a2, 3
la a0,msg1
li a7,4
ecall
jal ra,start_stairs
##------------------B---------------------------
la a0,testcase_b
li a7,4
ecall
la a0,msg5
li a7,4
ecall
li a2, 4
la a0,msg1
li a7,4
ecall
jal ra,start_stairs
##------------------C---------------------------
la a0,testcase_c
li a7,4
ecall
la a0,msg5
li a7,4
ecall
li a2, 5
la a0,msg1
li a7,4
ecall
jal ra,start_stairs
end:
li a7,10
ecall
```
**start_stairs :**
>Prepares the environment (stack, registers, and data pointers) for recursive computation of combinations.
```s
start_stairs:
mv s2,x0 ### s2 is the countCombination (how many ways to do the combination)
mv a3,x0 ### a3 is the StepSize
la a1 steps
beqz a2,end
addi sp, sp, -4
sw ra, 0(sp)
jal ra,printSteps
lw ra, 0(sp)
addi sp, sp, 4
ret
```
**printSteps:**
>Implements recursion
If a2 == 0, print one full combination (printSteps_0).
Otherwise, try taking 1 step or 2 steps recursively.
Concept:
This is the core recursive routine for generating all combinations of steps that sum to n.
It explores both possibilities — “take 1 step” and “take 2 steps” — until the total matches the target.
```s
printSteps:
bnez a2,L1
addi sp,sp,-12
sw ra,0(sp)
sw a2,4(sp)
sw a3,8(sp)
jal ra,printSteps_0
lw ra,0(sp)
lw a2,4(sp)
lw a3,8(sp)
addi sp,sp,12
L1:
li t0,1
bltu a2,t0,R
addi sp,sp,-12
sw ra,0(sp)
sw a2,4(sp)
sw a3,8(sp)
jal ra,printSteps_1
lw ra,0(sp)
lw a2,4(sp)
lw a3,8(sp)
addi sp,sp,12
L2:
li t0,2
bltu a2,t0,R
addi sp,sp,-12
sw ra,0(sp)
sw a2,4(sp)
sw a3,8(sp)
jal ra,printSteps_2
lw ra,0(sp)
lw a2,4(sp)
lw a3,8(sp)
addi sp,sp,12
R: ret
```
>Acts as the base case in recursion.
When the sum of steps equals n, this routine prints the sequence { 1 2 ... }
```s
printSteps_0:
addi s2,s2,1
mv a0,s2
li a7,1
ecall
la a0,msg2
li a7,4
ecall
addi sp,sp,-4
sw ra,0(sp)
li t0,0
loop_0:
slli t2,t0,2
add t1,a1,t2 ###step[index]
lw t3,0(t1)
mv a0,t3
li a7,1
ecall
la a0,msg4
li a7,4
ecall
addi t0,t0,1
bltu t0,a3,loop_0
la a0,msg3
li a7,4
ecall
lw ra,0(sp)
addi sp,sp,4
ret
```
**printSteps_1 & printSteps_2 :**
>Appends a 1-step or 2-step to the current combination, then calls printSteps again.
These represent recursive branching — the algorithm explores both paths by appending different step sizes.
```s
printSteps_1:
slli t2,a3,2
add t1,a1,t2
li t6,1
sw t6,0(t1)
addi a2,a2,-1
addi a3,a3,1
addi sp,sp,-12
sw ra,0(sp)
sw a2,4(sp)
sw a3,8(sp)
jal ra,printSteps
lw ra,0(sp)
lw a2,4(sp)
lw a3,8(sp)
addi sp,sp,12
ret
printSteps_2:
slli t2,a3,2
add t1,a1,t2
li t6,2
sw t6,0(t1)
addi a2,a2,-2
addi a3,a3,1
addi sp,sp,-12
sw ra,0(sp)
sw a2,4(sp)
sw a3,8(sp)
jal ra,printSteps
lw ra,0(sp)
lw a2,4(sp)
lw a3,8(sp)
addi sp,sp,12
ret
```
<h3>
Assembly code : version_2
</h3>
Do the unloop for test cases and the print out function, as you can see the only different between version_1 and version_2 is the position of the function, but it decrease a lot of cycles.
>At first, I want to try dynamic programming for this program, but as long as I need to print all step cases out, it became complicated, so I choose unlooping to optimal my program.
```s
.data
msg: .string "How many stairs?\n"
msg1: .string "Step combinations:\n"
msg2: .string ": { "
msg3: .string "}\n"
msg4: .string " "
msg5: .string "\n"
testcase_a: .string "n = 3"
testcase_b: .string "n = 4"
testcase_c: .string "n = 5"
steps: .word 256
.text
main:
la a0,msg
li a7,4
ecall
##------------------A---------------------------
la a0,testcase_a
li a7,4
ecall
la a0,msg5
li a7,4
ecall
li a2, 3 ###
la a0,msg1
li a7,4
ecall
mv s2,x0 ### s2 is the countCombination (how many ways to do the combination)
mv a3,x0 ### a3 is the StepSize
la a1 steps
beqz a2,end
jal ra,printSteps
##------------------B---------------------------
la a0,testcase_b
li a7,4
ecall
la a0,msg5
li a7,4
ecall
li a2, 4 ###
la a0,msg1
li a7,4
ecall
mv s2,x0 ### s2 is the countCombination (how many ways to do the combination)
mv a3,x0 ### a3 is the StepSize
la a1 steps
beqz a2,end
jal ra,printSteps
##------------------C---------------------------
la a0,testcase_c
li a7,4
ecall
la a0,msg5
li a7,4
ecall
li a2, 5
la a0,msg1
li a7,4
ecall
mv s2,x0 ### s2 is the countCombination (how many ways to do the combination)
mv a3,x0 ### a3 is the StepSize
la a1 steps
beqz a2,end
jal ra,printSteps
end: li a7,10
ecall
printSteps:
bnez a2,L1
printSteps_0:
addi s2,s2,1
mv a0,s2
li a7,1
ecall
la a0,msg2
li a7,4
ecall
addi sp,sp,-4
sw ra,0(sp)
li t0,0
loop_0:
slli t2,t0,2
add t1,a1,t2 ###step[index]
lw t3,0(t1)
mv a0,t3
li a7,1
ecall
la a0,msg4
li a7,4
ecall
addi t0,t0,1
bltu t0,a3,loop_0
la a0,msg3
li a7,4
ecall
lw ra,0(sp)
addi sp,sp,4
L1: li t0,1
bltu a2,t0,R
addi sp,sp,-12
sw ra,0(sp)
sw a2,4(sp)
sw a3,8(sp)
jal ra,printSteps_1
lw ra,0(sp)
lw a2,4(sp)
lw a3,8(sp)
addi sp,sp,12
L2: li t0,2
bltu a2,t0,R
addi sp,sp,-12
sw ra,0(sp)
sw a2,4(sp)
sw a3,8(sp)
jal ra,printSteps_2
lw ra,0(sp)
lw a2,4(sp)
lw a3,8(sp)
addi sp,sp,12
R: ret
printSteps_1: slli t2,a3,2
add t1,a1,t2
li t6,1
sw t6,0(t1)
addi a2,a2,-1
addi a3,a3,1
addi sp,sp,-12
sw ra,0(sp)
sw a2,4(sp)
sw a3,8(sp)
jal ra,printSteps
lw ra,0(sp)
lw a2,4(sp)
lw a3,8(sp)
addi sp,sp,12
ret
printSteps_2: slli t2,a3,2
add t1,a1,t2
li t6,2
sw t6,0(t1)
addi a2,a2,-2
addi a3,a3,1
addi sp,sp,-12
sw ra,0(sp)
sw a2,4(sp)
sw a3,8(sp)
jal ra,printSteps
lw ra,0(sp)
lw a2,4(sp)
lw a3,8(sp)
addi sp,sp,12
ret
```
:::info
Result in console:
:::
After all these stage are done, the console is updated like this:

:::info
Structure
:::
Take three as a example:
printSteps(3)
├→ walk `one` stair → printSteps(2)
│ ├→ walk `one` stair → printSteps(1)
│ │ ├→ walk `one` stair → printSteps(0) → `print` {1,1,1}
│ │ └→ walk `two` stairs (not enough)
│ └→ walk `two` stairs → printSteps(0) → `print` {1,2}
└→ walk `two` stairs → printSteps(1)
├→ walk `one` stair → printSteps(0) → `print` {2,1}
:::info
After all these stage are done, the register is updated like this:
:::

:::info
Memory viewer:
:::
The table below denotes the data section of memory.

:::info
The clock cycles of program version_1:
:::
Because of the method that I used is recursion, so my cycle is extremly more then others.

:::info
The clock cycles of program version_2:
:::
This version of program decrease `254` cycles

## 5-stage pipelined processor
5-stage pipelined processor I use in ripes is a single cycle processor with hazard detection and forwarding hazard detection. It's block diagram look like this:

Five stages are :
>Instruction fetch (IF)
The processor fetches the instruction from memory and updates the program counter (PC) to the next instruction.
>Instruction decode and register fetch (ID)
The instruction is decoded to determine its type, and the required operands are read from the register file.
>Execute (EX)
The ALU performs the necessary operation (such as arithmetic, logic, or address calculation).
>Memory access (MEM)
If the instruction requires reading from or writing to memory (e.g., lw or sw), the memory is accessed here.
>Register write back (WB)
The result from the ALU or memory is written back to the destination register, completing the instruction’s execution.
## Reference
https://hackmd.io/@sysprog/arch2025-quiz1-sol
https://hackmd.io/@sysprog/SkkbXLJRR
https://leetcode.com/problems/climbing-stairs/description/
https://github.com/sysprog21/ca2025-quizzes
## Full vertion assembly code of problem B
```s
.data
msg: .string "start decode\n"
msg1: .string "start encode\n"
msg2: .string "\n"
msg3: .string "decode->"
msg4: .string "\nencode->"
msg5: .string "fail encoding"
msg6: .string "0x"
testcase_a: .string "test_case_a:n = 0x0f" ###15
testcase_b: .string "test_case_b:n = 0x7d" ###125
testcase_c: .string "test_case_c:n = 0xe1" ###225
pass: .string "All testcases passed"
.text
main:
la a0,msg
li a7,4
ecall
##------------------A---------------------------
la a0,testcase_a
li a7,4
ecall
la a0,msg2
li a7,4
ecall
li a1, 0x0f ###n=0x0f
jal ra,start_decode_encode
##------------------B---------------------------
la a0,msg2
li a7,4
ecall
la a0,testcase_b
li a7,4
ecall
la a0,msg2
li a7,4
ecall
li a1, 0x7d ###n=0x7d
jal ra,start_decode_encode
##------------------C---------------------------
la a0,msg2
li a7,4
ecall
la a0,testcase_c
li a7,4
ecall
la a0,msg2
li a7,4
ecall
li a1, 0x0e1 ###n=0xe1
jal ra,start_decode_encode
j end
end:
li a7,10
ecall
start_decode_encode:
addi sp,sp,-4
sw ra, 0(sp)
jal ra,Decode
lw ra, 0(sp)
addi sp,sp,4
mv t0,a0 ###store the return value
la a0,msg3
li a7,4
ecall
mv a0,t0
li a7,1
ecall
addi sp,sp,-4
sw ra, 0(sp)
jal ra,Encode
lw ra, 0(sp)
addi sp,sp,4
mv t0,a0 ### store the return value
li t2,-1 ### previous_num
bne t0,a1,error
blt t0,t2,error
mv t2,t0 ### previous_value = value
la a0,msg4
li a7,4
ecall
mv a0,t2
addi sp,sp,-4
sw ra, 0(sp)
jal ra,print_hex
lw ra, 0(sp)
addi sp,sp,4
ret
error:
la a0,msg5
li a7,4
ecall
print_hex:
addi sp, sp, -16
sw ra, 8(sp) # store
srli t0, a0, 4 # high nibble
andi t1, a0, 0xF # low nibble
la a0, msg6
li a7, 4
ecall
mv a0, t0
jal ra, print_nibble
mv a0, t1
jal ra, print_nibble
lw ra,8(sp)
addi sp,sp,16
ret
print_nibble:
li t2,10
bltu a0, t2, digit
addi a0, a0, 87
j out
digit:
addi a0, a0, 48
out:
li a7, 11 # print_char
ecall
ret
Decode:
andi t0,a1,0x0f ### store mantissa in t0
srli t1,a1,4 ### store expo in t1
not t2,t1
addi t2,t2,1 ### two's complement of expo
addi t2,t2,15
li t3,0x7fff
srl t3,t3,t2
slli t3,t3,4 ### store offset in t3
sll t4,t0,t1
add a0,t4,t3 ### store trans_num in a2
xor t0,t0,t0
xor t3,t3,t3
ret
Encode:
addi sp, sp, -16
sw ra, 12(sp) # store
add s4,a0,x0 ### s4=value
li t0,16
bltu a0,t0,no_need_more_op ### a0=value
jal ra,clz
mv t1,a0 ### lz = clz(value)
li t2,31
sub s1,t2,t1 ### store msb in s1
li s2,0 ### store exponent in s2
li s3,0 ### store overflow in s3
li t6,5
bltu s1,t6,find_exact_exponent
addi s2,s1,-4
li t6,15
bltu s2,t6, Calculate_overflow ### t6=1 if greater than 15
li s2,15
Calculate_overflow: ###for loop
li s8,0 ###counting
loop1:
beq s8,s2,Adjust_if_estimate
slli s3,s3,1
addi s3,s3,16
addi s8,s8,1
j loop1
Adjust_if_estimate:
bgtz s2,check1
j find_exact_exponent
check1:
bltu s4,s3,check2
j find_exact_exponent
check2:
addi,s3,s3,-16
srli s3,s3,1
addi s2,s2,-1
j Adjust_if_estimate
find_exact_exponent:
li s7,20
bge s2, s7, end_encode
li t6,15
bgtu s2,t6,end_encode
slli s5,s3,1 ### 7f0 should be in s3 overflow
addi s5,s5,16 ### next_overflow store in s5
bltu s4,s5,end_encode
mv s3,s5
addi s2,s2,1
j find_exact_exponent
end_encode:
sub s6,s4,s3
srl s6,s6,s2 ### store other mantissa in s6
slli t6,s2,4
add a0,s6,t6
lw ra, 12(sp)
addi sp, sp, 16
ret
clz:
li t0,32 ### n=32
li t1,16 ### c=16
mv t3,a0
loop0:
beqz t1,return0
srl t2,t3,t1 ### y = x >> c
beqz t2,devide_c
sub t0,t0,t1
mv t3,t2
j loop0
devide_c:
srli t1,t1,1
j loop0
return0:
sub t0,t0,t3
mv a0,t0
ret
no_need_more_op:
lw ra, 12(sp) # restore ra stored at start of Encode
addi sp, sp, 16 # restore stack
ret
```
## Full vertion assembly code of problem C
```s
.data
Start_msg: .string "Testing basic conversions...\n"
Basic_msg: .string "Basic conversions: PASS\n"
Special_msg: .string "Special values: PASS\n"
Arithmetic_msg: .string "Arithmetic: PASS\n"
Comparisons_msg: .string "Comparisons: PASS\n"
Edge_msg: .string "Edge cases: PASS\n"
Rounding_msg: .string "Rounding: PASS\n"
all_test_pass_msg: .string "=== ALL TESTS PASSED ==="
#----------------------for_special------------------
msg_start_special: .string "Testing special values...\n"
msg_fail_posinf: .string "Positive infinity not detected\n"
msg_fail_inf_nan:.string "Infinity detected as NaN\n"
msg_fail_neginf: .string "Negative infinity not detected\n"
msg_fail_nan: .string "NaN not detected\n"
msg_fail_nan_inf:.string "NaN detected as infinity\n"
msg_fail_zero: .string "Zero not detected\n"
msg_fail_negzero:.string "Negative zero not detected\n"
#---------------------for_comparison---------------
msg_start_comparison: .string "Testing comparison...\n"
msg_fail_eq: .string "Equality test failed\n"
msg_fail_ineq: .string "Inequality test failed\n"
msg_fail_lt: .string "Less than test failed\n"
msg_fail_nlt: .string "Not less than test failed\n"
msg_fail_enlt: .string "Equal not less than test failed\n"
msg_fail_gt: .string "Greater than test failed\n"
msg_fail_ngt: .string "Not greater than test failed\n"
msg_fail_naneq: .string "NaN equality test failed\n"
msg_fail_nanlt: .string "Nan less than test failed\n"
msg_fail_nangt: .string "Nan greater than test failed\n"
#---------------------for_arithmetic---------------
msg_start_arithmetic: .string "Testing arithmetic operations...\n"
msg_result: .string "Result = "
msg_add_fail:.string "Addition failed\n"
msg_add_case1:.string "Addition :1.0f + 2.0f \n"
msg_pass_add: .string "PASSED\n"
msg_sub_fail:.string "Subtraction failed\n"
msg_sub_case1:.string "Subtraction :2.0f - 1.0f \n"
msg_mul_fail:.string "Multiplication failed\n"
msg_mul_case1:.string "Multiplication :3.0f x 4.0f \n"
msg_div_fail:.string "Division failed\n"
msg_div_case1:.string "Division :10.0f / 2.0f \n"
msg_sqrt_case1: .string "Sqrt :4.0f \n"
msg_sqrt_case2: .string "Sqrt :9.0f \n"
msg_sqrt4_fail:.string "sqrt(4) failed\n"
msg_sqrt9_fail:.string "sqrt(9) failed\n"
#---------------------for_edge_cases---------------
msg_testing_edges: .string "Testing edge cases...\n"
msg_fail_tiny: .string "Tiny value handling\n"
msg_fail_huge: .string "Overflow should produce infinity\n"
msg_fail_underflow:.string "Underflow should produce zero or denormal\n"
#---------------------for_rounding_cases---------------
msg_rounding: .string "Testing rounding behavior...\n"
msg_pass_rounding: .string " Rounding: PASS\n"
msg_fail_rounding: .string " Rounding: FAIL\n"
#-------------------
newline: .string "\n"
fail_msg: .string "fail"
fail_for_pos_inf_msg: .string "Positive infinity not detected"
msg1: .asciz "f32 -> bf16 = "
msg2: .asciz "bf16 -> f32 = "
hexchars: .asciz "0123456789ABCDEF\n"
hex: .string "0x"
testcase_a: .string "n = 0.0f\n" ###
testcase_b: .string "n = 1.0f\n" ###
testcase_c: .string "n = -1.0f\n" ###
.text
test_values:
.word 0x00000000 # 0.0f
.word 0x3f800000 # 1.0f
.word 0xbf800000 # -1.0f
main:
la t0, test_values
la a0,Start_msg
li a7,4
ecall
la a0,testcase_a
li a7,4
ecall
lw s1,0(t1) # 0.0f
lw s2,4(t1) # 1.0f
lw s3,8(t1) # -1.0f
la a0,msg1
li a7,4
ecall
mv a0,s1 ##test_case
###----------------------------test_f32<->bf16-------------------------------------
jal ra,f32_to_bf16
add s4,a0,x0 # store orig in s4
li a1, 4 # 16-bit = 4 hex digits
jal ra, print_hex
la a0,msg2
li a7,4
ecall
mv a0,s4
jal ra,bf16_to_f32
add s5,a0,x0
li a1, 8 # 16-bit = 8 hex digits
jal ra, print_hex
la a0,testcase_b
li a7,4
ecall
la a0,msg1
li a7,4
ecall
mv a0,s2 ##test_case
jal ra,f32_to_bf16
add s4,a0,x0 # store orig in s4
li a1, 4 # 16-bit = 4 hex digits
jal ra, print_hex
la a0,msg2
li a7,4
ecall
mv a0,s4
jal ra,bf16_to_f32
add s5,a0,x0
li a1, 8 # 16-bit = 8 hex digits
jal ra, print_hex
la a0,testcase_c
li a7,4
ecall
la a0,msg1
li a7,4
ecall
mv a0,s3 ##test_case
jal ra,f32_to_bf16
add s4,a0,x0 # store orig in s4
li a1, 4 # 16-bit = 4 hex digits
jal ra, print_hex
la a0,msg2
li a7,4
ecall
mv a0,s4
jal ra,bf16_to_f32
add s5,a0,x0
li a1, 8 # 16-bit = 8 hex digits
jal ra, print_hex
###----------------------------test_basic_conversion-------------------------------------
li t3,0x8000
slli t4,s4,16
and t4,t4,t3
and t5,s5,t3
bne t4,t5,fail
beq s4,x0,pass_basic_msg
srli t3,s5,16
beq s4,t3,pass_basic_msg ##because easy test_cases
back_from_basic_msg:
###----------------------------test_spacial_value-------------------------------------
test_special_values:
la a0, msg_start_special
li a7, 4
ecall
li a0, 0x7F80 #Test +Inf
jal ra, bf16_isinf
beqz a0, fail_posinf
li a0, 0x7F80
jal ra, bf16_isnan
bnez a0, fail_inf_nan
li a0, 0xFF80 #Test -Inf
jal ra, bf16_isinf
beqz a0, fail_neginf
li a0, 0x7FC0 #Test NaN
jal ra, bf16_isnan
beqz a0, fail_nan
li a0, 0x7FC0
jal ra, bf16_isinf
bnez a0, fail_nan_inf
li a0, 0x0000 # Test +0
jal ra, bf16_iszero
beqz a0, fail_zero
li a0, 0x8000 # Test -0
jal ra, bf16_is_neg_zero
beqz a0, fail_negzero
la a0, Special_msg
li a7, 4
ecall
###----------------------------test_comparison-------------------------------------
li s1,0x3F80
li s2,0x4000
li s3,0x3F80
la a0, msg_start_comparison
li a7, 4
ecall
#equality_test
mv a0,s1
mv a1,s3
jal ra,bf16_eq
beqz a0, fail_eq
mv a0,s1
mv a1,s2
jal ra,bf16_eq
bnez a0, fail_ineq
#less then_test
mv a0,s1
mv a1,s2
jal ra,bf16_lt
beqz a0, fail_lt
mv a0,s2
mv a1,s1
jal ra,bf16_lt
bnez a0, fail_nlt
mv a0,s1
mv a1,s3
jal ra,bf16_lt
bnez a0, fail_enlt
#greater then_test
mv a0,s2
mv a1,s1
jal ra,bf16_gt
beqz a0, fail_gt
mv a0,s1
mv a1,s2
jal ra,bf16_gt
bnez a0, fail_ngt
li t1,0x7FC0 #nan_f32
mv a0,t1
mv a1,t1
jal ra,bf16_eq
bnez a0, fail_naneq
mv a0,t1
mv a1,s1
jal ra,bf16_lt
bnez a0, fail_nanlt
mv a0,t1
mv a1,s1
jal ra,bf16_gt
bnez a0, fail_nangt
la a0, Comparisons_msg
li a7, 4
ecall
#------------------------test_arithmetic--------------------------
la a0, msg_start_arithmetic
li a7, 4
ecall
li s1,0x3f80 #a
li s2,0x4000 #b
#test_add
la a0, msg_add_case1 # 1+2
li a7, 4
ecall
mv a0,s1
mv a1,s2
jal ra,bf16_add
jal ra,bf16_to_f32 # return a0 = result 40400000
mv t2,a0
li t1,0x40400000
sub s3,t2,t1 #store diff in s3
bltz s3, abs
bnez s3,fail_add
mv a0,t2
jal ra, print_result
#test_sub
la a0, msg_sub_case1 # 2-1
li a7, 4
ecall
mv a0,s2
mv a1,s1
jal ra,bf16_sub
jal ra,bf16_to_f32 # return a0 = result
mv t2,a0
li t1,0x3f800000
sub s3,t2,t1 #store diff in s3
bltz s3, abs
bnez s3,fail_sub
mv a0,t2
jal ra, print_result
#test_mul
la a0, msg_mul_case1 # 3 * 4
li a7, 4
ecall
li s1,0x4040
li s2,0x4080
mv a0,s1
mv a1,s2
jal ra,bf16_mul
jal ra,bf16_to_f32 # return a0 = result
mv t2,a0
li t1,0x41400000 #12
sub s3,a0,t1 #store diff in s3
bltz s3, abs
bnez s3,fail_mul
mv a0,t2
jal ra, print_result
#test_div
la a0, msg_div_case1 # 3 * 4
li a7, 4
ecall
li s1,0x4120 #10
li s2,0x4000 #2
mv a0,s1
mv a1,s2
jal ra,bf16_div
jal ra,bf16_to_f32 # return a0 = result
mv t2,a0
li t1,0x40a00000 #5
sub s3,t2,t1 #store diff in s3
bltz s3, abs
bnez s3,fail_div
mv a0,t2
jal ra, print_result
#test_sqrt4
la a0, msg_sqrt_case1 # 4
li a7, 4
ecall
li s1,0x4080
mv a0,s1
jal ra,bf16_sqrt
jal ra,bf16_to_f32 # return a0 = result
mv t2,a0
li t1,0x40000000 #2
sub s3,a0,t1 #store diff in s3
bltz s3, abs
bnez s3,fail_sqrt4
mv a0,t2
jal ra, print_result
#test_sqrt9
la a0, msg_sqrt_case2 # 9
li a7, 4
ecall
li s1,0x4110
mv a0,s1
jal ra,bf16_sqrt
jal ra,bf16_to_f32 # return a0 = result
mv t2,a0
li t1,0x40400000 #3
sub s3,a0,t1 #store diff in s3
bltz s3, abs
bnez s3,fail_sqrt9
mv a0,t2
jal ra, print_result
la a0, Arithmetic_msg
li a7, 4
ecall
#------------------------test_edge_cases--------------------------
# Test 1: Tiny value handling
la a0, msg_testing_edges
li a7, 4
ecall
li a0, 0x00000001 # tiny 1e-45f
jal ra, f32_to_bf16 # -> a0 = bf_tiny(bits)
mv s0, a0
jal ra, bf16_to_f32 # -> a0 = tiny_val(bits)
mv s1, a0
# bf16_iszero(bf_tiny)?
mv a0, s0
jal ra, bf16_iszero
bnez a0, test1_pass
# abs(tiny_val)
li t3, 0x7FFFFFFF
and t4, s1, t3
# load threshold (1e-37)
li t5, 0x0C2CF59E # 1e-37f
# compare abs(tiny_val) < threshold ?
bltu t4, t5, test1_pass
# fail
la a0, msg_fail_tiny
li a7, 4
ecall
li a0, 1
j test_edge_finish
test1_pass: # Test 2: Overflow ?? Inf
li a0, 0x7E967699 # 1e38f
jal ra, f32_to_bf16
mv s2, a0 # s2 = bf_huge
li a0, 0x41200000 # 10.0f
jal ra, f32_to_bf16
mv s3, a0 # s3 = bf10
mv a0, s2
mv a1, s3
jal ra, bf16_mul
mv s2, a0 # s2 = bf_huge2
jal ra, bf16_isinf
beqz a0, fail_huge
j test2_pass
fail_huge:
la a0, msg_fail_huge
li a7, 4
ecall
li a0, 1
j test_edge_finish
test2_pass: # Test 3: Underflow
li a0, 0x007CE666 # 1e-38f
jal ra, f32_to_bf16
mv s0, a0 # s0 = bf_small
li a0, 0x501502F9 # 1e10f
jal ra, f32_to_bf16
mv s1, a0 # s1 = bf_1e10
mv a0, s0
mv a1, s1
jal ra, bf16_div
mv s2, a0 # s2 = smaller
mv a0, s2
jal ra, bf16_to_f32
mv t4, a0 # t4 = smaller_val f32 bits
jal ra, bf16_iszero
bnez a0, test3_pass
li t3, 0x7FFFFFFF
and t4, t4, t3 # clear sign
li t6, 0x00000001 # 1e-45f f32 bits
bltu t4, t6, test3_pass
la a0, msg_fail_underflow
li a7, 4
ecall
li a0, 1
j test_edge_finish
test3_pass:
la a0, Edge_msg
li a7, 4
ecall
li a0, 0
test_edge_finish:
#------------------------test_rounding------------------------
la a0, msg_rounding
li a7, 4
ecall
li a0, 0x3FC00000 # 1.5f
jal ra, f32_to_bf16
mv s0, a0 # s0 = bf_exact
# back_exact = bf16_to_f32(bf_exact)
jal ra, bf16_to_f32
mv t0, a0 # t0 = back_exact f32 bits
# check exact representation preserved
li t1, 0x3FC00000 # 1.5f bits
bne t0, t1, rounding_fail
pass_test_rounding_1:
li a0, 0x3F800066 # 1.0001f bits
jal ra, f32_to_bf16
mv s1, a0 # s1 = bf
jal ra, bf16_to_f32
mv t2, a0 # t2 = back f32 bits
# diff2 = back - val
li t3, 0x3F800066 # val bits
sub t4, t3, t2 # t4 = diff2 bits
# ????????
li t5, 0x7FFFFFFF
and t4, t4, t5
# check rounding error < 0.001
li t6, 0x3A83126F # 0.001f bits
bltu t4, t6, rounding_pass
rounding_fail:
la a0, msg_fail_rounding
li a7, 4
ecall
li a0, 1
j test_rounding_end
rounding_pass:
la a0, Rounding_msg
li a7, 4
ecall
test_rounding_end:
j end
print_result:
addi sp,sp,-4
sw ra,0(sp)
mv t1,a0
la a0, msg_result # "Result: "
li a7, 4
ecall
mv a0,t1
li a1,8
jal ra,print_hex
pass_add:
la a0, msg_pass_add
li a7, 4
ecall
lw ra,0(sp)
addi sp,sp,4
ret
end_add:
ret
abs:
bltz s3, abs_neg
j abs_positive
abs_neg:
neg s3, s3 # s3 = -s3
abs_positive:
li t1, 10
blt s3, t1, abs_pass
li a0, 1 # fail
ret
abs_pass:
li a0, 0 # pass
ret
bf16_add:
addi sp,sp,-8
sw ra,4(sp)
li t5,0xFF
li t6,0x7F
srli t0, a0, 15 # sign_a in t0
andi t0, t0, 1
srli t1, a1, 15 # sign_b in t1
andi t1, t1, 1
srli t2, a0, 7 # exp_a in t2
and t2, t2, t5
srli t3, a1, 7 # exp_b in t3
and t3, t3, t5
and t4, a0, t6 # mant_a in t4
and t5, a1, t6 # mant_b in t5
#s6 = result_sign, s7 = result_expo, s8 = mantissa
# if a is zero (exp_a==0 && mant_a==0) => return b
beqz a0,return_b
beqz a1,return_a
beqz t2, skip_a_norm
ori t4, t4, 0x80 # mant_a |= 0x80
skip_a_norm:
beqz t3, skip_b_norm
ori t5, t5, 0x80 # mant_b |= 0x80
skip_b_norm:
sub s9, t2, t3 # exp_diff = exp_a - exp_b
bgtz s9, exp_a_bigger
bltz s9, exp_b_bigger
j exp_equal
exp_a_bigger:
mv s7, t2 # result_exp = exp_a
li t6, 8
bgt s9, t6, return_a
srl t5, t5, s9 # mant_b >>= exp_diff
j continue_add
exp_b_bigger:
neg s9, s9 # exp_diff = -exp_diff
mv s7, t3 # result_exp = exp_b
li t6, -8
blt s9, t6, return_b
srl t4, t4, s9 # mant_a >>= -exp_diff
j continue_add
exp_equal:
mv s7, t2
continue_add:
beq t0, t1, same_sign
j diff_sign
same_sign:
mv s6, t0 # result_sign = sign_a
add s8, t4, t5
li t6, 0x100
and s9, s8, t6
beqz s9, normalize_end # overflow => ???????W??
srli s8, s8, 1 # => mantissa >> 1
addi s7, s7, 1 # exponent++
li t6, 0xFF
blt s7, t6, normalize_end
slli t6, s6, 15
li s9, 0x7F80
or t6, t6, s9
mv a0, t6
j add_exit
diff_sign:
bgeu t4, t5, mant_a_ge
mv s6, t1 # result_sign = sign_b
sub s8, t5, t4 # result_mant = mant_b - mant_a
j normalize_check
mant_a_ge:
mv s6, t0
sub s8, t4, t5
normalize_check:
beqz s8, return_zero
normalize_loop:
li t6, 0x80
and s9, s8, t6
bnez s9, normalize_end
slli s8, s8, 1
addi s7, s7, -1 # exponent--
blez s7, return_zero
j normalize_loop
normalize_end:
slli t6, s6, 15 # (sign << 15)
andi s9, s7, 0xFF
slli s9, s9, 7 # (exp << 7)
or t6, t6, s9
andi s9, s8, 0x7F # mantissa (7 bits)
or t6, t6, s9
mv a0, t6
j add_exit
return_a:
mv a0, a0
j add_exit
return_b:
mv a0, a1
j add_exit
return_zero:
li a0, 0
j add_exit
add_exit:
lw ra, 4(sp)
addi sp, sp, 8
ret
bf16_sub:
addi sp, sp, 4
sw ra, 0(sp)
li t0, 0x8000
xor a1, a1, t0
jal ra,bf16_add
lw ra,0(sp)
addi sp,sp,4
ret
bf16_mul:
addi sp, sp, -4
sw ra, 0(sp)
# constants
li t6, 0xFF
li s4, 0x7F
li s5, 127 # BF16_EXP_BIAS
beqz a0,return_zero
beqz a1,return_zero
# extract sign bits
srli t0, a0, 15
andi t0, t0, 1 # sign_a (t0)
srli t1, a1, 15
andi t1, t1, 1 # sign_b(t1)
# extract exponents (8 bits)
srli t2, a0, 7
and t2, t2, t6 # exp_a (t2)
srli t3, a1, 7
and t3, t3, t6 # exp_b (t3)
# extract mantissas (7 bits)
and t4, a0, s4 # mant_a (t4)
and t5, a1, s4 # mant_b (t5)
# result sign = sign_a ^ sign_b (s7)
xor s7, t0, t1
# result expo (s8)
# result mant (s9)
# exp_adjust = 0
li s6, 0
# === normalize mant_a ===
beqz t2, denorm_a
norm_a:
ori t4, t4, 0x80
j mant_a_done
denorm_a:
beqz t4, mant_a_done
denorm_a_loop:
andi t0, t4, 0x80
bnez t0, mant_a_done
slli t4, t4, 1
addi s6, s6, -1
j denorm_a_loop
mant_a_done:
# === normalize mant_b ===
beqz t3, denorm_b
norm_b:
ori t5, t5, 0x80
j mant_b_done
denorm_b:
beqz t5, mant_b_done
denorm_b_loop:
andi t0, t5, 0x80
bnez t0, mant_b_done
slli t5, t5, 1
addi s6, s6, -1
j denorm_b_loop
mant_b_done:
# mantissa multiply (8x8 = 16-bit)
mul s9, t4, t5
# result_exp = exp_a + exp_b - bias + exp_adjust
add s8, t2, t3
add s8, s8, s6
addi s8, s8, -127
# normalize mantissa
li t0, 0x8000
and t1, s9, t0
bnez t1, shift8
# no overflow: shift right 7 bits
srli s9, s9, 7
andi s9, s9, 0x7F
j norm_done
shift8:
srli s9, s9, 8
andi s9, s9, 0x7F
addi s8, s8, 1
norm_done:
# overflow check
li t0, 0xFF
bge s8, t0, set_inf
# underflow check
blez s8, underflow
# ===== normal result =====
slli t0, s7, 15
andi t1, s8, 0xFF
slli t1, t1, 7
or t0, t0, t1
andi t1, s9, 0x7F
or a0, t0, t1
j mul_done
# ===== underflow case =====
underflow:
li t0, -6
blt s8, t0, return_zero_mul
li t1, 1
sub t1, t1, s8
srl s9, s9, t1
li s8, 0
slli t0, s7, 15
andi t1, s8, 0xFF
slli t1, t1, 7
or t0, t0, t1
andi t1, s9, 0x7F
or a0, t0, t1
j mul_done
# ===== overflow (Inf) =====
set_inf:
slli a0, s7, 15
li t6,0x7F80
or a0, a0, t6
j mul_done
# ===== zero result =====
return_zero_mul:
slli a0, s7, 15
# ===== finish =====
mul_done:
lw ra, 0(sp)
addi sp, sp, 4
ret
bf16_div:
li t6, 0xFF
li s4, 0x7F
li s5, 127 # BF16_EXP_BIAS
# extract sign bits
srli t0, a0, 15
andi t0, t0, 1 # sign_a
srli t1, a1, 15
andi t1, t1, 1 # sign_b
# extract exponents
srli t2, a0, 7
and t2, t2, t6 # exp_a
srli t3, a1, 7
and t3, t3, t6 # exp_b
# extract mantissas
and t4, a0, s4 # mant_a
and t5, a1, s4 # mant_b
# result sign = sign_a ^ sign_b
xor s7, t0, t1
# add hidden 1-bit
ori t4, t4, 0x80 # mant_a |= 0x80
ori t5, t5, 0x80 # mant_b |= 0x80
# result_exp = exp_a - exp_b + bias
sub s8, t2, t3
add s8, s8, s5 # s8 = exp_a - exp_b + 127
# mantissa division (approx)
slli s9, t4, 7 # mant_a << 7
divu s9, s9, t5 # result_mant = mant_a / mant_b
# normalization (if mantissa >= 0x100)
li t0, 0x100
and t1, s9, t0
beqz t1, skip_norm
srli s9, s9, 1
addi s8, s8, 1
skip_norm:
# pack result bits
slli t0, s7, 15 # sign << 15
slli t1, s8, 7 # exp << 7
or t0, t0, t1
and s9, s9, s4 # mant & 0x7F
or a0, t0, s9
ret
bf16_sqrt:
li s5, 127 # BF16_EXP_BIAS
li t6, 0xFF
li s4, 0x7F
# ?? exponent ?P mantissa
srli t0, a0, 7
and t0, t0, t6 # exp
and t1, a0, s4 # mant
# e = exp - bias
addi t2, t0, -127 # e = exp - 127
li t3, 1
and t4, t2, t3 # t4 = e & 1
ori t5, t1, 0x80 # m = 0x80 | mant
beqz t4, sqrt_even_exp
slli t5, t5, 1 # m <<= 1
addi t2, t2, -1
sqrt_even_exp:
srai t6, t2, 1
add t6, t6, s5 # new_exp = (e>>1)+bias
# ???l??G???j?M
li s0, 90 # low
li s1, 256 # high
li s2, 128 # result
sqrt_loop:
bgt s0, s1, sqrt_done
add s3, s0, s1
srli s3, s3, 1 # mid = (low + high) >> 1
mul s4, s3, s3
srli s4, s4, 7 # sq = (mid*mid)/128
ble s4, t5, sqrt_le
addi s1, s3, -1 # high = mid - 1
j sqrt_loop
sqrt_le:
mv s2, s3 # result = mid
addi s0, s3, 1 # low = mid + 1
j sqrt_loop
sqrt_done:
li t0, 256
blt s2, t0, sqrt_check_low
srli s2, s2, 1
addi t6, t6, 1 # new_exp++
j sqrt_pack
sqrt_check_low:
li t1, 128
bge s2, t1, sqrt_pack
sqrt_shift_up:
blt t6, zero, sqrt_pack
slli s2, s2, 1
addi t6, t6, -1
blt s2, t1, sqrt_shift_up
sqrt_pack:
andi s2, s2, 0x7F # new_mant = result & 0x7F
slli t6, t6, 7
or a0, t6, s2
ret
#------------------------test_edge_classes--------------------------
bf16_eq:
addi sp, sp, -16
sw ra, 8(sp)
mv t0, a0 #store a in t0
jal ra, bf16_isnan
bnez a0, eq_false
# b NaN or not
mv t1, a1 #store b in t1
mv a0, t1
jal ra, bf16_isnan
bnez a0, eq_false
# both zero or not
mv a0, t0 # a0 = a
jal ra, bf16_iszero
mv t1, a0 # t1 = iszero(a)
mv a0, a1
jal ra, bf16_iszero
and t2, t1, a0
bnez t2, eq_true
# bit equality
beq t0, a1, eq_true
eq_false:
li a0, 0
j eq_exit
eq_true:
li a0, 1
eq_exit:
lw ra, 8(sp)
addi sp, sp, 16
ret
bf16_lt:
addi sp, sp, -16
sw ra, 12(sp)
mv t0, a0
jal ra, bf16_isnan
bnez a0, lt_false
mv a0, a1
jal ra, bf16_isnan
bnez a0, lt_false
# check zero
mv a0, t0
jal ra, bf16_iszero
mv t1, a0
mv a0, a1
jal ra, bf16_iszero
and t2, t1, a0
bnez t2, lt_false
# sign_a = (a >> 15) & 1
srli t3, t0, 15
andi t3, t3, 1
# sign_b = (b >> 15) & 1
srli t4, a1, 15
andi t4, t4, 1
# sign_a != sign_b ?
bne t3, t4, sign_diff
# same sign
beqz t3, both_pos # if sign = 0 positive
j both_neg
both_pos:
blt t0, a1, lt_true # both pos compare with numbers
j lt_false
both_neg:
bgt t0, a1, lt_true # neg_num need reverse
j lt_false
sign_diff:
# sign_a > sign_b ?
bgt t3, t4, lt_true
j lt_false
lt_true:
li a0, 1
j lt_exit
lt_false:
li a0, 0
lt_exit:
lw ra, 12(sp)
addi sp, sp, 16
ret
bf16_gt:
addi sp, sp, -16
sw ra, 4(sp)
mv t0, a0 # store a
mv a0, a1 # bf16_lt(b, a)
mv a1, t0
jal ra, bf16_lt
lw ra, 4(sp)
addi sp, sp, 16
ret
# ---------------- fail_msg_for_special----------------
fail_posinf:
la a0, msg_fail_posinf
li a7, 4
ecall
li a0, 1
ret
fail_inf_nan:
la a0, msg_fail_inf_nan
li a7, 4
ecall
li a0, 1
ret
fail_neginf:
la a0, msg_fail_neginf
li a7, 4
ecall
li a0, 1
ret
fail_nan:
la a0, msg_fail_nan
li a7, 4
ecall
li a0, 1
ret
fail_nan_inf:
la a0, msg_fail_nan_inf
li a7, 4
ecall
li a0, 1
ret
fail_zero:
la a0, msg_fail_zero
li a7, 4
ecall
li a0, 1
ret
fail_negzero:
la a0, msg_fail_negzero
li a7, 4
ecall
li a0, 1
ret
### ---------------- fail_for_comparison----------------
fail_naneq:
la a0, msg_fail_naneq
li a7, 4
ecall
li a0, 1
ret
fail_nanlt:
la a0, msg_fail_nanlt
li a7, 4
ecall
li a0, 1
ret
fail_nangt:
la a0, msg_fail_nangt
li a7, 4
ecall
li a0, 1
ret
fail_eq:
la a0, msg_fail_eq
li a7, 4
ecall
li a0, 1
ret
fail_ineq:
la a0, msg_fail_ineq
li a7, 4
ecall
li a0, 1
ret
fail_lt:
la a0, msg_fail_lt
li a7, 4
ecall
li a0, 1
ret
fail_nlt:
la a0, msg_fail_nlt
li a7, 4
ecall
li a0, 1
ret
fail_enlt:
la a0, msg_fail_enlt
li a7, 4
ecall
li a0, 1
ret
fail_gt:
la a0, msg_fail_gt
li a7, 4
ecall
li a0, 1
ret
fail_ngt:
la a0, msg_fail_ngt
li a7, 4
ecall
li a0, 1
ret
#---------------------fail_arithmetic-----------------
fail_add:
la a0, msg_add_fail
li a7, 4
ecall
ret
fail_sub:
la a0, msg_sub_fail
li a7, 4
ecall
ret
fail_mul:
la a0, msg_mul_fail
li a7, 4
ecall
ret
fail_div:
la a0, msg_div_fail
li a7, 4
ecall
ret
fail_sqrt4:
la a0, msg_sqrt4_fail
li a7, 4
ecall
ret
fail_sqrt9:
la a0, msg_sqrt9_fail
li a7, 4
ecall
ret
bf16_isinf:
mv t1,a0
li t2,0x7F80
and t3,t1,t2
bne t3,t2, not_inf
li t2,0x007F
and t4,t1,t2
bnez t4, not_inf
li a0,1
ret
not_inf:
li a0,0
ret
bf16_isnan:
mv t1,a0
li t2,0x7F80
and t3,t1,t2
bne t3,t2, not_nan
li t2,0x007F
and t4,t1,t2
beqz t4, not_nan
li a0,1
ret
not_nan:
li a0,0
ret
bf16_iszero:
mv t1,a0
li t2,0x7FFF
and t1,t1,t2
beqz t1, is_zero
li a0,0
ret
is_zero:
li a0,1
ret
bf16_is_neg_zero:
li t1,0x8000
beq a0,t1,is_negzero
li a0,0
ret
is_negzero:
li a0,1
ret
pass_basic_msg:
la a0,Basic_msg
li a7,4
ecall
j back_from_basic_msg ## not sure ------------------------------------
j end
print_hex:
mv t0, a0 # val
la a0,hex
li a7,4
ecall
mv t1, a1 # digits
la t2, hexchars # hex
print_hex_loop:
beqz t1, print_hex_done
addi t1, t1, -1
slli t3, t1, 2
srl t4, t0, t3 # val >> (4*pos)
andi t4, t4, 0xF
add t4, t2, t4
lbu a0, 0(t4)
li a7, 11 # print_char
ecall
j print_hex_loop
print_hex_done:
li a0, 10 # newline
li a7, 11
ecall
ret
fail:
la a0, fail_msg
li a7, 4
ecall
li a0, 1 # return 1 if fail
ret
end:
la a0, all_test_pass_msg
li a7, 4
ecall
li a7,10
ecall
f32_to_bf16:
mv t1,a0
# ((f32bits >> 23) & 0xFF)
srli t2, t1, 23
andi t2, t2, 0xFF #store expo in t2
li t3, 0xFF
beq t2,t3,is_nan_inf
srli t4,t1,16
andi t4,t4,1
li t3,0x7FFF
add t4,t4,t3
add t1,t1,t4
srli t2,t1,16
mv a0,t2
ret
bf16_to_f32:
slli t3,a0,16
mv a0,t3
ret
is_nan_inf:
srli t4,t2,16
li t3 0xff
and t4,t4,t3
mv a0,t4
ret
```