Assignment1: RISC-V Assembly and Instruction Pipeline

[toc] # Assignment1: RISC-V Assembly and Instructon Pipeline contributed by [< jningmin >](https://github.com/jningmin/2025_computer_architecture) ## [Quiz 1](https://hackmd.io/@sysprog/arch2025-quiz1-sol?stext=6284%3A2518%3A0%3A1759746369%3AseANJk) problem B <h3>Implementation_Uf8 and C code </h3> In this problem, we are asked to implement and test a simplified floating-point encoding scheme called UF8 (Unsigned Float 8-bit). The goal is to convert between a 32-bit unsigned integer (uint32_t) and a compact 8-bit representation (uf8), while preserving approximate numeric relationships. C code: https://hackmd.io/@sysprog/arch2025-quiz1-sol?stext=6284%3A2518%3A0%3A1759746369%3AseANJk <h3>Assembly code</h3> My test cases are :` A : 15, B : 125, C : 225.` Below are displaying main functions, if you want to see the full vertion of code, please scoll to the botom. :::info #### Step-by-step Explanation: ::: :::info 1.The program calls `Decode`, which converts the UF8 value (in register a1) into a decoded integer representation returned in a0. 2.After restoring the stack, the decoded integer result is stored in temporary register t0. 3.The result is printed as the decoded output using the system call (ecall) for string and integer printing. 4.The `Encode` subroutine is called to re-encode the integer value back into UF8 format. 5.After encoding, the program compares: Whether the re-encoded value matches the original input (bne t0, a1, error), and Whether the encoded value maintains strict monotonicity (blt t0, t2, error). 6.If both conditions are satisfied, the encoded value is printed in hexadecimal form by calling print_hex. The routine restores the stack pointer and returns control to the main program. ::: **Decoding** >In the decoding process, >→mantissa is obtained by masking the lower four bits of the input. >→exponent is extracted through a right-shift operation. The exponent is then converted to its two’s complement form to prepare for offset adjustment. An offset value is generated by shifting a constant (0x7FFF) proportionally to the adjusted exponent, effectively aligning the decoded number within the representable dynamic range. The mantissa is subsequently scaled by left-shifting it according to the exponent, and the final decoded integer is produced by adding this scaled mantissa to the computed offset, as expressed by $\text{value = ( mantissa ≪ exponent ) + offset}$ ```s Decode: andi t0,a1,0x0f ### store mantissa in t0 srli t1,a1,4 ### store expo in t1 not t2,t1 addi t2,t2,1 ### two's complement of expo addi t2,t2,15 li t3,0x7fff srl t3,t3,t2 slli t3,t3,4 ### store offset in t3 sll t4,t0,t1 add a0,t4,t3 ### store trans_num in a2 xor t0,t0,t0 xor t3,t3,t3 ret ``` **Encoding** >The encoding routine uses a logarithmic quantization approach similar to floating-point representation. It first estimates the exponent by locating the position of the most significant bit (MSB), found through a Count-Leading-Zero (CLZ) operation. This exponent effectively determines the dynamic range in which the integer lies. >Once the exponent is estimated, the algorithm constructs an overflow threshold that represents the smallest value encodable with that exponent. This threshold is iteratively adjusted until it properly bounds the input value. The mantissa is then derived by measuring how far the value exceeds this overflow boundary, scaled by the exponent’s power of two. >Mathematically, the overflow is built as an accumulated sum of power-of-two increments:$$ \text{overflow} = \sum_{i=0}^{e-1} 2^i \times 16 $$ Finally, the mantissa and exponent are combined into a single 8-bit quantity:$$ \text{UF8 = ( exponent ≪ 4 ) ∣ mantissa} $$ ```s Encode: addi sp, sp, -16 sw ra, 12(sp) # store add s4,a0,x0 ### s4=value li t0,16 bltu a0,t0,no_need_more_op ### a0=value jal ra,clz mv t1,a0 ### lz = clz(value) li t2,31 sub s1,t2,t1 ### store msb in s1 li s2,0 ### store exponent in s2 li s3,0 ### store overflow in s3 li t6,5 bltu s1,t6,find_exact_exponent addi s2,s1,-4 li t6,15 bltu s2,t6, Calculate_overflow ### t6=1 if greater than 15 li s2,15 Calculate_overflow: ###for loop li s8,0 ###counting loop1: beq s8,s2,Adjust_if_estimate slli s3,s3,1 addi s3,s3,16 addi s8,s8,1 j loop1 Adjust_if_estimate: bgtz s2,check1 j find_exact_exponent check1: bltu s4,s3,check2 j find_exact_exponent check2: addi,s3,s3,-16 srli s3,s3,1 addi s2,s2,-1 j Adjust_if_estimate find_exact_exponent: li s7,20 bge s2, s7, end_encode li t6,15 bgtu s2,t6,end_encode slli s5,s3,1 ### 7f0 should be in s3 overflow addi s5,s5,16 ### next_overflow store in s5 bltu s4,s5,end_encode mv s3,s5 addi s2,s2,1 j find_exact_exponent end_encode: sub s6,s4,s3 srl s6,s6,s2 ### store other mantissa in s6 slli t6,s2,4 add a0,s6,t6 lw ra, 12(sp) addi sp, sp, 16 ret ``` **Counting leading zeros** >Initialize n = 32 and c = 16. Iteratively shift the input right by c bits and check whether the result is non-zero. If non-zero, decrement n by c and update x with the shifted value. Halve c each iteration to perform a binary search over bit positions. Once c reaches zero, compute the final count as n - x, yielding the total number of leading zeros. Return this count in a0. ```s clz: li t0,32 ### n=32 li t1,16 ### c=16 mv t3,a0 loop0: beqz t1,return0 srl t2,t3,t1 ### y = x >> c beqz t2,devide_c sub t0,t0,t1 mv t3,t2 j loop0 devide_c: srli t1,t1,1 j loop0 return0: sub t0,t0,t3 mv a0,t0 ret no_need_more_op: lw ra, 12(sp) # restore ra stored at start of Encode addi sp, sp, 16 # restore stack ret ``` To display the floating point value in human-readable hexadecimal form without using hardware floating-point instructions or registers. >print_hex divides the 8-bit number into two 4-bit nibbles: The high nibble` (a0 >> 4)` The low nibble `(a0 & 0xF)` Each nibble is passed to the print_nibble routine for ASCII conversion: Values 0–9 are mapped to characters '0'–'9'. Values 10–15 are mapped to 'a'–'f'. System call ecall with code `11` is used to print the` ASCII` character to the console. ```s print_hex: addi sp, sp, -16 sw ra, 8(sp) # store srli t0, a0, 4 # high nibble andi t1, a0, 0xF # low nibble la a0, msg6 li a7, 4 ecall mv a0, t0 jal ra, print_nibble mv a0, t1 jal ra, print_nibble lw ra,8(sp) addi sp,sp,16 ret print_nibble: li t2,10 bltu a0, t2, digit addi a0, a0, 87 j out digit: addi a0, a0, 48 out: li a7, 11 # print_char ecall ret ``` :::info Result display on comsole: ::: ![image](https://hackmd.io/_uploads/SkgbMZzaxl.png =250x) After all these stage are done, the register is updated like this: ![image](https://hackmd.io/_uploads/SkPa9GZpxx.png =250x)![image](https://hackmd.io/_uploads/SkKliGZ6eg.png =250x) :::info Memory viewer: ::: Below denotes the data section of memory. ![image](https://hackmd.io/_uploads/rJONszWpxx.png) :::info The clock cycles of program: ::: ![image](https://hackmd.io/_uploads/SJUz7efagx.png) ## [Quiz 1](https://hackmd.io/@sysprog/arch2025-quiz1-sol?stext=6284%3A2518%3A0%3A1759746369%3AseANJk) problem C <h3>Implementation_bfloat16 and C code </h3> In this project, we implemented and verified a simplified 16-bit floating-point format called Bfloat16 (BF16) entirely in RISC-V assembly language. This project demonstrates the ability to simulate IEEE-like floating-point behavior using only integer and bitwise instructions. C code: https://hackmd.io/@sysprog/arch2025-quiz1-sol?stext=10087%3A7233%3A0%3A1759746474%3Ap5GiEp https://hackmd.io/@sysprog/arch2025-quiz1-sol?stext=18828%3A2804%3A0%3A1759746490%3ABFJGmJ <h3>Assembly code</h3> Below are some main operating functions, if you want to see the full vertion of the assembly code , please scoll down to the bottom. :::info Six main test functions → test basic conversions → test special values → test arithmetic → test comparisons → test edge cases → test rounding ::: **test_basic_conversions()** >**f32_to_bf16**: This function converts a 32-bit IEEE single-precision float (f32) into a 16-bit bfloat16 (bf16) representation. It first extracts the exponent and mantissa, then handles special cases such as NaN and infinity. The conversion keeps the sign and exponent bits while truncating the lower 16 bits of the mantissa. A simple rounding is applied by checking the most significant discarded bit (bit 16). >**bf16_to_f32**: This reverses the previous operation. It converts a 16-bit bfloat16 number back into 32-bit float representation by shifting left 16 bits. The lower bits of the mantissa are filled with zeros. ```s f32_to_bf16: mv t1,a0 # ((f32bits >> 23) & 0xFF) srli t2, t1, 23 andi t2, t2, 0xFF #store expo in t2 li t3, 0xFF beq t2,t3,is_nan_inf srli t4,t1,16 andi t4,t4,1 li t3,0x7FFF add t4,t4,t3 add t1,t1,t4 srli t2,t1,16 mv a0,t2 ret bf16_to_f32: slli t3,a0,16 mv a0,t3 ret ``` **test_special_values()** >This test routine verifies that special BF16 values (Inf, -Inf, NaN, +0, -0) are correctly detected. It calls helper functions like` bf16_isinf`,` bf16_isnan`, `bf16_iszero`, and `bf16_is_neg_zero`. Each case prints a message if it fails the expected check. Essentially, it ensures the system correctly interprets bit patterns for special cases. ```s test_special_values: la a0, msg_start_special li a7, 4 ecall li a0, 0x7F80 #Test +Inf jal ra, bf16_isinf beqz a0, fail_posinf li a0, 0x7F80 jal ra, bf16_isnan bnez a0, fail_inf_nan li a0, 0xFF80 #Test -Inf jal ra, bf16_isinf beqz a0, fail_neginf li a0, 0x7FC0 #Test NaN jal ra, bf16_isnan beqz a0, fail_nan li a0, 0x7FC0 jal ra, bf16_isinf bnez a0, fail_nan_inf li a0, 0x0000 # Test +0 jal ra, bf16_iszero beqz a0, fail_zero li a0, 0x8000 # Test -0 jal ra, bf16_is_neg_zero beqz a0, fail_negzero la a0, Special_msg li a7, 4 ecall ``` **test_arithmetic()** >**Addication :** `bf16_add`:Main function for arithmetic operation, use in addicaion substraction,multiplication. This function performs addition (or subtraction) between two BF16 numbers. It follows IEEE-style floating-point addition logic: 1.Extract signs, exponents, and mantissas. 2.Align the smaller exponent by shifting its mantissa. 3.Depending on signs: →If same → add mantissas, normalize and adjust exponent. →If different → subtract mantissas, normalize result toward zero. 4.Detect overflow (→ set to Inf) and underflow (→ set to 0). 5.Reconstruct the final 16-bit BF16 `result: (sign << 15) | (exp << 7) | mantissa`. ```s bf16_add: addi sp,sp,-8 sw ra,4(sp) li t5,0xFF li t6,0x7F srli t0, a0, 15 # sign_a in t0 andi t0, t0, 1 srli t1, a1, 15 # sign_b in t1 andi t1, t1, 1 srli t2, a0, 7 # exp_a in t2 and t2, t2, t5 srli t3, a1, 7 # exp_b in t3 and t3, t3, t5 and t4, a0, t6 # mant_a in t4 and t5, a1, t6 # mant_b in t5 #s6 = result_sign, s7 = result_expo, s8 = mantissa # if a is zero (exp_a==0 && mant_a==0) => return b beqz a0,return_b beqz a1,return_a beqz t2, skip_a_norm ori t4, t4, 0x80 # mant_a |= 0x80 skip_a_norm: beqz t3, skip_b_norm ori t5, t5, 0x80 # mant_b |= 0x80 skip_b_norm: sub s9, t2, t3 # exp_diff = exp_a - exp_b bgtz s9, exp_a_bigger bltz s9, exp_b_bigger j exp_equal exp_a_bigger: mv s7, t2 # result_exp = exp_a li t6, 8 bgt s9, t6, return_a srl t5, t5, s9 # mant_b >>= exp_diff j continue_add exp_b_bigger: neg s9, s9 # exp_diff = -exp_diff mv s7, t3 # result_exp = exp_b li t6, -8 blt s9, t6, return_b srl t4, t4, s9 # mant_a >>= -exp_diff j continue_add exp_equal: mv s7, t2 continue_add: beq t0, t1, same_sign j diff_sign same_sign: mv s6, t0 # result_sign = sign_a add s8, t4, t5 li t6, 0x100 and s9, s8, t6 beqz s9, normalize_end srli s8, s8, 1 # => mantissa >> 1 addi s7, s7, 1 # exponent++ li t6, 0xFF blt s7, t6, normalize_end slli t6, s6, 15 li s9, 0x7F80 or t6, t6, s9 mv a0, t6 j add_exit diff_sign: bgeu t4, t5, mant_a_ge mv s6, t1 # result_sign = sign_b sub s8, t5, t4 # result_mant = mant_b - mant_a j normalize_check mant_a_ge: mv s6, t0 sub s8, t4, t5 normalize_check: beqz s8, return_zero normalize_loop: li t6, 0x80 and s9, s8, t6 bnez s9, normalize_end slli s8, s8, 1 addi s7, s7, -1 # exponent-- blez s7, return_zero j normalize_loop normalize_end: slli t6, s6, 15 # (sign << 15) andi s9, s7, 0xFF slli s9, s9, 7 # (exp << 7) or t6, t6, s9 andi s9, s8, 0x7F # mantissa (7 bits) or t6, t6, s9 mv a0, t6 j add_exit return_a: mv a0, a0 j add_exit return_b: mv a0, a1 j add_exit return_zero: li a0, 0 j add_exit add_exit: lw ra, 4(sp) addi sp, sp, 8 ret ``` >**Subtraction :** Subtraction is implemented by flipping the sign bit of the second operand (XOR with 0x8000) and then calling bf16_add. This reuses the same addition logic since subtraction is just addition with a negated operand. ```s bf16_sub: addi sp, sp, 4 sw ra, 0(sp) li t0, 0x8000 xor a1, a1, t0 jal ra,bf16_add lw ra,0(sp) addi sp,sp,4 ret ``` >**Multiplication :** 1.Extract sign, exponent, and mantissa from both operands. 2.Result sign = XOR(sign_a, sign_b). 3.Result `exponent = exp_a + exp_b − bias (127)`. 4.Multiply mantissas (8-bit normalized, producing up to 16 bits). 5.Normalize result mantissa (shift if overflow/underflow). 6.Handle special cases: -Overflow → return `±Inf`. -Underflow → return `0`. -Otherwise → combine sign, exponent, mantissa to form BF16 result. ```s bf16_mul: addi sp, sp, -4 sw ra, 0(sp) # constants li t6, 0xFF li s4, 0x7F li s5, 127 # BF16_EXP_BIAS beqz a0,return_zero beqz a1,return_zero # extract sign bits srli t0, a0, 15 andi t0, t0, 1 # sign_a (t0) srli t1, a1, 15 andi t1, t1, 1 # sign_b(t1) # extract exponents (8 bits) srli t2, a0, 7 and t2, t2, t6 # exp_a (t2) srli t3, a1, 7 and t3, t3, t6 # exp_b (t3) # extract mantissas (7 bits) and t4, a0, s4 # mant_a (t4) and t5, a1, s4 # mant_b (t5) # result sign = sign_a ^ sign_b (s7) xor s7, t0, t1 # result expo (s8) # result mant (s9) # exp_adjust = 0 li s6, 0 # normalize mant_a beqz t2, denorm_a norm_a: ori t4, t4, 0x80 j mant_a_done denorm_a: beqz t4, mant_a_done denorm_a_loop: andi t0, t4, 0x80 bnez t0, mant_a_done slli t4, t4, 1 addi s6, s6, -1 j denorm_a_loop mant_a_done: # normalize mant_b beqz t3, denorm_b norm_b: ori t5, t5, 0x80 j mant_b_done denorm_b: beqz t5, mant_b_done denorm_b_loop: andi t0, t5, 0x80 bnez t0, mant_b_done slli t5, t5, 1 addi s6, s6, -1 j denorm_b_loop mant_b_done: # mantissa multiply (8x8 = 16-bit) mul s9, t4, t5 # result_exp = exp_a + exp_b - bias + exp_adjust add s8, t2, t3 add s8, s8, s6 addi s8, s8, -127 # normalize mantissa li t0, 0x8000 and t1, s9, t0 bnez t1, shift8 # no overflow: shift right 7 bits srli s9, s9, 7 andi s9, s9, 0x7F j norm_done shift8: srli s9, s9, 8 andi s9, s9, 0x7F addi s8, s8, 1 norm_done: # overflow check li t0, 0xFF bge s8, t0, set_inf # underflow check blez s8, underflow # ===== normal result ===== slli t0, s7, 15 andi t1, s8, 0xFF slli t1, t1, 7 or t0, t0, t1 andi t1, s9, 0x7F or a0, t0, t1 j mul_done # underflow case underflow: li t0, -6 blt s8, t0, return_zero_mul li t1, 1 sub t1, t1, s8 srl s9, s9, t1 li s8, 0 slli t0, s7, 15 andi t1, s8, 0xFF slli t1, t1, 7 or t0, t0, t1 andi t1, s9, 0x7F or a0, t0, t1 j mul_done # overflow (Inf) set_inf: slli a0, s7, 15 li t6,0x7F80 or a0, a0, t6 j mul_done # zero result return_zero_mul: slli a0, s7, 15 # finish mul_done: lw ra, 0(sp) addi sp, sp, 4 ret ``` >**Divedend :** 1.Result `sign = XOR(sign_a, sign_b)`. 2.Result `exponent = exp_a − exp_b + bias`. 3.Mantissa division (approximated with integer division). 4.Normalize result if `mantissa >= 0x100`. 5.Recombine bits into final BF16 format. It’s a simplified model of floating-point division — precision loss is expected. ```s bf16_div: li t6, 0xFF li s4, 0x7F li s5, 127 # BF16_EXP_BIAS # extract sign bits srli t0, a0, 15 andi t0, t0, 1 # sign_a srli t1, a1, 15 andi t1, t1, 1 # sign_b # extract exponents srli t2, a0, 7 and t2, t2, t6 # exp_a srli t3, a1, 7 and t3, t3, t6 # exp_b # extract mantissas and t4, a0, s4 # mant_a and t5, a1, s4 # mant_b # result sign = sign_a ^ sign_b xor s7, t0, t1 # add hidden 1-bit ori t4, t4, 0x80 # mant_a |= 0x80 ori t5, t5, 0x80 # mant_b |= 0x80 # result_exp = exp_a - exp_b + bias sub s8, t2, t3 add s8, s8, s5 # s8 = exp_a - exp_b + 127 # mantissa division (approx) slli s9, t4, 7 # mant_a << 7 divu s9, s9, t5 # result_mant = mant_a / mant_b # normalization (if mantissa >= 0x100) li t0, 0x100 and t1, s9, t0 beqz t1, skip_norm srli s9, s9, 1 addi s8, s8, 1 skip_norm: # pack result bits slli t0, s7, 15 # sign << 15 slli t1, s8, 7 # exp << 7 or t0, t0, t1 and s9, s9, s4 # mant & 0x7F or a0, t0, s9 ret ``` >**Square root :** 1.Extract exponent and mantissa. 2.Adjust exponent: -If exponent is odd → shift mantissa left and reduce exponent by 1. -Then halve exponent (since $\sqrt{2^e} = 2^{\frac{e}{2}}$). 3.Perform binary search between 90–256 to find mantissa whose square best fits the original value. 4.Normalize and reassemble the BF16 result. This mimics the floating-point sqrt hardware behavior using integer approximation. ```s bf16_sqrt: li s5, 127 # BF16_EXP_BIAS li t6, 0xFF li s4, 0x7F # exponent and mantissa srli t0, a0, 7 and t0, t0, t6 # exp and t1, a0, s4 # mant # e = exp - bias addi t2, t0, -127 # e = exp - 127 li t3, 1 and t4, t2, t3 # t4 = e & 1 ori t5, t1, 0x80 # m = 0x80 | mant beqz t4, sqrt_even_exp slli t5, t5, 1 # m <<= 1 addi t2, t2, -1 sqrt_even_exp: srai t6, t2, 1 add t6, t6, s5 # new_exp = (e>>1)+bias #binary search li s0, 90 # low li s1, 256 # high li s2, 128 # result sqrt_loop: bgt s0, s1, sqrt_done add s3, s0, s1 srli s3, s3, 1 # mid = (low + high) >> 1 mul s4, s3, s3 srli s4, s4, 7 # sq = (mid*mid)/128 ble s4, t5, sqrt_le addi s1, s3, -1 # high = mid - 1 j sqrt_loop sqrt_le: mv s2, s3 # result = mid addi s0, s3, 1 # low = mid + 1 j sqrt_loop sqrt_done: li t0, 256 blt s2, t0, sqrt_check_low srli s2, s2, 1 addi t6, t6, 1 # new_exp++ j sqrt_pack sqrt_check_low: li t1, 128 bge s2, t1, sqrt_pack sqrt_shift_up: blt t6, zero, sqrt_pack slli s2, s2, 1 addi t6, t6, -1 blt s2, t1, sqrt_shift_up sqrt_pack: andi s2, s2, 0x7F # new_mant = result & 0x7F slli t6, t6, 7 or a0, t6, s2 ret ``` **test_comparisons()** >`bf16_eq` : Checks equality. Returns false if either operand is `NaN`. Treats `+0` and `-0` as `equal`. `bf16_lt` : Returns true if a < b, considering sign bits and magnitude. `bf16_gt` : Simply calls bf16_lt(b, a) (reversed operands). ```s bf16_eq: addi sp, sp, -16 sw ra, 8(sp) mv t0, a0 #store a in t0 jal ra, bf16_isnan bnez a0, eq_false # b NaN or not mv t1, a1 #store b in t1 mv a0, t1 jal ra, bf16_isnan bnez a0, eq_false # both zero or not mv a0, t0 # a0 = a jal ra, bf16_iszero mv t1, a0 # t1 = iszero(a) mv a0, a1 jal ra, bf16_iszero and t2, t1, a0 bnez t2, eq_true # bit equality beq t0, a1, eq_true eq_false: li a0, 0 j eq_exit eq_true: li a0, 1 eq_exit: lw ra, 8(sp) addi sp, sp, 16 ret bf16_lt: addi sp, sp, -16 sw ra, 12(sp) mv t0, a0 jal ra, bf16_isnan bnez a0, lt_false mv a0, a1 jal ra, bf16_isnan bnez a0, lt_false # check zero mv a0, t0 jal ra, bf16_iszero mv t1, a0 mv a0, a1 jal ra, bf16_iszero and t2, t1, a0 bnez t2, lt_false # sign_a = (a >> 15) & 1 srli t3, t0, 15 andi t3, t3, 1 # sign_b = (b >> 15) & 1 srli t4, a1, 15 andi t4, t4, 1 # sign_a != sign_b ? bne t3, t4, sign_diff # same sign beqz t3, both_pos # if sign = 0 , positive j both_neg both_pos: blt t0, a1, lt_true # both pos compare with numbers j lt_false both_neg: bgt t0, a1, lt_true j lt_false sign_diff: # sign_a > sign_b ? (1 > 0 ?? neg < pos) bgt t3, t4, lt_true j lt_false lt_true: li a0, 1 j lt_exit lt_false: li a0, 0 lt_exit: lw ra, 12(sp) addi sp, sp, 16 ret bf16_gt: addi sp, sp, -16 sw ra, 4(sp) mv t0, a0 # store a mv a0, a1 # bf16_lt(b, a) mv a1, t0 jal ra, bf16_lt lw ra, 4(sp) addi sp, sp, 16 ret ``` **test_edge_cases();** >`Tiny values` — Check if underflowed to zero properly. `Overflow` — Multiply large values to ensure result saturates to ±Inf. `Underflow` — Divide very small by large number, expect zero. It validates BF16 arithmetic correctness near numerical boundaries. ```s # Test 1: Tiny value handling la a0, msg_testing_edges li a7, 4 ecall li a0, 0x00000001 # tiny 1e-45f jal ra, f32_to_bf16 # -> a0 = bf_tiny(bits) mv s0, a0 jal ra, bf16_to_f32 # -> a0 = tiny_val(bits) mv s1, a0 # bf16_iszero(bf_tiny)? mv a0, s0 jal ra, bf16_iszero bnez a0, test1_pass # abs(tiny_val) li t3, 0x7FFFFFFF and t4, s1, t3 # load threshold (1e-37) li t5, 0x0C2CF59E # 1e-37f # compare abs(tiny_val) < threshold ? bltu t4, t5, test1_pass # fail la a0, msg_fail_tiny li a7, 4 ecall li a0, 1 j test_edge_finish test1_pass: # Test 2: Overflow → Inf li a0, 0x7E967699 # 1e38f jal ra, f32_to_bf16 mv s2, a0 # s2 = bf_huge li a0, 0x41200000 # 10.0f jal ra, f32_to_bf16 mv s3, a0 # s3 = bf10 mv a0, s2 mv a1, s3 jal ra, bf16_mul mv s2, a0 # s2 = bf_huge2 jal ra, bf16_isinf beqz a0, fail_huge j test2_pass fail_huge: la a0, msg_fail_huge li a7, 4 ecall li a0, 1 j test_edge_finish test2_pass: # Test 3: Underflow li a0, 0x007CE666 # 1e-38f jal ra, f32_to_bf16 mv s0, a0 # s0 = bf_small li a0, 0x501502F9 # 1e10f jal ra, f32_to_bf16 mv s1, a0 # s1 = bf_1e10 mv a0, s0 mv a1, s1 jal ra, bf16_div mv s2, a0 # s2 = smaller mv a0, s2 jal ra, bf16_to_f32 mv t4, a0 # t4 = smaller_val f32 bits jal ra, bf16_iszero bnez a0, test3_pass li t3, 0x7FFFFFFF and t4, t4, t3 # clear sign li t6, 0x00000001 # 1e-45f f32 bits bltu t4, t6, test3_pass la a0, msg_fail_underflow li a7, 4 ecall li a0, 1 j test_edge_finish test3_pass: la a0, Edge_msg li a7, 4 ecall li a0, 0 test_edge_finish: ``` **test_rounding();** >1.Checks that 1.5f is exactly representable. 2.Verifies that converting 1.0001f to BF16 and back yields minimal rounding error (< 0.001). Ensures conversion maintains acceptable accuracy within expected precision. ```s la a0, msg_rounding li a7, 4 ecall li a0, 0x3FC00000 # 1.5f jal ra, f32_to_bf16 mv s0, a0 # s0 = bf_exact # back_exact = bf16_to_f32(bf_exact) jal ra, bf16_to_f32 mv t0, a0 # t0 = back_exact f32 bits # check exact representation preserved li t1, 0x3FC00000 # 1.5f bits bne t0, t1, rounding_fail pass_test_rounding_1: li a0, 0x3F800066 # 1.0001f bits jal ra, f32_to_bf16 mv s1, a0 # s1 = bf jal ra, bf16_to_f32 mv t2, a0 # t2 = back f32 bits # diff2 = back - val li t3, 0x3F800066 # val bits sub t4, t3, t2 # t4 = diff2 bits # 取絕對值 li t5, 0x7FFFFFFF and t4, t4, t5 # check rounding error < 0.001 li t6, 0x3A83126F # 0.001f bits bltu t4, t6, rounding_pass rounding_fail: la a0, msg_fail_rounding li a7, 4 ecall li a0, 1 j test_rounding_end rounding_pass: la a0, Rounding_msg li a7, 4 ecall test_rounding_end: j end ``` >A helper that computes the absolute value of a signed integer and checks if it’s less than 10. Used as a verification step for correctness tests (e.g., absolute error check). ```s abs: bltz s3, abs_neg j abs_positive abs_neg: neg s3, s3 # s3 = -s3 abs_positive: li t1, 10 blt s3, t1, abs_pass li a0, 1 # fail ret abs_pass: li a0, 0 # pass ret ``` >These are exception detection helpers that check BF16 bit patterns: `bf16_isinf`: exponent = 0xFF, mantissa = 0 → `Inf` `bf16_isnan`: exponent = 0xFF, mantissa ≠ 0 → `NaN` `bf16_iszero`: exponent = 0, mantissa = 0 → `Zero` `bf16_is_neg_zero`: Zero with sign bit = 1 Used throughout all test functions to handle edge and invalid cases correctly. ```s bf16_isinf: mv t1,a0 li t2,0x7F80 and t3,t1,t2 bne t3,t2, not_inf li t2,0x007F and t4,t1,t2 bnez t4, not_inf li a0,1 ret not_inf: li a0,0 ret is_nan_inf: srli t4,t2,16 li t3 0xff and t4,t4,t3 mv a0,t4 ret bf16_isnan: mv t1,a0 li t2,0x7F80 and t3,t1,t2 bne t3,t2, not_nan li t2,0x007F and t4,t1,t2 beqz t4, not_nan li a0,1 ret not_nan: li a0,0 ret bf16_iszero: mv t1,a0 li t2,0x7FFF and t1,t1,t2 beqz t1, is_zero li a0,0 ret is_zero: li a0,1 ret bf16_is_neg_zero: li t1,0x8000 beq a0,t1,is_negzero li a0,0 ret is_negzero: li a0,1 ret ``` To display the floating point value in human-readable hexadecimal form without using hardware floating-point instructions or registers. ```s print_hex: mv t0, a0 # val la a0,hex li a7,4 ecall mv t1, a1 # digits la t2, hexchars # hex print_hex_loop: beqz t1, print_hex_done addi t1, t1, -1 slli t3, t1, 2 srl t4, t0, t3 # val >> (4*pos) andi t4, t4, 0xF add t4, t2, t4 lbu a0, 0(t4) li a7, 11 # print_char ecall j print_hex_loop print_hex_done: li a0, 10 # newline li a7, 11 ecall ret fail: la a0, fail_msg li a7, 4 ecall li a0, 1 # return 1 if fail ret end: la a0, all_test_pass_msg li a7, 4 ecall li a7,10 ecall ``` :::info Results on console ::: ![image](https://hackmd.io/_uploads/rJN9oROaxx.png) :::info The clock cycles of program: ::: ![image](https://hackmd.io/_uploads/B1yhjROael.png) :::info Memory viewer: ::: ![image](https://hackmd.io/_uploads/H15JhCO6gx.png) ## Leetcode ## [leetcode_70_climbing_stairs](https://leetcode.com/problems/climbing-stairs/description/) You are climbing a staircase. It takes n steps to reach the top. Each time you can either climb 1 or 2 steps. In how many distinct ways can you climb to the top? <h3>Implementation</h3> Formally, it solves the recurrence: ${f(n)=f(n-1)+f(n-2)}$ which is similar to the Fibonacci sequence in concept. >Example A: Input: n = 2 Output: 2 Explanation: There are two ways to climb to the top. 1. 1 step + 1 step 2. 2 steps >Example B: Input: n = 3 Output: 3 Explanation: There are three ways to climb to the top. 1. 1 step + 1 step + 1 step 2. 1 step + 2 steps 3. 2 steps + 1 step <h4> C code </h4> ```c #include <assert.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #define VECTOR_MIN_SIZE 16 typedef struct { void **data; size_t size; /* Allocated size */ size_t count; /* Number of elements */ size_t free_slot; /* Index of a known hole */ } vector_t; /* -------- Vector 基本功能 -------- */ void vector_init(vector_t *v) { v->data = NULL; v->size = 0; v->count = 0; v->free_slot = 0; } int32_t vector_push(vector_t *v, void *ptr) { if (!v->size) { v->size = VECTOR_MIN_SIZE; v->data = calloc(v->size, sizeof(void *)); } if (v->free_slot && v->free_slot < v->count) { size_t idx = v->free_slot; v->data[idx] = ptr; v->free_slot = 0; return idx; } if (v->count == v->size) { v->size *= 2; v->data = realloc(v->data, v->size * sizeof(void *)); memset(v->data + v->count, 0, (v->size - v->count) * sizeof(void *)); } v->data[v->count] = ptr; return v->count++; } void *vector_pop(vector_t *v) { if (!v->count) return NULL; void *last = v->data[--v->count]; v->data[v->count] = NULL; return last; } void vector_free(vector_t *v) { if (!v->data) return; free(v->data); v->data = NULL; v->size = 0; v->count = 0; v->free_slot = 0; } /* -------- 遞迴列舉組合 -------- */ void generate_combinations(int n, vector_t *current) { if (n == 0) { printf("["); for (size_t i = 0; i < current->count; i++) { printf("%ld", (intptr_t) current->data[i]); if (i < current->count - 1) printf(","); } printf("]\n"); return; } if (n >= 1) { vector_push(current, (void *)(intptr_t)1); generate_combinations(n - 1, current); vector_pop(current); } if (n >= 2) { vector_push(current, (void *)(intptr_t)2); generate_combinations(n - 2, current); vector_pop(current); } } void stairs(int n) { if (n <= 0) { printf("No stairs to climb.\n"); return; } printf("Climbing %d stairs:\n", n); vector_t path; vector_init(&path); generate_combinations(n, &path); vector_free(&path); } /* -------- 測試 -------- */ void test_case1() { stairs(3); } void test_case2() { stairs(4); } void test_case3() { stairs(5); } int main() { test_case1(); test_case2(); test_case3(); printf("All tests passed !\n"); return 0; } ``` <h3> Assembly code : version_1 </h3> My test cases are : A : 3, B : 4, C : 5 >Output string preparing ```s .data msg: .string "How many stairs?\n" msg1: .string "Step combinations:\n" msg2: .string ": { " msg3: .string "}\n" msg4: .string " " msg5: .string "\n" testcase_a: .string "n = 3" testcase_b: .string "n = 4" testcase_c: .string "n = 5" steps: .word 256 .text ``` >main function ```s main: la a0,msg li a7,4 ecall ##------------------A--------------------------- la a0,testcase_a li a7,4 ecall la a0,msg5 li a7,4 ecall li a2, 3 la a0,msg1 li a7,4 ecall jal ra,start_stairs ##------------------B--------------------------- la a0,testcase_b li a7,4 ecall la a0,msg5 li a7,4 ecall li a2, 4 la a0,msg1 li a7,4 ecall jal ra,start_stairs ##------------------C--------------------------- la a0,testcase_c li a7,4 ecall la a0,msg5 li a7,4 ecall li a2, 5 la a0,msg1 li a7,4 ecall jal ra,start_stairs end: li a7,10 ecall ``` **start_stairs :** >Prepares the environment (stack, registers, and data pointers) for recursive computation of combinations. ```s start_stairs: mv s2,x0 ### s2 is the countCombination (how many ways to do the combination) mv a3,x0 ### a3 is the StepSize la a1 steps beqz a2,end addi sp, sp, -4 sw ra, 0(sp) jal ra,printSteps lw ra, 0(sp) addi sp, sp, 4 ret ``` **printSteps:** >Implements recursion If a2 == 0, print one full combination (printSteps_0). Otherwise, try taking 1 step or 2 steps recursively. Concept: This is the core recursive routine for generating all combinations of steps that sum to n. It explores both possibilities — “take 1 step” and “take 2 steps” — until the total matches the target. ```s printSteps: bnez a2,L1 addi sp,sp,-12 sw ra,0(sp) sw a2,4(sp) sw a3,8(sp) jal ra,printSteps_0 lw ra,0(sp) lw a2,4(sp) lw a3,8(sp) addi sp,sp,12 L1: li t0,1 bltu a2,t0,R addi sp,sp,-12 sw ra,0(sp) sw a2,4(sp) sw a3,8(sp) jal ra,printSteps_1 lw ra,0(sp) lw a2,4(sp) lw a3,8(sp) addi sp,sp,12 L2: li t0,2 bltu a2,t0,R addi sp,sp,-12 sw ra,0(sp) sw a2,4(sp) sw a3,8(sp) jal ra,printSteps_2 lw ra,0(sp) lw a2,4(sp) lw a3,8(sp) addi sp,sp,12 R: ret ``` >Acts as the base case in recursion. When the sum of steps equals n, this routine prints the sequence { 1 2 ... } ```s printSteps_0: addi s2,s2,1 mv a0,s2 li a7,1 ecall la a0,msg2 li a7,4 ecall addi sp,sp,-4 sw ra,0(sp) li t0,0 loop_0: slli t2,t0,2 add t1,a1,t2 ###step[index] lw t3,0(t1) mv a0,t3 li a7,1 ecall la a0,msg4 li a7,4 ecall addi t0,t0,1 bltu t0,a3,loop_0 la a0,msg3 li a7,4 ecall lw ra,0(sp) addi sp,sp,4 ret ``` **printSteps_1 & printSteps_2 :** >Appends a 1-step or 2-step to the current combination, then calls printSteps again. These represent recursive branching — the algorithm explores both paths by appending different step sizes. ```s printSteps_1: slli t2,a3,2 add t1,a1,t2 li t6,1 sw t6,0(t1) addi a2,a2,-1 addi a3,a3,1 addi sp,sp,-12 sw ra,0(sp) sw a2,4(sp) sw a3,8(sp) jal ra,printSteps lw ra,0(sp) lw a2,4(sp) lw a3,8(sp) addi sp,sp,12 ret printSteps_2: slli t2,a3,2 add t1,a1,t2 li t6,2 sw t6,0(t1) addi a2,a2,-2 addi a3,a3,1 addi sp,sp,-12 sw ra,0(sp) sw a2,4(sp) sw a3,8(sp) jal ra,printSteps lw ra,0(sp) lw a2,4(sp) lw a3,8(sp) addi sp,sp,12 ret ``` <h3> Assembly code : version_2 </h3> Do the unloop for test cases and the print out function, as you can see the only different between version_1 and version_2 is the position of the function, but it decrease a lot of cycles. >At first, I want to try dynamic programming for this program, but as long as I need to print all step cases out, it became complicated, so I choose unlooping to optimal my program. ```s .data msg: .string "How many stairs?\n" msg1: .string "Step combinations:\n" msg2: .string ": { " msg3: .string "}\n" msg4: .string " " msg5: .string "\n" testcase_a: .string "n = 3" testcase_b: .string "n = 4" testcase_c: .string "n = 5" steps: .word 256 .text main: la a0,msg li a7,4 ecall ##------------------A--------------------------- la a0,testcase_a li a7,4 ecall la a0,msg5 li a7,4 ecall li a2, 3 ### la a0,msg1 li a7,4 ecall mv s2,x0 ### s2 is the countCombination (how many ways to do the combination) mv a3,x0 ### a3 is the StepSize la a1 steps beqz a2,end jal ra,printSteps ##------------------B--------------------------- la a0,testcase_b li a7,4 ecall la a0,msg5 li a7,4 ecall li a2, 4 ### la a0,msg1 li a7,4 ecall mv s2,x0 ### s2 is the countCombination (how many ways to do the combination) mv a3,x0 ### a3 is the StepSize la a1 steps beqz a2,end jal ra,printSteps ##------------------C--------------------------- la a0,testcase_c li a7,4 ecall la a0,msg5 li a7,4 ecall li a2, 5 la a0,msg1 li a7,4 ecall mv s2,x0 ### s2 is the countCombination (how many ways to do the combination) mv a3,x0 ### a3 is the StepSize la a1 steps beqz a2,end jal ra,printSteps end: li a7,10 ecall printSteps: bnez a2,L1 printSteps_0: addi s2,s2,1 mv a0,s2 li a7,1 ecall la a0,msg2 li a7,4 ecall addi sp,sp,-4 sw ra,0(sp) li t0,0 loop_0: slli t2,t0,2 add t1,a1,t2 ###step[index] lw t3,0(t1) mv a0,t3 li a7,1 ecall la a0,msg4 li a7,4 ecall addi t0,t0,1 bltu t0,a3,loop_0 la a0,msg3 li a7,4 ecall lw ra,0(sp) addi sp,sp,4 L1: li t0,1 bltu a2,t0,R addi sp,sp,-12 sw ra,0(sp) sw a2,4(sp) sw a3,8(sp) jal ra,printSteps_1 lw ra,0(sp) lw a2,4(sp) lw a3,8(sp) addi sp,sp,12 L2: li t0,2 bltu a2,t0,R addi sp,sp,-12 sw ra,0(sp) sw a2,4(sp) sw a3,8(sp) jal ra,printSteps_2 lw ra,0(sp) lw a2,4(sp) lw a3,8(sp) addi sp,sp,12 R: ret printSteps_1: slli t2,a3,2 add t1,a1,t2 li t6,1 sw t6,0(t1) addi a2,a2,-1 addi a3,a3,1 addi sp,sp,-12 sw ra,0(sp) sw a2,4(sp) sw a3,8(sp) jal ra,printSteps lw ra,0(sp) lw a2,4(sp) lw a3,8(sp) addi sp,sp,12 ret printSteps_2: slli t2,a3,2 add t1,a1,t2 li t6,2 sw t6,0(t1) addi a2,a2,-2 addi a3,a3,1 addi sp,sp,-12 sw ra,0(sp) sw a2,4(sp) sw a3,8(sp) jal ra,printSteps lw ra,0(sp) lw a2,4(sp) lw a3,8(sp) addi sp,sp,12 ret ``` :::info Result in console: ::: After all these stage are done, the console is updated like this: ![image](https://hackmd.io/_uploads/B1W_DXWaxe.png =250x) :::info Structure ::: Take three as a example: printSteps(3) ├→ walk `one` stair → printSteps(2) │ ├→ walk `one` stair → printSteps(1) │ │ ├→ walk `one` stair → printSteps(0) → `print` {1,1,1} │ │ └→ walk `two` stairs (not enough) │ └→ walk `two` stairs → printSteps(0) → `print` {1,2} └→ walk `two` stairs → printSteps(1) ├→ walk `one` stair → printSteps(0) → `print` {2,1} :::info After all these stage are done, the register is updated like this: ::: ![image](https://hackmd.io/_uploads/rkoO8Qbagl.png =250x)![image](https://hackmd.io/_uploads/By3FI7bTge.png =280x) :::info Memory viewer: ::: The table below denotes the data section of memory. ![image](https://hackmd.io/_uploads/H1LNwQWTgg.png) :::info The clock cycles of program version_1: ::: Because of the method that I used is recursion, so my cycle is extremly more then others. ![image](https://hackmd.io/_uploads/Syy0mgMTeg.png) :::info The clock cycles of program version_2: ::: This version of program decrease `254` cycles ![image](https://hackmd.io/_uploads/rkefwrqaex.png) ## 5-stage pipelined processor 5-stage pipelined processor I use in ripes is a single cycle processor with hazard detection and forwarding hazard detection. It's block diagram look like this: ![image](https://hackmd.io/_uploads/BkeUG8Kpgl.png) Five stages are : >Instruction fetch (IF) The processor fetches the instruction from memory and updates the program counter (PC) to the next instruction. >Instruction decode and register fetch (ID) The instruction is decoded to determine its type, and the required operands are read from the register file. >Execute (EX) The ALU performs the necessary operation (such as arithmetic, logic, or address calculation). >Memory access (MEM) If the instruction requires reading from or writing to memory (e.g., lw or sw), the memory is accessed here. >Register write back (WB) The result from the ALU or memory is written back to the destination register, completing the instruction’s execution. ## Reference https://hackmd.io/@sysprog/arch2025-quiz1-sol https://hackmd.io/@sysprog/SkkbXLJRR https://leetcode.com/problems/climbing-stairs/description/ https://github.com/sysprog21/ca2025-quizzes ## Full vertion assembly code of problem B ```s .data msg: .string "start decode\n" msg1: .string "start encode\n" msg2: .string "\n" msg3: .string "decode->" msg4: .string "\nencode->" msg5: .string "fail encoding" msg6: .string "0x" testcase_a: .string "test_case_a:n = 0x0f" ###15 testcase_b: .string "test_case_b:n = 0x7d" ###125 testcase_c: .string "test_case_c:n = 0xe1" ###225 pass: .string "All testcases passed" .text main: la a0,msg li a7,4 ecall ##------------------A--------------------------- la a0,testcase_a li a7,4 ecall la a0,msg2 li a7,4 ecall li a1, 0x0f ###n=0x0f jal ra,start_decode_encode ##------------------B--------------------------- la a0,msg2 li a7,4 ecall la a0,testcase_b li a7,4 ecall la a0,msg2 li a7,4 ecall li a1, 0x7d ###n=0x7d jal ra,start_decode_encode ##------------------C--------------------------- la a0,msg2 li a7,4 ecall la a0,testcase_c li a7,4 ecall la a0,msg2 li a7,4 ecall li a1, 0x0e1 ###n=0xe1 jal ra,start_decode_encode j end end: li a7,10 ecall start_decode_encode: addi sp,sp,-4 sw ra, 0(sp) jal ra,Decode lw ra, 0(sp) addi sp,sp,4 mv t0,a0 ###store the return value la a0,msg3 li a7,4 ecall mv a0,t0 li a7,1 ecall addi sp,sp,-4 sw ra, 0(sp) jal ra,Encode lw ra, 0(sp) addi sp,sp,4 mv t0,a0 ### store the return value li t2,-1 ### previous_num bne t0,a1,error blt t0,t2,error mv t2,t0 ### previous_value = value la a0,msg4 li a7,4 ecall mv a0,t2 addi sp,sp,-4 sw ra, 0(sp) jal ra,print_hex lw ra, 0(sp) addi sp,sp,4 ret error: la a0,msg5 li a7,4 ecall print_hex: addi sp, sp, -16 sw ra, 8(sp) # store srli t0, a0, 4 # high nibble andi t1, a0, 0xF # low nibble la a0, msg6 li a7, 4 ecall mv a0, t0 jal ra, print_nibble mv a0, t1 jal ra, print_nibble lw ra,8(sp) addi sp,sp,16 ret print_nibble: li t2,10 bltu a0, t2, digit addi a0, a0, 87 j out digit: addi a0, a0, 48 out: li a7, 11 # print_char ecall ret Decode: andi t0,a1,0x0f ### store mantissa in t0 srli t1,a1,4 ### store expo in t1 not t2,t1 addi t2,t2,1 ### two's complement of expo addi t2,t2,15 li t3,0x7fff srl t3,t3,t2 slli t3,t3,4 ### store offset in t3 sll t4,t0,t1 add a0,t4,t3 ### store trans_num in a2 xor t0,t0,t0 xor t3,t3,t3 ret Encode: addi sp, sp, -16 sw ra, 12(sp) # store add s4,a0,x0 ### s4=value li t0,16 bltu a0,t0,no_need_more_op ### a0=value jal ra,clz mv t1,a0 ### lz = clz(value) li t2,31 sub s1,t2,t1 ### store msb in s1 li s2,0 ### store exponent in s2 li s3,0 ### store overflow in s3 li t6,5 bltu s1,t6,find_exact_exponent addi s2,s1,-4 li t6,15 bltu s2,t6, Calculate_overflow ### t6=1 if greater than 15 li s2,15 Calculate_overflow: ###for loop li s8,0 ###counting loop1: beq s8,s2,Adjust_if_estimate slli s3,s3,1 addi s3,s3,16 addi s8,s8,1 j loop1 Adjust_if_estimate: bgtz s2,check1 j find_exact_exponent check1: bltu s4,s3,check2 j find_exact_exponent check2: addi,s3,s3,-16 srli s3,s3,1 addi s2,s2,-1 j Adjust_if_estimate find_exact_exponent: li s7,20 bge s2, s7, end_encode li t6,15 bgtu s2,t6,end_encode slli s5,s3,1 ### 7f0 should be in s3 overflow addi s5,s5,16 ### next_overflow store in s5 bltu s4,s5,end_encode mv s3,s5 addi s2,s2,1 j find_exact_exponent end_encode: sub s6,s4,s3 srl s6,s6,s2 ### store other mantissa in s6 slli t6,s2,4 add a0,s6,t6 lw ra, 12(sp) addi sp, sp, 16 ret clz: li t0,32 ### n=32 li t1,16 ### c=16 mv t3,a0 loop0: beqz t1,return0 srl t2,t3,t1 ### y = x >> c beqz t2,devide_c sub t0,t0,t1 mv t3,t2 j loop0 devide_c: srli t1,t1,1 j loop0 return0: sub t0,t0,t3 mv a0,t0 ret no_need_more_op: lw ra, 12(sp) # restore ra stored at start of Encode addi sp, sp, 16 # restore stack ret ``` ## Full vertion assembly code of problem C ```s .data Start_msg: .string "Testing basic conversions...\n" Basic_msg: .string "Basic conversions: PASS\n" Special_msg: .string "Special values: PASS\n" Arithmetic_msg: .string "Arithmetic: PASS\n" Comparisons_msg: .string "Comparisons: PASS\n" Edge_msg: .string "Edge cases: PASS\n" Rounding_msg: .string "Rounding: PASS\n" all_test_pass_msg: .string "=== ALL TESTS PASSED ===" #----------------------for_special------------------ msg_start_special: .string "Testing special values...\n" msg_fail_posinf: .string "Positive infinity not detected\n" msg_fail_inf_nan:.string "Infinity detected as NaN\n" msg_fail_neginf: .string "Negative infinity not detected\n" msg_fail_nan: .string "NaN not detected\n" msg_fail_nan_inf:.string "NaN detected as infinity\n" msg_fail_zero: .string "Zero not detected\n" msg_fail_negzero:.string "Negative zero not detected\n" #---------------------for_comparison--------------- msg_start_comparison: .string "Testing comparison...\n" msg_fail_eq: .string "Equality test failed\n" msg_fail_ineq: .string "Inequality test failed\n" msg_fail_lt: .string "Less than test failed\n" msg_fail_nlt: .string "Not less than test failed\n" msg_fail_enlt: .string "Equal not less than test failed\n" msg_fail_gt: .string "Greater than test failed\n" msg_fail_ngt: .string "Not greater than test failed\n" msg_fail_naneq: .string "NaN equality test failed\n" msg_fail_nanlt: .string "Nan less than test failed\n" msg_fail_nangt: .string "Nan greater than test failed\n" #---------------------for_arithmetic--------------- msg_start_arithmetic: .string "Testing arithmetic operations...\n" msg_result: .string "Result = " msg_add_fail:.string "Addition failed\n" msg_add_case1:.string "Addition :1.0f + 2.0f \n" msg_pass_add: .string "PASSED\n" msg_sub_fail:.string "Subtraction failed\n" msg_sub_case1:.string "Subtraction :2.0f - 1.0f \n" msg_mul_fail:.string "Multiplication failed\n" msg_mul_case1:.string "Multiplication :3.0f x 4.0f \n" msg_div_fail:.string "Division failed\n" msg_div_case1:.string "Division :10.0f / 2.0f \n" msg_sqrt_case1: .string "Sqrt :4.0f \n" msg_sqrt_case2: .string "Sqrt :9.0f \n" msg_sqrt4_fail:.string "sqrt(4) failed\n" msg_sqrt9_fail:.string "sqrt(9) failed\n" #---------------------for_edge_cases--------------- msg_testing_edges: .string "Testing edge cases...\n" msg_fail_tiny: .string "Tiny value handling\n" msg_fail_huge: .string "Overflow should produce infinity\n" msg_fail_underflow:.string "Underflow should produce zero or denormal\n" #---------------------for_rounding_cases--------------- msg_rounding: .string "Testing rounding behavior...\n" msg_pass_rounding: .string " Rounding: PASS\n" msg_fail_rounding: .string " Rounding: FAIL\n" #------------------- newline: .string "\n" fail_msg: .string "fail" fail_for_pos_inf_msg: .string "Positive infinity not detected" msg1: .asciz "f32 -> bf16 = " msg2: .asciz "bf16 -> f32 = " hexchars: .asciz "0123456789ABCDEF\n" hex: .string "0x" testcase_a: .string "n = 0.0f\n" ### testcase_b: .string "n = 1.0f\n" ### testcase_c: .string "n = -1.0f\n" ### .text test_values: .word 0x00000000 # 0.0f .word 0x3f800000 # 1.0f .word 0xbf800000 # -1.0f main: la t0, test_values la a0,Start_msg li a7,4 ecall la a0,testcase_a li a7,4 ecall lw s1,0(t1) # 0.0f lw s2,4(t1) # 1.0f lw s3,8(t1) # -1.0f la a0,msg1 li a7,4 ecall mv a0,s1 ##test_case ###----------------------------test_f32<->bf16------------------------------------- jal ra,f32_to_bf16 add s4,a0,x0 # store orig in s4 li a1, 4 # 16-bit = 4 hex digits jal ra, print_hex la a0,msg2 li a7,4 ecall mv a0,s4 jal ra,bf16_to_f32 add s5,a0,x0 li a1, 8 # 16-bit = 8 hex digits jal ra, print_hex la a0,testcase_b li a7,4 ecall la a0,msg1 li a7,4 ecall mv a0,s2 ##test_case jal ra,f32_to_bf16 add s4,a0,x0 # store orig in s4 li a1, 4 # 16-bit = 4 hex digits jal ra, print_hex la a0,msg2 li a7,4 ecall mv a0,s4 jal ra,bf16_to_f32 add s5,a0,x0 li a1, 8 # 16-bit = 8 hex digits jal ra, print_hex la a0,testcase_c li a7,4 ecall la a0,msg1 li a7,4 ecall mv a0,s3 ##test_case jal ra,f32_to_bf16 add s4,a0,x0 # store orig in s4 li a1, 4 # 16-bit = 4 hex digits jal ra, print_hex la a0,msg2 li a7,4 ecall mv a0,s4 jal ra,bf16_to_f32 add s5,a0,x0 li a1, 8 # 16-bit = 8 hex digits jal ra, print_hex ###----------------------------test_basic_conversion------------------------------------- li t3,0x8000 slli t4,s4,16 and t4,t4,t3 and t5,s5,t3 bne t4,t5,fail beq s4,x0,pass_basic_msg srli t3,s5,16 beq s4,t3,pass_basic_msg ##because easy test_cases back_from_basic_msg: ###----------------------------test_spacial_value------------------------------------- test_special_values: la a0, msg_start_special li a7, 4 ecall li a0, 0x7F80 #Test +Inf jal ra, bf16_isinf beqz a0, fail_posinf li a0, 0x7F80 jal ra, bf16_isnan bnez a0, fail_inf_nan li a0, 0xFF80 #Test -Inf jal ra, bf16_isinf beqz a0, fail_neginf li a0, 0x7FC0 #Test NaN jal ra, bf16_isnan beqz a0, fail_nan li a0, 0x7FC0 jal ra, bf16_isinf bnez a0, fail_nan_inf li a0, 0x0000 # Test +0 jal ra, bf16_iszero beqz a0, fail_zero li a0, 0x8000 # Test -0 jal ra, bf16_is_neg_zero beqz a0, fail_negzero la a0, Special_msg li a7, 4 ecall ###----------------------------test_comparison------------------------------------- li s1,0x3F80 li s2,0x4000 li s3,0x3F80 la a0, msg_start_comparison li a7, 4 ecall #equality_test mv a0,s1 mv a1,s3 jal ra,bf16_eq beqz a0, fail_eq mv a0,s1 mv a1,s2 jal ra,bf16_eq bnez a0, fail_ineq #less then_test mv a0,s1 mv a1,s2 jal ra,bf16_lt beqz a0, fail_lt mv a0,s2 mv a1,s1 jal ra,bf16_lt bnez a0, fail_nlt mv a0,s1 mv a1,s3 jal ra,bf16_lt bnez a0, fail_enlt #greater then_test mv a0,s2 mv a1,s1 jal ra,bf16_gt beqz a0, fail_gt mv a0,s1 mv a1,s2 jal ra,bf16_gt bnez a0, fail_ngt li t1,0x7FC0 #nan_f32 mv a0,t1 mv a1,t1 jal ra,bf16_eq bnez a0, fail_naneq mv a0,t1 mv a1,s1 jal ra,bf16_lt bnez a0, fail_nanlt mv a0,t1 mv a1,s1 jal ra,bf16_gt bnez a0, fail_nangt la a0, Comparisons_msg li a7, 4 ecall #------------------------test_arithmetic-------------------------- la a0, msg_start_arithmetic li a7, 4 ecall li s1,0x3f80 #a li s2,0x4000 #b #test_add la a0, msg_add_case1 # 1+2 li a7, 4 ecall mv a0,s1 mv a1,s2 jal ra,bf16_add jal ra,bf16_to_f32 # return a0 = result 40400000 mv t2,a0 li t1,0x40400000 sub s3,t2,t1 #store diff in s3 bltz s3, abs bnez s3,fail_add mv a0,t2 jal ra, print_result #test_sub la a0, msg_sub_case1 # 2-1 li a7, 4 ecall mv a0,s2 mv a1,s1 jal ra,bf16_sub jal ra,bf16_to_f32 # return a0 = result mv t2,a0 li t1,0x3f800000 sub s3,t2,t1 #store diff in s3 bltz s3, abs bnez s3,fail_sub mv a0,t2 jal ra, print_result #test_mul la a0, msg_mul_case1 # 3 * 4 li a7, 4 ecall li s1,0x4040 li s2,0x4080 mv a0,s1 mv a1,s2 jal ra,bf16_mul jal ra,bf16_to_f32 # return a0 = result mv t2,a0 li t1,0x41400000 #12 sub s3,a0,t1 #store diff in s3 bltz s3, abs bnez s3,fail_mul mv a0,t2 jal ra, print_result #test_div la a0, msg_div_case1 # 3 * 4 li a7, 4 ecall li s1,0x4120 #10 li s2,0x4000 #2 mv a0,s1 mv a1,s2 jal ra,bf16_div jal ra,bf16_to_f32 # return a0 = result mv t2,a0 li t1,0x40a00000 #5 sub s3,t2,t1 #store diff in s3 bltz s3, abs bnez s3,fail_div mv a0,t2 jal ra, print_result #test_sqrt4 la a0, msg_sqrt_case1 # 4 li a7, 4 ecall li s1,0x4080 mv a0,s1 jal ra,bf16_sqrt jal ra,bf16_to_f32 # return a0 = result mv t2,a0 li t1,0x40000000 #2 sub s3,a0,t1 #store diff in s3 bltz s3, abs bnez s3,fail_sqrt4 mv a0,t2 jal ra, print_result #test_sqrt9 la a0, msg_sqrt_case2 # 9 li a7, 4 ecall li s1,0x4110 mv a0,s1 jal ra,bf16_sqrt jal ra,bf16_to_f32 # return a0 = result mv t2,a0 li t1,0x40400000 #3 sub s3,a0,t1 #store diff in s3 bltz s3, abs bnez s3,fail_sqrt9 mv a0,t2 jal ra, print_result la a0, Arithmetic_msg li a7, 4 ecall #------------------------test_edge_cases-------------------------- # Test 1: Tiny value handling la a0, msg_testing_edges li a7, 4 ecall li a0, 0x00000001 # tiny 1e-45f jal ra, f32_to_bf16 # -> a0 = bf_tiny(bits) mv s0, a0 jal ra, bf16_to_f32 # -> a0 = tiny_val(bits) mv s1, a0 # bf16_iszero(bf_tiny)? mv a0, s0 jal ra, bf16_iszero bnez a0, test1_pass # abs(tiny_val) li t3, 0x7FFFFFFF and t4, s1, t3 # load threshold (1e-37) li t5, 0x0C2CF59E # 1e-37f # compare abs(tiny_val) < threshold ? bltu t4, t5, test1_pass # fail la a0, msg_fail_tiny li a7, 4 ecall li a0, 1 j test_edge_finish test1_pass: # Test 2: Overflow ?? Inf li a0, 0x7E967699 # 1e38f jal ra, f32_to_bf16 mv s2, a0 # s2 = bf_huge li a0, 0x41200000 # 10.0f jal ra, f32_to_bf16 mv s3, a0 # s3 = bf10 mv a0, s2 mv a1, s3 jal ra, bf16_mul mv s2, a0 # s2 = bf_huge2 jal ra, bf16_isinf beqz a0, fail_huge j test2_pass fail_huge: la a0, msg_fail_huge li a7, 4 ecall li a0, 1 j test_edge_finish test2_pass: # Test 3: Underflow li a0, 0x007CE666 # 1e-38f jal ra, f32_to_bf16 mv s0, a0 # s0 = bf_small li a0, 0x501502F9 # 1e10f jal ra, f32_to_bf16 mv s1, a0 # s1 = bf_1e10 mv a0, s0 mv a1, s1 jal ra, bf16_div mv s2, a0 # s2 = smaller mv a0, s2 jal ra, bf16_to_f32 mv t4, a0 # t4 = smaller_val f32 bits jal ra, bf16_iszero bnez a0, test3_pass li t3, 0x7FFFFFFF and t4, t4, t3 # clear sign li t6, 0x00000001 # 1e-45f f32 bits bltu t4, t6, test3_pass la a0, msg_fail_underflow li a7, 4 ecall li a0, 1 j test_edge_finish test3_pass: la a0, Edge_msg li a7, 4 ecall li a0, 0 test_edge_finish: #------------------------test_rounding------------------------ la a0, msg_rounding li a7, 4 ecall li a0, 0x3FC00000 # 1.5f jal ra, f32_to_bf16 mv s0, a0 # s0 = bf_exact # back_exact = bf16_to_f32(bf_exact) jal ra, bf16_to_f32 mv t0, a0 # t0 = back_exact f32 bits # check exact representation preserved li t1, 0x3FC00000 # 1.5f bits bne t0, t1, rounding_fail pass_test_rounding_1: li a0, 0x3F800066 # 1.0001f bits jal ra, f32_to_bf16 mv s1, a0 # s1 = bf jal ra, bf16_to_f32 mv t2, a0 # t2 = back f32 bits # diff2 = back - val li t3, 0x3F800066 # val bits sub t4, t3, t2 # t4 = diff2 bits # ???????? li t5, 0x7FFFFFFF and t4, t4, t5 # check rounding error < 0.001 li t6, 0x3A83126F # 0.001f bits bltu t4, t6, rounding_pass rounding_fail: la a0, msg_fail_rounding li a7, 4 ecall li a0, 1 j test_rounding_end rounding_pass: la a0, Rounding_msg li a7, 4 ecall test_rounding_end: j end print_result: addi sp,sp,-4 sw ra,0(sp) mv t1,a0 la a0, msg_result # "Result: " li a7, 4 ecall mv a0,t1 li a1,8 jal ra,print_hex pass_add: la a0, msg_pass_add li a7, 4 ecall lw ra,0(sp) addi sp,sp,4 ret end_add: ret abs: bltz s3, abs_neg j abs_positive abs_neg: neg s3, s3 # s3 = -s3 abs_positive: li t1, 10 blt s3, t1, abs_pass li a0, 1 # fail ret abs_pass: li a0, 0 # pass ret bf16_add: addi sp,sp,-8 sw ra,4(sp) li t5,0xFF li t6,0x7F srli t0, a0, 15 # sign_a in t0 andi t0, t0, 1 srli t1, a1, 15 # sign_b in t1 andi t1, t1, 1 srli t2, a0, 7 # exp_a in t2 and t2, t2, t5 srli t3, a1, 7 # exp_b in t3 and t3, t3, t5 and t4, a0, t6 # mant_a in t4 and t5, a1, t6 # mant_b in t5 #s6 = result_sign, s7 = result_expo, s8 = mantissa # if a is zero (exp_a==0 && mant_a==0) => return b beqz a0,return_b beqz a1,return_a beqz t2, skip_a_norm ori t4, t4, 0x80 # mant_a |= 0x80 skip_a_norm: beqz t3, skip_b_norm ori t5, t5, 0x80 # mant_b |= 0x80 skip_b_norm: sub s9, t2, t3 # exp_diff = exp_a - exp_b bgtz s9, exp_a_bigger bltz s9, exp_b_bigger j exp_equal exp_a_bigger: mv s7, t2 # result_exp = exp_a li t6, 8 bgt s9, t6, return_a srl t5, t5, s9 # mant_b >>= exp_diff j continue_add exp_b_bigger: neg s9, s9 # exp_diff = -exp_diff mv s7, t3 # result_exp = exp_b li t6, -8 blt s9, t6, return_b srl t4, t4, s9 # mant_a >>= -exp_diff j continue_add exp_equal: mv s7, t2 continue_add: beq t0, t1, same_sign j diff_sign same_sign: mv s6, t0 # result_sign = sign_a add s8, t4, t5 li t6, 0x100 and s9, s8, t6 beqz s9, normalize_end # overflow => ???????W?? srli s8, s8, 1 # => mantissa >> 1 addi s7, s7, 1 # exponent++ li t6, 0xFF blt s7, t6, normalize_end slli t6, s6, 15 li s9, 0x7F80 or t6, t6, s9 mv a0, t6 j add_exit diff_sign: bgeu t4, t5, mant_a_ge mv s6, t1 # result_sign = sign_b sub s8, t5, t4 # result_mant = mant_b - mant_a j normalize_check mant_a_ge: mv s6, t0 sub s8, t4, t5 normalize_check: beqz s8, return_zero normalize_loop: li t6, 0x80 and s9, s8, t6 bnez s9, normalize_end slli s8, s8, 1 addi s7, s7, -1 # exponent-- blez s7, return_zero j normalize_loop normalize_end: slli t6, s6, 15 # (sign << 15) andi s9, s7, 0xFF slli s9, s9, 7 # (exp << 7) or t6, t6, s9 andi s9, s8, 0x7F # mantissa (7 bits) or t6, t6, s9 mv a0, t6 j add_exit return_a: mv a0, a0 j add_exit return_b: mv a0, a1 j add_exit return_zero: li a0, 0 j add_exit add_exit: lw ra, 4(sp) addi sp, sp, 8 ret bf16_sub: addi sp, sp, 4 sw ra, 0(sp) li t0, 0x8000 xor a1, a1, t0 jal ra,bf16_add lw ra,0(sp) addi sp,sp,4 ret bf16_mul: addi sp, sp, -4 sw ra, 0(sp) # constants li t6, 0xFF li s4, 0x7F li s5, 127 # BF16_EXP_BIAS beqz a0,return_zero beqz a1,return_zero # extract sign bits srli t0, a0, 15 andi t0, t0, 1 # sign_a (t0) srli t1, a1, 15 andi t1, t1, 1 # sign_b(t1) # extract exponents (8 bits) srli t2, a0, 7 and t2, t2, t6 # exp_a (t2) srli t3, a1, 7 and t3, t3, t6 # exp_b (t3) # extract mantissas (7 bits) and t4, a0, s4 # mant_a (t4) and t5, a1, s4 # mant_b (t5) # result sign = sign_a ^ sign_b (s7) xor s7, t0, t1 # result expo (s8) # result mant (s9) # exp_adjust = 0 li s6, 0 # === normalize mant_a === beqz t2, denorm_a norm_a: ori t4, t4, 0x80 j mant_a_done denorm_a: beqz t4, mant_a_done denorm_a_loop: andi t0, t4, 0x80 bnez t0, mant_a_done slli t4, t4, 1 addi s6, s6, -1 j denorm_a_loop mant_a_done: # === normalize mant_b === beqz t3, denorm_b norm_b: ori t5, t5, 0x80 j mant_b_done denorm_b: beqz t5, mant_b_done denorm_b_loop: andi t0, t5, 0x80 bnez t0, mant_b_done slli t5, t5, 1 addi s6, s6, -1 j denorm_b_loop mant_b_done: # mantissa multiply (8x8 = 16-bit) mul s9, t4, t5 # result_exp = exp_a + exp_b - bias + exp_adjust add s8, t2, t3 add s8, s8, s6 addi s8, s8, -127 # normalize mantissa li t0, 0x8000 and t1, s9, t0 bnez t1, shift8 # no overflow: shift right 7 bits srli s9, s9, 7 andi s9, s9, 0x7F j norm_done shift8: srli s9, s9, 8 andi s9, s9, 0x7F addi s8, s8, 1 norm_done: # overflow check li t0, 0xFF bge s8, t0, set_inf # underflow check blez s8, underflow # ===== normal result ===== slli t0, s7, 15 andi t1, s8, 0xFF slli t1, t1, 7 or t0, t0, t1 andi t1, s9, 0x7F or a0, t0, t1 j mul_done # ===== underflow case ===== underflow: li t0, -6 blt s8, t0, return_zero_mul li t1, 1 sub t1, t1, s8 srl s9, s9, t1 li s8, 0 slli t0, s7, 15 andi t1, s8, 0xFF slli t1, t1, 7 or t0, t0, t1 andi t1, s9, 0x7F or a0, t0, t1 j mul_done # ===== overflow (Inf) ===== set_inf: slli a0, s7, 15 li t6,0x7F80 or a0, a0, t6 j mul_done # ===== zero result ===== return_zero_mul: slli a0, s7, 15 # ===== finish ===== mul_done: lw ra, 0(sp) addi sp, sp, 4 ret bf16_div: li t6, 0xFF li s4, 0x7F li s5, 127 # BF16_EXP_BIAS # extract sign bits srli t0, a0, 15 andi t0, t0, 1 # sign_a srli t1, a1, 15 andi t1, t1, 1 # sign_b # extract exponents srli t2, a0, 7 and t2, t2, t6 # exp_a srli t3, a1, 7 and t3, t3, t6 # exp_b # extract mantissas and t4, a0, s4 # mant_a and t5, a1, s4 # mant_b # result sign = sign_a ^ sign_b xor s7, t0, t1 # add hidden 1-bit ori t4, t4, 0x80 # mant_a |= 0x80 ori t5, t5, 0x80 # mant_b |= 0x80 # result_exp = exp_a - exp_b + bias sub s8, t2, t3 add s8, s8, s5 # s8 = exp_a - exp_b + 127 # mantissa division (approx) slli s9, t4, 7 # mant_a << 7 divu s9, s9, t5 # result_mant = mant_a / mant_b # normalization (if mantissa >= 0x100) li t0, 0x100 and t1, s9, t0 beqz t1, skip_norm srli s9, s9, 1 addi s8, s8, 1 skip_norm: # pack result bits slli t0, s7, 15 # sign << 15 slli t1, s8, 7 # exp << 7 or t0, t0, t1 and s9, s9, s4 # mant & 0x7F or a0, t0, s9 ret bf16_sqrt: li s5, 127 # BF16_EXP_BIAS li t6, 0xFF li s4, 0x7F # ?? exponent ?P mantissa srli t0, a0, 7 and t0, t0, t6 # exp and t1, a0, s4 # mant # e = exp - bias addi t2, t0, -127 # e = exp - 127 li t3, 1 and t4, t2, t3 # t4 = e & 1 ori t5, t1, 0x80 # m = 0x80 | mant beqz t4, sqrt_even_exp slli t5, t5, 1 # m <<= 1 addi t2, t2, -1 sqrt_even_exp: srai t6, t2, 1 add t6, t6, s5 # new_exp = (e>>1)+bias # ???l??G???j?M li s0, 90 # low li s1, 256 # high li s2, 128 # result sqrt_loop: bgt s0, s1, sqrt_done add s3, s0, s1 srli s3, s3, 1 # mid = (low + high) >> 1 mul s4, s3, s3 srli s4, s4, 7 # sq = (mid*mid)/128 ble s4, t5, sqrt_le addi s1, s3, -1 # high = mid - 1 j sqrt_loop sqrt_le: mv s2, s3 # result = mid addi s0, s3, 1 # low = mid + 1 j sqrt_loop sqrt_done: li t0, 256 blt s2, t0, sqrt_check_low srli s2, s2, 1 addi t6, t6, 1 # new_exp++ j sqrt_pack sqrt_check_low: li t1, 128 bge s2, t1, sqrt_pack sqrt_shift_up: blt t6, zero, sqrt_pack slli s2, s2, 1 addi t6, t6, -1 blt s2, t1, sqrt_shift_up sqrt_pack: andi s2, s2, 0x7F # new_mant = result & 0x7F slli t6, t6, 7 or a0, t6, s2 ret #------------------------test_edge_classes-------------------------- bf16_eq: addi sp, sp, -16 sw ra, 8(sp) mv t0, a0 #store a in t0 jal ra, bf16_isnan bnez a0, eq_false # b NaN or not mv t1, a1 #store b in t1 mv a0, t1 jal ra, bf16_isnan bnez a0, eq_false # both zero or not mv a0, t0 # a0 = a jal ra, bf16_iszero mv t1, a0 # t1 = iszero(a) mv a0, a1 jal ra, bf16_iszero and t2, t1, a0 bnez t2, eq_true # bit equality beq t0, a1, eq_true eq_false: li a0, 0 j eq_exit eq_true: li a0, 1 eq_exit: lw ra, 8(sp) addi sp, sp, 16 ret bf16_lt: addi sp, sp, -16 sw ra, 12(sp) mv t0, a0 jal ra, bf16_isnan bnez a0, lt_false mv a0, a1 jal ra, bf16_isnan bnez a0, lt_false # check zero mv a0, t0 jal ra, bf16_iszero mv t1, a0 mv a0, a1 jal ra, bf16_iszero and t2, t1, a0 bnez t2, lt_false # sign_a = (a >> 15) & 1 srli t3, t0, 15 andi t3, t3, 1 # sign_b = (b >> 15) & 1 srli t4, a1, 15 andi t4, t4, 1 # sign_a != sign_b ? bne t3, t4, sign_diff # same sign beqz t3, both_pos # if sign = 0 positive j both_neg both_pos: blt t0, a1, lt_true # both pos compare with numbers j lt_false both_neg: bgt t0, a1, lt_true # neg_num need reverse j lt_false sign_diff: # sign_a > sign_b ? bgt t3, t4, lt_true j lt_false lt_true: li a0, 1 j lt_exit lt_false: li a0, 0 lt_exit: lw ra, 12(sp) addi sp, sp, 16 ret bf16_gt: addi sp, sp, -16 sw ra, 4(sp) mv t0, a0 # store a mv a0, a1 # bf16_lt(b, a) mv a1, t0 jal ra, bf16_lt lw ra, 4(sp) addi sp, sp, 16 ret # ---------------- fail_msg_for_special---------------- fail_posinf: la a0, msg_fail_posinf li a7, 4 ecall li a0, 1 ret fail_inf_nan: la a0, msg_fail_inf_nan li a7, 4 ecall li a0, 1 ret fail_neginf: la a0, msg_fail_neginf li a7, 4 ecall li a0, 1 ret fail_nan: la a0, msg_fail_nan li a7, 4 ecall li a0, 1 ret fail_nan_inf: la a0, msg_fail_nan_inf li a7, 4 ecall li a0, 1 ret fail_zero: la a0, msg_fail_zero li a7, 4 ecall li a0, 1 ret fail_negzero: la a0, msg_fail_negzero li a7, 4 ecall li a0, 1 ret ### ---------------- fail_for_comparison---------------- fail_naneq: la a0, msg_fail_naneq li a7, 4 ecall li a0, 1 ret fail_nanlt: la a0, msg_fail_nanlt li a7, 4 ecall li a0, 1 ret fail_nangt: la a0, msg_fail_nangt li a7, 4 ecall li a0, 1 ret fail_eq: la a0, msg_fail_eq li a7, 4 ecall li a0, 1 ret fail_ineq: la a0, msg_fail_ineq li a7, 4 ecall li a0, 1 ret fail_lt: la a0, msg_fail_lt li a7, 4 ecall li a0, 1 ret fail_nlt: la a0, msg_fail_nlt li a7, 4 ecall li a0, 1 ret fail_enlt: la a0, msg_fail_enlt li a7, 4 ecall li a0, 1 ret fail_gt: la a0, msg_fail_gt li a7, 4 ecall li a0, 1 ret fail_ngt: la a0, msg_fail_ngt li a7, 4 ecall li a0, 1 ret #---------------------fail_arithmetic----------------- fail_add: la a0, msg_add_fail li a7, 4 ecall ret fail_sub: la a0, msg_sub_fail li a7, 4 ecall ret fail_mul: la a0, msg_mul_fail li a7, 4 ecall ret fail_div: la a0, msg_div_fail li a7, 4 ecall ret fail_sqrt4: la a0, msg_sqrt4_fail li a7, 4 ecall ret fail_sqrt9: la a0, msg_sqrt9_fail li a7, 4 ecall ret bf16_isinf: mv t1,a0 li t2,0x7F80 and t3,t1,t2 bne t3,t2, not_inf li t2,0x007F and t4,t1,t2 bnez t4, not_inf li a0,1 ret not_inf: li a0,0 ret bf16_isnan: mv t1,a0 li t2,0x7F80 and t3,t1,t2 bne t3,t2, not_nan li t2,0x007F and t4,t1,t2 beqz t4, not_nan li a0,1 ret not_nan: li a0,0 ret bf16_iszero: mv t1,a0 li t2,0x7FFF and t1,t1,t2 beqz t1, is_zero li a0,0 ret is_zero: li a0,1 ret bf16_is_neg_zero: li t1,0x8000 beq a0,t1,is_negzero li a0,0 ret is_negzero: li a0,1 ret pass_basic_msg: la a0,Basic_msg li a7,4 ecall j back_from_basic_msg ## not sure ------------------------------------ j end print_hex: mv t0, a0 # val la a0,hex li a7,4 ecall mv t1, a1 # digits la t2, hexchars # hex print_hex_loop: beqz t1, print_hex_done addi t1, t1, -1 slli t3, t1, 2 srl t4, t0, t3 # val >> (4*pos) andi t4, t4, 0xF add t4, t2, t4 lbu a0, 0(t4) li a7, 11 # print_char ecall j print_hex_loop print_hex_done: li a0, 10 # newline li a7, 11 ecall ret fail: la a0, fail_msg li a7, 4 ecall li a0, 1 # return 1 if fail ret end: la a0, all_test_pass_msg li a7, 4 ecall li a7,10 ecall f32_to_bf16: mv t1,a0 # ((f32bits >> 23) & 0xFF) srli t2, t1, 23 andi t2, t2, 0xFF #store expo in t2 li t3, 0xFF beq t2,t3,is_nan_inf srli t4,t1,16 andi t4,t4,1 li t3,0x7FFF add t4,t4,t3 add t1,t1,t4 srli t2,t1,16 mv a0,t2 ret bf16_to_f32: slli t3,a0,16 mv a0,t3 ret is_nan_inf: srli t4,t2,16 li t3 0xff and t4,t4,t3 mv a0,t4 ret ```