arch2025-homework1

# Assignment1: RISC-V Assembly and Instruction Pipeline Contributed by <[moonchin819](https://github.com/moonchin819)> ## Implementation :::info **🛈 AI Tools Usage** I utilized AI tools to support my work in this assignment. Specifically, Claude was used for code explanations, and Perplexity for clarifying technical concepts. ::: Here's the [source code](https://github.com/moonchin819/ca2025-quizzes). ## [Problem B](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-B) ### uf8_decode 1. Extract last four bits as mantissa, and the high fours bit as the exponent part by shift logically right. 2. Generate 1's according to the number of exponent * e.g. exponent = 3, 0x7FFF >> 12 = 0x0007, and there's actual three 1's * And then move logically left for four bits to align offset to the mantissa part, to avoid overlapping and the continuity of the decode results \begin{gather*}D(b) = m \cdot 2^e + (2^e - 1) \cdot 16\end{gather*} 3. Mantissa controls the detail in every segments of exponent ### Method 1 - Arithmetic Computation ```asm= uf8_decode: andi t0, a0, 0x0F # mantissa = f1 & 0xF srli t1, a0, 4 # exponent = f1 >> 4 li t2, 15 sub t2, t2, t1 # 15 - exponent li t3, 0x7FFF srl t3, t3, t2 # 0x7fff >> (15 - exponent) slli t3, t3, 4 # offset << 4 sll t2, t0, t1 # mantissa << exponent add a0, t2, t3 # mantissa + offset ret ``` ### Method 2 - Read-only Memory (ROM) Lookup Table 1. Calculate the value of each exponent and place into a table * In this problem, the range of data is not so large, so I can try sacrifice some memory space to have better execution efficiency 2. We won't have to calculate the offset in the execution, using the value of exponent as the index to look up the table. \begin{gather*}\text{Offset}(e) =(2^{e}-1)\cdot16 \end{gather*} | Exponent | Offset | Range |Step Size| |:---------:|:-------:|:----------------------|:-----:| | 0 | 0 | [0, 15] |1| | 1 | 16 | [16, 47] |2| | 2 | 48 | [48, 111] |4| | 3 | 112 | [112, 239] |8| | 4 | 240 | [240, 495] |16| | 5 | 496 | [496, 1007] |32| |...|...|...|$2^{e}$| | 15 | 524272 | [524272, 1048575] |32768| ```asm= offset_table: .word 0, 16, 48, 112, 240, 496, 1008, 2032, 4080, 8176, 16368, 32752, 65520, 131056, 262128, 524272 uf8_decode: andi t0, a0, 0x0F # mantissa = f1 & 0xF srli t1, a0, 4 # exponent = f1 >> 4 la t2, offset_table slli t3, t1, 2 # offset << 4 add t2, t2, t3 lw t2, 0(t2) sll t0, t0, t1 # mantissa << exponent add a0, t0, t2 # mantissa + offset ret ``` ### Pros and Cons Pros : * It has faster access time * It's ideal for frequent reuse Cons : * It requires more memory space * Not suitable for very large data ranges ### clz * n represents "how many leading zeros are still there" in current prediction (set 32 to assume there's at most 32 0's) * c is how many bits to shift right now (set 16 to test first) * Use a loop to check the higher 16 bits first to check whether all of them are zero -> if not, check the second half part. * Use the core logic of binary * If the result of shifting right ```y!=0``` -> indicates that the leading zeros are less than what is expected -> ```n-=c``` in the mean time, update ```x=y``` to keep on dealing with the right half * If ```y==0```, it means the part that have been moved out are all zero, so we don't have to minus ```n``` * ```c >>=1``` implies we narrow down the search range by half each time, and it's the main concept of binary search. ### Method 1 - Directly Translation ```asm= clz: li t0, 32 # n = 32 li t1, 16 # c = 16 clz_loop: srl t2, a0, t1 # y = x >> c beqz t2, clz_skip # if (y == 0) skip sub t0, t0, t1 # n -= c mv a0, t2 # x = y clz_skip: srli t1, t1, 1 # c >>= 1 bnez t1, clz_loop # while (c != 0) sub a0, t0, a0 # return n - x ret ``` ### Method 2 - Loop Expansion * Since the loop is only executed 5 times (c decades from 16 to 1), we could expand the loop to eliminate all the cost of branches * Implement the code from 16 to 1 in order, keep dividing c by 2 until c be equal to 0 then exit the loop and return the result value. ```asm= clz: li t0, 32 # n = 32 # c = 16 srli t2, a0, 16 # y = x >> 16 beqz t2, c_equal_8 # branch if y is equal to zero addi t0, t0, -16 # n -= 16 mv a0, t2 # x = y c_equal_8: srli t2, a0, 8 # y = x >> 8 beqz t2, c_equal_4 addi t0, t0, -8 # n -= 8 mv a0, t2 # x = y c_equal_4: srli t2, a0, 4 # y = x >> 4 beqz t2, c_equal_2 addi t0, t0, -4 # n -= 4 mv a0, t2 # x = y c_equal_2: srli t2, a0, 2 # y = x >> 2 beqz t2, c_equal_1 addi t0, t0, -2 # n -= 2 mv a0, t2 # x = y c_equal_1: srli t2, a0, 1 # y = x >> 1 beqz t2, iter_end addi t0, t0, -1 # n -= 1 mv a0, t2 # x = y iter_end: sub a0, t0, a0 # return n - x ret ``` * Although number of code lines has more than doubled, this approach reduces approximately 5000 cycles. ![image](https://hackmd.io/_uploads/SycWsssTeg.png) ### uf8_encode :::spoiler Assembly Code ```asm= uf8_encode: addi sp, sp, -4 sw ra, 0(sp) # will call clz later mv t5, a0 # t5 = value li t0, 16 blt t5, t0, ret_value # branch if value is less than 16 jal ra, clz # call function clz, a0 = lz li t0, 31 sub t0, t0, a0 # msb = 31 - lz li t1, 0 # exponent = 0 li t2, 0 # overflow = 0 li t3, 5 blt t0, t3, find_exact_exp # branch to find exact exponent if msb is less than 5 addi t1, t0, -4 # exponent = msb - 4 li t3, 15 ble t1, t3, over # branch to the nearest 1 function if exponent is less than or equal to 15 li t1, 15 # exponent = 15 over: li t3, 0 # e (counter) overflow_loop: slli t6, t2, 1 # overflow << 1 addi t2, t6, 16 # (overflow << 1) + 16 addi t3, t3, 1 # e++ blt t3, t1, overflow_loop # branch to the begining of the loop if e is still less than exponent adjust_loop: blez t1, find_exact_exp # branch if exponent is less than or equal to zero bge t5, t2, find_exact_exp # branch if overflow is greater than or equal to value (unsigned) addi t6, t5, -16 # overflow - 16 srli t5, t6, 1 # (overflow - 16) >> 1 addi t1, t1, -1 # exponent-- j adjust_loop find_exact_exp: li t6, 15 find_exact_exp_loop: bge t1, t6, exact_done # branch if exponent is greater than or equal to 15 slli t3, t2, 1 # overflow << 1 addi t3, t3, 16 # next_overflow = (oveflow << 1) + 16 blt t5, t3, exact_done # branch if value is less than next_overflow mv t2, t3 # overflow = next_overflow addi t1, t1, 1 # exponent++ j find_exact_exp_loop exact_done: sub t0, t5, t2 # value - overflow srl t0, t0, t1 # (value - overflow) >> exponent slli t3, t1, 4 # exponent << 4 or a0, t3, t0 # (expoment << 4) | mantissa ret_value: lw ra, 0(sp) addi sp, sp, 4 jr ra ``` ::: ### Full Assembly Code :::spoiler Assembly Code ```asm= .data msg1: .asciz ": produce value " msg2: .asciz " but encodes back to " msg3: .asciz ": value\n" msg4: .asciz " <= previous_value " msg5: .asciz "All tests passed.\n" msg6: .asciz "Test failed.\n" line_break: .asciz "\n" .align 2 .text .global main main: jal ra, test # start testing beq a0, x0, fail # didn't pass the test la a0, msg5 li a7, 4 # test passed ecall li a7, 10 # set exit code li a0, 0 ecall # successful if exit code is 0 fail: la a0, msg6 li a7, 4 ecall li a7, 10 li a0, 1 ecall # unsuccessful if exit code is 1 test: addi sp, sp, -4 sw ra, 0(sp) li s0, -1 # s0 = previous_value li s1, 1 # s1 = passed (true) li s2, 0 # s2 = i, counter initialized li s3, 256 # end of counter begin_test_loop: mv a0, s2 jal ra, uf8_decode mv s4, a0 # return the value of uf8_decode mv a0, s4 jal ra, uf8_encode mv s5, a0 # return the value of uf8_encode test_if1: beq s2, s5, test_if2 # branch if fl==fl2 mv a0, s2 # printf(fl) li a7, 34 ecall la a0, msg1 li a7, 4 ecall mv a0, s4 li a7, 1 ecall la a0, msg2 li a7, 4 ecall mv a0, s5 li a7, 34 ecall la a0, line_break li a7, 4 ecall li s1, 0 # passed = false test_if2: bgt s4, s0, end_test_loop mv a0, s2 # printf(fl) li a7, 34 ecall la a0, msg3 li a7, 4 ecall mv a0, s4 # printf(value) li a7, 1 ecall la a0, msg4 li a7, 4 ecall mv a0, s0 # printf(previous_value) li a7, 34 ecall la a0, line_break li a7, 4 ecall li s1, 0 # passed = false end_test_loop: mv s0, s4 # previous_value = value addi s2, s2, 1 blt s2, s3, begin_test_loop mv a0, s1 # return passed lw ra, 0(sp) addi sp, sp, 4 jr ra clz: li t0, 32 # n = 32 li t1, 16 # c = 16 clz_loop: srl t2, a0, t1 # y = x >> c beqz t2, clz_skip # if (y == 0) skip sub t0, t0, t1 # n -= c mv a0, t2 # x = y clz_skip: srli t1, t1, 1 # c >>= 1 bnez t1, clz_loop # while (c != 0) sub a0, t0, a0 # return n - x ret offset_table: .word 0, 16, 48, 112, 240, 496, 1008, 2032, 4080, 8176, 16368, 32752, 65520, 131056, 262128, 524272 uf8_decode: andi t0, a0, 0x0F # mantissa = f1 & 0xF srli t1, a0, 4 # exponent = f1 >> 4 la t2, offset_table slli t3, t1, 2 # offset << 4 add t2, t2, t3 lw t2, 0(t2) sll t0, t0, t1 # mantissa << exponent add a0, t0, t2 # mantissa + offset ret uf8_encode: addi sp, sp, -4 sw ra, 0(sp) # will call clz later mv t5, a0 # t5 = value li t0, 16 blt t5, t0, ret_value # branch if value is less than 16 jal ra, clz # call function clz, a0 = lz li t0, 31 sub t0, t0, a0 # msb = 31 - lz li t1, 0 # exponent = 0 li t2, 0 # overflow = 0 li t3, 5 blt t0, t3, find_exact_exp # branch to find exact exponent if msb is less than 5 addi t1, t0, -4 # exponent = msb - 4 li t3, 15 ble t1, t3, over # branch to the nearest 1 function if exponent is less than or equal to 15 li t1, 15 # exponent = 15 over: li t3, 0 # e (counter) overflow_loop: slli t6, t2, 1 # overflow << 1 addi t2, t6, 16 # (overflow << 1) + 16 addi t3, t3, 1 # e++ blt t3, t1, overflow_loop # branch to the begining of the loop if e is still less than exponent adjust_loop: blez t1, find_exact_exp # branch if exponent is less than or equal to zero bge t5, t2, find_exact_exp # branch if overflow is greater than or equal to value (unsigned) addi t6, t5, -16 # overflow - 16 srli t5, t6, 1 # (overflow - 16) >> 1 addi t1, t1, -1 # exponent-- j adjust_loop find_exact_exp: li t6, 15 find_exact_exp_loop: bge t1, t6, exact_done # branch if exponent is greater than or equal to 15 slli t3, t2, 1 # overflow << 1 addi t3, t3, 16 # next_overflow = (oveflow << 1) + 16 blt t5, t3, exact_done # branch if value is less than next_overflow mv t2, t3 # overflow = next_overflow addi t1, t1, 1 # exponent++ j find_exact_exp_loop exact_done: sub t0, t5, t2 # value - overflow srl t0, t0, t1 # (value - overflow) >> exponent slli t3, t1, 4 # exponent << 4 or a0, t3, t0 # (expoment << 4) | mantissa ret_value: lw ra, 0(sp) addi sp, sp, 4 jr ra ``` ::: ### Analysis In this assignment we use [Ripes simulator](https://ripes.me/) to test our code. * Method 1 - Arithmetic Computation * Assembly code size : 165 lines Cycle counts : 40107 ![image](https://hackmd.io/_uploads/HJSPsijalx.png =50%x) * Method 2.1 - ROM Lookup Table without Loop Expansion * Assembly code size : 166 lines Cycle counts : 39851 ![image](https://hackmd.io/_uploads/BJ_cjiiTle.png =50%x) * Method 2.2 - ROM Lookup Table with Loop Expansion * Assembly code size : 182 lines Cycle counts : 35291 ![image](https://hackmd.io/_uploads/SJQ0ioipxe.png =50%x) Method 2.2 ## [Problem C](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-C) BFloat16 is a 16-bit floating point format designed for keeping the dynamic range of float32 but reduce the precision. The value of BFloat16 is calculated as : \begin{gather*}v = (-1)^S \times 2^{E-127} \times \left(1 + \frac{M}{128}\right)\end{gather*} Determine it's positive or negative by sign, and decide multiple by exponent, then finetune the precision of decimals part by mantissa. ### Initial Part :::spoiler Assembly Code ```asm= .globl bf16_isnan bf16_isnan: li t0, 0x7F80 and t1, a0, t0 bne t1, t0, isnan_false li t2, 0x007F and t3, a0, t2 snez a0, t3 ret isnan_false: li a0, 0 ret bf16_isinf: li t0, 0x7F80 and t1, a0, t0 bne t1, t0, isinf_false li t2, 0x007F and t3, a0, t2 seqz a0, t3 ret isinf_false: li a0, 0 ret bf16_iszero: li t0, 0x7FFF and t1, a0, t0 seqz a0, t1 ret .globl f32_to_bf16 f32_to_bf16: addi sp, sp, -4 sw s0, 0(sp) mv s0, a0 srli t0, s0, 23 andi t0, t0, 0xFF li t1, 0xFF bne t0, t1, unspecial srli a0, s0, 16 li t0, 0xFFFF and a0, a0, t0 j f32_to_bf16_done unspecial: srli t0, s0, 16 andi t0, t0, 1 li t1, 0x7FFF add t0, t0, t1 add s0, s0, t0 srli a0, s0, 16 f32_to_bf16_done: lw s0, 0(sp) addi sp, sp, 4 ret .globl bf16_to_f32 bf16_to_f32: slli a0, a0, 16 ret .globl BF16_NAN BF16_NAN: li a0, 0x7FC0 ret .globl BF16_ZERO BF16_ZERO: li a0, 0x0000 ret ``` ::: ### Add ![image](https://hackmd.io/_uploads/SyLS1pPCle.png) Result with three test cases :::spoiler Assembly Code ```asm= .globl bf16_add bf16_add: srli t0, a0, 15 # a.bits >> 15 andi t0, t0, 1 # & 1 srli t1, a1, 15 # b.bits >> 15 andi t1, t0, 1 # & 1 srli t2, a0, 7 # a.bits >> 7 andi t2, t2, 0xFF # & 0xFF srli t3, a1, 7 # b.bits >> 7 andi t3, t3, 0xFF # & 0xFF li t6, 0xFF beq t2, t6, exp_a_checkall j check_exp_b exp_a_checkall: beqz t4, exp_a_not_nan mv a0, a0 ret exp_a_not_nan: beq t3, t6, check1 mv a0, a0 ret check1: bnez t5, return_b1 beq t0, t1, return_b1 li a0, 0x7FC0 ret return_b1: mv a0, a1 ret check_exp_b: beq t3, t6, return_b2 j check_0_a return_b2: mv a0, a1 ret check_0_a: bnez t2, check_0_b bnez t4, check_0_b mv a0, a1 ret check_0_b: bnez t3, norm_a bnez t5, norm_a mv a0, a0 ret norm_a: beqz t2, norm_b ori t4, t4, 0x80 norm_b: beqz t3, end_check1 ori t5, t5, 0x80 end_check1: addi sp, sp, -20 sw s0, 16(sp) # for exp_diff sw s1, 12(sp) # for result_sign sw s2, 8(sp) # for result_exp sw s3, 4(sp) # for result_mant sw s4, 0(sp) sub s0, t2, t3 # exp_diff = exp_a - exp_b blez s0, diff_neg # branch if exp_diff <= 0 mv s2, t2 # result_exp = exp_a li t6, 8 bgt s0, t6, return_a srl t5, t5, s0 j exp_done diff_neg: bgez s0, diff_else mv s2, t3 li t6, -8 blt s0, t6, return_b3 neg s4, s0 srl t4, t4, s4 j exp_done diff_else: mv s2, t2 return_a: mv a0, a0 ret return_b3: mv a0, a1 ret exp_done: bne t0, t1, diff_sign # branch if sign_a != sign_b same_sign: mv s1, t0 add s3, t4, t5 andi t6, s3, 0x100 beqz t6, norm_end srli s3, s3, 1 addi s2, s2, 1 li t6, 0xFF bge s1, t6, overflow_inf j norm_end overflow_inf: slli a0, s1, 15 li t6, 0x7F80 or a0, a0, t6 ret diff_sign: bge t4, t5, manta_>_mantb mv s1, t1 sub s3, t5, t4 j mant_result manta_>_mantb: mv s1, t0 sub s3, t4, t5 mant_result: beqz s3, return_zero norm_loop: andi t6, s3, 0x80 bnez t6, norm_end slli s3, s3, 1 addi s2, s2, -1 blez s2, return_zero j norm_loop norm_end: slli t0, s1, 15 andi t1, s2, 0xFF slli t1, t1, 7 or t0, t0, t1 andi t1, s3, 0x7F or a0, t0, t1 ret return_zero: li a0, 0x0000 ret ``` ::: ### Sub ![image](https://hackmd.io/_uploads/H1cuy6PAgx.png) Result with three test cases :::spoiler Assembly Code ```asm= .globl bf16_sub bf16_sub: li t0, 0x8000 xor a1, a1, t0 j bf16_add ``` ::: ### Mul ![image](https://hackmd.io/_uploads/rJzi1aw0gx.png) Result with three test cases :::spoiler Assembly Code ```asm= .globl bf16_mul bf16_mul: addi sp, sp, -16 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw ra, 12(sp) srli t0, a0, 15 # a.bits >> 15 andi t0, t0, 1 # t0 = sign_a srli t1, a1, 15 # b.bits >> 15 andi t1, t0, 1 # t1 = sign_b srli t2, a0, 7 # a.bits >> 7 andi t2, t2, 0xFF # t2 = exp_a srli t3, a1, 7 # b.bits >> 7 andi t3, t3, 0xFF # t3 = exp_b andi t4, a0, 0x7F # t4 = mant_a andi t5, a1, 0x7F # t5 = mant_b xor s0, t0, t1 # s0 = result_sign li t6, 0xFF bne t2, t6, check_b_exp bnez t4, return_a bnez t3, result_1 bnez t5, result_1 result_1: slli a0, s0, 15 li t6, 0x7F80 or a0, a0, t6 j quit check_b_exp: li t6, 0xFF bne t3, t6, check_0 bnez t5, return_b bnez t2, result_2 bnez t4, result_2 j return_nan result_2: slli a0, s0, 15 li t6, 0x7F80 or a0, a0, t6 j quit check_0: bnez t2, a_not_zero bnez t4, a_not_zero j return_0 a_not_zero: bnez t3, norm_mant bnez t5, norm_mant return_0: slli a0, s0, 15 j quit norm_mant: li s1, 0 # s1 : exp_adjust = 0 bnez t2, norm_a_else # if(!exp_a) norm_loop_a: andi t6, t4, 0x80 bnez t6, norm_loop_a_done slli t4, t4, 1 addi s1, s1, -1 j norm_loop_a norm_loop_a_done: li t2, 1 j check_exp_b_norm norm_a_else: ori t4, t4, 0x80 check_exp_b_norm: bnez t3, else_norm_b norm_loop_b: andi t6, t5, 0x80 bnez t6, norm_b_done slli t5, t5, 1 addi s1, s1, -1 j norm_loop_b norm_b_done: li t3, 1 j mul_mant else_norm_b: ori t5, t5, 0x80 mul_mant: mul s2, t4, t5 # s2 = result_mant add t6, t2, t3 addi t6, t6, -127 add t6, t6, s1 mv s1, t6 # s1 = result_exp li t6, 0x8000 and t6, s2, t6 beqz t6, mult_else srli s2, s2, 8 andi s2, s2, 0x7F addi s1, s1, 1 j check_exp_overflow mult_else: srli s2, s2, 7 addi s2, s2, 0x7F check_exp_overflow: li t6, 0xFF blt s1, t6, underflow_check slli a0, s0, 15 li t6, 0x7F80 or a0, a0, t6 j quit underflow_check: bgt s1, zero, final li t6, -6 blt s1, t6, return_0_udflow li t6, 1 sub t6, t6, s1 srl s2, s2, t6 li s1, 0 j final return_0_udflow: slli a0, s0, 15 j quit final: slli a0, s0, 15 andi s1, s1, 0xFF slli s1, s1, 7 andi s2, s2, 0x7F or a0, a0, s1 or a0, a0, s2 j quit return_a: mv a0, a0 ret return_b: mv a0, a1 ret return_nan: li a0, 0x7FC0 j quit quit: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw ra, 12(sp) addi sp, sp, 16 ret ``` ::: ### Div ![image](https://hackmd.io/_uploads/HJGIsTwCeg.png) Result with three test cases :::spoiler Assembly Code ```asm= .globl bf16_div bf16_div: addi sp, sp, -16 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) srli t0, a0, 15 # a.bits >> 15 andi t0, t0, 1 # & 1 # t0 = sign_a srli t1, a1, 15 # b.bits >> 15 andi t1, t1, 1 # & 1 # t1 = sign_b srli t2, a0, 7 # a.bits >> 7 andi t2, t2, 0xFF # & 0xFF # t2 = exp_a srli t3, a1, 7 # b.bits >> 7 andi t3, t3, 0xFF # & 0xFF # t3 = exp_b andi t4, a0, 0x7F # t4 = mant_a andi t5, a1, 0x7F # t5 = mant_b xor s0, t0, t1 # s0 = result_sign = sign_a ^ sign_b li t6, 0xFF # t6 is for temporal usage bne t3, t6, check_zero # branch if(exp_b != 0xFF) beqz t5, check_inf # branch if mant = 0 => return b mv a0, a1 # return b j recover check_inf: li t6, 0xFF bne t2, t6, result_sign_1 bnez t4, result_sign_1 li a0, 0x7FC0 # return Nan j recover result_sign_1: slli a0, s0, 15 j recover check_zero: bnez t3, check_2_inf bnez t5, check_2_inf bnez t2, result_sign_2 bnez t4, result_sign_2 li a0, 0x7FC0 # return Nan j recover result_sign_2: slli a0, s0, 15 li t6, 0x7F80 or a0, a0, t6 j recover check_2_inf: li t6, 0xFF bne t2, t6, check_div_zero beqz t4, result_3 mv a0, a0 # return a j recover result_3: slli a0, s0, 15 li t6, 0x7F80 or a0, a0, t6 j recover check_div_zero: bnez t2, norm bnez t4, norm slli a0, s0, 15 j recover norm: beqz t2, norm_b ori t4, t4, 0x80 norm_b: beqz t3, norm_end ori t5, t5, 0x80 norm_end: slli s1, t4, 15 # s1 : dividend = mant_a << 15 mv s2, t5 # s2 : divisor = mant_b li s3, 0 # s3 : quotient = 0 li t6, 0 # t6 : i=0 div_loop: li a2, 16 # a3 = 16 bge t6, a2, end_div_loop # branch if i >= 16 slli s3, s3, 1 # quotient <<= 1 li a3, 15 sub a3, a3, t6 # a2 = 15 - i sll a4, s2, a3 # a3 = divisor << (15-i) bltu s1, a4, skip_sub # branch if dividend is less than sub s1, s1, a4 # dividend -= (divisor << (15-i)) ori s3, s3, 1 # quotient |= 1 skip_sub: addi t6, t6, 1 # i++ j div_loop end_div_loop: sub a2, t2, t3 # a2 = exp_a - exp_b addi a2, a2, 127 # a2 = result_exp = exp_a - exp_b + BF16_EXP_BIAS bnez t2, res_b # branch if exp_a != 0 addi a2, a2, -1 res_b: bnez t3, q_check # branch if exp_b != 0 addi a2, a2, 1 q_check: li t6, 0x8000 and a4, s3, t6 # quotient & 0x8000 beqz a4, q_else srli s3, s3, 8 j check_overflow q_else: q_loop: li t6, 0x8000 and a4, s3, t6 # quotient & 0x8000 bnez a4, q_loop_done li t6, 1 ble a2, t6, q_loop_done slli s3, s3, 1 addi a2, a2, -1 j q_loop q_loop_done: srli s3, s3, 8 check_overflow: andi s3, s3, 0x7F li t6, 0xFF bge a2, t6, overflow j check_un overflow: slli a0, s0, 15 li t6, 0x7F80 or a0, a0, t6 j recover check_un: bgt a2, x0, final_result slli a0, s0, 15 j recover final_result: slli a0, s0, 15 andi a2, a2, 0xFF slli a2, a2, 7 or a0, a0, a2 andi s3, s3, 0x7F or a0, a0, s3 recover: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) addi sp, sp, 16 ret ``` ::: ### Square Root ![image](https://hackmd.io/_uploads/B1IvaawRgx.png) Result with four test cases :::spoiler Assembly Code ```asm= bf16_sqrt: addi sp, sp, -32 sw ra, 28(sp) sw s0, 24(sp) sw s1, 20(sp) sw s2, 16(sp) sw s3, 12(sp) sw s4, 8(sp) sw s5, 4(sp) sw s6, 0(sp) srli t0, a0, 15 # Shift right 15 andi t0, t0, 1 # Extract sign bit by and operation with 1 srli t1, a0, 7 # Shift right 7 andi t1, t1, 0xFF # Extract exponent by and operation with 0xFF (8 bits) andi t2, a0, 0x7F # And operation with 0x7F to extract mantissa li t3, 0xFF bne t1, t3, check_zero # Branch if exp is not equal to 0xFF bnez t2, return_a # Branch if mantissa is not equal to zero bnez t0, return_nan # Branch if sign bit is not equal to zero j return_a # Return the value of a (input) check_zero: # First special case or t3, t1, t2 # bnez t3, check_negative j return_zero check_negative: # Second special case bnez t0, return_nan bnez t1, compute_sqrt j return_zero compute_sqrt: addi s0, t1, -127 ori s1, t2, 0x80 andi t3, s0, 1 beqz t3, even_exp slli s1, s1, 1 addi t4, s0, -1 srai t4, t4, 1 addi s2, t4, 127 j binary_search even_exp: srai t4, s0, 1 addi s2, t4, 127 binary_search: li s3, 90 li s4, 256 li s5, 128 search_loop: bgt s3, s4, search_done add t3, s3, s4 srli t3, t3, 1 mv a1, t3 mv a2, t3 jal multiply mv t4, a0 srli t4, t4, 7 bgt t4, s1, search_high mv s5, t3 addi s3, t3, 1 j search_loop search_high: addi s4, t3, -1 j search_loop search_done: li t3, 256 blt s5, t3, check_low srli s5, s5, 1 addi s2, s2, 1 j extract_mant check_low: li t3, 128 bge s5, t3, extract_mant norm_loop: li t3, 128 bge s5, t3, extract_mant li t3, 1 ble s2, t3, extract_mant slli s5, s5, 1 addi s2, s2, -1 j norm_loop extract_mant: andi s6, s5, 0x7F li t3, 0xFF bge s2, t3, return_inf blez s2, return_zero andi t3, s2, 0xFF slli t3, t3, 7 or a0, t3, s6 j cleanup return_zero: li a0, 0 j cleanup return_nan: li a0, 0x7FC0 j cleanup return_inf: li a0, 0x7F80 j cleanup return_a: cleanup: lw s6, 0(sp) lw s5, 4(sp) lw s4, 8(sp) lw s3, 12(sp) lw s2, 16(sp) lw s1, 20(sp) lw s0, 24(sp) lw ra, 28(sp) addi sp, sp, 32 ret multiply: li a0, 0 beqz a2, mult_done mult_loop: andi t0, a2, 1 beqz t0, mult_skip add a0, a0, a1 mult_skip: slli a1, a1, 1 srli a2, a2, 1 bnez a2, mult_loop mult_done: ret ``` ::: ## 5-Stage Pipeline ![image](https://hackmd.io/_uploads/rJE_Eu5axl.png) Here's the block diagram of the standard RISC-V pipeline, it divides instruction execution into **5 stages**, allowing multiple instructions to overlap in execution, one instruction in each stage. This mechanism improves throughput by executing several instructions simultaneously. ### Pipeline Principle In this pipeline mechanism, each instruction passes through five stages in order: IF -> ID -> EX -> MEM -> WB Let's go through each stage and see how it works. ### I-type First, I choose the first instruction after calling the test , ```addi sp, sp, -4``` function to discuss. 1. Instruction Fetch (IF) ![image](https://hackmd.io/_uploads/Hyeditqael.png =40%x) * For Program Counter: Currently holds the value **0x00000040**, which points to the address of the current instruction. * The value is then sent to the **Instruction Memory** as the address input. * Instruction memory reads the 32-bit instruction stored at **0x00000040**. And **0xFFC10113** is the fetched instruction corresponding to the instruction ```addi sp, sp, -4```. * Since `addi` is not a branch or jump instruction, the MUX selects the sequential path PC+4. The PC register is updated to 0x00000044 on the next clock cycle. 2. Instruction Decode (ID) ![image](https://hackmd.io/_uploads/ryxbx55peg.png =45%x) * Instruction is decoded to * Opcode : `0x13`or`ADDI` * r1 idx : 0x02 (x2) (sp) * r2 idx is unused in I-type because I-type instruction adds an immediate to source register. * rd, rs1 : x2 * Immediate : 0xfffffffc (-4) * Reg1 = 0x7FFFFFF0 is the value of current stack pointer, and Reg2 is 0x00000000 stands for unused. * The immediate will be sent into ALU with the output of register. * All the data was then sent to ID/EX pipeline register waiting for being used in next stage. 3. Execute (EX) ![image](https://hackmd.io/_uploads/HJRo8ccpgl.png =45%x) * Since it's `ADDI` I-type instruction, we need not access to memory or jump, just receive the value from register file and execute addition. * Input source, using two MUX to select the inputs of ALU as register value and immediate value: * Op1 : 0x7FFFFFF0 (from register x2 (xp)), represents the original stack pointer * Op2 : 0xFFFFFFFC (equals to immediate -4) * ALU executes the addition according to the control signal ALUOp, and the result is sp-4 = 0x7FFFFFEC * And there's a Branch Unit in the figure, but branch taken is set as 0 since this instruction is not a branch instruction, therefore, PC won't change the direction and keep on executing next instruction. * After EX stage is over, ALU result, destination register and control signals are sent into EX/MEM register. 4. Memory Access (MEM) ![image](https://hackmd.io/_uploads/SJGivc5Tll.png =35%x) * Data Memory module determine whether access or not depends on the control signals. * For this `ADDI` instruction case, we won't access to memory since it's a pure arithmetic instruction, so the WrEn is red and the output of Data Memory module is 0 since no memory data was read. * Pipeline register sends the results to the next stage, which meas it's a pass-through stage. 5. Write Back (WB) ![image](https://hackmd.io/_uploads/HyEI_q5Tlg.png =25%x) * In this stage, there's a crucial MUX to decide whether the write back data is from ALU results or the value read by Data Memory. * Since `ADDI` is a arithmetic instruction, the control signal is 0 and MUX chose ALU result as input then write back to register file. * And we can see the WrEn turns into 1 which means it's available to write in, so the value of sp is updated to 0x7FFFFFEC.