Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by `< liangchingyun >` ## [uf8](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-B) > [Quiz review](https://hackmd.io/kmvVhoHMSkCnxtcOoRheqQ#2025-Problem-B) Use 8 bits (`uint8_t`) to store 20-bit unsigned integer (0 ~ 1,015,792) * Decoding $$ D(b) = m \cdot 2^e + (2^e - 1) \cdot 16 $$ Where $e = \lfloor b/16 \rfloor$ and $m = b \bmod 16$ * Encoding $$ E(v) = \begin{cases} v, & \text{if } v < 16 \\ 16e + \lfloor(v - \text{offset}(e))/2^e\rfloor, & \text{otherwise} \end{cases} $$ Where $\text{offset}(e) = (2^e - 1) \cdot 16$ | Exponent | Range | Offset | Step Size | | -------- | ---------------------------- | ------- | --------- | | 0 | $[0, 15]$ | 0 | 1 | | 1 | $[16, 46]$ | 16 | 2 | | 2 | $[48, 108]$ | 48 | 4 | | 3 | $[112, 232]$ | 112 | 8 | | ... | $...$ | ... | $2^e$ | | 5 | $[524{,}272, 1{,}015{,}792]$ | 524,272 | 32,768 | ### `uf8_decode` #### C code ```c /* Decode uf8 to uint32_t */ uint32_t uf8_decode(uf8 fl) { uint32_t mantissa = fl & 0x0f; // low 4 bits uint8_t exponent = fl >> 4; // high 4 bits uint32_t offset = (0x7FFF >> (15 - exponent)) << 4; // (2^e - 1)·16 return (mantissa << exponent) + offset; } ``` #### Assembly code > Show iterative refinement (reducing code size and runtime overhead) ```riscv uf8_decode: addi sp, sp, -16 # allocate stack space for t0, t1, t2, ra sw ra, 12(sp) # save return address srai t0, a0, 4 # extract exponent t0 = a0 >> 4 andi t1, a0, 0x0F # extract mantissa t1 = a0 & 0x0F # calculate offset = (2^e -1) << 4 li t2, 1 sll t2, t2, t0 # t2 = 2^e addi t2, t2, -1 # t2 = 2^e -1 slli t2, t2, 4 # offset = (2^e-1)*16 # calculate mantissa shifted: t1 << e sll t1, t1, t0 # final value = mantissa<<e + offset add a0, t1, t2 lw ra, 12(sp) addi sp, sp, 16 ret ``` #### Use case > Three test data cases > Including internal validation > Compare against compiler output for verification and discussion ::: spoiler Test code ```riscv= .data str: .string "\n The decoded value is " .text # t0 = exponent # t1 = mantissa # t2 = offset main: # Test data 1 la a0, str li a7, 4 ecall li a0, 0x2F jal ra, uf8_decode li a7, 1 ecall # Test data 2 la a0, str li a7, 4 ecall li a0, 0x5A jal ra, uf8_decode li a7, 1 ecall # Test data 3 la a0, str li a7, 4 ecall li a0, 0xF0 jal ra, uf8_decode li a7, 1 ecall # End li a7, 10 ecall ``` ::: Input: * Test data 1: `0x2F` * Test data 2: `0x5A` * Test data 3: `0xF0` Output: ``` The decoded value is 108 The decoded value is 816 The decoded value is 524272 ``` Compiler output: ``` The decoded value of 0x2F is 108 The decoded value of 0x5A is 816 The decoded value of 0xF0 is 524272 ``` #### Analysis ##### Pseudo instruction ``` 00000000 <main>: 0: 10000517 auipc x10 0x10000 4: 00050513 addi x10 x10 0 8: 00400893 addi x17 x0 4 c: 00000073 ecall 10: 02f00513 addi x10 x0 47 14: 054000ef jal x1 84 <uf8_decode> 18: 00100893 addi x17 x0 1 1c: 00000073 ecall 20: 10000517 auipc x10 0x10000 24: fe050513 addi x10 x10 -32 28: 00400893 addi x17 x0 4 2c: 00000073 ecall 30: 05a00513 addi x10 x0 90 34: 034000ef jal x1 52 <uf8_decode> 38: 00100893 addi x17 x0 1 3c: 00000073 ecall 40: 10000517 auipc x10 0x10000 44: fc050513 addi x10 x10 -64 48: 00400893 addi x17 x0 4 4c: 00000073 ecall 50: 0f000513 addi x10 x0 240 54: 014000ef jal x1 20 <uf8_decode> 58: 00100893 addi x17 x0 1 5c: 00000073 ecall 60: 00a00893 addi x17 x0 10 64: 00000073 ecall 00000068 <uf8_decode>: 68: ff010113 addi x2 x2 -16 6c: 00112623 sw x1 12 x2 70: 40455293 srai x5 x10 4 74: 00f57313 andi x6 x10 15 78: 00100393 addi x7 x0 1 7c: 005393b3 sll x7 x7 x5 80: fff38393 addi x7 x7 -1 84: 00439393 slli x7 x7 4 88: 00531333 sll x6 x6 x5 8c: 00730533 add x10 x6 x7 90: 00c12083 lw x1 12 x2 94: 01010113 addi x2 x2 16 98: 00008067 jalr x0 x1 0 ``` ##### 5-stage pipelined processor > Demonstrate each stage: IF, ID, IE, MEM, WB. Explain memory update steps and their correctness. ![image](https://hackmd.io/_uploads/HycooFQTxg.png) ### `uf8_encode` #### C code ```c /* Encode uint32_t to uf8 */ uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16) // If less than 16, return directly → no compression needed. return value; /* Find appropriate exponent using CLZ hint */ int lz = clz(value); // count leading zeros int msb = 31 - lz; /* Start from a good initial guess */ uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = msb - 4; // Initial estimate if (exponent > 15) exponent = 15; /* Calculate overflow for estimated exponent */ // Interval start value for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; // overflow = offset(e) = (2^e−1)⋅16 /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { // value < overflow → exponent guessed too large. overflow = (overflow - 16) >> 1; exponent--; } } /* Find exact exponent */ while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) // value falls within the current exponent’s interval → no need to increase the exponent. break; overflow = next_overflow; exponent++; // exponent too small. } uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } ``` #### Assembly code > Show iterative refinement (reducing code size and runtime overhead) ```riscv uf8_encode: addi sp, sp, -16 sw ra, 12(sp) # check if value < 16 li t0, 16 blt a0, t0, encode_return_value # compute MSB via clz loop li t1, 31 # bit position mv t2, a0 # copy value clz_loop: srli t3, t2, 31 bnez t3, clz_done addi t1, t1, -1 slli t2, t2, 1 j clz_loop clz_done: # t1 = msb li t2, 5 blt t1, t2, find_exact_exponent_loop addi t0, t1, -4 # t0 = exponent li t2, 15 blt t2, t0, estimate_exponent li t0, 15 estimate_exponent: li t2, 0 # t2 = overflow li t3, 0 # t3 = e calculate_overflow_loop: bge t3, t0, adjust_loop slli t2, t2, 1 # overflow <<= 1 addi t2, t2, 16 # overflow += 16 addi t3, t3, 1 # e++ j calculate_overflow_loop adjust_loop: blez t0, find_exact_exponent_loop # if exponent <= 0, exit bge a0, t2, find_exact_exponent_loop # if value >= overflow, exit addi t2, t2, -16 # overflow -= 16 srli t2, t2, 1 # overflow >>= 1 addi t0, t0, -1 # exponent-- j adjust_loop find_exact_exponent_loop: li t4, 15 bge t0, t4, refine_exp_done # if exponent >= 15, exit loop slli t3, t2, 1 # t3 = next_overflow = overflow << 1 addi t3, t3, 16 # next_overflow += 16 blt a0, t3, refine_exp_done # if value < next_overflow, break mv t2, t3 # overflow = next_overflow addi t0, t0, 1 # exponent++ j find_exact_exponent_loop refine_exp_done: sub t4, a0, t2 # t4 = mantissa = value - overflow srl t4, t4, t0 # mantissa >>= exponent slli t0, t0, 4 # exponent << 4 or a0, t0, t4 # a0 = (exponent << 4) | mantissa encode_return_value: lw ra, 12(sp) addi sp, sp, 16 ret ``` #### Use case > Three test data cases > Including internal validation > Compare against compiler output for verification and discussion ::: spoiler Test code ```riscv= .data str: .string "\n The encoded value is " .text main: # Test data 1 la a0, str li a7, 4 ecall li a0, 108 jal ra, uf8_encode li a7, 1 ecall # Test data 2 la a0, str li a7, 4 ecall li a0, 816 jal ra, uf8_encode li a7, 1 ecall # Test data 3 la a0, str li a7, 4 ecall li a0, 524272 jal ra, uf8_encode li a7, 1 ecall # End li a7, 10 ecall ``` ::: Input: * Test data 1: `108` * Test data 2: `816` * Test data 3: `524272` Output: ``` The encoded value is 47 The encoded value is 90 The encoded value is 240 ``` Compiler output: ``` The encoded value of 108 is 47 The encoded value of 816 is 90 The encoded value of 524272 is 240 ``` #### Analysis ##### Pseudo instruction ``` 00000000 <main>: 0: 10000517 auipc x10 0x10000 4: 00050513 addi x10 x10 0 8: 00400893 addi x17 x0 4 c: 00000073 ecall 10: 06c00513 addi x10 x0 108 14: 058000ef jal x1 88 <uf8_encode> 18: 00100893 addi x17 x0 1 1c: 00000073 ecall 20: 10000517 auipc x10 0x10000 24: fe050513 addi x10 x10 -32 28: 00400893 addi x17 x0 4 2c: 00000073 ecall 30: 33000513 addi x10 x0 816 34: 038000ef jal x1 56 <uf8_encode> 38: 00100893 addi x17 x0 1 3c: 00000073 ecall 40: 10000517 auipc x10 0x10000 44: fc050513 addi x10 x10 -64 48: 00400893 addi x17 x0 4 4c: 00000073 ecall 50: 00080537 lui x10 0x80 54: ff050513 addi x10 x10 -16 58: 014000ef jal x1 20 <uf8_encode> 5c: 00100893 addi x17 x0 1 60: 00000073 ecall 64: 00a00893 addi x17 x0 10 68: 00000073 ecall 0000006c <uf8_encode>: 6c: ff010113 addi x2 x2 -16 70: 00112623 sw x1 12 x2 74: 01000293 addi x5 x0 16 78: 08554e63 blt x10 x5 156 <encode_return_value> 7c: 01f00313 addi x6 x0 31 80: 00050393 addi x7 x10 0 00000084 <clz_loop>: 84: 01f3de13 srli x28 x7 31 88: 000e1863 bne x28 x0 16 <clz_done> 8c: fff30313 addi x6 x6 -1 90: 00139393 slli x7 x7 1 94: ff1ff06f jal x0 -16 <clz_loop> 00000098 <clz_done>: 98: 00500393 addi x7 x0 5 9c: 04734463 blt x6 x7 72 <find_exact_exponent_loop> a0: ffc30293 addi x5 x6 -4 a4: 00f00393 addi x7 x0 15 a8: 0053c463 blt x7 x5 8 <estimate_exponent> ac: 00f00293 addi x5 x0 15 000000b0 <estimate_exponent>: b0: 00000393 addi x7 x0 0 b4: 00000e13 addi x28 x0 0 000000b8 <calculate_overflow_loop>: b8: 005e5a63 bge x28 x5 20 <adjust_loop> bc: 00139393 slli x7 x7 1 c0: 01038393 addi x7 x7 16 c4: 001e0e13 addi x28 x28 1 c8: ff1ff06f jal x0 -16 <calculate_overflow_loop> 000000cc <adjust_loop>: cc: 00505c63 bge x0 x5 24 <find_exact_exponent_loop> d0: 00755a63 bge x10 x7 20 <find_exact_exponent_loop> d4: ff038393 addi x7 x7 -16 d8: 0013d393 srli x7 x7 1 dc: fff28293 addi x5 x5 -1 e0: fedff06f jal x0 -20 <adjust_loop> 000000e4 <find_exact_exponent_loop>: e4: 00f00e93 addi x29 x0 15 e8: 01d2de63 bge x5 x29 28 <refine_exp_done> ec: 00139e13 slli x28 x7 1 f0: 010e0e13 addi x28 x28 16 f4: 01c54863 blt x10 x28 16 <refine_exp_done> f8: 000e0393 addi x7 x28 0 fc: 00128293 addi x5 x5 1 100: fe5ff06f jal x0 -28 <find_exact_exponent_loop> 00000104 <refine_exp_done>: 104: 40750eb3 sub x29 x10 x7 108: 005edeb3 srl x29 x29 x5 10c: 00429293 slli x5 x5 4 110: 01d2e533 or x10 x5 x29 00000114 <encode_return_value>: 114: 00c12083 lw x1 12 x2 118: 01010113 addi x2 x2 16 11c: 00008067 jalr x0 x1 0 ``` ##### 5-stage pipelined processor ![image](https://hackmd.io/_uploads/rJ0lFh76ll.png) ## [bfloat16](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-C) The bfloat16 format (16-bit, from Google Brain) preserves float32’s dynamic range by keeping the same 8-bit exponent, but reduces precision to a 7-bit significand (vs. 23). Value: $$ v = (-1)^S \times 2^{E-127} \times \left(1 + \frac{M}{128}\right) $$ where * $S \in \{0, 1\}$ is the sign bit * $E \in [1, 254]$ is the biased exponent * $M \in [0, 127]$ is the mantissa value Special Cases: * Zero: $E = 0, M = 0 \Rightarrow v = (-1)^S \times 0$ * Infinity: $E = 255, M = 0 \Rightarrow v = (-1)^S \times \infty$ * NaN: $E = 255, M \neq 0 \Rightarrow v = \text{NaN}$ * Denormals: Not supported (flush to zero) ### Implementation ```c #include <stdbool.h> #include <stdint.h> #include <string.h> typedef struct { uint16_t bits; } bf16_t; #define BF16_SIGN_MASK 0x8000U // bit15 #define BF16_EXP_MASK 0x7F80U // bits 14..7 = exponent mask #define BF16_MANT_MASK 0x007FU // bits 6..0 = mantissa mask #define BF16_EXP_BIAS 127 #define BF16_NAN() ((bf16_t) {.bits = 0x7FC0}) // canonical qNaN (exp=0xFF, mant=0x40) #define BF16_ZERO() ((bf16_t) {.bits = 0x0000}) // +0 ``` #### Determination ##### C code ```c /* Check for NaN: exponent all 1s and mantissa ≠ 0 */ static inline bool bf16_isnan(bf16_t a) { return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) && (a.bits & BF16_MANT_MASK); } /* Check for Inf: exponent all 1s and mantissa == 0 */ static inline bool bf16_isinf(bf16_t a) { return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) && !(a.bits & BF16_MANT_MASK); } /* Check for zero: mask out the sign bit using a.bits & 0x7FFF */ static inline bool bf16_iszero(bf16_t a) { return !(a.bits & 0x7FFF); } ``` ##### Assembly code > Show iterative refinement (reducing code size and runtime overhead) > Compare against compiler output for verification and discussion ```riscv main: li t3, 0x7F80 # t3 = BF16_EXP_MASK li t4, 0x007F # t4 = BF16_MANT_MASK li t6, 0x7FFF bf16_isnan: and t5, t3, a0 bne t5, t3, return_0 and t5, t4, a0 snez a0, t5 # a0 = 1 if mantissa != 0 ret bf16_isinf: and t5, t3, a0 bne t5, t3, return_0 and t5, t4, a0 seqz a0, t5 # a0 = 1 if mantissa == 0 ret bf16_iszero: and t5, t6, a0 seqz a0, t5 ret ``` ##### Use case > Three test data cases > Including internal validation ::: spoiler Test code ```riscv= .data bf16_list: .half 0x7FC1, 0x7F80, 0x0000, 0x8000 # NaN, +Inf, 0, -0 str1: .string "\n NAN : " str2: .string "\n Inf : " str3: .string "\n Zero : " .text main: isnan_start: la t0, bf16_list li t1, 0 li t2, 4 la a0, str1 li a7, 4 ecall isnan_loop: beq t1, t2, isinf_start lh a0, 0(t0) jal ra, bf16_isnan li a7, 1 ecall addi t0, t0, 2 # Next half (16 bits) addi t1, t1, 1 # i++ j isnan_loop isinf_start: la t0, bf16_list li t1, 0 li t2, 4 la a0, str2 li a7, 4 ecall isinf_loop: beq t1, t2, iszero_start lh a0, 0(t0) jal ra, bf16_isinf li a7, 1 ecall addi t0, t0, 2 # Next half (16 bits) addi t1, t1, 1 # i++ j isinf_loop iszero_start: la t0, bf16_list li t1, 0 li t2, 4 la a0, str3 li a7, 4 ecall iszero_loop: beq t1, t2, end lh a0, 0(t0) jal ra, bf16_iszero li a7, 1 ecall addi t0, t0, 2 # Next half (16 bits) addi t1, t1, 1 # i++ j iszero_loop return_0: li a0, 0 ret end: li a7, 10 ecall ``` ::: Input: `0x7FC1, 0x7F80, 0x0000, 0x8000` (NaN, +Inf, 0, -0) Output: ``` NAN : 1000 Inf : 0100 Zero : 0011 ``` The results match the expected output; therefore, the comparison with the compiler output is omitted. #### Format convertion ##### C code ```c static inline bf16_t f32_to_bf16(float val) { uint32_t f32bits; memcpy(&f32bits, &val, sizeof(float)); if (((f32bits >> 23) & 0xFF) == 0xFF) // exp all-ones return (bf16_t) {.bits = (f32bits >> 16) & 0xFFFF}; /* round-to-nearest-even */ f32bits += ((f32bits >> 16) & 1) + 0x7FFF; // 0x7FFF = 2^15 - 1 return (bf16_t) {.bits = f32bits >> 16}; } ``` ```c static inline float bf16_to_f32(bf16_t val) { /* Convert bf16_t to float32 by placing bf16 in the high 16 bits */ uint32_t f32bits = ((uint32_t) val.bits) << 16; float result; memcpy(&result, &f32bits, sizeof(float)); return result; } ``` ##### Assembly code > Show iterative refinement (reducing code size and runtime overhead) > Compare against compiler output for verification and discussion ```riscv f32_to_bf16: # check exp == 0xFF srli t0, a0, 23 andi t0, t0, 0xFF li t1, 0xFF beq t0, t1, f32_to_bf16_done # round-to-nearest-even srli t2, a0, 16 andi t2, t2, 1 li t3, 0x7FFF add a0, a0, t2 add a0, a0, t3 f32_to_bf16_done: srli a0, a0, 16 ret bf16_to_f32: slli a0, a0, 16 ret ``` ##### Use case > Three test data cases > Including internal validation ::: spoiler Test code ```riscv= .data str1: .string "\n The bf16 of 0x3FC00000 is " str2: .string "\n The bf16 of 0xC02C0000 is " str3: .string "\n The bf16 of 0x477FE000 is " str4: .string "\n The f32 of 0x3FC0 is " str5: .string "\n The f32 of 0xC02C is " str6: .string "\n The f32 of 0x4780 is " f32_input1: .word 0x3FC00000 f32_input2: .word 0xC02C0000 f32_input3: .word 0x477FE000 bf16_input1: .word 0x3FC0 bf16_input2: .word 0xC02C bf16_input3: .word 0x4780 .text main: # Test f32_to_bf16 # data 1 la a0, str1 li a7, 4 ecall lw a0, f32_input1 jal ra, f32_to_bf16 li a7, 34 ecall # data 2 la a0, str2 li a7, 4 ecall lw a0, f32_input2 jal ra, f32_to_bf16 li a7, 34 ecall # data 3 la a0, str3 li a7, 4 ecall lw a0, f32_input3 jal ra, f32_to_bf16 li a7, 34 ecall # Test bf16_to_f32 # data 1 la a0, str4 li a7, 4 ecall lw a0, bf16_input1 jal ra, bf16_to_f32 li a7, 34 ecall # data 2 la a0, str5 li a7, 4 ecall lw a0, bf16_input2 jal ra, bf16_to_f32 li a7, 34 ecall # data 3 la a0, str6 li a7, 4 ecall lw a0, bf16_input3 jal ra, bf16_to_f32 li a7, 34 ecall # End li a7, 10 ecall ``` ::: Input: * Test data for f32_to_bf16: `[0x3FC00000, 0xC02C0000, 0x477FE000]` * Test data for bf16_to_f32: `[0x3FC0, 0xC030, 0x4780]` Output: ``` The bf16 of 0x3FC00000 is 0x3fc0 The bf16 of 0xC02C0000 is 0xc02c The bf16 of 0x477FE000 is 0x4780 The f32 of 0x3FC0 is 0x3fc00000 The f32 of 0xC02C is 0xc02c0000 The f32 of 0x4780 is 0x47800000 ``` Compiler output: ``` The bf16 of 0x3fc00000 is 0x3fc0 The bf16 of 0xc02c0000 is 0xc02c The bf16 of 0x477fe000 is 0x4780 The f32 of 0x3fc0 is 0x3fc00000 The f32 of 0xc030 is 0xc0300000 The f32 of 0x4780 is 0x47800000 ``` #### Calculation ##### C code ```c static inline bf16_t bf16_add(bf16_t a, bf16_t b) { uint16_t sign_a = (a.bits >> 15) & 1; uint16_t sign_b = (b.bits >> 15) & 1; int16_t exp_a = ((a.bits >> 7) & 0xFF); int16_t exp_b = ((b.bits >> 7) & 0xFF); uint16_t mant_a = a.bits & 0x7F; uint16_t mant_b = b.bits & 0x7F; if (exp_a == 0xFF) { if (mant_a) return a; if (exp_b == 0xFF) return (mant_b || sign_a == sign_b) ? b : BF16_NAN(); return a; } if (exp_b == 0xFF) return b; if (!exp_a && !mant_a) // a == 0 -> return b return b; if (!exp_b && !mant_b) // b == 0 -> return a return a; /* restore implicit 1 */ if (exp_a) mant_a |= 0x80; if (exp_b) mant_b |= 0x80; int16_t exp_diff = exp_a - exp_b; uint16_t result_sign; int16_t result_exp; uint32_t result_mant; if (exp_diff > 0) { result_exp = exp_a; if (exp_diff > 8) // too far away -> b negligible return a; mant_b >>= exp_diff; // align mantissa } else if (exp_diff < 0) { result_exp = exp_b; if (exp_diff < -8) // too far away -> a negligible return b; mant_a >>= -exp_diff; } else { result_exp = exp_a; } if (sign_a == sign_b) { result_sign = sign_a; result_mant = (uint32_t) mant_a + mant_b; if (result_mant & 0x100) { // carry beyond 8 bits result_mant >>= 1; /* overflow -> Inf */ if (++result_exp >= 0xFF) return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; } } else { // different signs -> Subtract if (mant_a >= mant_b) { result_sign = sign_a; result_mant = mant_a - mant_b; } else { result_sign = sign_b; result_mant = mant_b - mant_a; } if (!result_mant) return BF16_ZERO(); /* normalize */ while (!(result_mant & 0x80)) { result_mant <<= 1; if (--result_exp <= 0) return BF16_ZERO(); } } return (bf16_t) { .bits = (result_sign << 15) | ((result_exp & 0xFF) << 7) | (result_mant & 0x7F), }; } ``` ```c static inline bf16_t bf16_sub(bf16_t a, bf16_t b) { b.bits ^= BF16_SIGN_MASK; // XOR return bf16_add(a, b); } ``` ```c static inline bf16_t bf16_mul(bf16_t a, bf16_t b) { uint16_t sign_a = (a.bits >> 15) & 1; uint16_t sign_b = (b.bits >> 15) & 1; int16_t exp_a = ((a.bits >> 7) & 0xFF); int16_t exp_b = ((b.bits >> 7) & 0xFF); uint16_t mant_a = a.bits & 0x7F; uint16_t mant_b = b.bits & 0x7F; uint16_t result_sign = sign_a ^ sign_b; // XOR: same sign → positive, different sign → negative if (exp_a == 0xFF) { if (mant_a) return a; // NaN if (!exp_b && !mant_b) return BF16_NAN(); // Inf * 0 = NaN return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; // Inf } if (exp_b == 0xFF) { if (mant_b) return b; if (!exp_a && !mant_a) return BF16_NAN(); return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; } /* If either number is zero */ if ((!exp_a && !mant_a) || (!exp_b && !mant_b)) return (bf16_t) {.bits = result_sign << 15}; /* subnormal number (exponent = 0，mantissa ≠ 0) → normalize */ int16_t exp_adjust = 0; if (!exp_a) { while (!(mant_a & 0x80)) { mant_a <<= 1; exp_adjust--; } exp_a = 1; } else mant_a |= 0x80; if (!exp_b) { while (!(mant_b & 0x80)) { mant_b <<= 1; exp_adjust--; } exp_b = 1; } else mant_b |= 0x80; uint32_t result_mant = (uint32_t) mant_a * mant_b; int32_t result_exp = (int32_t) exp_a + exp_b - BF16_EXP_BIAS + exp_adjust; /* Normalization */ if (result_mant & 0x8000) { result_mant = (result_mant >> 8) & 0x7F; result_exp++; } else result_mant = (result_mant >> 7) & 0x7F; /* overflow and underflow */ if (result_exp >= 0xFF) return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; if (result_exp <= 0) { if (result_exp < -6) return (bf16_t) {.bits = result_sign << 15}; result_mant >>= (1 - result_exp); result_exp = 0; } return (bf16_t) {.bits = (result_sign << 15) | ((result_exp & 0xFF) << 7) | (result_mant & 0x7F)}; } ``` ```c static inline bf16_t bf16_div(bf16_t a, bf16_t b) { uint16_t sign_a = (a.bits >> 15) & 1; uint16_t sign_b = (b.bits >> 15) & 1; int16_t exp_a = ((a.bits >> 7) & 0xFF); int16_t exp_b = ((b.bits >> 7) & 0xFF); uint16_t mant_a = a.bits & 0x7F; uint16_t mant_b = b.bits & 0x7F; uint16_t result_sign = sign_a ^ sign_b; if (exp_b == 0xFF) { if (mant_b) return b; // NaN /* Inf/Inf = NaN */ if (exp_a == 0xFF && !mant_a) return BF16_NAN(); // Inf / Inf = NaN return (bf16_t) {.bits = result_sign << 15}; // 有限數 / Inf = 0 } /* If denominator is 0 */ if (!exp_b && !mant_b) { if (!exp_a && !mant_a) return BF16_NAN(); // 0/0 = NaN return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; // x/0 = ±Inf } /* If numerator is Inf */ if (exp_a == 0xFF) { if (mant_a) return a; // NaN return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; // ±Inf / Finite number = ±Inf } /* If numerator is 0 */ if (!exp_a && !mant_a) return (bf16_t) {.bits = result_sign << 15}; // 0/x = 0 if (exp_a) mant_a |= 0x80; if (exp_b) mant_b |= 0x80; uint32_t dividend = (uint32_t) mant_a << 15; // dividend uint32_t divisor = mant_b; // divisor uint32_t quotient = 0; // quotient /* Loop 16 times to build the quotient bit by bit */ for (int i = 0; i < 16; i++) { // Shift divisor to align with dividend; subtract if possible and set quotient bit. quotient <<= 1; if (dividend >= (divisor << (15 - i))) { dividend -= (divisor << (15 - i)); quotient |= 1; } } int32_t result_exp = (int32_t) exp_a - exp_b + BF16_EXP_BIAS; if (!exp_a) result_exp--; if (!exp_b) result_exp++; /* Normalization */ if (quotient & 0x8000) quotient >>= 8; else { while (!(quotient & 0x8000) && result_exp > 1) { quotient <<= 1; result_exp--; } quotient >>= 8; } quotient &= 0x7F; /* overflow and underflow */ if (result_exp >= 0xFF) return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; if (result_exp <= 0) return (bf16_t) {.bits = result_sign << 15}; return (bf16_t) {.bits = (result_sign << 15) | ((result_exp & 0xFF) << 7) | (quotient & 0x7F)}; } ``` ##### Assembly code > Show iterative refinement (reducing code size and runtime overhead) > Compare against compiler output for verification and discussion ```riscv bf16_add: addi sp, sp, -16 sw ra, 12(sp) # --- extract sign, exp, mantissa --- srli a2, a0, 15 # a2 = sign_a srli a3, a1, 15 # a3 = sign_b srli a4, a0, 7 andi a4, a4, 0xFF # a4 = exp_a srli a5, a1, 7 andi a5, a5, 0xFF # a5 = exp_b andi a6, a0, 0x7F # a6 = mant_a andi a7, a1, 0x7F # a7 = mant_b # --- handle special cases: NaN, Inf, zero --- li t3, 0xFF bne a4, t3, finish_first_case bnez a6, return_a bne a5, t3, return_a # (mant_b || sign_a == sign_b) ? b : NaN bnez a7, return_b beq a2, a3, return_b li a0, 0x7FC0 ret finish_first_case: beq a5, t3, return_b beqz a4, return_b beqz a5, return_a # --- restore implicit 1 for normalized numbers --- bf16_add_restore_a: beqz a4, bf16_add_restore_b ori a6, a6, 0x80 # mant_a |= 0x80 bf16_add_restore_b: beqz a5, exp_diff_start ori a7, a7, 0x80 # mant_b |= 0x80 exp_diff_start: sub t3, a4, a5 # t3 = exp_diff li t4, 8 # t4 = 8 blez t3, exp_diff_le0 mv t5, a4 # t5 = result_exp bgt t3, t4, return_a srl a7, a7, t3 j sign_start exp_diff_le0: bgez t3, exp_diff_eq0 mv t5, a5 # t5 = result_exp neg t3, t3 # t3 = -exp_diff bgt t3, t4, return_b srl a6, a6, t3 j sign_start exp_diff_eq0: mv t5, a4 # t5 = result_exp sign_start: bne a2, a3, diff_sign_case same_sign_case: mv t3, a2 # t3 = result_sign add t4, a6, a7 # t4 = result_mant andi t6, t4, 0x100 # t6 = result_mant & 0x100 beqz t6, bf16_add_end srli t4, t4, 1 addi t5, t5, 1 addi t6, t5, -0xFF # t6 = ++result_exp - 0xFF bgez t6, overflow_to_inf j bf16_add_end overflow_to_inf: slli a0, t3, 15 li t6, 0x7F80 or a0, a0, t6 j bf16_add_end diff_sign_case: blt a6, a7, mant_a_smaller mv t3, a2 # t3 = result_sign sub t4, a6, a7 # t4 = result_mant j check_zero mant_a_smaller: mv t3, a3 # t3 = result_sign sub t4, a7, a6 # t4 = result_mant check_zero: beqz t4, return_zero normalize_loop: andi t6, t4, 0x80 # t6 = result_mant & 0x80 bnez t6, bf16_add_end slli t4, t4, 1 addi t5, t5, -1 blez t5 return_zero j normalize_loop bf16_add_end: slli a0, t3, 15 # a0 = result_sign << 15 andi t6, t5, 0xFF # t1 = result_exp & 0xFF slli t6, t6, 7 # t6 << 7 or a0, a0, t6 # a0 |= (result_exp & 0xFF) << 7 andi t6, t4, 0x7F # t6 = result_mant & 0x7F or a0, a0, t6 # a0 |= result_mant & 0x7F ret return_a: mv a0, a0 ret return_b: mv a0, a1 ret return_zero: li a0, 0 ret ``` ```riscv bf16_sub: addi sp, sp, -16 sw ra, 12(sp) li t3, 0x8000 xor a1, a1, t3 jal bf16_add lw ra, 12(sp) addi sp, sp, 16 ret ``` ##### Use case > Three test data cases > Including internal validation :::spoiler Test code ```riscv= .data bf16_vals: .half 0x3FC0 .half 0xC030 .half 0x4780 .half 0x3FC0 str1: .string "\nbf16_add result: " str2: .string "\n\nbf16_sub result: " str: .string " \n" .text main: add_start: la t0, bf16_vals # t0 = pointer to list li t1, 0 # loop counter li t2, 2 la a0, str1 li a7, 4 ecall add_loop: bgt t1, t2, sub_start la a0, str li a7, 4 ecall lhu a0, 0(t0) # a = bf16_vals[i] lhu a1, 2(t0) # b = bf16_vals[i+1] jal ra, bf16_add li a7, 34 ecall addi t0, t0, 2 addi t1, t1, 1 j add_loop sub_start: la t0, bf16_vals # t0 = pointer to list li t1, 0 # loop counter li t2, 2 la a0, str2 li a7, 4 ecall sub_loop: bgt t1, t2, end la a0, str li a7, 4 ecall lhu a0, 0(t0) # a = bf16_vals[i] lhu a1, 2(t0) # b = bf16_vals[i+1] jal bf16_sub li a7, 34 ecall addi t0, t0, 2 addi t1, t1, 1 j sub_loop end: li a7, 10 ecall ``` ::: Input: `[0x3FC0, 0xC030, 0x4780]`` Output: ``` bf16_add result: 0xbfa0 0x4780 0x4780 bf16_sub result: 0x4088 0xc780 0x4780 ``` Compiler output: ``` bf16_add result: 0xBFA0 0x4780 0x4780 bf16_sub result: 0x4088 0xC780 0x4780 ``` ### Analysis #### 5-stage pipelined processor > Demonstrate each stage: IF, ID, IE, MEM, WB. Explain memory update steps and their correctness.