Assignment 1: RISC-V Assembly and Instruction Pipeline

# Assignment 1: RISC-V Assembly and Instruction Pipeline ## Problem B ### Introduction #### About UF8 UF8 is an 8-bit **logarithmic encoding format** that represents a 20-bit non-negative integer (ranging from 0 to 1,015,792) approximately using only 8 bits. In UF8 encoding, each byte consists of **8 bits**, where the **upper 4 bits** represent the **exponent `e`**, and the **lower 4 bits** represent the **mantissa `m`**, as shown in the figure below: ![uf8 structure](https://hackmd.io/_uploads/B1-iaQ0peg.png) Each exponent `e` corresponds to a **range**, so there are 16 ranges in total. Each range has its corresponding **offset(e)** as the starting value. Within the same range, the value spacing (i.e., the difference between two adjacent representable values) is fixed, and the step size is $2^e$. #### UF8 Decoding To decode an 8-bit UF8 integer into an approximate 32-bit integer value, the following formula can be used: $$ e = \lfloor b / 16 \rfloor, \quad m = b \bmod 16 $$ $$ D(b) = m \cdot 2^e + (2^e - 1) \cdot 16 $$ Here, the term $(2^e - 1) \cdot 16$ represents the corresponding **offset(e)**. You can better understand the UF8 structure by observing the **starting value** and **step size** in the following examples: - **e = 0:** offset = 0, representable range = 0, 1, 2, …, 15, ->step = 1 - **e = 1:** offset = 16, representable range = 16, 18, 20, …, 46, ->step = 2 - **e = 2:** offset = 48, representable range = 48, 52, 56, …, 108, ->step = 4 Additionally, when the mantissa m = 15 , moving to the beginning of the next range (i.e., $ m = 0, e + 1 $) results in a value difference exactly equal to $2^e$. Thus, the overall value distribution forms a smooth staircase-like progression, ensuring there are no discontinuities or gaps between adjacent ranges. --- ### Related work In this part, we need to translate the original C code provided in **Problem B** into a **RISC-V assembly program**, using only **RV32I instructions** (without M or F/D extensions for F-P operations), and ensure that it runs correctly on the **Ripes simulator**. The original C code for Problem B is as follows: ```C= #include <stdbool.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> typedef uint8_t uf8; static inline unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; } /* Decode uf8 to uint32_t */ uint32_t uf8_decode(uf8 fl) { uint32_t mantissa = fl & 0x0f; uint8_t exponent = fl >> 4; uint32_t offset = (0x7FFF >> (15 - exponent)) << 4; return (mantissa << exponent) + offset; } /* Encode uint32_t to uf8 */ uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16) return value; /* Find appropriate exponent using CLZ hint */ int lz = clz(value); int msb = 31 - lz; /* Start from a good initial guess */ uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; /* Calculate overflow for estimated exponent */ for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } } /* Find exact exponent */ while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) break; overflow = next_overflow; exponent++; } uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } /* Test encode/decode round-trip */ static bool test(void) { int32_t previous_value = -1; bool passed = true; for (int i = 0; i < 256; i++) { uint8_t fl = i; int32_t value = uf8_decode(fl); uint8_t fl2 = uf8_encode(value); if (fl != fl2) { printf("%02x: produces value %d but encodes back to %02x\n", fl, value, fl2); passed = false; } if (value <= previous_value) { printf("%02x: value %d <= previous_value %d\n", fl, value, previous_value); passed = false; } previous_value = value; } return passed; } int main(void) { if (test()) { printf("All tests passed.\n"); return 0; } return 1; } ``` #### 1. uf8_decode The corresponding assembly code is as follows: ```C= UF8_decode: andi t1, a0, 0x0F # m = fl & 0x0F srli t2, a0, 4 # e = fl >> 4 li t3, 1 sll t3, t3, t2 # 1 << e addi t3, t3, -1 # (1 << e) - 1 slli t3, t3, 4 # offset = ((1 << e)-1) << 4 sll t1, t1, t2 # m << e add a0, t1, t3 # return (m << e) + offset jr ra ``` #### 2. clz **clz** is a function used by uf8_encode. Its purpose is to return **the number of leading zeros** in a 32-bit unsigned integer, starting from the most significant bit (MSB). If we directly translate the original C code of **clz** into assembly , the corresponding RISC-V assembly code would be as follows: ```C= CLZ: li t1, 32 # n li t2, 16 # c clz_For: srl t3, a0, t2 beqz t3, clz_Skip sub t1, t1, t2 mv a0, t3 clz_Skip: srli t2, t2, 1 bnez t2, clz_For clz_End: sub a0, t1, a0 jr ra ``` Using an general case input `a0 = 0x28ad94ca` for testing, the corresponding **Execution Information** is shown below: ![B_clz_directVer_perf](https://hackmd.io/_uploads/H1rjVEa6gg.png) Next, we optimize the above directly translated assembly code of **clz**, as shown below: ```C= CLZ: li t1, 32 # n = 32 clz_For1: srli t2, a0, 16 # y = x >> 16, c = 16 beqz t2, clz_For2 # if (y == 0) skip addi t1, t1, -16 # n -= 16 mv a0, t2 # x = y clz_For2: srli t2, a0, 8 beqz t2, clz_For3 addi t1, t1, -8 mv a0, t2 clz_For3: srli t2, a0, 4 beqz t2, clz_For4 addi t1, t1, -4 mv a0, t2 clz_For4: srli t2, a0, 2 beqz t2, clz_For5 addi t1, t1, -2 mv a0, t2 clz_For5: srli t2, a0, 1 beqz t2, clz_End addi t1, t1, -1 mv a0, t2 clz_End: sub a0, t1, a0 # return n - x jr ra ``` Here, **loop unrolling** is used for optimization. Although this increases the code size, it reduces the number of branches and the total execution cycles. Using the same input example as before, the **Execution Information** for this optimized version is shown below: ![B_clz_loopUN_perf](https://hackmd.io/_uploads/Hyfa846plg.png) It is clear that the **execution cycles** were reduced by approximately $33$%. #### 3. uf8_encode **uf8_encode** is a function that compresses a 20-bit unsigned integer into an 8-bit encoded value. It estimates the exponent (logarithmic scale) based on the magnitude of the input, calculates the mantissa (fine-grained bits), and finally combines them into a single 8-bit uf8 encoded result. The corresponding assembly code is as follows: ```C= UF8_encode: addi sp, sp, -4 sw ra, 0(sp) li t0, 16 # if value < 16 blt a0, t0, small_End mv t5, a0 # t5 = value (original value) # lz = clz(value); msb = 31 - lz jal ra, CLZ # a0 = clz(value) li t6, 31 sub t6, t6, a0 # t6 = msb # exponent = 0; overflow = 0 li t1, 0 # t1 = exponent li t2, 0 # t2 = overflow = offset(e) # if (msb >= 5) { exponent = msb - 4; clamp to 15; build offset(e) } li t0, 5 blt t6, t0, find_Exact # if (msb < 5) → skip addi t1, t6, -4 # exponent = msb - 4 li t0, 15 ble t1, t0, caculate_Off li t1, 15 # overflow = offset(exponent) # off(0)=0; off(e+1) = (off(e) << 1) + 16 caculate_Off: li t2, 0 # overflow = 0 li t3, 0 # counter off_For: beq t3, t1, adjust_Down slli t2, t2, 1 # overflow <<= 1 addi t2, t2, 16 # overflow += 16 addi t3, t3, 1 j off_For # while (exponent > 0 && value < overflow) # { overflow = (overflow - 16) >> 1; exponent--; } adjust_Down: beqz t1, find_Exact bge t5, t2, find_Exact addi t2, t2, -16 srli t2, t2, 1 addi t1, t1, -1 j adjust_Down # find exponent range s.t. offset(e) <= value < offset(e+1) find_Exact: li t0, 15 bge t1, t0, Mantissa # clamp at 15 up_For: slli t3, t2, 1 addi t3, t3, 16 # t3 = next_overflow = offset(e+1) blt t5, t3, Mantissa # if (value < next_overflow) break mv t2, t3 # overflow = next_overflow addi t1, t1, 1 # e++ blt t1, t0, up_For Mantissa: # m = (value - offset(e)) >> e sub t3, t5, t2 # value - overflow srl t3, t3, t1 # >> e encode_End: # b = (e<<4) | m slli t1, t1, 4 # e<<4 andi t3, t3, 0x0F # m (4 bits) or a0, t1, t3 lw ra, 0(sp) addi sp, sp, 4 jr ra # value < 16 small_End: andi a0, a0, 0xFF lw ra, 0(sp) addi sp, sp, 4 jr ra ``` #### 4. main() + test() The **test()** function is designed to ensure that **uf8_encode** and **uf8_decode** correctly convert between each other. It iterates through all possible 8-bit uf8 codes (a total of 256), decodes each one into its original integer value, re-encodes it back to uf8, and checks whether the result matches the original code. Additionally, it verifies that the decoded values are **strictly increasing** to confirm the monotonicity of the logarithmic quantization. The **main()** function is responsible for invoking test(). If all checks pass without error, it prints the message "**All tests passed.**" to indicate that the entire encoding system is functioning correctly. The corresponding assembly code is as follows: ```C= .data msg_pass: .asciz "All tests passed.\n" msg_err1: .asciz ": produces value " msg_err2: .asciz " but encodes back to " msg_err3: .asciz ": value " msg_err4: .asciz " <= previous_value " nextline: .asciz "\n" .text .globl main main: jal ra, Test beqz a0, Fail la a0, msg_pass li a7, 4 ecall li a0, 0 li a7, 10 ecall Fail: li a0, 1 li a7, 10 ecall # ------------------------------------------------------------ # Test function # ------------------------------------------------------------ Test: addi sp, sp, -4 sw ra, 0(sp) li s1, -1 li s2, 1 li s3, 0 li s7, 256 round_Trip: beq s3, s7, test_End mv s4, s3 mv a0, s4 jal ra, UF8_decode mv s5, a0 jal ra, UF8_encode mv s6, a0 beq s4, s6, check_Inc mv a0, s4 li a7, 34 ecall la a0, msg_err1 li a7, 4 ecall mv a0, s5 li a7, 1 ecall la a0, msg_err2 li a7, 4 ecall mv a0, s6 li a7, 34 ecall la a0, nextline li a7, 4 ecall li s2, 0 check_Inc: bgt s5, s1, check_End mv a0, s4 li a7, 34 ecall la a0, msg_err3 li a7, 4 ecall mv a0, s5 li a7, 1 ecall la a0, msg_err4 li a7, 4 ecall mv a0, s1 li a7, 1 ecall la a0, nextline li a7, 4 ecall li s2, 0 check_End: mv s1, s5 addi s3, s3, 1 j round_Trip test_End: mv a0, s2 lw ra, 0(sp) addi sp, sp, 4 jr ra ``` --- ### Result Console In this assignment, all assembly programs are run on a **RISC-V 32-bit 5-stage processor** and output to the **console** using environment calls. The **result** of executing the assembly program composed of the previously described functions is shown below: ![uf8_console](https://hackmd.io/_uploads/HkjqKVT6ge.png) As shown in the image above, the program executes correctly and outputs the expected prompt messages. --- ### Execution Information This section demonstrates the **Execution Information** when running the assembly code on the **Ripes simulator**, in order to compare with the compiler-generated version (with `-O1` optimization) and directly observe the performance. #### 1. Complier's version: ![PB_Complier_perf](https://hackmd.io/_uploads/HJ9VlScTgg.png) * Corresponding **code size: 267** #### 2. Improved version: ![PB_myVer_perf](https://hackmd.io/_uploads/HycXHBqTlx.png) * Corresponding **code size: 185** --- ### Improvement From the **Execution Information** above, compared with the compiler’s version: * **Code size reduced by approximately $31$%** * **Execution cycles reduced by approximately $11$%** * **Actual number of executed instructions reduced by approximately $17$%** --- ## Problem C ### Introduction **bfloat16** (Brain Float 16-bit) is a **16-bit** floating-point format proposed by Google. Its design philosophy is to **retain the same numerical range as FP32 while sacrificing precision**. The structural composition is as follows: ![bf16 structure](https://hackmd.io/_uploads/SJ-S0XApge.png) bfloat16 differs from the IEEE-754 standard half precision format (16-bit float, also known as float16). IEEE float16 consists of 1 sign bit + 5 exponent bits + 10 mantissa bits, offering higher precision but smaller representable range. In contrast, bfloat16 uses **1 sign bit + 8 exponent bits + 7 mantissa bits**, resulting in lower precision but a representable range similar to single-precision (float32), due to both formats having 8-bit exponents. --- ### Related Work We need to translate the original C code provided in **Problem C** ( divided into two parts ) into a **RISC-V assembly program**, using only **RV32I instructions** (without M or F/D extensions for F-P operations), and ensure that it runs correctly on the **Ripes simulator**. #### Part I This part implements the **core operations** on **bfloat16**, including addition, subtraction, multiplication, division, conversion, and special value processing. The original C code is as follows: ```C= #include <stdbool.h> #include <stdint.h> #include <string.h> typedef struct { uint16_t bits; } bf16_t; #define BF16_SIGN_MASK 0x8000U #define BF16_EXP_MASK 0x7F80U #define BF16_MANT_MASK 0x007FU #define BF16_EXP_BIAS 127 #define BF16_NAN() ((bf16_t) {.bits = 0x7FC0}) #define BF16_ZERO() ((bf16_t) {.bits = 0x0000}) static inline bool bf16_isnan(bf16_t a) { return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) && (a.bits & BF16_MANT_MASK); } static inline bool bf16_isinf(bf16_t a) { return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) && !(a.bits & BF16_MANT_MASK); } static inline bool bf16_iszero(bf16_t a) { return !(a.bits & 0x7FFF); } static inline bf16_t f32_to_bf16(float val) { uint32_t f32bits; memcpy(&f32bits, &val, sizeof(float)); if (((f32bits >> 23) & 0xFF) == 0xFF) return (bf16_t) {.bits = (f32bits >> 16) & 0xFFFF}; f32bits += ((f32bits >> 16) & 1) + 0x7FFF; return (bf16_t) {.bits = f32bits >> 16}; } static inline float bf16_to_f32(bf16_t val) { uint32_t f32bits = ((uint32_t) val.bits) << 16; float result; memcpy(&result, &f32bits, sizeof(float)); return result; } static inline bf16_t bf16_add(bf16_t a, bf16_t b) { uint16_t sign_a = (a.bits >> 15) & 1; uint16_t sign_b = (b.bits >> 15) & 1; int16_t exp_a = ((a.bits >> 7) & 0xFF); int16_t exp_b = ((b.bits >> 7) & 0xFF); uint16_t mant_a = a.bits & 0x7F; uint16_t mant_b = b.bits & 0x7F; if (exp_a == 0xFF) { if (mant_a) return a; if (exp_b == 0xFF) return (mant_b || sign_a == sign_b) ? b : BF16_NAN(); return a; } if (exp_b == 0xFF) return b; if (!exp_a && !mant_a) return b; if (!exp_b && !mant_b) return a; if (exp_a) mant_a |= 0x80; if (exp_b) mant_b |= 0x80; int16_t exp_diff = exp_a - exp_b; uint16_t result_sign; int16_t result_exp; uint32_t result_mant; if (exp_diff > 0) { result_exp = exp_a; if (exp_diff > 8) return a; mant_b >>= exp_diff; } else if (exp_diff < 0) { result_exp = exp_b; if (exp_diff < -8) return b; mant_a >>= -exp_diff; } else { result_exp = exp_a; } if (sign_a == sign_b) { result_sign = sign_a; result_mant = (uint32_t) mant_a + mant_b; if (result_mant & 0x100) { result_mant >>= 1; if (++result_exp >= 0xFF) return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; } } else { if (mant_a >= mant_b) { result_sign = sign_a; result_mant = mant_a - mant_b; } else { result_sign = sign_b; result_mant = mant_b - mant_a; } if (!result_mant) return BF16_ZERO(); while (!(result_mant & 0x80)) { result_mant <<= 1; if (--result_exp <= 0) return BF16_ZERO(); } } return (bf16_t) { .bits = (result_sign << 15) | ((result_exp & 0xFF) << 7) | (result_mant & 0x7F), }; } static inline bf16_t bf16_sub(bf16_t a, bf16_t b) { b.bits ^= BF16_SIGN_MASK; return bf16_add(a, b); } static inline bf16_t bf16_mul(bf16_t a, bf16_t b) { uint16_t sign_a = (a.bits >> 15) & 1; uint16_t sign_b = (b.bits >> 15) & 1; int16_t exp_a = ((a.bits >> 7) & 0xFF); int16_t exp_b = ((b.bits >> 7) & 0xFF); uint16_t mant_a = a.bits & 0x7F; uint16_t mant_b = b.bits & 0x7F; uint16_t result_sign = sign_a ^ sign_b; if (exp_a == 0xFF) { if (mant_a) return a; if (!exp_b && !mant_b) return BF16_NAN(); return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; } if (exp_b == 0xFF) { if (mant_b) return b; if (!exp_a && !mant_a) return BF16_NAN(); return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; } if ((!exp_a && !mant_a) || (!exp_b && !mant_b)) return (bf16_t) {.bits = result_sign << 15}; int16_t exp_adjust = 0; if (!exp_a) { while (!(mant_a & 0x80)) { mant_a <<= 1; exp_adjust--; } exp_a = 1; } else mant_a |= 0x80; if (!exp_b) { while (!(mant_b & 0x80)) { mant_b <<= 1; exp_adjust--; } exp_b = 1; } else mant_b |= 0x80; uint32_t result_mant = (uint32_t) mant_a * mant_b; int32_t result_exp = (int32_t) exp_a + exp_b - BF16_EXP_BIAS + exp_adjust; if (result_mant & 0x8000) { result_mant = (result_mant >> 8) & 0x7F; result_exp++; } else result_mant = (result_mant >> 7) & 0x7F; if (result_exp >= 0xFF) return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; if (result_exp <= 0) { if (result_exp < -6) return (bf16_t) {.bits = result_sign << 15}; result_mant >>= (1 - result_exp); result_exp = 0; } return (bf16_t) {.bits = (result_sign << 15) | ((result_exp & 0xFF) << 7) | (result_mant & 0x7F)}; } static inline bf16_t bf16_div(bf16_t a, bf16_t b) { uint16_t sign_a = (a.bits >> 15) & 1; uint16_t sign_b = (b.bits >> 15) & 1; int16_t exp_a = ((a.bits >> 7) & 0xFF); int16_t exp_b = ((b.bits >> 7) & 0xFF); uint16_t mant_a = a.bits & 0x7F; uint16_t mant_b = b.bits & 0x7F; uint16_t result_sign = sign_a ^ sign_b; if (exp_b == 0xFF) { if (mant_b) return b; /* Inf/Inf = NaN */ if (exp_a == 0xFF && !mant_a) return BF16_NAN(); return (bf16_t) {.bits = result_sign << 15}; } if (!exp_b && !mant_b) { if (!exp_a && !mant_a) return BF16_NAN(); return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; } if (exp_a == 0xFF) { if (mant_a) return a; return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; } if (!exp_a && !mant_a) return (bf16_t) {.bits = result_sign << 15}; if (exp_a) mant_a |= 0x80; if (exp_b) mant_b |= 0x80; uint32_t dividend = (uint32_t) mant_a << 15; uint32_t divisor = mant_b; uint32_t quotient = 0; for (int i = 0; i < 16; i++) { quotient <<= 1; if (dividend >= (divisor << (15 - i))) { dividend -= (divisor << (15 - i)); quotient |= 1; } } int32_t result_exp = (int32_t) exp_a - exp_b + BF16_EXP_BIAS; if (!exp_a) result_exp--; if (!exp_b) result_exp++; if (quotient & 0x8000) quotient >>= 8; else { while (!(quotient & 0x8000) && result_exp > 1) { quotient <<= 1; result_exp--; } quotient >>= 8; } quotient &= 0x7F; if (result_exp >= 0xFF) return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; if (result_exp <= 0) return (bf16_t) {.bits = result_sign << 15}; return (bf16_t) {.bits = (result_sign << 15) | ((result_exp & 0xFF) << 7) | (quotient & 0x7F)}; } ``` <details> <summary>The corresponding assembly code (include test part)</summary> ```C= .equ BF16_SIGN_MASK, 0x8000 .equ BF16_EXP_MASK, 0x7F80 .equ BF16_MANT_MASK, 0x007F .equ BF16_EXP_BIAS, 127 .data msg_pass: .asciz "All tests passed\n" msg_fail: .asciz "Tests failed\n" .text .globl main main: # ... run tests ... # -------------------- # bf16_isnan(a0) # -------------------- li a0, 0x7FC1 # NaN jal ra, bf16_isnan li t6, 1 bne a0, t6, fail li a0, 0x7F80 # +Inf jal ra, bf16_isnan li t6, 0 bne a0, t6, fail li a0, 0x4000 # +2.0 jal ra, bf16_isnan li t6, 0 bne a0, t6, fail # -------------------- # bf16_isinf(a0) # -------------------- li a0, 0x7F80 # +Inf jal ra, bf16_isinf li t6, 1 bne a0, t6, fail li a0, 0xFF80 # -Inf jal ra, bf16_isinf li t6, 1 bne a0, t6, fail li a0, 0x7FC1 # NaN jal ra, bf16_isinf li t6, 0 bne a0, t6, fail # -------------------- # bf16_iszero(a0) # -------------------- li a0, 0x0000 # +0 jal ra, bf16_iszero li t6, 1 bne a0, t6, fail li a0, 0x8000 # -0 jal ra, bf16_iszero li t6, 1 bne a0, t6, fail li a0, 0x4000 # 2.0 jal ra, bf16_iszero li t6, 0 bne a0, t6, fail # -------------------- # f32_to_bf16(a0: f32 bits) -> a0: bf16 bits # -------------------- li a0, 0x3F800000 # 1.0f jal ra, f32_to_bf16 li t6, 0x3F80 # 1.0 (bf16) bne a0, t6, fail li a0, 0x40400000 # 3.0f jal ra, f32_to_bf16 li t6, 0x4040 # 3.0 (bf16) bne a0, t6, fail li a0, 0x7FC00000 # quiet NaN jal ra, f32_to_bf16 li t6, 0x7FC0 bne a0, t6, fail # -------------------- # bf16_to_f32(a0: bf16 bits) -> a0: f32 bits # -------------------- li a0, 0x3F80 # 1.0 (bf16) jal ra, bf16_to_f32 li t6, 0x3F800000 bne a0, t6, fail li a0, 0x4040 # 3.0 (bf16) jal ra, bf16_to_f32 li t6, 0x40400000 bne a0, t6, fail li a0, 0x7FC1 # NaN with payload -> preserved in upper 16 bits jal ra, bf16_to_f32 li t6, 0x7FC10000 bne a0, t6, fail # -------------------- # bf16_add(a0, a1) -> a0 # -------------------- li a0, 0x3F80 # 1.0 + 2.0 = 3.0 li a1, 0x4000 jal ra, bf16_add li t6, 0x4040 bne a0, t6, fail li a0, 0x4040 # 3.0 + (-1.0) = 2.0 li a1, 0xBF80 jal ra, bf16_add li t6, 0x4000 bne a0, t6, fail li a0, 0x0000 # 0 + 1.0 = 1.0 li a1, 0x3F80 jal ra, bf16_add li t6, 0x3F80 bne a0, t6, fail li a0, 0x7F80 # +Inf + -Inf = NaN li a1, 0xFF80 jal ra, bf16_add li t6, 0x7FC0 bne a0, t6, fail li a0, 0x7FC1 # NaN + 1.0 = NaN (propagate payload) li a1, 0x3F80 jal ra, bf16_add li t6, 0x7FC1 bne a0, t6, fail li a0, 0x4000 # 2.0 + tiny subnormal (>= 9 ulps apart) -> unchanged li a1, 0x0001 # smallest subnormal jal ra, bf16_add li t6, 0x4000 bne a0, t6, fail # -------------------- # bf16_sub(a0, a1) -> a0 # -------------------- li a0, 0x4040 # 3.0 - 1.0 = 2.0 li a1, 0x3F80 jal ra, bf16_sub li t6, 0x4000 bne a0, t6, fail li a0, 0x3F80 # 1.0 - 3.0 = -2.0 li a1, 0x4040 jal ra, bf16_sub li t6, 0xC000 bne a0, t6, fail li a0, 0x0000 # 0 - 1.0 = -1.0 li a1, 0x3F80 jal ra, bf16_sub li t6, 0xBF80 bne a0, t6, fail li a0, 0x7F80 # +Inf - +Inf = NaN li a1, 0x7F80 jal ra, bf16_sub li t6, 0x7FC0 bne a0, t6, fail li a0, 0x7FC2 # NaN - 1.0 = NaN (propagate payload) li a1, 0x3F80 jal ra, bf16_sub li t6, 0x7FC2 bne a0, t6, fail li a0, 0x3F80 # 1.0 - 0 = 1.0 li a1, 0x0000 jal ra, bf16_sub li t6, 0x3F80 bne a0, t6, fail # -------------------- # bf16_mul(a0, a1) -> a0 # -------------------- li a0, 0x4000 # 2.0 * 3.0 = 6.0 li a1, 0x4040 jal ra, bf16_mul li t6, 0x40C0 # 6.0 bne a0, t6, fail li a0, 0x3F80 # 1.0 * -2.0 = -2.0 li a1, 0xC000 jal ra, bf16_mul li t6, 0xC000 bne a0, t6, fail li a0, 0x0000 # 0 * 3.0 = 0 li a1, 0x4040 jal ra, bf16_mul li t6, 0x0000 bne a0, t6, fail li a0, 0x7F80 # Inf * 0 = NaN li a1, 0x0000 jal ra, bf16_mul li t6, 0x7FC0 bne a0, t6, fail li a0, 0x7FC1 # NaN * 1.0 = NaN (propagate payload) li a1, 0x3F80 jal ra, bf16_mul li t6, 0x7FC1 bne a0, t6, fail li a0, 0x0002 # subnormal * 1.0 -> 0 (mul flush behavior here) li a1, 0x3F80 jal ra, bf16_mul li t6, 0x0000 bne a0, t6, fail # -------------------- # bf16_div(a0, a1) -> a0 # -------------------- li a0, 0x4080 # 4 / 2 = 2 li a1, 0x4000 jal ra, bf16_div li t6, 0x4000 bne a0, t6, fail li a0, 0x4000 # 2 / 4 = 0.5 li a1, 0x4080 jal ra, bf16_div li t6, 0x3F00 bne a0, t6, fail li a0, 0x4040 # 3 / 3 = 1 li a1, 0x4040 jal ra, bf16_div li t6, 0x3F80 bne a0, t6, fail li a0, 0x3F80 # x / 0 = +Inf (x > 0) li a1, 0x0000 jal ra, bf16_div li t6, 0x7F80 bne a0, t6, fail li a0, 0x0000 # 0 / x = 0 li a1, 0x3F80 jal ra, bf16_div li t6, 0x0000 bne a0, t6, fail li a0, 0x0001 # subnormal / 1.0 = 0 (div flush-to-zero here) li a1, 0x3F80 jal ra, bf16_div li t6, 0x0000 bne a0, t6, fail # -------------------- # All passed -> print and exit # -------------------- all_pass: la a0, msg_pass li a7, 4 ecall li a0, 0 # return code 0 li a7, 10 ecall # Any failure -> print and exit with code=1 fail: la a0, msg_fail li a7, 4 ecall li a0, 1 li a7, 10 ecall bf16_div: srli t0, a0, 15 andi t0, t0, 1 srli t1, a1, 15 andi t1, t1, 1 srli t2, a0, 7 andi t2, t2, 0xFF srli t3, a1, 7 andi t3, t3, 0xFF andi t4, a0, 0x7F andi t5, a1, 0x7F xor a2, t0, t1 li t6, 0xFF bne t3, t6, chk_b_zero bnez t5, ret_b bne t2, t6, ret_sign_zero bnez t4, ret_sign_zero j ret_nan chk_b_zero: or t6, t3, t5 bnez t6, chk_a_inf or t6, t2, t4 beqz t6, ret_nan j ret_inf chk_a_inf: li t6, 0xFF bne t2, t6, chk_a_zero bnez t4, ret_a j ret_inf chk_a_zero: or t6, t2, t4 bnez t6, prep_mant j ret_sign_zero prep_mant: beqz t2, a_is_norm ori t4, t4, 0x80 a_is_norm: beqz t3, div_a_b ori t5, t5, 0x80 div_a_b: mv a3, t4 li a6, 0 bgeu a3, t5, div_no_scale slli a3, a3, 1 li a6, 1 div_no_scale: sub a3, a3, t5 li a4, 0 li t6, 7 div_frac_for: beqz t6, div_frac_done slli a3, a3, 1 bltu a3, t5, div_bit0 sub a3, a3, t5 slli a4, a4, 1 ori a4, a4, 1 addi t6, t6, -1 j div_frac_for div_bit0: slli a4, a4, 1 addi t6, t6, -1 j div_frac_for div_frac_done: li t6, BF16_EXP_BIAS sub a5, t2, t3 add a5, a5, t6 bnez t2, chk_incr_exp addi a5, a5, -1 chk_incr_exp: bnez t3, div_result addi a5, a5, 1 beqz a6, div_result addi a5, a5, -1 div_result: li t6, 0xFF bge a5, t6, ret_inf blez a5, ret_sign_zero slli a2, a2, 15 andi a5, a5, 0xFF slli a5, a5, 7 andi a4, a4, 0x7F or a0, a2, a5 or a0, a0, a4 jr ra bf16_mul: srli t0, a0, 15 andi t0, t0, 1 srli t1, a1, 15 andi t1, t1, 1 srli t2, a0, 7 andi t2, t2, 0xFF srli t3, a1, 7 andi t3, t3, 0xFF andi t4, a0, 0x7F andi t5, a1, 0x7F xor a2, t0, t1 li t6, 0xFF bne t2, t6, chk_exp_b_full bnez t4, ret_a or t6, t3, t5 beqz t6, ret_nan j ret_inf chk_exp_b_full: li t6, 0xFF bne t3, t6, chk_if_zero bnez t5, ret_b or t6, t2, t4 beqz t6, ret_nan j ret_inf chk_if_zero: or t6, t2, t4 beqz t6, ret_sign_zero or t6, t3, t5 beqz t6, ret_sign_zero exp_adjust: li a3, 0 bnez t2, skip_a_norm norm_a_for: andi t6, t4, 0x80 bnez t6, break_1 slli t4, t4, 1 addi a3, a3, -1 j norm_a_for break_1: li t2, 1 j chk_b_norm skip_a_norm: ori t4, t4, 0x80 chk_b_norm: bnez t3, b_is_norm norm_b_for: andi t6, t5, 0x80 bnez t6, break_2 slli t5, t5, 1 addi a3, a3, -1 j norm_b_for break_2: li t3, 1 j mult_a_b b_is_norm: ori t5, t5, 0x80 mult_a_b: mv t6, t4 mv a6, t5 li a4, 0 mult_for: beqz a6, mult_done andi a5, a6, 1 beqz a5, skip_add add a4, a4, t6 skip_add: srli a6, a6, 1 slli t6, t6, 1 j mult_for mult_done: li t6, BF16_EXP_BIAS add a5, t2, t3 sub a5, a5, t6 add a5, a5, a3 norm_result_mant: li t6, 0x8000 li a6, 7 and t6, a4, t6 beqz t6, skip_1 addi a6, a6, 1 addi a5, a5, 1 skip_1: srl a4, a4, a6 andi a4, a4, 0x7F li t6, 0xFF bge a5, t6, ret_inf bgtz a5, mult_result li t6, -6 blt a5, t6, ret_sign_zero li t6, 1 sub t6, t6, a5 srl a4, a4, t6 li a5, 0 mult_result: slli a2, a2, 15 andi a5, a5, 0xFF slli a5, a5, 7 andi a4, a4, 0x7F or a0, a2, a5 or a0, a0, a4 jr ra bf16_sub: addi sp, sp, -4 sw ra, 0(sp) li t0, BF16_SIGN_MASK xor a1, a1, t0 jal ra, bf16_add lw ra, 0(sp) addi sp, sp, 4 jr ra bf16_add: srli t0, a0, 15 andi t0, t0, 1 srli t1, a1, 15 andi t1, t1, 1 srli t2, a0, 7 andi t2, t2, 0xFF srli t3, a1, 7 andi t3, t3, 0xFF andi t4, a0, 0x7F andi t5, a1, 0x7F li t6, 0xFF bne t2, t6, chk_b_inf_or_nan bnez t4, ret_a bne t3, t6, ret_a bnez t5, ret_b beq t0, t1, ret_b j ret_nan chk_b_inf_or_nan: beq t3, t6, ret_b check_zero: or t6, t2, t4 beqz t6, ret_b or t6, t3, t5 beqz t6, ret_a check_normal: beqz t2, skip_a ori t4, t4, 0x80 skip_a: beqz t3, adjust_exp ori t5, t5, 0x80 adjust_exp: sub a2, t2, t3 li t6, 9 blez a2, not_greater mv a3, t2 blt a2, t6, shift_b j ret_a shift_b: srl t5, t5, a2 j aligned not_greater: beqz a2, equal_diff mv a3, t3 sub a6, x0, a2 blt a6, t6, shift_a j ret_b shift_a: srl t4, t4, a6 j aligned equal_diff: mv a3, t2 aligned: bne t0, t1, diff_sign mv a4, t0 add a5, t4, t5 andi t6, a5, 0x100 beqz t6, compose_result srli a5, a5, 1 addi a3, a3, 1 li t6, 0xFF bge a3, t6, add_ret_inf j compose_result diff_sign: blt t4, t5, mant_a_small mv a4, t0 sub a5, t4, t5 j check_mant_zero mant_a_small: mv a4, t1 sub a5, t5, t4 check_mant_zero: beqz a5, ret_zero norm_for: andi t6, a5, 0x80 bnez t6, compose_result slli a5, a5, 1 addi a3, a3, -1 blez a3, ret_zero j norm_for compose_result: slli a4, a4, 15 andi t6, a3, 0xFF slli t6, t6, 7 andi a5, a5, 0x7F or a0, a4, t6 or a0, a0, a5 jr ra ret_a: jr ra ret_b: mv a0, a1 jr ra ret_zero: li a0, 0x0000 jr ra ret_sign_zero: slli a0, a2, 15 jr ra ret_nan: li a0, 0x7FC0 jr ra ret_inf: slli a0, a2, 15 li t6, 0x7F80 or a0, a0, t6 jr ra add_ret_inf: slli a0, a4, 15 li t6, 0x7F80 or a0, a0, t6 jr ra bf16_to_f32: slli a0, a0, 16 jr ra f32_to_bf16: mv t0, a0 srli t1, t0, 23 li t2, 0xFF and t1, t1, t2 beq t1, t2, is_NaN_or_Inf srli t1, t0, 16 andi t1, t1, 1 li t2, 0x7FFF add t1, t1, t2 add t0, t0, t1 srli a0, t0, 16 jr ra is_NaN_or_Inf: srli a0, t0, 16 jr ra bf16_isnan: li t1, BF16_EXP_MASK and t0, a0, t1 xor t0, t0, t1 slti t0, t0, 1 li t2, BF16_MANT_MASK and t1, a0, t2 snez t1, t1 and a0, t0, t1 jr ra bf16_isinf: li t1, BF16_EXP_MASK and t0, a0, t1 xor t0, t0, t1 slti t0, t0, 1 li t2, BF16_MANT_MASK and t1, a0, t2 slti t1, t1, 1 and a0, t0, t1 jr ra bf16_iszero: li t0, 0x7FFF and t0, a0, t0 slti a0, t0, 1 jr ra ``` </details> ##### 1. About main All **function tests** are integrated into `main` and executed uniformly. Each test calls the target function via `jal ra, <func>`, compares the return value in `a0` with the expected result using `bne a0, <expected>, fail`. Any mismatch jumps to `fail`, prints "**Tests failed**", and exits. If all pass, it prints "**All tests passed**". <details> <summary>The corresponding assembly code (test part)</summary> ```c= .equ BF16_SIGN_MASK, 0x8000 .equ BF16_EXP_MASK, 0x7F80 .equ BF16_MANT_MASK, 0x007F .equ BF16_EXP_BIAS, 127 .data msg_pass: .asciz "All tests passed\n" msg_fail: .asciz "Tests failed\n" .text .globl main main: # ... run tests ... # bf16_isnan(a0) li a0, 0x7FC1 # NaN jal ra, bf16_isnan li t6, 1 bne a0, t6, fail li a0, 0x7F80 # +Inf jal ra, bf16_isnan li t6, 0 bne a0, t6, fail li a0, 0x4000 # +2.0 jal ra, bf16_isnan li t6, 0 bne a0, t6, fail # bf16_isinf(a0) li a0, 0x7F80 # +Inf jal ra, bf16_isinf li t6, 1 bne a0, t6, fail li a0, 0xFF80 # -Inf jal ra, bf16_isinf li t6, 1 bne a0, t6, fail li a0, 0x7FC1 # NaN jal ra, bf16_isinf li t6, 0 bne a0, t6, fail # bf16_iszero(a0) li a0, 0x0000 # +0 jal ra, bf16_iszero li t6, 1 bne a0, t6, fail li a0, 0x8000 # -0 jal ra, bf16_iszero li t6, 1 bne a0, t6, fail li a0, 0x4000 # 2.0 jal ra, bf16_iszero li t6, 0 bne a0, t6, fail # f32_to_bf16(a0: f32 bits) -> a0: bf16 bits li a0, 0x3F800000 # 1.0f jal ra, f32_to_bf16 li t6, 0x3F80 # 1.0 (bf16) bne a0, t6, fail li a0, 0x40400000 # 3.0f jal ra, f32_to_bf16 li t6, 0x4040 # 3.0 (bf16) bne a0, t6, fail li a0, 0x7FC00000 # quiet NaN jal ra, f32_to_bf16 li t6, 0x7FC0 bne a0, t6, fail # bf16_to_f32(a0: bf16 bits) -> a0: f32 bits li a0, 0x3F80 # 1.0 (bf16) jal ra, bf16_to_f32 li t6, 0x3F800000 bne a0, t6, fail li a0, 0x4040 # 3.0 (bf16) jal ra, bf16_to_f32 li t6, 0x40400000 bne a0, t6, fail li a0, 0x7FC1 # NaN with payload -> preserved in upper 16 bits jal ra, bf16_to_f32 li t6, 0x7FC10000 bne a0, t6, fail # bf16_add(a0, a1) -> a0 li a0, 0x3F80 # 1.0 + 2.0 = 3.0 li a1, 0x4000 jal ra, bf16_add li t6, 0x4040 bne a0, t6, fail li a0, 0x4040 # 3.0 + (-1.0) = 2.0 li a1, 0xBF80 jal ra, bf16_add li t6, 0x4000 bne a0, t6, fail li a0, 0x0000 # 0 + 1.0 = 1.0 li a1, 0x3F80 jal ra, bf16_add li t6, 0x3F80 bne a0, t6, fail li a0, 0x7F80 # +Inf + -Inf = NaN li a1, 0xFF80 jal ra, bf16_add li t6, 0x7FC0 bne a0, t6, fail li a0, 0x7FC1 # NaN + 1.0 = NaN (propagate payload) li a1, 0x3F80 jal ra, bf16_add li t6, 0x7FC1 bne a0, t6, fail li a0, 0x4000 # 2.0 + tiny subnormal (>= 9 ulps apart) -> unchanged li a1, 0x0001 # smallest subnormal jal ra, bf16_add li t6, 0x4000 bne a0, t6, fail # bf16_sub(a0, a1) -> a0 li a0, 0x4040 # 3.0 - 1.0 = 2.0 li a1, 0x3F80 jal ra, bf16_sub li t6, 0x4000 bne a0, t6, fail li a0, 0x3F80 # 1.0 - 3.0 = -2.0 li a1, 0x4040 jal ra, bf16_sub li t6, 0xC000 bne a0, t6, fail li a0, 0x0000 # 0 - 1.0 = -1.0 li a1, 0x3F80 jal ra, bf16_sub li t6, 0xBF80 bne a0, t6, fail li a0, 0x7F80 # +Inf - +Inf = NaN li a1, 0x7F80 jal ra, bf16_sub li t6, 0x7FC0 bne a0, t6, fail li a0, 0x7FC2 # NaN - 1.0 = NaN (propagate payload) li a1, 0x3F80 jal ra, bf16_sub li t6, 0x7FC2 bne a0, t6, fail li a0, 0x3F80 # 1.0 - 0 = 1.0 li a1, 0x0000 jal ra, bf16_sub li t6, 0x3F80 bne a0, t6, fail # bf16_mul(a0, a1) -> a0 li a0, 0x4000 # 2.0 * 3.0 = 6.0 li a1, 0x4040 jal ra, bf16_mul li t6, 0x40C0 # 6.0 bne a0, t6, fail li a0, 0x3F80 # 1.0 * -2.0 = -2.0 li a1, 0xC000 jal ra, bf16_mul li t6, 0xC000 bne a0, t6, fail li a0, 0x0000 # 0 * 3.0 = 0 li a1, 0x4040 jal ra, bf16_mul li t6, 0x0000 bne a0, t6, fail li a0, 0x7F80 # Inf * 0 = NaN li a1, 0x0000 jal ra, bf16_mul li t6, 0x7FC0 bne a0, t6, fail li a0, 0x7FC1 # NaN * 1.0 = NaN (propagate payload) li a1, 0x3F80 jal ra, bf16_mul li t6, 0x7FC1 bne a0, t6, fail li a0, 0x0002 # subnormal * 1.0 -> 0 (mul flush behavior here) li a1, 0x3F80 jal ra, bf16_mul li t6, 0x0000 bne a0, t6, fail # bf16_div(a0, a1) -> a0 li a0, 0x4080 # 4 / 2 = 2 li a1, 0x4000 jal ra, bf16_div li t6, 0x4000 bne a0, t6, fail li a0, 0x4000 # 2 / 4 = 0.5 li a1, 0x4080 jal ra, bf16_div li t6, 0x3F00 bne a0, t6, fail li a0, 0x4040 # 3 / 3 = 1 li a1, 0x4040 jal ra, bf16_div li t6, 0x3F80 bne a0, t6, fail li a0, 0x3F80 # x / 0 = +Inf (x > 0) li a1, 0x0000 jal ra, bf16_div li t6, 0x7F80 bne a0, t6, fail li a0, 0x0000 # 0 / x = 0 li a1, 0x3F80 jal ra, bf16_div li t6, 0x0000 bne a0, t6, fail li a0, 0x0001 # subnormal / 1.0 = 0 (div flush-to-zero here) li a1, 0x3F80 jal ra, bf16_div li t6, 0x0000 bne a0, t6, fail # All passed -> print and exit all_pass: la a0, msg_pass li a7, 4 ecall li a0, 0 # return code 0 li a7, 10 ecall # Any failure -> print and exit with code=1 fail: la a0, msg_fail li a7, 4 ecall li a0, 1 li a7, 10 ecall ``` </details> Where the **test coverage** includes: - **bf16_isnan / bf16_isinf / bf16_iszero:** checks NaN, ±Inf, ±0, and normal numbers. - **f32_to_bf16 / bf16_to_f32:** verifies correct conversion for 1.0, 3.0, and NaN. - **bf16_add / bf16_sub:** tests same/opp sign operations, zeros, Inf, and NaN. - **bf16_mul / bf16_div:** covers regular arithmetic, zeros, Inf, NaN, subnormals, and divide-by-zero. The program prints only one summary line ("All tests passed" or "Tests failed") using **ecall**, ensuring minimal and clear output as required. ##### 2. About bf16_mul According to the assignment requirements, since the RV32M extension instructions such as `mul` are **not allowed**, the multiplication must be implemented using **only RV32I base instructions**. In this implementation, the classical **Egyptian multiplication** is adopted. This method emulates multiplication through bit shifting and conditional addition, completing the calculation of `mant_a × mant_b` in `bf16_mul`. The procedure of the multiplication are as follows: 1. Let **b** (`mant_b`) be the multiplier, and **a** (`mant_a`) be the multiplicand. 2. While `b ≠ 0`, perform the following operations repeatedly: - If **the least significant bit** of `b` is 1, add the current value of `a` to the **accumulator**. - Then perform `a <<= 1` (double the multiplicand) and `b >>= 1` (halve the multiplier). 3. When `b = 0`, the accumulator contains the final product. Below is the corresponding assembly code segment from `bf16_mul`: ```C= mult_a_b: mv t6, t4 mv a6, t5 li a4, 0 mult_for: beqz a6, mult_done andi a5, a6, 1 beqz a5, skip_add add a4, a4, t6 skip_add: srli a6, a6, 1 slli t6, t6, 1 j mult_for ``` where: `t4=mant_a`、`t5=mant_b`、`t6=multiplicand`、`a6=multiplier`、`a4=accumulator`. ##### 3. About bf16_div In `bf16_div`, the key operation is the **mantissa division**. Since division extensions are not allowed, the division must be implemented purely using **shift-and-subtraction** operations. The original version adopts a **16-round restoring long division** algorithm. It repeatedly shifts and compares the operands to determine each **quotient** bit, one per round. The corresponding original C code is as follows: ```C= uint32_t dividend = (uint32_t) mant_a << 15; uint32_t divisor = mant_b; uint32_t quotient = 0; for (int i = 0; i < 16; i++) { quotient <<= 1; if (dividend >= (divisor << (15 - i))) { dividend -= (divisor << (15 - i)); quotient |= 1; } } int32_t result_exp = (int32_t) exp_a - exp_b + BF16_EXP_BIAS; if (!exp_a) result_exp--; if (!exp_b) result_exp++; if (quotient & 0x8000) quotient >>= 8; else { while (!(quotient & 0x8000) && result_exp > 1) { quotient <<= 1; result_exp--; } quotient >>= 8; } quotient &= 0x7F; ``` If we directly translate the original C code into assembly, the corresponding assembly implementation is show below: ```C= div_a_b: slli a3, t4, 15 li a4, 0 li a6, 15 div_for: slli a4, a4, 1 sll t6, t5, a6 blt a3, t6, div_no_sub sub a3, a3, t6 ori a4, a4, 1 div_no_sub: addi a6, a6, -1 bgez a6, div_for li t6, BF16_EXP_BIAS sub a5, t2, t3 add a5, a5, t6 bnez t2, chk_incr_exp addi a5, a5, -1 chk_incr_exp: bnez t3, norm_quot addi a5, a5, 1 norm_quot: li t6, 0x8000 and t6, a4, t6 bnez t6, norm_quot_done norm_quot_for: li t6, 1 ble a5, t6, norm_quot_done li t6, 0x8000 and t6, a4, t6 bnez t6, norm_quot_done slli a4, a4, 1 addi a5, a5, -1 j norm_quot_for norm_quot_done: srli a4, a4, 8 andi a4, a4, 0x7F ``` Running with test cases for `bf16_div` in `main`, the corresponding **Execution Information** is shown below: ![bfDiv_directed](https://hackmd.io/_uploads/S1LSP4LV-x.png) This algorithm performs **16 iterations**, each computing one quotient bit. After all rounds, the result must be normalized to ensure that the mantissa lies within `[ 1.0, 2.0 )` . However, since the original `bf16_div` does not implement **rounding**, only the top 7 fraction bits are required, it is unnecessary to compute all 16 bits of precision. Thus, the process can be optimized by using a **7-round fractional long division**, which directly produces the normalized mantissa bits. In this improved version, the procedure is as follows: 1. **Integer-part correction:** Compare the two mantissas first. - If `mant_a ≥ mant_b`, the quotient’s integer part is 1, so subtract once: `remainder = mant_a - mant_b`. - Otherwise, the dividend is smaller than the divisor, so we cannot subtract yet. Set `remainder = mant_a` and later shift it left once (×2) to bring the integer part to 1, while the exponent `result_exp` will later be decreased by 1 to compensate. 2. **Seven-round fractional division:** Each round determines one fractional bit (total 7 bits): - Shift `remainder <<= 1` (to multiply by 2) and `mant <<= 1`. - If `remainder ≥ mant_b`, subtract once `remainder -= mant_b` and set the least bit to 1 (`mant |= 1`). - Otherwise, leave the least bit 0. The modified C code is show as follows: ```C= uint16_t remainder; uint16_t mant = 0; uint16_t flag = 0; if (mant_a >= mant_b) { remainder = (uint16_t)(mant_a - mant_b); } else { remainder = mant_a; flag = 1; } for (int i = 0; i < 7; i++) { remainder <<= 1; mant <<= 1; if (remainder >= mant_b) { remainder -= mant_b; mant |= 1; } } int32_t result_exp = (int32_t)exp_a - exp_b + BF16_EXP_BIAS; if (!exp_a) result_exp--; if (!exp_b) result_exp++; if (flag) result_exp--; ``` The corresponding assembly implementation is shown below: ```C= div_a_b: mv a3, t4 li a6, 0 bgeu a3, t5, div_no_scale slli a3, a3, 1 li a6, 1 div_no_scale: sub a3, a3, t5 li a4, 0 li t6, 7 div_frac_for: beqz t6, div_frac_done slli a3, a3, 1 bltu a3, t5, div_bit0 sub a3, a3, t5 slli a4, a4, 1 ori a4, a4, 1 addi t6, t6, -1 j div_frac_for div_bit0: slli a4, a4, 1 addi t6, t6, -1 j div_frac_for div_frac_done: li t6, BF16_EXP_BIAS sub a5, t2, t3 add a5, a5, t6 bnez t2, chk_incr_exp addi a5, a5, -1 chk_incr_exp: bnez t3, div_result addi a5, a5, 1 beqz a6, div_result addi a5, a5, -1 ``` where: `t4 = mant_a`、`t5 = mant_b`、`a3 = remainder`、`a4 = mant`、`a5 = result_exp`、`a6 = flag` and `t6 = 7` for loop control. Running with test cases for `bf16_div` in `main`, the corresponding **Execution Information** of this improved version is shown below: ![bfDiv_improved](https://hackmd.io/_uploads/HJwUDNUE-l.png) It is clear that the **execution cycles** were reduced by approximately $30$%. ##### 4. Others For example, in the original C code, you can often see some program fragments similar to the following: ```C= if (!x && !y) < operation > ...... ``` If we translate it directly, the corresponding assembly code would be: ```C= bnez a0, Skip bnez a1, Skip < operation > Skip: ...... ``` Here, `a0 = x` and `a1 = y`. However, this version requires **two conditional branches**, which can be further optimized. By using the **`xor`** instruction, we can achieve the same logic in a more compact way — that is, execute `<operation>` only when **both `x` and `y` are zero**. The optimized assembly code is as follows: ```C= xor t0, a0, a1 bnez t0, Skip < operation > Skip: ...... ``` This approach **reduces one branch instruction**, making the overall program more efficient and concise during execution. --- #### Part II This part is about the implementation of **square root operation** in **bfloat16** format. The original C code is as follows: ```C= static inline bf16_t bf16_sqrt(bf16_t a) { uint16_t sign = (a.bits >> 15) & 1; int16_t exp = ((a.bits >> 7) & 0xFF); uint16_t mant = a.bits & 0x7F; /* Handle special cases */ if (exp == 0xFF) { if (mant) return a; /* NaN propagation */ if (sign) return BF16_NAN(); /* sqrt(-Inf) = NaN */ return a; /* sqrt(+Inf) = +Inf */ } /* sqrt(0) = 0 (handle both +0 and -0) */ if (!exp && !mant) return BF16_ZERO(); /* sqrt of negative number is NaN */ if (sign) return BF16_NAN(); /* Flush denormals to zero */ if (!exp) return BF16_ZERO(); /* Direct bit manipulation square root algorithm */ /* For sqrt: new_exp = (old_exp - bias) / 2 + bias */ int32_t e = exp - BF16_EXP_BIAS; int32_t new_exp; /* Get full mantissa with implicit 1 */ uint32_t m = 0x80 | mant; /* Range [128, 256) representing [1.0, 2.0) */ /* Adjust for odd exponents: sqrt(2^odd * m) = 2^((odd-1)/2) * sqrt(2*m) */ if (e & 1) { m <<= 1; /* Double mantissa for odd exponent */ new_exp = ((e - 1) >> 1) + BF16_EXP_BIAS; } else { new_exp = (e >> 1) + BF16_EXP_BIAS; } /* Now m is in range [128, 256) or [256, 512) if exponent was odd */ /* Binary search for integer square root */ /* We want result where result^2 = m * 128 (since 128 represents 1.0) */ uint32_t low = 90; /* Min sqrt (roughly sqrt(128)) */ uint32_t high = 256; /* Max sqrt (roughly sqrt(512)) */ uint32_t result = 128; /* Default */ /* Binary search for square root of m */ while (low <= high) { uint32_t mid = (low + high) >> 1; uint32_t sq = (mid * mid) / 128; /* Square and scale */ if (sq <= m) { result = mid; /* This could be our answer */ low = mid + 1; } else { high = mid - 1; } } /* result now contains sqrt(m) * sqrt(128) / sqrt(128) = sqrt(m) */ /* But we need to adjust the scale */ /* Since m is scaled where 128=1.0, result should also be scaled same way */ /* Normalize to ensure result is in [128, 256) */ if (result >= 256) { result >>= 1; new_exp++; } else if (result < 128) { while (result < 128 && new_exp > 1) { result <<= 1; new_exp--; } } /* Extract 7-bit mantissa (remove implicit 1) */ uint16_t new_mant = result & 0x7F; /* Check for overflow/underflow */ if (new_exp >= 0xFF) return (bf16_t) {.bits = 0x7F80}; /* +Inf */ if (new_exp <= 0) return BF16_ZERO(); return (bf16_t) {.bits = ((new_exp & 0xFF) << 7) | new_mant}; } ``` The corresponding assembly code (include test part) is as follows: ```C= .equ BF16_SIGN_MASK, 0x8000 .equ BF16_EXP_MASK, 0x7F80 .equ BF16_MANT_MASK, 0x007F .equ BF16_EXP_BIAS, 127 .data msg_pass: .asciz "All tests passed\n" msg_fail: .asciz "Tests failed\n" .text .globl main main: # bf16_sqrt(a0) -> a0 li a0, 0x7FA5 jal ra, bf16_sqrt li t6, 0x7FA5 bne a0, t6, fail # +Inf li a0, 0x7F80 jal ra, bf16_sqrt li t6, 0x7F80 bne a0, t6, fail # -Inf -> NaN li a0, 0xFF80 jal ra, bf16_sqrt li t6, 0x7FC0 bne a0, t6, fail # +0 li a0, 0x0000 jal ra, bf16_sqrt li t6, 0x0000 bne a0, t6, fail # -0 treated as +0 li a0, 0x8000 jal ra, bf16_sqrt li t6, 0x0000 bne a0, t6, fail # Negative finite number -> NaN li a0, 0xC000 # -2.0 jal ra, bf16_sqrt li t6, 0x7FC0 bne a0, t6, fail # Subnormal input (flush to zero) li a0, 0x0001 # smallest subnormal jal ra, bf16_sqrt li t6, 0x0000 bne a0, t6, fail li a0, 0x007F # largest subnormal jal ra, bf16_sqrt li t6, 0x0000 bne a0, t6, fail # Even exponent: do not double mantissa li a0, 0x4080 # 4.0 jal ra, bf16_sqrt li t6, 0x4000 # 2.0 bne a0, t6, fail li a0, 0x4180 # 16.0 jal ra, bf16_sqrt li t6, 0x4080 # 4.0 bne a0, t6, fail li a0, 0x4110 # 9.0 jal ra, bf16_sqrt li t6, 0x4040 # 3.0 bne a0, t6, fail # Odd exponent: shift mantissa left before halving li a0, 0x4000 # 2.0 jal ra, bf16_sqrt li t6, 0x3FB5 # 1.4142 (truncated) bne a0, t6, fail li a0, 0x4040 # 3.0 jal ra, bf16_sqrt li t6, 0x3FDD # 1.732 (truncated) bne a0, t6, fail # Need left normalization (result < 128) li a0, 0x3F00 # 0.5 jal ra, bf16_sqrt li t6, 0x3F35 # 0.7071 (truncated) bne a0, t6, fail # All tests passed all_pass: la a0, msg_pass li a7, 4 ecall li a0, 0 li a7, 10 ecall # Any test failed fail: la a0, msg_fail li a7, 4 ecall li a0, 1 li a7, 10 ecall bf16_sqrt: addi sp, sp, -4 sw ra, 0(sp) srli t0, a0, 15 andi t0, t0, 1 srli t1, a0, 7 andi t1, t1, 0xFF andi t2, a0, 0x7F li t6, 0xFF bne t1, t6, chk_zero bnez t2, ret_a bnez t0, ret_nan j ret_a chk_zero: or t6, t1, t2 beqz t6, ret_zero bnez t0, ret_nan beqz t1, ret_zero li t6, 127 sub a2, t1, t6 li a3, 0x80 or a3, a3, t2 andi t6, a2, 1 beqz t6, get_new_exp odd_exp: slli a3, a3, 1 addi a2, a2, -1 get_new_exp: srai a2, a2, 1 li t6, 127 add t3, a2, t6 bin_search: li a4, 90 li a5, 256 li t4, 128 search_for: bgtu a4, a5, norm_result add a6, a4, a5 srli a6, a6, 1 mv a0, a6 mv a1, a6 jal ra, mul_aux srli t6, a0, 7 bgeu a3, t6, le_ok addi a5, a6, -1 j search_for le_ok: mv t4, a6 addi a4, a6, 1 j search_for norm_result: li t6, 256 blt t4, t6, chk_if_norm_up srli t4, t4, 1 addi t3, t3, 1 chk_if_norm_up: li t6, 128 bge t4, t6, mant_ok norm_up: li t6, 128 bge t4, t6, mant_ok li t6, 1 ble t3, t6, mant_ok slli t4, t4, 1 addi t3, t3, -1 j norm_up mant_ok: andi t4, t4, 0x7F li t6, 0xFF bgeu t3, t6, ret_inf blez t3, ret_zero sqrt_result: andi t3, t3, 0xFF slli t3, t3, 7 or a0, t3, t4 lw ra, 0(sp) addi sp, sp, 4 jr ra ret_a: lw ra, 0(sp) addi sp, sp, 4 jr ra ret_zero: li a0, 0x0000 lw ra, 0(sp) addi sp, sp, 4 jr ra ret_inf: li a0, 0x7F80 lw ra, 0(sp) addi sp, sp, 4 jr ra ret_nan: li a0, 0x7FC0 lw ra, 0(sp) addi sp, sp, 4 jr ra mul_aux: li t6, 0 mul_for: beqz a1, mul_done andi t2, a1, 1 beqz t2, skip_add add t6, t6, a0 skip_add: srli a1, a1, 1 slli a0, a0, 1 j mul_for mul_done: mv a0, t6 jr ra ``` ##### 1. Notes - In the **binary search loop**, compute `mid*mid` via `mul_aux` (Egyptian multiplication) to avoid hardware multiply. - Since `128 = 2^7`, compute `sq = (mid*mid)/128` by a right shift by 7 (`srli`), avoiding an general expensive division. ##### 2. About main All **tests** for `bf16_sqrt` are integrated in `main`. Each case calls `bf16_sqrt` via `jal ra, bf16_sqrt`, and the return is in `a0`. We compare with the expected value in `t6` using `bne a0, t6, fail`. **Test coverage** includes: - **NaN**: `a0=0x7FA5` . - **Infinities**: `+Inf` → `+Inf`, `-Inf` → `NaN`. - **Negative finite**: `-2.0 → NaN`. - **Subnormals (flush to zero)**: `0x0001` (min) and `0x007F` (max) → `0`. - **Even exponent (no mantissa doubling)**: `4.0` → `2.0`, `16.0` → `4.0`, `9.0` → `3.0`. - **Odd exponent (mantissa doubled first)**: `2.0` → ≈ `1.4142 (0x3FB5)`, `3.0` → ≈ `1.732 (0x3FDD)`. - **Needs left normalization**: `0.5` → ≈ `0.7071 (0x3F35)`. Any mismatch jumps to `fail`, prints "**Tests failed**", and exits. If all pass, the program prints "**All tests passed**". --- ### Result Console All assembly programs are run on a **RISC-V 32-bit 5-stage processor** and output to the **console** using environment calls. #### Part I The result of execution is shown below: ![bf16(div_improve)_console](https://hackmd.io/_uploads/r1iRyayRxg.png) #### Part II The result of execution is shown below: ![bf_sqrt_console](https://hackmd.io/_uploads/HkZGx6JRgg.png) As shown in the images for **Part I** and **Part II** above, the program executes correctly and outputs the expected prompt messages. --- ### Execution Information This section demonstrates the **Execution Information** when running the assembly code on the **Ripes simulator**, in order to compare with compiler-generated version (without `-O1` optimization) and directly observe the performance. #### Part I ##### 1. Complier's version: ![PC_withoutSqrt_complVer_perf_1](https://hackmd.io/_uploads/B18-Ow9pex.png) * Corresponding **code size: 1128** ##### 2. Improved version: ![PC_withoutSqrt_MyVer_improve_perf](https://hackmd.io/_uploads/HkMWRZpaex.png) * Corresponding **code size: 557** #### Part II ##### 1. Complier's version: ![PC_Sqrt_complVer_perf_1](https://hackmd.io/_uploads/HyeMuP5plg.png) * Corresponding **code size: 368** ##### 2. Improved version: ![PC_sqrt_myVer_perf](https://hackmd.io/_uploads/S1qsrDqpel.png) * Corresponding **code size: 191** --- ### Improvement From the **Execution Information** above, compared with the compiler’s version: #### Part I * **Code size reduced by approximately $51$%** * **Execution cycles reduced by approximately $62$%** * **Actual number of executed instructions reduced by approximately $62$%** #### Part II * **Code size reduced by approximately $48$%** * **Execution cycles reduced by approximately $11$%** * **Actual number of executed instructions reduced by approximately $12$%** --- ## Explanation Using Ripes Simulator In this part, we use **UF8** (introduced in Problem B) as an example to describe a possible use case, and provide explanations for both program functionality and instruction-level operations about `uf8_decode` using the Ripes simulator. --- ### Use case for UF8 For example, in [**LeetCode 347**](https://leetcode.com/problems/top-k-frequent-elements/), we may need to count frequencies for a large number of **elements/events**. The frequency of each element may range from 1 to tens of thousands (or even higher), resulting in a very wide range. In such a scenario, if we store each element’s frequency value using **UF8** representation (as approximate value), we can replace a `uint32_t` / `int` frequency value with a `uint8_t` one. This allows us to store more data under the same memory capacity, or support a larger hash table for counting. When producing the top K final output, we can simply **decode** the frequencies back into approximate integers and then sort based on the decoded values. However, since sorting is performed on decoded (approximate) values, this approach may suitable for scenarios where the most frequent elements usually have much higher counts than the rest, so the relative ordering is typically more stable. --- ### Explanation for Instruction-Level Operations Here, we use `uf8_decode` as the simple example, value `0x00000021` as the test input, and the corresponding assembly program is shown below: ```= .data .text .globl main main: li a0, 0x00000021 UF8_decode: andi t1, a0, 0x0F # m = fl & 0x0F srli t2, a0, 4 # e = fl >> 4 li t3, 1 sll t3, t3, t2 # 1 << e addi t3, t3, -1 # (1 << e) - 1 slli t3, t3, 4 # offset = ((1 << e)-1) << 4 sll t1, t1, t2 # m << e add a0, t1, t3 # result = (m << e) + offset Exit: li a7, 10 ecall ``` For the explanation of **instruction-level operations**, lets look at the first instruction `li a0, 0x00000021`, which is the first executed instruction in the program. #### 1. Instruction in IF-stage ![instr-1](https://hackmd.io/_uploads/HkTtiqP4-e.png) In the IF-stage, the CPU fetches the instruction from the instruction memory using the PC, while also computing the next PC. Since `li` is pseudo-instruction, the assembler expands it into a real RV32I instruction. In this case, it becomes `addi x10, x0, 33`. >#### In IF-stage: >- PC Enable: 1 (High), since there's no pipeline stall. >- Next PC Mux: 0 (Low), since no branch/jump is taken. #### 2. `addi x10, x0, 33` in ID-stage ![image](https://hackmd.io/_uploads/B1FgGswVZg.png) In the ID-stage, the decoder identifies the instruction as an I-type `addi` and generates the corresponding control signals. Besides, the register file reads `rs1 = x0` (which is always `0`), and the `Imm.` component produce `imm = 33`. Note that the register write enable (Wr En) is still 0 at this moment, this is because the write-back happens in the WB-stage. >#### In ID-stage: >- Reg Wr En: 0 (Low), not in WB yet. #### 3. `addi x10, x0, 33` in EX-stage ![image](https://hackmd.io/_uploads/Sk-Qriw4Wx.png) In the EX-stage, the ALU takes the value of `rs1` (`x0 = 0`) as **Op1** and the sign-extended immediate `33` as **Op2**, and performs **addition**. The expected ALU result is `33` (`0x21`). >#### In EX-stage: >- ALUOp1: from ID/EX pipeline register or forwarding path. >- ALUOp2: from the immediate. >- ForwardA: no forwarding, from ID/EX pipeline register. >- ForwardB: no forwarding. >- Branch taken: not taken. #### 4. `addi x10, x0, 33` in MEM-stage ![image](https://hackmd.io/_uploads/HkBaOiwV-g.png) In the MEM-stage, since `addi` does not access the data memory, the ALU result is simply carried into the MEM/WB pipeline register. >#### In MEM-stage: >- Data Memory Wr En: 0 (Low), since the `addi` instruction does not write memory. #### 5. `addi x10, x0, 33` in WB-stage ![image](https://hackmd.io/_uploads/rkozqsDV-l.png) In the WB-stage, the ALU result should be written back to the register file as `x10`. >#### In WB-stage: >- Reg Wr En (in ID-stage): 1 (High), write back to `x10`. >- WB Mux: select ALU result. #### 6. `addi x10, x0, 33` exits the pipeline In the next cycle, instruction `addi x10, x0, 33` has completed and exited the pipeline, so the value of `x10` should now be `0x21` (`33`). We check the value of `x10` (`a0`): ![image](https://hackmd.io/_uploads/SkclRsDVZl.png) The value of `x10` matches the expected result, indicating that pseudo-instruction `li a0, 0x00000021` (expanded into `addi`) executed correctly. --- ### Explanation for Program Functionality For the explanation of **program functionality**, we continue from the cycle right after step 6 in the previous section: at this point, `li a0, 0x00000021` (expanded into `addi x10, x0, 33`) has already completed and exited the pipeline, so `x10` has been correctly set to `0x21`. The program then enters the body of `UF8_decode` function, computing the mantissa `m`, the exponent `e`, the `offset`, and finally the decoded result. >Since this is a **5-stage pipeline processor**, **multiple instructions overlap in the same cycle**. Therefore, a single snapshot of the pipeline can show the different instructions simultaneously residing in different stages. ![pg-1](https://hackmd.io/_uploads/rkK5OTD4We.png) >#### In IF-stage: `addi x28, x28, -1` >- PC Enable: 1 (High), since there's no pipeline stall. >- Next PC Mux: 0 (Low), since no branch/jump is taken. >#### In ID-stage: `sll x28, x28, x7` >- Reg Wr En (for `andi` in WB-stage): 1 (High), write back to `x6`. >#### In EX-stage: `addi x28, x0, 1` >- ALUOp1: from ID/EX pipeline register or forwarding path. >- ALUOp2: from the immediate. >- ForwardA: no forwarding, from ID/EX pipeline register. >- ForwardB: no forwarding. >- Branch taken: not taken. >#### In MEM-stage: `srli x7, x10, 4` >- Data Memory Wr En: 0 (Low), since `srli` does not write memory. >#### In WB-stage: `andi x6, x10, 15` >- WB Mux: select ALU result. ![pg-2](https://hackmd.io/_uploads/r1JPspP4bx.png) >#### In IF-stage: `ecall` >- PC Enable: 1 (High), since there's no pipeline stall. >- Next PC Mux: 0 (Low), since no branch/jump is taken. >#### In ID-stage: `addi x17, x0, 10` >- Reg Wr En (for `slli` in WB-stage): 1 (High), write back to `x28`. >#### In EX-stage: `add x10, x6, x28` >- ALUOp1: from ID/EX pipeline register or forwarding path. >- ALUOp2: from ID/EX pipeline register or forwarding path. >- ForwardA: use forwarding, from MEM-stage. >- ForwardB: use forwarding, from WB-stage. >- Branch taken: not taken. >#### In MEM-stage: `sll x6, x6, x7` >- Data Memory Wr En: 0 (Low), since `sll` does not write memory. >#### In WB-stage: `slli x28, x28, 4` >- WB Mux: select ALU result. Notice that in the **EX-stage**, `add x10, x6, x28` has **data hazards** on both source operands: `x6` is produced by the **MEM-stage** instruction `sll x6, x6, x7`, and `x28` is produced by the **WB-stage** instruction `slli x28, x28, 4`. Therefore, **forwarding must be enabled** (ForwardA from the MEM-stage for the updated `x6`, and ForwardB from the WB-stage for the updated `x28`) so that the `add x10, x6, x28` instruction can **use the latest values without stalling the pipeline**. Back to the main topic, according to the behavior of `uf8_decode`, for the input `0x00000021`: - `m = 0x21 & 0x0F = 1` - `e = 0x21 >> 4 = 2` - `offset = ((1 << 2)-1) << 4 = (4-1) << 4 = 3 << 4 = 48` - `m << e = 1 << 2 = 4` - `result = 4 + 48 = 52 (0x34)` Therefore, the decoded value should be `0x34` (`52`), and according to the program flow it should end up in `x10` (`a0`). After the program finishes, we check the value of `x10` (`a0`): ![pg-3](https://hackmd.io/_uploads/ByBnDZOEbg.png) The value of `x10` matches the expected result, confirming that the program executed successfully, and validating the functionality (correctness) of `uf8_decode`. --- > Refer to <[Assignment1 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/2025-arch-homework1)> > Contributed by <[`Micelearner`](https://github.com/Micelearner/ca2025-quizzes)> ***Reference*** - [Assignment1 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/2025-arch-homework1) - [Quiz1 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz1-sol) - [RISC-V Assembly Programmer’s Manual](https://github.com/riscv-non-isa/riscv-asm-manual/blob/main/src/asm-manual.adoc) - [Lab1: RV32I Simulator](https://hackmd.io/@sysprog/H1TpVYMdB) - [Ripes: Supported Environment Calls](https://github.com/mortbopet/Ripes/blob/master/docs/ecalls.md) - [CS61C: RISC-V 5-Stage Pipeline](https://docs.google.com/presentation/d/1v-Squx8lK-oOrflFOwBZh-ue94seVHZudqDOgJmFf5Q/edit?slide=id.g2fa6143ce8c_0_133#slide=id.g2fa6143ce8c_0_133)