Assignment 1: RISC-V Assembly and Instruction Pipeline

# Assignment 1: RISC-V Assembly and Instruction Pipeline #### contributed by <`bbchen`> # Problem B in [Quiz 1](https://hackmd.io/@sysprog/arch2025-quiz1-sol) ### Abstract This problem is to understand and implement a custom 8-bit logarithmic codec called uf8, which compresses a 20-bit unsigned integer (range: 0 ~ 1,015,792) into only 8 bits, achieving a 2.5:1 compression ratio while guaranteeing a relative error ≤ 6.25%. [The complete C code and assembly for Problem B](https://github.com/CHENSHIBO0000/ca2025-quizzes/tree/main/HW1/q1-uf8) --- The following illustration is about each function and its corresponding RISC-Vassembly in Problem B of the quiz1. ### clz This function `clz` computes the count of leading zeros in a 32-bit unsigned integer using a binary search method. It starts with the maximum (32) and checks progressively smaller bit ranges (16, 8, 4, 2, 1). By shifting and adjusting the count, it quickly finds the position of the most significant 1-bit. The result is the exact number of leading zeros, with clz(0) = 32. #### C code of clz ```c static inline unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; } ``` 1. Start with `n = 32` (maximum possible leading zeros) and a probe size `c = 16`. 2. Compute `y = x >> c`. - If `y != 0`, then the highest 1-bit lies within the upper `32 - c` bits. Reduce the count: `n -= c`, and zoom in: `x = y`. - Otherwise, the highest 1-bit is in the lower `c` bits; just shrink the probe. 3. Halve `c` (`c >>= 1`) to test ranges of 8, 4, 2, 1 bits — **exactly 5 iterations total**. 4. After the loop, `n - x` yields the exact number of leading zeros. (During the search, `x` is reduced to the range `[1, 2)` in fixed-point terms, so subtracting `x` corrects the off-by-one accumulated in `n`.) #### Assembly code of clz ```assembly clz: addi t0, zero, 32 #t0 = n addi t1, zero, 16 #t1 = c add t3, a0, zero #t3 = x clz_Loop: srl t2, t3, t1 #t2 = y beq t2, zero, clz_y_0 sub t0, t0, t1 add t3, t2, zero #t3 = x clz_y_0: srli t1,t1,1 # c >> 1 bne t1, x0, clz_Loop sub a0,t0,t3 ret ``` --- ### uf8_decode The decoding formula is given as $$ D(b) = m \cdot 2^e + (2^e - 1) \cdot 16 $$ Where $$e = \lfloor b/16 \rfloor \quad \text{and} \quad m = b \bmod 16$$ #### C code of decode ```c /* * uf8 byte layout (8 bits) * * +--------+--------+ * | e e e e| m m m m| * +--------+--------+ * 7 .. 4 3 .. 0 * * e : 4-bit exponent * m : 4-bit mantissa */ uint32_t uf8_decode(uf8 fl) { uint32_t mantissa = fl & 0x0f; uint8_t exponent = fl >> 4; uint32_t offset = (0x7FFF >> (15 - exponent)) << 4; return (mantissa << exponent) + offset; } ``` 1. `mantissa = fl & 0x0F` → extract the 4-bit mantissa **m**. 2. `exponent = fl >> 4` → extract the 4-bit exponent **e**. 3. `offset = (0x7FFF >> (15 - exponent)) << 4` - Shifting right by `(15 - e)` yields `(2^e - 1)`. - Shifting left by 4 multiplies by 16 → `(2^e - 1) * 16`. 4. `return (mantissa << exponent) + offset` - `(mantissa << exponent)` is `m * 2^e`. - Adding `offset` implements the uf8 decode formula. #### Assembly code of uf8_decode ```assembly uf8_decode: addi sp, sp, -4 sw ra, 0(sp) andi t0, a0, 0x0f #t0 = mantissa srli t1, a0, 4 #t1 = exponent # offset = ((1 << exponent) - 1) << 4 addi t2, zero, 1 sll t2, t2, t1 addi t2, t2, -1 slli t2, t2, 4 #t2 = offset sll a0,t0, t1 add a0, a0 t2 lw ra 0(sp) # restore return addr addi sp, sp, 4 ret ``` --- ### uf8_encode The encoding formula is given as $$ E(v) =\begin{cases}v, & \text{if } v < 16 \\16e + \left\lfloor \dfrac{v - \text{offset}(e)}{2^e} \right\rfloor, & \text{otherwise}\end{cases} $$ Where $$\text{offset}(e) = (2^e - 1) \cdot 16$$ #### C code of encode ```c uf8 uf8_encode(uint32_t value) { /* Use CLZ for fast exponent calculation */ if (value < 16) return value; /* Find appropriate exponent using CLZ hint */ int lz = clz(value); int msb = 31 - lz; /* Start from a good initial guess */ uint8_t exponent = 0; uint32_t overflow = 0; if (msb >= 5) { /* Estimate exponent - the formula is empirical */ exponent = msb - 4; if (exponent > 15) exponent = 15; /* Calculate overflow for estimated exponent */ for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; /* Adjust if estimate was off */ while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } } /* Find exact exponent */ while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) break; overflow = next_overflow; exponent++; } uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } ``` 1. If the value is small (<16), it returns it directly. 2. Otherwise, it uses `clz` to find the highest 1-bit, estimates the exponent, and adjusts it to fit the value. 3. Then it computes the mantissa from the remaining part. 4. Finally, it packs the `exponent (high 4 bits)` and `mantissa (low 4 bits)` into the 8-bit `uf8`code. #### Assembly code of uf8_encode ```assembly uf8_encode: li t0, 16 slt t0, a0 ,t0 bne t0, zero, return_value addi sp, sp, -28 sw s0, 24(sp) sw s1, 20(sp) sw s2, 16(sp) sw s3, 12(sp) sw s4, 8(sp) sw s5, 4(sp) sw ra, 0(sp) add s0, a0, zero #s0 = value #Find appropriate exponent using CLZ hint jal ra, clz add s1, a0, x0 #s1 = lz li t0, 31 sub s2, t0, s1 #s2 = msb add s3, zero, zero #s3 = exponent add s4, zero, zero #s4 = overflow #if(msb >= 5) li t0, 5 blt t0, s2, 1f addi s3, s2, -4 #if(exponent > 15) li t0, 15 ble s3, t0, 2f li s3, 15 # Calculate overflow for estimated exponent 2: li s5, 0 3: bge s5, s3, 4f slli s4, s4, 1 addi s4, s4, 16 addi s5, s5, 1 j 3b # Adjust if estimate was off 4: slt t0, x0, s3 slt t1, s0, s4 and t0, t0, t1 beq t0, x0, 1f addi s4, s4, -16 srli, s4, s4, 1 addi s3, s3, -1 j 4b #Find exact exponent 1: addi t0, x0, 15 bge s3, t0, encode_done slli s6, s4, 1 addi s6, s6, 16 blt s0, s6, encode_done mv s4, s6 addi s3, s3, 1 j 1b encode_done: sub s7, s0, s4 srl s7, s7, s3 slli a0, s3, 4 or a0, a0, s7 lw ra, 0(sp) lw s5, 4(sp) lw s4, 8(sp) lw s3, 12(sp) lw s2, 16(sp) lw s1, 20(sp) lw s0, 24(sp) addi sp, sp, 28 return_value: ret ``` --- ### Test #### C code of test ```c static bool test(void) { int32_t previous_value = -1; bool passed = true; for (int i = 0; i < 256; i++) { uint8_t fl = i; int32_t value = uf8_decode(fl); uint8_t fl2 = uf8_encode(value); if (fl != fl2) { printf("%02x: produces value %d but encodes back to %02x\n", fl, value, fl2); passed = false; } if (value <= previous_value) { printf("%02x: value %d <= previous_value %d\n", fl, value, previous_value); passed = false; } previous_value = value; } return passed; } ``` The `test()` function verifies the correctness of the uf8 encode/decode implementation. 1. Iterate over all `256` possible uf8 codes. For each i from 0 to 255, treat it as a uf8 code fl. 2. Round-trip check (decode → encode) - `value = uf8_decode(fl)` decodes the uf8 code into an integer. - `fl2 = uf8_encode(value)` encodes that integer back into uf8. - If `fl != fl2`, it outputs an error message and marks the test as failed. This ensures decoding followed by encoding preserves the original code. 3. Keeps track of the previous decode value - Ensures that each new decoded value is strictly greater than the previous one. - Updates previous_value on each iteration. 4. Final result: - Returns `true` only if all checks pass; otherwise returns `false`. #### Assembly code of test ```assembly Test: addi sp, sp, -4 sw ra, 0(sp) li s0, -1 li s1, 1 li s2, 0 test_for_loop: mv s3, s2 #f1 = i mv a0, s3 # jal ra, uf8_decode mv s4, a0 jal ra, uf8_encode mv s5, a0 beq s3, s5, 1f li s1, 0 1: bgt s4, s0, 2f li s1, 0 2: #previous_value mv s0, s4 addi s2, s2, 1 li t0, 256 blt s2, t0, test_for_loop # if (i < 256), to 1 done: mv a1, s1 # return passed lw ra 0(sp) # restore return addr addi sp, sp, 4 ret ``` --- ### Result ![image](https://hackmd.io/_uploads/HyX2BeZaxg.png) ![image](https://hackmd.io/_uploads/HykaSxZaxl.png) --- # Problem C in [Quiz 1](https://hackmd.io/@sysprog/arch2025-quiz1-sol) ## Abstract This problem implements a complete BF16 floating-point library. It includes conversions between 32-bit float and 16-bit BF16, classification functions (NaN, Inf, Zero), and implements a bfloat16 (BF16) arithmetic library. It also implements comparison functions for BF16. [The complete C code and assembly for Problem C](https://github.com/CHENSHIBO0000/ca2025-quizzes/tree/main/HW1/q1-bfloat16) --- ## Classification #### C code ```c #define BF16_NAN() ((bf16_t) {.bits = 0x7FC0}) #define BF16_ZERO() ((bf16_t) {.bits = 0x0000}) static inline bool bf16_isnan(bf16_t a) { return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) && (a.bits & BF16_MANT_MASK); } static inline bool bf16_isinf(bf16_t a) { return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) && !(a.bits & BF16_MANT_MASK); } static inline bool bf16_iszero(bf16_t a) { return !(a.bits & 0x7FFF); } ``` These functions help detect `NaN`, `infinity`, and `±0` via exponent and mantissa patterns. #### Assembly code ```assembly bf16_isnan: mv t0, a0 li t1, BF16_EXP_MASK and t2, t0, t1 li t3, BF16_MANT_MASK and t4, t0, t3 bne t2, t1, bf16_isnan_false beq t4, x0, bf16_isnan_false li a0, 1 ret bf16_isnan_false: li a0, 0 ret bf16_isinf: mv t0, a0 li t1, BF16_EXP_MASK and t2, t0, t1 li t3, BF16_MANT_MASK and t4, t0, t3 bne t2, t1, bf16_isinf_false bne t4, x0, bf16_isinf_false li a0, 1 ret bf16_isinf_false: li a0, 0 ret bf16_iszero: li t0, 0x7FFF and t1, a0, t0 beq t1, x0, bf16_iszero_true li a0, 0 ret bf16_iszero_true: li a0, 1 ret ``` --- ## Basic conversion #### C code of basic conversions ```c static inline bf16_t f32_to_bf16(float val) { uint32_t f32bits; memcpy(&f32bits, &val, sizeof(float)); if (((f32bits >> 23) & 0xFF) == 0xFF) return (bf16_t) {.bits = (f32bits >> 16) & 0xFFFF}; f32bits += ((f32bits >> 16) & 1) + 0x7FFF; return (bf16_t) {.bits = f32bits >> 16}; } static inline float bf16_to_f32(bf16_t val) { uint32_t f32bits = ((uint32_t) val.bits) << 16; float result; memcpy(&result, &f32bits, sizeof(float)); return result; } ``` These two functions handle conversion between 32-bit float and 16-bit bfloat16 formats: - `f32_to_bf16` Extracts the 32-bit bit pattern of a float .If the exponent is `0xFF` (NaN or Inf), it directly truncates the upper 16 bits. Otherwise, it performs round-to-nearest-even by adding `0x7FFF` plus the least significant bit of the truncated part before shifting right by 16 bits. - `bf16_to_f32` Shifts the 16-bit BF16 value into the upper half of a 32-bit word and fills the lower bits with zeros, then reinterprets it as a float. #### Assembly code of basic conversions ```assembly f32_to_bf16: mv t0, a0 srli t1, t0, 23 andi t1, t1, 0xFF li t2, 0xFF beq t1, t2, 1f srli t1, t0, 16 andi t1, t1, 1 li t2, 0x7FFF add t3, t1, t2 add t0, t0, t3 srli t0, t0, 16 j 2f 1: srli t0, t0, 16 li t1, 0xFFFF and t0, t0, t1 2: mv a0, t0 ret bf16_to_f32: slli a0, a0, 16 ret ``` ## Comparison operations #### C code of comparison operations ```c static inline bool bf16_eq(bf16_t a, bf16_t b) { if (bf16_isnan(a) || bf16_isnan(b)) return false; if (bf16_iszero(a) && bf16_iszero(b)) return true; return a.bits == b.bits; } static inline bool bf16_lt(bf16_t a, bf16_t b) { if (bf16_isnan(a) || bf16_isnan(b)) return false; if (bf16_iszero(a) && bf16_iszero(b)) return false; bool sign_a = (a.bits >> 15) & 1, sign_b = (b.bits >> 15) & 1; if (sign_a != sign_b) return sign_a > sign_b; return sign_a ? a.bits > b.bits : a.bits < b.bits; } static inline bool bf16_gt(bf16_t a, bf16_t b) { return bf16_lt(b, a); } ``` - `bf16_eq(a, b)` Returns `false` if either operand is `NaN`. Treats `±0` as equal. Otherwise, compares raw 16-bit patterns for exact equality. - `bf16_lt(a, b)` Returns `false` if either is `NaN` or both are zeros. Implements signed ordering: • If `signs differ`, negative is less than positive. • If `signs are the same`, compares the 16-bit patterns. - `bf16_gt(a, b)` Swap the operands and reuse the `less-than` logic. #### Assembly code of comparison operations <details> <summary>Click to show code</summary> ```asm bf16_eq: addi sp, sp, -16 sw ra, 12(sp) sw s0, 8(sp) sw s1, 4(sp) sw s2, 0(sp) mv s0, a0 mv s1, a1 jal ra, bf16_isnan mv s2, a0 mv a0, s1 jal ra, bf16_isnan or s2, s2, a0 bne s2, x0, bf16_eq_false mv a0, s0 jal ra, bf16_iszero beq a0, x0, bf16_eq_check mv a0, s1 jal ra, bf16_iszero beq a0, x0, bf16_eq_check bf16_eq_true: li a0, 1 j bf16_eq_end bf16_eq_check: beq s0, s1, bf16_eq_true bf16_eq_false: li a0, 0 j bf16_eq_end bf16_eq_end: lw ra, 12(sp) lw s0, 8(sp) lw s1, 4(sp) lw s2, 0(sp) addi sp, sp, 16 ret bf16_lt: addi sp, sp, -24 sw ra, 20(sp) sw s0, 16(sp) sw s1, 12(sp) sw s2, 8(sp) sw s3, 4(sp) sw s4, 0(sp) mv s0, a0 mv s1, a1 jal ra, bf16_isnan mv s4, a0 mv a0, s1 jal ra, bf16_isnan or s4, s4, a0 bne s4, x0, bf16_lt_false mv a0, s0 jal ra, bf16_iszero beq a0, x0, bf16_lt_check_sign mv a0, s1 jal ra, bf16_iszero beq a0, x0, bf16_lt_check_sign bf16_lt_false: li a0, 0 j bf16_lt_end bf16_lt_check_sign: srli s2, s0, 15 # sign_a srli s3, s1, 15 # sign_b andi s2, s2, 1 andi s3, s3, 1 bne s2, s3, bf16_lt_sign_diff bgtz s2, 1f blt s0, s1, bf16_lt_true j bf16_lt_false 1: blt s0, s1, bf16_lt_false j bf16_lt_true bf16_lt_true: li a0, 1 j bf16_lt_end bf16_lt_end: lw ra, 20(sp) lw s0, 16(sp) lw s1, 12(sp) lw s2, 8(sp) lw s3, 4(sp) lw s4, 0(sp) addi sp, sp, 24 ret bf16_lt_sign_diff: blt s2, s3, bf16_lt_false j bf16_lt_true bf16_gt: addi sp, sp, -4 sw ra, 0(sp) mv t0, a0 mv a0, a1 mv a1, t0 jal ra, bf16_lt lw ra, 0(sp) addi sp, sp, 4 ret ``` </details> ## Arithmetic [The C code of Arithmetic](https://github.com/CHENSHIBO0000/ca2025-quizzes/blob/main/HW1/q1-bfloat16/q1-bfloat16.c) ### Add - First, the 16-bit BF16 values are split into `sign`, `exponent`, and `mantissa`. - Special cases: `NaN`, `infinity`, and `zero` are handled first according to IEEE 754 rules. - Right-shifted the mantissa of the smaller exponent, so both numbers share the same exponent. - Addition or subtraction: If the signs are the same → add the mantissas and then normalize if needed . If the signs differ → subtract the smaller mantissa from the larger, then left-shift until normalization. - Repack the result: Combine the final sign, exponent, and mantissa into a new BF16 value. #### Assembly code of add function <details> <summary>Click to show code</summary> ```assembly bf16_add: addi sp, sp, -44 sw ra, 40(sp) sw s0, 36(sp) sw s1, 32(sp) sw s2, 28(sp) sw s3, 24(sp) sw s4, 20(sp) sw s5, 16(sp) sw s6, 12(sp) sw s7, 8(sp) sw s8, 4(sp) sw s9, 0(sp) srli s0, a0, 15 # Extract sign of a srli s1, a1, 15 # Extract sign of b andi s0, s0, 0x1 andi s1, s1, 0x1 srli s2, a0, 7 # Extract exponent of a srli s3, a1, 7 # Extract exponent of b andi s2, s2, 0xFF andi s3, s3, 0xFF andi s4, a0, 0x7F # Extract mantissa of a andi s5, a1, 0x7F # Extract mantissa of b ##if (exp_a == 0xFF) li t0, 0XFF bne s2, t0, EXP_A_NOT_EQUAL_255 beq s4, x0, 1f j return_a 1: bne s3, t0, return_a bnez s5, return_b beq s0, s1, return_b li a0, BF16_NAN j add_done EXP_A_NOT_EQUAL_255: ##if (exp_b == 0xFF) bne s3, t0, EXP_B_NOT_EQUAL_255 j return_b EXP_B_NOT_EQUAL_255: ##if(!exp_a && !mant_a) bne s2, x0, CHECK_B_ZERO beq s4, x0, return_b CHECK_B_ZERO: ##if(!exp_b && !mant_b) bne s3, x0, CHECK_A_EXPONENTS beq s5, x0, return_a CHECK_A_EXPONENTS: ##if (exp_a == 0) beq s2, x0, CHECK_B_EXPONENTS ori s4, s4, 0x80 CHECK_B_EXPONENTS: ##if (exp_b == 0) beq s3, x0, RESULT_EXP ori s5, s5, 0x80 #result_exp RESULT_EXP: sub s6, s2, s3 # exp_diff = exp_a - exp_b ##if (exp_diff > 0) blt x0, s6, EXP_DIFF_POS blt s6, x0, EXP_DIFF_NEG #result_exp = exp_a; mv s8, s2 # result_exp = exp_a j BOTH_SIGN_CHECK EXP_DIFF_POS: mv s8, s2 # result_exp = exp_a li t0, 8 blt t0, s6, return_a srl s5, s5, s6 # mant_b >>= exp_diff j BOTH_SIGN_CHECK EXP_DIFF_NEG: mv s8, s3 # result_exp = exp_b li t0, -8 blt s6, t0, return_b neg t0, s6 srl s4, s4, t0 # mant_a >>= -exp_diff j BOTH_SIGN_CHECK BOTH_SIGN_CHECK: beq s0, s1, SAME_SIGN bgeu s4, s5, L_ge # if (mant_a >= mant_b) goto L_ge; mv s7, s1 # result_sign = sign_b sub s9, s5, s4 # result_mant = mant_b - mant_a j WHILE_LOOP L_ge: mv s7, s0 # result_sign = sign_a sub s9, s4, s5 # result_mant = mant_a - mant_b WHILE_LOOP: ## if (!result_mant) return 0 beqz s9, return_zero normalize_loop: andi t0, s9, 0x80 bnez t0, COMBINE_RESULT slli s9, s9, 1 # result_mant <<= 1 addi s8, s8, -1 # --result_exp bge x0, s8, return_zero # if (result_exp <= 0) j normalize_loop return_zero: mv a0, x0 j add_done SAME_SIGN: mv s7, s0 # result_sign = sign_a add s9, s4, s5 # result_mant = mant_a + mant_b andi t0, s9, 0x100 beq t0, x0, COMBINE_RESULT srli s9, s9, 1 # result_mant >>=1 addi s8, s8, 1 # result_exp++ li t2, 0xFF bge s8, t2, INFINITY COMBINE_RESULT: mv a0, s7 slli a0, a0, 15 andi s8, s8, 0xFF slli s8, s8, 7 or a0, a0, s8 andi s9, s9, 0x7F or a0, a0, s9 j add_done INFINITY: mv a0, s7 slli a0, a0, 15 li t0, 0X7F80 or a0, a0, t0 j add_done return_a: j add_done return_b: mv a0, a1 j add_done add_done: lw s9, 0(sp) lw s8, 4(sp) lw s7, 8(sp) lw s6, 12(sp) lw s5, 16(sp) lw s4, 20(sp) lw s3, 24(sp) lw s2, 28(sp) lw s1, 32(sp) lw s0, 36(sp) lw ra, 40(sp) addi sp, sp, 44 ret ``` </details> --- ### Sub This BF16 subtraction works in a very simple way: - Flip the sign bit of operand b by `XOR` with `BF16_SIGN_MASK (0x8000)`. - Then it simply calls `bf16_add(a, b)`, reusing the addition logic. #### Assembly code of sub function <details> <summary>Click to show code</summary> ```assembly bf16_sub: addi sp, sp, -4 sw ra, 0(sp) li t0,0x8000 xor a1, a1, t0 jal ra, bf16_add lw ra, 0(sp) addi sp, sp, 4 ret ``` </details> --- ### Mul The Mulitiply funcion of BF16 work as follow: - Field extraction: Get sign, 8-bit exponent, and 7-bit mantissa; result sign = XOR of signs. - Special cases: Handle `NaN`, `Inf`, and `±0` (including 0×Inf → NaN). - Normalization: For subnormals, left-shift mantissa until MSB=1 and track exp_adjust; for normals, just add the hidden 1-bit. - Computation: Multiply mantissas; add exponents, subtract bias, and apply `exp_adjust`. - Normalization of result: Shift right by 7 or 8 bits depending on overflow, update exponent (truncation only, no rounding). - Edge handling: Exponent overflow : `Inf`; underflow : `subnormal or 0` . - Repack result: Combine as `sign | (exp << 7) | mant`. #### Assembly code of mul function <details> <summary>Click to show code</summary> ```Assembly bf16_mul: addi sp, sp, -44 sw ra, 40(sp) sw s0, 36(sp) sw s1, 32(sp) sw s2, 28(sp) sw s3, 24(sp) sw s4, 20(sp) sw s5, 16(sp) sw s6, 12(sp) sw s7, 8(sp) sw s8, 4(sp) sw s9, 0(sp) srli s0, a0, 15 # Extract sign of a srli s1, a1, 15 # Extract sign of b andi s0, s0, 0x1 andi s1, s1, 0x1 srli s2, a0, 7 # Extract exponent of a srli s3, a1, 7 # Extract exponent of b andi s2, s2, 0xFF andi s3, s3, 0xFF andi s4, a0, 0x7F # Extract mantissa of a andi s5, a1, 0x7F # Extract mantissa of b xor s6, s0, s1 # Compute result sign ##if (exp_a == 0xFF) li t0, 0XFF bne s2, t0, CHECK_EXP_B_FF bne s4, x0, RETURN_A_MUL ## if (!exp_b && !mant_b) #bne s3, x0, RETURN_INFINITY_MUL #bne s5, x0, RETURN_INFINITY_MUL or t0, s3, s5 beqz t0, RETURN_NAN_MUL j RETURN_INFINITY_MUL CHECK_EXP_B_FF: ##if (exp_b == 0xFF) li t0, 0XFF bne s3, t0, CHECK_ZERO bne s5, x0, RETURN_B_MUL ## if (!exp_b && !mant_b) #bne s2, x0, RETURN_INFINITY_MUL #bne s4, x0, RETURN_INFINITY_MUL or t0, s2, s4 beqz t0, RETURN_NAN_MUL j RETURN_INFINITY_MUL CHECK_ZERO: ##if ((!exp_a && !mant_a) || (!exp_b && !mant_b)) or t0, s2, s4 beqz t0, RETURN_ZERO_MUL or t1, s3, s5 beqz t1, RETURN_ZERO_MUL ##int16_t exp_adjust = 0; li s7, 0 ##Check_exp_a beqz s2, A_SUBNORMAL ori s4, s4, 0x80 j CHECK_B_SUBNORMAL A_SUBNORMAL: WHILE_A: andi t0, s4, 0X80 bnez t0, 1f slli s4, s4, 1 addi s7, s7, -1 j WHILE_A 1: li s2, 1 j CHECK_B_SUBNORMAL ##Check_exp_b CHECK_B_SUBNORMAL: beqz s3, B_SUBNORMAL ori s5, s5, 0x80 j MUL B_SUBNORMAL: WHILE_B: andi t0, s5, 0X80 bnez t0, 2f slli s5, s5, 1 addi s7, s7, -1 j WHILE_B 2: li s3, 1 MUL: # s4 = A (16-bit), s5 = B (16-bit) → s8 = A*B (32-bit) li s8, 0 mv t0, s4 mv t1, s5 li t3, 16 1: andi t2, t1, 1 beqz t2, 2f add s8, s8, t0 2: slli t0, t0, 1 srli t1, t1, 1 addi t3, t3, -1 bnez t3, 1b add s9, s2, s3 li t0, 127 sub s9, s9, t0 add s9, s9, s7 ## if (result_mant & 0x8000) li t0,0x8000 and t0, s8, t0 beqz t0, 3f srli s8, s8, 8 andi s8, s8, 0x7F addi s9, s9, 1 j CHECK_OVERFLOW 3: srli s8, s8, 7 andi s8, s8, 0x7F ## if (result_exp >= 0xFF) CHECK_OVERFLOW: li t0, 0XFF blt s9, t0, no_overflow slli t1, s6, 15 li t0, 0x7F80 or a0, t1, t0 j mul_done no_overflow: bgt s9, x0, FINISH_COMBINE_RESULT_MUL li t0, -6 blt s9, t0, RETURN_ZERO_MUL li t0, 1 sub t0, t0, s9 srl s8, s8, t0 li s9, 0 FINISH_COMBINE_RESULT_MUL: slli s6, s6, 15 andi s9, s9, 0xFF slli s9, s9, 7 andi s8, s8, 0x7F or a0, s6, s9 or a0, a0, s8 j mul_done RETURN_NAN_MUL: li a0, BF16_NAN j mul_done RETURN_INFINITY_MUL: slli a0, s6, 15 li t0, 0x7F80 or a0, a0, t0 j mul_done RETURN_A_MUL: j mul_done RETURN_B_MUL: mv a0, a1 j mul_done RETURN_ZERO_MUL: mv a0, s6 slli a0, a0, 15 j mul_done mul_done: lw s9, 0(sp) lw s8, 4(sp) lw s7, 8(sp) lw s6, 12(sp) lw s5, 16(sp) lw s4, 20(sp) lw s3, 24(sp) lw s2, 28(sp) lw s1, 32(sp) lw s0, 36(sp) lw ra, 40(sp) addi sp, sp, 44 ret ``` </details> --- ### Div - First check the special cases. - Hidden bit: If exponent ≠ 0, set hidden 1 `(mant |= 0x80)` for both operands. - Restoring division: Build a 16-bit quotient by long division: `dividend = mant_a << 15` ; `divisor = mant_b`; loop 16 steps for shifting/subtracting. - Exponent math: `result_exp = exp_a − exp_b + BF16_EXP_BIAS`, then adjust for subnormals. - Normalize quotient and check overflow/underflow. - Pack result. #### Assembly code of div function <details> <summary>Click to show code</summary> ```Assembly bf16_div: addi sp, sp, -48 sw s10, 44(sp) sw ra, 40(sp) sw s0, 36(sp) sw s1, 32(sp) sw s2, 28(sp) sw s3, 24(sp) sw s4, 20(sp) sw s5, 16(sp) sw s6, 12(sp) sw s7, 8(sp) sw s8, 4(sp) sw s9, 0(sp) srli s0, a0, 15 # Extract sign of a srli s1, a1, 15 # Extract sign of b andi s0, s0, 0x1 andi s1, s1, 0x1 srli s2, a0, 7 # Extract exponent of a srli s3, a1, 7 # Extract exponent of b andi s2, s2, 0xFF andi s3, s3, 0xFF andi s4, a0, 0x7F # Extract mantissa of a andi s5, a1, 0x7F # Extract mantissa of b xor s6, s0, s1 # Compute result sign ##if (exp_b == 0xFF) li t0, 0XFF bne s3, t0, CHECK_ZERO_B bne s5, x0, RETURN_B_DIV bne s2, t0, RETURN_ZERO_DIV bne s4, x0, RETURN_ZERO_DIV j RETURN_NAN_DIV CHECK_ZERO_B: ## if (!exp_b && !mant_b) or t0, s3, s5 #### if (!exp_a && !mant_a) or t1, s2, s4 or t2, t0, t1 beqz t2, RETURN_NAN_DIV beqz t0, RETURN_INFINITY_DIV ##if (exp_a == 0xFF) li t0, 0XFF bne s2, t0, CHECK_ZERO_A bne s4, x0, RETURN_A_DIV j RETURN_INFINITY_DIV CHECK_ZERO_A: ## if (!exp_a && !mant_a) or t0, s2, s4 beqz t0, RETURN_ZERO_DIV beq s2, x0, 1f ori s4, s4, 0x80 1: beq s3, x0, 2f ori s5, s5, 0x80 2: slli s7, s4, 15 mv s8, s5 li s9, 0 li t0, 16 li t1, 0 li t2, 15 for: slli s9, s9, 1 sub t3, t2, t1 sll t4, s8, t3 bltu s7, t4, 3f sub s7, s7, t4 ori s9, s9, 1 3: addi t1, t1, 1 bne t1, t0, for #result_exp sub s10, s2, s3 addi s10, s10, 127 bne s2, x0, 4f addi s10, s10, -1 4: bne s3, x0, 5f addi s10, s10, 1 5: #t1:1 #t2:0x8000 li t1, 1 li t2, 0x8000 and t0, s9, t2 beqz t0, 6f srli s9, s9, 8 j 7f 6: #t0:quotient & 0x8000 #result_exp > 1 while_loop: and t0, s9, t2 bnez t0, end_while ble s10, t1, end_while slli s9, s9, 1 addi s10, s10, -1 j while_loop end_while: srli s9, s9, 8 7: andi s9, s9, 0x7F li t0, 0xFF blt s10, t0, 8f j RETURN_INFINITY_DIV 8: blez s10, RETURN_ZERO_DIV j FINISH_COMBINE_RESULT_DIV #return result RETURN_INFINITY_DIV: slli a0, s6, 15 li t0, 0x7F80 or a0, a0, t0 j div_done RETURN_ZERO_DIV: slli s6, s6, 15 mv a0, s6 j div_done RETURN_NAN_DIV: li a0, BF16_NAN j div_done RETURN_B_DIV: mv a0, a1 j div_done RETURN_A_DIV: j div_done FINISH_COMBINE_RESULT_DIV: slli s6, s6, 15 andi s10, s10, 0xFF slli s10, s10, 7 andi s9, s9, 0x7F or a0, s6, s10 or a0, a0, s9 j div_done div_done: lw s9, 0(sp) lw s8, 4(sp) lw s7, 8(sp) lw s6, 12(sp) lw s5, 16(sp) lw s4, 20(sp) lw s3, 24(sp) lw s2, 28(sp) lw s1, 32(sp) lw s0, 36(sp) lw ra, 40(sp) lw s10, 44(sp) addi sp, sp, 48 ret ``` </details> --- ### Sqrt The square root is computed as $$ \sqrt{a} = \sqrt{2^{e_a} \times m_a} = 2^{e_a/2} \times \sqrt{m_a} $$ When computing the square root, the `bf16_sqrt` function first handles all special cases—such as zero, infinity, NaN, and negative inputs—to remain IEEE-754 compliant. Then, it processes the exponent by dividing it by two; if the exponent is odd, the mantissa is adjusted so it stays within the normalized range. Next, the algorithm performs a `binary search` in integer space to approximate the square root of the mantissa, using a scaling factor of 128 to represent the implicit 1.0. Finally, during normalization, the mantissa is adjusted to stay within [128, 256), the exponent is corrected if needed, and the hidden bit is removed to produce a 7-bit mantissa. This method achieves an efficient and monotonic square root computation using only integer arithmetic operations. #### Assembly code of sqrt function <details> <summary>Click to show code</summary> ```Assembly bf16_sqrt: addi sp, sp, -44 sw ra, 40(sp) sw s0, 36(sp) sw s1, 32(sp) sw s2, 28(sp) sw s3, 24(sp) sw s4, 20(sp) sw s5, 16(sp) sw s6, 12(sp) sw s7, 8(sp) sw s8, 4(sp) sw s9, 0(sp) srli s0, a0, 15 # Extract sign andi s0, s0, 0x1 srli s1, a0, 7 # Extract exponent andi s1, s1, 0xFF andi s2, a0, 0x7F # Extract mantissa ## Handle special cases li t0, 0XFF bne s1, t0, 1f bne s2, x0, RETURN_A_SQRT bne s0, x0, RETURN_NAN_SQRT j RETURN_A_SQRT 1: or t0, s1, s2 beq t0, x0, RETURN_ZERO_SQRT bne s0, x0, RETURN_NAN_SQRT beq s1, x0, RETURN_ZERO_SQRT addi s8, s1, -127 ori s3, s2, 0x80 andi t0, s8, 0x1 beq t0, x0, EVEN_EXP slli s3, s3, 1 addi s9, s8, -1 srai s9, s9, 1 addi s9, s9, 127 j BINARY_SEARCH EVEN_EXP: srai s9, s8, 1 addi s9, s9, 127 j BINARY_SEARCH BINARY_SEARCH: li s4, 90 # low = 90 li s5, 256 # high = 256 li s6, 128 # result = 128 ## t0: mid ## t1: sq BINARY_SEARCH_LOOP: bgt s4, s5, NORMALIZE add t0, s4, s5 srli t0, t0, 1 # mid = (low + high) >> 1 li t1, 0 mv t2, t0 BINARY_SEARCH_LOOP_MUL: add t1, t1, t0 addi t2, t2, -1 bne t2, x0, BINARY_SEARCH_LOOP_MUL srli t1, t1, 7 # sq = (mid * mid) >> 7 bgtu t1, s3, 2f mv s6, t0 addi s4, t0, 1 j BINARY_SEARCH_LOOP 2: addi s5, t0, -1 j BINARY_SEARCH_LOOP NORMALIZE: li t0, 256 blt s6, t0, CHECK_LOWER_BOUND srli s6, s6, 1 addi s9, s9, 1 j CHECK_OVER_UNDER_FLOW CHECK_LOWER_BOUND: li t0, 128 bge s6, t0, CHECK_OVER_UNDER_FLOW li t1, 1 NORMALIZE_LOOP: bge s6, t0, CHECK_OVER_UNDER_FLOW ble s9, t1 , CHECK_OVER_UNDER_FLOW slli s6, s6, 1 addi s9, s9, -1 j NORMALIZE_LOOP CHECK_OVER_UNDER_FLOW: andi s7, s6, 0x7F li t0, 0xFF bge s9, t0, RETURN_INFINITY_SQRT ble s9, x0, RETURN_ZERO_SQRT j FINISH_COMBINE_RESULT_SQRT #RETURN result RETURN_A_SQRT: j sqrt_done RETURN_NAN_SQRT: li a0, BF16_NAN j sqrt_done RETURN_ZERO_SQRT: li a0, BF16_ZERO j sqrt_done RETURN_INFINITY_SQRT: li a0, 0x7F80 j sqrt_done FINISH_COMBINE_RESULT_SQRT: andi a0, s9, 0XFF slli a0, a0, 7 or a0, a0, s7 j sqrt_done sqrt_done: lw s9, 0(sp) lw s8, 4(sp) lw s7, 8(sp) lw s6, 12(sp) lw s5, 16(sp) lw s4, 20(sp) lw s3, 24(sp) lw s2, 28(sp) lw s1, 32(sp) lw s0, 36(sp) lw ra, 40(sp) addi sp, sp, 44 ret ``` </details> ## Test ### Float32 ⇄ BF16 Conversion: | # | orig_f32 (Hex) | conv_bf16 (Hex) | conv_f32 (Hex) | Description | | :-: | :------------: | :-------------: | :------------: | :---------- | | 1 | 0x00000000 | 0x0000 | 0x00000000 | 0.0 | | 2 | 0x3f800000 | 0x3f80 | 0x3f800000 | 1.0 | | 3 | 0xbf800000 | 0xbf80 | 0xbf800000 | -1.0 | | 4 | 0x3f000000 | 0x3f00 | 0x3f000000 | 0.5 | | 5 | 0xbf000000 | 0xbf00 | 0xbf000000 | -0.5 | | 6 | 0x40490fd0 | 0x4049 | 0x40490000 | 3.140625 | | 7 | 0xc0490fd0 | 0xc049 | 0xc0490000 | -3.140625 | | 8 | 0x501502f9 | 0x5015 | 0x50150000 | 1e10 | | 9 | 0xd01502f9 | 0xd015 | 0xd0150000 | -1e10 | ### BF16 Addition Test: | # | Input1 (Hex) | Input2 (Hex) | Expected Output (Hex) | Description | | :-: | :----------: | :----------: | :-------------------: | :--------------------------- | | 1 | 0x3f80 | 0x4000 | 0x4040 | 1.0 + 2.0 = 3.0 | | 2 | 0x4049 | 0x402e | 0x40bb | 3.140625 + 2.71875 = 5.84375 | | 3 | 0x3f80 | 0xc000 | 0xbf80 | 1.0 + (-2.0) = -1.0 | | 4 | 0xc000 | 0x3f80 | 0xbf80 | -2.0 + 1.0 = -1.0 | | 5 | 0x0000 | 0x3f80 | 0x3f80 | 0.0 + 1.0 = 1.0 | | 6 | 0x3f80 | 0x0000 | 0x3f80 | 1.0 + 0.0 = 1.0 | | 7 | 0x7f80 | 0x3f80 | 0x7f80 | +Inf + 1.0 = +Inf | | 8 | 0x3f80 | 0x7f80 | 0x7f80 | 1.0 + +Inf = +Inf | | 9 | 0xff80 | 0x3f80 | 0xff80 | -Inf + 1.0 = -Inf | | 10 | 0x3f80 | 0xff80 | 0xff80 | 1.0 + -Inf = -Inf | | 11 | 0x7f62 | 0x7f62 | 0x7f80 | 3e38 + 3e38 = +Inf | ### BF16 Subtraction Test: | # | Input1 (Hex) | Input2 (Hex) | Expected Output (Hex) | Description | | :-: | :----------: | :----------: | :-------------------: | :---------------------------- | | 1 | 0x4000 | 0x3f80 | 0x3f80 | 2.0 - 1.0 = 1.0 | | 2 | 0x4049 | 0x402e | 0x3ed8 | 3.140625 - 2.71875 = 0.421875 | | 3 | 0x3f80 | 0xc000 | 0x4040 | 1.0 - (-2.0) = 3.0 | ### BF16 Multiplication Test: | # | Input1 (Hex) | Input2 (Hex) | Expected Output (Hex) | Description | | :-: | :----------: | :----------: | :-------------------: | :----------------------- | | 1 | 0x4040 | 0x4080 | 0x4140 | 3.0 × 4.0 = 12.0 | | 2 | 0x4049 | 0x402e | 0x4108 | 3.140625 × 2.71875 = 8.5 | | 3 | 0x7f80 | 0x3f80 | 0x7f80 | +Inf × 1.0 = +Inf | | 4 | 0x7f80 | 0x0000 | 0x7fc0 | +Inf × 0.0 = NaN | ### BF16 Division Test: | # | Input1 (Hex) | Input2 (Hex) | Expected Output (Hex) | Description | | :-: | :----------: | :----------: | :-------------------: | :----------------- | | 1 | 0x4120 | 0x4000 | 0x40a0 | 10.0 / 2.0 = 5.0 | | 2 | 0xc120 | 0x4000 | 0xc0a0 | -10.0 / 2.0 = -5.0 | | 3 | 0x4128 | 0x3f00 | 0x41a8 | 10.5 / 0.5 = 21.0 | | 4 | 0x7f80 | 0x7f80 | 0x7fc0 | +Inf / +Inf = NaN | | 5 | 0x3f80 | 0x7f80 | 0x0000 | 1.0 / +Inf = 0.0 | | 6 | 0x7f80 | 0x3f80 | 0x7f80 | +Inf / 1.0 = +Inf | | 7 | 0x3f80 | 0x0000 | 0x7f80 | 1.0 / 0.0 = +Inf | ### BF16 Square Root Test: | # | Input (Hex) | Expected Output (Hex) | Description | | :-: | :---------: | :-------------------: | :---------- | | 1 | 0x4080 | 0x4000 | √4.0 = 2.0 | | 2 | 0x4110 | 0x4040 | √9.0 = 3.0 | | 3 | 0x42a2 | 0x4110 | √81.0 = 9.0 | ### BF16 Rounding Tests: | # | Input f32 (Hex) | Rounded Result f32 (Hex) | Notes | | :-: | :-------------: | :----------------------: | :--------------------------------------- | | 1 | `0x3FC00000` | `0x3FC00000` | Exactly representable (1.5) | | 2 | `0x3F800347` | `0x3F800000` | Near 1.0; rounds to nearest even → 1.0. | [The assembly code of TEST](https://github.com/CHENSHIBO0000/ca2025-quizzes/blob/main/HW1/q1-bfloat16/q1_bfloat16.S) The tests begin with Float32–BF16 conversion. And this dataset also designed to verify the accuracy and consistency of BF16 arithmetic operations, including addition, subtraction, multiplication, division, and square root. ## Result ![image](https://hackmd.io/_uploads/rkc985GTxe.png) ![image](https://hackmd.io/_uploads/ByYkwqzpxl.png) # Compute Hamming Distance with clz ### Motivation I chose `Hamming distance` as my topic because it is a simple yet powerful concept that connects both software and hardware design. It provides an intuitive way to measure the difference between binary patterns, which is essential in many real-world systems. From error detection in communication to similarity measurement in machine learning and bit-level operations in computer architecture, Hamming distance offers a clear and practical foundation for understanding data integrity and pattern comparison. To implement this problem efficiently, I utilized the `CLZ` instruction. `CLZ` allows us to quickly locate the position of the most significant bit (MSB) in a binary number, which can then be used to optimize bitwise operations. By combining CLZ with XOR operations, we can calculate the Hamming distance between two integers efficiently. [The complete code](https://github.com/CHENSHIBO0000/ca2025-quizzes/tree/main/HW1/leetcode) --- ### Problem [leetcode 461. Hamming Distance](https://leetcode.com/problems/hamming-distance/) The **Hamming distance** between two integers is the number of positions at which the corresponding bits are different. Given two integers `x` and `y`, return the **Hamming distance** between them. ``` Input: x = 1, y = 4 Output: 2 Explanation: 1 (0 0 0 1) 4 (0 1 0 0) ↑ ↑ The above arrows point to positions where the corresponding bits are different. ``` --- ### Hamming distance without CLZ #### C code ```c int hammingDistance(uint32_t x, uint32_t y) { uint32_t diff = x ^ y; int hamming_count = 0; if(diff==0) return 0; // Count remaining '1's for (int i = 0; i <= 32 ; i++) { diff <<= 1; if (diff & 0x80000000) hamming_count++; if(!diff) break; } ``` This function calculates the **Hamming distance** by first using XOR (`x ^ y`) to find which bits differ between the two numbers. Then it shifts the bits left one by one and counts how many `1`s appear in the most significant bit (MSB). The total count gives the number of different bits — the Hamming distance. #### Assembly code ```assembly hammingDistance: addi sp, sp -24 sw ra, 20(sp) sw s0, 16(sp) sw s1, 12(sp) sw s2, 8(sp) sw s3, 4(sp) sw s4, 0(sp) mv s0, a0 mv s1, a1 xor s2, s1, s0 li s3, 0 beq s2, x0, RETURN_ZERO li t0, 0 li t1, 32 li t3, 0x80000000 for: and t2, s2, t3 beq t2, x0, ZERO addi s3, s3, 1 ZERO: beq s2, x0, RETURN_DIS addi t0, t0, 1 slli s2, s2, 1 blt t0, t1, for j RETURN_DIS RETURN_ZERO: li a0, 0 j RETURN RETURN_DIS: mv a0, s3 j RETURN RETURN: lw s4, 0(sp) lw s3, 4(sp) lw s2, 8(sp) lw s1, 12(sp) lw s0, 16(sp) lw ra, 20(sp) addi sp, sp, 24 ret ``` --- ### Hamming distance with CLZ #### C code ```c int hammingDistance(uint32_t x, uint32_t y) { uint32_t diff = x ^ y; int hamming_count = 1; // Shift out the leading zeros first uint32_t leading_zeros = clz(diff); if(diff==0) return 0; diff <<= leading_zeros; // Count remaining '1's for (int i = 0; i < 31 - leading_zeros; i++) { diff <<= 1; if (diff & 0x80000000) hamming_count++; if(!diff) break; } return hamming_count; } static inline unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; } ``` #### This version does: 1. Compute `diff = x ^ y` so that each `1` marks a bit position where `x` and `y` differ. 2. Use `clz(diff)` to get the number of leading zeros. This locates the most-significant `1` quickly. 3. Left-shift `diff` by `leading_zeros` so the highest `1` moves into the MSB (bit 31). 4. Initialize the count to `1` (to account for that MSB), then keep left-shifting and increment the count whenever the MSB is `1`. Stop early if `diff` becomes `0`. 5. Return the total count = number of differing bits (the Hamming distance). #### Assembly code ```assembly hammingDistance: addi sp, sp -24 sw ra, 20(sp) sw s0, 16(sp) sw s1, 12(sp) sw s2, 8(sp) sw s3, 4(sp) sw s4, 0(sp) mv s0, a0 mv s1, a1 xor s2, s1, s0 li s3, 1 beq s2, x0, RETURN_ZERO mv a0, s2 jal ra,clz mv s4, a0 sll s2, s2, s4 li t0, 31 sub t1, t0, s4 li t0, 0 li t3, 0x80000000 for: slli s2, s2, 1 and t2, s2, t3 beq t2, x0, ZERO addi s3, s3, 1 ZERO: beq s2, x0, RETURN_DIS addi t0, t0, 1 blt t0, t1, for j RETURN_DIS RETURN_ZERO: li a0, 0 j RETURN RETURN_DIS: mv a0, s3 j RETURN RETURN: lw s4, 0(sp) lw s3, 4(sp) lw s2, 8(sp) lw s1, 12(sp) lw s0, 16(sp) lw ra, 20(sp) addi sp, sp, 24 ret ######################################### # # # # # clz # # # # # ######################################### clz: addi sp, sp -4 sw ra, 0(sp) li t0, 32 li t1, 16 mv t3, a0 clz_Loop: srl t2, t3, t1 beq t2, x0, SHIFT sub t0, t0, t1 # t 0 = t1 - t2 mv t3, t2 SHIFT: srli t1, t1, 1 bne t1, x0, clz_Loop sub a0, t0, t3 lw ra, 0(sp) addi sp, sp,4 ret ``` ### Test This table provides a wide coverage of bit-difference cases to verify the correctness of the Hamming distance function. - **Basic checks**: identical values, single-bit differences at LSB and MSB, and all-bit differences (0 vs FFFFFFFF). - **Alternating patterns**: cases like `0xAAAAAAAA` vs `0x55555555` to test alternating 1–0 bit flips. - **Partial and shifted patterns**: examples with nibbles, byte shifts, or swapped high/low 16 bits. - **Random pairs**: values such as `0xDEADBEEF` and `0xFEEDFACE` to ensure general correctness. - **Edge and boundary cases**: maximum positive vs negative (`0x7FFFFFFF` vs `0x80000000`), adjacent bits, and grouped 1’s. Using these 25 patterns to ensure full functional coverage of the Hamming distance logic. | # | X (Hex) | Y (Hex) | Hamming Distance | Description | |----|---------------|---------------|------------------|---------------------------------------| | 1 | 0x00000000 | 0x00000000 | 0 | identical | | 2 | 0x00000001 | 0x00000000 | 1 | one LSB differs | | 3 | 0x80000000 | 0x00000000 | 1 | one MSB differs | | 4 | 0xFFFFFFFF | 0x00000000 | 32 | all bits differ | | 5 | 0xAAAAAAAA | 0x55555555 | 32 | alternating 1010 vs 0101 | | 6 | 0xF0F0F0F0 | 0x0F0F0F0F | 32 | nibble-wise alternation | | 7 | 0x12345678 | 0x12345678 | 0 | identical random value | | 8 | 0x12345678 | 0x12345670 | 1 | single-bit delta | | 9 | 0x0000FFFF | 0xFFFF0000 | 32 | lower/upper 16 bits swapped | | 10 | 0x00000000 | 0xFFFFFFFF | 32 | all-zero vs all-one | | 11 | 0x7FFFFFFF | 0xFFFFFFFF | 1 | only MSB differs | | 12 | 0x7FFFFFFF | 0x80000000 | 32 | +max vs min-int | | 13 | 0x00010000 | 0x00000001 | 2 | two distant ones | | 14 | 0x00F00000 | 0x00000000 | 4 | four consecutive ones | | 15 | 0x00F00000 | 0x0000F000 | 8 | two groups of four ones | | 16 | 0xDEADBEEF | 0xFEEDFACE | 6 | random pair | | 17 | 0xCAFEBABE | 0x8BADF00D | 14 | random pair | | 18 | 0x01010101 | 0x10101010 | 8 | 0x01 pattern vs 0x10 pattern | | 19 | 0xAAAAAAAA | 0xAAAAAAAA | 0 | identical alternating | | 20 | 0x00000003 | 0x00000001 | 1 | adjacent low bits | | 21 | 0x0000000F | 0x000000F0 | 8 | eight consecutive bits differ | | 22 | 0x00000008 | 0x00000010 | 2 | two adjacent bits | | 23 | 0x40000000 | 0x80000000 | 2 | top two bits differ | | 24 | 0x80000001 | 0x00000001 | 1 | only MSB differs | | 25 | 0x00000002 | 0x00000001 | 2 | LSB and next bit differ | ```assembly main: li s0, 0 la s1, orig_X la s2, orig_Y la s3, golden li s6, NUM_TEST_VALUES bge s0, s6, Passed 1: lw a0, 0(s1) lw a1, 0(s2) lw s4, 0(s3) jal ra hammingDistance mv s5, a0 bne s4, s5, 2f addi s0, s0, 1 addi s1, s1, 4 addi s2, s2, 4 addi s3, s3, 4 blt s0, s6, 1b Passed: la a0, Passed_msg li a7, 4 ecall # Print passed message li a0, 0 j done # go to return #failed 2: la a0, Failed_msg li a7, 4 ecall # Print passed message j done # go to return done: li a7, 93 # system call: exit2 li a0, 1 # exit value ecall # exit 1 ``` ### Result - Without clz ![image](https://hackmd.io/_uploads/BkMtTebpxx.png) ![image](https://hackmd.io/_uploads/Hk0Fplbpex.png) - With clz ![image](https://hackmd.io/_uploads/Hk4CpgW6lg.png) ![image](https://hackmd.io/_uploads/ByP_kZ-6xe.png) ### Performance Comparison Both versions passed all functional tests, but the **CLZ-based implementation** performed better. By using the `CLZ` instruction to skip leading zeros, the loop runs fewer iterations, reducing total cycles and retired instructions. As shown in the results: - **Without CLZ:** 6365 cycles - **With CLZ:** 5345 cycles This demonstrates that the CLZ version is more efficient, achieving the same correctness with lower execution time. # 5-Stage RISC-V Processor ![image](https://hackmd.io/_uploads/BknYgdm6ll.png) In this work, I use a 5-stage RISC-V processor that executes programs using only the RV32I instruction set. The RISC-V 5-stage pipeline divides instruction execution into five sequential stages, each responsible for a specific part of the instruction flow. This structure allows multiple instructions to be processed simultaneously, improving throughput and efficiency. 1. Instruction Fetch (IF) - In this stage, the processor fetches the instruction from the instruction memory using the Program Counter (PC). - Update the PC to the next sequential instruction by adding 4. 2. Instruction Decode and Register Fetch (ID) - The fetched instruction is decoded to identify the operation type and register operands. - The control unit interprets fields such as opcode, funct3, funct7, rs1, rs2, and rd, generating the required control signals. - The Register File provides values from rs1 and rs2, and any immediate value is extracted. 3. Execute / Address Calculation (EX) - The ALU performs the required computation — arithmetic, logic, comparison, or address calculation. - For branch or memory instructions, this stage also computes the target or effective address, and determines whether a branch should be taken. 4. Memory Access (MEM) - If the instruction involves memory, processor reads or writes data from/to the data memory in this stage. - `load` instructions read data into a buffer. - `store` instructions write a register’s value to memory. 5. Write Back (WB) - The final stage writes the result (either from the ALU or memory) back into the Register File at the `rd`. --- Take `andi x10, x25, 255` (`andi a0, s9, 0XFF`) this instruction, which address is at 0xF00 and see how it works in each stage. ## 1. IF ![IF](https://hackmd.io/_uploads/BJc67qVTxl.png) The program counter currently holds the value `0x00000F00`, indicating that the next instruction to be executed is stored at memory address `0x00000F00`. In the next step of the instruction cycle, the CPU will fetch the instruction from this address. Simultaneously, PC will be added by 4 to obtain the next instruction address which is 0x00000F04. ## 2. ID ![ID](https://hackmd.io/_uploads/r1ABT54age.png) In this stage, the fetched instruction `0x0FFCF513` is decoded. The decoded result generates the necessary control signals, instructing the `Register File` to provide the values from `rs1 (x25)` (no rs2 is used for I-type instruction.). imm[11:0] | rs1 | funct3 | rd | opcode| ----------|-------|--------|-------|-------| 000011111111 | 11001 | 111 | 01010 | 0010011| Identify opcode and fields: | Field | Bits | Value (bin) | Value (hex) | Meaning | | ------------- | ------- | -------------- | ----------- | ---------------------- | | **opcode** | [6:0] | `0010011` | 0x13 | I-type (immediate ALU) | | **rd** | [11:7] | `01010` | 0x0A | **Name : x10 → Alias : a0** | | **funct3** | [14:12] | `111` | 0x7 | **ANDI** | | **rs1** | [19:15] | `11001` | 0x19 | **Name : x25 → Alias : s9** | | **imm[11:0]** | [31:20] | `000011111111` | 0x0FF | **Immediate = 0xFF** | The decoder extracts `imm[11:0]` and extends it to a 32-bit value for ALU computation, where the immediate is `0x000000FF`. The Register File provides values from `rs1(x25)`. ![Register](https://hackmd.io/_uploads/r19p0qVpex.png) ## 3. EX ![EX](https://hackmd.io/_uploads/BJDVFjVpeg.png) In this stage, the ALU performs a bitwise AND on the two operands selected by the input MUXs. - Input - op1 : `0X00000080 (x25)` - op2 : `0X000000FF (Imm)` - Output - 0X00000080 The `rd (0x0A)` address will be sent into next stage. ## 4. MEM ![MEM](https://hackmd.io/_uploads/Hk2gZ3Vael.png) In `I-type` instruction, no need for memory `store/load`, so set Data memory `wr_en` low. ## 5. WB ![WB](https://hackmd.io/_uploads/ryG7fnEpge.png) In the `WB` stage, MUXs select the write-back source (ALU result or memory data). For this I-type instruction, the ALU result `0x00000080` is written to the destination register `rd = x10 (index 0x0A)`. ## Register update - Before: ![image](https://hackmd.io/_uploads/S1_tN24Txl.png) - After: ![image](https://hackmd.io/_uploads/r1jSV3VTge.png) It can be seen that the value stored in register x10(a0) has changed from `0x00004080` to `0x00000080`.