# Assignment 1: RISC-V Assembly and Instruction Pipeline
#### contributed by <`bbchen`>
# Problem B in [Quiz 1](https://hackmd.io/@sysprog/arch2025-quiz1-sol)
### Abstract
This problem is to understand and implement a custom 8-bit logarithmic codec called uf8, which compresses a 20-bit unsigned integer (range: 0 ~ 1,015,792) into only 8 bits, achieving a 2.5:1 compression ratio while guaranteeing a relative error ≤ 6.25%.
[The complete C code and assembly for Problem B](https://github.com/CHENSHIBO0000/ca2025-quizzes/tree/main/HW1/q1-uf8)
---
The following illustration is about each function and its corresponding RISC-Vassembly in Problem B of the quiz1.
### clz
This function `clz` computes the count of leading zeros in a 32-bit unsigned integer using a binary search method. It starts with the maximum (32) and checks progressively smaller bit ranges (16, 8, 4, 2, 1). By shifting and adjusting the count, it quickly finds the position of the most significant 1-bit. The result is the exact number of leading zeros, with clz(0) = 32.
#### C code of clz
```c
static inline unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return n - x;
}
```
1. Start with `n = 32` (maximum possible leading zeros) and a probe size `c = 16`.
2. Compute `y = x >> c`.
- If `y != 0`, then the highest 1-bit lies within the upper `32 - c` bits.
Reduce the count: `n -= c`, and zoom in: `x = y`.
- Otherwise, the highest 1-bit is in the lower `c` bits; just shrink the probe.
3. Halve `c` (`c >>= 1`) to test ranges of 8, 4, 2, 1 bits — **exactly 5 iterations total**.
4. After the loop, `n - x` yields the exact number of leading zeros.
(During the search, `x` is reduced to the range `[1, 2)` in fixed-point terms, so subtracting `x` corrects the off-by-one accumulated in `n`.)
#### Assembly code of clz
```assembly
clz:
addi t0, zero, 32 #t0 = n
addi t1, zero, 16 #t1 = c
add t3, a0, zero #t3 = x
clz_Loop:
srl t2, t3, t1 #t2 = y
beq t2, zero, clz_y_0
sub t0, t0, t1
add t3, t2, zero #t3 = x
clz_y_0:
srli t1,t1,1 # c >> 1
bne t1, x0, clz_Loop
sub a0,t0,t3
ret
```
---
### uf8_decode
The decoding formula is given as
$$
D(b) = m \cdot 2^e + (2^e - 1) \cdot 16
$$
Where $$e = \lfloor b/16 \rfloor \quad \text{and} \quad m = b \bmod 16$$
#### C code of decode
```c
/*
* uf8 byte layout (8 bits)
*
* +--------+--------+
* | e e e e| m m m m|
* +--------+--------+
* 7 .. 4 3 .. 0
*
* e : 4-bit exponent
* m : 4-bit mantissa
*/
uint32_t uf8_decode(uf8 fl)
{
uint32_t mantissa = fl & 0x0f;
uint8_t exponent = fl >> 4;
uint32_t offset = (0x7FFF >> (15 - exponent)) << 4;
return (mantissa << exponent) + offset;
}
```
1. `mantissa = fl & 0x0F` → extract the 4-bit mantissa **m**.
2. `exponent = fl >> 4` → extract the 4-bit exponent **e**.
3. `offset = (0x7FFF >> (15 - exponent)) << 4`
- Shifting right by `(15 - e)` yields `(2^e - 1)`.
- Shifting left by 4 multiplies by 16 → `(2^e - 1) * 16`.
4. `return (mantissa << exponent) + offset`
- `(mantissa << exponent)` is `m * 2^e`.
- Adding `offset` implements the uf8 decode formula.
#### Assembly code of uf8_decode
```assembly
uf8_decode:
addi sp, sp, -4
sw ra, 0(sp)
andi t0, a0, 0x0f #t0 = mantissa
srli t1, a0, 4 #t1 = exponent
# offset = ((1 << exponent) - 1) << 4
addi t2, zero, 1
sll t2, t2, t1
addi t2, t2, -1
slli t2, t2, 4 #t2 = offset
sll a0,t0, t1
add a0, a0 t2
lw ra 0(sp) # restore return addr
addi sp, sp, 4
ret
```
---
### uf8_encode
The encoding formula is given as
$$
E(v) =\begin{cases}v, & \text{if } v < 16 \\16e + \left\lfloor \dfrac{v - \text{offset}(e)}{2^e} \right\rfloor, & \text{otherwise}\end{cases}
$$
Where $$\text{offset}(e) = (2^e - 1) \cdot 16$$
#### C code of encode
```c
uf8 uf8_encode(uint32_t value)
{
/* Use CLZ for fast exponent calculation */
if (value < 16)
return value;
/* Find appropriate exponent using CLZ hint */
int lz = clz(value);
int msb = 31 - lz;
/* Start from a good initial guess */
uint8_t exponent = 0;
uint32_t overflow = 0;
if (msb >= 5) {
/* Estimate exponent - the formula is empirical */
exponent = msb - 4;
if (exponent > 15)
exponent = 15;
/* Calculate overflow for estimated exponent */
for (uint8_t e = 0; e < exponent; e++)
overflow = (overflow << 1) + 16;
/* Adjust if estimate was off */
while (exponent > 0 && value < overflow) {
overflow = (overflow - 16) >> 1;
exponent--;
}
}
/* Find exact exponent */
while (exponent < 15) {
uint32_t next_overflow = (overflow << 1) + 16;
if (value < next_overflow)
break;
overflow = next_overflow;
exponent++;
}
uint8_t mantissa = (value - overflow) >> exponent;
return (exponent << 4) | mantissa;
}
```
1. If the value is small (<16), it returns it directly.
2. Otherwise, it uses `clz` to find the highest 1-bit, estimates the exponent, and adjusts it to fit the value.
3. Then it computes the mantissa from the remaining part.
4. Finally, it packs the `exponent (high 4 bits)` and `mantissa (low 4 bits)` into the 8-bit `uf8`code.
#### Assembly code of uf8_encode
```assembly
uf8_encode:
li t0, 16
slt t0, a0 ,t0
bne t0, zero, return_value
addi sp, sp, -28
sw s0, 24(sp)
sw s1, 20(sp)
sw s2, 16(sp)
sw s3, 12(sp)
sw s4, 8(sp)
sw s5, 4(sp)
sw ra, 0(sp)
add s0, a0, zero #s0 = value
#Find appropriate exponent using CLZ hint
jal ra, clz
add s1, a0, x0 #s1 = lz
li t0, 31
sub s2, t0, s1 #s2 = msb
add s3, zero, zero #s3 = exponent
add s4, zero, zero #s4 = overflow
#if(msb >= 5)
li t0, 5
blt t0, s2, 1f
addi s3, s2, -4
#if(exponent > 15)
li t0, 15
ble s3, t0, 2f
li s3, 15
# Calculate overflow for estimated exponent
2:
li s5, 0
3:
bge s5, s3, 4f
slli s4, s4, 1
addi s4, s4, 16
addi s5, s5, 1
j 3b
# Adjust if estimate was off
4:
slt t0, x0, s3
slt t1, s0, s4
and t0, t0, t1
beq t0, x0, 1f
addi s4, s4, -16
srli, s4, s4, 1
addi s3, s3, -1
j 4b
#Find exact exponent
1:
addi t0, x0, 15
bge s3, t0, encode_done
slli s6, s4, 1
addi s6, s6, 16
blt s0, s6, encode_done
mv s4, s6
addi s3, s3, 1
j 1b
encode_done:
sub s7, s0, s4
srl s7, s7, s3
slli a0, s3, 4
or a0, a0, s7
lw ra, 0(sp)
lw s5, 4(sp)
lw s4, 8(sp)
lw s3, 12(sp)
lw s2, 16(sp)
lw s1, 20(sp)
lw s0, 24(sp)
addi sp, sp, 28
return_value:
ret
```
---
### Test
#### C code of test
```c
static bool test(void)
{
int32_t previous_value = -1;
bool passed = true;
for (int i = 0; i < 256; i++) {
uint8_t fl = i;
int32_t value = uf8_decode(fl);
uint8_t fl2 = uf8_encode(value);
if (fl != fl2) {
printf("%02x: produces value %d but encodes back to %02x\n", fl,
value, fl2);
passed = false;
}
if (value <= previous_value) {
printf("%02x: value %d <= previous_value %d\n", fl, value,
previous_value);
passed = false;
}
previous_value = value;
}
return passed;
}
```
The `test()` function verifies the correctness of the uf8 encode/decode implementation.
1. Iterate over all `256` possible uf8 codes.
For each i from 0 to 255, treat it as a uf8 code fl.
2. Round-trip check (decode → encode)
- `value = uf8_decode(fl)` decodes the uf8 code into an integer.
- `fl2 = uf8_encode(value)` encodes that integer back into uf8.
- If `fl != fl2`, it outputs an error message and marks the test as failed.
This ensures decoding followed by encoding preserves the original code.
3. Keeps track of the previous decode value
- Ensures that each new decoded value is strictly greater than the previous one.
- Updates previous_value on each iteration.
4. Final result:
- Returns `true` only if all checks pass; otherwise returns `false`.
#### Assembly code of test
```assembly
Test:
addi sp, sp, -4
sw ra, 0(sp)
li s0, -1
li s1, 1
li s2, 0
test_for_loop:
mv s3, s2 #f1 = i
mv a0, s3 #
jal ra, uf8_decode
mv s4, a0
jal ra, uf8_encode
mv s5, a0
beq s3, s5, 1f
li s1, 0
1:
bgt s4, s0, 2f
li s1, 0
2: #previous_value
mv s0, s4
addi s2, s2, 1
li t0, 256
blt s2, t0, test_for_loop # if (i < 256), to 1
done:
mv a1, s1 # return passed
lw ra 0(sp) # restore return addr
addi sp, sp, 4
ret
```
---
### Result


---
# Problem C in [Quiz 1](https://hackmd.io/@sysprog/arch2025-quiz1-sol)
## Abstract
This problem implements a complete BF16 floating-point library. It includes conversions between 32-bit float and 16-bit BF16, classification functions (NaN, Inf, Zero), and implements a bfloat16 (BF16) arithmetic library. It also implements comparison functions for BF16.
[The complete C code and assembly for Problem C](https://github.com/CHENSHIBO0000/ca2025-quizzes/tree/main/HW1/q1-bfloat16)
---
## Classification
#### C code
```c
#define BF16_NAN() ((bf16_t) {.bits = 0x7FC0})
#define BF16_ZERO() ((bf16_t) {.bits = 0x0000})
static inline bool bf16_isnan(bf16_t a)
{
return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) &&
(a.bits & BF16_MANT_MASK);
}
static inline bool bf16_isinf(bf16_t a)
{
return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) &&
!(a.bits & BF16_MANT_MASK);
}
static inline bool bf16_iszero(bf16_t a)
{
return !(a.bits & 0x7FFF);
}
```
These functions help detect `NaN`, `infinity`, and `±0` via exponent and mantissa patterns.
#### Assembly code
```assembly
bf16_isnan:
mv t0, a0
li t1, BF16_EXP_MASK
and t2, t0, t1
li t3, BF16_MANT_MASK
and t4, t0, t3
bne t2, t1, bf16_isnan_false
beq t4, x0, bf16_isnan_false
li a0, 1
ret
bf16_isnan_false:
li a0, 0
ret
bf16_isinf:
mv t0, a0
li t1, BF16_EXP_MASK
and t2, t0, t1
li t3, BF16_MANT_MASK
and t4, t0, t3
bne t2, t1, bf16_isinf_false
bne t4, x0, bf16_isinf_false
li a0, 1
ret
bf16_isinf_false:
li a0, 0
ret
bf16_iszero:
li t0, 0x7FFF
and t1, a0, t0
beq t1, x0, bf16_iszero_true
li a0, 0
ret
bf16_iszero_true:
li a0, 1
ret
```
---
## Basic conversion
#### C code of basic conversions
```c
static inline bf16_t f32_to_bf16(float val)
{
uint32_t f32bits;
memcpy(&f32bits, &val, sizeof(float));
if (((f32bits >> 23) & 0xFF) == 0xFF)
return (bf16_t) {.bits = (f32bits >> 16) & 0xFFFF};
f32bits += ((f32bits >> 16) & 1) + 0x7FFF;
return (bf16_t) {.bits = f32bits >> 16};
}
static inline float bf16_to_f32(bf16_t val)
{
uint32_t f32bits = ((uint32_t) val.bits) << 16;
float result;
memcpy(&result, &f32bits, sizeof(float));
return result;
}
```
These two functions handle conversion between 32-bit float and 16-bit bfloat16 formats:
- `f32_to_bf16`
Extracts the 32-bit bit pattern of a float .If the exponent is `0xFF` (NaN or Inf), it directly truncates the upper 16 bits.
Otherwise, it performs round-to-nearest-even by adding `0x7FFF` plus the least significant bit of the truncated part before shifting right by 16 bits.
- `bf16_to_f32`
Shifts the 16-bit BF16 value into the upper half of a 32-bit word and fills the lower bits with zeros, then reinterprets it as a float.
#### Assembly code of basic conversions
```assembly
f32_to_bf16:
mv t0, a0
srli t1, t0, 23
andi t1, t1, 0xFF
li t2, 0xFF
beq t1, t2, 1f
srli t1, t0, 16
andi t1, t1, 1
li t2, 0x7FFF
add t3, t1, t2
add t0, t0, t3
srli t0, t0, 16
j 2f
1:
srli t0, t0, 16
li t1, 0xFFFF
and t0, t0, t1
2:
mv a0, t0
ret
bf16_to_f32:
slli a0, a0, 16
ret
```
## Comparison operations
#### C code of comparison operations
```c
static inline bool bf16_eq(bf16_t a, bf16_t b)
{
if (bf16_isnan(a) || bf16_isnan(b))
return false;
if (bf16_iszero(a) && bf16_iszero(b))
return true;
return a.bits == b.bits;
}
static inline bool bf16_lt(bf16_t a, bf16_t b)
{
if (bf16_isnan(a) || bf16_isnan(b))
return false;
if (bf16_iszero(a) && bf16_iszero(b))
return false;
bool sign_a = (a.bits >> 15) & 1, sign_b = (b.bits >> 15) & 1;
if (sign_a != sign_b)
return sign_a > sign_b;
return sign_a ? a.bits > b.bits : a.bits < b.bits;
}
static inline bool bf16_gt(bf16_t a, bf16_t b)
{
return bf16_lt(b, a);
}
```
- `bf16_eq(a, b)`
Returns `false` if either operand is `NaN`. Treats `±0` as equal. Otherwise, compares raw 16-bit patterns for exact equality.
- `bf16_lt(a, b)`
Returns `false` if either is `NaN` or both are zeros. Implements signed ordering:
• If `signs differ`, negative is less than positive.
• If `signs are the same`, compares the 16-bit patterns.
- `bf16_gt(a, b)`
Swap the operands and reuse the `less-than` logic.
#### Assembly code of comparison operations
<details>
<summary>Click to show code</summary>
```asm
bf16_eq:
addi sp, sp, -16
sw ra, 12(sp)
sw s0, 8(sp)
sw s1, 4(sp)
sw s2, 0(sp)
mv s0, a0
mv s1, a1
jal ra, bf16_isnan
mv s2, a0
mv a0, s1
jal ra, bf16_isnan
or s2, s2, a0
bne s2, x0, bf16_eq_false
mv a0, s0
jal ra, bf16_iszero
beq a0, x0, bf16_eq_check
mv a0, s1
jal ra, bf16_iszero
beq a0, x0, bf16_eq_check
bf16_eq_true:
li a0, 1
j bf16_eq_end
bf16_eq_check:
beq s0, s1, bf16_eq_true
bf16_eq_false:
li a0, 0
j bf16_eq_end
bf16_eq_end:
lw ra, 12(sp)
lw s0, 8(sp)
lw s1, 4(sp)
lw s2, 0(sp)
addi sp, sp, 16
ret
bf16_lt:
addi sp, sp, -24
sw ra, 20(sp)
sw s0, 16(sp)
sw s1, 12(sp)
sw s2, 8(sp)
sw s3, 4(sp)
sw s4, 0(sp)
mv s0, a0
mv s1, a1
jal ra, bf16_isnan
mv s4, a0
mv a0, s1
jal ra, bf16_isnan
or s4, s4, a0
bne s4, x0, bf16_lt_false
mv a0, s0
jal ra, bf16_iszero
beq a0, x0, bf16_lt_check_sign
mv a0, s1
jal ra, bf16_iszero
beq a0, x0, bf16_lt_check_sign
bf16_lt_false:
li a0, 0
j bf16_lt_end
bf16_lt_check_sign:
srli s2, s0, 15 # sign_a
srli s3, s1, 15 # sign_b
andi s2, s2, 1
andi s3, s3, 1
bne s2, s3, bf16_lt_sign_diff
bgtz s2, 1f
blt s0, s1, bf16_lt_true
j bf16_lt_false
1:
blt s0, s1, bf16_lt_false
j bf16_lt_true
bf16_lt_true:
li a0, 1
j bf16_lt_end
bf16_lt_end:
lw ra, 20(sp)
lw s0, 16(sp)
lw s1, 12(sp)
lw s2, 8(sp)
lw s3, 4(sp)
lw s4, 0(sp)
addi sp, sp, 24
ret
bf16_lt_sign_diff:
blt s2, s3, bf16_lt_false
j bf16_lt_true
bf16_gt:
addi sp, sp, -4
sw ra, 0(sp)
mv t0, a0
mv a0, a1
mv a1, t0
jal ra, bf16_lt
lw ra, 0(sp)
addi sp, sp, 4
ret
```
</details>
## Arithmetic
[The C code of Arithmetic](https://github.com/CHENSHIBO0000/ca2025-quizzes/blob/main/HW1/q1-bfloat16/q1-bfloat16.c)
### Add
- First, the 16-bit BF16 values are split into `sign`, `exponent`, and `mantissa`.
- Special cases: `NaN`, `infinity`, and `zero` are handled first according to IEEE 754 rules.
- Right-shifted the mantissa of the smaller exponent, so both numbers share the same exponent.
- Addition or subtraction:
If the signs are the same → add the mantissas and then normalize if needed .
If the signs differ → subtract the smaller mantissa from the larger, then left-shift until normalization.
- Repack the result: Combine the final sign, exponent, and mantissa into a new BF16 value.
#### Assembly code of add function
<details>
<summary>Click to show code</summary>
```assembly
bf16_add:
addi sp, sp, -44
sw ra, 40(sp)
sw s0, 36(sp)
sw s1, 32(sp)
sw s2, 28(sp)
sw s3, 24(sp)
sw s4, 20(sp)
sw s5, 16(sp)
sw s6, 12(sp)
sw s7, 8(sp)
sw s8, 4(sp)
sw s9, 0(sp)
srli s0, a0, 15 # Extract sign of a
srli s1, a1, 15 # Extract sign of b
andi s0, s0, 0x1
andi s1, s1, 0x1
srli s2, a0, 7 # Extract exponent of a
srli s3, a1, 7 # Extract exponent of b
andi s2, s2, 0xFF
andi s3, s3, 0xFF
andi s4, a0, 0x7F # Extract mantissa of a
andi s5, a1, 0x7F # Extract mantissa of b
##if (exp_a == 0xFF)
li t0, 0XFF
bne s2, t0, EXP_A_NOT_EQUAL_255
beq s4, x0, 1f
j return_a
1:
bne s3, t0, return_a
bnez s5, return_b
beq s0, s1, return_b
li a0, BF16_NAN
j add_done
EXP_A_NOT_EQUAL_255:
##if (exp_b == 0xFF)
bne s3, t0, EXP_B_NOT_EQUAL_255
j return_b
EXP_B_NOT_EQUAL_255:
##if(!exp_a && !mant_a)
bne s2, x0, CHECK_B_ZERO
beq s4, x0, return_b
CHECK_B_ZERO:
##if(!exp_b && !mant_b)
bne s3, x0, CHECK_A_EXPONENTS
beq s5, x0, return_a
CHECK_A_EXPONENTS:
##if (exp_a == 0)
beq s2, x0, CHECK_B_EXPONENTS
ori s4, s4, 0x80
CHECK_B_EXPONENTS:
##if (exp_b == 0)
beq s3, x0, RESULT_EXP
ori s5, s5, 0x80
#result_exp
RESULT_EXP:
sub s6, s2, s3 # exp_diff = exp_a - exp_b
##if (exp_diff > 0)
blt x0, s6, EXP_DIFF_POS
blt s6, x0, EXP_DIFF_NEG
#result_exp = exp_a;
mv s8, s2 # result_exp = exp_a
j BOTH_SIGN_CHECK
EXP_DIFF_POS:
mv s8, s2 # result_exp = exp_a
li t0, 8
blt t0, s6, return_a
srl s5, s5, s6 # mant_b >>= exp_diff
j BOTH_SIGN_CHECK
EXP_DIFF_NEG:
mv s8, s3 # result_exp = exp_b
li t0, -8
blt s6, t0, return_b
neg t0, s6
srl s4, s4, t0 # mant_a >>= -exp_diff
j BOTH_SIGN_CHECK
BOTH_SIGN_CHECK:
beq s0, s1, SAME_SIGN
bgeu s4, s5, L_ge # if (mant_a >= mant_b) goto L_ge;
mv s7, s1 # result_sign = sign_b
sub s9, s5, s4 # result_mant = mant_b - mant_a
j WHILE_LOOP
L_ge:
mv s7, s0 # result_sign = sign_a
sub s9, s4, s5 # result_mant = mant_a - mant_b
WHILE_LOOP:
## if (!result_mant) return 0
beqz s9, return_zero
normalize_loop:
andi t0, s9, 0x80
bnez t0, COMBINE_RESULT
slli s9, s9, 1 # result_mant <<= 1
addi s8, s8, -1 # --result_exp
bge x0, s8, return_zero # if (result_exp <= 0)
j normalize_loop
return_zero:
mv a0, x0
j add_done
SAME_SIGN:
mv s7, s0 # result_sign = sign_a
add s9, s4, s5 # result_mant = mant_a + mant_b
andi t0, s9, 0x100
beq t0, x0, COMBINE_RESULT
srli s9, s9, 1 # result_mant >>=1
addi s8, s8, 1 # result_exp++
li t2, 0xFF
bge s8, t2, INFINITY
COMBINE_RESULT:
mv a0, s7
slli a0, a0, 15
andi s8, s8, 0xFF
slli s8, s8, 7
or a0, a0, s8
andi s9, s9, 0x7F
or a0, a0, s9
j add_done
INFINITY:
mv a0, s7
slli a0, a0, 15
li t0, 0X7F80
or a0, a0, t0
j add_done
return_a:
j add_done
return_b:
mv a0, a1
j add_done
add_done:
lw s9, 0(sp)
lw s8, 4(sp)
lw s7, 8(sp)
lw s6, 12(sp)
lw s5, 16(sp)
lw s4, 20(sp)
lw s3, 24(sp)
lw s2, 28(sp)
lw s1, 32(sp)
lw s0, 36(sp)
lw ra, 40(sp)
addi sp, sp, 44
ret
```
</details>
---
### Sub
This BF16 subtraction works in a very simple way:
- Flip the sign bit of operand b by `XOR` with `BF16_SIGN_MASK (0x8000)`.
- Then it simply calls `bf16_add(a, b)`, reusing the addition logic.
#### Assembly code of sub function
<details>
<summary>Click to show code</summary>
```assembly
bf16_sub:
addi sp, sp, -4
sw ra, 0(sp)
li t0,0x8000
xor a1, a1, t0
jal ra, bf16_add
lw ra, 0(sp)
addi sp, sp, 4
ret
```
</details>
---
### Mul
The Mulitiply funcion of BF16 work as follow:
- Field extraction: Get sign, 8-bit exponent, and 7-bit mantissa; result sign = XOR of signs.
- Special cases: Handle `NaN`, `Inf`, and `±0` (including 0×Inf → NaN).
- Normalization: For subnormals, left-shift mantissa until MSB=1 and track exp_adjust; for normals, just add the hidden 1-bit.
- Computation: Multiply mantissas; add exponents, subtract bias, and apply `exp_adjust`.
- Normalization of result: Shift right by 7 or 8 bits depending on overflow, update exponent (truncation only, no rounding).
- Edge handling: Exponent overflow : `Inf`; underflow : `subnormal or 0` .
- Repack result: Combine as `sign | (exp << 7) | mant`.
#### Assembly code of mul function
<details>
<summary>Click to show code</summary>
```Assembly
bf16_mul:
addi sp, sp, -44
sw ra, 40(sp)
sw s0, 36(sp)
sw s1, 32(sp)
sw s2, 28(sp)
sw s3, 24(sp)
sw s4, 20(sp)
sw s5, 16(sp)
sw s6, 12(sp)
sw s7, 8(sp)
sw s8, 4(sp)
sw s9, 0(sp)
srli s0, a0, 15 # Extract sign of a
srli s1, a1, 15 # Extract sign of b
andi s0, s0, 0x1
andi s1, s1, 0x1
srli s2, a0, 7 # Extract exponent of a
srli s3, a1, 7 # Extract exponent of b
andi s2, s2, 0xFF
andi s3, s3, 0xFF
andi s4, a0, 0x7F # Extract mantissa of a
andi s5, a1, 0x7F # Extract mantissa of b
xor s6, s0, s1 # Compute result sign
##if (exp_a == 0xFF)
li t0, 0XFF
bne s2, t0, CHECK_EXP_B_FF
bne s4, x0, RETURN_A_MUL
## if (!exp_b && !mant_b)
#bne s3, x0, RETURN_INFINITY_MUL
#bne s5, x0, RETURN_INFINITY_MUL
or t0, s3, s5
beqz t0, RETURN_NAN_MUL
j RETURN_INFINITY_MUL
CHECK_EXP_B_FF:
##if (exp_b == 0xFF)
li t0, 0XFF
bne s3, t0, CHECK_ZERO
bne s5, x0, RETURN_B_MUL
## if (!exp_b && !mant_b)
#bne s2, x0, RETURN_INFINITY_MUL
#bne s4, x0, RETURN_INFINITY_MUL
or t0, s2, s4
beqz t0, RETURN_NAN_MUL
j RETURN_INFINITY_MUL
CHECK_ZERO:
##if ((!exp_a && !mant_a) || (!exp_b && !mant_b))
or t0, s2, s4
beqz t0, RETURN_ZERO_MUL
or t1, s3, s5
beqz t1, RETURN_ZERO_MUL
##int16_t exp_adjust = 0;
li s7, 0
##Check_exp_a
beqz s2, A_SUBNORMAL
ori s4, s4, 0x80
j CHECK_B_SUBNORMAL
A_SUBNORMAL:
WHILE_A:
andi t0, s4, 0X80
bnez t0, 1f
slli s4, s4, 1
addi s7, s7, -1
j WHILE_A
1:
li s2, 1
j CHECK_B_SUBNORMAL
##Check_exp_b
CHECK_B_SUBNORMAL:
beqz s3, B_SUBNORMAL
ori s5, s5, 0x80
j MUL
B_SUBNORMAL:
WHILE_B:
andi t0, s5, 0X80
bnez t0, 2f
slli s5, s5, 1
addi s7, s7, -1
j WHILE_B
2:
li s3, 1
MUL:
# s4 = A (16-bit), s5 = B (16-bit) → s8 = A*B (32-bit)
li s8, 0
mv t0, s4
mv t1, s5
li t3, 16
1:
andi t2, t1, 1
beqz t2, 2f
add s8, s8, t0
2:
slli t0, t0, 1
srli t1, t1, 1
addi t3, t3, -1
bnez t3, 1b
add s9, s2, s3
li t0, 127
sub s9, s9, t0
add s9, s9, s7
## if (result_mant & 0x8000)
li t0,0x8000
and t0, s8, t0
beqz t0, 3f
srli s8, s8, 8
andi s8, s8, 0x7F
addi s9, s9, 1
j CHECK_OVERFLOW
3:
srli s8, s8, 7
andi s8, s8, 0x7F
## if (result_exp >= 0xFF)
CHECK_OVERFLOW:
li t0, 0XFF
blt s9, t0, no_overflow
slli t1, s6, 15
li t0, 0x7F80
or a0, t1, t0
j mul_done
no_overflow:
bgt s9, x0, FINISH_COMBINE_RESULT_MUL
li t0, -6
blt s9, t0, RETURN_ZERO_MUL
li t0, 1
sub t0, t0, s9
srl s8, s8, t0
li s9, 0
FINISH_COMBINE_RESULT_MUL:
slli s6, s6, 15
andi s9, s9, 0xFF
slli s9, s9, 7
andi s8, s8, 0x7F
or a0, s6, s9
or a0, a0, s8
j mul_done
RETURN_NAN_MUL:
li a0, BF16_NAN
j mul_done
RETURN_INFINITY_MUL:
slli a0, s6, 15
li t0, 0x7F80
or a0, a0, t0
j mul_done
RETURN_A_MUL:
j mul_done
RETURN_B_MUL:
mv a0, a1
j mul_done
RETURN_ZERO_MUL:
mv a0, s6
slli a0, a0, 15
j mul_done
mul_done:
lw s9, 0(sp)
lw s8, 4(sp)
lw s7, 8(sp)
lw s6, 12(sp)
lw s5, 16(sp)
lw s4, 20(sp)
lw s3, 24(sp)
lw s2, 28(sp)
lw s1, 32(sp)
lw s0, 36(sp)
lw ra, 40(sp)
addi sp, sp, 44
ret
```
</details>
---
### Div
- First check the special cases.
- Hidden bit: If exponent ≠ 0, set hidden 1 `(mant |= 0x80)` for both operands.
- Restoring division: Build a 16-bit quotient by long division:
`dividend = mant_a << 15` ; `divisor = mant_b`; loop 16 steps for shifting/subtracting.
- Exponent math: `result_exp = exp_a − exp_b + BF16_EXP_BIAS`, then adjust for subnormals.
- Normalize quotient and check overflow/underflow.
- Pack result.
#### Assembly code of div function
<details>
<summary>Click to show code</summary>
```Assembly
bf16_div:
addi sp, sp, -48
sw s10, 44(sp)
sw ra, 40(sp)
sw s0, 36(sp)
sw s1, 32(sp)
sw s2, 28(sp)
sw s3, 24(sp)
sw s4, 20(sp)
sw s5, 16(sp)
sw s6, 12(sp)
sw s7, 8(sp)
sw s8, 4(sp)
sw s9, 0(sp)
srli s0, a0, 15 # Extract sign of a
srli s1, a1, 15 # Extract sign of b
andi s0, s0, 0x1
andi s1, s1, 0x1
srli s2, a0, 7 # Extract exponent of a
srli s3, a1, 7 # Extract exponent of b
andi s2, s2, 0xFF
andi s3, s3, 0xFF
andi s4, a0, 0x7F # Extract mantissa of a
andi s5, a1, 0x7F # Extract mantissa of b
xor s6, s0, s1 # Compute result sign
##if (exp_b == 0xFF)
li t0, 0XFF
bne s3, t0, CHECK_ZERO_B
bne s5, x0, RETURN_B_DIV
bne s2, t0, RETURN_ZERO_DIV
bne s4, x0, RETURN_ZERO_DIV
j RETURN_NAN_DIV
CHECK_ZERO_B:
## if (!exp_b && !mant_b)
or t0, s3, s5
#### if (!exp_a && !mant_a)
or t1, s2, s4
or t2, t0, t1
beqz t2, RETURN_NAN_DIV
beqz t0, RETURN_INFINITY_DIV
##if (exp_a == 0xFF)
li t0, 0XFF
bne s2, t0, CHECK_ZERO_A
bne s4, x0, RETURN_A_DIV
j RETURN_INFINITY_DIV
CHECK_ZERO_A:
## if (!exp_a && !mant_a)
or t0, s2, s4
beqz t0, RETURN_ZERO_DIV
beq s2, x0, 1f
ori s4, s4, 0x80
1:
beq s3, x0, 2f
ori s5, s5, 0x80
2:
slli s7, s4, 15
mv s8, s5
li s9, 0
li t0, 16
li t1, 0
li t2, 15
for:
slli s9, s9, 1
sub t3, t2, t1
sll t4, s8, t3
bltu s7, t4, 3f
sub s7, s7, t4
ori s9, s9, 1
3:
addi t1, t1, 1
bne t1, t0, for
#result_exp
sub s10, s2, s3
addi s10, s10, 127
bne s2, x0, 4f
addi s10, s10, -1
4:
bne s3, x0, 5f
addi s10, s10, 1
5:
#t1:1
#t2:0x8000
li t1, 1
li t2, 0x8000
and t0, s9, t2
beqz t0, 6f
srli s9, s9, 8
j 7f
6:
#t0:quotient & 0x8000
#result_exp > 1
while_loop:
and t0, s9, t2
bnez t0, end_while
ble s10, t1, end_while
slli s9, s9, 1
addi s10, s10, -1
j while_loop
end_while:
srli s9, s9, 8
7:
andi s9, s9, 0x7F
li t0, 0xFF
blt s10, t0, 8f
j RETURN_INFINITY_DIV
8:
blez s10, RETURN_ZERO_DIV
j FINISH_COMBINE_RESULT_DIV
#return result
RETURN_INFINITY_DIV:
slli a0, s6, 15
li t0, 0x7F80
or a0, a0, t0
j div_done
RETURN_ZERO_DIV:
slli s6, s6, 15
mv a0, s6
j div_done
RETURN_NAN_DIV:
li a0, BF16_NAN
j div_done
RETURN_B_DIV:
mv a0, a1
j div_done
RETURN_A_DIV:
j div_done
FINISH_COMBINE_RESULT_DIV:
slli s6, s6, 15
andi s10, s10, 0xFF
slli s10, s10, 7
andi s9, s9, 0x7F
or a0, s6, s10
or a0, a0, s9
j div_done
div_done:
lw s9, 0(sp)
lw s8, 4(sp)
lw s7, 8(sp)
lw s6, 12(sp)
lw s5, 16(sp)
lw s4, 20(sp)
lw s3, 24(sp)
lw s2, 28(sp)
lw s1, 32(sp)
lw s0, 36(sp)
lw ra, 40(sp)
lw s10, 44(sp)
addi sp, sp, 48
ret
```
</details>
---
### Sqrt
The square root is computed as
$$
\sqrt{a} = \sqrt{2^{e_a} \times m_a} = 2^{e_a/2} \times \sqrt{m_a}
$$
When computing the square root, the `bf16_sqrt` function first handles all special cases—such as zero, infinity, NaN, and negative inputs—to remain IEEE-754 compliant. Then, it processes the exponent by dividing it by two; if the exponent is odd, the mantissa is adjusted so it stays within the normalized range.
Next, the algorithm performs a `binary search` in integer space to approximate the square root of the mantissa, using a scaling factor of 128 to represent the implicit 1.0.
Finally, during normalization, the mantissa is adjusted to stay within [128, 256), the exponent is corrected if needed, and the hidden bit is removed to produce a 7-bit mantissa.
This method achieves an efficient and monotonic square root computation using only integer arithmetic operations.
#### Assembly code of sqrt function
<details>
<summary>Click to show code</summary>
```Assembly
bf16_sqrt:
addi sp, sp, -44
sw ra, 40(sp)
sw s0, 36(sp)
sw s1, 32(sp)
sw s2, 28(sp)
sw s3, 24(sp)
sw s4, 20(sp)
sw s5, 16(sp)
sw s6, 12(sp)
sw s7, 8(sp)
sw s8, 4(sp)
sw s9, 0(sp)
srli s0, a0, 15 # Extract sign
andi s0, s0, 0x1
srli s1, a0, 7 # Extract exponent
andi s1, s1, 0xFF
andi s2, a0, 0x7F # Extract mantissa
## Handle special cases
li t0, 0XFF
bne s1, t0, 1f
bne s2, x0, RETURN_A_SQRT
bne s0, x0, RETURN_NAN_SQRT
j RETURN_A_SQRT
1:
or t0, s1, s2
beq t0, x0, RETURN_ZERO_SQRT
bne s0, x0, RETURN_NAN_SQRT
beq s1, x0, RETURN_ZERO_SQRT
addi s8, s1, -127
ori s3, s2, 0x80
andi t0, s8, 0x1
beq t0, x0, EVEN_EXP
slli s3, s3, 1
addi s9, s8, -1
srai s9, s9, 1
addi s9, s9, 127
j BINARY_SEARCH
EVEN_EXP:
srai s9, s8, 1
addi s9, s9, 127
j BINARY_SEARCH
BINARY_SEARCH:
li s4, 90 # low = 90
li s5, 256 # high = 256
li s6, 128 # result = 128
## t0: mid
## t1: sq
BINARY_SEARCH_LOOP:
bgt s4, s5, NORMALIZE
add t0, s4, s5
srli t0, t0, 1 # mid = (low + high) >> 1
li t1, 0
mv t2, t0
BINARY_SEARCH_LOOP_MUL:
add t1, t1, t0
addi t2, t2, -1
bne t2, x0, BINARY_SEARCH_LOOP_MUL
srli t1, t1, 7 # sq = (mid * mid) >> 7
bgtu t1, s3, 2f
mv s6, t0
addi s4, t0, 1
j BINARY_SEARCH_LOOP
2:
addi s5, t0, -1
j BINARY_SEARCH_LOOP
NORMALIZE:
li t0, 256
blt s6, t0, CHECK_LOWER_BOUND
srli s6, s6, 1
addi s9, s9, 1
j CHECK_OVER_UNDER_FLOW
CHECK_LOWER_BOUND:
li t0, 128
bge s6, t0, CHECK_OVER_UNDER_FLOW
li t1, 1
NORMALIZE_LOOP:
bge s6, t0, CHECK_OVER_UNDER_FLOW
ble s9, t1 , CHECK_OVER_UNDER_FLOW
slli s6, s6, 1
addi s9, s9, -1
j NORMALIZE_LOOP
CHECK_OVER_UNDER_FLOW:
andi s7, s6, 0x7F
li t0, 0xFF
bge s9, t0, RETURN_INFINITY_SQRT
ble s9, x0, RETURN_ZERO_SQRT
j FINISH_COMBINE_RESULT_SQRT
#RETURN result
RETURN_A_SQRT:
j sqrt_done
RETURN_NAN_SQRT:
li a0, BF16_NAN
j sqrt_done
RETURN_ZERO_SQRT:
li a0, BF16_ZERO
j sqrt_done
RETURN_INFINITY_SQRT:
li a0, 0x7F80
j sqrt_done
FINISH_COMBINE_RESULT_SQRT:
andi a0, s9, 0XFF
slli a0, a0, 7
or a0, a0, s7
j sqrt_done
sqrt_done:
lw s9, 0(sp)
lw s8, 4(sp)
lw s7, 8(sp)
lw s6, 12(sp)
lw s5, 16(sp)
lw s4, 20(sp)
lw s3, 24(sp)
lw s2, 28(sp)
lw s1, 32(sp)
lw s0, 36(sp)
lw ra, 40(sp)
addi sp, sp, 44
ret
```
</details>
## Test
### Float32 ⇄ BF16 Conversion:
| # | orig_f32 (Hex) | conv_bf16 (Hex) | conv_f32 (Hex) | Description |
| :-: | :------------: | :-------------: | :------------: | :---------- |
| 1 | 0x00000000 | 0x0000 | 0x00000000 | 0.0 |
| 2 | 0x3f800000 | 0x3f80 | 0x3f800000 | 1.0 |
| 3 | 0xbf800000 | 0xbf80 | 0xbf800000 | -1.0 |
| 4 | 0x3f000000 | 0x3f00 | 0x3f000000 | 0.5 |
| 5 | 0xbf000000 | 0xbf00 | 0xbf000000 | -0.5 |
| 6 | 0x40490fd0 | 0x4049 | 0x40490000 | 3.140625 |
| 7 | 0xc0490fd0 | 0xc049 | 0xc0490000 | -3.140625 |
| 8 | 0x501502f9 | 0x5015 | 0x50150000 | 1e10 |
| 9 | 0xd01502f9 | 0xd015 | 0xd0150000 | -1e10 |
### BF16 Addition Test:
| # | Input1 (Hex) | Input2 (Hex) | Expected Output (Hex) | Description |
| :-: | :----------: | :----------: | :-------------------: | :--------------------------- |
| 1 | 0x3f80 | 0x4000 | 0x4040 | 1.0 + 2.0 = 3.0 |
| 2 | 0x4049 | 0x402e | 0x40bb | 3.140625 + 2.71875 = 5.84375 |
| 3 | 0x3f80 | 0xc000 | 0xbf80 | 1.0 + (-2.0) = -1.0 |
| 4 | 0xc000 | 0x3f80 | 0xbf80 | -2.0 + 1.0 = -1.0 |
| 5 | 0x0000 | 0x3f80 | 0x3f80 | 0.0 + 1.0 = 1.0 |
| 6 | 0x3f80 | 0x0000 | 0x3f80 | 1.0 + 0.0 = 1.0 |
| 7 | 0x7f80 | 0x3f80 | 0x7f80 | +Inf + 1.0 = +Inf |
| 8 | 0x3f80 | 0x7f80 | 0x7f80 | 1.0 + +Inf = +Inf |
| 9 | 0xff80 | 0x3f80 | 0xff80 | -Inf + 1.0 = -Inf |
| 10 | 0x3f80 | 0xff80 | 0xff80 | 1.0 + -Inf = -Inf |
| 11 | 0x7f62 | 0x7f62 | 0x7f80 | 3e38 + 3e38 = +Inf |
### BF16 Subtraction Test:
| # | Input1 (Hex) | Input2 (Hex) | Expected Output (Hex) | Description |
| :-: | :----------: | :----------: | :-------------------: | :---------------------------- |
| 1 | 0x4000 | 0x3f80 | 0x3f80 | 2.0 - 1.0 = 1.0 |
| 2 | 0x4049 | 0x402e | 0x3ed8 | 3.140625 - 2.71875 = 0.421875 |
| 3 | 0x3f80 | 0xc000 | 0x4040 | 1.0 - (-2.0) = 3.0 |
### BF16 Multiplication Test:
| # | Input1 (Hex) | Input2 (Hex) | Expected Output (Hex) | Description |
| :-: | :----------: | :----------: | :-------------------: | :----------------------- |
| 1 | 0x4040 | 0x4080 | 0x4140 | 3.0 × 4.0 = 12.0 |
| 2 | 0x4049 | 0x402e | 0x4108 | 3.140625 × 2.71875 = 8.5 |
| 3 | 0x7f80 | 0x3f80 | 0x7f80 | +Inf × 1.0 = +Inf |
| 4 | 0x7f80 | 0x0000 | 0x7fc0 | +Inf × 0.0 = NaN |
### BF16 Division Test:
| # | Input1 (Hex) | Input2 (Hex) | Expected Output (Hex) | Description |
| :-: | :----------: | :----------: | :-------------------: | :----------------- |
| 1 | 0x4120 | 0x4000 | 0x40a0 | 10.0 / 2.0 = 5.0 |
| 2 | 0xc120 | 0x4000 | 0xc0a0 | -10.0 / 2.0 = -5.0 |
| 3 | 0x4128 | 0x3f00 | 0x41a8 | 10.5 / 0.5 = 21.0 |
| 4 | 0x7f80 | 0x7f80 | 0x7fc0 | +Inf / +Inf = NaN |
| 5 | 0x3f80 | 0x7f80 | 0x0000 | 1.0 / +Inf = 0.0 |
| 6 | 0x7f80 | 0x3f80 | 0x7f80 | +Inf / 1.0 = +Inf |
| 7 | 0x3f80 | 0x0000 | 0x7f80 | 1.0 / 0.0 = +Inf |
### BF16 Square Root Test:
| # | Input (Hex) | Expected Output (Hex) | Description |
| :-: | :---------: | :-------------------: | :---------- |
| 1 | 0x4080 | 0x4000 | √4.0 = 2.0 |
| 2 | 0x4110 | 0x4040 | √9.0 = 3.0 |
| 3 | 0x42a2 | 0x4110 | √81.0 = 9.0 |
### BF16 Rounding Tests:
| # | Input f32 (Hex) | Rounded Result f32 (Hex) | Notes |
| :-: | :-------------: | :----------------------: | :--------------------------------------- |
| 1 | `0x3FC00000` | `0x3FC00000` | Exactly representable (1.5) |
| 2 | `0x3F800347` | `0x3F800000` | Near 1.0; rounds to nearest even → 1.0. |
[The assembly code of TEST](https://github.com/CHENSHIBO0000/ca2025-quizzes/blob/main/HW1/q1-bfloat16/q1_bfloat16.S)
The tests begin with Float32–BF16 conversion. And this dataset also designed to verify the accuracy and consistency of BF16 arithmetic operations, including addition, subtraction, multiplication, division, and square root.
## Result


# Compute Hamming Distance with clz
### Motivation
I chose `Hamming distance` as my topic because it is a simple yet powerful concept that connects both software and hardware design. It provides an intuitive way to measure the difference between binary patterns, which is essential in many real-world systems. From error detection in communication to similarity measurement in machine learning and bit-level operations in computer architecture, Hamming distance offers a clear and practical foundation for understanding data integrity and pattern comparison.
To implement this problem efficiently, I utilized the `CLZ` instruction. `CLZ` allows us to quickly locate the position of the most significant bit (MSB) in a binary number, which can then be used to optimize bitwise operations. By combining CLZ with XOR operations, we can calculate the Hamming distance between two integers efficiently.
[The complete code](https://github.com/CHENSHIBO0000/ca2025-quizzes/tree/main/HW1/leetcode)
---
### Problem
[leetcode 461. Hamming Distance](https://leetcode.com/problems/hamming-distance/)
The **Hamming distance** between two integers is the number of positions at which the corresponding bits are different.
Given two integers `x` and `y`, return the **Hamming distance** between them.
```
Input: x = 1, y = 4
Output: 2
Explanation:
1 (0 0 0 1)
4 (0 1 0 0)
↑ ↑
The above arrows point to positions where the corresponding bits are different.
```
---
### Hamming distance without CLZ
#### C code
```c
int hammingDistance(uint32_t x, uint32_t y) {
uint32_t diff = x ^ y;
int hamming_count = 0;
if(diff==0)
return 0;
// Count remaining '1's
for (int i = 0; i <= 32 ; i++) {
diff <<= 1;
if (diff & 0x80000000)
hamming_count++;
if(!diff)
break;
}
```
This function calculates the **Hamming distance** by first using XOR (`x ^ y`) to find which bits differ between the two numbers.
Then it shifts the bits left one by one and counts how many `1`s appear in the most significant bit (MSB).
The total count gives the number of different bits — the Hamming distance.
#### Assembly code
```assembly
hammingDistance:
addi sp, sp -24
sw ra, 20(sp)
sw s0, 16(sp)
sw s1, 12(sp)
sw s2, 8(sp)
sw s3, 4(sp)
sw s4, 0(sp)
mv s0, a0
mv s1, a1
xor s2, s1, s0
li s3, 0
beq s2, x0, RETURN_ZERO
li t0, 0
li t1, 32
li t3, 0x80000000
for:
and t2, s2, t3
beq t2, x0, ZERO
addi s3, s3, 1
ZERO:
beq s2, x0, RETURN_DIS
addi t0, t0, 1
slli s2, s2, 1
blt t0, t1, for
j RETURN_DIS
RETURN_ZERO:
li a0, 0
j RETURN
RETURN_DIS:
mv a0, s3
j RETURN
RETURN:
lw s4, 0(sp)
lw s3, 4(sp)
lw s2, 8(sp)
lw s1, 12(sp)
lw s0, 16(sp)
lw ra, 20(sp)
addi sp, sp, 24
ret
```
---
### Hamming distance with CLZ
#### C code
```c
int hammingDistance(uint32_t x, uint32_t y) {
uint32_t diff = x ^ y;
int hamming_count = 1;
// Shift out the leading zeros first
uint32_t leading_zeros = clz(diff);
if(diff==0)
return 0;
diff <<= leading_zeros;
// Count remaining '1's
for (int i = 0; i < 31 - leading_zeros; i++) {
diff <<= 1;
if (diff & 0x80000000)
hamming_count++;
if(!diff)
break;
}
return hamming_count;
}
static inline unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return n - x;
}
```
#### This version does:
1. Compute `diff = x ^ y` so that each `1` marks a bit position where `x` and `y` differ.
2. Use `clz(diff)` to get the number of leading zeros. This locates the most-significant `1` quickly.
3. Left-shift `diff` by `leading_zeros` so the highest `1` moves into the MSB (bit 31).
4. Initialize the count to `1` (to account for that MSB), then keep left-shifting and
increment the count whenever the MSB is `1`. Stop early if `diff` becomes `0`.
5. Return the total count = number of differing bits (the Hamming distance).
#### Assembly code
```assembly
hammingDistance:
addi sp, sp -24
sw ra, 20(sp)
sw s0, 16(sp)
sw s1, 12(sp)
sw s2, 8(sp)
sw s3, 4(sp)
sw s4, 0(sp)
mv s0, a0
mv s1, a1
xor s2, s1, s0
li s3, 1
beq s2, x0, RETURN_ZERO
mv a0, s2
jal ra,clz
mv s4, a0
sll s2, s2, s4
li t0, 31
sub t1, t0, s4
li t0, 0
li t3, 0x80000000
for:
slli s2, s2, 1
and t2, s2, t3
beq t2, x0, ZERO
addi s3, s3, 1
ZERO:
beq s2, x0, RETURN_DIS
addi t0, t0, 1
blt t0, t1, for
j RETURN_DIS
RETURN_ZERO:
li a0, 0
j RETURN
RETURN_DIS:
mv a0, s3
j RETURN
RETURN:
lw s4, 0(sp)
lw s3, 4(sp)
lw s2, 8(sp)
lw s1, 12(sp)
lw s0, 16(sp)
lw ra, 20(sp)
addi sp, sp, 24
ret
#########################################
# #
# #
# clz #
# #
# #
#########################################
clz:
addi sp, sp -4
sw ra, 0(sp)
li t0, 32
li t1, 16
mv t3, a0
clz_Loop:
srl t2, t3, t1
beq t2, x0, SHIFT
sub t0, t0, t1 # t 0 = t1 - t2
mv t3, t2
SHIFT:
srli t1, t1, 1
bne t1, x0, clz_Loop
sub a0, t0, t3
lw ra, 0(sp)
addi sp, sp,4
ret
```
### Test
This table provides a wide coverage of bit-difference cases to verify the correctness of the Hamming distance function.
- **Basic checks**: identical values, single-bit differences at LSB and MSB, and all-bit differences (0 vs FFFFFFFF).
- **Alternating patterns**: cases like `0xAAAAAAAA` vs `0x55555555` to test alternating 1–0 bit flips.
- **Partial and shifted patterns**: examples with nibbles, byte shifts, or swapped high/low 16 bits.
- **Random pairs**: values such as `0xDEADBEEF` and `0xFEEDFACE` to ensure general correctness.
- **Edge and boundary cases**: maximum positive vs negative (`0x7FFFFFFF` vs `0x80000000`), adjacent bits, and grouped 1’s.
Using these 25 patterns to ensure full functional coverage of the Hamming distance logic.
| # | X (Hex) | Y (Hex) | Hamming Distance | Description |
|----|---------------|---------------|------------------|---------------------------------------|
| 1 | 0x00000000 | 0x00000000 | 0 | identical |
| 2 | 0x00000001 | 0x00000000 | 1 | one LSB differs |
| 3 | 0x80000000 | 0x00000000 | 1 | one MSB differs |
| 4 | 0xFFFFFFFF | 0x00000000 | 32 | all bits differ |
| 5 | 0xAAAAAAAA | 0x55555555 | 32 | alternating 1010 vs 0101 |
| 6 | 0xF0F0F0F0 | 0x0F0F0F0F | 32 | nibble-wise alternation |
| 7 | 0x12345678 | 0x12345678 | 0 | identical random value |
| 8 | 0x12345678 | 0x12345670 | 1 | single-bit delta |
| 9 | 0x0000FFFF | 0xFFFF0000 | 32 | lower/upper 16 bits swapped |
| 10 | 0x00000000 | 0xFFFFFFFF | 32 | all-zero vs all-one |
| 11 | 0x7FFFFFFF | 0xFFFFFFFF | 1 | only MSB differs |
| 12 | 0x7FFFFFFF | 0x80000000 | 32 | +max vs min-int |
| 13 | 0x00010000 | 0x00000001 | 2 | two distant ones |
| 14 | 0x00F00000 | 0x00000000 | 4 | four consecutive ones |
| 15 | 0x00F00000 | 0x0000F000 | 8 | two groups of four ones |
| 16 | 0xDEADBEEF | 0xFEEDFACE | 6 | random pair |
| 17 | 0xCAFEBABE | 0x8BADF00D | 14 | random pair |
| 18 | 0x01010101 | 0x10101010 | 8 | 0x01 pattern vs 0x10 pattern |
| 19 | 0xAAAAAAAA | 0xAAAAAAAA | 0 | identical alternating |
| 20 | 0x00000003 | 0x00000001 | 1 | adjacent low bits |
| 21 | 0x0000000F | 0x000000F0 | 8 | eight consecutive bits differ |
| 22 | 0x00000008 | 0x00000010 | 2 | two adjacent bits |
| 23 | 0x40000000 | 0x80000000 | 2 | top two bits differ |
| 24 | 0x80000001 | 0x00000001 | 1 | only MSB differs |
| 25 | 0x00000002 | 0x00000001 | 2 | LSB and next bit differ |
```assembly
main:
li s0, 0
la s1, orig_X
la s2, orig_Y
la s3, golden
li s6, NUM_TEST_VALUES
bge s0, s6, Passed
1:
lw a0, 0(s1)
lw a1, 0(s2)
lw s4, 0(s3)
jal ra hammingDistance
mv s5, a0
bne s4, s5, 2f
addi s0, s0, 1
addi s1, s1, 4
addi s2, s2, 4
addi s3, s3, 4
blt s0, s6, 1b
Passed:
la a0, Passed_msg
li a7, 4
ecall # Print passed message
li a0, 0
j done # go to return
#failed
2:
la a0, Failed_msg
li a7, 4
ecall # Print passed message
j done # go to return
done:
li a7, 93 # system call: exit2
li a0, 1 # exit value
ecall # exit 1
```
### Result
- Without clz


- With clz


### Performance Comparison
Both versions passed all functional tests, but the **CLZ-based implementation** performed better.
By using the `CLZ` instruction to skip leading zeros, the loop runs fewer iterations, reducing total cycles and retired instructions.
As shown in the results:
- **Without CLZ:** 6365 cycles
- **With CLZ:** 5345 cycles
This demonstrates that the CLZ version is more efficient, achieving the same correctness with lower execution time.
# 5-Stage RISC-V Processor

In this work, I use a 5-stage RISC-V processor that executes programs using only the RV32I instruction set.
The RISC-V 5-stage pipeline divides instruction execution into five sequential stages, each responsible for a specific part of the instruction flow. This structure allows multiple instructions to be processed simultaneously, improving throughput and efficiency.
1. Instruction Fetch (IF)
- In this stage, the processor fetches the instruction from the instruction memory using the Program Counter (PC).
- Update the PC to the next sequential instruction by adding 4.
2. Instruction Decode and Register Fetch (ID)
- The fetched instruction is decoded to identify the operation type and register operands.
- The control unit interprets fields such as opcode, funct3, funct7, rs1, rs2, and rd, generating the required control signals.
- The Register File provides values from rs1 and rs2, and any immediate value is extracted.
3. Execute / Address Calculation (EX)
- The ALU performs the required computation — arithmetic, logic, comparison, or address calculation.
- For branch or memory instructions, this stage also computes the target or effective address, and determines whether a branch should be taken.
4. Memory Access (MEM)
- If the instruction involves memory, processor reads or writes data from/to the data memory in this stage.
- `load` instructions read data into a buffer.
- `store` instructions write a register’s value to memory.
5. Write Back (WB)
- The final stage writes the result (either from the ALU or memory) back into the Register File at the `rd`.
---
Take `andi x10, x25, 255` (`andi a0, s9, 0XFF`) this instruction, which address is at 0xF00 and see how it works in each stage.
## 1. IF

The program counter currently holds the value `0x00000F00`, indicating that the next instruction to be executed is stored at memory address `0x00000F00`. In the next step of the instruction cycle, the CPU will fetch the instruction from this address.
Simultaneously, PC will be added by 4 to obtain the next instruction address which is 0x00000F04.
## 2. ID

In this stage, the fetched instruction `0x0FFCF513` is decoded.
The decoded result generates the necessary control signals, instructing the `Register File` to provide the values from `rs1 (x25)` (no rs2 is used for I-type instruction.).
imm[11:0] | rs1 | funct3 | rd | opcode|
----------|-------|--------|-------|-------|
000011111111 | 11001 | 111 | 01010 | 0010011|
Identify opcode and fields:
| Field | Bits | Value (bin) | Value (hex) | Meaning |
| ------------- | ------- | -------------- | ----------- | ---------------------- |
| **opcode** | [6:0] | `0010011` | 0x13 | I-type (immediate ALU) |
| **rd** | [11:7] | `01010` | 0x0A | **Name : x10 → Alias : a0** |
| **funct3** | [14:12] | `111` | 0x7 | **ANDI** |
| **rs1** | [19:15] | `11001` | 0x19 | **Name : x25 → Alias : s9** |
| **imm[11:0]** | [31:20] | `000011111111` | 0x0FF | **Immediate = 0xFF** |
The decoder extracts `imm[11:0]` and extends it to a 32-bit value for ALU computation, where the immediate is `0x000000FF`.
The Register File provides values from `rs1(x25)`.

## 3. EX

In this stage, the ALU performs a bitwise AND on the two operands selected by the input MUXs.
- Input
- op1 : `0X00000080 (x25)`
- op2 : `0X000000FF (Imm)`
- Output
- 0X00000080
The `rd (0x0A)` address will be sent into next stage.
## 4. MEM

In `I-type` instruction, no need for memory `store/load`, so set Data memory `wr_en` low.
## 5. WB

In the `WB` stage, MUXs select the write-back source (ALU result or memory data).
For this I-type instruction, the ALU result `0x00000080` is written to the destination register `rd = x10 (index 0x0A)`.
## Register update
- Before:

- After:

It can be seen that the value stored in register x10(a0) has changed from `0x00004080` to `0x00000080`.