## AI Collaboration (Transparency)
1. Use AI for translation and note-taking.
2. Provide debugging suggestions for parts of the code.
3. Improve the formatting and visual presentation of the notes.
# UF8 Encoding/Decoding Tester
Program Walkthrough (Key Sections)
Important Registers (Internal Conventions)
s1: Current UF8 value being tested (0..255)
s2: previous_value — the previously decoded value (initialized to -1)
s3 = 15, s4 = 0x7FFF, s5 = 256 — frequently used constants
s6: Pass flag — indicates whether all tests passed (1 = all passed, 0 = any failure)
s7 = &LC0, s8 = &LC1 — pointers to the two error message format strings
s0: Temporary register for the decoded value
s9: Temporary register for the current UF8 byte
### Decode :
```
e = uf8 >> 4
m = uf8 & 0xF
base(e) = ((0x7FFF >> (15 - e)) << 4)
value = base(e) + (m << e)
```
* Implemented using **bit shifts and addition** (no multiplication or division).
* Since the test program decodes right at the beginning, **inlining/merging the decode into the test loop** removes an extra call/jump step.
```
# -------------------------
# Main loop (inline decoding)
# -------------------------
decoding:
andi s9, s1, 0xff # s9 = current UF8 byte (force 0..255)
srli a4, s9, 4 # a4 = exponent = uf8 >> 4
sub s0, s3, a4 # s0 = 15 - exponent
sra s0, s4, s0 # s0 = (0x7FFF >> (15 - exponent))
andi a5, s9, 15 # a5 = mantissa = uf8 & 0xF
slli s0, s0, 4 # s0 = ((0x7FFF >> (15 - e)) << 4) -- decode "offset" part
sll a5, a5, a4 # a5 = (mantissa << exponent)
add s0, s0, a5 # s0 = decoded value (value)
```
### Encode :
Since encoding always requires CLZ, we implement it in the same routine to avoid an extra branch/call.
First, find the appropriate exponent E (Exponent)
```
mv t1, a0 # x
li t2, 32 # n = 32
li t3, 16 # c = 16
1: srl t4, t1, t3 # y = x >> c
beqz t4, 2f
sub t2, t2, t3 # n -= c
mv t1, t4 # x = y
2: srai t3, t3, 1 # c >>= 1
bnez t3, 1b
sub t2, t2, t1 # n -= x
li t5, 31
```
$$
\boldsymbol{E}=(31-\mathrm{clz}(\mathrm{value}))-4
$$
This step roughly means: **take the position of the most significant bit (MSB) and subtract 4**,while ensuring that the result is **clamped within the range [0, 15]**.
```
sub t5, t5, t2 # t5 = 31 - clz(value)
addi t5, t5, -4 # t5 = E
bltz t5, 3f
li t6, 15
bleu t5, t6, 4f
mv t5, t6 # clamp max 15
j 4f
3: li t5, 0
```
Compared to the instructor’s approach, we encode directly via the inverse mapping. We use a CLZ routine to estimate the candidate exponent E, then compute base(E) = ((0x7FFF >> (15 - E)) << 4). Next, we derive the mantissa (denoted M) by simple rearrangement:
$$
base(E) = \left( (0x7FFF \gg (15 - E)) \ll 4 \right)
$$
Calculate the mantissa:
$$
M = \min\left(15,\ \frac{\textit{value}-\textit{base}(E)}{2^{E}}\right)
$$
Combine both parts to form the final 8-bit UF8 value:
$$
uf8 = (E \ll 4) \mid M
$$
According to the previous section, it was found during debugging that for certain values near the segment boundaries, the exponent `E` obtained directly from the CLZ function could be incorrect.
**Example:** when `value = 241`
1. **Find CLZ and estimate E**
241 in binary: `0b...11110001`
The highest bit is at bit 7 → `clz(241) = 24`
Estimated `E = (31 − 24) − 4 = 3`
2. **Compute the segment start base(E)**
base(3) = ((0x7FFF >> (15−3)) << 4) = ((0x7FFF >> 12) << 4) = 7 << 4 = **112**
Next segment:
base(4) = ((0x7FFF >> (15-4)) << 4) = ((0x7FFF >> 11) << 4) = 15 << 4 = ***240***
From this example, the correct segment for `241` should actually be at **E = 4**.
To fix this, an additional check mechanism was introduced to ensure that
**value > base(E)** and **value ≤ base(E+1)**
so that the exponent `E` is adjusted correctly for boundary cases.
```
4:
# ---- base(E) = ((0x7FFF >> (15-E)) << 4) ----
# a1 = base(E)
li a1, 15
sub a1, a1, t5 # a1 = 15 - E
li a2, 32768
addi a2, a2, -1 # a2 = 0x7FFF
sra a2, a2, a1
slli a2, a2, 4 # a2 = base(E)
# ---- Segment alignment adjustment: if value < base(E) → E-- (then recompute base) ----
bltu a0, a2, 5f
j 6f
5: beqz t5, 6f # If E == 0, do not decrement
addi t5, t5, -1 # E--
li a1, 15
sub a1, a1, t5
li a2, 32768
addi a2, a2, -1
sra a2, a2, a1
slli a2, a2, 4 # a2 = base(E) (recomputed)
6:
# ---- If value >= base(E+1) → E++ (then recompute base(E)) ----
# Compute base(E+1) into a3
li a3, 15
addi t1, t5, 1 # t1 = E + 1
bgtu t1, a3, 7f # If E == 15, there is no next segment
sub a3, a3, t1 # a3 = 15 - (E + 1)
li t0, 32768
addi t0, t0, -1 # t0 = 0x7FFF
sra t0, t0, a3
slli t0, t0, 4 # t0 = base(E+1)
bltu a0, t0, 7f # if (value < base(E+1)) keep E
addi t5, t5, 1 # E++
mv a2, t0 # a2 = base(E) = base(E+1)
7:
# ---- M = ((value - base(E)) >> E), clamped to 0..15 ----
sub t2, a0, a2 # value - base(E) (guaranteed non-negative)
srl t2, t2, t5 # M
li t3, 15
bleu t2, t3, 8f
mv t2, t3
8:
slli t5, t5, 4 # (E << 4)
or a0, t5, t2 # uf8 = (E << 4) | M
andi a0, a0, 0xFF
ret
```
### **cycle count:**
Using the data provided by the[ **Compiler Explorer**](https://godbolt.org/) website for comparison,
the **cycle count** of the assembly code generated on the webpage is shown below:

The optimized results are shown below:

The cycles almost reduced 34% using table approach
This part of code can be found on my [github](https://github.com/Stanley0915/ca2025-quizzes/blob/413676c647bbe4f5977e54b87f6b2b096532d2c3/prombleB.s)
[compiler part of problem B](https://github.com/Stanley0915/ca2025-quizzes/blob/2cecf56ff70cdd4f3806fc4e469bd7172ff39da9/compiler_problemB.s)
---
# promble c
### Test whether a bf16 bit-pattern is a special value (Zero, Infinity, NaN)
**Zero:**
$$
E = 0, M = 0 \Rightarrow v = (-1)^S \times 0
$$
**Infinity:**
$$
E = 255, M = 0 \Rightarrow v = (-1)^S \times \infty
$$
**NaN:**
$$
E = 255, M \neq 0 \Rightarrow v = \text{NaN}
$$
To conveniently keep the needed masks, use `s` registers and, before use, spill their contents to memory.
```
addi sp,sp,-48
sw ra,44(sp)
sw s0,40(sp)
sw s1,36(sp)
sw s2,32(sp)
sw s3,28(sp)
sw s4,24(sp)
sw s5,20(sp)
sw s6,16(sp)
sw s7,12(sp)
sw s8,8(sp)
sw s9,4(sp)
li s1,0x8000 #define BF16_SIGN_MASK 0x8000U
li s2,0x7F80 #define BF16_EXP_MASK 0x7F80U
li s3,0x007F #define BF16_MANT_MASK 0x007FU
li s4,127 #define BF16_EXP_BIAS 127
li s5,0x7FC0 #define BF16_NAN() ((bf16_t){ .bits = 0x7FC0 })
li s6,0x0000 #define BF16_ZERO() ((bf16_t){ .bits = 0x0000 })
bool_bf16_isnan:
and a1,s2,a0
bne s2,a1,label1
and a2,s3,a0
bne a2,x0,label1
li a0,0
ret
label1:
li a0,1
ret
bool_bf16_isinf:
and a1,s2,a0
bne s2,a1,label2
and a2,s3,a0
beq a2,x0,label2
li a0,1
ret
label2:
li a0,0
ret
bf16_iszero:
li t1, 0x7FFF # load immediate 0x7FFF
and a1, a0, t1 # a1 = a0 & t1
beq a1,x0,label3
li a0,0
ret
label3:
li a0,1
ret
```
---
This `bf16_add()` function is the **core algorithm** for bfloat16 floating-point addition. It mimics the IEEE-754 addition procedure but implements it using only 16 bits (1 sign + 8 exponent + 7 mantissa). Below is a complete, report-ready description you can paste into documentation. 👇
---
### `bf16_add()` Function Overview
This routine implements addition of two bfloat16 numbers. Because bfloat16 lacks hardware floating-point support, the routine **simulates all floating-point steps using integer bit operations**. The overall logic is divided into **5 stages**:
---
#### Field Extraction
The routine first parses three primary fields from the 16-bit inputs:
* **sign** (bit 15): sign bit
* **exp** (bits [14:7]): exponent with bias 127
* **mantissa** (bits [6:0]): mantissa
It uses shifts and masks (`>>`, `&`) to obtain each field so they can be manipulated independently later.
```
bf16_add:
srli t0,a0,15 # uint16_t sign_a = (a.bits >> 15) & 1;
andi t0,t0,1
srli t1,a1,15 # uint16_t sign_b = (b.bits >> 15) & 1;
andi t1,t1,1
srli t2,a0,7 # int16_t exp_a = ((a.bits >> 7) & 0xFF);
andi t2,t2,0xFF
srli t3,a1,7 # int16_t exp_b = ((b.bits >> 7) & 0xFF);
andi t3,t3,0xFF
andi t4,a0,0x7F # uint16_t mant_a = a.bits & 0x7F;
andi t5,a1,0x7F # uint16_t mant_b = b.bits & 0x7F;
```
---
#### Handling Special Values
If the exponent is all ones (`0xFF`), the number is **NaN or Infinity**:
* If mantissa ≠ 0 → NaN (return `a` directly)
* If the other operand is Inf but with opposite sign → return NaN
* Otherwise return `a` or `b` as is (Inf + Inf of same sign is still Inf)
It also handles zeros: if both exponent and mantissa are 0, directly return the other operand.
```
li t6, 0xFF
bne t2, t6, L_expA_not_FF # exp_a != 0xFF → skip this block
# exp_a == 0xFF
bnez t4, L_ret_a # if (mant_a) return a;
# mant_a == 0
bne t3, t6, L_ret_a # if (exp_b != 0xFF) return a;
# exp_b == 0xFF
bnez t5, L_ret_b # if (mant_b) return b; // b is NaN
# both are Inf: return ( (sign_a == sign_b) ? b : BF16_NAN() );
bne t0, t1, L_ret_nan # sign_a != sign_b → NaN
j L_ret_b # otherwise return b (Inf with same sign)
L_ret_a:
# a0 already holds a.bits
ret
L_ret_b:
mv a0, a1 # return b.bits
ret
L_ret_nan:
li a0, 0x7FC0 # BF16_NAN()
ret
```
---
#### Aligning Mantissas
To add the two values on a common scale, the routine computes:
```c
exp_diff = exp_a - exp_b;
```
Then it right-shifts the mantissa of the operand with the **smaller** exponent by `|exp_diff|`.
Example: for 1.0 (Exp=127) + 0.5 (Exp=126), the mantissa of 0.5 shifts right by 1 to align.
If the difference exceeds 8 (limited bfloat16 precision), it returns the larger operand directly, effectively discarding the tiny contribution.
```
sub a3, t2, t3 # a3 = exp_diff
bge x0, a3, skip_exp_diffbigger0 # if (0 >= a3) → skip this block
mv a4, t2
li a5, 8
blt a5, a3, exp_diffbigger8 # if (8 < a3) → return a
srl t5, t5, a3
j L_done_block
exp_diffbigger8:
ret
skip_exp_diffbigger0:
L_done_block:
bge a3, x0, skip_exp_difflitte0 # if (exp_diff >= 0) → skip this branch
mv a4, t3
li t6, -8
blt a3, t6, exp_difflittle_8 # if a3 < -8 → return b
sub t6, x0, a3 # t6 = -exp_diff
srl t4, t4, t6 # mant_a >>= t6 (variable shift via srl)
j L_done_block2
exp_difflittle_8:
mv a0, a1 # return b.bits
ret
skip_exp_difflitte0:
mv a4, t2 # result_exp = exp_a;
L_done_block2:
```
---
#### Perform Add/Sub and Normalize
Depending on whether the signs match, there are two paths:
* **Same-sign addition:**
Add mantissas and check for **carry** (bit 8 = 1).
If carry occurs, shift mantissa right by 1 and increment exponent to keep the 1.x format.
If the exponent overflows to 255 → return ∞.
* **Different-sign subtraction:**
Subtract the smaller mantissa from the larger. If the result is 0 → return 0.
Otherwise, enter a while-loop to **left-shift** the mantissa (normalization) until the top bit returns to bit 7, decrementing the exponent each shift.
If the exponent drops below 0, it’s underflow → return 0.
This corresponds to the floating-point “**align → add/sub → normalize**” sequence.
```
bne t0, t1, sign_a_notequ_sign_b
mv s1,t1 # s1 = result_sign
add s2,t4,t5 # s2 = result_mant
# t2,t3 are free for reuse now
andi t2,s2,0x100
beqz t2, not_equ_1
srli s2,s2,1
addi a4, a4, 1 # ++result_exp
li t3, 0xFF
bgeu a4, t3, L_ret_inf # result_exp >= 0xFF → ±Inf
not_equ_1:
j result
sign_a_notequ_sign_b:
bgeu t4, t5, mant_a_notbigger_mant_b # if (mant_a >= mant_b)
mv s1, t1 # result_sign = sign_b
sub s2, t5, t4 # result_mant = mant_b - mant_a
j mant_a_bigger_mant_b
mant_a_notbigger_mant_b:
mv s1, t0 # result_sign = sign_a
sub s2, t4, t5 # result_mant = mant_a - mant_b
mant_a_bigger_mant_b:
beqz s2, L_ret_zero
while_loop:
andi t2, s2, 0x80 # is bit7 = 1?
bnez t2, while_loopout # already normalized → break
slli s2, s2, 1 # mant <<= 1
addi a4, a4, -1 # --result_exp
addi t3, x0, 1
blt a4, t3, L_ret_zero # result_exp <= 0 → underflow → 0
j while_loop
while_loopout:
```
---
#### Repack the Result
Finally, repack the result into bfloat16 format:
```
bits = (sign << 15) | ((exp & 0xFF) << 7) | (mant & 0x7F);
```
This yields a new bfloat16 value with correct sign, exponent, and mantissa fields.
```
result:
slli t0, s1, 15 # t0 = (result_sign << 15)
andi t1, a4, 0xFF # t1 = result_exp & 0xFF
slli t1, t1, 7 # t1 = (result_exp & 0xFF) << 7
andi t2, s2, 0x7F # t2 = result_mant & 0x7F
or a0, t0, t1 # a0 = sign | exponent part
or a0, a0, t2 # a0 |= mantissa
li s1,0x8000 #define BF16_SIGN_MASK 0x8000U
li s2,0x7F80 #define BF16_EXP_MASK 0x7F80U
ret
```
---
### `bf16_sub()` Function Overview
For subtraction, compute it as addition of the negated subtrahend: flip the sign bit of the second operand with a mask, then reuse the addition routine above.
```
bf16_sub:
li t0, 0x8000 # BF16_SIGN_MASK
xor a1, a1, t0 # b.bits ^= 0x8000 → flip sign bit
bf16_add:
.....(continue with the addition routine above)
```
---
### bf16_mul() Function Overview
This routine implements **bfloat16 multiplication** entirely using **integer bit operations**, since there’s no hardware floating-point support.
The overall logic is divided into **9 stages**:
---
#### 1) Field Extraction
Parse the 16-bit input into three fields and compute the result sign.
* **sign** (bit 15)
* **exponent** (bits [14:7], bias = **127**)
* **mantissa** (bits [6:0]; normalized numbers will later add the hidden “1” bit)
```
srli t0, a0, 15 # sign_a
andi t0, t0, 1
srli t1, a1, 15 # sign_b
andi t1, t1, 1
srli t2, a0, 7 # exp_a
andi t2, t2, 0xFF
srli t3, a1, 7 # exp_b
andi t3, t3, 0xFF
andi t4, a0, 0x7F # mant_a
andi t5, a1, 0x7F # mant_b
xor s1, t0, t1 # result_sign = sign_a ^ sign_b
```
---
#### 2) Special-Value Detection (NaN / Inf)
Handle NaN and Inf early to avoid unnecessary multiplications.
Key logic:
* `exp==0xFF && mant!=0` → **NaN** (return that operand)
* `Inf * 0 → NaN`
* `Inf * non-zero → Inf`
```
li t6, 0xFF
# a special?
bne t2, t6, mul_check_b_special
bnez t4, mul_ret_a # a is NaN
bnez t3, mul_ret_inf # b non-zero → ±Inf
bnez t5, mul_ret_inf # b subnormal non-zero → ±Inf
j mul_ret_nan # Inf * 0 = NaN
# b special?
mul_check_b_special:
bne t3, t6, mul_check_zero
bjez t5, 1f # b = Inf or NaN?
bnez t5, mul_ret_b # b is NaN
1:
bnez t2, mul_ret_inf # a non-zero → ±Inf
bnez t4, mul_ret_inf # a subnormal non-zero → ±Inf
j mul_ret_nan # 0 * Inf = NaN
```
---
#### 3) Zero Detection
If either operand is ±0 → directly return **signed zero**.
```
mul_check_zero:
bnez t2, mul_check_b_zero
bnez t4, mul_check_b_zero
j mul_ret_signed_zero # a == 0
mul_check_b_zero:
bnez t3, mul_normalize
bnez t5, mul_normalize
j mul_ret_signed_zero # b == 0
```
---
#### 4) Mantissa Normalization (Handling Subnormal + Hidden Bit)
* For **normalized numbers** (`exp != 0`): set the hidden bit `mant |= 0x80`.
* For **subnormal numbers** (`exp == 0`): left-shift until MSB (bit7) becomes 1, decrement `exp_adjust` each shift, then set `exp = 1`.
```
mul_normalize:
li t6, 0 # exp_adjust = 0
# A
bnez t2, mul_norm_a_done
mul_norm_a_loop:
andi s7, t4, 0x80
bnez s7, mul_norm_a_subnormal
slli t4, t4, 1
addi t6, t6, -1 # exp_adjust--
j mul_norm_a_loop
mul_norm_a_subnormal:
li t2, 1
j mul_norm_b
mul_norm_a_done:
ori t4, t4, 0x80
# B
mul_norm_b:
bnez t3, mul_norm_b_done
mul_norm_b_loop:
andi s7, t5, 0x80
bnez s7, mul_norm_b_subnormal
slli t5, t5, 1
addi t6, t6, -1 # exp_adjust--
j mul_norm_b_loop
mul_norm_b_subnormal:
li t3, 1
j mul_mantissa
mul_norm_b_done:
ori t5, t5, 0x80
```
---
#### 5) Mantissa Multiplication (8-bit × 8-bit, Shift-Add)
Implements an 8×8 multiplication using the shift-add algorithm, producing a 16-bit product stored in `s2`.
```
mul_mantissa:
li s2, 0 # result_mant = 0
mv s7, t4 # multiplicand = mant_a
mv a5, t5 # multiplier = mant_b
li a4, 8 # loop count = 8
mul_shift_add_loop:
andi a3, a5, 1
beqz a3, mul_shift_only
add s2, s2, s7 # if LSB=1, add multiplicand
mul_shift_only:
slli s7, s7, 1
srli a5, a5, 1
addi a4, a4, -1
bnez a4, mul_shift_add_loop
```
---
#### 6) Exponent Calculation (Unbiased Form)
`result_exp = (exp_a + exp_b − 127) + exp_adjust`
```
mul_calc_exp:
add a4, t2, t3 # exp_a + exp_b
add a4, a4, t6 # + exp_adjust
addi a4, a4, -127 # − bias (127)
```
---
#### 7) Product Normalization ([1.0, 2.0) Range)
Depending on the highest bit of the product:
* If `bit15=1` → range ≈ **[2.0, 4.0]** → shift right by **8**, increment exponent by **1**.
* Otherwise → range ≈ **[1.0, 2.0]** → shift right by **7**.
```
mul_product_normalize:
li t6, 0x8000
and s7, s2, t6
beqz s7, mul_product_small
# [2.0,4.0) → >>8, exp+1
srli s2, s2, 8
andi s2, s2, 0x7F
addi a4, a4, 1
j mul_check_overflow
# [1.0,2.0) → >>7
mul_product_small:
srli s2, s2, 7
andi s2, s2, 0x7F
```
---
#### 8) Overflow and Underflow Handling
* **Overflow** (`exp >= 255`) → ±Inf
* **Normal** (`exp >= 1`) → pack and return
* **Underflow** (`exp <= 0`) → convert to subnormal by right-shifting `(1−exp)` bits; if too small (`exp < −6`), return ±0.
```
mul_check_overflow:
li t6, 0xFF
bgeu a4, t6, mul_ret_inf # overflow → ±Inf
li t6, 1
bge a4, t6, mul_pack_result # normal → pack
# underflow
mul_handle_underflow:
li s7, -6
blt a4, s7, mul_ret_signed_zero # too small → 0
li t6, 1
sub s7, t6, a4 # shift = 1 - exp
srl s2, s2, s7 # mant >>= shift
li a4, 0 # exp = 0
```
---
#### 9) Result Packing
Reassemble the 16-bit result: `(sign<<15) | (exp<<7) | mant7`
```
mul_pack_result:
slli t0, s1, 15 # sign
andi t1, a4, 0xFF # exp
slli t1, t1, 7
andi t2, s2, 0x7F # mant
or a0, t0, t1
or a0, a0, t2
ret
```
---
#### Special Return Paths (Clean Exit Points)
```
mul_ret_inf:
slli t0, s1, 15
li t1, 0x7F80 # ±Inf
or a0, t0, t1
ret
mul_ret_nan:
li a0, 0x7FC0 # quiet NaN (canonical)
ret
mul_ret_a:
ret # a0 already holds a.bits
mul_ret_b:
mv a0, a1 # return b.bits
ret
mul_ret_signed_zero:
slli a0, s1, 15 # ±0
ret
```
---
### `bf16_div()` Function Overview
This routine implements **division** of two **bfloat16** numbers. Since bfloat16 lacks hardware floating-point support, the routine simulates floating-point division entirely with **integer bit operations**. The overall logic is divided into **5 stages**:
---
#### 1) Field Extraction
Parse the 16-bit inputs into three fields:
* **sign** (bit 15): sign bit
* **exp** (bits [14:7]): exponent with **127** bias
* **mant** (bits [6:0]): mantissa (normalized numbers will later add the hidden 1)
Use shifts and masks (>>, &) to obtain each field, and compute the result sign.
```
# Extract fields
srli t0, a0, 15
andi t0, t0, 1 # sign_a
srli t1, a1, 15
andi t1, t1, 1 # sign_b
srli t2, a0, 7
andi t2, t2, 0xFF # exp_a
srli t3, a1, 7
andi t3, t3, 0xFF # exp_b
andi t4, a0, 0x7F # mant_a
andi t5, a1, 0x7F # mant_b
# result_sign = sign_a ^ sign_b
xor s7, t0, t1
```
---
#### 2) Special Values (NaN / Inf / 0)
Handle easy cases first to avoid entering the expensive long-division path.
* **b is NaN / Inf**:
* `exp_b == 0xFF` and `mant_b != 0` → return **b** (NaN)
* `exp_b == 0xFF` and `mant_b == 0` (b = Inf):
* if a is NaN → return **a**
* if a is not Inf → result is **0** (a/Inf = 0)
* if a = Inf → **NaN** (Inf/Inf)
* **b is 0**:
* if a is also 0 → **NaN** (0/0)
* otherwise result is **Inf** (with the computed sign)
* **a is NaN / Inf**:
* `exp_a == 0xFF` and `mant_a != 0` → return **a** (NaN)
* `exp_a == 0xFF` and `mant_a == 0` (a = Inf, b not special and nonzero) → **Inf**
* **a is 0**: directly return **±0** (with the computed sign)
```
li t6, 0xFF
# Check b first
bne t3, t6, div_b_not_special
bnez t5, div_ret_b # b is NaN
bne t2, t6, div_ret_zero # a is not Inf → 0
bnez t4, div_ret_a # a is NaN
j div_ret_nan # Inf/Inf = NaN
div_b_not_special:
or s8, t3, t5
beqz s8, div_b_zero # b==0 → a/0
# Then check a
bne t2, t6, div_a_not_special
bnez t4, div_ret_a # a is NaN
j div_ret_inf # Inf / b = Inf
div_a_not_special:
or s8, t2, t4
beqz s8, div_ret_zero # a==0 → 0
```
---
#### 3) Mantissa Normalization (add hidden bit / handle subnormals)
* Normal numbers (`exp != 0`): **add the hidden 1** (`mant |= 0x80`)
* Subnormals (`exp == 0`): this version uses a simplified model—**set exp=1** and **do not add** the hidden bit for the division
```
# A
beqz t2, div_norm_a
ori t4, t4, 0x80
j div_norm_a_done
div_norm_a:
li t2, 1
div_norm_a_done:
# B
beqz t3, div_norm_b
ori t5, t5, 0x80
j div_norm_b_done
div_norm_b:
li t3, 1
div_norm_b_done:
```
---
#### 4) Long Division (restoring division): mant_a / mant_b
Implement a 16-step restoring division using shift-subtract to obtain **quotient** and **remainder**:
* `dividend = mant_a << 15`
* stepwise compare `dividend` with `divisor << shift`; subtract when possible and set the quotient bit
* after 16 iterations, get a **16-bit quotient** in `a4`
```
slli s8, t4, 15 # dividend
mv s9, t5 # divisor
li a4, 0 # quotient
li a5, 16 # 16 iterations
div_loop:
slli a4, a4, 1
addi a6, a5, -1
li t6, 15
sub t6, t6, a6 # shift = i
sll a7, s9, t6 # divisor << shift
bgeu s8, a7, div_can_sub
j div_next
div_can_sub:
sub s8, s8, a7 # dividend -= (divisor<<shift)
ori a4, a4, 1 # quotient |= 1
div_next:
addi a5, a5, -1
bnez a5, div_loop
```
Compute the **result exponent**: `exp = (exp_a - exp_b) + bias` (bias = 127)
```
sub a5, t2, t3 # exp_a - exp_b
addi a5, a5, 127 # + bias
```
---
#### 5) Normalize Quotient, Check Over/Underflow, Pack Result
##### 5.1 Normalize the Quotient
* If the quotient MSB (bit15) is 1 → range ≈ **[1.0, 2.0)**, shift right **8** to get a 7-bit mantissa
* Otherwise (value < 1.0) → **left-shift until MSB=1**, decrement exponent by **1** per shift, then shift right 8 to get the mantissa
```
li t6, 0x8000
and s8, a4, t6
beqz s8, div_quo_small
# quotient >= 0x8000 → [1,2]
srli a4, a4, 8
andi a4, a4, 0x7F
j div_check_overflow
div_quo_small:
# Left-shift until MSB=1 (only shift if exp>1 to avoid excessive underflow)
div_norm_quo:
li t6, 0x8000
and s8, a4, t6
bnez s8, div_norm_quo_done
li t6, 1
ble a5, t6, div_norm_quo_done
slli a4, a4, 1
addi a5, a5, -1
j div_norm_quo
div_norm_quo_done:
srli a4, a4, 8
andi a4, a4, 0x7F
```
##### 5.2 Overflow / Underflow
* **Overflow** (`exp >= 255`): return **±Inf**
* **Underflow** (`exp <= 0`): this version directly returns **±0** (simplified; a stricter implementation would output **subnormals** with rounding)
```
li t6, 0xFF
bgeu a5, t6, div_ret_inf # overflow → Inf
blez a5, div_ret_zero # underflow → 0
```
##### 5.3 Pack the Result
Fill sign, exponent, and mantissa back into the bfloat16 layout:
```
slli t0, s7, 15 # sign
andi t1, a5, 0xFF # exp
slli t1, t1, 7
andi t2, a4, 0x7F # mant
or a0, t0, t1
or a0, a0, t2
ret
```
---
#### Special Return Paths (clean fast-exits)
```
div_b_zero:
or s8, t2, t4
beqz s8, div_ret_nan # 0/0 = NaN
j div_ret_inf # a/0 = Inf
div_ret_inf:
slli a0, s7, 15
li t0, 0x7F80
or a0, a0, t0
ret
div_ret_nan:
li a0, 0x7FC0 # canonical qNaN
ret
div_ret_zero:
slli a0, s7, 15 # ±0
ret
div_ret_a:
ret # a0 already holds a
div_ret_b:
mv a0, a1 # return b
ret
```
---
Because of time constraints and limited personal capability, the square root (sqrt) part could not be completed.
Therefore, in the testing section, the sqrt function will be omitted from the comparisons for now.


The cycle count after omitting the sqrt part is 1517.
This part of the code can be found on my [github](https://github.com/Stanley0915/ca2025-quizzes/blob/413676c647bbe4f5977e54b87f6b2b096532d2c3/problemC.s)
---
# Leetcode 283 Move Zeroes
Given an integer array nums, move all 0's to the end of it while maintaining the relative order of the non-zero elements.
Note that you must do this in-place without making a copy of the array.
Example 1:
Input: nums = [0,1,0,3,12]
Output: [1,3,12,0,0]
Example 2:
Input: nums = [0]
Output: [0]
### C program
```c
void moveZeroes(int* nums, int numsSize) {
int zerohead=0;
for(int i=0;i<numsSize;i++){
if(nums[i]!=0){
if(i!=zerohead){
nums[zerohead]=nums[i];
nums[i]=0;
}
zerohead++;
}
}
}
```
### Assembly code (RISC-V)
```
.data
test1: .word 0, 1, 0, 3, 12
test1_size: .word 5
test1_msg: .string "Test 1: [0, 1, 0, 3, 12] => "
test2: .word 0, 0, 0, 0
test2_size: .word 4
test2_msg: .string "Test 2: [0, 0, 0, 0] => "
test3: .word 5, 7, 2, 9, 1
test3_size: .word 5
test3_msg: .string "Test 3: [5, 7, 2, 9, 1] => "
newline: .string "\n"
space: .string " "
bracket_open: .string "["
bracket_close: .string "]"
.text
.globl main
main:
addi sp, sp, -4
sw ra, 0(sp)
# test 1
la a0, test1_msg
li a7, 4
ecall
la a0, test1
lw a1, test1_size
jal moveZeroes
la a0, test1
lw a1, test1_size
jal print_array
# test 2
la a0, test2_msg
li a7, 4
ecall
la a0, test2
lw a1, test2_size
jal moveZeroes
la a0, test2
lw a1, test2_size
jal print_array
# test 3
la a0, test3_msg
li a7, 4
ecall
la a0, test3
lw a1, test3_size
jal moveZeroes
la a0, test3
lw a1, test3_size
jal print_array
lw ra, 0(sp)
addi sp, sp, 4
li a7, 10
ecall
# moveZeroes 函數
# 輸入: a0 = nums 陣列指標, a1 = numsSize
# 使用的暫存器:
# t0 = zerohead
# t1 = i
# t2 = nums[i]
# t4 = nums 陣列基底位址
moveZeroes:
li t0, 0
li t1, 0
mv t4, a0
loop:
bge t1, a1, end
slli t3, t1, 2 # t3 = i * 4 (因為 int 是 4 bytes)
add t3, t4, t3 # t3 = &nums[i]
lw t2, 0(t3) # t2 = nums[i]
beqz t2, continue
beq t1, t0, skip_swap
slli t5, t0, 2
add t5, t4, t5
sw t2, 0(t5)
sw zero, 0(t3)
skip_swap:
addi t0, t0, 1
continue:
addi t1, t1, 1
j loop
end:
ret
print_array:
addi sp, sp, -16
sw ra, 12(sp)
sw s0, 8(sp)
sw s1, 4(sp)
sw s2, 0(sp)
mv s0, a0 # s0 = 陣列指標
mv s1, a1 # s1 = 陣列大小
li s2, 0 # s2 = 索引
# 印出 "["
la a0, bracket_open
li a7, 4
ecall
print_loop:
bge s2, s1, print_end
slli t0, s2, 2
add t0, s0, t0
lw a0, 0(t0)
li a7, 1
ecall
# 如果不是最後一個元素,印出空格
addi t0, s1, -1
bge s2, t0, skip_space
la a0, space
li a7, 4
ecall
skip_space:
addi s2, s2, 1
j print_loop
print_end:
# 印出 "]"
la a0, bracket_close
li a7, 4
ecall
# 印出換行
la a0, newline
li a7, 4
ecall
lw ra, 12(sp)
lw s0, 8(sp)
lw s1, 4(sp)
lw s2, 0(sp)
addi sp, sp, 16
ret
```
result:

---
# Ripes Simulator
using 5-stage pipelined process in pipes
```
00000074 <bool_bf16_isnan>:
74: 00a975b3 and x11 x18 x10
78: 00b91a63 bne x18 x11 20 <label1>
7c: 00a9f633 and x12 x19 x10
80: 00061663 bne x12 x0 12 <label1>
84: 00000513 addi x10 x0 0
88: 00008067 jalr x0 x1 0
```
take 7c:00a9f633 and x12 x19 x10for example in this 5-stage pipelined section.
---
**IF**

*The program counter updates the value of the instruction each time to fetch a new instruction.
*The current instruction memory address is 0x0000008c, and the instruction read from that address is 0x00a975b3.
*Since the current instruction memory address is 0x0000004c, and most instructions in many architectures (such as RISC-V) are 4 bytes (32 bits) long, the next instruction address will typically be 0x0000004c, which is 4 bytes ahead of the current one.
---
**ID**

*decode the instruction 0x00a975b3 into binary 0000000 01010 10010 111 01011 0110011
funct7 = 0000000
rs2 = 01010
rs1 = 10010
funct3 = 111
rd = 01011
opcode = 0110011
The instruction 00a9f633 has been decoded as and x12 x19 x10
---
**EXE**
2
*In the EX (Execute) stage, the ALU performs a bitwise AND operation between the value stored in register x18 and the value in register x10. The result of this operation is then forwarded to be written back into register x11
---
**MEM**

*In the MEM (memory) stage, it checks whether to write to or read from memory. For the and instruction, there is no need for reading from or writing to memory.
---
**WB**

*In the WB (write-back) stage, the data is written back to register x11