Lab1: RV32I Assembly

## AI Collaboration (Transparency) 1. Use AI for translation and note-taking. 2. Provide debugging suggestions for parts of the code. 3. Improve the formatting and visual presentation of the notes. # UF8 Encoding/Decoding Tester Program Walkthrough (Key Sections) Important Registers (Internal Conventions) s1: Current UF8 value being tested (0..255) s2: previous_value — the previously decoded value (initialized to -1) s3 = 15, s4 = 0x7FFF, s5 = 256 — frequently used constants s6: Pass flag — indicates whether all tests passed (1 = all passed, 0 = any failure) s7 = &LC0, s8 = &LC1 — pointers to the two error message format strings s0: Temporary register for the decoded value s9: Temporary register for the current UF8 byte ### Decode : ``` e = uf8 >> 4 m = uf8 & 0xF base(e) = ((0x7FFF >> (15 - e)) << 4) value = base(e) + (m << e) ``` * Implemented using **bit shifts and addition** (no multiplication or division). * Since the test program decodes right at the beginning, **inlining/merging the decode into the test loop** removes an extra call/jump step. ``` # ------------------------- # Main loop (inline decoding) # ------------------------- decoding: andi s9, s1, 0xff # s9 = current UF8 byte (force 0..255) srli a4, s9, 4 # a4 = exponent = uf8 >> 4 sub s0, s3, a4 # s0 = 15 - exponent sra s0, s4, s0 # s0 = (0x7FFF >> (15 - exponent)) andi a5, s9, 15 # a5 = mantissa = uf8 & 0xF slli s0, s0, 4 # s0 = ((0x7FFF >> (15 - e)) << 4) -- decode "offset" part sll a5, a5, a4 # a5 = (mantissa << exponent) add s0, s0, a5 # s0 = decoded value (value) ``` ### Encode : Since encoding always requires CLZ, we implement it in the same routine to avoid an extra branch/call. First, find the appropriate exponent E (Exponent) ``` mv t1, a0 # x li t2, 32 # n = 32 li t3, 16 # c = 16 1: srl t4, t1, t3 # y = x >> c beqz t4, 2f sub t2, t2, t3 # n -= c mv t1, t4 # x = y 2: srai t3, t3, 1 # c >>= 1 bnez t3, 1b sub t2, t2, t1 # n -= x li t5, 31 ``` $$ \boldsymbol{E}=(31-\mathrm{clz}(\mathrm{value}))-4 $$ This step roughly means: **take the position of the most significant bit (MSB) and subtract 4**,while ensuring that the result is **clamped within the range [0, 15]**. ``` sub t5, t5, t2 # t5 = 31 - clz(value) addi t5, t5, -4 # t5 = E bltz t5, 3f li t6, 15 bleu t5, t6, 4f mv t5, t6 # clamp max 15 j 4f 3: li t5, 0 ``` Compared to the instructor’s approach, we encode directly via the inverse mapping. We use a CLZ routine to estimate the candidate exponent E, then compute base(E) = ((0x7FFF >> (15 - E)) << 4). Next, we derive the mantissa (denoted M) by simple rearrangement: $$ base(E) = \left( (0x7FFF \gg (15 - E)) \ll 4 \right) $$ Calculate the mantissa: $$ M = \min\left(15,\ \frac{\textit{value}-\textit{base}(E)}{2^{E}}\right) $$ Combine both parts to form the final 8-bit UF8 value: $$ uf8 = (E \ll 4) \mid M $$ According to the previous section, it was found during debugging that for certain values near the segment boundaries, the exponent `E` obtained directly from the CLZ function could be incorrect. **Example:** when `value = 241` 1. **Find CLZ and estimate E** 241 in binary: `0b...11110001` The highest bit is at bit 7 → `clz(241) = 24` Estimated `E = (31 − 24) − 4 = 3` 2. **Compute the segment start base(E)** base(3) = ((0x7FFF >> (15−3)) << 4) = ((0x7FFF >> 12) << 4) = 7 << 4 = **112** Next segment: base(4) = ((0x7FFF >> (15-4)) << 4) = ((0x7FFF >> 11) << 4) = 15 << 4 = ***240*** From this example, the correct segment for `241` should actually be at **E = 4**. To fix this, an additional check mechanism was introduced to ensure that **value > base(E)** and **value ≤ base(E+1)** so that the exponent `E` is adjusted correctly for boundary cases. ``` 4: # ---- base(E) = ((0x7FFF >> (15-E)) << 4) ---- # a1 = base(E) li a1, 15 sub a1, a1, t5 # a1 = 15 - E li a2, 32768 addi a2, a2, -1 # a2 = 0x7FFF sra a2, a2, a1 slli a2, a2, 4 # a2 = base(E) # ---- Segment alignment adjustment: if value < base(E) → E-- (then recompute base) ---- bltu a0, a2, 5f j 6f 5: beqz t5, 6f # If E == 0, do not decrement addi t5, t5, -1 # E-- li a1, 15 sub a1, a1, t5 li a2, 32768 addi a2, a2, -1 sra a2, a2, a1 slli a2, a2, 4 # a2 = base(E) (recomputed) 6: # ---- If value >= base(E+1) → E++ (then recompute base(E)) ---- # Compute base(E+1) into a3 li a3, 15 addi t1, t5, 1 # t1 = E + 1 bgtu t1, a3, 7f # If E == 15, there is no next segment sub a3, a3, t1 # a3 = 15 - (E + 1) li t0, 32768 addi t0, t0, -1 # t0 = 0x7FFF sra t0, t0, a3 slli t0, t0, 4 # t0 = base(E+1) bltu a0, t0, 7f # if (value < base(E+1)) keep E addi t5, t5, 1 # E++ mv a2, t0 # a2 = base(E) = base(E+1) 7: # ---- M = ((value - base(E)) >> E), clamped to 0..15 ---- sub t2, a0, a2 # value - base(E) (guaranteed non-negative) srl t2, t2, t5 # M li t3, 15 bleu t2, t3, 8f mv t2, t3 8: slli t5, t5, 4 # (E << 4) or a0, t5, t2 # uf8 = (E << 4) | M andi a0, a0, 0xFF ret ``` ### **cycle count:** Using the data provided by the[ **Compiler Explorer**](https://godbolt.org/) website for comparison, the **cycle count** of the assembly code generated on the webpage is shown below: ![image](https://hackmd.io/_uploads/BkuqALvpeg.png) The optimized results are shown below: ![image](https://hackmd.io/_uploads/rJcipIwple.png) The cycles almost reduced 34% using table approach This part of code can be found on my [github](https://github.com/Stanley0915/ca2025-quizzes/blob/413676c647bbe4f5977e54b87f6b2b096532d2c3/prombleB.s) [compiler part of problem B](https://github.com/Stanley0915/ca2025-quizzes/blob/2cecf56ff70cdd4f3806fc4e469bd7172ff39da9/compiler_problemB.s) --- # promble c ### Test whether a bf16 bit-pattern is a special value (Zero, Infinity, NaN) **Zero:** $$ E = 0, M = 0 \Rightarrow v = (-1)^S \times 0 $$ **Infinity:** $$ E = 255, M = 0 \Rightarrow v = (-1)^S \times \infty $$ **NaN:** $$ E = 255, M \neq 0 \Rightarrow v = \text{NaN} $$ To conveniently keep the needed masks, use `s` registers and, before use, spill their contents to memory. ``` addi sp,sp,-48 sw ra,44(sp) sw s0,40(sp) sw s1,36(sp) sw s2,32(sp) sw s3,28(sp) sw s4,24(sp) sw s5,20(sp) sw s6,16(sp) sw s7,12(sp) sw s8,8(sp) sw s9,4(sp) li s1,0x8000 #define BF16_SIGN_MASK 0x8000U li s2,0x7F80 #define BF16_EXP_MASK 0x7F80U li s3,0x007F #define BF16_MANT_MASK 0x007FU li s4,127 #define BF16_EXP_BIAS 127 li s5,0x7FC0 #define BF16_NAN() ((bf16_t){ .bits = 0x7FC0 }) li s6,0x0000 #define BF16_ZERO() ((bf16_t){ .bits = 0x0000 }) bool_bf16_isnan: and a1,s2,a0 bne s2,a1,label1 and a2,s3,a0 bne a2,x0,label1 li a0,0 ret label1: li a0,1 ret bool_bf16_isinf: and a1,s2,a0 bne s2,a1,label2 and a2,s3,a0 beq a2,x0,label2 li a0,1 ret label2: li a0,0 ret bf16_iszero: li t1, 0x7FFF # load immediate 0x7FFF and a1, a0, t1 # a1 = a0 & t1 beq a1,x0,label3 li a0,0 ret label3: li a0,1 ret ``` --- This `bf16_add()` function is the **core algorithm** for bfloat16 floating-point addition. It mimics the IEEE-754 addition procedure but implements it using only 16 bits (1 sign + 8 exponent + 7 mantissa). Below is a complete, report-ready description you can paste into documentation. 👇 --- ### `bf16_add()` Function Overview This routine implements addition of two bfloat16 numbers. Because bfloat16 lacks hardware floating-point support, the routine **simulates all floating-point steps using integer bit operations**. The overall logic is divided into **5 stages**: --- #### Field Extraction The routine first parses three primary fields from the 16-bit inputs: * **sign** (bit 15): sign bit * **exp** (bits [14:7]): exponent with bias 127 * **mantissa** (bits [6:0]): mantissa It uses shifts and masks (`>>`, `&`) to obtain each field so they can be manipulated independently later. ``` bf16_add: srli t0,a0,15 # uint16_t sign_a = (a.bits >> 15) & 1; andi t0,t0,1 srli t1,a1,15 # uint16_t sign_b = (b.bits >> 15) & 1; andi t1,t1,1 srli t2,a0,7 # int16_t exp_a = ((a.bits >> 7) & 0xFF); andi t2,t2,0xFF srli t3,a1,7 # int16_t exp_b = ((b.bits >> 7) & 0xFF); andi t3,t3,0xFF andi t4,a0,0x7F # uint16_t mant_a = a.bits & 0x7F; andi t5,a1,0x7F # uint16_t mant_b = b.bits & 0x7F; ``` --- #### Handling Special Values If the exponent is all ones (`0xFF`), the number is **NaN or Infinity**: * If mantissa ≠ 0 → NaN (return `a` directly) * If the other operand is Inf but with opposite sign → return NaN * Otherwise return `a` or `b` as is (Inf + Inf of same sign is still Inf) It also handles zeros: if both exponent and mantissa are 0, directly return the other operand. ``` li t6, 0xFF bne t2, t6, L_expA_not_FF # exp_a != 0xFF → skip this block # exp_a == 0xFF bnez t4, L_ret_a # if (mant_a) return a; # mant_a == 0 bne t3, t6, L_ret_a # if (exp_b != 0xFF) return a; # exp_b == 0xFF bnez t5, L_ret_b # if (mant_b) return b; // b is NaN # both are Inf: return ( (sign_a == sign_b) ? b : BF16_NAN() ); bne t0, t1, L_ret_nan # sign_a != sign_b → NaN j L_ret_b # otherwise return b (Inf with same sign) L_ret_a: # a0 already holds a.bits ret L_ret_b: mv a0, a1 # return b.bits ret L_ret_nan: li a0, 0x7FC0 # BF16_NAN() ret ``` --- #### Aligning Mantissas To add the two values on a common scale, the routine computes: ```c exp_diff = exp_a - exp_b; ``` Then it right-shifts the mantissa of the operand with the **smaller** exponent by `|exp_diff|`. Example: for 1.0 (Exp=127) + 0.5 (Exp=126), the mantissa of 0.5 shifts right by 1 to align. If the difference exceeds 8 (limited bfloat16 precision), it returns the larger operand directly, effectively discarding the tiny contribution. ``` sub a3, t2, t3 # a3 = exp_diff bge x0, a3, skip_exp_diffbigger0 # if (0 >= a3) → skip this block mv a4, t2 li a5, 8 blt a5, a3, exp_diffbigger8 # if (8 < a3) → return a srl t5, t5, a3 j L_done_block exp_diffbigger8: ret skip_exp_diffbigger0: L_done_block: bge a3, x0, skip_exp_difflitte0 # if (exp_diff >= 0) → skip this branch mv a4, t3 li t6, -8 blt a3, t6, exp_difflittle_8 # if a3 < -8 → return b sub t6, x0, a3 # t6 = -exp_diff srl t4, t4, t6 # mant_a >>= t6 (variable shift via srl) j L_done_block2 exp_difflittle_8: mv a0, a1 # return b.bits ret skip_exp_difflitte0: mv a4, t2 # result_exp = exp_a; L_done_block2: ``` --- #### Perform Add/Sub and Normalize Depending on whether the signs match, there are two paths: * **Same-sign addition:** Add mantissas and check for **carry** (bit 8 = 1). If carry occurs, shift mantissa right by 1 and increment exponent to keep the 1.x format. If the exponent overflows to 255 → return ∞. * **Different-sign subtraction:** Subtract the smaller mantissa from the larger. If the result is 0 → return 0. Otherwise, enter a while-loop to **left-shift** the mantissa (normalization) until the top bit returns to bit 7, decrementing the exponent each shift. If the exponent drops below 0, it’s underflow → return 0. This corresponds to the floating-point “**align → add/sub → normalize**” sequence. ``` bne t0, t1, sign_a_notequ_sign_b mv s1,t1 # s1 = result_sign add s2,t4,t5 # s2 = result_mant # t2,t3 are free for reuse now andi t2,s2,0x100 beqz t2, not_equ_1 srli s2,s2,1 addi a4, a4, 1 # ++result_exp li t3, 0xFF bgeu a4, t3, L_ret_inf # result_exp >= 0xFF → ±Inf not_equ_1: j result sign_a_notequ_sign_b: bgeu t4, t5, mant_a_notbigger_mant_b # if (mant_a >= mant_b) mv s1, t1 # result_sign = sign_b sub s2, t5, t4 # result_mant = mant_b - mant_a j mant_a_bigger_mant_b mant_a_notbigger_mant_b: mv s1, t0 # result_sign = sign_a sub s2, t4, t5 # result_mant = mant_a - mant_b mant_a_bigger_mant_b: beqz s2, L_ret_zero while_loop: andi t2, s2, 0x80 # is bit7 = 1? bnez t2, while_loopout # already normalized → break slli s2, s2, 1 # mant <<= 1 addi a4, a4, -1 # --result_exp addi t3, x0, 1 blt a4, t3, L_ret_zero # result_exp <= 0 → underflow → 0 j while_loop while_loopout: ``` --- #### Repack the Result Finally, repack the result into bfloat16 format: ``` bits = (sign << 15) | ((exp & 0xFF) << 7) | (mant & 0x7F); ``` This yields a new bfloat16 value with correct sign, exponent, and mantissa fields. ``` result: slli t0, s1, 15 # t0 = (result_sign << 15) andi t1, a4, 0xFF # t1 = result_exp & 0xFF slli t1, t1, 7 # t1 = (result_exp & 0xFF) << 7 andi t2, s2, 0x7F # t2 = result_mant & 0x7F or a0, t0, t1 # a0 = sign | exponent part or a0, a0, t2 # a0 |= mantissa li s1,0x8000 #define BF16_SIGN_MASK 0x8000U li s2,0x7F80 #define BF16_EXP_MASK 0x7F80U ret ``` --- ### `bf16_sub()` Function Overview For subtraction, compute it as addition of the negated subtrahend: flip the sign bit of the second operand with a mask, then reuse the addition routine above. ``` bf16_sub: li t0, 0x8000 # BF16_SIGN_MASK xor a1, a1, t0 # b.bits ^= 0x8000 → flip sign bit bf16_add: .....(continue with the addition routine above) ``` --- ### bf16_mul() Function Overview This routine implements **bfloat16 multiplication** entirely using **integer bit operations**, since there’s no hardware floating-point support. The overall logic is divided into **9 stages**: --- #### 1) Field Extraction Parse the 16-bit input into three fields and compute the result sign. * **sign** (bit 15) * **exponent** (bits [14:7], bias = **127**) * **mantissa** (bits [6:0]; normalized numbers will later add the hidden “1” bit) ``` srli t0, a0, 15 # sign_a andi t0, t0, 1 srli t1, a1, 15 # sign_b andi t1, t1, 1 srli t2, a0, 7 # exp_a andi t2, t2, 0xFF srli t3, a1, 7 # exp_b andi t3, t3, 0xFF andi t4, a0, 0x7F # mant_a andi t5, a1, 0x7F # mant_b xor s1, t0, t1 # result_sign = sign_a ^ sign_b ``` --- #### 2) Special-Value Detection (NaN / Inf) Handle NaN and Inf early to avoid unnecessary multiplications. Key logic: * `exp==0xFF && mant!=0` → **NaN** (return that operand) * `Inf * 0 → NaN` * `Inf * non-zero → Inf` ``` li t6, 0xFF # a special? bne t2, t6, mul_check_b_special bnez t4, mul_ret_a # a is NaN bnez t3, mul_ret_inf # b non-zero → ±Inf bnez t5, mul_ret_inf # b subnormal non-zero → ±Inf j mul_ret_nan # Inf * 0 = NaN # b special? mul_check_b_special: bne t3, t6, mul_check_zero bjez t5, 1f # b = Inf or NaN? bnez t5, mul_ret_b # b is NaN 1: bnez t2, mul_ret_inf # a non-zero → ±Inf bnez t4, mul_ret_inf # a subnormal non-zero → ±Inf j mul_ret_nan # 0 * Inf = NaN ``` --- #### 3) Zero Detection If either operand is ±0 → directly return **signed zero**. ``` mul_check_zero: bnez t2, mul_check_b_zero bnez t4, mul_check_b_zero j mul_ret_signed_zero # a == 0 mul_check_b_zero: bnez t3, mul_normalize bnez t5, mul_normalize j mul_ret_signed_zero # b == 0 ``` --- #### 4) Mantissa Normalization (Handling Subnormal + Hidden Bit) * For **normalized numbers** (`exp != 0`): set the hidden bit `mant |= 0x80`. * For **subnormal numbers** (`exp == 0`): left-shift until MSB (bit7) becomes 1, decrement `exp_adjust` each shift, then set `exp = 1`. ``` mul_normalize: li t6, 0 # exp_adjust = 0 # A bnez t2, mul_norm_a_done mul_norm_a_loop: andi s7, t4, 0x80 bnez s7, mul_norm_a_subnormal slli t4, t4, 1 addi t6, t6, -1 # exp_adjust-- j mul_norm_a_loop mul_norm_a_subnormal: li t2, 1 j mul_norm_b mul_norm_a_done: ori t4, t4, 0x80 # B mul_norm_b: bnez t3, mul_norm_b_done mul_norm_b_loop: andi s7, t5, 0x80 bnez s7, mul_norm_b_subnormal slli t5, t5, 1 addi t6, t6, -1 # exp_adjust-- j mul_norm_b_loop mul_norm_b_subnormal: li t3, 1 j mul_mantissa mul_norm_b_done: ori t5, t5, 0x80 ``` --- #### 5) Mantissa Multiplication (8-bit × 8-bit, Shift-Add) Implements an 8×8 multiplication using the shift-add algorithm, producing a 16-bit product stored in `s2`. ``` mul_mantissa: li s2, 0 # result_mant = 0 mv s7, t4 # multiplicand = mant_a mv a5, t5 # multiplier = mant_b li a4, 8 # loop count = 8 mul_shift_add_loop: andi a3, a5, 1 beqz a3, mul_shift_only add s2, s2, s7 # if LSB=1, add multiplicand mul_shift_only: slli s7, s7, 1 srli a5, a5, 1 addi a4, a4, -1 bnez a4, mul_shift_add_loop ``` --- #### 6) Exponent Calculation (Unbiased Form) `result_exp = (exp_a + exp_b − 127) + exp_adjust` ``` mul_calc_exp: add a4, t2, t3 # exp_a + exp_b add a4, a4, t6 # + exp_adjust addi a4, a4, -127 # − bias (127) ``` --- #### 7) Product Normalization ([1.0, 2.0) Range) Depending on the highest bit of the product: * If `bit15=1` → range ≈ **[2.0, 4.0]** → shift right by **8**, increment exponent by **1**. * Otherwise → range ≈ **[1.0, 2.0]** → shift right by **7**. ``` mul_product_normalize: li t6, 0x8000 and s7, s2, t6 beqz s7, mul_product_small # [2.0,4.0) → >>8, exp+1 srli s2, s2, 8 andi s2, s2, 0x7F addi a4, a4, 1 j mul_check_overflow # [1.0,2.0) → >>7 mul_product_small: srli s2, s2, 7 andi s2, s2, 0x7F ``` --- #### 8) Overflow and Underflow Handling * **Overflow** (`exp >= 255`) → ±Inf * **Normal** (`exp >= 1`) → pack and return * **Underflow** (`exp <= 0`) → convert to subnormal by right-shifting `(1−exp)` bits; if too small (`exp < −6`), return ±0. ``` mul_check_overflow: li t6, 0xFF bgeu a4, t6, mul_ret_inf # overflow → ±Inf li t6, 1 bge a4, t6, mul_pack_result # normal → pack # underflow mul_handle_underflow: li s7, -6 blt a4, s7, mul_ret_signed_zero # too small → 0 li t6, 1 sub s7, t6, a4 # shift = 1 - exp srl s2, s2, s7 # mant >>= shift li a4, 0 # exp = 0 ``` --- #### 9) Result Packing Reassemble the 16-bit result: `(sign<<15) | (exp<<7) | mant7` ``` mul_pack_result: slli t0, s1, 15 # sign andi t1, a4, 0xFF # exp slli t1, t1, 7 andi t2, s2, 0x7F # mant or a0, t0, t1 or a0, a0, t2 ret ``` --- #### Special Return Paths (Clean Exit Points) ``` mul_ret_inf: slli t0, s1, 15 li t1, 0x7F80 # ±Inf or a0, t0, t1 ret mul_ret_nan: li a0, 0x7FC0 # quiet NaN (canonical) ret mul_ret_a: ret # a0 already holds a.bits mul_ret_b: mv a0, a1 # return b.bits ret mul_ret_signed_zero: slli a0, s1, 15 # ±0 ret ``` --- ### `bf16_div()` Function Overview This routine implements **division** of two **bfloat16** numbers. Since bfloat16 lacks hardware floating-point support, the routine simulates floating-point division entirely with **integer bit operations**. The overall logic is divided into **5 stages**: --- #### 1) Field Extraction Parse the 16-bit inputs into three fields: * **sign** (bit 15): sign bit * **exp** (bits [14:7]): exponent with **127** bias * **mant** (bits [6:0]): mantissa (normalized numbers will later add the hidden 1) Use shifts and masks (>>, &) to obtain each field, and compute the result sign. ``` # Extract fields srli t0, a0, 15 andi t0, t0, 1 # sign_a srli t1, a1, 15 andi t1, t1, 1 # sign_b srli t2, a0, 7 andi t2, t2, 0xFF # exp_a srli t3, a1, 7 andi t3, t3, 0xFF # exp_b andi t4, a0, 0x7F # mant_a andi t5, a1, 0x7F # mant_b # result_sign = sign_a ^ sign_b xor s7, t0, t1 ``` --- #### 2) Special Values (NaN / Inf / 0) Handle easy cases first to avoid entering the expensive long-division path. * **b is NaN / Inf**: * `exp_b == 0xFF` and `mant_b != 0` → return **b** (NaN) * `exp_b == 0xFF` and `mant_b == 0` (b = Inf): * if a is NaN → return **a** * if a is not Inf → result is **0** (a/Inf = 0) * if a = Inf → **NaN** (Inf/Inf) * **b is 0**: * if a is also 0 → **NaN** (0/0) * otherwise result is **Inf** (with the computed sign) * **a is NaN / Inf**: * `exp_a == 0xFF` and `mant_a != 0` → return **a** (NaN) * `exp_a == 0xFF` and `mant_a == 0` (a = Inf, b not special and nonzero) → **Inf** * **a is 0**: directly return **±0** (with the computed sign) ``` li t6, 0xFF # Check b first bne t3, t6, div_b_not_special bnez t5, div_ret_b # b is NaN bne t2, t6, div_ret_zero # a is not Inf → 0 bnez t4, div_ret_a # a is NaN j div_ret_nan # Inf/Inf = NaN div_b_not_special: or s8, t3, t5 beqz s8, div_b_zero # b==0 → a/0 # Then check a bne t2, t6, div_a_not_special bnez t4, div_ret_a # a is NaN j div_ret_inf # Inf / b = Inf div_a_not_special: or s8, t2, t4 beqz s8, div_ret_zero # a==0 → 0 ``` --- #### 3) Mantissa Normalization (add hidden bit / handle subnormals) * Normal numbers (`exp != 0`): **add the hidden 1** (`mant |= 0x80`) * Subnormals (`exp == 0`): this version uses a simplified model—**set exp=1** and **do not add** the hidden bit for the division ``` # A beqz t2, div_norm_a ori t4, t4, 0x80 j div_norm_a_done div_norm_a: li t2, 1 div_norm_a_done: # B beqz t3, div_norm_b ori t5, t5, 0x80 j div_norm_b_done div_norm_b: li t3, 1 div_norm_b_done: ``` --- #### 4) Long Division (restoring division): mant_a / mant_b Implement a 16-step restoring division using shift-subtract to obtain **quotient** and **remainder**: * `dividend = mant_a << 15` * stepwise compare `dividend` with `divisor << shift`; subtract when possible and set the quotient bit * after 16 iterations, get a **16-bit quotient** in `a4` ``` slli s8, t4, 15 # dividend mv s9, t5 # divisor li a4, 0 # quotient li a5, 16 # 16 iterations div_loop: slli a4, a4, 1 addi a6, a5, -1 li t6, 15 sub t6, t6, a6 # shift = i sll a7, s9, t6 # divisor << shift bgeu s8, a7, div_can_sub j div_next div_can_sub: sub s8, s8, a7 # dividend -= (divisor<<shift) ori a4, a4, 1 # quotient |= 1 div_next: addi a5, a5, -1 bnez a5, div_loop ``` Compute the **result exponent**: `exp = (exp_a - exp_b) + bias` (bias = 127) ``` sub a5, t2, t3 # exp_a - exp_b addi a5, a5, 127 # + bias ``` --- #### 5) Normalize Quotient, Check Over/Underflow, Pack Result ##### 5.1 Normalize the Quotient * If the quotient MSB (bit15) is 1 → range ≈ **[1.0, 2.0)**, shift right **8** to get a 7-bit mantissa * Otherwise (value < 1.0) → **left-shift until MSB=1**, decrement exponent by **1** per shift, then shift right 8 to get the mantissa ``` li t6, 0x8000 and s8, a4, t6 beqz s8, div_quo_small # quotient >= 0x8000 → [1,2] srli a4, a4, 8 andi a4, a4, 0x7F j div_check_overflow div_quo_small: # Left-shift until MSB=1 (only shift if exp>1 to avoid excessive underflow) div_norm_quo: li t6, 0x8000 and s8, a4, t6 bnez s8, div_norm_quo_done li t6, 1 ble a5, t6, div_norm_quo_done slli a4, a4, 1 addi a5, a5, -1 j div_norm_quo div_norm_quo_done: srli a4, a4, 8 andi a4, a4, 0x7F ``` ##### 5.2 Overflow / Underflow * **Overflow** (`exp >= 255`): return **±Inf** * **Underflow** (`exp <= 0`): this version directly returns **±0** (simplified; a stricter implementation would output **subnormals** with rounding) ``` li t6, 0xFF bgeu a5, t6, div_ret_inf # overflow → Inf blez a5, div_ret_zero # underflow → 0 ``` ##### 5.3 Pack the Result Fill sign, exponent, and mantissa back into the bfloat16 layout: ``` slli t0, s7, 15 # sign andi t1, a5, 0xFF # exp slli t1, t1, 7 andi t2, a4, 0x7F # mant or a0, t0, t1 or a0, a0, t2 ret ``` --- #### Special Return Paths (clean fast-exits) ``` div_b_zero: or s8, t2, t4 beqz s8, div_ret_nan # 0/0 = NaN j div_ret_inf # a/0 = Inf div_ret_inf: slli a0, s7, 15 li t0, 0x7F80 or a0, a0, t0 ret div_ret_nan: li a0, 0x7FC0 # canonical qNaN ret div_ret_zero: slli a0, s7, 15 # ±0 ret div_ret_a: ret # a0 already holds a div_ret_b: mv a0, a1 # return b ret ``` --- Because of time constraints and limited personal capability, the square root (sqrt) part could not be completed. Therefore, in the testing section, the sqrt function will be omitted from the comparisons for now. ![image](https://hackmd.io/_uploads/r1RYYdq6ee.png) ![image](https://hackmd.io/_uploads/BkAcKOcTel.png) The cycle count after omitting the sqrt part is 1517. This part of the code can be found on my [github](https://github.com/Stanley0915/ca2025-quizzes/blob/413676c647bbe4f5977e54b87f6b2b096532d2c3/problemC.s) --- # Leetcode 283 Move Zeroes Given an integer array nums, move all 0's to the end of it while maintaining the relative order of the non-zero elements. Note that you must do this in-place without making a copy of the array. Example 1: Input: nums = [0,1,0,3,12] Output: [1,3,12,0,0] Example 2: Input: nums = [0] Output: [0] ### C program ```c void moveZeroes(int* nums, int numsSize) { int zerohead=0; for(int i=0;i<numsSize;i++){ if(nums[i]!=0){ if(i!=zerohead){ nums[zerohead]=nums[i]; nums[i]=0; } zerohead++; } } } ``` ### Assembly code (RISC-V) ``` .data test1: .word 0, 1, 0, 3, 12 test1_size: .word 5 test1_msg: .string "Test 1: [0, 1, 0, 3, 12] => " test2: .word 0, 0, 0, 0 test2_size: .word 4 test2_msg: .string "Test 2: [0, 0, 0, 0] => " test3: .word 5, 7, 2, 9, 1 test3_size: .word 5 test3_msg: .string "Test 3: [5, 7, 2, 9, 1] => " newline: .string "\n" space: .string " " bracket_open: .string "[" bracket_close: .string "]" .text .globl main main: addi sp, sp, -4 sw ra, 0(sp) # test 1 la a0, test1_msg li a7, 4 ecall la a0, test1 lw a1, test1_size jal moveZeroes la a0, test1 lw a1, test1_size jal print_array # test 2 la a0, test2_msg li a7, 4 ecall la a0, test2 lw a1, test2_size jal moveZeroes la a0, test2 lw a1, test2_size jal print_array # test 3 la a0, test3_msg li a7, 4 ecall la a0, test3 lw a1, test3_size jal moveZeroes la a0, test3 lw a1, test3_size jal print_array lw ra, 0(sp) addi sp, sp, 4 li a7, 10 ecall # moveZeroes 函數 # 輸入: a0 = nums 陣列指標, a1 = numsSize # 使用的暫存器: # t0 = zerohead # t1 = i # t2 = nums[i] # t4 = nums 陣列基底位址 moveZeroes: li t0, 0 li t1, 0 mv t4, a0 loop: bge t1, a1, end slli t3, t1, 2 # t3 = i * 4 (因為 int 是 4 bytes) add t3, t4, t3 # t3 = &nums[i] lw t2, 0(t3) # t2 = nums[i] beqz t2, continue beq t1, t0, skip_swap slli t5, t0, 2 add t5, t4, t5 sw t2, 0(t5) sw zero, 0(t3) skip_swap: addi t0, t0, 1 continue: addi t1, t1, 1 j loop end: ret print_array: addi sp, sp, -16 sw ra, 12(sp) sw s0, 8(sp) sw s1, 4(sp) sw s2, 0(sp) mv s0, a0 # s0 = 陣列指標 mv s1, a1 # s1 = 陣列大小 li s2, 0 # s2 = 索引 # 印出 "[" la a0, bracket_open li a7, 4 ecall print_loop: bge s2, s1, print_end slli t0, s2, 2 add t0, s0, t0 lw a0, 0(t0) li a7, 1 ecall # 如果不是最後一個元素，印出空格 addi t0, s1, -1 bge s2, t0, skip_space la a0, space li a7, 4 ecall skip_space: addi s2, s2, 1 j print_loop print_end: # 印出 "]" la a0, bracket_close li a7, 4 ecall # 印出換行 la a0, newline li a7, 4 ecall lw ra, 12(sp) lw s0, 8(sp) lw s1, 4(sp) lw s2, 0(sp) addi sp, sp, 16 ret ``` result: ![image](https://hackmd.io/_uploads/H1a4Gdjagx.png) --- # Ripes Simulator using 5-stage pipelined process in pipes ``` 00000074 <bool_bf16_isnan>: 74: 00a975b3 and x11 x18 x10 78: 00b91a63 bne x18 x11 20 <label1> 7c: 00a9f633 and x12 x19 x10 80: 00061663 bne x12 x0 12 <label1> 84: 00000513 addi x10 x0 0 88: 00008067 jalr x0 x1 0 ``` take 7c:00a9f633 and x12 x19 x10for example in this 5-stage pipelined section. --- **IF** ![image](https://hackmd.io/_uploads/rJnwFmKagg.png) *The program counter updates the value of the instruction each time to fetch a new instruction. *The current instruction memory address is 0x0000008c, and the instruction read from that address is 0x00a975b3. *Since the current instruction memory address is 0x0000004c, and most instructions in many architectures (such as RISC-V) are 4 bytes (32 bits) long, the next instruction address will typically be 0x0000004c, which is 4 bytes ahead of the current one. --- **ID** ![image](https://hackmd.io/_uploads/Sk4O5mFagx.png) *decode the instruction 0x00a975b3 into binary 0000000 01010 10010 111 01011 0110011 funct7 = 0000000 rs2 = 01010 rs1 = 10010 funct3 = 111 rd = 01011 opcode = 0110011 The instruction 00a9f633 has been decoded as and x12 x19 x10 --- **EXE** ![image](https://hackmd.io/_uploads/rk3o5XYaxe.png)2 *In the EX (Execute) stage, the ALU performs a bitwise AND operation between the value stored in register x18 and the value in register x10. The result of this operation is then forwarded to be written back into register x11 --- **MEM** ![image](https://hackmd.io/_uploads/BJkC9XKpxg.png) *In the MEM (memory) stage, it checks whether to write to or read from memory. For the and instruction, there is no need for reading from or writing to memory. --- **WB** ![image](https://hackmd.io/_uploads/S14ksmtTgx.png) *In the WB (write-back) stage, the data is written back to register x11