# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`Shih-Hsuan`](https://github.com/Shih-Hsuan) > ## AI Tools Usage I used Genimi for : * Conceptual Explanation and Code debugging * Code Annotation * Translation and Technical Writing (Polishing) ## Quiz1 Problem B :::info For clarity and to keep the document organized, the test code is available on [GitHub](https://github.com/Shih-Hsuan/ca2025-quizzes). ::: ### 1. Count Leading Zeros Counts the number of consecutive zeros in the binary representation of a 32-bit unsigned integer , starting from the most significant bit (MSB). #### 1.1 Assembly code * Arguments : * `a0` : The 32-bit unsigned integer x. * Returns : * `a0` : The number of leading zeros in x. * Registers used: * **t0 : Stores variable x.** * Holds the input value being analyzed. This value is progressively shifted down during the search process to narrow in on the most significant bit. * **t1 : Stores variable n.** * A counter that is initialized to 32 and is reduced by the step size `c` to track the position of the most significant bit. * **t2 : Stores variable c.** * The step size for the binary search, which is also the amount to shift `x` by. It is halved in each iteration (16, 8, 4, 2, 1). * **t3 : Stores variable y.** * A temporary variable that holds the result of the shift (`x >> c`) in order to check if any set bits ('1's) were found in the upper half of `x`. ```s= clz: addi t0, a0, 0 # t0 = x (we will operate on the value of x in t0) addi t1, x0, 32 # t1 = n = 32 addi t2, x0, 16 # t2 = c = 16 loop_start: srl t3, t0, t2 # t3 = y = x >> c beqz t3, skip_if # if y (t3) == 0, skip the if-block sub t1, t1, t2 # n = n - c addi t0, t3, 0 # x = y skip_if: srli t2, t2, 1 # c = c / 2 bnez t2, loop_start # if c (t2) != 0, continue the loop sub a0, t1, t0 # Store the result (n-x) into a0 for return jalr x0, ra, 0 # Return to the caller ``` #### 1.2 Test results * Rispes Console ![image](https://hackmd.io/_uploads/SygoRjX6gg.png =300x) ### 2. uf8_decode The **uf8** encoding scheme is a logarithmic compression method that **sacrifices numerical precision to represent a vast range of values** within just 8 bits, making it ideal for applications like sensor data and computer graphics where magnitude is more important than exactness. * Decoding formula $$ D(b) = \underbrace{m \cdot 2^e}_{\text{Values in Range}} + \underbrace{(2^e - 1) \cdot 16}_{\text{Offset}} $$ Where $e = \lfloor b/16 \rfloor$ and $m = b \bmod 16$ #### 2.1 Assembly code * Arguments : * `a0` : The 8-bit uf8 input value. * Returns : * `a0` : The decoded 32-bit unsigned integer. * Registers used : * `t0` : Stores the mantissa. * `t1` : Stores the exponent. * `t2` : Stores the offset. * `t3`, `t4` : Used for intermediate calculations. ```s= uf8_decode: andi t0, a0, 0xf # Eextract lower 4 bits ( mantissa ) srli t1, a0, 4 # Eextract upper 4 bits ( exponent ) # offset = (0x7FFF >> (15 - exponent)) << 4; addi t3, x0, 15 sub t3, t3, t1 # Calculate the right shift amount lui t4, 0x8 # 0x8 << 12 addi t4, t4, -1 # t4 = 0x7FFF srl t4, t4, t3 # t4 = (0x7FFF >> (15 - exponent)) slli t2, t4, 4 # offset = t4 << 4 # return (mantissa << exponent) + offset; sll t0, t0, t1 # t0 = mantissa << exponent add a0, t0, t2 # Store the final result in a0 for return jalr x0, ra, 0 # Return to the caller function ``` #### 2.2 Test results * Rispes Console ![image](https://hackmd.io/_uploads/SJ5rnWfpxg.png =300x) ### 3. uf8_encode * Encoding formula $$ E(v) = \begin{cases} v, & \text{if } v < 16 \\ 16e + \lfloor(v - \text{offset}(e))/2^e\rfloor, & \text{otherwise} \end{cases} $$ Where $\text{offset}(e) = (2^e - 1) \cdot 16$ #### 3.1 Assembly code * Arguments : * `a0` : The 32-bit unsigned integer input value. * Returns : * `a0` : The encoded 8-bit uf8. * Registers used : * `s0` : Stores the input value. * `t0` : Stores the msb. * `t1` : Stores the exponent. * `t2` : Stores the overflow. ```s= uf8_encode: addi sp, sp, -8 # Allocate 8 bytes on the stack sw ra, 4(sp) # Save the return address (ra) because we will call a subroutine sw s0, 0(sp) # Save s0 (callee-saved, used here to store 'value') addi s0, a0, 0 # s0 = value (preserve the input argument) addi t0, x0, 16 blt s0, t0, return_trivial # if value < 16, jump to return # --- Find appropriate exponent using CLZ hint --- addi a0, s0, 0 # Prepare argument for clz jal ra, clz # Call clz, the result will be in a0 addi t0, x0, 31 sub t0, t0, a0 # t0 = msb addi t1, x0, 0 # t1 = exponent addi t2, x0, 0 # t2 = overflow addi t3, x0, 5 blt t0, t3, while_final_cond # if msb < 5, skip the guess block addi t1, t0, -4 # exponent = msb - 4 addi t3, x0, 15 ble t1, t3, exp_ok addi t1, x0, 15 # if (exponent > 15) exponent = 15 exp_ok: addi t3, x0, 0 # t3 = e (loop counter) for_loop_start: # calculate overflow bge t3, t1, while_adjust_cond slli t2, t2, 1 # overflow = overflow << 1 addi t2, t2, 16 # overflow = overflow + 16 addi t3, t3, 1 # e++ jal x0, for_loop_start while_adjust_cond: # Adjust if estimate was off blez t1, while_final_cond # if exponent <= 0, end bge s0, t2, while_final_cond # if value >= overflow, end addi t2, t2, -16 # overflow = overflow - 16 srli t2, t2, 1 # overflow = overflow >> 1 addi t1, t1, -1 # exponent-- jal x0, while_adjust_cond while_final_cond: addi t3, x0, 15 bge t1, t3, while_final_end slli t4, t2, 1 addi t4, t4, 16 # t4 = next_overflow blt s0, t4, while_final_end # if value < next_overflow, break addi t2, t4, 0 # overflow = next_overflow addi t1, t1, 1 # exponent++ jal x0, while_final_cond while_final_end: sub t0, s0, t2 # mantissa = (value - overflow) srl a0, t0, t1 # a0 = mantissa >> exponent slli t1, t1, 4 or a0, a0, t1 # a0 = (exponent << 4) | mantissa return_trivial: # return value lw s0, 0(sp) # Restore s0 lw ra, 4(sp) # Restore ra addi sp, sp, 8 # Deallocate stack space jalr x0, ra, 0 # Return to caller clz: addi t0, a0, 0 addi t1, x0, 32 addi t2, x0, 16 clz_loop_start: srl t3, t0, t2 beqz t3, clz_skip_if sub t1, t1, t2 addi t0, t3, 0 clz_skip_if: srli t2, t2, 1 bnez t2, clz_loop_start sub a0, t1, t0 # Store the result (n-x) into a0 for return jalr x0, ra, 0 # Return to the caller ``` #### 3.2 Test results * Rispes Console ![image](https://hackmd.io/_uploads/BkSdEBEpgg.png =300x) --- ## Quiz1 Problem C `bfloat16` Bit Layout ``` ┌─────────┬──────────────┬──────────────┐ │Sign (1) │ Exponent (8) │ Mantissa (7) │ └─────────┴──────────────┴──────────────┘ 15 14 7 6 0 S: Sign bit (0 = positive, 1 = negative) E: Exponent bits (8 bits, bias = 127) M: Mantissa/fraction bits (7 bits) ``` The value 𝑣 of a BFloat16 number is calculated as: $$ v = (-1)^S \times 2^{E-127} \times \left(1 + \frac{M}{128}\right) $$ where: * $S \in \{0, 1\}$ is the sign bit * $E \in [1, 254]$ is the biased exponent * $M \in [0, 127]$ is the mantissa value Special Cases * Zero : $E = 0, M = 0 \Rightarrow v = (-1)^S \times 0$ * Infinity : $E = 255, M = 0 \Rightarrow v = (-1)^S \times \infty$ * NaN : $E = 255, M \neq 0 \Rightarrow v = \text{NaN}$ * Denormals : Not supported (flush to zero) ### 1. `f32_to_bf16` : Convert float to bfloat16 `IEEE 754 floating point` Bit Layout ``` ┌─────────┬──────────────┬───────────────┐ │Sign (1) │ Exponent (8) │ Mantissa (23) │ └─────────┴──────────────┴───────────────┘ 31 30 23 22 0 S: Sign bit (0 = positive, 1 = negative) E: Exponent bits (8 bits, bias = 127) M: Mantissa/fraction bits (23 bits) ``` #### Round to Nearest, Ties to Even ##### Code Analysis First, let's look at this line of code again and break down its components: ```c f32bits += ((f32bits >> 16) & 1) + 0x7FFF; ``` This is equivalent to : `f32bits = f32bits + [Tie-Breaker Bit] + 0x7FFF` * **`0x7FFF`** : This is the main rounding value, which is slightly less than half (`0x8000`). * **`Tie-Breaker Bit`** : `((f32bits >> 16) & 1)`. This part is only used for handling the "exactly halfway" case to decide whether to round up or down. ##### Example 1: Rounding Up This occurs when the lower 16 bits to be discarded are **greater than** `0x8000` (halfway). * **Input `f32bits`**: `0x3F808001` * **Upper 16 bits ( part to keep )**: `0x3F80` * **Lower 16 bits ( part to discard )**: `0x8001` ( **\> 0x8000** ) **Calculation Steps:** 1. **Calculate Tie-Breaker Bit:** * `(0x3F808001 >> 16) & 1` * `0x00003F80 & 1` (The LSB of the upper 16 bits is 0) * The result is `0`. 2. **Calculate Total Value to Add:** * `0 (Tie-Breaker) + 0x7FFF = 0x7FFF` 3. **Perform Addition:** ``` 0x3F808001 + 0x00007FFF ----------------- 0x3F810000 ``` Because `0x8001 + 0x7FFF = 0x10000`, a **carry** is generated into the upper 16 bits. **Result**: The value of `f32bits` becomes `0x3F810000`. The subsequent `>> 16` operation will extract `0x3F81`, successfully rounding up. ##### Example 2: Rounding Down This occurs when the lower 16 bits to be discarded are **less than** `0x8000`. * **Input `f32bits`**: `0x40001234` * **Upper 16 bits ( part to keep )**: `0x4000` * **Lower 16 bits ( part to discard )**: `0x1234` ( **\< 0x8000** ) **Calculation Steps:** 1. **Calculate Tie-Breaker Bit:** * `(0x40001234 >> 16) & 1` * `0x00004000 & 1` (The LSB of the upper 16 bits is 0) * The result is `0`. 2. **Calculate Total Value to Add:** * `0 (Tie-Breaker) + 0x7FFF = 0x7FFF` 3. **Perform Addition:** ``` 0x40001234 + 0x00007FFF ----------------- 0x40009233 ``` Because `0x1234 + 0x7FFF = 0x9233`, this result does **not** exceed `0xFFFF`, so **no carry is generated**. **Result**: The value of `f32bits` becomes `0x40009233`. The subsequent `>> 16` operation will extract `0x4000`, successfully rounding down. ##### Example 3: The "Ties to Even" Tie-Breaker This occurs when the lower 16 bits to be discarded are **exactly equal to** `0x8000`. The Tie-Breaker Bit comes into play, and the rule is to round the result to an even number. **Sub-example 3a: The part to keep is odd, requiring it to round up to become even** * **Input `f32bits`**: `0x40018000` * **Upper 16 bits ( part to keep )**: `0x4001` (The last bit is 1, so it is **odd**) * **Lower 16 bits ( part to discard )**: `0x8000` (exactly halfway) **Calculation Steps:** 1. **Calculate Tie-Breaker Bit:** * `(0x40018000 >> 16) & 1` * `0x00004001 & 1` ( The LSB of the upper 16 bits is 1 ) * The result is `1`. 2. **Calculate Total Value to Add:** * `1 (Tie-Breaker) + 0x7FFF = 0x8000` 3. **Perform Addition:** ``` 0x40018000 + 0x00008000 ----------------- 0x40020000 ``` `0x8000 + 0x8000 = 0x10000`, which generates a **carry**. **Result**: `f32bits` becomes `0x40020000`. The subsequent extraction yields `0x4002` (**even**), successfully rounding the odd number up to an even one. **Sub-example 3b: The part to keep is even, so it rounds down (truncates)** * **Input `f32bits`**: `0x40008000` * **Upper 16 bits ( part to keep )**: `0x4000` (The last bit is 0, so it is **even**) * **Lower 16 bits ( part to discard )**: `0x8000` (exactly halfway) **Calculation Steps:** 1. **Calculate Tie-Breaker Bit:** * `(0x40008000 >> 16) & 1` * `0x00004000 & 1` (The LSB of the upper 16 bits is 0) * The result is `0`. 2. **Calculate Total Value to Add:** * `0 (Tie-Breaker) + 0x7FFF = 0x7FFF` 3. **Perform Addition:** ``` 0x40008000 + 0x00007FFF ----------------- 0x4000FFFF ``` `0x8000 + 0x7FFF = 0xFFFF`, so **no carry is generated**. **Result**: `f32bits` becomes `0x4000FFFF`. The subsequent extraction yields `0x4000` (**even**), preserving the original even number. #### 1.1 Assembly code * Argument * `a0` : a 32-bit float bit pattern * Returns * `a0` : a 16-bit bfloat16 bit pattern ```s= f32_to_bf16: # --- Check for Infinity or NaN --- srli t0, a0, 23 # t0 = a0 >> 23 (isolate exponent and sign) andi t0, t0, 0xFF # t0 = t0 & 0xFF (keep only the 8 exponent bits) addi t1, x0, 0xFF # t1 = 255 (0xFF), the pattern for Inf/NaN exponent bne t0, t1, normal_path # If exponent is not 0xFF,branch to rounding path # is_special_path (Inf/NaN): jal x0, final_shift normal_path: # --- Rounding for normal numbers --- lui t0, 0x8 addi t0, t0, -1 # t0 = 0x7FFF srli t1, a0, 16 # t1 = a0 >> 16 andi t1, t1, 1 # t1 = t1 & 1 (tie-breaker bit) add t0, t0, t1 # t0 = 0x7FFF + tie_breaker_bit add a0, a0, t0 # a0 = a0 + rounding_value final_shift: # --- Truncate to 16 bits --- srli a0, a0, 16 # a0 = a0 >> 16 jalr x0, ra, 0 ``` #### 1.2 Test results * Rispes Console ![image](https://hackmd.io/_uploads/Bky4up4Tle.png =300x) * Case 1: `0x3F808001` -> Rounds UP to `0x3F81` * Case 2: `0x7F800000` -> Special value (+Inf), truncates to `0x7F80` * Case 3: `0x40001234` -> Rounds DOWN to `0x4000` * Case 4: `0x40018000` -> Rounds UP to `0x4002` ### 2. `bf16_to_f32` : Convert bfloat16 to float #### 2.1 Assembly code * Argument * `a0` : a bf16_t value (bits are in the lower 16 bits of a0) * Returns : * `a0` : a 32-bit float bit pattern ```s= bf16_to_f32: slli a0, a0, 16 # a0 = a0 << 16 jalr x0, ra, 0 ``` #### 2.2 Test results * Rispes Console * Case 1: `0x4048` -> Normal value `3.125` * Case 2: `0x0000` -> `Zero` * Case 3: `0x7FC1` -> A `NaN` value ![image](https://hackmd.io/_uploads/Bk5q-8S6gx.png =300x) ### 3. bf16_add #### Handling Special Cases Floating-point operations must prioritize the handling of special values, such as **Infinity**, **NaN (Not a Number)**, and **Zero**. * **If either input is NaN**: The result is NaN. * **If one input is Infinity and the other is not**: The result is Infinity. * **If both inputs are Infinity**: * Same sign (`+Inf` + `+Inf`): The result is `+Inf`. * Opposite signs (`+Inf` + `-Inf`): The result is NaN (undefined). * **If either input is Zero**: The result is the other number. #### 3.1 Assembly code * Argument * `a0` : a bf16_t value ( `a.bit` ) * `a1` : a bf16_t value ( `b.bit` ) * Returns * `a0` : the result of `a.bit` + `b.bit` * Save registers * `s0` = sign_a * `s1` = exp_a * `s2` = mant_a * `s3` = sign_b * `s4` = exp_b * `s5` = mant_b * `s6` = result_sign * `s7` = result_exp * `s8` = result_mant ```s= bf16_add: # --- Prologue: Save registers --- addi sp, sp, -40 sw ra, 0(sp) sw s0, 4(sp) # s0 = sign_a sw s1, 8(sp) # s1 = exp_a sw s2, 12(sp) # s2 = mant_a sw s3, 16(sp) # s3 = sign_b sw s4, 20(sp) # s4 = exp_b sw s5, 24(sp) # s5 = mant_b sw s6, 28(sp) # s6 = result_sign sw s7, 32(sp) # s7 = result_exp sw s8, 36(sp) # s8 = result_mant # --- 1. Decode inputs --- srli s0, a0, 15 # s0 = sign_a srli t0, a0, 7 andi s1, t0, 0xFF # s1 = exp_a andi s2, a0, 0x7F # s2 = mant_a srli s3, a1, 15 # s3 = sign_b srli t0, a1, 7 andi s4, t0, 0xFF # s4 = exp_b andi s5, a1, 0x7F # s5 = mant_b # --- 2. Handle special cases --- addi t0, x0 0xFF bne s1, t0, check_b_special # if (exp_a != 0xFF) bnez s2, return_a # a is NaN, return a bne s4, t0, return_a # a is Inf, b is normal, return a bnez s5, return_b # a is Inf, b is NaN, return b beq s0, s3, return_a # a,b are Inf with same sign, return a lui a0, 0x8 addi a0, x0, -64 # a,b are Inf with different signs -> NaN jal x0, epilogue check_b_special: addi t0, x0, 0xFF bne s4, t0, check_zeros # if (exp_b != 0xFF) j return_b # b is Inf or NaN, return b check_zeros: or t0, s1, s2 bnez t0, check_b_zero jal x0, return_b # a is zero, return b check_b_zero: or t0, s4, s5 bnez t0, normalize jal x0, return_a # b is zero, return a # --- 3. Normalization (add implicit 1) --- normalize: bnez s1, add_implicit_a jal x0, check_implicit_b add_implicit_a: ori s2, s2, 0x80 check_implicit_b: bnez s4, add_implicit_b jal x0, align_exponents add_implicit_b: ori s5, s5, 0x80 # --- 4. Align Exponents --- align_exponents: sub t0, s1, s4 # t0 = exp_diff bgez t0, exp_a_ge_b addi s7, s4, 0 # result_exp = exp_b sub t0, x0, t0 addi t1, x0, 8 bge t1, t0, shift_a # if exp_diff < -8 ?? jal x0, return_b shift_a: srl s2, s2, t0 jal x0, arithmetic exp_a_ge_b: addi s7, s1, 0 # result_exp = exp_a beqz t0, arithmetic addi t1, x0, 8 bge t1, t0, shift_b # if exp_diff > 8 ?? jal x0, return_a shift_b: srl s5, s5, t0 # --- 5. Main Arithmetic --- arithmetic: beq s0, s3, addition_path jal x0, subtraction_path addition_path: addi s6, s0, 0 # result_sign = sign_a add s8, s2, s5 # result_mant = mant_a + mant_b andi t0, s8, 0x100 # check for carry-out beqz t0, encode_result srli s8, s8, 1 addi s7, s7, 1 addi t0, x0, 0xFF blt s7, t0, encode_result # Overflow -> Infinity lui t0, 0x8 addi t0, t0, -128 # t0 = 0x7F80 slli s6, s6, 15 or a0, s6, t0 jal x0, epilogue subtraction_path: blt s2, s5, b_minus_a add s6, s0, x0 # result_sign = sign_a sub s8, s2, s5 # result_mant = mant_a - mant_b jal x0, sub_normalize b_minus_a: add s6, s3, x0 # result_sign = sign_b sub s8, s5, s2 # result_mant = mant_b - mant_a sub_normalize: beqz s8, return_zero sub_norm_loop: andi t0, s8, 0x80 bnez t0, encode_result slli s8, s8, 1 addi s7, s7, -1 blez s7, return_zero jal x0, sub_norm_loop # --- 6. Encode Result --- encode_result: slli s6, s6, 15 # result_sign << 15 andi s7, s7, 0xFF # result_exp & 0xFF slli s7, s7, 7 andi s8, s8, 0x7F # remove implicit 1 or a0, s6, s7 or a0, a0, s8 jal x0, epilogue return_a: jal x0, epilogue return_b: addi a0, a1, 0 # a0 is the return value jal x0, epilogue return_zero: addi a0, x0, 0 jal x0, epilogue # --- Restore registers and return --- epilogue: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) lw s6, 28(sp) lw s7, 32(sp) lw s8, 36(sp) addi sp, sp, 40 jalr x0, ra, 0 ``` #### 3.2 Test results * Rispes Console * Case 1: 1.5 + 2.5 = `4.0` * Case 2: 5.0 + (-3.0) = `2.0` * Case 3: +Inf + 5.0 = `+Inf` ![image](https://hackmd.io/_uploads/ByvB-8S6el.png =300x) ### 4. bf16_sub * The `bf16_sub` function is implemented based on the mathematical identity that subtraction is equivalent to adding the negative of a number. In other words, the operation `a - b` is identical to `a + (-b)`. * In floating-point formats like `bfloat16` that follow the IEEE 754 standard, negating a number (changing `b` to `-b`) is achieved very efficiently by simply flipping its sign bit. #### 4.1 Assembly code ```s= lui t2, 0x8 xor a1, a1, t2 # convert sign bit jal ra, bf16_add ``` #### 4.2 Test results * Rispes Console * Case 1: 1.5 - 2.5 = `-1.0` * Case 2: 5.0 - (-3.0) = `8.0` * Case 3: +Inf - 5.0 = `+Inf` ![image](https://hackmd.io/_uploads/rkwFS8BTxx.png =300x) ### 5. bf16_mul The core principle of floating-point multiplication is : **multiply the mantissas and add the exponents**. #### Handling Special Cases Prioritize the handling of **NaN**, **Infinity**, and **Zero**. * **Any number multiplied by NaN** : The result is NaN. * **Infinity multiplied by 0** : The result is NaN (undefined). * **Infinity multiplied by any non-zero number** : The result is Infinity, with the sign determined by the `result_sign`. * **Any number multiplied by 0** : The result is 0, with the sign also determined by `result_sign` (resulting in `+0` or `-0`). #### Shift-and-Add Multiplication This RISC-V assembly code performs an 8-bit multiplication using a classic **shift-and-add** algorithm. It calculates the product of two numbers, a multiplicand (`mant_a` in `s2`) and a multiplier (`mant_b` in `s5`), storing the final result in `s8`. The process works as follows: 1. **Initialization**: A register `s8` is cleared to zero to act as an accumulator for the result. A counter `t0` is set to 8 for the loop. 2. **Iteration**: The code loops 8 times, once for each bit of the multiplier. 3. **Bit Check**: In each loop, it inspects the least significant bit of the multiplier (`s5`). 4. **Conditional Addition**: * If the LSB is **1**, the multiplicand (`s2`) is added to the accumulator (`s8`). * If the LSB is **0**, the addition is skipped. 5. **Shifting**: After the check, the multiplicand (`s2`) is shifted **left** by one bit (doubling its value to align with the next bit position), and the multiplier (`s5`) is shifted **right** by one bit (to expose the next bit for the next iteration). This loop repeats until all 8 bits of the multiplier have been examined, leaving the final product in register `s8`. #### 5.1 Assembly code * Argument * `a0` : a bf16_t value ( `a.bit` ) * `a1` : a bf16_t value ( `b.bit` ) * Returns * `a0` : the result of `a.bit` * `b.bit` * Save registers * `s0` = sign_a * `s1` = exp_a * `s2` = mant_a * `s3` = sign_b * `s4` = exp_b * `s5` = mant_b * `s6` = result_sign * `s7` = result_exp * `s8` = result_mant ```s= bf16_mul: addi sp, sp, -40 sw ra, 0(sp) sw s0, 4(sp) # s0 = sign_a sw s1, 8(sp) # s1 = exp_a sw s2, 12(sp) # s2 = mant_a sw s3, 16(sp) # s3 = sign_b sw s4, 20(sp) # s4 = exp_b sw s5, 24(sp) # s5 = mant_b sw s6, 28(sp) # s6 = result_sign sw s7, 32(sp) # s7 = result_exp sw s8, 36(sp) # s8 = result_mant # --- 1. Decode inputs & Calc Sign --- srli s0, a0, 15 # s0 = sign_a srli t0, a0, 7 andi s1, t0, 0xFF # s1 = exp_a andi s2, a0, 0x7F # s2 = mant_a srli s3, a1, 15 # s3 = sign_b srli t0, a1, 7 andi s4, t0, 0xFF # s4 = exp_b andi s5, a1, 0x7F # s5 = mant_b xor s6, s0, s3 # s6 = result_sign # --- 2. Handle special cases --- addi t0, x0, 0xFF bne s1, t0, check_b_special # a is NaN or Inf bnez s2, return_a # a is NaN # a is Inf. Check if b is zero or t0, s4, s5 bnez t0, a_inf_b_not_zero lui a0, 0x8 addi a0, x0, -64 jal x0, epilogue # Inf * 0 = NaN a_inf_b_not_zero: # The result depends on b's sign bit slli a0, s6, 15 lui t0, 0x8 addi t0, t0, -128 # t0 = Inf or a0, a0, t0 # a0 = +Inf or -Inf jal x0, epilogue check_b_special: addi t0, x0, 0xFF bne s4, t0, check_zeros # if ( b is not NaN or Inf ) # b is NaN or Inf bnez s5, return_b # b is NaN # b is Inf. Check if a is zero or t0, s1, s2 bnez t0, b_inf_a_not_zero lui a0, 0x8 addi a0, x0, -64 jal x0, epilogue # 0 * Inf = NaN b_inf_a_not_zero: # The result depends on a's sign bit slli a0, s6, 15 lui t0, 0x8 addi t0, t0, -128 # t0 = Inf or a0, a0, t0 # a0 = +Inf or -Inf jal x0, epilogue check_zeros: or t0, s1, s2 beqz t0, return_signed_zero # Any number multiplied by 0 or t0, s4, s5 bnez t0, handle_denormals return_signed_zero: slli a0, s6, 15 jal x0, epilogue # --- 3. Handle Denormals --- handle_denormals: addi s7, x0, 0 # exp_adjust = 0 bnez s1, a_is_normal denorm_loop_a: # a is denormal, normalize it andi t0, s2, 0x80 bnez t0, a_denorm_done slli s2, s2, 1 addi s7, s7, -1 jal x0, denorm_loop_a a_denorm_done: addi s1, x0, 1 a_is_normal: bnez s4, b_is_normal denorm_loop_b: # b is denormal, normalize it andi t0, s5, 0x80 bnez t0, b_denorm_done slli s5, s5, 1 addi s7, s7, -1 jal x0, denorm_loop_b b_denorm_done: addi s4, x0, 1 b_is_normal: # Add implicit 1 for normal numbers ori s2, s2, 0x80 ori s5, s5, 0x80 # --- 4. Core Arithmetic --- addi s8, x0, 0 # result_mant = 0 addi t0, x0, 8 # Loop 8 times for 8 bits mul_loop: andi t1, s5, 1 # Check the LSB of the multiplier (mant_b) beqz t1, skip_add # If LSB is 0, skip the addition add s8, s8, s2 # result += multiplicand (mant_a) skip_add: slli s2, s2, 1 # multiplicand <<= 1 srli s5, s5, 1 # multiplier >>= 1 addi t0, t0, -1 # Decrement loop counter bnez t0, mul_loop # Continue if counter is not zero add s7, s7, s1 add s7, s7, s4 addi s7, s7, -127 # result_exp # --- 5. Normalize Product --- lui t1, 0x8 and t0, s8, t1 beqz t0, norm_shift_7 srli s8, s8, 8 addi s7, s7, 1 jal x0, check_overflow norm_shift_7: srli s8, s8, 7 # --- 6. Check Overflow/Underflow --- check_overflow: andi s8, s8, 0x7F # Final mantissa part addi t0, x0, 0xFF blt s7, t0, check_underflow slli a0, s6, 15 lui t1, 0x8 addi t1, t1, -128 and a0, a0, t1 jal x0, epilogue # Overflow check_underflow: blt x0, s7, encode_result # Underflow logic addi t0, x0, -6 bge s7, t0, handle_underflow_shift jal x0, return_signed_zero # Too small, flush to zero handle_underflow_shift: addi t0, x0, 1 sub t0, t0, s7 srl s8, s8, t0 addi s7, x0, 0 # --- 7. Encode Result --- encode_result: slli s6, s6, 15 # result_sign << 15 andi s7, s7, 0xFF # result_exp & 0xFF slli s7, s7, 7 andi s8, s8, 0x7F # remove implicit 1 or a0, s6, s7 or a0, a0, s8 jal x0, epilogue return_a: jal x0, epilogue return_b: addi a0, a1, 0 # a0 is the return value jal x0, epilogue # --- Restore registers and return --- epilogue: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) lw s6, 28(sp) lw s7, 32(sp) lw s8, 36(sp) addi sp, sp, 40 jalr x0, ra, 0 ``` #### 5.2 Test results 1. **Simple multiplication** : `2.0 * 3.0 = 6.0` * `2.0` (`0x4000`) \* `3.0` (`0x4040`) the result will be `6.0` (`0x40C0`)。 2. **Multiplication of negative numbers**: `-1.5 * 4.0 = -6.0` * `-1.5` (`0xBFC0`) \* `4.0` (`0x4080`) the result will be `-6.0` (`0xC0C0`)。 3. **Special case**: `+Infinity * -2.0 = -Infinity` * `+Inf` (`0x7F80`) \* `-2.0` (`0xC000`) the result will be `-Inf` (`0xFF80`)。 * Rispes Console ![image](https://hackmd.io/_uploads/HJbZ57L6xe.png =300x) ### 6. bf16_div The core principle of floating-point division is: **Divide the mantissas and subtract the exponents** #### Handling Special Cases Division has many special edge cases that must be handled first: * **Division by NaN** : `Any Number / NaN = NaN`. * **Division by Infinity** : * `Finite Number / Inf = 0`. * `Inf / Inf = NaN` (undefined). * **Division by Zero**: * `Non-zero Number / 0 = Infinity`. * `0 / 0 = NaN` (undefined). * **The Dividend is a Special Value**: * `Inf / Finite Number = Inf`. * `NaN / Finite Number = NaN`. * `0 / Non-zero Number = 0`. #### 6.1 Assembly code * Argument * `a0` : a bf16_t value ( `a.bit` ) * `a1` : a bf16_t value ( `b.bit` ) * Returns * `a0` : the result of `a.bit` / `b.bit` * Save registers * `s0` = sign_a * `s1` = exp_a * `s2` = mant_a * `s3` = sign_b * `s4` = exp_b * `s5` = mant_b * `s6` = result_sign * `s7` = result_exp * `s8` = result_mant * `s9` = result_mant ```s= bf16_div: addi sp, sp, -44 sw ra, 0(sp) sw s0, 4(sp) # s0 = sign_a sw s1, 8(sp) # s1 = exp_a sw s2, 12(sp) # s2 = mant_a sw s3, 16(sp) # s3 = sign_b sw s4, 20(sp) # s4 = exp_b sw s5, 24(sp) # s5 = mant_b sw s6, 28(sp) # s6 = result_sign sw s7, 32(sp) # s7 = result_exp sw s8, 36(sp) # s8 = result_mant sw s9, 40(sp) # s9 = div loop increment # --- 1. Decode inputs & Calc Sign --- srli s0, a0, 15 # s0 = sign_a srli t0, a0, 7 andi s1, t0, 0xFF # s1 = exp_a andi s2, a0, 0x7F # s2 = mant_a srli s3, a1, 15 # s3 = sign_b srli t0, a1, 7 andi s4, t0, 0xFF # s4 = exp_b andi s5, a1, 0x7F # s5 = mant_b xor s6, s0, s3 # s6 = result_sign # --- 2. Handle special cases --- addi t0, x0, 0xFF bne s4, t0, check_div_by_zero # if exp_b != Inf/NaN bnez s5, return_b # b is NaN, return b # b is Inf bne s1, t0, div_by_inf_ret_zero # a is not Inf beqz s2, return_nan # a is Inf -> Inf/Inf = NaN div_by_inf_ret_zero: slli a0, s6, 15 jal x0, epilogue # finite/Inf = 0 check_div_by_zero: or t0, s4, s5 bnez t0, check_a_special # b is not zero # b is Zero or t0, s1, s2 beqz t0, return_nan # 0/0 = NaN slli a0, s6, 15 lui t0, 0x8 addi t0, t0, -128 or a0, a0, t0 jal x0, epilogue # non-zero/0 = Inf check_a_special: addi t0, x0, 0xFF bne s1, t0, check_a_is_zero bnez s2, return_a # a is NaN slli a0, s6, 15 lui t0, 0x8 addi t0, t0, -128 or a0, a0, t0 jal x0, epilogue # Inf/finite = Inf check_a_is_zero: or t0, s1, s2 bnez t0, normalize_inputs slli a0, s6, 15 jal x0, epilogue # 0/finite = 0 # --- 3. Normalize Inputs --- normalize_inputs: beq s1, x0, check_b_is_normal ori s2, s2, 0x80 check_b_is_normal: beq s4, x0, do_division ori s5, s5, 0x80 # --- 4. Core Division Loop --- do_division: slli s2, s2, 15 # dividend = mant_a << 15 addi s8, x0, 0 # quotient = 0 addi s9, x0, 0 # i = 0 div_loop: addi t0, x0, 16 bge s9, t0, exp_calc # for i < 16 addi t0, x0, 15 sub t0, t0, s9 # 15 - i sll t1, s5, t0 # divisor << (15 - i) slli s8, s8, 1 # quotient <<= 1 blt s2, t1, loop_continue # if dividend < shifted_divisor sub s2, s2, t1 # dividend -= shifted_divisor ori s8, s8, 1 # quotient |= 1 loop_continue: addi s9, s9, 1 jal x0, div_loop # --- 5. Exponent Calculation --- exp_calc: sub s7, s1, s4 addi s7, s7, 127 # result_exp bne s1, x0, check_exp_b addi s7, s7, -1 check_exp_b: bne s4, x0, norm_q_loop addi s7, s7, 1 # --- 6. Normalize Quotient --- norm_q_loop: lui t0, 0x8 and t0, s8, t0 bnez t0, norm_shift_8 addi t0, x0, 1 bge t0, s7, norm_shift_8 # while (result_exp > 1) slli s8, s8, 1 addi s7, s7, -1 jal x0, norm_q_loop norm_shift_8: srli s8, s8, 8 andi s8, s8, 0x7F # quotient &= 0x7F # --- 7. Check Under/Overflow --- addi t0, x0, 0xFF blt s7, t0, check_underflow slli a0, s6, 15 lui t0, 0x8 addi t1, t0, -128 or a0, a0, t1 jal x0, epilogue # Overflow check_underflow: bgtz s7, encode_result slli a0, s6, 15 jal x0, epilogue # Underflow -> zero # --- 8. Encode Result --- encode_result: slli s6, s6, 15 # result_sign << 15 andi s7, s7, 0xFF # result_exp & 0xFF slli s7, s7, 7 andi s8, s8, 0x7F # remove implicit 1 or a0, s6, s7 or a0, a0, s8 jal x0, epilogue return_nan: lui a0, 0x8 addi a0, a0, -64 jal x0, epilogue return_a: jal x0, epilogue return_b: addi a0, a1, 0 # a0 is the return value jal x0, epilogue # --- Restore registers and return --- epilogue: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) lw s6, 28(sp) lw s7, 32(sp) lw s8, 36(sp) lw s9, 40(sp) addi sp, sp, 44 jalr x0, ra, 0 ``` #### 6.2 Test results 1. **Simple division** : `6.0 / 2.0 = 3.0` * `6.0` (`0x40C0`) / `2.0` (`0x4000`) the result will be `3.0` (`0x4040`)。 2. **Result \< 1.0** : `3.0 / 5.0 = 0.6` * `3.0` (`0x4040`) / `5.0` (`0x40A0`) the result will be `0.6`, bfloat16 approximately `0x3F19`。 3. **Special case** : `10.0 / 0.0 = +Infinity` * `10.0` (`0x4120`) / `0.0` (`0x0000`) the result will be `+Inf` (`0x7F80`)。 * Rispes Console ![image](https://hackmd.io/_uploads/rJ_ts4UTel.png =300x) ### 7. bf16_sqrt $$ \sqrt{a} = \sqrt{2^{e_a} \times m_a} = 2^{e_a/2} \times \sqrt{m_a} $$ This tells us that taking the square root of a floating-point number ( composed of an exponent $e_a$ and a mantissa $m_a$ ) can be broken down into two independent tasks: * **For the exponent part**: **Divide** the original exponent $e_a$ **by 2**. * **For the mantissa part**: Calculate the **square root** of the mantissa $m_a$. 1. Special Case Handling (IEEE-754 compliant): * $\sqrt{+0} = +0$ * $\sqrt{-0} = 0$ * $\sqrt{+\infty} = +\infty$ * $\sqrt{-\infty} = \text{NaN}$ ( 負數沒有實數平方根 ) * $\sqrt{\text{NaN}} = \text{NaN}$ * $x < 0$ 2. Exponent Processing: For even exponents (𝑒~𝑎~ is even): $$ e_r = \frac{e_a}{2}, \quad m' = m_a $$ For even exponents (𝑒~𝑎~ is odd): $$ e_r = \frac{e_a - 1}{2}, \quad m' = 2 \times m_a $$ * **proof $e_a$ is odd**: * Leveraging the mathematical property $2^{e_a} = 2^{e_a - 1} \times 2^1$. * Thus, the square root becomes $\sqrt{2^{e_a - 1} \times (2 \times m_a)}$. * The new exponent part becomes $(e_a - 1) / 2$, since $e_a - 1$ is now guaranteed to be an even number. * The trade-off is that the mantissa must be multiplied by 2, becoming $m' = 2 \times m_a$. This ensures the mantissa remains in the normalized range $[1, 2)$ or $[2,4)$ for processing. 3. Mantissa Square Root via Binary Search: The algorithm finds $𝑟$ such that $r^2 \approx m' \times 128$ using binary search. The scaling factor 128 represents the implicit 1.0 in the mantissa representation. ```c= /* * The target mantissa 'm' is scaled by 128 (where 128 represents 1.0). * If the original exponent was even, m is in the range [128, 256). * If the original exponent was odd, m was doubled and is in the range [256, 512). */ uint32_t low = 90; // A safe lower bound for the search (approx. sqrt(128*1.0)) uint32_t high = 256; // A safe upper bound for the search (approx. sqrt(512*1.0)) uint32_t result = 128; // Initialize with a default value (1.0) /* * Perform a binary search to find the integer square root of 'm'. * The goal is to find a 'result' such that (result/128)^2 is close to (m/128), * which simplifies to finding a 'result' where (result*result)/128 is close to 'm'. */ while (low <= high) { uint32_t mid = (low + high) >> 1; // Square the guess 'mid' and scale it back down for comparison with 'm'. // Squaring a value on the 128x scale results in a 128*128x scale. // Dividing by 128 brings it back to the same 128x scale as 'm'. uint32_t sq = (mid * mid) / 128; if (sq <= m) { // The guess is good, or the actual root is slightly larger. result = mid; // Store this as our best guess so far. low = mid + 1; // Try searching in the upper half. } else { // The guess was too high. high = mid - 1; // Search in the lower half. } } ``` 4. Normalization: * Ensure result mantissa is in range $[128, 256)$ * Adjust exponent if necessary * Extract 7-bit mantissa ```c= /* The 'result' from the binary search is an integer approximation of sqrt(m). This entire algorithm operates on a scaled integer representation where the floating-point value 1.0 corresponds to the integer 128. The following block normalizes this integer 'result' to ensure it falls within the standard significand range of [128, 256), which represents the floating-point range [1.0, 2.0). The exponent is adjusted simultaneously to maintain the correct overall value. */ if (result >= 256) { // The result is too large (>= 2.0), so we need to scale it down. result >>= 1; // Halve the significand to bring it into the [128, 256) range. new_exp++; // Double the exponent part to compensate for halving the significand. } else if (result < 128) { // The result is too small (< 1.0), so we need to scale it up. while (result < 128 && new_exp > 1) { result <<= 1; // Double the significand. new_exp--; // Halve the exponent part to compensate. } } /* Extract the final 7-bit mantissa. This is done by masking off the implicit leading '1', which now resides at the 8th bit position (value 128). */ uint16_t new_mant = result & 0x7F; ``` #### 7.1 Assembly code ```s= _mul32: add t0, x0, a0 add t1, x0, a0 add a0, x0, x0 addi t2, x0, 32 _mul32_loop: andi t3, t1, 1 beqz t3, _mul32_skip_add add a0, a0, t0 _mul32_skip_add: srli t1, t1, 1 slli t0, t0, 1 addi t2, t2, -1 bnez t2, _mul32_loop jalr x0, x1, 0 bf16_sqrt: addi sp, sp, -32 sw ra, 0(sp) sw s0, 4(sp) # s0 = sign_a sw s1, 8(sp) # s1 = exp_a sw s2, 12(sp) # s2 = mant_a sw s3, 16(sp) sw s4, 20(sp) sw s5, 24(sp) # low sw s6, 28(sp) # high # --- 1. Decode inputs & Special cases --- srli s0, a0, 15 # s0 = sign_a srli t0, a0, 7 andi s1, t0, 0xFF # s1 = exp_a andi s2, a0, 0x7F # s2 = mant_a addi t0, x0, 0xFF bne s1, t0, check_zero bnez s2, return_a bnez s0, return_nan jal x0, return_a check_zero: or t0, s1, s2 bnez t0, check_negative addi a0, x0, 0 jal x0, epilogue check_negative: bnez s0, return_nan beqz s1, return_zero # --- 2. Exponent & Mantissa Prep --- addi s3, s1, -127 # e ori s4, s2, 0x80 # m andi t0, s3, 1 beqz t0, exp_is_even slli s4, s4, 1 addi s3, s3, -1 exp_is_even: srai s3, s3, 1 addi s3, s3, 127 # --- 3. Binary Search --- addi s5, x0, 90 # low addi s6, x0, 256 # high addi a0, x0, 128 # result bs_loop: bgt s5, s6, bs_end # if( low > high ) add t0, s5, s6 srai t0, t0, 1 # Save registers to STACK before call addi sp, sp, -8 sw a0, 0(sp) # Store result to Stack sw t0, 4(sp) # Store mid to Stack # Call _mul32 add a0, x0, t0 # a0 = mid jal ra, _mul32 srli a0, a0, 7 # a0 = sq add s2, x0, a0 # s2 = sq # Restore registers from STACK after call lw t1, 0(sp) # t1 = result lw t0, 4(sp) # t0 = mid addi sp, sp, 8 add a0, x0, t1 # a0 = result # Compare and update bounds bgt s2, s4, bs_else # sq > m add a0, x0, t0 # result = mid addi s5, t0, 1 # low = mid + 1 jal x0, bs_loop bs_else: addi s6, t0, -1 # high = mid - 1 jal x0, bs_loop bs_end: # --- 4. Normalize Result --- add s2, x0, a0 # s2 = result addi t0, x0, 256 blt s2, t0, check_res_low srli s2, s2, 1 addi s3, s3, 1 jal x0, get_new_mant check_res_low: addi t0, x0, 128 bge s2, t0, get_new_mant norm_res_loop: addi t0, x0, 128 bge s2, t0, get_new_mant addi t0, x0, 1 ble s3, t0, get_new_mant slli s2, s2, 1 addi s3, s3, -1 jal x0, norm_res_loop get_new_mant: andi s2, s2, 0x7F # --- 5. Final Checks & Encode --- addi t0, x0, 0xFF blt s3, t0, check_final_underflow lui a0, 0x80 addi a0, a0, -128 jal x0, epilogue check_final_underflow: blez s3, return_zero andi s3, s3, 0xFF slli s3, s3, 7 or a0, s3, s2 jal x0, epilogue return_nan: lui a0, 0x8 addi a0, a0, -64 jal x0, epilogue return_zero: addi a0, x0, 0 jal x0, epilogue return_a: jal x0, epilogue epilogue: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) lw s6, 28(sp) addi sp, sp, 32 jalr x0, ra, 0 ``` #### 7.2 Test results 1. **Perfect Square**: `sqrt(4.0) = 2.0` * `4.0` (`0x4080`) -\> `2.0` (`0x4000`)。 2. **Odd exponent**: `sqrt(2.0) ≈ 1.414` * `2.0` (`0x4000`) -\> `~1.414` (bfloat16 約為 `0x3Fb5`)。 3. **Special case**: `sqrt(-4.0) = NaN` * `-4.0` (`0xC080`) -\> `NaN` (`0x7FC0`)。 * Rispes Console ![image](https://hackmd.io/_uploads/rJ1bqqLTel.png =300x) --- ## [Leetcode 342. Power of Four](https://leetcode.com/problems/power-of-four/description/) ### Introduction For this assignment, I have selected the topic "**Implement log2 with a branchless clz**". To demonstrate the practical application and performance benefits of this approach, I have implemented a solution for **LeetCode problem 342, "Power of Four"**. I think this problem is a good choice as its logic requires determining if the base-2 logarithm of a number is an even integer. So, we can compare the efficiency of a `log2` function optimized with Count Leading Zeros (clz) against a traditional, loop-based implementation. ### Optimizing `log2` Computations with `clz` The core principle behind this optimization is a fundamental mathematical relationship: for any positive integer `n`, the value of `floor(log₂(n))` is identical to the zero-based index of its Most Significant Bit (MSB) in its binary representation. * **Example 1**: * For `n = 15` (binary `1111`), `log₂(15) ≈ 3.9`, so `floor(log₂(15))` is **3**. * The MSB of `1111` is at index **3**. * **Example 2**: * For `n = 16` (binary `10000`), `log₂(16) = 4`. * The MSB of `10000` is at index **4**. This relationship transforms the problem of calculating `log₂(n)` into a new problem: **How to efficiently find the position of the MSB ?** #### Approach 1: The Traditional Loop-Based Method ( Unoptimized ) Without `clz`, the most intuitive way to compute `log₂(n)` is through a loop that repeatedly right-shifts the number until it is reduced to zero. The number of shifts performed is the result. ```c int log2_loop(uint32_t n) { if (n == 0) return -1; // log(0) is undefined int log = 0; while ((n >> 1) > 0) { n >>= 1; // n = n / 2 log++; // Increment the log result } return log; } ``` The primary drawback of this method is that its performance is directly tied to the magnitude of the input `n`. The number of loop iterations is equal to the position of the MSB. * If `n = 16` (MSB at index 4), the loop runs 4 times. * If `n = 1,000,000,000` (MSB at index 29), the loop must run **29 times**. This bit-by-bit check is effectively a **linear search**, which is inefficient for large numbers. #### Approach 2: The `clz`-Optimized Method The `clz` function provides a highly efficient shortcut to find the MSB's position. **1. The `clz` Algorithm Core: Binary Search** Instead of a bit-by-bit linear search, the `clz` algorithm employs a **binary search** on the bit width of the integer. It works by checking large blocks of bits at a time, not just one. * **Step 1**: With `c = 16`, it asks, "Is the MSB in the upper 16 bits?" If so, it immediately discards the lower 16 bits from consideration. * **Step 2**: With `c = 8`, it asks, "Is the MSB in the upper 8 bits of the remaining range?" This process continues, halving the search space in each step. This strategy ensures that the `do-while` loop in the `clz` function always executes a **fixed number of times** (5 iterations for c = 16, 8, 4, 2, 1). Consequently, its execution time is **constant (O(1))**, regardless of the input value `n`'s magnitude. **2. Converting the `clz` Result to `log₂`** Once `clz(n)` has rapidly determined the number of leading zeros, calculating the MSB's position is a single subtraction: `msb_position = 31 - clz(n)` And since we know `log₂(n) = msb_position`, the final optimized formula is: `log₂(n) = 31 - clz(n)` ### Implementation #### Problem description Given an integer `n`, return `true` if it is a power of four. Otherwise, return `false`. An integer `n` is a power of four, if there exists an integer `x` such that `n` == $4^{x}$. * Constraints: * -2^31^ <= `n` <= 2^31^ - 1 #### The Traditional Loop-Based Method in C ```c= #include <stdbool.h> #include <stdint.h> #include <stdio.h> // Calculates floor(log2(n)) using a traditional loop (linear search). int log2_loop(uint32_t n) { if (n == 0) return -1; // Log of 0 is undefined int log = 0; while ((n >> 1) > 0) { n >>= 1; log++; } return log; // The integer base-2 logarithm of n } // Checks if a number is a power of four. bool isPowerOfFour_with_log_loop(int n) { // A power of four must be positive. if (n <= 0) { return false; // n isn't a power of four } // Condition 1: It must be a power of two (only one bit is set). // This is a fast bitwise check to filter out most numbers. bool is_power_of_two = (n & (n - 1)) == 0; if (!is_power_of_two) { return false; // n isn't a power of four } // Condition 2: The base-2 logarithm must be an even number. int log_val = log2_loop((uint32_t)n); return (log_val % 2) == 0; } // Structure to hold a test case and its expected result typedef struct { int input; bool expected; } TestCase; int main() { printf("--- Testing isPowerOfFour_with_log_loop ---\n\n"); // Define 6 test cases covering different scenarios TestCase tests[] = { {16, true}, // Case 1: A standard power of four (4^2) {1, true}, // Case 2: The base case (4^0) {8, false}, // Case 3: A power of two that is NOT a power of four {0, false}, // Case 4: The zero edge case {-256, false}, // Case 5: A negative number {1073741824, true} // Case 6: A large power of four (4^15) }; int num_tests = sizeof(tests) / sizeof(tests[0]); // Loop through all test cases for (int i = 0; i < num_tests; i++) { int n = tests[i].input; bool expected_result = tests[i].expected; // Call the function to get the actual result bool actual_result = isPowerOfFour_with_log_loop(n); // Print results printf("Testing n = %d:\n", n); printf(" Expected: %s\n", expected_result ? "true" : "false"); printf(" Actual: %s\n", actual_result ? "true" : "false"); // Compare and print pass/fail status if (actual_result == expected_result) { printf(" Result: PASS\n\n"); } else { printf(" Result: FAIL\n\n"); } } return 0; } ``` #### The Traditional Loop-Based Method in RISC-V assembly language ```s= .data # Strings for the test harness output msg_testing: .string "Testing n = " msg_expected: .string "\n Expected: " msg_actual: .string "\n Actual: " msg_pass: .string "\n Result: PASS\n\n" msg_fail: .string "\n Result: FAIL\n\n" str_true: .string "true" str_false: .string "false" .text .globl main main: # --- Test Case 1: n = 16, Expected = true --- li a0, 16 # Correct: First, load the input value li a1, 1 # Correct: Second, load the expected result jal ra, run_test_case # Correct: Third, call the function # --- Test Case 2: n = 1, Expected = true --- li a0, 1 li a1, 1 jal ra, run_test_case # --- Test Case 3: n = 8, Expected = false --- li a0, 8 li a1, 0 # Expected result (0 for false) jal ra, run_test_case # --- Test Case 4: n = 0, Expected = false --- li a0, 0 li a1, 0 jal ra, run_test_case # --- Test Case 5: n = -256, Expected = false --- li a0, -256 li a1, 0 jal ra, run_test_case # --- Test Case 6: n = 1073741824 (4^15), Expected = true --- li a0, 1073741824 li a1, 1 jal ra, run_test_case # --- End of program --- li a7, 10 ecall # Run a single test case # a0: Input value for the test # a1: Expected result (1 for true, 0 for false) run_test_case: # Save context addi sp, sp, -12 sw ra, 8(sp) sw s0, 4(sp) # s0 will store the input value sw s1, 0(sp) # s1 will store the expected result mv s0, a0 mv s1, a1 # Print "Testing n = [input]" la a0, msg_testing li a7, 4 ecall mv a0, s0 li a7, 1 ecall # Print "Expected: [true/false]" la a0, msg_expected li a7, 4 ecall beqz s1, print_expected_false la a0, str_true j print_expected_done print_expected_false: la a0, str_false print_expected_done: li a7, 4 ecall # Print "Actual: " la a0, msg_actual li a7, 4 ecall # Call isPowerOfFour with the input value mv a0, s0 jal ra, isPowerOfFour mv s0, a0 # Save actual result in s0 # Print actual result string beqz s0, print_actual_false la a0, str_true j print_actual_done print_actual_false: la a0, str_false print_actual_done: li a7, 4 ecall # Compare actual (s0) with expected (s1) and print PASS/FAIL bne s0, s1, print_fail la a0, msg_pass j print_done print_fail: la a0, msg_fail print_done: li a7, 4 ecall # Restore context and return lw s1, 0(sp) lw s0, 4(sp) lw ra, 8(sp) addi sp, sp, 12 ret # isPowerOfFour(int n) -> bool # param a0: The integer to check. # return a0: 1 if n is a power of four, 0 otherwise. isPowerOfFour: # Save ra because this function calls another function (log2_loop) addi sp, sp, -4 sw ra, 0(sp) blez a0, return_false # A power of four must be positive # Condition 1: is_power_of_two = (n & (n - 1)) == 0; addi t0, a0, -1 and t1, a0, t0 bnez t1, return_false # If not a power of two, return false # Condition 2: log_val = log2_loop(n); # Argument n is already in a0 jal ra, log2_loop andi t0, a0, 1 # Check if log_val is even (LSB is 0) beqz t0, return_true # return (log_val % 2) == 0 return_false: addi a0, x0, 0 jal x0 isPowerOfFour_exit return_true: addi a0, x0, 1 isPowerOfFour_exit: lw ra, 0(sp) addi sp, sp, 4 jalr x0, x1, 0 # log2_loop(uint32_t n) -> int # param a0: The input number. # return a0: The integer base-2 logarithm of n. log2_loop: # This is a leaf function, no need to save ra. addi t0, x0, 0 # t0 = log = 0 log_loop_start: srli t1, a0, 1 # t1 = n >> 1 beqz t1, log_loop_end # Log of 0 is undefined srli a0, a0, 1 # n >>= 1 addi t0, t0, 1 # log++ jal x0 log_loop_start log_loop_end: add a0, x0, t0 # Move result to a0 for return jalr x0, x1, 0 ``` * Ripes Console ![image](https://hackmd.io/_uploads/rkegA_vpxx.png =150x) #### Using `clz` to optimize the `log` version in C ```c= #include <stdbool.h> #include <stdint.h> #include <stdio.h> // Counts the leading zeros of a 32-bit integer using binary search. static inline unsigned int clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; } // Calculates floor(log2(n)) using the highly efficient clz function. int log2_clz(uint32_t n) { if (n == 0) return -1; // Log of 0 is undefined // The position of the most significant bit is 31 minus the number of leading zeros. return 31 - clz(n); } // Checks if a number is a power of four. bool isPowerOfFour_with_clz(int n) { // A power of four must be positive. if (n <= 0) { return false; } // Condition 1: It must be a power of two (only one bit is set). bool is_power_of_two = (n & (n - 1)) == 0; if (!is_power_of_two) { return false; } // Condition 2: The base-2 logarithm must be an even number. // Call the optimized, constant-time log2 function. int log_val = log2_clz((uint32_t)n); return (log_val % 2) == 0; } // Structure to hold a test case and its expected result typedef struct { int input; bool expected; } TestCase; int main() { printf("--- Testing isPowerOfFour_with_log_loop ---\n\n"); // Define 6 test cases covering different scenarios TestCase tests[] = { {16, true}, // Case 1: A standard power of four (4^2) {1, true}, // Case 2: The base case (4^0) {8, false}, // Case 3: A power of two that is NOT a power of four {0, false}, // Case 4: The zero edge case {-256, false}, // Case 5: A negative number {1073741824, true} // Case 6: A large power of four (4^15) }; int num_tests = sizeof(tests) / sizeof(tests[0]); // Loop through all test cases for (int i = 0; i < num_tests; i++) { int n = tests[i].input; bool expected_result = tests[i].expected; // Call the function to get the actual result bool actual_result = isPowerOfFour_with_log_loop(n); // Print results printf("Testing n = %d:\n", n); printf(" Expected: %s\n", expected_result ? "true" : "false"); printf(" Actual: %s\n", actual_result ? "true" : "false"); // Compare and print pass/fail status if (actual_result == expected_result) { printf(" Result: PASS\n\n"); } else { printf(" Result: FAIL\n\n"); } } return 0; } ``` #### Using `clz` to optimize the `log` version in RISC-V assembly language ```s= .data # Strings for the test harness output msg_title: .string "--- Testing isPowerOfFour_with_clz (Optimized Version) ---\n\n" msg_testing: .string "Testing n = " msg_expected: .string "\n Expected: " msg_actual: .string "\n Actual: " msg_pass: .string "\n Result: PASS\n\n" msg_fail: .string "\n Result: FAIL\n\n" str_true: .string "true" str_false: .string "false" .text .globl main main: # Print the title for this test run la a0, msg_title li a7, 4 ecall # --- Test Case 1: n = 16, Expected = true --- li a0, 16 li a1, 1 jal ra, run_test_case # --- Test Case 2: n = 1, Expected = true --- li a0, 1 li a1, 1 jal ra, run_test_case # --- Test Case 3: n = 8, Expected = false --- li a0, 8 li a1, 0 jal ra, run_test_case # --- Test Case 4: n = 0, Expected = false --- li a0, 0 li a1, 0 jal ra, run_test_case # --- Test Case 5: n = -256, Expected = false --- li a0, -256 li a1, 0 jal ra, run_test_case # --- Test Case 6: n = 1073741824, Expected = true --- li a0, 1073741824 li a1, 1 jal ra, run_test_case # --- End of program --- li a7, 10 ecall # Run a single test case # a0: Input value for the test # a1: Expected result (1 for true, 0 for false) run_test_case: addi sp, sp, -12 sw ra, 8(sp) sw s0, 4(sp) sw s1, 0(sp) mv s0, a0 mv s1, a1 la a0, msg_testing li a7, 4 ecall mv a0, s0 li a7, 1 ecall la a0, msg_expected li a7, 4 ecall beqz s1, print_expected_false la a0, str_true j print_expected_done print_expected_false: la a0, str_false print_expected_done: li a7, 4 ecall la a0, msg_actual li a7, 4 ecall mv a0, s0 jal ra, isPowerOfFour mv s0, a0 beqz s0, print_actual_false la a0, str_true j print_actual_done print_actual_false: la a0, str_false print_actual_done: li a7, 4 ecall bne s0, s1, print_fail la a0, msg_pass j print_done print_fail: la a0, msg_fail print_done: li a7, 4 ecall lw s1, 0(sp) lw s0, 4(sp) lw ra, 8(sp) addi sp, sp, 12 ret # isPowerOfFour(int n) -> bool # param a0: The integer to check. # return a0: 1 if n is a power of four, 0 otherwise. isPowerOfFour: addi sp, sp, -4 sw ra, 0(sp) blez a0, return_false addi t0, a0, -1 and t1, a0, t0 bnez t1, return_false jal ra, log2_clz # Optimize log operation with clz andi t0, a0, 1 beqz t0, return_true return_false: addi a0, x0, 0 jal x0, isPowerOfFour_exit return_true: addi a0, x0, 1 isPowerOfFour_exit: lw ra, 0(sp) addi sp, sp, 4 jalr x0, x1, 0 # log2_loop(uint32_t n) -> int # param a0: The input number. # return a0: The integer base-2 logarithm of n. log2_clz: addi sp, sp, -4 sw ra, 0(sp) beqz a0, return_minus_one # Check for n=0 jal ra, clz # Call clz, result is in a0 addi t0, x0, 31 sub a0, t0, a0 # return 31 - clz(n) jal x0, log2_clz_exit return_minus_one: li a0, -1 log2_clz_exit: lw ra, 0(sp) addi sp, sp, 4 jalr x0 x1 0 # clz(uint32_t x) -> unsigned int # param a0: The 32-bit unsigned integer x. # return a0: The number of leading zeros in x. clz: add t0, x0, a0 # t0 = x addi t1, x0, 32 # t1 = n addi t2, x0, 16 # t2 = c clz_loop_start: srl t3, t0, t2 # t3 = y = x >> c beqz t3, clz_skip_if sub t1, t1, t2 # n -= c add t0, x0, t3 # x = y clz_skip_if: srli t2, t2, 1 # c >>= 1 bnez t2, clz_loop_start sub a0, t1, t0 # return n - x jalr x0 x1 0 ``` * Ripes Console ![image](https://hackmd.io/_uploads/SyJ61cwpll.png =350x) ### Analysis * Both codes are excute in 5-stage processor * After Optimization with `clz` function : * Reduce CPU cycles by 5 % * Reduced number of instructions executed by 8 % * Although CPI has increased, I think it is because the algorithm is more complicated. | Loop-Based Method | Using `clz` to Optimize | |:---:| :---: | | ![image](https://hackmd.io/_uploads/Sy2nIvPTeg.png =250x) | ![image](https://hackmd.io/_uploads/S1YruDDpge.png =250x) | ### 5-stage pipelined processor * There are six categories in the rv32i instruction set * the same category instruction will have similar datapath ![image](https://hackmd.io/_uploads/HkMW2MBheg.png =500x) ![image](https://hackmd.io/_uploads/BkeG2fH2xe.png =500x) * With the instruction set mentioned above, We can use them to build this five-stage pipeline processor. ![image](https://hackmd.io/_uploads/BJHErqwTeg.png =650x) * This is the classic computer 5-stage processor pipeline architecture. | Stage | Description | | -------- | -------- | | IF | Instruction Fetch | | ID | Instruction Decode | | EX | Execution | | MEM | Data Memory Access | | WB | Write Back | #### Tracing a Single Instruction Through the Datapath In our Power of Four solution, this instruction is used to calculate `n - 1`. It takes the input number `n` from register `a0`, adds `-1` to it, and places the result of the operation into the destination register `t0`. The 32-bit machine code for this `addi` instruction is `0xFFF50293`. ##### 1. Instruction Fetch * Why is `0xFFF50293` ? `140: fff50293 addi t0, a0, -1` * Here is I-Type Instruction Format in RV32I ![image](https://hackmd.io/_uploads/SyAaJvBhgg.png =600x) * imm[11:0] : 1111_1111_1111 ( `-1` ) * rs1 : 01010 ( `a0`, `x10` ) * funct3 : 000 ( `add` ) * rd : 00101 ( `t0`, `x5` ) * opcode : 0010011 ( `OP-IMM` ) * result = `addi t0, a0, -1` ![image](https://hackmd.io/_uploads/SJA9SaPpee.png =300x) * During the Instruction Fetch stage, the processor performs two primary operations: * The Program Counter ( PC ) is incremented by 4 to calculate the address of the next sequential instruction. * The instruction is fetched from Instruction Memory using the current value of the PC as the address. ##### 2. Instruction Decode * In the Instruction Decode stage, the instruction `0xFFF50293`, fetched in the previous stage, is decoded by the control unit. * By examining the `opcode` and `funct3` fields, the control unit identifies the operation as `addi`. Simultaneously, the `rs1` field (which specifies the address of register 1) is sent to the Register File, which reads out the value of the corresponding register. In this specific case, the value is `0x0000_0010`, representing the number 16 from our "Power of Four" test. Being an I-type instruction, the `rs2` field is ignored. * The immediate field (`FFF`) is sign-extended to its 32-bit value, which is `0xFFFF_FFFF` (representing -1). Finally, these decoded components—the register value, the sign-extended immediate, and various control signals—are passed to the next stage for execution. * A key observation here is that the `RegWrite` control signal is asserted, indicating that a previous instruction is currently in the Write-Back (WB) stage. However, this does not create a data hazard. Because our register file is designed to be **read on the clock's rising edge and written on the falling edge**, the read for the current instruction and the write for the previous instruction happen at different points in the clock cycle, preventing any conflict. ![image](https://hackmd.io/_uploads/HybHy0vTlg.png =300x) ##### 3. Execution * In the Execution stage, the Arithmetic Logic Unit (ALU) performs the core computation. The ALU receives its two inputs: the first operand (`OP1`) is sourced from the register file value (`Reg1`), and the second operand (`OP2`) comes from the sign-extended immediate (`IMM`). As specified by the ALUOp control signal, which is derived from the instruction's opcode, the ALU is configured to perform an addition. The resulting output value is `0x0000_000F`, which is the correct outcome of `0x0000_0010 + 0xFFFF_FFFF`. ![image](https://hackmd.io/_uploads/B1sivCv6le.png =300x) ##### 4. Data Memory Access * In the Data Memory Access stage, an arithmetic instruction like `addi` performs no actual operation. Since it is neither a `load` nor a `store`, the data memory unit remains idle and no memory read or write signals are asserted. The primary function of this stage is simply to act as a pass-through, forwarding the ALU result (`0x0000_000F`) and the destination register's address from the previous stage to the Write-Back stage. ![image](https://hackmd.io/_uploads/BkzkcCPagg.png =250x) ##### 5. Write Back * In the final Write-Back stage, the result from the ALU is written to the Register File. As shown in the diagram, the `RegWrite` control signal is asserted (set to 1), which enables the value `0x0000_000F` to be correctly stored into the destination register, `x5`. ![image](https://hackmd.io/_uploads/SJUej0Paxl.png =550x) * We can see that, the address of the destination register is identified in the Instruction Decode stage and is then propagated through the pipeline registers from one stage to the next, ultimately being used in the final Write-Back stage. ![image](https://hackmd.io/_uploads/Byrzi0wale.png =550x) * Final, the instruction result `0x0000_000F` store in register `x5` ![image](https://hackmd.io/_uploads/ryz7iCPpex.png =250x) --- ## Reference * [Assignment 1: RISC-V Assembly and Instruction Pipeline](https://hackmd.io/@sysprog/2025-arch-homework1) * [The RISC-V Instruction Set Manual: Volume I: Unprivileged ISA](https://riscv.org/specifications/ratified/) * [RISC-V Instruction Set Specifications](https://msyksphinz-self.github.io/riscv-isadoc/html/index.html) * [RISC-V Pseudo Instruction](https://www.cnblogs.com/sureZ-learning/p/18402878) * [How to use environment call in Ripes](https://youtu.be/rlB8aeXDpc0?si=qQUMJMcTHUsK0EEL&t=181) * [Find whether a given number is a power of 4 or not](https://www.geeksforgeeks.org/dsa/find-whether-a-given-number-is-a-power-of-4-or-not/)