Assignment 1: RISC-V Assembly and Instruction Pipeline

# Assignment 1: RISC-V Assembly and Instruction Pipeline contributed by < `HeatCrab` > ## AI Development Process Documentation > Using [Claude](https://claude.ai/new) & [Gemini](https://gemini.google.com/app?hl=zh-TW) 1. Clarifying assignment requirements and offering directional guidance. 2. Assisting with the translation and polishing of documentation. 3. Generating automated tests for assembly code to quickly validate logic and catch errors. 4. Serving as a discussion partner for C code analysis and exploration. 5. Assisting with assembly code conversion, focusing on providing writing suggestions and discussing feasibility. 6. Helping to draft GitHub commit messages. 7. Brainstorming and providing ideas for use cases. ## Quiz1 - UF8 ### UF8 UF8 (unsigned float 8-bit) is a custom 8-bit floating-point format designed to represent a large range of unsigned integers with limited bits. It divides 8 bits into two parts: ```mermaid graph LR A[8-bit UF8] --> B[Exponent 4 bits] A --> C[Mantissa 4 bits] B --> D[Range: 0-15] C --> E[Range: 0-15] style A fill:#e1f5ff style B fill:#ffe1e1 style C fill:#e1ffe1 ``` The high 4 bits store the exponent, and the low 4 bits store the mantissa. UF8 has no sign bit, so it can only represent non-negative numbers. UF8 can represent values from 0 to 1,015,792, providing a compression ratio of 2.5:1. Error analysis is core to understanding UF8 encoding. Due to logarithmic quantization, the interval (step size) between adjacent representable values is $2^e$: ```mermaid graph TD A[UF8 Exponent Range & Step Size] --> B[e=0: 0-15, step 1] A --> C[e=1: 16-46, step 2] A --> D[e=2: 48-108, step 4] A --> E[e=3: 112-232, step 8] A --> F[...] A --> G[e=15: 524272-1015792, step 32768] style A fill:#e1f5ff style B fill:#e1ffe1 style G fill:#ffe1e1 ``` Error characteristics: - **Absolute error**: $\Delta_{\max} = 2^e - 1$, within the range of exponent $e$, the maximum error approaches the step size - **Relative error**: $\varepsilon_{\max} = 1/16 = 6.25\%$, because the mantissa has 4 bits (16 possible values) - **Expected error**: $\mathbb{E}[\varepsilon] \approx 3\%$, assuming uniformly distributed input values, the average error is approximately half the maximum For example, when $e = 10$, the step size is $2^{10} = 1024$: in the range $[524{,}272, 540{,}656]$, only 16 discrete values can be represented, adjacent values differ by 1024, absolute error reaches up to 1023, but relative error stays within 6.25%. UF8 uses 8 bits to achieve efficiency close to the theoretical minimum of 7.6 bits (95%), already quite optimal under a 6.25% error bound. This design sacrifices absolute precision in high-value regions for a representation range exceeding millions, making it particularly suitable for sensor data (temperature, distance), graphics applications (LOD distance, fog density), and exponential backoff timers. ### Informative Use Case Modern IoT edge devices collecting environmental data face several critical constraints: - Memory: Limited RAM for buffering sensor readings - Power: Battery operation demands minimal CPU cycles per operation - Bandwidth: Wireless transmission costs penalize large payloads UF8 encoding provides an efficient alternative with 4x compression ratio (8-bit vs 32-bit representation), wide range (values from 0 to 1,015,792), and controlled error (maximum 6.25% relative error). For applications like trend monitoring or anomaly detection where absolute precision isn't critical, this compression reduces both memory footprint and transmission overhead. However, efficient implementation of UF8 encoding remains a challenge for resource-constrained embedded systems. This assignment explores optimization techniques in RISC-V assembly to make UF8 practical for real-time edge computing applications. ### Algorithm Analysis First, we define a `clz()` function to calculate the number of leading zeros in a number. This function mainly uses binary search to reduce computational cost, with the key being the line `c >>= 1`, which halves the search range each time until finding the highest bit 1. Next, we define the decode function. According to the [detailed explanation in quiz1][q1-de-link], we know the decoding formula is as follows: $$D(b) = m \cdot 2^e + (2^e - 1) \cdot 16$$ Where $e = \lfloor b/16 \rfloor$ and $m = b \bmod 16$ ```c uint32_t mantissa = fl & 0x0f; uint8_t exponent = fl >> 4; ``` The implementation of exponent $e$ is very straightforward, simply shift right by four bits. As for mantissa $m$, we can verify through actual computation (using $b = fl = 45$ as an example): ``` 0010 1101 & 0000 1111 = 0000 1101 (13) ← This is m ``` We can use the bitwise operation `& 0x0f` to complete this. ```c uint32_t offset = (0x7FFF >> (15 - exponent)) << 4; ``` Next, the key for the latter half is how to implement $(2^e - 1)$. We know that $e$ has a maximum of 15, which means the maximum result will be `0111 1111 1111 1111`. Moving down one bit, when $e$ is 14, the result is `0011 1111 1111 1111`, and so on. Observing all the way down to when $e$ is 0, we find it's exactly `0000 0000 0000 0000`, so it's actually just the process of shifting `0x7FFF` all the way right to 0, and we can find that the shift amount is exactly $(15 - e)$. Finally, according to the formula, we need to multiply by 16 again, which means shifting left by 4 bits. The result of combining these two is the latter half of the formula. The multiplication $m \cdot 2^e$ is also cleverly done using left shift operations, reducing CPU cost. Where $(2^e - 1) \cdot 16$ represents the offset, and its design concept is to ensure that each exponent range is continuous and non-overlapping. Finally, we have the definition of the encode function: $$E(v) = \begin{cases} v, & \text{if } v < 16 \\ 16e + \lfloor(v - \text{offset}(e))/2^e\rfloor, & \text{otherwise} \end{cases}$$ The first half is very simple: ```c if (value < 16) return value; ``` When $v$ is less than 16, we simply return $v$, at which point $e = 0$, $m = v$. Next, we handle the latter half $16e + \lfloor(v - \text{offset}(e))/2^e\rfloor$. First, we use the `clz()` function defined at the very beginning for quick estimation: ```c int lz = clz(value); int msb = 31 - lz; ``` Then we need to verify how to find the appropriate exponent $e$. From the previous decoding formula, we can derive that when $m$ varies from 0 to 15, for a fixed exponent $e$: Minimum value ($m = 0$): $$\begin{align} \text{value}_{\min} &= 0 \cdot 2^e + (2^e - 1) \cdot 16\\ &= (2^e - 1) \cdot 16\\ \end{align}$$ Maximum value ($m = 15$): $$\begin{align} \text{value}_{\max} &= 15 \cdot 2^e + (2^e - 1) \cdot 16\\ &= 31 \cdot 2^e - 16\\ \end{align}$$ So the range of exponent $e$ is: $[(2^e - 1) \cdot 16, \quad 31 \cdot 2^e - 16]$ We define $\text{offset}(e) = (2^e - 1) \cdot 16$, representing the minimum value that exponent $e$ can represent. Next, we continue to derive $\text{offset}(e+1)$: $$\begin{align} \text{offset}(e+1) &= (2^{e+1} - 1) \cdot 16 \\ &= (2 \cdot 2^e - 1) \cdot 16 \\ &= 2 \cdot 2^e \cdot 16 - 16 \\ &= 2 \cdot (2^e - 1) \cdot 16 + 2 \cdot 16 - 16 \\ &= 2 \cdot \text{offset}(e) + 16 \end{align}$$ We get the final result: $$\text{offset}(e+1) = 2 \cdot \text{offset}(e) + 16$$ This recursive relationship corresponds to the code: ```c for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; ``` However, since errors may still occur, we must check again and use the formula $$\text{offset}(e) = (\text{offset}(e+1) - 16) / 2$$ derived backwards from $$\text{offset}(e+1) = 2 \cdot \text{offset}(e) + 16$$ for downward fine-tuning: ```c while (exponent > 0 && value < overflow) { overflow = (overflow - 16) >> 1; exponent--; } ``` After verifying all of this, we return to the initial goal: finding the exponent $e$ corresponding to value $v$. If $v$ belongs to exponent $e$, it must satisfy: $$\text{offset}(e) \leq v < \text{offset}(e+1)$$ We can further verify whether adjacent exponent ranges are continuous and non-overlapping. The maximum value of exponent $e$ is $31 \cdot 2^e - 16$, while the minimum value of exponent $e+1$ is: $$\begin{align} \text{offset}(e+1) &= (2^{e+1} - 1) \cdot 16\\ &= 32 \cdot 2^e - 16\\ \end{align}$$ We can find: $$31 \cdot 2^e - 16 < 32 \cdot 2^e - 16$$ This indicates that the starting point of $e+1$ is greater than the endpoint of $e$, so adjacent exponent ranges are continuous but non-overlapping. We reach the final conclusion: if $\text{offset}(e) \leq v < \text{offset}(e+1)$, then value $v$ belongs to the range of exponent $e$. This is the mathematical foundation for this logic in the code: ```c while (exponent < 15) { uint32_t next_overflow = (overflow << 1) + 16; if (value < next_overflow) break; // Found it! overflow = next_overflow; exponent++; } ``` Finally, we first calculate $\lfloor(v - \text{offset}(e))/2^e\rfloor$ according to the formula: ```c uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; ``` First use left shift by 4 bits to implement $16e$, then use `|` for bitwise addition operation, completing the encoding process. After understanding the encoding and decoding implementation, let's look at the design logic of the test function. At first, I was confused about the test function: since UF8 encoding produces errors, why does the test function check whether the result of "decode then re-encode" matches the original value? ```c static bool test(void) { int32_t previous_value = -1; bool passed = true; for (int i = 0; i < 256; i++) { uint8_t fl = i; int32_t value = uf8_decode(fl); uint8_t fl2 = uf8_encode(value); if (fl != fl2) { printf("%02x: produces value %d but encodes back to %02x\n", fl, value, fl2); passed = false; } // omitted } ``` The key lies in understanding the starting point of the test: this test starts from all 256 possible UF8 encoded values, not from arbitrary integers. When we start from a UF8 encoded value `fl`: 1. First decode to get integer `value` 2. This `value` must be a value that can be precisely represented by UF8 3. Encoding this precisely representable value back should theoretically yield the original `fl` The scenario where errors occur is: when you take an arbitrary integer (e.g., 12345) and encode it, it might be mapped to the nearest representable value (e.g., 12336). But if you start from a UF8 encoding, the decoded value (12336) must be one of the 256 precisely representable values, so encoding it back will yield the original UF8 encoding. This test actually verifies three things: 1. Whether the encode and decode functions are inverse functions of each other (for all 256 representable values) 2. Each UF8 encoded value corresponds to a unique integer value 3. These 256 representable integer values are monotonically increasing (checked via `value <= previous_value`) In other words, the test ensures the completeness of the UF8 encoding space: all 256 encodings are valid and non-overlapping, and each encoding can correctly round-trip convert. ### Assembly Code First, I decided to convert the code completely into assembly code before considering optimization strategies. > [Commit cf2328c][commit1]. Essentially, it's a direct translation from C code to assembly code, extending the `test()` function from the original [q1-uf8.c][q1-b] into automated testing. Here are the Console results after executing in Ripes: ```bash All tests passed! Program exited with code: 0 ``` Execution info: | Metric | Value | | -------- | ------: | | Cycles | 34999 | | Instrs. retired | 34999 | | CPI | 1 | | IPC | 1 | |Clock rate | 0 Hz | This serves as our baseline, and optimization work begins from here. In the original implementation, the `uf8_encode()` function directly mimics the original code, using many loops and calling the `clz()` function for estimation. However, this approach in assembly implementation results in excessive computations, cycles, and function call overhead. Revisiting our previous mathematical analysis of the algorithm: $$\text{offset}(e) \leq v < \text{offset}(e+1)$$ Expanding this inequality: $$(2^e - 1) \cdot 16 \leq v < (2^{e+1} - 1) \cdot 16$$ Simplifying by dividing by 16: $$2^e - 1 \leq \frac{v}{16} < 2^{e+1} - 1$$ Although RV32I lacks division instructions, we can achieve the same effect through bit manipulation. Using the right shift operation `v >> 4` efficiently computes $\frac{v}{16}$ and determines the position of the most significant bit (MSB). This approach theoretically allows direct exponent estimation through a simple formula without calling `clz()`: ```assembly # s0 = value srli t0, s0, 4 # t0 = value / 16 li s1, 0 # s1 = exponent estimate estimate_exp_loop: beqz t0, estimate_done # If t0 == 0, done srli t0, t0, 1 # t0 >>= 1 addi s1, s1, 1 # exponent++ j estimate_exp_loop ``` However, a problem emerges: the iteration count equals the MSB position of `(value >> 4)`, potentially creating a worst case scenario. Comparing iteration counts: | value Range | value >> 4 | estimate_exp_loop Iterations | CLZ Iterations | |-------------|:----------:|:---------------------------:|:--------------:| | 16-31 | 1 | 1 | 5 | | 32-63 | 2-3 | 2 | 5 | | 64-127 | 4-7 | 3 | 5 | | 128-255 | 8-15 | 4 | 5 | | 256-1023 | 16-63 | 5-6 | 5 | | 1024-65535 | 64-4095 | 7-12 | 5 | | Maximum | ~67M | 26 | 5 | Analysis shows that the simple loop approach can require up to 26 iterations in the worst case, far exceeding the constant 5 iterations of `clz()` using binary search. The binary search algorithm already achieves optimal O(log n) complexity. Therefore, my optimization strategy shifts focus to the real bottleneck: the redundant loops for offset computation in the original implementation. The optimization process: 1. Retain `clz()` for efficient MSB calculation, ensuring performance 2. Apply CLZ to `value >> 4` instead of the full value for a better initial estimate 3. Eliminate iterative offset accumulation using the direct formula: `offset = ((1 << e) - 1) << 4` 4. Reduce remaining work to fine-tuning, only requiring exponent adjustments of ±1 After obtaining the initial exponent estimate, the fine-tuning logic also uses the mathematical result we derived earlier: $$\text{offset}(e) \leq v < \text{offset}(e+1)$$ ```assembly fine_tune_down: # Check if value < offset (exponent too high) bge s0, s2, fine_tune_up beqz s1, calc_mantissa # Can't go lower than 0 # Adjust down: e = e - 1, recalculate offset addi s1, s1, -1 li t1, 1 sll t1, t1, s1 addi t1, t1, -1 slli s2, t1, 4 j fine_tune_down fine_tune_up: # Check if value >= next_offset (should go higher) li t2, 15 bge s1, t2, calc_mantissa # Already at maximum slli t3, s2, 1 # next_offset = (offset << 1) + 16 addi t3, t3, 16 blt s0, t3, calc_mantissa # Current exponent is correct # Adjust up: e = e + 1 addi s1, s1, 1 mv s2, t3 j fine_tune_up ``` This verifies the constraint and adjusts when necessary. This optimization eliminates three separate loops in the original implementation (accumulating overflow from 0, adjusting downward, and searching upward), replacing them with a single fine-tuning loop that typically requires only 1-2 iterations. >[Commit a36232b][commit2]. Optimized Execution info: | Metric | Baseline | Optimized | Improvement | | -------- | -------: | -------: | -------: | | Cycles | 34,999 | 23,005 | **-34.3%** | | Instrs. retired | 34,999 | 23,005 | -34.3% | | CPI | 1 | 1 | - | | IPC | 1 | 1 | - | The improvement saves approximately 12,000 cycles, primarily from: - Eliminating three redundant loop structures - Using direct calculation instead of iterative offset accumulation - Fine-tune logic typically requiring only 1-2 adjustments Next, examining the `clz()` and `uf8_decode()` functions, I found no obvious optimization opportunities. For instance, `clz()`'s binary search is already optimal, and `uf8_decode()`'s implementation is already quite streamlined. I decided to finalize the current modifications as the optimized version. [q1-b]:https://github.com/HeatCrab/ca2025-quizzes/blob/main/q1-uf8.c [q1-c]:https://github.com/HeatCrab/ca2025-quizzes/blob/main/q1-bfloat16.c [commit1]:https://github.com/HeatCrab/ca2025-quizzes/commit/cf2328c5543a701282c5338c08222471c17ef6e3 [commit2]:https://github.com/HeatCrab/ca2025-quizzes/commit/a36232b9e74c9d019ecca904bbe2254fe165c5da ### Pipeline Analysis :::info TBD ::: ## Quiz1 - Bfloat16 ### BFLOAT16 According to the [detailed explanation in quiz1][q1-de-link], we know that bfloat16 is a data type from Google Brain. It preserves float32's dynamic range by keeping the same 8-bit exponent, but reduces precision to a 7-bit significand, compared to the 23-bit version in float32. The format is as follows: ```mermaid graph LR A["Bit 15 Sign (1 bit)"] --> B["Bits 14-7 Exponent (8 bits)"] --> C["Bits 6-0 Mantissa (7 bits)"] style A fill:#ffcccc style B fill:#ccffcc style C fill:#ccccff ``` In this format, the Sign bit determines positive or negative (0 for positive, 1 for negative), the Exponent is 8 bits with a bias of 127, and the Mantissa is the 7-bit fractional part. The value of a BFloat16 number is calculated as: $$v = (-1)^S \times 2^{E-127} \times \left(1 + \frac{M}{128}\right)$$ where $S \in \{0, 1\}$ is the sign bit, $E \in [0, 255]$ is the complete range of the 8-bit exponent, but the **range for valid normalized numbers** is $E \in [1, 254]$, and $M \in [0, 127]$ is the mantissa value. According to the IEEE-754 standard, bfloat16 also defines several special cases: 1. **Zero ($E = 0, M = 0$)**: Represents zero. Interestingly, IEEE 754 allows both +0 and -0 to exist, so the code doesn't check the sign bit and treats both zeros as zero. This is implemented in [q1-bfloat16.c][q1-c] as: ```c return !(a.bits & 0x7FFF); ``` 2. **Infinity ($E = 255, M = 0$)**: Represents infinity, with all exponent bits set to 1 and mantissa set to 0. The sign bit determines whether it's positive or negative infinity. 3. **NaN ($E = 255, M \neq 0$)**: Short for Not-a-Number, with all exponent bits set to 1 and mantissa non-zero, used to represent undefined or unrepresentable results. 4. **Denormals ($E = 0, M \neq 0$)**: Denormalized numbers are supported in the bfloat16 standard, but in practice, many hardware implementations adopt a flush-to-zero strategy, directly treating denormal numbers as zero to simplify hardware design. ### Informative Use Case ### Algorithm Analysis First, let's analyze the conversions and rounding behavior, which is the float32 $\leftrightarrow$ bfloat16 conversion part. In the conversion, the `memcpy` function is used because direct conversion would cause strict aliasing issues. In the process of converting from float32 to bfloat16, since the float32 format is ```mermaid graph LR A["Bit 31 Sign (1 bit)"] --> B["Bits 30-23 Exponent (8 bits)"] --> C["Bits 22-0 Mantissa (23 bits)"] style A fill:#ffcccc style B fill:#ccffcc style C fill:#ccccff ``` By simply right-shifting 16 bits, the high 16 bits of float32 contain the complete sign bit and exponent, as well as the first 7 bits of the mantissa, which exactly corresponds to the bfloat16 format. We first use bitwise operations with `0xFF` to check if it's a special value. If so, we can directly extract the high 16 bits. Otherwise, we need to proceed to the next step: rounding. In the rounding process, the first decisive key is the low 16 bits of float32. By adding `0x7FFF`: - If the low 16 bits $<$ 32768 $\Rightarrow$ after addition, there will be no carry to the high 16 bits - If the low 16 bits $\ge$ 32768 $\Rightarrow$ after addition, there will be a carry to the high 16 bits Then what happens if the low 16 bits are exactly `0x8000`? In rounding, 0.5 would round up. In our design, this is the second key `& 1`. This step is the classic IEEE-754 round-to-nearest-even approach. We check whether its LSB is odd or even. If it's even, then $LSB = 0$, and there will be no carry. Otherwise, it carries. We can summarize the rules in the following table: | Low 16 bits value | Total added | Result | |:---------:|:----------------|:-----| | $<$ `0x8000` | `0x7FFF` | No carry (round down) | | $=$ `0x8000` | `0x7FFF` + 0 or 1 | Carry only if $LSB=1$ (round to even) | | $>$ `0x8000` | `0x8000` or `0x8001` | Must carry (round up) | As for converting to float32, it's quite simple—just pad zeros in the low 16 bits. Next is arithmetic. In the implementation of [q1-bfloat16.c][q1-c], all four basic operations are included. For ease of calculation, the incoming bfloat16 numbers are first converted into sign bit, exponent bits, and mantissa bits. Let's first discuss addition and subtraction. After confirming whether there are special values, the implicit bit is added to normalized numbers: ```c if (exp_a) mant_a |= 0x80; if (exp_b) mant_b |= 0x80; ``` Then we enter the key step: checking whether the exponents are aligned and adjusting them: ```c int16_t exp_diff = exp_a - exp_b; ``` Through simple subtraction, we first get the difference between the two. The key here is to determine whether `exp_diff` is greater than 8 (7 bits + 1 implicit bit). If it's greater, it means the gap between the two is too large and precision is lost. We directly return the relatively larger number as the result of addition or subtraction. Otherwise, we continue to align the two numbers. After alignment is complete, we can finally start performing addition and subtraction. After addition is complete, we first check if overflow occurred: ```c if (result_mant & 0x100) { result_mant >>= 1; if (++result_exp >= 0xFF) return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; } ``` Here appears a key concept: Normalization. Normalization ensures that the highest bit of the mantissa is always 1 (implicit bit), maintaining the value in the range $[1.0, 2.0)$. Why is this needed? Because a unified representation maximizes precision utilization and simplifies hardware design. The operation itself is simple: when the value is too large (overflow), right-shift the mantissa and increment the exponent by 1; when the value is too small, left-shift the mantissa and decrement the exponent by 1. We'll see this pattern repeatedly in various operations. Back to the code, using `0x100` we can clearly determine whether a carry occurred, which is overflow. If so, we directly right-shift by 1 bit and increment the exponent by one to maintain the normalized result. But at this point, we still need to check whether incrementing the exponent would cause an overflow to infinity. If it's greater than or equal to `0xFF`, it means the result of addition or subtraction is infinity, so we perform a bitwise OR operation with `0x7F80` (all exponent bits set to 1, all mantissa bits set to 0) to mark it as infinity. In the case of subtraction, we need to normalize the result through a while loop: left-shift (multiply by 2) the mantissa and decrement the exponent by one. If the exponent is decremented to zero first, we directly flush to zero. ```c while (!(result_mant & 0x80)) { result_mant <<= 1; if (--result_exp <= 0) return BF16_ZERO(); } ``` Finally, we assemble the sign bit, exponent bits, and mantissa bits to get the result of addition or subtraction. The mantissa performs a bitwise AND with `0x7F` to remove the implicit bit. Next is multiplication. The implementation of multiplication is based on the mathematical formula: $$(m_a × 2^{e_a}) × (m_b × 2^{e_b}) = (m_a × m_b) × 2^{e_a + e_b}$$ However, in the implementation we find: ```c uint32_t result_mant = (uint32_t) mant_a * mant_b; int32_t result_exp = (int32_t) exp_a + exp_b - BF16_EXP_BIAS + exp_adjust; ``` `BF16_EXP_BIAS`, which is 127, is subtracted. This is because $e_a$ is actually $e_a + 127$, and $e_b$ is actually $e_b + 127$. Adding them together would add 127 twice, so we need to subtract it once. Actually, before the operation, we also decompose and confirm special values. There's another crucial step: checking and handling denormalized numbers. ```c int16_t exp_adjust = 0; if (!exp_a) { while (!(mant_a & 0x80)) { mant_a <<= 1; exp_adjust--; } exp_a = 1; } else mant_a |= 0x80; ``` By using `0x80` and the implicit bit, we can obtain the correction value `exp_adjust` needed after normalization to reduce calculation errors. After completing the multiplication, we also need to check for normalization. Let's first analyze the range of results from multiplying two 8-bit numbers: - Minimum: $1.0000000 × 1.0000000 = 1.00000000000000$, with a total of 14 $0$s - Maximum: $1.1111111 × 1.1111111 \approx 3.99999...$, approaching 4 So the result range is in $[1.0, 4.0)$, in binary positions: - $[1, 2)$: $0$ 1.xxxxxxxxxxxxxx, bit 15=0, bit 14=1 - $[2, 4)$: $1$ x.xxxxxxxxxxxxxx, bit 15=1 So there are two cases. If `& 0x8000` is 1, it means the result $\ge 2.0$ and needs to be right-shifted by 8 bits. Otherwise, the result $< 2.0$ and only needs to be right-shifted by 7 bits. Finally, division. Since it's similar to multiplication, let's focus on the key in implementation: simulating manual division. ```c uint32_t dividend = (uint32_t) mant_a << 15; ``` First, left-shift `dividend` by 15 bits. If we divide directly, we can only get the integer part. Left-shifting by 15 bits is equivalent to magnifying the dividend by $2^{15}$ times, and the quotient will be magnified accordingly, including the fractional part. This reserves enough precision space for the mantissa and increases calculation accuracy. Then implement long division through a for loop: ```c for (int i = 0; i < 16; i++) { quotient <<= 1; if (dividend >= (divisor << (15 - i))) { dividend -= (divisor << (15 - i)); quotient |= 1; } } ``` The logic is quite simple: check whether the current dividend is greater than or equal to the divisor. If so, subtract the divisor and set the current quotient bit to 1; otherwise, set it to 0. After the mantissa calculation is complete, the exponent part is quite simple. According to the mathematical formula, we know: $$(m_a × 2^{e_a}) ÷ (m_b × 2^{e_b}) = (m_a / m_b) × 2^{e_a - e_b}$$ Just subtract directly. But like multiplication, there's a situation where we subtract 127 too much, so we need to add it back. The subsequent normalization and assembly steps complete the process. After the four basic operations, there's another arithmetic function: square root. According to the [detailed explanation][q1-de-link], we know the implementation formula is as follows: $$\sqrt{a} = \sqrt{2^{e_a} \times m_a} = 2^{e_a/2} \times \sqrt{m_a}$$ After confirming special cases, we implement the operation according to the formula. First, we need to process $e$: ```c if (e & 1) { m <<= 1; new_exp = ((e - 1) >> 1) + BF16_EXP_BIAS; } else { new_exp = (e >> 1) + BF16_EXP_BIAS; } ``` There are two cases depending on whether $e$ is odd or even: - If $e$ is even, we can assume $e$ is $2k$. Substituting into the formula, we get $\sqrt{2^{2k} \times m} = 2^{2k/2} \times \sqrt{m}$, which means $exp_{new}$ is $k$. - If $e$ is odd, $e$ can be assumed to be $2k + 1$. Substituting into the formula, we get $\sqrt{2^{2k+1} \times m} = \sqrt{2^{2k} \times 2m} = 2^{2k/2} \times \sqrt{2m}$. We still get that $exp_{new}$ is $k$, but the mantissa becomes $2m$, which is why we need to left-shift $m$. This ensures that the normalized result after the square root is still in the range of $[1, 2)$ or $[2, 4)$. Next, we enter the square root calculation, which uses binary search for the square root. The key in implementation is to find: $$result^2 \approx m \times 128$$ Why 128? Because 128 represents $1.0$, which is the normalization standard. From the previous calculation, we know that the range of $m$ now is generally $[128, 512)$. So we use 90 as the minimum (an estimate of $\sqrt{256}$), 256 as the maximum (an estimate of $\sqrt{512}$), and set 128 as the default result. ```c uint32_t low = 90; uint32_t high = 256; uint32_t result = 128; ``` Then we start the binary search process. The key is guessing the midpoint. The approach is to square $mid$ and then divide by 128, performing the same scaling operation. We judge the current position and decide whether to adjust up or down. Through continuous approximation, we find the maximum `mid` value that satisfies $sq \le m$. This is our square root estimate. Subsequent normalization adjustments complete the process. After that, there are three comparison functions: greater than, equal to, and less than. Since bfloat16 also has a sign bit, we need to check whether the signs of the two numbers are the same. ```c bool sign_a = (a.bits >> 15) & 1, sign_b = (b.bits >> 15) & 1; ``` If they're the same, we compare normally and return. Otherwise, we use the sign bit to determine the magnitude. Looking back at the entire implementation, we can discover an interesting pattern: regardless of which operation, we eventually see the normalization step. After addition, we might need to right-shift by one bit; after subtraction, we might need to left-shift by multiple bits; multiplication needs to decide the shift amount based on the result range; division and square root also have their own normalization logic. These seemingly different processes are actually doing the same thing: pulling the result back to the standard range of $[1.0, 2.0)$. However, thinking carefully, the normalization operation itself is quite simple—check the highest bit, and if it's not right, shift and adjust. It's that intuitive. What's truly complex is **when normalization is needed**: addition needs to check for carry, subtraction needs to worry about leading zeros, multiplication results might be in the range $[1, 4)$, and division and square root each have their own situations. The normalization requirements for each operation are different, but the processing method is unified. This design of "complex judgment but unified operation" allows hardware implementations to share the same normalization circuit. This also explains why bfloat16 adopts this design approach. ### Assembly Code First, I converted the conversion functions and directly used C code for the conversion. For the `f32_to_bf16()` function, I found that regardless of whether it's a special value, the return value always implements the `f32bits >> 16` operation. So in my design, for cleanliness and to reduce code duplication, I wrote the assembly as follows: ```assembly f32_to_bf16: addi sp, sp, -4 sw t0, 0(sp) # Check if NaN/Inf srli t0, a0, 23 andi t0, t0, 0xFF li t1, 0xFF beq t0, t1, f32_shift # omitted f32_shift: srli a0, a0, 16 bne t0, t1, finish_f32_to_bf16 # if not NaN/Inf, finish # If NaN/Inf, just mask to 16 bits lui t0, 0x10 addi t0, t0, -1 and a0, a0, t0 finish_f32_to_bf16: lw t0, 0(sp) addi sp, sp, 4 ret ``` I first converted the test function for conversion and rounding from C code into assembly code to test whether this function runs smoothly. > [Commit 62d571b][commit3] Next, I implemented special value detection. Basically, it's just converted from the C code. I noticed that the code separately checks both exponent and mantissa when confirming whether it's infinity or NaN. I originally considered writing it as a helper function for calling. However, this didn't save computation costs and would increase writing difficulty, so I abandoned that approach. I chose to write them separately and inline them directly. > [Commit 770ddb7][commit4] Before implementing the most complex four arithmetic operations, I first completed all the comparison functions. Again, I initially considered whether to implement additional helper functions, using `call function_name` to save code size for repeated conditional checks. ```c if (bf16_isnan(a) || bf16_isnan(b)) ``` ```c if (bf16_iszero(a) && bf16_iszero(b)) ``` However, after completing it, I found that it wasn't as practical as I had imagined, and it even added extra calling overhead. So I ultimately decided to directly convert it following the C code. > [Commit 3947618][commit5] Recording the cycles up to this point: | Metric | Value | | -------- | ------: | | Cycles | 1695 | | Instrs. retired | 1695 | | CPI | 1 | | IPC | 1 | |Clock rate | 0 Hz | Finally, the main event: the four arithmetic operations. Initially, I was thinking about whether to first implement the repeated parts as helper functions, but ultimately decided to write all the code first before deciding. First, I implemented the addition and subtraction parts by directly converting the C code. I encountered a problem when implementing: ```c return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; ``` I habitually used the `ori` ISA to perform bitwise operations on `0x7F80`. However, `0x7F80` exceeds the 12-bit range, so I changed the instruction to: ```assembly li t4 , 0x7F80 or a0, t1, t4 ``` > [Commit 97cd325][commit6] Next, I implemented the multiplication part. When encountering the following conditional statement, I thought it could be transformed using De Morgan's law, which would make it easier to write: ```c if ((!exp_a && !mant_a) || (!exp_b && !mant_b)) return (bf16_t) {.bits = result_sign << 15}; ``` So the original conditional becomes `!((exp_a || mant_a) && (exp_b || mant_b))`, which can be implemented in assembly code as: ```assembly or t2, s3, s4 # s3 = exp_a, s4 = mant_a or t3, s6, s7 # s6 = exp_b, s7 = mant_b and t0, t2, t3 ``` Next, I encountered mantissa multiplication in the code. ```c uint32_t result_mant = (uint32_t) mant_a * mant_b; ``` Since RV32I itself doesn't have a multiplication instruction, we need to implement it ourselves. The simplest approach is shift-and-add multiplication. The concept mimics the long multiplication we learned in elementary school. ```assembly li t3, 0 # Initialize result to 0 li t4, 8 # Counter = 8 (because mantissa is 8 bits) bf16_mul_mant_loop: beqz t4, bf16_mul_mant_done # If counter = 0, done andi t0, s7, 1 # Get LSB of multiplier mant_b beqz t0, bf16_mul_mant_skip # If it's 0, skip addition since multiplication still yields 0 add t3, t3, s4 # If it's 1, accumulate multiplicand mant_a bf16_mul_mant_skip: slli s4, s4, 1 # Left shift multiplicand, mimicking long multiplication where addition positions shift left srli s7, s7, 1 # Right shift multiplier, mimicking long multiplication where digit positions increase addi t4, t4, -1 # Counter -1 j bf16_mul_mant_loop # Continue loop bf16_mul_mant_done: # t3 now holds the result of mant_a × mant_b, which is result_mant ``` However, when testing with test data, I encountered a problem. The test data used $0.5f \times 4.0f$, expecting to get $0x4000$ which is $2.0f$, but my output kept being $0x0000$. After multiple unsuccessful searches for the issue, I found the problem with help from Claude Code. It was an error in my earlier implementation: ```assembly or t2, s3, s4 # s3 = exp_a, s4 = mant_a or t3, s6, s7 # s6 = exp_b, s7 = mant_b and t0, t2, t3 ``` Since the ISA's `and` is a bitwise operation, it differs from the logical operation in C code. For example, when `t2=0x80` and `t3=0x01`, although both are non-zero (logically true), the bitwise AND result is 0, leading to incorrect judgment and directly executing `return (bf16_t) {.bits = result_sign << 15};`, causing incorrect results. So the correct approach is not to use `and` for direct judgment, but to check them separately: ```assembly beqz t3, bf16_mul_return_shift beqz t2, bf16_mul_return_shift ``` >[Commit 8467fdb][commit7] Next, I implemented division. Since most of it is similar to multiplication implementation, I completed it quickly. Recording the cycles up to this point: | Metric | Value | | -------- | ------: | | Cycles | 5441 | | Instrs. retired | 5441 | | CPI | 1 | | IPC | 1 | |Clock rate | 0 Hz | > [Commit c1320ac][commit8] Then came the final arithmetic operation: square root implementation. After implementation, I tested it with test data $0.25f$ and encountered a problem. For $0.25f$, its `exp` would be 125, so in: ```c int32_t e = exp - BF16_EXP_BIAS; ``` It would yield a negative `e`. Since I used logical shift instructions in my subsequent implementation, my subsequent adjustment to $e$ : ```c if (e & 1) { m <<= 1; /* Double mantissa for odd exponent */ new_exp = ((e - 1) >> 1) + BF16_EXP_BIAS; } else { new_exp = (e >> 1) + BF16_EXP_BIAS; } ``` Would yield incorrect results, causing `new_exp` to be wrong. So I adjusted it as follows: ```diff # Check if e is odd andi t0, t3, 1 beqz t0, bf16_sqrt_even_exp # Odd exponent slli s6, s6, 1 addi t0, t3, -1 - srli t1, t0, 1 + srai t1, t0, 1 # Use arithmetic shift for signed division addi t1, t1, 127 j bf16_sqrt_binary_search bf16_sqrt_even_exp: - srli t1, t3, 1 + srai t1, t3, 1 # Use arithmetic shift for signed division addi t1, t1, 127 ``` > [Commit 08b5ecd][commit9] Up to this point, the entire assembly code conversion is complete. Total cycles so far: | Metric | Value | | -------- | ------: | | Cycles | 8517 | | Instrs. retired | 8517 | | CPI | 1 | | IPC | 1 | |Clock rate | 0 Hz | [commit3]:https://github.com/HeatCrab/ca2025-quizzes/commit/62d571b63e0adc06f03bda8554dfd0f3253be9dc [commit4]:https://github.com/HeatCrab/ca2025-quizzes/commit/770ddb7d8bbe93dfa0cbc77580a00609cd01b7aa [commit5]:https://github.com/HeatCrab/ca2025-quizzes/commit/3947618e828b8c05ebce73146957502cfd48b2f8 [commit6]:https://github.com/HeatCrab/ca2025-quizzes/commit/97cd325df0bc6981626f6d0265b85b36b624266d [commit7]:https://github.com/HeatCrab/ca2025-quizzes/commit/8467fdb8b98d2f45876092fe57008fdb8e86398a [commit8]:https://github.com/HeatCrab/ca2025-quizzes/commit/c1320ac4c3fbc519ba444960578c04eb0a4796d3 [commit9]:https://github.com/HeatCrab/ca2025-quizzes/commit/08b5ecdff90a5d35d342db13a26b20e8bae3ee11 [q1-de-link]:https://hackmd.io/@sysprog/arch2025-quiz1-sol