Assignment 1: RISC-V Assembly and Instruction Pipeline

# Assignment 1: RISC-V Assembly and Instruction Pipeline contributed by < [geumbo](https://github.com/geumbo/ca2025-quizzes) > >[!Note] AI tools usage I used ChatGPT to fix code issues and enhance the readability of my writing. ## Problem B ### Optimize overflow calculation from uf8_encode The original C code calculates the `overflow` using a for loop. The execution time of this loop is directly proportional to the value of the input `exponent`. Original C Code: ```c= uint32_t overflow = 0; /* Calculate overflow for estimated exponent */ for (uint8_t e = 0; e < exponent; e++) overflow = (overflow << 1) + 16; } ``` Direct RV32I Assembly Translation: ```c= mv t2, x0 # t2 = overflow = 0 mv t3, x0 # t3 = e = 0 calc_overflow_loop: bge t3, t1, adjust # if e >= exp, goto adjust (t1 = exp) slli t2, t2, 1 # overflow << 1 addi t2, t2, 16 # overflow = (overflow << 1) + 16 addi t3, t3, 1 # e + 1 j calc_overflow_loop ``` Let $O_k$ be the value of `overflow` after the $k$\-th iteration. The relation is: $$O_k = 2 \times O_{k-1} + 16$$ The initial condition is $O_0 = 0$, representing the value before the first iteration. - After 1 iteration (exponent = 1): $O_1 = 2 \times O_0 + 16 = 2 \times 0 + 16 = 16$ - After 2 iterations (exponent = 2): $O_2 = 2 \times O_1 + 16 = 2 \times (16) + 16 = 16 \times 2^1 + 16 \times 2^0$ - After 3 iterations (exponent = 3): $O_3 = 2 \times O_2 + 16 = 2 \times (16 \times 2^1 + 16 \times 2^0) + 16 = 16 \times 2^2 + 16 \times 2^1 + 16 \times 2^0$ We can generalize the formula for `n` iterations: $$O_n = 16 \times 2^{n-1} + 16 \times 2^{n-2} + \dots + 16 \times 2^1 + 16 \times 2^0$$ We can factor out the constant `16` and express the pattern using summation notation: $$O_n = 16 \times (2^{n-1} + 2^{n-2} + \dots + 2^1 + 2^0) = 16 \times \sum_{i=0}^{n-1} 2^i $$ The summation part, $\sum_{i=0}^{n-1} 2^i$, is a standard geometric series with: - First term, $a = 2^0 = 1$ - Common ratio, $r = 2$ - Number of terms, $n$ The formula for the sum of a geometric series is: $$S_n = a \frac{r^n - 1}{r - 1}$$ Substituting our values into the formula: $$S_n = 1 \times \frac{2^n - 1}{2 - 1} = 2^n - 1$$ We substitute this simplified sum back into our expression for $O_n$: $$O_n = 16 \times (2^n - 1)$$ The derived formula can be implemented with just a few efficient bitwise operations, completely eliminating the need for a loop. Optimized C Code: ```c= uint32_t overflow = ((1u << exponent) - 1u) << 4; ``` Optimized RV32I Assembly: ```c= calc_overflow: li t2, 1 sll t2, t2, t1 # t2 = 1 << exponent addi t2, t2, -1 # t2 = (1 << exponent) - 1 slli t2, t2, 4 # overflow = ((1 << exponent) - 1) << 4 j adjust ``` #### Evaluation (Ripes 5-stage processor) Original method cycles: 46,709 > Commit: [2bfcc07](https://github.com/geumbo/ca2025-quizzes/commit/2bfcc07986bb5cdd45197d389d668ef0b5b2825e) ![CleanShot 2025-10-12 at 15.18.21@2x](https://hackmd.io/_uploads/H1ZIURuall.png) Improved method cycles: 34,147 (-26.9%) > Commit: [e21b192](https://github.com/geumbo/ca2025-quizzes/commit/e21b192c41da0320b2e3c8483c94719511aae158) ![CleanShot 2025-10-12 at 15.25.27@2x](https://hackmd.io/_uploads/rkYlORdaex.png) #### Conclusion The original loop has a time complexity of $O(n)$, where `n` is the value of `exponent`. Its execution time grows linearly with the input. In contrast, the optimized version executes a fixed number of instructions, achieving a constant time complexity of $O(1)$. The optimized version also results in more compact machine code. It eliminates the need for instructions related to loop management, replacing them with simple arithmetic/logical instructions. ### Optimize exponent calculation from uf8_encode We already established the relationship between `overflow` and `exponent` as: $$\text{overflow} = 16 \times (2^{\text{exponent}} - 1)$$ The encoding rule requires us to find the largest exponent such that value is greater than or equal to the corresponding overflow. This means we need to solve the following inequality for exponent: $$\text{value} \ge 16 \times (2^{\text{exponent}} - 1)$$ 1. Divide both sides by 16: $$\frac{\text{value}}{16} \ge 2^{\text{exponent}} - 1$$ 2. Add 1 to both sides: $$\frac{\text{value}}{16} + 1 \ge 2^{\text{exponent}}$$ 3. Take the $\log_2$ of both sides: $$\log_2(\frac{\text{value}}{16} + 1) \ge \text{exponent}$$ Since we are looking for the **largest integer** `exponent` that satisfies this inequality, the solution is obtained by taking the floor of the left-hand side expression. $$\text{exponent} = \lfloor \log_2(\frac{\\text{value}}{16} + 1) \rfloor$$ For any positive 32-bit integer `x`, the value of `floor(log₂(x))` is equivalent to `31 - clz(x)`. The base-2 logarithm `log₂(x)` represents the power to which 2 must be raised to obtain `x`. Taking the floor, `floor(log₂(x))`, gives the **index of the most significant set bit (MSB)** in the binary representation of `x`. The function `clz(x)` returns the number of leading zero bits in the 32-bit binary representation of `x`. Therefore, by subtracting this count from 31 (since we are dealing with a 32-bit value), we obtain the position of the MSB which is exactly `floor(log₂(x))`. Optimized C Code: ```c= uf8 uf8_encode(uint32_t value) { if (value < 16) return value; // Directly calculate the optimal exponent using the formula uint8_t exponent = 31 - clz((value >> 4) + 1); if (exponent > 15) exponent = 15; // Use the formulaic approach to calculate the corresponding overflow uint32_t overflow = ((1u << exponent) - 1u) << 4; uint8_t mantissa = (value - overflow) >> exponent; return (exponent << 4) | mantissa; } ``` Optimized RV32I Assembly: ```c= encode: addi sp, sp, -16 sw ra, 12(sp) sw s0, 8(sp) sw s1, 4(sp) sw s2, 0(sp) mv s0, a0 # s0 = a0 = value li t0, 16 # value = 16 bltu s0, t0, encode_return_value # if value < 16, return value # Calculate exponent = 31 - clz((value >> 4) + 1) srli a0, s0, 4 # value >> 4 addi a0, a0, 1 # (value >> 4) + 1 jal ra, clz # call clz li t0, 31 sub s1, t0, a0 # exponent = 31 - clz # if (exponent > 15) exponent = 15 li t0, 15 bge t0, s1, calc_overflow mv s1, t0 # s1 = exponent calc_overflow: # overflow = ((1u << exponent) - 1u) << 4 li t0, 1 sll t0, t0, s1 # t0 = 1 << exponent addi t0, t0, -1 # t0 = (1 << exponent) - 1 slli s2, t0, 4 # s2 = overflow = ((1 << exponent) - 1) << 4 calc_mantissa: # mantissa = (value - overflow) >> exponent sub t0, s0, s2 # t0 = mantissa = value - overflow srl t0, t0, s1 # mantissa >> exp slli s1, s1, 4 # exp << 4 or a0, s1, t0 # return (mantissa | (exp << 4)) j encode_end encode_return_value: mv a0, s0 # return value (a0) encode_end: lw ra, 12(sp) lw s0, 8(sp) lw s1, 4(sp) lw s2, 0(sp) addi sp, sp, 16 ret ``` #### Evaluation (Ripes 5-stage processor) Original method cycles: 34,147 (Optimize overflow calculation) > Commit: [e21b192](https://github.com/geumbo/ca2025-quizzes/commit/e21b192c41da0320b2e3c8483c94719511aae158) ![CleanShot 2025-10-12 at 15.25.27@2x](https://hackmd.io/_uploads/rkYlORdaex.png) Improved method cycles: 29,753 (-12.9%) > Commit: [07dea06](https://github.com/geumbo/ca2025-quizzes/commit/07dea06525ee72df8084e8a485da0e5a73115cfd) ![CleanShot 2025-10-12 at 17.47.48@2x](https://hackmd.io/_uploads/BJKwKeYpgl.png) #### Conclusion Through two stages of mathematical optimization, the `uf8_encode` function was transformed from an iterative algorithm into a direct computation. As a result, the total execution cycles decreased from **46,709** to **29,753**, achieving an overall performance improvement of **36.3%**. ### Optimize clz The `clz` function is to count the number of consecutive zero bits starting from the most significant bit (MSB) of an integer. The original C code implements `clz` using a binary search algorithm. It repeatedly halves the search space to quickly locate the most significant '1' bit. Original C Code: ```c= static inline unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; } ``` Direct RV32I Assembly Translation: ```c= clz: li t0, 32 # t0 = n = 32 li t1, 16 # t1 = c = 16 mv t2, a0 # t2 = x clz_1: srl t3, t2, t1 # t3 = y = x >> c beq t3, x0, clz_2 # if y == 0, goto clz_2 sub t0, t0, t1 # n = n - c mv t2, t3 # x = y clz_2: srli t1, t1, 1 # c = c >> 1 bne t1, x0, clz_1 # if c != 0, goto clz_1 sub a0, t0, t2 # return n - x ret ``` Since the loop runs for a fixed number of iterations (5 times, for `c = 16, 8, 4, 2, 1`), we can **unroll** it into a linear sequence of instructions. The new algorithm works by checking progressively smaller blocks of bits from the most significant side. If a block is found to be all zeros, its size is added to a counter, and the input value is shifted left to expose the next block of bits for examination. Optimized C Code: ```c= static inline unsigned clz(uint32_t x) { if (x == 0) return 32; unsigned n = 0; if ((x >> 16) == 0) { n += 16; x <<= 16; } if ((x >> 24) == 0) { n += 8; x <<= 8; } if ((x >> 28) == 0) { n += 4; x <<= 4; } if ((x >> 30) == 0) { n += 2; x <<= 2; } if ((int32_t) x >= 0) { n += 1; } return n; } ``` Optimized RV32I Assembly: ```c= clz: # if (x == 0) return 32 bne a0, x0, clz_start li a0, 32 ret clz_start: li a1, 0 # a1 = n = 0 srli t0, a0, 16 bne t0, x0, check_8 addi a1, a1, 16 # n += 16 slli a0, a0, 16 # x <<= 16 check_8: srli t0, a0, 24 bne t0, x0, check_4 addi a1, a1, 8 # n += 8 slli a0, a0, 8 # x <<= 8 check_4: srli t0, a0, 28 bne t0, x0, check_2 addi a1, a1, 4 # n += 4 slli a0, a0, 4 # x <<= 4 check_2: srli t0, a0, 30 bne t0, x0, check_1 addi a1, a1, 2 # n += 2 slli a0, a0, 2 # x <<= 2 check_1: blt a0, x0, clz_end # if (x < 0), MSB is 1, we are done addi a1, a1, 1 # n += 1 clz_end: mv a0, a1 # move result to a0 for return ret ``` #### Evaluation (Ripes 5-stage processor) Original method cycles: 29,753 (Optimize overflow and exponent calculation) > Commit: [07dea06](https://github.com/geumbo/ca2025-quizzes/commit/07dea06525ee72df8084e8a485da0e5a73115cfd) ![CleanShot 2025-10-12 at 17.47.48@2x](https://hackmd.io/_uploads/BJKwKeYpgl.png) Improved method cycles: 25,321 (-14.9%) > Commit: [d513160](https://github.com/geumbo/ca2025-quizzes/commit/d513160cba15191d32a8d82da732eca3748abe04) ![CleanShot 2025-10-12 at 23.36.35@2x](https://hackmd.io/_uploads/rJ1msrtpxx.png) #### Conclusion Unrolling the fixed five-step search removes loop-control overhead. On the Ripes 5-stage pipeline, cycles decrease from **29,753** to **25,321** (**−14.9%**) ### 5-stage Pipelined Processor A 5-stage pipelined Processor is a architecture that divides instruction execution into five sequential stages — **Fetch**, **Decode**, **Execute**, **Memory Access**, and **Write Back**. Each stage performs a distinct operation on the instruction before passing it to the next stage. By allowing multiple instructions to occupy different stages simultaneously, the pipeline enables overlapping execution: while one instruction is being executed, another can be decoded, and a third can be fetched. This parallelism significantly increases the instruction throughput, improving overall processor performance by completing more instructions per clock cycle. 1. Instruction Fetch (IF) ![CleanShot 2025-10-12 at 21.09.20@1x](https://hackmd.io/_uploads/BJEc_mYpel.png) - The `PC` initially holds the value `0x00000000`, which is used as the address for the **Instruction Memory**. - At address `0x00000000`, the instruction memory outputs the 32-bit machine code `0xff010113` which corresponds to the instruction `addi x2 x2 -16`, so the signal `instr` = `0xff010113` - The `instr` is sent to the decode stage for further processing in the next pipeline cycle. - The adder connected to the `PC` automatically increments its value by 4. - The multiplexer selects the sequential PC update path, since there is no branch or jump in this stage. 2. Instruction Decode (ID) ![CleanShot 2025-10-12 at 21.19.08@2x](https://hackmd.io/_uploads/SkT0qQKpxx.png) Instruction `0xff010113` is decoded to four part: - `opcode` = `ADDI` - `R1 idx` = `0x02` → selects **x2** as the source register. - `Wr idx` = `0x02` → sets **x2** as the destination. - `imm` = `0xfffffff0` = `-16` means: > Take the value from register **x2**, > add the immediate constant **−16**, > and write the result back to **x2**. - The Registers block outputs the current value of `x2` (`0x7ffffff0`), which will be sent to the next stage (**Execute**) for arithmetic computation. 3. Instruction Execute (IE) ![CleanShot 2025-10-12 at 21.52.09@2x](https://hackmd.io/_uploads/S1Kqz4Kalg.png) Inputs to the ALU: - **Op1** (Operand 1): `0x7FFFFFF0` — the value read from source register **x2** - **Op2** (Operand 2): `0xFFFFFFF0` — the sign-extended immediate value **−16** The **ALU result (Res)**, `0x7FFFFFE0`, along with the destination register index (**x2**), is forwarded to the next pipeline stage (**MEM**) for subsequent memory access or write-back operations. 4. Memory Access (MEM) ![CleanShot 2025-10-12 at 22.02.46@2x](https://hackmd.io/_uploads/r1FMSVtpel.png) - Addr: `0x7FFFFFE0` – forwarded from the ALU result - Data in: `0x00000000` – no data to be written - Write enable (Wr en): `0` – memory write is **disabled** Since `ADDI` is an arithmetic instruction that does not perform any memory access, the **Data Memory** remains inactive. 5. Write Back (WB) ![CleanShot 2025-10-12 at 22.07.06@2x](https://hackmd.io/_uploads/B1nGUNFagx.png) - Destination register: `0x02` → **x2** - Write data: `0x7FFFFFE0` - Write enable: active (`1`) Thus, the value `0x7FFFFFE0` is written back into **register x2**, updating its contents. ## Problem C > [Full RV32I Assembly](https://github.com/geumbo/ca2025-quizzes/blob/main/q1-bfloat16.s) The implementation successfully runs on Ripes and passes all test suites, with a total cycle count of **10,186**. ![CleanShot 2025-10-13 at 09.44.54@2x](https://hackmd.io/_uploads/HkwnYCtpel.png) ### Optimize normalization process by replacing loop with `clz` In the `bf16_add` function, when two numbers with different signs are subtracted, the resulting mantissa (`result_mant`) may become denormalized (its MSB is not set). The original code uses a while loop to correct this by repeatedly left-shifting the mantissa until it is normalized. The goal is to replace this iterative search with a direct, constant-time calculation using the `clz` function. Original C Code: ```c= if (!result_mant) return BF16_ZERO(); while (!(result_mant & 0x80)) { result_mant <<= 1; if (--result_exp <= 0) return BF16_ZERO(); } ``` Direct RV32I Assembly Translation: ```c= normalize_loop: # while (!(result_mant & 0x80)) andi t0, s8, 0x80 bne t0, x0, build_result slli s8, s8, 1 # result_mant <<= 1 addi s7, s7, -1 # result_exp-- # if (--result_exp <= 0) bge x0, s7, return_zero j normalize_loop ``` The number of left shifts required for normalization is equal to the number of leading zeros within the mantissa's 8-bit window. We can calculate this in a single step using a 32-bit clz function: $$\text{shift_amount} = clz(\text{result_mant}) - 24$$ - If `result_exp <= shift_amount`, normalization would underflow → return zero. - Otherwise, `result_exp -= shift_amount; result_mant <<= shift_amount;`. This replaces the iterative normalization loop with a constant-time normalization process. Optimized C Code: ```c= if (!result_mant) return BF16_ZERO(); int shift_amount = clz(result_mant) - 24; if (result_exp <= shift_amount) { return BF16_ZERO(); } result_exp -= shift_amount; result_mant <<= shift_amount ``` Optimized RV32I Assembly: ```c= beq s8, x0, return_zero # if (!result_mant) return BF16_ZERO mv a0, s8 # a0 = result_mant jal ra, clz # shift_amount = clz(result_mant) addi t0, a0, -24 # t0 = clz(result_mant) - 24 ble s7, t0, return_zero # if (result_exp <= shift_amount) return zero sub s7, s7, t0 # result_exp -= shift_amount sll s8, s8, t0 # result_mant <<= shift_amount j build_result ``` ## Reference [How to Find the Sum of Geometric Series](https://www.geeksforgeeks.org/maths/how-to-find-the-sum-of-a-geometric-series/) [Floor log2 n in O(1) for C++](https://codeforces.com/blog/entry/89220) [2017q3 Homework4 (改善 clz)](https://hackmd.io/@3xOSPTI6QMGdj6jgMMe08w/Bk-uxCYxz)