# Assignment 1: RISC-V Assembly and Instruction Pipeline
contributed by < [geumbo](https://github.com/geumbo/ca2025-quizzes) >
>[!Note] AI tools usage
I used ChatGPT to fix code issues and enhance the readability of my writing.
## Problem B
### Optimize overflow calculation from uf8_encode
The original C code calculates the `overflow` using a for loop. The execution time of this loop is directly proportional to the value of the input `exponent`.
Original C Code:
```c=
uint32_t overflow = 0;
/* Calculate overflow for estimated exponent */
for (uint8_t e = 0; e < exponent; e++)
overflow = (overflow << 1) + 16;
}
```
Direct RV32I Assembly Translation:
```c=
mv t2, x0 # t2 = overflow = 0
mv t3, x0 # t3 = e = 0
calc_overflow_loop:
bge t3, t1, adjust # if e >= exp, goto adjust (t1 = exp)
slli t2, t2, 1 # overflow << 1
addi t2, t2, 16 # overflow = (overflow << 1) + 16
addi t3, t3, 1 # e + 1
j calc_overflow_loop
```
Let $O_k$ be the value of `overflow` after the $k$\-th iteration. The relation is:
$$O_k = 2 \times O_{k-1} + 16$$
The initial condition is $O_0 = 0$, representing the value before the first iteration.
- After 1 iteration (exponent = 1):
$O_1 = 2 \times O_0 + 16 = 2 \times 0 + 16 = 16$
- After 2 iterations (exponent = 2): $O_2 = 2 \times O_1 + 16 = 2 \times (16) + 16 = 16 \times 2^1 + 16 \times 2^0$
- After 3 iterations (exponent = 3):
$O_3 = 2 \times O_2 + 16 = 2 \times (16 \times 2^1 + 16 \times 2^0) + 16 = 16 \times 2^2 + 16 \times 2^1 + 16 \times 2^0$
We can generalize the formula for `n` iterations:
$$O_n = 16 \times 2^{n-1} + 16 \times 2^{n-2} + \dots + 16 \times 2^1 + 16 \times 2^0$$
We can factor out the constant `16` and express the pattern using summation notation:
$$O_n = 16 \times (2^{n-1} + 2^{n-2} + \dots + 2^1 + 2^0) = 16 \times \sum_{i=0}^{n-1} 2^i
$$
The summation part, $\sum_{i=0}^{n-1} 2^i$, is a standard geometric series with:
- First term, $a = 2^0 = 1$
- Common ratio, $r = 2$
- Number of terms, $n$
The formula for the sum of a geometric series is:
$$S_n = a \frac{r^n - 1}{r - 1}$$
Substituting our values into the formula:
$$S_n = 1 \times \frac{2^n - 1}{2 - 1} = 2^n - 1$$
We substitute this simplified sum back into our expression for $O_n$:
$$O_n = 16 \times (2^n - 1)$$
The derived formula can be implemented with just a few efficient bitwise operations, completely eliminating the need for a loop.
Optimized C Code:
```c=
uint32_t overflow = ((1u << exponent) - 1u) << 4;
```
Optimized RV32I Assembly:
```c=
calc_overflow:
li t2, 1
sll t2, t2, t1 # t2 = 1 << exponent
addi t2, t2, -1 # t2 = (1 << exponent) - 1
slli t2, t2, 4 # overflow = ((1 << exponent) - 1) << 4
j adjust
```
#### Evaluation (Ripes 5-stage processor)
Original method cycles: 46,709
> Commit: [2bfcc07](https://github.com/geumbo/ca2025-quizzes/commit/2bfcc07986bb5cdd45197d389d668ef0b5b2825e)

Improved method cycles: 34,147 (-26.9%)
> Commit: [e21b192](https://github.com/geumbo/ca2025-quizzes/commit/e21b192c41da0320b2e3c8483c94719511aae158)

#### Conclusion
The original loop has a time complexity of $O(n)$, where `n` is the value of `exponent`. Its execution time grows linearly with the input. In contrast, the optimized version executes a fixed number of instructions, achieving a constant time complexity of $O(1)$.
The optimized version also results in more compact machine code. It eliminates the need for instructions related to loop management, replacing them with simple arithmetic/logical instructions.
### Optimize exponent calculation from uf8_encode
We already established the relationship between `overflow` and `exponent` as:
$$\text{overflow} = 16 \times (2^{\text{exponent}} - 1)$$
The encoding rule requires us to find the largest exponent such that value is greater than or equal to the corresponding overflow. This means we need to solve the following inequality for exponent:
$$\text{value} \ge 16 \times (2^{\text{exponent}} - 1)$$
1. Divide both sides by 16:
$$\frac{\text{value}}{16} \ge 2^{\text{exponent}} - 1$$
2. Add 1 to both sides:
$$\frac{\text{value}}{16} + 1 \ge 2^{\text{exponent}}$$
3. Take the $\log_2$ of both sides:
$$\log_2(\frac{\text{value}}{16} + 1) \ge \text{exponent}$$
Since we are looking for the **largest integer** `exponent` that satisfies this inequality, the solution is obtained by taking the floor of the left-hand side expression.
$$\text{exponent} = \lfloor \log_2(\frac{\\text{value}}{16} + 1) \rfloor$$
For any positive 32-bit integer `x`, the value of `floor(log₂(x))` is equivalent to `31 - clz(x)`. The base-2 logarithm `log₂(x)` represents the power to which 2 must be raised to obtain `x`. Taking the floor, `floor(log₂(x))`, gives the **index of the most significant set bit (MSB)** in the binary representation of `x`. The function `clz(x)` returns the number of leading zero bits in the 32-bit binary representation of `x`. Therefore, by subtracting this count from 31 (since we are dealing with a 32-bit value), we obtain the position of the MSB which is exactly `floor(log₂(x))`.
Optimized C Code:
```c=
uf8 uf8_encode(uint32_t value)
{
if (value < 16)
return value;
// Directly calculate the optimal exponent using the formula
uint8_t exponent = 31 - clz((value >> 4) + 1);
if (exponent > 15)
exponent = 15;
// Use the formulaic approach to calculate the corresponding overflow
uint32_t overflow = ((1u << exponent) - 1u) << 4;
uint8_t mantissa = (value - overflow) >> exponent;
return (exponent << 4) | mantissa;
}
```
Optimized RV32I Assembly:
```c=
encode:
addi sp, sp, -16
sw ra, 12(sp)
sw s0, 8(sp)
sw s1, 4(sp)
sw s2, 0(sp)
mv s0, a0 # s0 = a0 = value
li t0, 16 # value = 16
bltu s0, t0, encode_return_value # if value < 16, return value
# Calculate exponent = 31 - clz((value >> 4) + 1)
srli a0, s0, 4 # value >> 4
addi a0, a0, 1 # (value >> 4) + 1
jal ra, clz # call clz
li t0, 31
sub s1, t0, a0 # exponent = 31 - clz
# if (exponent > 15) exponent = 15
li t0, 15
bge t0, s1, calc_overflow
mv s1, t0 # s1 = exponent
calc_overflow:
# overflow = ((1u << exponent) - 1u) << 4
li t0, 1
sll t0, t0, s1 # t0 = 1 << exponent
addi t0, t0, -1 # t0 = (1 << exponent) - 1
slli s2, t0, 4 # s2 = overflow = ((1 << exponent) - 1) << 4
calc_mantissa:
# mantissa = (value - overflow) >> exponent
sub t0, s0, s2 # t0 = mantissa = value - overflow
srl t0, t0, s1 # mantissa >> exp
slli s1, s1, 4 # exp << 4
or a0, s1, t0 # return (mantissa | (exp << 4))
j encode_end
encode_return_value:
mv a0, s0 # return value (a0)
encode_end:
lw ra, 12(sp)
lw s0, 8(sp)
lw s1, 4(sp)
lw s2, 0(sp)
addi sp, sp, 16
ret
```
#### Evaluation (Ripes 5-stage processor)
Original method cycles: 34,147 (Optimize overflow calculation)
> Commit: [e21b192](https://github.com/geumbo/ca2025-quizzes/commit/e21b192c41da0320b2e3c8483c94719511aae158)

Improved method cycles: 29,753 (-12.9%)
> Commit: [07dea06](https://github.com/geumbo/ca2025-quizzes/commit/07dea06525ee72df8084e8a485da0e5a73115cfd)

#### Conclusion
Through two stages of mathematical optimization, the `uf8_encode` function was transformed from an iterative algorithm into a direct computation. As a result, the total execution cycles decreased from **46,709** to **29,753**, achieving an overall performance improvement of **36.3%**.
### Optimize clz
The `clz` function is to count the number of consecutive zero bits starting from the most significant bit (MSB) of an integer.
The original C code implements `clz` using a binary search algorithm. It repeatedly halves the search space to quickly locate the most significant '1' bit.
Original C Code:
```c=
static inline unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return n - x;
}
```
Direct RV32I Assembly Translation:
```c=
clz:
li t0, 32 # t0 = n = 32
li t1, 16 # t1 = c = 16
mv t2, a0 # t2 = x
clz_1:
srl t3, t2, t1 # t3 = y = x >> c
beq t3, x0, clz_2 # if y == 0, goto clz_2
sub t0, t0, t1 # n = n - c
mv t2, t3 # x = y
clz_2:
srli t1, t1, 1 # c = c >> 1
bne t1, x0, clz_1 # if c != 0, goto clz_1
sub a0, t0, t2 # return n - x
ret
```
Since the loop runs for a fixed number of iterations (5 times, for `c = 16, 8, 4, 2, 1`), we can **unroll** it into a linear sequence of instructions.
The new algorithm works by checking progressively smaller blocks of bits from the most significant side. If a block is found to be all zeros, its size is added to a counter, and the input value is shifted left to expose the next block of bits for examination.
Optimized C Code:
```c=
static inline unsigned clz(uint32_t x)
{
if (x == 0) return 32;
unsigned n = 0;
if ((x >> 16) == 0) { n += 16; x <<= 16; }
if ((x >> 24) == 0) { n += 8; x <<= 8; }
if ((x >> 28) == 0) { n += 4; x <<= 4; }
if ((x >> 30) == 0) { n += 2; x <<= 2; }
if ((int32_t) x >= 0) { n += 1; }
return n;
}
```
Optimized RV32I Assembly:
```c=
clz:
# if (x == 0) return 32
bne a0, x0, clz_start
li a0, 32
ret
clz_start:
li a1, 0 # a1 = n = 0
srli t0, a0, 16
bne t0, x0, check_8
addi a1, a1, 16 # n += 16
slli a0, a0, 16 # x <<= 16
check_8:
srli t0, a0, 24
bne t0, x0, check_4
addi a1, a1, 8 # n += 8
slli a0, a0, 8 # x <<= 8
check_4:
srli t0, a0, 28
bne t0, x0, check_2
addi a1, a1, 4 # n += 4
slli a0, a0, 4 # x <<= 4
check_2:
srli t0, a0, 30
bne t0, x0, check_1
addi a1, a1, 2 # n += 2
slli a0, a0, 2 # x <<= 2
check_1:
blt a0, x0, clz_end # if (x < 0), MSB is 1, we are done
addi a1, a1, 1 # n += 1
clz_end:
mv a0, a1 # move result to a0 for return
ret
```
#### Evaluation (Ripes 5-stage processor)
Original method cycles: 29,753 (Optimize overflow and exponent calculation)
> Commit: [07dea06](https://github.com/geumbo/ca2025-quizzes/commit/07dea06525ee72df8084e8a485da0e5a73115cfd)

Improved method cycles: 25,321 (-14.9%)
> Commit: [d513160](https://github.com/geumbo/ca2025-quizzes/commit/d513160cba15191d32a8d82da732eca3748abe04)

#### Conclusion
Unrolling the fixed five-step search removes loop-control overhead. On the Ripes 5-stage pipeline, cycles decrease from **29,753** to **25,321** (**−14.9%**)
### 5-stage Pipelined Processor
A 5-stage pipelined Processor is a architecture that divides instruction execution into five sequential stages — **Fetch**, **Decode**, **Execute**, **Memory Access**, and **Write Back**. Each stage performs a distinct operation on the instruction before passing it to the next stage. By allowing multiple instructions to occupy different stages simultaneously, the pipeline enables overlapping execution: while one instruction is being executed, another can be decoded, and a third can be fetched. This parallelism significantly increases the instruction throughput, improving overall processor performance by completing more instructions per clock cycle.
1. Instruction Fetch (IF)

- The `PC` initially holds the value `0x00000000`, which is used as the address for the **Instruction Memory**.
- At address `0x00000000`, the instruction memory outputs the 32-bit machine code `0xff010113` which corresponds to the instruction `addi x2 x2 -16`, so the signal `instr` = `0xff010113`
- The `instr` is sent to the decode stage for further processing in the next pipeline cycle.
- The adder connected to the `PC` automatically increments its value by 4.
- The multiplexer selects the sequential PC update path, since there is no branch or jump in this stage.
2. Instruction Decode (ID)

Instruction `0xff010113` is decoded to four part:
- `opcode` = `ADDI`
- `R1 idx` = `0x02` → selects **x2** as the source register.
- `Wr idx` = `0x02` → sets **x2** as the destination.
- `imm` = `0xfffffff0` = `-16`
means:
> Take the value from register **x2**,
> add the immediate constant **−16**,
> and write the result back to **x2**.
- The Registers block outputs the current value of `x2` (`0x7ffffff0`), which will be sent to the next stage (**Execute**) for arithmetic computation.
3. Instruction Execute (IE)

Inputs to the ALU:
- **Op1** (Operand 1): `0x7FFFFFF0` — the value read from source register **x2**
- **Op2** (Operand 2): `0xFFFFFFF0` — the sign-extended immediate value **−16**
The **ALU result (Res)**, `0x7FFFFFE0`, along with the destination register index (**x2**), is forwarded to the next pipeline stage (**MEM**) for subsequent memory access or write-back operations.
4. Memory Access (MEM)

- Addr: `0x7FFFFFE0` – forwarded from the ALU result
- Data in: `0x00000000` – no data to be written
- Write enable (Wr en): `0` – memory write is **disabled**
Since `ADDI` is an arithmetic instruction that does not perform any memory access, the **Data Memory** remains inactive.
5. Write Back (WB)

- Destination register: `0x02` → **x2**
- Write data: `0x7FFFFFE0`
- Write enable: active (`1`)
Thus, the value `0x7FFFFFE0` is written back into **register x2**, updating its contents.
## Problem C
> [Full RV32I Assembly](https://github.com/geumbo/ca2025-quizzes/blob/main/q1-bfloat16.s)
The implementation successfully runs on Ripes and passes all test suites, with a total cycle count of **10,186**.

### Optimize normalization process by replacing loop with `clz`
In the `bf16_add` function, when two numbers with different signs are subtracted, the resulting mantissa (`result_mant`) may become denormalized (its MSB is not set). The original code uses a while loop to correct this by repeatedly left-shifting the mantissa until it is normalized. The goal is to replace this iterative search with a direct, constant-time calculation using the `clz` function.
Original C Code:
```c=
if (!result_mant)
return BF16_ZERO();
while (!(result_mant & 0x80)) {
result_mant <<= 1;
if (--result_exp <= 0)
return BF16_ZERO();
}
```
Direct RV32I Assembly Translation:
```c=
normalize_loop:
# while (!(result_mant & 0x80))
andi t0, s8, 0x80
bne t0, x0, build_result
slli s8, s8, 1 # result_mant <<= 1
addi s7, s7, -1 # result_exp--
# if (--result_exp <= 0)
bge x0, s7, return_zero
j normalize_loop
```
The number of left shifts required for normalization is equal to the number of leading zeros within the mantissa's 8-bit window. We can calculate this in a single step using a 32-bit clz function:
$$\text{shift_amount} = clz(\text{result_mant}) - 24$$
- If `result_exp <= shift_amount`, normalization would underflow → return zero.
- Otherwise, `result_exp -= shift_amount; result_mant <<= shift_amount;`.
This replaces the iterative normalization loop with a constant-time normalization process.
Optimized C Code:
```c=
if (!result_mant)
return BF16_ZERO();
int shift_amount = clz(result_mant) - 24;
if (result_exp <= shift_amount) {
return BF16_ZERO();
}
result_exp -= shift_amount;
result_mant <<= shift_amount
```
Optimized RV32I Assembly:
```c=
beq s8, x0, return_zero # if (!result_mant) return BF16_ZERO
mv a0, s8 # a0 = result_mant
jal ra, clz # shift_amount = clz(result_mant)
addi t0, a0, -24 # t0 = clz(result_mant) - 24
ble s7, t0, return_zero # if (result_exp <= shift_amount) return zero
sub s7, s7, t0 # result_exp -= shift_amount
sll s8, s8, t0 # result_mant <<= shift_amount
j build_result
```
## Reference
[How to Find the Sum of Geometric Series](https://www.geeksforgeeks.org/maths/how-to-find-the-sum-of-a-geometric-series/)
[Floor log2 n in O(1) for C++](https://codeforces.com/blog/entry/89220)
[2017q3 Homework4 (改善 clz)](https://hackmd.io/@3xOSPTI6QMGdj6jgMMe08w/Bk-uxCYxz)