# Assignment 1: RISC-V Assembly and Instruction Pipeline
contributed by < [`jin11109`](https://github.com/jin11109/ca2025-quizzes) >
> This assignment was completed with assistance from [ChatGPT](https://chatgpt.com/) for grammar checking, translation and initial brainstorming. All final analysis and conclusions are my own.
## [Problem B](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-B) from Quiz1
### Abstraction
Problem B discusses a logarithmic 8-bit codec that maps 20-bit unsigned integers to 8-bit symbols through logarithmic quantization.
In my work, I optimized the `clz` (count leading zeros) function and translated all the C code from Problem B into RISC-V assembly. Additionally, I applied several low-level optimizations, such as branchless operations, to improve performance.
### Inspiration
>We used ChatGPT to understand the compilation chain behind GCC's `__builtin_clz()` and to locate the software fallback implementation that GCC emits when a target CPU does not provide a hardware CLZ instruction.
#### CLZ optimization
CLZ is a performance-critical primitive in many low-level algorithms. Because this primitive can be implemented either by a single hardware instruction or by a software fallback, I prioritize optimizing its software path when hardware support is absent.
GCC exposes a family of [built-in helpers](https://gcc.gnu.org/onlinedocs/gcc/Bit-Operation-Builtins.html) (e.g. `__builtin_clz()`). If the target CPU lacks a dedicated CLZ instruction, the compiler will arrange for a library helper to be used at runtime. My approach is to inspect the software fallback implementations provided by GCC and adapt or optimize them for our target.
#### Tracing GCC's `__builtin_clz()`
When the frontend encounters `__builtin_clz()` it lowers that call to an internal helper (for example, `clzsi2`). If the backend can emit a native `clz` instruction, the backend will expand the helper into that instruction. Otherwise the generated code will call a libgcc helper such as `__clzSI2`, which lives in [`gcc/libgcc/libgcc2.c`](https://github.com/gcc-mirror/gcc/blob/master/libgcc/libgcc2.c). An example snippet from that file looks like:
```c
#ifdef L_clzsi2
#undef int
int
__clzSI2 (UWtype x)
{
Wtype ret;
count_leading_zeros (ret, x);
return ret;
}
#endif
```
In [`gcc/include/longlong.h`](https://github.com/gcc-mirror/gcc/blob/master/include/longlong.h), GCC defines `count_leading_zeros()` in an architecture-specific way. Targets that provide a native CLZ (for example RISC-V with the Zbb extension) typically define the macro to use the builtin or direct instruction:
```c
#ifdef __riscv_zbb
#if W_TYPE_SIZE == 32
#define count_leading_zeros(COUNT, X) ((COUNT) = __builtin_clz (X))
#define count_trailing_zeros(COUNT, X) ((COUNT) = __builtin_ctz (X))
#define COUNT_LEADING_ZEROS_0 32
#endif /* W_TYPE_SIZE == 32 */
```
Finally, if no architecture provides a `count_leading_zeros()` macro, the header falls back to a generic software implementation at the end of the file — this is the fallback routine I want to study and optimize:
```c
#if !defined (count_leading_zeros)
#define count_leading_zeros(count, x) \
do { \
UWtype __xr = (x); \
UWtype __a; \
\
if (W_TYPE_SIZE <= 32) \
{ \
__a = __xr < ((UWtype)1<<2*__BITS4) \
? (__xr < ((UWtype)1<<__BITS4) ? 0 : __BITS4) \
: (__xr < ((UWtype)1<<3*__BITS4) ? 2*__BITS4 : 3*__BITS4); \
} \
else \
{ \
for (__a = W_TYPE_SIZE - 8; __a > 0; __a -= 8) \
if (((__xr >> __a) & 0xff) != 0) \
break; \
} \
\
(count) = W_TYPE_SIZE - (__clz_tab[__xr >> __a] + __a); \
} while (0)
#define COUNT_LEADING_ZEROS_0 W_TYPE_SIZE
#endif
```
This fallback locates the highest non-zero byte (or scans by bytes for larger word sizes), then consults a small lookup table (`__clz_tab`) for the leading-zero count of that byte and adds the byte-offset to produce the full count. That approach reduces per-bit work compared with naive bit-by-bit scanning and is a good starting point for further micro-optimizations.
#### Rewrite problem B `clz` function
```c
const uint8_t clz_tab[256] = {
0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
static inline unsigned clz(uint32_t x) {
uint32_t a; // used for offset
if (x < (1u << 16)) { // 0x0000_XXXX
if (x < (1u << 8)) { // 0x0000_00XX
a = 0;
} else { // 0x0000_XX00
a = 8;
}
} else { // 0xXXXX_0000
if (x < (1u << 24)) { // 0x00XX_0000
a = 16;
} else { // 0xXX00_0000
a = 24;
}
}
/**
* clz_tab is used to get the position of leading one of the selected 8-bit
* block(XX)
*/
return 32 - (clz_tab[x >> a] + a);
}
```
This version of CLZ only use two branches and one lookup table.
Original version use 10 branches to find the clz result.
## Solution
### Idea for problem solving
1. **Version1**: After optimize `CLZ` funciton,I perform a direct, line-by-line translation of the original C code into RISC-V assembly to establish a correctness baseline.
2. **Version2**: Based on observations of the Ripes pipeline, I apply low-level assembly optimizations such as instruction scheduling and branchless implementations to improve throughput and reduce pipeline stalls.
### Version1(Naive):
**UF8 Decode:**
```asm
uf8_decode:
andi t0, a0, 0x0f
srli t1, a0, 4
li t2, 0x7fff
li t3, 15
sub t3, t3, t1
srl t2, t2, t3
slli t2, t2, 4
sll a0, t0, t1
add a0, a0, t2
jr ra
```
**UF8 Encode:**
This naive version of assembly code don't care about the pipeline and just translate line-by-line. But, there is still a little optimization which is inline clz function.
```asm
uf8_encode:
li t0, 16
blt a0, t0, return1 # if (value < 16) return value
# Find appropriate exponent using CLZ hint
clz:
li t0, 0x10000
bgt a0, t0, 1f
li t0, 0x100
bgt a0, t0, 3f
li t1, 0
j 2f
3:
li t1, 8
j 2f
1:
li t0, 0x1000000
bgt a0, t0, 3f
li t1, 16
j 2f
3:
li t1, 24
2:
li t0, 32
la t2, clz_tab
srl t3, a0, t1
add t2, t2, t3
lb t2, 0(t2)
sub t0, t0, t2
sub t0, t0, t1
clz_done:
addi t0, t0, -31
sub t0, x0, t0 # msb = 31 - lz
# Start from a good initial guess
addi t1, x0, 0 # exponent = 0
addi t2, x0, 0 # overflow = 0
li t3, 5
blt t0, t3, estimate_end # if msb < 5 then estimate_end
addi t1, t0, -4 # exponent = msb - 4
li t3, 15
ble t1, t3, skip_set_exp # if exponent <= 15
li t1, 15 # exponent = 15
skip_set_exp:
# Calculate overflow for estimated exponent
li t3, 0
for_loop:
bge t3, t1, for_loop_end # e < exponent
slli t2, t2, 1 # overflow <<= 1
addi t2, t2, 16 # overflow += 16
addi t3, t3, 1 # e++
j for_loop
for_loop_end:
# Adjust if estimate was off
while_loop:
ble t1, x0, while_loop_end
bge a0, t2, while_loop_end
addi t2, t2, -16 # overflow += -16
srli t2, t2, 1 # overflow >>= 1
addi t1, t1, -1
j while_loop
while_loop_end:
estimate_end:
# Find exact exponent
li t3, 15
while_exact_exp:
bge t1, t3, return2 # while (exponent < 15)
slli t4, t2, 1
addi t4, t4, 16
blt a0, t4, return2 # if (value < next_overflow) break;
mv t2, t4
addi t1, t1, 1
j while_exact_exp
return1:
jr ra
return2:
sub t3, a0, t2
srl t3, t3, t1
slli a0, t1, 4
or a0, a0, t3
jr ra
```
clz table
```asm
.data
clz_tab:
.byte 0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5,
.byte 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
.byte 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7,
.byte 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
.byte 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
.byte 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
.byte 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
.byte 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
.byte 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
.byte 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
.byte 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8
```
**Main Test Function:**
Recall that the maximum absolute error for UF8 is defined as $\Delta_{\max} = 2^e - 1$. Based on this property, I designed a verification test: for any given input value, after being encoded and then decoded by the UF8 codec, the absolute error of the reconstructed result should be smaller than $\Delta_{\max}$. If the observed error exceeds this bound, it indicates that either the encoder or the decoder is implemented incorrectly.
```asm
.data
# test data, max absolute error (2^e - 1)
test_case1: .word 0x3, 0
test_case2: .word 0x50, 0x3
test_case3: .word 0x12345, 0xfff
test_case4: .word 0xf1234, 0x7fff
test_case_end:
msg_pass: .asciz "All tests passed\n"
msg_failed: .asciz "Failed\n"
```
```asm
main:
# Prologue
addi sp, sp, -24
sw ra, 20(sp)
sw s0, 16(sp)
sw s1, 12(sp)
sw s2, 8(sp)
sw s3, 4(sp)
sw s4, 0(sp)
la, s2, test_case1 # test case start address
la, s3, test_case_end
addi s4, x0, 1 # init result of all cases
# Encode and decode a given number, than test if the absolute error of result
# smaller than max absolute error.
test_loop:
beq s2, s3, done
lw s0, 0(s2)
mv a0, s0
jal uf8_encode
jal uf8_decode
lw s1, 4(s2)
# Find absolute error
sub s0, a0, s0
bgez s0, 1f
sub s0, x0, s0
1:
ble s0, s1, 1f
andi s4, s4, 0
1:
# next_case
addi s2, s2, 8
j test_loop
done:
# Print result
bnez s4, print_pass
la a0, msg_failed
li a7, 4
ecall
j 1f
print_pass:
la a0, msg_pass
li a7, 4
ecall
1:
# Epilogue
lw ra, 20(sp)
lw s0, 16(sp)
lw s1, 12(sp)
lw s2, 8(sp)
lw s3, 4(sp)
lw s4, 0(sp)
addi sp, sp, 24
j end_of_file
```
#### Ripes risc-v simulator
All of test cases is pass in Ripes.


Through observation of the pipeline behavior, I found that each iteration of the loop causes a pipeline flush due to the branch instruction.
To address this issue, I applied optimizations in Version 2 to eliminate or reduce branch dependencies. For example, in the `uf8_encode()` function, the `for` loop used to calculate overflow for the estimated exponent was identified as a major source of control-flow stalls, as shown below:

### Version2:
- **Branchless operation and unrolling**
**`for` loop in `uf8_encode()`:**
Recall the C source code and assembly code in version1.
```c
/* Calculate overflow for estimated exponent */
for (uint8_t e = 0; e < exponent; e++)
overflow = (overflow << 1) + 16;
```
```asm
# Calculate overflow for estimated exponent
li t3, 0
for_loop:
bge t3, t1, for_loop_end # e < exponent
slli t2, t2, 1 # overflow <<= 1
addi t2, t2, 16 # overflow += 16
addi t3, t3, 1 # e++
j for_loop
for_loop_end:
```
Through inspection of the loop, we observe the recurrence:
\begin{gather*}overflow_{n+1} = (overflow_n << 1) + 16\end{gather*}
| exponent | overflow |
| -------- | -------- |
| 0 | overflow |
| 1 | (overflow << 1) + 16 |
| 2 | (overflow << 2) + (16 << 1) + 16 |
| 3 | (overflow << 3) + (16 << 2) + (16 << 1) + 16 |
| n | (overflow << n) + 16 * (1 << n - 1) |
Because RV32I does not provide a hardware multiply instruction, implementing the term $16 \times (1 << n - 1)$ via multiplication could either require a software multiply (which may introduce branches or additional overhead) or a loop. Instead, we compute the term using shifts.
Therefore the branchless closed-form calculation is:
\begin{gather*}overflow_n = (overflow << n) + (((1 << n) - 1) << 4)\end{gather*}
Then write this in c and assembly
```c
/* Calculate overflow for estimated exponent */
overflow = (overflow << exponent) + (((1 << exponent) - 1) << 4);
```
```asm
# Calculate overflow for estimated exponent
li t3, 1
sll t3, t3, t1
addi t3, t3, -1
slli t3, t3, 4
sll t2, t2, t1
add t2, t2, t3
```
- **Instruction Scheduling**
After completing the section [Ripes simulator: instruction-level operations](#Ripes-simulator:-instruction-level-operations). , I realized that instruction scheduling can be applied to reduce CPU **stalls** caused by data hazards.
Consider the final part of the `clz` function, which involves the lookup table and the computation of the return value:
```asm
li t0, 32
la t2, clz_tab
srl t3, a0, t1
add t2, t2, t3
lb t2, 0(t2)
sub t0, t0, t2
sub t0, t0, t1
```
As discussed in the final seciton, `add t2, t2, t3` need to **stall** to wait for `lb` because of dependency data from `t2`. By applying instruction scheduling, we can reorder instructions to reduce this stall. Specifically, swapping:
```asm
sub t0, t0, t2
sub t0, t0, t1
```
to execute `sub t0, t0, t1` before `sub t0, t0, t2` allows the CPU to perform useful work while waiting for the lb to complete.
This simple reordering helps eliminate the pipeline stall and improves execution efficiency without changing the program’s semantics.
**Before** instruction scheduling.

**After** instruction scheduling.

#### Ripes risc-v simulator
After further optimization, the total cpu cycle reduce from `544` to `344`

And also pass all test cases.

## [Problem C](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-C) from Quiz 1
### Abstraction
Problem C involves multiple operations on the bfloat16 data type.
In my work, I first studied the hardware multiplier and derived several insights based on its behavior. After that, I translated all the C code from Problem C into RISC-V assembly. Based on observations from the Ripes simulator, I focused on the most computationally expensive operations and applied several low-level optimizations to improve performance.
### Inspiration
I found that the implementation of the `bf16_sqrt()` operation uses **multiplication** and **division** inside a binary search, as shown below:
```c
uint32_t low = 90; /* Min sqrt (roughly sqrt(128)) */
uint32_t high = 256; /* Max sqrt (roughly sqrt(512)) */
uint32_t result = 128; /* Default */
/* Binary search for square root of m */
while (low <= high) {
uint32_t mid = (low + high) >> 1;
uint32_t sq = (mid * mid) / 128; /* Square and scale */
if (sq <= m) {
result = mid; /* This could be our answer */
low = mid + 1;
} else {
high = mid - 1;
}
}
```
Since these operations are more computationally expensive than others, I first investigated how to implement fast multiplication. Because RV32I does not include the mul instruction, multiplication must be implemented in software.
(The division by 128 in the binary search can be replaced with a right shift, >> 7, so only multiplication needs to be optimized.)
I began by studying [Hardware Multipliers](https://web.archive.org/web/20000818145531/http://www.andraka.com/multipli.htm#Booth%20Recoding) to identify techniques that could be implemented in assembly — such as look-up table multipliers or scaling accumulator multipliers.
However, in this binary search, the variable `mid` always lies between `low` and `high` (i.e., from 90 to 256). Moreover, the operation is a **square** operation, and the result is always divided by `128`. Therefore, a full 32-bit multiplier is unnecessary. Instead, we can precompute and store the results for these 167 possible values in a lookup table, inspired by the concept of look-up table multipliers.
```c
uint16_t sq_table[167] = {
63, 64, 66, 67, 69, 70, 72, 73, 75, 76, 78, 79, 81, 82,
84, 86, 87, 89, 91, 92, 94, 96, 98, 99, 101, 103, 105, 106,
108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134,
136, 138, 140, 142, 144, 146, 148, 150, 153, 155, 157, 159, 162, 164,
166, 168, 171, 173, 175, 178, 180, 182, 185, 187, 190, 192, 195, 197,
200, 202, 205, 207, 210, 212, 215, 217, 220, 223, 225, 228, 231, 233,
236, 239, 242, 244, 247, 250, 253, 255, 258, 261, 264, 267, 270, 273,
276, 279, 282, 285, 288, 291, 294, 297, 300, 303, 306, 309, 312, 315,
318, 321, 325, 328, 331, 334, 338, 341, 344, 347, 351, 354, 357, 361,
364, 367, 371, 374, 378, 381, 385, 388, 392, 395, 399, 402, 406, 409,
413, 416, 420, 424, 427, 431, 435, 438, 442, 446, 450, 453, 457, 461,
465, 468, 472, 476, 480, 484, 488, 492, 496, 500, 504, 508, 512};
```
```c
uint32_t low = 90; /* Min sqrt (roughly sqrt(128)) */
uint32_t high = 256; /* Max sqrt (roughly sqrt(512)) */
uint32_t result = 128; /* Default */
/* Binary search for square root of m */
while (low <= high) {
uint32_t mid = (low + high) >> 1;
uint32_t sq = sq_table[mid - 90]; /* Square and scale */
if (sq <= m) {
result = mid; /* This could be our answer */
low = mid + 1;
} else {
high = mid - 1;
}
}
```
The lookup table requires 334 bytes of storage.
However, the benefit is that it completely eliminates the need for a hardware or software multiplier, relying only on one table lookup and one addition instruction.
>Because I want to evaluate the benefit of this optimization, I will include it in Version 2, while Version 1 will use a general 32-bit software multiplier.
## Solution
### Idea for problem solving
1. **Version1**: Translate the C code line by line as a baseline implementation.
2. **Version2**: Replace the `bf16_sqrt()` multiplier with the lookup-table approach.
Based on the Ripes pipeline observation, apply assembly-level optimizations such as instruction scheduling and branchless operations.
### Version1(Naive):
**f32_to_bf16** and **f32_to_bf16**:
```asm
# ----------------------------------------
# Function: f32_to_bf16
# Input: a0 = float
# Output: a0 = bf16
# ----------------------------------------
f32_to_bf16:
srli t0, a0, 23
andi t0, t0, 0xff
li t1, 0xff
bne t0, t1, 1f
srli a0, a0, 16
li t4, 0xffff
and a0, t4, a0
jr ra
1:
li t1, 0x7fff
srli t0, a0, 16
andi t0, t0, 1
add t0, t0, t1
add a0, a0, t0
srli a0, a0, 16
jr ra
# ----------------------------------------
# Function: bf16_to_f32
# Input: a0 = bf16
# Output: a0 = float
# ----------------------------------------
bf16_to_f32:
slli a0, a0, 16
jr ra
```
**bf16_add**:
```asm
# ----------------------------------------
# Function: bf16_add
# Input: a0 = augend, a1 = addend
# Output: a0 = sum
# ----------------------------------------
bf16_add:
# Prologue
addi sp, sp, -28
sw ra, 24(sp)
sw s0, 20(sp)
sw s1, 16(sp)
sw s2, 12(sp)
sw s3, 8(sp)
sw s4, 4(sp)
sw s5, 0(sp)
li t0, 0xffff
and a0, a0, t0
and a1, a1, t0
# Find sign, exponent and mantissa
srli t0, a0, 15
srli t1, a1, 15
andi s0, t0, 1
andi s1, t1, 1
srli t0, a0, 7
andi s2, t0, 0xff
srli t0, a1, 7
andi s3, t0, 0xff
andi s4, a0, 0x7f
andi s5, a1, 0x7f
# Handle special cases
li t0, 0xff
# Test if exp_a == 0xff
bne s2, t0, 1f
beqz s4, 2f
j add_return
2:
bne s3, t0, 2f
bnez s5, 3f
beq s0, s1, 3f
li t6, 0x7FC0
add a0, t6, x0
j add_return
3:
add a0, a1, x0
j add_return
2:
j add_return
1:
# Test if exp_b == 0xff
bne s3, t0, 1f
add a0, a1, x0
j add_return
1:
# Test if a is zero
bnez s2, 1f
bnez s4, 1f
add a0, a1, x0
j add_return
1:
# Test if b is zero
bnez s3, 1f
bnez s5, 1f
j add_return
1:
# Normalize mantissas
beqz s2, 1f
ori s4, s4, 0x80
1:
beqz s3, 1f
ori s5, s5, 0x80
1:
# Align exponents
sub t0, s2, s3
ble t0, x0, 1f
add t2, s2, x0
li t6, 8
ble t0, t6, 2f
j add_return
2:
srl s5, s5, t0
j 3f
1:
bge t0, x0, 1f
add t2, s3, x0
li t6, -8
bge t0, t6, 2f
add a0, a1, x0
j add_return
2:
sub t6, x0, t0
srl s4, s4, t6
j 3f
1:
add t2, x0, s2
3:
# Perform addition or subtraction depending on sign
bne s0, s1, 1f
add t1, s0, x0
add t3, x0, s4
add t3, t3, s5
andi t6, t3, 0x100
beqz t6, add_end
srli t3, t3, 1
addi t2, t2, 1
li t6, 0xff
blt t2, t6, add_end
slli t1, t1, 15
li t6, 0x7f80
or a0, t6, t1
j add_return
1:
blt s4, s5, 2f
add t1, x0, s0
sub t3, s4, s5
j 3f
2:
add t1, x0, s1
sub t3, s5, s4
3:
bnez t3, 1f
li a0, 0x0
j add_return
1:
andi t6, t3, 0x80
bnez t6, add_end
slli t3, t3, 1
addi t2, t2, -1
bgt t2, x0, 1b
li a0, 0x0
j add_return
# Pack final result (sign | exponent | mantissa)
add_end:
slli a0, t1, 15
andi t2, t2, 0xff
slli t2, t2, 7
andi t3, t3, 0x7f
or a0, t2, a0
or a0, t3, a0
j add_return
# Epilogue
add_return:
lw ra, 24(sp)
lw s0, 20(sp)
lw s1, 16(sp)
lw s2, 12(sp)
lw s3, 8(sp)
lw s4, 4(sp)
lw s5, 0(sp)
addi sp, sp, 28
jr ra
```
**bf16_sub**:
```asm
# ----------------------------------------
# Function: bf16_sub
# Input: a0 = minuend, a1 = subtrahend
# Output: a0 = difference
# ----------------------------------------
bf16_sub:
li t0, 0x8000
xor a1, a1, t0
jal bf16_add
```
**bf16_mul**:
```asm
# ----------------------------------------
# Function: bf16_mul
# Input: a0 = multiplicand, a1 = multiplier
# Output: a0 = product
# ----------------------------------------
bf16_mul:
# Prologue
addi sp, sp, -28
sw ra, 24(sp)
sw s0, 20(sp)
sw s1, 16(sp)
sw s2, 12(sp)
sw s3, 8(sp)
sw s4, 4(sp)
sw s5, 0(sp)
li t0, 0xffff
and a0, a0, t0
and a1, a1, t0
# Find sign, exponent and mantissa
srli t0, a0, 15
srli t1, a1, 15
andi s0, t0, 1
andi s1, t1, 1
srli t0, a0, 7
andi s2, t0, 0xff
srli t0, a1, 7
andi s3, t0, 0xff
andi s4, a0, 0x7f
andi s5, a1, 0x7f
xor s0, s0, s1
# Handle special cases
# If exp_a is 0xff
li t6, 0xff
bne s2, t6, 1f
beqz s4, 2f
j mul_return
2:
bnez s3, 2f
bnez s5, 2f
li a0, 0x7fc0
j mul_return
2:
li a0, 0x7f80
slli s0, s0, 15
or a0, s0, a0
j mul_return
1:
# If exp_b is 0xff
bne s3, t6, 1f
beqz s5, 2f
add a0, a1, x0
j mul_return
2:
bnez s2, 2f
bnez s4, 2f
li a0, 0x7fc0
j mul_return
2:
li a0, 0x7f80
slli s0, s0, 15
or a0, s0, a0
j mul_return
1:
# Test if a or b is zero
bnez s2, 1f
bnez s4, 1f
j 2f
1:
bnez s3, 1f
bnez s5, 1f
2:
slli a0, s0, 15
j mul_return
1:
# Normalize operands
li s1, 0
bnez s2, 1f
3:
andi t0, s4, 0x80
bnez t0, 3f
slli s4, s4, 1
addi s1, s1, -1
j 3b
3:
li s2, 1
j 2f
1:
ori s4, s4, 0x80
2:
bnez s3, 1f
3:
andi t0, s5, 0x80
bnez t0, 3f
slli s5, s5, 1
addi s1, s1, -1
j 3b
3:
li s3, 1
j 2f
1:
ori s5, s5, 0x80
2:
# Multiply mantissas using integer arithmetic
mv a0, s4
mv a1, s5
jal mul32u
mv s4, a0
# Compute result exponent with bias adjustment
add t0, s2, s3
addi t0, t0, -127
add s1, s1, t0
# Normalize mantissa result and handle possible overflow
li t0, 0x8000
and t0, t0, s4
beqz t0, 1f
srli s4, s4, 8
andi s4, s4, 0x7f
addi s1, s1, 1
j 2f
1:
srli s4, s4, 7
andi s4, s4, 0x7f
2:
# Handle overflow and underflow for exponent
li t6, 0xff
blt s1, t6, 1f
li a0, 0x7f80
slli s0, s0, 15
or a0, a0, s0
j mul_return
1:
bgtz s1, 1f
li t6, -6
bge s1, t6, 2f
slli a0, s0, 15
j mul_return
2:
li t0, 1
sub t0, t0, s1
srl s4, s4, t0
li s1, 0
1:
# Pack final result (sign | exponent | mantissa)
slli a0, s0, 15
andi s1, s1, 0xff
slli s1, s1, 7
andi s4, s4, 0x7f
or a0, a0, s1
or a0, a0, s4
j mul_return
# Epilogue
mul_return:
lw ra, 24(sp)
lw s0, 20(sp)
lw s1, 16(sp)
lw s2, 12(sp)
lw s3, 8(sp)
lw s4, 4(sp)
lw s5, 0(sp)
addi sp, sp, 28
jr ra
```
**bf16_div**:
```asm
# ----------------------------------------
# Function: bf16_div
# Input: a0 = dividend, a1 = divisor
# Output: a0 = quotient
# ----------------------------------------
bf16_div:
# Prologue
addi sp, sp, -24
sw ra, 20(sp)
sw s0, 16(sp)
sw s1, 12(sp)
sw s2, 8(sp)
sw s3, 4(sp)
sw s4, 0(sp)
li t0, 0xffff
and a0, a0, t0
and a1, a1, t0
# Find sign, exponent and mantissa
srli t0, a0, 15
srli t1, a1, 15
andi t0, t0, 1
andi t1, t1, 1
xor s0, t0, t1
srli t0, a0, 7
andi s1, t0, 0xff
srli t0, a1, 7
andi s2, t0, 0xff
andi s3, a0, 0x7f
andi s4, a1, 0x7f
li t6, 0xff
beq s2, t6, div_return1
# Handle special cases
bnez s2, 1f
bnez s4, 1f
j div_return2
1:
beq s1, t6, div_return3
seqz t0, s1
seqz t1, s3
and t0, t0, t1
bnez t0, div_return4
# Normalize mantissas
beqz s1, 1f
ori s3, s3, 0x80
1:
beqz s2, 1f
ori s4, s4, 0x80
1:
# Perform fixed-point division
slli s3, s3, 15
li t0, 0
# Bitwise long division loop
li t1, 0
li t2, 16
2:
bge t1, t2, 2f
slli t0, t0, 1
addi t3, t2, -1
sub t3, t3, t1
sll t4, s4, t3
blt s3, t4, 1f
sub s3, s3, t4
ori t0, t0, 1
1:
addi t1, t1, 1
j 2b
2:
# Adjust exponent based on input exponents and bias
sub t1, s1, s2
addi t1, t1, 127
bnez s1, 1f
addi t1, t1, -1
1:
bnez s2, 1f
addi t1, t1, 1
1:
# Normalize quotient and align to 8-bit mantissa format
li t2, 0x8000
and t3, t2, t0
beqz t3, 1f
srli t0, t0, 8
j 2f
1:
li t4, -1
and t3, t2, t0
xor t5, t4, t3
li t6, 1
slt t6, t6, t1
and t6, t6, t5
bnez t6, 1f
slli t0, t0, 1
addi t1, t1, -1
1:
srli t0, t0, 8
2:
andi t0, t0, 0x7f
# Handle overflow and underflow conditions
li t2, 0xff
blt t1, t2, 1f
slli a0, s0, 15
li t2, 0x7f80
or a0, a0, t2
j div_return
1:
bgt t1, x0, 1f
slli a0, s0, 15
j div_return
1:
# Pack final result (sign | exponent | mantissa)
slli a0, s0, 15
andi t1, t1, 0xff
slli t1, t1, 7
andi t0, t0, 0x7f
or a0, a0, t1
or a0, a0, t0
j div_return
# Return result of special cases
div_return1:
beqz s4, 1f
add a0, x0, a1
j div_return
1:
bne s1, t6, 1f
bnez s3, 1f
li a0, 0x7fc0
j div_return
1:
slli a0, s0, 15
j div_return
div_return2:
bnez s1, 1f
bnez s3, 1f
li a0, 0x7fc0
j div_return
1:
li t2, 0x7f80
slli a0, s0, 15
or a0, a0, t2
j div_return
div_return3:
beqz s3, 1f
j div_return
1:
li t2, 0x7f80
slli a0, s0, 15
or a0, t2, a0
j div_return
div_return4:
slli a0, s0, 15
j div_return
# Epilogue
div_return:
lw ra, 20(sp)
lw s0, 16(sp)
lw s1, 12(sp)
lw s2, 8(sp)
lw s3, 4(sp)
lw s4, 0(sp)
addi sp, sp, 24
jr ra
```
**bf16_sqrt**:
```asm
# ----------------------------------------
# Function: bf16_sqrt
# Input: a0 = num
# Output: a0 = square root
# ----------------------------------------
bf16_sqrt:
# Prologue
addi sp, sp, -20
sw ra, 16(sp)
sw s0, 12(sp)
sw s1, 8(sp)
sw s2, 4(sp)
sw s3, 0(sp)
srli s0, a0, 15
andi s0, s0, 1
srli s1, a0, 7
andi s1, s1, 0xff
andi s2, a0, 0x7f
# handle special cases
li t6, 0xff
bne s1, t6, 1f
beqz s2, 2f
j sqrt_return
2:
beqz s0, 2f
li a0, 0x7fc0
j sqrt_return
2:
j sqrt_return
1:
# sqrt(+/-0) = 0
bnez s1, 1f
bnez s2, 1f
li a0, 0x0
j sqrt_return
1:
# sqrt of negative number is NaN
beqz s0, 1f
li a0, 0x7fc0
j sqrt_return
1:
# Flush denormals to zero
bnez s1, 1f
li a0, 0x0
j sqrt_return
1:
addi s1, s1, -127
ori s2, s2, 0x80
andi t0, s1, 1
beqz t0, 1f
slli s2, s2, 1
addi s0, s1, -1
srai s0, s0, 1
addi s0, s0, 127
j 2f
1:
srai s0, s1, 1
addi s0, s0, 127
2:
# Binary search for square root
li t0, 90
li t1, 256
li t2, 128
1:
bgt t0, t1, 1f
add t3, t0, t1
srli t3, t3, 1
addi sp, sp, -20
sw ra, 16(sp)
sw t0, 12(sp)
sw t1, 8(sp)
sw t2, 4(sp)
sw t3, 0(sp)
mv a0, t3
mv a1, t3
jal mul32u
srli t4, a0, 7
lw ra, 16(sp)
lw t0, 12(sp)
lw t1, 8(sp)
lw t2, 4(sp)
lw t3, 0(sp)
addi sp, sp, 20
bgt t4, s2, 2f
mv t2, t3
addi t0, t3, 1
j 1b
2:
addi t1, t3, -1
j 1b
1:
# Normalize to ensure result is in
li t6, 256
blt t2, t6, 1f
srli t2, t2, 1
addi s0, s0, 1
j 2f
1:
li t6, 128
li t5, 1
bge t2, t6, 2f
3:
bge t2, t6, 2f
ble s0, t5, 2f
slli t2, t2, 1
addi s0, s0, -1
j 3b
2:
# Extract 7-bit mantissa (remove implicit 1)
andi t2, t2, 0x7f
# Check for overflow/underflow
li t6, 0xff
blt s0, t6, 1f
li a0, 0x7f80
j sqrt_return
1:
bgtz s0, 1f
li a0, 0x0
j sqrt_return
1:
# return組合
andi a0, s0, 0xff
slli a0, a0, 7
or a0, a0, t2
j sqrt_return
# Epilogue
sqrt_return:
lw ra, 16(sp)
lw s0, 12(sp)
lw s1, 8(sp)
lw s2, 4(sp)
lw s3, 0(sp)
addi sp, sp, 20
jr ra
```
**mul32u**:
```asm
# ----------------------------------------
# Function: mul32u
# Input: a0 = multiplicand
# a1 = multiplier
# Output: a0 = 32-bit product
# ----------------------------------------
mul32u:
li t1, 0
mul_loop:
beqz a1, mul_doned
andi t0, a1, 1
beqz t0, mul_skip_add
add t1, t1, a0
mul_skip_add:
slli a0, a0, 1
srli a1, a1, 1
j mul_loop
mul_done:
mv a0, t1
jr ra
```
**Main test function**:
For each test case, two floating-point numbers and their expected results after every operation are provided. The program first uses `f32_to_bf16()` to convert the `float` values into the `bf16` format and then performs each operation. After that, the result is converted back to the `float` type for comparison with the expected answer.
If the results of all operations are correct, the program prints **“Passed”** for that case; otherwise, it prints **“Failed.”** Therefore, seeing four “Passed” messages indicates that all test cases have passed successfully.
```asm
.data
# test data1, test data2, div result, add result, mul result, sqrt result
# 25.0, 5.0, 5.0, 30.0, 125.0, 5.0
test_case1: .word 0x41c80000, 0x40a00000, 0x40a00000, 0x41f00000, 0x42fa0000, 0x40a00000
# -2.5, -5, 0.5, -7.5, 12.5, NaN
test_case2: .word 0xc0200000, 0xc0a00000, 0x3f000000, 0xc0f00000, 0x41480000, 0x7fc00000
# -4.0, 0, -inf, -4.0, -0, NaN
test_case3: .word 0xc0800000, 0x00000000, 0xff800000, 0xc0800000, 0x80000000, 0x7fc00000
# inf, inf, NaN, inf, inf, inf
test_case4: .word 0x7f800000, 0x7f800000, 0x7fc00000, 0x7f800000, 0x7f800000, 0x7f800000
test_case_end:
msg_pass: .asciz "Passed\n"
msg_failed: .asciz "Failed\n"
.text
main:
# Prologue
addi sp, sp, -28
sw ra, 24(sp)
sw s0, 20(sp)
sw s1, 16(sp)
sw s2, 12(sp)
sw s3, 8(sp)
sw s4, 4(sp)
sw s5, 0(sp)
la, s3, test_case1 # test case start address
la, s4, test_case_end
addi s5, x0, 1 # init result of all cases
test_loop:
beq s3, s4, done
lw a0, 0(s3)
jal f32_to_bf16
add s0, a0, x0
lw a0, 4(s3)
jal f32_to_bf16
add s1, a0, x0
# Test div operatoion
add a0, s0, x0
add a1, s1, x0
jal bf16_div
jal bf16_to_f32
lw s2, 8(s3)
xor t0, a0, s2
seqz t0, t0
and s5, s5, t0
# Test add operation
add a0, s0, x0
add a1, s1, x0
jal bf16_add
jal bf16_to_f32
lw s2, 12(s3)
xor t0, a0, s2
seqz t0, t0
and s5, s5, t0
# Test mul operation
add a0, s0, x0
add a1, s1, x0
jal bf16_mul
jal bf16_to_f32
lw s2, 16(s3)
xor t0, a0, s2
seqz t0, t0
and s5, s5, t0
# Test sqrt operation
add a0, s0, x0
jal bf16_sqrt
jal bf16_to_f32
lw s2, 20(s3)
xor t0, a0, s2
seqz t0, t0
and s5, s5, t0
# Print result
bnez s5, print_pass
j print_failed
print_pass:
la a0, msg_pass
li a7, 4
ecall
j next_case
print_failed:
la a0, msg_failed
li a7, 4
ecall
next_case:
addi s3, s3, 24
j test_loop
done:
# Epilogue
lw ra, 24(sp)
lw s0, 20(sp)
lw s1, 16(sp)
lw s2, 12(sp)
lw s3, 8(sp)
lw s4, 4(sp)
lw s5, 0(sp)
addi sp, sp, 28
j end_of_file
```
#### Ripes risc-v simulator
All of test cases is pass in Ripes.


### Version2:
- **Remove muliplication form `bf16_sqrt()`**
As mentioned in the **Inspiration** section, I implemented this optimization in RISC-V assembly. Recall the binary search in **Version 1**: because the algorithm calls `mul32u`, it is necessary to save caller-saved registers beforehand. This additional step increases the execution time and makes the operation more costly.
```asm
# Binary search for square root
li t0, 90
li t1, 256
li t2, 128
1:
bgt t0, t1, 1f
add t3, t0, t1
srli t3, t3, 1
addi sp, sp, -20
sw ra, 16(sp)
sw t0, 12(sp)
sw t1, 8(sp)
sw t2, 4(sp)
sw t3, 0(sp)
mv a0, t3
mv a1, t3
jal mul32u
srli t4, a0, 7
lw ra, 16(sp)
lw t0, 12(sp)
lw t1, 8(sp)
lw t2, 4(sp)
lw t3, 0(sp)
addi sp, sp, 20
bgt t4, s2, 2f
mv t2, t3
addi t0, t3, 1
j 1b
2:
addi t1, t3, -1
j 1b
1:
```
Now, we use `sq_table` to optimize.
```asm
.data
sq_table: .half 63, 64, 66, 67, 69, 70, 72, 73, 75, 76, 78, 79, 81, 82
.half 84, 86, 87, 89, 91, 92, 94, 96, 98, 99, 101, 103, 105, 106
.half 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134
.half 136, 138, 140, 142, 144, 146, 148, 150, 153, 155, 157, 159, 162, 164
.half 166, 168, 171, 173, 175, 178, 180, 182, 185, 187, 190, 192, 195, 197
.half 200, 202, 205, 207, 210, 212, 215, 217, 220, 223, 225, 228, 231, 233
.half 236, 239, 242, 244, 247, 250, 253, 255, 258, 261, 264, 267, 270, 273
.half 276, 279, 282, 285, 288, 291, 294, 297, 300, 303, 306, 309, 312, 315
.half 318, 321, 325, 328, 331, 334, 338, 341, 344, 347, 351, 354, 357, 361
.half 364, 367, 371, 374, 378, 381, 385, 388, 392, 395, 399, 402, 406, 409
.half 413, 416, 420, 424, 427, 431, 435, 438, 442, 446, 450, 453, 457, 461
.half 465, 468, 472, 476, 480, 484, 488, 492, 496, 500, 504, 508, 512
```
```asm
# Binary search for square root
li t0, 90
li t1, 256
li t2, 128
1:
bgt t0, t1, 1f
add t3, t0, t1
srli t3, t3, 1
# Load square and scale from table
addi t4, t3, -90
slli t4, t4, 1
la t6, sq_table
add t4, t4, t6
lh t4, 0(t4)
bgt t4, s2, 2f
mv t2, t3
addi t0, t3, 1
j 1b
2:
addi t1, t3, -1
j 1b
1:
```
- **`for` loop in `bf16_div()`**
When running the Version 1 code in Ripes, we found that the `for` loop in `bf16_div()` consumes a significant amount of execution time.
The loop in Version 1 can be expressed as:
```asm
# Bitwise long division loop
li t1, 0
li t2, 16
2:
bge t1, t2, 2f
slli t0, t0, 1
addi t3, t2, -1
sub t3, t3, t1
sll t4, s4, t3
blt s3, t4, 1f
sub s3, s3, t4
ori t0, t0, 1
1:
addi t1, t1, 1
j 2b
2:
```
and the corresponding C source code:
```c
for (int i = 0; i < 16; i++) {
quotient <<= 1;
if (dividend >= (divisor << (15 - i))) {
dividend -= (divisor << (15 - i));
quotient |= 1;
}
}
```
We can implement a **branchless version** using the `slt` instruction.
Since `slt` returns `1` or `0`, subtracting `1` from the result produces a mask (`0x0` for true, `0xFFFF_FFFF` for false) that can be used to replace the branch:
```asm
# Bitwise long division loop
li t1, 0
li t2, 16
li t5, 15
2:
bge t1, t2, 2f
slli t0, t0, 1
sub t3, t5, t1 # 15 - i
sll t4, s4, t3 # divisor << (15 - i)
slt t6, s3, t4
addi t6, t6, -1 # mask for branch
and t4, t4, t6
andi t3, t6, 1
sub s3, s3, t4 # dividend -= (divisor << (15 - i));
or t0, t0, t3 # quotient |= 1;
addi t1, t1, 1
j 2b
2:
```
Moreover, since this loop has a fixed number of iterations, it is a good candidate for **loop unrolling** to further improve performance:
```asm
# Bitwise long division loop
li t5, 15
# first loop
slli t0, t0, 1
addi t3, t5, 0 # change here for 0~-15
sll t4, s4, t3
slt t6, s3, t4
addi t6, t6, -1
and t4, t4, t6
andi t3, t6, 1
sub s3, s3, t4
or t0, t0, t3
# second loop
slli t0, t0, 1
addi t3, t5, -1
sll t4, s4, t3
slt t6, s3, t4
addi t6, t6, -1
and t4, t4, t6
andi t3, t6, 1
sub s3, s3, t4
or t0, t0, t3
.
.
.
# final loop
slli t0, t0, 1
addi t3, t5, -15
sll t4, s4, t3
slt t6, s3, t4
addi t6, t6, -1
and t4, t4, t6
andi t3, t6, 1
sub s3, s3, t4
or t0, t0, t3
```
#### Ripes risc-v simulator
After further optimization, the total cpu cycle reduce from `2941` to `2075`

And also pass all test cases.

## Ripes simulator: instruction-level operations
Take `lb t2, 0(t2)` in `clz` function as an example.
```asm
li a0, 0x00100000
clz:
addi t0, x0, 1
slli t0, t0, 16
bgt a0, t0, 1f
slli t0, t0, 8
bgt a0, t0, 3f
li t1, 0
j 2f
3:
li t1, 8
j 2f
1:
slli t0, t0, 24
bgt a0, t0, 3f
li t1, 16
j 2f
3:
li t1, 24
2:
li t0, 32
la t2, clz_tab
srl t3, a0, t1
add t2, t2, t3
lb t2, 0(t2)
sub t0, t0, t2
sub a0, t0, t1
clz_done:
```
### Program functionality
The clz function computes the count of leading zeros for a 32-bit integer.
1. It uses a sequence of shift and compare operations to estimate the exponent.
2. Then it accesses the lookup table clz_tab to retrieve the exact count of leading zeros for the least significant byte.
3. Finally, the result is calculated and stored in a0.
### Instruction-level operations
We focus on the `lb t2, 0(t2)` instruction and analyze its pipeline behavior, memory access, and hazard handling:
1. Instruction Fetch (IF)
- The processor fetches the instruction from instruction memory and places it in the IF/ID pipeline register.
- The fetch stage does not yet execute the instruction; it only retrieves the binary encoding.
2. Instruction decode (ID):
- In this stage, the processor decodes the instruction: identifying source and destination registers, immediate values, and control signals.
- These decoded values are prepared for the EX stage.
3. Execute(EX):
- The `lb` instruction computes the effective address by adding the base register (`t7`) and the immediate offset (`0`).
- But, the base register depends on a pervious instruction (`add x7, x7, x28`), **data forwarding** ensures that the EX stage can use the latest value **without stalling**.
- This is done via a forwarding circuit connecting MEM/WB results back to the EX stage.
- And then, add with the immdeiate data:
4. Memory(MEM):
- The memory stage performs the byte load from the calculated address.
- We can see that the EX stage is stall now. This because the next instruction `sub x5, x5, x7` depend on the `x7`, but `x7` need to read data memory to get the data. It can not use **data forwarding** to pass the data to EX stage.
- Memory read results are stored in the MEM/WB pipeline register for the write-back stage.

5. Write back(WB):
- The loaded byte is written back to the destination register `x7`.
- Thanks to the forwarding paths, subsequent instructions depending on `x7` can access the updated value without additional stalls.