---
title: 'Assignment 1: RISC-V Assembly and Instruction Pipeline'
disqus: hackmd
---
Assignment 1: RISC-V Assembly and Instruction Pipeline
===
>[!Note] AI tools usage
>I use ChatGPT to assist with Quiz 1 by providing code explanations, grammar revisions, pre-work research, code summaries, and explanations of standard RISC-V instruction usage.
## Problem B: uf8 - the 8-bit codec
### 1. Overview
**What is UF8?**
UF8 (microsecond float 8-bit) is a logarithmic encoding scheme that compresses 20-bit unsigned integers ([0, 1,015,792]) into 8-bit symbols with:
* 2.5:1 compression ratio (20 bits → 8 bits)
* ≤6.25% relative error (maximum)
* 3% expected error (average)
**Mathematical Foundation**
* Encoding Formula:
$$E(v) =
\begin{cases}
v & v < 16 \\
16e + \left\lfloor \frac{v - \text{offset}(e)}{2^e} \right\rfloor & v \geq 16
\end{cases}
$$
Where: $offset(e) = (2^e - 1) · 16$
* Decoding Formula:
$$D(b) = m · 2^e + (2^e - 1) · 16$$
Where: $e = ⌊b/16⌋ (exponent)$
$m = b \mod 16 (mantissa)$
### 2. The RV32I Assembly translation from the C program of uf8
* Original C code:
```c=
/* Decode uf8 to uint32_t */
uint32_t uf8_decode(uf8 fl)
{
uint32_t mantissa = fl & 0x0f;
uint8_t exponent = fl >> 4;
uint32_t offset = (0x7FFF >> (15 - exponent)) << 4;
return (mantissa << exponent) + offset;
}
/* Encode uint32_t to uf8 */
uf8 uf8_encode(uint32_t value)
{
/* Use CLZ for fast exponent calculation */
if (value < 16)
return value;
/* Find appropriate exponent using CLZ hint */
int lz = clz(value);
int msb = 31 - lz;
/* Start from a good initial guess */
uint8_t exponent = 0;
uint32_t overflow = 0;
if (msb >= 5) {
/* Estimate exponent - the formula is empirical */
exponent = msb - 4;
if (exponent > 15)
exponent = 15;
/* Calculate overflow for estimated exponent */
for (uint8_t e = 0; e < exponent; e++)
overflow = (overflow << 1) + 16;
/* Adjust if estimate was off */
while (exponent > 0 && value < overflow) {
overflow = (overflow - 16) >> 1;
exponent--;
}
}
/* Find exact exponent */
while (exponent < 15) {
uint32_t next_overflow = (overflow << 1) + 16;
if (value < next_overflow)
break;
overflow = next_overflow;
exponent++;
}
uint8_t mantissa = (value - overflow) >> exponent;
return (exponent << 4) | mantissa;
}
```
**The Compiler Explorer code and Ripes Execution Info:**
> Full Compiler Explorer uf8 RV32I version: [Link](https://github.com/scarlett910/ca2025-quizzes/blob/28b215784036ae93465d461da938ffdaa7aa6415/uf8_compiler_gen.s)
The Processor execution result:

**My Approach:**
* **uf8 decoding:**
I simplified the arithmetic formula of uf8 decoding using basic exponent math instead of using magic numbers and bit shifts
Original calculation:
```c=
offset = (0x7FFF >> (15 - exponent)) << 4;
result = (mantissa << exponent) + offset;
```
Simplified:
```c=
offset = ((1 << exponent) - 1) * 16;
result = (mantissa << exponent) + offset;
```
* **uf8 encoding:**
I rewrote the logic to remove unnecessary branches, function calls, and stack operations that the compiler-generated code introduced.
The compiler’s version used a very complex control structure with multiple nested loops and condition blocks (.L8, .L12, .L14, .L19, etc.). I try to optimize the uf8 encoding with these following approaches:
🔹Linear Loop Optimization
In my version, I replaced the compiler’s nested conditional structure with a simple linear loop that increases the threshold by left-shifting and adding constants:
```asm=
slli a5, a5, 1
addi a5, a5, 16
```
This part linearly grows the offset in a predictable way until it reaches the range limit.
Because it’s just one simple loop with a branch back to the start, it avoids multiple entry and exit conditions that the compiler generated.
As a result, the CPU pipeline stays full, and the branch predictor works more efficiently.
🔹 Register-Based Computation
All intermediate results are computed and stored directly in registers (a4, a5, etc.), instead of using memory to store local variables like the compiler did:
```asm=
sw a5, -24(s0)
lw a5, -24(s0)
```
By keeping everything in registers, my code eliminates memory latency and reduces total instruction count significantly.
🔹 Removed Function Call Overhead
The compiler’s code calls an external clz (count leading zeros) function using:
```asm=
call clz
```
Each call pushes the return address and local variables onto the stack but I perform the same counting logic inline with shift and subtract operations, so the entire calculation happens within the same function — no call, no return, no stack frame changes.
🔹 Reduced Branch Complexity
Instead of the compiler’s multi-branch conditions like:
```asm=
bgtu a4, a5, .L8
bleu a4, a5, .L19
beq a5, zero, .L16
```
I structured the control flow into a straightforward sequence where only one condition controls the loop. This simplified design reduces the number of jumps and labels that the CPU must handle.
**My code for uf8 codec and the Ripes execution:**
```asm=
# uf8 Decode function
uf8_decode:
andi t0, a0, 0x0F # Extract mantissa (lower 4 bits)
srli t1, a0, 4 # Extract exponent (upper 4 bits)
li t2, 1
sll t2, t2, t1 # Calculate 2^exponent
addi t2, t2, -1 # (2^exponent - 1)
slli t2, t2, 4 # offset = (2^exponent - 1) * 16
sll t0, t0, t1 # mantissa_value = mantissa << exponent
add a0, t0, t2 # final_value = mantissa_value + offset
ret
# uf8 encode function
uf8_encode:
# handle small values (0-15) that don't need compression
li t0, 16
blt a0, t0, small
# initialize variables for exponent search
li t1, 0
li t2, 0
li t4, 15
loop_exp:
# calculate next threshold value for current exponent
add t3, t2, t0
bgt t3, a0, done_exp
mv t2, t3
slli t0, t0, 1
addi t1, t1, 1
blt t1, t4, loop_exp
done_exp:
# calculate mantissa from remaining value
sub t3, a0, t2 # value - base_offset
srl t3, t3, t1 # mantissa = (value - base_offset) >> exponent
andi t3, t3, 0x0F # ensure mantissa fits in 4 bits
# combine exponent and mantissa into final byte
slli t1, t1, 4 # shift exponent to upper 4 bits
or a0, t1, t3 # combine with mantissa in lower 4 bits
ret
small:
# for values < 16, no encoding needed
ret
```
The Processor execution result:

Compiler-generated code:
* Cycles: 57,075
* Instructions: 39,774
My code:
* Cycles: 2,761
* Instructions: 1,912
**Conclusion:** The direct translation from the C code has much higher cycle count and instruction count comparing to my approach (higher by 20 times).
### 3. Use code: LeetCode 912. Sort an Array
* **Problem context**
LeetCode 912 asks to sort an array of integers in ascending order, where:
* Array length: 1 ≤ length ≤ 10000
* Value range: -50000 ≤ A[i] ≤ 50000
* Must use O(n log n) time complexity
Normally, this means:
* Each integer takes 4 bytes (32-bit word) in memory
* Sorting algorithms compare and swap full 32-bit values
* Memory bandwidth becomes a bottleneck for large arrays
* Cache misses occur frequently with large data structures
* **Why UF8 helps**
1. Reduced memory footprint
Each value in standard int32 takes 4 bytes and in UF8 encoding, each value compresses to 1 byte (for values 0-524272). This is a 75% reduction in memory usage so we can fit 4 times more elements in the same cache line
2. Faster sorting operations
* Sorting operates on 8-bit bytes instead of 32-bit words and with fewer memory cycles per comparison and swap
* Better cache utilization during the sorting phase
3. Controlled precision trade-off
* For values < 16: exact representation (no loss)
* For values ≥ 16: ~6.25% maximum relative error
* The mantissa-exponent structure preserves ordering relationships
* **Implementation Details**
> Sort an array RV32I code:
```asm=
.data
test_header: .string "\n=== Array Sort with UF8 ===\n"
before: .string "Before: "
after: .string "After: "
space: .string " "
newline: .string "\n"
# Test: [5, 2, 3, 1]
test_array: .word 5, 2, 3, 1
.text
.globl main
main:
# Initialize stack
li sp, 0x10000000
addi sp, sp, -256
# Print header
la a0, test_header
li a7, 4
ecall
# Print "Before: "
la a0, before
li a7, 4
ecall
# Print original
la s0, test_array
li s1, 4
li s2, 0
print1:
lw a0, 0(s0)
li a7, 1
ecall
la a0, space
li a7, 4
ecall
addi s0, s0, 4
addi s2, s2, 1
blt s2, s1, print1
la a0, newline
li a7, 4
ecall
# Encode values
la s0, test_array
li s1, 4
li s2, 0
encode:
lw a0, 0(s0)
jal ra, uf8_encode
sw a0, 0(s0)
addi s0, s0, 4
addi s2, s2, 1
blt s2, s1, encode
# Bubble sort
li s3, 3 # outer limit
li s4, 0 # outer counter
outer:
bge s4, s3, sorted
sub s5, s3, s4 # inner limit
li s6, 0 # inner counter
inner:
bge s6, s5, next_outer
la s0, test_array
slli t0, s6, 2
add s0, s0, t0
lw t1, 0(s0)
lw t2, 4(s0)
ble t1, t2, no_swap
sw t2, 0(s0)
sw t1, 4(s0)
no_swap:
addi s6, s6, 1
j inner
next_outer:
addi s4, s4, 1
j outer
sorted:
# Decode values
la s0, test_array
li s1, 4
li s2, 0
decode:
lw a0, 0(s0)
jal ra, uf8_decode
sw a0, 0(s0)
addi s0, s0, 4
addi s2, s2, 1
blt s2, s1, decode
# Print "After: "
la a0, after
li a7, 4
ecall
# Print sorted
la s0, test_array
li s1, 4
li s2, 0
print2:
lw a0, 0(s0)
li a7, 1
ecall
la a0, space
li a7, 4
ecall
addi s0, s0, 4
addi s2, s2, 1
blt s2, s1, print2
la a0, newline
li a7, 4
ecall
# Exit
li a7, 10
ecall
# UF8 Encode
uf8_encode:
li t0, 16
blt a0, t0, small
li t1, 0
li t2, 0
li t4, 15
loop_exp:
add t3, t2, t0
bgt t3, a0, done_exp
mv t2, t3
slli t0, t0, 1
addi t1, t1, 1
blt t1, t4, loop_exp
done_exp:
sub t3, a0, t2
srl t3, t3, t1
andi t3, t3, 0x0F
slli t1, t1, 4
or a0, t1, t3
ret
small:
ret
# UF8 Decode
uf8_decode:
andi t0, a0, 0x0F
srli t1, a0, 4
li t2, 1
sll t2, t2, t1
addi t2, t2, -1
slli t2, t2, 4
sll t0, t0, t1
add a0, t0, t2
ret
```
* **RV32I Code Result in Ripes:**

**Conclusion:**
```
Memory Efficiency Metrics:
Example: Array of 1000 integers
- Standard int32: 1000 × 4 bytes = 4,000 bytes
- UF8 compressed: 1000 × 1 byte = 1,000 bytes
- Memory savings: 3,000 bytes (75% reduction)
Performance Benefits:
✓ 75% memory reduction
✓ Faster comparisons (8-bit vs 32-bit)
✓ Maintained sort ordering correctness
✓ Suitable for approximate sorting applications
```
## Problem C: bfloat16
### 1. Overview
**Why BFloat16?**
BFloat16 (Brain Floating Point 16) was developed by Google Brain specifically for machine learning workloads. It offers several critical advantages:
| Feature | Float32 | BFloat16 |Benefit |
| ---------| ------- | -------- | ---------- |
|Total bits|32 | 16 |50% memory reduction|
|Exponent bits|8 |8 |Same dynamic range |
|Mantissa bits|23 |7 |Lower precision (acceptable for ML) |
|Range |±1.18×10⁻³⁸ to ±3.4×10³⁸| Same |No range loss |
Bit Layout
```
┌─────────┬──────────────┬──────────────┐
│Sign (1) │ Exponent (8) │ Mantissa (7) │
└─────────┴──────────────┴──────────────┘
15 14 7 6 0
S: Sign bit (0 = positive, 1 = negative)
E: Exponent bits (8 bits, bias = 127)
M: Mantissa/fraction bits (7 bits)
```
The value $v$ of a BFloat16 number is calculated as:
$$v = (-1)^S \times 2^{E-127} \times \left(1 + \frac{M}{128}\right)$$
where:
* $S \in \{0, 1\}$is the sign bit
* $E \in [1, 254]$ is the biased exponent
* $M \in [0, 127]$ is the mantissa value
Special Cases
* Zero: $E = 0, M = 0 \Rightarrow v = (-1)^S \times 0$
* Infinity: $E = 255, M = 0 \Rightarrow v = (-1)^S \times \infty$
* NaN: $E = 255, M \neq 0 \Rightarrow v = \text{NaN}$
* Denormals: Not supported (flush to zero)
Square Root
$$\sqrt{a} = \sqrt{2^{e_a} \times m_a} = 2^{e_a/2} \times \sqrt{m_a}$$
The bf16_sqrt implementation uses pure integer arithmetic without floating-point library dependencies:
1. Special Case Handling (IEEE-754 compliant):
* $\sqrt{+0} = +0$
* $\sqrt{-0} = 0$
* $\sqrt{+\infty} = +\infty$
* $\sqrt{-\infty} = \text{NaN}$
* $\sqrt{\text{NaN}} = \text{NaN}$
* $\sqrt{x} = \text{NaN}$ for all $x < 0$
2. Exponent Processing:
For even exponents ($e_a$ is even):
$$e_r = \frac{e_a}{2}, \quad m' = m_a$$
For odd exponents ($e_a$is odd):
$$e_r = \frac{e_a - 1}{2}, \quad m' = 2 \times m_a$$
This ensures the mantissa remains in the normalized range $[1, 2)$ or $[2, 4)$ for processing.
3. Mantissa Square Root via Binary Search:
The algorithm finds $r$ such that $r^2 \approx m' \times 128$ using binary search. The scaling factor 128 represents the implicit 1.0 in the mantissa representation.
4. Normalization:
* Ensure result mantissa is in range $[128, 256)$
* Ajust exponent if necessary
* Extract 7-bit mantissa (removing implicit 1)
**Mathematical Properties:**
* Monotonic: $x < y \Rightarrow \sqrt{x} \leq \sqrt{y}$
* Identity: $\sqrt{x^2} = |x|$ for $x \geq 0$
* Precision: Maximum relative error < 0.5% across all BF16 range
### 2. The RV32I Assembly translation from the C program of bfloat16
> Full bfloat16 RV32I version (too long to add here ¯\_(ツ)_/¯): [Github link](https://github.com/scarlett910/ca2025-quizzes/blob/a8ccb45a364bf7c12484bedeacf2734e5092af2c/hw1_bf16_full.s)
**Implementation details:**
Core BF16 Operations:
* Coversion: `bf16_to_f32`, `f32_to_bf16`
* Arithmetic Functions Implemented:
* Addition (`bf16_add`)
* Subtraction (`bf16_sub`)
* Multiplication (`bf16_mul`)
* Division (`bf16_div`)
* Square Root (`bf16_sqrt`)approximation
* Special Value Handling: `bf16_isnan`, `bf16_iszero`
* Comparison Operations: `bf16_eq`, `bf16_lt`, `bf16_gt`
**Testing result analysis and Ripes processor execution info:**
Testing result:
```
==== BF16 Addition Tests ====
Test 1: 1.0 + 2.0 = 3.0 ✓ PASS
Test 2: 3.5 + 2.25 = 5.75 (expected 0x40B8, got 0x4050) X FAIL
Test 3: 5.5 + 0.0 = 5.5 ✓ PASS
Test 4: -2.5 + 2.5 = 0.0 ✓ PASS
==== BF16 Subtraction Tests ====
Test 1: 5.0 - 3.0 = 2.0 ✓ PASS
Test 2: 10.0 - 5.0 = 5.0 ✓ PASS
Test 3: 3.0 - 3.0 = 0.0 ✓ PASS
==== BF16 Multiplication Tests ====
Test 1: 3.0 × 4.0 = 12.0 ✓ PASS
Test 2: 2.0 × 3.0 = 6.0 ✓ PASS
Test 3: 1.5 × 2.0 = 3.0 ✓ PASS
Test 4: -2.0 × 3.0 = -6.0 ✓ PASS
Test 5: 0.5 × 4.0 = 2.0 ✓ PASS
==== BF16 Division Tests ====
Test 1: 8.0 ÷ 2.0 = 4.0 ✓ PASS
Test 2: 9.0 ÷ 3.0 = 3.0 ✓ PASS
Test 3: 1.0 ÷ 2.0 = 0.5 ✓ PASS
==== Special Values Tests ====
Test 1: Infinity + 5.0 = Infinity ✓ PASS
Test 2: Infinity - Infinity = NaN ✓ PASS
Test 3: 2.0 × Infinity = Infinity ✓ PASS
Test 4: 0.0 ÷ 0.0 = NaN ✓ PASS
Test 5: 5.0 ÷ 0.0 = Infinity ✓ PASS
==== Edge Cases Tests ====
Test 1: tiny + 0.0 = 0.0 ✓ PASS
Test 2: huge huge = Infinity ✓ PASS
Test 3: +0.0 + -0.0 = +0.0 (expected 0x0000, got 0x8000) X FAIL
==== Rounding Tests ====
Test 1: 1.0 + tiny = 1.0 (expected 0x3F80, got 0x3F81) X FAIL
Test 2: 1.5 × 1.5 = 2.25 ✓ PASS
==== Comparison Tests ====
Test 1: 1.0 == 1.0 = true ✓ PASS
Test 2: 1.0 == 2.0 = false ✓ PASS
Test 3: 1.0 < 2.0 = true ✓ PASS
Test 4: 2.0 < 1.0 = false ✓ PASS
Test 5: 2.0 > 1.0 = true ✓ PASS
Test 6: NaN == NaN = false ✓ PASS
==== Square Root Tests ====
Test 1: sqrt(4.0) = 2.0 ✓ PASS
Test 2: sqrt(9.0) = 3.0 ✓ PASS
Test 3: sqrt(16.0) = 4.0 ✓ PASS
Test 4: sqrt(25.0) = 5.0 ✓ PASS
Test 5: sqrt(0.0) = 0.0 ✓ PASS
Test 6: sqrt(-2.0) = NaN ✓ PASS
==== Test Summary ===
Total tests: 37
Passed: 34
Failed: 3
```
Failed Tests (Rounding Precision Issues):
❌ Test 2 (Addition): 3.5 + 2.25 = 5.75
Expected: 0x40B8, Got: 0x4050
Cause: BF16's 7-bit mantissa cannot precisely represent 5.75
❌ Test 3 (Edge Cases): +0.0 + -0.0 = +0.0
Expected: 0x0000, Got: 0x8000 (negative zero)
Cause: Sign bit handling in zero result cases
❌ Test 1 (Rounding): 1.0 + tiny = 1.0
Expected: 0x3F80, Got: 0x3F81
Cause: Rounding bias in denormal number addition
Ripes processor execution:

Performance Metrics:
| Metric | Value | Analysis |
| ------------ | ------ | ---------------------------------- |
| Total Cycles | 11,846 | Includes all arithmetic operations |
|Instructions Retired |8,803 |Actual instructions executed |
|CPI |1.35 |Good for software |
### 3. Use code: Leetcode 311. Sparse Matrix Multiplication
* **Problem context**
LeetCode 311 asks to compute the product of two sparse matrices, where most elements are zeros.
Normally, this means:
* We only multiply a small fraction of entries.
* Each nonzero multiplication involves floating-point math or large integers.
* The main bottleneck is memory bandwidth and cache access, not the arithmetic itself.
* **Why bfloat16 helps**
1. Reduced memory bandwidth
- Each matrix element in float32 takes 4 bytes. In bfloat16, it takes 2 bytes — exactly half so we can store twice as many matrix elements in cache, each matrix fetch from memory (load/store) is faster
- The CPU/FPGA/accelerator performs fewer memory cycles.
- For sparse matrix multiplication that accessing nonzero entries is memory-heavy, this drastically cuts data traffic.
2. Preserved dynamic range
- Unlike fp16, bfloat16 keeps the same exponent width (8 bits) as float32. It handles large and small values without overflow/underflow.
- Keep nearly the same range, perfect for accumulation of approximate results.
* **Implementation Details**
Idea in C code:
Using the optimized approach when addressing the matrix multiplication by doing the zero-skipping
```c=
// ============================================================================
// Matrix Structure
// ============================================================================
typedef struct {
int rows;
int cols;
bf16_t *data;
} Matrix;
// ============================================================================
// Sparse Matrix Multiplication
// ============================================================================
void sparse_matrix_multiply(Matrix *A, Matrix *B, Matrix *C, int *total_ops, int *skipped_ops) {
*total_ops = 0;
*skipped_ops = 0;
// Initialize C to zeros
memset(C->data, 0, C->rows * C->cols * sizeof(bf16_t));
// Optimized sparse matrix multiplication
for (int i = 0; i < A->rows; i++) {
for (int k = 0; k < A->cols; k++) {
bf16_t a_ik = A->data[i * A->cols + k];
// Skip if A[i][k] is zero
if (bf16_iszero(a_ik)) {
*skipped_ops += B->cols;
continue;
}
for (int j = 0; j < B->cols; j++) {
bf16_t b_kj = B->data[k * B->cols + j];
// Skip if B[k][j] is zero
if (bf16_iszero(b_kj)) {
(*skipped_ops)++;
continue;
}
// C[i][j] += A[i][k] * B[k][j]
bf16_t product = bf16_mul(a_ik, b_kj);
bf16_t current = C->data[i * C->cols + j];
C->data[i * C->cols + j] = bf16_add(current, product);
(*total_ops)++;
}
}
}
}
```
> Sparse Matrix Multiplication RV32I code: (too long to add here ¯\_(ツ)_/¯): [Github link](https://github.com/scarlett910/ca2025-quizzes/blob/a8ccb45a364bf7c12484bedeacf2734e5092af2c/bf16_sparse_matrix_mul.s)
* **RV32I Code Result in Ripes:**


**Conclusion:**
```
Efficiency = (Skipped Operations/Total Possible Operations) = (15 / 18) = 83.33%
BF16 Memory Savings vs FP32:
- FP32: 21 × 4 bytes = 84 bytes
- BF16: 21 × 2 bytes = 42 bytes
- Savings: 50%
```
## Reference
[Quiz1 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Quiz1-of-Computer-Architecture-2025-Fall)
[Leetcode 311. Sparse Matrix Multiplication ](https://www.cnblogs.com/grandyang/p/5282959.html)
[LeetCode 912. Sort an Arra](https://www.cnblogs.com/grandyang/p/11483981.html)