# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [jgw0915](https://github.com/jgw0915/Computer-Architecture_HW1) >
## Problem (B) UF8
### UF8 Introduction
- **1. Encoding Mathematics**
- **Encoding Function**
$$
E(v) =
\begin{cases}
v, & \text{if } v < 16 \\
16e + \lfloor (v - \text{offset}(e)) / 2^e \rfloor, & \text{otherwise}
\end{cases}
$$
- **Offset Definition**
$$
\text{offset}(e) = (2^e - 1) \times 16
$$
---
From the formula, we can see offset(e) is constant for each exponent e .
- **Example 1**
Try v = 16 :
- e = 1
- offset(1) = 16
$$
E(16) = 16 \times 1 + \lfloor (16 - 16) / 2^1 \rfloor = 16
$$
- **Example 2**
Try v = 17 :
- e = 1
- offset(1) = 16
$$
E(17) = 16 \times 1 + \lfloor (17 - 16) / 2^1 \rfloor = 16
$$
→ Both 16 and 17 have same offset and map to the same encoding value **16**.
---
- **Finding Exponent \( e \)**
$$
\text{offset}(e) \le v < \text{offset}(e+1)
$$
- **Example:**
Try v = 51 :
- offset(1) = 16
- offset(2) = 48
- offset(3) = 112
→ Expoenet of 51 is 2
---
- **Mapping Summary**
| Exponent (binary) | Range Start | Encoded Value |
|-------------------|-------------|----------------|
| 0001 | 16 | 16 |
| 0011 | 48 | 32 |
For Range Start (Input that equal to its offset(e)), its uf8 encoding $\rightarrow$
$16 \times e$
---
- **2. Decoding Mathematics**
- **Decoding Formula**
$$
D(b) = m \times 2^e + (2^e - 1) \times 16
$$
where
$$
e = \lfloor b / 16 \rfloor, \quad m = b \bmod 16
$$
---
- **Example**
Try b = 31 :
- e = 1 , m = 15
$$
D(31) = 15 \times 2^1 + (2^1 - 1) \times 16 = 30 + 16 = 46
$$
⚠️ **Observation:**
$\;\;\;\;$ Values like 46 and 47 share the same encoding (31), and decode to same value 46.
---
- **Proof 1 - (Derive Decoding Formula by Encoding Formula)**
$$
E(v) = 16e + \lfloor (v - (2^e - 1)\times16) / 2^e \rfloor
$$
$$
\rightarrow b = 16e + \lfloor (D(b) - (2^e - 1)\times16) / 2^e \rfloor
$$
$$
\rightarrow D(b) = 2^e \times b - 16 \times e \times 2^e + (2^e - 1)\times16
$$
$$
\rightarrow D(b) = 2^e \times ( b - 16 \times e ) + (2^e - 1)\times16
$$
$$
\rightarrow D(b) = 2^e \times m + (2^e - 1) \times 16
$$
---
- **Proof 1.1 - (Modulo Definition)**
$$
b \bmod 16 = b - 16 \times \lfloor b / 16 \rfloor
$$
---
- **3. Error Analysis**
- **Absolute Error (max)**
- For numbers having same exponent , when decoding, all values sharing the same quotient (divided by $2^e$) map to the same decoded value.
Thus, the maximal absolute value for the numbers in the exponent(e) range should be the biggest number substract the smallest number within the same quotient value dividing by $2^e$, which should be:
$$
\text{Max Absolute Error} = 2^{e} - 1
$$
---
- **Relative Error (Max)**
- Theoritically, Maximal relative error is
$$
\text{Relative Error}_{\max}(e) = \frac{\text{Max Absolute Error}}{\text{Smallest Value for the groupt in same quotient value (offset(e), the recoded resulte)}}
$$
- So we can apply the formula of Max Absolute Error and offset(e)
$$
\text{Relative Error}_{\max} = \frac{2^e - 1}{(2^e - 1)\times16} = \frac{1}{16} = 0.0625
$$
This is the theoretical upper bound; actual test cases will be smaller.
---
### Implementation - CLZ (Count Leading Zero)
#### Source C code in Problem B
- **Algorithm Logic**
- Initialize n = 32, c = 16.
- **Iteration for `c`**
- `16 → 8 → 4 → 2 → 1 → 0`
- **While-loop logic**
- For each `c`, right-shift `x` by `c` bits.
- If `(x >> c) > 0`:
- Subtract `n -= c` to update the possible number of leading zeros from `32` to `n - c` ( `c` is accomulative through interation).
- Replace `x = x >> c` so the active bit-width of `x` shrinks from 32 to `n - c` bits ( Again, `c` is accomulative through interation).
- Terminate when `c = 0`.
- Return `n - x`, where:
- `n` is the count of leading zeros plus one (It includes the first non-zero bit).
- `x` becomes `1` if the original input was nonzero, otherwise `x = 0`.
```clike=
static inline unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
// printf(" n = %d , x = %u \n", n, x);
return n - x;
}
```
#### RV32 Assembly - Compiler Generate (rv32-gcc (trunk))
- The following rv32 code of `clz` function is generate from rv32-gcc (trunk) in [Compiler Expoler](https://godbolt.org/)
- Total code space for `clz` function is 34 instruction lines
```gas=
clz:
# a0 -> function arg x on entry; return value on exit.
# ra -> return address (saved to stack).
# s0 -> frame pointer.
# a4, a5 -> temporaries.
# Stack frame (at s0):
# -36(s0) -> x (current working value).
# -20(s0) -> n.
# -24(s0) -> c.
# -28(s0) -> y (temporary).
addi sp,sp,-48 # prologue: make 48B stack frame
sw ra,44(sp) # save ra
sw s0,40(sp) # save s0
addi s0,sp,48 # s0 = frame pointer
sw a0,-36(s0) # x = arg
li a5,32
sw a5,-20(s0) # n = 32
li a5,16
sw a5,-24(s0) # c = 16
# do { ... } while (c);
L3:
lw a5,-24(s0) # a5 = c
lw a4,-36(s0) # a4 = x
srl a5,a4,a5 # a5 = x >> c
sw a5,-28(s0) # y = x >> c
lw a5,-28(s0) # if (y)
beq a5,zero,L2 # skip if y == 0
lw a4,-20(s0) # n -= c;
lw a5,-24(s0)
sub a5,a4,a5
sw a5,-20(s0)
lw a5,-28(s0) # x = y;
sw a5,-36(s0)
L2:
lw a5,-24(s0) # c >>= 1;
srai a5,a5,1
sw a5,-24(s0)
lw a5,-24(s0) # while (c) loop back
bne a5,zero,L3
lw a4,-20(s0) # return n - x;
lw a5,-36(s0)
sub a5,a4,a5
mv a0,a5 # a0 = result
lw ra,44(sp) # epilogue
lw s0,40(sp)
addi sp,sp,48
jr ra
```
#### RV32 Assembly- Optmized by Using Temp Register
```gas=
clz:
# a0 -> function arg x on entry; return value on exit.
# ra -> return address (saved to stack).
# s0 -> frame pointer.
# t0 -> x (current working value).
# t1 -> n.
# t2 -> c.
# t3 -> y (temporary, store x >> c).
# initialize stack frame
addi sp,sp,-8 # prologue: make 48B stack frame
sw ra,4(sp) # save ra
sw s0,0(sp) # save s0
addi s0,sp,8 # s0 = frame pointer
add t0,x0,a0, # x = arg
addi t1,x0,32 # n = 32
addi t2,x0,16 # c = 16
# do { ... } while (c);
WHILE_LOOP:
srl t3,t0,t2 # y = x >> c
beq t3,x0,SKIP # if (y) skip if
sub t1,t1,t2 # n -= c;
add t0,x0,t3 # x = y;
SKIP:
srai t2,t2,1 # c >>= 1;
bne t2,x0,WHILE_LOOP # while (c) loop
sub a0,t1,t0 # return n - x;
lw ra,4(sp) # epilogue
lw s0,0(sp)
addi sp,sp,8
jr ra
```
---
### Implementation - uf8_decode
#### Source C code in Problem B
```clike=
/* Decode uf8 to uint32_t */
uint32_t uf8_decode(uf8 fl)
{
uint32_t mantissa = fl & 0x0f;
uint8_t exponent = fl >> 4;
uint32_t offset = (0x7FFF >> (15 - exponent)) << 4;
return (mantissa << exponent) + offset;
}
```
#### RV32 Assembly - Compiler Generate (rv32-gcc (trunk))
```gas=
uf8_decode:
# Register roles:
# a0 = argument fl on entry; return value on exit
# a4 = temporary (builds 0x7FFF and holds (mantissa<<exponent))
# a5 = temporary (holds fl, exponent, intermediate math, final sum)
# s0 = frame pointer (FP)
# ra = return address
#-------------------------------------------------------
# Stack frame (relative to s0):
# -33(s0) : byte fl
# -21(s0) : byte exponent
# -20(s0) : word mantissa
# -28(s0) : word offset
addi sp,sp,-48 # prologue: allocate 48B frame
sw ra,44(sp) # save ra
sw s0,40(sp) # save s0
addi s0,sp,48 # s0 = sp + 48 (set FP)
mv a5,a0 # a5 = fl (from a0)
sb a5,-33(s0) # store fl (byte)
lbu a5,-33(s0) # a5 = fl
andi a5,a5,15 # a5 = fl & 0x0f
sw a5,-20(s0) # mantissa = a5
lbu a5,-33(s0) # a5 = fl
srli a5,a5,4 # a5 = fl >> 4
sb a5,-21(s0) # exponent = a5 (byte)
# offset = (0x7FFF >> (15 - exponent)) << 4
lbu a5,-21(s0) # a5 = exponent
li a4,15 # a4 = 15
sub a5,a4,a5 # a5 = 15 - exponent
li a4,32768 # a4 = 0x8000
addi a4,a4,-1 # a4 = 0x7FFF
sra a5,a4,a5 # a5 = 0x7FFF >> (15 - exponent)
slli a5,a5,4 # a5 = a5 << 4
sw a5,-28(s0) # offset = a5
# return (mantissa << exponent) + offset
lbu a5,-21(s0) # a5 = exponent
lw a4,-20(s0) # a4 = mantissa
sll a4,a4,a5 # a4 = mantissa << exponent
lw a5,-28(s0) # a5 = offset
add a5,a4,a5 # a5 = (mantissa<<exponent) + offset
mv a0,a5 # return value in a0
# epilogue
lw ra,44(sp) # restore ra
lw s0,40(sp) # restore s0
addi sp,sp,48 # free frame
jr ra # return
```
#### RV32 Assembly- Optmized hand-written
```gas=
uf8_decode:
# Register role:
# a0 = input argument fl on entry; return value on exit
# s0 = frame pointer (FP)
# ra = return address
# t0 = mantissa (holds fl & 0x0f)
# t1 = exponent (holds fl >> 4)
# t2 = offset (holds (0x7FFF >> (15 - exponent)) << 4)
#------------------------------------------------------
# Stack frame (relative to s0):
# -48(s0) : reserved ra (save area)
# -44(s0) : reserved s0 (save area)
addi sp,sp,-8 # prologue: allocate 8B frame
sw ra,4(sp) # save ra
sw s0,0(sp) # save s0
addi s0,sp,8 # s0 = sp + 8 (set FP)
andi t0,a0,15 # mantissa = fl & 0x0f
srli t1,a0,4 # exponent = fl >> 4
li t2,15 # t2 = 15
sub t2,t2,t1 # t2 = 15 - exponent
srli t2,32768,t2 # t2 = 0x7FFF >> (15 - exponent)
slli t2,t2,4 # offset = t2 = t2 << 4
sll a0,t0,t1 # a0 = mantissa << exponent
add a0,a0,t2 # return (mantissa << exponent) + offset
lw ra,4(sp) # epilogue
lw s0,0(sp) # restore s0
addi sp,sp,8 # free frame
jr ra # return
```
----
### Implementation - uf8_encode
#### Original C code with Test
```clike=
uf8 uf8_encode(uint32_t value)
{
/* Use CLZ for fast exponent calculation */
if (value < 16)
return value;
/* Find appropriate exponent using CLZ hint */
int lz = clz(value);
int msb = 31 - lz;
/* Start from a good initial guess */
uint8_t exponent = 0;
uint32_t overflow = 0;
if (msb >= 5) {
/* Estimate exponent - the formula is empirical */
exponent = msb - 4;
if (exponent > 15)
exponent = 15;
/* Calculate overflow for estimated exponent */
for (uint8_t e = 0; e < exponent; e++)
overflow = (overflow << 1) + 16;
/* Adjust if estimate was off */
while (exponent > 0 && value < overflow) {
overflow = (overflow - 16) >> 1;
exponent--;
}
}
/* Find exact exponent */
while (exponent < 15) {
uint32_t next_overflow = (overflow << 1) + 16;
if (value < next_overflow)
break;
overflow = next_overflow;
exponent++;
}
uint8_t mantissa = (value - overflow) >> exponent;
return (exponent << 4) | mantissa;
}
```
#### RV32 Assembly - Compiler Generate (rv32-gcc (trunk))
```gas=
uf8_encode:
# Register roles:
# a0 = arg value on entry; return uf8 on exit
# a4 = temporary (ALU, comparisons, shifted intermediates)
# a5 = primary temporary (loop vars, intermediates, final byte)
# s0 = frame pointer (FP)
# ra = return address
#-------------------------------------------------------------------
# Stack frame (relative to s0):
# -52(s0) : uint32_t value
# -32(s0) : int lz
# -36(s0) : int msb
# -24(s0) : uint32_t overflow
# -17(s0) : uint8_t exponent
# -25(s0) : uint8_t e (loop counter for initial overflow build)
# -40(s0) : uint32_t next_overflow
# -41(s0) : uint8_t mantissa
# Initialize stack frame
addi sp,sp,-64 # prologue: allocate 64B
sw ra,60(sp) # save ra
sw s0,56(sp) # save s0
addi s0,sp,64 # set FP
sw a0,-52(s0) # store input value
# Quick return for value <= 15
lw a4,-52(s0) # a4 = value
li a5,15 # a5 = 15
bgtu a4,a5,L6 # if (value > 15) goto L6
lw a5,-52(s0) # a5 = value
andi a5,a5,0xff # zero-extend to byte
j L7 # return value
# value >= 16 path
L6:
lw a0,-52(s0) # a0 = value
call clz # a0 = clz(value)
mv a5,a0 # a5 = lz
sw a5,-32(s0) # lz = a5
li a4,31 # a4 = 31
lw a5,-32(s0) # a5 = lz
sub a5,a4,a5 # a5 = 31 - lz
sw a5,-36(s0) # msb = a5
sb zero,-17(s0) # exponent = 0
sw zero,-24(s0) # overflow = 0
lw a4,-36(s0) # a4 = msb
li a5,4 # a5 = 4
ble a4,a5,L14 # if (msb <= 4) skip estimate block
# exponent = msb - 4; if (exponent > 15) exponent = 15;
lw a5,-36(s0) # a5 = msb
andi a5,a5,0xff # keep low byte
addi a5,a5,-4 # a5 = msb - 4
sb a5,-17(s0) # exponent = a5
lbu a4,-17(s0) # a4 = exponent
li a5,15 # a5 = 15
bleu a4,a5,L9 # if (exponent <= 15) ok
li a5,15 # a5 = 15
sb a5,-17(s0) # exponent = 15
L9:
sb zero,-25(s0) # e = 0 (byte)
# for (e=0; e<exponent; e++) overflow = (overflow<<1)+16;
j L10
L11:
lw a5,-24(s0) # a5 = overflow
slli a5,a5,1 # a5 = overflow << 1
addi a5,a5,16 # a5 += 16
sw a5,-24(s0) # overflow = a5
lbu a5,-25(s0) # a5 = e
addi a5,a5,1 # e++
sb a5,-25(s0) # store e
L10:
lbu a4,-25(s0) # a4 = e
lbu a5,-17(s0) # a5 = exponent
bltu a4,a5,L11 # loop while e < exponent
# while (exponent > 0 && value < overflow) { overflow = (overflow-16)>>1; exponent--; }
j L12
L13:
lw a5,-24(s0) # a5 = overflow
addi a5,a5,-16 # a5 = overflow - 16
srli a5,a5,1 # a5 = (overflow - 16) >> 1
sw a5,-24(s0) # overflow = a5
lbu a5,-17(s0) # a5 = exponent
addi a5,a5,-1 # exponent--
sb a5,-17(s0) # store exponent
L12:
lbu a5,-17(s0) # a5 = exponent
beq a5,zero,L14 # if (exponent == 0) stop
lw a4,-52(s0) # a4 = value
lw a5,-24(s0) # a5 = overflow
bltu a4,a5,L13 # if (value < overflow) adjust down
j L14
# while (exponent < 15) { next_overflow=(overflow<<1)+16; if (value < next_overflow) break; overflow=next_overflow; exponent++; }
L17:
lw a5,-24(s0) # a5 = overflow
slli a5,a5,1 # a5 = overflow << 1
addi a5,a5,16 # a5 += 16
sw a5,-40(s0) # next_overflow = a5
lw a4,-52(s0) # a4 = value
lw a5,-40(s0) # a5 = next_overflow
bltu a4,a5,L18 # if (value < next_overflow) break
lw a5,-40(s0) # a5 = next_overflow
sw a5,-24(s0) # overflow = next_overflow
lbu a5,-17(s0) # a5 = exponent
addi a5,a5,1 # exponent++
sb a5,-17(s0) # store exponent
L14:
lbu a4,-17(s0) # a4 = exponent
li a5,14 # a5 = 14
bleu a4,a5,L17 # loop while exponent <= 14 (i.e., < 15)
j L16
L18:
nop # break
# mantissa = (value - overflow) >> exponent; return (exponent<<4)|mantissa;
L16:
lw a4,-52(s0) # a4 = value
lw a5,-24(s0) # a5 = overflow
sub a4,a4,a5 # a4 = value - overflow
lbu a5,-17(s0) # a5 = exponent
srl a5,a4,a5 # a5 = (value - overflow) >> exponent
sb a5,-41(s0) # mantissa = a5 (byte)
lb a5,-17(s0) # a5 = sign-extended exponent byte
slli a5,a5,4 # a5 = exponent << 4
slli a4,a5,24 # pack to 32 then
srai a4,a4,24 # sign-correct to low byte
lb a5,-41(s0) # a5 = mantissa (sign-extended byte)
or a5,a4,a5 # a5 = (exponent<<4) | mantissa
slli a5,a5,24 # zero-extend to 8-bit in 32-bit reg
srai a5,a5,24
andi a5,a5,0xff # ensure 8-bit result
L7:
mv a0,a5 # return in a0
lw ra,60(sp) # epilogue: restore ra
lw s0,56(sp) # restore s0
addi sp,sp,64 # free frame
jr ra # return
```
#### RV32 Assembly- Optmized hand-written
```gas=
uf8_encode:
# register roles:
# a0 : uint32_t value (input)
# s0 : frame pointer
# t0 : input value copy
# t1 : msb (most significant bit position)
# t2 : exponent
# t3 : overflow (offset for current exponent)
# t4 : e (temporary for exponent calculation)
# t5 : next_overflow (offset for next exponent)
# t6 : mantissa
# a1 : temporary constant
# stack frame (from s0, growing down):
# -4(s0) : ra
# -8(s0) : saved s0
# -12(s0) : uint32_t value (spill)
# initialize stack frame
addi sp,sp,-12
sw ra,8(sp)
sw s0,4(sp)
addi s0,sp,12
# quick return for value <= 15
addi a1,zero,15 # a1 = 15
bgtu a0,a1,init_exponent_guess # if (value <= 15) return value
j uf8_encode_return
init_exponent_guess:
# initialize variables for exponent guessing loop
sw a0,0(sp) # save value across call
call clz # call clz(value)
li t1,31 # t1 = msb = 31
lw t0,0(sp) # t0 = value
sub t1,t1,a0 # msb = 31 - clz(value)
li t2,0 # t2 = exponent = 0
li t3,0 # t3 = overflow = 0
# Estimate exponent
addi a1,x0,4 # a1 = 4
ble t1,a1,find_exact_exponent # if (msb <= 4) goto find_exact_exponent
addi t2,t1,-4 # exponent = msb - 4
# Handle exponent overflow
addi a1,x0,15 # a1 = 15
bleu t2,a1,estimate_exponent_and_overflow # if (exponent <= 15) goto estimate_exponent
addi t2,x0,15 # exponent = 15
estimate_exponent_and_overflow:
# Calculate overflow for current exponent
addi t4,x0,0 # t4 = e = 0
j Loop_condition_1
Loop_1:
slli t3,t3,1 # overflow <<= 1
addi t3,t3,16 # overflow += 16
addi t4,t4,1 # e += 1
Loop_condition_1:
bltu t4,t2,Loop_1
adjust_e_o:
# adjust exponent and overflow
beq t2,x0,find_exact_exponent # if (exponent == 0) goto find_exact_exponent
bge t0,t3,find_exact_exponent # if (value >= overflow) go to estimating
addi t3,t3,-16 # overflow -= 16
srli t3,t3,1 # overflow >>= 1
addi t2,t2,-1 # exponent -= 1
j adjust_e_o
find_exact_exponent:
# Find exact exponent
addi a1,x0,14 # a1 = 14
bgtu t2,a1,finalize # if (exponent > 14) goto finalize
slli t5,t3,1 # next_overflow = overflow << 1
addi t5,t5,16 # next_overflow += 16
bltu t0,t5,finalize # if (value < next_overflow) goto finalize
addi t3,t5,0 # overflow = next_overflow
addi t2,t2,1 # exponent += 1
j find_exact_exponent
finalize:
# Calculate mantissa
sub t6,t0,t3 # mantissa = value - overflow
srl t6,t6,t2 # mantissa >>= exponent
andi t6,t6,15 # mantissa &= 0x0f
# Combine exponent and mantissa
slli t2,t2,4 # exponent <<= 4
slli t2,t2,24
srai t2,t2,24
or a0,t2,t6 # return (exponent << 4) | mantissa
slli a0,a0,24
srai a0,a0,24
uf8_encode_return:
lw ra,8(sp) # epilogue: restore ra
lw s0,4(sp) # restore s0
addi sp,sp,12 # free frame
jr ra # return
```
---
### Analysis
#### Improvement comparing to compiler
Following comparisons of functions used its own test case, and the comparison is targeting at optimization of RV32 instructions in function not in test case.
* **For CLZ function**

* **For Decoding**

* **For Encoding**

* **For whole code in problem B**

#### Snapshot context (from Ripes banners)

- **IF**: `bge x5 x28 0x20 <find_exact_exponent>` ; `value ≥ overflow`
- **ID**: `beq x7 x0 0x24 <find_exact_exponent>` ; `exponent == 0`
- **EX**: `bltu x29 x7 -0x0c <Loop_1>` ; `e < exponent`
- **MEM**: `addi x29 x29 1` ; `e += 1`
- **WB**: `addi x28 x28 16` ; `overflow += 16` (paired with prior `slli`)
This is inside `estimate_exponent_and_overflow`: builds
`overflow = 16 * (2^e − 1)` via repeated `slli` and `addi`.
---
#### 5-Stage Pipeline Processor
This section is the analysis of [q1-uf8.s](https://github.com/jgw0915/Computer-Architecture/blob/main/uf8/q1-uf8.s) in terms of processor, memory, instruction-level view.

* **IF (Instruction Fetch)**
- **Role**: Drive `PC` into instruction memory.
- **Key signals**: `PCSrc` mux selects `PC+4` or EX branch target. `PC.en=1` unless branch flush.
- **Here**: `bltu` in EX is taken → `PCSrc=branch`, IF fetches `Loop_1`.

* **ID (Instruction Decode)**
- **Role**: Decode control, read regs, form immediates.
- **Key signals**: `RegFile.rs1/rs2`, `ImmGen` (B/I type), `Branch`, `RegWrite`.
- **Here**: `beq x7,x0,+0x24` reads `x7` (exponent) and `x0`; `Branch=1`. No stall; forwarding covers hazards.

* **EX (Execution)**
- **Role**: ALU ops, branch compare, target add.
- **Key signals**: `ALUSrc`, compare unit, `BranchTaken`, target adder.
- **Here**: `bltu x29,x7` sets `BranchTaken=1` since `e < exponent`.

* **MEM (Memory Read/Write)**
- **Role**: Data memory for loads/stores.
- **Key signals**: `MemRead`, `MemWrite`.
- **Here**: For `addi/branch` both 0.
Actual writes appear in prologues/epilogues (`sw/lw`) and test locals (`sb/sw` at `s0` offsets).

* **WB (Register Write Back)**
- **Role**: Write ALU or load result to `rd`.
- **Key signals**: `RegWrite=1` for `addi`, WB mux selects ALU path.
- **Here**: `x29 ← x29+1` and `x28 ← x28+16` update loop state.

---
#### Instruction-level intent in uf8_encode

- **Estimate**
- If `msb ≤ 4` → go refine; else `t2 = msb − 4`, cap at 15.
- **Build overflow** (`estimate_exponent_and_overflow`)
- Loop while `t4 < t2`: `t3 = (t3<<1) + 16; t4++`.
- **Adjust** (`adjust_e_o`)
- If `t2 == 0` or `value ≥ overflow` → continue.
- Else: `t3 = (t3 − 16) >> 1; t2--` and repeat.
- **Refine** (`find_exact_exponent`)
- `t5 = (t3<<1) + 16`.
- If `value ≥ t5` and `t2 ≤ 14` → `t3=t5; t2++`; repeat.
- **Finalize**
- `t6 = ((value − overflow) >> t2) & 0x0f`.
- Return `(t2<<4) | t6` (sign-extended pairs trimmed by `slli/srai`).
---
#### Memory updates and correctness

- **Stack frames**
- `clz`: `addi sp,-8` → `sw ra,s0` → restore with `lw` and `addi sp,+8`.
- `uf8_decode/encode`: similar `-8/−12` frames; balanced restores. No leaks.
- **Test locals**
- `-29(s0)` current byte, `-21(s0)` flag, `-20(s0)` previous value, `-28(s0)` index.
- Accessed via matching `lbu/sb` and `lw/sw`. No aliasing.
- **.data strings**
- Read-only, printed by `ecall 4`. No stores.
----
## Problem \(C\) BFloat16
### Implementation - Special Cases Detection
#### Source C code in Problem C
```clike=
#include <stdbool.h>
#include <stdint.h>
#include <string.h>
#include <stdio.h>
typedef struct {
uint16_t bits;
} bf16_t;
#define BF16_EXP_MASK 0x7F80U
#define BF16_MANT_MASK 0x007FU
static inline bool bf16_isnan(bf16_t a)
{
return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) &&
(a.bits & BF16_MANT_MASK);
}
static inline bool bf16_isinf(bf16_t a)
{
return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) &&
!(a.bits & BF16_MANT_MASK);
}
static inline bool bf16_iszero(bf16_t a)
{
return !(a.bits & 0x7FFF);
}
```
#### RV32 Assembly - Compiler Generate (rv32-gcc (trunk))
```gas=
# RV32 register-variable mapping (across functions)
# a0 : function argument / return value
# a4 : temporary for masked bits or compares
# a5 : primary temporary / boolean result
# s0 : frame pointer
# ra : return address
# stack frame layout (offsets from s0)
# -20 : local bf16 input
# -24 : s0 saved
# -28 : ra saved
# bf16_isnan(bf16 a): return ((a & EXP)==EXP) && ((a & MANT)!=0)
bf16_isnan:
addi sp,sp,-32
sw ra,28(sp)
sw s0,24(sp)
addi s0,sp,32
# Load input into stack-local; compute (a & 0x7F80) then test equals 0x7F80; then test mantissa != 0; return 1/0
sh a0,-20(s0)
lhu a5,-20(s0)
mv a4,a5
li a5,32768
addi a5,a5,-128
and a4,a4,a5
li a5,32768
addi a5,a5,-128
bne a4,a5,L2
lhu a5,-20(s0)
andi a5,a5,127
beq a5,zero,L2
li a5,1
j L3
L2:
li a5,0
L3:
# Normalize boolean to 0/1 and return
andi a5,a5,1
andi a5,a5,0xff
mv a0,a5
lw ra,28(sp)
lw s0,24(sp)
addi sp,sp,32
jr ra
# bf16_isinf(bf16 a): return ((a & EXP)==EXP) && ((a & MANT)==0)
bf16_isinf:
addi sp,sp,-32
sw ra,28(sp)
sw s0,24(sp)
addi s0,sp,32
# Load input; test exponent all-ones; then test mantissa equals zero; return 1/0
sh a0,-20(s0)
lhu a5,-20(s0)
mv a4,a5
li a5,32768
addi a5,a5,-128
and a4,a4,a5
li a5,32768
addi a5,a5,-128
bne a4,a5,L6
lhu a5,-20(s0)
andi a5,a5,127
bne a5,zero,L6
li a5,1
j L7
L6:
li a5,0
L7:
# Normalize boolean to 0/1 and return
andi a5,a5,1
andi a5,a5,0xff
mv a0,a5
lw ra,28(sp)
lw s0,24(sp)
addi sp,sp,32
jr ra
# bf16_iszero(bf16 a): return (a & 0x7FFF) == 0
bf16_iszero:
addi sp,sp,-32
sw ra,28(sp)
sw s0,24(sp)
addi s0,sp,32
# Load input; clear sign bit; test zero; return 1/0
sh a0,-20(s0)
lhu a5,-20(s0)
slli a5,a5,17
srli a5,a5,17
seqz a5,a5
andi a5,a5,0xff
mv a0,a5
lw ra,28(sp)
lw s0,24(sp)
addi sp,sp,32
jr ra
```
#### RV32 Assembly- Optmized hand-written
```gas=
# RV32 register-variable mapping (across functions)
# a0: input bfloat16 number a and return value (1 if NaN, 0 otherwise)
# s0: stack frame pointer
# ra: return address
# Stack frame layout (offsets from s0, across functions):
# -8 : saved s0
# -4 : saved ra
bf16_isnan:
# Register Role:
# t0: store result of a & Exp_mask (0x7F80)
# t1: store result of a & Mant_mask (0x007F)
# t2: store Exp_mask (0x7F80)
addi sp, sp, -8
sw ra, 4(sp)
sw s0, 0(sp)
addi s0, sp, 8
li t2,32640 # Load Exp_mask into t2
and t0, a0, t2 # t0 = a & Exp_mask
beq t0, t2, isnan_check_mantissa # If (a & Exp_mask) == Exp_mask, check mantissa
li a0, 0 # Not NaN
j end_bf16_isnan
isnan_check_mantissa:
andi t1, a0, 127 # t1 = a & Mant_mask
bne t1, zero, is_nan # If (a & Mant_mask) != 0, it is NaN
li a0, 0 # Not NaN
j end_bf16_isnan
is_nan:
li a0, 1 # Is NaN
end_bf16_isnan:
lw ra, 4(sp)
lw s0, 0(sp)
addi sp, sp, 8
jr ra
bf16_isinf:
# Register Role:
# t0: store result of a & Exp_mask (0x7F80)
# t1: store result of !(a & Mant_mask (0x007F))
# t2: store Exp_mask (0x7F80)
addi sp, sp, -8
sw ra, 4(sp)
sw s0, 0(sp)
addi s0, sp, 8
li t2,32640 # Load Exp_mask into t2
and t0, a0, t2 # t0 = a & Exp_mask
beq t0, t2, isinf_check_mantissa # If (a & Exp_mask) == Exp_mask, check mantissa
li a0, 0 # Not Inf
j end_bf16_isinf
isinf_check_mantissa:
andi t1, a0, 127 # t1 = a & Mant_mask
beq t1, zero, is_inf # If (a & Mant_mask) == 0, it is Inf
li a0, 0 # Not Inf
j end_bf16_isinf
is_inf:
li a0, 1 # Is Inf
end_bf16_isinf:
lw ra, 4(sp)
lw s0, 0(sp)
addi sp, sp, 8
jr ra
bf16_iszero:
# Register Role:
# t0: store result of !(a & ~Sign_mask (0x7FFF))
addi sp, sp, -8
sw ra, 4(sp)
sw s0, 0(sp)
addi s0, sp, 8
li t0, 32767 # Load ~Sign_mask into t0
and t0, a0, t0 # t0 = a & ~Sign_mask
xori t0, t0, 1 # Negate the result for zero check
add a0, t0, zero # Move result to a0
lw ra, 4(sp)
lw s0, 0(sp)
addi sp, sp, 8
jr ra
```
---
### Implementation - BF16 and FP16 Conversion
#### Source C code in Problem C
```clike=
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <math.h>
typedef struct {
uint16_t bits;
} bf16_t;
static inline bf16_t f32_to_bf16(float val)
{
uint32_t f32bits;
memcpy(&f32bits, &val, sizeof(float));
if (((f32bits >> 23) & 0xFF) == 0xFF)
return (bf16_t) {.bits = (f32bits >> 16) & 0xFFFF};
f32bits += ((f32bits >> 16) & 1) + 0x7FFF; // For Rounding to Nearest, Ties to Even
return (bf16_t) {.bits = f32bits >> 16};
}
static inline float bf16_to_f32(bf16_t val)
{
uint32_t f32bits = ((uint32_t) val.bits) << 16;
float result;
memcpy(&result, &f32bits, sizeof(float));
return result;
}
```
#### RV32 Assembly - Compiler Generate (rv32-gc clang (trunk))
```gas=
# -----------------------------------------------------------------------------
# float bf16_to_f32(bf16 val)
# Build 32-bit pattern by placing bf16 in high 16 bits, then bit-cast to float.
# Registers:
# a0 : arg bf16.bits (low 16 in a0) / returns f32bits
# Stack frame (32B):
# -10(s0) : bf16.bits (int16_t local copy)
# -16(s0) : f32bits (uint32_t)
# -20(s0) : result copy (uint32_t)
bf16_to_f32:
addi sp, sp, -32
sw ra, 28(sp)
sw s0, 24(sp)
addi s0, sp, 32
# f32bits = (uint32)val.bits << 16; return as float bit pattern
sh a0, -10(s0)
lh a0, -10(s0)
slli a0, a0, 16
sw a0, -16(s0)
lw a0, -16(s0)
sw a0, -20(s0)
lw a0, -20(s0)
lw ra, 28(sp)
lw s0, 24(sp)
addi sp, sp, 32
ret
# -----------------------------------------------------------------------------
# bf16_t f32_to_bf16(float val)
# If exp==0xFF → just truncate top 16 bits (Inf/NaN preserve payload).
# Else round-to-nearest-even: add LSB of cut part plus 0x7FFF, then take top 16.
# Registers:
# a0 : arg f32bits / returns bf16.bits (in low 16)
# a1 : temps (exp calc, constants)
# Stack frame (32B):
# -16(s0) : f32bits
# -20(s0) : f32bits (working)
# -10(s0) : bf16.bits (uint16_t out)
# -18(s0) : scratch (unused load/store pair for flow)
f32_to_bf16:
addi sp, sp, -32
sw ra, 28(sp)
sw s0, 24(sp)
addi s0, sp, 32
# spill input bits
sw a0, -16(s0)
lw a0, -16(s0)
sw a0, -20(s0)
# if (((f32bits>>23)&0xFF)==0xFF) → fast path: return high16
lw a0, -20(s0)
slli a0, a0, 1
srli a0, a0, 24
li a1, 255
bne a0, a1, LBB2_2
LBB2_1:
# Inf/NaN path: return (f32bits >> 16)
lhu a0, -18(s0) # (compiler artifact; final move happens below)
sh a0, -10(s0)
j LBB2_3
LBB2_2:
# RNE rounding path:
# f32bits += ((f32bits>>16)&1) + 0x7FFF
lw a1, -20(s0)
slli a0, a1, 15
srli a0, a0, 31 # ((f32bits>>16)&1)
add a0, a0, a1
lui a1, 8
addi a1, a1, -1 # 0x7FFF
add a0, a0, a1
sw a0, -20(s0)
# produce top 16 bits
lhu a0, -18(s0) # (compiler artifact; replaced by final load)
sh a0, -10(s0)
LBB2_3:
# return bf16 bits (top 16 of updated f32bits)
lhu a0, -10(s0)
lw ra, 28(sp)
lw s0, 24(sp)
addi sp, sp, 32
ret
```
#### RV32 Assembly- Optmized hand-written
```gas=
bf16_to_f32:
# Initialize stack frame (16B)
addi sp, sp, -8
sw ra, 4(sp)
sw s0, 0(sp)
addi s0, sp, 8
# Shifting 16 bits left to convert bf16 to f32
slli a0, a0, 16
# epilogue
lw ra, 4(sp)
lw s0, 0(sp)
addi sp, sp, 8
jr ra
f32_to_bf16:
# Register Roles:
# t0: saved result of input value of (f32bits >> 23) & 0xFF
# t1: saved constant 0xFF
# t2: saved result of input value of (f32bits >> 16) & 0x7FFF
# Initialize stack frame (16B)
addi sp, sp, -8
sw ra, 4(sp)
sw s0, 0(sp)
addi s0, sp, 8
add t0, a0, zero
# Calculate (f32bits >> 23) & 0xFF
srli t0, t0, 23
andi t0, t0, 0xFF
li t1, 0xFF
beq t0, t1, handling_inf_nan
# Not Inf or NaN case
# Rounding to Nearest, Ties to Even
srli t2, t0, 16
andi t2, t2, 1
li t2, 32767
add a0, a0, t2
# Shift right 16 to convert f32 to bf16
srli a0, a0, 16
j f32_to_bf16_epilogue
handling_inf_nan:
# Inf or NaN case: just shift right 16 bits to convert f32 to bf16
srli a0, a0, 16
li a0, 65535
f32_to_bf16_epilogue:
# epilogue
lw ra, 4(sp)
lw s0, 0(sp)
addi sp, sp, 8
jr ra
```
---
### Implementation - BF16 Addition and Substraction
#### Source C code in Problem C
```clike=
#include <stdbool.h>
#include <stdint.h>
#include <string.h>
#include <stdio.h>
typedef struct {
uint16_t bits;
} bf16_t;
#define BF16_SIGN_MASK 0x8000U
#define BF16_NAN() ((bf16_t) {.bits = 0x7FC0})
#define BF16_ZERO() ((bf16_t) {.bits = 0x0000})
static inline bf16_t bf16_add(bf16_t a, bf16_t b)
{
uint16_t sign_a = (a.bits >> 15) & 1;
uint16_t sign_b = (b.bits >> 15) & 1;
int16_t exp_a = ((a.bits >> 7) & 0xFF);
int16_t exp_b = ((b.bits >> 7) & 0xFF);
uint16_t mant_a = a.bits & 0x7F;
uint16_t mant_b = b.bits & 0x7F;
if (exp_a == 0xFF) {
if (mant_a)
return a;
if (exp_b == 0xFF)
return (mant_b || sign_a == sign_b) ? b : BF16_NAN();
return a;
}
if (exp_b == 0xFF)
return b;
if (!exp_a && !mant_a)
return b;
if (!exp_b && !mant_b)
return a;
if (exp_a)
mant_a |= 0x80;
if (exp_b)
mant_b |= 0x80;
int16_t exp_diff = exp_a - exp_b;
uint16_t result_sign;
int16_t result_exp;
uint32_t result_mant;
if (exp_diff > 0) {
result_exp = exp_a;
if (exp_diff > 8)
// When b is too small to a to affect the result, return a
return a;
mant_b >>= exp_diff;
} else if (exp_diff < 0) {
result_exp = exp_b;
if (exp_diff < -8)
return b;
mant_a >>= -exp_diff;
} else {
result_exp = exp_a;
}
if (sign_a == sign_b) {
result_sign = sign_a;
result_mant = (uint32_t) mant_a + mant_b;
if (result_mant & 0x100) {
result_mant >>= 1;
if (++result_exp >= 0xFF)
return (bf16_t) {.bits = (result_sign << 15) | 0x7F80};
}
} else {
if (mant_a >= mant_b) {
result_sign = sign_a;
result_mant = mant_a - mant_b;
} else {
result_sign = sign_b;
result_mant = mant_b - mant_a;
}
if (!result_mant)
return BF16_ZERO();
while (!(result_mant & 0x80)) {
result_mant <<= 1;
if (--result_exp <= 0)
return BF16_ZERO();
}
}
return (bf16_t) {
.bits = (result_sign << 15) | ((result_exp & 0xFF) << 7) |
(result_mant & 0x7F),
};
}
static inline bf16_t bf16_sub(bf16_t a, bf16_t b)
{
b.bits ^= BF16_SIGN_MASK;
return bf16_add(a, b);
}
```
#### RV32 Assembly - Compiler Generate (rv32-gc clang (trunk))
* Addition
```gas=
############################################################
# bf16_add
# ABI: a0=input a.bits (u16), a1=input b.bits (u16), a0=return bits (u16)
# Callee-saved: s0. Caller-saved: a0–a7.
#
# Stack frame (48 bytes):
# [sp+44] ra
# [sp+40] s0
# Locals (relative to s0):
# -10: u16 result_bits
# -12: u16 a_bits
# -14: u16 b_bits
# -16: u16 sign_a
# -18: u16 sign_b
# -20: i16 exp_a
# -22: i16 exp_b
# -24: u16 mant_a (7-bit field, later with implicit 1)
# -26: u16 mant_b (7-bit field, later with implicit 1)
# -28: i16 exp_diff
# -30: u16 result_sign
# -32: i16 result_exp
# -36..-33: u32 result_mant (temp, uses word ops)
#
# Sections:
# Prologue: create frame, spill a/b.
############################################################
bf16_add:
addi sp, sp, -48
sw ra, 44(sp)
sw s0, 40(sp)
addi s0, sp, 48
sh a0, -12(s0)
sh a1, -14(s0)
############################################################
# Unpack fields:
# sign_a/sign_b = bits>>15 & 1
# exp_a/exp_b = bits[14:7]
# mant_a/mant_b = bits[6:0]
############################################################
lhu a0, -12(s0)
srli a0, a0, 15
sh a0, -16(s0)
lhu a0, -14(s0)
srli a0, a0, 15
sh a0, -18(s0)
lhu a0, -12(s0)
slli a0, a0, 17
srli a0, a0, 24
sh a0, -20(s0)
lhu a0, -14(s0)
slli a0, a0, 17
srli a0, a0, 24
sh a0, -22(s0)
lhu a0, -12(s0)
andi a0, a0, 127
sh a0, -24(s0)
lhu a0, -14(s0)
andi a0, a0, 127
sh a0, -26(s0)
############################################################
# Special cases on a:
# if exp_a==0xFF:
# if mant_a!=0 -> return a (NaN)
# else if exp_b==0xFF:
# if mant_b!=0 or sign_a==sign_b -> return b
# else -> return qNaN
# else -> return a (Inf)
############################################################
lh a0, -20(s0)
li a1, 255
bne a0, a1, LBB1_10
LBB1_1:
lhu a0, -24(s0)
beqz a0, LBB1_3
LBB1_2:
lh a0, -12(s0)
sh a0, -10(s0)
j LBB1_50
LBB1_3:
lh a0, -22(s0)
li a1, 255
bne a0, a1, LBB1_9
LBB1_4:
lhu a0, -26(s0)
bnez a0, LBB1_6
LBB1_5:
lhu a0, -16(s0)
lhu a1, -18(s0)
bne a0, a1, LBB1_7
LBB1_6:
lh a0, -14(s0) # return b (Inf or NaN)
sh a0, -10(s0)
j LBB1_8
LBB1_7:
lui a0, 8 # return qNaN 0x7FC0
addi a0, a0, -64
sh a0, -10(s0)
LBB1_8:
j LBB1_50
LBB1_9:
lh a0, -12(s0) # return a (Inf)
sh a0, -10(s0)
j LBB1_50
############################################################
# Special cases on b:
# if exp_b==0xFF -> return b
# if a==0 -> return b
# if b==0 -> return a
# If normalized, set implicit 1: mant|=0x80
############################################################
LBB1_10:
lh a0, -22(s0)
li a1, 255
bne a0, a1, LBB1_12
LBB1_11:
lh a0, -14(s0) # return b
sh a0, -10(s0)
j LBB1_50
LBB1_12:
lhu a0, -20(s0)
bnez a0, LBB1_15
LBB1_13:
lhu a0, -24(s0)
bnez a0, LBB1_15
LBB1_14:
lh a0, -14(s0) # a==0 -> return b
sh a0, -10(s0)
j LBB1_50
LBB1_15:
lhu a0, -22(s0)
bnez a0, LBB1_18
LBB1_16:
lhu a0, -26(s0)
bnez a0, LBB1_18
LBB1_17:
lh a0, -12(s0) # b==0 -> return a
sh a0, -10(s0)
j LBB1_50
LBB1_18:
lhu a0, -20(s0)
beqz a0, LBB1_20
LBB1_19:
lh a0, -24(s0) # mant_a |= 0x80
ori a0, a0, 128
sh a0, -24(s0)
LBB1_20:
lhu a0, -22(s0)
beqz a0, LBB1_22
LBB1_21:
lh a0, -26(s0) # mant_b |= 0x80
ori a0, a0, 128
sh a0, -26(s0)
############################################################
# Exponent alignment:
# exp_diff = exp_a - exp_b
# If exp_diff>0:
# result_exp=exp_a; if exp_diff>=9 -> return a; shift mant_b>>=exp_diff
# Else if exp_diff<0:
# result_exp=exp_b; if exp_diff<=-9 -> return b; shift mant_a>>=-exp_diff
# Else result_exp=exp_a
############################################################
LBB1_22:
lh a0, -20(s0)
lh a1, -22(s0)
sub a0, a0, a1
sh a0, -28(s0)
lh a1, -28(s0)
li a0, 0
bge a0, a1, LBB1_26 # exp_diff <= 0?
LBB1_23: # exp_diff > 0
lh a0, -20(s0)
sh a0, -32(s0) # result_exp = exp_a
lh a0, -28(s0)
li a1, 9
blt a0, a1, LBB1_25
LBB1_24: # gap >= 9 -> return a
lh a0, -12(s0)
sh a0, -10(s0)
j LBB1_50
LBB1_25:
lh a1, -28(s0)
lhu a0, -26(s0)
srl a0, a0, a1 # mant_b >>= exp_diff
sh a0, -26(s0)
j LBB1_32
LBB1_26: # exp_diff <= 0
lh a0, -28(s0)
bgez a0, LBB1_30 # exp_diff == 0
LBB1_27: # exp_diff < 0
lh a0, -22(s0)
sh a0, -32(s0) # result_exp = exp_b
lh a1, -28(s0)
li a0, -9
blt a0, a1, LBB1_29
LBB1_28: # gap <= -9 -> return b
lh a0, -14(s0)
sh a0, -10(s0)
j LBB1_50
LBB1_29:
lh a1, -28(s0)
li a0, 0
sub a1, a0, a1 # -exp_diff
lhu a0, -24(s0)
srl a0, a0, a1 # mant_a >>= -exp_diff
sh a0, -24(s0)
j LBB1_31
LBB1_30: # exp_diff == 0
lh a0, -20(s0)
sh a0, -32(s0) # result_exp = exp_a
LBB1_31:
j LBB1_32
############################################################
# Same-sign path:
# if sign_a==sign_b:
# result_sign=sign_a
# result_mant=mant_a+mant_b
# if carry -> shift right, result_exp++
# if result_exp>=0xFF -> return signed Inf
############################################################
LBB1_32:
lhu a0, -16(s0)
lhu a1, -18(s0)
bne a0, a1, LBB1_38
LBB1_33:
lh a0, -16(s0)
sh a0, -30(s0) # result_sign
lhu a0, -24(s0)
lhu a1, -26(s0)
add a0, a0, a1
sw a0, -36(s0) # result_mant
lbu a0, -35(s0)
andi a0, a0, 1
beqz a0, LBB1_37 # no carry
LBB1_34: # carry path
lw a0, -36(s0)
srli a0, a0, 1 # mant >>= 1
sw a0, -36(s0)
lh a1, -32(s0)
slli a0, a1, 16
addi a1, a1, 1 # result_exp++
sh a1, -32(s0)
lui a1, 16
add a0, a0, a1
srai a0, a0, 16
li a1, 255
blt a0, a1, LBB1_36
LBB1_35: # overflow -> signed Inf
lh a0, -30(s0)
slli a0, a0, 15
lui a1, 8
addi a1, a1, -128 # 0x7F80
or a0, a0, a1
sh a0, -10(s0)
j LBB1_50
LBB1_36:
j LBB1_37
LBB1_37:
j LBB1_49
############################################################
# Different-sign path:
# subtract larger mantissa from smaller
# result_sign = sign of larger
# if zero -> return +0
# normalize left until bit7 set, decrement result_exp each shift
# if result_exp <= 0 -> return 0
############################################################
LBB1_38:
lhu a0, -24(s0)
lhu a1, -26(s0)
blt a0, a1, LBB1_40
LBB1_39: # mant_a >= mant_b
lh a0, -16(s0)
sh a0, -30(s0) # result_sign = sign_a
lhu a0, -24(s0)
lhu a1, -26(s0)
sub a0, a0, a1
sw a0, -36(s0) # result_mant
j LBB1_41
LBB1_40: # mant_b > mant_a
lh a0, -18(s0)
sh a0, -30(s0) # result_sign = sign_b
lhu a0, -26(s0)
lhu a1, -24(s0)
sub a0, a0, a1
sw a0, -36(s0)
LBB1_41:
lw a0, -36(s0)
bnez a0, LBB1_43 # non-zero
LBB1_42:
li a0, 0 # exact cancel -> +0
sh a0, -10(s0)
j LBB1_50
LBB1_43: # normalize left
LBB1_44:
lbu a0, -36(s0)
andi a0, a0, 128 # test bit7
bnez a0, LBB1_48
LBB1_45:
lw a0, -36(s0)
slli a0, a0, 1
sw a0, -36(s0)
lh a1, -32(s0)
slli a0, a1, 16
addi a1, a1, -1 # result_exp--
sh a1, -32(s0)
lui a1, 1048560 # bounds check (<=0) -> zero
add a0, a0, a1
srai a1, a0, 16
li a0, 0
blt a0, a1, LBB1_47
LBB1_46:
li a0, 0
sh a0, -10(s0)
j LBB1_50
LBB1_47:
j LBB1_44
############################################################
# Pack result:
# bits = (result_sign<<15) | (result_exp<<7) | (result_mant & 0x7F)
############################################################
LBB1_48:
j LBB1_49
LBB1_49:
lh a0, -30(s0)
slli a0, a0, 15
lbu a1, -32(s0)
slli a1, a1, 7
or a0, a0, a1
lw a1, -36(s0)
andi a1, a1, 127
or a0, a0, a1
sh a0, -10(s0)
############################################################
# Epilogue: move result to a0 and return.
############################################################
LBB1_50:
lhu a0, -10(s0)
lw ra, 44(sp)
lw s0, 40(sp)
addi sp, sp, 48
ret
```
* Substraction
```gas=
############################################################
# bf16_sub
# ABI: a0=a.bits, a1=b.bits. Returns a0 = (a - b).bits.
#
# Stack frame (16 bytes):
# [sp+12] ra
# [sp+8] s0
# Locals:
# -12: u16 a_bits
# -14: u16 b_bits (to be negated)
# -10: u16 result_bits
#
# Sections:
# Prologue: save inputs.
# Flip sign of b (xor with 0x8000).
# Call bf16_add(a, -b).
# Epilogue: return result.
############################################################
bf16_sub:
addi sp, sp, -16
sw ra, 12(sp)
sw s0, 8(sp)
addi s0, sp, 16
sh a0, -12(s0)
sh a1, -14(s0)
lh a0, -14(s0)
lui a1, 8 # 0x8000
xor a0, a0, a1 # negate b
sh a0, -14(s0)
lhu a0, -12(s0)
lhu a1, -14(s0)
call bf16_add
sh a0, -10(s0)
lhu a0, -10(s0)
lw ra, 12(sp)
lw s0, 8(sp)
addi sp, sp, 16
ret
```
#### RV32 Assembly- Optmized hand-written
* Addition
```gas=
# bf16_add
# ABI:
# a0=input a.bits (u16), a1=input b.bits (u16), a0=return bits (u16)
# Register Role:
# t0: sign_a
# t1: sign_b
# t2: exp_a
# t3: exp_b
# t4: mant_a
# t5: mant_b
# t6: exp_diff
# a2: result sign
# a3: result exp
# a4: result mant
# a5: temporary constant/ result
bf16_add:
addi sp, sp, -8
sw ra, 4(sp)
sw s0, 0(sp)
addi s0, sp, 8
# unpack
srli t0,a0,15
andi t0,t0,1
srli t1,a1,15
andi t1,t1,1
srli t2,a0,7
andi t2,t2,0xFF
srli t3,a1,7
andi t3,t3,0xFF
andi t4,a0,0x7F
andi t5,a1,0x7F
# ---- Special: exp_a==0xFF ----
li a5,0xFF
bne t2,a5, L_check_b_ff
# exp_a==0xFF
bnez t4, L_ret_a # mant_a!=0 => NaN, return a
# mant_a==0 => a=Inf
bne t3,a5, L_ret_a # b not Special => return a (Inf)
# both Special
bnez t5, L_ret_b # b NaN => return b
# mant_b==0, both Inf: if sign_a==sign_b return b else qNaN
bne t0,t1, L_ret_qnan
j L_ret_b
# ---- Special: exp_b==0xFF ----
L_check_b_ff:
li a5,0xFF
bne t3,a5, L_check_a_zero
L_ret_b:
add a0,zero,a1
j L_epilogue
L_ret_a:
add a0,zero,a0
j L_epilogue
L_ret_qnan:
li a0,0x7FC0
j L_epilogue
# ---- a==0? (!exp_a && !mant_a) -> b ----
L_check_a_zero:
beqz t2, 1f
j L_check_b_zero
1:
beqz t4, L_ret_b
j L_check_b_zero
# ---- b==0? (!exp_b && !mant_b) -> a ----
L_check_b_zero:
beqz t3, 2f
j L_set_implicit
2:
beqz t5, L_ret_a
# ---- implicit leading 1 when normalized ----
L_set_implicit:
beqz t2, 3f
ori t4,t4,0x80
3:
beqz t3, 4f
ori t5,t5,0x80
4:
# ---- exponent alignment ----
sub t6,t2,t3 # exp_diff
bgtz t6, L_diff_pos
bltz t6, L_diff_neg
add a3,zero,t2 # diff==0
j L_sign_check
L_diff_pos:
add a3,zero,t2
li a5,9
bge t6,a5, L_ret_a # diff>=9 -> return a
srl t5,t5,t6 # mant_b >>= diff
j L_sign_check
L_diff_neg:
add a3,zero,t3
li a5,-8
blt t6,a5, L_ret_b # diff<-8 -> return b
neg a5,t6 # a5 = -diff
srl t4,t4,a5 # mant_a >>= -diff
# ---- sign check ----
L_sign_check:
bne t0,t1, L_sign_diff
# same sign: add
add a2,zero,t0
add a4,t4,t5 # 0..510
andi a5,a4,0x100 # carry?
beqz a5, L_pack
srli a4,a4,1 # >>1
addi a3,a3,1 # exp++
li a5,0xFF
blt a3,a5, L_pack # if exp<255 ok
# overflow -> signed Inf
slli a0,a2,15
li a5,0x7F80
or a0,a0,a5
j L_epilogue
# different sign: subtract larger mant
L_sign_diff:
blt t4,t5, 5f
add a2,zero,t0
sub a4,t4,t5
j 6f
5:
add a2,zero,t1
sub a4,t5,t4
6:
beqz a4, L_ret_zero
# normalize left until bit7 set; exp--
L_norm:
andi a5,a4,0x80
bnez a5, L_pack
slli a4,a4,1
addi a3,a3,-1
blez a3, L_ret_zero
j L_norm
# pack result
L_pack:
slli a0,a2,15
andi a3,a3,0xFF
slli a5,a3,7
or a0,a0,a5
andi a4,a4,0x7F
or a0,a0,a4
j L_epilogue
L_ret_zero:
add a0,zero,zero
L_epilogue:
lw ra, 4(sp)
lw s0, 0(sp)
addi sp, sp, 8
ret
```
* Substraction
```gas=
bf16_sub:
addi sp, sp, -8
sw ra, 4(sp)
sw s0, 0(sp)
addi s0, sp, 8
xori a1,a1,0x8000
call bf16_add
lw ra, 4(sp)
lw s0, 0(sp)
addi sp, sp, 8
ret
```
---
### Implementation - BF16 Multiplication, Division, Square Root
* Due to limitation number of characters in HackMD, I only show the bp_16 optimzation code and processes, the whole code detail is in [bfloat16](https://github.com/jgw0915/Computer-Architecture/tree/dev/bfloat16)
#### Source C code in Problem C
* Square Root
```clike=
static inline bf16_t bf16_sqrt(bf16_t a)
{
uint16_t sign = (a.bits >> 15) & 1;
int16_t exp = ((a.bits >> 7) & 0xFF);
uint16_t mant = a.bits & 0x7F;
if (exp == 0xFF) {
if (mant) return a; /* NaN */
if (sign) return BF16_NAN(); /* sqrt(-Inf) = NaN */
return a; /* +Inf */
}
if (!exp && !mant) return BF16_ZERO(); /* sqrt(0) = 0 */
if (sign) return BF16_NAN(); /* sqrt(negative) = NaN */
if (!exp) return BF16_ZERO(); /* flush denormals */
int32_t e = exp - BF16_EXP_BIAS;
int32_t new_exp;
uint32_t m = 0x80 | mant;
if (e & 1) {
m <<= 1;
new_exp = ((e - 1) >> 1) + BF16_EXP_BIAS;
} else {
new_exp = (e >> 1) + BF16_EXP_BIAS;
}
uint32_t low = 90, high = 256, result = 128;
while (low <= high) {
uint32_t mid = (low + high) >> 1;
uint32_t sq = mul16x16_u32((uint16_t)mid, (uint16_t)mid) >> 7;
if (sq <= m) { result = mid; low = mid + 1; }
else { high = mid - 1; }
}
if (result >= 256) {
result >>= 1;
new_exp++;
} else if (result < 128) {
while (result < 128 && new_exp > 1) {
result <<= 1;
new_exp--;
}
}
uint16_t new_mant = result & 0x7F;
if (new_exp >= 0xFF) return (bf16_t){ .bits = 0x7F80 }; /* +Inf */
if (new_exp <= 0) return BF16_ZERO();
/* 正常回傳 */
return (bf16_t){
.bits = ((uint16_t)0 << 15) | (((uint16_t)new_exp & 0xFF) << 7) | new_mant
};
}
```
#### RV32 Assembly - Compiler Generate (rv32-gcc (trunk))
* Square Root
```gas=
bf16_sqrt:
addi sp, sp, -64
sw ra, 60(sp)
sw s0, 56(sp)
addi s0, sp, 64
sh a0, -12(s0)
lhu a0, -12(s0)
srli a0, a0, 15
sh a0, -14(s0)
lhu a0, -12(s0)
slli a0, a0, 17
srli a0, a0, 24
sh a0, -16(s0)
lhu a0, -12(s0)
andi a0, a0, 127
sh a0, -18(s0)
lh a0, -16(s0)
li a1, 255
bne a0, a1, LBB3_6
j LBB3_1
LBB3_1:
lhu a0, -18(s0)
beqz a0, LBB3_3
j LBB3_2
LBB3_2:
lh a0, -12(s0)
sh a0, -10(s0)
j LBB3_37
LBB3_3:
lhu a0, -14(s0)
beqz a0, LBB3_5
j LBB3_4
LBB3_4:
lui a0, 8
addi a0, a0, -64
sh a0, -10(s0)
j LBB3_37
LBB3_5:
lh a0, -12(s0)
sh a0, -10(s0)
j LBB3_37
LBB3_6:
lhu a0, -16(s0)
bnez a0, LBB3_9
j LBB3_7
LBB3_7:
lhu a0, -18(s0)
bnez a0, LBB3_9
j LBB3_8
LBB3_8:
li a0, 0
sh a0, -10(s0)
j LBB3_37
LBB3_9:
lhu a0, -14(s0)
beqz a0, LBB3_11
j LBB3_10
LBB3_10:
lui a0, 8
addi a0, a0, -64
sh a0, -10(s0)
j LBB3_37
LBB3_11:
lhu a0, -16(s0)
bnez a0, LBB3_13
j LBB3_12
LBB3_12:
li a0, 0
sh a0, -10(s0)
j LBB3_37
LBB3_13:
lh a0, -16(s0)
addi a0, a0, -127
sw a0, -24(s0)
lhu a0, -18(s0)
ori a0, a0, 128
sw a0, -32(s0)
lbu a0, -24(s0)
andi a0, a0, 1
beqz a0, LBB3_15
j LBB3_14
LBB3_14:
lw a0, -32(s0)
slli a0, a0, 1
sw a0, -32(s0)
lw a0, -24(s0)
addi a0, a0, -1
srai a0, a0, 1
addi a0, a0, 127
sw a0, -28(s0)
j LBB3_16
LBB3_15:
lw a0, -24(s0)
srai a0, a0, 1
addi a0, a0, 127
sw a0, -28(s0)
j LBB3_16
LBB3_16:
li a0, 90
sw a0, -36(s0)
li a0, 256
sw a0, -40(s0)
li a0, 128
sw a0, -44(s0)
j LBB3_17
LBB3_17:
lw a1, -36(s0)
lw a0, -40(s0)
bltu a0, a1, LBB3_22
j LBB3_18
LBB3_18:
lw a0, -36(s0)
lw a1, -40(s0)
add a0, a0, a1
srli a0, a0, 1
sw a0, -48(s0)
lhu a1, -48(s0)
mv a0, a1
call mul16x16_u32
srli a0, a0, 7
sw a0, -52(s0)
lw a1, -52(s0)
lw a0, -32(s0)
bltu a0, a1, LBB3_20
j LBB3_19
LBB3_19:
lw a0, -48(s0)
sw a0, -44(s0)
lw a0, -48(s0)
addi a0, a0, 1
sw a0, -36(s0)
j LBB3_21
LBB3_20:
lw a0, -48(s0)
addi a0, a0, -1
sw a0, -40(s0)
j LBB3_21
LBB3_21:
j LBB3_17
LBB3_22:
lw a0, -44(s0)
li a1, 256
bltu a0, a1, LBB3_24
j LBB3_23
LBB3_23:
lw a0, -44(s0)
srli a0, a0, 1
sw a0, -44(s0)
lw a0, -28(s0)
addi a0, a0, 1
sw a0, -28(s0)
j LBB3_32
LBB3_24:
lw a1, -44(s0)
li a0, 127
bltu a0, a1, LBB3_31
j LBB3_25
LBB3_25:
j LBB3_26
LBB3_26:
lw a1, -44(s0)
li a2, 0
li a0, 127
sw a2, -60(s0)
bltu a0, a1, LBB3_28
j LBB3_27
LBB3_27:
lw a0, -28(s0)
slti a0, a0, 2
xori a0, a0, 1
sw a0, -60(s0)
j LBB3_28
LBB3_28:
lw a0, -60(s0)
andi a0, a0, 1
beqz a0, LBB3_30
j LBB3_29
LBB3_29:
lw a0, -44(s0)
slli a0, a0, 1
sw a0, -44(s0)
lw a0, -28(s0)
addi a0, a0, -1
sw a0, -28(s0)
j LBB3_26
LBB3_30:
j LBB3_31
LBB3_31:
j LBB3_32
LBB3_32:
lw a0, -44(s0)
andi a0, a0, 127
sh a0, -54(s0)
lw a0, -28(s0)
li a1, 255
blt a0, a1, LBB3_34
j LBB3_33
LBB3_33:
lui a0, 8
addi a0, a0, -128
sh a0, -10(s0)
j LBB3_37
LBB3_34:
lw a1, -28(s0)
li a0, 0
blt a0, a1, LBB3_36
j LBB3_35
LBB3_35:
li a0, 0
sh a0, -10(s0)
j LBB3_37
LBB3_36:
lbu a0, -28(s0)
slli a0, a0, 7
lh a1, -54(s0)
or a0, a0, a1
sh a0, -10(s0)
j LBB3_37
LBB3_37:
lhu a0, -10(s0)
lw ra, 60(sp)
lw s0, 56(sp)
addi sp, sp, 64
ret
```
#### RV32 Assembly- Optmized hand-written
```gas=
bf16_sqrt:
addi sp,sp,-16
sw ra,12(sp)
mv a6,a0 # keep original a for NaN/+Inf returns
# Extract fields
srli t0,a0,15 # sign
andi t0,t0,1
srli t1,a0,7 # exp
andi t1,t1,255
andi t2,a0,127 # mant
# exp == 0xFF ?
li t3,255
beq t1,t3, case_exp_ff
# zero? (exp==0 && mant==0)
or t3,t1,t2
beqz t3, ret_zero
# negative finite -> NaN
bnez t0, ret_nan
# denormal -> 0
beqz t1, ret_zero
# e = exp - 127 ; m = 0x80 | mant
addi t3,t1,-127 # e
ori t4,t2,0x80 # m
andi t2,t3,1
beqz t2, exp_even
# e odd: m <<= 1 ; new_exp = ((e-1)>>1)+127
slli t4,t4,1
addi t3,t3,-1
srai t5,t3,1 # new_exp
addi t5,t5,127
j exp_ready
exp_even:
srai t5,t3,1
addi t5,t5,127
exp_ready:
# Binary search on [90,256], result=128
li a6,90 # low
li a7,256 # high
li a5,128 # result
bin_loop_cond:
bltu a7,a6, bin_done # while (low <= high)
add a4,a6,a7
srli a4,a4,1 # mid
mv a0,a4 # sq = (mid*mid)>>7
mv a1,a4
jal mul16x16_u32
srli a0,a0,7
bltu t4,a0, bin_high # if (m < sq) high=mid-1
mv a5,a4 # else result=mid, low=mid+1
addi a6,a4,1
j bin_loop_cond
bin_high:
addi a7,a4,-1
j bin_loop_cond
bin_done:
# Post-adjust result and exponent
li t1,256
bltu a5,t1, check_small
srli a5,a5,1 # result >>= 1
addi t5,t5,1 # new_exp++
j post_norm
check_small:
li t1,128
bgeu a5,t1, post_norm
norm_small_loop: # while (result<128 && new_exp>1)
bgeu a5,t1, post_norm
slti t2,t5,2
bnez t2, post_norm
slli a5,a5,1
addi t5,t5,-1
j norm_small_loop
post_norm:
andi a5,a5,127 # new_mant
# Overflow/underflow on exponent
li t1,255
bge t5,t1, ret_pos_inf # new_exp >= 255
slti t1,t5,1
bnez t1, ret_zero # new_exp <= 0
# Pack: sign=0
slli t5,t5,7
or a0,t5,a5
lw ra,12(sp)
addi sp,sp,16
ret
# --- Special cases ---
case_exp_ff: # exp == 0xFF
bnez t2, ret_a # NaN payload -> return a
bnez t0, ret_nan # -Inf -> NaN
mv a0,a6 # +Inf -> return a
lw ra,12(sp)
addi sp,sp,16
ret
ret_a:
mv a0,a6
lw ra,12(sp)
addi sp,sp,16
ret
ret_pos_inf:
li a0,0x7F80
lw ra,12(sp)
addi sp,sp,16
ret
ret_zero:
li a0,0
lw ra,12(sp)
addi sp,sp,16
ret
ret_nan:
li a0,0x7FC0
lw ra,12(sp)
addi sp,sp,16
ret
```
----
### Analysis
#### Improvement comparing to compiler
* **BF16 Special Cases Detection**

* **BF16, FP32 Conversion**

* **BF16 add/sub**

* **BF16 mul/div/sqrt**

#### 5-Stage Pipeline Processor
This section is the analysis of [bfloat16_mul_div_sqrt.s](https://github.com/jgw0915/Computer-Architecture/blob/main/bfloat16/bfloat16_mul_div_sqrt.s) in terms of processor, memory, instruction-level view.
* **Context**
- Program: drives three BF16 kernels: `bf16_mul`, `bf16_div`, `bf16_sqrt`.
It writes temporaries on the stack, checks ranges, and returns pass/fail in `a0`.
- Snapshot (from Ripes, bf16_div inner loop at `no_subtract`):

- **WB** `0x00000370`: `srli x14, x14, 1`
- **MEM** `0x00000374`: `addi x12, x12, -1`
- **EX** `0x00000378`: `bne x12, x0, -24 <div_loop>`
- **ID** `0x0000037c`: `sub x6, x7, x28`
- **IF** `0x00000380`: `addi x6, x6, 127`
* **IF (Instruction Fetch)**
- Inputs: `PC+4` vs branch target from EX via `PCSrc` mux.
- Behavior: EX asserts **BranchTaken=1** for `bne`, so IF fetches `0x00000360` (target of `div_loop`).
- Signals: `PC.en=1`, `PCSrc=branch`, I-mem read valid.

* **ID (Instruction Decode)**
- Current: `sub x6,x7,x28` (computing `result_exp = exp_a - exp_b`).
- Actions: read `x7(t2)`, `x28(t3)`; generate R-type control; `RegWrite=1` scheduled; immediates idle.
- Hazards: forwarding covers `x12` loop counter; no stall shown.

* **EX (Execution)**
- Current: `bne x12,x0,-24`.
- ALU/Branch:
- Comparator sees `x12(a2) ≠ 0` → **BranchTaken=1**.
- Target = `ID/EX.PC + B-imm(-24)` = `0x378 + (-0x18)` = `0x360`.
- Control: `RegWrite=0`, `MemRead=0`, `MemWrite=0`.

* **MEM (Memory Read/Write)**
- Current: `addi x12,x12,-1` in MEM stage. No D-mem access; WB source = ALU.
- Elsewhere in program (actual memory writers):
- Prologues: `sw ra,s0,...` create stack frames in `test_mul_div_sqrt`, `bf16_mul`, `bf16_sqrt`.
- Temporaries: `sh`/`sb` to `-12,-14,-16,-18,-20,-22,-9(s0)` store halfwords/byte results and flags.
- Epilogues: `lw`/`lhu`/`lbu` read back the same offsets.
- Correctness: stores are matched by loads at identical offsets; stack frames are balanced by symmetric restores.

* **WB (Register Write Back)**
- Current: `srli x14,x14,1` writes normalized divisor (`x14=a4 >> 1`) back to `x14`.
- WB mux: ALU result path; `RegWrite=1`; `rd=x14`.
- Nearby writes (next cycles): `addi x12,x12,-1` writes the loop counter; `sub x6,x7,x28` then `addi x6,127` build exponent.

##### Instruction-level intent (bf16_div inner loop)

- Loop state:
- `x14` (divisor_shifted) >>= 1 each iter.
- `x12` (iteration counter) += −1; `bne x12,x0,-24` repeats 16 times.
- After loop:
- `sub x6,x7,x28` computes `exp_a - exp_b`; `addi x6,127` adds bias.
- Ripes branch widget shows **taken** while `x12>0`, matching algorithm.
---
##### Memory map and updates (from Ripes “Memory viewer”)

- `.text` at `0x00000000–0x00000657` (size ≈ 1624). `.data` and `.bss` empty in this build.
- Stack writes:
- `test_mul_div_sqrt`: `sw ra,28(sp)`, `sw s0,24(sp)`, locals at `-32..-9(s0)`.
- `bf16_mul/bf16_sqrt`: save/restore `ra,s0,s1,s2`; temporaries only in registers.
- Endianness: little-endian; `sh` writes appear as low/high byte ordering in viewer.
- Correctness: Every `addi sp, -N` is paired with `addi sp, +N`; all callee-saved regs restored before `ret`.