# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by `< liangchingyun >`
## [uf8](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-B)
> [Quiz review](https://hackmd.io/kmvVhoHMSkCnxtcOoRheqQ#2025-Problem-B)
Use 8 bits (`uint8_t`) to store 20-bit unsigned integer (0 ~ 1,015,792)
* Decoding
$$
D(b) = m \cdot 2^e + (2^e - 1) \cdot 16
$$
Where $e = \lfloor b/16 \rfloor$ and $m = b \bmod 16$
* Encoding
$$
E(v) = \begin{cases}
v, & \text{if } v < 16 \\
16e + \lfloor(v - \text{offset}(e))/2^e\rfloor, & \text{otherwise}
\end{cases}
$$
Where $\text{offset}(e) = (2^e - 1) \cdot 16$
| Exponent | Range | Offset | Step Size |
| -------- | ---------------------------- | ------- | --------- |
| 0 | $[0, 15]$ | 0 | 1 |
| 1 | $[16, 46]$ | 16 | 2 |
| 2 | $[48, 108]$ | 48 | 4 |
| 3 | $[112, 232]$ | 112 | 8 |
| ... | $...$ | ... | $2^e$ |
| 5 | $[524{,}272, 1{,}015{,}792]$ | 524,272 | 32,768 |
### `uf8_decode`
#### C code
```c
/* Decode uf8 to uint32_t */
uint32_t uf8_decode(uf8 fl)
{
uint32_t mantissa = fl & 0x0f; // low 4 bits
uint8_t exponent = fl >> 4; // high 4 bits
uint32_t offset = (0x7FFF >> (15 - exponent)) << 4; // (2^e - 1)·16
return (mantissa << exponent) + offset;
}
```
#### Assembly code
> Show iterative refinement (reducing code size and runtime overhead)
```riscv
uf8_decode:
addi sp, sp, -16 # allocate stack space for t0, t1, t2, ra
sw ra, 12(sp) # save return address
srai t0, a0, 4 # extract exponent t0 = a0 >> 4
andi t1, a0, 0x0F # extract mantissa t1 = a0 & 0x0F
# calculate offset = (2^e -1) << 4
li t2, 1
sll t2, t2, t0 # t2 = 2^e
addi t2, t2, -1 # t2 = 2^e -1
slli t2, t2, 4 # offset = (2^e-1)*16
# calculate mantissa shifted: t1 << e
sll t1, t1, t0
# final value = mantissa<<e + offset
add a0, t1, t2
lw ra, 12(sp)
addi sp, sp, 16
ret
```
#### Use case
> Three test data cases
> Including internal validation
> Compare against compiler output for verification and discussion
::: spoiler Test code
```riscv=
.data
str: .string "\n The decoded value is "
.text
# t0 = exponent
# t1 = mantissa
# t2 = offset
main:
# Test data 1
la a0, str
li a7, 4
ecall
li a0, 0x2F
jal ra, uf8_decode
li a7, 1
ecall
# Test data 2
la a0, str
li a7, 4
ecall
li a0, 0x5A
jal ra, uf8_decode
li a7, 1
ecall
# Test data 3
la a0, str
li a7, 4
ecall
li a0, 0xF0
jal ra, uf8_decode
li a7, 1
ecall
# End
li a7, 10
ecall
```
:::
Input:
* Test data 1: `0x2F`
* Test data 2: `0x5A`
* Test data 3: `0xF0`
Output:
```
The decoded value is 108
The decoded value is 816
The decoded value is 524272
```
Compiler output:
```
The decoded value of 0x2F is 108
The decoded value of 0x5A is 816
The decoded value of 0xF0 is 524272
```
#### Analysis
##### Pseudo instruction
```
00000000 <main>:
0: 10000517 auipc x10 0x10000
4: 00050513 addi x10 x10 0
8: 00400893 addi x17 x0 4
c: 00000073 ecall
10: 02f00513 addi x10 x0 47
14: 054000ef jal x1 84 <uf8_decode>
18: 00100893 addi x17 x0 1
1c: 00000073 ecall
20: 10000517 auipc x10 0x10000
24: fe050513 addi x10 x10 -32
28: 00400893 addi x17 x0 4
2c: 00000073 ecall
30: 05a00513 addi x10 x0 90
34: 034000ef jal x1 52 <uf8_decode>
38: 00100893 addi x17 x0 1
3c: 00000073 ecall
40: 10000517 auipc x10 0x10000
44: fc050513 addi x10 x10 -64
48: 00400893 addi x17 x0 4
4c: 00000073 ecall
50: 0f000513 addi x10 x0 240
54: 014000ef jal x1 20 <uf8_decode>
58: 00100893 addi x17 x0 1
5c: 00000073 ecall
60: 00a00893 addi x17 x0 10
64: 00000073 ecall
00000068 <uf8_decode>:
68: ff010113 addi x2 x2 -16
6c: 00112623 sw x1 12 x2
70: 40455293 srai x5 x10 4
74: 00f57313 andi x6 x10 15
78: 00100393 addi x7 x0 1
7c: 005393b3 sll x7 x7 x5
80: fff38393 addi x7 x7 -1
84: 00439393 slli x7 x7 4
88: 00531333 sll x6 x6 x5
8c: 00730533 add x10 x6 x7
90: 00c12083 lw x1 12 x2
94: 01010113 addi x2 x2 16
98: 00008067 jalr x0 x1 0
```
##### 5-stage pipelined processor
> Demonstrate each stage: IF, ID, IE, MEM, WB.
Explain memory update steps and their correctness.

### `uf8_encode`
#### C code
```c
/* Encode uint32_t to uf8 */
uf8 uf8_encode(uint32_t value)
{
/* Use CLZ for fast exponent calculation */
if (value < 16) // If less than 16, return directly → no compression needed.
return value;
/* Find appropriate exponent using CLZ hint */
int lz = clz(value); // count leading zeros
int msb = 31 - lz;
/* Start from a good initial guess */
uint8_t exponent = 0;
uint32_t overflow = 0;
if (msb >= 5) {
/* Estimate exponent - the formula is empirical */
exponent = msb - 4; // Initial estimate
if (exponent > 15)
exponent = 15;
/* Calculate overflow for estimated exponent */ // Interval start value
for (uint8_t e = 0; e < exponent; e++)
overflow = (overflow << 1) + 16; // overflow = offset(e) = (2^e−1)⋅16
/* Adjust if estimate was off */
while (exponent > 0 && value < overflow) { // value < overflow → exponent guessed too large.
overflow = (overflow - 16) >> 1;
exponent--;
}
}
/* Find exact exponent */
while (exponent < 15) {
uint32_t next_overflow = (overflow << 1) + 16;
if (value < next_overflow) // value falls within the current exponent’s interval → no need to increase the exponent.
break;
overflow = next_overflow;
exponent++; // exponent too small.
}
uint8_t mantissa = (value - overflow) >> exponent;
return (exponent << 4) | mantissa;
}
```
#### Assembly code
> Show iterative refinement (reducing code size and runtime overhead)
```riscv
uf8_encode:
addi sp, sp, -16
sw ra, 12(sp)
# check if value < 16
li t0, 16
blt a0, t0, encode_return_value
# compute MSB via clz loop
li t1, 31 # bit position
mv t2, a0 # copy value
clz_loop:
srli t3, t2, 31
bnez t3, clz_done
addi t1, t1, -1
slli t2, t2, 1
j clz_loop
clz_done:
# t1 = msb
li t2, 5
blt t1, t2, find_exact_exponent_loop
addi t0, t1, -4 # t0 = exponent
li t2, 15
blt t2, t0, estimate_exponent
li t0, 15
estimate_exponent:
li t2, 0 # t2 = overflow
li t3, 0 # t3 = e
calculate_overflow_loop:
bge t3, t0, adjust_loop
slli t2, t2, 1 # overflow <<= 1
addi t2, t2, 16 # overflow += 16
addi t3, t3, 1 # e++
j calculate_overflow_loop
adjust_loop:
blez t0, find_exact_exponent_loop # if exponent <= 0, exit
bge a0, t2, find_exact_exponent_loop # if value >= overflow, exit
addi t2, t2, -16 # overflow -= 16
srli t2, t2, 1 # overflow >>= 1
addi t0, t0, -1 # exponent--
j adjust_loop
find_exact_exponent_loop:
li t4, 15
bge t0, t4, refine_exp_done # if exponent >= 15, exit loop
slli t3, t2, 1 # t3 = next_overflow = overflow << 1
addi t3, t3, 16 # next_overflow += 16
blt a0, t3, refine_exp_done # if value < next_overflow, break
mv t2, t3 # overflow = next_overflow
addi t0, t0, 1 # exponent++
j find_exact_exponent_loop
refine_exp_done:
sub t4, a0, t2 # t4 = mantissa = value - overflow
srl t4, t4, t0 # mantissa >>= exponent
slli t0, t0, 4 # exponent << 4
or a0, t0, t4 # a0 = (exponent << 4) | mantissa
encode_return_value:
lw ra, 12(sp)
addi sp, sp, 16
ret
```
#### Use case
> Three test data cases
> Including internal validation
> Compare against compiler output for verification and discussion
::: spoiler Test code
```riscv=
.data
str: .string "\n The encoded value is "
.text
main:
# Test data 1
la a0, str
li a7, 4
ecall
li a0, 108
jal ra, uf8_encode
li a7, 1
ecall
# Test data 2
la a0, str
li a7, 4
ecall
li a0, 816
jal ra, uf8_encode
li a7, 1
ecall
# Test data 3
la a0, str
li a7, 4
ecall
li a0, 524272
jal ra, uf8_encode
li a7, 1
ecall
# End
li a7, 10
ecall
```
:::
Input:
* Test data 1: `108`
* Test data 2: `816`
* Test data 3: `524272`
Output:
```
The encoded value is 47
The encoded value is 90
The encoded value is 240
```
Compiler output:
```
The encoded value of 108 is 47
The encoded value of 816 is 90
The encoded value of 524272 is 240
```
#### Analysis
##### Pseudo instruction
```
00000000 <main>:
0: 10000517 auipc x10 0x10000
4: 00050513 addi x10 x10 0
8: 00400893 addi x17 x0 4
c: 00000073 ecall
10: 06c00513 addi x10 x0 108
14: 058000ef jal x1 88 <uf8_encode>
18: 00100893 addi x17 x0 1
1c: 00000073 ecall
20: 10000517 auipc x10 0x10000
24: fe050513 addi x10 x10 -32
28: 00400893 addi x17 x0 4
2c: 00000073 ecall
30: 33000513 addi x10 x0 816
34: 038000ef jal x1 56 <uf8_encode>
38: 00100893 addi x17 x0 1
3c: 00000073 ecall
40: 10000517 auipc x10 0x10000
44: fc050513 addi x10 x10 -64
48: 00400893 addi x17 x0 4
4c: 00000073 ecall
50: 00080537 lui x10 0x80
54: ff050513 addi x10 x10 -16
58: 014000ef jal x1 20 <uf8_encode>
5c: 00100893 addi x17 x0 1
60: 00000073 ecall
64: 00a00893 addi x17 x0 10
68: 00000073 ecall
0000006c <uf8_encode>:
6c: ff010113 addi x2 x2 -16
70: 00112623 sw x1 12 x2
74: 01000293 addi x5 x0 16
78: 08554e63 blt x10 x5 156 <encode_return_value>
7c: 01f00313 addi x6 x0 31
80: 00050393 addi x7 x10 0
00000084 <clz_loop>:
84: 01f3de13 srli x28 x7 31
88: 000e1863 bne x28 x0 16 <clz_done>
8c: fff30313 addi x6 x6 -1
90: 00139393 slli x7 x7 1
94: ff1ff06f jal x0 -16 <clz_loop>
00000098 <clz_done>:
98: 00500393 addi x7 x0 5
9c: 04734463 blt x6 x7 72 <find_exact_exponent_loop>
a0: ffc30293 addi x5 x6 -4
a4: 00f00393 addi x7 x0 15
a8: 0053c463 blt x7 x5 8 <estimate_exponent>
ac: 00f00293 addi x5 x0 15
000000b0 <estimate_exponent>:
b0: 00000393 addi x7 x0 0
b4: 00000e13 addi x28 x0 0
000000b8 <calculate_overflow_loop>:
b8: 005e5a63 bge x28 x5 20 <adjust_loop>
bc: 00139393 slli x7 x7 1
c0: 01038393 addi x7 x7 16
c4: 001e0e13 addi x28 x28 1
c8: ff1ff06f jal x0 -16 <calculate_overflow_loop>
000000cc <adjust_loop>:
cc: 00505c63 bge x0 x5 24 <find_exact_exponent_loop>
d0: 00755a63 bge x10 x7 20 <find_exact_exponent_loop>
d4: ff038393 addi x7 x7 -16
d8: 0013d393 srli x7 x7 1
dc: fff28293 addi x5 x5 -1
e0: fedff06f jal x0 -20 <adjust_loop>
000000e4 <find_exact_exponent_loop>:
e4: 00f00e93 addi x29 x0 15
e8: 01d2de63 bge x5 x29 28 <refine_exp_done>
ec: 00139e13 slli x28 x7 1
f0: 010e0e13 addi x28 x28 16
f4: 01c54863 blt x10 x28 16 <refine_exp_done>
f8: 000e0393 addi x7 x28 0
fc: 00128293 addi x5 x5 1
100: fe5ff06f jal x0 -28 <find_exact_exponent_loop>
00000104 <refine_exp_done>:
104: 40750eb3 sub x29 x10 x7
108: 005edeb3 srl x29 x29 x5
10c: 00429293 slli x5 x5 4
110: 01d2e533 or x10 x5 x29
00000114 <encode_return_value>:
114: 00c12083 lw x1 12 x2
118: 01010113 addi x2 x2 16
11c: 00008067 jalr x0 x1 0
```
##### 5-stage pipelined processor

## [bfloat16](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-C)
The bfloat16 format (16-bit, from Google Brain) preserves float32’s dynamic range by keeping the same 8-bit exponent, but reduces precision to a 7-bit significand (vs. 23).
Value:
$$
v = (-1)^S \times 2^{E-127} \times \left(1 + \frac{M}{128}\right)
$$
where
* $S \in \{0, 1\}$ is the sign bit
* $E \in [1, 254]$ is the biased exponent
* $M \in [0, 127]$ is the mantissa value
Special Cases:
* Zero: $E = 0, M = 0 \Rightarrow v = (-1)^S \times 0$
* Infinity: $E = 255, M = 0 \Rightarrow v = (-1)^S \times \infty$
* NaN: $E = 255, M \neq 0 \Rightarrow v = \text{NaN}$
* Denormals: Not supported (flush to zero)
### Implementation
```c
#include <stdbool.h>
#include <stdint.h>
#include <string.h>
typedef struct {
uint16_t bits;
} bf16_t;
#define BF16_SIGN_MASK 0x8000U // bit15
#define BF16_EXP_MASK 0x7F80U // bits 14..7 = exponent mask
#define BF16_MANT_MASK 0x007FU // bits 6..0 = mantissa mask
#define BF16_EXP_BIAS 127
#define BF16_NAN() ((bf16_t) {.bits = 0x7FC0}) // canonical qNaN (exp=0xFF, mant=0x40)
#define BF16_ZERO() ((bf16_t) {.bits = 0x0000}) // +0
```
#### Determination
##### C code
```c
/* Check for NaN: exponent all 1s and mantissa ≠ 0 */
static inline bool bf16_isnan(bf16_t a)
{
return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) &&
(a.bits & BF16_MANT_MASK);
}
/* Check for Inf: exponent all 1s and mantissa == 0 */
static inline bool bf16_isinf(bf16_t a)
{
return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) &&
!(a.bits & BF16_MANT_MASK);
}
/* Check for zero: mask out the sign bit using a.bits & 0x7FFF */
static inline bool bf16_iszero(bf16_t a)
{
return !(a.bits & 0x7FFF);
}
```
##### Assembly code
> Show iterative refinement (reducing code size and runtime overhead)
> Compare against compiler output for verification and discussion
```riscv
main:
li t3, 0x7F80 # t3 = BF16_EXP_MASK
li t4, 0x007F # t4 = BF16_MANT_MASK
li t6, 0x7FFF
bf16_isnan:
and t5, t3, a0
bne t5, t3, return_0
and t5, t4, a0
snez a0, t5 # a0 = 1 if mantissa != 0
ret
bf16_isinf:
and t5, t3, a0
bne t5, t3, return_0
and t5, t4, a0
seqz a0, t5 # a0 = 1 if mantissa == 0
ret
bf16_iszero:
and t5, t6, a0
seqz a0, t5
ret
```
##### Use case
> Three test data cases
> Including internal validation
::: spoiler Test code
```riscv=
.data
bf16_list: .half 0x7FC1, 0x7F80, 0x0000, 0x8000 # NaN, +Inf, 0, -0
str1: .string "\n NAN : "
str2: .string "\n Inf : "
str3: .string "\n Zero : "
.text
main:
isnan_start:
la t0, bf16_list
li t1, 0
li t2, 4
la a0, str1
li a7, 4
ecall
isnan_loop:
beq t1, t2, isinf_start
lh a0, 0(t0)
jal ra, bf16_isnan
li a7, 1
ecall
addi t0, t0, 2 # Next half (16 bits)
addi t1, t1, 1 # i++
j isnan_loop
isinf_start:
la t0, bf16_list
li t1, 0
li t2, 4
la a0, str2
li a7, 4
ecall
isinf_loop:
beq t1, t2, iszero_start
lh a0, 0(t0)
jal ra, bf16_isinf
li a7, 1
ecall
addi t0, t0, 2 # Next half (16 bits)
addi t1, t1, 1 # i++
j isinf_loop
iszero_start:
la t0, bf16_list
li t1, 0
li t2, 4
la a0, str3
li a7, 4
ecall
iszero_loop:
beq t1, t2, end
lh a0, 0(t0)
jal ra, bf16_iszero
li a7, 1
ecall
addi t0, t0, 2 # Next half (16 bits)
addi t1, t1, 1 # i++
j iszero_loop
return_0:
li a0, 0
ret
end:
li a7, 10
ecall
```
:::
Input: `0x7FC1, 0x7F80, 0x0000, 0x8000` (NaN, +Inf, 0, -0)
Output:
```
NAN : 1000
Inf : 0100
Zero : 0011
```
The results match the expected output; therefore, the comparison with the compiler output is omitted.
#### Format convertion
##### C code
```c
static inline bf16_t f32_to_bf16(float val)
{
uint32_t f32bits;
memcpy(&f32bits, &val, sizeof(float));
if (((f32bits >> 23) & 0xFF) == 0xFF) // exp all-ones
return (bf16_t) {.bits = (f32bits >> 16) & 0xFFFF};
/* round-to-nearest-even */
f32bits += ((f32bits >> 16) & 1) + 0x7FFF; // 0x7FFF = 2^15 - 1
return (bf16_t) {.bits = f32bits >> 16};
}
```
```c
static inline float bf16_to_f32(bf16_t val)
{
/* Convert bf16_t to float32 by placing bf16 in the high 16 bits */
uint32_t f32bits = ((uint32_t) val.bits) << 16;
float result;
memcpy(&result, &f32bits, sizeof(float));
return result;
}
```
##### Assembly code
> Show iterative refinement (reducing code size and runtime overhead)
> Compare against compiler output for verification and discussion
```riscv
f32_to_bf16:
# check exp == 0xFF
srli t0, a0, 23
andi t0, t0, 0xFF
li t1, 0xFF
beq t0, t1, f32_to_bf16_done
# round-to-nearest-even
srli t2, a0, 16
andi t2, t2, 1
li t3, 0x7FFF
add a0, a0, t2
add a0, a0, t3
f32_to_bf16_done:
srli a0, a0, 16
ret
bf16_to_f32:
slli a0, a0, 16
ret
```
##### Use case
> Three test data cases
> Including internal validation
::: spoiler Test code
```riscv=
.data
str1: .string "\n The bf16 of 0x3FC00000 is "
str2: .string "\n The bf16 of 0xC02C0000 is "
str3: .string "\n The bf16 of 0x477FE000 is "
str4: .string "\n The f32 of 0x3FC0 is "
str5: .string "\n The f32 of 0xC02C is "
str6: .string "\n The f32 of 0x4780 is "
f32_input1: .word 0x3FC00000
f32_input2: .word 0xC02C0000
f32_input3: .word 0x477FE000
bf16_input1: .word 0x3FC0
bf16_input2: .word 0xC02C
bf16_input3: .word 0x4780
.text
main:
# Test f32_to_bf16
# data 1
la a0, str1
li a7, 4
ecall
lw a0, f32_input1
jal ra, f32_to_bf16
li a7, 34
ecall
# data 2
la a0, str2
li a7, 4
ecall
lw a0, f32_input2
jal ra, f32_to_bf16
li a7, 34
ecall
# data 3
la a0, str3
li a7, 4
ecall
lw a0, f32_input3
jal ra, f32_to_bf16
li a7, 34
ecall
# Test bf16_to_f32
# data 1
la a0, str4
li a7, 4
ecall
lw a0, bf16_input1
jal ra, bf16_to_f32
li a7, 34
ecall
# data 2
la a0, str5
li a7, 4
ecall
lw a0, bf16_input2
jal ra, bf16_to_f32
li a7, 34
ecall
# data 3
la a0, str6
li a7, 4
ecall
lw a0, bf16_input3
jal ra, bf16_to_f32
li a7, 34
ecall
# End
li a7, 10
ecall
```
:::
Input:
* Test data for f32_to_bf16: `[0x3FC00000, 0xC02C0000, 0x477FE000]`
* Test data for bf16_to_f32: `[0x3FC0, 0xC030, 0x4780]`
Output:
```
The bf16 of 0x3FC00000 is 0x3fc0
The bf16 of 0xC02C0000 is 0xc02c
The bf16 of 0x477FE000 is 0x4780
The f32 of 0x3FC0 is 0x3fc00000
The f32 of 0xC02C is 0xc02c0000
The f32 of 0x4780 is 0x47800000
```
Compiler output:
```
The bf16 of 0x3fc00000 is 0x3fc0
The bf16 of 0xc02c0000 is 0xc02c
The bf16 of 0x477fe000 is 0x4780
The f32 of 0x3fc0 is 0x3fc00000
The f32 of 0xc030 is 0xc0300000
The f32 of 0x4780 is 0x47800000
```
#### Calculation
##### C code
```c
static inline bf16_t bf16_add(bf16_t a, bf16_t b)
{
uint16_t sign_a = (a.bits >> 15) & 1;
uint16_t sign_b = (b.bits >> 15) & 1;
int16_t exp_a = ((a.bits >> 7) & 0xFF);
int16_t exp_b = ((b.bits >> 7) & 0xFF);
uint16_t mant_a = a.bits & 0x7F;
uint16_t mant_b = b.bits & 0x7F;
if (exp_a == 0xFF) {
if (mant_a)
return a;
if (exp_b == 0xFF)
return (mant_b || sign_a == sign_b) ? b : BF16_NAN();
return a;
}
if (exp_b == 0xFF)
return b;
if (!exp_a && !mant_a) // a == 0 -> return b
return b;
if (!exp_b && !mant_b) // b == 0 -> return a
return a;
/* restore implicit 1 */
if (exp_a)
mant_a |= 0x80;
if (exp_b)
mant_b |= 0x80;
int16_t exp_diff = exp_a - exp_b;
uint16_t result_sign;
int16_t result_exp;
uint32_t result_mant;
if (exp_diff > 0) {
result_exp = exp_a;
if (exp_diff > 8) // too far away -> b negligible
return a;
mant_b >>= exp_diff; // align mantissa
} else if (exp_diff < 0) {
result_exp = exp_b;
if (exp_diff < -8) // too far away -> a negligible
return b;
mant_a >>= -exp_diff;
} else {
result_exp = exp_a;
}
if (sign_a == sign_b) {
result_sign = sign_a;
result_mant = (uint32_t) mant_a + mant_b;
if (result_mant & 0x100) { // carry beyond 8 bits
result_mant >>= 1;
/* overflow -> Inf */
if (++result_exp >= 0xFF)
return (bf16_t) {.bits = (result_sign << 15) | 0x7F80};
}
} else { // different signs -> Subtract
if (mant_a >= mant_b) {
result_sign = sign_a;
result_mant = mant_a - mant_b;
} else {
result_sign = sign_b;
result_mant = mant_b - mant_a;
}
if (!result_mant)
return BF16_ZERO();
/* normalize */
while (!(result_mant & 0x80)) {
result_mant <<= 1;
if (--result_exp <= 0)
return BF16_ZERO();
}
}
return (bf16_t) {
.bits = (result_sign << 15) | ((result_exp & 0xFF) << 7) |
(result_mant & 0x7F),
};
}
```
```c
static inline bf16_t bf16_sub(bf16_t a, bf16_t b)
{
b.bits ^= BF16_SIGN_MASK; // XOR
return bf16_add(a, b);
}
```
```c
static inline bf16_t bf16_mul(bf16_t a, bf16_t b)
{
uint16_t sign_a = (a.bits >> 15) & 1;
uint16_t sign_b = (b.bits >> 15) & 1;
int16_t exp_a = ((a.bits >> 7) & 0xFF);
int16_t exp_b = ((b.bits >> 7) & 0xFF);
uint16_t mant_a = a.bits & 0x7F;
uint16_t mant_b = b.bits & 0x7F;
uint16_t result_sign = sign_a ^ sign_b; // XOR: same sign → positive, different sign → negative
if (exp_a == 0xFF) {
if (mant_a)
return a; // NaN
if (!exp_b && !mant_b)
return BF16_NAN(); // Inf * 0 = NaN
return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; // Inf
}
if (exp_b == 0xFF) {
if (mant_b)
return b;
if (!exp_a && !mant_a)
return BF16_NAN();
return (bf16_t) {.bits = (result_sign << 15) | 0x7F80};
}
/* If either number is zero */
if ((!exp_a && !mant_a) || (!exp_b && !mant_b))
return (bf16_t) {.bits = result_sign << 15};
/* subnormal number (exponent = 0,mantissa ≠ 0) → normalize */
int16_t exp_adjust = 0;
if (!exp_a) {
while (!(mant_a & 0x80)) {
mant_a <<= 1;
exp_adjust--;
}
exp_a = 1;
} else
mant_a |= 0x80;
if (!exp_b) {
while (!(mant_b & 0x80)) {
mant_b <<= 1;
exp_adjust--;
}
exp_b = 1;
} else
mant_b |= 0x80;
uint32_t result_mant = (uint32_t) mant_a * mant_b;
int32_t result_exp = (int32_t) exp_a + exp_b - BF16_EXP_BIAS + exp_adjust;
/* Normalization */
if (result_mant & 0x8000) {
result_mant = (result_mant >> 8) & 0x7F;
result_exp++;
} else
result_mant = (result_mant >> 7) & 0x7F;
/* overflow and underflow */
if (result_exp >= 0xFF)
return (bf16_t) {.bits = (result_sign << 15) | 0x7F80};
if (result_exp <= 0) {
if (result_exp < -6)
return (bf16_t) {.bits = result_sign << 15};
result_mant >>= (1 - result_exp);
result_exp = 0;
}
return (bf16_t) {.bits = (result_sign << 15) | ((result_exp & 0xFF) << 7) |
(result_mant & 0x7F)};
}
```
```c
static inline bf16_t bf16_div(bf16_t a, bf16_t b)
{
uint16_t sign_a = (a.bits >> 15) & 1;
uint16_t sign_b = (b.bits >> 15) & 1;
int16_t exp_a = ((a.bits >> 7) & 0xFF);
int16_t exp_b = ((b.bits >> 7) & 0xFF);
uint16_t mant_a = a.bits & 0x7F;
uint16_t mant_b = b.bits & 0x7F;
uint16_t result_sign = sign_a ^ sign_b;
if (exp_b == 0xFF) {
if (mant_b)
return b; // NaN
/* Inf/Inf = NaN */
if (exp_a == 0xFF && !mant_a)
return BF16_NAN(); // Inf / Inf = NaN
return (bf16_t) {.bits = result_sign << 15}; // 有限數 / Inf = 0
}
/* If denominator is 0 */
if (!exp_b && !mant_b) {
if (!exp_a && !mant_a)
return BF16_NAN(); // 0/0 = NaN
return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; // x/0 = ±Inf
}
/* If numerator is Inf */
if (exp_a == 0xFF) {
if (mant_a)
return a; // NaN
return (bf16_t) {.bits = (result_sign << 15) | 0x7F80}; // ±Inf / Finite number = ±Inf
}
/* If numerator is 0 */
if (!exp_a && !mant_a)
return (bf16_t) {.bits = result_sign << 15}; // 0/x = 0
if (exp_a)
mant_a |= 0x80;
if (exp_b)
mant_b |= 0x80;
uint32_t dividend = (uint32_t) mant_a << 15; // dividend
uint32_t divisor = mant_b; // divisor
uint32_t quotient = 0; // quotient
/* Loop 16 times to build the quotient bit by bit */
for (int i = 0; i < 16; i++) { // Shift divisor to align with dividend; subtract if possible and set quotient bit.
quotient <<= 1;
if (dividend >= (divisor << (15 - i))) {
dividend -= (divisor << (15 - i));
quotient |= 1;
}
}
int32_t result_exp = (int32_t) exp_a - exp_b + BF16_EXP_BIAS;
if (!exp_a)
result_exp--;
if (!exp_b)
result_exp++;
/* Normalization */
if (quotient & 0x8000)
quotient >>= 8;
else {
while (!(quotient & 0x8000) && result_exp > 1) {
quotient <<= 1;
result_exp--;
}
quotient >>= 8;
}
quotient &= 0x7F;
/* overflow and underflow */
if (result_exp >= 0xFF)
return (bf16_t) {.bits = (result_sign << 15) | 0x7F80};
if (result_exp <= 0)
return (bf16_t) {.bits = result_sign << 15};
return (bf16_t) {.bits = (result_sign << 15) | ((result_exp & 0xFF) << 7) |
(quotient & 0x7F)};
}
```
##### Assembly code
> Show iterative refinement (reducing code size and runtime overhead)
> Compare against compiler output for verification and discussion
```riscv
bf16_add:
addi sp, sp, -16
sw ra, 12(sp)
# --- extract sign, exp, mantissa ---
srli a2, a0, 15 # a2 = sign_a
srli a3, a1, 15 # a3 = sign_b
srli a4, a0, 7
andi a4, a4, 0xFF # a4 = exp_a
srli a5, a1, 7
andi a5, a5, 0xFF # a5 = exp_b
andi a6, a0, 0x7F # a6 = mant_a
andi a7, a1, 0x7F # a7 = mant_b
# --- handle special cases: NaN, Inf, zero ---
li t3, 0xFF
bne a4, t3, finish_first_case
bnez a6, return_a
bne a5, t3, return_a
# (mant_b || sign_a == sign_b) ? b : NaN
bnez a7, return_b
beq a2, a3, return_b
li a0, 0x7FC0
ret
finish_first_case:
beq a5, t3, return_b
beqz a4, return_b
beqz a5, return_a
# --- restore implicit 1 for normalized numbers ---
bf16_add_restore_a:
beqz a4, bf16_add_restore_b
ori a6, a6, 0x80 # mant_a |= 0x80
bf16_add_restore_b:
beqz a5, exp_diff_start
ori a7, a7, 0x80 # mant_b |= 0x80
exp_diff_start:
sub t3, a4, a5 # t3 = exp_diff
li t4, 8 # t4 = 8
blez t3, exp_diff_le0
mv t5, a4 # t5 = result_exp
bgt t3, t4, return_a
srl a7, a7, t3
j sign_start
exp_diff_le0:
bgez t3, exp_diff_eq0
mv t5, a5 # t5 = result_exp
neg t3, t3 # t3 = -exp_diff
bgt t3, t4, return_b
srl a6, a6, t3
j sign_start
exp_diff_eq0:
mv t5, a4 # t5 = result_exp
sign_start:
bne a2, a3, diff_sign_case
same_sign_case:
mv t3, a2 # t3 = result_sign
add t4, a6, a7 # t4 = result_mant
andi t6, t4, 0x100 # t6 = result_mant & 0x100
beqz t6, bf16_add_end
srli t4, t4, 1
addi t5, t5, 1
addi t6, t5, -0xFF # t6 = ++result_exp - 0xFF
bgez t6, overflow_to_inf
j bf16_add_end
overflow_to_inf:
slli a0, t3, 15
li t6, 0x7F80
or a0, a0, t6
j bf16_add_end
diff_sign_case:
blt a6, a7, mant_a_smaller
mv t3, a2 # t3 = result_sign
sub t4, a6, a7 # t4 = result_mant
j check_zero
mant_a_smaller:
mv t3, a3 # t3 = result_sign
sub t4, a7, a6 # t4 = result_mant
check_zero:
beqz t4, return_zero
normalize_loop:
andi t6, t4, 0x80 # t6 = result_mant & 0x80
bnez t6, bf16_add_end
slli t4, t4, 1
addi t5, t5, -1
blez t5 return_zero
j normalize_loop
bf16_add_end:
slli a0, t3, 15 # a0 = result_sign << 15
andi t6, t5, 0xFF # t1 = result_exp & 0xFF
slli t6, t6, 7 # t6 << 7
or a0, a0, t6 # a0 |= (result_exp & 0xFF) << 7
andi t6, t4, 0x7F # t6 = result_mant & 0x7F
or a0, a0, t6 # a0 |= result_mant & 0x7F
ret
return_a:
mv a0, a0
ret
return_b:
mv a0, a1
ret
return_zero:
li a0, 0
ret
```
```riscv
bf16_sub:
addi sp, sp, -16
sw ra, 12(sp)
li t3, 0x8000
xor a1, a1, t3
jal bf16_add
lw ra, 12(sp)
addi sp, sp, 16
ret
```
##### Use case
> Three test data cases
> Including internal validation
:::spoiler Test code
```riscv=
.data
bf16_vals:
.half 0x3FC0
.half 0xC030
.half 0x4780
.half 0x3FC0
str1: .string "\nbf16_add result: "
str2: .string "\n\nbf16_sub result: "
str: .string " \n"
.text
main:
add_start:
la t0, bf16_vals # t0 = pointer to list
li t1, 0 # loop counter
li t2, 2
la a0, str1
li a7, 4
ecall
add_loop:
bgt t1, t2, sub_start
la a0, str
li a7, 4
ecall
lhu a0, 0(t0) # a = bf16_vals[i]
lhu a1, 2(t0) # b = bf16_vals[i+1]
jal ra, bf16_add
li a7, 34
ecall
addi t0, t0, 2
addi t1, t1, 1
j add_loop
sub_start:
la t0, bf16_vals # t0 = pointer to list
li t1, 0 # loop counter
li t2, 2
la a0, str2
li a7, 4
ecall
sub_loop:
bgt t1, t2, end
la a0, str
li a7, 4
ecall
lhu a0, 0(t0) # a = bf16_vals[i]
lhu a1, 2(t0) # b = bf16_vals[i+1]
jal bf16_sub
li a7, 34
ecall
addi t0, t0, 2
addi t1, t1, 1
j sub_loop
end:
li a7, 10
ecall
```
:::
Input: `[0x3FC0, 0xC030, 0x4780]``
Output:
```
bf16_add result:
0xbfa0
0x4780
0x4780
bf16_sub result:
0x4088
0xc780
0x4780
```
Compiler output:
```
bf16_add result:
0xBFA0
0x4780
0x4780
bf16_sub result:
0x4088
0xC780
0x4780
```
### Analysis
#### 5-stage pipelined processor
> Demonstrate each stage: IF, ID, IE, MEM, WB.
Explain memory update steps and their correctness.