# Assignment 1: RISC-V Assembly and Instruction Pipeline
## Problem B
### Introduction
#### About UF8
UF8 is an 8-bit **logarithmic encoding format** that represents a 20-bit non-negative integer (ranging from 0 to 1,015,792) approximately using only 8 bits. In UF8 encoding, each byte consists of **8 bits**, where the **upper 4 bits** represent the **exponent `e`**, and the **lower 4 bits** represent the **mantissa `m`**, as shown in the figure below:

Each exponent `e` corresponds to a **range**, so there are 16 ranges in total. Each range has its corresponding **offset(e)** as the starting value.
Within the same range, the value spacing (i.e., the difference between two adjacent representable values) is fixed, and the step size is $2^e$.
#### UF8 Decoding
To decode an 8-bit UF8 integer into an approximate 32-bit integer value, the following formula can be used:
$$
e = \lfloor b / 16 \rfloor, \quad m = b \bmod 16
$$
$$
D(b) = m \cdot 2^e + (2^e - 1) \cdot 16
$$
Here, the term $(2^e - 1) \cdot 16$ represents the corresponding **offset(e)**.
You can better understand the UF8 structure by observing the **starting value** and **step size** in the following examples:
- **e = 0:** offset = 0, representable range = 0, 1, 2, …, 15, ->step = 1
- **e = 1:** offset = 16, representable range = 16, 18, 20, …, 46, ->step = 2
- **e = 2:** offset = 48, representable range = 48, 52, 56, …, 108, ->step = 4
Additionally, when the mantissa m = 15 , moving to the beginning of the next range (i.e., \( m = 0, e + 1 \)) results in a value difference exactly equal to $2^e$.
Thus, the overall value distribution forms a smooth staircase-like progression, ensuring there are no discontinuities or gaps between adjacent ranges.
---
### Related work
In this part, we need to translate the original C code provided in **Problem B** into a **RISC-V assembly program**, using only **RV32I instructions** (without M or F/D extensions for F-P operations), and ensure that it runs correctly on the **Ripes simulator**.
The original C code for Problem B is as follows:
```C=
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
typedef uint8_t uf8;
static inline unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return n - x;
}
/* Decode uf8 to uint32_t */
uint32_t uf8_decode(uf8 fl)
{
uint32_t mantissa = fl & 0x0f;
uint8_t exponent = fl >> 4;
uint32_t offset = (0x7FFF >> (15 - exponent)) << 4;
return (mantissa << exponent) + offset;
}
/* Encode uint32_t to uf8 */
uf8 uf8_encode(uint32_t value)
{
/* Use CLZ for fast exponent calculation */
if (value < 16)
return value;
/* Find appropriate exponent using CLZ hint */
int lz = clz(value);
int msb = 31 - lz;
/* Start from a good initial guess */
uint8_t exponent = 0;
uint32_t overflow = 0;
if (msb >= 5) {
/* Estimate exponent - the formula is empirical */
exponent = msb - 4;
if (exponent > 15)
exponent = 15;
/* Calculate overflow for estimated exponent */
for (uint8_t e = 0; e < exponent; e++)
overflow = (overflow << 1) + 16;
/* Adjust if estimate was off */
while (exponent > 0 && value < overflow) {
overflow = (overflow - 16) >> 1;
exponent--;
}
}
/* Find exact exponent */
while (exponent < 15) {
uint32_t next_overflow = (overflow << 1) + 16;
if (value < next_overflow)
break;
overflow = next_overflow;
exponent++;
}
uint8_t mantissa = (value - overflow) >> exponent;
return (exponent << 4) | mantissa;
}
/* Test encode/decode round-trip */
static bool test(void)
{
int32_t previous_value = -1;
bool passed = true;
for (int i = 0; i < 256; i++) {
uint8_t fl = i;
int32_t value = uf8_decode(fl);
uint8_t fl2 = uf8_encode(value);
if (fl != fl2) {
printf("%02x: produces value %d but encodes back to %02x\n", fl,
value, fl2);
passed = false;
}
if (value <= previous_value) {
printf("%02x: value %d <= previous_value %d\n", fl, value,
previous_value);
passed = false;
}
previous_value = value;
}
return passed;
}
int main(void)
{
if (test()) {
printf("All tests passed.\n");
return 0;
}
return 1;
}
```
#### 1. uf8_decode
The corresponding assembly code is as follows:
```C=
UF8_decode:
andi t1, a0, 0x0F # m = fl & 0x0F
srli t2, a0, 4 # e = fl >> 4
li t3, 1
sll t3, t3, t2 # 1 << e
addi t3, t3, -1 # (1 << e) - 1
slli t3, t3, 4 # offset = ((1 << e)-1) << 4
sll t1, t1, t2 # m << e
add a0, t1, t3 # return (m << e) + offset
jr ra
```
#### 2. clz
**clz** is a function used by uf8_encode. Its purpose is to return **the number of leading zeros** in a 32-bit unsigned integer, starting from the most significant bit (MSB).
If we directly translate the original C code of **clz** into assembly , the corresponding RISC-V assembly code would be as follows:
```C=
CLZ:
li t1, 32 # n
li t2, 16 # c
clz_For:
srl t3, a0, t2
beqz t3, clz_Skip
sub t1, t1, t2
mv a0, t3
clz_Skip:
srli t2, t2, 1
bnez t2, clz_For
clz_End:
sub a0, t1, a0
jr ra
```
Using an general case input `a0 = 0x28ad94ca` for testing, the corresponding **Execution Information** is shown below:

Next, we optimize the above directly translated assembly code of **clz**, as shown below:
```C=
CLZ:
li t1, 32 # n = 32
clz_For1:
srli t2, a0, 16 # y = x >> 16, c = 16
beqz t2, clz_For2 # if (y == 0) skip
addi t1, t1, -16 # n -= 16
mv a0, t2 # x = y
clz_For2:
srli t2, a0, 8
beqz t2, clz_For3
addi t1, t1, -8
mv a0, t2
clz_For3:
srli t2, a0, 4
beqz t2, clz_For4
addi t1, t1, -4
mv a0, t2
clz_For4:
srli t2, a0, 2
beqz t2, clz_For5
addi t1, t1, -2
mv a0, t2
clz_For5:
srli t2, a0, 1
beqz t2, clz_End
addi t1, t1, -1
mv a0, t2
clz_End:
sub a0, t1, a0 # return n - x
jr ra
```
Here, **loop unrolling** is used for optimization. Although this increases the code size, it reduces the number of branches and the total execution cycles.
Using the same input example as before, the **Execution Information** for this optimized version is shown below:

It is clear that the **execution cycles** were reduced by approximately $33$%.
#### 3. uf8_encode
**uf8_encode** is a function that compresses a 20-bit unsigned integer into an 8-bit encoded value. It estimates the exponent (logarithmic scale) based on the magnitude of the input, calculates the mantissa (fine-grained bits), and finally combines them into a single 8-bit uf8 encoded result.
The corresponding assembly code is as follows:
```C=
UF8_encode:
addi sp, sp, -4
sw ra, 0(sp)
li t0, 16 # if value < 16
blt a0, t0, small_End
mv t5, a0 # t5 = value (original value)
# lz = clz(value); msb = 31 - lz
jal ra, CLZ # a0 = clz(value)
li t6, 31
sub t6, t6, a0 # t6 = msb
# exponent = 0; overflow = 0
li t1, 0 # t1 = exponent
li t2, 0 # t2 = overflow = offset(e)
# if (msb >= 5) { exponent = msb - 4; clamp to 15; build offset(e) }
li t0, 5
blt t6, t0, find_Exact # if (msb < 5) → skip
addi t1, t6, -4 # exponent = msb - 4
li t0, 15
ble t1, t0, caculate_Off
li t1, 15
# overflow = offset(exponent)
# off(0)=0; off(e+1) = (off(e) << 1) + 16
caculate_Off:
li t2, 0 # overflow = 0
li t3, 0 # counter
off_For:
beq t3, t1, adjust_Down
slli t2, t2, 1 # overflow <<= 1
addi t2, t2, 16 # overflow += 16
addi t3, t3, 1
j off_For
# while (exponent > 0 && value < overflow)
# { overflow = (overflow - 16) >> 1; exponent--; }
adjust_Down:
beqz t1, find_Exact
bge t5, t2, find_Exact
addi t2, t2, -16
srli t2, t2, 1
addi t1, t1, -1
j adjust_Down
# find exponent range s.t. offset(e) <= value < offset(e+1)
find_Exact:
li t0, 15
bge t1, t0, Mantissa # clamp at 15
up_For:
slli t3, t2, 1
addi t3, t3, 16 # t3 = next_overflow = offset(e+1)
blt t5, t3, Mantissa # if (value < next_overflow) break
mv t2, t3 # overflow = next_overflow
addi t1, t1, 1 # e++
blt t1, t0, up_For
Mantissa:
# m = (value - offset(e)) >> e
sub t3, t5, t2 # value - overflow
srl t3, t3, t1 # >> e
encode_End:
# b = (e<<4) | m
slli t1, t1, 4 # e<<4
andi t3, t3, 0x0F # m (4 bits)
or a0, t1, t3
lw ra, 0(sp)
addi sp, sp, 4
jr ra
# value < 16
small_End:
andi a0, a0, 0xFF
lw ra, 0(sp)
addi sp, sp, 4
jr ra
```
#### 4. main() + test()
The **test()** function is designed to ensure that **uf8_encode** and **uf8_decode** correctly convert between each other. It iterates through all possible 8-bit uf8 codes (a total of 256), decodes each one into its original integer value, re-encodes it back to uf8, and checks whether the result matches the original code. Additionally, it verifies that the decoded values are **strictly increasing** to confirm the monotonicity of the logarithmic quantization.
The **main()** function is responsible for invoking test(). If all checks pass without error, it prints the message "**All tests passed.**" to indicate that the entire encoding system is functioning correctly.
The corresponding assembly code is as follows:
```C=
.data
msg_pass: .asciz "All tests passed.\n"
msg_err1: .asciz ": produces value "
msg_err2: .asciz " but encodes back to "
msg_err3: .asciz ": value "
msg_err4: .asciz " <= previous_value "
nextline: .asciz "\n"
.text
.globl main
main:
jal ra, Test
beqz a0, Fail
la a0, msg_pass
li a7, 4
ecall
li a0, 0
li a7, 10
ecall
Fail:
li a0, 1
li a7, 10
ecall
# ------------------------------------------------------------
# Test function
# ------------------------------------------------------------
Test:
addi sp, sp, -4
sw ra, 0(sp)
li s1, -1
li s2, 1
li s3, 0
li s7, 256
round_Trip:
beq s3, s7, test_End
mv s4, s3
mv a0, s4
jal ra, UF8_decode
mv s5, a0
jal ra, UF8_encode
mv s6, a0
beq s4, s6, check_Inc
mv a0, s4
li a7, 34
ecall
la a0, msg_err1
li a7, 4
ecall
mv a0, s5
li a7, 1
ecall
la a0, msg_err2
li a7, 4
ecall
mv a0, s6
li a7, 34
ecall
la a0, nextline
li a7, 4
ecall
li s2, 0
check_Inc:
bgt s5, s1, check_End
mv a0, s4
li a7, 34
ecall
la a0, msg_err3
li a7, 4
ecall
mv a0, s5
li a7, 1
ecall
la a0, msg_err4
li a7, 4
ecall
mv a0, s1
li a7, 1
ecall
la a0, nextline
li a7, 4
ecall
li s2, 0
check_End:
mv s1, s5
addi s3, s3, 1
j round_Trip
test_End:
mv a0, s2
lw ra, 0(sp)
addi sp, sp, 4
jr ra
```
---
### Result Console
In this assignment, all assembly programs are run on a **RISC-V 32-bit 5-stage processor** and output to the **console** using environment calls.
The **result** of executing the assembly program composed of the previously described functions is shown below:

As shown in the image above, the program executes correctly and outputs the expected prompt messages.
---
### Execution Information
This section demonstrates the **Execution Information** when running the assembly code on the **Ripes simulator**, in order to compare with the compiler-generated version (with `-O1` optimization) and directly observe the performance.
#### 1. Complier's version:

* Corresponding **code size: 267**
#### 2. Improved version:

* Corresponding **code size: 185**
---
### Improvement
From the **Execution Information** above, compared with the compiler’s version:
* **Code size reduced by approximately $31$%**
* **Execution cycles reduced by approximately $11$%**
* **Actual number of executed instructions reduced by approximately $17$%**
---
## Problem C
### Introduction
**bfloat16** (Brain Float 16-bit) is a **16-bit** floating-point format proposed by Google. Its design philosophy is to **retain the same numerical range as FP32 while sacrificing precision**. The structural composition is as follows:

bfloat16 differs from the IEEE-754 standard half precision format (16-bit float, also known as float16). IEEE float16 consists of 1 sign bit + 5 exponent bits + 10 mantissa bits, offering higher precision but smaller representable range. In contrast, bfloat16 uses **1 sign bit + 8 exponent bits + 7 mantissa bits**, resulting in lower precision but a representable range similar to single-precision (float32), due to both formats having 8-bit exponents.
---
### Related Work
We need to translate the original C code provided in **Problem C** ( divided into two parts ) into a **RISC-V assembly program**, using only **RV32I instructions** (without M or F/D extensions for F-P operations), and ensure that it runs correctly on the **Ripes simulator**.
#### Part I
This part implements the **core operations** on **bfloat16**, including addition, subtraction, multiplication, division, conversion, and special value processing.
The original C code is as follows:
```C=
#include <stdbool.h>
#include <stdint.h>
#include <string.h>
typedef struct {
uint16_t bits;
} bf16_t;
#define BF16_SIGN_MASK 0x8000U
#define BF16_EXP_MASK 0x7F80U
#define BF16_MANT_MASK 0x007FU
#define BF16_EXP_BIAS 127
#define BF16_NAN() ((bf16_t) {.bits = 0x7FC0})
#define BF16_ZERO() ((bf16_t) {.bits = 0x0000})
static inline bool bf16_isnan(bf16_t a)
{
return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) &&
(a.bits & BF16_MANT_MASK);
}
static inline bool bf16_isinf(bf16_t a)
{
return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) &&
!(a.bits & BF16_MANT_MASK);
}
static inline bool bf16_iszero(bf16_t a)
{
return !(a.bits & 0x7FFF);
}
static inline bf16_t f32_to_bf16(float val)
{
uint32_t f32bits;
memcpy(&f32bits, &val, sizeof(float));
if (((f32bits >> 23) & 0xFF) == 0xFF)
return (bf16_t) {.bits = (f32bits >> 16) & 0xFFFF};
f32bits += ((f32bits >> 16) & 1) + 0x7FFF;
return (bf16_t) {.bits = f32bits >> 16};
}
static inline float bf16_to_f32(bf16_t val)
{
uint32_t f32bits = ((uint32_t) val.bits) << 16;
float result;
memcpy(&result, &f32bits, sizeof(float));
return result;
}
static inline bf16_t bf16_add(bf16_t a, bf16_t b)
{
uint16_t sign_a = (a.bits >> 15) & 1;
uint16_t sign_b = (b.bits >> 15) & 1;
int16_t exp_a = ((a.bits >> 7) & 0xFF);
int16_t exp_b = ((b.bits >> 7) & 0xFF);
uint16_t mant_a = a.bits & 0x7F;
uint16_t mant_b = b.bits & 0x7F;
if (exp_a == 0xFF) {
if (mant_a)
return a;
if (exp_b == 0xFF)
return (mant_b || sign_a == sign_b) ? b : BF16_NAN();
return a;
}
if (exp_b == 0xFF)
return b;
if (!exp_a && !mant_a)
return b;
if (!exp_b && !mant_b)
return a;
if (exp_a)
mant_a |= 0x80;
if (exp_b)
mant_b |= 0x80;
int16_t exp_diff = exp_a - exp_b;
uint16_t result_sign;
int16_t result_exp;
uint32_t result_mant;
if (exp_diff > 0) {
result_exp = exp_a;
if (exp_diff > 8)
return a;
mant_b >>= exp_diff;
} else if (exp_diff < 0) {
result_exp = exp_b;
if (exp_diff < -8)
return b;
mant_a >>= -exp_diff;
} else {
result_exp = exp_a;
}
if (sign_a == sign_b) {
result_sign = sign_a;
result_mant = (uint32_t) mant_a + mant_b;
if (result_mant & 0x100) {
result_mant >>= 1;
if (++result_exp >= 0xFF)
return (bf16_t) {.bits = (result_sign << 15) | 0x7F80};
}
} else {
if (mant_a >= mant_b) {
result_sign = sign_a;
result_mant = mant_a - mant_b;
} else {
result_sign = sign_b;
result_mant = mant_b - mant_a;
}
if (!result_mant)
return BF16_ZERO();
while (!(result_mant & 0x80)) {
result_mant <<= 1;
if (--result_exp <= 0)
return BF16_ZERO();
}
}
return (bf16_t) {
.bits = (result_sign << 15) | ((result_exp & 0xFF) << 7) |
(result_mant & 0x7F),
};
}
static inline bf16_t bf16_sub(bf16_t a, bf16_t b)
{
b.bits ^= BF16_SIGN_MASK;
return bf16_add(a, b);
}
static inline bf16_t bf16_mul(bf16_t a, bf16_t b)
{
uint16_t sign_a = (a.bits >> 15) & 1;
uint16_t sign_b = (b.bits >> 15) & 1;
int16_t exp_a = ((a.bits >> 7) & 0xFF);
int16_t exp_b = ((b.bits >> 7) & 0xFF);
uint16_t mant_a = a.bits & 0x7F;
uint16_t mant_b = b.bits & 0x7F;
uint16_t result_sign = sign_a ^ sign_b;
if (exp_a == 0xFF) {
if (mant_a)
return a;
if (!exp_b && !mant_b)
return BF16_NAN();
return (bf16_t) {.bits = (result_sign << 15) | 0x7F80};
}
if (exp_b == 0xFF) {
if (mant_b)
return b;
if (!exp_a && !mant_a)
return BF16_NAN();
return (bf16_t) {.bits = (result_sign << 15) | 0x7F80};
}
if ((!exp_a && !mant_a) || (!exp_b && !mant_b))
return (bf16_t) {.bits = result_sign << 15};
int16_t exp_adjust = 0;
if (!exp_a) {
while (!(mant_a & 0x80)) {
mant_a <<= 1;
exp_adjust--;
}
exp_a = 1;
} else
mant_a |= 0x80;
if (!exp_b) {
while (!(mant_b & 0x80)) {
mant_b <<= 1;
exp_adjust--;
}
exp_b = 1;
} else
mant_b |= 0x80;
uint32_t result_mant = (uint32_t) mant_a * mant_b;
int32_t result_exp = (int32_t) exp_a + exp_b - BF16_EXP_BIAS + exp_adjust;
if (result_mant & 0x8000) {
result_mant = (result_mant >> 8) & 0x7F;
result_exp++;
} else
result_mant = (result_mant >> 7) & 0x7F;
if (result_exp >= 0xFF)
return (bf16_t) {.bits = (result_sign << 15) | 0x7F80};
if (result_exp <= 0) {
if (result_exp < -6)
return (bf16_t) {.bits = result_sign << 15};
result_mant >>= (1 - result_exp);
result_exp = 0;
}
return (bf16_t) {.bits = (result_sign << 15) | ((result_exp & 0xFF) << 7) |
(result_mant & 0x7F)};
}
static inline bf16_t bf16_div(bf16_t a, bf16_t b)
{
uint16_t sign_a = (a.bits >> 15) & 1;
uint16_t sign_b = (b.bits >> 15) & 1;
int16_t exp_a = ((a.bits >> 7) & 0xFF);
int16_t exp_b = ((b.bits >> 7) & 0xFF);
uint16_t mant_a = a.bits & 0x7F;
uint16_t mant_b = b.bits & 0x7F;
uint16_t result_sign = sign_a ^ sign_b;
if (exp_b == 0xFF) {
if (mant_b)
return b;
/* Inf/Inf = NaN */
if (exp_a == 0xFF && !mant_a)
return BF16_NAN();
return (bf16_t) {.bits = result_sign << 15};
}
if (!exp_b && !mant_b) {
if (!exp_a && !mant_a)
return BF16_NAN();
return (bf16_t) {.bits = (result_sign << 15) | 0x7F80};
}
if (exp_a == 0xFF) {
if (mant_a)
return a;
return (bf16_t) {.bits = (result_sign << 15) | 0x7F80};
}
if (!exp_a && !mant_a)
return (bf16_t) {.bits = result_sign << 15};
if (exp_a)
mant_a |= 0x80;
if (exp_b)
mant_b |= 0x80;
uint32_t dividend = (uint32_t) mant_a << 15;
uint32_t divisor = mant_b;
uint32_t quotient = 0;
for (int i = 0; i < 16; i++) {
quotient <<= 1;
if (dividend >= (divisor << (15 - i))) {
dividend -= (divisor << (15 - i));
quotient |= 1;
}
}
int32_t result_exp = (int32_t) exp_a - exp_b + BF16_EXP_BIAS;
if (!exp_a)
result_exp--;
if (!exp_b)
result_exp++;
if (quotient & 0x8000)
quotient >>= 8;
else {
while (!(quotient & 0x8000) && result_exp > 1) {
quotient <<= 1;
result_exp--;
}
quotient >>= 8;
}
quotient &= 0x7F;
if (result_exp >= 0xFF)
return (bf16_t) {.bits = (result_sign << 15) | 0x7F80};
if (result_exp <= 0)
return (bf16_t) {.bits = result_sign << 15};
return (bf16_t) {.bits = (result_sign << 15) | ((result_exp & 0xFF) << 7) |
(quotient & 0x7F)};
}
```
<details>
<summary>The corresponding assembly code (include test part)</summary>
```C=
.equ BF16_SIGN_MASK, 0x8000
.equ BF16_EXP_MASK, 0x7F80
.equ BF16_MANT_MASK, 0x007F
.equ BF16_EXP_BIAS, 127
.data
msg_pass: .asciz "All tests passed\n"
msg_fail: .asciz "Tests failed\n"
.text
.globl main
main:
# ... run tests ...
# --------------------
# bf16_isnan(a0)
# --------------------
li a0, 0x7FC1 # NaN
jal ra, bf16_isnan
li t6, 1
bne a0, t6, fail
li a0, 0x7F80 # +Inf
jal ra, bf16_isnan
li t6, 0
bne a0, t6, fail
li a0, 0x4000 # +2.0
jal ra, bf16_isnan
li t6, 0
bne a0, t6, fail
# --------------------
# bf16_isinf(a0)
# --------------------
li a0, 0x7F80 # +Inf
jal ra, bf16_isinf
li t6, 1
bne a0, t6, fail
li a0, 0xFF80 # -Inf
jal ra, bf16_isinf
li t6, 1
bne a0, t6, fail
li a0, 0x7FC1 # NaN
jal ra, bf16_isinf
li t6, 0
bne a0, t6, fail
# --------------------
# bf16_iszero(a0)
# --------------------
li a0, 0x0000 # +0
jal ra, bf16_iszero
li t6, 1
bne a0, t6, fail
li a0, 0x8000 # -0
jal ra, bf16_iszero
li t6, 1
bne a0, t6, fail
li a0, 0x4000 # 2.0
jal ra, bf16_iszero
li t6, 0
bne a0, t6, fail
# --------------------
# f32_to_bf16(a0: f32 bits) -> a0: bf16 bits
# --------------------
li a0, 0x3F800000 # 1.0f
jal ra, f32_to_bf16
li t6, 0x3F80 # 1.0 (bf16)
bne a0, t6, fail
li a0, 0x40400000 # 3.0f
jal ra, f32_to_bf16
li t6, 0x4040 # 3.0 (bf16)
bne a0, t6, fail
li a0, 0x7FC00000 # quiet NaN
jal ra, f32_to_bf16
li t6, 0x7FC0
bne a0, t6, fail
# --------------------
# bf16_to_f32(a0: bf16 bits) -> a0: f32 bits
# --------------------
li a0, 0x3F80 # 1.0 (bf16)
jal ra, bf16_to_f32
li t6, 0x3F800000
bne a0, t6, fail
li a0, 0x4040 # 3.0 (bf16)
jal ra, bf16_to_f32
li t6, 0x40400000
bne a0, t6, fail
li a0, 0x7FC1 # NaN with payload -> preserved in upper 16 bits
jal ra, bf16_to_f32
li t6, 0x7FC10000
bne a0, t6, fail
# --------------------
# bf16_add(a0, a1) -> a0
# --------------------
li a0, 0x3F80 # 1.0 + 2.0 = 3.0
li a1, 0x4000
jal ra, bf16_add
li t6, 0x4040
bne a0, t6, fail
li a0, 0x4040 # 3.0 + (-1.0) = 2.0
li a1, 0xBF80
jal ra, bf16_add
li t6, 0x4000
bne a0, t6, fail
li a0, 0x0000 # 0 + 1.0 = 1.0
li a1, 0x3F80
jal ra, bf16_add
li t6, 0x3F80
bne a0, t6, fail
li a0, 0x7F80 # +Inf + -Inf = NaN
li a1, 0xFF80
jal ra, bf16_add
li t6, 0x7FC0
bne a0, t6, fail
li a0, 0x7FC1 # NaN + 1.0 = NaN (propagate payload)
li a1, 0x3F80
jal ra, bf16_add
li t6, 0x7FC1
bne a0, t6, fail
li a0, 0x4000 # 2.0 + tiny subnormal (>= 9 ulps apart) -> unchanged
li a1, 0x0001 # smallest subnormal
jal ra, bf16_add
li t6, 0x4000
bne a0, t6, fail
# --------------------
# bf16_sub(a0, a1) -> a0
# --------------------
li a0, 0x4040 # 3.0 - 1.0 = 2.0
li a1, 0x3F80
jal ra, bf16_sub
li t6, 0x4000
bne a0, t6, fail
li a0, 0x3F80 # 1.0 - 3.0 = -2.0
li a1, 0x4040
jal ra, bf16_sub
li t6, 0xC000
bne a0, t6, fail
li a0, 0x0000 # 0 - 1.0 = -1.0
li a1, 0x3F80
jal ra, bf16_sub
li t6, 0xBF80
bne a0, t6, fail
li a0, 0x7F80 # +Inf - +Inf = NaN
li a1, 0x7F80
jal ra, bf16_sub
li t6, 0x7FC0
bne a0, t6, fail
li a0, 0x7FC2 # NaN - 1.0 = NaN (propagate payload)
li a1, 0x3F80
jal ra, bf16_sub
li t6, 0x7FC2
bne a0, t6, fail
li a0, 0x3F80 # 1.0 - 0 = 1.0
li a1, 0x0000
jal ra, bf16_sub
li t6, 0x3F80
bne a0, t6, fail
# --------------------
# bf16_mul(a0, a1) -> a0
# --------------------
li a0, 0x4000 # 2.0 * 3.0 = 6.0
li a1, 0x4040
jal ra, bf16_mul
li t6, 0x40C0 # 6.0
bne a0, t6, fail
li a0, 0x3F80 # 1.0 * -2.0 = -2.0
li a1, 0xC000
jal ra, bf16_mul
li t6, 0xC000
bne a0, t6, fail
li a0, 0x0000 # 0 * 3.0 = 0
li a1, 0x4040
jal ra, bf16_mul
li t6, 0x0000
bne a0, t6, fail
li a0, 0x7F80 # Inf * 0 = NaN
li a1, 0x0000
jal ra, bf16_mul
li t6, 0x7FC0
bne a0, t6, fail
li a0, 0x7FC1 # NaN * 1.0 = NaN (propagate payload)
li a1, 0x3F80
jal ra, bf16_mul
li t6, 0x7FC1
bne a0, t6, fail
li a0, 0x0002 # subnormal * 1.0 -> 0 (mul flush behavior here)
li a1, 0x3F80
jal ra, bf16_mul
li t6, 0x0000
bne a0, t6, fail
# --------------------
# bf16_div(a0, a1) -> a0
# --------------------
li a0, 0x4080 # 4 / 2 = 2
li a1, 0x4000
jal ra, bf16_div
li t6, 0x4000
bne a0, t6, fail
li a0, 0x4000 # 2 / 4 = 0.5
li a1, 0x4080
jal ra, bf16_div
li t6, 0x3F00
bne a0, t6, fail
li a0, 0x4040 # 3 / 3 = 1
li a1, 0x4040
jal ra, bf16_div
li t6, 0x3F80
bne a0, t6, fail
li a0, 0x3F80 # x / 0 = +Inf (x > 0)
li a1, 0x0000
jal ra, bf16_div
li t6, 0x7F80
bne a0, t6, fail
li a0, 0x0000 # 0 / x = 0
li a1, 0x3F80
jal ra, bf16_div
li t6, 0x0000
bne a0, t6, fail
li a0, 0x0001 # subnormal / 1.0 = 0 (div flush-to-zero here)
li a1, 0x3F80
jal ra, bf16_div
li t6, 0x0000
bne a0, t6, fail
# --------------------
# All passed -> print and exit
# --------------------
all_pass:
la a0, msg_pass
li a7, 4
ecall
li a0, 0 # return code 0
li a7, 10
ecall
# Any failure -> print and exit with code=1
fail:
la a0, msg_fail
li a7, 4
ecall
li a0, 1
li a7, 10
ecall
bf16_div:
srli t0, a0, 15
andi t0, t0, 1
srli t1, a1, 15
andi t1, t1, 1
srli t2, a0, 7
andi t2, t2, 0xFF
srli t3, a1, 7
andi t3, t3, 0xFF
andi t4, a0, 0x7F
andi t5, a1, 0x7F
xor a2, t0, t1
li t6, 0xFF
bne t3, t6, chk_b_zero
bnez t5, ret_b
bne t2, t6, ret_sign_zero
bnez t4, ret_sign_zero
j ret_nan
chk_b_zero:
or t6, t3, t5
bnez t6, chk_a_inf
or t6, t2, t4
beqz t6, ret_nan
j ret_inf
chk_a_inf:
li t6, 0xFF
bne t2, t6, chk_a_zero
bnez t4, ret_a
j ret_inf
chk_a_zero:
or t6, t2, t4
bnez t6, prep_mant
j ret_sign_zero
prep_mant:
beqz t2, a_is_norm
ori t4, t4, 0x80
a_is_norm:
beqz t3, div_a_b
ori t5, t5, 0x80
div_a_b:
mv a3, t4
li a6, 0
bgeu a3, t5, div_no_scale
slli a3, a3, 1
li a6, 1
div_no_scale:
sub a3, a3, t5
li a4, 0
li t6, 7
div_frac_for:
beqz t6, div_frac_done
slli a3, a3, 1
bltu a3, t5, div_bit0
sub a3, a3, t5
slli a4, a4, 1
ori a4, a4, 1
addi t6, t6, -1
j div_frac_for
div_bit0:
slli a4, a4, 1
addi t6, t6, -1
j div_frac_for
div_frac_done:
li t6, BF16_EXP_BIAS
sub a5, t2, t3
add a5, a5, t6
bnez t2, chk_incr_exp
addi a5, a5, -1
chk_incr_exp:
bnez t3, div_result
addi a5, a5, 1
beqz a6, div_result
addi a5, a5, -1
div_result:
li t6, 0xFF
bge a5, t6, ret_inf
blez a5, ret_sign_zero
slli a2, a2, 15
andi a5, a5, 0xFF
slli a5, a5, 7
andi a4, a4, 0x7F
or a0, a2, a5
or a0, a0, a4
jr ra
bf16_mul:
srli t0, a0, 15
andi t0, t0, 1
srli t1, a1, 15
andi t1, t1, 1
srli t2, a0, 7
andi t2, t2, 0xFF
srli t3, a1, 7
andi t3, t3, 0xFF
andi t4, a0, 0x7F
andi t5, a1, 0x7F
xor a2, t0, t1
li t6, 0xFF
bne t2, t6, chk_exp_b_full
bnez t4, ret_a
or t6, t3, t5
beqz t6, ret_nan
j ret_inf
chk_exp_b_full:
li t6, 0xFF
bne t3, t6, chk_if_zero
bnez t5, ret_b
or t6, t2, t4
beqz t6, ret_nan
j ret_inf
chk_if_zero:
or t6, t2, t4
beqz t6, ret_sign_zero
or t6, t3, t5
beqz t6, ret_sign_zero
exp_adjust:
li a3, 0
bnez t2, skip_a_norm
norm_a_for:
andi t6, t4, 0x80
bnez t6, break_1
slli t4, t4, 1
addi a3, a3, -1
j norm_a_for
break_1:
li t2, 1
j chk_b_norm
skip_a_norm:
ori t4, t4, 0x80
chk_b_norm:
bnez t3, b_is_norm
norm_b_for:
andi t6, t5, 0x80
bnez t6, break_2
slli t5, t5, 1
addi a3, a3, -1
j norm_b_for
break_2:
li t3, 1
j mult_a_b
b_is_norm:
ori t5, t5, 0x80
mult_a_b:
mv t6, t4
mv a6, t5
li a4, 0
mult_for:
beqz a6, mult_done
andi a5, a6, 1
beqz a5, skip_add
add a4, a4, t6
skip_add:
srli a6, a6, 1
slli t6, t6, 1
j mult_for
mult_done:
li t6, BF16_EXP_BIAS
add a5, t2, t3
sub a5, a5, t6
add a5, a5, a3
norm_result_mant:
li t6, 0x8000
li a6, 7
and t6, a4, t6
beqz t6, skip_1
addi a6, a6, 1
addi a5, a5, 1
skip_1:
srl a4, a4, a6
andi a4, a4, 0x7F
li t6, 0xFF
bge a5, t6, ret_inf
bgtz a5, mult_result
li t6, -6
blt a5, t6, ret_sign_zero
li t6, 1
sub t6, t6, a5
srl a4, a4, t6
li a5, 0
mult_result:
slli a2, a2, 15
andi a5, a5, 0xFF
slli a5, a5, 7
andi a4, a4, 0x7F
or a0, a2, a5
or a0, a0, a4
jr ra
bf16_sub:
addi sp, sp, -4
sw ra, 0(sp)
li t0, BF16_SIGN_MASK
xor a1, a1, t0
jal ra, bf16_add
lw ra, 0(sp)
addi sp, sp, 4
jr ra
bf16_add:
srli t0, a0, 15
andi t0, t0, 1
srli t1, a1, 15
andi t1, t1, 1
srli t2, a0, 7
andi t2, t2, 0xFF
srli t3, a1, 7
andi t3, t3, 0xFF
andi t4, a0, 0x7F
andi t5, a1, 0x7F
li t6, 0xFF
bne t2, t6, chk_b_inf_or_nan
bnez t4, ret_a
bne t3, t6, ret_a
bnez t5, ret_b
beq t0, t1, ret_b
j ret_nan
chk_b_inf_or_nan:
beq t3, t6, ret_b
check_zero:
or t6, t2, t4
beqz t6, ret_b
or t6, t3, t5
beqz t6, ret_a
check_normal:
beqz t2, skip_a
ori t4, t4, 0x80
skip_a:
beqz t3, adjust_exp
ori t5, t5, 0x80
adjust_exp:
sub a2, t2, t3
li t6, 9
blez a2, not_greater
mv a3, t2
blt a2, t6, shift_b
j ret_a
shift_b:
srl t5, t5, a2
j aligned
not_greater:
beqz a2, equal_diff
mv a3, t3
sub a6, x0, a2
blt a6, t6, shift_a
j ret_b
shift_a:
srl t4, t4, a6
j aligned
equal_diff:
mv a3, t2
aligned:
bne t0, t1, diff_sign
mv a4, t0
add a5, t4, t5
andi t6, a5, 0x100
beqz t6, compose_result
srli a5, a5, 1
addi a3, a3, 1
li t6, 0xFF
bge a3, t6, add_ret_inf
j compose_result
diff_sign:
blt t4, t5, mant_a_small
mv a4, t0
sub a5, t4, t5
j check_mant_zero
mant_a_small:
mv a4, t1
sub a5, t5, t4
check_mant_zero:
beqz a5, ret_zero
norm_for:
andi t6, a5, 0x80
bnez t6, compose_result
slli a5, a5, 1
addi a3, a3, -1
blez a3, ret_zero
j norm_for
compose_result:
slli a4, a4, 15
andi t6, a3, 0xFF
slli t6, t6, 7
andi a5, a5, 0x7F
or a0, a4, t6
or a0, a0, a5
jr ra
ret_a:
jr ra
ret_b:
mv a0, a1
jr ra
ret_zero:
li a0, 0x0000
jr ra
ret_sign_zero:
slli a0, a2, 15
jr ra
ret_nan:
li a0, 0x7FC0
jr ra
ret_inf:
slli a0, a2, 15
li t6, 0x7F80
or a0, a0, t6
jr ra
add_ret_inf:
slli a0, a4, 15
li t6, 0x7F80
or a0, a0, t6
jr ra
bf16_to_f32:
slli a0, a0, 16
jr ra
f32_to_bf16:
mv t0, a0
srli t1, t0, 23
li t2, 0xFF
and t1, t1, t2
beq t1, t2, is_NaN_or_Inf
srli t1, t0, 16
andi t1, t1, 1
li t2, 0x7FFF
add t1, t1, t2
add t0, t0, t1
srli a0, t0, 16
jr ra
is_NaN_or_Inf:
srli a0, t0, 16
jr ra
bf16_isnan:
li t1, BF16_EXP_MASK
and t0, a0, t1
xor t0, t0, t1
slti t0, t0, 1
li t2, BF16_MANT_MASK
and t1, a0, t2
snez t1, t1
and a0, t0, t1
jr ra
bf16_isinf:
li t1, BF16_EXP_MASK
and t0, a0, t1
xor t0, t0, t1
slti t0, t0, 1
li t2, BF16_MANT_MASK
and t1, a0, t2
slti t1, t1, 1
and a0, t0, t1
jr ra
bf16_iszero:
li t0, 0x7FFF
and t0, a0, t0
slti a0, t0, 1
jr ra
```
</details>
##### 1. About main
All **function tests** are integrated into `main` and executed uniformly. Each test calls the target function via `jal ra, <func>`, compares the return value in `a0` with the expected result using `bne a0, <expected>, fail`. Any mismatch jumps to `fail`, prints "**Tests failed**", and exits. If all pass, it prints "**All tests passed**".
<details>
<summary>The corresponding assembly code (test part)</summary>
```c=
.equ BF16_SIGN_MASK, 0x8000
.equ BF16_EXP_MASK, 0x7F80
.equ BF16_MANT_MASK, 0x007F
.equ BF16_EXP_BIAS, 127
.data
msg_pass: .asciz "All tests passed\n"
msg_fail: .asciz "Tests failed\n"
.text
.globl main
main:
# ... run tests ...
# bf16_isnan(a0)
li a0, 0x7FC1 # NaN
jal ra, bf16_isnan
li t6, 1
bne a0, t6, fail
li a0, 0x7F80 # +Inf
jal ra, bf16_isnan
li t6, 0
bne a0, t6, fail
li a0, 0x4000 # +2.0
jal ra, bf16_isnan
li t6, 0
bne a0, t6, fail
# bf16_isinf(a0)
li a0, 0x7F80 # +Inf
jal ra, bf16_isinf
li t6, 1
bne a0, t6, fail
li a0, 0xFF80 # -Inf
jal ra, bf16_isinf
li t6, 1
bne a0, t6, fail
li a0, 0x7FC1 # NaN
jal ra, bf16_isinf
li t6, 0
bne a0, t6, fail
# bf16_iszero(a0)
li a0, 0x0000 # +0
jal ra, bf16_iszero
li t6, 1
bne a0, t6, fail
li a0, 0x8000 # -0
jal ra, bf16_iszero
li t6, 1
bne a0, t6, fail
li a0, 0x4000 # 2.0
jal ra, bf16_iszero
li t6, 0
bne a0, t6, fail
# f32_to_bf16(a0: f32 bits) -> a0: bf16 bits
li a0, 0x3F800000 # 1.0f
jal ra, f32_to_bf16
li t6, 0x3F80 # 1.0 (bf16)
bne a0, t6, fail
li a0, 0x40400000 # 3.0f
jal ra, f32_to_bf16
li t6, 0x4040 # 3.0 (bf16)
bne a0, t6, fail
li a0, 0x7FC00000 # quiet NaN
jal ra, f32_to_bf16
li t6, 0x7FC0
bne a0, t6, fail
# bf16_to_f32(a0: bf16 bits) -> a0: f32 bits
li a0, 0x3F80 # 1.0 (bf16)
jal ra, bf16_to_f32
li t6, 0x3F800000
bne a0, t6, fail
li a0, 0x4040 # 3.0 (bf16)
jal ra, bf16_to_f32
li t6, 0x40400000
bne a0, t6, fail
li a0, 0x7FC1 # NaN with payload -> preserved in upper 16 bits
jal ra, bf16_to_f32
li t6, 0x7FC10000
bne a0, t6, fail
# bf16_add(a0, a1) -> a0
li a0, 0x3F80 # 1.0 + 2.0 = 3.0
li a1, 0x4000
jal ra, bf16_add
li t6, 0x4040
bne a0, t6, fail
li a0, 0x4040 # 3.0 + (-1.0) = 2.0
li a1, 0xBF80
jal ra, bf16_add
li t6, 0x4000
bne a0, t6, fail
li a0, 0x0000 # 0 + 1.0 = 1.0
li a1, 0x3F80
jal ra, bf16_add
li t6, 0x3F80
bne a0, t6, fail
li a0, 0x7F80 # +Inf + -Inf = NaN
li a1, 0xFF80
jal ra, bf16_add
li t6, 0x7FC0
bne a0, t6, fail
li a0, 0x7FC1 # NaN + 1.0 = NaN (propagate payload)
li a1, 0x3F80
jal ra, bf16_add
li t6, 0x7FC1
bne a0, t6, fail
li a0, 0x4000 # 2.0 + tiny subnormal (>= 9 ulps apart) -> unchanged
li a1, 0x0001 # smallest subnormal
jal ra, bf16_add
li t6, 0x4000
bne a0, t6, fail
# bf16_sub(a0, a1) -> a0
li a0, 0x4040 # 3.0 - 1.0 = 2.0
li a1, 0x3F80
jal ra, bf16_sub
li t6, 0x4000
bne a0, t6, fail
li a0, 0x3F80 # 1.0 - 3.0 = -2.0
li a1, 0x4040
jal ra, bf16_sub
li t6, 0xC000
bne a0, t6, fail
li a0, 0x0000 # 0 - 1.0 = -1.0
li a1, 0x3F80
jal ra, bf16_sub
li t6, 0xBF80
bne a0, t6, fail
li a0, 0x7F80 # +Inf - +Inf = NaN
li a1, 0x7F80
jal ra, bf16_sub
li t6, 0x7FC0
bne a0, t6, fail
li a0, 0x7FC2 # NaN - 1.0 = NaN (propagate payload)
li a1, 0x3F80
jal ra, bf16_sub
li t6, 0x7FC2
bne a0, t6, fail
li a0, 0x3F80 # 1.0 - 0 = 1.0
li a1, 0x0000
jal ra, bf16_sub
li t6, 0x3F80
bne a0, t6, fail
# bf16_mul(a0, a1) -> a0
li a0, 0x4000 # 2.0 * 3.0 = 6.0
li a1, 0x4040
jal ra, bf16_mul
li t6, 0x40C0 # 6.0
bne a0, t6, fail
li a0, 0x3F80 # 1.0 * -2.0 = -2.0
li a1, 0xC000
jal ra, bf16_mul
li t6, 0xC000
bne a0, t6, fail
li a0, 0x0000 # 0 * 3.0 = 0
li a1, 0x4040
jal ra, bf16_mul
li t6, 0x0000
bne a0, t6, fail
li a0, 0x7F80 # Inf * 0 = NaN
li a1, 0x0000
jal ra, bf16_mul
li t6, 0x7FC0
bne a0, t6, fail
li a0, 0x7FC1 # NaN * 1.0 = NaN (propagate payload)
li a1, 0x3F80
jal ra, bf16_mul
li t6, 0x7FC1
bne a0, t6, fail
li a0, 0x0002 # subnormal * 1.0 -> 0 (mul flush behavior here)
li a1, 0x3F80
jal ra, bf16_mul
li t6, 0x0000
bne a0, t6, fail
# bf16_div(a0, a1) -> a0
li a0, 0x4080 # 4 / 2 = 2
li a1, 0x4000
jal ra, bf16_div
li t6, 0x4000
bne a0, t6, fail
li a0, 0x4000 # 2 / 4 = 0.5
li a1, 0x4080
jal ra, bf16_div
li t6, 0x3F00
bne a0, t6, fail
li a0, 0x4040 # 3 / 3 = 1
li a1, 0x4040
jal ra, bf16_div
li t6, 0x3F80
bne a0, t6, fail
li a0, 0x3F80 # x / 0 = +Inf (x > 0)
li a1, 0x0000
jal ra, bf16_div
li t6, 0x7F80
bne a0, t6, fail
li a0, 0x0000 # 0 / x = 0
li a1, 0x3F80
jal ra, bf16_div
li t6, 0x0000
bne a0, t6, fail
li a0, 0x0001 # subnormal / 1.0 = 0 (div flush-to-zero here)
li a1, 0x3F80
jal ra, bf16_div
li t6, 0x0000
bne a0, t6, fail
# All passed -> print and exit
all_pass:
la a0, msg_pass
li a7, 4
ecall
li a0, 0 # return code 0
li a7, 10
ecall
# Any failure -> print and exit with code=1
fail:
la a0, msg_fail
li a7, 4
ecall
li a0, 1
li a7, 10
ecall
```
</details>
Where the **test coverage** includes:
- **bf16_isnan / bf16_isinf / bf16_iszero:** checks NaN, ±Inf, ±0, and normal numbers.
- **f32_to_bf16 / bf16_to_f32:** verifies correct conversion for 1.0, 3.0, and NaN.
- **bf16_add / bf16_sub:** tests same/opp sign operations, zeros, Inf, and NaN.
- **bf16_mul / bf16_div:** covers regular arithmetic, zeros, Inf, NaN, subnormals, and divide-by-zero.
The program prints only one summary line ("All tests passed" or "Tests failed") using **ecall**, ensuring minimal and clear output as required.
##### 2. About bf16_mul
According to the assignment requirements, since the RV32M extension instructions such as `mul` are **not allowed**, the multiplication must be implemented using **only RV32I base instructions**.
In this implementation, the classical **Egyptian multiplication** is adopted. This method emulates multiplication through bit shifting and conditional addition, completing the calculation of `mant_a × mant_b` in `bf16_mul`.
The procedure of the multiplication are as follows:
1. Let **b** (`mant_b`) be the multiplier, and **a** (`mant_a`) be the multiplicand.
2. While `b ≠ 0`, perform the following operations repeatedly:
- If **the least significant bit** of `b` is 1, add the current value of `a` to the **accumulator**.
- Then perform `a <<= 1` (double the multiplicand) and `b >>= 1` (halve the multiplier).
3. When `b = 0`, the accumulator contains the final product.
Below is the corresponding assembly code segment from `bf16_mul`:
```C=
mult_a_b:
mv t6, t4
mv a6, t5
li a4, 0
mult_for:
beqz a6, mult_done
andi a5, a6, 1
beqz a5, skip_add
add a4, a4, t6
skip_add:
srli a6, a6, 1
slli t6, t6, 1
j mult_for
```
where: `t4=mant_a`、`t5=mant_b`、`t6=multiplicand`、`a6=multiplier`、`a4=accumulator`.
##### 3. About bf16_div
In `bf16_div`, the key operation is the **mantissa division**. Since division extensions are not allowed, the division must be implemented purely using **shift-and-subtraction** operations.
The original version adopts a **16-round restoring long division** algorithm. It repeatedly shifts and compares the operands to determine each **quotient** bit, one per round.
The corresponding original C code is as follows:
```C=
uint32_t dividend = (uint32_t) mant_a << 15;
uint32_t divisor = mant_b;
uint32_t quotient = 0;
for (int i = 0; i < 16; i++) {
quotient <<= 1;
if (dividend >= (divisor << (15 - i))) {
dividend -= (divisor << (15 - i));
quotient |= 1;
}
}
int32_t result_exp = (int32_t) exp_a - exp_b + BF16_EXP_BIAS;
if (!exp_a)
result_exp--;
if (!exp_b)
result_exp++;
if (quotient & 0x8000)
quotient >>= 8;
else {
while (!(quotient & 0x8000) && result_exp > 1) {
quotient <<= 1;
result_exp--;
}
quotient >>= 8;
}
quotient &= 0x7F;
```
If we directly translate the original C code into assembly, the corresponding assembly implementation is show below:
```C=
div_a_b:
slli a3, t4, 15
li a4, 0
li a6, 15
div_for:
slli a4, a4, 1
sll t6, t5, a6
blt a3, t6, div_no_sub
sub a3, a3, t6
ori a4, a4, 1
div_no_sub:
addi a6, a6, -1
bgez a6, div_for
li t6, BF16_EXP_BIAS
sub a5, t2, t3
add a5, a5, t6
bnez t2, chk_incr_exp
addi a5, a5, -1
chk_incr_exp:
bnez t3, norm_quot
addi a5, a5, 1
norm_quot:
li t6, 0x8000
and t6, a4, t6
bnez t6, norm_quot_done
norm_quot_for:
li t6, 1
ble a5, t6, norm_quot_done
li t6, 0x8000
and t6, a4, t6
bnez t6, norm_quot_done
slli a4, a4, 1
addi a5, a5, -1
j norm_quot_for
norm_quot_done:
srli a4, a4, 8
andi a4, a4, 0x7F
```
Running with test cases for `bf16_div` in `main`, the corresponding **Execution Information** is shown below:

This algorithm performs **16 iterations**, each computing one quotient bit. After all rounds, the result must be normalized to ensure that the mantissa lies within `[ 1.0, 2.0 )` .
However, since the original `bf16_div` does not implement **rounding**, only the top 7 fraction bits are required, it is unnecessary to compute all 16 bits of precision.
Thus, the process can be optimized by using a **7-round fractional long division**, which directly produces the normalized mantissa bits.
In this improved version, the procedure is as follows:
1. **Integer-part correction:**
Compare the two mantissas first.
- If `mant_a ≥ mant_b`, the quotient’s integer part is 1,
so subtract once: `remainder = mant_a - mant_b`.
- Otherwise, the dividend is smaller than the divisor, so we cannot subtract yet.
Set `remainder = mant_a` and later shift it left once (×2) to bring the integer part to 1, while the exponent `result_exp` will later be decreased by 1 to compensate.
2. **Seven-round fractional division:**
Each round determines one fractional bit (total 7 bits):
- Shift `remainder <<= 1` (to multiply by 2) and `mant <<= 1`.
- If `remainder ≥ mant_b`, subtract once `remainder -= mant_b` and set the least bit to 1 (`mant |= 1`).
- Otherwise, leave the least bit 0.
The modified C code is show as follows:
```C=
uint16_t remainder;
uint16_t mant = 0;
uint16_t flag = 0;
if (mant_a >= mant_b) {
remainder = (uint16_t)(mant_a - mant_b);
} else {
remainder = mant_a;
flag = 1;
}
for (int i = 0; i < 7; i++) {
remainder <<= 1;
mant <<= 1;
if (remainder >= mant_b) {
remainder -= mant_b;
mant |= 1;
}
}
int32_t result_exp = (int32_t)exp_a - exp_b + BF16_EXP_BIAS;
if (!exp_a) result_exp--;
if (!exp_b) result_exp++;
if (flag) result_exp--;
```
The corresponding assembly implementation is shown below:
```C=
div_a_b:
mv a3, t4
li a6, 0
bgeu a3, t5, div_no_scale
slli a3, a3, 1
li a6, 1
div_no_scale:
sub a3, a3, t5
li a4, 0
li t6, 7
div_frac_for:
beqz t6, div_frac_done
slli a3, a3, 1
bltu a3, t5, div_bit0
sub a3, a3, t5
slli a4, a4, 1
ori a4, a4, 1
addi t6, t6, -1
j div_frac_for
div_bit0:
slli a4, a4, 1
addi t6, t6, -1
j div_frac_for
div_frac_done:
li t6, BF16_EXP_BIAS
sub a5, t2, t3
add a5, a5, t6
bnez t2, chk_incr_exp
addi a5, a5, -1
chk_incr_exp:
bnez t3, div_result
addi a5, a5, 1
beqz a6, div_result
addi a5, a5, -1
```
where: `t4 = mant_a`、`t5 = mant_b`、`a3 = remainder`、`a4 = mant`、`a5 = result_exp`、`a6 = flag` and `t6 = 7` for loop control.
Running with test cases for `bf16_div` in `main`, the corresponding **Execution Information** of this improved version is shown below:

It is clear that the **execution cycles** were reduced by approximately $30$%.
##### 4. Others
For example, in the original C code, you can often see some program fragments similar to the following:
```C=
if (!x && !y)
< operation >
......
```
If we translate it directly, the corresponding assembly code would be:
```C=
bnez a0, Skip
bnez a1, Skip
< operation >
Skip:
......
```
Here, `a0 = x` and `a1 = y`.
However, this version requires **two conditional branches**, which can be further optimized. By using the **`xor`** instruction, we can achieve the same logic in a more compact way — that is, execute `<operation>` only when **both `x` and `y` are zero**.
The optimized assembly code is as follows:
```C=
xor t0, a0, a1
bnez t0, Skip
< operation >
Skip:
......
```
This approach **reduces one branch instruction**, making the overall program more efficient and concise during execution.
---
#### Part II
This part is about the implementation of **square root operation** in **bfloat16** format. The original C code is as follows:
```C=
static inline bf16_t bf16_sqrt(bf16_t a)
{
uint16_t sign = (a.bits >> 15) & 1;
int16_t exp = ((a.bits >> 7) & 0xFF);
uint16_t mant = a.bits & 0x7F;
/* Handle special cases */
if (exp == 0xFF) {
if (mant)
return a; /* NaN propagation */
if (sign)
return BF16_NAN(); /* sqrt(-Inf) = NaN */
return a; /* sqrt(+Inf) = +Inf */
}
/* sqrt(0) = 0 (handle both +0 and -0) */
if (!exp && !mant)
return BF16_ZERO();
/* sqrt of negative number is NaN */
if (sign)
return BF16_NAN();
/* Flush denormals to zero */
if (!exp)
return BF16_ZERO();
/* Direct bit manipulation square root algorithm */
/* For sqrt: new_exp = (old_exp - bias) / 2 + bias */
int32_t e = exp - BF16_EXP_BIAS;
int32_t new_exp;
/* Get full mantissa with implicit 1 */
uint32_t m = 0x80 | mant; /* Range [128, 256) representing [1.0, 2.0) */
/* Adjust for odd exponents: sqrt(2^odd * m) = 2^((odd-1)/2) * sqrt(2*m) */
if (e & 1) {
m <<= 1; /* Double mantissa for odd exponent */
new_exp = ((e - 1) >> 1) + BF16_EXP_BIAS;
} else {
new_exp = (e >> 1) + BF16_EXP_BIAS;
}
/* Now m is in range [128, 256) or [256, 512) if exponent was odd */
/* Binary search for integer square root */
/* We want result where result^2 = m * 128 (since 128 represents 1.0) */
uint32_t low = 90; /* Min sqrt (roughly sqrt(128)) */
uint32_t high = 256; /* Max sqrt (roughly sqrt(512)) */
uint32_t result = 128; /* Default */
/* Binary search for square root of m */
while (low <= high) {
uint32_t mid = (low + high) >> 1;
uint32_t sq = (mid * mid) / 128; /* Square and scale */
if (sq <= m) {
result = mid; /* This could be our answer */
low = mid + 1;
} else {
high = mid - 1;
}
}
/* result now contains sqrt(m) * sqrt(128) / sqrt(128) = sqrt(m) */
/* But we need to adjust the scale */
/* Since m is scaled where 128=1.0, result should also be scaled same way */
/* Normalize to ensure result is in [128, 256) */
if (result >= 256) {
result >>= 1;
new_exp++;
} else if (result < 128) {
while (result < 128 && new_exp > 1) {
result <<= 1;
new_exp--;
}
}
/* Extract 7-bit mantissa (remove implicit 1) */
uint16_t new_mant = result & 0x7F;
/* Check for overflow/underflow */
if (new_exp >= 0xFF)
return (bf16_t) {.bits = 0x7F80}; /* +Inf */
if (new_exp <= 0)
return BF16_ZERO();
return (bf16_t) {.bits = ((new_exp & 0xFF) << 7) | new_mant};
}
```
The corresponding assembly code (include test part) is as follows:
```C=
.equ BF16_SIGN_MASK, 0x8000
.equ BF16_EXP_MASK, 0x7F80
.equ BF16_MANT_MASK, 0x007F
.equ BF16_EXP_BIAS, 127
.data
msg_pass: .asciz "All tests passed\n"
msg_fail: .asciz "Tests failed\n"
.text
.globl main
main:
# bf16_sqrt(a0) -> a0
li a0, 0x7FA5
jal ra, bf16_sqrt
li t6, 0x7FA5
bne a0, t6, fail
# +Inf
li a0, 0x7F80
jal ra, bf16_sqrt
li t6, 0x7F80
bne a0, t6, fail
# -Inf -> NaN
li a0, 0xFF80
jal ra, bf16_sqrt
li t6, 0x7FC0
bne a0, t6, fail
# +0
li a0, 0x0000
jal ra, bf16_sqrt
li t6, 0x0000
bne a0, t6, fail
# -0 treated as +0
li a0, 0x8000
jal ra, bf16_sqrt
li t6, 0x0000
bne a0, t6, fail
# Negative finite number -> NaN
li a0, 0xC000 # -2.0
jal ra, bf16_sqrt
li t6, 0x7FC0
bne a0, t6, fail
# Subnormal input (flush to zero)
li a0, 0x0001 # smallest subnormal
jal ra, bf16_sqrt
li t6, 0x0000
bne a0, t6, fail
li a0, 0x007F # largest subnormal
jal ra, bf16_sqrt
li t6, 0x0000
bne a0, t6, fail
# Even exponent: do not double mantissa
li a0, 0x4080 # 4.0
jal ra, bf16_sqrt
li t6, 0x4000 # 2.0
bne a0, t6, fail
li a0, 0x4180 # 16.0
jal ra, bf16_sqrt
li t6, 0x4080 # 4.0
bne a0, t6, fail
li a0, 0x4110 # 9.0
jal ra, bf16_sqrt
li t6, 0x4040 # 3.0
bne a0, t6, fail
# Odd exponent: shift mantissa left before halving
li a0, 0x4000 # 2.0
jal ra, bf16_sqrt
li t6, 0x3FB5 # 1.4142 (truncated)
bne a0, t6, fail
li a0, 0x4040 # 3.0
jal ra, bf16_sqrt
li t6, 0x3FDD # 1.732 (truncated)
bne a0, t6, fail
# Need left normalization (result < 128)
li a0, 0x3F00 # 0.5
jal ra, bf16_sqrt
li t6, 0x3F35 # 0.7071 (truncated)
bne a0, t6, fail
# All tests passed
all_pass:
la a0, msg_pass
li a7, 4
ecall
li a0, 0
li a7, 10
ecall
# Any test failed
fail:
la a0, msg_fail
li a7, 4
ecall
li a0, 1
li a7, 10
ecall
bf16_sqrt:
addi sp, sp, -4
sw ra, 0(sp)
srli t0, a0, 15
andi t0, t0, 1
srli t1, a0, 7
andi t1, t1, 0xFF
andi t2, a0, 0x7F
li t6, 0xFF
bne t1, t6, chk_zero
bnez t2, ret_a
bnez t0, ret_nan
j ret_a
chk_zero:
or t6, t1, t2
beqz t6, ret_zero
bnez t0, ret_nan
beqz t1, ret_zero
li t6, 127
sub a2, t1, t6
li a3, 0x80
or a3, a3, t2
andi t6, a2, 1
beqz t6, get_new_exp
odd_exp:
slli a3, a3, 1
addi a2, a2, -1
get_new_exp:
srai a2, a2, 1
li t6, 127
add t3, a2, t6
bin_search:
li a4, 90
li a5, 256
li t4, 128
search_for:
bgtu a4, a5, norm_result
add a6, a4, a5
srli a6, a6, 1
mv a0, a6
mv a1, a6
jal ra, mul_aux
srli t6, a0, 7
bgeu a3, t6, le_ok
addi a5, a6, -1
j search_for
le_ok:
mv t4, a6
addi a4, a6, 1
j search_for
norm_result:
li t6, 256
blt t4, t6, chk_if_norm_up
srli t4, t4, 1
addi t3, t3, 1
chk_if_norm_up:
li t6, 128
bge t4, t6, mant_ok
norm_up:
li t6, 128
bge t4, t6, mant_ok
li t6, 1
ble t3, t6, mant_ok
slli t4, t4, 1
addi t3, t3, -1
j norm_up
mant_ok:
andi t4, t4, 0x7F
li t6, 0xFF
bgeu t3, t6, ret_inf
blez t3, ret_zero
sqrt_result:
andi t3, t3, 0xFF
slli t3, t3, 7
or a0, t3, t4
lw ra, 0(sp)
addi sp, sp, 4
jr ra
ret_a:
lw ra, 0(sp)
addi sp, sp, 4
jr ra
ret_zero:
li a0, 0x0000
lw ra, 0(sp)
addi sp, sp, 4
jr ra
ret_inf:
li a0, 0x7F80
lw ra, 0(sp)
addi sp, sp, 4
jr ra
ret_nan:
li a0, 0x7FC0
lw ra, 0(sp)
addi sp, sp, 4
jr ra
mul_aux:
li t6, 0
mul_for:
beqz a1, mul_done
andi t2, a1, 1
beqz t2, skip_add
add t6, t6, a0
skip_add:
srli a1, a1, 1
slli a0, a0, 1
j mul_for
mul_done:
mv a0, t6
jr ra
```
##### 1. Notes
- In the **binary search loop**, compute `mid*mid` via `mul_aux` (Egyptian multiplication) to avoid hardware multiply.
- Since `128 = 2^7`, compute `sq = (mid*mid)/128` by a right shift by 7 (`srli`), avoiding an general expensive division.
##### 2. About main
All **tests** for `bf16_sqrt` are integrated in `main`. Each case calls `bf16_sqrt` via `jal ra, bf16_sqrt`, and the return is in `a0`. We compare with the expected value in `t6` using `bne a0, t6, fail`.
**Test coverage** includes:
- **NaN**: `a0=0x7FA5` .
- **Infinities**: `+Inf` → `+Inf`, `-Inf` → `NaN`.
- **Negative finite**: `-2.0 → NaN`.
- **Subnormals (flush to zero)**: `0x0001` (min) and `0x007F` (max) → `0`.
- **Even exponent (no mantissa doubling)**: `4.0` → `2.0`, `16.0` → `4.0`, `9.0` → `3.0`.
- **Odd exponent (mantissa doubled first)**: `2.0` → ≈ `1.4142 (0x3FB5)`, `3.0` → ≈ `1.732 (0x3FDD)`.
- **Needs left normalization**: `0.5` → ≈ `0.7071 (0x3F35)`.
Any mismatch jumps to `fail`, prints "**Tests failed**", and exits. If all pass, the program prints "**All tests passed**".
---
### Result Console
All assembly programs are run on a **RISC-V 32-bit 5-stage processor** and output to the **console** using environment calls.
#### Part I
The result of execution is shown below:

#### Part II
The result of execution is shown below:

As shown in the images for **Part I** and **Part II** above, the program executes correctly and outputs the expected prompt messages.
---
### Execution Information
This section demonstrates the **Execution Information** when running the assembly code on the **Ripes simulator**, in order to compare with compiler-generated version (without `-O1` optimization) and directly observe the performance.
#### Part I
##### 1. Complier's version:

* Corresponding **code size: 1128**
##### 2. Improved version:

* Corresponding **code size: 557**
#### Part II
##### 1. Complier's version:

* Corresponding **code size: 368**
##### 2. Improved version:

* Corresponding **code size: 191**
---
### Improvement
From the **Execution Information** above, compared with the compiler’s version:
#### Part I
* **Code size reduced by approximately $51$%**
* **Execution cycles reduced by approximately $62$%**
* **Actual number of executed instructions reduced by approximately $62$%**
#### Part II
* **Code size reduced by approximately $48$%**
* **Execution cycles reduced by approximately $11$%**
* **Actual number of executed instructions reduced by approximately $12$%**
---
## Explanation Using Ripes Simulator
In this part, we use **UF8** (introduced in Problem B) as an example to describe a possible use case, and provide explanations for both program functionality and instruction-level operations about `uf8_decode` using the Ripes simulator.
---
### Use case for UF8
For example, in [**LeetCode 347**](https://leetcode.com/problems/top-k-frequent-elements/), we may need to count frequencies for a large number of **elements/events**. The frequency of each element may range from 1 to tens of thousands (or even higher), resulting in a very wide range.
In such a scenario, if we store each element’s frequency value using **UF8** representation (as approximate value), we can replace a `uint32_t` / `int` frequency value with a `uint8_t` one. This allows us to store more data under the same memory capacity, or support a larger hash table for counting.
When producing the top K final output, we can simply **decode** the frequencies back into approximate integers and then sort based on the decoded values. However, since sorting is performed on decoded (approximate) values, this approach may suitable for scenarios where the most frequent elements usually have much higher counts than the rest, so the relative ordering is typically more stable.
---
### Explanation for Instruction-Level Operations
Here, we use `uf8_decode` as the simple example, value `0x00000021` as the test input, and the corresponding assembly program is shown below:
```=
.data
.text
.globl main
main:
li a0, 0x00000021
UF8_decode:
andi t1, a0, 0x0F # m = fl & 0x0F
srli t2, a0, 4 # e = fl >> 4
li t3, 1
sll t3, t3, t2 # 1 << e
addi t3, t3, -1 # (1 << e) - 1
slli t3, t3, 4 # offset = ((1 << e)-1) << 4
sll t1, t1, t2 # m << e
add a0, t1, t3 # result = (m << e) + offset
Exit:
li a7, 10
ecall
```
For the explanation of **instruction-level operations**, lets look at the first instruction `li a0, 0x00000021`, which is the first executed instruction in the program.
#### 1. Instruction in IF-stage

In the IF-stage, the CPU fetches the instruction from the instruction memory using the PC, while also computing the next PC. Since `li` is pseudo-instruction, the assembler expands it into a real RV32I instruction. In this case, it becomes `addi x10, x0, 33`.
>#### In IF-stage:
>- PC Enable: 1 (High), since there's no pipeline stall.
>- Next PC Mux: 0 (Low), since no branch/jump is taken.
#### 2. `addi x10, x0, 33` in ID-stage

In the ID-stage, the decoder identifies the instruction as an I-type `addi` and generates the corresponding control signals. Besides, the register file reads `rs1 = x0` (which is always `0`), and the `Imm.` component produce `imm = 33`.
Note that the register write enable (Wr En) is still 0 at this moment, this is because the write-back happens in the WB-stage.
>#### In ID-stage:
>- Reg Wr En: 0 (Low), not in WB yet.
#### 3. `addi x10, x0, 33` in EX-stage

In the EX-stage, the ALU takes the value of `rs1` (`x0 = 0`) as **Op1** and the sign-extended immediate `33` as **Op2**, and performs **addition**. The expected ALU result is `33` (`0x21`).
>#### In EX-stage:
>- ALUOp1: from ID/EX pipeline register or forwarding path.
>- ALUOp2: from the immediate.
>- ForwardA: no forwarding, from ID/EX pipeline register.
>- ForwardB: no forwarding.
>- Branch taken: not taken.
#### 4. `addi x10, x0, 33` in MEM-stage

In the MEM-stage, since `addi` does not access the data memory, the ALU result is simply carried into the MEM/WB pipeline register.
>#### In MEM-stage:
>- Data Memory Wr En: 0 (Low), since the `addi` instruction does not write memory.
#### 5. `addi x10, x0, 33` in WB-stage

In the WB-stage, the ALU result should be written back to the register file as `x10`.
>#### In WB-stage:
>- Reg Wr En (in ID-stage): 1 (High), write back to `x10`.
>- WB Mux: select ALU result.
#### 6. `addi x10, x0, 33` exits the pipeline
In the next cycle, instruction `addi x10, x0, 33` has completed and exited the pipeline, so the value of `x10` should now be `0x21` (`33`). We check the value of `x10` (`a0`):

The value of `x10` matches the expected result, indicating that pseudo-instruction `li a0, 0x00000021` (expanded into `addi`) executed correctly.
---
### Explanation for Program Functionality
For the explanation of **program functionality**, we continue from the cycle right after step 6 in the previous section: at this point, `li a0, 0x00000021` (expanded into `addi x10, x0, 33`) has already completed and exited the pipeline, so `x10` has been correctly set to `0x21`.
The program then enters the body of `UF8_decode` function, computing the mantissa `m`, the exponent `e`, the `offset`, and finally the decoded result.
>Since this is a **5-stage pipeline processor**, **multiple instructions overlap in the same cycle**. Therefore, a single snapshot of the pipeline can show the different instructions simultaneously residing in different stages.

>#### In IF-stage: `addi x28, x28, -1`
>- PC Enable: 1 (High), since there's no pipeline stall.
>- Next PC Mux: 0 (Low), since no branch/jump is taken.
>#### In ID-stage: `sll x28, x28, x7`
>- Reg Wr En (for `andi` in WB-stage): 1 (High), write back to `x6`.
>#### In EX-stage: `addi x28, x0, 1`
>- ALUOp1: from ID/EX pipeline register or forwarding path.
>- ALUOp2: from the immediate.
>- ForwardA: no forwarding, from ID/EX pipeline register.
>- ForwardB: no forwarding.
>- Branch taken: not taken.
>#### In MEM-stage: `srli x7, x10, 4`
>- Data Memory Wr En: 0 (Low), since `srli` does not write memory.
>#### In WB-stage: `andi x6, x10, 15`
>- WB Mux: select ALU result.

>#### In IF-stage: `ecall`
>- PC Enable: 1 (High), since there's no pipeline stall.
>- Next PC Mux: 0 (Low), since no branch/jump is taken.
>#### In ID-stage: `addi x17, x0, 10`
>- Reg Wr En (for `slli` in WB-stage): 1 (High), write back to `x28`.
>#### In EX-stage: `add x10, x6, x28`
>- ALUOp1: from ID/EX pipeline register or forwarding path.
>- ALUOp2: from ID/EX pipeline register or forwarding path.
>- ForwardA: use forwarding, from MEM-stage.
>- ForwardB: use forwarding, from WB-stage.
>- Branch taken: not taken.
>#### In MEM-stage: `sll x6, x6, x7`
>- Data Memory Wr En: 0 (Low), since `sll` does not write memory.
>#### In WB-stage: `slli x28, x28, 4`
>- WB Mux: select ALU result.
Notice that in the **EX-stage**, `add x10, x6, x28` has **data hazards** on both source operands: `x6` is produced by the **MEM-stage** instruction `sll x6, x6, x7`, and `x28` is produced by the **WB-stage** instruction `slli x28, x28, 4`.
Therefore, **forwarding must be enabled** (ForwardA from the MEM-stage for the updated `x6`, and ForwardB from the WB-stage for the updated `x28`) so that the `add x10, x6, x28` instruction can **use the latest values without stalling the pipeline**.
Back to the main topic, according to the behavior of `uf8_decode`, for the input `0x00000021`:
- `m = 0x21 & 0x0F = 1`
- `e = 0x21 >> 4 = 2`
- `offset = ((1 << 2)-1) << 4 = (4-1) << 4 = 3 << 4 = 48`
- `m << e = 1 << 2 = 4`
- `result = 4 + 48 = 52 (0x34)`
Therefore, the decoded value should be `0x34` (`52`), and according to the program flow it should end up in `x10` (`a0`). After the program finishes, we check the value of `x10` (`a0`):

The value of `x10` matches the expected result, confirming that the program executed successfully, and validating the functionality (correctness) of `uf8_decode`.
---
> Refer to <[Assignment1 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/2025-arch-homework1)>
> Contributed by <[`Micelearner`](https://github.com/Micelearner/ca2025-quizzes)>
***Reference***
- [Assignment1 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/2025-arch-homework1)
- [Quiz1 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz1-sol)
- [RISC-V Assembly Programmer’s Manual](https://github.com/riscv-non-isa/riscv-asm-manual/blob/main/src/asm-manual.adoc)
- [Lab1: RV32I Simulator](https://hackmd.io/@sysprog/H1TpVYMdB)
- [Ripes: Supported Environment Calls](https://github.com/mortbopet/Ripes/blob/master/docs/ecalls.md)
- [CS61C: RISC-V 5-Stage Pipeline](https://docs.google.com/presentation/d/1v-Squx8lK-oOrflFOwBZh-ue94seVHZudqDOgJmFf5Q/edit?slide=id.g2fa6143ce8c_0_133#slide=id.g2fa6143ce8c_0_133)