# Assignment1: RISC-V Assembly and Instruction Pipeline
## bfloat16 Arithmetic
The bfloat16 is a type representing floating point number. It has the same range as the IEEE 754 standard float32 but requires fewer spaces. Consequently, the resolution of bfloat16 is degraded because of 7-bit mantissa field.
To implement bfloat16 and its test suites, we need to introduce the original floating point results calculated with float32. In our assembly code, we extend the functions and make them reconfigurable. That is, we can pass specific parameters to change their behaviors and produces float32 or bfloat16 operations. For example, when we pass a mantissa mask such as ```0x7F``` which indicates the operation is related to bfloat16. On the other hand, mantissa mask ```0x7FFFFF``` used in procedure represents float32 is under going.
First, conversion assembly code is listed as follows.
```clike
# === float32_to_bfloat16 ===
f32_to_bf16:
add x28, x0, a0 # move argument f32 to x28
srli x29, a0, 16 # make shared x29 for (f32>>16)
srli x28, a0, 23 # make exponent field at LSB
andi x28, x28, 0xFF # make value uneffected by sign bit
addi x28, x28, -0xFF # result==0 equivant to exponent==0xFF
bne x28, zero, exp_non_ff
slli x28, x29, 16 # eliminate upper 16-bits
srli a0, x28, 16 # a0 = (bits>>16) & 0xFFFF
ret
exp_non_ff:
andi x28, x29, 1 # (f32>>16)&1
srli x29, t1, 17 # make constant 0x7FFF
add x28, x28, x29
add a0, x28, a0
srli a0, a0, 16 # f32 >> 16
ret
# === bfloat16_to_float32 ===
bf16_to_f32:
slli a0, a0, 16
ret
```
Here, I naively complete the conversion through masking and branching.
Then, the emphasized part of my work is
- my_add
- my_sub
- my_mul
- my_div
- my_sqrt
-
As mentioned above, the evaluations of these function are involved with float32 (f32) calculations. With the results calculated by f32_my_add, f32_my_sub, f32_my_mul, ..., etc, we can use f32_my_sub to produce the distance or the difference between bf16_my_add, bf16_my_sub, bf16_my_mul, ..., etc.
The original code are written as [1705cf7](https://github.com/sysprog21/ca2025-quizzes/commit/1705cf7d560d661435182eb0d8d9a308a7bdbd83), [576d4f6](https://github.com/sysprog21/ca2025-quizzes/commit/576d4f6afdb11196a9bd689dee2e3d5fdbd898c7), [943324e](https://github.com/sysprog21/ca2025-quizzes/commit/943324edfcb64971d825225294492c396ed67e6e), [aa54328](https://github.com/sysprog21/ca2025-quizzes/commit/aa543288920e011183079f98f1959852a767ad93), [5b1c8b9](https://github.com/sysprog21/ca2025-quizzes/commit/5b1c8b93224421daa7cfe9ea41de4cd37a38a536).
After the modification, the functions supports f32 according to [13bd5df](https://github.com/sysprog21/ca2025-quizzes/commit/13bd5df502cbcd6600e2e2b274e7e2daddb95bfa) and [d92c0af](https://github.com/sysprog21/ca2025-quizzes/commit/d92c0af4ef9aa246013c927d840f65efca8fb206).
The key point is at a3, a4, a5, a6. If we can assign specific value when we call procedure, we can achieve different behavior in my assembly codes. Besides, because bfoat16 and float32 have the same exponent field, I use immediate type instruction to deal with this.
```clike
# --- bf16_add ---
lw a0, 0(t0)
lw a1, 4(t0)
addi a3, x0, 25 # for producing mantissa mask
addi a4, x0, 7 # for mantissa shift
addi a5, x0, 15 # for sign shifting
addi a6, x0, 48 # for resolution (for my_add, it is not used.)
jal ra, my_add
li a7, 34
# --- f32_add ---
lw a0, 0(t0)
lw a1, 4(t0)
addi a3, x0, 9 # for producing mantissa mask
addi a4, x0, 23 # for mantissa shift
addi a5, x0, 31 # for sign shifting
addi a6, x0, 48 # for resolution (for my_add, it is not used.)
jal ra, my_add
li a7, 34
ecall
```
I sractch the my_mul to observe the configuration. Among these, ```t1``` stores ```0xFFFFFFFF```. Therefore, we can use the shift right logic ```srl``` with mantissa shifting offset to produce mantissa mask of bf16 or f32. That is, ```srl x29, t1, a3```.
```clike=
my_mul:
# x28, x29 tmp // x30, x31 sign // x26, x27 exp // x24, x25 mant
# x23 result sign, x22 result exp, x21 result mant, x20 exp_adjust
srl x30, a0, a5 # extract sign bit
andi x30, x30, 1 # extract sign bit masking
srl x31, a1, a5 # extract sign bit
andi x31, x31, 1 # extract sign bit masking
srl x26, a0, a4 # extract exponent
andi x26, x26, 0xFF # extract exponent masking
srl x27, a1, a4 # extract exponent
andi x27, x27, 0xFF # extract exponent masking
srl x29, t1, a3 # mantissa mask
and x24, a0, x29 # extract mantissa masking
and x25, a1, x29 # extract mantissa masking
xor x23, x30, x31 # result sign
```
Another issue I want to discuss is square root operation. In course, we found that there is some redundant code in sqrt written in C. Thus, after evalution, I modified it and change into my assembly code. I also found after the modification, my assembly code produces preciser results.
Redundant code in original C code locates here. The ```else if``` condition is not necessary because the mantissa range is (128, 512]. It is impossible for ```result < 128``` to happen.
```c
if (result >= 256) {
result >>= 1;
new_exp++;
} else if (result < 128) {
while (result < 128 && new_exp > 1) {
result <<= 1;
new_exp--;
}
}
```
My assembly code not only eliminates the unnecessary part but also adds a new condition for juding infinity at the end of the code (line 16).
```clike=
sqrt_normalize:
addi x29, x31, -256
blt x29, zero, sqrt_cal
srli x31, x31, 1
addi x22, x22, 1
sqrt_cal:
addi x29, x22, -0xFF
bge x29, zero, sqrt_cal_case
bge zero, x22, rt_zero
and a0, x22, x28
slli a0, a0, 7
andi x29, x31, 0x7F
or a0, a0, x29
ret
sqrt_cal_case:
bne x29, zero, rt_inf
andi x29, x31, 0x7F # new mant
bne x29, zero, rt_nan
j rt_inf
```
## Hero's Formular (海龍公式)
I choose the Hero's Formula to framework my bfloat16 implementations.
Given $a, b, c$ being lengths or a triangle. The formula is defined as
## $\sqrt{s(s-a)(s-b)(s-c)}$ , where $s=\frac{a+b+c}{2}$
The straightforward implementation is calculated each required element and then perform multiplication and square root, as the commit of my code here [8f13de6](https://github.com/sysprog21/ca2025-quizzes/commit/8f13de6a25226d8b292b246f74424007a4d62ccc).
```clike=
main_loop:
lw a0, 0(t0) # length a
jal ra, f32_to_bf16
sw a0, 0(sp)
lw a0, 4(t0) # length b
jal ra, f32_to_bf16
sw a0, 4(sp)
lw a0, 8(t0) # length c
jal ra, f32_to_bf16
sw a0, 8(sp)
lw a0, div_value # /2
jal ra, f32_to_bf16
sw a0, 12(sp)
addi a3, x0, 25
addi a4, x0, 7
addi a5, x0, 15
addi a6, x0, 15
lw a0, 0(sp)
lw a1, 4(sp)
jal ra, my_add # a+b
lw a1, 8(sp)
jal ra, my_add # a+b+c
lw a1, 12(sp)
jal ra, my_div
sw a0, 16(sp) # S=(a+b+c)/2
lw a1, 0(sp)
jal ra, my_sub
sw a0, 20(sp) # s-a
lw a0, 16(sp)
lw a1, 4(sp)
jal ra, my_sub
sw a0, 24(sp) # s-b
lw a0, 16(sp)
lw a1, 8(sp)
jal ra, my_sub
sw a0, 28(sp) # s-c
lw a0, 16(sp)
lw a1, 20(sp)
jal ra, my_mul # s*(s-a)
lw a1, 24(sp)
jal ra, my_mul # s*(s-a)*(s-b)
lw a1, 28(sp)
jal ra, my_mul # s*(s-a)*(s-b)*(s-c)
jal ra, my_sqrt # (s*(s-a)*(s-b)*(s-c))^0.5
jal ra, bf16_to_f32
li a7, 34
ecall
addi sp, sp, 32
li a7, 10 # Syscall for Exit program
ecall
```
However, I found that the implementation is inefficient for its code size and memory usage, so we can re-write the formula as
## $\sqrt{\frac{a+b+c}{2}\times\frac{b+c-a}{2}\times\frac{a+c-b}{2}\times\frac{a+b-c}{2}}$
We can find that the numerator is the combinations of $a, b, c$ with additions and subtractions. we can re-use the storage that stores $a, b, c$ for reducing the memory usage.
The optimized code is here [6c730bc](https://github.com/sysprog21/ca2025-quizzes/commit/6c730bc7cb24261f33132ff66b655386e2128991)
We can see that our stack storage is changed from 28 to 20, saving 2 4 bytes.
```clike=
# === Main Function ===
main:
addi sp, sp, -32
lw t1, mask_value # useful mask register 0xFFFFFFFF
la t0, test_values
main_loop:
lw a0, 0(t0) # length a
jal ra, f32_to_bf16
sw a0, 0(sp)
lw a0, 4(t0) # length b
jal ra, f32_to_bf16
sw a0, 4(sp)
lw a0, 8(t0) # length c
jal ra, f32_to_bf16
sw a0, 8(sp)
lw a0, div_value # /2
jal ra, f32_to_bf16
sw a0, 12(sp)
addi a3, x0, 25
addi a4, x0, 7
addi a5, x0, 15
addi a6, x0, 15
lw a0, 0(sp)
lw a1, 4(sp)
jal ra, my_add # a+b
lw a1, 8(sp)
jal ra, my_add # a+b+c
lw a1, 12(sp)
jal ra, my_div
sw a0, 16(sp) # S=(a+b+c)/2
lw a1, 0(sp)
jal ra, my_sub
lw a1, 16(sp)
jal ra, my_mul # s-a
sw a0, 20(sp)
lw a0, 16(sp)
lw a1, 4(sp)
jal ra, my_sub
lw a1, 20(sp)
jal ra, my_mul
sw a0, 20(sp) # s-b
lw a0, 16(sp)
lw a1, 8(sp)
jal ra, my_sub
lw a1, 20(sp)
jal ra, my_mul # s-c
jal ra, my_sqrt # (s*(s-a)*(s-b)*(s-c))^0.5
jal ra, bf16_to_f32
li a7, 34
ecall
addi sp, sp, 32
li a7, 10 # Syscall for Exit program
ecall
```