# Assignment 1: RISC-V Assembly and Instruction Pipeline
contributed by <`Nsly0204`>
[homework1](https://hackmd.io/@sysprog/2025-arch-homework1)
## Problem B
> [Commit 9d4e89a](https://github.com/sysprog21/ca2025-quizzes/commit/9d4e89ad09b3f2856ac516648c655f1c9ddef661)
Fundemental Logic to improve performance:
1. Store reusable constants (e.g. 0xFF) in specific register avoid repetitively using `li` instruction.
2. Using loop instead of using recursive function (e.g. clz, multiply) to avoid caller save and callee save. ==However, whether the miss in branch prediction or saving register causes more resourse is up to discuss.)==
## Problem C
>
>* convertion & bf16_add & bf16_sub
>[Commit 1b2689a](https://github.com/sysprog21/ca2025-quizzes/commit/1b2689a964e13024c70986d7518e72e40e714c98)
>* bf16_div and bf16_mul
>[Commit 86e4b5d](https://github.com/sysprog21/ca2025-quizzes/commit/86e4b5d1a6134f8ede4545e341c021ffbdfc8c3a)
>* bf16_sqrt and test function
>[Commit e38b9fd](https://github.com/sysprog21/ca2025-quizzes/commit/e38b9fd5559516e32614499a80956a91a7aae2f8)

```c=
static inline bf16_t f32_to_bf16(float val)
{
uint32_t f32bits;
memcpy(&f32bits, &val, sizeof(float));
if (((f32bits >> 23) & 0xFF) == 0xFF)
return (bf16_t) {.bits = (f32bits >> 16) & 0xFFFF};
f32bits += ((f32bits >> 16) & 1) + 0x7FFF;
return (bf16_t) {.bits = f32bits >> 16};
}
```
The overall process can be broken down into three parts:
* exception detection
* rounding (carrier calculation)
* f32 to bf16 convertion
During exception detection, fp16 is store in a 32 bits risc-v register; therefore, the `& 0xFFFF` operation can be ignored in assembly code.
`(f32bits >> 16) & 1)` in line 7 calculates the 17-th bit of `f32bits` the result will be either 1 or 0. This is the implementation of round-to-nearest-even, meaning that when the truncated-part of mentissa is exactly half the percision `0x8000`, it should be round to the nearest even:
* if the number after truncation is even, `(f32bits >> 16) & 1)` = 0.When the truncated-part of mentissa is `0x8000`, the carrier equals to 0 because `0x8000 + (f32bits >> 16) & 1) + 0x7FFF` = `0x0FFFF`
* if the number after truncation is odd $\rightarrow$ `(f32bits >> 16) & 1)` = 1. When the truncated-part of mentissa is `0x8000`, the carrier equals to 1 because `(f32bits >> 16) & 1) + 0x7FFF` = `0x10000`
There are underlying exceptions in the provided convertion method. For instance, `0x7F800001`, which represents NaN in fp32, will be convert into `0x7F80`, which represents infinity in bf16. However, such cases are inevitable due to information loss during convertion. The test program should take this phonomenon into consider.
#### assembly code
:::spoiler
```asm
f32_to_bf16:
####
## input argument
# a0 = f32
## output argument
# a0 = traslated bf16
####
# callee save
addi sp, sp, -8
sw ra, 0(sp)
sw s0, 4(sp)
# get input argument
mv s0, a0
# NaN or Inf check
li t0, 0xFF # t0 = 0xFF (exp mask)
srli t2, s0, 23 # (f32bits >> 23)
and t1, t2, t0 # (exp of fp32) t1 = (f32bits >> 23) & 0xFF
beq t1, t0, exception # if (f32bits >> 23) & 0xFF == 0xFF
# convert
srli t0, s0, 16 # t0 = fp32 >> 16
# round to even
addi t1, x0, 1 # t1 = 1
and t0, t0, t1 # t0 = (fp32 >> 16) & 1, the 16-th bit in mantissa
li t1, 0x7FFF # t1 = 0x7FFF
add t0, t0, t1 # t0 = ((fp32 >> 16) & 1) + 0x7FFF, t0 = carry
add t0, s0, t0 # t0 = fp32 + carry
srli a0, t0, 16 # a0 = (fp32 + carry) >> 16
beq x0, x0, end_f32_to_bf16
exception:
srli a0, s0, 16 # a0 = fp32 >> 16
end_f32_to_bf16:
# retrieve ra and callee save
lw ra, 0(sp)
lw s0, 4(sp)
addi sp, sp, 8
ret
```
:::
### TODO: using sqrt function with bit operation only to enhance performance
The following is a common impelment of square root function with bit operation only. It's a digit-by-digit algorithm origined from long division square root method. For more detail one can refer [Wikipedia-Square root algorithms](https://en.wikipedia.org/wiki/Square_root_algorithms)
```c
ix0 += ix0;
int32_t q = s = 0; /* q = sqrt(x) */
uint32_t r = 0x01000000; /* r = moving bit from right to left */
while(r!=0) {
t = s+r; // temp = sqrt + 1 bit
if(t <= ix0) { // if remain > add
s = t+r; // s = (s+r) + r = s + 2r
ix0 -= t; // remain -add
q += r; // quotient add one bit
}
ix0 += ix0; // append one more bits
r >>= 1;
}
```