# Assignment1: RISC-V Assembly and Instruction Pipeline ## bfloat16 Arithmetic The bfloat16 is a type representing floating point number. It has the same range as the IEEE 754 standard float32 but requires fewer spaces. Consequently, the resolution of bfloat16 is degraded because of 7-bit mantissa field. To implement bfloat16 and its test suites, we need to introduce the original floating point results calculated with float32. In our assembly code, we extend the functions and make them reconfigurable. That is, we can pass specific parameters to change their behaviors and produces float32 or bfloat16 operations. For example, when we pass a mantissa mask such as ```0x7F``` which indicates the operation is related to bfloat16. On the other hand, mantissa mask ```0x7FFFFF``` used in procedure represents float32 is under going. First, conversion assembly code is listed as follows. ```clike # === float32_to_bfloat16 === f32_to_bf16: add x28, x0, a0 # move argument f32 to x28 srli x29, a0, 16 # make shared x29 for (f32>>16) srli x28, a0, 23 # make exponent field at LSB andi x28, x28, 0xFF # make value uneffected by sign bit addi x28, x28, -0xFF # result==0 equivant to exponent==0xFF bne x28, zero, exp_non_ff slli x28, x29, 16 # eliminate upper 16-bits srli a0, x28, 16 # a0 = (bits>>16) & 0xFFFF ret exp_non_ff: andi x28, x29, 1 # (f32>>16)&1 srli x29, t1, 17 # make constant 0x7FFF add x28, x28, x29 add a0, x28, a0 srli a0, a0, 16 # f32 >> 16 ret # === bfloat16_to_float32 === bf16_to_f32: slli a0, a0, 16 ret ``` Here, I naively complete the conversion through masking and branching. Then, the emphasized part of my work is - my_add - my_sub - my_mul - my_div - my_sqrt - As mentioned above, the evaluations of these function are involved with float32 (f32) calculations. With the results calculated by f32_my_add, f32_my_sub, f32_my_mul, ..., etc, we can use f32_my_sub to produce the distance or the difference between bf16_my_add, bf16_my_sub, bf16_my_mul, ..., etc. The original code are written as [1705cf7](https://github.com/sysprog21/ca2025-quizzes/commit/1705cf7d560d661435182eb0d8d9a308a7bdbd83), [576d4f6](https://github.com/sysprog21/ca2025-quizzes/commit/576d4f6afdb11196a9bd689dee2e3d5fdbd898c7), [943324e](https://github.com/sysprog21/ca2025-quizzes/commit/943324edfcb64971d825225294492c396ed67e6e), [aa54328](https://github.com/sysprog21/ca2025-quizzes/commit/aa543288920e011183079f98f1959852a767ad93), [5b1c8b9](https://github.com/sysprog21/ca2025-quizzes/commit/5b1c8b93224421daa7cfe9ea41de4cd37a38a536). After the modification, the functions supports f32 according to [13bd5df](https://github.com/sysprog21/ca2025-quizzes/commit/13bd5df502cbcd6600e2e2b274e7e2daddb95bfa) and [d92c0af](https://github.com/sysprog21/ca2025-quizzes/commit/d92c0af4ef9aa246013c927d840f65efca8fb206). The key point is at a3, a4, a5, a6. If we can assign specific value when we call procedure, we can achieve different behavior in my assembly codes. Besides, because bfoat16 and float32 have the same exponent field, I use immediate type instruction to deal with this. ```clike # --- bf16_add --- lw a0, 0(t0) lw a1, 4(t0) addi a3, x0, 25 # for producing mantissa mask addi a4, x0, 7 # for mantissa shift addi a5, x0, 15 # for sign shifting addi a6, x0, 48 # for resolution (for my_add, it is not used.) jal ra, my_add li a7, 34 # --- f32_add --- lw a0, 0(t0) lw a1, 4(t0) addi a3, x0, 9 # for producing mantissa mask addi a4, x0, 23 # for mantissa shift addi a5, x0, 31 # for sign shifting addi a6, x0, 48 # for resolution (for my_add, it is not used.) jal ra, my_add li a7, 34 ecall ``` I sractch the my_mul to observe the configuration. Among these, ```t1``` stores ```0xFFFFFFFF```. Therefore, we can use the shift right logic ```srl``` with mantissa shifting offset to produce mantissa mask of bf16 or f32. That is, ```srl x29, t1, a3```. ```clike= my_mul: # x28, x29 tmp // x30, x31 sign // x26, x27 exp // x24, x25 mant # x23 result sign, x22 result exp, x21 result mant, x20 exp_adjust srl x30, a0, a5 # extract sign bit andi x30, x30, 1 # extract sign bit masking srl x31, a1, a5 # extract sign bit andi x31, x31, 1 # extract sign bit masking srl x26, a0, a4 # extract exponent andi x26, x26, 0xFF # extract exponent masking srl x27, a1, a4 # extract exponent andi x27, x27, 0xFF # extract exponent masking srl x29, t1, a3 # mantissa mask and x24, a0, x29 # extract mantissa masking and x25, a1, x29 # extract mantissa masking xor x23, x30, x31 # result sign ``` Another issue I want to discuss is square root operation. In course, we found that there is some redundant code in sqrt written in C. Thus, after evalution, I modified it and change into my assembly code. I also found after the modification, my assembly code produces preciser results. Redundant code in original C code locates here. The ```else if``` condition is not necessary because the mantissa range is (128, 512]. It is impossible for ```result < 128``` to happen. ```c if (result >= 256) { result >>= 1; new_exp++; } else if (result < 128) { while (result < 128 && new_exp > 1) { result <<= 1; new_exp--; } } ``` My assembly code not only eliminates the unnecessary part but also adds a new condition for juding infinity at the end of the code (line 16). ```clike= sqrt_normalize: addi x29, x31, -256 blt x29, zero, sqrt_cal srli x31, x31, 1 addi x22, x22, 1 sqrt_cal: addi x29, x22, -0xFF bge x29, zero, sqrt_cal_case bge zero, x22, rt_zero and a0, x22, x28 slli a0, a0, 7 andi x29, x31, 0x7F or a0, a0, x29 ret sqrt_cal_case: bne x29, zero, rt_inf andi x29, x31, 0x7F # new mant bne x29, zero, rt_nan j rt_inf ``` ## Hero's Formular (海龍公式) I choose the Hero's Formula to framework my bfloat16 implementations. Given $a, b, c$ being lengths or a triangle. The formula is defined as ## $\sqrt{s(s-a)(s-b)(s-c)}$ , where $s=\frac{a+b+c}{2}$ The straightforward implementation is calculated each required element and then perform multiplication and square root, as the commit of my code here [8f13de6](https://github.com/sysprog21/ca2025-quizzes/commit/8f13de6a25226d8b292b246f74424007a4d62ccc). ```clike= main_loop: lw a0, 0(t0) # length a jal ra, f32_to_bf16 sw a0, 0(sp) lw a0, 4(t0) # length b jal ra, f32_to_bf16 sw a0, 4(sp) lw a0, 8(t0) # length c jal ra, f32_to_bf16 sw a0, 8(sp) lw a0, div_value # /2 jal ra, f32_to_bf16 sw a0, 12(sp) addi a3, x0, 25 addi a4, x0, 7 addi a5, x0, 15 addi a6, x0, 15 lw a0, 0(sp) lw a1, 4(sp) jal ra, my_add # a+b lw a1, 8(sp) jal ra, my_add # a+b+c lw a1, 12(sp) jal ra, my_div sw a0, 16(sp) # S=(a+b+c)/2 lw a1, 0(sp) jal ra, my_sub sw a0, 20(sp) # s-a lw a0, 16(sp) lw a1, 4(sp) jal ra, my_sub sw a0, 24(sp) # s-b lw a0, 16(sp) lw a1, 8(sp) jal ra, my_sub sw a0, 28(sp) # s-c lw a0, 16(sp) lw a1, 20(sp) jal ra, my_mul # s*(s-a) lw a1, 24(sp) jal ra, my_mul # s*(s-a)*(s-b) lw a1, 28(sp) jal ra, my_mul # s*(s-a)*(s-b)*(s-c) jal ra, my_sqrt # (s*(s-a)*(s-b)*(s-c))^0.5 jal ra, bf16_to_f32 li a7, 34 ecall addi sp, sp, 32 li a7, 10 # Syscall for Exit program ecall ``` However, I found that the implementation is inefficient for its code size and memory usage, so we can re-write the formula as ## $\sqrt{\frac{a+b+c}{2}\times\frac{b+c-a}{2}\times\frac{a+c-b}{2}\times\frac{a+b-c}{2}}$ We can find that the numerator is the combinations of $a, b, c$ with additions and subtractions. we can re-use the storage that stores $a, b, c$ for reducing the memory usage. The optimized code is here [6c730bc](https://github.com/sysprog21/ca2025-quizzes/commit/6c730bc7cb24261f33132ff66b655386e2128991) We can see that our stack storage is changed from 28 to 20, saving 2 4 bytes. ```clike= # === Main Function === main: addi sp, sp, -32 lw t1, mask_value # useful mask register 0xFFFFFFFF la t0, test_values main_loop: lw a0, 0(t0) # length a jal ra, f32_to_bf16 sw a0, 0(sp) lw a0, 4(t0) # length b jal ra, f32_to_bf16 sw a0, 4(sp) lw a0, 8(t0) # length c jal ra, f32_to_bf16 sw a0, 8(sp) lw a0, div_value # /2 jal ra, f32_to_bf16 sw a0, 12(sp) addi a3, x0, 25 addi a4, x0, 7 addi a5, x0, 15 addi a6, x0, 15 lw a0, 0(sp) lw a1, 4(sp) jal ra, my_add # a+b lw a1, 8(sp) jal ra, my_add # a+b+c lw a1, 12(sp) jal ra, my_div sw a0, 16(sp) # S=(a+b+c)/2 lw a1, 0(sp) jal ra, my_sub lw a1, 16(sp) jal ra, my_mul # s-a sw a0, 20(sp) lw a0, 16(sp) lw a1, 4(sp) jal ra, my_sub lw a1, 20(sp) jal ra, my_mul sw a0, 20(sp) # s-b lw a0, 16(sp) lw a1, 8(sp) jal ra, my_sub lw a1, 20(sp) jal ra, my_mul # s-c jal ra, my_sqrt # (s*(s-a)*(s-b)*(s-c))^0.5 jal ra, bf16_to_f32 li a7, 34 ecall addi sp, sp, 32 li a7, 10 # Syscall for Exit program ecall ```