---
tags: CA2025
---
# Assignment2: Complete Applications
contributed by [<wilson0828>](https://github.com/wilson0828)
>Refer to [Assignment2](https://hackmd.io/@sysprog/2025-arch-homework2)
## Part 1 - Environment setup
This project challenges to develop a highly practical system using RISC-V assembly language. We need to set up the experimental environment before moving on to the implementation.
Follow [Lab2: RISC-V](https://hackmd.io/@sysprog/Sko2Ja5pel#Overview) to build rv32emu, which is a RISC-V RV32 emulator that faithfully implements the RISC-V instruction set architecture.
After following the instructions in previous lab, we use Makefile to validate whether `tests/system/playground` is fully verified.
```
blz@localhost:~/riscv-none-elf-gcc/rv32emu/tests/system/playground$ make run
../../../build/rv32emu test.elf
```
This validates:
1. Emulator binary exists
2. `ENABLE_SYSTEM=1` is configured
3. `ENABLE_ELF_LOADER=1` is configured, meaning that we can load `.elf` file to execute directly.
- output
```
=== ChaCha20 Tests ===
Test 0: ChaCha20 (RISC-V Assembly)
Test: ChaCha20
ChaCha20 RFC 7539: PASSED
Cycles: 6797
Instructions: 6797
=== BFloat16 Tests ===
Test 1: bf16_add
Test: bf16_add
1.0 + 1.0 = 4000
PASSED
Cycles: 432
Instructions: 432
Test 2: bf16_sub
Test: bf16_sub
3.0 - 2.0 = 3f80
PASSED
Cycles: 373
Instructions: 373
Test 3: bf16_mul
Test: bf16_mul
2.0 * 3.0 = 40c0
PASSED
Cycles: 464
Instructions: 464
Test 4: bf16_div
Test: bf16_div
6.0 / 2.0 = 4040
PASSED
Cycles: 624
Instructions: 624
Test 5: bf16_special_cases
Test: bf16_special_cases
bf16_iszero(0): PASSED
bf16_isnan(NaN): PASSED
bf16_isinf(Inf): PASSED
Cycles: 239
Instructions: 239
=== All Tests Completed ===
```
We have completed the environment setup part
- Note
Just to clarify remember to `source $HOME/riscv-none-elf-gcc/setenv` everytime you open a new shell, so the correct xPack RISC-V toolchain is added to your PATH and Make will pick the right compiler/assembler. Refer to [Lab2](https://hackmd.io/@sysprog/Sko2Ja5pel)
## Part 2 - Run on bare-metal environment
In this section, we need to run our complete program on in a bare-metal environment using rv32emu’s system emulation.
Here, I choose to reuse the assembly of `bf16_sqrt`, `bf16_add` and `bf16_mul`, that we implemented in [Assignment1](https://hackmd.io/fBYPlyHETiGzC1sBdhYm0w?view#Part-2)
<details>
<summary>bf16_hw1</summary>
```asm
.section .rodata
.align 4
clz8_lut:
.byte 8,7,6,6,5,5,5,5,4,4,4,4,4,4,4,4
.byte 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
.byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
.byte 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
.text
.align 2
clz8:
andi a0, a0, 0xFF
la t0, clz8_lut
add t0, t0, a0
lbu a0, 0(t0)
ret
mul8x8_to16:
andi a0, a0, 0xFF
andi a1, a1, 0xFF
mv t1, a0
mv t2, a1
li t0, 0
li t3, 8
mul8x8_to16_loop:
andi t4, t2, 1
beqz t4, mul8x8_to16_skip
add t0, t0, t1
mul8x8_to16_skip:
slli t1, t1, 1
srli t2, t2, 1
addi t3, t3, -1
bnez t3, mul8x8_to16_loop
mv a0, t0
ret
.globl bf16_add
bf16_add:
addi sp, sp, -28
sw ra, 24(sp)
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
srli s0, a0, 15
andi s0, s0, 1 # s0 = sign_a
srli s1, a1, 15
andi s1, s1, 1 # s1 = sign_b
srli s2, a0, 7
andi s2, s2, 0xFF # s2 = exp_a
srli s3, a1, 7
andi s3, s3, 0xFF # s3 = exp_b
andi s4, a0, 0x7F # s4 = mant_a
andi s5, a1, 0x7F # s5 = mant_b
li t0, 0xFF
bne s2, t0, bf16_add_chk
bnez s4, bf16_add_ret_a
bne s3, t0, bf16_add_ret_a
bnez s5, bf16_add_ret_b
bne s0, s1, bf16_add_ret_nan
j bf16_add_ret_b
bf16_add_chk:
beq s3, t0, bf16_add_ret_b
li t0, 0x7FFF
and t1, a0, t0
beq t1, x0, bf16_add_ret_b
and t1, a1, t0
beq t1, x0, bf16_add_ret_a
beq s2, x0, bf16_add_a_den_done
ori s4, s4, 0x80
bf16_add_a_den_done:
beq s3, x0, bf16_add_b_den_done
ori s5, s5, 0x80
bf16_add_b_den_done:
sub t1, s2, s3
blt x0, t1, bf16_add_grt
beq t1, x0, bf16_add_equ
mv t2, s3
li t0, -8
blt t1, t0, bf16_add_ret_b
sub t0, x0, t1
srl s4, s4, t0
j bf16_add_exp_dif
bf16_add_grt:
mv t2, s2
li t0, 8
blt t0, t1, bf16_add_ret_a
srl s5, s5, t1
j bf16_add_exp_dif
bf16_add_equ:
mv t2, s2
bf16_add_exp_dif:
bne s0, s1, bf16_add_diff_signs
mv t3, s0
add t4, s4, s5
li t0, 0x100
and t1, t4, t0
beq t1, x0, bf16_add_pack
srli t4, t4, 1
addi t2, t2, 1
li t0, 0xFF
blt t2, t0, bf16_add_pack
slli a0, t3, 15
li t5, 0x7F80
or a0, a0, t5
j bf16_add_ans
bf16_add_diff_signs:
blt s4, s5, bf16_add_gt_ma
mv t3, s0
sub t4, s4, s5
j bf16_add_norm
bf16_add_gt_ma:
mv t3, s1
sub t4, s5, s4
bf16_add_norm:
beq t4, x0, bf16_add_ret_zero
mv a0, t4
jal ra, clz8
mv t0, a0
sll t4, t4, t0
sub t2, t2, t0
blt t2, x0, bf16_add_ret_zero
beq t2, x0, bf16_add_ret_zero
j bf16_add_pack
bf16_add_ret_zero:
li a0, 0x0000
j bf16_add_ans
bf16_add_pack:
slli a0, t3, 15
slli t1, t2, 7
or a0, a0, t1
andi t4, t4, 0x7F
or a0, a0, t4
j bf16_add_ans
bf16_add_ret_b:
mv a0, a1
j bf16_add_ans
bf16_add_ret_nan:
li a0, 0x7FC0
j bf16_add_ans
bf16_add_ret_a:
j bf16_add_ans
bf16_add_ans:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
lw ra, 24(sp)
addi sp, sp, 28
ret
.globl bf16_mul
bf16_mul:
addi sp, sp, -28
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
sw ra, 24(sp)
srli s0, a0, 15
andi s0, s0, 1
srli s1, a1, 15
andi s1, s1, 1
srli s2, a0, 7
andi s2, s2, 0xFF
srli s3, a1, 7
andi s3, s3, 0xFF
andi s4, a0, 0x7F
andi s5, a1, 0x7F
li t0, 0xff
xor t1, s0, s1
bne s2, t0, bf16_mul_a_exp
bne s4, x0, bf16_mul_ret_b
bne s3, x0, bf16_mul_inf1
beq s5, x0, bf16_mul_ret_nan
bf16_mul_inf1:
slli t2, t1, 15
li t3, 0x7F80
or a0, t2, t3
j bf16_mul_ans
bf16_mul_a_exp:
bne s3, t0, bf16_mul_b_exp
bne s5, x0, bf16_mul_ret_b
bne s2, x0, bf16_mul_inf2
beq s4, x0, bf16_mul_ret_nan
bf16_mul_inf2:
slli t2, t1, 15
li t3, 0x7F80
or a0, t2, t3
j bf16_mul_ans
bf16_mul_b_exp:
bne s2, x0, bf16_mul_skip1
beq s4, x0, bf16_mul_zero_ret
bf16_mul_skip1:
bne s3, x0, bf16_mul_skip2
bne s4, x0, bf16_mul_skip2
bf16_mul_zero_ret:
srli a0, t1, 15
j bf16_mul_ans
bf16_mul_skip2:
li t2, 0
bne s2, x0, bf16_mul_else_a
mv a0, s4
jal ra, clz8
mv t0, a0
sll s4, s4, t0
sub t2, t2, t0
li s2, 1
bf16_mul_else_a:
ori s4, s4, 0x80
bne s3, x0, bf16_mul_else_b
mv a0, s5
jal ra, clz8
mv t0, a0
sll s5, s5, t0
sub t2, t2, t0
li s3, 1
bf16_mul_else_b:
ori s5, s5, 0x80
mv a0, s4
mv a1, s5
jal ra, mul8x8_to16
mv t3, a0
xor t1, s0, s1
add t4, s2, s3
addi t4, t4, -127
add t4, t4, t2
li t5, 0x8000
and t0, t3, t5
beq t0, x0, bf16_mul_l2
srli t3, t3, 8
andi t3, t3, 0x7F
addi t4, t4, 1
j bf16_mul_mant
bf16_mul_l2:
srli t3, t3, 7
andi t3, t3, 0x7F
bf16_mul_mant:
li t0, 0xFF
blt t4, t0, bf16_mul_skip3
slli a0, t1, 15
li t0, 0x7F80
or a0, a0, t0
j bf16_mul_ans
bf16_mul_skip3:
blt x0, t4, bf16_mul_pack
addi t0, x0, -6
blt t4, t0, bf16_mul_underflow
li t0, 1
sub t0, t0, t4
srl t3, t3, t0
li t4, 0
j bf16_mul_pack
bf16_mul_underflow:
srli a0, t1, 15
j bf16_mul_ans
bf16_mul_pack:
andi t1, t1, 1
slli t1, t1, 15
andi t4, t4, 0xFF
slli t4, t4, 7
andi t3, t3, 0x7F
or a0, t1, t4
or a0, a0, t3
li t0, 0xFFFF
and a0, a0, t0
j bf16_mul_ans
bf16_mul_ret_b:
mv a0, a1
j bf16_mul_ans
bf16_mul_ret_nan:
li a0, 0x7FC0
j bf16_mul_ans
bf16_mul_ret_a:
j bf16_mul_ans
bf16_mul_ans:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
lw ra, 24(sp)
addi sp, sp, 28
ret
isqrt16_pow4:
li t0, 0
li t1, 16384
isqrt16_pow4_loop:
beqz t1, isqrt16_pow4_done
add t2, t0, t1
bgeu a0, t2, isqrt16_pow4_ge
srli t0, t0, 1
srli t1, t1, 2
j isqrt16_pow4_loop
isqrt16_pow4_ge:
sub a0, a0, t2
srli t0, t0, 1
add t0, t0, t1
srli t1, t1, 2
j isqrt16_pow4_loop
isqrt16_pow4_done:
mv a0, t0
ret
.globl bf16_sqrt
bf16_sqrt:
addi sp, sp, -32
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
sw s6, 24(sp)
sw ra, 28(sp)
srli s0, a0, 15
andi s0, s0, 1 # s0 = sign_a
srli s1, a0, 7
andi s1, s1, 0xFF # s1 = exp_a
andi s2, a0, 0x7F # s2 = mant_a
li t0, 0xFF
bne s1, t0, bf16_sqrt_a_exp
beq s2, x0, bf16_sqrt_a_mant
j bf16_sqrt_ret_a
bf16_sqrt_a_mant:
beq s0, x0, bf16_sqrt_a_sign
li a0, 0x7FC0
j bf16_sqrt_ans
bf16_sqrt_a_sign:
j bf16_sqrt_ret_a
bf16_sqrt_a_exp:
bne s1, x0, bf16_sqrt_skip
bne s2, x0, bf16_sqrt_skip
li a0, 0x0000
j bf16_sqrt_ans
bf16_sqrt_skip:
beq s0, x0, bf16_sqrt_negative_skip
li a0, 0x7FC0
j bf16_sqrt_ans
bf16_sqrt_negative_skip:
bne s1, x0, bf16_sqrt_denormals_skip
li a0, 0x0000
j bf16_sqrt_ans
bf16_sqrt_denormals_skip:
addi t1, s1, -127
ori t2, s2, 0x80
andi t0, t1, 1
beq t0, x0, bf16_sqrt_else
slli t2, t2, 1
addi t0, t1, -1
srai t0, t0, 1
addi t3, t0, 127
j bf16_sqrt_end_if
bf16_sqrt_else:
srai t0, t1, 1
addi t3, t0, 127
bf16_sqrt_end_if:
mv s6, t3
slli a0, t2, 7
addi a0, a0, 127
jal isqrt16_pow4
mv s5, a0
mv t3, s6
j bf16_sqrt_l3
bf16_sqrt_l3:
andi t5, s5, 0x7F
li t0, 0xFF
blt t3, t0, bf16_sqrt_no_overflow
li a0, 0x7F80
j bf16_sqrt_ans
bf16_sqrt_no_overflow:
bgt t3, x0, bf16_sqrt_no_underflow
li a0, 0
j bf16_sqrt_ans
bf16_sqrt_no_underflow:
andi t3, t3, 0xff
slli t3, t3, 7
or a0, t3, t5
j bf16_sqrt_ans
bf16_sqrt_ret_a:
j bf16_sqrt_ans
bf16_sqrt_ans:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
lw s6, 24(sp)
lw ra, 28(sp)
addi sp, sp, 32
ret
```
</details>
- Copy the directory
```bash=
cd ~/riscv-none-elf-gcc/rv32emu
cp -r tests/system/playground tests/system/bf16
cd tests/system/bf16
```
- Modify `main.c`
Since we are running in a bare-metal environment using rv32emu’s system emulation, which means we can only use some [system call](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md) provided by rv32emu. In this case, I remove unneccsary part from original `main.c` and adapted it for our test setup. This allows us to test our own assembly of `bf16_sqrt`, `bf16_add` and `bf16_mul`. We replace the old C implementations with the bf16 routines in `bf16_hw1.o`, while still reusing the existing test functions in main.c to validate our bf16 add, mul, and sqrt.
<details>
<summary>main.c</summary>
```clike
#include <stdbool.h>
#include <stdint.h>
#include <string.h>
#define printstr(ptr, length) \
do { \
asm volatile( \
"add a7, x0, 0x40;" \
"add a0, x0, 0x1;" /* stdout */ \
"add a1, x0, %0;" \
"mv a2, %1;" /* length character */ \
"ecall;" \
: \
: "r"(ptr), "r"(length) \
: "a0", "a1", "a2", "a7"); \
} while (0)
#define TEST_OUTPUT(msg, length) printstr(msg, length)
#define TEST_LOGGER(msg) \
{ \
char _msg[] = msg; \
TEST_OUTPUT(_msg, sizeof(_msg) - 1); \
}
extern uint64_t get_cycles(void);
extern uint64_t get_instret(void);
/* Bare metal memcpy implementation */
void *memcpy(void *dest, const void *src, size_t n)
{
uint8_t *d = (uint8_t *) dest;
const uint8_t *s = (const uint8_t *) src;
while (n--)
*d++ = *s++;
return dest;
}
/* Software division for RV32I (no M extension) */
static unsigned long udiv(unsigned long dividend, unsigned long divisor)
{
if (divisor == 0)
return 0;
unsigned long quotient = 0;
unsigned long remainder = 0;
for (int i = 31; i >= 0; i--) {
remainder <<= 1;
remainder |= (dividend >> i) & 1;
if (remainder >= divisor) {
remainder -= divisor;
quotient |= (1UL << i);
}
}
return quotient;
}
static unsigned long umod(unsigned long dividend, unsigned long divisor)
{
if (divisor == 0)
return 0;
unsigned long remainder = 0;
for (int i = 31; i >= 0; i--) {
remainder <<= 1;
remainder |= (dividend >> i) & 1;
if (remainder >= divisor) {
remainder -= divisor;
}
}
return remainder;
}
/* Software multiplication for RV32I (no M extension) */
static uint32_t umul(uint32_t a, uint32_t b)
{
uint32_t result = 0;
while (b) {
if (b & 1)
result += a;
a <<= 1;
b >>= 1;
}
return result;
}
/* Provide __mulsi3 for GCC */
uint32_t __mulsi3(uint32_t a, uint32_t b)
{
return umul(a, b);
}
/* Simple integer to hex string conversion */
static void print_hex(unsigned long val)
{
char buf[20];
char *p = buf + sizeof(buf) - 1;
*p = '\n';
p--;
if (val == 0) {
*p = '0';
p--;
} else {
while (val > 0) {
int digit = val & 0xf;
*p = (digit < 10) ? ('0' + digit) : ('a' + digit - 10);
p--;
val >>= 4;
}
}
p++;
printstr(p, (buf + sizeof(buf) - p));
}
/* Simple integer to decimal string conversion */
static void print_dec(unsigned long val)
{
char buf[20];
char *p = buf + sizeof(buf) - 1;
*p = '\n';
p--;
if (val == 0) {
*p = '0';
p--;
} else {
while (val > 0) {
*p = '0' + umod(val, 10);
p--;
val = udiv(val, 10);
}
}
p++;
printstr(p, (buf + sizeof(buf) - p));
}
typedef struct {
uint16_t bits;
} bf16_t;
uint32_t bf16_add(uint32_t a_bf16, uint32_t b_bf16);
uint32_t bf16_mul(uint32_t a_bf16, uint32_t b_bf16);
uint32_t bf16_sqrt(uint32_t a_bf16);
static inline bf16_t bf16_add_w(bf16_t a, bf16_t b) {
bf16_t r = { .bits = (uint16_t)bf16_add(a.bits, b.bits) };
return r;
}
static inline bf16_t bf16_mul_w(bf16_t a, bf16_t b) {
bf16_t r = { .bits = (uint16_t)bf16_mul(a.bits, b.bits) };
return r;
}
static inline bf16_t bf16_sqrt_w(bf16_t a) {
bf16_t r = { .bits = (uint16_t)bf16_sqrt(a.bits) };
return r;
}
static void test_bf16_add(void)
{
TEST_LOGGER("Test: bf16_add\n");
/* 1.0 + 1.0 = 2.0 */
bf16_t a = {.bits = 0x3F80}; /* 1.0 */
bf16_t b = {.bits = 0x3F80}; /* 1.0 */
bf16_t result = bf16_add_w(a, b);
TEST_LOGGER(" 1.0 + 1.0 = ");
print_hex(result.bits);
/* Expected: 0x4000 (2.0) */
if (result.bits == 0x4000) {
TEST_LOGGER(" PASSED\n");
} else {
TEST_LOGGER(" FAILED (expected 0x4000)\n");
}
}
static void test_bf16_mul(void)
{
TEST_LOGGER("Test: bf16_mul\n");
/* 2.0 * 3.0 = 6.0 */
bf16_t a = {.bits = 0x4000}; /* 2.0 */
bf16_t b = {.bits = 0x4040}; /* 3.0 */
bf16_t result = bf16_mul_w(a, b);
TEST_LOGGER(" 2.0 * 3.0 = ");
print_hex(result.bits);
/* Expected: 0x40C0 (6.0) */
if (result.bits == 0x40C0) {
TEST_LOGGER(" PASSED\n");
} else {
TEST_LOGGER(" FAILED (expected 0x40C0)\n");
}
}
static void test_bf16_sqrt(void)
{
TEST_LOGGER("Test: bf16_sqrt\n");
// sqrt(-1.0) = NaN
bf16_t a = {.bits = 0xBF80};
bf16_t result = bf16_sqrt_w(a);
TEST_LOGGER(" sqrt(-1.0) = ");
print_hex(result.bits);
/* Expected: 0x7FC0 (NaN) */
if (result.bits == 0x7FC0) {
TEST_LOGGER(" PASSED\n");
} else {
TEST_LOGGER(" FAILED (expected 0x7FC0)\n");
}
}
int main(void)
{
uint64_t start_cycles, end_cycles, cycles_elapsed;
uint64_t start_instret, end_instret, instret_elapsed;
TEST_LOGGER("\n=== BFloat16 Tests ===\n\n");
TEST_LOGGER("Test 1: bf16_add\n");
start_cycles = get_cycles();
start_instret = get_instret();
test_bf16_add();
end_cycles = get_cycles();
end_instret = get_instret();
cycles_elapsed = end_cycles - start_cycles;
instret_elapsed = end_instret - start_instret;
TEST_LOGGER(" Cycles: ");
print_dec((unsigned long) cycles_elapsed);
TEST_LOGGER(" Instructions: ");
print_dec((unsigned long) instret_elapsed);
TEST_LOGGER("\n");
end_cycles = get_cycles();
end_instret = get_instret();
cycles_elapsed = end_cycles - start_cycles;
instret_elapsed = end_instret - start_instret;
TEST_LOGGER("Test 3: bf16_mul\n");
start_cycles = get_cycles();
start_instret = get_instret();
test_bf16_mul();
end_cycles = get_cycles();
end_instret = get_instret();
cycles_elapsed = end_cycles - start_cycles;
instret_elapsed = end_instret - start_instret;
TEST_LOGGER(" Cycles: ");
print_dec((unsigned long) cycles_elapsed);
TEST_LOGGER(" Instructions: ");
print_dec((unsigned long) instret_elapsed);
TEST_LOGGER("\n");
TEST_LOGGER("Test 4: bf16_sqrt\n");
start_cycles = get_cycles();
start_instret = get_instret();
test_bf16_sqrt();
end_cycles = get_cycles();
end_instret = get_instret();
cycles_elapsed = end_cycles - start_cycles;
instret_elapsed = end_instret - start_instret;
TEST_LOGGER(" Cycles: ");
print_dec((unsigned long) cycles_elapsed);
TEST_LOGGER(" Instructions: ");
print_dec((unsigned long) instret_elapsed);
TEST_LOGGER("\n");
TEST_LOGGER("\n=== All Tests Completed ===\n");
return 0;
}
```
</details>
Since we placed a `.globl` directive in front of each bf16 operations, these symbols are visible and callable from main.c.
Also we need Makefile to
1. Compile / Assemble: Convert source files like main.c, start.S, and bf16_hw1.s/.S into `.o` object files.
2. Link: Use `linker.ld` to combine all `.o` files into an ELF executable file.
3. Excute
4. Disassemble
- Makefile
```asm=
include ../../../mk/toolchain.mk
ARCH = -march=rv32izicsr
LINKER_SCRIPT = linker.ld
EMU ?= ../../../build/rv32emu
AFLAGS = -g $(ARCH)
CFLAGS = -g -march=rv32i_zicsr
LDFLAGS = -T $(LINKER_SCRIPT)
EXEC = test.elf
CC = $(CROSS_COMPILE)gcc
AS = $(CROSS_COMPILE)as
LD = $(CROSS_COMPILE)ld
OBJDUMP = $(CROSS_COMPILE)objdump
OBJS = start.o main.o perfcounter.o chacha20_asm.o bf16_hw1.o
.PHONY: all run dump clean
all: $(EXEC)
$(EXEC): $(OBJS) $(LINKER_SCRIPT)
$(LD) $(LDFLAGS) -o $@ $(OBJS)
%.o: %.S
$(AS) $(AFLAGS) $< -o $@
%.o: %.s
$(AS) $(AFLAGS) $< -o $@
%.o: %.c
$(CC) $(CFLAGS) $< -o $@ -c
run: $(EXEC)
@test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1)
@grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1)
@grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1)
$(EMU) $<
dump: $(EXEC)
$(OBJDUMP) -Ds $< | less
clean:
rm -f $(EXEC) $(OBJS)
```
We only add `bf16_hw1.o` to OBJS and convert `.s` file to `.o` file from original Makefile.
`OBJS` is the list of object files that will be fed into the linker to produce the final `.elf` file.
- command
```bash=
make clean
make
make run
```
- Result
```
=== BFloat16 Tests ===
Test 1: bf16_add
Test: bf16_add
1.0 + 1.0 = 4000
PASSED
Cycles: 299
Instructions: 299
Test 3: bf16_mul
Test: bf16_mul
2.0 * 3.0 = 40c0
PASSED
Cycles: 361
Instructions: 361
Test 4: bf16_sqrt
Test: bf16_sqrt
sqrt(-1.0) = 7fc0
PASSED
Cycles: 260
Instructions: 260
=== All Tests Completed ===
```
## Part 3 - two problem
### Problem A from [Quiz 2](https://hackmd.io/@sysprog/arch2025-quiz2)
- Makefile
```
include ../../../mk/toolchain.mk
ARCH = -march=rv32izicsr
LINKER_SCRIPT = linker.ld
EMU ?= ../../../build/rv32emu
AFLAGS = -g $(ARCH)
CFLAGS = -g $(ARCH)
LDFLAGS = -T $(LINKER_SCRIPT)
EXEC = test.elf
CROSS_COMPILE = riscv-none-elf-
CC = $(CROSS_COMPILE)gcc
AS = $(CROSS_COMPILE)as
OBJDUMP = $(CROSS_COMPILE)objdump
OBJS = start.o main.o perfcounter.o hanoi.o
.PHONY: all run dump clean
all: $(EXEC)
$(EXEC): $(OBJS) $(LINKER_SCRIPT)
$(CC) $(CFLAGS) $(LDFLAGS) -nostartfiles -o $@ $(OBJS)
%.o: %.S
$(AS) $(AFLAGS) -c $< -o $@
%.o: %.c
$(CC) $(CFLAGS) $< -o $@ -c
run: $(EXEC)
@test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1)
@grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1)
@grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1)
$(EMU) $<
dump: $(EXEC)
$(OBJDUMP) -Ds $< | less
clean:
rm -f $(EXEC) $(OBJS)
```
First, let’s look at the original display_move function in `hanoi.S` and the `.data` section:
```c
display_move:
la x20, obdata
add x5, x20, x18
# BLANK 23: Load byte from obfuscated data
lbu x11, 0(x5)
# BLANK 24: Decode constant (0x6F)
li x6, 0x6F
xor x11, x11, x6
# BLANK 25: Final offset adjustment
addi x11, x11, -0x12
add x7, x20, x19
lbu x12, 0(x7)
xor x12, x12, x6
addi x12, x12, -0x12
...
.data
obdata: .byte 0x3c, 0x3b, 0x3a
str1: .asciz "Move Disk "
str2: .asciz " from "
str3: .asciz " to "
```
If you decode it, you can see that obdata is actually an encrypted version of 'A', 'B', 'C':
0x3c ^ 0x6F = 0x53, 0x53 - 0x12 = 0x41 → 'A'
0x3b ^ 0x6F = 0x54, 0x54 - 0x12 = 0x42 → 'B'
0x3a ^ 0x6F = 0x55, 0x55 - 0x12 = 0x43 → 'C'
In the Tower of Hanoi code `x9` holds the disk index, `x18` holds the current peg index and `x19` holds the target peg index. Then we moved the display logic into a C function `hanoi_move()` that computes peg letters with 'A' + index and uses `TEST_LOGGER` + `print_dec` + `printstr`. By doing these two steps, we can cleanly print all the disk moves in a bare-metal environment while keeping the assembly focused on the actual Hanoi algorithm.
- add a C function hanoi_move()
```clike=
void hanoi_move(unsigned disk, unsigned from, unsigned to)
{
TEST_LOGGER("Move Disk ");
print_dec(disk);
TEST_LOGGER(" from ");
char peg_from = 'A' + (from & 3u);
printstr(&peg_from, 1);
TEST_LOGGER(" to ");
char peg_to = 'A' + (to & 3u);
printstr(&peg_to, 1);
TEST_LOGGER("\n");
}
```
- modify display_move
```c
display_move:
addi a0, x9, 1
mv a1, x18
mv a2, x19
jal ra, hanoi_move
# BLANK 26: Calculate storage offset
slli x5, x9, 2
addi x5, x5, 20
add x5, x2, x5
...
.data
obdata: .byte 0x3c, 0x3b, 0x3a
str1: .asciz "Move Disk "
str2: .asciz " from "
str3: .asciz " to "
```
- result

### Problem C from [Quiz 3](https://hackmd.io/@sysprog/arch2025-quiz3)
- Makefile
```
include ../../../mk/toolchain.mk
ARCH = -march=rv32izicsr
LINKER_SCRIPT = linker.ld
EMU ?= ../../../build/rv32emu
AFLAGS = -g $(ARCH)
CFLAGS = -g $(ARCH)
LDFLAGS = -T $(LINKER_SCRIPT)
EXEC = test.elf
CROSS_COMPILE = riscv-none-elf-
CC = $(CROSS_COMPILE)gcc
AS = $(CROSS_COMPILE)as
OBJDUMP = $(CROSS_COMPILE)objdump
OBJS = start.o main.o perfcounter.o fast_rsqrt.o
.PHONY: all run dump clean
all: $(EXEC)
$(EXEC): $(OBJS) $(LINKER_SCRIPT)
$(CC) $(CFLAGS) $(LDFLAGS) -nostartfiles -o $@ $(OBJS) -lm
%.o: %.S
$(AS) $(AFLAGS) -c $< -o $@
%.o: %.c
$(CC) $(CFLAGS) $< -o $@ -c
run: $(EXEC)
@test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1)
@grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1)
@grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1)
$(EMU) $<
dump: $(EXEC)
$(OBJDUMP) -Ds $< | less
clean:
rm -f $(EXEC) $(OBJS)
```
To evaluate our integer-based fast reciprocal square root (rsqrt) implementations Quiz 3, we first wrote a baseline `rsqrt_baseline(x)` as a reference. We use a Q16 fixed-point format, where $( y \approx \frac{1}{\sqrt{x}} \cdot 2^{16} )$. The baseline computes an exact integer square root with binary search, then uses 64-bit integer division to get a high-precision reciprocal and convert it to Q16.
- baseline
```clike=
static uint32_t isqrt_u32(uint32_t x)
{
uint32_t left = 0, right = 65535;
while (left <= right) {
uint32_t mid = (left + right) >> 1;
uint64_t sq = (uint64_t)mid * mid;
if (sq == x) return mid;
if (sq < x) left = mid + 1;
else right = mid - 1;
}
return right;
}
static uint32_t rsqrt_baseline(uint32_t x)
{
if (x == 0) return 0;
uint32_t s = isqrt_u32(x);
if (s == 0) return 0xFFFFFFFFu;
uint64_t num = (1ull << 32);
uint32_t inv_s = (uint32_t)(num / s);
uint32_t y = (inv_s + (1u << 15)) >> 16;
return y;
}
```
With this baseline in place, we run `fast_rsqrt` on rv32emu’s bare-metal system, compare each result against the baseline (absolute and relative error), and report the max and average error.
- result

- analyzing precision and performance (with 1000 test cases)
| | fast_rsqrt | fast_rsqrt_1iter(LUT + once Newton) | fast_rsqrt_lut(only LUT) |
| -------------- | ---------- |:-----------------------------------:| ------------------------ |
| Cycle | 21925181 | 17446433 | 12630288 |
| Instructions | 21925181 | 17446433 | 12630288 |
| mean_rel_error | 0.96% | 1.11% | 2.17% |
From the above table, we can tell it's a tradeoff between cycle count and error rate when choosing a LUT+Newton approach versus a fully LUT-free method, so it either takes more cycles to reach comparable accuracy or settles for a higher error rate.
## Part 4 - Disassemble
In this seciotn, we disassemble the ELF binaries generated by the C compiler and compare them with our handwritten implementation to analyze the differences in the resulting assembly code.
First, I reused the RV32I friendly bf16_sqrt assembly implementation from [Assignment 1](https://hackmd.io/@IBxdYttpQ5-JUEW6_KypVg/arch2025-homework1#Part-2)
- rv32i_friendly_bf16_sqrt
```asm
.data
.align 4
.text
isqrt16_pow4:
li t0, 0 # res
li t1, 16384 # bit = 1<<14 (2^16 = 65536 > 65535)
isqrt_loop:
beqz t1, isqrt_done
add t2, t0, t1 # tmp = res + bit
bgeu a0, t2, isqrt_ge
srli t0, t0, 1
srli t1, t1, 2
j isqrt_loop
isqrt_ge:
sub a0, a0, t2
srli t0, t0, 1
add t0, t0, t1
srli t1, t1, 2
j isqrt_loop
isqrt_done:
mv a0, t0
ret
bf16_sqrt:
addi sp, sp, -32
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
sw s6, 24(sp)
sw ra, 28(sp)
srli s0, a0, 15
andi s0, s0, 1 # s0 = sign_a
srli s1, a0, 7
andi s1, s1, 0xFF # s1 = exp_a
andi s2, a0, 0x7F # s2 = mant_a
li t0, 0xFF
bne s1, t0, a_exp
beq s2, x0, a_mant
j ret_a
a_mant:
beq s0, x0, a_sign
li a0, 0x7FC0
j ans
a_sign:
j ret_a
a_exp:
bne s1, x0, skip
bne s2, x0, skip
li a0, 0x0000
j ans
skip:
beq s0, x0, negative_skip
li a0, 0x7FC0
j ans
negative_skip:
bne s1, x0, denormals_skip
li a0, 0x0000
j ans
denormals_skip:
addi t1, s1, -127 # t1 = e
ori t2, s2, 0x80 # t2 = m (implicit 1)
andi t0, t1, 1
beq t0, x0, else
slli t2, t2, 1 # m <<= 1 (odd exponent)
addi t0, t1, -1 # t0 = e - 1
srai t0, t0, 1 # t0 = (e-1)>>1
addi t3, t0, 127 # t3 = new_exp
j end_if
else:
srai t0, t1, 1 # t0 = e>>1
addi t3, t0, 127 # t3 = new_exp
end_if:
mv s6, t3
slli a0, t2, 7 # a0 = m<<7
addi a0, a0, 127 # a0 = (m<<7) + 127
jal isqrt16_pow4 # a0 = result
mv s5, a0 # s5 = result
mv t3, s6
j l3
l3:
andi t5, s5, 0x7F # t5 = new_mant
li t0, 0xFF
blt t3, t0, no_overflow
li a0, 0x7F80
j ans
no_overflow:
bgt t3, x0, no_underflow
li a0, 0
j ans
no_underflow:
andi t3, t3, 0xff
slli t3, t3, 7
or a0, t3, t5
j ans
ret_a:
j ans
ans:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
lw s6, 24(sp)
lw ra, 28(sp)
addi sp, sp, 32
ret
```
- For comparsion we use original c code from [quiz1](https://hackmd.io/@sysprog/arch2025-quiz1-sol) (remove `static inline`)
```c
#include <stdbool.h>
#include <stdint.h>
typedef struct {
uint16_t bits;
} bf16_t;
#define BF16_SIGN_MASK 0x8000U
#define BF16_EXP_MASK 0x7F80U
#define BF16_MANT_MASK 0x007FU
#define BF16_EXP_BIAS 127
#define BF16_NAN() ((bf16_t) {.bits = 0x7FC0})
#define BF16_ZERO() ((bf16_t) {.bits = 0x0000})
bf16_t bf16_sqrt(bf16_t a)
{
uint16_t sign = (a.bits >> 15) & 1;
int16_t exp = ((a.bits >> 7) & 0xFF);
uint16_t mant = a.bits & 0x7F;
/* Handle special cases */
if (exp == 0xFF) {
if (mant)
return a; /* NaN propagation */
if (sign)
return BF16_NAN(); /* sqrt(-Inf) = NaN */
return a; /* sqrt(+Inf) = +Inf */
}
/* sqrt(0) = 0 (handle both +0 and -0) */
if (!exp && !mant)
return BF16_ZERO();
/* sqrt of negative number is NaN */
if (sign)
return BF16_NAN();
/* Flush denormals to zero */
if (!exp)
return BF16_ZERO();
/* Direct bit manipulation square root algorithm */
/* For sqrt: new_exp = (old_exp - bias) / 2 + bias */
int32_t e = exp - BF16_EXP_BIAS;
int32_t new_exp;
/* Get full mantissa with implicit 1 */
uint32_t m = 0x80 | mant; /* Range [128, 256) representing [1.0, 2.0) */
/* Adjust for odd exponents: sqrt(2^odd * m) = 2^((odd-1)/2) * sqrt(2*m) */
if (e & 1) {
m <<= 1; /* Double mantissa for odd exponent */
new_exp = ((e - 1) >> 1) + BF16_EXP_BIAS;
} else {
new_exp = (e >> 1) + BF16_EXP_BIAS;
}
/* Now m is in range [128, 256) or [256, 512) if exponent was odd */
/* Binary search for integer square root */
/* We want result where result^2 = m * 128 (since 128 represents 1.0) */
uint32_t low = 90; /* Min sqrt (roughly sqrt(128)) */
uint32_t high = 256; /* Max sqrt (roughly sqrt(512)) */
uint32_t result = 128; /* Default */
/* Binary search for square root of m */
while (low <= high) {
uint32_t mid = (low + high) >> 1;
uint32_t sq = (mid * mid) / 128; /* Square and scale */
if (sq <= m) {
result = mid; /* This could be our answer */
low = mid + 1;
} else {
high = mid - 1;
}
}
/* result now contains sqrt(m) * sqrt(128) / sqrt(128) = sqrt(m) */
/* But we need to adjust the scale */
/* Since m is scaled where 128=1.0, result should also be scaled same way */
/* Normalize to ensure result is in [128, 256) */
if (result >= 256) {
result >>= 1;
new_exp++;
} else if (result < 128) {
while (result < 128 && new_exp > 1) {
result <<= 1;
new_exp--;
}
}
/* Extract 7-bit mantissa (remove implicit 1) */
uint16_t new_mant = result & 0x7F;
/* Check for overflow/underflow */
if (new_exp >= 0xFF)
return (bf16_t) {.bits = 0x7F80}; /* +Inf */
if (new_exp <= 0)
return BF16_ZERO();
return (bf16_t) {.bits = ((new_exp & 0xFF) << 7) | new_mant};
}
```
Additionally we reuse main.c from Part 2 and do a little change (remove unnecessary part like test file for other bf16 operations)
- main.c
```c
#include <stdbool.h>
#include <stdint.h>
#include <string.h>
#define printstr(ptr, length) \
do { \
asm volatile( \
"add a7, x0, 0x40;" \
"add a0, x0, 0x1;" /* stdout */ \
"add a1, x0, %0;" \
"mv a2, %1;" /* length character */ \
"ecall;" \
: \
: "r"(ptr), "r"(length) \
: "a0", "a1", "a2", "a7"); \
} while (0)
#define TEST_OUTPUT(msg, length) printstr(msg, length)
#define TEST_LOGGER(msg) \
{ \
char _msg[] = msg; \
TEST_OUTPUT(_msg, sizeof(_msg) - 1); \
}
extern uint64_t get_cycles(void);
extern uint64_t get_instret(void);
/* Bare metal memcpy implementation */
void *memcpy(void *dest, const void *src, size_t n)
{
uint8_t *d = (uint8_t *) dest;
const uint8_t *s = (const uint8_t *) src;
while (n--)
*d++ = *s++;
return dest;
}
/* Software division for RV32I (no M extension) */
static unsigned long udiv(unsigned long dividend, unsigned long divisor)
{
if (divisor == 0)
return 0;
unsigned long quotient = 0;
unsigned long remainder = 0;
for (int i = 31; i >= 0; i--) {
remainder <<= 1;
remainder |= (dividend >> i) & 1;
if (remainder >= divisor) {
remainder -= divisor;
quotient |= (1UL << i);
}
}
return quotient;
}
static unsigned long umod(unsigned long dividend, unsigned long divisor)
{
if (divisor == 0)
return 0;
unsigned long remainder = 0;
for (int i = 31; i >= 0; i--) {
remainder <<= 1;
remainder |= (dividend >> i) & 1;
if (remainder >= divisor) {
remainder -= divisor;
}
}
return remainder;
}
/* Software multiplication for RV32I (no M extension) */
static uint32_t umul(uint32_t a, uint32_t b)
{
uint32_t result = 0;
while (b) {
if (b & 1)
result += a;
a <<= 1;
b >>= 1;
}
return result;
}
/* Provide __mulsi3 for GCC */
uint32_t __mulsi3(uint32_t a, uint32_t b)
{
return umul(a, b);
}
/* Simple integer to hex string conversion */
static void print_hex(unsigned long val)
{
char buf[20];
char *p = buf + sizeof(buf) - 1;
*p = '\n';
p--;
if (val == 0) {
*p = '0';
p--;
} else {
while (val > 0) {
int digit = val & 0xf;
*p = (digit < 10) ? ('0' + digit) : ('a' + digit - 10);
p--;
val >>= 4;
}
}
p++;
printstr(p, (buf + sizeof(buf) - p));
}
/* Simple integer to decimal string conversion */
static void print_dec(unsigned long val)
{
char buf[20];
char *p = buf + sizeof(buf) - 1;
*p = '\n';
p--;
if (val == 0) {
*p = '0';
p--;
} else {
while (val > 0) {
*p = '0' + umod(val, 10);
p--;
val = udiv(val, 10);
}
}
p++;
printstr(p, (buf + sizeof(buf) - p));
}
typedef struct {
uint16_t bits;
} bf16_t;
uint32_t bf16_sqrt(uint32_t a_bf16);
static inline bf16_t bf16_sqrt_w(bf16_t a) {
bf16_t r = { .bits = (uint16_t)bf16_sqrt(a.bits) };
return r;
}
static void test_bf16_sqrt(void)
{
TEST_LOGGER("Test: bf16_sqrt\n");
// sqrt(-1.0) = NaN
bf16_t a = {.bits = 0xBF80};
bf16_t result = bf16_sqrt_w(a);
TEST_LOGGER(" sqrt(-1.0) = ");
print_hex(result.bits);
/* Expected: 0x7FC0 (NaN) */
if (result.bits == 0x7FC0) {
TEST_LOGGER(" PASSED\n");
} else {
TEST_LOGGER(" FAILED (expected 0x7FC0)\n");
}
}
int main(void)
{
uint64_t start_cycles, end_cycles, cycles_elapsed;
uint64_t start_instret, end_instret, instret_elapsed;
TEST_LOGGER("Test: bf16_sqrt\n");
start_cycles = get_cycles();
start_instret = get_instret();
test_bf16_sqrt();
end_cycles = get_cycles();
end_instret = get_instret();
cycles_elapsed = end_cycles - start_cycles;
instret_elapsed = end_instret - start_instret;
TEST_LOGGER(" Cycles: ");
print_dec((unsigned long) cycles_elapsed);
TEST_LOGGER(" Instructions: ");
print_dec((unsigned long) instret_elapsed);
TEST_LOGGER("\n");
TEST_LOGGER("\n=== All Tests Completed ===\n");
return 0;
}
```
For Makefile we mainly change three things
1. Add `OPT` to control GCC’s optimization level(-O0 / -Os / -Ofast) and `BF_IMPL` selects which bf16_sqrt implementation to use (asm or c)
2. Add the optimization flag into the C compile flags `CFLAGS = -g -march=rv32i_zicsr $(OPT)`
3. Use `ifeq` to choose which BF16 object file gets linked
- Makefile
```asm
include ../../../mk/toolchain.mk
ARCH = -march=rv32izicsr
LINKER_SCRIPT = linker.ld
EMU ?= ../../../build/rv32emu
OPT ?= -O0 # -Ofast / -Os / -O2
BF_IMPL ?= asm # asm / c
AFLAGS = -g $(ARCH)
CFLAGS = -g -march=rv32i_zicsr $(OPT)
LDFLAGS = -T $(LINKER_SCRIPT)
EXEC = test.elf
CC = $(CROSS_COMPILE)gcc
AS = $(CROSS_COMPILE)as
LD = $(CROSS_COMPILE)ld
OBJDUMP = $(CROSS_COMPILE)objdump
ifeq ($(BF_IMPL),asm)
BF_OBJ = bf16_sqrt.o
else
BF_OBJ = bf16_sqrt_base.o
endif
OBJS = start.o main.o perfcounter.o chacha20_asm.o $(BF_OBJ)
.PHONY: all run dump clean
all: $(EXEC)
$(EXEC): $(OBJS) $(LINKER_SCRIPT)
$(LD) $(LDFLAGS) -o $@ $(OBJS)
%.o: %.S
$(AS) $(AFLAGS) $< -o $@
%.o: %.s
$(AS) $(AFLAGS) $< -o $@
%.o: %.c
$(CC) $(CFLAGS) -c $< -o $@
run: $(EXEC)
@test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1)
@grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1)
@grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1)
$(EMU) $<
dump: $(EXEC)
$(OBJDUMP) -Ds $< | less
clean:
rm -f $(EXEC) $(OBJS)
```
- command
```bash=
make clean
make BF_IMPL=asm OPT=-O0 # or BF_IMPL=c OPT=-Os, etc.
riscv-none-elf-objdump -Ds test.elf > asm_O0.dump
```
By following the steps above, we can obtain four disassembled versions of bf16_sqrt: the handwritten assembly implementation, and the compiler-generated implementations built with `-O0`, `-Os`, and `-Ofast`.
First, assembler/compiler will turn Source code(asm/c) into object code. Then Linker will combine all object code into ELF (Executable and linkable) file, which is likely contain with code section `.text`, data section `.data/.bss`, symbol table and entry point.
However, an ELF is still mostly raw machine code, which humans can’t read. That’s why we disassemble it to turn machine code back into assembly so we can inspect what the compiler produced, compare it with handwritten assembly, and observe differences under different optimization flags.
<details>
<summary>asm.dump</summary>
```
00014154 <isqrt16_pow4>:
14154: 00000293 li t0,0
14158: 00004337 lui t1,0x4
0001415c <isqrt_loop>:
1415c: 02030663 beqz t1,14188 <isqrt_done>
14160: 006283b3 add t2,t0,t1
14164: 00757863 bgeu a0,t2,14174 <isqrt_ge>
14168: 0012d293 srli t0,t0,0x1
1416c: 00235313 srli t1,t1,0x2
14170: fedff06f j 1415c <isqrt_loop>
00014174 <isqrt_ge>:
14174: 40750533 sub a0,a0,t2
14178: 0012d293 srli t0,t0,0x1
1417c: 006282b3 add t0,t0,t1
14180: 00235313 srli t1,t1,0x2
14184: fd9ff06f j 1415c <isqrt_loop>
00014188 <isqrt_done>:
14188: 00028513 mv a0,t0
1418c: 00008067 ret
00014190 <bf16_sqrt>:
14190: fe010113 addi sp,sp,-32
14194: 00812023 sw s0,0(sp)
14198: 00912223 sw s1,4(sp)
1419c: 01212423 sw s2,8(sp)
141a0: 01312623 sw s3,12(sp)
141a4: 01412823 sw s4,16(sp)
141a8: 01512a23 sw s5,20(sp)
141ac: 01612c23 sw s6,24(sp)
141b0: 00112e23 sw ra,28(sp)
141b4: 00f55413 srli s0,a0,0xf
141b8: 00147413 andi s0,s0,1
141bc: 00755493 srli s1,a0,0x7
141c0: 0ff4f493 zext.b s1,s1
141c4: 07f57913 andi s2,a0,127
141c8: 0ff00293 li t0,255
141cc: 02549063 bne s1,t0,141ec <a_exp>
141d0: 00090463 beqz s2,141d8 <a_mant>
141d4: 0c00006f j 14294 <ret_a>
000141d8 <a_mant>:
141d8: 00040863 beqz s0,141e8 <a_sign>
141dc: 00008537 lui a0,0x8
141e0: fc050513 addi a0,a0,-64 # 7fc0 <_start-0x8040>
141e4: 0b40006f j 14298 <ans>
000141e8 <a_sign>:
141e8: 0ac0006f j 14294 <ret_a>
000141ec <a_exp>:
141ec: 00049863 bnez s1,141fc <skip>
141f0: 00091663 bnez s2,141fc <skip>
141f4: 00000513 li a0,0
141f8: 0a00006f j 14298 <ans>
000141fc <skip>:
141fc: 00040863 beqz s0,1420c <negative_skip>
14200: 00008537 lui a0,0x8
14204: fc050513 addi a0,a0,-64 # 7fc0 <_start-0x8040>
14208: 0900006f j 14298 <ans>
0001420c <negative_skip>:
1420c: 00049663 bnez s1,14218 <denormals_skip>
14210: 00000513 li a0,0
14214: 0840006f j 14298 <ans>
00014218 <denormals_skip>:
14218: f8148313 addi t1,s1,-127
1421c: 08096393 ori t2,s2,128
14220: 00137293 andi t0,t1,1
14224: 00028c63 beqz t0,1423c <else>
14228: 00139393 slli t2,t2,0x1
1422c: fff30293 addi t0,t1,-1 # 3fff <_start-0xc001>
14230: 4012d293 srai t0,t0,0x1
14234: 07f28e13 addi t3,t0,127
14238: 00c0006f j 14244 <end_if>
0001423c <else>:
1423c: 40135293 srai t0,t1,0x1
14240: 07f28e13 addi t3,t0,127
00014244 <end_if>:
14244: 000e0b13 mv s6,t3
14248: 00739513 slli a0,t2,0x7
1424c: 07f50513 addi a0,a0,127
14250: f05ff0ef jal 14154 <isqrt16_pow4>
14254: 00050a93 mv s5,a0
14258: 000b0e13 mv t3,s6
1425c: 0040006f j 14260 <l3>
00014260 <l3>:
14260: 07faff13 andi t5,s5,127
14264: 0ff00293 li t0,255
14268: 005e4863 blt t3,t0,14278 <no_overflow>
1426c: 00008537 lui a0,0x8
14270: f8050513 addi a0,a0,-128 # 7f80 <_start-0x8080>
14274: 0240006f j 14298 <ans>
00014278 <no_overflow>:
14278: 01c04663 bgtz t3,14284 <no_underflow>
1427c: 00000513 li a0,0
14280: 0180006f j 14298 <ans>
00014284 <no_underflow>:
14284: 0ffe7e13 zext.b t3,t3
14288: 007e1e13 slli t3,t3,0x7
1428c: 01ee6533 or a0,t3,t5
14290: 0080006f j 14298 <ans>
00014294 <ret_a>:
14294: 0040006f j 14298 <ans>
00014298 <ans>:
14298: 00012403 lw s0,0(sp)
1429c: 00412483 lw s1,4(sp)
142a0: 00812903 lw s2,8(sp)
142a4: 00c12983 lw s3,12(sp)
142a8: 01012a03 lw s4,16(sp)
142ac: 01412a83 lw s5,20(sp)
142b0: 01812b03 lw s6,24(sp)
142b4: 01c12083 lw ra,28(sp)
142b8: 02010113 addi sp,sp,32
142bc: 00008067 ret
```
</details>
<details>
<summary>c_O0.dump</summary>
```
00014154 <bf16_sqrt>:
14154: fb010113 addi sp,sp,-80
14158: 04112623 sw ra,76(sp)
1415c: 04812423 sw s0,72(sp)
14160: 05010413 addi s0,sp,80
14164: faa41e23 sh a0,-68(s0)
14168: fbc45783 lhu a5,-68(s0)
1416c: 00f7d793 srli a5,a5,0xf
14170: fcf41d23 sh a5,-38(s0)
14174: fbc45783 lhu a5,-68(s0)
14178: 0077d793 srli a5,a5,0x7
1417c: 01079793 slli a5,a5,0x10
14180: 0107d793 srli a5,a5,0x10
14184: 01079793 slli a5,a5,0x10
14188: 4107d793 srai a5,a5,0x10
1418c: 0ff7f793 zext.b a5,a5
14190: fcf41c23 sh a5,-40(s0)
14194: fbc45783 lhu a5,-68(s0)
14198: 07f7f793 andi a5,a5,127
1419c: fcf41b23 sh a5,-42(s0)
141a0: fd841703 lh a4,-40(s0)
141a4: 0ff00793 li a5,255
141a8: 02f71863 bne a4,a5,141d8 <bf16_sqrt+0x84>
141ac: fd645783 lhu a5,-42(s0)
141b0: 00078663 beqz a5,141bc <bf16_sqrt+0x68>
141b4: fbc45783 lhu a5,-68(s0)
141b8: 2280006f j 143e0 <bf16_sqrt+0x28c>
141bc: fda45783 lhu a5,-38(s0)
141c0: 00078863 beqz a5,141d0 <bf16_sqrt+0x7c>
141c4: ffff87b7 lui a5,0xffff8
141c8: fc07c793 xori a5,a5,-64
141cc: 2140006f j 143e0 <bf16_sqrt+0x28c>
141d0: fbc45783 lhu a5,-68(s0)
141d4: 20c0006f j 143e0 <bf16_sqrt+0x28c>
141d8: fd841783 lh a5,-40(s0)
141dc: 00079a63 bnez a5,141f0 <bf16_sqrt+0x9c>
141e0: fd645783 lhu a5,-42(s0)
141e4: 00079663 bnez a5,141f0 <bf16_sqrt+0x9c>
141e8: 00000793 li a5,0
141ec: 1f40006f j 143e0 <bf16_sqrt+0x28c>
141f0: fda45783 lhu a5,-38(s0)
141f4: 00078863 beqz a5,14204 <bf16_sqrt+0xb0>
141f8: ffff87b7 lui a5,0xffff8
141fc: fc07c793 xori a5,a5,-64
14200: 1e00006f j 143e0 <bf16_sqrt+0x28c>
14204: fd841783 lh a5,-40(s0)
14208: 00079663 bnez a5,14214 <bf16_sqrt+0xc0>
1420c: 00000793 li a5,0
14210: 1d00006f j 143e0 <bf16_sqrt+0x28c>
14214: fd841783 lh a5,-40(s0)
14218: f8178793 addi a5,a5,-127 # ffff7f81 <__stack_top+0xfffe2ae1>
1421c: fcf42823 sw a5,-48(s0)
14220: fd645783 lhu a5,-42(s0)
14224: 0807e793 ori a5,a5,128
14228: 01079793 slli a5,a5,0x10
1422c: 0107d793 srli a5,a5,0x10
14230: fef42423 sw a5,-24(s0)
14234: fd042783 lw a5,-48(s0)
14238: 0017f793 andi a5,a5,1
1423c: 02078463 beqz a5,14264 <bf16_sqrt+0x110>
14240: fe842783 lw a5,-24(s0)
14244: 00179793 slli a5,a5,0x1
14248: fef42423 sw a5,-24(s0)
1424c: fd042783 lw a5,-48(s0)
14250: fff78793 addi a5,a5,-1
14254: 4017d793 srai a5,a5,0x1
14258: 07f78793 addi a5,a5,127
1425c: fef42623 sw a5,-20(s0)
14260: 0140006f j 14274 <bf16_sqrt+0x120>
14264: fd042783 lw a5,-48(s0)
14268: 4017d793 srai a5,a5,0x1
1426c: 07f78793 addi a5,a5,127
14270: fef42623 sw a5,-20(s0)
14274: 05a00793 li a5,90
14278: fef42223 sw a5,-28(s0)
1427c: 10000793 li a5,256
14280: fef42023 sw a5,-32(s0)
14284: 08000793 li a5,128
14288: fcf42e23 sw a5,-36(s0)
1428c: 0600006f j 142ec <bf16_sqrt+0x198>
14290: fe442703 lw a4,-28(s0)
14294: fe042783 lw a5,-32(s0)
14298: 00f707b3 add a5,a4,a5
1429c: 0017d793 srli a5,a5,0x1
142a0: fcf42423 sw a5,-56(s0)
142a4: fc842583 lw a1,-56(s0)
142a8: fc842503 lw a0,-56(s0)
142ac: fe9fb0ef jal 10294 <__mulsi3>
142b0: 00050793 mv a5,a0
142b4: 0077d793 srli a5,a5,0x7
142b8: fcf42223 sw a5,-60(s0)
142bc: fc442703 lw a4,-60(s0)
142c0: fe842783 lw a5,-24(s0)
142c4: 00e7ee63 bltu a5,a4,142e0 <bf16_sqrt+0x18c>
142c8: fc842783 lw a5,-56(s0)
142cc: fcf42e23 sw a5,-36(s0)
142d0: fc842783 lw a5,-56(s0)
142d4: 00178793 addi a5,a5,1
142d8: fef42223 sw a5,-28(s0)
142dc: 0100006f j 142ec <bf16_sqrt+0x198>
142e0: fc842783 lw a5,-56(s0)
142e4: fff78793 addi a5,a5,-1
142e8: fef42023 sw a5,-32(s0)
142ec: fe442703 lw a4,-28(s0)
142f0: fe042783 lw a5,-32(s0)
142f4: f8e7fee3 bgeu a5,a4,14290 <bf16_sqrt+0x13c>
142f8: fdc42703 lw a4,-36(s0)
142fc: 0ff00793 li a5,255
14300: 02e7f063 bgeu a5,a4,14320 <bf16_sqrt+0x1cc>
14304: fdc42783 lw a5,-36(s0)
14308: 0017d793 srli a5,a5,0x1
1430c: fcf42e23 sw a5,-36(s0)
14310: fec42783 lw a5,-20(s0)
14314: 00178793 addi a5,a5,1
14318: fef42623 sw a5,-20(s0)
1431c: 0440006f j 14360 <bf16_sqrt+0x20c>
14320: fdc42703 lw a4,-36(s0)
14324: 07f00793 li a5,127
14328: 02e7ec63 bltu a5,a4,14360 <bf16_sqrt+0x20c>
1432c: 01c0006f j 14348 <bf16_sqrt+0x1f4>
14330: fdc42783 lw a5,-36(s0)
14334: 00179793 slli a5,a5,0x1
14338: fcf42e23 sw a5,-36(s0)
1433c: fec42783 lw a5,-20(s0)
14340: fff78793 addi a5,a5,-1
14344: fef42623 sw a5,-20(s0)
14348: fdc42703 lw a4,-36(s0)
1434c: 07f00793 li a5,127
14350: 00e7e863 bltu a5,a4,14360 <bf16_sqrt+0x20c>
14354: fec42703 lw a4,-20(s0)
14358: 00100793 li a5,1
1435c: fce7cae3 blt a5,a4,14330 <bf16_sqrt+0x1dc>
14360: fdc42783 lw a5,-36(s0)
14364: 01079793 slli a5,a5,0x10
14368: 0107d793 srli a5,a5,0x10
1436c: 07f7f793 andi a5,a5,127
14370: fcf41723 sh a5,-50(s0)
14374: fec42703 lw a4,-20(s0)
14378: 0fe00793 li a5,254
1437c: 00e7d863 bge a5,a4,1438c <bf16_sqrt+0x238>
14380: ffff87b7 lui a5,0xffff8
14384: f807c793 xori a5,a5,-128
14388: 0580006f j 143e0 <bf16_sqrt+0x28c>
1438c: fec42783 lw a5,-20(s0)
14390: 00f04663 bgtz a5,1439c <bf16_sqrt+0x248>
14394: 00000793 li a5,0
14398: 0480006f j 143e0 <bf16_sqrt+0x28c>
1439c: fec42783 lw a5,-20(s0)
143a0: 01079793 slli a5,a5,0x10
143a4: 4107d793 srai a5,a5,0x10
143a8: 00779793 slli a5,a5,0x7
143ac: 01079713 slli a4,a5,0x10
143b0: 41075713 srai a4,a4,0x10
143b4: 000087b7 lui a5,0x8
143b8: f8078793 addi a5,a5,-128 # 7f80 <_start-0x8080>
143bc: 00f777b3 and a5,a4,a5
143c0: 01079713 slli a4,a5,0x10
143c4: 41075713 srai a4,a4,0x10
143c8: fce41783 lh a5,-50(s0)
143cc: 00f767b3 or a5,a4,a5
143d0: 01079793 slli a5,a5,0x10
143d4: 4107d793 srai a5,a5,0x10
143d8: 01079793 slli a5,a5,0x10
143dc: 0107d793 srli a5,a5,0x10
143e0: 00078513 mv a0,a5
143e4: 04c12083 lw ra,76(sp)
143e8: 04812403 lw s0,72(sp)
143ec: 05010113 addi sp,sp,80
143f0: 00008067 ret
```
</details>
<details>
<summary>c_Os.dump</summary>
```
0001399c <bf16_sqrt>:
1399c: 01051513 slli a0,a0,0x10
139a0: 01055513 srli a0,a0,0x10
139a4: fe010113 addi sp,sp,-32
139a8: 00755793 srli a5,a0,0x7
139ac: 01212823 sw s2,16(sp)
139b0: 00112e23 sw ra,28(sp)
139b4: 00812c23 sw s0,24(sp)
139b8: 00912a23 sw s1,20(sp)
139bc: 01312623 sw s3,12(sp)
139c0: 01412423 sw s4,8(sp)
139c4: 01512223 sw s5,4(sp)
139c8: 0ff7f793 zext.b a5,a5
139cc: 0ff00693 li a3,255
139d0: 00f55713 srli a4,a0,0xf
139d4: 07f57913 andi s2,a0,127
139d8: 04d79063 bne a5,a3,13a18 <bf16_sqrt+0x7c>
139dc: 00091c63 bnez s2,139f4 <bf16_sqrt+0x58>
139e0: 00008537 lui a0,0x8
139e4: fc050513 addi a0,a0,-64 # 7fc0 <_start-0x8040>
139e8: 00071663 bnez a4,139f4 <bf16_sqrt+0x58>
139ec: 00008537 lui a0,0x8
139f0: f8050513 addi a0,a0,-128 # 7f80 <_start-0x8080>
139f4: 01c12083 lw ra,28(sp)
139f8: 01812403 lw s0,24(sp)
139fc: 01412483 lw s1,20(sp)
13a00: 01012903 lw s2,16(sp)
13a04: 00c12983 lw s3,12(sp)
13a08: 00812a03 lw s4,8(sp)
13a0c: 00412a83 lw s5,4(sp)
13a10: 02010113 addi sp,sp,32
13a14: 00008067 ret
13a18: 04079a63 bnez a5,13a6c <bf16_sqrt+0xd0>
13a1c: 00000513 li a0,0
13a20: fc090ae3 beqz s2,139f4 <bf16_sqrt+0x58>
13a24: 00008537 lui a0,0x8
13a28: fc050513 addi a0,a0,-64 # 7fc0 <_start-0x8040>
13a2c: 40e00733 neg a4,a4
13a30: 00e57533 and a0,a0,a4
13a34: fc1ff06f j 139f4 <bf16_sqrt+0x58>
13a38: 40145413 srai s0,s0,0x1
13a3c: 07f40413 addi s0,s0,127
13a40: 0580006f j 13a98 <bf16_sqrt+0xfc>
13a44: fff98a93 addi s5,s3,-1
13a48: 0800006f j 13ac8 <bf16_sqrt+0x12c>
13a4c: 07f00793 li a5,127
13a50: 0897e663 bltu a5,s1,13adc <bf16_sqrt+0x140>
13a54: 00100713 li a4,1
13a58: 00149493 slli s1,s1,0x1
13a5c: fff40413 addi s0,s0,-1
13a60: 0697ee63 bltu a5,s1,13adc <bf16_sqrt+0x140>
13a64: fee41ae3 bne s0,a4,13a58 <bf16_sqrt+0xbc>
13a68: 0740006f j 13adc <bf16_sqrt+0x140>
13a6c: 00008537 lui a0,0x8
13a70: fc050513 addi a0,a0,-64 # 7fc0 <_start-0x8040>
13a74: f80710e3 bnez a4,139f4 <bf16_sqrt+0x58>
13a78: f8178413 addi s0,a5,-127
13a7c: 00147713 andi a4,s0,1
13a80: 08096913 ori s2,s2,128
13a84: fa070ae3 beqz a4,13a38 <bf16_sqrt+0x9c>
13a88: f8078793 addi a5,a5,-128
13a8c: 4017d793 srai a5,a5,0x1
13a90: 00191913 slli s2,s2,0x1
13a94: 07f78413 addi s0,a5,127
13a98: 08000493 li s1,128
13a9c: 10000a93 li s5,256
13aa0: 05a00a13 li s4,90
13aa4: 015a09b3 add s3,s4,s5
13aa8: 0019d993 srli s3,s3,0x1
13aac: 00098593 mv a1,s3
13ab0: 00098513 mv a0,s3
13ab4: e34fc0ef jal 100e8 <__mulsi3>
13ab8: 00755513 srli a0,a0,0x7
13abc: f8a964e3 bltu s2,a0,13a44 <bf16_sqrt+0xa8>
13ac0: 00198a13 addi s4,s3,1
13ac4: 00098493 mv s1,s3
13ac8: fd4afee3 bgeu s5,s4,13aa4 <bf16_sqrt+0x108>
13acc: 0ff00793 li a5,255
13ad0: f697fee3 bgeu a5,s1,13a4c <bf16_sqrt+0xb0>
13ad4: 0014d493 srli s1,s1,0x1
13ad8: 00140413 addi s0,s0,1
13adc: 00741513 slli a0,s0,0x7
13ae0: 07f4f493 andi s1,s1,127
13ae4: 00956533 or a0,a0,s1
13ae8: 01051513 slli a0,a0,0x10
13aec: 01055513 srli a0,a0,0x10
13af0: f05ff06f j 139f4 <bf16_sqrt+0x58>
```
</details>
<details>
<summary>c_Ofast.dump</summary>
```
00013bc4 <main>:
13bc4: fc010113 addi sp,sp,-64
13bc8: 02812c23 sw s0,56(sp)
13bcc: 03312623 sw s3,44(sp)
13bd0: 02112e23 sw ra,60(sp)
13bd4: 02912a23 sw s1,52(sp)
13bd8: 03212823 sw s2,48(sp)
13bdc: 00010413 mv s0,sp
13be0: 01000993 li s3,16
13be4: 04000893 li a7,64
13be8: 00100513 li a0,1
13bec: 002005b3 add a1,zero,sp
13bf0: 00098613 mv a2,s3
13bf4: 00000073 ecall
13bf8: da0fc0ef jal 10198 <get_cycles>
13bfc: 00050913 mv s2,a0
13c00: dacfc0ef jal 101ac <get_instret>
13c04: 00050493 mv s1,a0
13c08: 04000893 li a7,64
13c0c: 00100513 li a0,1
13c10: 002005b3 add a1,zero,sp
13c14: 00098613 mv a2,s3
13c18: 00000073 ecall
13c1c: 0000c537 lui a0,0xc
13c20: f8050513 addi a0,a0,-128 # bf80 <_start-0x4080>
13c24: df9ff0ef jal 13a1c <bf16_sqrt>
13c28: 01051693 slli a3,a0,0x10
13c2c: 0106d693 srli a3,a3,0x10
13c30: 00f00793 li a5,15
13c34: 04000893 li a7,64
13c38: 00100513 li a0,1
13c3c: 002005b3 add a1,zero,sp
13c40: 00078613 mv a2,a5
13c44: 00000073 ecall
13c48: 01110793 addi a5,sp,17
13c4c: 00068c63 beqz a3,13c64 <main+0xa0>
13c50: 00068713 mv a4,a3
13c54: 01210793 addi a5,sp,18
13c58: 00475713 srli a4,a4,0x4
13c5c: fff78793 addi a5,a5,-1
13c60: fe071ce3 bnez a4,13c58 <main+0x94>
13c64: 00178793 addi a5,a5,1
13c68: 01410713 addi a4,sp,20
13c6c: 40f70733 sub a4,a4,a5
13c70: 04000893 li a7,64
13c74: 00100513 li a0,1
13c78: 00f005b3 add a1,zero,a5
13c7c: 00070613 mv a2,a4
13c80: 00000073 ecall
13c84: 000087b7 lui a5,0x8
13c88: fc078793 addi a5,a5,-64 # 7fc0 <_start-0x8040>
13c8c: 0af68e63 beq a3,a5,13d48 <main+0x184>
13c90: 01b00793 li a5,27
13c94: 04000893 li a7,64
13c98: 00100513 li a0,1
13c9c: 008005b3 add a1,zero,s0
13ca0: 00078613 mv a2,a5
13ca4: 00000073 ecall
13ca8: cf0fc0ef jal 10198 <get_cycles>
13cac: 00050993 mv s3,a0
13cb0: cfcfc0ef jal 101ac <get_instret>
13cb4: 409504b3 sub s1,a0,s1
13cb8: 00a00793 li a5,10
13cbc: 04000893 li a7,64
13cc0: 00100513 li a0,1
13cc4: 008005b3 add a1,zero,s0
13cc8: 00078613 mv a2,a5
13ccc: 00000073 ecall
13cd0: 41298533 sub a0,s3,s2
13cd4: b68fc0ef jal 1003c <print_dec>
13cd8: 01000793 li a5,16
13cdc: 04000893 li a7,64
13ce0: 00100513 li a0,1
13ce4: 008005b3 add a1,zero,s0
13ce8: 00078613 mv a2,a5
13cec: 00000073 ecall
13cf0: 00048513 mv a0,s1
13cf4: b48fc0ef jal 1003c <print_dec>
13cf8: 00100793 li a5,1
13cfc: 04000893 li a7,64
13d00: 00100513 li a0,1
13d04: 008005b3 add a1,zero,s0
13d08: 00078613 mv a2,a5
13d0c: 00000073 ecall
13d10: 01d00793 li a5,29
13d14: 04000893 li a7,64
13d18: 00100513 li a0,1
13d1c: 008005b3 add a1,zero,s0
13d20: 00078613 mv a2,a5
13d24: 00000073 ecall
13d28: 03c12083 lw ra,60(sp)
13d2c: 03812403 lw s0,56(sp)
13d30: 03412483 lw s1,52(sp)
13d34: 03012903 lw s2,48(sp)
13d38: 02c12983 lw s3,44(sp)
13d3c: 00000513 li a0,0
13d40: 04010113 addi sp,sp,64
13d44: 00008067 ret
13d48: 00900793 li a5,9
13d4c: 04000893 li a7,64
13d50: 00100513 li a0,1
13d54: 008005b3 add a1,zero,s0
13d58: 00078613 mv a2,a5
13d5c: 00000073 ecall
13d60: f49ff06f j 13ca8 <main+0xe4>
```
</details>
### Observation 1: Stack usage
- compiler original version
Stack frame is the largest of all version (e.g.,`addi sp,sp,-80`). And that's because of complier perfer to save lots of loacl variable to stack such as sign, exp, mant and so on. As the result, the generated code contains many unnecessary lw/sw instructions for loading and storing temporaries. Moreover, disassembly shows frequent lw/sw traffic to spill and reload values inside the binary-search loop, which produce extra memory accesses and increases instruction count.
- compiler with flag `-Os`(size) version
With `-Os`, this time compiler reduces stack usage(e.g.,`addi sp,sp,-32`) and tries to keep more live values in registers instead of spilling them to memory.
- compiler with flag `-Ofast` (spead) version
In this disassembly, `-Ofast` also drops the stack pointer by 32 right away, but its register saving strategy feels a bit more staggered. It starts by saving just a few registers (like ra and s4). It waits to push the rest (s0-s5) until it actually hits the binary search path.
By this way in some serval cases like `NaN`, `Inf`, `zero` or `negative values`, the `-Ofast` can perform much better than `-Os`, since it don't need to save all registers from the start of the program.
### Observation 2: Binary Search part
Since we are using bf16_sqrt of RV32I friendly from [Assignment1](https://hackmd.io/@IBxdYttpQ5-JUEW6_KypVg/arch2025-homework1), which don't require any multiplication at all. Instead we use other operations like add, shift, sub and compare to replace it. On the other hand, no matter which optimization flag the compiler uses, the C version still has to implement `mid * mid`, so it ends up calling `__mulsi3` (GCC's runtime helper) on RV32I (no M extension) and keeping the binary-search loop structure.
## Part 5 - Optimization
In this section, we aim to improve the assembly code generated by GCC, and we use [ticks.c](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c) and [perfcounter](https://github.com/sysprog21/rv32emu/tree/master/tests/perfcounter) to collect performance statistics.
- Performance metric
- Cycle: the number of clock cycles elapsed during execution
- Instructions: the number of instructions actually executed
In terms of implementation, I choose the compiler generated code withuot any flag to optimize.
Let's look at the `main.c` (from Part 4). In order to analyze the performance of program more directly. We simply extend the test function from single to 10 cases.
```clike=
static void test_bf16_sqrt(void)
{
TEST_LOGGER("Test: bf16_sqrt\n");
// 1) sqrt(-1.0) = NaN (0x7FC0)
{
bf16_t a = {.bits = 0xBF80};
bf16_t result = bf16_sqrt_w(a);
TEST_LOGGER(" sqrt(-1.0) = ");
print_hex(result.bits);
if (result.bits == 0x7FC0) {
TEST_LOGGER(" PASSED\n");
} else {
TEST_LOGGER(" FAILED (expected 0x7FC0)\n");
}
}
// 2) sqrt(+0.0) = +0.0
{
bf16_t a = {.bits = 0x0000};
bf16_t result = bf16_sqrt_w(a);
TEST_LOGGER(" sqrt(+0.0) = ");
print_hex(result.bits);
if (result.bits == 0x0000) {
TEST_LOGGER(" PASSED\n");
} else {
TEST_LOGGER(" FAILED (expected 0x0000)\n");
}
}
// 3) sqrt(-0.0) = +0.0
{
bf16_t a = {.bits = 0x8000};
bf16_t result = bf16_sqrt_w(a);
TEST_LOGGER(" sqrt(-0.0) = ");
print_hex(result.bits);
if (result.bits == 0x0000) {
TEST_LOGGER(" PASSED\n");
} else {
TEST_LOGGER(" FAILED (expected 0x0000)\n");
}
}
// 4) sqrt(+Inf) = +Inf
{
bf16_t a = {.bits = 0x7F80};
bf16_t result = bf16_sqrt_w(a);
TEST_LOGGER(" sqrt(+Inf) = ");
print_hex(result.bits);
if (result.bits == 0x7F80) {
TEST_LOGGER(" PASSED\n");
} else {
TEST_LOGGER(" FAILED (expected 0x7F80)\n");
}
}
// 5) sqrt(-Inf) = NaN (0x7FC0)
{
bf16_t a = {.bits = 0xFF80};
bf16_t result = bf16_sqrt_w(a);
TEST_LOGGER(" sqrt(-Inf) = ");
print_hex(result.bits);
if (result.bits == 0x7FC0) {
TEST_LOGGER(" PASSED\n");
} else {
TEST_LOGGER(" FAILED (expected 0x7FC0)\n");
}
}
// 6) sqrt(NaN canonical 0x7FC0) = NaN (expect 0x7FC0)
{
bf16_t a = {.bits = 0x7FC0};
bf16_t result = bf16_sqrt_w(a);
TEST_LOGGER(" sqrt(NaN 0x7FC0) = ");
print_hex(result.bits);
if (result.bits == 0x7FC0) {
TEST_LOGGER(" PASSED\n");
} else {
TEST_LOGGER(" FAILED (expected 0x7FC0)\n");
}
}
// 7) sqrt(1.0) = 1.0
{
bf16_t a = {.bits = 0x3F80};
bf16_t result = bf16_sqrt_w(a);
TEST_LOGGER(" sqrt(1.0) = ");
print_hex(result.bits);
if (result.bits == 0x3F80) {
TEST_LOGGER(" PASSED\n");
} else {
TEST_LOGGER(" FAILED (expected 0x3F80)\n");
}
}
// 8) sqrt(4.0) = 2.0
{
bf16_t a = {.bits = 0x4080}; // 4.0 in bf16
bf16_t result = bf16_sqrt_w(a);
TEST_LOGGER(" sqrt(4.0) = ");
print_hex(result.bits);
if (result.bits == 0x4000) { // 2.0 in bf16
TEST_LOGGER(" PASSED\n");
} else {
TEST_LOGGER(" FAILED (expected 0x4000)\n");
}
}
// 9) sqrt(0.25) = 0.5
{
bf16_t a = {.bits = 0x3E80}; // 0.25 in bf16
bf16_t result = bf16_sqrt_w(a);
TEST_LOGGER(" sqrt(0.25) = ");
print_hex(result.bits);
if (result.bits == 0x3F00) { // 0.5 in bf16
TEST_LOGGER(" PASSED\n");
} else {
TEST_LOGGER(" FAILED (expected 0x3F00)\n");
}
}
// 10) sqrt(2.25) = 1.5
{
bf16_t a = {.bits = 0x4010};
bf16_t result = bf16_sqrt_w(a);
TEST_LOGGER(" sqrt(2.25) = ");
print_hex(result.bits);
if (result.bits == 0x3FC0) {
TEST_LOGGER(" PASSED\n");
} else {
TEST_LOGGER(" FAILED (expected 0x3FC0)\n");
}
}
}
```
Before we go further, we have to transfer compiler generated c code to assembly first, since we need to improve it.
```bash=
riscv-none-elf-gcc -S -march=rv32i_zicsr -mabi=ilp32 -O0 bf16_sqrt_base.c -o bf16_sqrt_gcc_O0.s
```
- Makefile
```
...
OBJS = start.o main.o perfcounter.o chacha20_asm.o bf16_sqrt_gcc_O0.o
...
```
- command
```bash=
make clean
make
make run
```
- Result
```bash=
Test: bf16_sqrt
sqrt(-1.0) = 7fc0
PASSED
sqrt(+0.0) = 0
PASSED
sqrt(-0.0) = 0
PASSED
sqrt(+Inf) = 7f80
PASSED
sqrt(-Inf) = 7fc0
PASSED
sqrt(NaN 0x7FC0) = 7fc0
PASSED
sqrt(1.0) = 3f80
PASSED
sqrt(4.0) = 4000
PASSED
sqrt(0.25) = 3f00
PASSED
sqrt(2.25) = 3fc0
PASSED
Cycles: 6745
Instructions: 6745
=== All Tests Completed ===
```
Without any optimization:
- Cycle: 6745
- Instruction: 6745
### Improvement1: Peephole inside loop
Like we mentiond before, complier (without any optimization flag) perfer to save lots of loacl variable to stack. And that give us a lot of room for improvement.
Since every loop iteration we have to perform lots of unnecessary lw/sw, our strategy here is simply using sepcial registers instead of spill to memory for binary search loop.
Because the loop contains a function call `__mulsi3`, t/a registers might get overwritten. To avoid saving/restoring them every iteration, we use s registers to hold the loop state.
As we require sepcial registers for binary search loop, we have to allocate memory space for them first.
- origin
```
bf16_sqrt:
addi sp,sp,-80
sw ra,76(sp)
sw s0,72(sp)
addi s0,sp,80
```
- after
```
bf16_sqrt:
addi sp,sp,-96
sw ra,92(sp)
sw s0,88(sp)
sw s1,84(sp)
sw s2,80(sp)
sw s3,76(sp)
sw s4,72(sp)
addi s0,sp,96
```
- origin
```
.L20:
mv a0,a5
lw ra,76(sp)
lw s0,72(sp)
addi sp,sp,80
jr ra
```
- after
```
.L20:
mv a0,a5
lw s4,72(sp)
lw s3,76(sp)
lw s2,80(sp)
lw s1,84(sp)
lw s0,88(sp)
lw ra,92(sp)
addi sp,sp,96
jr ra
```
Now, let’s rewrite the binary search loop to use registers only.
- origin
```asm=
.L10:
li a5,90
sw a5,-28(s0) # low
li a5,256
sw a5,-32(s0) # high
li a5,128
sw a5,-36(s0) # best
j .L11
.L13:
lw a4,-28(s0)
lw a5,-32(s0)
add a5,a4,a5
srli a5,a5,1
sw a5,-56(s0) # mid
lw a1,-56(s0)
lw a0,-56(s0)
call __mulsi3
mv a5,a0
srli a5,a5,7
sw a5,-60(s0)
lw a4,-60(s0)
lw a5,-24(s0) # mant
bgtu a4,a5,.L12
lw a5,-56(s0)
sw a5,-36(s0)
lw a5,-56(s0)
addi a5,a5,1
sw a5,-28(s0)
j .L11
.L12:
lw a5,-56(s0)
addi a5,a5,-1
sw a5,-32(s0)
.L11:
lw a4,-28(s0)
lw a5,-32(s0)
bleu a4,a5,.L13
```
- after
```asm=
.L10:
li s1,90 # low
li s2,256 # high
li s3,128 # best
lw s4,-24(s0) # mant
j .L11
.L13:
add t0,s1,s2
srli t0,t0,1 # mid
mv a0,t0
mv a1,t0
call __mulsi3
srli t1,a0,7 # mid2 = (mid*mid)>>7
bgtu t1,s4,.L12
mv s3,t0 # best = mid
addi s1,t0,1 # low = mid+1
j .L11
.L12:
addi s2,t0,-1 # high = mid-1
.L11:
bleu s1,s2,.L13
sw s3,-36(s0)
```
- Result
```
Test: bf16_sqrt
sqrt(-1.0) = 7fc0
PASSED
sqrt(+0.0) = 0
PASSED
sqrt(-0.0) = 0
PASSED
sqrt(+Inf) = 7f80
PASSED
sqrt(-Inf) = 7fc0
PASSED
sqrt(NaN 0x7FC0) = 7fc0
PASSED
sqrt(1.0) = 3f80
PASSED
sqrt(4.0) = 4000
PASSED
sqrt(0.25) = 3f00
PASSED
sqrt(2.25) = 3fc0
PASSED
Cycles: 6397
Instructions: 6397
=== All Tests Completed ===
```
### Improvement2: Drop some function calls
In order to calculate `mid*mid` for binary search loop, we have to call function call `__mulsi3`, which is quite heavy for our cycle.
We observed that `high` and `low` are initialized to 256 and 90, respectively, so `mid` is always bounded within this small interval. As a result, `mid` only needs about 9 bits to represent (since the maximum value is 256), and computing `mid * mid` does not require a full 32-bit multiplication routine.
Therefore we replaced function call `__mulsi3` with a shift and add multiplier (9 iterations).
- origin
```asm=
mv a0,t0
mv a1,t0
call __mulsi3
srli t1,a0,7
```
- after
```asm=
# t0 = mid
mv t2, t0 # multiplicand (a)
mv t3, t0 # multiplier (b)
li t1, 0 # acc = 0
andi t4, t3, 1
sub t4, zero, t4 # t4 = 0 or -1
and t5, t2, t4 # t5 = (b&1)?a:0
add t1, t1, t5
slli t2, t2, 1
srli t3, t3, 1
andi t4, t3, 1
sub t4, zero, t4
and t5, t2, t4
add t1, t1, t5
slli t2, t2, 1
srli t3, t3, 1
andi t4, t3, 1
sub t4, zero, t4
and t5, t2, t4
add t1, t1, t5
slli t2, t2, 1
srli t3, t3, 1
andi t4, t3, 1
sub t4, zero, t4
and t5, t2, t4
add t1, t1, t5
slli t2, t2, 1
srli t3, t3, 1
andi t4, t3, 1
sub t4, zero, t4
and t5, t2, t4
add t1, t1, t5
slli t2, t2, 1
srli t3, t3, 1
andi t4, t3, 1
sub t4, zero, t4
and t5, t2, t4
add t1, t1, t5
slli t2, t2, 1
srli t3, t3, 1
andi t4, t3, 1
sub t4, zero, t4
and t5, t2, t4
add t1, t1, t5
slli t2, t2, 1
srli t3, t3, 1
andi t4, t3, 1
sub t4, zero, t4
and t5, t2, t4
add t1, t1, t5
slli t2, t2, 1
srli t3, t3, 1
andi t4, t3, 1
sub t4, zero, t4
and t5, t2, t4
add t1, t1, t5
srli t1, t1, 7
```
To better understand the iterations, we analyze one loop iteration in detail
```asm=
andi t4, t3, 1
sub t4, zero, t4 # t4 = 0 or -1 (0xFFFFFFFF)
and t5, t2, t4 # t5 = a & 0 = 0 or t5 = a & 0xFFFFFFFF = a
add t1, t1, t5
slli t2, t2, 1
srli t3, t3, 1
```
> t3 = multiplier, t2 = multiplicand, t1 = temp
Basically we check if LSB of multiplier is 1 every round. If it is, then we just add the current multiplicand to the accumulator. Then we left-shift the multiplicand and right-shift the multiplier for the next round.
- result
``` bash=
Test: bf16_sqrt
sqrt(-1.0) = 7fc0
PASSED
sqrt(+0.0) = 0
PASSED
sqrt(-0.0) = 0
PASSED
sqrt(+Inf) = 7f80
PASSED
sqrt(-Inf) = 7fc0
PASSED
sqrt(NaN 0x7FC0) = 7fc0
PASSED
sqrt(1.0) = 3f80
PASSED
sqrt(4.0) = 4000
PASSED
sqrt(0.25) = 3f00
PASSED
sqrt(2.25) = 3fc0
PASSED
Cycles: 4093
Instructions: 4093
=== All Tests Completed ===
```
After above two improvement our assembly can reduce about 39.3% !