# Assignment2: RISC-V Toolchain
Contributed by [andy891023](https://github.com/andy891023)
## Rewrite Implementing FP32 Operations by Applying FP32 to Bfloat16 Conversion Algorithm
I rewrote the assembly code Applying FP32 to Bfloat16 Conversion Algorithm from [brain049](https://github.com/brian049), the reason I chose this project is because the algorithm can reduce the size of the memory storing.While the accuracy decreases, what I gain is a reduction in storage space.
## Makefile
###
```shell
$ make
```
You shall see the following messages:
```shell
riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Ofast -Wall -c -o getcycles.o getcycles.S
riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Ofast -Wall -c -o getinstret.o getinstret.S
riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Ofast -Wall -c -o hello.o hello.s
riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Ofast -Wall -c -o main.o main.c
riscv-none-elf-gcc -o perfcount.elf getcycles.o getinstret.o hello.o main.o
```
The ==perfcount.elf== will produce after that, and you can run it by the follwing command.
```
$ ../../build/rv32emu perfcount.elf
```
Expected output:
```
Input FP32:3f99999a
Output bfloat16:3fd90000
cycle count: 3443
instret: 2be
inferior exit code 0
```
## Modified RISC-V code
The modified RISC-V code which can run in [rv32emu](https://github.com/sysprog21/rv32emu) is shown below, input 32-bits number and the output is the 16-bits number.
```c
.data
#.align 8
.global start
.set SYSEXIT, 93
.set SYSWRITE, 64
sign_mask: .word 0x80000000
exp_mask: .word 0x7F800000
man_mask: .word 0x007FFFFF
man16_mask: .word 0x007F0000
r_mask: .word 0xFF800000
divisor: .word 0x100
mul_use: .word 0x00800000
mul_use2: .word 0x01000000
mul_use3: .word 0x3F800000
str1: .ascii "zero\n"
.set str1_size, .-str1
str2: .ascii "infinity or NaN\n"
.set str2_size, .-str2
.text
start:
add x5, a0, x0 # x5=a0
addi x30, x0, 1
is_it_zero_or_infinity_or_NaN:
# Load exp and man into a0 and a1
lw a0, exp_mask
lw a1, man_mask
and x6, x5, a0 # exp
and x7, x5, a1 # man
# Check for zero
beqz x6, zero_case
beqz x7, zero_case
Normalize:
# Check for infinity or NaN
li a0, 0x7F800000
beq x6, a0, infinity_nan_case
# Normalized number
add a0, x0, x5
add x6, a0, x0
lw a0, r_mask
and x6, x6, a0 # r_mask
# r /= 0x100
srli x6, x6, 8
add a0, x0, x5
add x5, a0, x6 # y = x + r
# Mask the lower 16 bits of y
li t6, 0xFFFF0000
and x5, x5, t6
sw x5, 0(x8)
addi x8, x8, 8
j done
zero_case:
sw x5, 0(x8)
addi x8, x8, 8
li a7, SYSWRITE
li a0, 1
la a1, str1
li a2, str1_size
ecall
j done
infinity_nan_case:
sw x5, 0(x8)
addi x8, x8, 8
li a7, SYSWRITE
li a0, 1
la a1, str2
li a2, str2_size
ecall
j done
done:
addi x30, x30, -1
bnez x30, is_it_zero_or_infinity_or_NaN
add t1, x0, x0 # initialize t1
addi x8, x8, -8
mv a0, x5
ret
```
## Compare performance
| Level | cycle count | instret | text | data | bss | dec | hex |
| ------ | ----------- | ------- | ----- | ---- | ---- | ----- | ---- |
| -O0 | 3446 | 2be | 51418 | 1932 | 1528 | 54878 | d65e |
| -O1 | 3443 | 2be | 51346 | 1932 | 1528 | 54806 | d616 |
| -O2 | 3443 | 2be | 51346 | 1932 | 1528 | 54806 | d616 |
| -O3 | 3443 | 2be | 51346 | 1932 | 1528 | 54806 | d616 |
| -Ofast | 3443 | 2be | 51346 | 1932 | 1528 | 54806 | d616|
:::warning
You should improve the assembly implementation and then compare instead of changing optimization levels which affect C code.
:notes: jserv
:::