# Assignment2: RISC-V Toolchain Contributed by [andy891023](https://github.com/andy891023) ## Rewrite Implementing FP32 Operations by Applying FP32 to Bfloat16 Conversion Algorithm I rewrote the assembly code Applying FP32 to Bfloat16 Conversion Algorithm from [brain049](https://github.com/brian049), the reason I chose this project is because the algorithm can reduce the size of the memory storing.While the accuracy decreases, what I gain is a reduction in storage space. ## Makefile ### ```shell $ make ``` You shall see the following messages: ```shell riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Ofast -Wall -c -o getcycles.o getcycles.S riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Ofast -Wall -c -o getinstret.o getinstret.S riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Ofast -Wall -c -o hello.o hello.s riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -Ofast -Wall -c -o main.o main.c riscv-none-elf-gcc -o perfcount.elf getcycles.o getinstret.o hello.o main.o ``` The ==perfcount.elf== will produce after that, and you can run it by the follwing command. ``` $ ../../build/rv32emu perfcount.elf ``` Expected output: ``` Input FP32:3f99999a Output bfloat16:3fd90000 cycle count: 3443 instret: 2be inferior exit code 0 ``` ## Modified RISC-V code The modified RISC-V code which can run in [rv32emu](https://github.com/sysprog21/rv32emu) is shown below, input 32-bits number and the output is the 16-bits number. ```c .data #.align 8 .global start .set SYSEXIT, 93 .set SYSWRITE, 64 sign_mask: .word 0x80000000 exp_mask: .word 0x7F800000 man_mask: .word 0x007FFFFF man16_mask: .word 0x007F0000 r_mask: .word 0xFF800000 divisor: .word 0x100 mul_use: .word 0x00800000 mul_use2: .word 0x01000000 mul_use3: .word 0x3F800000 str1: .ascii "zero\n" .set str1_size, .-str1 str2: .ascii "infinity or NaN\n" .set str2_size, .-str2 .text start: add x5, a0, x0 # x5=a0 addi x30, x0, 1 is_it_zero_or_infinity_or_NaN: # Load exp and man into a0 and a1 lw a0, exp_mask lw a1, man_mask and x6, x5, a0 # exp and x7, x5, a1 # man # Check for zero beqz x6, zero_case beqz x7, zero_case Normalize: # Check for infinity or NaN li a0, 0x7F800000 beq x6, a0, infinity_nan_case # Normalized number add a0, x0, x5 add x6, a0, x0 lw a0, r_mask and x6, x6, a0 # r_mask # r /= 0x100 srli x6, x6, 8 add a0, x0, x5 add x5, a0, x6 # y = x + r # Mask the lower 16 bits of y li t6, 0xFFFF0000 and x5, x5, t6 sw x5, 0(x8) addi x8, x8, 8 j done zero_case: sw x5, 0(x8) addi x8, x8, 8 li a7, SYSWRITE li a0, 1 la a1, str1 li a2, str1_size ecall j done infinity_nan_case: sw x5, 0(x8) addi x8, x8, 8 li a7, SYSWRITE li a0, 1 la a1, str2 li a2, str2_size ecall j done done: addi x30, x30, -1 bnez x30, is_it_zero_or_infinity_or_NaN add t1, x0, x0 # initialize t1 addi x8, x8, -8 mv a0, x5 ret ``` ## Compare performance | Level | cycle count | instret | text | data | bss | dec | hex | | ------ | ----------- | ------- | ----- | ---- | ---- | ----- | ---- | | -O0 | 3446 | 2be | 51418 | 1932 | 1528 | 54878 | d65e | | -O1 | 3443 | 2be | 51346 | 1932 | 1528 | 54806 | d616 | | -O2 | 3443 | 2be | 51346 | 1932 | 1528 | 54806 | d616 | | -O3 | 3443 | 2be | 51346 | 1932 | 1528 | 54806 | d616 | | -Ofast | 3443 | 2be | 51346 | 1932 | 1528 | 54806 | d616| :::warning You should improve the assembly implementation and then compare instead of changing optimization levels which affect C code. :notes: jserv :::