owned this note
owned this note
Published
Linked with GitHub
# Assignment 2
contributed by [dcciou](https://github.com/dcciou)
:::danger
Improve the structure of this document!
:notes: jserv
:::
## Motivation
- I want to learn how to convert FP32 to BF16.
- His idea can present a good comparison with my previous work([CLZ](https://hackmd.io/@yayaoh/By7B-QKlT)).
- His code can understand easily to me.
That's why I choose the problem [Convert FP32 to BF16 and Count the Number of Ones in the Binary Representation](https://hackmd.io/@n-g2ouCxQbmy_er1MvIKhQ/rkCz0Wega) from [蔡忠翰](https://hackmd.io/@n-g2ouCxQbmy_er1MvIKhQ).
## C
I think his code is clear and efficient enough, so I don't want to spoil that beautiful work.
## Assembly
### Improve
- I find out that his program need to alter its load dataset. It's not straight forward enough to run 3 dataset, so I add a loop.
- Although he already considered all of the special case would happen, but data doesn't contain it. I think it's fairly waste, so I add some special case.
- In original program, system will end when system detect the special case,and doesn't tell you which case lead to. It's kinda unconfortable to me. Thus, I alter it to be continue running the program until all data finish conversion. Furthermore, my program will tell you why you can't get number of ones.
- Enhance clock rate while CPI & clocks doesn't change. Surprisingly, my program work much more but spend almost same time, contrary to original program.
#### original
```c
.data
test0: .word 1067030938 #true t0=0x3f99999a
test1: .word 1067057152 #true t0=0x3f9a0000
test2: .word 1075052544 #true t0=0x40140000
str: .string "How many '1's are there in BF16 (binary)? ANS : "
.text
.globl main
main:
lw a0, test0 #x=16
mv a1, x0 #count=0
mv t0, a0 #t0=y
li t1, 0x7F800000
and t2, t1, a0 #t2=exp
li t3, 0x007FFFFF
and t4, t3, a0 #t4=man
beq a0, t1, end #infinity or NaN
beq t4, zero, end #zero
beq t4, zero, end #zero
mv t5, a0 #t5=r
li t6, 0xFF800000 #r has the same exp as x
and t5, t6, t5
mv s3, t4 #find y = x + r ; r/=0x100
srli s3, s3, 8
li s4, 0x8000
or s3, s3, s4 #obtain r_man when r_exp no change
add s5, s3, t4 #s5=y_man
add t0, s5, t2 #value y
li t1, 0xFFFF0000
and t0, t0, t1 #transfer to bf16
#t0=y
count_ones:
addi t1, t0, -1 # *p - 1
and t0, t0, t1 #*p &= (*p - 1)
addi a1, a1, 1 #count++
bne t0, x0, count_ones #if t0!=0 goto loop #goto ra
la a0, str
li a7, 4
ecall
mv a0, a1
end:
li a7, 1
ecall
```
#### modified
```c
.data
test: .word 0,1067057152,1075052544 #true t0=0x3f99999a
str: .string "How many '1's are there in BF16 (binary)? ANS : "
str_z: .string "data = 0 \n"
str_ion: .string "data= Nan or infinity \n"
.text
main:
la a1, test
addi t0, zero,3 #set count =3
loop:
lw a2, 0(a1) #load data
li t1, 0x7F800000
and a3, t1, a2 #a3=exp
li t3, 0x007FFFFF
and a4, t3, a2 #a4=man
addi t0, t0, -1 #count-1
addi a1, a1, 4 #offset data address
or t5, a3, a4
beq t5, zero, end_z #check if zero
beq a3,t1, end_ion #check if infinity or NaN
#Start conversion, we don't need t1,t3,t5 anymore, while t2,t4 is unused
li t1, 0xFF800000 #r has the same exp as x(not used)
and t3, t1, a2
srli t3, a4, 8 #r_man/=0x100, since t3 won't used, just recover it by new data
li t2, 0x8000
or t3, t3, t2 #obtain r_man when r_exp no change
#find y = x + r ; r/=0x100
add a5, t3, a4 #a5=y_man
add t6, a5, a2 #t6=value y
li t1, 0xFFFF0000
and t6, t6, t1 #transfer to bf16, t6=y
addi a6, zero, 0 #reset count
count_ones:
addi t1, t6, -1 # *p - 1
and t6, t6, t1 #*p &= (*p - 1)
addi a6, a6, 1 #count++,a6
bne t6, zero, count_ones #if t6!=0 goto loop
la a0, str #print How many '1's are there in BF16 (binary)? ANS :
li a7, 4
ecall
mv a0, a6 #print how many 1
li a7, 1
ecall
bnez t0, loop
beqz t0, end
end_z:
la a0, str_z
li a7, 4
ecall
bnez t0, loop
beqz t0, end
end_ion:
la a0, str_ion
li a7, 4
ecall
bnez t0, loop
beqz t0, end
end:
nop
```
#### original vs modified
Left is my code, and right is original code.
Although my cycles are 3x larger, my clock rate also 3x faster.

## RV32emu & Riscv-none-elf-gcc
Since I've never used linux before, it takes me bunch of time on setting.
### Riscv-none-elf-gcc part
I decide to convert my .s file to .elf file first.
- ***Q1:*** After I setup gcc path, I was stocked. I have no idea about how to read file. In the beginning, I try to run gcc at my code file but got "command not found ".

- Solution: Type ==**ls**== to check where am I, then put my code file there.

- ***Q2:*** When I tried to convert .s to .elf, it happened.

- Solution: splite command to:
```
riscv-non-elf-as -march=rv32i -mabi=ilp32 hw2.s -o hw2.o
riscv-non-elf-ld hw2.o -o hw2.elf
```
- ***Q3:*** After I solved Q2, new problem appeared:

- Solution: add few lines at the beginning of your assembly code:
```
.globl _start
_start:
j main
```
After I solved the problems above, I finally got status with O0 level from my handwrite assembly code:

### rv32emu part
I checked if my code could run after I got status. Also, I added cycle counter.
- step1. convert .c to .elf with O0 level optimization.
```
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 find_msb.c -o find_msb.elf
```
- step2. run program to get number of cycle
```
./rv32emu/build/rv32emu hw2c.elf
```

### Compare size & cycle
| | text | data | bss | dec | hex | cycle |
| -- | ----- | ---- | --- | --- | --- | --- |
| O0 | 79528 | 2320 | 1528 | 83376 | 145b0 | 38655 |
| O1 | 78540 | 2360 | 1528 | 82428 | 141fc | 37843 |
| O2 | 78552 | 2360 | 1528 | 82440 | 14208 | 37849 |
| O3 | 78560 | 2360 | 1528 | 82448 | 14210 | 37850 |
| Os | 78536 | 2360 | 1528 | 82424 | 141f8 | 37872 |
| Ofast | 78556 | 2360 | 1528 | 82444 | 1429c | 37850 |
### something you have to know
- When open a new terminal, you have to reload gcc again.
```
cd $HOME/riscv-none-elf-gcc
echo "export PATH=`pwd`/bin:$PATH" > setenv
cd $HOME
source riscv-none-elf-gcc/setenv
```
- Find size
```
riscv-none-elf-size hw2c.elf
```