# Assignment2: GNU Toolchain
contributed by < [`eeeXun`](https://github.com/eeeXun) >
## Install [xPack GUN RISC-V Embedded GCC](https://xpack.github.io/riscv-none-embed-gcc/) and [rv32emu](https://github.com/sysprog21/rv32emu)
### riscv-none-embed-gcc
```sh
wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases/download/v13.2.0-2/xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz
```
```sh
tar zxvf xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz
```
```sh
cp -r xpack-riscv-none-elf-gcc-13.2.0-2 $HOME/.local/share/riscv-none-elf-gcc
```
Because I like to put the executable file under `$HOME/.local/bin`, which I'v already changed `$PATH` in `.profile` and `$.zprofile`,
```sh
export PATH="$PATH:$HOME/.local/bin"
```
so I symbolic link all files under `$HOME/.local/share/riscv-none-elf-gcc/bin` to `$HOME/.local/bin`.
### rv32emu
```sh
git clone https://github.com/sysprog21/rv32emu
cd rv32emu
make
mv build/rv32emu $HOME/.local/bin
```
### rv_histogram
```sh
make tool
mv build/rv_histogram $HOME/.local/bin
```
## Question Selection
### Question
I choose the question **[Reducing memory usage with bfloat and bfloat multiplication](https://hackmd.io/@PWCheng/CAHW01)** by [Brian Cheng](https://github.com/BrianCheng-TheLegend).
### Motivation
I want to learn how to deal with floating point multiplication in assembly, which is different from my previous [homework 1](https://hackmd.io/jU6XFUUWT2SFW9aR-0iOfA?view) topic. And the documentation done by [Brian Cheng](https://github.com/BrianCheng-TheLegend) is quite simple and precise that I can understand quickly.
## Compile C code
Compile the following C code from [Brian Cheng](https://github.com/BrianCheng-TheLegend) with `O0`, `O1`, `O2`, `O3`, `Os` and `Ofast` flags.
```c!
#include <stdio.h>
float fp32_to_bf16(float x)
{
float y = x;
int* p = (int*)&y;
unsigned int exp = *p & 0x7F800000;
unsigned int man = *p & 0x007FFFFF;
if (exp == 0 && man == 0) /* zero */
return x;
if (exp == 0x7F800000) /* infinity or NaN */
return x;
/* Normalized number */
/* round to nearest */
float r = x;
int* pr = (int*)&r;
*pr &= 0xFF800000; /* r has the same exp as x */
r /= 0X100;
y = x + r;
*p &= 0xFFFF0000;
return y;
}
// encoder : encode two bfloat number in one memory
int encoder(int* a, int* b)
{
int c = 0;
c = *a | (*b >> 16);
return c;
}
// decoder : decode one memory number in two bfloat
void decoder(int c, void* n1, void* n2)
{
*(int*)n1 = c & 0xffff0000;
*(int*)n2 = (c & 0x0000ffff) << 16;
}
int main()
{
// definition of num1 and transfer it to bfloat
float num1 = -12.123;
int* np1 = (int*)&num1;
num1 = fp32_to_bf16(num1);
// definition of num2 and transfer it to bfloat
float num2 = 45.568;
int* np2 = (int*)&num2;
num2 = fp32_to_bf16(num2);
float add;
int* p = (int*)&add;
*p = 0;
// show num1 binary form and it's value
printf("0x%x\n", *np1);
printf("%f\n", num1);
// show num2 binary form and it's value
printf("0x%x\n", *np2);
printf("%f\n", num2);
// add two number together and print the binary form
*p = encoder(np1, np2);
decoder(*p, &num1, &num2);
float mul_num;
mul_num = num1 * num2;
printf("%f\n", mul_num);
return 0;
}
```
```sh
for i in 0 1 2 3 s fast; do riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O$i hw2.c -o hw2.o$i; done
```
### Compare size
Observe the output from `riscv-none-elf-readelf` and `riscv-none-elf-size`
| | O0 | O1 | O2 | O3 | Os | Ofast |
| :----------------------: | :---: | :---: | :---: | :---: | :---: | :---: |
| Start of section headers | 98920 | 94748 | 94772 | 94772 | 94780 | 94772 |
| text | 79848 | 77720 | 77724 | 77724 | 77732 | 77724 |
| data | 2320 | 2348 | 2348 | 2348 | 2348 | 2348 |
| bss | 1528 | 1528 | 1528 | 1528 | 1528 | 1528 |
| dec | 83696 | 81596 | 81600 | 81600 | 81608 | 81600 |
### Compare instructions
Look into the instructions histogram generated by `rv_histogram`
| | O0 | O1 | O2 | O3 | Os | Ofast |
| :----: | :------------: | :------------: | :------------: | :------------: | :------------: | :-----------: |
| R-Type | 1310 (6.88%) | 1281 (6.91%) | 1281 (6.91%) | 1281 (6.91%) | 1281 (6.91%) | 1281 (6.91%) |
| I-Type | 10313 (54.18%) | 10017 (54.02%) | 10018 (54.02%) | 10018 (54.02%) | 10018 (53.99%) | 10018 (54.02 |
| S-Type | 1961 (10.3%) | 1918 (10.34%) | 1918 (10.34%) | 1918 (10.34%) | 1918 (10.34%) | 1918 (10.34%) |
| U-Type | 439 (2.3%) | 424 (2.29%) | 424 (2.29%) | 424 (2.29%) | 424 (2.29%) | 424 (2.29%) |
| B-Type | 2105 (11.05%) | 2072 (11.17%) | 2072 (11.17%) | 2072 (11.17%) | 2073 (11.17%) | 2072 (11.17%) |
| J-Type | 1701 (8.94%) | 1657 (8.93%) | 1657 (8.93%) | 1657 (8.93%) | 1658 (8.94%) | 1657 (8.93%) |
The biggest difference in above table is compiling from `O0` to `O1`. `O1`, `O2`, `O3`, `Os` and `Ofast` are nearly identical.
:::warning
You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution.
:notes: jserv
:::
### RDCYCLE
Take the `getcycles.s` from [rv32emu repo](https://github.com/sysprog21/rv32emu/blob/master/tests/perfcounter/getcycles.S). Add it to the C program to count the cycle.
```diff!
@@ -1,5 +1,8 @@
+#include <stdint.h>
#include <stdio.h>
+extern uint64_t get_cycles();
+
float fp32_to_bf16(float x)
{
float y = x;
@@ -41,6 +44,7 @@ void decoder(int c, void* n1, void* n2)
int main()
{
+ uint64_t oldcount = get_cycles();
// definition of num1 and transfer it to bfloat
float num1 = -12.123;
int* np1 = (int*)&num1;
@@ -65,5 +69,7 @@ int main()
float mul_num;
mul_num = num1 * num2;
printf("%f\n", mul_num);
+ uint64_t cyclecount = get_cycles() - oldcount;
+ printf("cyle count: %u\n", (unsigned int)cyclecount);
return 0;
}
```
Assemble `getcycle.s` to object file.
```sh
riscv-none-elf-as -march=rv32i_zicsr_zifencei -mabi=ilp32 getcycles.s -o getcycles.o
```
Compile C code and link it with object file of `getcycle`.
```sh
for i in 0 1 2 3 s fast; do riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -O$i -c hw2.c -o hw2.o$i; riscv-none-elf-gcc getcycles.o hw2.o$i -o hw2.elf$i; done
```
Use `rv32emu` to run the elf file.
| | O0 | O1 | O2 | O3 | Os | Ofast |
| :---------: | :---: | :---: | :---: | :---: | :---: | :---: |
| cycle count | 23578 | 22846 | 22846 | 22846 | 22846 | 22846 |
## Assembly
### [improve 1](https://github.com/eeeXun/computer_architecture/commit/7868165ca885bed07fd638666357722fb40b49bc)
To deal with the mask, like `0x7F800000`, `0x800000` and `0x3F800000` etc. The code stored these mask values in data section and did a bunch of `lw` operation. Because these value load from the memory was later used to mask other value. It cause data hazard, the cpu need to waste a cycle to wait for the data.
I change the `lw` operation into `li`, which may convert to `addi` or `lui`. Then this reduce the pipeline stall.
```diff!
@@ -3,14 +3,6 @@
test0: .word 0x4141f9a7,0x423645a2
test1: .word 0x3fa66666,0x42c63333
test2: .word 0x43e43a5e,0x42b1999a
-# mask
-# mask0 for exponent ,fraction
-# ( 0 ,4 ,8 ,12 ,16 ,20 ,24 )
-mask0: .word 0x7F800000,0x007FFFFF,0x800000,0x8000,0x7f,0x3F800000,0x80000000
-# mask1 for round
-mask1: .word 0x8000
-# mask2 for decoder
-mask2: .word 0xFFFF0000,0x0000FFFF
#string
str: .string "\n"
@@ -37,14 +29,20 @@
add a0,x0,s5 # set a0 as s5
ecall # ecall
- jal ra,cl # change line
+ # change line
+ li a7,4 # set a7 as string mode
+ la a0,str # load str to a0
+ ecall # ecall
# Output first bfloat after decoder
li a7,2 # set a7 as float mode
add a0,x0,s6 # set a0 as s6
ecall # ecall
- jal ra,cl # change line
+ # change line
+ li a7,4 # set a7 as string mode
+ la a0,str # load str to a0
+ ecall # ecall
# Output Multiplication result
li a7,2 # set a7 as float mode
@@ -57,43 +55,42 @@
f32_b16_p1:
sw a6,0(sp)
add t0,a6,x0 # a6 will be only for this funtion to access
- la a3,mask0 # load mask0 address to a3
# exponent
- lw t6,0(a3) # load mask 0x7F800000 to t6
+ li t6,0x7F800000
and t1,t0,t6 # let exponent save to t1
# fraction
- lw t6,4(a3) # load 0x007FFFFF to t6
+ li t6,0x007FFFFF
and t2,t0,t6 # let fraction save to t2
# check this number if 0 or inf (exponent + fraction)
- lw t6,0(a3) # load mask 0x7F800000 to t6
+ li t6,0x7F800000
beq t1,t6,inf_or_zero # exp == 0x7F800000
or t3,t1,t2
beq t3,x0,inf_or_zero # exp == 0 && man == 0
# add integer to fraction
- lw t6,8(a3) # load integer
+ li t6,0x800000
or t2,t2,t6 # add integer
# round to nearest for fraction
- lw t6,12(a3) # load the round number
+ li t6,0x8000
add t2,t2,t6 # add round number
srli t5,t2,24 # shift left 24 to t5
beq t5,x0,no_overflow # if t5 equal to 0 move to no_overflow
# if overflow
- lw t6,8(a3) # load mask 0x007FFFFF
+ li t6,0x800000
add t1,t1,t6 # add 1 to exponent
srli t2,t2,17 # shift t2 to left 1 integer and 7 fraction
- lw t6,16(a3) # load mask 0x7f
+ li t6,0x7f
and t2,t2,t6 # let t2 only have integer
slli t2,t2,16 # shift right 16
j f32_b16_p2
# if not overflow
no_overflow:
srli t2,t2,16 # shift t2 to left 1 integer and 7 fraction
- lw t6,16(a3) # load mask 0x7f
+ li t6,0x7f
and t2,t2,t6 # let t2 only have integer
slli t2,t2,16 # shift right 16
#f32_b16 end function
@@ -124,23 +121,14 @@
### decode two bfloat on one register to two registers
decoder:
add t0,s9,x0 # load s9(data encode) to t0
- la a1,mask2 # load mask2 address
- lw s2,0(a1) # load mask 0xFFFF0000
+ li s2,0xFFFF0000
and t1,t0,s2 # use mask to specification bfloat 1
- lw s2,4(a1) # load mask 0x0000FFFF
+ li s2,0x0000FFFF
and t2,t0,s2 # use mask to specification bfloat 2
slli t2,t2,16 # shift to left to let bfloat peform like original float
add s6,t1,x0 # store t1(bfloat 1) to s6
add s5,t2,x0 # store t2(bfloat 2) to s5
ret # return to main
-
-### change line
-cl:
- li a7,4 # set a7 as string mode
- la a0,str # load str to a0
- ecall # ecall
- ret # return to main
-
### Multiplication with bfloat in one register
Multi_bfloat:
@@ -149,12 +137,12 @@
# decoder function output is s5,s6
add t0,s5,x0 # store s5(bfloat 2) to t0
add t1,s6,x0 # store s6(bfloat 1) to t1
- lw t6,0(a3) # load mask0 mask 0x7F800000
+ li t6,0x7F800000
# get exponent to t2,t3
and t3,t0,t6 # use mask 0x7F800000 to get t0 exponent
and t2,t1,t6 # use mask 0x7F800000 to get t1 exponent
add t3,t3,t2 # add two exponent to t3
- lw t6,20(a3) # load mask0 mask 0x3F800000
+ li t6,0x3F800000
sub t3,t3,t6 # sub 127 to exponent
# get sign
@@ -170,13 +158,13 @@
or t0,t3,t0
# get fraction to t2 and t3
- lw t6,16(a3) # load mask0 mask 0x7F
+ li t6,0x7f
slli t6,t6,16 # shift mask to 0x7F0000
and t2,t0,t6 # use mask 0x7F0000 get fraction
and t3,t1,t6 # use mask 0x7F0000 get fraction
slli t2,t2,9 # shift left let no leading 0
srli t2,t2,1 # shift right let leading has one 0
- lw t6,24(a3) # load mask0 mask 0x80000000
+ li t6,0x80000000
or t2,t2,t6 # use mask 0x80000000 to add integer
srli t2,t2,1 # shift right to add space for overflow
@@ -187,7 +175,7 @@
add s11,x0,x0 # set a counter and 0
addi s10,x0,8 # set a end condition
add t1,x0,x0 # reset t1 to 0 and let this register be result
- lw t6,24(a3) # load mask0 mask 0x80000000
+ li t6,0x80000000
loop:
addi s11,s11,1 # add 1 at counter every loop
@@ -202,7 +190,7 @@
# end of loop
# check if overflow
- lw t6,24(a3) # load mask0 mask 0x80000000 to t6
+ li t6,0x80000000
and t4,t1,t6 # get t1 max bit
# if t4 max bit equal to 0 will not overflow
@@ -210,7 +198,7 @@
# if overflow
slli t1,t1,1 # shift left 1 bits to remove integer
- lw t6,8(a3) # load mask0 mask 0x800000
+ li t6,0x800000
add t0,t0,t6 # exponent add 1 if overflow
j Mult_end # jump to Mult_end
```
original:

improve 1:

### [improve 2](https://github.com/eeeXun/computer_architecture/commit/f2fe0813f9bcd79b54e7d1a9369dfd9b428d1849)
The below are trivil improments. For example, mask a value with `0x0000FFFF` and then shift left 16 bits. This can be simply done by shift left 16 bits.
```diff!
@@ -123,9 +123,7 @@ decoder:
add t0,s9,x0 # load s9(data encode) to t0
li s2,0xFFFF0000
and t1,t0,s2 # use mask to specification bfloat 1
- li s2,0x0000FFFF
- and t2,t0,s2 # use mask to specification bfloat 2
- slli t2,t2,16 # shift to left to let bfloat peform like original float
+ slli t2, t0, 16
add s6,t1,x0 # store t1(bfloat 1) to s6
add s5,t2,x0 # store t2(bfloat 2) to s5
ret # return to main
@@ -158,12 +156,10 @@ Multi_bfloat:
or t0,t3,t0
# get fraction to t2 and t3
- li t6,0x7f
- slli t6,t6,16 # shift mask to 0x7F0000
+ li t6,0x7F0000 # shift mask to 0x7F0000
and t2,t0,t6 # use mask 0x7F0000 get fraction
and t3,t1,t6 # use mask 0x7F0000 get fraction
- slli t2,t2,9 # shift left let no leading 0
- srli t2,t2,1 # shift right let leading has one 0
+ slli, t2,t2,8
li t6,0x80000000
or t2,t2,t6 # use mask 0x80000000 to add integer
srli t2,t2,1 # shift right to add space for overflow
```
improve 2:

### [improve 3](https://github.com/eeeXun/computer_architecture/commit/fc4558bb5ce2516520cc5b927322368a6613ac03)
The code also implement bfloat16 multiplication. Here is the picture of algorithm from Brian Cheng.

$$
\begin{array}{}
&&1&1&0&1&1&1&0&0 \\
\times &&1&1&1&1&1&1&1&1 \\
\hline
+&&1&1&0&1&1&1&0&0 \\
+&&&1&1&0&1&1&1&0&0 \\
+&&&&1&1&0&1&1&1&0&0 \\
+&&&&&1&1&0&1&1&1&0&0 \\
+&&&&&&1&1&0&1&1&1&0&0 \\
+&&&&&&&1&1&0&1&1&1&0&0 \\
+&&&&&&&&1&1&0&1&1&1&0&0 \\
+&&&&&&&&&1&1&0&1&1&1&0&0 \\
\hline
&1&1&0&1&1&0&0&0&1
\end{array}
$$
He mentioned that the multiplication can be done by 8 times of addition. 1 integer and 7 mantissa.
The following is the implementation done by Brian Cheng.
```asm!
addi s10,x0,8 # set a end condition
add t1,x0,x0 # reset t1 to 0 and let this register be result
li t6,0x80000000
loop:
addi s11,s11,1 # add 1 at counter every loop
srli t6,t6,1 # shift right at 1 every loop
and t4,t2,t6 # use mask to specified number at that place
beq t4,x0,not_add # jump if t4 equal to 0
add t1,t1,t3 # add t3 to t1
not_add:
srli t3,t3,1 # shift left 1 bit to t3
bne s11,s10,loop # if the condition not satisfy return to loop
```
I want to unroll the loop to make it branchless. In order to do that, here is one condiction that I have to handle. If the selected bit of multiplier is 1, then we need to add multiplicand to result. The following is how I deal with it. Register `t4` is selected bit of multiplier. If it is greater than 1, than set `s0` to `0xFFFFFFFF`. Otherwise, set it to 0. Then `and` multiplicand with `s0`. Add the value to the result. This means we only add multiplicand when `t4` is greater than 1. Otherwise, we add 0.
```asm!
sltu s0,zero,t4 # s0 = 1 if t4 > 0, otherwise 0
sub s0,zero,s0 # s0 = 0xFFFFFFFF if s0 = 1, otherwise 0
and s1,s0,t3
add t1,t1,s1 # add t3 to s1
```
improve 3:
