# Assignment2: GNU Toolchain contributed by < [`eeeXun`](https://github.com/eeeXun) > ## Install [xPack GUN RISC-V Embedded GCC](https://xpack.github.io/riscv-none-embed-gcc/) and [rv32emu](https://github.com/sysprog21/rv32emu) ### riscv-none-embed-gcc ```sh wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases/download/v13.2.0-2/xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz ``` ```sh tar zxvf xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz ``` ```sh cp -r xpack-riscv-none-elf-gcc-13.2.0-2 $HOME/.local/share/riscv-none-elf-gcc ``` Because I like to put the executable file under `$HOME/.local/bin`, which I'v already changed `$PATH` in `.profile` and `$.zprofile`, ```sh export PATH="$PATH:$HOME/.local/bin" ``` so I symbolic link all files under `$HOME/.local/share/riscv-none-elf-gcc/bin` to `$HOME/.local/bin`. ### rv32emu ```sh git clone https://github.com/sysprog21/rv32emu cd rv32emu make mv build/rv32emu $HOME/.local/bin ``` ### rv_histogram ```sh make tool mv build/rv_histogram $HOME/.local/bin ``` ## Question Selection ### Question I choose the question **[Reducing memory usage with bfloat and bfloat multiplication](https://hackmd.io/@PWCheng/CAHW01)** by [Brian Cheng](https://github.com/BrianCheng-TheLegend). ### Motivation I want to learn how to deal with floating point multiplication in assembly, which is different from my previous [homework 1](https://hackmd.io/jU6XFUUWT2SFW9aR-0iOfA?view) topic. And the documentation done by [Brian Cheng](https://github.com/BrianCheng-TheLegend) is quite simple and precise that I can understand quickly. ## Compile C code Compile the following C code from [Brian Cheng](https://github.com/BrianCheng-TheLegend) with `O0`, `O1`, `O2`, `O3`, `Os` and `Ofast` flags. ```c! #include <stdio.h> float fp32_to_bf16(float x) { float y = x; int* p = (int*)&y; unsigned int exp = *p & 0x7F800000; unsigned int man = *p & 0x007FFFFF; if (exp == 0 && man == 0) /* zero */ return x; if (exp == 0x7F800000) /* infinity or NaN */ return x; /* Normalized number */ /* round to nearest */ float r = x; int* pr = (int*)&r; *pr &= 0xFF800000; /* r has the same exp as x */ r /= 0X100; y = x + r; *p &= 0xFFFF0000; return y; } // encoder : encode two bfloat number in one memory int encoder(int* a, int* b) { int c = 0; c = *a | (*b >> 16); return c; } // decoder : decode one memory number in two bfloat void decoder(int c, void* n1, void* n2) { *(int*)n1 = c & 0xffff0000; *(int*)n2 = (c & 0x0000ffff) << 16; } int main() { // definition of num1 and transfer it to bfloat float num1 = -12.123; int* np1 = (int*)&num1; num1 = fp32_to_bf16(num1); // definition of num2 and transfer it to bfloat float num2 = 45.568; int* np2 = (int*)&num2; num2 = fp32_to_bf16(num2); float add; int* p = (int*)&add; *p = 0; // show num1 binary form and it's value printf("0x%x\n", *np1); printf("%f\n", num1); // show num2 binary form and it's value printf("0x%x\n", *np2); printf("%f\n", num2); // add two number together and print the binary form *p = encoder(np1, np2); decoder(*p, &num1, &num2); float mul_num; mul_num = num1 * num2; printf("%f\n", mul_num); return 0; } ``` ```sh for i in 0 1 2 3 s fast; do riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O$i hw2.c -o hw2.o$i; done ``` ### Compare size Observe the output from `riscv-none-elf-readelf` and `riscv-none-elf-size` | | O0 | O1 | O2 | O3 | Os | Ofast | | :----------------------: | :---: | :---: | :---: | :---: | :---: | :---: | | Start of section headers | 98920 | 94748 | 94772 | 94772 | 94780 | 94772 | | text | 79848 | 77720 | 77724 | 77724 | 77732 | 77724 | | data | 2320 | 2348 | 2348 | 2348 | 2348 | 2348 | | bss | 1528 | 1528 | 1528 | 1528 | 1528 | 1528 | | dec | 83696 | 81596 | 81600 | 81600 | 81608 | 81600 | ### Compare instructions Look into the instructions histogram generated by `rv_histogram` | | O0 | O1 | O2 | O3 | Os | Ofast | | :----: | :------------: | :------------: | :------------: | :------------: | :------------: | :-----------: | | R-Type | 1310 (6.88%) | 1281 (6.91%) | 1281 (6.91%) | 1281 (6.91%) | 1281 (6.91%) | 1281 (6.91%) | | I-Type | 10313 (54.18%) | 10017 (54.02%) | 10018 (54.02%) | 10018 (54.02%) | 10018 (53.99%) | 10018 (54.02 | | S-Type | 1961 (10.3%) | 1918 (10.34%) | 1918 (10.34%) | 1918 (10.34%) | 1918 (10.34%) | 1918 (10.34%) | | U-Type | 439 (2.3%) | 424 (2.29%) | 424 (2.29%) | 424 (2.29%) | 424 (2.29%) | 424 (2.29%) | | B-Type | 2105 (11.05%) | 2072 (11.17%) | 2072 (11.17%) | 2072 (11.17%) | 2073 (11.17%) | 2072 (11.17%) | | J-Type | 1701 (8.94%) | 1657 (8.93%) | 1657 (8.93%) | 1657 (8.93%) | 1658 (8.94%) | 1657 (8.93%) | The biggest difference in above table is compiling from `O0` to `O1`. `O1`, `O2`, `O3`, `Os` and `Ofast` are nearly identical. :::warning You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution. :notes: jserv ::: ### RDCYCLE Take the `getcycles.s` from [rv32emu repo](https://github.com/sysprog21/rv32emu/blob/master/tests/perfcounter/getcycles.S). Add it to the C program to count the cycle. ```diff! @@ -1,5 +1,8 @@ +#include <stdint.h> #include <stdio.h> +extern uint64_t get_cycles(); + float fp32_to_bf16(float x) { float y = x; @@ -41,6 +44,7 @@ void decoder(int c, void* n1, void* n2) int main() { + uint64_t oldcount = get_cycles(); // definition of num1 and transfer it to bfloat float num1 = -12.123; int* np1 = (int*)&num1; @@ -65,5 +69,7 @@ int main() float mul_num; mul_num = num1 * num2; printf("%f\n", mul_num); + uint64_t cyclecount = get_cycles() - oldcount; + printf("cyle count: %u\n", (unsigned int)cyclecount); return 0; } ``` Assemble `getcycle.s` to object file. ```sh riscv-none-elf-as -march=rv32i_zicsr_zifencei -mabi=ilp32 getcycles.s -o getcycles.o ``` Compile C code and link it with object file of `getcycle`. ```sh for i in 0 1 2 3 s fast; do riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -O$i -c hw2.c -o hw2.o$i; riscv-none-elf-gcc getcycles.o hw2.o$i -o hw2.elf$i; done ``` Use `rv32emu` to run the elf file. | | O0 | O1 | O2 | O3 | Os | Ofast | | :---------: | :---: | :---: | :---: | :---: | :---: | :---: | | cycle count | 23578 | 22846 | 22846 | 22846 | 22846 | 22846 | ## Assembly ### [improve 1](https://github.com/eeeXun/computer_architecture/commit/7868165ca885bed07fd638666357722fb40b49bc) To deal with the mask, like `0x7F800000`, `0x800000` and `0x3F800000` etc. The code stored these mask values in data section and did a bunch of `lw` operation. Because these value load from the memory was later used to mask other value. It cause data hazard, the cpu need to waste a cycle to wait for the data. I change the `lw` operation into `li`, which may convert to `addi` or `lui`. Then this reduce the pipeline stall. ```diff! @@ -3,14 +3,6 @@ test0: .word 0x4141f9a7,0x423645a2 test1: .word 0x3fa66666,0x42c63333 test2: .word 0x43e43a5e,0x42b1999a -# mask -# mask0 for exponent ,fraction -# ( 0 ,4 ,8 ,12 ,16 ,20 ,24 ) -mask0: .word 0x7F800000,0x007FFFFF,0x800000,0x8000,0x7f,0x3F800000,0x80000000 -# mask1 for round -mask1: .word 0x8000 -# mask2 for decoder -mask2: .word 0xFFFF0000,0x0000FFFF #string str: .string "\n" @@ -37,14 +29,20 @@ add a0,x0,s5 # set a0 as s5 ecall # ecall - jal ra,cl # change line + # change line + li a7,4 # set a7 as string mode + la a0,str # load str to a0 + ecall # ecall # Output first bfloat after decoder li a7,2 # set a7 as float mode add a0,x0,s6 # set a0 as s6 ecall # ecall - jal ra,cl # change line + # change line + li a7,4 # set a7 as string mode + la a0,str # load str to a0 + ecall # ecall # Output Multiplication result li a7,2 # set a7 as float mode @@ -57,43 +55,42 @@ f32_b16_p1: sw a6,0(sp) add t0,a6,x0 # a6 will be only for this funtion to access - la a3,mask0 # load mask0 address to a3 # exponent - lw t6,0(a3) # load mask 0x7F800000 to t6 + li t6,0x7F800000 and t1,t0,t6 # let exponent save to t1 # fraction - lw t6,4(a3) # load 0x007FFFFF to t6 + li t6,0x007FFFFF and t2,t0,t6 # let fraction save to t2 # check this number if 0 or inf (exponent + fraction) - lw t6,0(a3) # load mask 0x7F800000 to t6 + li t6,0x7F800000 beq t1,t6,inf_or_zero # exp == 0x7F800000 or t3,t1,t2 beq t3,x0,inf_or_zero # exp == 0 && man == 0 # add integer to fraction - lw t6,8(a3) # load integer + li t6,0x800000 or t2,t2,t6 # add integer # round to nearest for fraction - lw t6,12(a3) # load the round number + li t6,0x8000 add t2,t2,t6 # add round number srli t5,t2,24 # shift left 24 to t5 beq t5,x0,no_overflow # if t5 equal to 0 move to no_overflow # if overflow - lw t6,8(a3) # load mask 0x007FFFFF + li t6,0x800000 add t1,t1,t6 # add 1 to exponent srli t2,t2,17 # shift t2 to left 1 integer and 7 fraction - lw t6,16(a3) # load mask 0x7f + li t6,0x7f and t2,t2,t6 # let t2 only have integer slli t2,t2,16 # shift right 16 j f32_b16_p2 # if not overflow no_overflow: srli t2,t2,16 # shift t2 to left 1 integer and 7 fraction - lw t6,16(a3) # load mask 0x7f + li t6,0x7f and t2,t2,t6 # let t2 only have integer slli t2,t2,16 # shift right 16 #f32_b16 end function @@ -124,23 +121,14 @@ ### decode two bfloat on one register to two registers decoder: add t0,s9,x0 # load s9(data encode) to t0 - la a1,mask2 # load mask2 address - lw s2,0(a1) # load mask 0xFFFF0000 + li s2,0xFFFF0000 and t1,t0,s2 # use mask to specification bfloat 1 - lw s2,4(a1) # load mask 0x0000FFFF + li s2,0x0000FFFF and t2,t0,s2 # use mask to specification bfloat 2 slli t2,t2,16 # shift to left to let bfloat peform like original float add s6,t1,x0 # store t1(bfloat 1) to s6 add s5,t2,x0 # store t2(bfloat 2) to s5 ret # return to main - -### change line -cl: - li a7,4 # set a7 as string mode - la a0,str # load str to a0 - ecall # ecall - ret # return to main - ### Multiplication with bfloat in one register Multi_bfloat: @@ -149,12 +137,12 @@ # decoder function output is s5,s6 add t0,s5,x0 # store s5(bfloat 2) to t0 add t1,s6,x0 # store s6(bfloat 1) to t1 - lw t6,0(a3) # load mask0 mask 0x7F800000 + li t6,0x7F800000 # get exponent to t2,t3 and t3,t0,t6 # use mask 0x7F800000 to get t0 exponent and t2,t1,t6 # use mask 0x7F800000 to get t1 exponent add t3,t3,t2 # add two exponent to t3 - lw t6,20(a3) # load mask0 mask 0x3F800000 + li t6,0x3F800000 sub t3,t3,t6 # sub 127 to exponent # get sign @@ -170,13 +158,13 @@ or t0,t3,t0 # get fraction to t2 and t3 - lw t6,16(a3) # load mask0 mask 0x7F + li t6,0x7f slli t6,t6,16 # shift mask to 0x7F0000 and t2,t0,t6 # use mask 0x7F0000 get fraction and t3,t1,t6 # use mask 0x7F0000 get fraction slli t2,t2,9 # shift left let no leading 0 srli t2,t2,1 # shift right let leading has one 0 - lw t6,24(a3) # load mask0 mask 0x80000000 + li t6,0x80000000 or t2,t2,t6 # use mask 0x80000000 to add integer srli t2,t2,1 # shift right to add space for overflow @@ -187,7 +175,7 @@ add s11,x0,x0 # set a counter and 0 addi s10,x0,8 # set a end condition add t1,x0,x0 # reset t1 to 0 and let this register be result - lw t6,24(a3) # load mask0 mask 0x80000000 + li t6,0x80000000 loop: addi s11,s11,1 # add 1 at counter every loop @@ -202,7 +190,7 @@ # end of loop # check if overflow - lw t6,24(a3) # load mask0 mask 0x80000000 to t6 + li t6,0x80000000 and t4,t1,t6 # get t1 max bit # if t4 max bit equal to 0 will not overflow @@ -210,7 +198,7 @@ # if overflow slli t1,t1,1 # shift left 1 bits to remove integer - lw t6,8(a3) # load mask0 mask 0x800000 + li t6,0x800000 add t0,t0,t6 # exponent add 1 if overflow j Mult_end # jump to Mult_end ``` original: ![](https://hackmd.io/_uploads/HJy8_pQGp.png) improve 1: ![](https://hackmd.io/_uploads/S1ZddTXf6.png) ### [improve 2](https://github.com/eeeXun/computer_architecture/commit/f2fe0813f9bcd79b54e7d1a9369dfd9b428d1849) The below are trivil improments. For example, mask a value with `0x0000FFFF` and then shift left 16 bits. This can be simply done by shift left 16 bits. ```diff! @@ -123,9 +123,7 @@ decoder: add t0,s9,x0 # load s9(data encode) to t0 li s2,0xFFFF0000 and t1,t0,s2 # use mask to specification bfloat 1 - li s2,0x0000FFFF - and t2,t0,s2 # use mask to specification bfloat 2 - slli t2,t2,16 # shift to left to let bfloat peform like original float + slli t2, t0, 16 add s6,t1,x0 # store t1(bfloat 1) to s6 add s5,t2,x0 # store t2(bfloat 2) to s5 ret # return to main @@ -158,12 +156,10 @@ Multi_bfloat: or t0,t3,t0 # get fraction to t2 and t3 - li t6,0x7f - slli t6,t6,16 # shift mask to 0x7F0000 + li t6,0x7F0000 # shift mask to 0x7F0000 and t2,t0,t6 # use mask 0x7F0000 get fraction and t3,t1,t6 # use mask 0x7F0000 get fraction - slli t2,t2,9 # shift left let no leading 0 - srli t2,t2,1 # shift right let leading has one 0 + slli, t2,t2,8 li t6,0x80000000 or t2,t2,t6 # use mask 0x80000000 to add integer srli t2,t2,1 # shift right to add space for overflow ``` improve 2: ![](https://hackmd.io/_uploads/r1wwqTXfT.png) ### [improve 3](https://github.com/eeeXun/computer_architecture/commit/fc4558bb5ce2516520cc5b927322368a6613ac03) The code also implement bfloat16 multiplication. Here is the picture of algorithm from Brian Cheng. ![](https://hackmd.io/_uploads/Sy52XJ4G6.png) $$ \begin{array}{} &&1&1&0&1&1&1&0&0 \\ \times &&1&1&1&1&1&1&1&1 \\ \hline +&&1&1&0&1&1&1&0&0 \\ +&&&1&1&0&1&1&1&0&0 \\ +&&&&1&1&0&1&1&1&0&0 \\ +&&&&&1&1&0&1&1&1&0&0 \\ +&&&&&&1&1&0&1&1&1&0&0 \\ +&&&&&&&1&1&0&1&1&1&0&0 \\ +&&&&&&&&1&1&0&1&1&1&0&0 \\ +&&&&&&&&&1&1&0&1&1&1&0&0 \\ \hline &1&1&0&1&1&0&0&0&1 \end{array} $$ He mentioned that the multiplication can be done by 8 times of addition. 1 integer and 7 mantissa. The following is the implementation done by Brian Cheng. ```asm! addi s10,x0,8 # set a end condition add t1,x0,x0 # reset t1 to 0 and let this register be result li t6,0x80000000 loop: addi s11,s11,1 # add 1 at counter every loop srli t6,t6,1 # shift right at 1 every loop and t4,t2,t6 # use mask to specified number at that place beq t4,x0,not_add # jump if t4 equal to 0 add t1,t1,t3 # add t3 to t1 not_add: srli t3,t3,1 # shift left 1 bit to t3 bne s11,s10,loop # if the condition not satisfy return to loop ``` I want to unroll the loop to make it branchless. In order to do that, here is one condiction that I have to handle. If the selected bit of multiplier is 1, then we need to add multiplicand to result. The following is how I deal with it. Register `t4` is selected bit of multiplier. If it is greater than 1, than set `s0` to `0xFFFFFFFF`. Otherwise, set it to 0. Then `and` multiplicand with `s0`. Add the value to the result. This means we only add multiplicand when `t4` is greater than 1. Otherwise, we add 0. ```asm! sltu s0,zero,t4 # s0 = 1 if t4 > 0, otherwise 0 sub s0,zero,s0 # s0 = 0xFFFFFFFF if s0 = 1, otherwise 0 and s1,s0,t3 add t1,t1,s1 # add t3 to s1 ``` improve 3: ![](https://hackmd.io/_uploads/rymgKk4Ma.png)