# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by <[`st10740`](https://github.com/st10740/Computer-Architecture-HW/tree/main/HW1)> ## Linear convolution using bfloat16 ### Idea #### bfloat16 bfloat16 format was developed by Google Brain for artificial intelligence purpose to accelarate computing and save storage, which uses only half number of bits of FP32 to represent a floating point number. The smaller number of bits speedup arithmetic operations such as addition, multiplication and so on. Below is the comparision of FP32 and BF16 in visual. - FP32 | sign (1 bit) | exponent (8bit) | mantissa (23 bit) | | ---- | --------- | -------------------------------- | | 31 | 30 ----------- 23 | 22 --------------------- 0 | - BF16 | sign (1 bit) | exponent (8bit) | mantissa (7 bit) | | ---- | --------- | --------------------- | | 15 | 14 ----------- 7 | 6 ----------- 0 | After conversion from FP32 to BF16, the exponent bits of BF16 are preserved so that the range of numbers BF16 can represent is as wide as FP32. The number of mantissa bits is reduced from 23 to 7 after conversion, which leads to less accuracy but also saves storage. #### Linear convolution Linear convolution is a fundamental concept in Digital Signal Processing (DSP) with a wide range of applications. It plays a pivotal role in analyzing, processing, and transforming signals in various domains, such as audio, image, and communication systems. Given two functions f(n) and h(n), the linear convolution of two is expressed by following:$$f(n) * h(n) = \sum_{k=-\infty}^{\infty} f(k) * h(n-k)$$ We can notive that there are plenty of addition and multiplication operations of floating point numbers in this formula. In order to speedup the computing, I want to try using bfloat16 to compute it. ## Implementation ### C program `fp32_to_bf16` is forked from [Quiz1 - Problem B](https://hackmd.io/@sysprog/arch2023-quiz1-sol) #### Test cases | signal | impulse | Case | | -------- | -------- | -------- | | [1.2, 1.3, 1.4] | [2.1, 3.2, 2.1] | normal | | [1.1, 1.2, 1.3] | [0.0, 0.0, 0.0] | contains zero | | [1.4, 1.5, 1.6] | [1.7, 1.8, 1.9] | may carry on after multiplicating | #### Code ```c!= #include <stdio.h> #include <stdlib.h> float fp32_to_bf16(float x) { float y = x; int *p = (int *) &y; unsigned int exp = *p & 0x7F800000; unsigned int man = *p & 0x007FFFFF; if (exp == 0 && man == 0) /* zero */ return x; if (exp == 0x7F800000) /* infinity or NaN */ return x; /* Normalized number */ /* round to nearest */ float r = x; int *pr = (int *) &r; *pr &= 0xFF800000; /* r has the same exp as x */ r /= 0x100; y = x + r; *p &= 0xFFFF0000; return y; } void fp32_to_bf16_in_arr(float* arr, int len) { while(len) { *arr = fp32_to_bf16(*arr); arr += 1; len--; } } void printResult(float* signal, int len) { for(int i=0; i<len; i++) { printf("%f ", *(signal + i)); } printf("\n"); } float* convolution_with_bf16(float* signal, float* impulse, int sLen, int impluseLen) { int l = sLen + impluseLen -1; float* output = (float*) malloc(sizeof(float)*l); /* convert signal from FP32 format to BF16 format */ fp32_to_bf16_in_arr(signal, sLen); fp32_to_bf16_in_arr(impulse, impluseLen); /* do linear convolution */ for(int n=0; n<l; n++) { output[n] = 0.0; for(int k=0; k<sLen; k++) { if((n-k) >= 0 && (n-k) <= impluseLen) { output[n] += signal[k] * impulse[n-k]; } } } return output; } int main() { float signal1[3] = {1.2, 1.3, 1.4}; float impulse1[3] = {2.1, 3.2, 2.1}; float* result1 = convolution_with_bf16(signal1, impulse1, 3, 3); printResult(result1, 3+3-1); return 0; } ``` ### Assembly code Below is the implementation of `fp32_to_bf16_2(float x)` assembly code with test cases to ensure the program works as expect. #### Test Cases | HEX | binary | Case | | -------- | -------- | -------- | | 0x0 | 00000000000000000000000000000000 | exponent=0x0 & mantissa=0x0| | 0x7F800000 | 01111111100000000000000000000000 | exponent=0xFF | | 0x700AB000 | 01110000000010101011000000000000 | normalized positive number | | 0xF00AB000 | 11110000000010101011000000000000 | normalized negative number | | 0x707F8FFF | 01110000011111111000111111111111 | normalized number with overflow after rounding to nearest | #### Code ```asm!= .data test1: .word 0x0 # exponent=0x0 & mantissa=0x0 test2: .word 0x7F800000 # exp=0x7F800000 test3: .word 0x700AB000 # normalized positive number test4: .word 0xF00AB000 # normalized negative number test5: .word 0x707F8FFF # normalized number with overflow after rounding to nearest exp_mask: .word 0x7F800000 man_mask: .word 0x7FFFFF round: .word 0x8000 bfloat16_man_plus_one_mask: .word 0x7F8000 bfloat16_mask: .word 0xFFFF0000 msg_before_str: .string "before convert to bfloat16: " msg_after_str: .string "after convert to bfloat16: " end_str: .string "\n" .text main: lw a0, test1 jal do_fp32_to_bf16_and_print lw a0, test2 jal do_fp32_to_bf16_and_print lw a0, test3 jal do_fp32_to_bf16_and_print lw a0, test4 jal do_fp32_to_bf16_and_print lw a0, test5 jal do_fp32_to_bf16_and_print li a7, 10 # exit ecall do_fp32_to_bf16_and_print: addi sp, sp, -8 sw ra, 0(sp) sw a0, 4(sp) la a0, msg_before_str li a7, 4 ecall # print "before convert to bfloat16: " lw a0, 4(sp) li a7, 2 ecall # print FP32 format la a0, end_str li a7, 4 ecall # print "\n" la a0, msg_after_str li a7, 4 ecall # print "after convert to bfloat16: " lw a0, 4(sp) jal fp32_to_bf16 li a7, 2 ecall # print BF16 format la a0, end_str li a7, 4 ecall # print "\n" lw ra, 0(sp) lw a0, 4(sp) addi sp, sp, 8 jr ra fp32_to_bf16: addi sp, sp, -4 sw ra, 0(sp) add t0, a0, zero # store a0 to t0 lw t1, exp_mask and t2, t0, t1 # t2: result of applying exp mask to t0 beq t1, t2, return_fp32_to_bf16 # t0 is infinity or NaN, just return it beq t2, zero, exp_is_zero normalized_number: lw t1, bfloat16_man_plus_one_mask lw t2, round and t3, t0, t1 # t3: applying bfloat16_man_plus_one_mask to t0 beq t1, t3, is_overflow # doing round to nearest will cause overflow add t0, t0, t2 # do round to nearest lw t1, bfloat16_mask and a0, t0, t1 # a0: applying bfloat16_mask to t0 return_fp32_to_bf16: lw ra, 0(sp) addi sp, sp, 4 jr ra exp_is_zero: lw t1, man_mask and t1, t1, t0 # t1: applying man_mask to t0 bne t1, zero, normalized_number # if t1!=0, t0 is normalized number j return_fp32_to_bf16 # else, t0 is zero, just return it is_overflow: lw t1, bfloat16_mask and a0, t0, t1 # do round down j return_fp32_to_bf16 ``` The most tricky part is to convert below lines into assembly code : ```c! *pr &= 0xFF800000; /* r has the same exp as x */ r /= 0x100; y = x + r; ``` It needs division and addition of floating point numbers. However, it's only allowed to use RV32I instructions in this homework. To overcome it, I take a deep dive into the purpose of these three lines and find out that the only thing they do is to plus 1 to 8 binary places for rounding to nearest. suppose $x = (1.xxx)_2 * 2 ^ N$ and $r = (1.xxx)_2 * 2^N$ after `*pr &= 0xFF800000`, $r=(1.0)_2 * 2^N$ after `r /= 0x100`, $r=(1.0)_2 * 2^{N-8}$ so $y = x + r = (1.xxx)_2 * 2^N + (1.0)_2 * 2^{N-8} = (1.xxx)_2 * 2^N + (0.00000001)_2 * 2^N$ $= (1.xxx + 0.00000001)_2 * 2^N$ Therefore, I replace the division and addition operations with below strategy. (t0 stores the floating point number) ```asm! .data round: .word 0x8000 .text lw t2, round add t0, t0, t2 ``` --- Actually, I haven't finished converting the other parts of c code to assembly code yet because I'm stuck in implementing bfloat16 addition and mulplication. I will try to finish these parts in the future as soon as possible. ## Analysis I use [Ripes](https://github.com/mortbopet/Ripes) simulator to analyze. ### Five stage RISC pipeline 1. **Instruction fetch (IF)** - Instruction is fetched from the memory 2. **Instruction decode and register fetch (ID)** - Decode the instruction 3. **Execute (EX)** - Perform the operation specified by the instruction 4. **Memory access (MEM)** - Access data from memory or store data into memory 5. **Register write back (WB)** - Store the result in the destination location ![](https://hackmd.io/_uploads/H1toR6M-6.png) I take `add t0, t0, t2` at 82 line for example to explain each stage. #### Instruction fetch (IF) - The instruction is fetched into IFID which located at `0x007282b3` in instruction memory. - PC is updated to PC+4 to fetch next instruction with the value from `0x000000f4` to `0x000000f8` ![](https://hackmd.io/_uploads/rJfblRG-6.png) #### Instruction decode and register fetch (ID) - The instruction is decoded for next stages. - The decoded opcode indicates that it's `ADD` instrunction. - The upper `0x05` indicates that the number of register 1 is `0x05`, which means it's `x5` or `t0`. The `0x07` indicates that the number of register 2 is `0x07`, which means it's `x7` or `t2`. The lower `0x05` indicates that the number of destination register is `0x05`, which means it's `x5` or `t0`. - We can notice that register 1 and destination register share the same register. ![](https://hackmd.io/_uploads/H1ivOCfZp.png) #### Execute (EX) - `ADD` is excuted through ALU in this stage. - The value `0x700ab000` stored in register 1 `x5` and the value `0x00001f40` stored in register 2 `x7` is added through ALU, and output of ALU is the result `0x700acf40`. ![](https://hackmd.io/_uploads/ByVLJ1QW6.png) #### Memory access (MEM) - Since this instruction is R-type, it doesn't do anything in this stage. - Data memory includes two inputs, which are the address of data to be loaded or stored and the value of data to be writen into. There is one output in data memory , which is the value of data loaded from the given address. - Although R-type instruction do nothing in this stage, we can see from the illustration, the result of ALU and the value stored in register 2 are buffered into data memory. ![](https://hackmd.io/_uploads/B1M35JX-6.png) #### Register write back (WB) - The result `0x700acf40` calculated by ALU is written back to destination register `x5`. ![](https://hackmd.io/_uploads/HJoJZxm-6.png) ### Memory update As mentioned above, the source register and destination register of `add x5, x5, x7` share the same register. Before `add x5, x5, x7` completed, the value stored in `x5` is `0x700ab000`. After it completed, the value stored in `x5` is updated to `0x700acf40`. #### `add x5, x5, x7` in the WB stage ![](https://hackmd.io/_uploads/rJXffeXZp.png) #### `add x5, x5, x7` out of the WB stage ![](https://hackmd.io/_uploads/SkSPflXZT.png) ## Assembly optimization ### Original execution info ![](https://hackmd.io/_uploads/HJaH69GWT.png) ### Duplicate code - **Observation** : There are plenty of codes in `main` doing similar things and only the test data are different. - **Implentation** - Put all test data into one array and use loop to iterate them. **Before** ```asm!= .data test1: .word 0x0 test2: .word 0x7F800000 test3: .word 0x700AB000 test4: .word 0xF00AB000 test5: .word 0x707F8FFF .text main: lw a0, test1 jal do_fp32_to_bf16_and_print lw a0, test2 jal do_fp32_to_bf16_and_print lw a0, test3 jal do_fp32_to_bf16_and_print lw a0, test4 jal do_fp32_to_bf16_and_print lw a0, test5 jal do_fp32_to_bf16_and_print li a7, 10 # exit ecall ``` **After** ```asm!= .data test: .word 0x00000000 0x7F800000 0x700AB000 0xF00AB000 0x707F8FFF .text main: la s0, test li s1, 0 # i li s2, 5 # n test_loop: beq s1, s2, exit # if i==5 break lw a0, 0(s0) # a0: test[i] jal do_fp32_to_bf16_and_print addi s0, s0, 4 # s0: address of test[i+1] addi s1, s1, 1 # s1: i++ j test_loop exit: li a7, 10 # exit ecall ``` We can see the execution info below. Although the the number of cycle increase after modifying, the code become more elegant. ![](https://hackmd.io/_uploads/SJ75gam-p.png) ## Reference - https://en.wikipedia.org/wiki/Bfloat16_floating-point_format - https://en.wikipedia.org/wiki/Convolution - https://www.geeksforgeeks.org/linear-convolution-using-c-and-matlab/