# Generate 3×3 Filter Box in BF16 Format and Application > Contributed by < [DarrenHuang0411](https://github.com/DarrenHuang0411/NCKU_CA_2023/tree/Assignment-1) > ## 1. Introduction/Motivation Utilizing the bfloat16 (BF16) format for Gaussian filter calculations is driven by a strategic balance between computational efficiency and numerical precision. Gaussian filters are fundamental tools in a wide range of applications, including image processing, computer vision, and various fields where tasks such as blurring, smoothing, and noise reduction are essential. These operations frequently entail extensive convolution or filtering procedures that demand substantial computational resources. ## 2. Process ### **a. Bulid a 3x3 filter by c code** The formula calculates the value of the Gaussian filter at a given (x, y) coordinate by applying the Gaussian function to that point. As you move away from the center (x=0, y=0), the value of the Gaussian filter decreases, creating a smooth, blurring effect when this filter is applied to an image. The amount of smoothing is controlled by the standard deviation (σ), with a larger σ resulting in more significant smoothing. ![Gaussian Filter](https://hackmd.io/_uploads/B1WSm-MWa.jpg) - **σ (sigma):** This is the standard deviation of the Gaussian distribution. It controls the spread or width of the Gaussian curve. A larger σ results in a wider blur. * The completed code is at-[Generate_fp32_filter.c](https://github.com/DarrenHuang0411/NCKU_CA_2023/blob/Assignment-1/Generate_fp32_filter.c) ```javascript=31 float img2[image_size][image_size] = {}; //float img2[image_size+2][image_size+2] = {}; float img3[image_size][image_size] = {}; float sum = 0; int i, j; for (i = 0; i < smooth_kernel_size; i++) { for (j = 0; j < smooth_kernel_size; j++) { float x = i - (smooth_kernel_size - 1) / 2.0; float y = j - (smooth_kernel_size - 1) / 2.0; gauss[i][j] = 0.16*(1/sigma) * exp(((pow(x, 2) + pow(y, 2)) / ((2 * pow(sigma, 2)))) * (-1)); sum += gauss[i][j]; } } printf("The Filter Generated by Testcase\n "); for (i = 0; i < smooth_kernel_size; i++) { for (j = 0; j < smooth_kernel_size; j++) { gauss[i][j] /= sum; printf("%f ", gauss[i][j]); } printf("\n"); } ``` The result | Case | Sigma | output | | ---- | ----- |:------------------------------------:| | | | 0.044904 0.122097 0.044904 (test1) | | test | 0.707 | 0.122097 0.331996 0.122097 (test2) | | | | 0.044904 0.122097 0.044904 (test3) | | | | | | 0.092717 0.119061 0.092717 | | test | 1.414 | 0.119061 0.152888 0.119061 | | | | 0.092717 0.119061 0.092717 | | | | | | 0.102890 0.114985 0.102890 | | test | 2.121 | 0.114985 0.128502 0.114985 | | | | 0.102890 0.114985 0.102890 | ### **b.Change the 32 bit floating point in filter to Bfloat16 format** **Improved Computational Speed:** BF16 arithmetic operations typically execute faster than their FP32 counterparts. This can lead to quicker filter calculations, making it ideal for real-time or high-throughput processing tasks. * The completed code is at-[Generate_fp32_filter.c](https://github.com/DarrenHuang0411/NCKU_CA_2023/blob/Assignment-1/Generate_fp32_filter.c) * Please refer to the corresponding assembly code below.- [Assembly Code](#a-Change-the-32-bit-floating-point-in-filter-to-Bfloat16-format) ```javascript= float fp32_to_bf16(float x){ float y = x; int *p = (int *) &y; printf("%d", *p); unsigned int exp = *p & 0x7F800000; unsigned int man = *p & 0x007FFFFF; //printf("%x ", exp); //printf("%x ", man); if (exp == 0 && man == 0) /* zero */ return x; if (exp == 0x7F800000) /* infinity or NaN */ return x; /* Normalized number */ /* round to nearest */ float r = x; int *pr = (int *) &r; *pr &= 0xFF800000; /* r has the same exp as x */ r /= 0x100; y = x + r; *p &= 0xFFFF0000; return *p; } ``` The result we used to as testbanch | Sigma | output | Float Point 754 | BF16 | |:-----:|:----------------------------:|:-----------------------------------:|:----------------------------------------:| | | 0.044904 0.122097 0.044904 | 0x3d37ed42 0x3dfa0dfe 0x3d37ed42 | 3d380000 3dfa0000 3d380000 | | 0.707 | 0.122097 0.331996 0.122097 | 0x3dfa0dfe 0x3ea9fb61 0x3dfa0dfe | 3dfa0e3f 3ea9fb55 3dfa0e3f | | | 0.044904 0.122097 0.044904 | 0x3d37ed42 0x3dfa0dfe 0x3d37ed42 | 3d37ecd2 3dfa0e3f 3d37ecd2 | | | | | 0.092717 0.119061 0.092717 | 0x3dbde269, 0x3df3d641, 0x3dbde269 | 3dbe0000 3df40000 3dbe0000 | | 1.414 | 0.119061 0.152888 0.119061 | 0x3df3d641, 0x3e1c8eac, 0x3df3d641 | 3df40000 3e1d0000 3df40000 | | | 0.092717 0.119061 0.092717 | 0x3dbde269, 0x3df3d641, 0x3dbde269 | 3dbe0000 3df40000 3dbe0000 | | | | | | 0.102890 0.114985 0.102890 | 0x3dd2b7fe, 0x3deb7d41, 0x3dd2b7fe | 3dd30000 3deb0000 3dd30000 | | 2.121 | 0.114985 0.128502 0.114985 | 0x3deb7d41, 0x3e039607, 0x3deb7d41 | 3deb0000 3e040000 3deb0000 | | | 0.102890 0.114985 0.102890 | 0x3dd2b7fe, 0x3deb7d41 , 0x3dd2b7fe | 3dd30000 3deb0000 3dd30000 | ### **c. Pixel Image Change to Bfloat16 format** In Gaussian filtering operations, the original range of input image values is typically set between 0 and 255. Therefore, here, it is assumed that the test image values are floating-point numbers ranging from 0 to 255 | Input Num. | Float Point 754 | |:-----------:|:----------------------------------------:| | 30, 65, 138 | 0x41f00000 0x42820000 0x430a0000 (test1) | | 53, 230, 240 | 0x42540000 0x43660000 0x43700000(test2) | | 59, 87, 14 | 0x426c0000 0x42ae0000 0x41600000(test3) | ### **d. Pixel Image and Filter Kernel are executed Multiplication (Convolution Concept)** Convolution in image processing is an essential concept used to process digital images and extract features from them. It is a mathematical operation typically employed to filter and manipulate images, aiming to enhance image quality, identify objects, or perform other tasks related to image processing. The concepts of Convolution Operation is a process in which a convolution kernel is applied to each pixel position in an image. It involves aligning the center of the convolution kernel with each pixel in the image and computing the element-wise multiplication of overlapping regions. This process entails sliding the convolution kernel across the entire image, resulting in a newly processed image. * The completed code is at-[HW1_v1.c](https://github.com/DarrenHuang0411/NCKU_CA_2023/blob/Assignment-1/HW1_v1.c) ![](https://hackmd.io/_uploads/Byw_oI4-6.png) :::spoiler #### **d. 1 Fmul32** `Fmul32 Function` * Please refer to the corresponding assembly code below.- [Assembly Code](#b1-Fmul32) ```javascript=80 float fmul32(float a, float b) { //*(int32_t*) &*(int32_t*) & /* TODO: Special values like NaN and INF */ int32_t ia = a, ib = b; printf("ia=%x ib=%x", ia,ib); /* sign */ int sa = ia >> 31; int sb = ib >> 31; /* mantissa */ int32_t ma = (ia & 0x7FFFFF) | 0x800000; int32_t mb = (ib & 0x7FFFFF) | 0x800000; /* exponent */ int32_t ea = ((ia >> 23) & 0xFF); int32_t eb = ((ib >> 23) & 0xFF); /* 'r' = result */ //int64_t mrtmp = imul32(ma, mb) >> 23; int32_t mrtmp = imul32(ma, mb); int mshift = getbit(mrtmp, 24); int64_t mr = mrtmp >> mshift; int32_t ertmp = ea + eb - 127; int32_t er = mshift ? inc(ertmp) : ertmp; /* TODO: Overflow ^ */ int sr = sa ^ sb; int32_t r = (sr << 31) | ((er & 0xFF) << 23) | (mr & 0x7FFFFF); printf("result = %x ", r); return *(float *) &r; } ``` ::: :::spoiler #### **d. 2 imul32** `imul32 Function` * Please refer to the corresponding assembly code below.- [Assembly Code]() ```javascript=66 int32_t imul32(int32_t a, int32_t b) { int32_t r = 0; while(b != 0){ if (b & 1){ r += a; } b = b >> 1; r = r >> 1; } r = r << 1; return r; } ``` ::: :::spoiler #### **d. 3 getbit/inc/mask_zero** `getbit/inc/mask_zero Function` * Please refer to the corresponding assembly code below.- [Assembly Code]() ```javascript=61 uint64_t mask_lowest_zero(uint64_t x) { uint64_t mask = x; mask &= (mask << 1) | 0x1; mask &= (mask << 2) | 0x3; mask &= (mask << 4) | 0xF; mask &= (mask << 8) | 0xFF; mask &= (mask << 16) | 0xFFFF; mask &= (mask << 32) | 0xFFFFFFFF; return mask; } int64_t inc(int64_t x) { if (~x == 0) return 0; /* TODO: Carry flag */ int64_t mask = mask_lowest_zero(x); int64_t z1 = mask ^ ((mask << 1) | 1); return (x & ~mask) | z1; } static inline int64_t getbit(int64_t value, int n) { return (value >> n) & 1; } ``` ::: The result will be generated following: | Filter Kernal | Image Pixel | Result | |:----------------------------:|:------------------:|:-------------------------:| | 0.044904 0.122097 0.044904 | 30, 65, 138(test1) | 1.34033 7.93457 6.16553 | | 0.122097 0.331996 0.122097 | 53,230, 240(test2) | 6.46973 75.918 29.2969 | | 0.044904 0.122097 0.044904 | 59, 87, 14(test3) | 2.63399 10.6201 0.625488 | | Filter Kernal | Image Pixel | Result | | 0.092717 0.119061 0.092717 | 30, 65, 138 | 2.76855 7.7124 12.7354 | | 0.119061 0.152888 0.119061 | 53,230, 240 | 6.28857 35.0391 28.4766 | | 0.092717 0.119061 0.092717 | 59, 87, 14 | 5.44482 10.3228 1.29199 | | Filter Kernal | Image Pixel | Result | | 0.102890 0.114985 0.102890 | 30, 65, 138 | 3.07617, 7.4585, 14.1504 | | 0.114985 0.128502 0.114985 | 53,230, 240 | 6.08154, 29.4238, 27.5391 | | 0.102890 0.114985 0.102890 | 59, 87, 14 | 6.0498, 9.98291, 1.43555 | The example result of `sigma=1.414` generated by C code is shown below: ![](https://hackmd.io/_uploads/SJwCefB-6.png) ## 3. Assembly Code ### a. Change the 32 bit floating point in filter to Bfloat16 format * The completed code is at-[Generate_fp32_filter.s](https://github.com/DarrenHuang0411/NCKU_CA_2023/blob/Assignment-1/HW1_v1_32to16.s) :::spoiler **fp32_to_bf16** ```javascript=89 fp32_to_bf16: #alloc addi sp, sp, -24 sw ra, 0(sp) sw s2, 4(sp) sw s3, 8(sp) sw s4, 12(sp) sw s5, 16(sp) sw s6, 20(sp) lw t6, 0(a0) li t1, 0x7F800000 and s2, t6, t1 #exp li t2, 0x007FFFFF and s3, t6, t2 #man beq a0, zero, return_x #zero beq s2, t1,return_x #inf or nan ###### Normalize mv s4, t6 #r=s4 li t3, 0xFF800000 and t4, t3, t4 #t4 = *pr srli s4, s4, 8 li t4, 0x8000 add s5, s4, t6 or s6, s5, s2 #s6=y(32) li t5, 0xFFFF0000 and t5, t6, t5 #a0=y(16) lw ra, 0(sp) lw s2, 4(sp) lw s3, 8(sp) lw s4, 12(sp) lw s5, 16(sp) lw s6, 20(sp) addi sp, sp, 24 ret ``` ::: The result of BF16 format filter generated by assembly code is shown below: ![Only for FP32 to BF16](https://hackmd.io/_uploads/Hkt2i2m-p.png) * The CPU and Cache Analysis are shown at - [Only FP32 to BF16](#a-CPU-and-Cache-Analysis--Only-for-FP32-to-BF16) ### b. Pixel Image and Filter Kernel are executed Multiplication (Convolution Concept) * The completed code is at -[HW1_v1_all.s](https://github.com/DarrenHuang0411/NCKU_CA_2023/blob/Assignment-1/HW1_v1_32to16.s) :::spoiler #### **b. 1 Loop Address** ![](https://hackmd.io/_uploads/H1lrEPrZp.png) ```javascript=33 GF: #alloc addi sp, sp, -20 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) addi a1, a1, -8 #addi a3, a3, -4 #address initial li s0, 0 #i li s1, 0 #j li s2, 3 li s3, 0 #result_12*i+j Loop_o: bge s0, s2, Loop_o_d # if s0 >= s2 then target li s1, 0 # j=0 Loop_i: bge s1, s2, Loop_i_d #addr.offset cal slli t0, s0, 3 slli t1, s0, 2 add t2, t0, t1 #t2 = 12*i slli t3, s1, 2 add s3, t2 ,t3 #s10 = addr.= 12*i + 4*j ``` ::: :::spoiler #### **b. 2 fmul32** ```javascript=156 fmul32: addi sp, sp, -48 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) sw s4, 20(sp) sw s5, 24(sp) sw s6, 28(sp) sw s7, 32(sp) sw s8, 36(sp) sw s9, 40(sp) sw s10, 44(sp) lw s2, 0(a1) #s2 is filter bf16 lw s3, 4(a1) #s3 is img bf16 #srli s2, s2, 16 #srli s3, s3, 16 #sign srli s4, s2 , 31 srli s5, s3, 31 #man li t1, 0x7FFFFF li t2, 0x800000 and s6, t1, s2 or s6, t2, s6 and s7, t1, s3 or s7, t2, s7 #exp srli s8, s2, 23 srli s9, s3, 23 andi s8, s8, 0xFF andi s9, s9, 0xFF #mr call imul li t4, 24 call get_bit srl t3, t3, t0 #t0 now is mshift mv s6, t3 #er mv t4, s8 #t4 now is ertmp add t4, t4, s9 addi t4, t4, -127 #check mshift bne t0, zero, inc mv s8, t4 mshift_done: #Overflow check #sr xor s4, s4, s5 #result sr slli s4, s4, 31 or s10, s10, s4 #s10=result #result er andi s8, s8, 0xFF slli s8, s8, 23 or s10, s10, s8 #result mr li t1, 0x7FFFFF and s6, s6, t1 or s10, s10, s6 #slli s10, s10, 16 mv a4, s10 lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) lw s6, 28(sp) lw s7, 32(sp) lw s8, 36(sp) lw s9, 40(sp) lw s10, 44(sp) addi sp, sp, 48 ret ``` :::: :::spoiler #### **b. 3 imul32** ```javascript=239 imul: addi sp, sp, -4 sw ra, 0(sp) mv t4, s6 #t4 = a mv t5, s7 #t5 = b li t6, 0 #t6 = r loop_imul: beq t5, zero, imul_done andi t0, t5, 1 #getbit beq t0, zero, imul_ans0 add t6, t6, t4 imul_ans0: srli t5, t5, 1 srli t6, t6, 1 j loop_imul imul_done: slli t3, t6, 1 #imul_result back t3 lw ra, 0(sp) addi sp, sp, 4 ret ``` ::: :::spoiler #### **b. 4 getbit/inc/mask_zero** ```javascript=263 get_bit: addi sp, sp, -4 sw ra, 0(sp) srl t6, t3, t4 #mr = mrtmp >> mshift; andi t0, t6, 1 #t0 use to getbit--> now is mshift lw ra, 0(sp) addi sp, sp, 4 ret inc: li t2, 0xFF mv a6, t4 bne a6, t2, inc_start mv a6, zero j mshift_done inc_start: call mask_zero slli t6, t6, 1 #t6 now is the sec. mask ori t6, t6, 0x1 xor t6, a5, t6 #t6 now is z1 #return value mv t5, t4 li t2, 0xFFFFFFFF xor a5, a5, t2 #a5=(~mask) and t4, t4, a5 #t4= (x & ~mask) or t4, t4, t6 mv s8, t4 j mshift_done mask_zero: addi sp, sp, -4 sw ra, 0(sp) mv t5, t4 # t5=mask (ertmp) #m1 slli t6, t5, 1 ori t6, t6, 0x1 and t5, t5, t6 #m2 slli t6, t5, 2 ori t6, t6, 0x3 and t5, t5, t6 #m3 slli t6, t5, 4 ori t6, t6, 0xF and t5, t5, t6 #m4 slli t6, t5, 8 ori t6, t6, 0xFF and t5, t5, t6 #m5 slli t6, t5, 16 li t2, 0xFFFF or t6, t6, t2 and t5, t5, t6 mv a5, t5 mv t6, t5 lw ra, 0(sp) addi sp, sp, 4 ret ``` ::: The result of **`sigma=0.707`** ![](https://hackmd.io/_uploads/BkZCdkHbp.png) The result of **`sigma=1.414`** ![](https://hackmd.io/_uploads/SJt0oWHZp.png) The result of **`sigma=2.121`** ![](https://hackmd.io/_uploads/HyBbMmBZa.png) * The CPU and Cache Analysis are shown at - [FP32 to BF16 and mul](#b-CPU-and-Cache-Analysis--FP32-to-BF16-and-mul) ## 4. Analysis ### **a. CPU and Cache Analysis -(Only for FP32 to BF16)** ![](https://hackmd.io/_uploads/SJ8uCyHW6.png) ### **b. CPU and Cache Analysis -(FP32 to BF16 and mul)** ![](https://hackmd.io/_uploads/rJ8IpJH-T.png) ## 5. Improvement ### **a. The Loop about filter to Bfloat16 format** * The completed code is at-[HW1_v2_improve.s](https://github.com/DarrenHuang0411/NCKU_CA_2023/blob/Assignment-1/HW1_v2_improve.s) Due to the presence of duplicate elements in the 3x3 Gaussian filter, there are actually only three unique elements. By modifying this portion of the code, the number of redundant calculations can be reduced. ![](https://hackmd.io/_uploads/BJVRJtIb6.png) The results indicate that modifying the code can reduce `4568 - 4241 = 327 `cycles. ![](https://hackmd.io/_uploads/B1KA6vv-a.png) ### **b. fmul32 change to fmul16** --- ## 6. 5-stage pipelined processor A 5-stage pipeline processor is a type of microprocessor design that breaks down the execution of instructions into five distinct stages, each of which is responsible for a specific aspect of instruction processing. This pipelining technique is commonly used in modern microprocessors to improve their performance by allowing multiple instructions to be in various stages of execution simultaneously. The five stages in a typical 5-stage pipeline processor are: **Fetch (IF - Instruction Fetch):** In this stage, the processor retrieves the instruction from memory. This typically involves reading the instruction from the instruction cache or main memory. **Decode (ID - Instruction Decode):** The fetched instruction is decoded in this stage. The processor determines the operation to be performed and identifies the registers or memory locations involved. **Execute (EX - Execute):** In this stage, the actual execution of the instruction takes place. This might involve arithmetic and logical operations, data transfers, or any other operation specified by the instruction. **Memory (MEM - Memory Access):** If the instruction requires access to memory (e.g., load or store operations), this stage is responsible for performing memory read or write operations. **Write-back (WB - Write-back):** The result of the instruction's execution is written back to the appropriate register or memory location in this stage. This stage updates the processor's internal state. ### **a. Total Structure** The instruction we analysis in this program is `'beq x10 x0 396<return_x>'`. Explain the `'beq x10 x0 396<return_x>'`(code 130) command in a 5-stage RISC-V CPU ![](https://hackmd.io/_uploads/H1YaNdSWp.png) ### **b. IF stage/ID stage** **Instruction fetch (IF stage)** **(left picture)** * `beq x10 x0 396<return_x>`instruction is located at address `0x0000016c` in the instruction memory. * mechine code about `beq x10 x0 396<return_x>` is `0x24050a63`. **Instruction decode (ID stage)** **(right picture)** * Read two source registers from Register File - **x10 (a0)** register : `0x10000000` **x0 (zero)** register : `0x00e70afe` ![](https://hackmd.io/_uploads/rkfNONU-p.png) ### **c. EXE stage/MEM stage** #### **Execute (EXE stage) (left picture)** * ALU Src1 value is `0x10000000` * ALU Src2 value is `0x00000000` * ALU_OP1 (choose)value is `0x0000016c` * ALU_OP2 (choose)value is `0x00000254` * ALU output : `0x000003C0` * ALU output = ALU_OP1 + ALU_OP2 #### **Memory access (MEM stage) (right picture)** Since the `'beq'` instruction only operates up to the `'exe'` stage, both the `'mem'` stage and the `'wb'` stage do not have any data transfer for the `'beq'` instruction. ![](https://hackmd.io/_uploads/r1b9mHUZ6.png) ### **d. Write back (WB stage)** * Select `0x000003C0`(ALU output) to write back register file. * There are no any data transfer for the `'beq'` instruction. ![](https://hackmd.io/_uploads/BJgPSKIb6.png) ## 7. Reference [1] [[Python]Gaussian Filter-概念與實作](https://medium.com/@bob800530/python-gaussian-filter-%E6%A6%82%E5%BF%B5%E8%88%87%E5%AF%A6%E4%BD%9C-676aac52ea17) [2] [RV32I, RV64I Instructions](https://msyksphinz-self.github.io/riscv-isadoc/html/rvi.html#ecall) [3] [RISC-V 指令集架構介紹 - RV32I](https://tclin914.github.io/16df19b4/) [4] [Apply a Gauss filter to an image with Python](https://www.geeksforgeeks.org/apply-a-gauss-filter-to-an-image-with-python/)