Assignment2: RISC-V Toolchain

# Assignment2: RISC-V Toolchain contributed by < [`ChengChaoChun`](https://github.com/ChengChaoChun) > ## Rewrite [Matrix multiplication using bfloat16](https://hackmd.io/@HKWg9_3IT2aEDv2v6T-Z0Q/SJb0JeVg6) Motivation: I choose the problem [Matrix multiplication using bfloat16](https://hackmd.io/@HKWg9_3IT2aEDv2v6T-Z0Q/SJb0JeVg6) from [魏彥庭](https://hackmd.io/@HKWg9_3IT2aEDv2v6T-Z0Q/SJb0JeVg6). Matrix multiplication is a fundamental operation in many scientific and engineering fields, aiding in the solution of complex mathematical and computational problems. Therefore, I am attempting to increase the efficiency of matrix multiplication. Original C code: ```c #include <stdio.h> #include <stdint.h> extern uint64_t get_cycles(); //Define a union for the conversion between floating-point numbers and integers. typedef union { float f; // The floating-point part unsigned int i; // The integer part } float_to_int; //Define a function to convert single-precision floating-point numbers to the bfloat16 format float fp32_to_bf16(float x) { float y = x; // Copy the input value to y int *p = (int *) &y; // Create an integer pointer to access the binary representation of a floating-point number unsigned int exp = *p & 0x7F800000; // Extract the exponent part of the floating-point number. unsigned int man = *p & 0x007FFFFF; // Extract the mantissa part of the floating-point number // If it is zero or infinity/NaN, simply return the input value x if (exp == 0 && man == 0) return x; if (exp == 0x7F800000) return x; // Perform rounding on normalized numbers. // First, create a new floating-point number "r" for rounding. float r = x; int *pr = (int *) &r; *pr &= 0xFF800000; // The exponent part of "r" is the same as that of "x" r /= 0x100; // Right-shift by 8 bits, which is equivalent to dividing by 256, to implement rounding y = x + r; // Add the rounded value back to "x" *p &= 0xFFFF0000; // Clear the fractional part of "y" and retain the integer part return y; // Return the bfloat16 value } int main() { uint64_t oldcount = get_cycles(); // Define two 2x2 matrices, A and B, as well as a result matrix C, all initialized to zero float A[2][2] = { {1.2345, 2.3456}, {3.4567, 4.5678} }; float B[2][2] = { {0.1234, 0.2345}, {0.3456, 0.4567} }; float C[2][2] = { {0.0, 0.0}, {0.0, 0.0} }; for (int i = 0; i < 2; i++) { for (int j = 0; j < 2; j++) { for (int k = 0; k < 2; k++) { // Use the fp32_to_bf16 function to convert to bfloat16 // Perform matrix multiplication and accumulate the results into matrix C C[i][j] += fp32_to_bf16(A[i][k]) * fp32_to_bf16(B[k][j]); } } } printf("Result Matrix C:\n"); for (int i = 0; i < 2; i++) { for (int j = 0; j < 2; j++) { printf("%f ", C[i][j]); } printf("\n"); } uint64_t cyclecount = get_cycles() - oldcount; printf("cycle count: %u\n", (unsigned int) cyclecount); return 0; } ``` Modefied C code: ```c #include <stdio.h> #include <stdint.h> extern uint64_t get_cycles(); //Define a union for the conversion between floating-point numbers and integers typedef union { float f; // The floating-point part unsigned int i; // The integer part } float_to_int; //Define a function to convert single-precision floating-point numbers to the bfloat16 format. float fp32_to_bf16(float x) { float y = x; // Copy the input value to y int *p = (int *) &y; // Create an integer pointer to access the binary representation of a floating-point number unsigned int exp = *p & 0x7F800000; // Extract the exponent part of the floating-point number unsigned int man = *p & 0x007FFFFF; // Extract the mantissa part of the floating-point number // If it is zero or infinity/NaN, simply return the input value x if (exp == 0 && man == 0) return x; if (exp == 0x7F800000) return x; // Perform rounding on normalized numbers. // First, create a new floating-point number "r" for rounding. float r = x; int *pr = (int *) &r; *pr &= 0xFF800000; // The exponent part of "r" is the same as that of "x" r /= 0x100; // Right-shift by 8 bits, which is equivalent to dividing by 256, to implement rounding y = x + r; // Add the rounded value back to "x" *p &= 0xFFFF0000; // Clear the fractional part of "y" and retain the integer part return y; // Return the bfloat16 value } int main() { uint64_t oldcount = get_cycles(); // Define two 2x2 matrices, A and B, as well as a result matrix C, all initialized to zero float A[2][2] = { {1.2345, 2.3456}, {3.4567, 4.5678} }; float B[2][2] = { {0.1234, 0.2345}, {0.3456, 0.4567} }; float C[2][2] = { {0.0, 0.0}, {0.0, 0.0} }; C[0][0] += fp32_to_bf16(A[0][0]) * fp32_to_bf16(B[0][0]); C[0][0] += fp32_to_bf16(A[0][1]) * fp32_to_bf16(B[1][0]); C[0][1] += fp32_to_bf16(A[0][0]) * fp32_to_bf16(B[0][1]); C[0][1] += fp32_to_bf16(A[0][1]) * fp32_to_bf16(B[1][1]); C[1][0] += fp32_to_bf16(A[1][0]) * fp32_to_bf16(B[0][0]); C[1][0] += fp32_to_bf16(A[1][1]) * fp32_to_bf16(B[1][0]); C[1][1] += fp32_to_bf16(A[1][0]) * fp32_to_bf16(B[0][1]); C[1][1] += fp32_to_bf16(A[1][1]) * fp32_to_bf16(B[1][1]); printf("Result Matrix C:\n"); printf("%f ", C[0][0]); printf("%f \n", C[0][1]); printf("%f ", C[1][0]); printf("%f ", C[1][1]); uint64_t cyclecount = get_cycles() - oldcount; printf("cycle count: %u\n", (unsigned int) cyclecount); return 0; } ``` I Added a cycle calculation function call code in both the original C code and the modified C code. The difference between the modified C code and the original C code is that the original C code multiplies matrix A with one column of matrix B at a time, whereas the Modify C code using loop unrolling techniques to perform matrix multiplication. Here is the assembly code for cycle count.The assembly code is at the rv32emu/tests/perfcounter directory. ``` .text .globl get_cycles .align 2 get_cycles: csrr a1, cycleh csrr a0, cycle csrr a2, cycleh bne a1, a2, get_cycles ret .size get_cycles,.-get_cycles ``` ## Optimization Optimization level -O0 from original c code : ``` $ ../../build/rv32emu ori_mm_0.elf Result Matrix C: 0.962730 1.360474 2.003853 2.894531 cycle count: 29268 inferior exit code 0 ``` ``` $ riscv-none-elf-size ori_mm_0.elf text data bss dec hex filename 55284 1876 1528 58688 e540 ori_mm_0.elf ``` from modified c code : ``` $ ../../build/rv32emu modi_mm_0.elf Result Matrix C: 0.962730 1.360474 2.003853 2.894531 cycle count: 28515 inferior exit code 0 ``` ``` $ riscv-none-elf-size modi_mm_0.elf text data bss dec hex filename 55306 1876 1528 58710 e556 modi_mm_0.elf ``` The cycle count of the modified C code is slightly lower than that of the original C code, while the text size of the modified C code is slightly larger. In this example, we're only dealing with a 2x2 matrix multiplication. If it were a 10x10 matrix multiplication, theoretically, the modified version could have even fewer cycles than the original version, but the size would also become significantly larger. Optimization level -O1 from original c code : ``` $ ../../build/rv32emu ori_mm_1.elf Result Matrix C: 0.962730 1.360474 2.003853 2.894531 cycle count: 27433 inferior exit code 0 ``` ``` $ riscv-none-elf-size ori_mm_1.elf text data bss dec hex filename 53950 1912 1528 57390 e02e ori_mm_1.elf ``` from modified c code : ``` $ ../../build/rv32emu modi_mm_1.elf Result Matrix C: 0.962730 1.360474 2.003853 2.894531 cycle count: 23377 inferior exit code 0 ``` ``` 2$ riscv-none-elf-size modi_mm_1.elf text data bss dec hex filename 53132 1912 1528 56572 dcfc modi_mm_1.elf ``` The cycle count of both the modified C code and the original C code is less than the O0 version. Optimization level -O2 from original c code : ``` $ ../../build/rv32emu ori_mm_2.elf Result Matrix C: 0.962730 1.360474 2.003853 2.894531 cycle count: 28110 inferior exit code 0 ``` ``` $ riscv-none-elf-size ori_mm_2.elf text data bss dec hex filename 54116 1912 1528 57556 e0d4 ori_mm_2.elf ``` from modified c code : ``` $ ../../build/rv32emu modi_mm_2.elf Result Matrix C: 0.962730 1.360474 2.003853 2.894531 cycle count: 23377 inferior exit code 0 ``` ``` $ riscv-none-elf-size modi_mm_2.elf text data bss dec hex filename 53136 1912 1528 56576 dd00 modi_mm_2.elf ``` In O2 optimization, the original C code unexpectedly has a higher cycle count compared to the O1 version, while the modified C code has about 5,000 fewer cycle counts than the original c code. Optimization level -O3 from original c code : ``` $ ../../build/rv32emu ori_mm_3.elf Result Matrix C: 0.962730 1.360474 2.003853 2.894531 cycle count: 23699 inferior exit code 0 ``` ``` $ riscv-none-elf-size ori_mm_3.elf text data bss dec hex filename 53472 1912 1528 56912 de50 ori_mm_3.elf ``` from modified c code : ``` $ ../../build/rv32emu modi_mm_3.elf Result Matrix C: 0.962730 1.360474 2.003853 2.894531 cycle count: 23377 inferior exit code 0 ``` ``` $ riscv-none-elf-size modi_mm_3.elf text data bss dec hex filename 53136 1912 1528 56576 dd00 modi_mm_3.elf ``` In O3 optimization, the cycle count of the original C code is already lower than that of the modified C code, indicating that the modified C code seems to have been optimized to its limit. Optimization level -Ofast from original c code : ``` $ ../../build/rv32emu ori_mm_fast.elf Result Matrix C: 0.962730 1.360474 2.003853 2.894531 cycle count: 23699 inferior exit code 0 ``` ``` $ riscv-none-elf-size ori_mm_fast.elf text data bss dec hex filename 53472 1912 1528 56912 de50 ori_mm_fast.elf ``` Display the ELF file header ``` $ riscv-none-elf-readelf -h ori_mm_fast.elf ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10168 Start of program headers: 52 (bytes into file) Start of section headers: 70240 (bytes into file) Flags: 0x1, RVC, soft-float ABI Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` from modified c code : ``` $ ../../build/rv32emu modi_mm_fast.elf Result Matrix C: 0.962730 1.360474 2.003853 2.894531 cycle count: 23377 inferior exit code 0 ``` ``` $ riscv-none-elf-size modi_mm_fast.elf text data bss dec hex filename 53136 1912 1528 56576 dd00 modi_mm_fast.elf ``` Display the ELF file header ``` $ riscv-none-elf-readelf -h modi_mm_fast.elf ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x1015c Start of program headers: 52 (bytes into file) Start of section headers: 69680 (bytes into file) Flags: 0x1, RVC, soft-float ABI Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` ## Conclusion In high-dimensional matrix multiplication, using the loop unrolling technique can reduce the cycle count, but it will also increase the code size. Additional information: In the original C code for matrix multiplication, B[k][j] constantly jumps in memory. This leads to a low cache hit rate, as the loop continuously transfers data from memory to the cache, resulting in reduced efficiency. ```c for (int i = 0; i < 2; i++) { for (int j = 0; j < 2; j++) { for (int k = 0; k < 2; k++) { //B[k][j] constantly jumps in memory. C[i][j] += A[i][k] * B[k][j]; } } } ``` To increase program efficiency, we can change the order of matrix element multiplication. Below is the modified code. ```c for (int i = 0; i < 2; i++) { for (int j = 0; j < 2; j++) { int temp = A[j][i]; for (int k = 0; k < 2; k++) { C[j][k] += temp * B[i][k]; } } } ``` First, in the innermost loop, as k increments by 1, both C[j][k] and B[i][k] only increase by 1 each time. This aligns with the principle of spatial locality, meaning that memory reads occur sequentially without significant jumps. Secondly, A[j][i] jumps within the middle loop, but the middle loop doesn't execute as frequently as the innermost loop. Additionally, we assign A[j][i] to a local variable 'temp'. During the compilation process when the assembly code is generated, the local variable 'temp' should be stored in CPU registers, and the time it takes for the innermost loop to read from registers is almost negligible. ![](https://hackmd.io/_uploads/HJKp-dafT.png) Each different color represents one execution of the for loop that increments k. :::warning TODO: Revise the assembly implementation to see if you can improve. :notes: jserv :::

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.