# Assignment2: RISC-V Toolchain
contributed by < [`ChengChaoChun`](https://github.com/ChengChaoChun) >
## Rewrite [Matrix multiplication using bfloat16](https://hackmd.io/@HKWg9_3IT2aEDv2v6T-Z0Q/SJb0JeVg6)
Motivation:
I choose the problem [Matrix multiplication using bfloat16](https://hackmd.io/@HKWg9_3IT2aEDv2v6T-Z0Q/SJb0JeVg6) from [魏彥庭](https://hackmd.io/@HKWg9_3IT2aEDv2v6T-Z0Q/SJb0JeVg6). Matrix multiplication is a fundamental operation in many scientific and engineering fields, aiding in the solution of complex mathematical and computational problems. Therefore, I am attempting to increase the efficiency of matrix multiplication.
Original C code:
```c
#include <stdio.h>
#include <stdint.h>
extern uint64_t get_cycles();
//Define a union for the conversion between floating-point numbers and integers.
typedef union {
float f; // The floating-point part
unsigned int i; // The integer part
} float_to_int;
//Define a function to convert single-precision floating-point numbers to the bfloat16 format
float fp32_to_bf16(float x) {
float y = x; // Copy the input value to y
int *p = (int *) &y; // Create an integer pointer to access the binary representation of a floating-point number
unsigned int exp = *p & 0x7F800000; // Extract the exponent part of the floating-point number.
unsigned int man = *p & 0x007FFFFF; // Extract the mantissa part of the floating-point number
// If it is zero or infinity/NaN, simply return the input value x
if (exp == 0 && man == 0)
return x;
if (exp == 0x7F800000)
return x;
// Perform rounding on normalized numbers.
// First, create a new floating-point number "r" for rounding.
float r = x;
int *pr = (int *) &r;
*pr &= 0xFF800000; // The exponent part of "r" is the same as that of "x"
r /= 0x100; // Right-shift by 8 bits, which is equivalent to dividing by 256, to implement rounding
y = x + r; // Add the rounded value back to "x"
*p &= 0xFFFF0000; // Clear the fractional part of "y" and retain the integer part
return y; // Return the bfloat16 value
}
int main() {
uint64_t oldcount = get_cycles();
// Define two 2x2 matrices, A and B, as well as a result matrix C, all initialized to zero
float A[2][2] = {
{1.2345, 2.3456},
{3.4567, 4.5678}
};
float B[2][2] = {
{0.1234, 0.2345},
{0.3456, 0.4567}
};
float C[2][2] = {
{0.0, 0.0},
{0.0, 0.0}
};
for (int i = 0; i < 2; i++) {
for (int j = 0; j < 2; j++) {
for (int k = 0; k < 2; k++) {
// Use the fp32_to_bf16 function to convert to bfloat16
// Perform matrix multiplication and accumulate the results into matrix C
C[i][j] += fp32_to_bf16(A[i][k]) * fp32_to_bf16(B[k][j]);
}
}
}
printf("Result Matrix C:\n");
for (int i = 0; i < 2; i++) {
for (int j = 0; j < 2; j++) {
printf("%f ", C[i][j]);
}
printf("\n");
}
uint64_t cyclecount = get_cycles() - oldcount;
printf("cycle count: %u\n", (unsigned int) cyclecount);
return 0;
}
```
Modefied C code:
```c
#include <stdio.h>
#include <stdint.h>
extern uint64_t get_cycles();
//Define a union for the conversion between floating-point numbers and integers
typedef union {
float f; // The floating-point part
unsigned int i; // The integer part
} float_to_int;
//Define a function to convert single-precision floating-point numbers to the bfloat16 format.
float fp32_to_bf16(float x) {
float y = x; // Copy the input value to y
int *p = (int *) &y; // Create an integer pointer to access the binary representation of a floating-point number
unsigned int exp = *p & 0x7F800000; // Extract the exponent part of the floating-point number
unsigned int man = *p & 0x007FFFFF; // Extract the mantissa part of the floating-point number
// If it is zero or infinity/NaN, simply return the input value x
if (exp == 0 && man == 0)
return x;
if (exp == 0x7F800000)
return x;
// Perform rounding on normalized numbers.
// First, create a new floating-point number "r" for rounding.
float r = x;
int *pr = (int *) &r;
*pr &= 0xFF800000; // The exponent part of "r" is the same as that of "x"
r /= 0x100; // Right-shift by 8 bits, which is equivalent to dividing by 256, to implement rounding
y = x + r; // Add the rounded value back to "x"
*p &= 0xFFFF0000; // Clear the fractional part of "y" and retain the integer part
return y; // Return the bfloat16 value
}
int main() {
uint64_t oldcount = get_cycles();
// Define two 2x2 matrices, A and B, as well as a result matrix C, all initialized to zero
float A[2][2] = {
{1.2345, 2.3456},
{3.4567, 4.5678}
};
float B[2][2] = {
{0.1234, 0.2345},
{0.3456, 0.4567}
};
float C[2][2] = {
{0.0, 0.0},
{0.0, 0.0}
};
C[0][0] += fp32_to_bf16(A[0][0]) * fp32_to_bf16(B[0][0]);
C[0][0] += fp32_to_bf16(A[0][1]) * fp32_to_bf16(B[1][0]);
C[0][1] += fp32_to_bf16(A[0][0]) * fp32_to_bf16(B[0][1]);
C[0][1] += fp32_to_bf16(A[0][1]) * fp32_to_bf16(B[1][1]);
C[1][0] += fp32_to_bf16(A[1][0]) * fp32_to_bf16(B[0][0]);
C[1][0] += fp32_to_bf16(A[1][1]) * fp32_to_bf16(B[1][0]);
C[1][1] += fp32_to_bf16(A[1][0]) * fp32_to_bf16(B[0][1]);
C[1][1] += fp32_to_bf16(A[1][1]) * fp32_to_bf16(B[1][1]);
printf("Result Matrix C:\n");
printf("%f ", C[0][0]);
printf("%f \n", C[0][1]);
printf("%f ", C[1][0]);
printf("%f ", C[1][1]);
uint64_t cyclecount = get_cycles() - oldcount;
printf("cycle count: %u\n", (unsigned int) cyclecount);
return 0;
}
```
I Added a cycle calculation function call code in both the original C code and the modified C code.
The difference between the modified C code and the original C code is that the original C code multiplies matrix A with one column of matrix B at a time, whereas the Modify C code using loop unrolling techniques to perform matrix multiplication.
Here is the assembly code for cycle count.The assembly code is at the rv32emu/tests/perfcounter directory.
```
.text
.globl get_cycles
.align 2
get_cycles:
csrr a1, cycleh
csrr a0, cycle
csrr a2, cycleh
bne a1, a2, get_cycles
ret
.size get_cycles,.-get_cycles
```
## Optimization
Optimization level -O0
from original c code :
```
$ ../../build/rv32emu ori_mm_0.elf
Result Matrix C:
0.962730 1.360474
2.003853 2.894531
cycle count: 29268
inferior exit code 0
```
```
$ riscv-none-elf-size ori_mm_0.elf
text data bss dec hex filename
55284 1876 1528 58688 e540 ori_mm_0.elf
```
from modified c code :
```
$ ../../build/rv32emu modi_mm_0.elf
Result Matrix C:
0.962730 1.360474
2.003853 2.894531 cycle count: 28515
inferior exit code 0
```
```
$ riscv-none-elf-size modi_mm_0.elf
text data bss dec hex filename
55306 1876 1528 58710 e556 modi_mm_0.elf
```
The cycle count of the modified C code is slightly lower than that of the original C code, while the text size of the modified C code is slightly larger. In this example, we're only dealing with a 2x2 matrix multiplication. If it were a 10x10 matrix multiplication, theoretically, the modified version could have even fewer cycles than the original version, but the size would also become significantly larger.
Optimization level -O1
from original c code :
```
$ ../../build/rv32emu ori_mm_1.elf
Result Matrix C:
0.962730 1.360474
2.003853 2.894531
cycle count: 27433
inferior exit code 0
```
```
$ riscv-none-elf-size ori_mm_1.elf
text data bss dec hex filename
53950 1912 1528 57390 e02e ori_mm_1.elf
```
from modified c code :
```
$ ../../build/rv32emu modi_mm_1.elf
Result Matrix C:
0.962730 1.360474
2.003853 2.894531 cycle count: 23377
inferior exit code 0
```
```
2$ riscv-none-elf-size modi_mm_1.elf
text data bss dec hex filename
53132 1912 1528 56572 dcfc modi_mm_1.elf
```
The cycle count of both the modified C code and the original C code is less than the O0 version.
Optimization level -O2
from original c code :
```
$ ../../build/rv32emu ori_mm_2.elf
Result Matrix C:
0.962730 1.360474
2.003853 2.894531
cycle count: 28110
inferior exit code 0
```
```
$ riscv-none-elf-size ori_mm_2.elf
text data bss dec hex filename
54116 1912 1528 57556 e0d4 ori_mm_2.elf
```
from modified c code :
```
$ ../../build/rv32emu modi_mm_2.elf
Result Matrix C:
0.962730 1.360474
2.003853 2.894531 cycle count: 23377
inferior exit code 0
```
```
$ riscv-none-elf-size modi_mm_2.elf
text data bss dec hex filename
53136 1912 1528 56576 dd00 modi_mm_2.elf
```
In O2 optimization, the original C code unexpectedly has a higher cycle count compared to the O1 version, while the modified C code has about 5,000 fewer cycle counts than the original c code.
Optimization level -O3
from original c code :
```
$ ../../build/rv32emu ori_mm_3.elf
Result Matrix C:
0.962730 1.360474
2.003853 2.894531
cycle count: 23699
inferior exit code 0
```
```
$ riscv-none-elf-size ori_mm_3.elf
text data bss dec hex filename
53472 1912 1528 56912 de50 ori_mm_3.elf
```
from modified c code :
```
$ ../../build/rv32emu modi_mm_3.elf
Result Matrix C:
0.962730 1.360474
2.003853 2.894531 cycle count: 23377
inferior exit code 0
```
```
$ riscv-none-elf-size modi_mm_3.elf
text data bss dec hex filename
53136 1912 1528 56576 dd00 modi_mm_3.elf
```
In O3 optimization, the cycle count of the original C code is already lower than that of the modified C code, indicating that the modified C code seems to have been optimized to its limit.
Optimization level -Ofast
from original c code :
```
$ ../../build/rv32emu ori_mm_fast.elf
Result Matrix C:
0.962730 1.360474
2.003853 2.894531
cycle count: 23699
inferior exit code 0
```
```
$ riscv-none-elf-size ori_mm_fast.elf
text data bss dec hex filename
53472 1912 1528 56912 de50 ori_mm_fast.elf
```
Display the ELF file header
```
$ riscv-none-elf-readelf -h ori_mm_fast.elf
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10168
Start of program headers: 52 (bytes into file)
Start of section headers: 70240 (bytes into file)
Flags: 0x1, RVC, soft-float ABI
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
```
from modified c code :
```
$ ../../build/rv32emu modi_mm_fast.elf
Result Matrix C:
0.962730 1.360474
2.003853 2.894531 cycle count: 23377
inferior exit code 0
```
```
$ riscv-none-elf-size modi_mm_fast.elf
text data bss dec hex filename
53136 1912 1528 56576 dd00 modi_mm_fast.elf
```
Display the ELF file header
```
$ riscv-none-elf-readelf -h modi_mm_fast.elf
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x1015c
Start of program headers: 52 (bytes into file)
Start of section headers: 69680 (bytes into file)
Flags: 0x1, RVC, soft-float ABI
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
```
## Conclusion
In high-dimensional matrix multiplication, using the loop unrolling technique can reduce the cycle count, but it will also increase the code size.
Additional information:
In the original C code for matrix multiplication, B[k][j] constantly jumps in memory. This leads to a low cache hit rate, as the loop continuously transfers data from memory to the cache, resulting in reduced efficiency.
```c
for (int i = 0; i < 2; i++) {
for (int j = 0; j < 2; j++) {
for (int k = 0; k < 2; k++) {
//B[k][j] constantly jumps in memory.
C[i][j] += A[i][k] * B[k][j];
}
}
}
```
To increase program efficiency, we can change the order of matrix element multiplication. Below is the modified code.
```c
for (int i = 0; i < 2; i++) {
for (int j = 0; j < 2; j++) {
int temp = A[j][i];
for (int k = 0; k < 2; k++) {
C[j][k] += temp * B[i][k];
}
}
}
```
First, in the innermost loop, as k increments by 1, both C[j][k] and B[i][k] only increase by 1 each time. This aligns with the principle of spatial locality, meaning that memory reads occur sequentially without significant jumps.
Secondly, A[j][i] jumps within the middle loop, but the middle loop doesn't execute as frequently as the innermost loop. Additionally, we assign A[j][i] to a local variable 'temp'. During the compilation process when the assembly code is generated, the local variable 'temp' should be stored in CPU registers, and the time it takes for the innermost loop to read from registers is almost negligible.

Each different color represents one execution of the for loop that increments k.
:::warning
TODO: Revise the assembly implementation to see if you can improve.
:notes: jserv
:::