Reference \:
> * **Agner Fog's VCL (Vector Class Library Installation and Overview)**
https://www.youtube.com/watch?v=TKjYdLIMTrI&list=PLKK11LigqithMn_3ipTSSTZdbLHSh5Iy3&index=3
*This is an old note from the early days 2022\/06\/11. Please refer to other posts if you want to use intrinsics directly.*
## Vector Class Library
Compiler intrinsic can be used to take advantage on these registers. However, the syntax is difficult to read and maintain. VCL provide SIMD vectors of 128 bits, 256 bits, or 512 bits for int from 8 bits to 64 bits and 32/64 bits floating point operation. The instruction sets can emulate AVX512 using two 256 bits registers.
:::info
:bulb: **Compiler Settings**
Check the default option selected by the gcc compiler
***gcc -march=native -Q --help=target | grep march***
> https://stackoverflow.com/questions/52653025/why-is-march-native-used-so-rarely
Add default include path to C++ compiler
***export CPLUS_INCLUDE_PATH="/home/erebus/VCL"***
Additionally, the program needs to be compiled in 64 bits and the C++ standard should be set as c++17
***g++ -std=c++17 -m64 -march=native -o binaryname filename***
:::
## Examples
This program does a simple reduction to calculate the sum of all elements in an array which has size is not a multiple of the vector size.
Set the data size and calculate the part that is the multiple of vector size.
```cpp
/*
Array is not always with data size that is a multiple of the vector size.
Therefore, increase the array size to fit into a vector register or handling the remaining data is required.
*/
#include <iostream>
#include <vectorclass.h>
const int datasize = 134;
const int vectorsize = 8;
const int regularpart = datasize & (-vectorsize); // = 128
// (AND - ing with -vectorsize will round down to the nearest
// lower multiple of vectorsize . This works only if vectorsize
// is a power of 2)
```
### Method \#1
Handling the remaining data with smaller vector size.
```cpp
if (method_select == '0')
{
std::cout << "method 0" << std::endl;
int i;
int mydata[datasize];
for (i = 0; i < datasize; ++i) // initialize mydata
mydata[i] = i;
Vec8i sum1(0), temp;
int sum = 0;
// loop for 8 numbers at a time
for (i = 0; i < regularpart; i += vectorsize)
{
temp.load(mydata + i); // load 8 elements
sum1 += temp; // add 8 elements
}
sum = horizontal_add(sum1); // sum of first 128 numbers
if (datasize - i >= 4)
{
// get four more numbers
Vec4i sum2;
sum2.load(mydata + i);
i += 4;
sum += horizontal_add(sum2);
}
// loop for the remaining 2 numbers
for (; i < datasize; i++)
{
sum += mydata[i];
}
}
```
### Method \#2
Use partial load for the last vector
```cpp
// Use partial load for the last vector
else if (method_select == '1')
{
std::cout << "method 1" << std::endl;
int i;
int mydata[datasize];
for (i = 0; i < datasize; ++i) // initialize mydata
mydata[i] = i;
Vec8i sum1(0), temp;
// loop for 8 numbers at a time
for (int i = 0; i < regularpart; i += vectorsize)
{
temp.load(mydata + i); // load 8 elements
sum1 += temp; // add 8 elements
}
// load the last 6 elements
temp.load_partial(datasize - regularpart, mydata + regularpart);
sum1 += temp; // add last 6 elements
int sum = horizontal_add(sum1); // vector sum
}
```
### Method \#3
Read past the end of the array and ignore excess data
```cpp
// method 3
// Read past the end of the array and ignore excess data
else if (method_select == '2') // potential hazard exist
{
std::cout << "method 2" << std::endl;
int i;
int mydata[datasize];
for (i = 0; i < datasize; ++i) // initialize mydata
mydata[i] = i;
Vec8i sum1(0), temp;
// loop for 8 numbers at a time , reading 136 numbers
for (int i = 0; i < datasize; i += vectorsize)
{
temp.load(mydata + i); // load 8 elements
if (datasize - i < vectorsize)
{
// set excess data to zero
// ( this may be faster than load_partial )
temp.cutoff(datasize - i);
}
sum1 += temp; // add 8 elements
}
int sum = horizontal_add(sum1); // vector sum
}
```
### Method \#4
Make array bigger and set excess data to zero
```cpp
// Make array bigger and set excess data to zero
else if (method_select == '3')
{
std::cout << "method 3" << std::endl;
const int arraysize = (datasize + vectorsize - 1) & (-vectorsize); // = 136
int i;
int mydata[arraysize];
for (i = 0; i < datasize; ++i) // initialize mydata
mydata[i] = i;
// set excess data to zero
for (i = datasize; i < arraysize; ++i)
mydata[i] = 0;
Vec8i sum1(0), temp;
// loop for 8 numbers at a time , reading 136 numbers
for (i = 0; i < arraysize; i += vectorsize)
{
temp.load(mydata + i); // load 8 elements
sum1 += temp; // add 8 elements
}
int sum = horizontal_add(sum1); // vector sum
}
```
## Matrix Multiplication Implementation
*2024\/08\/01*
### Implementation with Intrinsics
Please refer to the other post for this version
Compile with `-g`
```clike
./main
Compute Time: 0.031000
```
Compile with `-O3`
```clike
./main
Compute Time: 0.035000
```
https://hackmd.io/@Erebustsai/HkdXPx-rh
### Implementation with VLC2
Compile with `-g`
```clike
./main
Compute Time: 0.096000
```
Compile with `-O3`
```clike
./main
Compute Time: 0.036000
```
Using the `.load_partial()` might be what cause the performance difference since intrinsic version use mask. Further investigate required