SMID instruction and Vector Class Library from Anger Fog

Reference \: > * **Agner Fog's VCL (Vector Class Library Installation and Overview)** https://www.youtube.com/watch?v=TKjYdLIMTrI&list=PLKK11LigqithMn_3ipTSSTZdbLHSh5Iy3&index=3 *This is an old note from the early days 2022\/06\/11. Please refer to other posts if you want to use intrinsics directly.* ## Vector Class Library Compiler intrinsic can be used to take advantage on these registers. However, the syntax is difficult to read and maintain. VCL provide SIMD vectors of 128 bits, 256 bits, or 512 bits for int from 8 bits to 64 bits and 32/64 bits floating point operation. The instruction sets can emulate AVX512 using two 256 bits registers. :::info :bulb: **Compiler Settings** Check the default option selected by the gcc compiler ***gcc -march=native -Q --help=target | grep march*** > https://stackoverflow.com/questions/52653025/why-is-march-native-used-so-rarely Add default include path to C++ compiler ***export CPLUS_INCLUDE_PATH="/home/erebus/VCL"*** Additionally, the program needs to be compiled in 64 bits and the C++ standard should be set as c++17 ***g++ -std=c++17 -m64 -march=native -o binaryname filename*** ::: ## Examples This program does a simple reduction to calculate the sum of all elements in an array which has size is not a multiple of the vector size. Set the data size and calculate the part that is the multiple of vector size. ```cpp /* Array is not always with data size that is a multiple of the vector size. Therefore, increase the array size to fit into a vector register or handling the remaining data is required. */ #include <iostream> #include <vectorclass.h> const int datasize = 134; const int vectorsize = 8; const int regularpart = datasize & (-vectorsize); // = 128 // (AND - ing with -vectorsize will round down to the nearest // lower multiple of vectorsize . This works only if vectorsize // is a power of 2) ``` ### Method \#1 Handling the remaining data with smaller vector size. ```cpp if (method_select == '0') { std::cout << "method 0" << std::endl; int i; int mydata[datasize]; for (i = 0; i < datasize; ++i) // initialize mydata mydata[i] = i; Vec8i sum1(0), temp; int sum = 0; // loop for 8 numbers at a time for (i = 0; i < regularpart; i += vectorsize) { temp.load(mydata + i); // load 8 elements sum1 += temp; // add 8 elements } sum = horizontal_add(sum1); // sum of first 128 numbers if (datasize - i >= 4) { // get four more numbers Vec4i sum2; sum2.load(mydata + i); i += 4; sum += horizontal_add(sum2); } // loop for the remaining 2 numbers for (; i < datasize; i++) { sum += mydata[i]; } } ``` ### Method \#2 Use partial load for the last vector ```cpp // Use partial load for the last vector else if (method_select == '1') { std::cout << "method 1" << std::endl; int i; int mydata[datasize]; for (i = 0; i < datasize; ++i) // initialize mydata mydata[i] = i; Vec8i sum1(0), temp; // loop for 8 numbers at a time for (int i = 0; i < regularpart; i += vectorsize) { temp.load(mydata + i); // load 8 elements sum1 += temp; // add 8 elements } // load the last 6 elements temp.load_partial(datasize - regularpart, mydata + regularpart); sum1 += temp; // add last 6 elements int sum = horizontal_add(sum1); // vector sum } ``` ### Method \#3 Read past the end of the array and ignore excess data ```cpp // method 3 // Read past the end of the array and ignore excess data else if (method_select == '2') // potential hazard exist { std::cout << "method 2" << std::endl; int i; int mydata[datasize]; for (i = 0; i < datasize; ++i) // initialize mydata mydata[i] = i; Vec8i sum1(0), temp; // loop for 8 numbers at a time , reading 136 numbers for (int i = 0; i < datasize; i += vectorsize) { temp.load(mydata + i); // load 8 elements if (datasize - i < vectorsize) { // set excess data to zero // ( this may be faster than load_partial ) temp.cutoff(datasize - i); } sum1 += temp; // add 8 elements } int sum = horizontal_add(sum1); // vector sum } ``` ### Method \#4 Make array bigger and set excess data to zero ```cpp // Make array bigger and set excess data to zero else if (method_select == '3') { std::cout << "method 3" << std::endl; const int arraysize = (datasize + vectorsize - 1) & (-vectorsize); // = 136 int i; int mydata[arraysize]; for (i = 0; i < datasize; ++i) // initialize mydata mydata[i] = i; // set excess data to zero for (i = datasize; i < arraysize; ++i) mydata[i] = 0; Vec8i sum1(0), temp; // loop for 8 numbers at a time , reading 136 numbers for (i = 0; i < arraysize; i += vectorsize) { temp.load(mydata + i); // load 8 elements sum1 += temp; // add 8 elements } int sum = horizontal_add(sum1); // vector sum } ``` ## Matrix Multiplication Implementation *2024\/08\/01* ### Implementation with Intrinsics Please refer to the other post for this version Compile with `-g` ```clike ./main Compute Time: 0.031000 ``` Compile with `-O3` ```clike ./main Compute Time: 0.035000 ``` https://hackmd.io/@Erebustsai/HkdXPx-rh ### Implementation with VLC2 Compile with `-g` ```clike ./main Compute Time: 0.096000 ``` Compile with `-O3` ```clike ./main Compute Time: 0.036000 ``` Using the `.load_partial()` might be what cause the performance difference since intrinsic version use mask. Further investigate required