Books & Courses & Documents : General Perfromance Engineering

# Books & Courses & Documents : General Perfromance Engineering > * Algorithmica > https://en.algorithmica.org/ > * 計算機系統開發與優化實戰 > https://www.tenlong.com.tw/products/9787115592880 > * Parallel and Distributed Algorithms by MobyLab > https://mobylab.docs.crescdi.pub.ro/en/docs/parallelAndDistributed/introduction/ > * How To Befriend NUMA *Ruud van der Pas* > https://www.youtube.com/watch?v=0v1FGLrlt9k > * 10710周志遠教授｜平行程式 > https://www.youtube.com/playlist?list=PLS0SUwlYe8cxqw70UHOE5n4Lm-mXFXbZT > # System Hardware Info **System** *This system is built with all old computer part with Chinese knock-off MB which support dual CPU but with strange PCIEs interface layout. The GPUs run on this system are an Nvidia Quadro T400 2GB and a Tesla P40 24GB. This system is meanly used for testing and experiment. $12/24/2023$* **CPU(s)** \: 2 \* Intel Xeon E5-2696 V3 **MB Chipset** \: Intel Wellsburg C612, Intel Haswell-EP **OS** \: Windows 10 Professional & WSL2 ```bash Coreinfo v3.6 - Dump information on system CPU and memory topology Copyright (C) 2008-2022 Mark Russinovich Sysinternals - www.sysinternals.com Logical Processor to NUMA Node Map: NUMA Node 0: ************************************ ------------------------------------ NUMA Node 1: ------------------------------------ ************************************ ``` :::info :information_source: **CPU flags from E5-2696V3 & EPYC 7742** Artical providing common CPU flags information. > https://www.baeldung.com/linux/proc-cpuinfo-flags ***Intel(R) Xeon(R) CPU E5-2696 v3 @ 2.30GHz*** ```bash Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm cons tant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 movbe p opcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single pti ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt md_clear flush_l1d arch_capabilities ``` ***AMD EPYC 7742 64-Core Processor*** ```bash Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm co nstant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb b pext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflusho pt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt l brv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overf low_recov succor smca sev sev_es ``` ::: **Memory** \: *Quad channel on both CPU each contain 64GB with 16GB in each slot. They run with 2133 not 2666* ![image](https://hackmd.io/_uploads/H1D8s0WE6.png) ## NUMA System ![image](https://hackmd.io/_uploads/B1GDexlPT.png) :::info :bulb: **Distributed Memory : Multi-Computer** * A communication network is needed (infiniBand or Ethernet) * Memory is in silo which cause no effect on others when there is a node modified the memory. ::: # Algorithmica *In this section, only the additionally source or any contain serve as add-on will be list here. Please read the original book with this note as support material.* > https://en.algorithmica.org/hpc/architecture/layout/ ## Machine Code Layout ### Pipeline of a CPU :::success :information_source: **CPU Detailed Arch** The CPU used in this section is E5-2696V3 $An OEM Version of E5-2699V3$. The following document can be refered for architecture detail. https://pages.cs.wisc.edu/~rajwar/papers/ieee_micro_haswell.pdf ::: ### Instruction Cache :::info :bulb: **Harvard Architecture V.S. Von Neumann Architecture** 從0到有製作自己的CPU！！第一週第一組課後KIM教學 \20220707\ https://hackmd.io/@accomdemy/HyBd36Wiq#:~:text=%E5%8E%9F%E6%9C%AC%E7%B4%94%E9%A6%AE%E7%B4%90%E6%9B%BC%E6%9E%B6%E6%A7%8B,%E6%9C%89%E6%9B%B4%E5%A5%BD%E7%9A%84%E8%A1%A8%E7%8F%BE%E3%80%82 為什麼電腦還沿用馮·諾伊曼結構而不使用哈佛結構？ https://www.getit01.com/p201711163710/ ::: ### Unequal Branches In this section, algorithmica provide a example of using compiler hints `[[unlikely]]`. However, as I further research on it, the use of this is described as obsolete in a post $https://stackoverflow.com/questions/2738835/learning-sample-of-likely-and-unlikely-compiler-hints$. Refered to the following article : *How branches influence the performance of your code and what can you do about it?*$https://johnysswlab.com/how-branches-influence-the-performance-of-your-code-and-what-can-you-do-about-it/$ ### Instruction-Level Parallelism **Branchless Programming** ***Example Code*** ```cpp // layer_conv2d.cpp in mpMLFramework ... size_t framework::conv2d::setParam(const float* buffer) { wMEM = make_shared<float[]>(weightSize); if (wMEM == nullptr) { printf("Unable to allocate wMEM for %s layer", this->name.c_str()); exit(EXIT_FAILURE); } memcpy(wMEM.get(), buffer, weightSize * sizeof(float)); if (useBias) { bMEM = make_shared<float[]>(biasSize); if (bMEM == nullptr) { printf("Unable to allocate bMEM for %s layer", this->name.c_str()); exit(EXIT_FAILURE); } memcpy(bMEM.get(), buffer + weightSize, biasSize * sizeof(float)); } return weightSize + static_cast<int>(useBias) * biasSize; } ``` In the above code, when doing `memcpy`, `return`, I use `static_cast` casting a `bool` to `int` and thus achieve branchless. Please refer to my Github for the full code. ## Compilation :::info :information_source: **Compiler Settings** **SIMD Compiler Options** https://stackoverflow.com/questions/71229343/what-exactly-do-the-gcc-compiler-switches-mavx-mavx2-mavx512f-do ::: ### Likeliness of Branches It's better to inspect the assembly when using this hint. The following article provide an example $https://stackoverflow.com/questions/72471900/why-isn-t-my-code-with-c20-likely-unlikely-attributes-faster$ ### Profile-Guided Optimization Using PGO in Visual Studio. https://learn.microsoft.com/en-us/cpp/build/profile-guided-optimizations?view=msvc-170 ### Precomputation Since C++14, `constexpr` functions can be vary useful when making a lookup table or speed any function that can be known in compile time. https://tjsw.medium.com/%E6%BD%AE-c-constexpr-ac1bb2bdc5e2 ## Profiling Profiling method in different scope : * Instrumentation : Program level $timing, record events, ...$ * Statistical Profiling : Assembly level $branch prediction, cache misses, ...$ * Program Simulation : Hardware level ### Linux Perf Tool ### Valgrind ### Profiling with Visual Studio ## Arithmetic :::info :bulb: **Please Check out my another Article** AI Embedded Systems Algorithm Optimization and Implementation https://hackmd.io/JAZbc0TpTjSH-EqTowxYUQ \[TODO : Edit above URL so that guest access is posible shared link\] ::: ## Rounding Error **Accumulate floating point** ```cpp float X = 0.0f; for (int i = 0; i < (1 << 24) + 1; ++i) { X++; } printf("%f\n", X); ``` In the above code, `X == 1 << 24` instead of `1 << 24 + 1` because `float` will not be able to represent `1 << 24 + 1` and rounding back to `1 << 24`. All in all, whenever X increased above `1 << 24` it will be round back. Therefore, any accumulation above `1 << 24` will all be rounding back. The loop will compute `1 << 24 + 1 rounding back to 1 << 24` at all iterations above `1 << 24`. :::info :bulb: **Check if Two Floating Point Equal** The following function check if the difference between the two floating point numbers is within epsilon of `float32` ```cpp inline bool fequ(const float& a, const float& b) { constexpr float epsilon = std::numeric_limits<float>::epsilon(); if (std::abs(a - b) <= epsilon) return true; return false; } ``` ::: ## External Memory ### External Memory Model :::info :information_source: **Standard RAM Model v.s. External Memory Model** **Standard RAM Model** is used in most barchelor degree Algorithm class for estimate big-O of an algorithm. This model ignore *different operations takes different times to complete* and *operation on data store on different types of memory can take significantly different amount of time to process*. **External Memory Model** ignore every operation that is not an I/O operation ::: In the article, the following assumptions are made. **Size of Dataset $in external memory$** : N **Read/Write in a Batch** : B $Take same time reading a block or read an element in a block$ **Internal Memory Size** : M $We can store up to M elements in internal memory$ **N >> M >> B** ### External Sorting **Tall-Cache Assumption** https://www.youtube.com/watch?v=DQzWB7SJIOQ *In this section, I will only include code snippet and write down how I understand it. Please refer to the article for more detail.* **Step 1 : Read input.bin file and split it into part-xxx.bin files** ```cpp int* part = new int[M]; while (true) { // static int part[M]; // in original program int n = fread(part, 4, M, input); if (n == 0) break; std::sort(part, part + n); char fpart[sizeof("part-999.bin")]; sprintf(fpart, "part-%03d.bin", parts.size()); printf("Writing %d elements into %s...\n", n, fpart); FILE* file = fopen(fpart, "wb"); fwrite(part, 4, n, file); fclose(file); file = fopen(fpart, "rb"); parts.push_back(file); } delete[] part; ``` **Step 2 : Use priority_queue to select the smallest element out of all part** Create a Pointer data structure to store element value and which part they are from. ```cpp bool operator<(const Pointer& other) const { return key > other.key; // reversed [std::priority_queue is a max-heap by default] } ``` In the `priority_queue`, it will only contain **one** element from each of the parts. `l` is used to mark which element should be processed next for each part. ```cpp std::priority_queue<Pointer> q; const int nparts = parts.size(); auto buffers = new int[nparts][B]; int* l = new int[nparts]; // mark current # of element in a part int* r = new int[nparts]; // buffer read size for (int part = 0; part < nparts; part++) { l[part] = 1; r[part] = fread(buffers[part], 4, B, parts[part]); q.push({ buffers[part][0], part }); } ``` Finally, after processed the parts, we need to write what's remain in the output buffer to the file. ```cpp fwrite(outBuffer, 4, buffered, output); ``` In this algorithm, since we only maintain input buffers, an output buffer, l, r and a priority_queue which only contain a element from each part == the number of parts. *The amount of extra memory used for computation is not vary big*. ### Cache-Oblivious Algorithms > https://en.algorithmica.org/hpc/external-memory/oblivious/ * Cache-aware algorithms \: Algos are efficient for known B and M. * Cache-oblivious alogrithms \: Algos are efficient for any B and M. :bulb: **Matrix Transposition** ![image](https://hackmd.io/_uploads/SJbZjS4d6.png =x120) The article provide the following implementation > https://en.algorithmica.org/hpc/external-memory/oblivious/ ```cpp void matrixTranspose(int* X, size_t N) { for (int i = 0; i < N; i++) for (int j = 0; j < i; j++) std::swap(X[j * N + i], X[i * N + j]); } ``` **Blocking Matrix Transpose** In this section I will refer to the following Stack Overflow post. > https://stackoverflow.com/questions/16737298/what-is-the-fastest-way-to-transpose-a-matrix-in-c ```cpp void matrixTransposeScalarBlock(int blockIdxY, int blockIdxX, const int N, const int M, const int blockSize, float* src, float* dst) { int borderY = (blockIdxY + blockSize) < N ? (blockIdxY + blockSize) : N; int borderX = (blockIdxX + blockSize) < M ? (blockIdxX + blockSize) : M; for (int i = blockIdxY; i < borderY; i++) { for (int j = blockIdxX; j < borderX; j++) { dst[i + j * N] = src[i * M + j]; } } } void matrixTransposeBlocks(float* src, float* dst, const int N, const int M, const int blockSize) { for (int blockIdxY = 0; blockIdxY < N; blockIdxY += blockSize) { for (int blockIdxX = 0; blockIdxX < M; blockIdxX += blockSize) { matrixTransposeScalarBlock(blockIdxY, blockIdxX, N, M, blockSize, src, dst); } } } ``` **SSE & Improved Version with Visual Studio MSVC** We can use a macro defined in `<xmmintrin.h>` in MSVC. I will provide a visual guide of this macro below. ```cpp void matrixTransposeScalarBlock(int blockIdxY, int blockIdxX, const int N, const int M, const int blockSize, float* src, float* dst) { if (blockSize != 4) // TODO : Move the checking out exit(EXIT_FAILURE); __m128 R0 = _mm_loadu_ps((src + ((blockIdxY + 0) * M + blockIdxX))); __m128 R1 = _mm_loadu_ps((src + ((blockIdxY + 1) * M + blockIdxX))); __m128 R2 = _mm_loadu_ps((src + ((blockIdxY + 2) * M + blockIdxX))); __m128 R3 = _mm_loadu_ps((src + ((blockIdxY + 3) * M + blockIdxX))); _MM_TRANSPOSE4_PS(R0, R1, R2, R3); _mm_storeu_ps((dst + ((blockIdxX + 0) * N + blockIdxY)), R0); _mm_storeu_ps((dst + ((blockIdxX + 1) * N + blockIdxY)), R1); _mm_storeu_ps((dst + ((blockIdxX + 2) * N + blockIdxY)), R2); _mm_storeu_ps((dst + ((blockIdxX + 3) * N + blockIdxY)), R3); } void matrixTransposeBorderBlock(int blockIdxY, int blockIdxX, const int N, const int M, const int blockSize, float* src, float* dst) { int borderY = (blockIdxY + blockSize) < N ? (blockIdxY + blockSize) : N; int borderX = (blockIdxX + blockSize) < M ? (blockIdxX + blockSize) : M; for (int i = blockIdxY; i < borderY; i++) { for (int j = blockIdxX; j < borderX; j++) { dst[i + j * N] = src[i * M + j]; } } } void matrixTransposeBlocks(float* src, float* dst, const int N, const int M, const int blockSize) { for (int blockIdxY = 0; blockIdxY < N; blockIdxY += blockSize) { for (int blockIdxX = 0; blockIdxX < M; blockIdxX += blockSize) { if (blockIdxY + blockSize <= N && blockIdxX + blockSize <= M) { matrixTransposeScalarBlock(blockIdxY, blockIdxX, N, M, blockSize, src, dst); } else { matrixTransposeBorderBlock(blockIdxY, blockIdxX, N, M, blockSize, src, dst); } } } } ``` **Implementation of _MM_TRANSPOSE4_PS(R0, R1, R2, R3)** ```cpp #define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) { \ __m128 _Tmp3, _Tmp2, _Tmp1, _Tmp0; \ \ _Tmp0 = _mm_shuffle_ps((row0), (row1), 0x44); \ _Tmp2 = _mm_shuffle_ps((row0), (row1), 0xEE); \ _Tmp1 = _mm_shuffle_ps((row2), (row3), 0x44); \ _Tmp3 = _mm_shuffle_ps((row2), (row3), 0xEE); \ \ (row0) = _mm_shuffle_ps(_Tmp0, _Tmp1, 0x88); \ (row1) = _mm_shuffle_ps(_Tmp0, _Tmp1, 0xDD); \ (row2) = _mm_shuffle_ps(_Tmp2, _Tmp3, 0x88); \ (row3) = _mm_shuffle_ps(_Tmp2, _Tmp3, 0xDD); \ } ``` When using `_mm_shuffle_ps`, we can use another macro `_MM_SHUFFLE(0, 0, 0, 0)` to help us create a mask for it. ```cpp printf("0x%02x\n", _MM_SHUFFLE(1, 0, 1, 0)); /* Please remember that in the register lo to hi direction is different than what we usually denote in array for example load {1 2 3 4} in memory will be lo to hi {1 2 3 4} and usually in register for better understanding {4, 3, 2, 1} c = _mm_shuffle_ps(a, b, _MM_SHUFFLE(l, k, j, i)); c3, c2, c1, c0 c0, c1 => any element of a c2, c3 => any element of b c0 = a[i] c1 = a[j] c2 = b[k] c3 = b[l] i, k, j, l in {0, 1, 2, 3} */ // Output 0x44 ``` **Walk through of the entire shuffle stage**. Keep in mind that the register is represent from hi -> lo to match what `_MM_SHUFFLE(l, k, j, i)` used. ![image](https://hackmd.io/_uploads/HJldPuVKu6.png =x400) **AVX2 Version with Visual Studio MSVC** Please refer to the following post for the code or refer to my github as I include this into my implementation. > Implementation Reference : https://stackoverflow.com/questions/16941098/fast-memory-transpose-with-sse-avx-and-openmp **_mm256_unpacklo_ps example** > Reference : https://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX ![image](https://hackmd.io/_uploads/BJ7ya3F_T.png) **Hand Tracing the code** ![image](https://hackmd.io/_uploads/SygLjK5dp.png) **_mm256_permute2f128_ps** ```cpp DEFINE SELECT4(src1, src2, control) { CASE(control[1:0]) OF 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0 ``` The above explantation is from Intel Intrinsics Guide. It's actually easy to read just remember that **the control is always from right (lo) to left (hi)** :::info :bulb: **A integer version of blocking Transpose with SSE** https://hackmd.io/@Rance/Byp2oH1AM?type=view ::: :::success :warning: **Swap Implementation v.s. Bit Hack** In the following post, a discussion about the standard library implementation `std::swap` compare to a `xor` version https://stackoverflow.com/questions/10549155/why-swap-doesnt-use-xor-operation-in-c The following code snippet is `std::swap` implementation of MSVC with `std=c++20` set. ```cpp #endif // _HAS_CXX17 _CONSTEXPR20 void swap(_Ty& _Left, _Ty& _Right) noexcept( is_nothrow_move_constructible_v<_Ty>&& is_nothrow_move_assignable_v<_Ty>) { _Ty _Tmp = _STD move(_Left); _Left = _STD move(_Right); _Right = _STD move(_Tmp); } ``` ::: :::info :bulb: **Rounding up to the nearest multiple of a number** The following Stack Overflow post provide a discussion see \#176 > https://stackoverflow.com/questions/3407012/rounding-up-to-the-nearest-multiple-of-a-number ```cpp inline int roundUp(int X, int M) { return ((X + (M - 1)) / M) * M; } inline int roundUp_PW2(int X, int M) { return ((X + (M - 1)) & -M); } inline int roundUp_PW2(int X, int M) { int isPositive = static_cast<int>(X >= 0); return ((X + (isPositive * (M - 1))) / M) * M; } ... std::cout << roundUp(3001, 5) << "\n\n"; std::cout << roundUp_PW2(3000, 16) << "\n\n"; std::cout << roundUp_negative(-3001, 5) << "\n\n"; ``` Output of the above code ```bash 3005 3008 -3000 ``` ::: **Matrix Multiplication** * Matrix Multiplication with SIMD Please refer to another post ***SIMD Programming Note*** https://hackmd.io/@Erebustsai/HkdXPx-rh * Matrix Multiplication with Quantization Please refer to another post ***AI Embedded Systems Algorithm Optimization and Implementation*** https://hackmd.io/@Erebustsai/Sk39malXa ### Spatial and Temporal Locality \[TODO\] ## RAM & CPU Caches ### Cache Line > https://en.algorithmica.org/hpc/cpu-cache/cache-lines/ > The important practical lesson when designing and analyzing memory-bound algorithms is to count the number of cache lines accessed and not just the total count of memory reads and writes. Cache Line Size of E5-2696v3 ```bash xxxxxx@DESKTOP-XXXXXXX:/sys/devices/system/cpu/cpu0/cache/index0$ cat coherency_line_size ───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ File: coherency_line_size ───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 64 ───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ``` Cache Line Size of EPYC 7742 ```bash xxxxxx@xxxxxxxxxxxxxx:/sys/devices/system/cpu/cpu0/cache/index0$ cat coherency_line_size ───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ File: coherency_line_size ───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ 64 ───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ``` The above CPU Cache Line Size is returned in **Bytes** which means that a unit of read/write from/to cache will be 64 bytes. Additionally, this number might be consistant on X86 systems, considered that these two CPU is from different vendor and in different time. *$Intel E5-2696v3 2015 v.s. AMD EPYC 7742 2019$* ### Prefetching The article shows how to prefetch data using GCC intrinsic `__builtin_prefetch(&a[k])`. However, in MSVC there is no such intrinsic. We can use `_mm_prefetch()` to do similar thing. In this section, I will go through the following tutorial to produce an implementation of **Matrix Transpose** with cache prefetch and SIMD > https://hackmd.io/@dbdb/S1us_k_0?type=view # Algorithm Case Studies ## Exponentiating by Squaring https://oi-wiki.org/math/binary-exponentiation/ ## Binary GCD > Reference of GCD > https://jason-chen-1992.weebly.com/home/-euclidean-algorithm > https://en.algorithmica.org/hpc/algorithms/gcd/ Please refer to the above article for observations that help us build a Binary GCD. :::info :bulb: **GCC Builtin Functions** This built in functions of GCC that can help providing optimized implementation of a certain function. The following article can help find most used built in functions. However, I will consider using these functions **when optimizing code not in the development state**. A code should run **correctly before fast**. * 6.60 Other Built-in Functions Provided by GCC https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html * 有一種函數，叫內建函數 https://www.zhaixue.cc/c-arm/c-arm-builtin.html * gcc的__builtin_函數介紹 https://blog.csdn.net/jasonchen_gbd/article/details/44948523 * GCC種builtin函數的介紹以及實現過程（1） https://blog.csdn.net/weixin_46804051/article/details/125134414 * Builtin functions of GCC compiler https://www.geeksforgeeks.org/builtin-functions-gcc-compiler/ ::: ## Article Study \: Beating NumPy's Matrix Multiplication in 150 Lines of C Code > Reference \: > * Article > https://salykova.github.io/matmul-cpu > * Github Repository > https://github.com/salykova/matmul.c ### Python Numpy Backend Information ```clike >>> import numpy as np >>> np.show_config() Build Dependencies: blas: detection method: pkgconfig found: true include directory: /usr/local/include lib directory: /usr/local/lib name: openblas64 openblas configuration: USE_64BITINT=1 DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS= NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= HASWELL MAX_THREADS=2 pc file directory: /usr/local/lib/pkgconfig version: 0.3.23.dev lapack: detection method: internal found: true include directory: unknown lib directory: unknown name: dep139863411681952 openblas configuration: unknown pc file directory: unknown version: 1.26.4 Compilers: c: args: -fno-strict-aliasing commands: cc linker: ld.bfd linker args: -Wl,--strip-debug, -fno-strict-aliasing name: gcc version: 10.2.1 c++: commands: c++ linker: ld.bfd linker args: -Wl,--strip-debug name: gcc version: 10.2.1 cython: commands: cython linker: cython name: cython version: 3.0.8 Machine Information: build: cpu: x86_64 endian: little family: x86_64 system: linux host: cpu: x86_64 endian: little family: x86_64 system: linux Python Information: path: /opt/python/cp310-cp310/bin/python version: '3.10' SIMD Extensions: baseline: - SSE - SSE2 - SSE3 found: - SSSE3 - SSE41 - POPCNT - SSE42 - AVX - F16C - FMA3 - AVX2 not found: - AVX512F - AVX512CD - AVX512_KNL - AVX512_KNM - AVX512_SKX - AVX512_CLX - AVX512_CNL - AVX512_ICL ``` The above output is from a EPYC server with **EPYC 7742**, which does not support AVX512 and only support up to AVX2. In my git repository \[https://github.com/Chen-KaiTsai/PerformanceEngineering_repo/tree/main/AlgorithmRepo_SIMDx86/GEMM\] for SIMD x86 matrix multiplication only use AVX2 because I currently $2024_07_22$ does not have any CPU that support AVX512. :::info :information_source: **Hardware Detail** The following post is the detail record of my homelab and the setting of the homelab. https://hackmd.io/EhDRrZWEQ4it2FMZZPVE4w?view ::: ### Naive Implementation :::success :bulb: **Loop Body Optimization** https://hackmd.io/JAZbc0TpTjSH-EqTowxYUQ#Common-Software-Optimization ::: ```cpp void matrixMull(int** a, int** b, int** c, size_t size) { for (int i = 0; i < size; ++i) for (int j = 0; j < size; ++j) for (int k = 0; k < size; ++k) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } void matrixMull_opt(int** a, int** b, int** c, size_t size) { for (int i = 0; i < size; ++i) for (int k = 0; k < size; ++k) for (int j = 0; j < size; ++j) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } ``` In the above code, we can swap these three for loops freely. Therefore, the combination utilize cache locality the most is the better version. :::success :bulb: **Matrix Multiplication with AVX2** https://hackmd.io/6h4V3xR6TIaa5xx6dzML1Q#Matrix-Multiplication-256-bits ::: ### Caching *The most difference between the above AVX2 version and this article is about cache and multiple cores utilization.* :::success :bulb: **Cache \& Ram** https://hackmd.io/FiPAsD0pQfempm4gPABisQ?view#Spatial-and-Temporal-Locality-TODO ::: ![image](https://hackmd.io/_uploads/SyxTxgpuC.png) ### Multithreading ```cpp #pragma omp parallel for num_threads(NTHREADS) schedule(static) ``` :::info :information_source: **OpenMP Scheduling Strategy** * OpenMP - Scheduling(static, dynamic, guided, runtime, auto) https://610yilingliu.github.io/2020/07/15/ScheduleinOpenMP/ * OpenMP Dynamic vs Guided Scheduling https://stackoverflow.com/questions/42970700/openmp-dynamic-vs-guided-scheduling The above second post on stack overflow provides a good discussion about how to select scheduling strategy. ::: https://github.com/flame/blis/blob/master/docs/Multithreading.md # Study : Optimizing software in C++ An optimization guide for Windows, Linux, and Mac platforms by Agner Fog > Reference \: https://agner.org/optimize/optimizing_cpp.pdf *This section as usual, will only contain anything extrea metarial or something that I search online. Please refer to the author's website for more optimization technique. \[https://agner.org/?e=0#0\]* ## Compute Library Since I only have AMD EPYC 7002 CPU and Xeon E5v3 CPU, I will be able to use both Intel Math Kenel and AMD Optimizing CPU Libraries. * AMD Optimizing CPU Libraries (AOCL) https://www.amd.com/zh-tw/developer/aocl.html * Intel Math Kernel Library *TODO \: Since Windows testing server is only used with WSL, a hardware Update might be needed before unlock Windows Programming Environment again.* ## User Interface *Since I mostly don't want or work on a program with GUI interface, I will only focusing on how to design a good commandline interface.* I will study through the following guide. :::success https://clig.dev/ ::: ## The Efficiency of Different C++ Constructs :::success \# 1 In assembly, function parameters are pushed into stacks. The stack reside in memory not in CPU. Only the the stack address is stored in a register in CPU. ::: :::success \# 2 > Floating point variables use a different kind of registers. There are eight floating point registers available in 32-bit operating systems, sixteen in 64-bit operating systems, and thirty two when the AVX512 instruction set is enabled in 64-bit mode. ::: :::success \#3 `volatile` can be used to prevent compiler optimize away variables that its value can be changed by other threads. > The effect of the keyword volatile is that it makes sure the variable is stored in memory rather than in a register and prevents all optimizations on the variable. ::: :::success \#4 > Thread-local storage is inefficient because it is accessed through a pointer stored in a thread environment block. Static thread-local storage should be avoided, if possible, and replaced by storage on the thread's own stack > The need for static thread-local storage can be avoided in most cases by declaring the variables or objects inside the thread function **thread_local Storage in C++ 11** https://www.geeksforgeeks.org/thread_local-storage-in-cpp-11/ ::: :::success \#5 :bulb: > The pre-increment operator ++i and the post-increment operator i++ are as fast as additions. When used simply to increment an integer variable, it makes no difference whether you use pre-increment or post-increment. The effect is simply identical. For example, `for (i=0; i<n; i++)` is the same as `for (i=0; i<n; ++i)`. But when the result of the expression is used, then there may be a difference in efficiency. For example, `x = array[i++]` is more efficient than `x = array[++i]` because in the latter case, the calculation of the address of the array element has to wait for the new value of `i` which will delay the availability of `x` for approximately two clock cycles. Obviously, the initial value of `i` must be adjusted if you change pre-increment to post-increment. There are also situations where pre-increment is more efficient than post-increment. For example, in the case `a = ++b;` the compiler will recognize that the values of a and b are the same after this statement so that it can use the same register for both, while the expression `a = b++;` will make the values of a and b different so that they cannot use the same register. :mag: The only special case for me is that the using increment or decrement on array accessing. Making array access wait for the increment or decrement need to be noticed. **\[探索 5 分鐘\] i++ 與 ++i 的功能與性能是否相同 ? 用程式範例介紹 $c\#$** https://nwpie.blogspot.com/2017/07/5-i-i.html ::: :::success \# 6 > Reference \: https://agner.org/optimize/optimizing_cpp.pdf *Please refer to chapter 7\.3 for detail advantage and disadvantage of using x87 registers or vector registers on floating point operation.* **x87 Registers** > The original method of doing floating point operations involves eight floating point registers organized as a register stack, called x87 registers. These registers have long double precision (80 bits). **Vector Registers $SIMD$** > A newer method of doing floating point operations involves vector registers (XMM, YMM, or ZMM) which can be used for multiple purposes. Floating point operations are done with single or double precision, and intermediate results are always calculated with the same precision as the operands. **Summary** > Modern compilers will use vector registers for floating point calculations whenever they are available, i.e. in 64-bit mode or when the SSE or higher instruction set is enabled. Few compilers are able to mix the two types of floating point operations and choose the type that is optimal for each calculation. * **Did any compiler fully use Intel x87 80-bit floating point?** https://retrocomputing.stackexchange.com/questions/9751/did-any-compiler-fully-use-intel-x87-80-bit-floating-point ::: :::success \# 7 Smart Pointer is good when a function allocate memory and the other function need to deallocate it. Using it will have cost then raw pointer. > If a program uses many small dynamically allocated objects with each their smart pointer then you may consider if the cost of this solution is too high. It may be more efficient to pool all the objects together into a single container, preferably with contiguous memory. ::: ## Optimization in Compilers *In this section, the following MIT course is reviewed while reading the book chapter. Mapping these two source of material is the goal of this section.* :::info :information_source: **MIT 6.172 Performance Engineering of Software Systems** *9. What Compilers Can and Cannot Do* https://www.youtube.com/watch?v=ulJm7_aTiQM&list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&index=9 ::: ### How Compiler Optimize > Reference \: https://agner.org/optimize/optimizing_cpp.pdf > Chapter 8\.1 * Function inlining * Constant folding and constant propagation * > An expression or subexpression containing only constants will be replaced by the calculated result. * Pointer elimination * Common subexpression elimination * Register variables * > An expression or subexpression containing only constants will be replaced by the calculated result. * Live range analysis * This optimization basicaly is about register reuse * Eliminate jumps * Loop unrolling * Loop invariant code motion * Statement independent from loop iteration can be moved outside of the loop. * Vectorization * Induction variables * Scheduling * Instruction Reorder * Algebraic reductions * Devirtualization * > An optimizing compiler can bypass the virtual table lookup for a virtual function call if it is known which version of the virtual function is needed ### Compilers Comparison > Reference \: https://agner.org/optimize/optimizing_cpp.pdf > Chapter 8\.2 *In the guide, the author provide a vary detailed compiler comparison on many different statements. Please refer to the guide.* **Summary** > The Clang and Gnu compilers are the ones that optimize best, while the Microsoft compiler has inferior performance. ### Obstacles Prevent Compiler Optimization * Cannot optimize across modules * Pointer aliasing * Using `__restrict__` on pointer that has no alias. * Dynamic memory allocation * Pure functions * TODO * Floating point induction variables ## Optimizing Memory Access > A cache is a proxy for the main memory in a computer. ### Cache Organization :::info :information_source: **Good Article About Cache** https://hackmd.io/@sysprog/HkW3Dr1Rb ::: ## Specific Optimization Topics * Use Lookup Tables * Reading Value from the predefined table, which can be created when programming. Elminating calculation completely. * Bounds Checking * The following statements accessing the array are the same since a negative number will be cast to a vary big `unsigned` variable. ```cpp float list[size]; if (i < 0 || i >= size) { ; } if ((unsigned int)i >= (unsigned int)size) { ; } ``` * Use Bitwise Operators for Checking Multiple Values at Once * Use bitwise operator on integers as Boolean vectors. *TODO* * Integer Multiplication * bit shifting can replace multiplication $Please refer to my github for an example \[https://github.com/Chen-KaiTsai/ZeroDCE_JetsonNano\]$ * Integer Division * Floating Point Division * Do Not Mix Float and Double * Conversions Between Floating Point Numbers and Integers * Using Integer Operations for Manipulating Floating Point Variables * Mathematical Functions * Static Versus Dynamic Libraries * Position-independent Code * *TODO* * System Programming * *TODO* ## Overview of Compiler Options > Reference \: https://agner.org/optimize/optimizing_cpp.pdf > Chapter 18 # 計算機系統開發與優化實戰 > Reference \: > https://www.tenlong.com.tw/products/9787115592880 :::info :information_source: **MMU $Memory Management Unit$** When CPU access system memory, CPU will see all memory address as virtual addresses and the MMU is responsible for translate virtual address into physical address. https://en.wikipedia.org/wiki/Memory_management_unit ::: # Parallel and Distributed Algorithms by MobyLab > Reference \: > https://mobylab.docs.crescdi.pub.ro/en/docs/parallelAndDistributed/introduction/ # Others ## The Hoard Memory Allocator > https://github.com/emeryberger/Hoard A drop-in replacement of `malloc` Benchmark result on E5-2696v3 with 4 channels 64GB 2133 memories in WSL2 $*for NUMA system WSL2 will only use one CPU socket this does not have viable sulotion now 12/23/2023*$ `~/Hoard/benchmarks/linux-scalability/linux-scalability 1048576 1024 18` ```bash Object size: 1048576, Iterations: 1024, Threads: 18 Starting test... Average execution time = 1.336873 seconds, standard deviation = 0.019881. ``` ## Understand Hyper-Threading Reference \: > * Hyperthreading & SIMD > https://comp.anu.edu.au/courses/comp4300/assets/lectures/Lecture_5_SIMD_4up.pdf > * 超線程沒有用？超線程原理講解，看完你就知道了| 超線程 | 超執行緒 | CPU | Intel | 英特爾 > https://www.youtube.com/watch?v=ztVZ3Mdw7vI > * What is hyper-threading and how does it work? > https://superuser.com/questions/122536/what-is-hyper-threading-and-how-does-it-work#:~:text=The%20point%20of%20hyperthreading%20is,of%20the%20processor%20in%20parallel. > * SIMD register inside a core > https://community.intel.com/t5/Processors/SIMD-register-inside-a-core/m-p/1563032?profile.language=zh-TW ### The Answer to How SIMD Register Shared Between HT cores > The Intel® Core™ i9-10900K Processor has a total of 16 SIMD registers and each hyperthreading (HT) core will see its own 16 SIMD registers. As for the example provided the answer is no, they are not sharing, each hyperthreading core will see its own 16 SIMD registers. ***In general, all the Instruction Set Architecture (ISA) defined registers need to be duplicated per-HT context so each HT can behave as a logic processor with no interference with the other HT cores***. ## Larry Bank's Blog Post for Optimization https://bitbanksoftware.blogspot.com/