# 2016q3 Homework3 (software-pipelining) contributed by <`carolc0708`> [作業說明 A08: Software Pipelining](https://hackmd.io/s/ry7eqDEC) ## Prefetching ### 參考資料 * [Data Prefetch Mechanisms](http://www.cs.ucy.ac.cy/courses/EPL605/Fall2014Files/DataPrefetchSurvey.pdf) * [Caches and Prefetching](http://web.cecs.pdx.edu/~alaa/courses/ece587/spring2012/notes/prefetching.pdf) * [Prefetching](http://www.cc.gatech.edu/~hyesoon/fall11/lec_pref.pdf) --- * 重要名詞定義 * Prefetch Hit: Prefetched line that was hit in the cache before being replaced (miss avoided) * Prefetch Miss: Prefetched line that was replaced before being accessed * Prefetch rate: Prefetches per instruction (or 1000 inst.) * Accuracy: Percentage of prefetch hits to all prefetches * Coverage: Percentage of misses avoided due to prefetching * 100 x (Prefetch Hits / (Prefetch Hits + Cache Misses)) ## [When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) ### 已完成整篇論文閱讀,完整中文筆記參考[這裡](https://hackmd.io/s/rk8ivCT0) * software prefetching * hardware prefetching >其實還是沒有很懂SW prefetching和HW prefetching的差異@@ >>SW 指的就是老師寫的 `_mm_prefetch`,可以由我們來控制的,HW 是硬體上直接做好的,通常是根據歷史紀錄推敲下一個要 prefetch 哪個,所以 HW 是需要 training 的[name=Yen-Kuan Wu] ### 1. INTRODUCTION * First, there are few rigorous guidelines on how best to insert prefetch intrinsics. Secondly, the complexity of the interaction between software and hardware prefetching is not well understood. ==>goal is to provide guidelines for inserting prefetch intrinsics in the presence of a hardware prefetcher and, more importantly, to identify promising new directions for automated cooperative software/hardware prefetching schemes baseline: only compiler inserted software prefetchers without any hardware prefetching (1) What are the limitations and overheads of software prefetching? (2) What are the limitations and overheads of hardware prefetching? (3) When is it beneficial to use software and/or hardware prefetching? when software prefetching targets short array streams, irregular memory address patterns, and L1 cache miss reduction, there is an overall positive impact with code examples. Note that, in this article, we refer to unit-stride cache-line accesses as streams and access stride distances greater than two cache lines as strided > stream & stride? > HW prefetching ==> fetch in stream or arbitrary stride > stream 是指說你的 prefetch 是連續的抓取 cache line,而 stride 則是指說你是一次跳幾格去 pregetch cache line。[name=Yen-Kuan Wu] ### 2. BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING #### 2.1 Software Prefetch Intrinsics #### 2.2 Prefetch Classification #### 2.3 Software Prefetch Distance * **software prefetch distance**: the distance ahead of which a prefetch should be requested on a memory address. ![](https://i.imgur.com/aCfmkqd.png) *where l is the prefetch latency and s is the length of the shortest path through the loop body* * if it is too large, prefetched data could evict useful cache blocks, and the elements in the beginning of the array may not be prefetched, leading to less coverage and more cache misses. Hence the prefetch distance has a significant effect on the overall performance of the prefetcher. #### 2.4 Direct and Indirect Memory Indexing direct memory indexing can be easily prefetched by hardware since the memory addresses show regular stream/stride behavior, but indirect memory indexing requires special hardware prefetching mechanisms. indirect indexing is relatively simpler to compute in software. Thus, we expect software prefetching to be more effective than hardware prefetching for the indirect indexing case > 不太懂@@ > 簡單講就是 a[i] 和 a[b[i]],可以發現第一個就是簡單的 direct 第二個就是 indirect,也就是他的抓的 address 會跳動,其實也對應到你上面的問題,stream vs arbitrary。[name=Yen-Kuan Wu] > 大感謝!!! ^^ [name=Carol Chen] --- ### 3. POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING #### 3.1 benefit of SW prefetching * large num of stream: The number of streams in the stream prefetcher is limited by hardware resources * short stream: If the length of stream is extremely short, there will not be enough cache misses to train a hardware prefetcher and load useful cache blocks * Irregular memory access: irregular data structure * cache locality hint: there is greater flexibility of placing data in the right cache level. In particular, there is a prefetch hint (Table II) associated with every prefetch request. * loop bound: can be defined with little effort.loop unrolling, software pipelining, and using branch instructions #### 3.2 negative impact of SW prefetching * Increased Instruction Count: consume fetch and execution bandwidth and machine resources * Static Insertion: cannot adapt to runtime behavioral changes * Code Structure Change #### 3.3 synergistic effect of SW & HW prefetching * Handling Multiple Streams: e.g. a hardware prefetcher might cover regular streams and software prefetching can cover irregular streams. * Positive Training: If a block prefetched by a software prefetcher is late, then a trained hardware prefetcher can improve prefetch timeliness. #### 3.4 Antagonistic effect of SW & HW prefetching * Negative Training: * If prefetched blocks by software prefetching hide a part of one or more streams, the hardware prefetcher will not be trained properly. * Software prefetch instructions can trigger overly aggressive hardware prefetches, which results in early requests. * Harmful Software Prefetching: When software prefetch requests are incorrect or early, the increased stress on cache and memory bandwidth can further reduce the effectiveness of hardware prefetching. --- ### 4. EXPERIMENTAL METHODOLOGY #### 4.1 Prefetch Intrinsic Insertion Algorithm * prefetch candidate: any load whose L1 misses per thousand instructions (MPKI) exceeds 0.05. * prefetch distance: ![](https://i.imgur.com/gnnjX6l.png) where K is a constant factor, L is an average memory latency, IPCb ench is the profiled average IPC of each benchmark, and Wloop is the average instruction count in one loop iteration. * Chose K = 4, L = 300, IPC & Wloop from profile data ==>profile-driven * not for complex data structure #### 4.2 Simulation Methodology * MacSim: evaluate software and hardware prefetching effects * Pinpoints: select a representative simulation region for each benchmark using the reference input set * simulate the same section of the code across different binaries --- ### 5. EVALUATIONS: BASIC OBSERVATIONS ABOUT PREFETCHING #### 5.1 Overhead and Limitations of Software Prefetching * Instruction Overhead * intrinsic instruction + extra instructions for handling of indirect memory accesses and index calculations. * GemsFDTD instructions increase more than 50%, but performance is still positive * Software Prefetching Overhead * to evaluate, **eliminate** the overhead of cache pollution (SW+P), bandwidth consumption (SW+B, between core and memory), memory access latency (SW+L), redundant prefetch overhead(SW+R), and instruction overhead (SW+I) * SW+P(small): caused by early or incorrect prefetches. * SW+B(small): current machines provide enough bandwidth for single-thread applications * SW+L(large): software prefetching is not completely hiding memory latency * SW+R(small): negative effect of redundant prefetch instructions is generally negligible * SW+I(small): even if the instruction increase a lot > what is cache pollution(caused by early or incorrect prefetches)? > Prefetch replaces a line that is requested later[name=Carol Chen] >> 簡單說就是 cache 被塞入不時宜的 data,而 early 則是指說太早抓這個 data 進來,結果要用到時已經被 replaced 了,inccorrect 是指說你載入了根本用不到的 data,這些都是會讓 cache miss 增加,也可以說是 汙染了cache [name=Yen-Kuan Wu] * the Effect of Prefetch Distance ![](https://i.imgur.com/MWe4YIx.png) x-axis show the prefetch distance relative to the base prefetch distance. >why use ipc for performance evaluation? ![](https://i.imgur.com/Uo9JRId.png) * Static Distance vs. Machine Configuration * optimal distance might vary by machine * for evaluation, compare the best performing static distance for three different machine configurations (base, less-aggressive, and aggressive processors) * observation: static prefetch distance variance does not impact performance significantly even if the machine configuration is changed at runtime * Cache-Level Insertion Policy * Since software prefetching is generally accurate, we can safely insert prefetched blocks into the L1 cache directly without the risk of polluting the L1 cache. * **T0** always performs better than other hints in all evaluated benchmarks >"In general, an out-of-order processor can easily tolerate L1 misses, so hardware prefetchers usually insert prefetched blocks into the last level cache to reduce the chance of polluting the L1 cache."==>why?? #### 5.2 Effect of Using Hardware and Software Prefetching Together * Hardware Prefetcher Training Effects * NT (SW+GHB, SW+STR): The hardware prefetcher’s training algorithm ignores software prefetch requests. That is, the hardware prefetcher is only trained by demand cache misses. * Train (SW+GHB+T, SW+STR+T): The hardware prefetcher’s training algorithm includes software prefetch requests, in addition to the usual demand misses. >training effect 的概念不太懂@@ * Prefetch Coverage * coverge: the ratio of useful prefetches to total L2 misses * the coverage of software prefetching is higher than that of hardware prefetchers, but the software coverage of benchmarks in that neutral and negative groups is lower than that of the hardware prefetchers ==> ==less coverage== is the main reason for performance loss in the neutral and negative groups * Prefetching Classification * By eliminating the latency overhead (SW+L), the performance of mcf is improved by 9%. #### 5.3 Using the GCC Compiler >看不太懂操作方法@@ * The gcc compiler inserts more software prefetches than icc, except for mcf and lbm. #### 5.4 Real System Experiments #### 5.5 Profile-Guided Optimization (PGO) #### 5.6 Hardware Prefetcher for Short Streams * evaluate ASD(an abstraction of HW prefetcher) along with software and hardware prefetching * Although ASD can predict short-length streams, it still requires observing the first instances of a memory stream. * software prefetching is much more effective for prefetching short streams than ASD. #### 5.7 Content Directed Prefetching * CDP: target linked and other irregular data structures #### 5.8 Manual Tuning * remove useless prefetches, adjust the prefetch distance of each software prefetch,and apply additional overhead-reduction techniques, such as *loop unrolling* and *prologue/epilogue transformations* for each benchmark. >prologue/epilogue transformations?? * algorithm is already good to use as a guideline for inserting software prefetch intrinisics #### 5.9 Summary of New Findings (1) Hardware prefetchers can under-exploit even regular access patterns, such as streams. Short streams are particularly problematic, and software prefetching is frequently more effective in such cases. (2) The software prefetching distance is relatively ==insensitive to the hardware configuration==. The prefetch distance does need to be set carefully, but as long as the prefetch distance is greater than the minimum distance, most applications will not be sensitive to the prefetch distance. (3) Although most L1 cache misses can be tolerated through out-of-order execution, ==when the L1 cache miss rate is much higher than 20%, reducing L1 cache misses by prefetching into the L1 cache can be effective.== (4) The overhead of useless prefetching instructions is not very significant. Most applications that have prefetching instructions are typically limited by memory operations. Thus, having some extra instructions (software prefetching instructions) does not increase execution time by much. (5) Software prefetching can be used to train a hardware prefetcher and thereby yield some performance improvement. However, it can also degrade performance severely, and therefore must be done judiciously if at all. --- ### 6. CASE STUDIES: SOURCE CODE ANALYSIS OF INDIVIDUAL BENCHMARKS ![](https://i.imgur.com/PLIFfor.png) #### 6.1 Positive Group #### 6.2 Neutral Group #### 6.3 Negative Group --- ### 7. RELATED WORK --- ### 8. CONCLUSION AND FUTURE WORK ## naive_, sse_, sse_prefetch_ 之間的效能差異 * 修改Makefile,使用perf來做效能評估: `naive_transpose`: ``` Performance counter stats for './main' (100 runs): 13,664,142 cache-misses # 75.001 % of all cache refs ( +- 0.03% ) 18,218,644 cache-references ( +- 0.15% ) 1,565,542,213 instructions # 0.97 insns per cycle ( +- 0.01% ) 1,608,864,478 cycles ( +- 0.27% ) 0.464768873 seconds time elapsed ( +- 0.29% ) ``` `sse_transpose`: ``` Performance counter stats for './main_sse' (100 runs): 4,470,699 cache-misses # 79.313 % of all cache refs ( +- 0.17% ) 5,636,790 cache-references ( +- 0.53% ) 1,343,263,111 instructions # 1.27 insns per cycle ( +- 0.08% ) 1,055,228,135 cycles ( +- 0.35% ) 0.304981652 seconds time elapsed ( +- 0.37% ) ``` `sse_prefetch_transpose`: ``` Performance counter stats for './main_pre' (100 runs): 4,519,376 cache-misses # 78.251 % of all cache refs ( +- 0.12% ) 5,775,457 cache-references ( +- 0.13% ) 1,384,727,378 instructions # 1.64 insns per cycle ( +- 0.00% ) 843,211,849 cycles ( +- 0.38% ) 0.243784674 seconds time elapsed ( +- 0.41% ) ``` * 觀察結果 * 有做prefetch的版本除了IPC(instruction per cycle)提升以外,cycle數亦大幅下降。 * cache-miss方面沒有太大的差別,不論是與`naive`相比或與`sse_transpose`比較。 ==>開始思考prefetch 和 cache-miss之間的關係,以及當中覺得有端倪的locality hint:`_MM_HINT_T1`和 `PFDIST` ==>又回去深入把之前參考的資料看清楚,在下面記錄我對這邊的認識@@ * 關於SSE中的prefetch: * `_mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1)` 在[intel intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,AVX,AVX2&text=prefetch&expand=4060,4078,4061,3861,3877,4060)中定義如下: :::info void _mm_prefetch (char const* p, int i) **Synopsis** void _mm_prefetch (char const* p, int i) #include "xmmintrin.h" **Instruction**: prefetchnta mprefetch prefetcht0 mprefetch prefetcht1 mprefetch prefetcht2 mprefetch **CPUID Flags**: SSE **Description** Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i. ::: * 其中傳入p值是"要prefetch的資料的記憶體位置" ==>看懂這裡,就知道`PFDIST`在原本的程式中是什麼意義了!! * i值是locality hint,`_MM_HINT_T0`, `_MM_HINT_T1`, `_MM_HINT_T2`和`_MM_HINT_NTA`共4種。 * `T0 (temporal data)`--prefetch data into all cache levels. `T1 (temporal data with respect to first level cache)`--prefetch data in all cache levels except 0th cache level<將資料取至所有cache level除了 L1 cache之外> `T2 (temporal data with respect to second level cache)` --prefetch data in all cache levels, except 0th and 1st cache levels. `NTA (non-temporal data with respect to all cache levels)`--prefetch data into non-temporal cache structure. (This hint can be used to minimize pollution of caches.) * 關於`PFDIST`: * 原程式碼中,`PFDIST`和論文中的prefetch distance定義有些不同,但概念上是相通的。在這裡`PFDIST`定為8,又每次loop取4個int來做,因此真正的作用是提前8/4=2輪來做prefetch。 * (software) prefetch distance在論文中定義是"the distance ahead of which a prefetch should be requested on a memory address." ![](https://i.imgur.com/aCfmkqd.png) *where **l** is the **prefetch latency** and **s** is **the length of the shortest path through the loop body*** * 看參考資料中的這張圖,可以很清楚的了解 ![](https://i.imgur.com/OuyQWUj.png) * 當中Loading an element的時間看得出來是loop中Computaion時間的兩倍(實際上要看出他的時間,就是去查memory latency的cycle數和loop當中instruction執行的cycle數) * 因此,若將prefetch distance定義為2,代表在需要資料的前2輪就做prefetch,先將資料由main memory載入cache中。prefetch distance至少要定義為2,才可以來得及在需要用到資料的那輪拿到要的資料。 ==>當然prefetch過早也會有問題,就像論文中提到的cache pollution(在cache中有當下不必要的資料,造成cache空間浪費)。或是可能後來載入的資料洗掉cache裡的資料,資料還沒用到前就被洗掉,又再度造成cache miss。 * 結論與觀察假設: * 來研究看看不同==locality hint==執行上的差異 * 去查memory latency的時間以及loop中instruction執行的時間,==估算有效的prefetch distance==。提前2輪是否最佳?太早取或太晚取的影響差別多大? ==>雖然不知道原本的2輪是不是計算之後的結果,但覺得要改掉可能也要有個依據在@@ ## 使用AVX來提升效能 * 將SSE的版本改成AVX(依照直覺方式改) * 對照一下 [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/) `avx_transpose()`: ```C= void avx_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 4) { for (int y = 0; y < h; y += 4) { __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x)); __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x)); __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x)); __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x)); __m256i T0 = _mm256_unpacklo_epi32(I0, I1); __m256i T1 = _mm256_unpacklo_epi32(I2, I3); __m256i T2 = _mm256_unpackhi_epi32(I0, I1); __m256i T3 = _mm256_unpackhi_epi32(I2, I3); I0 = _mm256_unpacklo_epi64(T0, T1); I1 = _mm256_unpackhi_epi64(T0, T1); I2 = _mm256_unpacklo_epi64(T2, T3); I3 = _mm256_unpackhi_epi64(T2, T3); _mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), I0); _mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), I1); _mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), I2); _mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), I3); } } } ``` `avx_prefetch_transpose()`: * 發現到AVX-512才有支援prefetch,但我的CPU只有支援到AVX2,只好先借用SSE的prefetch@@ * [查看CPU info](http://www.binarytides.com/linux-cpu-information/): 找自己CPU有支援哪些指令集 >覺得做起來毛毛的不知道這樣做有沒有瑕疵@@ >> 之前做 clz 的作業,應該就有建立驗證輸出正確性的機制,你應該一併提出,你可以用 pure C 的程式輸入和輸出來驗證 SSE/AVX 版本的實作 [name=jserv] >>感謝老師~ 稍晚就來試試^^ [name=Carol Chen] * 根據老師的建議,多做了一個transpose_verfiy()來驗證SSE和AVX的transpose有沒有做錯 * 當中就是call `naive_transpose()`來比比看正不正確 * 驗證結果都是正確的~~~ ```C= void avx_prefetch_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 4) { for (int y = 0; y < h; y += 4) { _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1); __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x)); __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x)); __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x)); __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x)); __m256i T0 = _mm256_unpacklo_epi32(I0, I1); __m256i T1 = _mm256_unpacklo_epi32(I2, I3); __m256i T2 = _mm256_unpackhi_epi32(I0, I1); __m256i T3 = _mm256_unpackhi_epi32(I2, I3); I0 = _mm256_unpacklo_epi64(T0, T1); I1 = _mm256_unpackhi_epi64(T0, T1); I2 = _mm256_unpacklo_epi64(T2, T3); I3 = _mm256_unpackhi_epi64(T2, T3); _mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), I0); _mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), I1); _mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), I2); _mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), I3); } } } ``` * 改成AVX之後的效能分析結果 `avx_transpose`: ``` Performance counter stats for './main_avx' (100 runs): 5,308,277 cache-misses # 79.449 % of all cache refs ( +- 0.08% ) 6,681,339 cache-references ( +- 0.18% ) 1,340,262,129 instructions # 1.23 insns per cycle ( +- 0.03% ) 1,087,092,611 cycles ( +- 0.36% ) 0.314026976 seconds time elapsed ( +- 0.38% ) ``` `avx_prefetch_transpose`: ``` Performance counter stats for './main_avx_pre' (100 runs): 5,605,493 cache-misses # 77.948 % of all cache refs ( +- 0.10% ) 7,191,278 cache-references ( +- 0.14% ) 1,393,902,294 instructions # 1.45 insns per cycle ( +- 0.00% ) 961,599,446 cycles ( +- 0.35% ) 0.277883340 seconds time elapsed ( +- 0.38% ) ``` 執行結果和SSE的版本沒差很多@@ --- * 參考[周曠宇的筆記](https://embedded2015.hackpad.com/Week8--VGN4PI1cUxh)關於AVX transpose的實作,才發現自己原本的版本根本沒有發揮到AVX一次可以處理 256bits 的功能,只是用AVX在實作一次SSE的版本 * 花了一些時間看懂code,以下是一些筆記: 老師給的參考資料[4x4 transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html)當中的這張圖: ![](https://i.imgur.com/uebouDZ.png) 可以理解到為什麼SIMD的transpose要這麼寫 * `__m256i T0 = _mm256_unpacklo_epi32(I0, I1)`意思是將I0和I1的low 128 bit以32 bit為單位交錯放進T0;`unpackhi`就是處理high 128 bit 再把問題轉到8x8的矩陣上,可想而知轉換並不會那麼單純而已 ```C= for ( int x = 0; x < w; x += 8 ) { for ( int y = 0; y < h; y += 8 ) { __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x)); __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x)); __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x)); __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x)); __m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x)); __m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x)); __m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x)); __m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x)); __m256i T0 = _mm256_unpacklo_epi32(I0, I1); __m256i T1 = _mm256_unpackhi_epi32(I0, I1); __m256i T2 = _mm256_unpacklo_epi32(I2, I3); __m256i T3 = _mm256_unpackhi_epi32(I2, I3); __m256i T4 = _mm256_unpacklo_epi32(I4, I5); __m256i T5 = _mm256_unpackhi_epi32(I4, I5); __m256i T6 = _mm256_unpacklo_epi32(I6, I7); __m256i T7 = _mm256_unpackhi_epi32(I6, I7); I0 = _mm256_unpacklo_epi64(T0, T2); I1 = _mm256_unpackhi_epi64(T0, T2); I2 = _mm256_unpacklo_epi64(T1, T3); I3 = _mm256_unpackhi_epi64(T1, T3); I4 = _mm256_unpacklo_epi64(T4, T6); I5 = _mm256_unpackhi_epi64(T4, T6); I6 = _mm256_unpacklo_epi64(T5, T7); I7 = _mm256_unpackhi_epi64(T5, T7); T0 = _mm256_permute2x128_si256(I0, I4, 0x20); T1 = _mm256_permute2x128_si256(I1, I5, 0x20); T2 = _mm256_permute2x128_si256(I2, I6, 0x20); T3 = _mm256_permute2x128_si256(I3, I7, 0x20); T4 = _mm256_permute2x128_si256(I0, I4, 0x31); T5 = _mm256_permute2x128_si256(I1, I5, 0x31); T6 = _mm256_permute2x128_si256(I2, I6, 0x31); T7 = _mm256_permute2x128_si256(I3, I7, 0x31); _mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), T0); _mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), T1); _mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), T2); _mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), T3); _mm256_storeu_si256((__m256i *)(dst + ((x + 4) * h) + y), T4); _mm256_storeu_si256((__m256i *)(dst + ((x + 5) * h) + y), T5); _mm256_storeu_si256((__m256i *)(dst + ((x + 6) * h) + y), T6); _mm256_storeu_si256((__m256i *)(dst + ((x + 7) * h) + y), T7); } } ``` 試著把這段code用紙筆畫出來,還是不太懂為什麼要這麼做@@ 尤其是對於`permute()`的結果 ![](https://i.imgur.com/psEgfuw.png) > T0 = _mm256_permute2x128_si256(I0, I4, 0x20); >我的理解是把I0的low 128-bit放 T0 的low 128-bit, >I4的low 128-bit放 T0 的high 128-bit >但結果相差很多@@ 於是又到stackoverflow找其他[作法](http://stackoverflow.com/questions/25622745/transpose-an-8x8-float-using-avx-avx2): ```C= __t0 = _mm256_unpacklo_ps(row0, row1); __t1 = _mm256_unpackhi_ps(row0, row1); __t2 = _mm256_unpacklo_ps(row2, row3); __t3 = _mm256_unpackhi_ps(row2, row3); __t4 = _mm256_unpacklo_ps(row4, row5); __t5 = _mm256_unpackhi_ps(row4, row5); __t6 = _mm256_unpacklo_ps(row6, row7); __t7 = _mm256_unpackhi_ps(row6, row7); __tt0 = _mm256_shuffle_ps(__t0,__t2,_MM_SHUFFLE(1,0,1,0)); __tt1 = _mm256_shuffle_ps(__t0,__t2,_MM_SHUFFLE(3,2,3,2)); __tt2 = _mm256_shuffle_ps(__t1,__t3,_MM_SHUFFLE(1,0,1,0)); __tt3 = _mm256_shuffle_ps(__t1,__t3,_MM_SHUFFLE(3,2,3,2)); __tt4 = _mm256_shuffle_ps(__t4,__t6,_MM_SHUFFLE(1,0,1,0)); __tt5 = _mm256_shuffle_ps(__t4,__t6,_MM_SHUFFLE(3,2,3,2)); __tt6 = _mm256_shuffle_ps(__t5,__t7,_MM_SHUFFLE(1,0,1,0)); __tt7 = _mm256_shuffle_ps(__t5,__t7,_MM_SHUFFLE(3,2,3,2)); row0 = _mm256_permute2f128_ps(__tt0, __tt4, 0x20); row1 = _mm256_permute2f128_ps(__tt1, __tt5, 0x20); row2 = _mm256_permute2f128_ps(__tt2, __tt6, 0x20); row3 = _mm256_permute2f128_ps(__tt3, __tt7, 0x20); row4 = _mm256_permute2f128_ps(__tt0, __tt4, 0x31); row5 = _mm256_permute2f128_ps(__tt1, __tt5, 0x31); row6 = _mm256_permute2f128_ps(__tt2, __tt6, 0x31); row7 = _mm256_permute2f128_ps(__tt3, __tt7, 0x31); ``` >晚點來研究這邊@@ * 正確版AVX執行結果如下: * 很重要的是,要和SSE做比較,`PFDIST`也要隨之調整 * 在這邊若要做到與SSE差不多的效果,==應該`PFDIST`由8改為16== 在SSE版本中`PFDIST=8`,8/4(一個loop拿四個int來做) = 2,代表提前2輪作prefetch; 然而在AVX版本中,一個loop拿了8個int來做,若也要提前兩輪做prefetch,應將`PFDIST`改為 8*2 = 16。 `avx_transpose`: ``` Performance counter stats for './main_avx' (100 runs): 3,660,681 cache-misses # 69.870 % of all cache refs ( +- 0.11% ) 5,239,311 cache-references ( +- 0.10% ) 1,261,624,058 instructions # 1.44 insns per cycle ( +- 0.00% ) 875,580,047 cycles ( +- 0.25% ) 0.253157364 seconds time elapsed ( +- 0.27% ) ``` `avx_prefetch_transpose`: ``` Performance counter stats for './main_avx_pre' (100 runs): 3,863,180 cache-misses # 69.442 % of all cache refs ( +- 0.14% ) 5,563,190 cache-references ( +- 0.10% ) 1,284,699,571 instructions # 1.52 insns per cycle ( +- 0.00% ) 846,756,431 cycles ( +- 0.29% ) 0.245009690 seconds time elapsed ( +- 0.33% ) ``` * 觀察結果: * 和沒做prefetch的`avx_transpose()`相比,cycle數少了3000,0000次。 * 這兩個正確版的AVX在效能上確實有提升,cache-miss比SSE版本的下降10%左右。 ## 優化嘗試1: 修改locality hint * locality hint: `T0 (temporal data)`--prefetch data into all cache levels. `T1 (temporal data with respect to first level cache)`--prefetch data in all cache levels except 0th cache level<將資料取至所有cache level除了 L1 cache之外> `T2 (temporal data with respect to second level cache)` --prefetch data in all cache levels, except 0th and 1st cache levels. `NTA (non-temporal data with respect to all cache levels)`--prefetch data into non-temporal cache structure. (This hint can be used to minimize pollution of caches.) * ## 優化嘗試2: 調整PFDIST * 要知道prefetch distance要設多少,須知道memory latency的時間和loop中instruction執行的時間 * 雖然不知道是不是真的可以量測到,但很好奇memory latency是怎麼量測的,於是搜尋了一下找到兩種方式: * 安裝i2c-tools,然後`decode-dimms`。參考[這裡](http://www.richud.com/wiki/Ubuntu_See_Live_RAM_Timings_Decode_DIMMS)。但是失敗了~~ @@ * 使用[Intel® Memory Latency Checker v3.1a](https://software.intel.com/en-us/articles/intelr-memory-latency-checker): 怎麼使用官網上都寫得滿清楚了。 先由官網下載`mlcv3.1a.tgz`,將它解壓縮`$ tar -zxvf mlcv3.1a.tgz`。 進到`Linux/`目錄底下會有兩個執行檔`mlc`和`mlc_avx512`,我的CPU沒支援到AVX512,所以用`mlc`。 `$ sudo ./mlc` 得到錯誤訊息@@ ``` tried to open /dev/cpu/0/msr Cannot access MSR module. Possibilities include not having root privileges, or MSR module not inserted (try 'modprobe msr', and refer README) ``` 照他說的,`$ sudo modprobe msr`後,在執行一次`mlc`就可以了。 以下是結果: ``` Intel(R) Memory Latency Checker - v3.1a Measuring idle latencies (in ns)... Memory node Socket 0 0 53.4 Measuring Peak Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 10770.2 3:1 Reads-Writes : 10047.3 2:1 Reads-Writes : 9765.2 1:1 Reads-Writes : 9274.6 Stream-triad like: 9805.0 Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Memory node Socket 0 0 11037.1 Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 177.28 10649.6 00002 192.86 10994.8 00008 167.25 10948.4 00015 159.44 10612.9 00050 154.25 10795.7 00100 140.75 10143.9 00200 128.83 7893.9 00300 105.37 5952.7 00400 100.31 4776.8 00500 86.84 4003.8 00700 88.70 3163.5 01000 86.94 2478.6 01300 85.03 2088.2 01700 82.19 1816.9 02500 82.56 1500.1 03500 72.52 1369.3 05000 66.17 1288.9 09000 88.03 929.9 20000 78.31 905.1 Measuring cache-to-cache transfer latency (in ns)... Local Socket L2->L2 HIT latency 26.4 Local Socket L2->L2 HITM latency 25.7 ``` --- * 試試看一邊執行程式,一邊跑`./mlc` * 執行結果如下: ``` Intel(R) Memory Latency Checker - v3.1a Measuring idle latencies (in ns)... Memory node Socket 0 0 77.9 Measuring Peak Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 9720.8 3:1 Reads-Writes : 8869.7 2:1 Reads-Writes : 8909.7 1:1 Reads-Writes : 8313.2 Stream-triad like: 9235.2 Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Memory node Socket 0 0 9383.3 Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 323.97 9325.1 00002 187.46 8582.3 00008 337.34 9331.5 00015 177.40 7845.7 00050 246.49 8772.4 00100 296.18 7506.0 00200 243.75 5463.3 00300 221.81 4152.5 00400 154.30 3104.2 00500 221.95 2651.5 00700 190.65 1968.7 01000 158.13 1544.2 01300 235.15 1242.4 01700 153.25 1097.1 02500 149.58 901.2 03500 163.42 734.9 05000 91.15 885.4 09000 191.48 472.6 20000 222.20 354.6 Measuring cache-to-cache transfer latency (in ns)... Local Socket L2->L2 HIT latency 32.5 Local Socket L2->L2 HITM latency 37.6 ``` 發現數字改變了~~ 真的有在運作!! 如果沒有理解錯誤,應該就得到我想知道的memory latency大概是介在`90(ns) ~ 320(ns)`之間 ==>計算時看成平均`200(ns)` >實在不確定有沒有遺漏什麼@@ 若有發現瑕疵再麻煩指正了~>< * 回去觀察程式執行結果,發現cache-miss整體飆高了10%多!! `naive_transpose()`甚至達到了92%,覺得不是偶然,因為執行那麼多次沒有這麼高過。 * 猜想大概是因為官網說明有講到的,MLC會透過MSR的HW Prefetcher Control在執行過程中會**關掉hardware prefetcher**,可想而知效能會下降 ==>居然遇到傳說中的hardware prefetcher,發現還可以往這邊找些資料來研究看看 ==>官網提供的資料: [Disclosure of H/W prefetcher control on some Intel processors](https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors): 裡面有大概講MSR是怎麼對hardware prefetcher做操作的 * 但是效能下降一去不復返,不執行MLC之後本來以為cache miss可以回到原本的70%上下,結果還是一直停留在8、90% ==>覺得有兩種可能,要不是真的動到HW prefetcher(後來沒打開或沒全開),就是跟HW & SW prefetcher一起用的時候的training effect有關(真的會這樣嗎?!) ==>去看一些HW prefetcher相關的資料,再好好研究一下這塊@@ --- * 回到一開始的問題,還差每一個loop執行的時間(cycle數) ==> 用函式執行的時間 / loop執行次數 = 每一個loop的執行時間 * 下面列出所有版本的執行時間,以及平均每個loop的執行時間 [ 註: μs = $10^3$ ns] ``` naive: 266798 us naive per iteration: 0.015902 us ``` ``` sse: 119329 us sse per iteration: 0.113801 us ``` ==>memory latency平均為`200(ns)`左右,SSE一個loop約需`100(ns)`的執行時間,算起來prefetch distance真的是2左右!! ==>提前2輪是有依據的~~ ``` sse prefetch: 63210 us sse prefetch per iteration: 0.060282 us ``` ``` avx: 62812 us avx per iteration: 0.239609 us ``` ==>算起來AVX其實可以提前1輪就好,`PFDIST`維持在8即可 ``` avx prefetch: 66318 us avx prefetch per iteration: 0.252983 us ``` ## Intel Intrinsics Prefetch 使用與概念 * 在這邊補充自己在看資料的一些東西,雖然AVX-512我的CPU沒支援>< >資料待補 這邊我還沒看懂@@ ### AVX-512 Prefetch ![](https://i.imgur.com/Cr2Suqu.png) ==>看到一堆scatter和gather不知道要怎麼放參數 * 先把scatter和gather是什麼看懂 參考[Intel® 64 and IA-32 Architectures Optimization Reference Manual](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf): * Data Gather * The gather operation reads elements from an input buffer based on indexes specified in an index buffer. * The gathered elements are written in an output buffer. ![](https://i.imgur.com/4fHYOrM.png) * Data Scatter * The scatter operation reads elements from an input buffer sequentially. * It then writes them to an output buffer based on indexes specified in an index buffer. ![](https://i.imgur.com/whRVC3S.png)