2017q1 Homework3 (software-pipelining)

# 2017q1 Homework3 (software-pipelining) contributed by <`chenweiii`> ## 硬體環境 * Operating System: Ubuntu 16.02.2 LTS (64 bit) * CPU: [Intel i7-6700 @ 3.4 GHz](http://www.cpu-world.com/CPUs/Core_i7/Intel-Core%20i7-6700.html) * Cache: * L1d cache: 32 K 8-way set associative * L1i cache: 32 K 8-way set associative * L2 cache: 256 KB 4-way set associative * L3 cache: 8 MB 16-way set associative * cache alignment: 64 B * Memory: 32 GB ## 論文閱讀 > 部份內容節錄自 [論文 When Prefetching Works, When It Doesn’t, and Why 重點提示和解說](https://hackmd.io/s/HJtfT3icx#) ### Summary ![](https://i.imgur.com/pwHxvmJ.png) * 值得注意的是 Speedup 的計算方式，是以各 Benchmark 表現最好的 ```SW + HW prefetcher``` 與 ```HW prefetcher``` 做比較。所以由左至右的順序，是依軟硬 prefetcher 共存時哪個 benchmark 表現最好作排序。 * 可以觀察到除了 gcc, bzip2, soplex ，加上 intrinsics 表現都大幅提昇。 > 內文說明實驗採用了三種 binaries 及三種 simulated hardware prefetching schemes。但圖表上只看到 GHB & STR，並沒有看到第三種；而 binaries 也僅看到 baseline 及 SW 兩種，不知道第三種 binaries 為何。 * 在 Intel's optimization guidelines 便指出 SW prefetch 用在短陣列、連續且不規則的記憶體存取、減少 L1 cache miss 時會有正面的影響，與本篇論文分析之數據結果一致。 > 但我沒有印象這篇論文有特別琢磨這三項的分析，不知道是不是我漏看了。 * 此篇論文目的便是談討 SW & PW prefetcher 之間的交互作用，最後也有觀察到在某些 benchmark 裡 SW prefetcher 會對 HW prefetcher 造成過度的訓練，造成巨大的負面影響。 ### Background on SW & HW prefetching ![](https://i.imgur.com/yN0hMjp.png) * 不同的資料結構有各式各樣適合的 prefetching mechanism，例如 Recursive Data Structures (RDS)，便適合由 SW prefetcher 處理而不是交由 HW prefetcher。 * 定義 cache-line access 分類 * stream: unit-stride cache-line access * strided: access stride distances greater than two cache lines * stream prefetcher, GHB prefetcher, content-based prefetcher 皆為 Hardware-based prefetching mechanisms，但本實驗僅採用前兩個已成功商業化的 HW prefetcher。 > 不知道這邊理解的對不對。 GHB 的相關文章正在閱讀中。 #### Prefetch Classification ![](https://i.imgur.com/JCu3psd.png) * 依 Demand Access Time 落在此軸何處，來決定此 prefetch 的分類。 ![](https://i.imgur.com/dFYj7Eo.png) * Redundant_dc: 欲 prefetch 的 cache block 已經在 cache 裡。 > 該如何觀察各個種類的 prefetch 次數？ * Miss status holding registers (MSHR) * 參考 [cornell memory system 的講義](https://jontse.com/courses/files/cornell/ece5730/Lecture04.pdf) * When a miss occurs, the MSHRs are looked up to determine if the cache block is already being fetched. * Each MSHR holds information for a primary miss. * If an MSHR hit, then a secondary miss has occurred. (primary miss to the same line is outstanding) > 仍然不太了解 MSHR 在 Miss-Under-Miss Cache Design 裡扮演什麼角色及過程。卡關...但在此篇論文後續沒有討論到 MSHR，所以之後想辦法再找文章閱讀。 #### Software Prefetch Distance * Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency. * If prefetch distance is too large, prefetched data could evict useful cache blocks, and the elements in the beginning of the array may not e prefetched, leading to less coverage and more cache misses. * Prefetch coverage * the ratio of useful prefetches to total L2 misses. * 藉由 prefetch 避掉的 cache miss 比例 * 100 x (Prefetch Hits / (Prefetch Hits + Cache Misses)) > prefetch hits 的數據該如何觀察？ #### Direct and Indirect Memory Indexing * **Direct memory indexing** can be easily prefetched by hardware since the memory addresses show regular stream/stride behavior. * **Indirect indexing** is relatively simpler to compute in software, but it usually has a higher overhead than that of direct memory index. ### SW prefetching 之優缺點及 HW + SW 的效果 > 我很好奇這邊的整理，是源自於其他文獻，還是歸納自本篇論文的實驗結果？ #### Benefits of SW prefetching over HW * Large Number of Streams (Limited Hardware Resources) * Short Streams * Irregular Memory Access * Cache Locality Hint * 多數 hardware prefetcher 將資料放在 lower cache level (L1 or L3)，但是 software prefetcher 可以依提示直接放進 L1 cache * Lower-level prefetching block insertion greatly reduces the higher level cache pollution, but L2 to L1 latency can degrade performance significantly. * So, there is greater flexibility of placing data in the right cache level in the SW prefetching mechanism. * Loop Bounds * Several methods, such as loop unrolling, software pipelining, and using branch instructions, can **prevent generating prefetch requests out of array bounds** in software. * However, overly aggressive prefetching results in early or incorrect prefetch requests, in addition to **consuming memory bandwidth**. #### Negative Impacts of Software Prefetching * Increased Instruction Count * Static Insertion * The decision to prefetch and choice of parameters , such as **data to be prefetched and the corresponding prefetch distance**, are made statically, and therefore cannot adapt to runtime behavioral changes such as **varying memory latency, effective cache size, and bandwidth**, especially in heterogeneous architectures. * Code Structure Change #### Synergistic Effects when using HW + SW prefetching ```Synergistic: 協同作用的``` * Handling Multiple Streams * HW prefetcher cover regular stream. * SW prefetcher cover irregular stream. * Positive Training * e.g. If a block prefetched by a software prefetcher is late, then a trained hardware prefetcher can improve prefetch timeliness. #### Antagonistic Effects ```Antagonistic: 對抗性的``` * Negative Training * Software prefetch requests can slow down the hardware prefetcher training. * e.g. If prefetched blocks by software prefetching hide a part of one or more streams, the hardware prefetcher will not be trained properly. * [註] 不太了解這個例子想表達的意思。 * Software prefetch instructions can trigger overly aggressive hardware prefetches, which results in early requests. * Harmful Software Prefetching * When software prefetch requests are incorrect or early, the increased stress on cache and memory bandwidth can further reduce the effectiveness of hardware prefetching. ## 學習 SSE Reference: [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#) > 先學習程式上用到之指令。 > 東番西找，找不到跟 SSE 有關的教學，沒想到 Intrinsics Guide 已提供最清楚的解釋。 ### 先閱讀 Hotball 的小屋撰寫的 SSE 介紹 * SSE 的 intrinsics 的命名形式 * \_mm_<opcode>_<suffix> * 其中 <opcode> 是指令的類別。而 <suffix> 則是資料的種類。 * 在 SSE 浮點運算指令中，只有兩個種類： ps 和 ss * ps 是指 Packed Single-precision，也就是這個指令對暫存器中的四個單精度浮點數進行運算。 * ss 則是指 Scalar Single-precision，也就是這個指令只對暫存器中的 DATA0 進行運算。 > 仍不太清楚下列三個指令之 <suffix> 所代表之意思 e.g. si128, epi32 * 要使用 SSE 的 intrinsics 之前，要記得先包含 xmmintrin.h 這個 header 檔。 * 絕大部份需要存取記憶體的 SSE 指令，都要求位址是 16 的倍數（也就是對齊在 16 bytes 的邊上） * 如果沒有對齊 16 bytes 的話，就不能用 _mm_load_ps 這個 intrinsic 來載入，而要改用 _mm_loadu_ps 這個 intrinsic。它是專門用來處理沒有對齊在 16 bytes 邊上的資料的。但是，它的速度會比較慢。 > ~~找到一個可能可以改善的方向，但要如何使矩陣的起始位址為 16 的倍數？~~ * 根據其計算結果，再設定適當的旗標（divide-by-zero、invalid 等等），或是產生 exception。這可以由一個 MXCSR 暫存器來設定。MXCSR 暫存器是一個 32 位元的旗標暫存器，可以設定是否要產生各種 exception，並會記錄上次的計算中，發生了哪些情況。 * SSE 除了運算的指令之外，還支援了一些 cache 控制指令： prefetch 和 movntps。 prefetch 指令實際上有四個不同的指令，包括 prefetch0、 prefetch1、 prefetch2 和 prefetchnta。不過，它們都是用同一個 intrinsic 表示的，也就是 _mm_prefetch。 * prefetch 指令的主要目的，是提前讓 CPU 載入稍後運算所需要的資料。通常是在對目前的資料進行運算之前，告訴 CPU 載入下一筆資料。 **這樣就可以讓目前的運算，和載入下一筆資料的動作，可以同時進行。** 如果運算的複雜度夠高的話，這樣可以完全消除讀取主記憶體的 latency。不同的 prefetch 指令則是告訴 CPU 將資料載入不同層次的 cache。不過，最常用的還是 prefetchnta，這個指令會把資料載入到離 CPU 最近的 cache 中（通常是 L1 cache 或 L2 cache），適用於資料在短時間內就會用到的情形。 ### [impl.c] 所用到之 SSE intrinsic #### void _mm_prefetch (char const* p, int i) * Description * Fetch **the line of data** from memory that contains address p to a location in the cache heirarchy specified by the locality hint i. #### __m128i _mm_loadu_si128 (__m128i const* mem_addr) * Description * Load 128-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary. * Operation * ```dst[127:0] := MEM[mem_addr+127:mem_addr]``` #### __m128i _mm_unpacklo_epi32 (__m128i a, __m128i b) * Description * Unpack and interleave 32-bit integers from the low half of a and b, and store the results in dst. * Operation * 將 128 bit 切一半，兩變數 a, b 分別取低位址的 64 bit，交叉存放於新的變數。 * 將 lo 改成 hi 則依其意。 #### void _mm_storeu_si128 (__m128i* mem_addr, __m128i a) * Description * Store 128-bits of integer data from a into memory. mem_addr does not need to be aligned on any particular boundary. > 實際紙筆操作老師的 transpose 程式後，實在太神奇了！ ## raw counter 查找 [Intel® 64 and IA-32 Architectures Developer’s Manual: Vol. 3B](http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html)，我是第六代處理器 skylake > 在這邊 raw counter 找得很挫折，不知道哪一個可以幫助我分析，想嘗試找跟 <```yenWu```> 相同的兩個 event，但找不到。 ## 初步觀察 ### 三種版本之執行時間 > 每個版本各取樣 100 次 ``` naive: 233908 us sse: 154229 us sse prefetch: 42235 us ``` ### 使用 perf 分析 cache misses ```$perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./...``` ```$sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"``` #### naive impl. ``` Performance counter stats for './naive' (100 runs): 41,213,342 cache-misses # 85.814 % of all cache refs ( +- 0.29% ) 48,026,565 cache-references ( +- 0.25% ) 1,448,367,681 instructions # 1.55 insn per cycle ( +- 0.00% ) 936,303,948 cycles ( +- 0.30% ) 0.320454039 seconds time elapsed ( +- 0.45% ) ``` #### sse impl. ``` Performance counter stats for './sse' (100 runs): 14,535,369 cache-misses # 75.659 % of all cache refs ( +- 0.07% ) 19,211,776 cache-references ( +- 0.05% ) 1,236,449,412 instructions # 1.79 insn per cycle ( +- 0.00% ) 690,682,519 cycles ( +- 0.15% ) 0.257315101 seconds time elapsed ( +- 0.24% ) ``` #### sse_prefetch impl. ``` Performance counter stats for './sse_prefetch' (100 runs): 8,599,805 cache-misses # 65.805 % of all cache refs ( +- 0.29% ) 13,068,657 cache-references ( +- 0.18% ) 1,282,489,200 instructions # 2.40 insn per cycle ( +- 0.00% ) 534,001,192 cycles ( +- 0.14% ) 0.140952357 seconds time elapsed ( +- 0.40% ) ``` ## 針對 SSE version 進行改善 > 改善 SSE 在 load 時對齊的情況，後來翻閱 <```yenWu```> 也有針對這個問題撰寫一節，但似乎目前沒有結果。 ### memalign - allocate aligned memory #### int posix_memalign(void **memptr, size_t alignment, size_t size); * Description * The function posix_memalign() allocates size bytes and places the address of the allocated memory in *memptr. The address of the allocated memory will be a multiple of alignment, which must be a power of two and a multiple of sizeof(void *). If size is 0, then posix_memalign() returns either NULL, or a unique pointer value that can later be successfully passed to free(3). #### void *memalign(size_t alignment, size_t size); * Description * The obsolete function memalign() allocates size bytes and returns a pointer to the allocated memory. The memory address will be a multiple of alignment, which must be a power of two. ### 修改的程式碼 ```clike= /* main.c */ int *src = (int *) memalign(16, sizeof(int) * TEST_W * TEST_H); int *out = (int *) memalign(16, sizeof(int) * TEST_W * TEST_H); /* impl.c */ __m128i I0 = _mm_load_si128((__m128i *)(src + (y + 0) * w + x)); __m128i I1 = _mm_load_si128((__m128i *)(src + (y + 1) * w + x)); __m128i I2 = _mm_load_si128((__m128i *)(src + (y + 2) * w + x)); __m128i I3 = _mm_load_si128((__m128i *)(src + (y + 3) * w + x)); _mm_store_si128((__m128i *)(dst + ((x + 0) * h) + y), I0); _mm_store_si128((__m128i *)(dst + ((x + 1) * h) + y), I1); _mm_store_si128((__m128i *)(dst + ((x + 2) * h) + y), I2); _mm_store_si128((__m128i *)(dst + ((x + 3) * h) + y), I3); ``` ### 結果 ``` sse: 155314 us sse prefetch: 44440 us ``` 沒有顯著的改善。 ``` Performance counter stats for './sse' (100 runs): 1579,7948 cache-misses # 82.011 % of all cache refs ( +- 0.33% ) 1926,3250 cache-references ( +- 0.07% ) 12,3650,5999 instructions # 1.52 insn per cycle ( +- 0.00% ) 8,1370,9554 cycles ( +- 0.66% ) 0.273664634 seconds time elapsed ( +- 0.40% ) Performance counter stats for './sse_prefetch' (100 runs): 1018,6595 cache-misses # 72.837 % of all cache refs ( +- 0.59% ) 1398,5518 cache-references ( +- 0.24% ) 12,8251,1076 instructions # 2.06 insn per cycle ( +- 0.00% ) 6,2351,1430 cycles ( +- 0.49% ) 0.168184185 seconds time elapsed ( +- 0.55% ) ``` sse & sse_prefetch 的 cache-misses ratio，皆較原本的版本高出 7% > 不知道為什麼 cache-misses 會增加，仿 phonebook 一樣觀察了哪些地方是 cache-misses 的熱點，但我看不太出什麼東西 orz ## AVX version > 先嘗試不要看其他人的 code，翻一下 Intel Intrinsics Guide 找出對應的指令。寬度變為 256 bits 的話，迴圈就變為一次處理 8*8 的方塊矩陣，那這樣原本的 verify 似乎便不能使用... [time=Fri, Mar 31, 2017 5:27 PM] > 紙筆操作很順利，但似乎發現沒有單位為 128 bits 的 unpack...[time=Fri, Mar 31, 2017 5:27 PM] > 沒有想像中容易，因為 AVX 的 unpack 運作與 SSE2 也不同，經過一番操作後終於湊出來（咦？） [time=Fri, Mar 31, 2017 7:41 PM] > 撰寫 AVX 版本時， DEBUG 時用到大量的 gdb [time=Fri, Mar 31, 2017 8:29 PM] > >[breakpoint](http://www.delorie.com/gnu/docs/gdb/gdb_29.html) ### 新增的程式碼 ```clike= /* verify whether this two matrix is identical */ static int equal(int *a, int *b, int w, int h) { for (int x = 0; x < w; x++) for (int y = 0; y < h; y++) if (*(a + x * h + y) != *(b + x * h + y)) return -1; return 1; } ``` 用來與 naive_transpose 比較，驗證 avx_transpose 結果是否正確。 ### 使用到的 AVX intrinsic #### __m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b) * Description * Unpack and interleave 32-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst. > 被這個可搞慘了，不能直接把 SSE2 拿來類推，要看清楚手冊。 #### __m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8) * Description * Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst. * 依需要調整 imm8，手冊有說明 Operation 該如何設定。 > 還好有閱讀到 <```yenWu```> 的筆記有提到這個 intrinsic，不然根本不知道該選這個 ### 結果 ``` avx: 64688 us ``` 相較於 sse 版本之 speedup: 2.4 ``` Performance counter stats for './avx' (100 runs): 1098,0390 cache-misses # 72.052 % of all cache refs ( +- 0.44% ) 1523,9534 cache-references ( +- 0.09% ) 11,4866,6522 instructions # 2.07 insn per cycle ( +- 0.00% ) 5,5608,8254 cycles ( +- 0.80% ) 0.174747969 seconds time elapsed ( +- 0.74% ) ``` ## 分析 prefetch distance 對效能之影響 > 想試著重現 <```kaizsv```> 的實驗 [time=Fri, Mar 31, 2017 8:23 PM] > 測試了一下不同的 prefech distance 影響很大，這個圖表必須作！ [time=Fri, Mar 31, 2017 8:24 PM] > 遇到了一個困難是，若要測試 PFDIST，勢必要改為傳入的參數，但這樣就動到 transpose 一致的界面，該怎麼處理最好？ [time=Fri, Mar 31, 2017 8:45 PM] > 最後很克難的只好改寫 prefetch 的界面，先把圖畫出來再說。 [time=Fri, Mar 31, 2017 9:17 PM] ![](https://i.imgur.com/FmGaVbA.png) ## Reference - [ ][kaizsv](https://hackmd.io/s/r1IWtb9R#) - [ ][yenWu](https://hackmd.io/s/ryTASBCT#) - [x][SSE 介紹]( https://www.csie.ntu.edu.tw/~r89004/hive/sse/page_1.html) - [ ][台大不錯的 SIMD 講義](http://www.csie.ntu.edu.tw/~cyy/courses/assembly/10fall/lectures/handouts/lec17_x86SIMD.pdf)