2017q1 Homework3 (software-pipelining)

# 2017q1 Homework3 (software-pipelining) contributed by <`claaaaassic`> ###### tags: `claaaaassic` [B05: software-pipelining](https://hackmd.io/s/rks62p1sl) ## 預期目標 * 學習計算機結構並且透過實驗來驗證所學 * 理解 prefetch 對 cache 的影響，從而設計實驗來釐清相關議題 * 論文閱讀和思考 ## 作業要求 * 閱讀論文 "[When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)" (務必事先閱讀論文，否則貿然看程式碼，只是陷入一片茫然!)，紀錄你的認知和提問 * 在 Linux/x86_64 (注意，要用 64-bit 系統，不能透過虛擬機器執行) 上編譯並執行 [prefetcher](https://github.com/sysprog21/prefetcher) * 說明 `naive_transpose`, `sse_transpose`, `sse_prefetch_transpose` 之間的效能差異，以及 prefetcher 對 cache 的影響 * 在 github 上 fork [prefetcher](https://github.com/sysprog21/prefetcher)，參照下方「見賢思齊：共筆選讀」的實驗，嘗試用 AVX 進一步提昇效能。分析的方法可參見 [Matrix Multiplication using SIMD](https://hackmd.io/s/Hk-llHEyx) * 修改 `Makefile`，產生新的執行檔，分別對應於 `naive_transpose`, `sse_transpose`, `sse_prefetch_transpose` (學習 [phonebook](s/S1RVdgza) 的做法)，學習 [你所不知道的 C 語言：物件導向程式設計篇](https://hackmd.io/s/HJLyQaQMl) 提到的封裝技巧，以物件導向的方式封裝轉置矩陣的不同實作，得以透過一致的介面 (interface) 存取個別方法並且評估效能 * 用 perf 分析 cache miss/hit * 學習 `perf stat` 的 raw counter 命令 * 參考 [Performance of SSE and AVX Instruction Sets](http://arxiv.org/pdf/1211.0820.pdf)，用 SSE/AVX intrinsic 來改寫程式碼 ## 論文筆記 ### Prefetch * 不同資料結構適合的 SW prefetch 和 HW prefetch 策略(TABLE I) * prefetch 分類(Fig. 2): 其中若為 timely 效果最佳 ### Direct and Indirect Memory Indexing * HW Prefetch 處理 indirect memory indexing 較困難，需要特殊硬體設計 * SW Predetch 處理較為容易 direct address prefetches: 2個指令/intrinsic indirect memory addresses prefetch: 4個指令/intrinsic prefetch indirect memory 的消耗比 direct 多 ### Software Prefetch Compiler 或人工加入 prefetch intrinsic 作為 prefetch 的依據 Complier : icc 、 gcc 人工：x86 SSE SIMD extensions。e.g. _mm_prefetch(char *ptr, int hint) #### Distance * prefetch distance 必須要大到能夠隱藏 memory latency * distance 太大可能造成不良的後果，如coverage更小 * Coverage：100 x (Prefetch Hits / (Prefetch Hits + Cache Misses)) #### Software Prefetch 的優點 * Large Number of Streams： * HW prefetcher 受限於硬體資源如 stream detectors 和 book-keeping mechanisms，一次能 prefetch 的數量有限，SW prefetcher 的靈活度較高。 * Short Streams： * HW prefetch 在 training 上需要時間 * Irregular Memory Access： * SW prefetcher(插入 intrinsic 的方式)較能對複雜的資料結構做prefetch * Cache Locality Hint： * 程式開發者使用 SW prefetcher 時，能自行調整 locality hint * Loop Bounds： * SW 可在開發時直接加入 #### Software Prefetch 的缺點 * Increased Instruction Count：因為需要對指令 fetch 以及 execution * Static Insertion： * 因為會事先決定好 Distance ，但 latency 會隨時間改變，無法像 HW Prefetch 一樣靈活 * 在異質系統(heterogeneous architectures)會更嚴重 * Code Structure Change：因為對於迴圈中較少的指令數量可能會產生來不及 fetch 的問題，這時候可能就要改變 loop 的結構，例如說把部份迴圈拆掉 ### Hardware Prefetch 依過去紀錄來作為 Prefetch 的根據，需要 training #### GHB * stride prefetcher( 一次跳幾格去 prefetch cache line ) #### STR * stream prefetcher( 連續的 prefetch cache line ) ### Synergistic Effects：SW + HW Prefetch * Handling Multiple Streams： * HW prefetcher 處理 regular stream、SW prefetcher 處理 irregular stream * Positive Training： * 可用 SW 訓練 HW ### Antagonistic Effects：SW + HW Prefetch * 若 SW 表現不佳時反而會讓 HW 做出錯誤的 Prefetch ## 實驗 ### 電腦規格 ``` Architecture / Microarchitecture Microarchitecture Sandy Bridge Processor core Sandy Bridge Core stepping J1 (Q1SD, SR04B) CPUID 206A7 (SR04B) Manufacturing process 0.032 micron High-K metal gate process Data width 64 bit The number of CPU cores 2 The number of threads 4 Floating Point Unit Integrated Level 1 cache size 2 x 32 KB 8-way set associative instruction caches 2 x 32 KB 8-way set associative data caches Level 2 cache size 2 x 256 KB 8-way set associative caches Level 3 cache size 3 MB 12-way set associative shared cache Cache latency 3 (L1 cache) 10 (L2 cache) 22 (L3 cache) Physical memory 16 GB Multiprocessing Not supported Features MMX instructions SSE / Streaming SIMD Extensions SSE2 / Streaming SIMD Extensions 2 SSE3 / Streaming SIMD Extensions 3 SSSE3 / Supplemental Streaming SIMD Extensions 3 SSE4 / SSE4.1 + SSE4.2 / Streaming SIMD Extensions 4 AES / Advanced Encryption Standard instructions AVX / Advanced Vector Extensions EM64T / Extended Memory 64 technology / Intel 64 NX / XD / Execute disable bit HT / Hyper-Threading technology TBT 2.0 / Turbo Boost technology 2.0 VT-x / Virtualization technology Low power features Enhanced SpeedStep technology ``` ### OOP 首先改寫成以物件導向的方式封裝轉置矩陣的不同實作，可是太醜陋了但不確定改成怎樣比較好看，晚點再來改的更好 ```shell typedef struct { void (*transpose)(int *src, int *dst, int w, int h); }transpose_interface; #ifdef NAIVE void transpose_init (transpose_interface *transpose_interface) { transpose_interface->transpose = naive_transpose; } #endif ``` 三種方法的效能在下方，有個解說 sse 轉置矩陣說的蠻好的 [ierosodin](https://hackmd.io/s/rkX95E-il#how-sse-transpose-matrix) ```shell naive: 261739 us Performance counter stats for './naive_transpose' (10 runs): 18,324,202 cache-misses # 91.207 % of all cache refs ( +- 0.04% ) 20,090,813 cache-references ( +- 0.02% ) 1,448,648,908 instructions # 1.20 insn per cycle ( +- 0.00% ) 1,209,328,679 cycles ( +- 1.23% ) 0.486415549 seconds time elapsed ( +- 2.04% ) sse: 119548 us Performance counter stats for './sse_transpose' (10 runs): 6,073,358 cache-misses # 80.934 % of all cache refs ( +- 0.07% ) 7,504,114 cache-references ( +- 0.04% ) 1,236,665,480 instructions # 1.44 insn per cycle ( +- 0.00% ) 861,231,896 cycles ( +- 0.54% ) 0.344926233 seconds time elapsed ( +- 2.70% ) sse_prefetch: 63448 us Performance counter stats for './sse_prefetch_transpose' (10 runs): 6,026,915 cache-misses # 80.284 % of all cache refs ( +- 0.08% ) 7,506,949 cache-references ( +- 0.03% ) 1,282,716,243 instructions # 1.79 insn per cycle ( +- 0.00% ) 716,413,130 cycles ( +- 0.45% ) 0.293630597 seconds time elapsed ( +- 2.55% ) ``` 這邊可以看出 sse 跟 sse_prefetch 的執行時間相差幾乎一倍，但是 cache-misses 卻差不多，可以比較一下有沒有 prefetch 的差別 sse_prefetch 使用的是 SW Prefetch ，這邊參考[baimao8437](https://hackmd.io/s/r1IWtb9R#perf-raw-counter)找到這兩個參數，用來觀察 SW 與 HW 的使用情形 |Event Num. |Umask Value |Event Mask Mnemonic| |:---:|:---:|:---:| |4CH| 01H| LOAD_HIT_PRE.SW_PF| |4CH| 02H| LOAD_HIT_PRE.HW_PF| 下面可以看出 sse_prefetch_transpose 確實有使用 sw prefetch，他們也都還有使用 HW Prefetch ```shell Performance counter stats for './naive_transpose' (10 runs): 3,745 r014c ( +- 3.52% ) (83.68%) 386,261 r024c ( +- 7.34% ) (83.50%) Performance counter stats for './sse_transpose' (10 runs): 3,428 r014c ( +- 1.42% ) (83.91%) 440,137 r024c ( +- 1.62% ) (83.28%) Performance counter stats for './sse_prefetch_transpose' (10 runs): 947,640 r014c ( +- 2.85% ) (83.97%) 397,503 r024c ( +- 8.06% ) (83.48%) ``` 不過關於 Prefetch 對於 cache-misses 上的影響這邊還看不出差異，接下來列出一些可能的差異來分析看看，這邊參考[jack81306](https://hackmd.io/s/rJRG6D4ix)使用兩種觀察 L1 cache miss，這裡的描述應該是load的指令並且有hit 的數量 |Event Num. |Umask Value |Event Mask Mnemonic|Description| |:---:|:---:|:---:|:---:| |D1H| 01H| MEM_LOAD_UOPS_RETIRED.L1_HIT|Retired load uops with L1 cache hits as data sources.| |D1H| 02H| MEM_LOAD_UOPS_RETIRED.L2_HIT|Retired load uops with L2 cache hits as data sources.| ```shell Performance counter stats for './naive_transpose' (10 runs): 18,819,052 cache-misses # 96.449 % of all cache refs ( +- 1.33% ) (19.61%) 19,511,844 cache-references ( +- 1.87% ) (20.47%) 1,406,820,862 instructions # 1.20 insn per cycle ( +- 0.48% ) (31.15%) 1,170,377,147 cycles ( +- 1.39% ) (41.30%) 92,178 L1-icache-load-misses ( +- 4.80% ) (51.33%) 20,204,217 L1-dcache-load-misses ( +- 0.75% ) (51.24%) 533,762,313 r01d1 ( +- 0.87% ) (38.76%) 24,312 r02d1 ( +- 13.81% ) (38.19%) Performance counter stats for './sse_transpose' (10 runs): 5,982,773 cache-misses # 84.739 % of all cache refs ( +- 1.65% ) (19.55%) 7,060,276 cache-references ( +- 1.51% ) (21.10%) 1,179,462,760 instructions # 1.43 insn per cycle ( +- 0.80% ) (31.84%) 823,323,897 cycles ( +- 0.83% ) (42.14%) 76,290 L1-icache-load-misses ( +- 3.06% ) (52.19%) 8,269,203 L1-dcache-load-misses ( +- 1.08% ) (52.10%) 439,601,890 r01d1 ( +- 1.13% ) (37.42%) 23,614 r02d1 ( +- 18.48% ) (36.84%) Performance counter stats for './sse_prefetch_transpose' (10 runs): 6,454,148 cache-misses # 83.139 % of all cache refs ( +- 5.15% ) (20.10%) 7,763,062 cache-references ( +- 3.85% ) (21.42%) 1,225,180,064 instructions # 1.77 insn per cycle ( +- 0.99% ) (32.06%) 693,957,480 cycles ( +- 0.71% ) (42.10%) 80,995 L1-icache-load-misses ( +- 4.40% ) (52.02%) 7,940,568 L1-dcache-load-misses ( +- 3.99% ) (51.50%) 466,446,812 r01d1 ( +- 1.26% ) (37.19%) 4,199,010 r02d1 ( +- 6.29% ) (36.46%) ``` 從有使用 prefetch 的差別應該是在使用了之後 L2 cache hits 的數量增加(r02d1那列)，所以讓效能變快，不過這邊沒有找到 L3 cache 相關的 event 可以使用，還有待改進 > 有時間應該還可以再分析一下，在關閉 HW Prefetch 下效能的改變，還沒有查如何關閉 ### AVX 我電腦只有支援到 AVX ，就用 sse 的版本改來看效能上有沒有改變這邊不懂怎麼轉換成 avx 的版本，看了一下同學的共筆有找到兩種辦法，[jack81306](https://hackmd.io/s/rJRG6D4ix#使用-avx-進行改寫) 用 `_mm256_permute2x128_si256` 這個指令來改寫，以及 [twzjwang](https://hackmd.io/s/HJDVmyZje#實驗一-88-矩陣轉置，使用-avx) 使用`_mm256_shuffle_ps` + `_mm256_permute2f128_ps` 這邊就用 `_mm256_permute2x128_si256` 來實做，但目前一直卡在 `illegal hardware instruction (core dumped)`的錯誤訊息上，然後查了一段時間發現我電腦根本不支援 avx2 所以一堆指令不能用要再找找從網路上找到的avx指令能用的型態都不同，光是一堆轉換型態的搞死我了qq，可能要之後再補方法了 ```info The __m256 data type is used to represent the contents of the extended SSE register - the YMM register, used by the Intel® AVX intrinsics. The __m256 data type can hold eight 32-bit floating-point values. The __m256d data type can hold four 64-bit double precision floating-point values. The __m256i data type can hold thirty-two 8-bit, sixteen 16-bit, eight 32-bit, or four 64-bit integer values. ``` ## 參考資料 [kaizsv 的共筆](https://hackmd.io/s/r1IWtb9R)