2017q1 Homework3 (software-pipelining)

# 2017q1 Homework3 (software-pipelining) contributed by < [Chihsiang](https://github.com/ChiHsiang/prefetcher) > [作業wiki](https://hackmd.io/s/rks62p1sl) [Prefetch 論文](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) [論文中文整理](https://hackmd.io/s/HJtfT3icx#名詞解釋) # 論文學習首先尋找文中提出三點問題的的答案： 1. 軟體預取的限制及成本？ 2. 硬體預取的限制及成本？ 3. 何時使用軟/硬體預取可以達到最佳效果？先是介紹的軟/硬體目前資料結構使用的對照表， ![](https://i.imgur.com/hrlWa8W.png) 由上此表得知目前廣泛使用的 Array、Hash 皆有提供 prefetch 方式 ![](https://i.imgur.com/U0649pq.png) ![](https://i.imgur.com/DOIvHix.png) 上表 prefetch 的分類以及時間軸，==時間軸的圖不是很理解==，根據內文 > Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency 只能理解要足夠的時間及早提出需求，用以隱藏記憶體延遲。 * Prefetch Distance 條件 ![](https://i.imgur.com/yrmVLf6.png) * D 是 prefetch 目標距離 * l 是 lantency 搬移資料所延遲的時間 * s 是迴圈當中的最短路徑 * Prefetch Distance 插入演算法 ![](http://i.imgur.com/BxZ0Yc1.png) * D 是 prefetch 的目標距離 * L 是平均記憶體延遲 * K 是常數 * IPCb 是 profiled average IPC of each benchmark * Wloop 是迴圈平均指令數量 * 軟體比較硬體的預取優勢(情況例子) * Large Number of Streams: - 硬體prefetcher 能處理 stream 數量是受限制的，然而軟體 prefetcher 可於每個 stream 中根據需求使用 prefetch * Short Streams: - 硬體需要至少兩個 cache misses 檢測訓練的時間，才能判斷 stream 或 stride 的方向 * Irregular Memory Access - 軟體預取較能對複雜的資料結構做 pretch * Cache Locality Hint - 程式開發者使用SW prefetcher時，能自行調整locality hint - HW prefetcher 的設計上則是大多將prefetch的資料搬移至lower-level cache(L2或L3) - 優點: 減少 L1 cache的cache pollution - 缺點: lower-level cache資料搬移至L1的時間(latency)會造成效能下降 * Loop Bounds - SW prefetch的好處是可以在程式當中訂定不會超過迴圈執行次數的prefetch要求 - 例如: 使用loop unrolling、SW pipelining、使用branch instrcution - HW prefetch 則是無法有效控制，特別是當此HW prefetcher工作能力較大(prefetch distance大、高prefetch比率)，甚至可能消耗過多的memory bandwidth * 軟體預取缺點 * Instruction count * Static insert * Code structure change # 程式改善 * 環境資訊`$ lscpu` ```shell= ⚡ lscpu Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 每核心執行緒數： 2 每通訊端核心數： 4 Socket(s): 1 NUMA 節點： 1 供應商識別號： GenuineIntel CPU 家族： 6 型號： 58 Model name: Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz 製程： 9 CPU MHz： 1279.431 CPU max MHz: 3300.0000 CPU min MHz: 1200.0000 BogoMIPS: 4589.27 虛擬： VT-x L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 6144K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts ``` * 原始程式分析 - 執行時間 ``` sse prefetch: 54074 us sse: 107184 us naive: 224151 us ``` - perf stat - naive ``` naive: 241652 us Performance counter stats for './naive_transpose': 18,172,399 cache-misses # 93.396 % of all cache refs (43.87%) 19,457,264 cache-references (43.86%) 20,157,827 L1-dcache-load-misses # 3.77% of all L1-dcache hits (44.24%) 534,896,289 L1-dcache-loads (44.73%) 22,319 L1-dcache-prefetch-misses (23.05%) 4,204,752 L1-dcache-store-misses (22.85%) 32,932 L1-icache-load-misses (33.98%) 282,353,767 branch-instructions (44.91%) 563,433 branch-misses # 0.20% of all branches (44.05%) 0.463622418 seconds time elapsed ``` * sse ``` sse: 116348 us Performance counter stats for './sse_transpose': 5,864,063 cache-misses # 82.944 % of all cache refs (45.02%) 7,069,928 cache-references (45.68%) 8,338,277 L1-dcache-load-misses # 1.90% of all L1-dcache hits (46.32%) 439,214,122 L1-dcache-loads (44.80%) 108,383 L1-dcache-prefetch-misses (22.49%) 4,283,935 L1-dcache-store-misses (21.30%) 35,892 L1-icache-load-misses (31.90%) 278,020,930 branch-instructions (42.50%) 571,414 branch-misses # 0.21% of all branches (42.43%) 0.340159594 seconds time elapsed ``` * sse_prefetch ``` sse prefetch: 55429 us Performance counter stats for './sse_prefetch_transpose': 7,835,410 cache-misses # 92.816 % of all cache refs (45.05%) 8,441,831 cache-references (45.39%) 8,624,095 L1-dcache-load-misses # 1.84% of all L1-dcache hits (45.40%) 469,033,285 L1-dcache-loads (42.97%) 17,002 L1-dcache-prefetch-misses (21.98%) 3,474,969 L1-dcache-store-misses (23.26%) 24,527 L1-icache-load-misses (34.40%) 259,320,698 branch-instructions (45.21%) 509,994 branch-misses # 0.20% of all branches (44.58%) 0.293630274 seconds time elapsed ``` 根據分析，naive版本的cache-misses比例以及次數都是最高且時間耗費最多，然後沒使用 prefetch 的 sse 版本也比使用 prefetch 的高，且使用 prefetch 耗費時間是未使用的一半。 * perf raw counter 參照 [kaizsv共筆](https://hackmd.io/s/r1IWtb9R) 先找到自己 CPU 架構下的 mask number and event number 查找自己 CPU 規格 [Intel-Core-i7-3615QM](http://www.intel.com/content/www/us/en/processors/core/core-technical-resources.html) 屬於==3rd generation== 查文件 [Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B](http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html) 找相關 raw count 的編號 perf raw counter 範例 ``` Example: If the Intel docs for a QM720 Core i7 describe an event as: Event Umask Event Mask Num. Value Mnemonic Description Comment A8H 01H LSD.UOPS Counts the number of micro-ops Use cmask=1 and delivered by loop stream detector invert to count cycles raw encoding of 0x1A8 can be used: perf stat -e r1a8 -a sleep 1 perf record -e r1a8 ... ``` 使用方法先查找需要測試的Event，然後組合 ==Umask value + Event num = 01 + A8== - 手冊上查詢 SW-prefetch |Event|Umask|Event Mask Mnemonic|Description| |:---:|:---:|:--------:|:---------:| |4CH|01H|LOAD_HIT_PRE.SW_PF|Non-SW-prefetch load dispatches that hit fill buffer allocated for S/W prefetch.| |4CH|01H|LOAD_HIT_PRE.HW_PF|Non-SW-prefetch load dispatches that hit fill buffer allocated for H/W prefetch.| - ==perf stat -e r014C ./naive== ``` naive: 224461 us Performance counter stats for './naive_transpose': 402 r014C 0.450811263 seconds time elapsed ``` - ==perf stat -e r014C ./sse== ``` sse: 107869 us Performance counter stats for './sse_transpose': 339 r014C 0.346157841 seconds time elapsed ``` - ==perf stat -e r014C ./sse_prfetch== ``` sse prefetch: 55197 us Performance counter stats for './sse_prefetch_transpose': 1,798,956 r014C 0.279139593 seconds time elapsed ``` 透過這些資訊可得到更細節的 Event 發生次數，由於情況許多還不完全理解每個資訊所提供的次數可做哪些效能層面的分析。 * Verify 機制要比對轉置結果是否成功希望可以自動化，參照[twzjwang](https://github.com/twzjwang/prefetcher/blob/master/verify.c)同學的實作仿效。 p.s [程式碼](https://github.com/ChiHsiang/prefetcher/blob/feature_avx/verify.c) * AVX Transpose - 仿照 SSE 的方法，實作 8*8 矩陣的 transpose * _mm256_loadu_ps 讀取 256bits data 對齊的存入 Dst * _mm256_unpacklo_ps 將 source a and b 的拆成兩段128bits 將每段低位(64bits) 存入 Dst * _mm256_shuffle_ps 依照 imm8 指定位元決定讀取 source 位置將其存入 Dst p.s [imm8 wiki](https://en.wikibooks.org/wiki/X86_Assembly/SSE#IMM8_control_byte_description_2) * _mm256_permute2f128_ps 依照 imm8 一次將 128bits 依照指定位元決定讀取 source 位置將其存入 Dst ```c= void avx_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 8) { for (int y = 0; y < h; y += 8) { __m256 r0, r1, r2, r3, r4, r5, r6, r7; __m256 t0, t1, t2, t3, t4, t5, t6, t7; r0 = _mm256_loadu_ps((__m256 *)(src + (y + 0) * w + x)); r1 = _mm256_loadu_ps((__m256 *)(src + (y + 1) * w + x)); r2 = _mm256_loadu_ps((__m256 *)(src + (y + 2) * w + x)); r3 = _mm256_loadu_ps((__m256 *)(src + (y + 3) * w + x)); r4 = _mm256_loadu_ps((__m256 *)(src + (y + 4) * w + x)); r5 = _mm256_loadu_ps((__m256 *)(src + (y + 5) * w + x)); r6 = _mm256_loadu_ps((__m256 *)(src + (y + 6) * w + x)); r7 = _mm256_loadu_ps((__m256 *)(src + (y + 7) * w + x)); t0 = _mm256_unpacklo_ps(r0, r1); t1 = _mm256_unpackhi_ps(r0, r1); t2 = _mm256_unpacklo_ps(r2, r3); t3 = _mm256_unpackhi_ps(r2, r3); t4 = _mm256_unpacklo_ps(r4, r5); t5 = _mm256_unpackhi_ps(r4, r5); t6 = _mm256_unpacklo_ps(r6, r7); t7 = _mm256_unpackhi_ps(r6, r7); r0 = _mm256_shuffle_ps(t0, t2, 0x44); r1 = _mm256_shuffle_ps(t0, t2, 0xEE); r2 = _mm256_shuffle_ps(t1, t3, 0x44); r3 = _mm256_shuffle_ps(t1, t3, 0xEE); r4 = _mm256_shuffle_ps(t4, t6, 0x44); r5 = _mm256_shuffle_ps(t4, t6, 0xEE); r6 = _mm256_shuffle_ps(t5, t7, 0x44); r7 = _mm256_shuffle_ps(t5, t7, 0xEE); t0 = _mm256_permute2f128_ps(r0, r4, 0x20); t1 = _mm256_permute2f128_ps(r1, r5, 0x20); t2 = _mm256_permute2f128_ps(r2, r6, 0x20); t3 = _mm256_permute2f128_ps(r3, r7, 0x20); t4 = _mm256_permute2f128_ps(r0, r4, 0x31); t5 = _mm256_permute2f128_ps(r1, r5, 0x31); t6 = _mm256_permute2f128_ps(r2, r6, 0x31); t7 = _mm256_permute2f128_ps(r3, r7, 0x31); _mm256_storeu_ps((__m256 *)(dst + ((x + 0) * h) + y), t0); _mm256_storeu_ps((__m256 *)(dst + ((x + 1) * h) + y), t1); _mm256_storeu_ps((__m256 *)(dst + ((x + 2) * h) + y), t2); _mm256_storeu_ps((__m256 *)(dst + ((x + 3) * h) + y), t3); _mm256_storeu_ps((__m256 *)(dst + ((x + 4) * h) + y), t4); _mm256_storeu_ps((__m256 *)(dst + ((x + 5) * h) + y), t5); _mm256_storeu_ps((__m256 *)(dst + ((x + 6) * h) + y), t6); _mm256_storeu_ps((__m256 *)(dst + ((x + 7) * h) + y), t7); } } } ``` * 時間 ```shell= naive: 231006 us sse: 114900 us avx: 65514 us sse prefetch: 54519 us ``` 從時間上可得知，AVX因為處理的數量是SSE兩倍因此在速度讓也有將近兩倍的差距。 * 加入 Prefetch 由於 AVX 版本並沒有提供 Prefetch 指令因此使用 SSE 版本的取而代之 ```c= Synopsis void _mm_prefetch (char const* p, int i) #include "xmmintrin.h" Instruction: prefetchnta mprefetch prefetcht0 mprefetch prefetcht1 mprefetch prefetcht2 mprefetch CPUID Flags: SSE Description Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i. ``` * AVX Prefetch [程式碼](https://github.com/ChiHsiang/prefetcher/blob/feature_avx/impl.c#LC131) # 主要分析 * Perf Stat * SSE ``` sse: 114395 us Performance counter stats for './sse_transpose': 6,613,088 cache-misses # 81.627 % of all cache refs (44.81%) 8,101,572 cache-references (44.81%) 8,963,416 L1-dcache-load-misses # 2.03% of all L1-dcache hits (44.80%) 441,279,258 L1-dcache-loads (42.83%) 19,481 L1-dcache-prefetch-misses (22.27%) 4,194,679 L1-dcache-store-misses (23.04%) 29,409 L1-icache-load-misses (34.18%) 279,189,951 branch-instructions (45.05%) 555,463 branch-misses # 0.20% of all branches (44.55%) 0.362906873 seconds time elapsed ``` * AVX ``` avx: 64653 us Performance counter stats for './avx_transpose': 5,649,660 cache-misses # 80.898 % of all cache refs (45.06%) 6,983,688 cache-references (45.80%) 7,684,501 L1-dcache-load-misses # 1.86% of all L1-dcache hits (45.84%) 412,369,472 L1-dcache-loads (43.41%) 23,589 L1-dcache-prefetch-misses (21.68%) 4,697,455 L1-dcache-store-misses (23.25%) 26,099 L1-icache-load-misses (34.38%) 248,622,769 branch-instructions (45.20%) 518,603 branch-misses # 0.21% of all branches (44.57%) 0.295939411 seconds time elapsed ``` 上述分析，我最注意到的是==L1-dcache-prefetch-misses、L1-dcache-store-misses== AVX 指令所產生的 Misses 次數居然比較多。 * SSE prefetch(D = 8) ``` sse prefetch: 54383 us Performance counter stats for './sse_prefetch_transpose': 6,208,199 cache-misses # 98.981 % of all cache refs (43.46%) 6,272,119 cache-references (43.45%) 6,500,214 L1-dcache-load-misses # 1.46% of all L1-dcache hits (44.17%) 444,322,322 L1-dcache-loads (44.87%) 24,325 L1-dcache-prefetch-misses (23.58%) 4,606,491 L1-dcache-store-misses (23.23%) 26,308 L1-icache-load-misses (34.35%) 254,746,423 branch-instructions (45.06%) 542,430 branch-misses # 0.21% of all branches (43.65%) 0.283370894 seconds time elapsed ``` * AVX prefetch(D = 8) ``` avx prefetch: 58723 us Performance counter stats for './avx_prefetch_transpose': 6,158,483 cache-misses # 86.765 % of all cache refs (44.96%) 7,097,877 cache-references (44.94%) 7,500,835 L1-dcache-load-misses # 1.76% of all L1-dcache hits (44.94%) 425,937,839 L1-dcache-loads (42.50%) 23,246 L1-dcache-prefetch-misses (22.29%) 4,833,529 L1-dcache-store-misses (23.26%) 29,966 L1-icache-load-misses (34.40%) 247,753,365 branch-instructions (45.21%) 515,668 branch-misses # 0.21% of all branches (44.58%) 0.291183267 seconds time elapsed ``` 由於數據上並未補齊所有 D 範圍以及 prefech 數量的比較，之後補上。 * Memory Latency 參照[carolc0708](https://hackmd.io/s/B1lDZtO0#%E5%84%AA%E5%8C%96%E5%98%97%E8%A9%A62-%E8%AA%BF%E6%95%B4pfdist#優化嘗試2-調整pfdist)同學的共筆，也想了解自己電腦的 Memory Latency 測試 ``` Intel(R) Memory Latency Checker - v3.1a Measuring idle latencies (in ns)... Memory node Socket 0 0 60.6 Measuring Peak Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 22924.6 3:1 Reads-Writes : 21408.4 2:1 Reads-Writes : 21079.7 1:1 Reads-Writes : 20016.7 Stream-triad like: 20424.8 Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Memory node Socket 0 0 22919.5 Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 135.19 23058.9 00002 134.82 23053.4 00008 133.35 23043.2 00015 132.27 23037.6 00050 126.29 22870.4 00100 82.99 20031.6 00200 64.48 11700.4 00300 61.22 8385.7 00400 61.37 6637.3 00500 60.76 5573.2 00700 60.60 4321.9 01000 61.27 3353.8 01300 61.46 2828.7 01700 61.51 2413.9 02500 61.51 1978.3 03500 61.88 1682.8 05000 61.84 1492.8 09000 61.96 1285.1 20000 61.83 1150.6 Measuring cache-to-cache transfer latency (in ns)... Local Socket L2->L2 HIT latency 19.1 Local Socket L2->L2 HITM latency 22.8 ``` * Run avx prefetch 測試 `mlc` ``` Intel(R) Memory Latency Checker - v3.1a Measuring idle latencies (in ns)... Memory node Socket 0 0 60.8 Measuring Peak Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 22224.3 3:1 Reads-Writes : 20991.3 2:1 Reads-Writes : 20937.0 1:1 Reads-Writes : 19920.7 Stream-triad like: 20050.8 Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Memory node Socket 0 0 22883.8 Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 141.15 22473.7 00002 136.83 22856.0 00008 135.33 22841.5 00015 136.96 22582.6 00050 131.97 22277.7 00100 86.87 19832.6 00200 67.08 11648.3 00300 61.98 8370.2 00400 62.60 6614.3 00500 62.30 5439.4 00700 70.03 3344.9 01000 70.56 2536.9 01300 62.81 2642.3 01700 63.33 2192.8 02500 62.48 1885.3 03500 61.62 1704.1 05000 62.17 1473.5 09000 82.62 917.5 20000 64.81 1086.5 Measuring cache-to-cache transfer latency (in ns)... Local Socket L2->L2 HIT latency 21.0 Local Socket L2->L2 HITM latency 23.2 ``` 發現真的也有在變化，可是完全控制測試方始還在嘗試。 # 參考資料 * Transpose - [Transpose an 8x8 float using AVX/AVX2](http://stackoverflow.com/questions/25622745/transpose-an-8x8-float-using-avx-avx2) - [Fast memory transpose with SSE, AVX, and OpenMP](https://stackoverflow.com/questions/16941098/fast-memory-transpose-with-sse-avx-and-openmp) * 共筆 - [yenWu](https://hackmd.io/EYMwbAhgDArAnAFgLQmGATEhB2AppiCAZhCQBMBGXGBGXCkCMEIA?view) - [kaizsv](https://hackmd.io/s/r1IWtb9R) - [carolc0708](https://hackmd.io/s/B1lDZtO0#2016q3-homework3-software-pipelining)