2017q1 Homework3 (software-pipelining)

# 2017q1 Homework3 (software-pipelining) contributed by < `laochanlam` > ###### tags: `laochanlam` ## 開發環境 - Ubuntu 16.10 ( 64 bit ) ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 61 Model name: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz Stepping: 4 CPU MHz: 2400.878 CPU max MHz: 3000.0000 CPU min MHz: 500.0000 BogoMIPS: 4788.90 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 4096K NUMA node0 CPU(s): 0-3 ``` ## 相關連結 - [2017 年春季作業說明](https://hackmd.io/s/Bk-1zqIte#) - [2017q1 Homework3 (作業區)](https://hackmd.io/IYUxBMGYEZxBaArAJnAM3gFgJwDZfzC4Dsw8AHJgEYzDTTIDGVuQA===#) - [B06: software-pipelining作業要求](https://hackmd.io/s/rks62p1sl) - [When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) - [When Prefetching Works, When It Doesn’t, and Why 重點提示和解說](https://hackmd.io/s/HJtfT3icx) - [課程進度與開放資源 Wiki](http://wiki.csie.ncku.edu.tw/sysprog/schedule) - [Matrix Multiplication using SIMD](https://hackmd.io/s/Hk-llHEyx) ## 閱讀筆記 ### 名詞和他的朋友們 - SW - compiler + 程式中插入 intrinsic - GHB - (stride prefetcher) - [Global History Buffer](http://www.eecg.toronto.edu/~steffan/carg/readings/ghb.pdf) - compiler + HW prefetcher - STR - (stream prefetcher) - compiler + HW prefetcher --- ### 1. INTRODUCTION ![Imgur](http://i.imgur.com/pSuHzeN.png) 不同種類的 SW / HW prefetcher 及組合對其執行時間之影響。 - Y軸 --- Execution time - X軸 --- [CPU 2006 benchmarks](http://baike.baidu.com/item/SPEC%20CPU%202006) ### Stride V.S. Stream >Note that, in this article, we refer to unit-stride cache-line accesses as **streams** and access **stride** distances greater than two cache lines as strided. --- ### 2. BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING ### Prefetch distance ![](https://i.imgur.com/JsV1VpJ.png) *l* stand for **prefetch latency** *s* stand for **the length of the shortest path through the loop body .** >we define prefetch distance as the distance ahead of which an prefetch should be requested on a memory address <br> <br> ### Direct and Indirect Memory Indexing - **Direct memory indexing** 容易被 Hardware Prefetch。 - 相對來說，**indirect memory indexing** 容易被 Software prefetch。 - 因為 Hardware prefetch 需要額外機制。 - **Indirect** 始終 Overhead than **direct**。 - load instruction 的數量不同。 --- ### 3. POSTIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING ### Benefits of Software Prefetching over Hardware Prefetching 1. HW 不可以有太多 Stream，會 Overloading； SW 可以獨立處理每個 Stream。 2. 若Stream 的長度太短，HW 會無法進行 Training ( 通常至少要經過兩個 Cache miss )。 3. HW 比較難處理 Complex Structure。 4. SW 可自訂 Cache Locality Hint ( 要存在 L1 L2 還是 L3 )； HW 通常會把 prefetch 的資料放到 L2 及 L3，導致效率的缺失。 5. HW 無法有效控制 loop bound ，SW 可以直接告知 loop bound。 ### Negative Impacts of Software Prefetching 1. SW 會增加 Instruction 的數量。 2. SW 因為是手動加入程式碼，導致在程式執行中無法 Dynamic 地控制 prefetch 的行為 ( 缺少 Adaptivity ) ，現在的 HW 有解決的 feedback-driven 機制但 SW 還未有。 3. Code Structure Change。 >再多看一下再寫 [name=laochanlam][color=blue] > 加油 > [name=洪正皇][color=#cb64e5] ### Synergistic Effects when Using Software and Hardware Prefetching Together 1. HW 負責 regular stream ， SW 負責 irregular stream。 2. 可以用 SW 訓練 HW。 ### Antagonistic Effects when Using Software and Hardware Prefetching Together 1. 可能 SW 沒有在某些 Stream 上，導致 HW 有錯誤的訓練。 2. 過度訓練 HW ，會導致 HW 變得過度 aggressive ，有 early 的狀況出現。 --- ## 開發紀錄大致上看完論文了，接下來先以物件導向的方式改寫程式架構，然後分別產出 ```naive_transpose```, ```sse_transpose``` 及 ```sse_prefetc_transpose``` 並用 perf 對其評估效能。 ### naive_transpose ```C void naive_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x++) for (int y = 0; y < h; y++) *(dst + x * h + y) = *(src + y * w + x); } ``` ```naive_transpose``` 是最普通的 transpose ，把二維陣列傳入。以下為 perf 的效能分析: ``` Performance counter stats for './naive_transpose' (20 runs): 25,303,321 cache-misses # 94.060 % of all cache refs ( +- 0.92% ) 26,901,122 cache-references ( +- 0.68% ) 1,466,156,276 instructions # 1.09 insn per cycle ( +- 0.02% ) 1,341,339,997 cycles ( +- 2.14% ) 0.465588938 seconds time elapsed ( +- 2.32% ) ``` --- ### sse_transpose ```C void sse_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 4) { for (int y = 0; y < h; y += 4) { __m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x)); __m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x)); __m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x)); __m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x)); __m128i T0 = _mm_unpacklo_epi32(I0, I1); __m128i T1 = _mm_unpacklo_epi32(I2, I3); __m128i T2 = _mm_unpackhi_epi32(I0, I1); __m128i T3 = _mm_unpackhi_epi32(I2, I3); I0 = _mm_unpacklo_epi64(T0, T1); I1 = _mm_unpackhi_epi64(T0, T1); I2 = _mm_unpacklo_epi64(T2, T3); I3 = _mm_unpackhi_epi64(T2, T3); _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0); _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1); _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2); _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3); } } } ``` 這裡推薦 [ierosodin 的共筆](https://hackmd.io/MYNgzA7ArMAMCmBaAHATgCYCNEBYSwEMUIcJEAmAMxwEZR0dYopMg===?both)，講解得十分詳細！ ``` Performance counter stats for './sse_transpose' (20 runs): 11,237,627 cache-misses # 89.150 % of all cache refs ( +- 0.56% ) 12,605,267 cache-references ( +- 0.19% ) 1,253,550,522 instructions # 1.50 insn per cycle ( +- 0.00% ) 834,419,437 cycles ( +- 1.51% ) 0.290138817 seconds time elapsed ( +- 1.71% ) ``` --- ### sse_prefetch_transpose ```C void sse_prefetch_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 4) { for (int y = 0; y < h; y += 4) { #define PFDIST 8 _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1); __m128i I0 = _mm_loadu_si128 ((__m128i *)(src + (y + 0) * w + x)); __m128i I1 = _mm_loadu_si128 ((__m128i *)(src + (y + 1) * w + x)); __m128i I2 = _mm_loadu_si128 ((__m128i *)(src + (y + 2) * w + x)); __m128i I3 = _mm_loadu_si128 ((__m128i *)(src + (y + 3) * w + x)); __m128i T0 = _mm_unpacklo_epi32(I0, I1); __m128i T1 = _mm_unpacklo_epi32(I2, I3); __m128i T2 = _mm_unpackhi_epi32(I0, I1); __m128i T3 = _mm_unpackhi_epi32(I2, I3); I0 = _mm_unpacklo_epi64(T0, T1); I1 = _mm_unpackhi_epi64(T0, T1); I2 = _mm_unpacklo_epi64(T2, T3); I3 = _mm_unpackhi_epi64(T2, T3); _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0); _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1); _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2); _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3); } } } ``` ```sse_prefetch_transpose``` 是在 ```sse_transpose``` 上加上實作 prefetch 的功能。 #### _mm_prefetch(char const* p, int i) 而這裡的 ```_mm_prefetch(char const* p, int i)``` 跟論文中的一樣, ```p``` 是被 prefetch 的記憶體位置,而 ```i``` 則是locality hint。 #### Locality hint ![Imgur](http://i.imgur.com/kZ81msp.png) 可根據所需使用 T0， T1， T2， NTA 在 ```i``` 的位置。 #### PFDIST 而 PFDIST 則是“對應”論文中的 Prefetch Distance ，而這裡定義為 8 ，而每次 loop 又會取用 4 個 int （ 128 / 4 ) , 所以此處大概為對兩個 loop 後進行 prefetch。 >似乎並沒有考慮 latency 等因素,有錯煩請指正。 >[name=laochanlam][color=blue] 以下為效能分析： ``` Performance counter stats for './sse_prefetch_transpose' (20 runs): 8,221,632 cache-misses # 83.978 % of all cache refs ( +- 1.02% ) 9,790,247 cache-references ( +- 0.45% ) 1,299,672,492 instructions # 1.99 insn per cycle ( +- 0.01% ) 651,782,650 cycles ( +- 1.52% ) 0.224777752 seconds time elapsed ( +- 1.78% ) ``` --- 先仿照 phonebook 的方法用 perf 及 gnuplot 對以上三個實作的 execution time 繪出圖像: ![Imgur](http://i.imgur.com/iP7mCej.png) 可見平時寫的程式有多大的改進空間 orz . --- ### 實驗 1：　修改 PFDIST 嘗試對 PFDIST 進行修改，驗証公式的計算及 early 跟 late 對 prefetch 的影響程度。以下把 PFDIST 設置成 2, 4, 6, 8, 10, 12, 14 時的執行速度比較： ![Imgur](http://i.imgur.com/JoOyMzi.png) **竟然是 PFDIST = 10 的情況下最快！** 只好看看怎麼查出 latency 及重新思考一下 " shortest loop distance " 了。 :::info 此次實驗的程式碼實作在 branch [PFDIST_Expreiment](https://github.com/laochanlam/prefetcher/tree/PFDIST_Experiment) 中 ::: --- ### 實驗 2：　修改 Locality Hint ![Imgur](http://i.imgur.com/hdv5d3U.png) 除了 NTA 之外 T0, T1, T2 看起來並沒有太大變化,有可能是 prefetch 的量及操作太少以致實驗結果並不明顯 >待驗証! >[name=laochanlam][color=blue] :::info 此次實驗的程式碼實作在 branch [LOCALITY_HINT_Experiment](https://github.com/laochanlam/prefetcher/tree/LOCALITY_HINT_Experiment) 中。 ::: --- ### AVX 優化 [參考學長的寫法](https://embedded2015.hackpad.com/ep/pad/static/VGN4PI1cUxh),實作 AVX 跟 AVX with prefetch (此處的 PFDIST 姑且暫定為 16) , 執行時間圖如下: ![Imgur](http://i.imgur.com/Cwh1yB8.png) 發現 SSE 的 prefetch 竟然比 AVX 還要快,~~開始懷疑人生了。~~ >待驗証++! >[name=laochanlam][color=blue] 一度在編譯時沒有加入 ```-mavx2``` 參數害我卡關很久。 >我也是！！！ >[name=洪正皇][color=#f4b000] --- ### 小小總結: 在知識及資訊不足夠的情況下即使實驗結果不符合預期也無法推出個合理解釋,看來改天真的要再重讀相關知識了。 ## 補充知識 - [ ] GHB - [x] loop splitting --- ## GHB （ Global History Buffer ） #### 參考及引用資料 [Data Cache Prefetching Using a Global History Buffer](http://www.eecg.toronto.edu/~steffan/carg/readings/ghb.pdf) ---