# 2017q3 Homework2 (software-pipelining) contributed by <`HMKRL`> ### Reviewed by `zhanyangch` * SIMD 的指令可以參考 [intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/) ,比用 MSDN 方便查詢。 * 已經對不同的_MM_HINT作實驗,若能增加對 PDIST 修改的實驗,與論文中 D 值的計算做比較,可使報告更完整。 * 對 gef 介紹可以更詳盡,或是列出相關的參考資料,方便其它同學也能夠學習這項工具。 ## 論文閱讀:[When Prefetching Works, When It Doesn’t, and Why](https://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) ![](https://i.imgur.com/2ZOadps.png) 由文中的實驗結果圖可以發現,當 Software prefetching (例如 `_mm_prefetch()`) 與 Hardware prefetching (例如 [GHB](http://www.eecg.toronto.edu/~steffan/carg/readings/ghb.pdf)) 同時使用時,並沒有明確的效能表現規則,本文重點為探索此兩種 prefetcher 交互使用的效果 其中有三項探討重點: 1. What are the limitations and overheads of software prefetching? 2. What are the limitations and overheads of hardware prefetching? 3. When is it beneficial to use software and/or hardware prefetching? ![](https://i.imgur.com/PSrAban.png) Table 1 為文中對不同資料結構適用的 prefetch 方式整理,其中關於 prefetch distance (D),可以參考圖二的時間線: ![](https://i.imgur.com/PeVlOvu.png) 可以發現 發出 prefetch request 後,資料需要一段時間才能進入 cache ,但若過太久仍未被使用,則會被移出 cache 因此採用 software prefetch 時,若太晚執行 prefetch ,則無法透過這次 prefetch 消除 memory latency 的影響,太早執行則會讓資料在被使用前就移出 cache ,都無法有效增進效能(甚至會因為 prefetch 本身的成本造成效能低落),適合的 D 值可以這樣計算: $$ D\geq[\frac{l}{s}] $$ 其中 $l$ 是 prefetch 的延遲, $s$ 則是一次迴圈的最短路徑 論文中 5.1.3 對不同 D 值的影響進行分析: ![](https://i.imgur.com/ecfc6va.png) ![](https://i.imgur.com/ZqewxX0.png) ### Memory Indexing Memory indexing 可以分為兩種:direct 與 indirect 其中 direct 型(例如一維陣列) 較容易做 prefetch (由 index 可以直接得知資料所在的位置) indirect 型(例如 Linked-List )則較不容易 圖3以簡化的組合語言指令比較了兩種 indexing 模式的差異,可以發現 indirect indexing 必須額外計算實際位址 ![](https://i.imgur.com/yVWJKXR.png) ## 理解 prefetch 程式 開發環境: ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Model name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 6144K ``` 首先注意 `impl.c` `naive_transpose` 是最直接的轉置作法,直接依照轉置矩陣的規則以迴圈實作 `sse_transpose` 則出現許多未曾使用過的 sse 指令集操作 ,因此使用 `gdb` 觀察程式執行(已加裝 [gef](https://github.com/hugsy/gef) ) 將中斷點設在 `impl.c:17` , 由 `gdb` 顯示的組合語言指令得知`_mm_loadu_si128` 用途是將數值存入 `xmm0` ,參考 [wikipedia](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions#Registers) 得知這是用於 sse 指令集的特殊 register ``` ────────────────────────────────────────────────────────────────────────────────────────────────────────────[ code:i386:x86-64 ]──── 0x555555554a52 <sse_transpose+109> add rax, rdx 0x555555554a55 <sse_transpose+112> mov QWORD PTR [rbp-0x200], rax 0x555555554a5c <sse_transpose+119> mov rax, QWORD PTR [rbp-0x200] 0x555555554a63 <sse_transpose+126> movdqu xmm0, XMMWORD PTR [rax] 0x555555554a67 <sse_transpose+130> movaps XMMWORD PTR [rbp-0x1c0], xmm0 → 0x555555554a6e <sse_transpose+137> mov eax, DWORD PTR [rbp-0x204] 0x555555554a74 <sse_transpose+143> add eax, 0x1 0x555555554a77 <sse_transpose+146> imul eax, DWORD PTR [rbp-0x224] 0x555555554a7e <sse_transpose+153> movsxd rdx, eax 0x555555554a81 <sse_transpose+156> mov eax, DWORD PTR [rbp-0x208] 0x555555554a87 <sse_transpose+162> cdqe ────────────────────────────────────────────────────────────────────────────────────────────────────────────[ source:impl.c+16 ]──── 12 { 13 for (int x = 0; x < w; x += 4) { 14 for (int y = 0; y < h; y += 4) { 15 __m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x)); // y=0x0, x=0x0, src=0x00007fffffffd798 → [...] → 0x0000000100000000, w=0x4 → 16 __m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x)); 17 __m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x)); 18 __m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x)); 19 __m128i T0 = _mm_unpacklo_epi32(I0, I1); 20 __m128i T1 = _mm_unpacklo_epi32(I2, I3); ``` 再觀察以下執行結果: `__m128i T0 = _mm_unpacklo_epi32(I0, I1)` ``` gef➤ p I0 $8 = {0x100000000, 0x300000002} gef➤ p I1 $9 = {0x500000004, 0x700000006} gef➤ p T0 $10 = {0x400000000, 0x500000001} ``` `__m128i T2 = _mm_unpackhi_epi32(I0, I1);`: ``` gef➤ p I0 $11 = {0x100000000, 0x300000002} gef➤ p I1 $12 = {0x500000004, 0x700000006} gef➤ p T2 $13 = {0x600000002, 0x700000003} ``` `I0 = _mm_unpacklo_epi64(T0, T1);` ``` gef➤ p T0 $23 = {0x400000000, 0x500000001} gef➤ p T1 $24 = {0xc00000008, 0xd00000009} gef➤ p I0 $25 = {0x400000000, 0xc00000008} ``` 透過以上觀察,加上參照 [MSDN](https://msdn.microsoft.com/en-us/library/ac53z079(v=vs.90).aspx) 的說明,即可理解這幾個 function 的行為 - sse register 有 128-bit, 善用可以在每道指令處理4個 32-bit integer 數值,比起一次處理一個來說可以省下許多手續 ## 產生新的執行檔,分別對應於不同模式 程式碼修改請見 [commit f15322b](https://github.com/HMKRL/prefetcher/commit/f15322bf4a66d0bc2549f8d78926f8f8908d43ad) 新增 `impl.h`, 統一所有 transpose function 名稱為 `transpose()` 並宣告函式原型在 `impl.h` 中 各種不同實作分別位於 `naive_transpose.c`,`sse_transpose.c` 與 `sse_prefetch_transpose.c` 中,透過 Makefile 決定要使用的檔案 測試結果: ``` make run ./naive_transpose naive: 600506 us ./sse_transpose sse: 307496 us ./sse_prefetch_transpose sse_prefetch: 149165 us ``` `make cache-test` |test case|cache-miss| |:-------:|:--------:| |naive |87.332 % | |sse |78.215 % | |sse_prefetch|66.306 % | ### 嘗試修改 _MM_HINT 並透過 raw counter 觀察 ``` _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1); ``` ![](https://i.imgur.com/zkO67ao.png) 參考 illusion030 的[共筆](https://hackmd.io/s/HkHDV-moe),觀察以下 raw counter: ``` r014c LOAD_HIT_PRE.SW_PF r024c LOAD_HIT_PRE.HW_PF r81d0 MEM_UOPS_RETIRED.ALL_LOADS r01d1 MEM_LOAD_UOPS_RETIRED.L1_HIT r02d1 MEM_LOAD_UOPS_RETIRED.L2_HIT r04d1 MEM_LOAD_UOPS_RETIRED.L3_HIT ``` 結果: ``` Performance counter stats for './naive_transpose' (10 runs): 39,361,652 cache-misses # 87.332 % of all cache refs ( +- 1.18% ) (39.91%) 45,071,235 cache-references ( +- 1.30% ) (40.45%) 1,394,570,644 instructions # 1.18 insn per cycle ( +- 0.62% ) (50.60%) 1,183,001,040 cycles ( +- 0.39% ) (60.58%) 145 r014c ( +- 26.11% ) (60.62%) 347,288 r024c ( +- 8.84% ) (60.77%) 565,116,285 r81d0 ( +- 0.52% ) (60.89%) 541,841,834 r01d1 ( +- 0.72% ) (61.08%) 32,226 r02d1 ( +- 13.70% ) (40.05%) 704,847 r04d1 ( +- 5.39% ) (39.95%) ``` ``` Performance counter stats for './sse_transpose' (10 runs): 14,631,270 cache-misses # 78.215 % of all cache refs ( +- 1.32% ) (37.96%) 18,706,573 cache-references ( +- 1.21% ) (40.68%) 1,156,979,239 instructions # 1.44 insn per cycle ( +- 1.37% ) (51.62%) 801,805,797 cycles ( +- 0.67% ) (61.88%) 104 r014c ( +- 4.41% ) (62.54%) 228,122 r024c ( +- 11.42% ) (63.56%) 442,330,486 r81d0 ( +- 0.65% ) (63.34%) 444,642,666 r01d1 ( +- 0.66% ) (61.72%) 32,251 r02d1 ( +- 8.80% ) (38.78%) 138,286 r04d1 ( +- 2.73% ) (37.31%) ``` ``` Performance counter stats for './sse_prefetch_transpose' (10 runs): 8,407,856 cache-misses # 66.306 % of all cache refs ( +- 0.77% ) (39.42%) 12,680,436 cache-references ( +- 0.38% ) (39.50%) 1,266,181,395 instructions # 2.19 insn per cycle ( +- 0.38% ) (49.59%) 577,113,498 cycles ( +- 0.36% ) (59.67%) 1,194,382 r014c ( +- 3.70% ) (59.67%) 276,261 r024c ( +- 13.61% ) (60.82%) 450,951,942 r81d0 ( +- 1.27% ) (63.35%) 448,833,332 r01d1 ( +- 0.84% ) (62.97%) 1,185,907 r02d1 ( +- 2.41% ) (41.53%) 23,900 r04d1 ( +- 8.80% ) (40.14%) ``` 可以發現在使用了 `_mm_prefetch` 的 `sse_prefetch_transpose` 中,`r014c LOAD_HIT_PRE.SW_PF ` 明顯高出其他版本很多,證實了 software prefetching 有實際生效 接下來單獨觀察 `sse_prefetch_transpose`: - _MM_HINT_T0 ``` Performance counter stats for './sse_prefetch_transpose' (10 runs): 8,274,577 cache-misses # 65.760 % of all cache refs ( +- 1.55% ) (39.24%) 12,582,914 cache-references ( +- 1.15% ) (39.39%) 1,270,075,951 instructions # 2.22 insn per cycle ( +- 0.48% ) (49.56%) 571,300,575 cycles ( +- 0.29% ) (59.66%) 2,055,764 r014c ( +- 1.49% ) (59.81%) 285,024 r024c ( +- 12.43% ) (61.40%) 452,350,810 r81d0 ( +- 0.99% ) (63.40%) 450,444,764 r01d1 ( +- 0.97% ) (62.82%) 136,386 r02d1 ( +- 6.58% ) (41.17%) 21,463 r04d1 ( +- 9.31% ) (39.90%) ``` - _MM_HINT_T1 ``` Performance counter stats for './sse_prefetch_transpose' (10 runs): 8,294,601 cache-misses # 65.715 % of all cache refs ( +- 1.96% ) (39.20%) 12,622,082 cache-references ( +- 1.59% ) (39.49%) 1,267,488,698 instructions # 2.21 insn per cycle ( +- 0.20% ) (49.82%) 573,599,634 cycles ( +- 0.47% ) (59.97%) 1,138,961 r014c ( +- 3.27% ) (60.38%) 297,754 r024c ( +- 12.40% ) (62.11%) 452,313,963 r81d0 ( +- 0.71% ) (63.44%) 447,991,261 r01d1 ( +- 0.71% ) (63.10%) 1,211,477 r02d1 ( +- 2.48% ) (40.61%) 23,397 r04d1 ( +- 7.42% ) (39.48%) ``` - _MM_HINT_T2 ``` Performance counter stats for './sse_prefetch_transpose' (10 runs): 8,160,145 cache-misses # 64.491 % of all cache refs ( +- 2.19% ) (39.61%) 12,653,216 cache-references ( +- 0.45% ) (39.98%) 1,261,785,164 instructions # 2.19 insn per cycle ( +- 0.37% ) (50.15%) 575,843,475 cycles ( +- 0.38% ) (60.17%) 1,137,009 r014c ( +- 3.06% ) (60.17%) 340,189 r024c ( +- 9.23% ) (60.61%) 464,808,926 r81d0 ( +- 0.84% ) (61.75%) 450,360,519 r01d1 ( +- 0.92% ) (62.46%) 1,179,211 r02d1 ( +- 1.03% ) (40.89%) 20,659 r04d1 ( +- 7.79% ) (40.36%) ``` - _MM_HINT_NTA ``` Performance counter stats for './sse_prefetch_transpose' (10 runs): 8,121,213 cache-misses # 63.071 % of all cache refs ( +- 2.09% ) (39.80%) 12,876,217 cache-references ( +- 2.41% ) (40.06%) 1,267,977,461 instructions # 2.20 insn per cycle ( +- 0.43% ) (50.18%) 575,640,296 cycles ( +- 0.60% ) (60.18%) 2,028,726 r014c ( +- 1.83% ) (60.17%) 338,681 r024c ( +- 9.67% ) (60.74%) 460,173,646 r81d0 ( +- 0.99% ) (61.99%) 452,179,786 r01d1 ( +- 0.93% ) (62.14%) 11,199 r02d1 ( +- 3.08% ) (41.24%) 219,170 r04d1 ( +- 2.84% ) (40.43%) ``` 疑問: 可以發現 `T0` 和 `T1` 都如預期將資料放在 L1 / L2 cache 但 `T2` 並不是放在 L3 而是 L2 ? 反而是 NTA 模式 似乎傾向使用 L3 cache :::danger 用 perf raw-counter 確認! :notes: jserv ::: 此處尚待理解