Try   HackMD

2017q3 Homework2 (software-pipelining)

contributed by <HMKRL>

Reviewed by zhanyangch

  • SIMD 的指令可以參考 intel Intrinsics Guide ,比用 MSDN 方便查詢。
  • 已經對不同的_MM_HINT作實驗,若能增加對 PDIST 修改的實驗,與論文中 D 值的計算做比較,可使報告更完整。
  • 對 gef 介紹可以更詳盡,或是列出相關的參考資料,方便其它同學也能夠學習這項工具。

論文閱讀:When Prefetching Works, When It Doesn’t, and Why

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

由文中的實驗結果圖可以發現,當 Software prefetching (例如 _mm_prefetch()) 與 Hardware prefetching (例如 GHB) 同時使用時,並沒有明確的效能表現規則,本文重點為探索此兩種 prefetcher 交互使用的效果

其中有三項探討重點:

  1. What are the limitations and overheads of software prefetching?
  2. What are the limitations and overheads of hardware prefetching?
  3. When is it beneficial to use software and/or hardware prefetching?

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Table 1 為文中對不同資料結構適用的 prefetch 方式整理,其中關於 prefetch distance (D),可以參考圖二的時間線:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

可以發現 發出 prefetch request 後,資料需要一段時間才能進入 cache ,但若過太久仍未被使用,則會被移出 cache

因此採用 software prefetch 時,若太晚執行 prefetch ,則無法透過這次 prefetch 消除 memory latency 的影響,太早執行則會讓資料在被使用前就移出 cache ,都無法有效增進效能(甚至會因為 prefetch 本身的成本造成效能低落),適合的 D 值可以這樣計算:

D[ls]

其中

l 是 prefetch 的延遲,
s
則是一次迴圈的最短路徑

論文中 5.1.3 對不同 D 值的影響進行分析:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Memory Indexing

Memory indexing 可以分為兩種:direct 與 indirect
其中 direct 型(例如一維陣列) 較容易做 prefetch (由 index 可以直接得知資料所在的位置)
indirect 型(例如 Linked-List )則較不容易

圖3以簡化的組合語言指令比較了兩種 indexing 模式的差異,可以發現 indirect indexing 必須額外計算實際位址

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

理解 prefetch 程式

開發環境:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
Model name:            Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K

首先注意 impl.c

naive_transpose 是最直接的轉置作法,直接依照轉置矩陣的規則以迴圈實作

sse_transpose 則出現許多未曾使用過的 sse 指令集操作 ,因此使用 gdb 觀察程式執行(已加裝 gef )

將中斷點設在 impl.c:17 , 由 gdb 顯示的組合語言指令得知_mm_loadu_si128 用途是將數值存入 xmm0 ,參考 wikipedia 得知這是用於 sse 指令集的特殊 register

────────────────────────────────────────────────────────────────────────────────────────────────────────────[ code:i386:x86-64 ]────
   0x555555554a52 <sse_transpose+109> add    rax, rdx
   0x555555554a55 <sse_transpose+112> mov    QWORD PTR [rbp-0x200], rax
   0x555555554a5c <sse_transpose+119> mov    rax, QWORD PTR [rbp-0x200]
   0x555555554a63 <sse_transpose+126> movdqu xmm0, XMMWORD PTR [rax]
   0x555555554a67 <sse_transpose+130> movaps XMMWORD PTR [rbp-0x1c0], xmm0
 → 0x555555554a6e <sse_transpose+137> mov    eax, DWORD PTR [rbp-0x204]
   0x555555554a74 <sse_transpose+143> add    eax, 0x1
   0x555555554a77 <sse_transpose+146> imul   eax, DWORD PTR [rbp-0x224]
   0x555555554a7e <sse_transpose+153> movsxd rdx, eax
   0x555555554a81 <sse_transpose+156> mov    eax, DWORD PTR [rbp-0x208]
   0x555555554a87 <sse_transpose+162> cdqe
────────────────────────────────────────────────────────────────────────────────────────────────────────────[ source:impl.c+16 ]────
     12  {
     13      for (int x = 0; x < w; x += 4) {
     14          for (int y = 0; y < h; y += 4) {
     15              __m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x));
                // y=0x0, x=0x0, src=0x00007fffffffd798  →  [...]  →  0x0000000100000000, w=0x4
 →   16              __m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x));
     17              __m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x));
     18              __m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x));
     19              __m128i T0 = _mm_unpacklo_epi32(I0, I1);
     20              __m128i T1 = _mm_unpacklo_epi32(I2, I3);

再觀察以下執行結果:
__m128i T0 = _mm_unpacklo_epi32(I0, I1)

gef➤  p I0
$8 = {0x100000000, 0x300000002}
gef➤  p I1
$9 = {0x500000004, 0x700000006}
gef➤  p T0
$10 = {0x400000000, 0x500000001}

__m128i T2 = _mm_unpackhi_epi32(I0, I1);:

gef➤  p I0
$11 = {0x100000000, 0x300000002}
gef➤  p I1
$12 = {0x500000004, 0x700000006}
gef➤  p T2
$13 = {0x600000002, 0x700000003}

I0 = _mm_unpacklo_epi64(T0, T1);

gef➤  p T0
$23 = {0x400000000, 0x500000001}
gef➤  p T1
$24 = {0xc00000008, 0xd00000009}
gef➤  p I0
$25 = {0x400000000, 0xc00000008}

透過以上觀察,加上參照 MSDN 的說明,即可理解這幾個 function 的行為

  • sse register 有 128-bit, 善用可以在每道指令處理4個 32-bit integer 數值,比起一次處理一個來說可以省下許多手續

產生新的執行檔,分別對應於不同模式

程式碼修改請見 commit f15322b

新增 impl.h, 統一所有 transpose function 名稱為 transpose() 並宣告函式原型在 impl.h

各種不同實作分別位於 naive_transpose.c,sse_transpose.csse_prefetch_transpose.c 中,透過 Makefile 決定要使用的檔案

測試結果:

make run
./naive_transpose
naive:   600506 us
./sse_transpose
sse:     307496 us
./sse_prefetch_transpose
sse_prefetch:    149165 us

make cache-test

test case cache-miss
naive 87.332 %
sse 78.215 %
sse_prefetch 66.306 %

嘗試修改 _MM_HINT 並透過 raw counter 觀察

_mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1);

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

參考 illusion030 的共筆,觀察以下 raw counter:

r014c LOAD_HIT_PRE.SW_PF
r024c LOAD_HIT_PRE.HW_PF
r81d0 MEM_UOPS_RETIRED.ALL_LOADS
r01d1 MEM_LOAD_UOPS_RETIRED.L1_HIT
r02d1 MEM_LOAD_UOPS_RETIRED.L2_HIT
r04d1 MEM_LOAD_UOPS_RETIRED.L3_HIT

結果:

Performance counter stats for './naive_transpose' (10 runs):

        39,361,652      cache-misses              #   87.332 % of all cache refs      ( +-  1.18% )  (39.91%)
        45,071,235      cache-references                                              ( +-  1.30% )  (40.45%)
     1,394,570,644      instructions              #    1.18  insn per cycle           ( +-  0.62% )  (50.60%)
     1,183,001,040      cycles                                                        ( +-  0.39% )  (60.58%)
               145      r014c                                                         ( +- 26.11% )  (60.62%)
           347,288      r024c                                                         ( +-  8.84% )  (60.77%)
       565,116,285      r81d0                                                         ( +-  0.52% )  (60.89%)
       541,841,834      r01d1                                                         ( +-  0.72% )  (61.08%)
            32,226      r02d1                                                         ( +- 13.70% )  (40.05%)
           704,847      r04d1                                                         ( +-  5.39% )  (39.95%)
Performance counter stats for './sse_transpose' (10 runs):

        14,631,270      cache-misses              #   78.215 % of all cache refs      ( +-  1.32% )  (37.96%)
        18,706,573      cache-references                                              ( +-  1.21% )  (40.68%)
     1,156,979,239      instructions              #    1.44  insn per cycle           ( +-  1.37% )  (51.62%)
       801,805,797      cycles                                                        ( +-  0.67% )  (61.88%)
               104      r014c                                                         ( +-  4.41% )  (62.54%)
           228,122      r024c                                                         ( +- 11.42% )  (63.56%)
       442,330,486      r81d0                                                         ( +-  0.65% )  (63.34%)
       444,642,666      r01d1                                                         ( +-  0.66% )  (61.72%)
            32,251      r02d1                                                         ( +-  8.80% )  (38.78%)
           138,286      r04d1                                                         ( +-  2.73% )  (37.31%)
Performance counter stats for './sse_prefetch_transpose' (10 runs):

         8,407,856      cache-misses              #   66.306 % of all cache refs      ( +-  0.77% )  (39.42%)
        12,680,436      cache-references                                              ( +-  0.38% )  (39.50%)
     1,266,181,395      instructions              #    2.19  insn per cycle           ( +-  0.38% )  (49.59%)
       577,113,498      cycles                                                        ( +-  0.36% )  (59.67%)
         1,194,382      r014c                                                         ( +-  3.70% )  (59.67%)
           276,261      r024c                                                         ( +- 13.61% )  (60.82%)
       450,951,942      r81d0                                                         ( +-  1.27% )  (63.35%)
       448,833,332      r01d1                                                         ( +-  0.84% )  (62.97%)
         1,185,907      r02d1                                                         ( +-  2.41% )  (41.53%)
            23,900      r04d1                                                         ( +-  8.80% )  (40.14%)

可以發現在使用了 _mm_prefetchsse_prefetch_transpose 中,r014c LOAD_HIT_PRE.SW_PF 明顯高出其他版本很多,證實了 software prefetching 有實際生效

接下來單獨觀察 sse_prefetch_transpose:

  • _MM_HINT_T0
Performance counter stats for './sse_prefetch_transpose' (10 runs):

         8,274,577      cache-misses              #   65.760 % of all cache refs      ( +-  1.55% )  (39.24%)
        12,582,914      cache-references                                              ( +-  1.15% )  (39.39%)
     1,270,075,951      instructions              #    2.22  insn per cycle           ( +-  0.48% )  (49.56%)
       571,300,575      cycles                                                        ( +-  0.29% )  (59.66%)
         2,055,764      r014c                                                         ( +-  1.49% )  (59.81%)
           285,024      r024c                                                         ( +- 12.43% )  (61.40%)
       452,350,810      r81d0                                                         ( +-  0.99% )  (63.40%)
       450,444,764      r01d1                                                         ( +-  0.97% )  (62.82%)
           136,386      r02d1                                                         ( +-  6.58% )  (41.17%)
            21,463      r04d1                                                         ( +-  9.31% )  (39.90%)
  • _MM_HINT_T1
Performance counter stats for './sse_prefetch_transpose' (10 runs):

         8,294,601      cache-misses              #   65.715 % of all cache refs      ( +-  1.96% )  (39.20%)
        12,622,082      cache-references                                              ( +-  1.59% )  (39.49%)
     1,267,488,698      instructions              #    2.21  insn per cycle           ( +-  0.20% )  (49.82%)
       573,599,634      cycles                                                        ( +-  0.47% )  (59.97%)
         1,138,961      r014c                                                         ( +-  3.27% )  (60.38%)
           297,754      r024c                                                         ( +- 12.40% )  (62.11%)
       452,313,963      r81d0                                                         ( +-  0.71% )  (63.44%)
       447,991,261      r01d1                                                         ( +-  0.71% )  (63.10%)
         1,211,477      r02d1                                                         ( +-  2.48% )  (40.61%)
            23,397      r04d1                                                         ( +-  7.42% )  (39.48%)
  • _MM_HINT_T2
Performance counter stats for './sse_prefetch_transpose' (10 runs):

         8,160,145      cache-misses              #   64.491 % of all cache refs      ( +-  2.19% )  (39.61%)
        12,653,216      cache-references                                              ( +-  0.45% )  (39.98%)
     1,261,785,164      instructions              #    2.19  insn per cycle           ( +-  0.37% )  (50.15%)
       575,843,475      cycles                                                        ( +-  0.38% )  (60.17%)
         1,137,009      r014c                                                         ( +-  3.06% )  (60.17%)
           340,189      r024c                                                         ( +-  9.23% )  (60.61%)
       464,808,926      r81d0                                                         ( +-  0.84% )  (61.75%)
       450,360,519      r01d1                                                         ( +-  0.92% )  (62.46%)
         1,179,211      r02d1                                                         ( +-  1.03% )  (40.89%)
            20,659      r04d1                                                         ( +-  7.79% )  (40.36%)
  • _MM_HINT_NTA
Performance counter stats for './sse_prefetch_transpose' (10 runs):

         8,121,213      cache-misses              #   63.071 % of all cache refs      ( +-  2.09% )  (39.80%)
        12,876,217      cache-references                                              ( +-  2.41% )  (40.06%)
     1,267,977,461      instructions              #    2.20  insn per cycle           ( +-  0.43% )  (50.18%)
       575,640,296      cycles                                                        ( +-  0.60% )  (60.18%)
         2,028,726      r014c                                                         ( +-  1.83% )  (60.17%)
           338,681      r024c                                                         ( +-  9.67% )  (60.74%)
       460,173,646      r81d0                                                         ( +-  0.99% )  (61.99%)
       452,179,786      r01d1                                                         ( +-  0.93% )  (62.14%)
            11,199      r02d1                                                         ( +-  3.08% )  (41.24%)
           219,170      r04d1                                                         ( +-  2.84% )  (40.43%)

疑問:
可以發現 T0T1 都如預期將資料放在 L1 / L2 cache
T2 並不是放在 L3 而是 L2 ? 反而是 NTA 模式 似乎傾向使用 L3 cache

用 perf raw-counter 確認!

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
jserv

此處尚待理解