2017q3 Homework2 (software-pipelining)

contributed by <HMKRL>

Reviewed by `zhanyangch`

SIMD 的指令可以參考 intel Intrinsics Guide ，比用 MSDN 方便查詢。
已經對不同的_MM_HINT作實驗，若能增加對 PDIST 修改的實驗，與論文中 D 值的計算做比較，可使報告更完整。
對 gef 介紹可以更詳盡，或是列出相關的參考資料，方便其它同學也能夠學習這項工具。

論文閱讀：When Prefetching Works, When It Doesn’t, and Why

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

由文中的實驗結果圖可以發現，當 Software prefetching (例如 _mm_prefetch()) 與 Hardware prefetching (例如 GHB) 同時使用時，並沒有明確的效能表現規則，本文重點為探索此兩種 prefetcher 交互使用的效果

其中有三項探討重點：

What are the limitations and overheads of software prefetching?
What are the limitations and overheads of hardware prefetching?
When is it beneficial to use software and/or hardware prefetching?

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Table 1 為文中對不同資料結構適用的 prefetch 方式整理，其中關於 prefetch distance (D)，可以參考圖二的時間線：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

可以發現發出 prefetch request 後，資料需要一段時間才能進入 cache ，但若過太久仍未被使用，則會被移出 cache

因此採用 software prefetch 時，若太晚執行 prefetch ，則無法透過這次 prefetch 消除 memory latency 的影響，太早執行則會讓資料在被使用前就移出 cache ，都無法有效增進效能(甚至會因為 prefetch 本身的成本造成效能低落)，適合的 D 值可以這樣計算：

D \geq [\frac{l}{s}]

其中

l

是 prefetch 的延遲，

s

則是一次迴圈的最短路徑

論文中 5.1.3 對不同 D 值的影響進行分析：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Memory Indexing

Memory indexing 可以分為兩種：direct 與 indirect
其中 direct 型(例如一維陣列) 較容易做 prefetch (由 index 可以直接得知資料所在的位置)
indirect 型(例如 Linked-List )則較不容易

圖3以簡化的組合語言指令比較了兩種 indexing 模式的差異，可以發現 indirect indexing 必須額外計算實際位址

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

理解 prefetch 程式

開發環境：

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
Model name:            Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K

首先注意 impl.c

naive_transpose 是最直接的轉置作法，直接依照轉置矩陣的規則以迴圈實作

sse_transpose 則出現許多未曾使用過的 sse 指令集操作 ,因此使用 gdb 觀察程式執行(已加裝 gef )

將中斷點設在 impl.c:17 , 由 gdb 顯示的組合語言指令得知_mm_loadu_si128 用途是將數值存入 xmm0 ，參考 wikipedia 得知這是用於 sse 指令集的特殊 register

────────────────────────────────────────────────────────────────────────────────────────────────────────────[ code:i386:x86-64 ]────
   0x555555554a52 <sse_transpose+109> add    rax, rdx
   0x555555554a55 <sse_transpose+112> mov    QWORD PTR [rbp-0x200], rax
   0x555555554a5c <sse_transpose+119> mov    rax, QWORD PTR [rbp-0x200]
   0x555555554a63 <sse_transpose+126> movdqu xmm0, XMMWORD PTR [rax]
   0x555555554a67 <sse_transpose+130> movaps XMMWORD PTR [rbp-0x1c0], xmm0
 → 0x555555554a6e <sse_transpose+137> mov    eax, DWORD PTR [rbp-0x204]
   0x555555554a74 <sse_transpose+143> add    eax, 0x1
   0x555555554a77 <sse_transpose+146> imul   eax, DWORD PTR [rbp-0x224]
   0x555555554a7e <sse_transpose+153> movsxd rdx, eax
   0x555555554a81 <sse_transpose+156> mov    eax, DWORD PTR [rbp-0x208]
   0x555555554a87 <sse_transpose+162> cdqe
────────────────────────────────────────────────────────────────────────────────────────────────────────────[ source:impl.c+16 ]────
     12  {
     13      for (int x = 0; x < w; x += 4) {
     14          for (int y = 0; y < h; y += 4) {
     15              __m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x));
                // y=0x0, x=0x0, src=0x00007fffffffd798  →  [...]  →  0x0000000100000000, w=0x4
 →   16              __m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x));
     17              __m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x));
     18              __m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x));
     19              __m128i T0 = _mm_unpacklo_epi32(I0, I1);
     20              __m128i T1 = _mm_unpacklo_epi32(I2, I3);

再觀察以下執行結果：
__m128i T0 = _mm_unpacklo_epi32(I0, I1)

gef➤  p I0
$8 = {0x100000000, 0x300000002}
gef➤  p I1
$9 = {0x500000004, 0x700000006}
gef➤  p T0
$10 = {0x400000000, 0x500000001}

__m128i T2 = _mm_unpackhi_epi32(I0, I1);:

gef➤  p I0
$11 = {0x100000000, 0x300000002}
gef➤  p I1
$12 = {0x500000004, 0x700000006}
gef➤  p T2
$13 = {0x600000002, 0x700000003}

I0 = _mm_unpacklo_epi64(T0, T1);

gef➤  p T0
$23 = {0x400000000, 0x500000001}
gef➤  p T1
$24 = {0xc00000008, 0xd00000009}
gef➤  p I0
$25 = {0x400000000, 0xc00000008}

透過以上觀察，加上參照 MSDN 的說明，即可理解這幾個 function 的行為

sse register 有 128-bit, 善用可以在每道指令處理4個 32-bit integer 數值，比起一次處理一個來說可以省下許多手續

產生新的執行檔，分別對應於不同模式

程式碼修改請見 commit f15322b

新增 impl.h, 統一所有 transpose function 名稱為 transpose() 並宣告函式原型在 impl.h 中

各種不同實作分別位於 naive_transpose.c,sse_transpose.c 與 sse_prefetch_transpose.c 中，透過 Makefile 決定要使用的檔案

測試結果：

make run
./naive_transpose
naive:   600506 us
./sse_transpose
sse:     307496 us
./sse_prefetch_transpose
sse_prefetch:    149165 us

make cache-test

test case	cache-miss
naive	87.332 %
sse	78.215 %
sse_prefetch	66.306 %

嘗試修改 _MM_HINT 並透過 raw counter 觀察

_mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1);

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

參考 illusion030 的共筆，觀察以下 raw counter:

r014c LOAD_HIT_PRE.SW_PF
r024c LOAD_HIT_PRE.HW_PF
r81d0 MEM_UOPS_RETIRED.ALL_LOADS
r01d1 MEM_LOAD_UOPS_RETIRED.L1_HIT
r02d1 MEM_LOAD_UOPS_RETIRED.L2_HIT
r04d1 MEM_LOAD_UOPS_RETIRED.L3_HIT

結果：

Performance counter stats for './naive_transpose' (10 runs):

        39,361,652      cache-misses              #   87.332 % of all cache refs      ( +-  1.18% )  (39.91%)
        45,071,235      cache-references                                              ( +-  1.30% )  (40.45%)
     1,394,570,644      instructions              #    1.18  insn per cycle           ( +-  0.62% )  (50.60%)
     1,183,001,040      cycles                                                        ( +-  0.39% )  (60.58%)
               145      r014c                                                         ( +- 26.11% )  (60.62%)
           347,288      r024c                                                         ( +-  8.84% )  (60.77%)
       565,116,285      r81d0                                                         ( +-  0.52% )  (60.89%)
       541,841,834      r01d1                                                         ( +-  0.72% )  (61.08%)
            32,226      r02d1                                                         ( +- 13.70% )  (40.05%)
           704,847      r04d1                                                         ( +-  5.39% )  (39.95%)

Performance counter stats for './sse_transpose' (10 runs):

        14,631,270      cache-misses              #   78.215 % of all cache refs      ( +-  1.32% )  (37.96%)
        18,706,573      cache-references                                              ( +-  1.21% )  (40.68%)
     1,156,979,239      instructions              #    1.44  insn per cycle           ( +-  1.37% )  (51.62%)
       801,805,797      cycles                                                        ( +-  0.67% )  (61.88%)
               104      r014c                                                         ( +-  4.41% )  (62.54%)
           228,122      r024c                                                         ( +- 11.42% )  (63.56%)
       442,330,486      r81d0                                                         ( +-  0.65% )  (63.34%)
       444,642,666      r01d1                                                         ( +-  0.66% )  (61.72%)
            32,251      r02d1                                                         ( +-  8.80% )  (38.78%)
           138,286      r04d1                                                         ( +-  2.73% )  (37.31%)

Performance counter stats for './sse_prefetch_transpose' (10 runs):

         8,407,856      cache-misses              #   66.306 % of all cache refs      ( +-  0.77% )  (39.42%)
        12,680,436      cache-references                                              ( +-  0.38% )  (39.50%)
     1,266,181,395      instructions              #    2.19  insn per cycle           ( +-  0.38% )  (49.59%)
       577,113,498      cycles                                                        ( +-  0.36% )  (59.67%)
         1,194,382      r014c                                                         ( +-  3.70% )  (59.67%)
           276,261      r024c                                                         ( +- 13.61% )  (60.82%)
       450,951,942      r81d0                                                         ( +-  1.27% )  (63.35%)
       448,833,332      r01d1                                                         ( +-  0.84% )  (62.97%)
         1,185,907      r02d1                                                         ( +-  2.41% )  (41.53%)
            23,900      r04d1                                                         ( +-  8.80% )  (40.14%)

可以發現在使用了 _mm_prefetch 的 sse_prefetch_transpose 中，r014c LOAD_HIT_PRE.SW_PF 明顯高出其他版本很多，證實了 software prefetching 有實際生效

接下來單獨觀察 sse_prefetch_transpose:

_MM_HINT_T0

Performance counter stats for './sse_prefetch_transpose' (10 runs):

         8,274,577      cache-misses              #   65.760 % of all cache refs      ( +-  1.55% )  (39.24%)
        12,582,914      cache-references                                              ( +-  1.15% )  (39.39%)
     1,270,075,951      instructions              #    2.22  insn per cycle           ( +-  0.48% )  (49.56%)
       571,300,575      cycles                                                        ( +-  0.29% )  (59.66%)
         2,055,764      r014c                                                         ( +-  1.49% )  (59.81%)
           285,024      r024c                                                         ( +- 12.43% )  (61.40%)
       452,350,810      r81d0                                                         ( +-  0.99% )  (63.40%)
       450,444,764      r01d1                                                         ( +-  0.97% )  (62.82%)
           136,386      r02d1                                                         ( +-  6.58% )  (41.17%)
            21,463      r04d1                                                         ( +-  9.31% )  (39.90%)

_MM_HINT_T1

Performance counter stats for './sse_prefetch_transpose' (10 runs):

         8,294,601      cache-misses              #   65.715 % of all cache refs      ( +-  1.96% )  (39.20%)
        12,622,082      cache-references                                              ( +-  1.59% )  (39.49%)
     1,267,488,698      instructions              #    2.21  insn per cycle           ( +-  0.20% )  (49.82%)
       573,599,634      cycles                                                        ( +-  0.47% )  (59.97%)
         1,138,961      r014c                                                         ( +-  3.27% )  (60.38%)
           297,754      r024c                                                         ( +- 12.40% )  (62.11%)
       452,313,963      r81d0                                                         ( +-  0.71% )  (63.44%)
       447,991,261      r01d1                                                         ( +-  0.71% )  (63.10%)
         1,211,477      r02d1                                                         ( +-  2.48% )  (40.61%)
            23,397      r04d1                                                         ( +-  7.42% )  (39.48%)

_MM_HINT_T2

Performance counter stats for './sse_prefetch_transpose' (10 runs):

         8,160,145      cache-misses              #   64.491 % of all cache refs      ( +-  2.19% )  (39.61%)
        12,653,216      cache-references                                              ( +-  0.45% )  (39.98%)
     1,261,785,164      instructions              #    2.19  insn per cycle           ( +-  0.37% )  (50.15%)
       575,843,475      cycles                                                        ( +-  0.38% )  (60.17%)
         1,137,009      r014c                                                         ( +-  3.06% )  (60.17%)
           340,189      r024c                                                         ( +-  9.23% )  (60.61%)
       464,808,926      r81d0                                                         ( +-  0.84% )  (61.75%)
       450,360,519      r01d1                                                         ( +-  0.92% )  (62.46%)
         1,179,211      r02d1                                                         ( +-  1.03% )  (40.89%)
            20,659      r04d1                                                         ( +-  7.79% )  (40.36%)

_MM_HINT_NTA

Performance counter stats for './sse_prefetch_transpose' (10 runs):

         8,121,213      cache-misses              #   63.071 % of all cache refs      ( +-  2.09% )  (39.80%)
        12,876,217      cache-references                                              ( +-  2.41% )  (40.06%)
     1,267,977,461      instructions              #    2.20  insn per cycle           ( +-  0.43% )  (50.18%)
       575,640,296      cycles                                                        ( +-  0.60% )  (60.18%)
         2,028,726      r014c                                                         ( +-  1.83% )  (60.17%)
           338,681      r024c                                                         ( +-  9.67% )  (60.74%)
       460,173,646      r81d0                                                         ( +-  0.99% )  (61.99%)
       452,179,786      r01d1                                                         ( +-  0.93% )  (62.14%)
            11,199      r02d1                                                         ( +-  3.08% )  (41.24%)
           219,170      r04d1                                                         ( +-  2.84% )  (40.43%)

疑問：
可以發現 T0 和 T1 都如預期將資料放在 L1 / L2 cache
但 T2 並不是放在 L3 而是 L2 ？反而是 NTA 模式似乎傾向使用 L3 cache

用 perf raw-counter 確認！

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

此處尚待理解

2017q3 Homework2 (software-pipelining)

Reviewed by zhanyangch

論文閱讀：When Prefetching Works, When It Doesn’t, and Why

Memory Indexing

理解 prefetch 程式

產生新的執行檔，分別對應於不同模式

嘗試修改 _MM_HINT 並透過 raw counter 觀察

Reviewed by `zhanyangch`