2018 software-pipelining

tags `BodomMoon` `TingL7` `software-pipelining` `sysprog2018`

contributed by<BodomMoon,TingL7,sabishi>
分組表 github youtube 實驗解說影片、背景知識解說影片

作業要求：參照 software pipelining 共筆，更新內容描述和解決之前未能完成的實驗

內容大綱

預計流程
背景知識整理
SSE 矩陣實驗及 prefetch 論文實證
prefetch 部份自動化測試實作
自動統計及改變矩陣大小
TODO
相關參考資料整理

預計流程

要能夠對先賢的筆記做補充及整理的話我們必須先熟悉該研究背景知識，故預估分為以下階段

研讀論文並了解實驗原理
研讀程式碼並了解如何實作實驗
補充未完實驗步驟並紀錄統整
錄影紀錄 ( youtube實驗解說影片、背景知識解說影片)
完成 <- now

背景知識整理

有鑑於 software-pipelining 乃超過三年七個學期的智慧結晶，~~複雜度只略遜linux-kernel一籌~~，故在此將這份作業所需的不同背景知識單獨整理成篇並簡化以便後人了解

prefetch 背景知識簡略整理 by <BodomMoon,TingL7>
SSE 矩陣轉置演算法解析 by <BodomMoon>

SSE 矩陣實驗及 prefetch 論文實證

眼尖的同學看完 SSE 矩陣轉置演算法解析會發現，在 8*8 的延展形式上這個算法居然是 column order 的！？

我覺得這個效能問題是 Jserv 故意留下來的，只是三年下來有發現的人居然寥寥無幾。BodomMoon

在 C語言中 大多數的情況下 column order 的效率都會比 row order 還要差

由於 C 語言的陣列是 row-major order，因此 row-major oreder 效能才會比較優，如果換成其他語言就不一定。TingL7
僅在 column 和 row 長度相同且存取次數為元素大小/cache line size 的倍數時效率相同BodomMoon

你指的是陣列大小，不是元素大小吧。
用比較簡單的說法是，每次有存取到 cache 的資料都直接全部用完，不會取到不用到的資料。TingL7

於是更換存取順序來測試一下

-    for (int x = 0; x < w; x += 4) {
-        for (int y = 0; y < h; y += 4) {
+    for (int y = 0; y < h; y += 4) {
+        for (int x = 0; x < w; x += 4) {
        .....
        }
    }

原版效能分析(column_order)

perf stat --repeat 20 -e cache-misses,cache-references,instructions,cycles,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-prefetches,LLC-prefetch-misses,r02D1,r10D1,r04D1,r20D1 ./sse_transpose

Performance counter stats for './sse_transpose' (20 runs):

        12,933,825      cache-misses              #   73.395 % of all cache refs      ( +-  1.19% )  (31.89%)
        17,622,255      cache-references                                              ( +-  1.19% )  (32.95%)
     1,388,011,122      instructions              #    1.65  insn per cycle           ( +-  1.12% )  (42.58%)
       843,618,590      cycles                                                        ( +-  0.73% )  (43.16%)
       486,121,646      L1-dcache-loads                                               ( +-  1.32% )  (44.21%)
         9,925,358      L1-dcache-load-misses     #    2.04% of all L1-dcache hits    ( +-  0.76% )  (44.40%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   
         4,539,155      LLC-loads                                                     ( +-  0.78% )  (44.06%)
         3,938,302      LLC-load-misses           #   86.76% of all LL-cache hits     ( +-  0.92% )  (34.74%)
   <not supported>      LLC-prefetches                                              
   <not supported>      LLC-prefetch-misses                                         
           242,069      r02D1                                                         ( +-  4.97% )  (34.58%)
         3,963,272      r10D1                                                         ( +-  0.77% )  (33.78%)
           410,698      r04D1                                                         ( +-  2.53% )  (32.84%)
         3,408,067      r20D1                                                         ( +-  1.66% )  (32.04%)

       0.327289812 seconds time elapsed                                          ( +-  0.63% )

優化版效能分析(row_order)

perf stat --repeat 20 -e cache-misses,cache-references,instructions,cycles,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-prefetches,LLC-prefetch-misses,r02D1,r10D1,r04D1,r20D1 ./sse_transpose_row

Performance counter stats for './sse_transpose_row' (20 runs):

        17,491,290      cache-misses              #   78.348 % of all cache refs      ( +-  1.63% )  (33.60%)
        22,325,027      cache-references                                              ( +-  1.42% )  (35.18%)
     1,257,804,286      instructions              #    1.69  insn per cycle           ( +-  0.35% )  (44.41%)
       743,668,631      cycles                                                        ( +-  0.36% )  (45.49%)
       499,101,040      L1-dcache-loads                                               ( +-  0.55% )  (46.52%)
         9,960,237      L1-dcache-load-misses     #    2.00% of all L1-dcache hits    ( +-  0.60% )  (46.54%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   
           254,015      LLC-loads                                                     ( +-  2.63% )  (44.96%)
           117,380      LLC-load-misses           #   46.21% of all LL-cache hits     ( +-  1.97% )  (33.83%)
   <not supported>      LLC-prefetches                                              
   <not supported>      LLC-prefetch-misses                                         
           296,008      r02D1                                                         ( +-  4.31% )  (32.03%)
            85,257      r10D1                                                         ( +-  8.36% )  (31.06%)
            13,754      r04D1                                                         ( +- 24.54% )  (32.60%)
            63,409      r20D1                                                         ( +-  2.65% )  (33.04%)

       0.214055056 seconds time elapsed                                          ( +-  0.36% )

可以看到雖然執行效率大幅提昇(耗時減少了接近30％)，而減少的 cache miss 大部分都集中在 LLC(Last level cache) ，這個現象是對應於 row order 在 cache 上的優勢其實是對應 Hardware prefetch 的 STR (在記憶體裡面讀取時一次預取一整個 cache line) ，而 Hardware prefetch 的操作通常是操作較為末級的 Cache ，故受惠的部份集中在 LLC cache 的部份。

題外話：據說 AMD 跟 Intel 已經優化到幾乎任何程式的 L1 cache miss 都能小於 5% 從下面看起來結論也幾乎是這樣，這種魔法般的技術是怎麼做到的？BodomMoon

經測試將 sse_prefetch_transpose 改為 row_order 的話效能也會有提升

這裡放一下 naive_transpose sse_transpose(row order) sse_prefetch_transpose(row order) 的效能對比：

naive_transpose(row order,矩陣大小為 4096 * 4096)

naive: 		 367553429 ns

 Performance counter stats for './naive_transpose' (20 runs):

        39,629,222      cache-misses              #   76.159 % of all cache refs      ( +-  0.56% )  (33.33%)
        52,034,511      cache-references                                              ( +-  0.78% )  (33.71%)
     1,700,143,929      instructions              #    1.40  insn per cycle           ( +-  0.51% )  (42.18%)
     1,213,698,295      cycles                                                        ( +-  0.50% )  (42.32%)
       641,193,099      L1-dcache-loads                                               ( +-  0.45% )  (42.44%)
        29,021,316      L1-dcache-load-misses     #    4.53% of all L1-dcache hits    ( +-  0.50% )  (42.55%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   
        17,505,775      LLC-loads                                                     ( +-  0.55% )  (42.52%)
        14,261,795      LLC-load-misses           #   81.47% of all LL-cache hits     ( +-  0.46% )  (33.72%)
   <not supported>      LLC-prefetches                                              
   <not supported>      LLC-prefetch-misses                                         
           576,890      r02D1                                                         ( +-  3.13% )  (33.88%)
        16,141,089      r10D1                                                         ( +-  0.48% )  (33.68%)
         3,066,231      r04D1                                                         ( +-  1.41% )  (33.67%)
        12,905,041      r20D1                                                         ( +-  0.59% )  (33.39%)

       0.533723421 seconds time elapsed                                          ( +-  0.67% )

sse_transpose_row(row order,矩陣大小為 4096 * 4096)

sse: 		 78642654 ns

 Performance counter stats for './sse_transpose_row' (100 runs):

        16,695,527      cache-misses              #   77.365 % of all cache refs      ( +-  0.34% )  (33.90%)
        21,580,334      cache-references                                              ( +-  0.39% )  (35.27%)
     1,327,756,804      instructions              #    1.79  insn per cycle           ( +-  0.22% )  (44.47%)
       741,817,972      cycles                                                        ( +-  0.19% )  (45.31%)
       559,820,011      L1-dcache-loads                                               ( +-  0.23% )  (45.54%)
         8,723,894      L1-dcache-load-misses     #    1.56% of all L1-dcache hits    ( +-  0.47% )  (44.76%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   
           361,260      LLC-loads                                                     ( +-  0.62% )  (42.93%)
           103,600      LLC-load-misses           #   28.68% of all LL-cache hits     ( +-  4.79% )  (31.97%)
   <not supported>      LLC-prefetches                                              
   <not supported>      LLC-prefetch-misses                                         
           143,543      r02D1                                                         ( +-  5.24% )  (33.79%)
            78,784      r10D1                                                         ( +-  6.68% )  (34.67%)
            11,216      r04D1                                                         ( +- 11.43% )  (34.20%)
            68,828      r20D1                                                         ( +-  7.22% )  (33.57%)

       0.206929540 seconds time elapsed                                          ( +-  0.23% )

sse_prefetch_transpose_opt(row order,矩陣大小為 4096 * 4096,PFDIST = 6 * 4 = 24,_MM_HINT_T1)

sse_prefetch: 		 72837456 ns

  Performance counter stats for './sse_prefetch_transpose_opt 6' (100 runs):

        18,050,296      cache-misses              #   79.911 % of all cache refs      ( +-  0.25% )  (34.03%)
        22,587,956      cache-references                                              ( +-  0.23% )  (35.36%)
     1,372,153,797      instructions              #    1.84  insn per cycle           ( +-  0.17% )  (44.56%)
       747,232,008      cycles                                                        ( +-  0.07% )  (45.35%)
       576,889,463      L1-dcache-loads                                               ( +-  0.09% )  (45.45%)
         8,881,922      L1-dcache-load-misses     #    1.54% of all L1-dcache hits    ( +-  0.35% )  (44.47%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   
           303,539      LLC-loads                                                     ( +-  0.44% )  (42.52%)
            36,078      LLC-load-misses           #   11.89% of all LL-cache hits     ( +-  5.54% )  (31.61%)
   <not supported>      LLC-prefetches                                              
   <not supported>      LLC-prefetch-misses                                         
           284,348      r02D1                                                         ( +-  0.72% )  (34.08%)
            13,161      r10D1                                                         ( +-  0.50% )  (35.16%)
             5,295      r04D1                                                         ( +-  0.83% )  (34.40%)
             9,161      r20D1                                                         ( +-  0.84% )  (33.68%)

       0.205510823 seconds time elapsed                                          ( +-  0.06% )

prefetch 部份自動化測試實作

戳我查看 Prefetch 指令說明

實作了針對 Prefetch distance 的自動化測試，重複測試了上千次，但按照執行時間來取得的最佳 Prefetch 都不太相同，尤其在 PFDIST(prefetch distance) 4~12 之間的效能差距已經小於作業系統執行時的誤差

故我轉向由 LLC miss rate 來下手，因 L1 miss rate 都已經小於 1.5％可知在 NTA的模式下 prefetch 時應該是不太會動到 L1 cache 了，那唯一會有顯著改變的應該在 LLC 的部份

perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-prefetches,LLC-prefetch-misses,r02D1,r10D1,r04D1,r20D1 ./sse_prefetch_transpose_opt 

for array size = 4096

在 PFDIST 3 時
26,183      LLC-load-misses           #   22.23% of all LL-cache hits     ( +-  3.23% )  (34.73%)
在 PFDIST 4 時
23,773      LLC-load-misses           #   21.94% of all LL-cache hits     ( +-  0.94% )  (35.13%)
在 PFDIST 6 時
26,653      LLC-load-misses           #   22.72% of all LL-cache hits     ( +-  6.92% )  (34.66%)
在 PFDIST 8 時
23,324      LLC-load-misses           #   23.59% of all LL-cache hits     ( +-  0.92% )  (35.84%)
在 PFDIST 10 時
26,911      LLC-load-misses           #   24.40% of all LL-cache hits     ( +-  6.73% )  (35.13%)

由於 size 4096 這個數字在效能上會有較為特殊的影響

參考software pipelining
- 這是考慮到 cache line index shift 有關映射方面計算造成的效能衝擊，其中這句當 CPU 需要讀入 (1, 0) 時, 因為讀入 (0, 0) 所讀入的 cache line 有相當高的機會早已被 replace這句話可能有一點偏誤，讀入 (1, 0) 跟讀入 (0, 0) 並沒有關聯性。
- 這個效能影響在我的電腦上一樣可以重現，沒意外只要 cache line size 是 64 byte 的都會有這種情況。

用 n-way set associative 下去算的時候，如果以 column order 的方式下去取讀 (0,0) 時讀入的 (1,0) 的確有極高的機率被刷掉，應該是這樣解釋才對。裡面詳細的計算都沒錯，只要修正上面那句跟註明一下這句

程式在 column order 讀取的狀況下

可以知道在 size 4096 的情況下由於映射位置的重疊，cache 內資料被覆蓋的速度較快，故理論上較佳的 PFDIST 應該較小。相對的在映射位置重疊速度較慢的情況下，PFDIST 就應該要較大，做個實驗實證一下。

莫忘 prefetch distance 必須要大到能夠隱藏 memory latency BodomMoon

perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-prefetches,LLC-prefetch-misses,r02D1,r10D1,r04D1,r20D1 ./sse_prefetch_transpose_opt 

for array size = 4160
在 PFDIST 4 時
14,935      LLC-load-misses           #   11.84% of all LL-cache hits     ( +- 13.93% )  (36.58%)
在 PFDIST 6 時
12,004      LLC-load-misses           #   10.16% of all LL-cache hits     ( +- 11.18% )  (36.64%)
在 PFDIST 8 時
8,485      LLC-load-misses           #    7.18% of all LL-cache hits     ( +-  1.30% )  (36.79%)
在 PFDIST 10 時
11,505      LLC-load-misses           #    9.41% of all LL-cache hits     ( +-  9.46% )  (36.69%)

得證， prefetch distance 這種東西對於不同機器，不同算法，不同資料都沒有移植性，在軟體上除了對應不同的程式個別統計調整以外並沒有其他的作法。不過在統計及調整上當然我們可以讓程式跑自動幫我們統計。

到這裡其實我有一點疑惑，我的 cache line size 為 64 byte 那理論上一次可以放滿 16 個數字，故 PFDIST 預取的距離理論上要大於 16 個數字才有意義。於是我回頭看了一下發現自己測試的 PFDIST 居然只有 0 ~ 15，那好像根本都是在取辛酸的！？BodomMoon

猛然驚覺上面在算的東西根本都是在「舉燭」，今天我們計算的單位是 4 * 4 的矩陣區域，所以我要 prefetch 的應該是後續的 4 * 4 矩陣區域，但我在計算記憶體位移的時候居然是以 1 * 1 為單位在移動的。這樣只要在非 4 倍數取到的其實都是在同一個矩陣區域，那在這個X+0~X+3範圍內的效能差距其實不會太大。

其實只要把 prefetch 的位移量改為 4 的倍數，整體效能的起伏就會變得非常明顯了。

之前的同學不用注意這個是因為他們取的順序是 column order 的，在超過 4 * 4 的情況下從 colume 的角度向下預取一個 4 * 4 矩陣單位的距離就已經超過 cache line size 的大小BodomMoon

如圖：

註：以下時間單位為 microsecond ( y軸的單位我忘記換了)

array size = 4096

落點基本上在 5 * 4 = 20 或 4 * 6 = 24

這次符合以 cache line size 來看位移應該要大於 16 的理論了BodomMoon

array size = 4160

可見 PFDIST 在 8~10 (4x)左右效果較好(?)

這個我做了比較多次實驗但一直不太能找出一個平均的落點BodomMoon

想想因為現在用的 prefetch 是用 NTA 的規則在跑，NTA 是力求把負面影響壓到最低的算法，也是相對效益最低的算法，試著把規則改成 T2 和 T1 看看取得的數據會不會比較有代表性

規則請參見 software-pipelining 這篇說明得很詳細 BodomMoon
也可參考這篇，詳細介紹四種 hint 。
為了避免 cache 中正在用的資料被覆蓋、不會用到的資料跑到 cache，造成 cache miss ， non-temporal access (NTA) 的技巧在於使用額外的 buffer 去儲存暫時不會被用到的 prefetch 資料，也可以避免資料被更動，直到準備要使用時，再送到 cache。因此，理論上，時間會比直接放到 cache 的 T0 、 T1 、 T2 慢。TingL7

T2

感覺效能差距較為明顯了喔喔喔喔！再改成 T1
T1
T0

T0 的結果就顯示的相對明顯了，PFDIST 10~12(4x) 時的效果是最佳的，由此還是可以推出以上結論

可以知道在 size 4096 的情況下由於映射位置的重疊，cache 內資料被覆蓋的速度較快，故理論上較佳的 PFDIST 應該較小。相對的在映射位置重疊速度較慢的情況下，PFDIST 就應該要較大，同時也再次證實符合以 cache line size 來看位移應該要大於 16 的理論了。 BodomMoon

自動統計以及改變矩陣大小

在專案的 caculate.c 中已經將自動統計以及計算以執行時間為基礎的最佳 PFDIST 程式寫好了，透過 TEST_HW.txt 就可以調整要測試的 array size

以下為作業要求

改變矩陣大小

But 我其實不太清楚實驗這個部份是想測試還是映證什麼東西= = BodomMoon

改變 main.c 裡的設定，即可改變輸入矩陣的大小
- 已修改為可針對任何邊長為 4 倍數的正方矩陣進行運算

以下將執行時間單位改為 nanosecond (不然單位太大看不出差別)，並透過 caculate.c 計算 Best PFDIST

但老實說結果僅供參考，因為以下 PFDIST 更改對於矩陣執行的效能差距小於作業系統執行時的誤差，僅能取出一個 PFDIST 過小到過大的範圍區間而已。

註：以下時間單位為 nanosecond ( y軸的單位我忘記換了)

8 x 8

其實以計算的方式可知這裡的 PFDIST 應該要是
8 左右，因為上方 4 * 8 的矩陣在第一次的 cache miss 時就會一次被讀取進去， Prefetch 只有在預取下方的 4 * 8 矩陣時會發揮功效BodomMoon

對於僅有 8 * 8 的矩陣來說，上方的 4 * 8 cache miss 造成的時間衝擊很大。

16 x 16

僅能得知最佳 PFDIST 大概在 4~10 之間比較穩定較低

32 x 32

結果如圖

僅能得知最佳 PFDIST 大概在 4~8 左右比較穩定

64 x 64