Try   HackMD

2016q3 Homework3 (software-pipeline)

contribute by <kobeyu>

github
作業說明

指定閱讀

效能分析: Prefetching

When Prefetching Works, When It Doesn’t, and Why

Prefetch

在cpu要運算之前把資料從主記憶體預先載入cache,降低cpu等待時間進而提升效率.
從時間上來看載入至主記憶體的時機不能太晚(cpu已經運算完畢不再需要這些資料)也不能太早(有可能在使用之前就被踢出快取記憶體).
從記憶體層級上來看prefetch intrinsic提供參數可以選擇載入到哪個層級的快取記憶體.

Hardware prefetch

是用過去的存取紀錄作為prefetch的依據.
GHB:Global History Buffer

Software prefetch

可以是人為或是compiler在適當的地方加入prefetch指令.

Direct and Indirect Memory Indexing

Direct是直接可以從指令中知道要存取的記憶體位址,可以透過hareward prefetch.

Indirect 就像這行程式碼 x = a[b[i]],需要先計算b[i]的值才知道要存取a[]中的哪個位置,機器不會比人更知道b[i]是怎麼來的,所以使用software prefetch比較容易達成.

使用Software Prefetch優於Hardware Prefetch的情況:

​* Large Number of Streams (Limited Hardware Resources).
​* Short Streams 前面有提到hareware prefetch是根據歷史紀律作為prefetch的依據,所以過短的資料比較適合sofware prefetch.
​* Irregular Memory Access
​* Cache Locality Hint
​* Loop Bounds.

使用Software Prefetch的缺點

​* Increased Instruction Count
​* Static Insertion 無法再執行時期動態調整
​* Code Structure Change

同時使用Hardware與Software Prefetch能夠發會1+1>2的情況

​*Handling Multiple Streams
​*Positive Training

同時使用Hardware與Software Prefetch有人是豬隊友的情況

​*Negative Training
​*Harmful Software Prefetching 如果software prefetch的效能很差連帶會影響到hardware prefetch.

Prefetch Intrinsic Insertion Algorithm

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

K is a constant factor
L is an average memory latency
IPCb ench is the profiled average IPC of each benchmark
Wloop is the average instruction count in one loop iteration.

p.s在論文中所有的量測K=4 & L=300


Matrix Transpose

矩陣轉置就是把行變列,列變行的數學運算,由於記憶體空間是Row Major,所以在讀取列時是讀取連續的記憶體空間,寫入行則是寫入不連續的記憶體空間,所以矩陣轉置先天上就會造成很高比例的cache-misses.

使用Intel Assembary (SSE2)完成矩陣轉置

Programming trivia: 4x4 integer matrix transpose in SSE2
使用SSE/AVX能得到比較短的執行時間,以目前的認知來看原因有:
1.一次可以處理比較長的資料長度
2.CPU指令集直接支援,可以有比較好的CPI.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

PUNPCKLDQ

_mm_unpacklo_epi32 (__m128i __A, __m128i __B)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

PUNPCKHDQ

_mm_unpackhi_epi32 (__m128i __A, __m128i __B)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

PUNPCKLQDQ

_mm_unpacklo_epi64 (__m128i __A, __m128i __B)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

PUNPCKHQDQ

_mm_unpackhi_epi64 (__m128i __A, __m128i __B)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

圖片來源:32/64-Bit 80x86 Assembly Language Architecture


解析程式碼pefetch

資料夾結構

​.
​├── impl.c
​├── main.c
​├── Makefile
​├── README.md
​└── scripts
​	├── plot.gp
​	└── pre-commit.hook

​1 directory, 6 files

impl.c 實作了三個版本的矩陣轉制,分別是naive,sse,sse+prefectch

先來看一下執行時間的比較:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

根據先前閱讀的資料來看prefetch的效能好壞會根據Distance與Cache Level有關.
先來分析Distance對於效能的影響.
中imple.c中prefetch的距離是根據PFDIST的設定值,在實驗中分別測試D=0/4/8/12/16的執行時間分別是多少:

以結果來看執行時間的關係為:
D=8 < D=4,12 < D=16 < D=0
所以證明了prefetch的距離會影響到執行效能.其中D=0尤其明顯,推測可能是先前(2 iteration)用過的資料已經被移出cache或被覆蓋的關係.另外觀察到一個有趣的現象就是無論D是等於多少,都會比沒有加prefetch的sse版本還要來得快.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


接著我將Distance固定在8,改變prefetch的位置(TO/T1/T2/NTA)
執行的結果是NTA最花時間,再來確認一下NTA的定義:
NTA (non-temporal data with respect to all cache levels)prefetch data into non-temporal cache structure. (This hint can be used to minimize pollution of caches.)
有兩個地方不太懂"non-temporal cache structure"與"pollution of caches"

簡單說就是 cache 被塞入不時宜的 data,而 early 則是指說太早抓這個 data 進來,結果要用到時已經被 replaced 了,inccorrect 是指說你載入了根本用不到的 data,這些都是會讓 cache miss 增加,也可以說是 汙染了cache Yen-Kuan Wu

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

接著用perf觀察cache misses
$ perf stat -r 50 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses ./main

D=8 + T0

​Performance counter stats for './main' (50 runs):

​​​​    62,963,773      cache-misses              #   86.289 % of all cache refs      ( +-  0.16% )  (66.55%)
​​​​    72,968,723      cache-references                                              ( +-  0.15% )  (66.62%)
​​​​    34,265,657      L1-dcache-load-misses     #    4.49% of all L1-dcache hits    ( +-  0.11% )  (66.74%)
​​​​   762,324,176      L1-dcache-loads                                               ( +-  0.06% )  (66.89%)
​​​​   314,796,493      L1-dcache-stores                                              ( +-  0.06% )  (66.92%)
​​​​       194,207      L1-icache-load-misses                                         ( +-  1.25% )  (66.69%)

​​​​   0.698045545 seconds time elapsed                                          ( +-  0.16% )

D=0+NTA

​​Performance counter stats for './main' (50 runs):

​​​​    65,505,666      cache-misses              #   86.406 % of all cache refs      ( +-  0.18% )  (66.35%)
​​​​    75,811,770      cache-references                                              ( +-  0.16% )  (66.62%)
​​​​    33,924,898      L1-dcache-load-misses     #    4.49% of all L1-dcache hits    ( +-  0.08% )  (66.93%)
​​​​   754,883,652      L1-dcache-loads                                               ( +-  0.05% )  (67.06%)
​​​​   316,034,709      L1-dcache-stores                                              ( +-  0.08% )  (66.89%)
​​​​       153,597      L1-icache-load-misses                                         ( +-  1.40% )  (66.49%)

​​​​   0.779092952 seconds time elapsed                                          ( +-  0.11% )

D=0+T0

​Performance counter stats for './main' (50 runs):

​​​​    69,063,108      cache-misses              #   86.741 % of all cache refs      ( +-  0.17% )  (66.29%)
​​​​    79,620,072      cache-references                                              ( +-  0.14% )  (66.56%)
​​​​    33,903,912      L1-dcache-load-misses     #    4.49% of all L1-dcache hits    ( +-  0.07% )  (66.94%)
​​​​   755,064,566      L1-dcache-loads                                               ( +-  0.05% )  (67.15%)
​​​​   316,197,275      L1-dcache-stores                                              ( +-  0.07% )  (66.94%)
​​​​       157,890      L1-icache-load-misses                                         ( +-  1.25% )  (66.45%)

​​​​   0.781984994 seconds time elapsed                                          ( +-  0.08% )

目前的結果發現執行時間似乎跟cache misses太大的關聯,似乎證明了之前的推論,prefetch的目的並不是在降低cache miss,而是降低cache misses所產生的延遲.


上述的實驗(cache-misses)結果是錯的,因為在測試cache misses的時候sse+prefetch/sse/naive並沒有分開執行,以下是分開執行的結果:

  1. SSE + Prefetch (Distancd=8,@TO)

    ​​​​​​​ Performance counter stats for './main' (50 runs):
    
    ​​​​​​​  9,022,860      cache-misses              #   67.566 % of all cache refs      ( +-  0.32% )  (66.61%)
    ​​​​​​​ 13,354,083      cache-references                                              ( +-  0.16% )  (67.11%)
    ​​​​​​​  8,964,264      L1-dcache-load-misses     #    1.93% of all L1-dcache hits    ( +-  0.32% )  (67.20%)
    ​​​​​​​465,440,570      L1-dcache-loads                                               ( +-  0.08% )  (67.20%)
    ​​​​​​​233,797,035      L1-dcache-stores                                              ( +-  0.12% )  (66.88%)
    ​​​​​​​     62,034      L1-icache-load-misses                                         ( +-  2.20% )  (66.37%)
    
    ​​​​​​​0.195889775 seconds time elapsed 
    
  2. SSE

    ​​​​​​​ Performance counter stats for './main' (50 runs):
    
    ​​​​​​​ 15,047,354      cache-misses              #   79.379 % of all cache refs      ( +-  0.28% )  (66.25%)
    ​​​​​​​ 18,956,381      cache-references                                              ( +-  0.18% )  (66.36%)
    ​​​​​​​  
    ​​​​​​​  8,408,231      L1-dcache-load-misses     #    1.89% of all L1-dcache hits    ( +-  0.12% )  (66.71%)
    ​​​​​​​443,824,150      L1-dcache-loads                                               ( +-  0.15% )  (67.34%)
    ​​​​​​​229,787,688      L1-dcache-stores                                              ( +-  0.13% )  (67.44%)
    ​​​​​​​     90,578      L1-icache-load-misses                                         ( +-  1.63% )  (66.70%)
    
    ​​​​​​​0.356293017 seconds time elapsed    
    
  3. NAIVE

    ​​​​​​​ Performance counter stats for './main' (50 runs):
    
    ​​​​​​​ 42,155,708      cache-misses              #   87.722 % of all cache refs      ( +-  0.21% )  (66.42%)
    ​​​​​​​ 48,056,099      cache-references                                              ( +-  0.20% )  (66.60%)
    ​​​​​​​ 21,011,298      L1-dcache-load-misses     #    3.84% of all L1-dcache hits    ( +-  0.07% )  (66.80%)
    ​​​​​​​547,771,641      L1-dcache-loads                                               ( +-  0.09% )  (67.18%)
    ​​​​​​​215,693,429      L1-dcache-stores                                              ( +-  0.13% )  (67.09%)
    ​​​​​​​     76,732      L1-icache-load-misses                                         ( +-  1.87% )  (66.52%)
    
    ​​​​​​​0.432626384 seconds time elapsed                                          ( +-  0.20% )
    

從上面的實驗結果可以看到prefetch先將所需要的資料載入快取記憶體降低了cache-misses進而減少了執行時間.
思考cache miss的定義,要存取的資料若不在快取記憶體中=>cache misses,如果預先知道要存取的資料並載入快取記憶體中,就可以避免cache misses.

使用AVX提升效能

參考之前同學的實作方法 link

參考資料

GHB link

tags: kobeyu