2016q3 Homework3 (software-pipeline)

contribute by <kobeyu>

指定閱讀

效能分析: Prefetching

When Prefetching Works, When It Doesn’t, and Why

Prefetch

在cpu要運算之前把資料從主記憶體預先載入cache,降低cpu等待時間進而提升效率.
從時間上來看載入至主記憶體的時機不能太晚(cpu已經運算完畢不再需要這些資料)也不能太早(有可能在使用之前就被踢出快取記憶體).
從記憶體層級上來看prefetch intrinsic提供參數可以選擇載入到哪個層級的快取記憶體.

Hardware prefetch

是用過去的存取紀錄作為prefetch的依據.
GHB:Global History Buffer

Software prefetch

可以是人為或是compiler在適當的地方加入prefetch指令.

Direct and Indirect Memory Indexing

Direct是直接可以從指令中知道要存取的記憶體位址,可以透過hareward prefetch.

Indirect 就像這行程式碼 x = a[b[i]],需要先計算b[i]的值才知道要存取a[]中的哪個位置,機器不會比人更知道b[i]是怎麼來的,所以使用software prefetch比較容易達成.

使用Software Prefetch優於Hardware Prefetch的情況:

* Large Number of Streams (Limited Hardware Resources).
* Short Streams 前面有提到hareware prefetch是根據歷史紀律作為prefetch的依據,所以過短的資料比較適合sofware prefetch.
* Irregular Memory Access
* Cache Locality Hint
* Loop Bounds.

使用Software Prefetch的缺點

* Increased Instruction Count
* Static Insertion 無法再執行時期動態調整
* Code Structure Change

同時使用Hardware與Software Prefetch能夠發會1+1>2的情況

*Handling Multiple Streams
*Positive Training

同時使用Hardware與Software Prefetch有人是豬隊友的情況…

*Negative Training
*Harmful Software Prefetching 如果software prefetch的效能很差連帶會影響到hardware prefetch.

Prefetch Intrinsic Insertion Algorithm

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

K is a constant factor
L is an average memory latency
IPCb ench is the profiled average IPC of each benchmark
Wloop is the average instruction count in one loop iteration.

p.s在論文中所有的量測K=4 & L=300

Matrix Transpose

矩陣轉置就是把行變列,列變行的數學運算,由於記憶體空間是Row Major,所以在讀取列時是讀取連續的記憶體空間,寫入行則是寫入不連續的記憶體空間,所以矩陣轉置先天上就會造成很高比例的cache-misses.

使用Intel Assembary (SSE2)完成矩陣轉置

Programming trivia: 4x4 integer matrix transpose in SSE2
使用SSE/AVX能得到比較短的執行時間,以目前的認知來看原因有:
1.一次可以處理比較長的資料長度
2.CPU指令集直接支援,可以有比較好的CPI.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

PUNPCKLDQ

_mm_unpacklo_epi32 (__m128i __A, __m128i __B)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

PUNPCKHDQ

_mm_unpackhi_epi32 (__m128i __A, __m128i __B)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

PUNPCKLQDQ

_mm_unpacklo_epi64 (__m128i __A, __m128i __B)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

PUNPCKHQDQ

_mm_unpackhi_epi64 (__m128i __A, __m128i __B)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

圖片來源:32/64-Bit 80x86 Assembly Language Architecture

解析程式碼pefetch

資料夾結構

.
├── impl.c
├── main.c
├── Makefile
├── README.md
└── scripts
	├── plot.gp
	└── pre-commit.hook

1 directory, 6 files

impl.c 實作了三個版本的矩陣轉制,分別是naive,sse,sse+prefectch

先來看一下執行時間的比較:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

根據先前閱讀的資料來看prefetch的效能好壞會根據Distance與Cache Level有關.
先來分析Distance對於效能的影響.
中imple.c中prefetch的距離是根據PFDIST的設定值,在實驗中分別測試D=0/4/8/12/16的執行時間分別是多少:

以結果來看執行時間的關係為:
D=8 < D=4,12 < D=16 < D=0
所以證明了prefetch的距離會影響到執行效能.其中D=0尤其明顯,推測可能是先前(2 iteration)用過的資料已經被移出cache或被覆蓋的關係.另外觀察到一個有趣的現象就是無論D是等於多少,都會比沒有加prefetch的sse版本還要來得快.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

接著我將Distance固定在8,改變prefetch的位置(TO/T1/T2/NTA)
執行的結果是NTA最花時間,再來確認一下NTA的定義:
NTA (non-temporal data with respect to all cache levels)–prefetch data into non-temporal cache structure. (This hint can be used to minimize pollution of caches.)
有兩個地方不太懂"non-temporal cache structure"與"pollution of caches"

簡單說就是 cache 被塞入不時宜的 data，而 early 則是指說太早抓這個 data 進來，結果要用到時已經被 replaced 了，inccorrect 是指說你載入了根本用不到的 data，這些都是會讓 cache miss 增加，也可以說是汙染了cache Yen-Kuan Wu

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

接著用perf觀察cache misses
$ perf stat -r 50 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses ./main

D=8 + T0

Performance counter stats for './main' (50 runs):

    62,963,773      cache-misses              #   86.289 % of all cache refs      ( +-  0.16% )  (66.55%)
    72,968,723      cache-references                                              ( +-  0.15% )  (66.62%)
    34,265,657      L1-dcache-load-misses     #    4.49% of all L1-dcache hits    ( +-  0.11% )  (66.74%)
   762,324,176      L1-dcache-loads                                               ( +-  0.06% )  (66.89%)
   314,796,493      L1-dcache-stores                                              ( +-  0.06% )  (66.92%)
       194,207      L1-icache-load-misses                                         ( +-  1.25% )  (66.69%)

   0.698045545 seconds time elapsed                                          ( +-  0.16% )

D=0+NTA

Performance counter stats for './main' (50 runs):

    65,505,666      cache-misses              #   86.406 % of all cache refs      ( +-  0.18% )  (66.35%)
    75,811,770      cache-references                                              ( +-  0.16% )  (66.62%)
    33,924,898      L1-dcache-load-misses     #    4.49% of all L1-dcache hits    ( +-  0.08% )  (66.93%)
   754,883,652      L1-dcache-loads                                               ( +-  0.05% )  (67.06%)
   316,034,709      L1-dcache-stores                                              ( +-  0.08% )  (66.89%)
       153,597      L1-icache-load-misses                                         ( +-  1.40% )  (66.49%)

   0.779092952 seconds time elapsed                                          ( +-  0.11% )

D=0+T0

Performance counter stats for './main' (50 runs):

    69,063,108      cache-misses              #   86.741 % of all cache refs      ( +-  0.17% )  (66.29%)
    79,620,072      cache-references                                              ( +-  0.14% )  (66.56%)
    33,903,912      L1-dcache-load-misses     #    4.49% of all L1-dcache hits    ( +-  0.07% )  (66.94%)
   755,064,566      L1-dcache-loads                                               ( +-  0.05% )  (67.15%)
   316,197,275      L1-dcache-stores                                              ( +-  0.07% )  (66.94%)
       157,890      L1-icache-load-misses                                         ( +-  1.25% )  (66.45%)

   0.781984994 seconds time elapsed                                          ( +-  0.08% )

~~目前的結果發現執行時間似乎跟cache misses太大的關聯,似乎證明了之前的推論,prefetch的目的並不是在降低cache miss,而是降低cache misses所產生的延遲.~~

上述的實驗(cache-misses)結果是錯的,因為在測試cache misses的時候sse+prefetch/sse/naive並沒有分開執行,以下是分開執行的結果：

SSE + Prefetch (Distancd=8,@TO)

 Performance counter stats for './main' (50 runs):

  9,022,860      cache-misses              #   67.566 % of all cache refs      ( +-  0.32% )  (66.61%)
 13,354,083      cache-references                                              ( +-  0.16% )  (67.11%)
  8,964,264      L1-dcache-load-misses     #    1.93% of all L1-dcache hits    ( +-  0.32% )  (67.20%)
465,440,570      L1-dcache-loads                                               ( +-  0.08% )  (67.20%)
233,797,035      L1-dcache-stores                                              ( +-  0.12% )  (66.88%)
     62,034      L1-icache-load-misses                                         ( +-  2.20% )  (66.37%)

0.195889775 seconds time elapsed

SSE

 Performance counter stats for './main' (50 runs):

 15,047,354      cache-misses              #   79.379 % of all cache refs      ( +-  0.28% )  (66.25%)
 18,956,381      cache-references                                              ( +-  0.18% )  (66.36%)
  
  8,408,231      L1-dcache-load-misses     #    1.89% of all L1-dcache hits    ( +-  0.12% )  (66.71%)
443,824,150      L1-dcache-loads                                               ( +-  0.15% )  (67.34%)
229,787,688      L1-dcache-stores                                              ( +-  0.13% )  (67.44%)
     90,578      L1-icache-load-misses                                         ( +-  1.63% )  (66.70%)

0.356293017 seconds time elapsed

NAIVE

 Performance counter stats for './main' (50 runs):

 42,155,708      cache-misses              #   87.722 % of all cache refs      ( +-  0.21% )  (66.42%)
 48,056,099      cache-references                                              ( +-  0.20% )  (66.60%)
 21,011,298      L1-dcache-load-misses     #    3.84% of all L1-dcache hits    ( +-  0.07% )  (66.80%)
547,771,641      L1-dcache-loads                                               ( +-  0.09% )  (67.18%)
215,693,429      L1-dcache-stores                                              ( +-  0.13% )  (67.09%)
     76,732      L1-icache-load-misses                                         ( +-  1.87% )  (66.52%)

0.432626384 seconds time elapsed                                          ( +-  0.20% )

從上面的實驗結果可以看到prefetch先將所需要的資料載入快取記憶體降低了cache-misses進而減少了執行時間.
思考cache miss的定義,要存取的資料若不在快取記憶體中=>cache misses,如果預先知道要存取的資料並載入快取記憶體中,就可以避免cache misses.

使用AVX提升效能

參考之前同學的實作方法 link

參考資料

GHB link

2016q3 Homework3 (software-pipeline)

指定閱讀

效能分析: Prefetching

When Prefetching Works, When It Doesn’t, and Why

Prefetch

Hardware prefetch

Software prefetch

Direct and Indirect Memory Indexing

使用Software Prefetch優於Hardware Prefetch的情況:

使用Software Prefetch的缺點

同時使用Hardware與Software Prefetch能夠發會1+1>2的情況

同時使用Hardware與Software Prefetch有人是豬隊友的情況…

Prefetch Intrinsic Insertion Algorithm

Matrix Transpose

使用Intel Assembary (SSE2)完成矩陣轉置

PUNPCKLDQ

PUNPCKHDQ

PUNPCKLQDQ

PUNPCKHQDQ

解析程式碼pefetch

資料夾結構

使用AVX提升效能

參考資料

tags: kobeyu

tags: `kobeyu`