2017q1 Homework3 (software-pipelining)

contributed by <baimao8437>

論文筆記

前言:

本論文研究方向有三:

source code-level analysis 直接說明SW、HW的優缺點
SW、HW之間的協同作用(synergistic)和反作用(antagonistic effects)
在SW有request的情況下評估HW的訓練策略

名詞解釋:

Prefetch
在 cpu 要運算之前把資料從主記憶體預先載入 cache，降低 cpu 等待時間進而提升效率.
載入至主記憶體的時機不能太晚(cpu 已經運算完畢不再需要這些資料)也不能太早(有可能在使用之前就被踢出快取記憶體)，所以計算 prefetch distance 很重要。
Hardware prefetch
是用過去的存取紀錄作為 prefetch 的依據，所以需要 training。
e.g. GHB:Global History Buffer
Software prefetch
人為或 compiler 在適當的地方加入 prefetching intrinsics。
但 ICC or GCC 中只有簡單的 SW 指令，所以比較仰賴人為加入。
e.g. SSE 的 __mm_prefetch()

而對於手動加入指令有兩個最大的問題:

對於如何插入 prefetch intrinsic 沒有一個嚴格的標準
軟體與硬體的 prefetch 交互使用的複雜度沒有一個很好的了解

SW & HW 的交互作用實驗結果:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

GHB(stride prefetcher)
stride 是指說你是一次跳幾格去 prefetch cache line。
STR(stream prefetcher)
stream 是指說你的 prefetch 是連續的抓取 cache line。

可以看到左邊 Positive 區的實驗結果顯示在這些 benchmark 的情況下，SW+HW 會比 HW 來得更有效率。

實驗結果重點

SW 對於 short array streams, irregular memory address patterns, and L1 cache miss reduction 效果較好
SW 會對 HW 的 training 產生干擾，讓效能變差，尤其是在 stream 的時候。

從實驗中，產生下列主要的三個問題:

What are the limitations and overheads of software prefetching?
What are the limitations and overheads of hardware prefetching?
When is it beneficial to use software and/or hardware prefetching?

SW & HW 的背景知識

Data Structure

此表分析了不同 data structure 的性質及當時已經有提出的 HW SW prefetch 方式

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

而當中的 D 就是 prefetch distance

Prefetch hint

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

主要使用在SSE的__mm_prefetch([address],[hint]);

Prefetch 分類 & Timeliness

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Prefetch distance

Prefetch 希望能夠 timely，所以要設定prefetch distance，要大到足夠隱藏 latency。

D \geq ⌈ \frac{l}{s} ⌉

D is called the prefetch distance
l is the prefetch latency
s is the length of the shortest path through the loop body

Direct\Indirect index

Direct: 是直接可以從指令中知道要存取的記憶體位址,可以透過 hareward prefetch.
Indirect: 就像這行程式碼 x = a[b[i]],需要先計算b[i]的值才知道要存取a[]中的哪個位置,使用 software prefetch 比較容易達成.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

(a)Direct: 2個指令
(b)Indirect: 4個指令

程式實作矩陣轉置

開發環境

baimao@baimao-Aspire-V5-573G:~$ lscpu
Architecture:          x86_64
CPU 作業模式：    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
每核心執行緒數：2
每通訊端核心數：2
Socket(s):             1
NUMA 節點：         1
供應商識別號：  GenuineIntel
CPU 家族：          6
型號：              69
Model name:            Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
製程：              1
CPU MHz：             1711.242
CPU max MHz:           2600.0000
CPU min MHz:           800.0000
BogoMIPS:              4589.38
虛擬：              VT-x
L1d 快取：          32K
L1i 快取：          32K
L2 快取：           256K
L3 快取：           3072K
NUMA node0 CPU(s):     0-3

CPU:Intel Colre i5-4200U 資訊

Architecture / Microarchitecture
Microarchitecture        Haswell
Platform		  Shark Bay
Processor core   	  Haswell
Core stepping   	  C0 (SR170)
Manufacturing process	  0.022 micron
Data width		  64 bit
The number of CPU cores  2
The number of threads	  4
Floating Point Unit	  Integrated
Level 1 cache size  	  2 x 32 KB 8-way set associative instruction caches 
			  2 x 32 KB 8-way set associative data caches
Level 2 cache size  	  2 x 256 KB 8-way set associative caches
Level 3 cache size	  3 MB 12-way set associative shared cache
Physical memory		  16 GB
Multiprocessing		  Uniprocessor
Features		  MMX instructions
			  SSE / Streaming SIMD Extensions
			  SSE2 / Streaming SIMD Extensions 2
			  SSE3 / Streaming SIMD Extensions 3
			  SSSE3 / Supplemental Streaming SIMD Extensions 3
			  SSE4 / SSE4.1 + SSE4.2 / Streaming SIMD Extensions 4  
			  AES / Advanced Encryption Standard instructions
			  AVX / Advanced Vector Extensions
			  AVX2 / Advanced Vector Extensions 2.0
			  BMI / BMI1 + BMI2 / Bit Manipulation instructions
			  F16C / 16-bit Floating-Point conversion instructions
			  FMA3 / 3-operand Fused Multiply-Add instructions
			  EM64T / Extended Memory 64 technology / Intel 64  
			  NX / XD / Execute disable bit  
			  HT / Hyper-Threading technology   
			  TBT 2.0 / Turbo Boost technology 2.0   
			  VT-x / Virtualization technology 
Low power features	  Enhanced SpeedStep technology

開發紀錄

第一次執行結果

  0  1  2  3
  4  5  6  7
  8  9 10 11
 12 13 14 15

  0  4  8 12
  1  5  9 13
  2  6 10 14
  3  7 11 15
sse prefetch: 	 70873 us
sse: 		 144184 us
naive: 		 244073 us

gnuplot make plot
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

這是100次取平均值 naive 的時間不穩定所以平均的誤差大
但可以看到 sse + prefetch 的時間最短

接著改寫makefile、main將三種方法分開進行 perf 測試$ make cache-test
perf 所設定的參數^[1]:

cache-misses
cache-references
L1-dcache-load-misses
L1-dcache-loads
L1-dcache-stores 
L1-icache-load-misses

makefile:

EXEC = \
	sse_prefetch_transpose\
	sse_transpose\
	naive_transpose

cache-test: $(EXEC)
	for method in $(EXEC);do \
		perf stat --repeat 100 \
		-e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses \
		./$$method;\
	done

這裡有個困惑當我嘗試將 -e 後面的參數斷行
-e cache-misses,cache-references, \
L1-dcache-load-misses,L1-dcache-loads, \
L1-dcache-stores,L1-icache-load-misses \
執行時會發生錯誤:
event syntax error: 'cache-misses,cache-references,'
___ parser error
Run 'perf list' for a list of valid events
Usage: perf stat [<options>] [<command>]
-e, –event <event> event selector. use 'perf list' to list available events

我改成
-e cache-misses,cache-references \
-e L1-dcache-load-misses,L1-dcache-loads \
-e L1-dcache-stores,L1-icache-load-misses
就沒有出現錯誤訊息了 claaaaassic

歐歐~ 那我.. 還是一行好了orzbaimao8437

pref 結果 `$ make cache-test`

sse prefetch

 Performance counter stats for './sse_prefetch_transpose' (100 runs):

         4,759,440      cache-misses              #   85.111 % of all cache refs      ( +-  0.69% )  (32.89%)
         5,592,027      cache-references                                              ( +-  0.49% )  (50.39%)
         8,481,323      L1-dcache-load-misses     #    1.83% of all L1-dcache hits    ( +-  0.26% )  (67.13%)
       464,186,704      L1-dcache-loads                                               ( +-  0.07% )  (65.37%)
       234,986,970      L1-dcache-stores                                              ( +-  0.10% )  (61.59%)
            55,906      L1-icache-load-misses                                         ( +-  3.66% )  (32.76%)

       0.268066902 seconds time elapsed                                          ( +-  0.64% )

 Performance counter stats for './sse_transpose' (100 runs):

         4,807,128      cache-misses              #   86.253 % of all cache refs      ( +-  0.64% )  (33.36%)
         5,573,299      cache-references                                              ( +-  0.31% )  (50.49%)
         8,545,880      L1-dcache-load-misses     #    1.93% of all L1-dcache hits    ( +-  0.15% )  (67.18%)
       441,706,999      L1-dcache-loads                                               ( +-  0.10% )  (65.64%)
       231,528,304      L1-dcache-stores                                              ( +-  0.12% )  (62.41%)
            64,512      L1-icache-load-misses                                         ( +-  3.90% )  (32.84%)

       0.341006236 seconds time elapsed                                          ( +-  1.00% )

naive

 Performance counter stats for './naive_transpose' (100 runs):

        17,137,199      cache-misses              #   94.603 % of all cache refs      ( +-  0.28% )  (33.27%)
        18,114,904      cache-references                                              ( +-  0.19% )  (50.21%)
        21,069,839      L1-dcache-load-misses     #    3.82% of all L1-dcache hits    ( +-  0.11% )  (66.89%)
       551,000,987      L1-dcache-loads                                               ( +-  0.07% )  (65.83%)
       213,504,897      L1-dcache-stores                                              ( +-  0.16% )  (63.62%)
            71,981      L1-icache-load-misses                                         ( +-  4.93% )  (33.07%)

       0.467403806 seconds time elapsed                                          ( +-  1.33% )

可以看到 cache-misses 確實是 sse prefetch 最低
因為 prefetch 先將所需要的資料載入快取記憶體降低了 cache-misses，低的 miss rate 讓他的執行速度比較快。

Row Counter

參考kaizsv 的共筆在 intel 文件中找 4th Generation Haswell microarchitecture
察看下列項目:

r014c: LOAD_HIT_PRE.SW_PF
r024c: LOAD_HIT_PRE.HW_PF

結果:

 Performance counter stats for './sse_prefetch_transpose' (100 runs):

         4,911,498      cache-misses              #   84.564 % of all cache refs      ( +-  0.78% )  (49.91%)
         5,808,046      cache-references                                              ( +-  0.49% )  (50.22%)
         8,670,151      L1-dcache-load-misses     #    1.88% of all L1-dcache hits    ( +-  0.39% )  (50.51%)
       460,232,822      L1-dcache-loads                                               ( +-  0.14% )  (45.85%)
       237,557,467      L1-dcache-stores                                              ( +-  0.22% )  (25.48%)
            54,105      L1-icache-load-misses                                         ( +-  3.69% )  (25.65%)
           499,850      r014c                                                         ( +-  1.84% )  (25.31%)
           272,620      r024c                                                         ( +-  5.06% )  (37.61%)

       0.277024409 seconds time elapsed                                          ( +-  1.18% )


 Performance counter stats for './sse_transpose' (100 runs):

         4,743,884      cache-misses              #   85.357 % of all cache refs      ( +-  0.39% )  (49.71%)
         5,557,698      cache-references                                              ( +-  0.28% )  (49.95%)
         8,494,094      L1-dcache-load-misses     #    1.99% of all L1-dcache hits    ( +-  0.23% )  (50.33%)
       426,955,982      L1-dcache-loads                                               ( +-  0.15% )  (46.63%)
       236,397,375      L1-dcache-stores                                              ( +-  0.26% )  (25.85%)
            60,052      L1-icache-load-misses                                         ( +-  3.60% )  (25.54%)
               464      r014c                                                         ( +-  4.00% )  (25.23%)
           312,872      r024c                                                         ( +-  4.40% )  (37.39%)

       0.326735088 seconds time elapsed                                          ( +-  0.44% )


 Performance counter stats for './naive_transpose' (100 runs):

        16,888,393      cache-misses              #   92.475 % of all cache refs      ( +-  0.25% )  (49.93%)
        18,262,729      cache-references                                              ( +-  0.20% )  (50.39%)
        21,341,610      L1-dcache-load-misses     #    4.01% of all L1-dcache hits    ( +-  0.17% )  (50.91%)
       532,720,948      L1-dcache-loads                                               ( +-  0.10% )  (48.00%)
       228,505,453      L1-dcache-stores                                              ( +-  0.23% )  (25.06%)
            64,509      L1-icache-load-misses                                         ( +-  3.95% )  (25.08%)
               496      r014c                                                         ( +-  5.12% )  (24.98%)
           249,019      r024c                                                         ( +-  4.90% )  (37.44%)

       0.437171123 seconds time elapsed                                          ( +-  0.56% )

sse prefetch 的 r014c 明顯比其他大代表真的有執行 SW prefetch

實驗不同 Prefetch Distance

論文中提到 Prefetch Distance 會影響 Prefetch 的效率
在 impl.c 中的 PFDIST 就是 distance
將其用 0,2,4,7,8,9,10,12,16 代入因為預設8 固檢查8附近的值
$ make plot DIS=1

結果符合理論 distance 過猶不及這邊 8 代入的效率最好

實驗不同 Prefetch Hints

如同前方表格顯示共有4種hints 將其分別代入比較(預設是 T1)

_MM_HINT_T0

 Performance counter stats for './sse_prefetch_transpose' (100 runs):

         4,667,916      cache-misses              #   82.749 % of all cache refs      ( +-  0.57% )  (49.02%)
         5,641,054      cache-references                                              ( +-  0.44% )  (49.48%)
         8,916,708      L1-dcache-load-misses     #    1.93% of all L1-dcache hits    ( +-  0.40% )  (50.67%)
       463,153,960      L1-dcache-loads                                               ( +-  0.18% )  (46.33%)
       235,037,792      L1-dcache-stores                                              ( +-  0.25% )  (25.85%)
            50,785      L1-icache-load-misses                                         ( +-  4.21% )  (25.70%)
         1,237,631      r014c                                                         ( +-  1.73% )  (24.94%)
           403,505      r024c                                                         ( +-  3.30% )  (36.74%)

       0.256539780 seconds time elapsed                                          ( +-  0.57% )

_MM_HINT_T1

 Performance counter stats for './sse_prefetch_transpose' (100 runs):

         4,906,154      cache-misses              #   82.649 % of all cache refs      ( +-  0.20% )  (49.38%)
         5,936,140      cache-references                                              ( +-  0.14% )  (49.40%)
         8,625,730      L1-dcache-load-misses     #    1.84% of all L1-dcache hits    ( +-  0.27% )  (49.46%)
       467,546,983      L1-dcache-loads                                               ( +-  0.11% )  (44.87%)
       232,786,099      L1-dcache-stores                                              ( +-  0.17% )  (26.09%)
            37,113      L1-icache-load-misses                                         ( +-  1.55% )  (26.02%)
           441,530      r014c                                                         ( +-  1.60% )  (25.60%)
           428,107      r024c                                                         ( +-  2.72% )  (37.41%)

       0.253216818 seconds time elapsed                                          ( +-  0.10% )

_MM_HINT_T2

 Performance counter stats for './sse_prefetch_transpose' (100 runs):

         4,931,244      cache-misses              #   83.404 % of all cache refs      ( +-  0.71% )  (49.53%)
         5,912,467      cache-references                                              ( +-  0.53% )  (49.67%)
         8,578,641      L1-dcache-load-misses     #    1.84% of all L1-dcache hits    ( +-  0.39% )  (49.89%)
       466,874,784      L1-dcache-loads                                               ( +-  0.17% )  (45.45%)
       234,373,157      L1-dcache-stores                                              ( +-  0.22% )  (26.58%)
            45,452      L1-icache-load-misses                                         ( +-  6.35% )  (25.87%)
           435,123      r014c                                                         ( +-  1.95% )  (25.43%)
           408,551      r024c                                                         ( +-  3.97% )  (37.36%)

       0.264967830 seconds time elapsed                                          ( +-  1.18% )

_MM_HINT_NTA

 Performance counter stats for './sse_prefetch_transpose' (100 runs):

         4,848,945      cache-misses              #   82.589 % of all cache refs      ( +-  0.41% )  (49.29%)
         5,871,166      cache-references                                              ( +-  0.36% )  (49.36%)
         8,568,859      L1-dcache-load-misses     #    1.83% of all L1-dcache hits    ( +-  0.27% )  (49.50%)
       467,023,161      L1-dcache-loads                                               ( +-  0.12% )  (45.17%)
       232,552,596      L1-dcache-stores                                              ( +-  0.16% )  (26.36%)
            35,771      L1-icache-load-misses                                         ( +-  2.76% )  (25.96%)
           451,018      r014c                                                         ( +-  1.83% )  (25.57%)
           428,982      r024c                                                         ( +-  2.64% )  (37.33%)

       0.253415363 seconds time elapsed                                          ( +-  0.18% )

gnuplot

可以觀察出 T0 時 r014c 最高因為他將資料 prefetch 到所有 level 所以效率最高

AVX

首先先從了解 SSE 的運作方式^[2] 然後將原本的 4*4 matrix 改為 8*8 matrix
code的部份也照著 SSE 的步驟改為 AVX 但結果並不正確:

I0 ->  0  8 16 24 | 4 12 20 28
I1 ->  1  9 17 25 | 5 13 21 29
I2 ->  2 10 18 26 | 6 14 22 30
I3 ->  3 11 19 27 | 7 15 23 31
-----------------------------
I4 -> 32 40 48 56 | 36 44 52 60
I5 -> 33 41 49 57 | 37 45 53 61
I6 -> 34 42 50 58 | 38 46 54 62
I7 -> 35 43 51 59 | 39 47 55 63

左下和右上的角應該要交換才對
最後參考周曠宇的共筆加入這幾行code完成交換

T0 = _mm256_permute2x128_si256(I0, I4, 0x20);
T1 = _mm256_permute2x128_si256(I1, I5, 0x20);
T2 = _mm256_permute2x128_si256(I2, I6, 0x20);
T3 = _mm256_permute2x128_si256(I3, I7, 0x20);
T4 = _mm256_permute2x128_si256(I0, I4, 0x31);
T5 = _mm256_permute2x128_si256(I1, I5, 0x31);
T6 = _mm256_permute2x128_si256(I2, I6, 0x31);
T7 = _mm256_permute2x128_si256(I3, I7, 0x31);

但這個指令我還在研究原理為何

試著加入 prefetch 但是 AVX 並沒有支援 prefetch 要到 AVX-512才有支援
所以先用 SSE 的 prefetch 這裡先預設 distance = 8, Hints = T1

結果

avx perf

    -  Performance counter stats for './avx_transpose' (100 runs):

         3,612,052      cache-misses              #   77.914 % of all cache refs      ( +-  0.36% )  (50.22%)
         4,635,946      cache-references                                              ( +-  0.29% )  (50.45%)
         8,223,505      L1-dcache-load-misses     #    2.11% of all L1-dcache hits    ( +-  0.19% )  (50.50%)
       390,434,639      L1-dcache-loads                                               ( +-  0.07% )  (45.13%)
       216,084,485      L1-dcache-stores                                              ( +-  0.14% )  (25.38%)
            36,183      L1-icache-load-misses                                         ( +-  1.32% )  (25.96%)
               359      r014c                                                         ( +-  5.20% )  (25.59%)
           208,075      r024c                                                         ( +-  4.01% )  (37.76%)

       0.259036199 seconds time elapsed                                          ( +-  0.15% )

avx_prefetch perf

 Performance counter stats for './avx_prefetch_transpose' (100 runs):

         3,870,080      cache-misses              #   75.536 % of all cache refs      ( +-  1.15% )  (49.67%)
         5,123,514      cache-references                                              ( +-  0.64% )  (49.90%)
         8,183,794      L1-dcache-load-misses     #    2.01% of all L1-dcache hits    ( +-  0.43% )  (50.22%)
       407,171,607      L1-dcache-loads                                               ( +-  0.22% )  (45.78%)
       214,583,897      L1-dcache-stores                                              ( +-  0.31% )  (25.80%)
            58,325      L1-icache-load-misses                                         ( +-  3.85% )  (25.72%)
           259,162      r014c                                                         ( +-  2.77% )  (25.31%)
           351,523      r024c                                                         ( +-  4.43% )  (37.40%)

       0.277570860 seconds time elapsed                                          ( +-  1.61% )

看到 r014c 的差異代表有進行SW prefetch

gnuplot

avx 的效能又比 sse prefetch 好因為一次處理的資料量比較大

但有一個重點 avx 所用的 prefetch distance 應該要是 sse 版本的兩倍也就是16才對
這裡比較一些不同的 distance 代入的結果

結果

gnuplot

執行時間還是 distance = 8 時比較好
perf

 Performance counter stats for './avx_prefetch_transpose 8' (20 runs):

         3,867,416      cache-misses              #   80.189 % of all cache refs      ( +-  2.16% )  (44.33%)
         4,822,896      cache-references                                              ( +-  2.20% )  (44.72%)
       684,115,466      cycles                                                        ( +-  2.98% )  (45.18%)
         7,658,761      L1-dcache-load-misses     #    1.90% of all L1-dcache hits    ( +-  1.46% )  (45.68%)
       403,675,699      L1-dcache-loads                                               ( +-  0.92% )  (40.26%)
       209,757,372      L1-dcache-stores                                              ( +-  0.89% )  (22.63%)
            60,160      L1-icache-load-misses                                         ( +-  8.71% )  (22.67%)
           284,360      r014c                                                         ( +-  5.87% )  (22.46%)
           419,681      r024c                                                         ( +- 10.79% )  (33.30%)

       0.285078156 seconds time elapsed                                          ( +-  3.82% )

 Performance counter stats for './avx_prefetch_transpose 16' (20 runs):

         3,839,351      cache-misses              #   58.430 % of all cache refs      ( +-  1.73% )  (44.75%)
         6,570,840      cache-references                                              ( +-  1.35% )  (45.41%)
       644,629,299      cycles                                                        ( +-  0.44% )  (45.90%)
         7,695,401      L1-dcache-load-misses     #    1.86% of all L1-dcache hits    ( +-  1.10% )  (46.16%)
       413,234,940      L1-dcache-loads                                               ( +-  0.37% )  (39.60%)
       213,451,682      L1-dcache-stores                                              ( +-  0.71% )  (22.13%)
            55,960      L1-icache-load-misses                                         ( +- 10.40% )  (22.69%)
           178,769      r014c                                                         ( +-  8.47% )  (22.50%)
           197,597      r024c                                                         ( +-  1.80% )  (33.55%)

       0.262698318 seconds time elapsed

cache misses 確實是 distance = 16 最低但是是因為 cache refs 比別人多了很多所以 miss rate 才低
cycles 數 dis = 16 比 dis = 8 少了 40,000,000 次
r014c r024c cache hit 次數都比 dis = 8 還少

Code 再進化

Use higher level macro to simplify avx code
參考到類似作法還在學習中
Use Stopwatch to record time
將老師推薦的 Stopwatch 結合 clock_gettime 達到精準的時間測試

2017q1 Homework3 (software-pipelining)

論文筆記

前言:

名詞解釋:

SW & HW 的交互作用實驗結果:

實驗結果重點

SW & HW 的背景知識

Data Structure

Prefetch hint

Prefetch 分類 & Timeliness

Prefetch distance

Direct\Indirect index

程式實作 矩陣轉置

開發環境

開發紀錄

第一次執行結果

pref 結果 $ make cache-test

Row Counter

實驗不同 Prefetch Distance

實驗不同 Prefetch Hints

AVX

Code 再進化

程式實作矩陣轉置

pref 結果 `$ make cache-test`