contributed by <baimao8437
>
本論文研究方向有三:
Prefetch
在 cpu 要運算之前把資料從主記憶體預先載入 cache,降低 cpu 等待時間進而提升效率.
載入至主記憶體的時機不能太晚(cpu 已經運算完畢不再需要這些資料)也不能太早(有可能在使用之前就被踢出快取記憶體),所以計算 prefetch distance 很重要。
Hardware prefetch
是用過去的存取紀錄作為 prefetch 的依據,所以需要 training。
e.g. GHB:Global History Buffer
Software prefetch
人為或 compiler 在適當的地方加入 prefetching intrinsics。
但 ICC or GCC 中只有簡單的 SW 指令,所以比較仰賴人為加入。
e.g. SSE 的 __mm_prefetch()
而對於手動加入指令有兩個最大的問題:
可以看到左邊 Positive 區的實驗結果顯示在這些 benchmark 的情況下,SW+HW 會比 HW 來得更有效率。
從實驗中,產生下列主要的三個問題:
此表分析了不同 data structure 的性質及當時已經有提出的 HW SW prefetch 方式
Prefetch 希望能夠 timely,所以要設定prefetch distance,要大到足夠隱藏 latency。
D
is called the prefetch distancel
is the prefetch latencys
is the length of the shortest path through the loop bodyDirect: 是直接可以從指令中知道要存取的記憶體位址,可以透過 hareward prefetch.
Indirect: 就像這行程式碼 x = a[b[i]],需要先計算b[i]的值才知道要存取a[]中的哪個位置,使用 software prefetch 比較容易達成.
baimao@baimao-Aspire-V5-573G:~$ lscpu
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
每核心執行緒數:2
每通訊端核心數:2
Socket(s): 1
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
型號: 69
Model name: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
製程: 1
CPU MHz: 1711.242
CPU max MHz: 2600.0000
CPU min MHz: 800.0000
BogoMIPS: 4589.38
虛擬: VT-x
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 3072K
NUMA node0 CPU(s): 0-3
CPU:Intel Colre i5-4200U 資訊
Architecture / Microarchitecture
Microarchitecture Haswell
Platform Shark Bay
Processor core Haswell
Core stepping C0 (SR170)
Manufacturing process 0.022 micron
Data width 64 bit
The number of CPU cores 2
The number of threads 4
Floating Point Unit Integrated
Level 1 cache size 2 x 32 KB 8-way set associative instruction caches
2 x 32 KB 8-way set associative data caches
Level 2 cache size 2 x 256 KB 8-way set associative caches
Level 3 cache size 3 MB 12-way set associative shared cache
Physical memory 16 GB
Multiprocessing Uniprocessor
Features MMX instructions
SSE / Streaming SIMD Extensions
SSE2 / Streaming SIMD Extensions 2
SSE3 / Streaming SIMD Extensions 3
SSSE3 / Supplemental Streaming SIMD Extensions 3
SSE4 / SSE4.1 + SSE4.2 / Streaming SIMD Extensions 4
AES / Advanced Encryption Standard instructions
AVX / Advanced Vector Extensions
AVX2 / Advanced Vector Extensions 2.0
BMI / BMI1 + BMI2 / Bit Manipulation instructions
F16C / 16-bit Floating-Point conversion instructions
FMA3 / 3-operand Fused Multiply-Add instructions
EM64T / Extended Memory 64 technology / Intel 64
NX / XD / Execute disable bit
HT / Hyper-Threading technology
TBT 2.0 / Turbo Boost technology 2.0
VT-x / Virtualization technology
Low power features Enhanced SpeedStep technology
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
sse prefetch: 70873 us
sse: 144184 us
naive: 244073 us
make plot
這是100次取平均值 naive 的時間不穩定 所以平均的誤差大
但可以看到 sse + prefetch 的時間最短
接著改寫makefile
、main
將三種方法分開進行 perf 測試$ make cache-test
perf 所設定的參數[1]:
cache-misses
cache-references
L1-dcache-load-misses
L1-dcache-loads
L1-dcache-stores
L1-icache-load-misses
makefile:
EXEC = \
sse_prefetch_transpose\
sse_transpose\
naive_transpose
cache-test: $(EXEC)
for method in $(EXEC);do \
perf stat --repeat 100 \
-e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses \
./$$method;\
done
這裡有個困惑 當我嘗試將 -e 後面的參數斷行
-e cache-misses,cache-references, \
L1-dcache-load-misses,L1-dcache-loads, \
L1-dcache-stores,L1-icache-load-misses \
執行時會發生錯誤:
event syntax error: 'cache-misses,cache-references,'
___ parser error
Run 'perf list' for a list of valid events
Usage: perf stat [<options>] [<command>]
-e, –event <event> event selector. use 'perf list' to list available events
我改成
-e cache-misses,cache-references \
-e L1-dcache-load-misses,L1-dcache-loads \
-e L1-dcache-stores,L1-icache-load-misses
就沒有出現錯誤訊息了 claaaaassic歐歐~ 那我.. 還是一行好了orzbaimao8437
$ make cache-test
Performance counter stats for './sse_prefetch_transpose' (100 runs):
4,759,440 cache-misses # 85.111 % of all cache refs ( +- 0.69% ) (32.89%)
5,592,027 cache-references ( +- 0.49% ) (50.39%)
8,481,323 L1-dcache-load-misses # 1.83% of all L1-dcache hits ( +- 0.26% ) (67.13%)
464,186,704 L1-dcache-loads ( +- 0.07% ) (65.37%)
234,986,970 L1-dcache-stores ( +- 0.10% ) (61.59%)
55,906 L1-icache-load-misses ( +- 3.66% ) (32.76%)
0.268066902 seconds time elapsed ( +- 0.64% )
Performance counter stats for './sse_transpose' (100 runs):
4,807,128 cache-misses # 86.253 % of all cache refs ( +- 0.64% ) (33.36%)
5,573,299 cache-references ( +- 0.31% ) (50.49%)
8,545,880 L1-dcache-load-misses # 1.93% of all L1-dcache hits ( +- 0.15% ) (67.18%)
441,706,999 L1-dcache-loads ( +- 0.10% ) (65.64%)
231,528,304 L1-dcache-stores ( +- 0.12% ) (62.41%)
64,512 L1-icache-load-misses ( +- 3.90% ) (32.84%)
0.341006236 seconds time elapsed ( +- 1.00% )
Performance counter stats for './naive_transpose' (100 runs):
17,137,199 cache-misses # 94.603 % of all cache refs ( +- 0.28% ) (33.27%)
18,114,904 cache-references ( +- 0.19% ) (50.21%)
21,069,839 L1-dcache-load-misses # 3.82% of all L1-dcache hits ( +- 0.11% ) (66.89%)
551,000,987 L1-dcache-loads ( +- 0.07% ) (65.83%)
213,504,897 L1-dcache-stores ( +- 0.16% ) (63.62%)
71,981 L1-icache-load-misses ( +- 4.93% ) (33.07%)
0.467403806 seconds time elapsed ( +- 1.33% )
可以看到 cache-misses 確實是 sse prefetch 最低
因為 prefetch 先將所需要的資料載入快取記憶體降低了 cache-misses,低的 miss rate 讓他的執行速度比較快。
參考kaizsv 的共筆在 intel 文件中找 4th Generation Haswell microarchitecture
察看下列項目:
結果:
Performance counter stats for './sse_prefetch_transpose' (100 runs):
4,911,498 cache-misses # 84.564 % of all cache refs ( +- 0.78% ) (49.91%)
5,808,046 cache-references ( +- 0.49% ) (50.22%)
8,670,151 L1-dcache-load-misses # 1.88% of all L1-dcache hits ( +- 0.39% ) (50.51%)
460,232,822 L1-dcache-loads ( +- 0.14% ) (45.85%)
237,557,467 L1-dcache-stores ( +- 0.22% ) (25.48%)
54,105 L1-icache-load-misses ( +- 3.69% ) (25.65%)
499,850 r014c ( +- 1.84% ) (25.31%)
272,620 r024c ( +- 5.06% ) (37.61%)
0.277024409 seconds time elapsed ( +- 1.18% )
Performance counter stats for './sse_transpose' (100 runs):
4,743,884 cache-misses # 85.357 % of all cache refs ( +- 0.39% ) (49.71%)
5,557,698 cache-references ( +- 0.28% ) (49.95%)
8,494,094 L1-dcache-load-misses # 1.99% of all L1-dcache hits ( +- 0.23% ) (50.33%)
426,955,982 L1-dcache-loads ( +- 0.15% ) (46.63%)
236,397,375 L1-dcache-stores ( +- 0.26% ) (25.85%)
60,052 L1-icache-load-misses ( +- 3.60% ) (25.54%)
464 r014c ( +- 4.00% ) (25.23%)
312,872 r024c ( +- 4.40% ) (37.39%)
0.326735088 seconds time elapsed ( +- 0.44% )
Performance counter stats for './naive_transpose' (100 runs):
16,888,393 cache-misses # 92.475 % of all cache refs ( +- 0.25% ) (49.93%)
18,262,729 cache-references ( +- 0.20% ) (50.39%)
21,341,610 L1-dcache-load-misses # 4.01% of all L1-dcache hits ( +- 0.17% ) (50.91%)
532,720,948 L1-dcache-loads ( +- 0.10% ) (48.00%)
228,505,453 L1-dcache-stores ( +- 0.23% ) (25.06%)
64,509 L1-icache-load-misses ( +- 3.95% ) (25.08%)
496 r014c ( +- 5.12% ) (24.98%)
249,019 r024c ( +- 4.90% ) (37.44%)
0.437171123 seconds time elapsed ( +- 0.56% )
sse prefetch 的 r014c 明顯比其他大 代表真的有執行 SW prefetch
論文中提到 Prefetch Distance 會影響 Prefetch 的效率
在 impl.c
中的 PFDIST
就是 distance
將其用 0,2,4,7,8,9,10,12,16 代入 因為預設8 固檢查8附近的值
$ make plot DIS=1
結果符合理論 distance 過猶不及 這邊 8 代入的效率最好
如同前方表格顯示共有4種hints 將其分別代入比較(預設是 T1)
Performance counter stats for './sse_prefetch_transpose' (100 runs):
4,667,916 cache-misses # 82.749 % of all cache refs ( +- 0.57% ) (49.02%)
5,641,054 cache-references ( +- 0.44% ) (49.48%)
8,916,708 L1-dcache-load-misses # 1.93% of all L1-dcache hits ( +- 0.40% ) (50.67%)
463,153,960 L1-dcache-loads ( +- 0.18% ) (46.33%)
235,037,792 L1-dcache-stores ( +- 0.25% ) (25.85%)
50,785 L1-icache-load-misses ( +- 4.21% ) (25.70%)
1,237,631 r014c ( +- 1.73% ) (24.94%)
403,505 r024c ( +- 3.30% ) (36.74%)
0.256539780 seconds time elapsed ( +- 0.57% )
Performance counter stats for './sse_prefetch_transpose' (100 runs):
4,906,154 cache-misses # 82.649 % of all cache refs ( +- 0.20% ) (49.38%)
5,936,140 cache-references ( +- 0.14% ) (49.40%)
8,625,730 L1-dcache-load-misses # 1.84% of all L1-dcache hits ( +- 0.27% ) (49.46%)
467,546,983 L1-dcache-loads ( +- 0.11% ) (44.87%)
232,786,099 L1-dcache-stores ( +- 0.17% ) (26.09%)
37,113 L1-icache-load-misses ( +- 1.55% ) (26.02%)
441,530 r014c ( +- 1.60% ) (25.60%)
428,107 r024c ( +- 2.72% ) (37.41%)
0.253216818 seconds time elapsed ( +- 0.10% )
Performance counter stats for './sse_prefetch_transpose' (100 runs):
4,931,244 cache-misses # 83.404 % of all cache refs ( +- 0.71% ) (49.53%)
5,912,467 cache-references ( +- 0.53% ) (49.67%)
8,578,641 L1-dcache-load-misses # 1.84% of all L1-dcache hits ( +- 0.39% ) (49.89%)
466,874,784 L1-dcache-loads ( +- 0.17% ) (45.45%)
234,373,157 L1-dcache-stores ( +- 0.22% ) (26.58%)
45,452 L1-icache-load-misses ( +- 6.35% ) (25.87%)
435,123 r014c ( +- 1.95% ) (25.43%)
408,551 r024c ( +- 3.97% ) (37.36%)
0.264967830 seconds time elapsed ( +- 1.18% )
Performance counter stats for './sse_prefetch_transpose' (100 runs):
4,848,945 cache-misses # 82.589 % of all cache refs ( +- 0.41% ) (49.29%)
5,871,166 cache-references ( +- 0.36% ) (49.36%)
8,568,859 L1-dcache-load-misses # 1.83% of all L1-dcache hits ( +- 0.27% ) (49.50%)
467,023,161 L1-dcache-loads ( +- 0.12% ) (45.17%)
232,552,596 L1-dcache-stores ( +- 0.16% ) (26.36%)
35,771 L1-icache-load-misses ( +- 2.76% ) (25.96%)
451,018 r014c ( +- 1.83% ) (25.57%)
428,982 r024c ( +- 2.64% ) (37.33%)
0.253415363 seconds time elapsed ( +- 0.18% )
首先先從了解 SSE 的運作方式[2] 然後將原本的 4*4 matrix 改為 8*8 matrix
code的部份也照著 SSE 的步驟改為 AVX 但結果並不正確:
I0 -> 0 8 16 24 | 4 12 20 28
I1 -> 1 9 17 25 | 5 13 21 29
I2 -> 2 10 18 26 | 6 14 22 30
I3 -> 3 11 19 27 | 7 15 23 31
-----------------------------
I4 -> 32 40 48 56 | 36 44 52 60
I5 -> 33 41 49 57 | 37 45 53 61
I6 -> 34 42 50 58 | 38 46 54 62
I7 -> 35 43 51 59 | 39 47 55 63
左下和右上的角應該要交換才對
最後參考周曠宇的共筆 加入這幾行code完成交換
T0 = _mm256_permute2x128_si256(I0, I4, 0x20);
T1 = _mm256_permute2x128_si256(I1, I5, 0x20);
T2 = _mm256_permute2x128_si256(I2, I6, 0x20);
T3 = _mm256_permute2x128_si256(I3, I7, 0x20);
T4 = _mm256_permute2x128_si256(I0, I4, 0x31);
T5 = _mm256_permute2x128_si256(I1, I5, 0x31);
T6 = _mm256_permute2x128_si256(I2, I6, 0x31);
T7 = _mm256_permute2x128_si256(I3, I7, 0x31);
但這個指令我還在研究原理為何
試著加入 prefetch 但是 AVX 並沒有支援 prefetch 要到 AVX-512才有支援
所以先用 SSE 的 prefetch 這裡先預設 distance = 8, Hints = T1
- Performance counter stats for './avx_transpose' (100 runs):
3,612,052 cache-misses # 77.914 % of all cache refs ( +- 0.36% ) (50.22%)
4,635,946 cache-references ( +- 0.29% ) (50.45%)
8,223,505 L1-dcache-load-misses # 2.11% of all L1-dcache hits ( +- 0.19% ) (50.50%)
390,434,639 L1-dcache-loads ( +- 0.07% ) (45.13%)
216,084,485 L1-dcache-stores ( +- 0.14% ) (25.38%)
36,183 L1-icache-load-misses ( +- 1.32% ) (25.96%)
359 r014c ( +- 5.20% ) (25.59%)
208,075 r024c ( +- 4.01% ) (37.76%)
0.259036199 seconds time elapsed ( +- 0.15% )
Performance counter stats for './avx_prefetch_transpose' (100 runs):
3,870,080 cache-misses # 75.536 % of all cache refs ( +- 1.15% ) (49.67%)
5,123,514 cache-references ( +- 0.64% ) (49.90%)
8,183,794 L1-dcache-load-misses # 2.01% of all L1-dcache hits ( +- 0.43% ) (50.22%)
407,171,607 L1-dcache-loads ( +- 0.22% ) (45.78%)
214,583,897 L1-dcache-stores ( +- 0.31% ) (25.80%)
58,325 L1-icache-load-misses ( +- 3.85% ) (25.72%)
259,162 r014c ( +- 2.77% ) (25.31%)
351,523 r024c ( +- 4.43% ) (37.40%)
0.277570860 seconds time elapsed ( +- 1.61% )
但有一個重點 avx 所用的 prefetch distance 應該要是 sse 版本的兩倍 也就是16才對
這裡比較一些不同的 distance 代入的結果
Performance counter stats for './avx_prefetch_transpose 8' (20 runs):
3,867,416 cache-misses # 80.189 % of all cache refs ( +- 2.16% ) (44.33%)
4,822,896 cache-references ( +- 2.20% ) (44.72%)
684,115,466 cycles ( +- 2.98% ) (45.18%)
7,658,761 L1-dcache-load-misses # 1.90% of all L1-dcache hits ( +- 1.46% ) (45.68%)
403,675,699 L1-dcache-loads ( +- 0.92% ) (40.26%)
209,757,372 L1-dcache-stores ( +- 0.89% ) (22.63%)
60,160 L1-icache-load-misses ( +- 8.71% ) (22.67%)
284,360 r014c ( +- 5.87% ) (22.46%)
419,681 r024c ( +- 10.79% ) (33.30%)
0.285078156 seconds time elapsed ( +- 3.82% )
Performance counter stats for './avx_prefetch_transpose 16' (20 runs):
3,839,351 cache-misses # 58.430 % of all cache refs ( +- 1.73% ) (44.75%)
6,570,840 cache-references ( +- 1.35% ) (45.41%)
644,629,299 cycles ( +- 0.44% ) (45.90%)
7,695,401 L1-dcache-load-misses # 1.86% of all L1-dcache hits ( +- 1.10% ) (46.16%)
413,234,940 L1-dcache-loads ( +- 0.37% ) (39.60%)
213,451,682 L1-dcache-stores ( +- 0.71% ) (22.13%)
55,960 L1-icache-load-misses ( +- 10.40% ) (22.69%)
178,769 r014c ( +- 8.47% ) (22.50%)
197,597 r024c ( +- 1.80% ) (33.55%)
0.262698318 seconds time elapsed