contributed by < Chihsiang >
首先尋找文中提出三點問題的的答案:
先是介紹的軟/硬體目前資料結構使用的對照表,
由上此表得知目前廣泛使用的 Array、Hash 皆有提供 prefetch 方式
上表 prefetch 的分類以及時間軸,時間軸的圖不是很理解,根據內文
Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency
只能理解要足夠的時間及早提出需求,用以隱藏記憶體延遲。
Prefetch Distance 條件
Prefetch Distance 插入演算法
軟體比較硬體的預取優勢(情況例子)
軟體預取缺點
$ lscpu
⚡ lscpu
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
每核心執行緒數: 2
每通訊端核心數: 4
Socket(s): 1
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
型號: 58
Model name: Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
製程: 9
CPU MHz: 1279.431
CPU max MHz: 3300.0000
CPU min MHz: 1200.0000
BogoMIPS: 4589.27
虛擬: VT-x
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 6144K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce
cx8 apic sep mtrr pge mca cmov pat pse36
clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx rdtscp lm constant_tsc arch_perfmon
pebs bts rep_good nopl xtopology nonstop_tsc
aperfmperf eagerfpu pni pclmulqdq dtes64 monitor
ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid
sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer
aes xsave avx f16c rdrand lahf_lm tpr_shadow
vnmi flexpriority ept vpid fsgsbase smep
erms xsaveopt dtherm ida arat pln pts
原始程式分析
執行時間
sse prefetch: 54074 us
sse: 107184 us
naive: 224151 us
perf stat
naive: 241652 us
Performance counter stats for './naive_transpose':
18,172,399 cache-misses # 93.396 % of all cache refs (43.87%)
19,457,264 cache-references (43.86%)
20,157,827 L1-dcache-load-misses # 3.77% of all L1-dcache hits (44.24%)
534,896,289 L1-dcache-loads (44.73%)
22,319 L1-dcache-prefetch-misses (23.05%)
4,204,752 L1-dcache-store-misses (22.85%)
32,932 L1-icache-load-misses (33.98%)
282,353,767 branch-instructions (44.91%)
563,433 branch-misses # 0.20% of all branches (44.05%)
0.463622418 seconds time elapsed
sse: 116348 us
Performance counter stats for './sse_transpose':
5,864,063 cache-misses # 82.944 % of all cache refs (45.02%)
7,069,928 cache-references (45.68%)
8,338,277 L1-dcache-load-misses # 1.90% of all L1-dcache hits (46.32%)
439,214,122 L1-dcache-loads (44.80%)
108,383 L1-dcache-prefetch-misses (22.49%)
4,283,935 L1-dcache-store-misses (21.30%)
35,892 L1-icache-load-misses (31.90%)
278,020,930 branch-instructions (42.50%)
571,414 branch-misses # 0.21% of all branches (42.43%)
0.340159594 seconds time elapsed
sse prefetch: 55429 us
Performance counter stats for './sse_prefetch_transpose':
7,835,410 cache-misses # 92.816 % of all cache refs (45.05%)
8,441,831 cache-references (45.39%)
8,624,095 L1-dcache-load-misses # 1.84% of all L1-dcache hits (45.40%)
469,033,285 L1-dcache-loads (42.97%)
17,002 L1-dcache-prefetch-misses (21.98%)
3,474,969 L1-dcache-store-misses (23.26%)
24,527 L1-icache-load-misses (34.40%)
259,320,698 branch-instructions (45.21%)
509,994 branch-misses # 0.20% of all branches (44.58%)
0.293630274 seconds time elapsed
根據分析,naive版本的cache-misses比例以及次數都是最高且時間耗費最多,然後沒使用 prefetch 的 sse 版本也比使用 prefetch 的高,且使用 prefetch 耗費時間是未使用的一半。
perf raw counter
參照 kaizsv共筆 先找到自己 CPU 架構下的 mask number and event number
查找自己 CPU 規格 Intel-Core-i7-3615QM 屬於3rd generation
查文件 Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B 找相關 raw count 的編號
perf raw counter 範例
Example:
If the Intel docs for a QM720 Core i7 describe an event as:
Event Umask Event Mask
Num. Value Mnemonic Description Comment
A8H 01H LSD.UOPS Counts the number of micro-ops Use cmask=1 and
delivered by loop stream detector invert to count
cycles
raw encoding of 0x1A8 can be used:
perf stat -e r1a8 -a sleep 1
perf record -e r1a8 ...
使用方法先查找需要測試的Event,然後組合 Umask value + Event num = 01 + A8
手冊上查詢 SW-prefetch
Event | Umask | Event Mask Mnemonic | Description |
---|---|---|---|
4CH | 01H | LOAD_HIT_PRE.SW_PF | Non-SW-prefetch load dispatches that hit fill buffer allocated for S/W prefetch. |
4CH | 01H | LOAD_HIT_PRE.HW_PF | Non-SW-prefetch load dispatches that hit fill buffer allocated for H/W prefetch. |
perf stat -e r014C ./naive
naive: 224461 us
Performance counter stats for './naive_transpose':
402 r014C
0.450811263 seconds time elapsed
perf stat -e r014C ./sse
sse: 107869 us
Performance counter stats for './sse_transpose':
339 r014C
0.346157841 seconds time elapsed
perf stat -e r014C ./sse_prfetch
sse prefetch: 55197 us
Performance counter stats for './sse_prefetch_transpose':
1,798,956 r014C
0.279139593 seconds time elapsed
透過這些資訊可得到更細節的 Event 發生次數,由於情況許多還不完全理解每個資訊所提供的次數可做哪些效能層面的分析。
AVX Transpose
仿照 SSE 的方法,實作 8*8 矩陣的 transpose
void avx_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x += 8) {
for (int y = 0; y < h; y += 8) {
__m256 r0, r1, r2, r3, r4, r5, r6, r7;
__m256 t0, t1, t2, t3, t4, t5, t6, t7;
r0 = _mm256_loadu_ps((__m256 *)(src + (y + 0) * w + x));
r1 = _mm256_loadu_ps((__m256 *)(src + (y + 1) * w + x));
r2 = _mm256_loadu_ps((__m256 *)(src + (y + 2) * w + x));
r3 = _mm256_loadu_ps((__m256 *)(src + (y + 3) * w + x));
r4 = _mm256_loadu_ps((__m256 *)(src + (y + 4) * w + x));
r5 = _mm256_loadu_ps((__m256 *)(src + (y + 5) * w + x));
r6 = _mm256_loadu_ps((__m256 *)(src + (y + 6) * w + x));
r7 = _mm256_loadu_ps((__m256 *)(src + (y + 7) * w + x));
t0 = _mm256_unpacklo_ps(r0, r1);
t1 = _mm256_unpackhi_ps(r0, r1);
t2 = _mm256_unpacklo_ps(r2, r3);
t3 = _mm256_unpackhi_ps(r2, r3);
t4 = _mm256_unpacklo_ps(r4, r5);
t5 = _mm256_unpackhi_ps(r4, r5);
t6 = _mm256_unpacklo_ps(r6, r7);
t7 = _mm256_unpackhi_ps(r6, r7);
r0 = _mm256_shuffle_ps(t0, t2, 0x44);
r1 = _mm256_shuffle_ps(t0, t2, 0xEE);
r2 = _mm256_shuffle_ps(t1, t3, 0x44);
r3 = _mm256_shuffle_ps(t1, t3, 0xEE);
r4 = _mm256_shuffle_ps(t4, t6, 0x44);
r5 = _mm256_shuffle_ps(t4, t6, 0xEE);
r6 = _mm256_shuffle_ps(t5, t7, 0x44);
r7 = _mm256_shuffle_ps(t5, t7, 0xEE);
t0 = _mm256_permute2f128_ps(r0, r4, 0x20);
t1 = _mm256_permute2f128_ps(r1, r5, 0x20);
t2 = _mm256_permute2f128_ps(r2, r6, 0x20);
t3 = _mm256_permute2f128_ps(r3, r7, 0x20);
t4 = _mm256_permute2f128_ps(r0, r4, 0x31);
t5 = _mm256_permute2f128_ps(r1, r5, 0x31);
t6 = _mm256_permute2f128_ps(r2, r6, 0x31);
t7 = _mm256_permute2f128_ps(r3, r7, 0x31);
_mm256_storeu_ps((__m256 *)(dst + ((x + 0) * h) + y), t0);
_mm256_storeu_ps((__m256 *)(dst + ((x + 1) * h) + y), t1);
_mm256_storeu_ps((__m256 *)(dst + ((x + 2) * h) + y), t2);
_mm256_storeu_ps((__m256 *)(dst + ((x + 3) * h) + y), t3);
_mm256_storeu_ps((__m256 *)(dst + ((x + 4) * h) + y), t4);
_mm256_storeu_ps((__m256 *)(dst + ((x + 5) * h) + y), t5);
_mm256_storeu_ps((__m256 *)(dst + ((x + 6) * h) + y), t6);
_mm256_storeu_ps((__m256 *)(dst + ((x + 7) * h) + y), t7);
}
}
}
naive: 231006 us
sse: 114900 us
avx: 65514 us
sse prefetch: 54519 us
從時間上可得知,AVX因為處理的數量是SSE兩倍因此在速度讓也有將近兩倍的差距。
Synopsis
void _mm_prefetch (char const* p, int i)
#include "xmmintrin.h"
Instruction: prefetchnta mprefetch
prefetcht0 mprefetch
prefetcht1 mprefetch
prefetcht2 mprefetch
CPUID Flags: SSE
Description
Fetch the line of data from memory
that contains address p to a location
in the cache heirarchy specified
by the locality hint i.
SSE
sse: 114395 us
Performance counter stats for './sse_transpose':
6,613,088 cache-misses # 81.627 % of all cache refs (44.81%)
8,101,572 cache-references (44.81%)
8,963,416 L1-dcache-load-misses # 2.03% of all L1-dcache hits (44.80%)
441,279,258 L1-dcache-loads (42.83%)
19,481 L1-dcache-prefetch-misses (22.27%)
4,194,679 L1-dcache-store-misses (23.04%)
29,409 L1-icache-load-misses (34.18%)
279,189,951 branch-instructions (45.05%)
555,463 branch-misses # 0.20% of all branches (44.55%)
0.362906873 seconds time elapsed
AVX
avx: 64653 us
Performance counter stats for './avx_transpose':
5,649,660 cache-misses # 80.898 % of all cache refs (45.06%)
6,983,688 cache-references (45.80%)
7,684,501 L1-dcache-load-misses # 1.86% of all L1-dcache hits (45.84%)
412,369,472 L1-dcache-loads (43.41%)
23,589 L1-dcache-prefetch-misses (21.68%)
4,697,455 L1-dcache-store-misses (23.25%)
26,099 L1-icache-load-misses (34.38%)
248,622,769 branch-instructions (45.20%)
518,603 branch-misses # 0.21% of all branches (44.57%)
0.295939411 seconds time elapsed
上述分析,我最注意到的是L1-dcache-prefetch-misses、L1-dcache-store-misses AVX 指令所產生的 Misses 次數居然比較多。
SSE prefetch(D = 8)
sse prefetch: 54383 us
Performance counter stats for './sse_prefetch_transpose':
6,208,199 cache-misses # 98.981 % of all cache refs (43.46%)
6,272,119 cache-references (43.45%)
6,500,214 L1-dcache-load-misses # 1.46% of all L1-dcache hits (44.17%)
444,322,322 L1-dcache-loads (44.87%)
24,325 L1-dcache-prefetch-misses (23.58%)
4,606,491 L1-dcache-store-misses (23.23%)
26,308 L1-icache-load-misses (34.35%)
254,746,423 branch-instructions (45.06%)
542,430 branch-misses # 0.21% of all branches (43.65%)
0.283370894 seconds time elapsed
AVX prefetch(D = 8)
avx prefetch: 58723 us
Performance counter stats for './avx_prefetch_transpose':
6,158,483 cache-misses # 86.765 % of all cache refs (44.96%)
7,097,877 cache-references (44.94%)
7,500,835 L1-dcache-load-misses # 1.76% of all L1-dcache hits (44.94%)
425,937,839 L1-dcache-loads (42.50%)
23,246 L1-dcache-prefetch-misses (22.29%)
4,833,529 L1-dcache-store-misses (23.26%)
29,966 L1-icache-load-misses (34.40%)
247,753,365 branch-instructions (45.21%)
515,668 branch-misses # 0.21% of all branches (44.58%)
0.291183267 seconds time elapsed
由於數據上並未補齊所有 D 範圍以及 prefech 數量的比較,之後補上。
Memory Latency
參照carolc0708同學的共筆,也想了解自己電腦的 Memory Latency 測試
Intel(R) Memory Latency Checker - v3.1a
Measuring idle latencies (in ns)...
Memory node
Socket 0
0 60.6
Measuring Peak Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 22924.6
3:1 Reads-Writes : 21408.4
2:1 Reads-Writes : 21079.7
1:1 Reads-Writes : 20016.7
Stream-triad like: 20424.8
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Memory node
Socket 0
0 22919.5
Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 135.19 23058.9
00002 134.82 23053.4
00008 133.35 23043.2
00015 132.27 23037.6
00050 126.29 22870.4
00100 82.99 20031.6
00200 64.48 11700.4
00300 61.22 8385.7
00400 61.37 6637.3
00500 60.76 5573.2
00700 60.60 4321.9
01000 61.27 3353.8
01300 61.46 2828.7
01700 61.51 2413.9
02500 61.51 1978.3
03500 61.88 1682.8
05000 61.84 1492.8
09000 61.96 1285.1
20000 61.83 1150.6
Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT latency 19.1
Local Socket L2->L2 HITM latency 22.8
Run avx prefetch 測試 mlc
Intel(R) Memory Latency Checker - v3.1a
Measuring idle latencies (in ns)...
Memory node
Socket 0
0 60.8
Measuring Peak Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 22224.3
3:1 Reads-Writes : 20991.3
2:1 Reads-Writes : 20937.0
1:1 Reads-Writes : 19920.7
Stream-triad like: 20050.8
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Memory node
Socket 0
0 22883.8
Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 141.15 22473.7
00002 136.83 22856.0
00008 135.33 22841.5
00015 136.96 22582.6
00050 131.97 22277.7
00100 86.87 19832.6
00200 67.08 11648.3
00300 61.98 8370.2
00400 62.60 6614.3
00500 62.30 5439.4
00700 70.03 3344.9
01000 70.56 2536.9
01300 62.81 2642.3
01700 63.33 2192.8
02500 62.48 1885.3
03500 61.62 1704.1
05000 62.17 1473.5
09000 82.62 917.5
20000 64.81 1086.5
Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT latency 21.0
Local Socket L2->L2 HITM latency 23.2
發現真的也有在變化,可是完全控制測試方始還在嘗試。