contributed by<zhanyangch
>
ZixinYang
zhanyangch
sysprog2017
week3
$ lscpu
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
每核心執行緒數:2
每通訊端核心數:2
Socket(s): 1
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
型號: 42
Model name: Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz
製程: 7
CPU MHz: 855.421
CPU max MHz: 3500.0000
CPU min MHz: 800.0000
BogoMIPS: 5587.06
虛擬: VT-x
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 4096K
NUMA node0 CPU(s): 0-3
修改 Makefile impl.c 使其依 -D 的內容決定實做內容
例如 __m128i T0 = _mm_unpacklo_epi32(I0, I1);
__m128i | d0 | d1 | d2 | d3 |
---|---|---|---|---|
I0 | i00 | i01 | i02 | i03 |
I1 | i10 | i11 | i12 | i13 |
T0 | i00 | i10 | i01 | i11 |
Performance counter stats for './naive_transpose' (5 runs):
18,321,550 cache-misses # 91.193 % of all cache refs ( +- 0.03% )
20,090,955 cache-references ( +- 0.01% )
1,448,992,488 instructions # 1.04 insns per cycle ( +- 0.01% )
1,388,146,906 cycles ( +- 1.86% )
0.476780726 seconds time elapsed ( +- 3.28% )
Performance counter stats for './sse_transpose' (5 runs):
6,044,604 cache-misses # 80.660 % of all cache refs ( +- 0.14% )
7,493,943 cache-references ( +- 0.06% )
1,237,151,919 instructions # 1.33 insns per cycle ( +- 0.02% )
933,002,791 cycles ( +- 1.95% )
0.349286181 seconds time elapsed ( +- 8.65% )
Performance counter stats for './sse_prefetch_transpose' (5 runs):
6,005,813 cache-misses # 80.029 % of all cache refs ( +- 0.10% )
7,504,588 cache-references ( +- 0.12% )
1,283,017,647 instructions # 1.73 insns per cycle ( +- 0.00% )
741,071,615 cycles ( +- 0.37% )
0.293647190 seconds time elapsed ( +- 6.50% )
很意外的是 sse 跟 sse+prefetch 的 cache miss 在 PDIST 為 8 時差不多,由論文得知 prefetch distance 必須大於 memory latency
調整 PFDIST,可以發現轉折點發生在 PDIST=16,112 ,而其他部份則差異不大,當 PDIST < 16,distance < memory latency ,而當 PDIST > 112 ,資料過早被 prefetch 造成 cache miss 上升。
使用 raw counter 的到更詳細的數據,參考illusion030的共筆,在Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B找到對應處理器型號的PERFORMANCE-MONITORING EVENTS,以下列出的為表19.5的資料
Event Num. | Umask Value | Event Mask Mnemonic |
---|---|---|
4CH | 01H | LOAD_HIT_PRE.SW_PF |
4CH | 02H | LOAD_HIT_PRE.HW_PF |
D1H | 01H | MEM_LOAD_UOPS_RETIRED.L1_HIT |
D1H | 02H | MEM_LOAD_UOPS_RETIRED.L2_HIT |
D1H | 04H | MEM_LOAD_UOPS_RETIRED.LLC_HIT |
D1H | 08H | MEM_LOAD_UOPS_RETIRED.L1_MIS |
D1H | 10H | MEM_LOAD_UOPS_RETIRED.L2_MIS |
D1H | 20H | MEM_LOAD_UOPS_RETIRED.LLC_MIS |
指令格式
perf stat -e r<Umask Value><Event Num>
Performance counter stats for './naive_transpose' (10 runs):
4,380 r014c LOAD_HIT_PRE.SW_PF ( +- 20.20% ) (24.90%)
416,936 r024c LOAD_HIT_PRE.HW_PF ( +- 27.46% ) (25.69%)
521,970,614 r01d1 MEM_LOAD_UOPS_RETIRED.L1_HIT ( +- 0.87% ) (26.13%)
55,233 r02d1 MEM_LOAD_UOPS_RETIRED.L2_HIT ( +- 58.72% ) (25.76%)
457,605 r04d1 MEM_LOAD_UOPS_RETIRED.LLC_HIT ( +- 17.80% ) (25.39%)
16,221,927 r08d1 MEM_LOAD_UOPS_RETIRED.L1_MIS ( +- 0.90% ) (25.18%)
16,138,041 r10d1 MEM_LOAD_UOPS_RETIRED.L2_MIS ( +- 1.07% ) (25.06%)
3,110 r20d1 MEM_LOAD_UOPS_RETIRED.LLC_MIS ( +- 34.97% ) (24.78%)
0.485324876 seconds time elapsed ( +- 2.56% )
Performance counter stats for './sse_transpose' (10 runs):
11,687 r014c LOAD_HIT_PRE.SW_PF ( +- 67.66% ) (24.64%)
613,046 r024c LOAD_HIT_PRE.HW_PF ( +- 27.57% ) (25.75%)
427,907,439 r01d1 MEM_LOAD_UOPS_RETIRED.L1_HIT ( +- 2.01% ) (26.56%)
32,728 r02d1 MEM_LOAD_UOPS_RETIRED.L2_HIT ( +- 53.98% ) (26.24%)
105,153 r04d1 MEM_LOAD_UOPS_RETIRED.LLC_HIT ( +- 16.43% ) (25.64%)
3,968,212 r08d1 MEM_LOAD_UOPS_RETIRED.L1_MIS ( +- 3.31% ) (25.45%)
4,018,123 r10d1 MEM_LOAD_UOPS_RETIRED.L2_MIS ( +- 3.23% ) (25.23%)
2,194 r20d1 MEM_LOAD_UOPS_RETIRED.LLC_MIS ( +- 43.05% ) (24.75%)
0.314186726 seconds time elapsed ( +- 4.80% )
Performance counter stats for './sse_prefetch_transpose' (10 runs):
73,962 r014c LOAD_HIT_PRE.SW_PF ( +- 7.85% ) (24.98%)
645,530 r024c LOAD_HIT_PRE.HW_PF ( +- 27.46% ) (25.76%)
436,823,098 r01d1 MEM_LOAD_UOPS_RETIRED.L1_HIT ( +- 1.01% ) (26.42%)
250,519 r02d1 MEM_LOAD_UOPS_RETIRED.L2_HIT ( +- 4.97% ) (25.91%)
3,701,145 r04d1 MEM_LOAD_UOPS_RETIRED.LLC_HIT ( +- 1.58% ) (25.79%)
3,950,488 r08d1 MEM_LOAD_UOPS_RETIRED.L1_MIS ( +- 1.36% ) (25.68%)
3,771,673 r10d1 MEM_LOAD_UOPS_RETIRED.L2_MIS ( +- 1.11% ) (25.31%)
1,026 r20d1 MEM_LOAD_UOPS_RETIRED.LLC_MIS ( +- 10.91% ) (24.86%)
0.298254699 seconds time elapsed ( +- 3.08% )
可以看到有 sse_prefetch 在 LOAD_HIT_PRE.SW_PF 比其他兩個高出許多