contributed by <HMKRL
>
zhanyangch
由文中的實驗結果圖可以發現,當 Software prefetching (例如 _mm_prefetch()
) 與 Hardware prefetching (例如 GHB) 同時使用時,並沒有明確的效能表現規則,本文重點為探索此兩種 prefetcher 交互使用的效果
其中有三項探討重點:
Table 1 為文中對不同資料結構適用的 prefetch 方式整理,其中關於 prefetch distance (D),可以參考圖二的時間線:
可以發現 發出 prefetch request 後,資料需要一段時間才能進入 cache ,但若過太久仍未被使用,則會被移出 cache
因此採用 software prefetch 時,若太晚執行 prefetch ,則無法透過這次 prefetch 消除 memory latency 的影響,太早執行則會讓資料在被使用前就移出 cache ,都無法有效增進效能(甚至會因為 prefetch 本身的成本造成效能低落),適合的 D 值可以這樣計算:
其中
論文中 5.1.3 對不同 D 值的影響進行分析:
Memory indexing 可以分為兩種:direct 與 indirect
其中 direct 型(例如一維陣列) 較容易做 prefetch (由 index 可以直接得知資料所在的位置)
indirect 型(例如 Linked-List )則較不容易
圖3以簡化的組合語言指令比較了兩種 indexing 模式的差異,可以發現 indirect indexing 必須額外計算實際位址
開發環境:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Model name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
首先注意 impl.c
naive_transpose
是最直接的轉置作法,直接依照轉置矩陣的規則以迴圈實作
sse_transpose
則出現許多未曾使用過的 sse 指令集操作 ,因此使用 gdb
觀察程式執行(已加裝 gef )
將中斷點設在 impl.c:17
, 由 gdb
顯示的組合語言指令得知_mm_loadu_si128
用途是將數值存入 xmm0
,參考 wikipedia 得知這是用於 sse 指令集的特殊 register
────────────────────────────────────────────────────────────────────────────────────────────────────────────[ code:i386:x86-64 ]────
0x555555554a52 <sse_transpose+109> add rax, rdx
0x555555554a55 <sse_transpose+112> mov QWORD PTR [rbp-0x200], rax
0x555555554a5c <sse_transpose+119> mov rax, QWORD PTR [rbp-0x200]
0x555555554a63 <sse_transpose+126> movdqu xmm0, XMMWORD PTR [rax]
0x555555554a67 <sse_transpose+130> movaps XMMWORD PTR [rbp-0x1c0], xmm0
→ 0x555555554a6e <sse_transpose+137> mov eax, DWORD PTR [rbp-0x204]
0x555555554a74 <sse_transpose+143> add eax, 0x1
0x555555554a77 <sse_transpose+146> imul eax, DWORD PTR [rbp-0x224]
0x555555554a7e <sse_transpose+153> movsxd rdx, eax
0x555555554a81 <sse_transpose+156> mov eax, DWORD PTR [rbp-0x208]
0x555555554a87 <sse_transpose+162> cdqe
────────────────────────────────────────────────────────────────────────────────────────────────────────────[ source:impl.c+16 ]────
12 {
13 for (int x = 0; x < w; x += 4) {
14 for (int y = 0; y < h; y += 4) {
15 __m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x));
// y=0x0, x=0x0, src=0x00007fffffffd798 → [...] → 0x0000000100000000, w=0x4
→ 16 __m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x));
17 __m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x));
18 __m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x));
19 __m128i T0 = _mm_unpacklo_epi32(I0, I1);
20 __m128i T1 = _mm_unpacklo_epi32(I2, I3);
再觀察以下執行結果:
__m128i T0 = _mm_unpacklo_epi32(I0, I1)
gef➤ p I0
$8 = {0x100000000, 0x300000002}
gef➤ p I1
$9 = {0x500000004, 0x700000006}
gef➤ p T0
$10 = {0x400000000, 0x500000001}
__m128i T2 = _mm_unpackhi_epi32(I0, I1);
:
gef➤ p I0
$11 = {0x100000000, 0x300000002}
gef➤ p I1
$12 = {0x500000004, 0x700000006}
gef➤ p T2
$13 = {0x600000002, 0x700000003}
I0 = _mm_unpacklo_epi64(T0, T1);
gef➤ p T0
$23 = {0x400000000, 0x500000001}
gef➤ p T1
$24 = {0xc00000008, 0xd00000009}
gef➤ p I0
$25 = {0x400000000, 0xc00000008}
透過以上觀察,加上參照 MSDN 的說明,即可理解這幾個 function 的行為
程式碼修改請見 commit f15322b
新增 impl.h
, 統一所有 transpose function 名稱為 transpose()
並宣告函式原型在 impl.h
中
各種不同實作分別位於 naive_transpose.c
,sse_transpose.c
與 sse_prefetch_transpose.c
中,透過 Makefile 決定要使用的檔案
測試結果:
make run
./naive_transpose
naive: 600506 us
./sse_transpose
sse: 307496 us
./sse_prefetch_transpose
sse_prefetch: 149165 us
make cache-test
test case | cache-miss |
---|---|
naive | 87.332 % |
sse | 78.215 % |
sse_prefetch | 66.306 % |
_mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1);
參考 illusion030 的共筆,觀察以下 raw counter:
r014c LOAD_HIT_PRE.SW_PF
r024c LOAD_HIT_PRE.HW_PF
r81d0 MEM_UOPS_RETIRED.ALL_LOADS
r01d1 MEM_LOAD_UOPS_RETIRED.L1_HIT
r02d1 MEM_LOAD_UOPS_RETIRED.L2_HIT
r04d1 MEM_LOAD_UOPS_RETIRED.L3_HIT
結果:
Performance counter stats for './naive_transpose' (10 runs):
39,361,652 cache-misses # 87.332 % of all cache refs ( +- 1.18% ) (39.91%)
45,071,235 cache-references ( +- 1.30% ) (40.45%)
1,394,570,644 instructions # 1.18 insn per cycle ( +- 0.62% ) (50.60%)
1,183,001,040 cycles ( +- 0.39% ) (60.58%)
145 r014c ( +- 26.11% ) (60.62%)
347,288 r024c ( +- 8.84% ) (60.77%)
565,116,285 r81d0 ( +- 0.52% ) (60.89%)
541,841,834 r01d1 ( +- 0.72% ) (61.08%)
32,226 r02d1 ( +- 13.70% ) (40.05%)
704,847 r04d1 ( +- 5.39% ) (39.95%)
Performance counter stats for './sse_transpose' (10 runs):
14,631,270 cache-misses # 78.215 % of all cache refs ( +- 1.32% ) (37.96%)
18,706,573 cache-references ( +- 1.21% ) (40.68%)
1,156,979,239 instructions # 1.44 insn per cycle ( +- 1.37% ) (51.62%)
801,805,797 cycles ( +- 0.67% ) (61.88%)
104 r014c ( +- 4.41% ) (62.54%)
228,122 r024c ( +- 11.42% ) (63.56%)
442,330,486 r81d0 ( +- 0.65% ) (63.34%)
444,642,666 r01d1 ( +- 0.66% ) (61.72%)
32,251 r02d1 ( +- 8.80% ) (38.78%)
138,286 r04d1 ( +- 2.73% ) (37.31%)
Performance counter stats for './sse_prefetch_transpose' (10 runs):
8,407,856 cache-misses # 66.306 % of all cache refs ( +- 0.77% ) (39.42%)
12,680,436 cache-references ( +- 0.38% ) (39.50%)
1,266,181,395 instructions # 2.19 insn per cycle ( +- 0.38% ) (49.59%)
577,113,498 cycles ( +- 0.36% ) (59.67%)
1,194,382 r014c ( +- 3.70% ) (59.67%)
276,261 r024c ( +- 13.61% ) (60.82%)
450,951,942 r81d0 ( +- 1.27% ) (63.35%)
448,833,332 r01d1 ( +- 0.84% ) (62.97%)
1,185,907 r02d1 ( +- 2.41% ) (41.53%)
23,900 r04d1 ( +- 8.80% ) (40.14%)
可以發現在使用了 _mm_prefetch
的 sse_prefetch_transpose
中,r014c LOAD_HIT_PRE.SW_PF
明顯高出其他版本很多,證實了 software prefetching 有實際生效
接下來單獨觀察 sse_prefetch_transpose
:
Performance counter stats for './sse_prefetch_transpose' (10 runs):
8,274,577 cache-misses # 65.760 % of all cache refs ( +- 1.55% ) (39.24%)
12,582,914 cache-references ( +- 1.15% ) (39.39%)
1,270,075,951 instructions # 2.22 insn per cycle ( +- 0.48% ) (49.56%)
571,300,575 cycles ( +- 0.29% ) (59.66%)
2,055,764 r014c ( +- 1.49% ) (59.81%)
285,024 r024c ( +- 12.43% ) (61.40%)
452,350,810 r81d0 ( +- 0.99% ) (63.40%)
450,444,764 r01d1 ( +- 0.97% ) (62.82%)
136,386 r02d1 ( +- 6.58% ) (41.17%)
21,463 r04d1 ( +- 9.31% ) (39.90%)
Performance counter stats for './sse_prefetch_transpose' (10 runs):
8,294,601 cache-misses # 65.715 % of all cache refs ( +- 1.96% ) (39.20%)
12,622,082 cache-references ( +- 1.59% ) (39.49%)
1,267,488,698 instructions # 2.21 insn per cycle ( +- 0.20% ) (49.82%)
573,599,634 cycles ( +- 0.47% ) (59.97%)
1,138,961 r014c ( +- 3.27% ) (60.38%)
297,754 r024c ( +- 12.40% ) (62.11%)
452,313,963 r81d0 ( +- 0.71% ) (63.44%)
447,991,261 r01d1 ( +- 0.71% ) (63.10%)
1,211,477 r02d1 ( +- 2.48% ) (40.61%)
23,397 r04d1 ( +- 7.42% ) (39.48%)
Performance counter stats for './sse_prefetch_transpose' (10 runs):
8,160,145 cache-misses # 64.491 % of all cache refs ( +- 2.19% ) (39.61%)
12,653,216 cache-references ( +- 0.45% ) (39.98%)
1,261,785,164 instructions # 2.19 insn per cycle ( +- 0.37% ) (50.15%)
575,843,475 cycles ( +- 0.38% ) (60.17%)
1,137,009 r014c ( +- 3.06% ) (60.17%)
340,189 r024c ( +- 9.23% ) (60.61%)
464,808,926 r81d0 ( +- 0.84% ) (61.75%)
450,360,519 r01d1 ( +- 0.92% ) (62.46%)
1,179,211 r02d1 ( +- 1.03% ) (40.89%)
20,659 r04d1 ( +- 7.79% ) (40.36%)
Performance counter stats for './sse_prefetch_transpose' (10 runs):
8,121,213 cache-misses # 63.071 % of all cache refs ( +- 2.09% ) (39.80%)
12,876,217 cache-references ( +- 2.41% ) (40.06%)
1,267,977,461 instructions # 2.20 insn per cycle ( +- 0.43% ) (50.18%)
575,640,296 cycles ( +- 0.60% ) (60.18%)
2,028,726 r014c ( +- 1.83% ) (60.17%)
338,681 r024c ( +- 9.67% ) (60.74%)
460,173,646 r81d0 ( +- 0.99% ) (61.99%)
452,179,786 r01d1 ( +- 0.93% ) (62.14%)
11,199 r02d1 ( +- 3.08% ) (41.24%)
219,170 r04d1 ( +- 2.84% ) (40.43%)
疑問:
可以發現 T0
和 T1
都如預期將資料放在 L1 / L2 cache
但 T2
並不是放在 L3 而是 L2 ? 反而是 NTA 模式 似乎傾向使用 L3 cache
用 perf raw-counter 確認!
此處尚待理解