contributed by<changyuanhua
,vonchuang
>
naive_transpose
, sse_transpose
, sse_prefetch_transpose
之間的效能差異,以及 prefetcher 對 cache 的影響Makefile
,產生新的執行檔,分別對應於 naive_transpose
, sse_transpose
, sse_prefetch_transpose
(學習 phonebook 的做法),學習 你所不知道的 C 語言:物件導向程式設計篇 提到的封裝技巧,以物件導向的方式封裝轉置矩陣的不同實作,得以透過一致的介面 (interface) 存取個別方法並且評估效能perf stat
的 raw counter 命令 lscpu
CPU 快取是用於減少處理器訪問記憶體所需平均時間的部件,下圖為處理器架構:
level-1 data cache:一級數據快取(I$)
level-1 inst cache:一級指令快取(D$)
MMU:記憶體管理單元
TLB:轉譯後備緩衝區 (translation lookaside buffer)
level-2 cache:二級緩存(E$)
level-3 cache:三級緩存
處理器讀取數據過程如下面兩個圖:
整個過程大致如下:
reference:
如果在CPU操作數據之前,我們就已經將數據提前加載到快取中,那麼就減少了由於快取未中 (cache-miss),需要從記憶體取值的情況,這樣就可以加速操作,獲得性能上提升。使用主動快取技術來優化記憶體複製。
值得注意的是,CPU 對數據操作擁有絕對自由!使用預取指令只是按我們自己的想法對 CPU 的數據操作進行補充,有可能CPU當前並不需要我們加載到快取的數據。這樣,我們的預取指令可能會帶來相反的結果,比如對於多工系統,有可能我們沖掉了有用的 cache。不過,在多工系統上,由於執行緒 (thread) 或行程 (process) 的切換所花費的時間相對於預取操作來說太長了, 所以可以忽略執行緒或行程切換對預取的影響。
prefetch distance 的定義: the distance ahead of which a prefetch should be requested on a memory address.
Table I 中的 D 表示為 prefetch distance ,在 array data structure 中要預取的 cache block 的存儲器地址是通過向 array 中的 index 添加常數值D來計算的
計算公式:
為了可以及時取出需要的資料,每次 prefetch 都要取出:
當下往後D次迴圈後的資料
,這樣在需要使用時,就會有正確的資料可以使用,不然每次 prefetch 會來不及趕上或是等待過久。
論文中將 Prefetch 的成功與否,分類為下面這幾種:
prefetch distance 必須要大到能夠隱藏 memory latency
CFLAGS += -D
:定義巨集變數,經過編譯之後,就可以在.c檔案中使用此定義巨集來做判斷perf
實驗結果:
實驗結果(重複執行100次的實驗結果):
cache miss
r10D1(L2 cache miss):
IPC(instruction per cycle):
distance | time | distance | time | distance | time |
---|---|---|---|---|---|
1 | 55542.054 | 11 | 51445.986 | 21 | 53940.23 |
2 | 55913.266 | 12 | 51587.33 | 22 | 53641.179 |
3 | 54025.94 | 13 | 52554.398 | 23 | 55055.568 |
4 | 54084.633 | 14 | 52422.197 | 24 | 54264.255 |
5 | 52981.885 | 15 | 52104.55 | 25 | 54850.852 |
6 | 52725.476 | 16 | 52381.843 | 26 | 54749.051 |
7 | 50900.602 | 17 | 52925.904 | 27 | 54883.265 |
8 | 51283.657 | 18 | 53391.204 | 28 | 55703.9 |
9 | 51379.069 | 19 | 53579.475 | 29 | 55961.272 |
10 | 50877.663 | 20 | 53242.423 | 30 | 55942.22 |
distance | time | distance | time | distance | time |
---|---|---|---|---|---|
1 | 54534.382 | 11 | 49985.784 | 21 | 52316.857 |
2 | 54182.105 | 12 | 49759.201 | 22 | 51906.385 |
3 | 52002.398 | 13 | 50525.176 | 23 | 53225.511 |
4 | 52318.836 | 14 | 51049.788 | 24 | 52741.997 |
5 | 50781.044 | 15 | 51073.010 | 25 | 53903.179 |
6 | 51153.778 | 16 | 50807.230 | 26 | 53713.169 |
7 | 49509.160 | 17 | 50636.573 | 27 | 53082.176 |
8 | 49670.598 | 18 | 51116.230 | 28 | 54087.863 |
9 | 49381.901 | 19 | 51542.549 | 29 | 54663.468 |
10 | 49712.684 | 20 | 51592.086 | 30 | 53808.543 |
distance | 平均時間 |
---|---|
distance 9 | 49553.042 |
distance 16 | 50539.942 |
程式碼:github
$ cat /etc/issue
Ubuntu 16.04.3 LTS
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 37
Model name: Intel(R) Core(TM) i3 CPU M 350 @ 2.27GHz
Stepping: 2
CPU MHz: 1866.000
CPU max MHz: 2266.0000
CPU min MHz: 933.0000
BogoMIPS: 4522.32
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
先將原先的impl.c
分成naive_transpose.c
,sse_transpose.c
,sse_prefetch_transpose.c
,並在各檔中定義一數值以利 main.c 判斷該使用哪種 transpose
#define NAIVE 1
#define SSE 1
#define SSE_PREFETCH 1
在 Makefile 中分別編譯之
naive_transpose: $(SRCS_common) naive_transpose.c
$(CC) $(CFLAGS) \
-DIMPL="\"$@.c\"" -o $@ \
$(SRCS_common)
sse_transpose: $(SRCS_common) sse_transpose.c
$(CC) $(CFLAGS) \
-DIMPL="\"$@.c\"" -o $@ \
$(SRCS_common)
sse_prefetch_transpose: $(SRCS_common) sse_prefetch_transpose.c
$(CC) $(CFLAGS) \
-DIMPL="\"$@.c\"" -o $@ \
$(SRCS_common)
其中的-DIMPL="\"$@.c\""
為依據不同檔案將之 include 進 main.c 中
perf
再分別對三種 transpose 以 pref 分析cache miss/hit
cache-test: $(EXEC)
perf stat --repeat 10 \
-e cache-misses,cache-references,instructions,cycles \
./naive_transpose
perf stat --repeat 10 \
-e cache-misses,cache-references,instructions,cycles \
./sse_transpose
perf stat --repeat 10 \
-e cache-misses,cache-references,instructions,cycles \
./sse_prefetch_transpose
以#include IMPL
依據 Makefile 的 DIMPL 分別 include 不同的 transpose 檔,再依該檔判斷並執行不同的 transpose
直接分別將代表行與列的 x , y 對調後存至 dst。假設 1 word = 4 Byte,Cache 為 4 word,而 Cache 初始為空,則
dst[i][j] | j = 0 | j = 1 | j = 2 | j = 3 |
---|---|---|---|---|
i = 0 | [m] | [h] | [h] | [h] |
i = 1 | [m] | [h] | [h] | [h] |
i = 2 | [m] | [h] | [h] | [h] |
i = 3 | [m] | [h] | [h] | [h] |
在 dst[0][0] 時,會 miss,同時相應的包含 dst[0][0]~ dst[0][3] 被存到 Cache 中,因此接下來三個引用都會是 hit。i = 1 ~ i = 3 皆是如此,故其 Cache miss rate 為 1/4。
src[i][j] | j = 0 | j = 1 | j = 2 | j = 3 |
---|---|---|---|---|
i = 0 | [m] | [m] | [m] | [m] |
i = 1 | [m] | [m] | [m] | [m] |
i = 2 | [m] | [m] | [m] | [m] |
i = 3 | [m] | [m] | [m] | [m] |
在這裡是一列一列而不是一行一行的掃描矩陣 src。如果夠幸運的話,整個矩陣都在 Cache 中,那麼也會是 1/4 的 Cache miss rate。但若是矩陣比 Cache 要大,那麼每次對 src[i][j] 的存取都是 miss,嚴重影響程式的執行效率!
此處可用分塊(Blocking)來提升效能,即將一個程式中的數據結構組織成大的片(Chunk),稱為塊(Block)。這樣的結構程式,能夠將一個片加載到 L1 Cache 中,並在這個片中進行所有的讀和寫,然後就丟掉這個片,加仔下一個片,依此類推。此方式可以提高其 Temporal Locality,在一些沒有 Prefetch 的系統上獲得極大的性能收益。
temporal locality(時間的局部性): 一個記憶體位址被存取後,不久會再度被存取
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 4 5
8 9 12 13
2 3 6 7
10 11 14 15
0 1 8 9
2 3 10 11
4 5 12 13
6 7 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Performance counter stats for './naive_transpose' (10 runs):
19,219,864 cache-misses # 96.869 % of all cache refs ( +- 0.03% )
19,841,159 cache-references ( +- 0.26% )
1,527,263,121 instructions # 0.73 insn per cycle ( +- 0.01% )
2,088,780,011 cycles ( +- 0.35% )
0.938373013 seconds time elapsed ( +- 1.01% )
Performance counter stats for './sse_transpose' (10 runs):
6,511,540 cache-misses # 88.634 % of all cache refs ( +- 0.02% )
7,346,541 cache-references ( +- 0.17% )
1,314,247,165 instructions # 1.06 insn per cycle ( +- 0.01% )
1,242,508,090 cycles ( +- 0.72% )
0.556691565 seconds time elapsed ( +- 0.64% )
經過實際執行測試後,確認 sse_transpose 確實比 naive_transpose 有較低的 Cache miss,執行時間降低為約 0.6 倍。
將 sse_transpose 的程使稍作改良,在每次轉置前先存取資料到 Cache 中,可降低 Cache miss,減少 miss penalty,進而增進效能。
void _mm_prefetch(char *p, int i)
_MM_HINT_T0
,_MM_HINT_T1
,_MM_HINT_T2
,_MM_HINT_NTA
)_mm_prefetch
本質上為指令,其對應的組合語言為PREFETCHh
—Prefetch Data Into Caches
PREFETCHh
只是一個 hint,CPU 並不一定會執行動作PREFETCHh
:
Opcode | Instruction | Op/En | 64-Bit Mode | Compat/Leg Mode | Description |
---|---|---|---|---|---|
0F 18 /1 | PREFETCHT0 m8 | M | Valid | Valid | Move data from m8 closer to the processor using T0 hint |
0F 18 /2 | PREFETCHT1 m8 | M | Valid | Valid | Move data from m8 closer to the processor using T1 hint |
0F 18 /3 | PREFETCHT2 m8 | M | Valid | Valid | Move data from m8 closer to the processor using T2 hint |
0F 18 /0 | PREFETCHNTA m8 | M | Valid | Valid | Move data from m8 closer to the processor using NTA hint |
Intel's manuals imply that prefetchNTA fetches into L1D and (into one way of) L3
$perf list
並不會列出所有可偵測的情形,若是無法在其中找到所需,可於 Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B 中找到更多可使用的 Counter,稱之為 Raw Counter
Raw Counter 的形式為rUUEE
,其中 UU 為 umask,EE 為 Event。
根據不同的處理器,Event 跟 Umask 的值也會不同。以下為依據測試環境Intel(R) Core(TM) i3 CPU M 350 @ 2.27GHz
所列之值:
Event Num. | Umask Value | Event Mask Mnemonic | Description |
---|---|---|---|
4CH | 01H | LOAD_HIT_PRE.SW_PF | Non-SW-prefetch load dispatches that hit fill buffer allocated for S/W prefetch. |
D0H | 81H | MEM_UOPS_RETIRED.ALL_LOADS | All retired load uops. |
D0H | 82H | MEM_UOPS_RETIRED.ALL_STORES | All retired store uops. |
D0H | 01H | MEM_LOAD_UOPS_RETIRED.L1_HIT | Retired load uops with L1 cache hits as data sources. |
D0H | 02H | MEM_LOAD_UOPS_RETIRED.L2_HIT | Retired load uops with L2 cache hits as data sources. |
D0H | 04H | MEM_LOAD_UOPS_RETIRED.LLC_HIT | Retired load uops whose data source was LLC hit with no snoop required. |
D0H | 08H | MEM_LOAD_UOPS_RETIRED.L1_MISS | Retired load uops whose data source followed an L1 miss. |
D0H | 10H | MEM_LOAD_UOPS_RETIRED.L2_MISS | Retired load uops that missed L2, excluding unknown sources. |
D0H | 20H | MEM_LOAD_UOPS_RETIRED.LLC_MISS | Retired load uops whose data source is LLC miss. |
依此實際測試給定程式不同的 Hint 時,其 Cache miss, prefetch 等情形,並比較四者的差異:
sse-transpose:
Performance counter stats for './sse_transpose' (10 runs):
6,282,455 cache-misses # 92.612 % of all cache refs ( +- 1.93% ) (18.94%)
6,783,656 cache-references ( +- 1.74% ) (19.95%)
1,208,375,002 instructions # 1.05 insn per cycle ( +- 1.09% ) (25.19%)
1,148,164,660 cycles ( +- 1.41% ) (25.73%)
443,295,887 L1-dcache-loads ( +- 1.19% ) (26.19%)
8,620,595 L1-dcache-load-misses # 1.94% of all L1-dcache hits ( +- 1.12% ) (26.25%)
2,251,129 L1-dcache-prefetches ( +- 1.30% ) (26.11%)
4,440 L1-dcache-prefetch-misses ( +- 10.70% ) (20.91%)
4,440,890 LLC-loads ( +- 3.03% ) (20.80%)
4,448,917 LLC-load-misses # 100.18% of all LL-cache hits ( +- 3.37% ) (20.63%)
363,532 LLC-prefetches ( +- 9.30% ) (10.10%)
39,545 LLC-prefetch-misses ( +- 14.52% ) (9.91%)
19,819 r014c ( +- 4.27% ) (14.91%)
0 r82d0 (19.47%)
0 r82d0 (19.09%)
1,246,538,911 r01d0 ( +- 1.97% ) (18.89%)
0 r02d0 (18.74%)
0 r04d0 (19.25%)
0 r08d0 (19.20%)
0 r10d0 (18.90%)
0 r20d0 (18.67%)
0.559394008 seconds time elapsed ( +- 0.88% )
sse-prefetch-transpose(T0)
Performance counter stats for './sse_prefetch_transpose' (10 runs):
6,234,747 cache-misses # 96.627 % of all cache refs ( +- 3.31% ) (19.28%)
6,452,393 cache-references ( +- 3.60% ) (19.61%)
1,350,558,124 instructions # 1.48 insn per cycle ( +- 1.19% ) (24.43%)
915,436,632 cycles ( +- 0.91% ) (24.55%)
481,342,795 L1-dcache-loads ( +- 1.21% ) (24.64%)
7,207,800 L1-dcache-load-misses # 1.50% of all L1-dcache hits ( +- 5.20% ) (24.78%)
2,330,886 L1-dcache-prefetches ( +- 4.27% ) (25.49%)
4,674 L1-dcache-prefetch-misses ( +- 7.88% ) (21.43%)
3,827,603 LLC-loads ( +- 3.84% ) (21.52%)
4,121,594 LLC-load-misses # 107.68% of all LL-cache hits ( +- 3.96% ) (21.26%)
311,904 LLC-prefetches ( +- 13.26% ) (10.40%)
26,175 LLC-prefetch-misses ( +- 9.35% ) (10.44%)
156,362 r014c ( +- 7.06% ) (15.54%)
0 r82d0 (20.50%)
0 r82d0 (20.30%)
1,222,176,158 r01d0 ( +- 0.33% ) (20.13%)
0 r02d0 (19.89%)
0 r04d0 (19.70%)
0 r08d0 (19.55%)
0 r10d0 (19.31%)
0 r20d0 (19.12%)
0.426371795 seconds time elapsed ( +- 0.76% )
sse-prefetch-transpose(T1)
Performance counter stats for './sse_prefetch_transpose' (10 runs):
5,758,146 cache-misses # 92.114 % of all cache refs ( +- 5.01% ) (18.36%)
6,251,077 cache-references ( +- 5.67% ) (18.96%)
1,321,154,025 instructions # 1.46 insn per cycle ( +- 1.87% ) (24.30%)
905,959,322 cycles ( +- 2.14% ) (24.60%)
473,197,320 L1-dcache-loads ( +- 1.70% ) (25.14%)
7,398,959 L1-dcache-load-misses # 1.56% of all L1-dcache hits ( +- 6.23% ) (26.86%)
2,179,161 L1-dcache-prefetches ( +- 5.21% ) (26.50%)
4,441 L1-dcache-prefetch-misses ( +- 8.41% ) (21.54%)
4,239,907 LLC-loads ( +- 4.80% ) (21.23%)
4,339,259 LLC-load-misses # 102.34% of all LL-cache hits ( +- 4.73% ) (21.01%)
312,456 LLC-prefetches ( +- 11.39% ) (10.51%)
27,690 LLC-prefetch-misses ( +- 4.29% ) (10.30%)
182,223 r014c ( +- 7.21% ) (15.30%)
0 r82d0 (20.20%)
0 r82d0 (19.99%)
1,235,676,607 r01d0 ( +- 1.17% ) (20.13%)
0 r02d0 (19.93%)
0 r04d0 (19.69%)
0 r08d0 (19.43%)
0 r10d0 (19.08%)
0 r20d0 (18.57%)
0.422474481 seconds time elapsed ( +- 1.19% )
sse-prefetch-transpose(T2)
Performance counter stats for './sse_prefetch_transpose' (10 runs):
6,049,351 cache-misses # 96.873 % of all cache refs ( +- 2.96% ) (19.18%)
6,244,603 cache-references ( +- 2.31% ) (19.56%)
1,347,072,434 instructions # 1.43 insn per cycle ( +- 0.66% ) (24.33%)
943,767,484 cycles ( +- 3.60% ) (24.36%)
474,619,500 L1-dcache-loads ( +- 1.43% ) (24.76%)
6,865,977 L1-dcache-load-misses # 1.45% of all L1-dcache hits ( +- 2.02% ) (24.45%)
2,481,744 L1-dcache-prefetches ( +- 2.70% ) (24.33%)
4,206 L1-dcache-prefetch-misses ( +- 2.88% ) (21.08%)
3,972,475 LLC-loads ( +- 3.35% ) (21.52%)
4,357,785 LLC-load-misses # 109.70% of all LL-cache hits ( +- 1.96% ) (21.28%)
498,598 LLC-prefetches ( +- 11.46% ) (10.53%)
31,366 LLC-prefetch-misses ( +- 2.81% ) (10.45%)
188,798 r014c ( +- 1.89% ) (15.49%)
0 r82d0 (20.46%)
0 r82d0 (20.27%)
1,236,227,193 r01d0 ( +- 0.60% ) (20.04%)
0 r02d0 (19.87%)
0 r04d0 (19.69%)
0 r08d0 (19.48%)
0 r10d0 (19.29%)
0 r20d0 (19.04%)
0.440200280 seconds time elapsed ( +- 3.55% )
sse-prefetch-transpose(NTA)
Performance counter stats for './sse_prefetch_transpose' (10 runs):
5,921,357 cache-misses # 89.033 % of all cache refs ( +- 4.62% ) (18.30%)
6,650,740 cache-references ( +- 7.27% ) (19.05%)
1,287,992,474 instructions # 1.40 insn per cycle ( +- 2.35% ) (24.54%)
917,113,958 cycles ( +- 4.18% ) (24.97%)
461,517,367 L1-dcache-loads ( +- 2.29% ) (25.92%)
7,674,133 L1-dcache-load-misses # 1.66% of all L1-dcache hits ( +- 9.60% ) (26.84%)
2,281,064 L1-dcache-prefetches ( +- 4.77% ) (26.82%)
7,122 L1-dcache-prefetch-misses ( +- 10.88% ) (21.03%)
4,468,090 LLC-loads ( +- 7.10% ) (20.90%)
4,254,908 LLC-load-misses # 95.23% of all LL-cache hits ( +- 5.98% ) (20.54%)
467,020 LLC-prefetches ( +- 8.57% ) (10.21%)
148,930 LLC-prefetch-misses ( +- 24.88% ) (10.47%)
245,885 r014c ( +- 19.58% ) (15.52%)
0 r82d0 (20.49%)
0 r82d0 (20.28%)
1,205,895,629 r01d0 ( +- 1.31% ) (20.07%)
0 r02d0 (19.89%)
0 r04d0 (19.66%)
0 r08d0 (19.34%)
0 r10d0 (18.95%)
0 r20d0 (18.43%)
0.432545105 seconds time elapsed ( +- 2.64% )
HINT | no-prefetch | T0 | T1 | T2 | NTA |
---|---|---|---|---|---|
cache-misses | 6,282,455 | 6,234,747 | 5,758,146 | 6,049,351 | 5,921,357 |
cache-references | 6,783,656 | 6,452,393 | 6,251,077 | 6,049,351 | 6,650,740 |
instructions | 1,208,375,002 | 1,350,558,124 | 1,321,154,025 | 1,347,072,434 | 1,287,992,474 |
cycles | 1,148,164,660 | 915,436,632 | 905,959,322 | 943,767,484 | 917,113,958 |
L1-dcache-loads | 443,295,887 | 481,342,795 | 473,197,320 | 474,619,500 | 461,517,367 |
L1-dcache-load-misses | 8,620,595 | 7,207,800 | 7,398,959 | 6,865,977 | 7,674,133 |
L1-dcache-prefetches | 2,251,129 | 2,330,886 | 2,179,161 | 2,481,744 | 2,281,064 |
L1-dcache-prefetch-misses | 4,440 | 4,674 | 4,441 | 4,206 | 7,122 |
LLC-loads | 4,440,890 | 3,827,603 | 4,239,907 | 3,972,475 | 4,468,090 |
LLC-load-misses | 4,448,917 | 4,121,594 | 4,339,259 | 4,357,785 | 4,254,908 |
LLC-prefetches | 363,532 | 311,904 | 312,456 | 498,598 | 467,020 |
LLC-prefetch-misses | 39,545 | 26,175 | 27,690 | 31,366 | 148,930 |
r014c | 19,819 | 156,362 | 182,223 | 188,798 | 245,885 |
r82d0 | 0 | 0 | 0 | 0 | 0 |
r01d0 | 0 | 0 | 0 | 0 | 0 |
r02d0 | 1,246,538,911 | 1,222,176,158 | 1,235,676,607 | 1,236,227,193 | 1,205,895,629 |
r04d0 | 0 | 0 | 0 | 0 | 0 |
r08d0 | 0 | 0 | 0 | 0 | 0 |
r10d0 | 0 | 0 | 0 | 0 | 0 |
r20d0 | 0 | 0 | 0 | 0 | 0 |
Where we might expect the T1 and T2 hints to work well is in the case of streaming applications that have little data reuse. Since our benchmarks do not have such applications, we find that T0 always performs better than other hints in all evaluated benchmarks.
改變 main.c 裡的設定,即可改變輸入矩陣的大小