contribute by <kobeyu
>
在cpu要運算之前把資料從主記憶體預先載入cache,降低cpu等待時間進而提升效率.
從時間上來看載入至主記憶體的時機不能太晚(cpu已經運算完畢不再需要這些資料)也不能太早(有可能在使用之前就被踢出快取記憶體).
從記憶體層級上來看prefetch intrinsic提供參數可以選擇載入到哪個層級的快取記憶體.
是用過去的存取紀錄作為prefetch的依據.
GHB:Global History Buffer
可以是人為或是compiler在適當的地方加入prefetch指令.
Direct是直接可以從指令中知道要存取的記憶體位址,可以透過hareward prefetch.
Indirect 就像這行程式碼 x = a[b[i]],需要先計算b[i]的值才知道要存取a[]中的哪個位置,機器不會比人更知道b[i]是怎麼來的,所以使用software prefetch比較容易達成.
* Large Number of Streams (Limited Hardware Resources).
* Short Streams 前面有提到hareware prefetch是根據歷史紀律作為prefetch的依據,所以過短的資料比較適合sofware prefetch.
* Irregular Memory Access
* Cache Locality Hint
* Loop Bounds.
* Increased Instruction Count
* Static Insertion 無法再執行時期動態調整
* Code Structure Change
*Handling Multiple Streams
*Positive Training
*Negative Training
*Harmful Software Prefetching 如果software prefetch的效能很差連帶會影響到hardware prefetch.
K is a constant factor
L is an average memory latency
IPCb ench is the profiled average IPC of each benchmark
Wloop is the average instruction count in one loop iteration.
p.s在論文中所有的量測K=4 & L=300
矩陣轉置就是把行變列,列變行的數學運算,由於記憶體空間是Row Major,所以在讀取列時是讀取連續的記憶體空間,寫入行則是寫入不連續的記憶體空間,所以矩陣轉置先天上就會造成很高比例的cache-misses.
Programming trivia: 4x4 integer matrix transpose in SSE2
使用SSE/AVX能得到比較短的執行時間,以目前的認知來看原因有:
1.一次可以處理比較長的資料長度
2.CPU指令集直接支援,可以有比較好的CPI.
_mm_unpacklo_epi32 (__m128i __A, __m128i __B)
_mm_unpackhi_epi32 (__m128i __A, __m128i __B)
_mm_unpacklo_epi64 (__m128i __A, __m128i __B)
_mm_unpackhi_epi64 (__m128i __A, __m128i __B)
圖片來源:32/64-Bit 80x86 Assembly Language Architecture
.
├── impl.c
├── main.c
├── Makefile
├── README.md
└── scripts
├── plot.gp
└── pre-commit.hook
1 directory, 6 files
impl.c 實作了三個版本的矩陣轉制,分別是naive,sse,sse+prefectch
先來看一下執行時間的比較:
根據先前閱讀的資料來看prefetch的效能好壞會根據Distance與Cache Level有關.
先來分析Distance對於效能的影響.
中imple.c中prefetch的距離是根據PFDIST的設定值,在實驗中分別測試D=0/4/8/12/16的執行時間分別是多少:
以結果來看執行時間的關係為:
D=8 < D=4,12 < D=16 < D=0
所以證明了prefetch的距離會影響到執行效能.其中D=0尤其明顯,推測可能是先前(2 iteration)用過的資料已經被移出cache或被覆蓋的關係.另外觀察到一個有趣的現象就是無論D是等於多少,都會比沒有加prefetch的sse版本還要來得快.
接著我將Distance固定在8,改變prefetch的位置(TO/T1/T2/NTA)
執行的結果是NTA最花時間,再來確認一下NTA的定義:
NTA (non-temporal data with respect to all cache levels)–prefetch data into non-temporal cache structure. (This hint can be used to minimize pollution of caches.)
有兩個地方不太懂"non-temporal cache structure"與"pollution of caches"
簡單說就是 cache 被塞入不時宜的 data,而 early 則是指說太早抓這個 data 進來,結果要用到時已經被 replaced 了,inccorrect 是指說你載入了根本用不到的 data,這些都是會讓 cache miss 增加,也可以說是 汙染了cache Yen-Kuan Wu
接著用perf觀察cache misses
$ perf stat -r 50 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses ./main
D=8 + T0
Performance counter stats for './main' (50 runs):
62,963,773 cache-misses # 86.289 % of all cache refs ( +- 0.16% ) (66.55%)
72,968,723 cache-references ( +- 0.15% ) (66.62%)
34,265,657 L1-dcache-load-misses # 4.49% of all L1-dcache hits ( +- 0.11% ) (66.74%)
762,324,176 L1-dcache-loads ( +- 0.06% ) (66.89%)
314,796,493 L1-dcache-stores ( +- 0.06% ) (66.92%)
194,207 L1-icache-load-misses ( +- 1.25% ) (66.69%)
0.698045545 seconds time elapsed ( +- 0.16% )
D=0+NTA
Performance counter stats for './main' (50 runs):
65,505,666 cache-misses # 86.406 % of all cache refs ( +- 0.18% ) (66.35%)
75,811,770 cache-references ( +- 0.16% ) (66.62%)
33,924,898 L1-dcache-load-misses # 4.49% of all L1-dcache hits ( +- 0.08% ) (66.93%)
754,883,652 L1-dcache-loads ( +- 0.05% ) (67.06%)
316,034,709 L1-dcache-stores ( +- 0.08% ) (66.89%)
153,597 L1-icache-load-misses ( +- 1.40% ) (66.49%)
0.779092952 seconds time elapsed ( +- 0.11% )
D=0+T0
Performance counter stats for './main' (50 runs):
69,063,108 cache-misses # 86.741 % of all cache refs ( +- 0.17% ) (66.29%)
79,620,072 cache-references ( +- 0.14% ) (66.56%)
33,903,912 L1-dcache-load-misses # 4.49% of all L1-dcache hits ( +- 0.07% ) (66.94%)
755,064,566 L1-dcache-loads ( +- 0.05% ) (67.15%)
316,197,275 L1-dcache-stores ( +- 0.07% ) (66.94%)
157,890 L1-icache-load-misses ( +- 1.25% ) (66.45%)
0.781984994 seconds time elapsed ( +- 0.08% )
目前的結果發現執行時間似乎跟cache misses太大的關聯,似乎證明了之前的推論,prefetch的目的並不是在降低cache miss,而是降低cache misses所產生的延遲.
上述的實驗(cache-misses)結果是錯的,因為在測試cache misses的時候sse+prefetch/sse/naive並沒有分開執行,以下是分開執行的結果:
SSE + Prefetch (Distancd=8,@TO)
Performance counter stats for './main' (50 runs):
9,022,860 cache-misses # 67.566 % of all cache refs ( +- 0.32% ) (66.61%)
13,354,083 cache-references ( +- 0.16% ) (67.11%)
8,964,264 L1-dcache-load-misses # 1.93% of all L1-dcache hits ( +- 0.32% ) (67.20%)
465,440,570 L1-dcache-loads ( +- 0.08% ) (67.20%)
233,797,035 L1-dcache-stores ( +- 0.12% ) (66.88%)
62,034 L1-icache-load-misses ( +- 2.20% ) (66.37%)
0.195889775 seconds time elapsed
SSE
Performance counter stats for './main' (50 runs):
15,047,354 cache-misses # 79.379 % of all cache refs ( +- 0.28% ) (66.25%)
18,956,381 cache-references ( +- 0.18% ) (66.36%)
8,408,231 L1-dcache-load-misses # 1.89% of all L1-dcache hits ( +- 0.12% ) (66.71%)
443,824,150 L1-dcache-loads ( +- 0.15% ) (67.34%)
229,787,688 L1-dcache-stores ( +- 0.13% ) (67.44%)
90,578 L1-icache-load-misses ( +- 1.63% ) (66.70%)
0.356293017 seconds time elapsed
NAIVE
Performance counter stats for './main' (50 runs):
42,155,708 cache-misses # 87.722 % of all cache refs ( +- 0.21% ) (66.42%)
48,056,099 cache-references ( +- 0.20% ) (66.60%)
21,011,298 L1-dcache-load-misses # 3.84% of all L1-dcache hits ( +- 0.07% ) (66.80%)
547,771,641 L1-dcache-loads ( +- 0.09% ) (67.18%)
215,693,429 L1-dcache-stores ( +- 0.13% ) (67.09%)
76,732 L1-icache-load-misses ( +- 1.87% ) (66.52%)
0.432626384 seconds time elapsed ( +- 0.20% )
從上面的實驗結果可以看到prefetch先將所需要的資料載入快取記憶體降低了cache-misses進而減少了執行時間.
思考cache miss的定義,要存取的資料若不在快取記憶體中=>cache misses,如果預先知道要存取的資料並載入快取記憶體中,就可以避免cache misses.
參考之前同學的實作方法 link
GHB link
kobeyu