2016q3 Homework3 (software-pipeline)

# 2016q3 Homework3 (software-pipeline) contribute by <`kobeyu`> [github](https://github.com/kobe0308/prefetcher.git) [作業說明](https://hackmd.io/s/ry7eqDEC) ## 指定閱讀 ### [效能分析: Prefetching](https://embedded2015.hackpad.com/ep/pad/static/7ZSsa98cSKw) ### [When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) ## Prefetch 在cpu要運算之前把資料從主記憶體預先載入cache,降低cpu等待時間進而提升效率. 從時間上來看載入至主記憶體的時機不能太晚(cpu已經運算完畢不再需要這些資料)也不能太早(有可能在使用之前就被踢出快取記憶體). 從記憶體層級上來看prefetch intrinsic提供參數可以選擇載入到哪個層級的快取記憶體. ### Hardware prefetch 是用過去的存取紀錄作為prefetch的依據. GHB:Global History Buffer ### Software prefetch 可以是人為或是compiler在適當的地方加入prefetch指令. ### Direct and Indirect Memory Indexing Direct是直接可以從指令中知道要存取的記憶體位址,可以透過hareward prefetch. Indirect 就像這行程式碼 x = a[b[i]],需要先計算b[i]的值才知道要存取a[]中的哪個位置,機器不會比人更知道b[i]是怎麼來的,所以使用software prefetch比較容易達成. ### 使用Software Prefetch優於Hardware Prefetch的情況: * Large Number of Streams (Limited Hardware Resources). * Short Streams 前面有提到hareware prefetch是根據歷史紀律作為prefetch的依據,所以過短的資料比較適合sofware prefetch. * Irregular Memory Access * Cache Locality Hint * Loop Bounds. ### 使用Software Prefetch的缺點 * Increased Instruction Count * Static Insertion 無法再執行時期動態調整 * Code Structure Change ### 同時使用Hardware與Software Prefetch能夠發會1+1>2的情況 *Handling Multiple Streams *Positive Training ### 同時使用Hardware與Software Prefetch有人是豬隊友的情況... *Negative Training *Harmful Software Prefetching 如果software prefetch的效能很差連帶會影響到hardware prefetch. ### Prefetch Intrinsic Insertion Algorithm ![](http://i.imgur.com/BxZ0Yc1.png) K is a constant factor L is an average memory latency IPCb ench is the profiled average IPC of each benchmark Wloop is the average instruction count in one loop iteration. p.s在論文中所有的量測K=4 & L=300 --- ## Matrix Transpose 矩陣轉置就是把行變列,列變行的數學運算,由於記憶體空間是Row Major,所以在讀取列時是讀取連續的記憶體空間,寫入行則是寫入不連續的記憶體空間,所以矩陣轉置先天上就會造成很高比例的cache-misses. ## 使用Intel Assembary (SSE2)完成矩陣轉置 [Programming trivia: 4x4 integer matrix transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html) 使用SSE/AVX能得到比較短的執行時間,以目前的認知來看原因有: 1.一次可以處理比較長的資料長度 2.CPU指令集直接支援,可以有比較好的CPI. ![](https://www.randombit.net/bitbashing/_images/sse2_transpose.png) ### PUNPCKLDQ _mm_unpacklo_epi32 (__m128i __A, __m128i __B) ![](http://i.imgur.com/UDcEf2B.png) ### PUNPCKHDQ _mm_unpackhi_epi32 (__m128i __A, __m128i __B) ![](http://i.imgur.com/S44xOoU.png) ### PUNPCKLQDQ _mm_unpacklo_epi64 (__m128i __A, __m128i __B) ![](http://i.imgur.com/8ztO1Bu.png) ### PUNPCKHQDQ _mm_unpackhi_epi64 (__m128i __A, __m128i __B) ![](http://i.imgur.com/y188HS4.png) 圖片來源:[32/64-Bit 80x86 Assembly Language Architecture](https://books.google.com.tw/books?id=c3SlgrqMid4C) --- ### 解析程式碼pefetch #### 資料夾結構 . ├── impl.c ├── main.c ├── Makefile ├── README.md └── scripts ├── plot.gp └── pre-commit.hook 1 directory, 6 files impl.c 實作了三個版本的矩陣轉制,分別是naive,sse,sse+prefectch 先來看一下執行時間的比較: ![](http://i.imgur.com/Q37PLOH.png) 根據先前閱讀的資料來看prefetch的效能好壞會根據Distance與Cache Level有關. 先來分析Distance對於效能的影響. 中imple.c中prefetch的距離是根據PFDIST的設定值,在實驗中分別測試D=0/4/8/12/16的執行時間分別是多少: 以結果來看執行時間的關係為: D=8 < D=4,12 < D=16 < D=0 所以證明了prefetch的距離會影響到執行效能.其中D=0尤其明顯,推測可能是先前(2 iteration)用過的資料已經被移出cache或被覆蓋的關係.另外觀察到一個有趣的現象就是無論D是等於多少,都會比沒有加prefetch的sse版本還要來得快. ![](http://i.imgur.com/PTb7mHO.png) --- 接著我將Distance固定在8,改變prefetch的位置(TO/T1/T2/NTA) 執行的結果是NTA最花時間,再來確認一下NTA的定義: NTA (non-temporal data with respect to all cache levels)--prefetch data into non-temporal cache structure. (This hint can be used to minimize pollution of caches.) 有兩個地方不太懂"non-temporal cache structure"與"pollution of caches" >> 簡單說就是 cache 被塞入不時宜的 data，而 early 則是指說太早抓這個 data 進來，結果要用到時已經被 replaced 了，inccorrect 是指說你載入了根本用不到的 data，這些都是會讓 cache miss 增加，也可以說是汙染了cache [name=Yen-Kuan Wu] ![](http://i.imgur.com/8O91k3Y.png) 接著用perf觀察cache misses $ perf stat -r 50 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses ./main D=8 + T0 Performance counter stats for './main' (50 runs): 62,963,773 cache-misses # 86.289 % of all cache refs ( +- 0.16% ) (66.55%) 72,968,723 cache-references ( +- 0.15% ) (66.62%) 34,265,657 L1-dcache-load-misses # 4.49% of all L1-dcache hits ( +- 0.11% ) (66.74%) 762,324,176 L1-dcache-loads ( +- 0.06% ) (66.89%) 314,796,493 L1-dcache-stores ( +- 0.06% ) (66.92%) 194,207 L1-icache-load-misses ( +- 1.25% ) (66.69%) 0.698045545 seconds time elapsed ( +- 0.16% ) --- D=0+NTA Performance counter stats for './main' (50 runs): 65,505,666 cache-misses # 86.406 % of all cache refs ( +- 0.18% ) (66.35%) 75,811,770 cache-references ( +- 0.16% ) (66.62%) 33,924,898 L1-dcache-load-misses # 4.49% of all L1-dcache hits ( +- 0.08% ) (66.93%) 754,883,652 L1-dcache-loads ( +- 0.05% ) (67.06%) 316,034,709 L1-dcache-stores ( +- 0.08% ) (66.89%) 153,597 L1-icache-load-misses ( +- 1.40% ) (66.49%) 0.779092952 seconds time elapsed ( +- 0.11% ) --- D=0+T0 Performance counter stats for './main' (50 runs): 69,063,108 cache-misses # 86.741 % of all cache refs ( +- 0.17% ) (66.29%) 79,620,072 cache-references ( +- 0.14% ) (66.56%) 33,903,912 L1-dcache-load-misses # 4.49% of all L1-dcache hits ( +- 0.07% ) (66.94%) 755,064,566 L1-dcache-loads ( +- 0.05% ) (67.15%) 316,197,275 L1-dcache-stores ( +- 0.07% ) (66.94%) 157,890 L1-icache-load-misses ( +- 1.25% ) (66.45%) 0.781984994 seconds time elapsed ( +- 0.08% ) ~~目前的結果發現執行時間似乎跟cache misses太大的關聯,似乎證明了之前的推論,prefetch的目的並不是在降低cache miss,而是降低cache misses所產生的延遲.~~ --- 上述的實驗(cache-misses)結果是錯的,因為在測試cache misses的時候sse+prefetch/sse/naive並沒有分開執行,以下是分開執行的結果： 1. SSE + Prefetch (Distancd=8,@TO) Performance counter stats for './main' (50 runs): 9,022,860 cache-misses # 67.566 % of all cache refs ( +- 0.32% ) (66.61%) 13,354,083 cache-references ( +- 0.16% ) (67.11%) 8,964,264 L1-dcache-load-misses # 1.93% of all L1-dcache hits ( +- 0.32% ) (67.20%) 465,440,570 L1-dcache-loads ( +- 0.08% ) (67.20%) 233,797,035 L1-dcache-stores ( +- 0.12% ) (66.88%) 62,034 L1-icache-load-misses ( +- 2.20% ) (66.37%) 0.195889775 seconds time elapsed 2. SSE Performance counter stats for './main' (50 runs): 15,047,354 cache-misses # 79.379 % of all cache refs ( +- 0.28% ) (66.25%) 18,956,381 cache-references ( +- 0.18% ) (66.36%) 8,408,231 L1-dcache-load-misses # 1.89% of all L1-dcache hits ( +- 0.12% ) (66.71%) 443,824,150 L1-dcache-loads ( +- 0.15% ) (67.34%) 229,787,688 L1-dcache-stores ( +- 0.13% ) (67.44%) 90,578 L1-icache-load-misses ( +- 1.63% ) (66.70%) 0.356293017 seconds time elapsed 3. NAIVE Performance counter stats for './main' (50 runs): 42,155,708 cache-misses # 87.722 % of all cache refs ( +- 0.21% ) (66.42%) 48,056,099 cache-references ( +- 0.20% ) (66.60%) 21,011,298 L1-dcache-load-misses # 3.84% of all L1-dcache hits ( +- 0.07% ) (66.80%) 547,771,641 L1-dcache-loads ( +- 0.09% ) (67.18%) 215,693,429 L1-dcache-stores ( +- 0.13% ) (67.09%) 76,732 L1-icache-load-misses ( +- 1.87% ) (66.52%) 0.432626384 seconds time elapsed ( +- 0.20% ) 從上面的實驗結果可以看到prefetch先將所需要的資料載入快取記憶體降低了cache-misses進而減少了執行時間. 思考cache miss的定義,要存取的資料若不在快取記憶體中=>cache misses,如果預先知道要存取的資料並載入快取記憶體中,就可以避免cache misses. ### 使用AVX提升效能參考之前同學的實作方法 [link](https://embedded2015.hackpad.com/ep/pad/static/VGN4PI1cUxh) ### 參考資料 GHB [link](http://www.eecg.toronto.edu/~steffan/carg/readings/ghb.pdf) ###### tags: `kobeyu`