2016q3 Homework3 (software-pipelining)

# 2016q3 Homework3 (software-pipelining) contributed by <`TotallyWrong`> ## 開發作業環境 **CPU** : Intel Core i5-5200U **Cache size**: L1 Cache 128KB, L2 Cache 512K, L3 Cache 3072KB **Operating System** : Ubuntu 15.10 Wily Werewolf --- **Feature**: * MMX instructions * SSE / Streaming SIMD Extensions * SSE2 / Streaming SIMD Extensions 2 * SSE3 / Streaming SIMD Extensions 3 * SSSE3 / Supplemental Streaming SIMD Extensions 3 * SSE4 / SSE4.1 + SSE4.2 / Streaming SIMD Extensions 4 ? * AES / Advanced Encryption Standard instructions * AVX / Advanced Vector Extensions * AVX2 / Advanced Vector Extensions 2.0 --- ## Prefetching 相關閱讀再閱讀了老師所推荐的Modern Microprocessors A 90 Minute Guide!了解Memory Hierarchy 的大概時間差，這樣了解了Prefectching可能增加的效應。 PC Core i*4 Level Size Latency Physical Location L1 cache 32 KB 4 cycles inside each core L2 cache 256 KB 12 cycles beside each core L3 cache 6 MB ~21 cycles shared between all cores L4 E-cache 128 MB ~58 cycles separate eDRAM chip RAM 4+ GB ~117 cycles SDRAM DIMMs on motherboard Swap 100+ GB 10,000+ cycles hard disk or SSD 閱讀"When Prefetching Works, When It Doesn’t, and Why"時了解prefetching有分為CPU prefetching和 Software prefecthing，而不適當的Software prefetching 會影響CPU prefetching 而影響Overall的效能。內容提到幾個重點使用Prefecthing的時機： 1. 當有Short streamer 而CPU prefectcher 還來不及學會prefectch的Pattern。 2. 不規則資料結構例如 Hash 和Resursive。 3. 可以減少L1 Cache miss的時候。 4. 當過多的資料結構再同時進行運算時超過Hardware Prefetcher的數量。內容也提到Prefetch的時機點是需要控制的太早Cache可能會被換掉做白工太晚則Pefetching的效應不佳，而Prefecting也可以依需求放到L1 ,L2或L3 Cache就好。 ![](https://i.imgur.com/3UOzEOT.png) 再做Prefetch作業前我想了解一下 Intel CPU的Cache設計狀況，在搜尋後找Linux有提共這類資料查尋的方法。 `grep . /sys/devices/system/cpu/cpu0/cache/index*/*` 在了解後Intel broadwell的一個core　cache狀況: * L1 Data cache: 32KB * Cache Line : 64Byte * 一個Block有 : 64個Line * 8 way associative * L2 cache: 256KB * Cache Line : 64Byte * 一個Block有 : 512個Line * 8 way associative * L3 cache: 3072KB * Cache Line : 64Byte * 一個Block有 : 4096個Line * 12 way associative ## Prefetching 程式解析程式本身並不是很難懂就是建立一個4096x4096 Integer Array做Tranpose的動作，整個Array 的大小約是4096X4096X8個Byte大概128MB，看到這時了解原來4096這個數字是有考量過的數字。 >原本一直把int Array 當成4個Byte，後來照這樣的邏輯去Prefectch的效果很差，一直到後來才想起C語言在Array的處理是以指標的型式做處理，所以是64位元的Pointer是8個Byte。在所有Array都是align的情況下一行大概是8個Page，而L2 Cache一個block可以Hold一行,而Ｌ3的一個block可以hold八行。而且既有這程式中的Prefechter都只有Prefetch到L2 Cache而不是Ｌ1 Cache，Prefech 的方式似乎是讓L1 cache miss時可以在L2找到而不是直接替換Ｌ1 Cache，可能是考慮L1很小需要比較精準的計算。在看Code時就有點納悶為什麼，Write 不用Prefectch? 難道Intel的CPU是write through?　再讀過[Write buffer](https://en.wikipedia.org/wiki/Write_buffer)了解這個程式中Write會先被到CPU寫到Write buffer中等待對的時機才寫入，所以不用Prefetch。我也做了Prefetch Write 試試看結果果然是比較差 ``` Performance counter stats for ./main_AVX_prefetch' (100 runs): 9,271,268 cache-misses # 81.664 % of all cache refs 11,361,723 cache-references 1,163,935,141 instructions # 1.81 insns per cycle 640,361,365 cycles 0.242706453 seconds time elapsed ( +- 0.11% ) ``` ``` Performance counter stats for './main_AVX_prefetchV2' (100 runs): 8,830,566 cache-misses # 77.818 % of all cache refs 11,323,298 cache-references 1,163,997,144 instructions # 1.77 insns per cycle 658,548,796 cycles 0.248680126 seconds time elapsed ( +- 0.27% ) ``` ## Prefetching 程式修改在把SEE版本修改為AVX2版本時對AVX load 指令不是這麼熟在參考[許士杰](https://embedded2015.hackpad.com/-Homework-7-8-XZe3c94XjUh)學長的筆記後寫出AVX版和第一版AVX_Prefecth。 ``` clike= void AVX_prefetch_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 8) { for (int y = 0; y < h; y += 8) { _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 4) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 5) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 6) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 7) *w + x, _MM_HINT_T1); __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x)); __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x)); __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x)); __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x)); __m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x)); __m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x)); __m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x)); __m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x)); __m256i T0 = _mm256_unpacklo_epi32(I0, I1); __m256i T1 = _mm256_unpackhi_epi32(I0, I1); __m256i T2 = _mm256_unpacklo_epi32(I2, I3); __m256i T3 = _mm256_unpackhi_epi32(I2, I3); __m256i T4 = _mm256_unpacklo_epi32(I4, I5); __m256i T5 = _mm256_unpackhi_epi32(I4, I5); __m256i T6 = _mm256_unpacklo_epi32(I6, I7); __m256i T7 = _mm256_unpackhi_epi32(I6, I7); I0 = _mm256_unpacklo_epi64(T0, T2); I1 = _mm256_unpackhi_epi64(T0, T2); I2 = _mm256_unpacklo_epi64(T1, T3); I3 = _mm256_unpackhi_epi64(T1, T3); I4 = _mm256_unpacklo_epi64(T4, T6); I5 = _mm256_unpackhi_epi64(T4, T6); I6 = _mm256_unpacklo_epi64(T5, T7); I7 = _mm256_unpackhi_epi64(T5, T7); _mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), I0); _mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), I1); _mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), I2); _mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), I3); _mm256_storeu_si256((__m256i *)(dst + ((x + 4) * h) + y), I0); _mm256_storeu_si256((__m256i *)(dst + ((x + 5) * h) + y), I1); _mm256_storeu_si256((__m256i *)(dst + ((x + 6) * h) + y), I2); _mm256_storeu_si256((__m256i *)(dst + ((x + 7) * h) + y), I3); } } } ``` 在修改Makefile把各個方法獨立成為一個執行檔，並加入perf跑100次後的結果如下： **Naive :** ``` Performance counter stats for './main_naive' (100 runs): 23,910,730 cache-misses # 91.377 % of all cache refs 26,050,626 cache-references 1,448,938,705 instructions # 1.27 insns per cycle 1,150,593,694 cycles 0.432103936 seconds time elapsed ( +- 0.18% ) ``` **SSE :** ``` Performance counter stats for './main_sse' (100 runs): 11,043,650 cache-misses # 86.766 % of all cache refs 12,735,285 cache-references 1,236,998,307 instructions # 1.59 insns per cycle 789,708,333 cycles 0.297204120 seconds time elapsed ( +- 0.10% ) ``` **SSE Prefetch :** ``` Performance counter stats for './main_sse_prefetch' (100 runs): 8,184,734 cache-misses # 83.109 % of all cache refs 9,898,505 cache-references 1,282,988,040 instructions # 2.01 insns per cycle 640,139,684 cycles 0.238750082 seconds time elapsed ( +- 0.12% ) ``` **AVX :** ``` Performance counter stats for './main_AVX' (100 runs): 8,791,762 cache-misses # 85.228 % of all cache refs 10,313,159 cache-references 1,140,901,574 instructions # 1.80 insns per cycle 637,450,976 cycles 0.239936268 seconds time elapsed ( +- 0.11% ) ``` **AVX Prefetching:** ``` Performance counter stats for './main_AVX_prefetch' (100 runs): 9,196,764 cache-misses # 81.035 % of all cache refs 11,320,842 cache-references 1,163,973,064 instructions # 1.87 insns per cycle 619,061,075 cycles 0.235337862 seconds time elapsed ( +- 0.11% ) ``` 這個結果顯示出的頻均時間是 AVX Prefetch > SSE Prefetch > AVX > SSE >Naive 。在幾次測試後發現似乎因為Ｌ2是8 way associatives 所以原本一次Prefetch　8個block 會把一些既有而且還在用的Block換掉這似乎會脫慢一些時間，在幾次測試過後發現把Prefetching的時間分開會變快。還有整個矩陣是128MB所以L3　cache是絕對塞不下的，所以有時做prefecthing需要去Memory搬，如果能先從Ｍemory　prefetch到Ｌ3 這樣之後要prefetch 到Ｌ2快一點。做過這兩個變動後有了 AVX Prefetch V2 速度的確有提升一點，程式碼和結果如下。 **AVX Prefetch V2 Code :** ``` clike= void AVX_prefetchV2_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 8) { for (int y = 0; y < h; y += 8) { _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1); __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x)); __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x)); __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x)); __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x)); __m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x)); __m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x)); __m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x)); __m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x)); __m256i T0 = _mm256_unpacklo_epi32(I0, I1); __m256i T1 = _mm256_unpackhi_epi32(I0, I1); __m256i T2 = _mm256_unpacklo_epi32(I2, I3); __m256i T3 = _mm256_unpackhi_epi32(I2, I3); __m256i T4 = _mm256_unpacklo_epi32(I4, I5); __m256i T5 = _mm256_unpackhi_epi32(I4, I5); __m256i T6 = _mm256_unpacklo_epi32(I6, I7); __m256i T7 = _mm256_unpackhi_epi32(I6, I7); _mm_prefetch(src+(y + PFDIST + 4) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 5) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 6) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 7) *w + x, _MM_HINT_T1); I0 = _mm256_unpacklo_epi64(T0, T2); I1 = _mm256_unpackhi_epi64(T0, T2); I2 = _mm256_unpacklo_epi64(T1, T3); I3 = _mm256_unpackhi_epi64(T1, T3); I4 = _mm256_unpacklo_epi64(T4, T6); I5 = _mm256_unpackhi_epi64(T4, T6); I6 = _mm256_unpacklo_epi64(T5, T7); I7 = _mm256_unpackhi_epi64(T5, T7); _mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), I0); _mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), I1); _mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), I2); _mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), I3); _mm256_storeu_si256((__m256i *)(dst + ((x + 4) * h) + y), I0); _mm256_storeu_si256((__m256i *)(dst + ((x + 5) * h) + y), I1); _mm256_storeu_si256((__m256i *)(dst + ((x + 6) * h) + y), I2); _mm256_storeu_si256((__m256i *)(dst + ((x + 7) * h) + y), I3); _mm_prefetch(src+(y + PFDIST2 + 0) *w + x, _MM_HINT_T2); } } } ``` **AVX Prefetch V2 結果 :** ``` 8,744,585 cache-misses # 83.729 % of all cache refs 10,436,599 cache-references 1,166,862,788 instructions # 1.91 insns per cycle 615,119,198 cycles 0.230133365 seconds time elapsed ( +- 0.12% ) ``` 最後結果AVXPrefetchV2　0.230133365seconds　比　AVXPrefetch　快了約2%，比想像中的小很多這也說明Prefetch還是要考量到成本假如改良幅度比成本還小可能反而會使程式變慢而且CPU也有既有的Prefetching機制隨便亂使用software prefetching機制也會有干擾。 ## 結論在做過許多嘗試後發現Prefetch的使用跟cache block的大小和associate的degree有很大的關聯。要完全掌握使用時機和prefetch ###### tags: `TotallyWrong` `On going`

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.