# 2017q1 Homework3 (software-pipelining) contributed by <`ryanwang522`> * [Github](https://github.com/ryanwang522/prefetcher) --- ## 開發環境 * OS: ubuntu 16.04 LTS * L1d cache: 32K * L1i cache: 32K * L2 cache: 256K * L3 cache: 3072K * Architecture:x86_64 * CPU op-mode(s): 32-bit, 64-bit * Byte Order: Little Endian * CPU(s): 4 * Model name: Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz ## 論文閱讀 * 閱讀 [ “When Prefetching Works, When It Doesn’t, and Why” ](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) * 參考 [重點提示和解說](https://hackmd.io/s/HJtfT3icx) * 第一次看覺得有些名詞都似懂非懂,決定先來看這次作業的程式碼。 >>[作業說明](https://hackmd.io/s/rks62p1sl#) 有提醒同學: >>務必事先閱讀論文,否則貿然看程式碼,只是陷入一片茫然! >>請先閱讀論文 >>[name=課程助教][color=red] ## 實際嘗試 * 編譯之後執行 ``` 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse prefetch: 86235 us sse: 171542 us naive: 359344 us ``` * 觀察 `impl.c` 後發現有不少看不懂的指令 * 參考 [Integer Memory and Initialization Using Streaming SIMD Extensions 2](https://msdn.microsoft.com/zh-tw/library/hfhxtdwx(v=vs.100).aspx) * 裡面有 `_mm_loadu_si128` ` _mm_unpacklo_epi32` `_mm_storeu_si128` ...等有出現的 SSE 函式的說明。 * 花了不少時間看懂矩陣轉置的實作 * 用紙筆稍微寫出實作過程就不難了解 * 在找資料的過程中發現了張簡單明瞭的概念圖,早點看到就好了QQ * ![](https://i.imgur.com/BxvN77F.png) * 接著是 prefetch 的觀念建立,參考 [Cache prefetching](https://en.wikipedia.org/wiki/Cache_prefetching) * 程式碼中 `PFDIST` 定義成 8 ,但是又每次 loop 取四個值來做,所以是提前將下下次 loop 會用到的資料先取置 Cache。 * SSE 程式碼 ```C= void sse_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 4) { for (int y = 0; y < h; y += 4) { __m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x)); __m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x)); __m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x)); __m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x)); __m128i T0 = _mm_unpacklo_epi32(I0, I1); __m128i T1 = _mm_unpacklo_epi32(I2, I3); __m128i T2 = _mm_unpackhi_epi32(I0, I1); __m128i T3 = _mm_unpackhi_epi32(I2, I3); I0 = _mm_unpacklo_epi64(T0, T1); I1 = _mm_unpackhi_epi64(T0, T1); I2 = _mm_unpacklo_epi64(T2, T3); I3 = _mm_unpackhi_epi64(T2, T3); _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0); _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1); _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2); _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3); } } } ``` * SSE_prefetch 程式碼 ```C= #define PFDIST 8 void sse_prefetch_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 4) { for (int y = 0; y < h; y += 4) { _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1); __m128i I0 = _mm_loadu_si128 ((__m128i *)(src + (y + 0) * w + x)); __m128i I1 = _mm_loadu_si128 ((__m128i *)(src + (y + 1) * w + x)); __m128i I2 = _mm_loadu_si128 ((__m128i *)(src + (y + 2) * w + x)); __m128i I3 = _mm_loadu_si128 ((__m128i *)(src + (y + 3) * w + x)); __m128i T0 = _mm_unpacklo_epi32(I0, I1); __m128i T1 = _mm_unpacklo_epi32(I2, I3); __m128i T2 = _mm_unpackhi_epi32(I0, I1); __m128i T3 = _mm_unpackhi_epi32(I2, I3); I0 = _mm_unpacklo_epi64(T0, T1); I1 = _mm_unpackhi_epi64(T0, T1); I2 = _mm_unpacklo_epi64(T2, T3); I3 = _mm_unpackhi_epi64(T2, T3); _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0); _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1); _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2); _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3); } } } ``` ## `naive` `SSE` `SSE_Prefetch` 比較 * 先利用 [你所不知道的 C 語言:物件導向程式設計篇](https://hackmd.io/s/HJLyQaQMl) 裡面提及的 OOP 概念實作不同函式的呼叫介面 ```c= typedef void (*func_t)(int *src, int *dst, int w, int h); typedef struct __Object Object; struct __Object { func_t trans_func; }; ``` * 以及函式的初始 function,利用 function pointer 指向所要取用的函式 ```C= int init_sse(Object **self) { if((*self = malloc(sizeof(Object))) == NULL) return -1; (*self)->trans_func = sse_transpose; return 0; } ... ``` * 最後利用 Conditional Compile 加上編譯時給的 -D 參數達到產生不同執行檔對應不同 function 的功能。 ```C= /* create a object and init it with corresponding define */ Object *interface = NULL; #ifdef NAIVE if (init_naive(&interface) == -1) { printf("init naive error.\n"); return -1; } #elif defined(SSE_PREFETCH) if (init_sse_prefetch(&interface) == -1) { printf("init sse_prefetch error.\n"); return -1; } #else if (init_sse(&interface) == -1) { printf("init sse error.\n"); return -1; } #endif interface->trans_func(testin, testout, 4, 4); ``` * 修改 `main.c` 進而單純地使用 perf 觀察三個函式的效能差異。 * 執行前先清空 Cache: `$ echo 1 | sudo tee /proc/sys/vm/drop_caches` ``` Performance counter stats for './naive' (100 runs): 955 cache-misses # 7.011 % of all cache refs ( +- 13.85% ) 1,3616 cache-references ( +- 0.72% ) 63,5161 instructions # 0.76 insn per cycle ( +- 0.25% ) 84,0346 cycles ( +- 6.58% ) 0.000523424 seconds time elapsed ( +- 7.21% ) ``` ``` Performance counter stats for './sse_transpose' (100 runs): 763 cache-misses # 5.590 % of all cache refs ( +- 15.07% ) 1,3645 cache-references ( +- 0.66% ) 63,6933 instructions # 0.82 insn per cycle ( +- 0.22% ) 78,0010 cycles ( +- 5.09% ) 0.000545729 seconds time elapsed ( +- 23.24% ) ``` ``` Performance counter stats for './sse_prefetch_transpose' (100 runs): 752 cache-misses # 5.688 % of all cache refs ( +- 17.94% ) 1,3215 cache-references ( +- 0.60% ) 63,0374 instructions # 0.84 insn per cycle ( +- 0.20% ) 74,8224 cycles ( +- 5.34% ) 0.000507983 seconds time elapsed ( +- 18.33% ) ``` * 觀察結果,Cache misses 差異不大? Prefetch 的版本也沒有明顯的 cache-misses% 改善,但執行時間卻顯著減少 * 思考 prefetch 時也會產生 cache-misses(到更高階級的記憶單元取資料就算) * 需要更詳細的效能評估 ### 利用 perf raw counter 進行分析 * 參考 [kaizsv 的共筆](https://hackmd.io/s/r1IWtb9R),發現 perf 還可以進行更詳細的效能評估,對照 [Intel® 64 and IA-32 Architectures Developer’s Manual: Vol. 3B](http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html) 的第 19 章,我的電腦 CPU 型號是 i5-4200, 看了一下 manual 是屬於 4th generation Haswell 的 cpu * 找了些關於 cache misses 的觀察項目 * ` If you find one you want to instrument, you can specify it as a raw event with the format: rUUEE, where UU == umask, and EE == event number. ` |Event Num|Umask Value|Event Mask Mnemonic|Description| |-|-|-|-| |4CH|01H|LOAD_HIT_PRE.SW_PF|Non-SW-prefetch load dispatches that hit fill buffer allocated for S/W prefetch.| |4CH|02H|LOAD_HIT_PRE.HW_PF|Non-SW-prefetch load dispatches that hit fill buffer allocated for H/W prefetch.| |D1H|01H|MEM_LOAD_UOPS_RETIRED.L1_HIT|Retired load uops with L1 cache hits as data sources.| |D1H|02H|MEM_LOAD_UOPS_RETIRED.L2_HIT|Retired load uops with L2 cache hits| as data sources.| |D1H|04H|MEM_LOAD_UOPS_RETIRED.L3_HIT|Retired load uops with L3 cache hits as data sources.| |D1H|08H|MEM_LOAD_UOPS_RETIRED.L1_MISS|Retired load uops missed L1 cache as data sources.| |D1H|10H|MEM_LOAD_UOPS_RETIRED.L2_MISS|Retired load uops missed L2. Unknown data source excluded.| |D1H|20H|MEM_LOAD_UOPS_RETIRED.L3_MISS|Retired load uops missed L3. Excludes unknown data | ``` $ sudo perf stat --repeat 100 -e cache-misses,cache-references,r014C,r024C,r01D1,r02d1,r04d1,r08d1,r10d1,r20d1 ./sse_prefetch_transpose Performance counter stats for './sse_prefetch_transpose' (100 runs): 1042 cache-misses # 7.732 % of all cache refs ( +- 14.03% ) 1,3473 cache-references ( +- 0.83% ) 15 r014C ( +- 13.45% ) 6388 r024C ( +- 5.34% ) <not counted> r01D1 (0.00%) <not counted> r02d1 (0.00%) <not counted> r04d1 (0.00%) <not counted> r08d1 (0.00%) <not counted> r10d1 (0.00%) <not counted> r20d1 (0.00%) 0.000501777 seconds time elapsed ( +- 8.75% ) ``` * ~~這裡不太清楚哪裡弄錯了,raw counter 跑不出來@@~~ * 發現是沒注意到之前 debug 的東西沒清乾淨,少跑了 4096 * 4096 的轉置部份,導致程式只有一個 4 * 4 的矩陣轉置,所以 raw counter 才會沒算到(汗... * 順便在修改一下先前的介面實作,增加其相容性。 * 執行 perf stat ``` $ sudo perf stat --repeat 100 -e cache-misses,cache-references,r014C,r024C,r01D1,r02d1,r04d1,r08d1,r10d1,r20d1 ./naive Performance counter stats for './naive' (100 runs): 1556,1539 cache-misses # 88.602 % of all cache refs ( +- 0.38% ) (19.95%) 1756,3494 cache-references ( +- 0.25% ) (20.84%) 2,8704 r014C (LOAD_HIT_PRE.SW_PF) ( +- 3.65% ) (31.41%) 22,5545 r024C (LOAD_HIT_PRE.HW_PF) ( +- 6.06% ) (41.65%) 5,2906,5443 r01D1 (MEM_LOAD_UOPS_RETIRED.L1_HIT) ( +- 0.16% ) (40.03%) 65,1904 r02d1 (MEM_LOAD_UOPS_RETIRED.L2_HIT) ( +- 1.78% ) (37.99%) 145,5382 r04d1 (MEM_LOAD_UOPS_RETIRED.L3_HIT) ( +- 2.57% ) (20.16%) 1639,8087 r08d1 (MEM_LOAD_UOPS_RETIRED.L1_MISS) ( +- 0.19% ) (19.96%) 1562,3707 r10d1 (MEM_LOAD_UOPS_RETIRED.L2_MISS) ( +- 0.21% ) (19.83%) 1421,5128 r20d1 (MEM_LOAD_UOPS_RETIRED.L3_MISS) ( +- 0.37% ) (19.68%) --- $ sudo perf stat --repeat 100 -e cache-misses,cache-references,r014C,r024C,r01D1,r02d1,r04d1,r08d1,r10d1,r20d1 ./sse_transpose Performance counter stats for './sse_transpose' (100 runs): 428,6746 cache-misses # 74.599 % of all cache refs ( +- 0.54% ) (20.53%) 574,6364 cache-references ( +- 0.51% ) (21.49%) 1,3171 r014C ( +- 3.67% ) (32.00%) 14,3591 r024C ( +- 2.24% ) (42.14%) 4,4080,0600 r01D1 ( +- 0.21% ) (39.35%) 37,4749 r02d1 ( +- 1.26% ) (35.94%) 25,6156 r04d1 ( +- 2.06% ) (19.55%) 404,8482 r08d1 ( +- 0.54% ) (20.19%) 376,3449 r10d1 ( +- 0.65% ) (20.22%) 359,4998 r20d1 ( +- 0.73% ) (20.07%) --- $ sudo perf stat --repeat 100 -e cache-misses,cache-references,r014C,r024C,r01D1,r02d1,r04d1,r08d1,r10d1,r20d1 ./sse_prefetch_transpose Performance counter stats for './sse_prefetch_transpose' (100 runs): 425,0698 cache-misses # 74.312 % of all cache refs ( +- 1.04% ) (20.22%) 572,0070 cache-references ( +- 1.10% ) (21.65%) 330,9106 r014C ( +- 1.60% ) (32.59%) 11,1900 r024C ( +- 3.56% ) (43.04%) 4,6314,2283 r01D1 ( +- 0.22% ) (40.01%) 365,8195 r02d1 ( +- 0.67% ) (36.26%) 5,3485 r04d1 ( +- 8.65% ) (20.09%) 311,3756 r08d1 ( +- 0.97% ) (19.76%) 26,0548 r10d1 ( +- 5.63% ) (19.70%) 21,1605 r20d1 ( +- 5.71% ) (19.62%) ``` * 發現 `r014C` 的變化顯示 sw prefetch 有預期地執行 * prefetch 版本的 Cache misses 也有可觀的減少,說明了為何預取能夠藉由隱藏 latency 的時間成本,進而顯著的提昇效率。 * 然而 prefetch 版本的 L2 Cache hit 有明顯的增加 * 思考是不是與 _MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2, _MM_HINT_NTA 有關 --- ## 嘗試 Memory Pool * 看到 github 上 `jserv` 的建議就想說來試試看 memory pool 對效能的影響 * `main.c` 中只有兩次的 malloc() 給 4096 * 4096 的整數矩陣,不太確定 memory pool 的影響力是否能發揮出來,怕被建立時的 overhead 給蓋過去。 * 實作 ```C= Pool *pool_create(size_t size) { Pool *poolPtr = (Pool *) malloc(size * sizeof(int) + sizeof(Pool)); if (poolPtr == NULL) printf("Pool Create Failed.\n"); poolPtr->nextAvail = (int *)&poolPtr[1]; poolPtr->end = poolPtr->nextAvail + size; return poolPtr; } void pool_free(Pool *poolPtr) { free(poolPtr); } int pool_available(Pool *poolPtr, size_t size) { /* Yet consider alignment */ return ((size_t)(poolPtr->end - poolPtr->nextAvail)) < size ? 0 : 1; } void *pool_alloc(Pool *poolPtr, size_t size) { int check; if ((check = pool_available(poolPtr, size)) == 0) return NULL; void *mem = (void *) poolPtr->nextAvail; poolPtr->nextAvail += size; return mem; } ``` * 接著比較效率 * 與原始版本比較時間有微小的縮小,可能在 `malloc()` 上的負擔還沒有大到能夠產生顯著的效能改善。 * ![](https://i.imgur.com/tu5kixM.png) ``` 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 naive: 341012 us sse: 160502 us see_prefetch: 94221 us ``` ## AVX 版本 ## 參考資料 [4x4 integer matrix transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html) [Matrix Multiplication using SIMD ](https://hackmd.io/s/Hk-llHEyx)