2016q3 Homework4 (software-pipelining)

# 2016q3 Homework4 (software-pipelining) contributed by <`fzzzz`> ## 什麼是 prefectching? 節錄自wikipedia In computer architecture, instruction prefetch is a technique used in central processor units to speed up the execution of a program by reducing wait states. ## 論文閱讀 [論文 When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) >論文的部份只有稍微掃過一次，大部份還是看學長姐的筆記 ### Key points * source code-level analysis * study of the synergistic(協同) and antagonistic（拮抗） effects between sw/hw * evaluation of HW prefetching training policies ### Software / Hardware Prefetching 的背景知識 * Software : 通常會在程式碼間安插一些 extension 的 intrinsics ，來告訴 compiler 哪邊要做 prefetch 及　prefetch distance * prefetch 要有實際效果，傳送要求時間點要夠早，要能和 memory latency 抵銷。 * Hardware : 利用 CPU 中的 GHB(Global History Buffer)，通常需要做 training，要經過幾次的 cache misses 來偵測。 * Prefetch distance : * ![](https://i.imgur.com/wVw6qIr.png) * l : prefetch latency (memory) * s : 迴圈當中的最短路徑 ### 文章閱讀概要現在內建有 software prefetching algorithms 的 compiler 有 icc(intel) / gcc，但是對於如何安插 prefetch instrinsics 的 guideline 相關文件不多，再來就是對於軟體及硬體 prefetching 的分界是不清楚的。(沒有夠嚴謹的標準存在)，所以這篇論文表示他想要提供一個 prefetch distance 標準。 ![](https://i.imgur.com/xq57kqO.png) * 左Y軸為正規化後執行時間（除以未做 prefetch 的執行時間） * 右Y軸為加速倍率 * 從這張圖的結果來看，最好的結果是　SW+STR (stream prefetcher) or SW+GHB (stride prefetcher) 結論為 * software prefetching 在　short array streams , irregular memory address patterns 和 L1 cache reduction　上有較為顯著的正面影響。 #### Synergistic effects using software + hardware prefetching * 可以同時處理多個 data streams 。 * software prefetching能夠解決不規則的 stream ，而　hardware　則去做規律的部份。 * Positive training *　如果一個 software prefetch 太慢的話，可以藉由 hardware prefetch來增加 prefetch 的即時性。 #### Antagonistic Effects when Using Software and Hardware Prefetching * Negative training * 沒有完整地對所有 stream 做 sw prefetch的話，會使　hardware prefecting 的訓練結果不正確或過度訓練造成工作量過大。 * Harmful software prefecting * 通常 hardware prefecthing 的精度會低於 software ，而當 prefecting request 出現的時間點不對或是過早的話，會造成再記憶體及快取頻寬上的負擔，甚至是降低 hardware prefetching 的效率。 ## 作業實驗結果 ( native / sse / sse-prefetch ) ### 更改 Makefile 並重新執行 ``` native_transpose: $(GIT_HOOKS) main.c $(CC) $(CFLAGS_common) -DIMPL="\"$@.h\"" -o $@ main.c echo $@ sse_transpose: $(GIT_HOOKS) main.c $(CC) $(CFLAGS_common) $(CFLAGS_sse) -DIMPL="\"$@.h\"" -o $@ main.c sse_prefetch_transpose: $(GIT_HOOKS) main.c $(CC) $(CFLAGS_common) $(CFLAGS_sse) -DIMPL="\"$@.h\"" -o $@ main.c ``` native transpose ``` fzzzz@fzzzz-Aspire-V3-571G:~/Workspace/prefetcher$ ./native_transpose 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 native_transpose.h: 237301 us ``` sse transpose ``` fzzzz@fzzzz-Aspire-V3-571G:~/Workspace/prefetcher$ ./sse_transpose 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse_transpose.h: 123182 us ``` sse transpose using software prefetching ``` fzzzz@fzzzz-Aspire-V3-571G:~/Workspace/prefetcher$ ./sse_prefetch_transpose 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse_prefetch_transpose.h: 60141 us ``` ### 修改成 AVX 版本這邊筆記一下會用到的 function * `__m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b)` * 將 src1 與 src2 的低/高位元以交錯的方式做排列，尾端的 32 就是每次交錯排列的位元數 * `__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)` * 將 src1 與　src2 各取一半進行排列， * 方法如下 ``` SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0 ``` 修改過後的程式碼 [( code )](https://github.com/FZzzz/prefetcher/blob/master/avx_transpose.h) #### 在桌機上跑（筆電沒有支援到 AVX2） ``` ./native_transpose native_transpose.h: 308829 us ./sse_transpose sse_transpose.h: 157873 us ./sse_prefetch_transpose sse_prefetch_transpose.h: 80256 us ./avx_transpose avx_transpose.h: 85646 us ./avx_prefetch_transpose avx_prefetch_transpose.h: 84487 us ``` 其實在看完論文和學長姐的筆記後，我還是不太知道怎麼取 prefetch distance。即便是按照內文提到的演算法來算的話，也仍然會因為各處理器 IPC 不同，而有不同的 prefetch distance。所以總結了一下想法，我還是全都跑一次看哪個 prefetch distance 會帶來最好的效果。下圖為 avx_prefetch 在各個　prefetch distance 下的執行時間。 ![](https://i.imgur.com/96kjz1B.png) 可以看到在 prefetch distance = 8 的時候最快 #### 各版本的 cache misses native_transpose ``` Performance counter stats for './native_transpose' (5 runs): 16,543,459 cache-misses # 91.690 % of all cache refs ( +- 0.05% ) 18,042,742 cache-references ( +- 0.03% ) 1,448,816,603 instructions # 0.86 insns per cycle ( +- 0.00% ) 1,690,823,261 cycles ( +- 0.48% ) 0.453539595 seconds time elapsed ( +- 1.14% ) ``` sse_transpose ``` Performance counter stats for './sse_transpose' (5 runs): 4,271,798 cache-misses # 78.221 % of all cache refs ( +- 0.18% ) 5,461,196 cache-references ( +- 0.09% ) 1,236,784,781 instructions # 1.17 insns per cycle ( +- 0.00% ) 1,061,389,121 cycles ( +- 1.19% ) 0.303614305 seconds time elapsed ( +- 7.57% ) ``` sse_prefetch_transpose ``` Performance counter stats for './sse_prefetch_transpose' (5 runs): 4,272,500 cache-misses # 78.128 % of all cache refs ( +- 0.21% ) 5,468,597 cache-references ( +- 0.02% ) 1,282,833,253 instructions # 1.60 insns per cycle ( +- 0.00% ) 803,852,552 cycles ( +- 0.37% ) 0.213622428 seconds time elapsed ( +- 0.50% ) ``` avx_transpose ``` Performance counter stats for './avx_transpose' (5 runs): 3,243,450 cache-misses # 73.077 % of all cache refs ( +- 0.03% ) 4,438,375 cache-references ( +- 0.09% ) 1,149,164,499 instructions # 1.42 insns per cycle ( +- 0.00% ) 810,729,581 cycles ( +- 0.30% ) 0.219928612 seconds time elapsed ( +- 2.58% ) ``` avx_prefetch_tanspose ( PFDIST = 8 ) ``` Performance counter stats for './avx_prefetch_transpose' (5 runs): 3,269,625 cache-misses # 50.112 % of all cache refs ( +- 0.02% ) 6,524,581 cache-references ( +- 0.04% ) 1,172,166,712 instructions # 1.42 insns per cycle ( +- 0.00% ) 825,967,980 cycles ( +- 0.31% ) 0.218844538 seconds time elapsed ( +- 0.29% ) ``` 從 cache misses 的比例上來看可以看見 avx 在做了 prefetch 之後，有很大幅度的改善。但在執行時間上仍然是和 avx_transpose 與 sse_prefetch_transpose 沒有相差太多。（甚至是更慢的），另外 avx 的兩個版本的 IPC 均比 sse_prefetch 來的低。我個人猜測這很可能是為什麼 avx 效能上不及 sse_prefetch 的原因（整整少了10% 的 IPC ）。 ## Reference * [Instruction Prefetching](https://en.wikipedia.org/wiki/Instruction_prefetch) * [作業網站](https://hackmd.io/s/ry7eqDEC) * [Carol Chen的筆記](https://hackmd.io/s/rk8ivCT0) * [許元杰的筆記](https://embedded2015.hackpad.com/-Homework-7-8-XZe3c94XjUh) * [周曠的宇的共筆](https://embedded2015.hackpad.com/Week8--VGN4PI1cUxh) ###### tags: `fzzzz` `sysprog`