2017q1 Homework03 (software-pipelining)

# 2017q1 Homework03 (software-pipelining) contributed by <`hugikun999`> ## [When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) * 矩陣或遞迴式容易預測的，軟體排程較容易;hashing 相對則是叫不容易預測。 * `void _mm_prefetch(char * p , int i )` **建議** compiler 預取資料進入 cache。 [參考網站](https://www.csie.ntu.edu.tw/~r89004/hive/sse/page_7.html) * `void _mm_stream_ps(float * p , __m128 a );` 要求資料寫入時，直接寫入記憶體而非 cache 中。 * Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency。 * Prefetch distance for an array D $\geq$ $\lceil l/S\rceil$ l: prefetch latency S: the length of the shortest path through the loop body * If prefetch distance is too large,prefetched data could evict useful cache blocks。 * Two types of memory indexing * Direct easily prefetched by hardware * Indirect easily prefetched by software * Benefits of Software Prefetching over Hardware Prefetching * Large Number of Streams Not be limited by hardware resource * Short Streams Hardware need to be trained.If the length of stream isextremely short, there will not be enough cache misses for hardware to load useful cache blocks. * Irregular Memory Access * Cache Locality Hint Greater flexibility of placing data in the right cache level with hint. * Loop Bounds Some methods to detect loop bounds such as loop unrolling, software pipelining, and using branch instructions. * Negative Impacts of Software Prefetching * Increased Instruction Count. * Static Insertion cannot adapt to run time behavioral changes. * Code Structure Change When the number of instructions in a loop is extremely small,it can be challenging to insert software prefetching instructions. ### 名詞解釋 * prefetch distance 預取的距離。 * degree of prefetching 預取幾個 cache line。假設預取512 bytes 且每一個 cache line 大小是 256 bytes，則 degree 為2。 * intrinsics 型態類似一般的函式，但會被 compiler 直接譯為組語。 * [loop splitting](http://stackoverflow.com/questions/13866962/what-is-the-overhead-in-splitting-a-for-loop-into-multiple-for-loops-if-the-tot) 即將一個 loop 中互相獨立的事件分開放在不同的 loop 底下，這個方法不一定對程式有優化的效果，端看在 loop 中做的事情決定是否會使程式效能提升。在下面的測試中，拆成3個 for loop 會使程式變得**更耗費時間**，這個結果能符合預期，因為 for loop 越多代表條件判斷也會越多，branch predicter 的預測不對也會隨之增加。 ```C++ for (int i = 0; i < 1000000; i++){ #ifdef TOGETHER for (int c = 0; c < size; c++){ data[c] *= 10; data[c] += 7; data[c] &= 15; } #else for (int c = 0; c < size; c++){ data[c] *= 10; } for (int c = 0; c < size; c++){ data[c] += 7; } for (int c = 0; c < size; c++){ data[c] &= 15; } #endif } ``` > [Loop Transformations](http://users.ices.utexas.edu/~lenharth/cs378/fall14/8%20-%20LoopOptimizations.pdf) > 這裏面有提到許多關於 loop 方面的處理技巧 ## Prefetch ### file analyze * _mm_loadu_si128 這個 function 在 [intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_loadu_si128&expand=3043,3144) 裡會有兩個結果，差異在於 `_mm_lddqu_si128` 有特別注明當資料有跨 cache line 時**可能**會比 `_mm_loadu_si128` 效能更好。 > 這邊有個疑惑，在 main 中只有 `#include <xmmintrin.h>` 並沒有 `#include "emmintrin.h"`，這不符合 intel 的對於這個 API 的說明。有嘗試將其改過，但並不影響 compile 和其數據。 ```shell= <mmintrin.h> MMX <xmmintrin.h> SSE <emmintrin.h> SSE2 <pmmintrin.h> SSE3 <tmmintrin.h> SSSE3 <smmintrin.h> SSE4.1 <nmmintrin.h> SSE4.2 <ammintrin.h> SSE4A <wmmintrin.h> AES <immintrin.h> AVX <zmmintrin.h> AVX512 ``` ### 優化 #### _mm_lddqu_si128 先嘗試在[intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_loadu_si128&expand=3043,3144)到的這個 API，但是測試出來沒有特別的加快，應該再另外設計實驗，看這個 API 在跨 cache line 時讀資料的效能是否較好。 ```shell sse prefetch: 57656 us sse: 119557 us sse_lddqu: 118414 us naive: 258989 us ``` [A faster integer SSE unalligned load that's rarely used ](http://stackoverflow.com/questions/38370622/a-faster-integer-sse-unalligned-load-thats-rarely-used) 嘗試改變所運算的矩陣大小，看是否會因為矩陣而影響運算的效能，另一個原因是想嘗試 `_mm_lddqu_si128` 效能在 cross cache line 時會不會如官方所述比 `_mm_loadu_si128` 來的好，但是從圖表的結果看來，`sse_lddqu_align` 是多了對指標做 align 的動作，我在這邊的 align 大小是16 bytes，這邊是為了比較另一個 API,`_mm_load_si128`。 :::info 這邊的實作基於 SSE 上，因此每128 bits 會被當作一個單位，所以不是4的倍數在運算的時候會有問題，要特別注意。 ::: 下面三張是用同樣的數據畫的，因為疊再一起不易分辨所以分開化了三張，從第一張圖可以看到其實`sse_lddqu_align` 在大部份的大小花的時間是最多的。 ![](https://i.imgur.com/yMKBQPv.png) ![](https://i.imgur.com/aAzL4kF.png) ![](https://i.imgur.com/n2dQd9A.png) 下面這張比較清楚，將 height 減少為 1000，可以明顯看到 `sse-lddqu-align` 花費的時間式最多的。且 `_mm_lddqu_si128` 的速度比 `_mm_loadu_si128` 還來得快一些，但是 align(16) 的就明顯比其他兩種方法都慢。 > align(16)的上下跳動幅度很大，當 width 變大時另外兩個的變動幅度也變大了。[memory aligned](https://wr.informatik.uni-hamburg.de/_media/teaching/wintersemester_2013_2014/epc-14-haase-svenhendrik-alignmentinc-paper.pdf) > 目前對於時間大幅跳動沒有什麼想法。另外對於是否有確實做 align 的動做也無法確定? ![](https://i.imgur.com/gUbM6I0.png) 在每次執行間都加上清空 cache 的動作，`sse-lddqu-align`的時間明顯下降到與 `sse` 相同了。 ``` $ echo 1 | sudo tee /proc/sys/vm/drop_caches ``` >1：Clear PageCache 2：Clear dentries and inodes 3：Clear PageCache, dentries and inodes ![](https://i.imgur.com/lzeTQHJ.png) #### prefetch distance 每次要預取多遠的地方是由 programmer 指定的，這個距離取多少是很重要的，太早或太晚預取都會讓 cache 的使用效率降低，太早會將有用的資訊提早踢出，太晚又會達不到預取的原本目的。下面的圖可以很清楚的看出預取使得效能上的提升，上升的斜率明顯小於不做預取的 function，但是在距離間的效能差卻不是很明顯，大致上 distance 為4的時候比其他的要慢了一些。 ![](https://i.imgur.com/628onz1.png) 將運算的矩陣變大觀察，distance 為4時幾乎全部的花費時間都是最高的。在3750~4500之間的運算時間開始有大幅跳動的現象，之後的時間大致呈現線性上升。 ![](https://i.imgur.com/vLKPsen.png) 重跑一次並去掉 distance 為4的數據所得出來的圖變得更趨進線性模型，3750~4500之間仍然高出了一點點。 ![](https://i.imgur.com/mONSTH7.png) #### reverse the naive function 在 naive_transpose 中的 for loop 沿著 y 軸由上往下一一將地址給 dst，因為沿著 y 軸位址是跳躍的，考慮到每次讀取資料進 cache 都是連續的，改成沿著 x 軸的方向把地址給 dst。剛開始時沒有什麼差別是因為矩陣不大，即使跳躍也較不會超過每次讀進的 cache。但是到了後半段反而是 reverse 的版本比較慢，這裡可能是因為反轉 src 的同時變成 dst 會產生跳躍的現象。 ![](https://i.imgur.com/VCRTWvx.png) >當矩陣一放大就可以看到，反轉 x 和 y 極度緩慢QQ ![](https://i.imgur.com/fZ46COU.png) ```C _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1); ```