software pipelining

# software pipelining contributed by <`kevinbird61`>, <`CheHsuan`> # Note ### Performance of SSE and AVX Instruction Sets >>中英關鍵字間請以空白區隔! >>[color=red][name=課程助教] - Data packing : 多個資料包進一個 SIMD operation 中的 register 裡頭，使之能夠同時進行運算 - Data reusing : 迴圈進行時，同一個 data 會被使用多次，那麼我們就可以使用 data reuse 的機制來重複使用沒有變動的 register（優化明顯） - Asynchronous data transfer : 為了增加從 data memory 到 register 之間的傳輸效能，再使用前先 load 進資料至 cache memory (注意：prefetching並沒有支援完全的控制，所以並沒辦法強迫 CPU 預載進所提供的 data，只能暗示 CPU 這些 data 較為重要，如果要 preload 的話，可以先預載這些進來 => 因為 CPU 並不是只有執行這支程式而已！！) ### When Prefetching Works, When It Doesn’t, and Why - 定義(本篇)： - strided：access stride distances greater than two cache lines - streams：unit-stride cache-line accesses ## Cache Miss 分析 SS 和 SSE prefetch 能有巨大的效能差異應該是處理器預先將需要的 data 從 main memory 載到 cache 裡面，因此照理來說，cache misses 應該會降低，使執行時間減少。因此我做了以下實驗來檢查是否和推測的一樣 ```C Performance counter stats for './sse_only' (10 runs): 6,415,416 cache-misses # 85.008 % of all cache refs ( +- 0.12% ) 7,546,881 cache-references ( +- 0.04% ) 1,239,381,237 instructions # 1.30 insns per cycle ( +- 0.02% ) 953,680,054 cycles ( +- 0.10% ) 0.255951245 seconds time elapsed ( +- 0.28% ) ``` 只有 SSE ```C Performance counter stats for './sse_prefetch' (10 runs): 6,424,985 cache-misses # 85.203 % of all cache refs ( +- 0.08% ) 7,540,824 cache-references ( +- 0.08% ) 1,284,271,432 instructions # 1.78 insns per cycle ( +- 0.02% ) 720,105,763 cycles ( +- 0.14% ) 0.195868131 seconds time elapsed ( +- 0.48% ) ``` 同時使用 SSE 和 SSE prefetch，我們觀察到說 cache misses 並沒有減少！ OK，因為我們 prefetch 是將 data 載入到 dcache 去，所以應該把 perf 的資訊更精確一點 ```c Performance counter stats for './sse_only' (10 runs): 6,609,974 cache-misses # 86.487 % of all cache refs ( +- 0.34% ) 7,642,716 cache-references ( +- 0.50% ) 8,648,877 L1-dcache-load-misses ( +- 0.48% ) 4,332,796 L1-dcache-store-misses ( +- 0.39% ) 90,646 L1-dcache-prefetch-misses ( +- 8.75% ) 92,394 L1-icache-load-misses ( +- 10.04% ) 0.273118939 seconds time elapsed ( +- 0.35% ) ``` ```c Performance counter stats for './sse_prefetch' (10 runs): 6,513,958 cache-misses # 85.623 % of all cache refs ( +- 0.10% ) 7,607,699 cache-references ( +- 0.17% ) 8,577,430 L1-dcache-load-misses ( +- 0.11% ) 4,304,992 L1-dcache-store-misses ( +- 0.07% ) 79,447 L1-dcache-prefetch-misses ( +- 1.06% ) 80,910 L1-icache-load-misses ( +- 8.26% ) 0.207518135 seconds time elapsed ( +- 0.63% ) ``` 這邊特別列出L1 dcache load misses, store misses和prefetch misses # Prefetch Distance 參考論文裡面提到的 prefetch 時間軸 ![](https://i.imgur.com/38Igwb5.png) 我們必須在正確的時間 prefetch 正確的資料到 cache 上，因此提到了 prefetch distance 這個概念 ![](https://i.imgur.com/muZ9SJv.png) l就是在 prefetch timliness 裡面的IN-INIT，而 s 就是一個 loop 執行的時間，因此我們可以理解為在本次的 iteration 當中，必須prefetch D個 iteration 之後的資料以我們矩陣轉置來說，在impl.c的 sse_prefetch_transpose() 有定義一個巨集 PFDIST 為 8，等於預先載入2個迴圈後的資料到 cache 上，我們可以修改這個巨集來驗證我們的想法。 ```clike= #define PFDIST 8 ``` ![](https://i.imgur.com/AVgupst.png) 這邊可以觀察到當 prefetch distance 為0時，一定是 late，而4~176為 timely，之後是 early 接下來我們做一下不同 prefetch distance 的 cache miss 分析 ```C Performance counter stats for './prefetch 0' (10 runs): 6,469,153 cache-misses # 86.910 % of all cache refs ( +- 0.48% ) 7,443,546 cache-references ( +- 0.52% ) 1,380,741,324 instructions # 1.28 insns per cycle ( +- 0.06% ) 1,082,544,766 cycles ( +- 0.23% ) 0.295275443 seconds time elapsed ( +- 0.68% ) ``` prefetch distance 為 0(等同於沒有prefetch) ```C Performance counter stats for './prefetch 8' (10 runs): 6,530,074 cache-misses # 84.966 % of all cache refs ( +- 0.91% ) 7,685,479 cache-references ( +- 1.23% ) 1,295,966,776 instructions # 1.77 insns per cycle ( +- 0.24% ) 730,825,221 cycles ( +- 0.74% ) 0.199804215 seconds time elapsed ( +- 1.50% ) ``` prefetch distance 為 8 ```C Performance counter stats for './prefetch 200' (10 runs): 9,525,837 cache-misses # 80.809 % of all cache refs ( +- 0.26% ) 11,788,122 cache-references ( +- 0.06% ) 1,294,261,441 instructions # 1.21 insns per cycle ( +- 0.03% ) 1,074,000,442 cycles ( +- 0.32% ) 0.291899917 seconds time elapsed ( +- 0.65% ) ``` prefetch distance 為 200 # SIMD Learning `Header file - <xmmintrin.h>` - SSE intrinsics命名規則 ```C _mm_<opcode>_<suffix> ``` - opcode：表示指令類別（add, sub...） - suffix：資料的種類，在SSE浮點運算指令中，只有兩個種類-> `ps`(packed single-precision，也就是這個指令對暫存器中的四個單精度浮點數進行運算) , `ss`(Scalar Single-precision，也就是這個指令只對暫存器中的 DATA0 進行運算) - 而SSE也需要一個新的資料型態，便是上面所看到`__m128` - 是一個 16 bytes（128 bits）的資料型態，對應 SSE 的 128 位元暫存器 - `_mm_loadu_ps`這個intrinsic專門用來處理沒有對齊再16bytes邊上的資料（但是相對處理速度較慢） - 若是little endian特性（x86），所有的資料會以逆向的方式儲存；而我們可以使用`_mm_loadr_ps` - 若需要對齊16bytes邊上，可以利用SSE指令：`__declspec(align(16)) float input[4]`，對等於`_MM_ALIGN16 float input[4]` - 原始碼裡頭出現的 - `_mm_loadu_si128`: - `__m128i _mm_loadu_si128 (__m128i const* mem_addr)` - Instruction : `movdqu xmm, m128` - Load 128-bits of integer data from memory into `dst.mem_addr` , doesn't need to be aligned on any particular boundary - 等同於`dst[127:0]:=MEM[mem_addr+127:mem_addr]` - 只是再guideline上，使用的`#include "emmintrin.h"`，但上面使用的是`#include "xmmintrin.h"` (Question?) - `_mm_unpacklo_epi32`: - Unpack and interleave(交錯) 32-bit integers from the **`low`** half of a and b, and store the results in dst. - 從底下的operation中看，這支指令可以把32-bit分成四份分別交錯，並做儲存 - `_mm_unpackhi_epi32` - Unpack and interleave(交錯) 32-bit integers from the **`high`** half of a and b, and store the results in dst. - `_mm_unpackhi_epi64` - 發現他是以64-bit為一單位做儲存 ```C ## _mm_unpacklo_epi32 INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[31:0] dst[63:32] := src2[31:0] dst[95:64] := src1[63:32] dst[127:96] := src2[63:32] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0]) ## _mm_unpackhi_epi32 INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[95:64] dst[63:32] := src2[95:64] dst[95:64] := src1[127:96] dst[127:96] := src2[127:96] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0]) ## _mm_unpackhi_epi64 INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[127:64] dst[127:64] := src2[127:64] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0]) ## _mm_unpacklo_epi64 INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[63:0] dst[127:64] := src2[63:0] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0]) ``` - 比較transpose實作: - naive_transpose:原始使用C程式碼實作轉置矩陣 - sse_transpose:利用SSE指令來實作轉置矩陣 - sse_prefetch_transpose:利用SSE指令以及其提供的prefetch指令來實作 - 可以到[實作](https://github.com/kevinbird61/prefetcher)中，利用epi32與epi64兩種不同指令的特性，進行不同位置交換！漂亮 - 先從原先的矩陣中，獲取同column上的4個位置，並且load進從他開始往後的4個資料，存入I0,I1,I2,I3中 - 用`_mm_unpacklo_epi32 / _mm_unpackhi_epi32`把I0-3交錯存在T0-3之中；以T0為例，其內容順序分別為I0[31:0]->I1[31:0]->I0[63:32]->I1[63:32]；而T1則是存取較高位的I0,1；T2,3則跟T0,1同理(注意，這邊是**T0,T2->pair , T1,3->pair**) - 之後利用`_mm_unpacklo_epi64 / _mm_unpackhi_epi64`來操作，利用原本交錯的特性，分別擷取部份：取T0的low[63:0]=>I0,I1的第1項，也就是原本取出array資料的第1項(而此處則是以T0+T1，便可以取到I0~3的第一筆資料)；以此類推，可以依據此方式，把所有資訊分別儲存回I0\~3中，並且於此同時，資料已經做出改變，Ex:I0中，從原本取得x[h]\~x[h+3]到變成x[h],x[h+w],x[h+2\*w],x[h+3\*w]\(w為row長度) - 因此只需要存回去即可！ - `void _mm_prefetch (char const* p, int i)` - Fetch the line of data from memory that contains address `p` to a location in the cache heirarchy specified by the locality hint `i`. - 依據後面`i`給出的hint來做prefetch的動作以及該去哪塊cache拿取資料 - 分為t0,t1,t2,nta；T0 - T2對應了L1 - L3 caches，NTA表示加載數據在L1 cache並標記為首先被替換的 - `_MM_HINT_T1`:Temporal data with respect to `first level cache（L1 cache）` - 改寫程式，檢視執行時間[(Source code)](https://github.com/kevinbird61/prefetcher) - 利用執行時間畫圖，可以看到prefetch時間遠小於naive ![](https://i.imgur.com/peaN2OC.png) - 執行數次perf，看到圖形變化趨勢大，唯有大小順序相同 >下面幾張圖都麻煩你修改圖例位置,都跟圖表打架了~可以考慮將它移出圖表外 >參考連結:http://blog.csdn.net/iemyxie/article/details/41548583 >[color=red][name=課程助教] ![](https://i.imgur.com/MpPrTau.png) - 加入AVX指令版本 - 參考曠宇學長版本，做出AVX的transpose實作 - 更改Makefile編譯參數，加上`-mavx2`來實作 - 可以由下面的實作中看到這支function的實作的control值的存在 - `0x20` = `..0010100` , `0x31` = `..0011111` - 下面的實作(`dst[127:0]...`)可以看到此操作分為三部份，來決定最終dst的值 - 0x20 -> 分為[3:0]`0100`,以及[7:4]`0001`；tmp[127:0]=src1[127:0], tmp[255:128]則=src1[255:128] - 0x31 -> 分為[3:0]`1111`，以及[7:4]`0001`;tmp[127:0]=src2[255:128]\(?其中後面`IF control[3]`此時為1，是否這邊輸出改為0？),tmp[255:128]=src1[255:128] ```c= ## _mm256_permute2x128_si256 SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0 ``` ![](https://i.imgur.com/JoZwuh3.png) - 可以看到，使用一次操作8個單位的avx的版本，比使用原本操作4個單位的sse版本還要來的快上將近一倍 ![](https://i.imgur.com/m1o2o5T.png) - (PFDIST=8)可以看到，使用avx+prefetch版本，和原本的sse+prefetch版本較低一些，但不會低多少 **Why using cc , instead of using gcc** - 輸出man cc , man gcc的資料，並且用diff來檢視；發現相同 - cc is the name of the original UNIX c compiler command. The default c compiler for your operating system should be executable with that command. gcc is the GNU operating system c compiler. On GNU+Linux systems it is usual for cc to be a link to gcc so you and your scripts can use either interchangeably. Traditionally, a c compiler that is named or linked to cc has to obey certain standards and respect certain command arguement interfaces in order for it to be used in this capacity. While using GNU+Linux (like Ubuntu) consider cc and gcc to be synonyms for each other. PS: Like all *nix typically ends with an "x", it is also traditional for c compilers to end with "cc", eg. tcc (tiny c compiler) or icc (intel c compiler). ## 效能改進-parallel computing 因為矩陣轉置的for loop當中，每一個轉換都是沒有關聯興的，因此我們可以用openmp來平行化運算。 [吳彥寬](https://github.com/c14006078)同學提示我說每一個CPU都有SIMD的暫存器和特殊處理器，因此如果我們把程式平行化到每個core上去計算的話，也不會每個thread在搶SIMD暫存器和處理器的使用，所以應該可以再加速更多 ![](https://i.imgur.com/AmFeQin.png) 下面有點亂，用單次數據來看一下 ```c sse prefetch: 56706 us sse prf & omp: 59678 us sse: 122146 us sse omp: 59864 us naive: 242929 us naive omp: 174083 us ``` * 這邊我們可以看到使用了openmp平行化矩陣轉置運算，naive omp比naive快，sse omp也比sse快，而sse prefetch+omp卻比單純sse prefetch慢。 * 關於sse prefetch+omp沒有比較快這件事，我有一個猜想，是不是因為我們使用了multi-thread去分散處理，也就是說，每次去跑迴圈時，每個thread(core)會使用到的x和y變得不可預測，導致存取到的memory address沒有規律(或有特別的規律，要去看openmp如何平行化)，因此沒有發揮到prefetch的特性 => 因此如果我們找到了openmp如何平行化for loop的運算，我們去調整PFDIST的數值，因該還是可以加速的 >> 主要因為資料切割方式帶來限制 [name=jserv] ### openmp平行化程式的任務切割－schedule 考慮到為了找到最佳的PFDIST，必須先了解openmp如何切割loop分配給thread運作，切割方法如下四種 * static(default) * dynamic : 像是threadpool的概念，動態的分配任務給空閒的thread去執行 * guided * runtime ### 更改成schedule(static,1) ![](https://i.imgur.com/Ro11sg6.png) PFDIST為0 ![](https://i.imgur.com/XzB0CSg.png) PFDIST為8(原本) ![](https://i.imgur.com/SCHrZUv.png) PFDIST為32 原本是想要探討一下在多執行緒的環境下(4個執行緒)，更改不同prefetch distance對運算時間的影響，從0到8到32，0和8的話預計和sse omp差不多時間(因為prefetch到的記憶體位址無用)，預計32為最好的，原因是用static並且chunk size為1，這個thread下兩次運算使用到的記憶體位置為目前記憶體位置加上32，可是由上面幾個實驗下來，prefetch對multi-thread環境下好像無用？ ### 將static方法改成dynamic方法 ![](https://i.imgur.com/Ysa55eE.png) 可以看到原本執行時間非常相似SSE OMP、SSE prefetch+omp和SSE prefetch，前二者經由dynamic的方式，已經和後者拉出些距離，不過這是藉由改變平行運算分配任務的方式達到的效果，不是提高cache hit所帶來的效益 # Reference - [ Basic SIMD ](https://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-docs/CellProgrammingTutorial/BasicsOfSIMDProgramming.html) - [ Intel intrinsics guide ](https://software.intel.com/sites/landingpage/IntrinsicsGuide/) - [ 周曠宇學長共筆 ](https://embedded2015.hackpad.com/Week8--VGN4PI1cUxh) - [淺談memory cache](http://enginechang.logdown.com/posts/249025-discussion-on-memory-cache)