2016q3 Homework3 (software-pipelining)

# 2016q3 Homework3 (software-pipelining) contributed by <`ktvexe`> - [ ]學習計算機結構並且透過實驗來驗證所學 - [ ]理解 prefetch 對 cache 的影響，從而設計實驗來釐清相關議題 - [ ]論文閱讀和思考 ## 作業要求 * 閱讀 [在計算機裡頭實踐演算法](/s/HyKtIPN0) 提到的論文: "[When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)" (務必事先閱讀論文，否則貿然看程式碼，只是陷入一片茫然!)，在 Linux/x86_64 (注意，要用 64-bit 系統，不能透過虛擬機器執行) 上編譯並執行 [prefetcher](https://github.com/embedded2015/prefetcher) * 說明 `naive_transpose`, `sse_transpose`, `sse_prefetch_transpose` 之間的效能差異，以及 prefetcher 對 cache 的影響 * 在 github 上 fork [prefetcher](https://github.com/embedded2015/prefetcher)，嘗試用 AVX 進一步提昇效能 * 修改 `Makefile`，產生新的執行檔，分別對應於 `naive_transpose`, `sse_transpose`, `sse_prefetch_transpose` (學習 [phonebook](s/S1RVdgza) 的做法) * 用 perf 分析 cache miss/hit * 學習 `perf stat` 的 raw counter 命令 * 參考 [Performance of SSE and AVX Instruction Sets](http://arxiv.org/pdf/1211.0820.pdf)，用 SSE/AVX intrinsic 來改寫程式碼 * 詳細描述實驗設計，以及你的觀察 ## 複習計組 >紀錄：翻閱了算盤書，想回憶書中對prefetch的描述，赫然發現書中充滿了AVX、SIMD等關鍵字，我過去一直以為我是在phonebook中認識這些技術，我的大二到底怎麼了...[name=劉亮谷][color=#c63eef] ## When Prefetching Works, When It Doesn’t, and Why ### Introduction 雖然有很多種software 或是 hardware prefetching的機制可以應付cache miss的可能，但在編譯器中，只有實作一些比較簡單的software prefetching演算法，所以說：注重效能的工程師會使用 prefetching intrinsics function來撰寫程式。但這樣做也是有問題存在，第1點是這種方式並沒有嚴謹的規則可以描述何種插入 intrinsics funciton的方法最能真正提高效能；再者，software 與 hardware prefetching間互動的複雜度我們是不清楚的。 ## Performance of SSE and AVX Instruction Sets 這份文件簡介中很清楚的介紹SIMD與SSE/AVX instruction，以下節錄： SSE (streaming SIMD extensions)和 AVX (advanced vector extensions) 都是 SIMD (single instruction multiple data streams)的 instruction sets SIMD可讓多核的CPU進行平行處理，基本的資料搬移與算術運算指令可以被同時處理；雖然像GNU與Intel compiler皆俱備 SIMD optimization的選項，但使用適當的SIMD programming技巧，如： data packing, data reuse and asynchronous data等，可以達到更加的效能。 ### Introduction SIMD不同於SISD，他一個指令可以同時對多個data進行操作，提供了CPU層級的平行處理，直接舉例就是：同時對數個數值做加法，或是更快完成向量積。要實作SIMD文中提到三種作法，分別是：inline assembly、 intrinsic function、 vector class * inline assembly：這種方法當然最複雜，但是優點就是可以清楚掌握每個過程的進行(好比手刻的compiler跟lex/yacc)。 * intrinsic function: 這提供了C/C++語言的介面，可以直接產生assembly code，但缺點是不保證會有最高度的最佳化。 * Vector class:使用vector class的優點是比intrinsic function更容易撰寫，但效能也會比較差。 SSE提供16支XMM registers (xmm0 ∼ xmm15)分別各有128bits AVX將原本SSE的128bits暫存器擴增至256bits，稱為 YMM registers(ymm0 ∼ ymm15) 不過AVX其實已經進展到AVX-512(@Champ Yenvm學長的投影片中提及)，很可惜我的電腦不支援，這次無法實驗。 ![](https://i.imgur.com/ljewRAl.png) 此外AVX提供三組input arguments，而SSE只有兩組，所以說SSE只支援a=a+b的操作，而AVX可以辦到a=b+c的操作。 ### Optimization Scheme * Data Packing 這段落很有趣，看到程式碼：他分別一次取 256bits的資料作copy，可以看到操作次數上有所差異，然而最有趣的地方是，如果對C code version開最佳化，他的效能是與SIMD相同的，可以用objdump將最佳化結果倒出來，會發先開最佳化後，會自動將此操作轉為SIMD；而且比較AVX與SSE會發現，效能上並沒有明顯的差異，由此可見YMM registers並沒有辦法提升資料搬移的速度。 C code ```clike= for(i=0 ; i< ArraySize ; i+=8){ b[i] = a[i]; b[i+1] = a[i+1]; b[i+2] = a[i+2]; b[i+3] = a[i+3]; b[i+4] = a[i+4]; b[i+5] = a[i+5]; b[i+6] = a[i+6]; b[i+7] = a[i+7]; } ``` > 我的天阿，他不給我syntax highlight[name=劉亮谷] > 原來小於後面需要空白，不然HackMD會parsing error[name=劉亮谷][color=#c63eef] SSE： ```clike= for(i=0; i< ArraySize; i+=8} { asm volatile ( "movups %2, %%xmm0 \n\t" "movups %%xmm0, %0 \n\t" "movups %3, %%xmm1 \n\t" "movups %%xmm1, %1 \n\t" : "=m" (b[i]), "=m" (b[i+4]) : "m" (a[i]), "m" (a[i+4]) ); } ``` AVX: ```clike= for(i=0; i< ArraySize; i+=8} { asm volatile ( "vmovups %1, %%ymm0 \n\t" "vmovups %%ymm0, %0 \n\t" : "=m" (b[i]) : "m" (a[i]) ); } ``` 效能比較： | |C++|SSE4.2|AVX1| |---|---|----|----| |no opt.|163|94.5|97.7| |max opt.|75|71|75| --- * Data Reuse 要講reuse還是用範例講解最易懂：因為SSE與AVX不需重複將資料搬進reg中，而是重複使用相同register，所以再效能上會比沒有使用SIMD的版本有明顯的提升。 C code ```clike= for(i=0; i < LoopNum; i++) { c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; c[4] = a[4] + b[4]; c[5] = a[5] + b[5]; c[6] = a[6] + b[6]; c[7] = a[7] + b[7]; } ``` SSE ```clike= asm volatile ( ... "LOOP: \n\t" "addps %%xmm0, %%xmm1 \n\t" "addps %%xmm2, %%xmm3 \n\t" "dec %%ecx \n\t" "jnz LOOP \n\t ... ``` AVX ```clike= asm volatile ( ... "LOOP: \n\t" "vaddps %%ymm0, %%ymm1, %%ymm2 \n\t" "dec %%ecx \n\t" "jnz LOOP \n\t" ... ``` 效能比較： | |C++|SSE4.2|AVX1| |---|---|----|----| |no opt.|732|83.7|26.9| |max opt.|317|78.7|26.9| >其實我看不懂AVX在這邊比SSE還要快的原因[name=劉亮谷] --- * Asynchronous Data Transfer 將資料從記憶體搬到register是非常緩慢的動作，所以說以data packing的方式，是很難真正得到效能改善，不過Asynchronous Data Transfer是種同時處理運算與資料搬移的技術，可以將資料搬移的負載減少。 SSE與AVX都提供prefetching的方法，prefetching就是在CPU真正運算前，將資料先辦進cache，不過要注意的是，SSE與AVX皆無法強制data做prefetching，只能提供資訊給CPU做選擇，原因是有可能有更高優先權的process(有可能來自OS或其他)需要被處理。效能比較： | |C++|SSE4.2|AVX1| |---|---|----|----| |no prefetching|75|71|75| |512 bytes prefetching||69.9|70.9| 這邊的結論是：我們需要更powerful的 asynchronous data transfer method >這邊看到蠻傻眼的，前面描述好像很厲害，結果卻無法有明顯的效能提升[name=劉亮谷][color=#c63eef] ## Experiment Evaluation ### Origin： ```shell= ktvexe@ktvexe-SVS13126PWR:~/sysprog21/prefetcher$ ./main 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse prefetch: 88613 us sse: 156130 us naive: 296468 us ``` >~~在perfetch下速度有所提升，不過經過前面的資料閱讀，我猜測這應該是再資料量與運算量少時的假象。~~[name=劉亮谷] ### cache miss: * naive ```shell Performance counter stats for './naive' (100 runs): 18,929,950 cache-misses # 93.514 % of all cache refs (66.28%) 20,315,136 cache-references (66.48%) 21,216,594 L1-dcache-load-misses (66.79%) 4,127,598 L1-dcache-store-misses (67.16%) 53,629 L1-dcache-prefetch-misses (67.13%) 66,024 L1-icache-load-misses (66.54%) 0.469926622 seconds time elapsed ( +- 0.60% ) ``` * SSE ```shell Performance counter stats for './SSE' (100 runs): 6,892,848 cache-misses # 90.614 % of all cache refs (66.29%) 7,713,047 cache-references (66.55%) 8,645,086 L1-dcache-load-misses (66.87%) 3,996,170 L1-dcache-store-misses (67.20%) 84,852 L1-dcache-prefetch-misses (67.09%) 166,184 L1-icache-load-misses (66.50%) 0.333443107 seconds time elapsed ( +- 0.80% ) ``` * SSE_PF ```shell Performance counter stats for './SSE_PF' (100 runs): 6,681,287 cache-misses # 87.202 % of all cache refs (65.92%) 7,349,026 cache-references (66.51%) 8,312,963 L1-dcache-load-misses (67.10%) 3,965,986 L1-dcache-store-misses (67.56%) 34,294 L1-dcache-prefetch-misses (67.32%) 63,104 L1-icache-load-misses (66.23%) 0.275961985 seconds time elapsed ( +- 0.97% ) ``` * 研究AVX 依樣畫葫蘆，對照SSE的模式， |Type |Meaning| |-------|-------| |__m256 |256-bit as eight single-precision floating-point values, representing a YMM register or memory location| |__m256d|256-bit as four double-precision floating-point values, representing a YMM register or memory location| |__m256i|256-bit as integers, (bytes, words, etc.)| |__m128|128-bit single precision floating-point (32 bits each)| |__m128d|128-bit double precision floating-point (64 bits each)| SSE中 `__m128i _mm_lddqu_si128 (__m128i const* mem_addr)` Description： Load 128-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm_loadu_si128 when the data crosses a cache line boundary. `__m128i _mm_loadu_si128 (__m128i const* mem_addr) ` Description： Load 128-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary. 相對應AVX中 `__m256i _mm256_lddqu_si256 (__m256i const * mem_addr)` Description： Load 256-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm256_loadu_si256 when the data crosses a cache line boundary. `__m256i _mm256_loadu_si256 (__m256i const * mem_addr)` Description Load 256-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary. >其實不是很清楚`lddqu`與`loadu`的差異，想增加一個`lddqu`的版本卻一直編不過。[name=ktvexe] ```shell cc -msse3 -msse2 --std gnu99 -O0 -Wall -g -DSSE_PF -o SSE_PF main.c In file included from main.c:17:0: impl.c: In function ‘sse_transpose2’: impl.c:39:26: warning: implicit declaration of function ‘_mm_lddqu_si128’ [-Wimplicit-function-declaration] __m128i I0 = _mm_lddqu_si128((__m128i *)(src + (y + 0) * w + x)); ``` >最後依照Intel Intrinsics Guide 上的Synopsis，加上`#include "pmmintrin.h"`終於編過了，不過最終結果並沒有 ## Reference * [How to Vectorize Code Using C/C++ Classes on 32-Bit Intel® Architecture](https://software.intel.com/en-us/articles/how-to-vectorize-code-using-cc-classes-on-32-bit-intel-architecture) * [Programming trivia: 4x4 integer matrix transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html) * [Introduction to Intel® Advanced Vector Extensions](https://software.intel.com/sites/default/files/m/d/4/1/d/8/Intro_to_Intel_AVX.pdf) * [Intel® Architecture nstruction Set Extensions Programming eference](https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf) * [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide) * [SIMD Programming Introduction](https://docs.google.com/presentation/d/1LeVe7EAmZvqD3KN7p4Wynbd36xPOk9biBCFrdmsqzKs/edit#slide=id.p3) * [Hotball's HIVE](https://www.csie.ntu.edu.tw/~r89004/hive/sse/page_1.html) 未讀 * [Lattice gauge theory](https://en.wikipedia.org/wiki/Lattice_gauge_theory) * [許士杰共筆](https://embedded2015.hackpad.com/-Homework-7-8-XZe3c94XjUh) * [周曠宇共筆](https://embedded2015.hackpad.com/Week8--VGN4PI1cUxh)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.