2016q3 Homework3 (software-pipelining)

contributed by <ktvexe>

[ ]學習計算機結構並且透過實驗來驗證所學
[ ]理解 prefetch 對 cache 的影響，從而設計實驗來釐清相關議題
[ ]論文閱讀和思考

作業要求

閱讀在計算機裡頭實踐演算法提到的論文: "When Prefetching Works, When It Doesn’t, and Why" (務必事先閱讀論文，否則貿然看程式碼，只是陷入一片茫然!)，在 Linux/x86_64 (注意，要用 64-bit 系統，不能透過虛擬機器執行) 上編譯並執行 prefetcher
- 說明 naive_transpose, sse_transpose, sse_prefetch_transpose 之間的效能差異，以及 prefetcher 對 cache 的影響
在 github 上 fork prefetcher，嘗試用 AVX 進一步提昇效能
- 修改 Makefile，產生新的執行檔，分別對應於 naive_transpose, sse_transpose, sse_prefetch_transpose (學習 phonebook 的做法)
- 用 perf 分析 cache miss/hit
- 學習 perf stat 的 raw counter 命令
- 參考 Performance of SSE and AVX Instruction Sets，用 SSE/AVX intrinsic 來改寫程式碼
- 詳細描述實驗設計，以及你的觀察

複習計組

紀錄：翻閱了算盤書，想回憶書中對prefetch的描述，赫然發現書中充滿了AVX、SIMD等關鍵字，我過去一直以為我是在phonebook中認識這些技術，我的大二到底怎麼了…劉亮谷

When Prefetching Works, When It Doesn’t, and Why

Introduction

雖然有很多種software 或是 hardware prefetching的機制可以應付cache miss的可能，但在編譯器中，只有實作一些比較簡單的software prefetching演算法，所以說：注重效能的工程師會使用 prefetching intrinsics function來撰寫程式。
但這樣做也是有問題存在，第1點是這種方式並沒有嚴謹的規則可以描述何種插入 intrinsics funciton的方法最能真正提高效能；再者，software 與 hardware prefetching間互動的複雜度我們是不清楚的。

Performance of SSE and AVX Instruction Sets

這份文件簡介中很清楚的介紹SIMD與SSE/AVX instruction，以下節錄：
SSE (streaming SIMD extensions)和 AVX (advanced vector extensions) 都是 SIMD (single instruction multiple data streams)的 instruction sets
SIMD可讓多核的CPU進行平行處理，基本的資料搬移與算術運算指令可以被同時處理；雖然像GNU與Intel compiler皆俱備 SIMD optimization的選項，但使用適當的SIMD programming技巧，如： data packing, data reuse and asynchronous data等，可以達到更加的效能。

Introduction

SIMD不同於SISD，他一個指令可以同時對多個data進行操作，提供了CPU層級的平行處理，直接舉例就是：同時對數個數值做加法，或是更快完成向量積。

要實作SIMD文中提到三種作法，分別是：inline assembly、 intrinsic function、 vector class

inline assembly：這種方法當然最複雜，但是優點就是可以清楚掌握每個過程的進行(好比手刻的compiler跟lex/yacc)。
intrinsic function: 這提供了C/C++語言的介面，可以直接產生assembly code，但缺點是不保證會有最高度的最佳化。
Vector class:使用vector class的優點是比intrinsic function更容易撰寫，但效能也會比較差。

SSE提供16支XMM registers (xmm0 ∼ xmm15)分別各有128bits
AVX將原本SSE的128bits暫存器擴增至256bits，稱為 YMM registers(ymm0 ∼ ymm15)
不過AVX其實已經進展到AVX-512(@Champ Yenvm學長的投影片中提及)，很可惜我的電腦不支援，這次無法實驗。

此外AVX提供三組input arguments，而SSE只有兩組，所以說SSE只支援a=a+b的操作，而AVX可以辦到a=b+c的操作。

Optimization Scheme

Data Packing
這段落很有趣，看到程式碼：他分別一次取 256bits的資料作copy，可以看到操作次數上有所差異，然而最有趣的地方是，如果對C code version開最佳化，他的效能是與SIMD相同的，可以用objdump將最佳化結果倒出來，會發先開最佳化後，會自動將此操作轉為SIMD；而且比較AVX與SSE會發現，效能上並沒有明顯的差異，由此可見YMM registers並沒有辦法提升資料搬移的速度。

C code










for(i=0 ; i< ArraySize ; i+=8){
	b[i] = a[i];
	b[i+1] = a[i+1];
	b[i+2] = a[i+2];
	b[i+3] = a[i+3];
	b[i+4] = a[i+4];
	b[i+5] = a[i+5];
	b[i+6] = a[i+6];
	b[i+7] = a[i+7];
}

我的天阿，他不給我syntax highlight劉亮谷
原來小於後面需要空白，不然HackMD會parsing error劉亮谷

SSE：










for(i=0; i< ArraySize; i+=8} {
	asm volatile (
		"movups %2, %%xmm0 \n\t"
		"movups %%xmm0, %0 \n\t"
		"movups %3, %%xmm1 \n\t"
		"movups %%xmm1, %1 \n\t"
		: "=m" (b[i]), "=m" (b[i+4])
		: "m" (a[i]), "m" (a[i+4])
	);
}

AVX:







for(i=0; i< ArraySize; i+=8} {
	asm volatile (
		"vmovups %1, %%ymm0 \n\t"
		"vmovups %%ymm0, %0 \n\t"
		: "=m" (b[i])
		: "m" (a[i]) );
}

效能比較：

	C++	SSE4.2	AVX1
no opt.	163	94.5	97.7
max opt.	75	71	75

Data Reuse
要講reuse還是用範例講解最易懂：因為SSE與AVX不需重複將資料搬進reg中，而是重複使用相同register，所以再效能上會比沒有使用SIMD的版本有明顯的提升。

C code










for(i=0; i < LoopNum; i++) {
	c[0] = a[0] + b[0];
	c[1] = a[1] + b[1];
	c[2] = a[2] + b[2];
	c[3] = a[3] + b[3];
	c[4] = a[4] + b[4];
	c[5] = a[5] + b[5];
	c[6] = a[6] + b[6];
	c[7] = a[7] + b[7];
}

SSE








asm volatile (
...
"LOOP: \n\t"
"addps %%xmm0, %%xmm1 \n\t"
"addps %%xmm2, %%xmm3 \n\t"
"dec %%ecx \n\t"
"jnz LOOP \n\t
...

AVX







asm volatile (
...
"LOOP: \n\t"
"vaddps %%ymm0, %%ymm1, %%ymm2 \n\t"
"dec %%ecx \n\t"
"jnz LOOP \n\t"
...

效能比較：

	C++	SSE4.2	AVX1
no opt.	732	83.7	26.9
max opt.	317	78.7	26.9

其實我看不懂AVX在這邊比SSE還要快的原因劉亮谷

Asynchronous Data Transfer
將資料從記憶體搬到register是非常緩慢的動作，所以說以data packing的方式，是很難真正得到效能改善，不過Asynchronous Data Transfer是種同時處理運算與資料搬移的技術，可以將資料搬移的負載減少。
SSE與AVX都提供prefetching的方法，prefetching就是在CPU真正運算前，將資料先辦進cache，不過要注意的是，SSE與AVX皆無法強制data做prefetching，只能提供資訊給CPU做選擇，原因是有可能有更高優先權的process(有可能來自OS或其他)需要被處理。

效能比較：

	C++	SSE4.2	AVX1
no prefetching	75	71	75
512 bytes prefetching		69.9	70.9
這邊的結論是：我們需要更powerful的 asynchronous data transfer method

這邊看到蠻傻眼的，前面描述好像很厲害，結果卻無法有明顯的效能提升劉亮谷

Experiment Evaluation

Origin：














ktvexe@ktvexe-SVS13126PWR:~/sysprog21/prefetcher$ ./main 
  0  1  2  3
  4  5  6  7
  8  9 10 11
 12 13 14 15

  0  4  8 12
  1  5  9 13
  2  6 10 14
  3  7 11 15
sse prefetch: 	 88613 us
sse: 		 156130 us
naive: 		 296468 us

~~在perfetch下速度有所提升，不過經過前面的資料閱讀，我猜測這應該是再資料量與運算量少時的假象。~~劉亮谷

cache miss:

naive

Performance counter stats for './naive' (100 runs):

        18,929,950      cache-misses              #   93.514 % of all cache refs      (66.28%)
        20,315,136      cache-references                                              (66.48%)
        21,216,594      L1-dcache-load-misses                                         (66.79%)
         4,127,598      L1-dcache-store-misses                                        (67.16%)
            53,629      L1-dcache-prefetch-misses                                     (67.13%)
            66,024      L1-icache-load-misses                                         (66.54%)

       0.469926622 seconds time elapsed                                          ( +-  0.60% )

Performance counter stats for './SSE' (100 runs):

         6,892,848      cache-misses              #   90.614 % of all cache refs      (66.29%)
         7,713,047      cache-references                                              (66.55%)
         8,645,086      L1-dcache-load-misses                                         (66.87%)
         3,996,170      L1-dcache-store-misses                                        (67.20%)
            84,852      L1-dcache-prefetch-misses                                     (67.09%)
           166,184      L1-icache-load-misses                                         (66.50%)

       0.333443107 seconds time elapsed                                          ( +-  0.80% )

SSE_PF

 Performance counter stats for './SSE_PF' (100 runs):

         6,681,287      cache-misses              #   87.202 % of all cache refs      (65.92%)
         7,349,026      cache-references                                              (66.51%)
         8,312,963      L1-dcache-load-misses                                         (67.10%)
         3,965,986      L1-dcache-store-misses                                        (67.56%)
            34,294      L1-dcache-prefetch-misses                                     (67.32%)
            63,104      L1-icache-load-misses                                         (66.23%)

       0.275961985 seconds time elapsed                                          ( +-  0.97% )

研究AVX
依樣畫葫蘆，對照SSE的模式，

Type	Meaning
__m256	256-bit as eight single-precision floating-point values, representing a YMM register or memory location
__m256d	256-bit as four double-precision floating-point values, representing a YMM register or memory location
__m256i	256-bit as integers, (bytes, words, etc.)
__m128	128-bit single precision floating-point (32 bits each)
__m128d	128-bit double precision floating-point (64 bits each)

SSE中

__m128i _mm_lddqu_si128 (__m128i const* mem_addr)
Description：
Load 128-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm_loadu_si128 when the data crosses a cache line boundary.

__m128i _mm_loadu_si128 (__m128i const* mem_addr)
Description：
Load 128-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary.

相對應AVX中
__m256i _mm256_lddqu_si256 (__m256i const * mem_addr)
Description：
Load 256-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm256_loadu_si256 when the data crosses a cache line boundary.
__m256i _mm256_loadu_si256 (__m256i const * mem_addr)
Description
Load 256-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary.

其實不是很清楚lddqu與loadu的差異，想增加一個lddqu的版本卻一直編不過。ktvexe

cc -msse3 -msse2 --std gnu99 -O0 -Wall -g -DSSE_PF -o SSE_PF main.c
In file included from main.c:17:0:
impl.c: In function ‘sse_transpose2’:
impl.c:39:26: warning: implicit declaration of function ‘_mm_lddqu_si128’ [-Wimplicit-function-declaration]
             __m128i I0 = _mm_lddqu_si128((__m128i *)(src + (y + 0) * w + x));

最後依照Intel Intrinsics Guide 上的Synopsis，加上#include "pmmintrin.h"終於編過了，

不過最終結果並沒有

Reference

未讀