contributed by <ktvexe
>
naive_transpose
, sse_transpose
, sse_prefetch_transpose
之間的效能差異,以及 prefetcher 對 cache 的影響Makefile
,產生新的執行檔,分別對應於 naive_transpose
, sse_transpose
, sse_prefetch_transpose
(學習 phonebook 的做法)perf stat
的 raw counter 命令紀錄:翻閱了算盤書,想回憶書中對prefetch的描述,赫然發現書中充滿了AVX、SIMD等關鍵字,我過去一直以為我是在phonebook中認識這些技術,我的大二到底怎麼了…劉亮谷
雖然有很多種software 或是 hardware prefetching的機制可以應付cache miss的可能,但在編譯器中,只有實作一些比較簡單的software prefetching演算法,所以說:注重效能的工程師會使用 prefetching intrinsics function來撰寫程式。
但這樣做也是有問題存在,第1點是這種方式並沒有嚴謹的規則可以描述何種插入 intrinsics funciton的方法最能真正提高效能;再者,software 與 hardware prefetching間互動的複雜度我們是不清楚的。
這份文件簡介中很清楚的介紹SIMD與SSE/AVX instruction,以下節錄:
SSE (streaming SIMD extensions)和 AVX (advanced vector extensions) 都是 SIMD (single instruction multiple data streams)的 instruction sets
SIMD可讓多核的CPU進行平行處理,基本的資料搬移與算術運算指令可以被同時處理;雖然像GNU與Intel compiler皆俱備 SIMD optimization的選項,但使用適當的SIMD programming技巧,如: data packing, data reuse and asynchronous data等,可以達到更加的效能。
SIMD不同於SISD,他一個指令可以同時對多個data進行操作,提供了CPU層級的平行處理,直接舉例就是:同時對數個數值做加法,或是更快完成向量積。
要實作SIMD文中提到三種作法,分別是:inline assembly、 intrinsic function、 vector class
inline assembly:這種方法當然最複雜,但是優點就是可以清楚掌握每個過程的進行(好比手刻的compiler跟lex/yacc)。
intrinsic function: 這提供了C/C++語言的介面,可以直接產生assembly code,但缺點是不保證會有最高度的最佳化。
Vector class:使用vector class的優點是比intrinsic function更容易撰寫,但效能也會比較差。
SSE提供16支XMM registers (xmm0 ∼ xmm15)分別各有128bits
AVX將原本SSE的128bits暫存器擴增至256bits,稱為 YMM registers(ymm0 ∼ ymm15)
不過AVX其實已經進展到AVX-512(@Champ Yenvm學長的投影片中提及),很可惜我的電腦不支援,這次無法實驗。
此外AVX提供三組input arguments,而SSE只有兩組,所以說SSE只支援a=a+b的操作,而AVX可以辦到a=b+c的操作。
C code
for(i=0 ; i< ArraySize ; i+=8){
b[i] = a[i];
b[i+1] = a[i+1];
b[i+2] = a[i+2];
b[i+3] = a[i+3];
b[i+4] = a[i+4];
b[i+5] = a[i+5];
b[i+6] = a[i+6];
b[i+7] = a[i+7];
}
我的天阿,他不給我syntax highlight劉亮谷
原來小於後面需要空白,不然HackMD會parsing error劉亮谷
SSE:
for(i=0; i< ArraySize; i+=8} {
asm volatile (
"movups %2, %%xmm0 \n\t"
"movups %%xmm0, %0 \n\t"
"movups %3, %%xmm1 \n\t"
"movups %%xmm1, %1 \n\t"
: "=m" (b[i]), "=m" (b[i+4])
: "m" (a[i]), "m" (a[i+4])
);
}
AVX:
for(i=0; i< ArraySize; i+=8} {
asm volatile (
"vmovups %1, %%ymm0 \n\t"
"vmovups %%ymm0, %0 \n\t"
: "=m" (b[i])
: "m" (a[i]) );
}
效能比較:
C++ | SSE4.2 | AVX1 | |
---|---|---|---|
no opt. | 163 | 94.5 | 97.7 |
max opt. | 75 | 71 | 75 |
C code
for(i=0; i < LoopNum; i++) {
c[0] = a[0] + b[0];
c[1] = a[1] + b[1];
c[2] = a[2] + b[2];
c[3] = a[3] + b[3];
c[4] = a[4] + b[4];
c[5] = a[5] + b[5];
c[6] = a[6] + b[6];
c[7] = a[7] + b[7];
}
SSE
asm volatile (
...
"LOOP: \n\t"
"addps %%xmm0, %%xmm1 \n\t"
"addps %%xmm2, %%xmm3 \n\t"
"dec %%ecx \n\t"
"jnz LOOP \n\t
...
AVX
asm volatile (
...
"LOOP: \n\t"
"vaddps %%ymm0, %%ymm1, %%ymm2 \n\t"
"dec %%ecx \n\t"
"jnz LOOP \n\t"
...
效能比較:
C++ | SSE4.2 | AVX1 | |
---|---|---|---|
no opt. | 732 | 83.7 | 26.9 |
max opt. | 317 | 78.7 | 26.9 |
其實我看不懂AVX在這邊比SSE還要快的原因劉亮谷
效能比較:
C++ | SSE4.2 | AVX1 | |
---|---|---|---|
no prefetching | 75 | 71 | 75 |
512 bytes prefetching | 69.9 | 70.9 | |
這邊的結論是:我們需要更powerful的 asynchronous data transfer method |
這邊看到蠻傻眼的,前面描述好像很厲害,結果卻無法有明顯的效能提升劉亮谷
ktvexe@ktvexe-SVS13126PWR:~/sysprog21/prefetcher$ ./main
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
sse prefetch: 88613 us
sse: 156130 us
naive: 296468 us
在perfetch下速度有所提升,不過經過前面的資料閱讀,我猜測這應該是再資料量與運算量少時的假象。劉亮谷
Performance counter stats for './naive' (100 runs):
18,929,950 cache-misses # 93.514 % of all cache refs (66.28%)
20,315,136 cache-references (66.48%)
21,216,594 L1-dcache-load-misses (66.79%)
4,127,598 L1-dcache-store-misses (67.16%)
53,629 L1-dcache-prefetch-misses (67.13%)
66,024 L1-icache-load-misses (66.54%)
0.469926622 seconds time elapsed ( +- 0.60% )
Performance counter stats for './SSE' (100 runs):
6,892,848 cache-misses # 90.614 % of all cache refs (66.29%)
7,713,047 cache-references (66.55%)
8,645,086 L1-dcache-load-misses (66.87%)
3,996,170 L1-dcache-store-misses (67.20%)
84,852 L1-dcache-prefetch-misses (67.09%)
166,184 L1-icache-load-misses (66.50%)
0.333443107 seconds time elapsed ( +- 0.80% )
Performance counter stats for './SSE_PF' (100 runs):
6,681,287 cache-misses # 87.202 % of all cache refs (65.92%)
7,349,026 cache-references (66.51%)
8,312,963 L1-dcache-load-misses (67.10%)
3,965,986 L1-dcache-store-misses (67.56%)
34,294 L1-dcache-prefetch-misses (67.32%)
63,104 L1-icache-load-misses (66.23%)
0.275961985 seconds time elapsed ( +- 0.97% )
Type | Meaning |
---|---|
__m256 | 256-bit as eight single-precision floating-point values, representing a YMM register or memory location |
__m256d | 256-bit as four double-precision floating-point values, representing a YMM register or memory location |
__m256i | 256-bit as integers, (bytes, words, etc.) |
__m128 | 128-bit single precision floating-point (32 bits each) |
__m128d | 128-bit double precision floating-point (64 bits each) |
SSE中
__m128i _mm_lddqu_si128 (__m128i const* mem_addr)
Description:
Load 128-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm_loadu_si128 when the data crosses a cache line boundary.
__m128i _mm_loadu_si128 (__m128i const* mem_addr)
Description:
Load 128-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary.
相對應AVX中
__m256i _mm256_lddqu_si256 (__m256i const * mem_addr)
Description:
Load 256-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm256_loadu_si256 when the data crosses a cache line boundary.
__m256i _mm256_loadu_si256 (__m256i const * mem_addr)
Description
Load 256-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary.
其實不是很清楚
lddqu
與loadu
的差異,想增加一個lddqu
的版本卻一直編不過。ktvexe
cc -msse3 -msse2 --std gnu99 -O0 -Wall -g -DSSE_PF -o SSE_PF main.c
In file included from main.c:17:0:
impl.c: In function ‘sse_transpose2’:
impl.c:39:26: warning: implicit declaration of function ‘_mm_lddqu_si128’ [-Wimplicit-function-declaration]
__m128i I0 = _mm_lddqu_si128((__m128i *)(src + (y + 0) * w + x));
最後依照Intel Intrinsics Guide 上的Synopsis,加上
#include "pmmintrin.h"
終於編過了,
不過最終結果並沒有
未讀