contributed by <fzzzz
>
節錄自wikipedia
In computer architecture, instruction prefetch is a technique used in central processor units to speed up the execution of a program by reducing wait states.
論文 When Prefetching Works, When It Doesn’t, and Why
論文的部份只有稍微掃過一次,大部份還是看學長姐的筆記
Software : 通常會在程式碼間安插一些 extension 的 intrinsics ,來告訴 compiler 哪邊要做 prefetch 及 prefetch distance
Hardware : 利用 CPU 中的 GHB(Global History Buffer),通常需要做 training,要經過幾次的 cache misses 來偵測。
Prefetch distance :
現在內建有 software prefetching algorithms 的 compiler 有 icc(intel) / gcc,但是對於如何安插 prefetch instrinsics 的 guideline 相關文件不多,再來就是對於軟體及硬體 prefetching 的分界是不清楚的。(沒有夠嚴謹的標準存在),所以這篇論文表示他想要提供一個 prefetch distance 標準。
結論為
native_transpose:
$(GIT_HOOKS) main.c
$(CC) $(CFLAGS_common) -DIMPL="\"$@.h\"" -o $@ main.c
echo $@
sse_transpose:
$(GIT_HOOKS) main.c
$(CC) $(CFLAGS_common) $(CFLAGS_sse) -DIMPL="\"$@.h\"" -o $@ main.c
sse_prefetch_transpose:
$(GIT_HOOKS) main.c
$(CC) $(CFLAGS_common) $(CFLAGS_sse) -DIMPL="\"$@.h\"" -o $@ main.c
native transpose
fzzzz@fzzzz-Aspire-V3-571G:~/Workspace/prefetcher$ ./native_transpose
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
native_transpose.h: 237301 us
sse transpose
fzzzz@fzzzz-Aspire-V3-571G:~/Workspace/prefetcher$ ./sse_transpose
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
sse_transpose.h: 123182 us
sse transpose using software prefetching
fzzzz@fzzzz-Aspire-V3-571G:~/Workspace/prefetcher$ ./sse_prefetch_transpose
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
sse_prefetch_transpose.h: 60141 us
這邊筆記一下會用到的 function
__m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b)
__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)
SELECT4(src1, src2, control){
CASE(control[1:0])
0: tmp[127:0] := src1[127:0]
1: tmp[127:0] := src1[255:128]
2: tmp[127:0] := src2[127:0]
3: tmp[127:0] := src2[255:128]
ESAC
IF control[3]
tmp[127:0] := 0
FI
RETURN tmp[127:0]
}
dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0])
dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4])
dst[MAX:256] := 0
修改過後的程式碼 ( code )
./native_transpose
native_transpose.h: 308829 us
./sse_transpose
sse_transpose.h: 157873 us
./sse_prefetch_transpose
sse_prefetch_transpose.h: 80256 us
./avx_transpose
avx_transpose.h: 85646 us
./avx_prefetch_transpose
avx_prefetch_transpose.h: 84487 us
其實在看完論文和學長姐的筆記後,我還是不太知道怎麼取 prefetch distance。即便是按照內文提到的演算法來算的話,也仍然會因為各處理器 IPC 不同,而有不同的 prefetch distance。所以總結了一下想法,我還是全都跑一次看哪個 prefetch distance 會帶來最好的效果。
下圖為 avx_prefetch 在各個 prefetch distance 下的執行時間。
可以看到在 prefetch distance = 8 的時候最快
native_transpose
Performance counter stats for './native_transpose' (5 runs):
16,543,459 cache-misses # 91.690 % of all cache refs ( +- 0.05% )
18,042,742 cache-references ( +- 0.03% )
1,448,816,603 instructions # 0.86 insns per cycle ( +- 0.00% )
1,690,823,261 cycles ( +- 0.48% )
0.453539595 seconds time elapsed ( +- 1.14% )
sse_transpose
Performance counter stats for './sse_transpose' (5 runs):
4,271,798 cache-misses # 78.221 % of all cache refs ( +- 0.18% )
5,461,196 cache-references ( +- 0.09% )
1,236,784,781 instructions # 1.17 insns per cycle ( +- 0.00% )
1,061,389,121 cycles ( +- 1.19% )
0.303614305 seconds time elapsed ( +- 7.57% )
sse_prefetch_transpose
Performance counter stats for './sse_prefetch_transpose' (5 runs):
4,272,500 cache-misses # 78.128 % of all cache refs ( +- 0.21% )
5,468,597 cache-references ( +- 0.02% )
1,282,833,253 instructions # 1.60 insns per cycle ( +- 0.00% )
803,852,552 cycles ( +- 0.37% )
0.213622428 seconds time elapsed ( +- 0.50% )
avx_transpose
Performance counter stats for './avx_transpose' (5 runs):
3,243,450 cache-misses # 73.077 % of all cache refs ( +- 0.03% )
4,438,375 cache-references ( +- 0.09% )
1,149,164,499 instructions # 1.42 insns per cycle ( +- 0.00% )
810,729,581 cycles ( +- 0.30% )
0.219928612 seconds time elapsed ( +- 2.58% )
avx_prefetch_tanspose ( PFDIST = 8 )
Performance counter stats for './avx_prefetch_transpose' (5 runs):
3,269,625 cache-misses # 50.112 % of all cache refs ( +- 0.02% )
6,524,581 cache-references ( +- 0.04% )
1,172,166,712 instructions # 1.42 insns per cycle ( +- 0.00% )
825,967,980 cycles ( +- 0.31% )
0.218844538 seconds time elapsed ( +- 0.29% )
從 cache misses 的比例上來看 可以看見 avx 在做了 prefetch 之後,有很大幅度的改善。但在執行時間上仍然是和 avx_transpose 與 sse_prefetch_transpose 沒有相差太多。(甚至是更慢的),另外 avx 的兩個版本的 IPC 均比 sse_prefetch 來的低。 我個人猜測這很可能是為什麼 avx 效能上不及 sse_prefetch 的原因 ( 整整少了10% 的 IPC )。
fzzzz
sysprog