2016q3 Homework4 (software-pipelining)

contributed by <fzzzz>

什麼是 prefectching?

節錄自wikipedia
In computer architecture, instruction prefetch is a technique used in central processor units to speed up the execution of a program by reducing wait states.

論文閱讀

論文 When Prefetching Works, When It Doesn’t, and Why

論文的部份只有稍微掃過一次，大部份還是看學長姐的筆記

Key points

source code-level analysis
study of the synergistic(協同) and antagonistic（拮抗） effects between sw/hw
evaluation of HW prefetching training policies

Software / Hardware Prefetching 的背景知識

Software : 通常會在程式碼間安插一些 extension 的 intrinsics ，來告訴 compiler 哪邊要做 prefetch 及　prefetch distance
- prefetch 要有實際效果，傳送要求時間點要夠早，要能和 memory latency 抵銷。
Hardware : 利用 CPU 中的 GHB(Global History Buffer)，通常需要做 training，要經過幾次的 cache misses 來偵測。
Prefetch distance :
- - l : prefetch latency (memory)
  - s : 迴圈當中的最短路徑

文章閱讀概要

現在內建有 software prefetching algorithms 的 compiler 有 icc(intel) / gcc，但是對於如何安插 prefetch instrinsics 的 guideline 相關文件不多，再來就是對於軟體及硬體 prefetching 的分界是不清楚的。(沒有夠嚴謹的標準存在)，所以這篇論文表示他想要提供一個 prefetch distance 標準。

左Y軸為正規化後執行時間（除以未做 prefetch 的執行時間）
右Y軸為加速倍率
從這張圖的結果來看，最好的結果是　SW+STR (stream prefetcher) or SW+GHB (stride prefetcher)

結論為

software prefetching 在　short array streams , irregular memory address patterns 和 L1 cache reduction　上有較為顯著的正面影響。

Synergistic effects using software + hardware prefetching

可以同時處理多個 data streams 。
- software prefetching能夠解決不規則的 stream ，而　hardware　則去做規律的部份。
Positive training
*　如果一個 software prefetch 太慢的話，可以藉由 hardware prefetch來增加 prefetch 的即時性。

Antagonistic Effects when Using Software and Hardware Prefetching

Negative training
- 沒有完整地對所有 stream 做 sw prefetch的話，會使　hardware prefecting 的訓練結果不正確或過度訓練造成工作量過大。
Harmful software prefecting
- 通常 hardware prefecthing 的精度會低於 software ，而當 prefecting request 出現的時間點不對或是過早的話，會造成再記憶體及快取頻寬上的負擔，甚至是降低 hardware prefetching 的效率。

作業實驗結果 ( native / sse / sse-prefetch )

更改 Makefile 並重新執行

native_transpose:
	$(GIT_HOOKS) main.c
	$(CC) $(CFLAGS_common) -DIMPL="\"$@.h\"" -o $@ main.c
	echo $@
sse_transpose:
	$(GIT_HOOKS) main.c
	$(CC) $(CFLAGS_common) $(CFLAGS_sse) -DIMPL="\"$@.h\"" -o $@ main.c
sse_prefetch_transpose:
	$(GIT_HOOKS) main.c
	$(CC) $(CFLAGS_common) $(CFLAGS_sse) -DIMPL="\"$@.h\"" -o $@ main.c

native transpose

fzzzz@fzzzz-Aspire-V3-571G:~/Workspace/prefetcher$ ./native_transpose 
  0  1  2  3
  4  5  6  7
  8  9 10 11
 12 13 14 15

  0  4  8 12
  1  5  9 13
  2  6 10 14
  3  7 11 15
native_transpose.h: 	 237301 us

sse transpose

fzzzz@fzzzz-Aspire-V3-571G:~/Workspace/prefetcher$ ./sse_transpose 
  0  1  2  3
  4  5  6  7
  8  9 10 11
 12 13 14 15

  0  4  8 12
  1  5  9 13
  2  6 10 14
  3  7 11 15
sse_transpose.h: 	 123182 us

sse transpose using software prefetching

fzzzz@fzzzz-Aspire-V3-571G:~/Workspace/prefetcher$ ./sse_prefetch_transpose 
  0  1  2  3
  4  5  6  7
  8  9 10 11
 12 13 14 15

  0  4  8 12
  1  5  9 13
  2  6 10 14
  3  7 11 15
sse_prefetch_transpose.h: 	 60141 us

修改成 AVX 版本

這邊筆記一下會用到的 function

__m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b)
- 將 src1 與 src2 的低/高位元以交錯的方式做排列，尾端的 32 就是每次交錯排列的位元數
__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)
- 將 src1 與　src2 各取一半進行排列，
- 方法如下

SELECT4(src1, src2, control){
    CASE(control[1:0])
    0:	tmp[127:0] := src1[127:0]
    1:	tmp[127:0] := src1[255:128]
    2:	tmp[127:0] := src2[127:0]
    3:	tmp[127:0] := src2[255:128]
    ESAC
    IF control[3]
    	tmp[127:0] := 0
    FI
    RETURN tmp[127:0]
}

dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0])
dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4])
dst[MAX:256] := 0

修改過後的程式碼 ( code )

在桌機上跑（筆電沒有支援到 AVX2）

./native_transpose
native_transpose.h: 		 308829 us
./sse_transpose
sse_transpose.h: 		 157873 us
./sse_prefetch_transpose
sse_prefetch_transpose.h: 		 80256 us
./avx_transpose
avx_transpose.h: 		 85646 us
./avx_prefetch_transpose
avx_prefetch_transpose.h: 	 84487 us

其實在看完論文和學長姐的筆記後，我還是不太知道怎麼取 prefetch distance。即便是按照內文提到的演算法來算的話，也仍然會因為各處理器 IPC 不同，而有不同的 prefetch distance。所以總結了一下想法，我還是全都跑一次看哪個 prefetch distance 會帶來最好的效果。
下圖為 avx_prefetch 在各個　prefetch distance 下的執行時間。

可以看到在 prefetch distance = 8 的時候最快

各版本的 cache misses

native_transpose

 Performance counter stats for './native_transpose' (5 runs):

        16,543,459      cache-misses              #   91.690 % of all cache refs      ( +-  0.05% )
        18,042,742      cache-references                                              ( +-  0.03% )
     1,448,816,603      instructions              #    0.86  insns per cycle          ( +-  0.00% )
     1,690,823,261      cycles                                                        ( +-  0.48% )

       0.453539595 seconds time elapsed                                          ( +-  1.14% )

sse_transpose

 Performance counter stats for './sse_transpose' (5 runs):

         4,271,798      cache-misses              #   78.221 % of all cache refs      ( +-  0.18% )
         5,461,196      cache-references                                              ( +-  0.09% )
     1,236,784,781      instructions              #    1.17  insns per cycle          ( +-  0.00% )
     1,061,389,121      cycles                                                        ( +-  1.19% )

       0.303614305 seconds time elapsed                                          ( +-  7.57% )

sse_prefetch_transpose

 Performance counter stats for './sse_prefetch_transpose' (5 runs):

         4,272,500      cache-misses              #   78.128 % of all cache refs      ( +-  0.21% )
         5,468,597      cache-references                                              ( +-  0.02% )
     1,282,833,253      instructions              #    1.60  insns per cycle          ( +-  0.00% )
       803,852,552      cycles                                                        ( +-  0.37% )

       0.213622428 seconds time elapsed                                          ( +-  0.50% )

avx_transpose

 Performance counter stats for './avx_transpose' (5 runs):

         3,243,450      cache-misses              #   73.077 % of all cache refs      ( +-  0.03% )
         4,438,375      cache-references                                              ( +-  0.09% )
     1,149,164,499      instructions              #    1.42  insns per cycle          ( +-  0.00% )
       810,729,581      cycles                                                        ( +-  0.30% )

       0.219928612 seconds time elapsed                                          ( +-  2.58% )

avx_prefetch_tanspose ( PFDIST = 8 )

 Performance counter stats for './avx_prefetch_transpose' (5 runs):

         3,269,625      cache-misses              #   50.112 % of all cache refs      ( +-  0.02% )
         6,524,581      cache-references                                              ( +-  0.04% )
     1,172,166,712      instructions              #    1.42  insns per cycle          ( +-  0.00% )
       825,967,980      cycles                                                        ( +-  0.31% )

       0.218844538 seconds time elapsed                                          ( +-  0.29% )

從 cache misses 的比例上來看可以看見 avx 在做了 prefetch 之後，有很大幅度的改善。但在執行時間上仍然是和 avx_transpose 與 sse_prefetch_transpose 沒有相差太多。（甚至是更慢的），另外 avx 的兩個版本的 IPC 均比 sse_prefetch 來的低。我個人猜測這很可能是為什麼 avx 效能上不及 sse_prefetch 的原因（整整少了10% 的 IPC ）。

2016q3 Homework4 (software-pipelining)

什麼是 prefectching?

論文閱讀

Key points

Software / Hardware Prefetching 的背景知識

文章閱讀概要

Synergistic effects using software + hardware prefetching

Antagonistic Effects when Using Software and Hardware Prefetching

作業實驗結果 ( native / sse / sse-prefetch )

更改 Makefile 並重新執行

修改成 AVX 版本

在桌機上跑（筆電沒有支援到 AVX2）

各版本的 cache misses

Reference

tags: `fzzzz` `sysprog`

2016q3 Homework4 (software-pipelining)

什麼是 prefectching?

論文閱讀

Key points

Software / Hardware Prefetching 的背景知識

文章閱讀概要

Synergistic effects using software + hardware prefetching

Antagonistic Effects when Using Software and Hardware Prefetching

作業實驗結果 ( native / sse / sse-prefetch )

更改 Makefile 並重新執行

修改成 AVX 版本

在桌機上跑（筆電沒有支援到 AVX2）

各版本的 cache misses

Reference

tags: fzzzz sysprog

tags: `fzzzz` `sysprog`