2017q1 Homework3 (software-pipelining)

contributed by <ierosodin>

Reviewed by `jackyhobingo`

在實驗的作圖上，橫軸的單位看起來很混亂，數字都擠在一起，可以考慮修改一下
在 commit 上都只有一行說明，沒有明確說明為什麼改了就可以增進效能
有些標題下的內文還未出現，期待版本三的出現

論文筆記

hardware prefetching
- stream prefetcher (STR)
  當發生連續 data access cache miss 時，就會先將接下來連續的 data prefetch 進 stream buffer 。
- stride prefetcher (GHB)
  不同於 stream prefetcher ，當發生固定間隔的 cache miss 時，預測接下來會 access 的 data ，進行 prefetching 。
  而 global history buffer (GHB) 以 FIFO 記錄過去 data access 的 address ，並利用 hash table 作為 index 來減少空間的浪費，再以這些紀錄進行 prefetch 。
software prefetching
- icc
- gcc
- intrinsic
  _mm prefetch(char *ptr, int hint)
  - _MM HINT_T0 (all level caches)
  - _MM HINT_T1 (respect to first level cache)
  - _MM HINT_T2 (respect to second level cache)
  - _MM HINT_NTA (Non-temporal data)
Data Structures and Corresponding Prefetching

structure	code	pattern	sizeof a	index	SW	HW
Array	a[i]	stream	< cache block	direct	a[i+D]	stream
Array	a[i], a[i][const]	stride	> cache block	direct	a[i+D][j]	stride, GHB
Array	a[b[i]]	irregular	< cache block	indirect	a[b[i+D]]	CDP, markov…
RDS	a = a -> next	stride, irregular		indirect	a -> next ->next…	CDP, markov…
Hash	b[f(i)] -> a	irregular		indirect	Group prefetching…	helper threads, pre-computation

Prefetch Classification

分類 prefetch 的精確度

Software Prefetch Distance
- $D >= [\frac{l}{s}]$
  where D is the prefetch distance, l is the prefetch latency and s is the length of the shortest path through the loop body
- large enough to hide the latency
- if it is too large, prefetched data could evict useful cache blocks, and the elements in the beginning of the array may not be prefetched
Direct and Indirect Memory Indexing
- indirect memory indexing requires special hardware prefetching mechanisms
- indirect indexing is relatively simpler to compute in software. So software prefetching is more effective than hardware prefetching.
Benefits of Software Prefetching
- 在面對 Large Number of Streams 時，由於 HW prefetcher 一次能 prefetch 的數量有限，不像 SW prefetcher 的靈活度較高。
- 然而 HW prefetcher 通常至少需要兩次的 cache misses 來 training ，因此 Short Streams 在 HW prefetcher 上的效能也較 SW prefetcher 來得不理想。
- Irregular Memory Access 對 HW prefetcher 來說較複雜，不容易進行 prefetch 。
- HW prefetcher 通常會將 data prefetch 進 lowerlevel cache，例如 L2、L3 ，而從 L2 到 L1 又是一筆效能支出(雖然這能減少 cache pollution)，相較於 SW prefetcher ，提供 Cache Locality Hint 來更彈性的進行 data prefetch。
- 如何決定 Loop Bounds 對 HW prefetcher 來說是困難的，而 SW prefetcher 則可以使用像是 loop unrolling、sofeware pipelining 等來預防 prefetch 超出 array bounds。
Negative Impacts of Software Prefetching
- 由於 SW prefetcher 需要花費額外的支出在指令的 fetch and execution ，因此會增加指令的數量。
- SW prefetcher 無論是 compiler 或是 intrinsic ，都是事先決定好 prefetch distance ，然而像是 memory latency 都是會隨時間變化的。有些 HW prefetcher 則可以對此作出相對應的改變。
- 當 loop 中要執行的 instruction 數量較少時， SW prefetcher 可能會發生來不及 prefetch 的問題。
Synergistic Effects
- HW 與 SW 可以分別處理不同的 data structure，像是 HW prefetcher 處裡 regular streams ，而 SW prefetcher 則處理 irregular streams 。
- SW prefetcher 的動作可以用來 training HW prefetcher ，讓 HW prefetcher 知道何時要 prefetch 。
Antagonistic Effects
- 如果 SW prefetcher 跳過部分的 streams ，將會影響 HW prefetcher 的 training ，且過度的 training 也會造成反效果。
- SW prefetcher 通常比 HW prefetcher 來得精準，因此若 SW prefetcher 發生 incorrect 或是 early ，將降低 HW prefetcher 的效益。

EXPERIMENTAL

Transpose Matrix

修改 Makefile

改成分別產生 naive_transpose sse_transpose sse_prefetch_transpose 執行檔，並用 -D 來辨別實作的方法。

封裝技巧

修改 impl.c ，用 #ifdef 來區別不同的 transpose 方法，這樣在 main.c 中，同樣是呼叫 transpose() ，會因為不同的 define ，而呼叫不一樣的實作函式。

How SSE transpose Matrix

SEE 儲存變數的 type 是 __m128i ，因此可以存 4 的整數(32 bits)。

__m128i _mm_loadu_si128 (__m128i *p);
在每一個 for 迴圈中，會呼叫 4 次 load 來儲存 4 個 row 的資料，也就是一個 4*4 的矩陣。

I0	01	02	03	04
I1	11	12	13	14
I2	21	22	23	24
I3	31	32	33	34

__m128i _mm_unpacklo_epi32 (__m128i a, __m128i b);
會接收兩個 128 bits 的 data ，會取出低位的兩組 32 bits，並做交叉組合
__m128i _mm_unpackhi_epi32 (__m128i a, __m128i b);
會接收兩個 128 bits 的 data ，會取出高位的兩組 32 bits，並做交叉組合

D0 = _mm_unpacklo_epi32(I0, I1)	01	11	02	12
D1 = _mm_unpacklo_epi32(I2, I3)	21	31	22	32
D2 = _mm_unpackhi_epi32(I0, I1)	03	13	04	14
D3 = _mm_unpackhi_epi32(I2, I3)	23	33	24	34

__m128i _mm_unpacklo_epi64 (__m128i a, __m128i b);
會接收兩個 128 bits 的 data ，會取出低位的一組 64 bits，並做交叉組合
__m128i _mm_unpackhi_epi64 (__m128i a, __m128i b);
會接收兩個 128 bits 的 data ，會取出高位的一組 64 bits，並做交叉組合

I0 = _mm_unpacklo_epi64(D0, D1)	01	11	21	31
I1 = _mm_unpackhi_epi64(D0, D1)	02	02	22	32
I2 = _mm_unpacklo_epi64(D2, D3)	03	13	23	33
I3 = _mm_unpackhi_epi64(D2, D3)	04	14	24	34

__m128i _mm_storeu_si128 (__m128i *p, __m128i a);
將轉置後的 data 存回陣列

I0	01	11	21	31
I1	02	02	22	32
I2	03	13	23	33
I3	04	14	24	34

從 SSE 實作 transpose 的方法可以發現，矩陣的邊長必須是 4 的倍數！ierosodin

效能分析

以 gnuplot ，畫出三種方法矩陣邊長對時間的折線圖：

可以發現使用 prefetch 明顯的提升了轉置矩陣的速度，但 naive 與 sse 兩者的效能卻很難比較，並發現 sse 矩陣邊長對時間的震盪很明顯且似乎有規律！拉出來分析：

這是從邊長 1600 開始，每次加 4 所得到的結果，可以發現，邊長為 8 的倍數的表現明顯較好

是 cache 的行為造成的？思考如何解釋這個現象…ierosodin

perf 分析 cache-misses

AVX

有關 SIMD 語法的使用，可以參考The Intel Intrinsics Guide

版本一：仿照 SSE 的方法，實作 8*8 矩陣的 transpose





























































for (int x = 0; x < w; x += 8) {
    for (int y = 0; y < h; y += 8) {
        __m256 I0 = _mm256_loadu_ps((src + (y + 0) * w + x));
        __m256 I1 = _mm256_loadu_ps((src + (y + 1) * w + x));
        __m256 I2 = _mm256_loadu_ps((src + (y + 2) * w + x));
        __m256 I3 = _mm256_loadu_ps((src + (y + 3) * w + x));
        __m256 I4 = _mm256_loadu_ps((src + (y + 4) * w + x));
        __m256 I5 = _mm256_loadu_ps((src + (y + 5) * w + x));
        __m256 I6 = _mm256_loadu_ps((src + (y + 6) * w + x));
        __m256 I7 = _mm256_loadu_ps((src + (y + 7) * w + x));
        __m256 T0 = _mm256_unpacklo_ps(I0, I1);
        __m256 T1 = _mm256_unpacklo_ps(I2, I3);
        __m256 T2 = _mm256_unpackhi_ps(I0, I1);
        __m256 T3 = _mm256_unpackhi_ps(I2, I3);
        __m256 T4 = _mm256_unpacklo_ps(I4, I5);
        __m256 T5 = _mm256_unpacklo_ps(I6, I7);
        __m256 T6 = _mm256_unpackhi_ps(I4, I5);
        __m256 T7 = _mm256_unpackhi_ps(I6, I7);
        __m256d D0 = _mm256_castps_pd(T0);
        __m256d D1 = _mm256_castps_pd(T1);
        __m256d D2 = _mm256_castps_pd(T2);
        __m256d D3 = _mm256_castps_pd(T3);
        __m256d D4 = _mm256_castps_pd(T4);
        __m256d D5 = _mm256_castps_pd(T5);
        __m256d D6 = _mm256_castps_pd(T6);
        __m256d D7 = _mm256_castps_pd(T7);
        __m256d J0 = _mm256_unpacklo_pd(D0, D1);
        __m256d J1 = _mm256_unpackhi_pd(D0, D1);
        __m256d J2 = _mm256_unpacklo_pd(D2, D3);
        __m256d J3 = _mm256_unpackhi_pd(D2, D3);
        __m256d J4 = _mm256_unpacklo_pd(D4, D5);
        __m256d J5 = _mm256_unpackhi_pd(D4, D5);
        __m256d J6 = _mm256_unpacklo_pd(D6, D7);
        __m256d J7 = _mm256_unpackhi_pd(D6, D7);
        D0 = _mm256_permute2f128_pd(J0, J4, 0x20);
        D1 = _mm256_permute2f128_pd(J1, J5, 0x20);
        D2 = _mm256_permute2f128_pd(J2, J6, 0x20);
        D3 = _mm256_permute2f128_pd(J3, J7, 0x20);
        D4 = _mm256_permute2f128_pd(J0, J4, 0x31);
        D5 = _mm256_permute2f128_pd(J1, J5, 0x31);
        D6 = _mm256_permute2f128_pd(J2, J6, 0x31);
        D7 = _mm256_permute2f128_pd(J3, J7, 0x31);
        I0 = _mm256_castpd_ps(D0);
        I1 = _mm256_castpd_ps(D1);
        I2 = _mm256_castpd_ps(D2);
        I3 = _mm256_castpd_ps(D3);
        I4 = _mm256_castpd_ps(D4);
        I5 = _mm256_castpd_ps(D5);
        I6 = _mm256_castpd_ps(D6);
        I7 = _mm256_castpd_ps(D7);
        _mm256_store_ps((dst + ((x + 0) * h) + y), I0);
        _mm256_store_ps((dst + ((x + 1) * h) + y), I1);
        _mm256_store_ps((dst + ((x + 2) * h) + y), I2);
        _mm256_store_ps((dst + ((x + 3) * h) + y), I3);
        _mm256_store_ps((dst + ((x + 4) * h) + y), I4);
        _mm256_store_ps((dst + ((x + 5) * h) + y), I5);
        _mm256_store_ps((dst + ((x + 6) * h) + y), I6);
        _mm256_store_ps((dst + ((x + 7) * h) + y), I7);
    }
}

雖然成功的將 8*8 矩陣轉置，但當矩陣邊長增加時，卻出現了 Segmentation fault (core dumped) ！問題出在 avx 僅有 15 個 register (ymm0~ymm14)，然而 code 中卻大量宣告 __m256 。

register 不是免費的阿！！！ierosodin

版本二

利用 _mm_256_shuffle_ps 代替 _mm256_unpacklo_pd ，這樣可以免去資料型態的轉換













































for (int x = 0; x < w; x += 8) {
    for (int y = 0; y < h; y += 8) {
        __m256 I0 = _mm256_loadu_ps((src + (y + 0) * w + x));
        __m256 I1 = _mm256_loadu_ps((src + (y + 1) * w + x));
        __m256 I2 = _mm256_loadu_ps((src + (y + 2) * w + x));
        __m256 I3 = _mm256_loadu_ps((src + (y + 3) * w + x));
        __m256 I4 = _mm256_loadu_ps((src + (y + 4) * w + x));
        __m256 I5 = _mm256_loadu_ps((src + (y + 5) * w + x));
        __m256 I6 = _mm256_loadu_ps((src + (y + 6) * w + x));
        __m256 I7 = _mm256_loadu_ps((src + (y + 7) * w + x));
        __m256 T0 = _mm256_unpacklo_ps(I0, I1);
        __m256 T1 = _mm256_unpacklo_ps(I2, I3);
        __m256 T2 = _mm256_unpackhi_ps(I0, I1);
        __m256 T3 = _mm256_unpackhi_ps(I2, I3);
        __m256 T4 = _mm256_unpacklo_ps(I4, I5);
        __m256 T5 = _mm256_unpacklo_ps(I6, I7);
        __m256 T6 = _mm256_unpackhi_ps(I4, I5);
        __m256 T7 = _mm256_unpackhi_ps(I6, I7);
        I0 = _mm256_shuffle_ps(T0, T1, _MM_SHUFFLE(1,0,1,0));
        I1 = _mm256_shuffle_ps(T0, T1, _MM_SHUFFLE(3,2,3,2));
        I2 = _mm256_shuffle_ps(T2, T3, _MM_SHUFFLE(1,0,1,0));
        I3 = _mm256_shuffle_ps(T2, T3, _MM_SHUFFLE(3,2,3,2));
        I4 = _mm256_shuffle_ps(T4, T5, _MM_SHUFFLE(1,0,1,0));
        I5 = _mm256_shuffle_ps(T4, T5, _MM_SHUFFLE(3,2,3,2));
        I6 = _mm256_shuffle_ps(T6, T6, _MM_SHUFFLE(1,0,1,0));
        I7 = _mm256_shuffle_ps(T7, T7, _MM_SHUFFLE(3,2,3,2));
        T0 = _mm256_permute2f128_ps(I0, I4, 0x20);
        T1 = _mm256_permute2f128_ps(I1, I5, 0x20);
        T2 = _mm256_permute2f128_ps(I2, I6, 0x20);
        T3 = _mm256_permute2f128_ps(I3, I7, 0x20);
        T4 = _mm256_permute2f128_ps(I0, I4, 0x31);
        T5 = _mm256_permute2f128_ps(I1, I5, 0x31);
        T6 = _mm256_permute2f128_ps(I2, I6, 0x31);
        T7 = _mm256_permute2f128_ps(I3, I7, 0x31);
        _mm256_store_ps((dst + ((x + 0) * h) + y), T0);
        _mm256_store_ps((dst + ((x + 1) * h) + y), T1);
        _mm256_store_ps((dst + ((x + 2) * h) + y), T2);
        _mm256_store_ps((dst + ((x + 3) * h) + y), T3);
        _mm256_store_ps((dst + ((x + 4) * h) + y), T4);
        _mm256_store_ps((dst + ((x + 5) * h) + y), T5);
        _mm256_store_ps((dst + ((x + 6) * h) + y), T6);
        _mm256_store_ps((dst + ((x + 7) * h) + y), T7);
    }
}