2017q1 Homework03 (software-pipelining)

contributed by <hugikun999>

When Prefetching Works, When It Doesn’t, and Why

矩陣或遞迴式容易預測的，軟體排程較容易;hashing 相對則是叫不容易預測。
void _mm_prefetch(char * p , int i )
建議 compiler 預取資料進入 cache。
參考網站
void _mm_stream_ps(float * p , __m128 a );
要求資料寫入時，直接寫入記憶體而非 cache 中。
Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency。
Prefetch distance for an array
D
$\geq$
$⌈ l / S ⌉$
l: prefetch latency S: the length of the shortest path through the loop
body
If prefetch distance is too large,prefetched data could evict useful cache blocks。
Two types of memory indexing
- Direct
  easily prefetched by hardware
- Indirect
  easily prefetched by software
Benefits of Software Prefetching over Hardware Prefetching
- Large Number of Streams
  Not be limited by hardware resource
- Short Streams
  Hardware need to be trained.If the length of stream isextremely short, there will not be enough cache misses for hardware to load useful cache blocks.
- Irregular Memory Access
- Cache Locality Hint
  Greater flexibility of placing data in the right cache level with hint.
- Loop Bounds
  Some methods to detect loop bounds such as loop unrolling, software pipelining, and using branch instructions.
Negative Impacts of Software Prefetching
- Increased Instruction Count.
- Static Insertion
  cannot adapt to run time behavioral changes.
- Code Structure Change
  When the number of instructions in a loop is extremely small,it can be challenging to insert software prefetching instructions.

名詞解釋

prefetch distance
預取的距離。
degree of prefetching
預取幾個 cache line。假設預取512 bytes 且每一個 cache line 大小是 256 bytes，則 degree 為2。
intrinsics
型態類似一般的函式，但會被 compiler 直接譯為組語。
loop splitting
即將一個 loop 中互相獨立的事件分開放在不同的 loop 底下，這個方法不一定對程式有優化的效果，端看在 loop 中做的事情決定是否會使程式效能提升。在下面的測試中，拆成3個 for loop 會使程式變得更耗費時間，這個結果能符合預期，因為 for loop 越多代表條件判斷也會越多，branch predicter 的預測不對也會隨之增加。

    for (int i = 0; i < 1000000; i++){
#ifdef TOGETHER
        for (int c = 0; c < size; c++){
            data[c] *= 10;
            data[c] += 7;
            data[c] &= 15;
        }
#else
        for (int c = 0; c < size; c++){
            data[c] *= 10;
        }
        for (int c = 0; c < size; c++){
            data[c] += 7;
        }
        for (int c = 0; c < size; c++){
            data[c] &= 15;
        }
#endif
    }

Loop Transformations
這裏面有提到許多關於 loop 方面的處理技巧

Prefetch

file analyze

_mm_loadu_si128
這個 function 在 intel Intrinsics Guide 裡會有兩個結果，差異在於 _mm_lddqu_si128 有特別注明當資料有跨 cache line 時可能會比 _mm_loadu_si128 效能更好。

這邊有個疑惑，在 main 中只有 #include <xmmintrin.h> 並沒有 #include "emmintrin.h"，這不符合 intel 的對於這個 API 的說明。有嘗試將其改過，但並不影響 compile 和其數據。











<mmintrin.h>  MMX
<xmmintrin.h> SSE
<emmintrin.h> SSE2
<pmmintrin.h> SSE3
<tmmintrin.h> SSSE3
<smmintrin.h> SSE4.1
<nmmintrin.h> SSE4.2
<ammintrin.h> SSE4A
<wmmintrin.h> AES
<immintrin.h> AVX
<zmmintrin.h> AVX512

優化

_mm_lddqu_si128

先嘗試在intel Intrinsics Guide到的這個 API，但是測試出來沒有特別的加快，應該再另外設計實驗，看這個 API 在跨 cache line 時讀資料的效能是否較好。

sse prefetch: 	 57656 us
sse: 		 119557 us
sse_lddqu: 	 118414 us
naive: 		 258989 us

A faster integer SSE unalligned load that's rarely used

嘗試改變所運算的矩陣大小，看是否會因為矩陣而影響運算的效能，另一個原因是想嘗試 _mm_lddqu_si128 效能在 cross cache line 時會不會如官方所述比 _mm_loadu_si128 來的好，但是從圖表的結果看來，sse_lddqu_align 是多了對指標做 align 的動作，我在這邊的 align 大小是16 bytes，這邊是為了比較另一個 API,_mm_load_si128。

這邊的實作基於 SSE 上，因此每128 bits 會被當作一個單位，所以不是4的倍數在運算的時候會有問題，要特別注意。

下面三張是用同樣的數據畫的，因為疊再一起不易分辨所以分開化了三張，從第一張圖可以看到其實sse_lddqu_align 在大部份的大小花的時間是最多的。