# 2017q1 Homework3 (software-pipelining)
contributed by < ` changyuanhua ` >
## 開發環境
* 輸入指令 ` lscpu `
```
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
每核心執行緒數:1
每通訊端核心數:4
Socket(s): 1
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
型號: 94
Model name: Intel(R) Core(TM) i5-6300HQ CPU @ 2.30GHz
製程: 3
CPU MHz: 799.890
CPU max MHz: 3200.0000
CPU min MHz: 800.0000
BogoMIPS: 4608.00
虛擬: VT-x
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 6144K
NUMA node0 CPU(s): 0-3
```
## 論文閱讀紀錄
### 名詞解釋
Data reference patterns 被分類為以下三種情形:
* Temporal (時間): 數據將很快再次被使用
* Spatial (空間): 數據將用於相鄰位置(例如,在同一個 cache line 上)。
* Non-temporal (非時間): 數據在將來被引用一次並且不被重用的(例如,對於一些多媒體數據類型,作為3D圖形應用中的頂點緩衝器)
* Prefetch:
* 從記憶體層級上來看 prefetch intrinsic 提供參數可以選擇載入到哪個層級的快取記憶體
> 先搞懂 prefetch
> >在資料被需要之前,把資料從 main memory 中,預先載入到 cache 。因為當資料被需要使用時,從 cache 訪問比 main memory 請求更快速。因此,prefetch 可以隱藏 memory access latency。
> Where to place prefetched data?
> > Cache or dedicated buffer
* Hardware prefetch
* 是用過去的存取紀錄作為 prefetch 的依據,所以 HW 是需要 training 的
* GHB:Global History Buffer
> 搞懂 Hardware prefetch
> >Hardware prefetch 監視從處理器到 cache 的 memory 存取過程,它監視處理器當前正在訪問哪些 memory 的位置,預測將來需要哪些 memory 的位置,以及向那些 memory 的位置發出預取。
* Software prefetch
* 可以是人為或是 compiler 在適當的地方加入 prefetch 指令.
* SW 指的就是老師寫的 _mm_prefetch,可以由我們來控制的
### 研究目的
* the goal of this study is to develop a novel, foundational understanding of both the benefits and limitations of hardware and software prefetching
* 提供在 HW prefetcher 有效的 intrinsic 使用方法
* 提供 HW & SW prefetcher 共存時,效能好的組合
### 研究
* source code-level analysis (源代碼級分析): 幫助理解 compiler- and software-based prefetching (基於編譯器和基於軟件的預取)的實際優缺點
* the synergistic (協同) and antagonistic (對抗) effects between software and hardware prefetching
* an evaluation (評估) of hardware prefetching training policies in the presence of software prefetching requests
### 介紹
手動加入指令有兩個最大的問題:
* 對於如何插入 prefetch intrinsic 沒有一個嚴格的標準
* 軟體與硬體的 prefetch 交互使用的複雜度沒有一個很好的了解
#### 實驗
* 總結了軟件和硬件預取之間的複雜交互
![](https://i.imgur.com/s3Omhht.png)
* SPEC CPU 2006 benchmark(x軸)
* 執行時間標準化(y軸): 使用baseline binaries(只有compiler插入的SW prefetcher)的執行時間
* CPU: Intel’s Core 2 and Nehalem
* 使用 icc compiler (也有對gcc compiler做實驗)
* GHB(stride prefetcher)/STR(stream prefetcher): compiler + HW prefetcher
* Base: (i.e., the best among Base, GHB, and STR)
* SW: compiler + 程式中插入 intrinsic (i.e., the best among SW, SW+GHB, and SW+STR)
* 已知驗證: SW prefetch 用在短陣列、連續且不規則的記憶體存取、減少 L1 cache miss 時效能較好
* 實驗發現: SW prefetch 對於HW prefetcher 會有 training effect,造成效能變差。特別是在 SW prefetch只用在stream的一部分時
* 本實驗只會針對現有機制,而模擬與真實系統都會被使用,以模擬完成較複雜的實驗,再對其中的發現以真實系統做驗證
### HW & SW prefetcher背景知識
#### 不同資料結構適合的SW prefetch 和 HW prefetch:
* summarizes common data structures, their corresponding data access patterns, and the hardware/software prefetching mechanisms previously proposed for each.
![](https://i.imgur.com/ERu9psG.png)
* 各種資料結構都適用: dead-block-based prefetching
* Recursive Data Structures (RDS): RDS data structure accesses 可以透過 software 容易地預測訪問
* hashing: 難以有效 prefetch
* 定義 cache-line access:
stream: unit-stride
stride: access stride distances 大於 2 個 cache-line
* HW prefetcher 嘗試了 stream prefetcher、GHB prefetcher(stride prefetcher)、content-based prefetcher
#### Software Prefetch Intrinsics
![](https://i.imgur.com/850jWfQ.png)
* 當手動插入預取請求時,我們使用 software prefetch intrinsics (x86 SSE SIMD 擴展的一部分)
* 使用 SSE 的__mm_prefetch(char *ptr, int hint) 時要指定
* prefetch address
* prefetch usage hint
* direct address prefetches: 2個指令 /intrinsic
* indirect memory addresses prefetch: 4個指令 /intrinsic
#### prefetch分類
![](https://i.imgur.com/cdaB0ox.png)
![](https://i.imgur.com/8QR3ye5.png)
> early vs late
> > Too early : 取代其他有用數據 (cache pollution) 或在使用前被取代掉
Too late : 不能隱藏 processor stall
* ex: if a demand access occurs after its corresponding prefetch request but before it is inserted into the cache, we classify the prefetch as “late.”
* MSHR: Miss Status Holding Register
#### Software Prefetch Distance
* prefetch distance 必須要大到能夠隱藏 memory latency
* D 稱為 prefetch 距離 ; 又定義為應該在存儲器地址上請求 prefetch 的距離
* l is the prefetch latency
* s 是迴圈當中的最短路徑
* prefetch distance 太大可能造成不良的後果,如 coverage 更小、 cache miss 更多
* Coverage: 藉由 prefetch 避掉的 cache miss 比例
* 100 x (Prefetch Hits / (Prefetch Hits + Cache Misses))
![](https://i.imgur.com/PUh5dfT.png)
#### Direct and Indirect Memory Indexing
* direct memory indexing :
* 可以容易地由 HW prefetched ,因為存儲器地址顯示常規的 stream/stride behavior (流/步幅行為)
* 在 software 中計算不容易
* Direct 是直接可以從指令中知道要存取的記憶體位址,可以透過 hareward prefetch.
* indirect memory indexing :
* 需要特殊的 hardware prefetching mechanisms
* 在 software 中計算容易
* 程式碼 x = a[b[i]] ,需要先計算 b[i] 的值才知道要存取 a[] 中的哪個位置,機器不會比人更知道 b[i] 是怎麼來的,所以使用 software prefetch 比較容易達成.
* indirect memory access 可能造成 SW prefetching overhead
* e.g… a[i] prefetch 一次 , a[b[i]]則要分兩次prefetch
## Prefetching in the Intel® Core™ Microarchitecture
![](https://i.imgur.com/ovC0E9y.png)
* 我的 CPU 的 Microarchitecture 是 Skylake [[參考]](http://www.cpu-world.com/CPUs/Core_i5/Intel-Core%20i5-6300HQ%20Mobile%20processor.html)
--> 知道microarchitecture是哪種,才可以確定 HW prefetcher 有哪些種類
## 原始程式碼
* naive
```clike=
void naive_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x++)
for (int y = 0; y < h; y++)
*(dst + x * h + y) = *(src + y * w + x);
}
```
這是一個 ` dst[x][y] = src[y][x] ` 的程式
* C 是 row-major 的語言,所以當以 colum order 的方式去讀取,會導致 cache 沒辦法有效地被利用。
* sse
```clike=
void sse_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x += 4) {
for (int y = 0; y < h; y += 4) {
__m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x));
__m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x));
__m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x));
__m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x));
__m128i T0 = _mm_unpacklo_epi32(I0, I1);
__m128i T1 = _mm_unpacklo_epi32(I2, I3);
__m128i T2 = _mm_unpackhi_epi32(I0, I1);
__m128i T3 = _mm_unpackhi_epi32(I2, I3);
I0 = _mm_unpacklo_epi64(T0, T1);
I1 = _mm_unpackhi_epi64(T0, T1);
I2 = _mm_unpacklo_epi64(T2, T3);
I3 = _mm_unpackhi_epi64(T2, T3);
_mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
_mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
_mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
_mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);
}
}
}
```
這個程式碼是一次將16筆整數橫向讀入到4個 128bit 的 sse 暫存器,接著利用 unpackhi、unpacklo 使資料 swap ,最後以直排順序存到橫排
* matrix1 一次橫向讀入128bit = 四個int 存進I0 ,只要給要讀入陣列的起始位置,他會往後抓127bit, I1,I2,I3同理
```
__m128i I0 = _mm_loadu_si128((__m128i *)(src1 + (x + 0) * src1_w + k));
```
詳細內容參考 [kobeyu 的筆記](https://hackmd.io/s/BycRx2EC) -> 解釋得很清楚
* sse_prefetch
```clike=
void sse_prefetch_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x += 4) {
for (int y = 0; y < h; y += 4) {
#define PFDIST 8
_mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1);
__m128i I0 = _mm_loadu_si128 ((__m128i *)(src + (y + 0) * w + x));
__m128i I1 = _mm_loadu_si128 ((__m128i *)(src + (y + 1) * w + x));
__m128i I2 = _mm_loadu_si128 ((__m128i *)(src + (y + 2) * w + x));
__m128i I3 = _mm_loadu_si128 ((__m128i *)(src + (y + 3) * w + x));
__m128i T0 = _mm_unpacklo_epi32(I0, I1);
__m128i T1 = _mm_unpacklo_epi32(I2, I3);
__m128i T2 = _mm_unpackhi_epi32(I0, I1);
__m128i T3 = _mm_unpackhi_epi32(I2, I3);
I0 = _mm_unpacklo_epi64(T0, T1);
I1 = _mm_unpackhi_epi64(T0, T1);
I2 = _mm_unpacklo_epi64(T2, T3);
I3 = _mm_unpackhi_epi64(T2, T3);
_mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
_mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
_mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
_mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);
}
}
}
#endif /* TRANSPOSE_IMPL */
```
這個程式碼是將 sse 版本中,加入 SW Prefetch
* 效能分析
* naive
```
Performance counter stats for './naive_transpose' (100 runs):
40,396,950 cache-misses # 86.186 % of all cache refs ( +- 0.30% )
46,872,044 cache-references ( +- 0.24% )
21,069,452 L1-dcache-load-misses # 3.79% of all L1-dcache hits ( +- 0.01% )
555,330,933 L1-dcache-loads ( +- 0.00% )
219,108,523 L1-dcache-stores ( +- 0.00% )
72,803 L1-icache-load-misses ( +- 2.12% )
0.358395944 seconds time elapsed ( +- 0.31% )
```
* sse
```
Performance counter stats for './sse_transpose' (100 runs):
14,842,277 cache-misses # 77.006 % of all cache refs ( +- 0.28% )
19,274,268 cache-references ( +- 0.05% )
8,514,541 L1-dcache-load-misses # 1.91% of all L1-dcache hits ( +- 0.02% )
445,306,115 L1-dcache-loads ( +- 0.00% )
232,791,733 L1-dcache-stores ( +- 0.00% )
103,480 L1-icache-load-misses ( +- 2.20% )
0.273274670 seconds time elapsed ( +- 0.31% )
```
* sse_prefetch
```
Performance counter stats for './sse_prefetch_transpose' (100 runs):
10,613,966 cache-misses # 72.171 % of all cache refs ( +- 0.70% )
14,706,647 cache-references ( +- 0.28% )
8,519,990 L1-dcache-load-misses # 1.83% of all L1-dcache hits ( +- 0.02% )
466,169,620 L1-dcache-loads ( +- 0.00% )
232,724,643 L1-dcache-stores ( +- 0.00% )
83,636 L1-icache-load-misses ( +- 2.14% )
0.186173153 seconds time elapsed ( +- 0.63% )
```
| method | cache-misses rate| L1-dcache-load-misses rate |
| ------ | ---------------- | -------------------------- |
| naive | 86.186 % | 3.79% |
| sse | 77.006 % | 1.91% |
| sse prefetcher | 72.171 % | 1.83% |
* 時間
```
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
```
* naive : 236494 us
* sse : 139678 us
* sse_prefetch : 62750 us
## 實驗一: avx
AVX 的指令集中,沒有 unpack128 這個指令,所以用 _mm256_permute2x128_si256() 來替代
* code:
```clike=
void avx_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x += 8) {
for (int y = 0; y < h; y += 8) {
__m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x));
__m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x));
__m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x));
__m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x));
__m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x));
__m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x));
__m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x));
__m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x));
__m256i T0 = _mm256_unpacklo_epi32(I0, I1);
__m256i T1 = _mm256_unpacklo_epi32(I2, I3);
__m256i T2 = _mm256_unpackhi_epi32(I0, I1);
__m256i T3 = _mm256_unpackhi_epi32(I2, I3);
__m256i T4 = _mm256_unpacklo_epi32(I4, I5);
__m256i T5 = _mm256_unpacklo_epi32(I6, I7);
__m256i T6 = _mm256_unpackhi_epi32(I4, I5);
__m256i T7 = _mm256_unpackhi_epi32(I6, I7);
I0 = _mm256_unpacklo_epi64(T0, T1);
I1 = _mm256_unpackhi_epi64(T0, T1);
I2 = _mm256_unpacklo_epi64(T2, T3);
I3 = _mm256_unpackhi_epi64(T2, T3);
I4 = _mm256_unpacklo_epi64(T4, T5);
I5 = _mm256_unpackhi_epi64(T4, T5);
I6 = _mm256_unpacklo_epi64(T6, T7);
I7 = _mm256_unpackhi_epi64(T6, T7);
T0 = _mm256_unpacklo_epi128(I0, I4);
T1 = _mm256_unpacklo_epi128(I1, I5);
T2 = _mm256_unpacklo_epi128(I2, I6);
T3 = _mm256_unpacklo_epi128(I3, I7);
T4 = _mm256_unpackhi_epi128(I0, I4);
T5 = _mm256_unpackhi_epi128(I1, I5);
T6 = _mm256_unpackhi_epi128(I2, I6);
T7 = _mm256_unpackhi_epi128(I3, I7);
_mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), T0);
_mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), T1);
_mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), T2);
_mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), T3);
_mm256_storeu_si256((__m256i *)(dst + ((x + 4) * h) + y), T4);
_mm256_storeu_si256((__m256i *)(dst + ((x + 5) * h) + y), T5);
_mm256_storeu_si256((__m256i *)(dst + ((x + 6) * h) + y), T6);
_mm256_storeu_si256((__m256i *)(dst + ((x + 7) * h) + y), T7);
}
}
}
* 效能分析
* naive
```
Performance counter stats for './naive_transpose' (100 runs):
41,691,591 cache-misses # 86.654 % of all cache refs ( +- 0.24% )
48,112,706 cache-references ( +- 0.22% )
21,074,570 L1-dcache-load-misses # 3.79% of all L1-dcache hits ( +- 0.01% )
555,368,170 L1-dcache-loads ( +- 0.00% )
219,132,867 L1-dcache-stores ( +- 0.00% )
78,851 L1-icache-load-misses ( +- 2.26% )
0.362426909 seconds time elapsed ( +- 0.45% )
```
* sse
```
Performance counter stats for './sse_transpose' (100 runs):
15,237,833 cache-misses # 78.909 % of all cache refs ( +- 0.47% )
19,310,745 cache-references ( +- 0.06% )
8,526,042 L1-dcache-load-misses # 1.91% of all L1-dcache hits ( +- 0.03% )
445,281,290 L1-dcache-loads ( +- 0.00% )
232,777,361 L1-dcache-stores ( +- 0.00% )
98,770 L1-icache-load-misses ( +- 2.42% )
0.266709924 seconds time elapsed ( +- 0.38% )
```
* sse_prefetch
```
Performance counter stats for './sse_prefetch_transpose' (100 runs):
8,041,382 cache-misses # 65.431 % of all cache refs ( +- 0.89% )
12,289,899 cache-references ( +- 0.28% )
8,523,157 L1-dcache-load-misses # 1.83% of all L1-dcache hits ( +- 0.02% )
466,153,254 L1-dcache-loads ( +- 0.00% )
232,714,767 L1-dcache-stores ( +- 0.00% )
71,794 L1-icache-load-misses ( +- 2.35% )
0.168626115 seconds time elapsed ( +- 0.71% )
```
* avx
```
Performance counter stats for './avx_transpose' (100 runs):
10,646,344 cache-misses # 70.440 % of all cache refs ( +- 0.17% )
15,114,039 cache-references ( +- 0.06% )
8,129,909 L1-dcache-load-misses # 2.02% of all L1-dcache hits ( +- 0.03% )
402,956,831 L1-dcache-loads ( +- 0.00% )
210,944,765 L1-dcache-stores ( +- 0.00% )
72,525 L1-icache-load-misses ( +- 1.06% )
0.192949994 seconds time elapsed ( +- 0.20% )
```
| method | cache-misses rate| L1-dcache-load-misses rate |
| ------ | ---------------- | -------------------------- |
| naive | 86.654 % | 3.79% |
| sse | 78.909 % | 1.91% |
| sse prefetcher | 65.431 % | 1.83% |
| avx | 70.440 % | 2.02% |
> 我發現 L1-dcache-load-misses rate 跟最初的版本都一樣,不知道是剛好,還是我操作有問題?待會在測試其他版本試試看
* 時間:
轉置了8*8的矩陣
```
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63
0 8 16 24 32 40 48 56
1 9 17 25 33 41 49 57
2 10 18 26 34 42 50 58
3 11 19 27 35 43 51 59
4 12 20 28 36 44 52 60
5 13 21 29 37 45 53 61
6 14 22 30 38 46 54 62
7 15 23 31 39 47 55 63
```
* naive : 231472 us
* sse : 128587 us
* sse_prefetch : 58680 us
* avx : 55967 us
## 參考資料
* [論文 When Prefetching Works, When It Doesn’t, and Why 重點提示和解說](https://hackmd.io/s/HJtfT3icx)
* [論文 When Prefetching Works, When It Doesn’t, and Why ](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)
* [kobeyu 的筆記](https://hackmd.io/s/BycRx2EC)
* [prefetch 相關資料](http://www.ece.cmu.edu/~ece740/f11/lib/exe/fetch.php%3Fmedia%3Dwiki:lectures:onur-740-fall11-lecture24-prefetching-afterlecture.pdf)