# 2017q1 Homework3 (software-pipelining)
contributed by <`yangyang95`>
###### tags: `sysprog2017` `HW3` `yangyang95`
github page 看 [這裡](https://github.com/yangyang95/prefetcher)
作業要求看 [這裡](https://hackmd.io/s/rks62p1sl#作業要求)
## 論文閱讀
論文 [When Prefetching Works, When It Doesn't, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)
### 1. INTRODUCTION
文章中提到軟、硬體的 prefetching 是高階處理器容忍 cache miss latency 跟管理 memory bandwidth 以達到高效能的重要技巧。硬體在 prefetching 應用上已經有許多機制,然而軟體卻只有相對簡單的演算法使用在現代處理器上。
- prefetching - a technique used in central processor units to speed up the execution of a program by reducing wait states.
下圖是文中提供的一份 [SPEC CPU 2006](http://baike.baidu.com/item/SPEC%20CPU%202006) 的效能比較
![](https://i.imgur.com/pwHxvmJ.png)
> 註:圖中的執行 cycle 為對 baseline binaries (Base^2^) 做正規化的結果,只存在編譯器自動置入的軟體 prefetcher。數值越小,代表效能越高。
- GHB - Global History Buffer (全域歷史緩衝)
* 為採用先進先出(FIFO)資料架構的緩衝區,它會將快取失敗的資料暫存起來,當緩衝區空間用罄時,會先刪除較舊的資料,可用於「校正」快取運作模式,增加快取命中率。[[source](http://www.techbang.com/posts/13076-tegra-4-schemas-fully-resolved-the-old-structure-to-the-new-core-reloaded?page=3)]
* stride prefetch (hardware)
- STR
* stream prefetch (hardware)
- SW
* 人工加入的 software prefetch
- speedup
* best software+hardware prefetching scheme (i.e.,
the best among SW, SW+GHB, and SW+STR) over the best performing hardware prefetching (i.e., the best among Base, GHB, and STR)
* 就是 *(SW+HW 中最快 / HW 中最快)*
* 根據 speedup 分成三個區塊 positive, neutral, and negative
:::danger
這段不太理解的是 stream prefetch 的運作
參考了 [0xff07 的筆記](https://hackmd.io/AwRgrARgpgHMDsBaALAYzFFYCGxEE4wATERMAM2CNSIDZURyp4g=?view#一些基本的-prefetch-機制)的 stream prefetch 段落
還在理解「 把發生 Cache Miss 位置附近連續位置的資料放到 stream buffer 裡面。」
問題是:
* miss 的資料放到哪一個 stream buffer?
* 這樣做的用意是? 需要使用到這些資料時如何取出?
![](https://i.imgur.com/WXEGQmT.png)
:::
### 2. BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING
* **不同資料結構適合的SW prefetch 和 HW prefetch策略:** [[原圖](https://i.imgur.com/yN0hMjp.png)]
![](https://i.imgur.com/yN0hMjp.png)
* **Software Prefetch**
prefetch distance - the distance ahead of which a prefetch should be requested on a memory address
$$當下往後 D \geq \bigg{\lceil} \frac {l}{s} \bigg{\rceil} 次迴圈後,會取出的資料$$
$l$ is the prefetch latency
$s$ is the length of the shortest path through the loop body
* **Direct and Indirect Memory Indexing**
一般來說 direct memory indexing 比較適合使用硬體,因為存取記憶體是 stream/stride 特性,然而 indirect indexing 則比較適合軟體實作
下圖比較兩種不同的 index 的組合語言
![](https://i.imgur.com/0f4HRNo.png)
### 3. POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING
#### SW prefetching 優點
* Large Number of Streams (Limited Hardware Resources)
* Short Streams
* Irregular Memory Access
* Cache Locality Hint
* Loop Bounds
#### SW Prefetching 缺點
* Increased Instruction Count
![](https://i.imgur.com/4je9gba.png)
For example, the number of instructions in bwaves increases by more than 100% due to software prefetches.
* Static Insertion
* Code Structure Change
#### SW + HW prefetching 優點
* Handling Multiple Streams
* Positive Training
#### SW + HW prefetching 缺點
* Negative Training
* Harmful Software Prefetching
:::success
第三章更詳細說明看 [這裡](https://hackmd.io/AwRgrAbMCmBGYFpoBMSwQFhCCCCGIe0CEysAzAGYYAc5GyA7MkA=?view#sw-prefetching對效能的正面和負面影響)
:::
:::danger
有空需要找找 SW + HW Traning 的實際例子,不然不好理解
:::
### 4. EXPERIMENTAL METHODOLOGY
#### Prefetch Intrinsic Insertion Algorithm
* prefetch candidates: L1 cache misses / 1000 個指令 (MPKI) > 0.05
* prefetch distance: ![](https://i.imgur.com/KcVHDSm.png)
k 為常數
L 為平均的 memory latency
IPCbench 是每個 benchmark 的 profiled average IPC (Inter-Process Communication)
Wloop 是迴圈平均每一輪的指令數
### 5. EVALUATIONS: BASIC OBSERVATIONS ABOUT PREFETCHING
#### Overhead and Limitations of Software Prefetching
![](https://i.imgur.com/4je9gba.png)
Figure 4 shows how the instruction count increases in SW binaries
(GemsFDTD, bwaves, and leslie3d)
* **Software Prefetching Overhead**
去除一些 SW prefetch 造成的負面影響,來觀察效能提升的情況
![](https://i.imgur.com/AJHlQOj.png)
:::danger
這裡不太明白是如何除去 SW prefetch 造成的負面影響?
:::
* **The Effect of Prefetch Distance**
![](https://i.imgur.com/Ak3axXP.png)
Figure 6 shows the sensitivity of performance to prefetch distance.
x 軸 - the prefetch distance relative to the base prefetch distance
y 軸 - IPC (Inter-Process Communication)
根據 optimal distance zone, performance variance 分成五個分類,如下圖
![](https://i.imgur.com/s7bpZz7.png)
* **Static Distance vs. Machine Configuration**
![](https://i.imgur.com/KNo86kn.png)
結論: 固定的 prefetch distance 對於效能的影響不太受機器組態的不同而有所影響,即使機器組態在執行時間有所改變也不會差太多
* **Cache-Level Insertion Policy**
大部分的 Out-Of-Order 處理器都可以容忍 L1 cache misses (影響小到可以忽略),所以 HW prefetcher 通常為了降低汙染 (pollute) L1 cache 的機率會傾向將資料 prefetch 到 last-level cache (L2 或 L3)。
==> 然而,過多的 L1 cache miss 也會造成效能下降。
==> 此時,使用較精準的 SW prefetch 在不污染 L1 cache 的情況下(因為 miss 已經太嚴重,不太會再次發生汙染),將資料 prefetch 到 L1 cache
:::danger
miss 已經太嚴重,不太會再次發生汙染?
:::
下圖為不同 locality hint 下,各個 benchmark 的效能表現
![](https://i.imgur.com/1N4TAmF.png)
The benefit of T0 over T1 (T2) mainly comes from hiding L1 cache misses by inserting prefetched blocks into the L1 cache. In the evaluated benchmarks, libquantum represents this behavior.
![](https://i.imgur.com/iyL4LN2.png)
With software prefetching, the L1 cache miss rate is reduced from 10.4% to 0.01% (Figure 26), which results in a significant performance improvement.
對於資料重複使用率較低的 streaming 應用,應該要能觀察到 T1 、 T2 有不錯的表現。然而,在實驗中的 benchmark 裡沒有這樣的應用,所觀察到的結果都是 T0 較 T1、T2 好
#### Effect of Using Hardware and Software Prefetching Together
* **Hardware Prefetcher Training Effects**
1. NT (SW+GHB, SW+STR)
The hardware prefetcher’s training algorithm ignores software prefetch requests. That is, the hardware prefetcher is only trained by demand cache misses.
2. Train (SW+GHB+T, SW+STR+T)
The hardware prefetcher’s training algorithm includes software prefetch requests, in addition to the usual demand misses.
:::danger
這裡還不太明白 HW 是如何被 cache miss 和 SW 訓練?
:::
下圖為各個 benchmark 的 HW prefetcher 經過 SW prefetch 要求訓練的結果:
![](https://i.imgur.com/Cbyal9Q.png)
* sphinx3: 兩種 HW prefetcher 在訓練之後效能都有正向成長
* milc、gcc、soplex: 兩種 HW prefetcher 效能都是退步的
* mcf、bwaves: 效能大致上不受訓練的影響
* 其他 benchmark: 可能因 HW prefetcher 的不同(GHB 或 STR)而有所不同
結論: 雖然訓練後有些 benchmark 效能能有正向成長,但是會造成效能退化的幅度遠大於正向成長的。因此在一般的情況來說,用 SW prefetching 來訓練 HW prefetcher 並不是一個很好的選擇
* **Prefetch Coverage** - the ratio of useful prefetches to total L2 misses
![](https://i.imgur.com/IrcvW4y.png)
Figure 10(a) shows prefetch coverage for GHB and STR
Figure 10(b) shows the prefetch coverage of SW+GHB and SW+STR
結論: 一般來說,SW prefetching 的 coverage 會比 HW prefetching 還高,但在效能表現中間或負向退步的組別中情況相反
==> 因此可以歸納出 coverage 太低是造成效能無法提升的主要原因
* **Using the GCC Compiler**
![](https://i.imgur.com/Vf4ajoY.png)
- 除了 mcf 和 lbm 以外,其餘使用 gcc compiler 後都被放入更多 SW prefetch intrinsic
- gcc 還是無法解決所有的 cache miss
- gcc 的表現介在 NOSW (icc) 和 SW 之間
* **Real System Experiments**
實驗結果如圖:
![](https://i.imgur.com/4hsfpSx.png)
* cactusADM、soplex、bwaves、sphinx3、leslie3d 在 simulator 上的表現為效能些微提升,但在實際處理器上效能是退步的
* 效能正向提升組在實際處理器上的效能都是提升的
* 和 Core2 相比,Nehalem 的 HW prefetch 效果又更好。HW prefetcher 因為有更多 cache level 的 prefetch,SW prefetch (只有 L1 prefetch)的效能
* **Profile-Guided Optimization (PGO)**
開啟 gcc 4.4 的 `fprofile-generate` 和 `fprofile-use` flag,結果如下圖所示:
![](https://i.imgur.com/wrbkHrK.png)
* 表現介在 base 和 SW 之間,但效能的正向提升並非都來自於 prefetch
==> 由圖 ( c ) ( d )可以看到 PGO 的全部指令以及 prefetch 的指令是有減少的
:::danger
prefetch 的指令減少,那效能增加是為何?
:::
* **Hardware Prefetcher for Short Streams**
:::warning
後面的部分,[重點整理](https://hackmd.io/AwRgrAbMCmBGYFpoBMSwQFhCCCCGIe0CEysAzAGYYAc5GyA7MkA=?view#適合處理-short-streams-的-hw-prefetcher) 寫的很清楚,直接參照即可
:::
## 開發環境
```
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 58
Model name: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
Stepping: 9
CPU MHz: 1316.186
CPU max MHz: 3200.0000
CPU min MHz: 1200.0000
BogoMIPS: 5187.80
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
```
## 開發記錄
本次開發目的為分析矩陣轉置的效能,並嘗試改良
### 轉置實作分析
下面分別介紹使用的方法:
* **naive transpose**
```C==
void naive_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x++)
for (int y = 0; y < h; y++)
*(dst + x * h + y) = *(src + y * w + x);
}
```
單純從 source 先由上而下,再由左而右的取值,再轉置放到 destination
* **sse transpose**
```C==
void sse_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x += 4) {
for (int y = 0; y < h; y += 4) {
__m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x));
__m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x));
__m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x));
__m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x));
__m128i T0 = _mm_unpacklo_epi32(I0, I1);
__m128i T1 = _mm_unpacklo_epi32(I2, I3);
__m128i T2 = _mm_unpackhi_epi32(I0, I1);
__m128i T3 = _mm_unpackhi_epi32(I2, I3);
I0 = _mm_unpacklo_epi64(T0, T1);
I1 = _mm_unpackhi_epi64(T0, T1);
I2 = _mm_unpacklo_epi64(T2, T3);
I3 = _mm_unpackhi_epi64(T2, T3);
_mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
_mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
_mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
_mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);
}
}
}
```
SEE 儲存變數的 type 是 __m128i ,因此可以存 4 的整數(32 bits)
指令分析:
* `__m128i _mm_loadu_si128 (__m128i *p)`
在每一個 for 迴圈中,會呼叫 4 次 load 來儲存 4 個 row 的資料,也就是一個 4*4 的矩陣。
* `__m128i _mm_unpacklo_epi32 (__m128i a, __m128i b)`
會接收兩個 128 bits 的 data ,會輪流取出兩組低位的 32 bits
* `__m128i _mm_unpackhi_epi32 (__m128i a, __m128i b)`
會接收兩個 128 bits 的 data ,會輪流取出兩組高位的 32 bits
* `__m128i _mm_storeu_si128 (__m128i *p, __m128i a)`
將轉置後的 data 存回陣列
參考資料: [ierosodin HackMD 筆記](https://hackmd.io/MYNgzA7ArMAMCmBaAHATgCYCNEBYSwEMUIcJEAmAMxwEZR0dYopMg===?view#)
* **sse prefetch transpose**
```C==
_mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1);
```
在讀取矩陣前先做 prefetch
接下來執行程式來觀察所需時間差異:
sse prefetch: 92009 us
sse: 184088 us
naive: 365143 us
製圖來比較可看見顯著差異
![](https://i.imgur.com/GxUTFoT.png)
### Makefile 修改
再來修改 `main.c` 跟 `makefile` 來為每種方法產生個別實作
makefile :
```makefile
EXECUTABLE := naive_transpose \
sse_transpose sse_prefetch_transpose
all: $(GIT_HOOKS) $(EXECUTABLE)
%_transpose: main.c
$(CC) $(CFLAGS) -D$(subst _transpose,,$@) -o $@ main.c
```
參考了 wildcard '%' 的 [實作](https://github.com/0140454/prefetcher/blob/master/Makefile) 來練習簡潔的 makefile 寫法
這邊也介紹一下 subst 函數:
```
$(subst <from>,<to>,<text>)
名稱:字符串替換函數——subst
功能:把字串 <text> 中的 <from> 字符串替換成 <to>
```
makefile 裡面的寫法就是把實作後面的 `_transpose` 去掉
另外 `$@` 指的是傳遞過來的所有參數
參考資料:
* [跟我一起寫 Makefile(九)](http://blog.csdn.net/haoel/article/details/2894)
### 封裝技巧
參考 [你所不知道的 C 語言:物件導向程式設計篇](https://hackmd.io/s/HJLyQaQMl#物件導向是一種態度) 來實作封裝
加入 `transClass` 來當作存取轉置函式的的統一界面
`impl.c` :
```C==
typedef struct _transClass transClass;
typedef void (*func_t)(int *, int *, int, int);
struct _transClass {
func_t run;
};
int init_object (transClass **self, void *func)
{
if (NULL == (*self = malloc(sizeof(transClass)))) return -1;
(*self)->run = func;
return 0;
}
```
`main.c` :
```C==
#if defined(sse_prefetch)
init_object(&transpose, &sse_prefetch_transpose);
#elif defined(sse)
init_object(&transpose, &sse_transpose);
#else
init_object(&transpose, &naive_transpose);
#endif
clock_gettime(CLOCK_REALTIME, &start);
transpose->run(src, out1, TEST_W, TEST_H);
clock_gettime(CLOCK_REALTIME, &end);
#if defined(sse_prefetch)
printf("sse prefetch: \t %ld us\n", diff_in_us(start, end));
#elif defined(sse)
printf("sse: \t\t %ld us\n", diff_in_us(start, end));
#else
printf("naive: \t\t %ld us\n", diff_in_us(start, end));
#endif
```
### perf 分析
```
Performance counter stats for './naive_transpose' (100 runs):
444.625426 task-clock (msec) # 0.997 CPUs utilized ( +- 0.12% )
7 context-switches # 0.016 K/sec ( +- 9.48% )
0 cpu-migrations # 0.001 K/sec ( +- 15.76% )
12,879 page-faults # 0.029 M/sec ( +- 0.12% )
1,349,182,914 cycles # 3.034 GHz ( +- 0.22% )
802,582,659 stalled-cycles-frontend # 59.49% frontend cycles idle ( +- 0.37% )
1,498,472,412 instructions # 1.11 insn per cycle
# 0.54 stalled cycles per insn ( +- 0.01% )
292,033,795 branches # 656.809 M/sec ( +- 0.01% )
573,629 branch-misses # 0.20% of all branches ( +- 0.56% )
0.445790066 seconds time elapsed ( +- 0.13% )
```
可以看到 stalled-cycles-frontend 高達 59.49%
那這是什麼呢? [[source](http://stackoverflow.com/a/29059380)]
The **cycles stalled in the front-end** are a waste because that means that the CPU does not feed the Back End with instructions. This can mean that you have misses in the Instruction cache, or complex instructions that are not already decoded in the micro-op cache.
The **cycles stalled in the back-end** are a waste because the CPU has to wait for resources (usually memory) or to finish long latency instructions (e.g. transcedentals - sqrt, logs, etc.).
```
Performance counter stats for './sse_prefetch_transpose' (100 runs):
284.116064 task-clock (msec) # 0.997 CPUs utilized ( +- 0.35% )
5 context-switches # 0.016 K/sec ( +- 12.11% )
0 cpu-migrations # 0.000 K/sec ( +- 28.59% )
31,797 page-faults # 0.112 M/sec ( +- 0.04% )
884,873,715 cycles # 3.114 GHz ( +- 0.28% )
356,371,500 stalled-cycles-frontend # 40.27% frontend cycles idle ( +- 0.77% )
1,383,115,994 instructions # 1.56 insn per cycle
# 0.26 stalled cycles per insn ( +- 0.01% )
284,731,213 branches # 1002.165 M/sec ( +- 0.01% )
588,273 branch-misses # 0.21% of all branches ( +- 0.10% )
0.284906362 seconds time elapsed ( +- 0.37% )
```
可以看見 stalled-cycles-frontend 減小,根據提示猜測是 cache miss 的影響
那就來嘗試看看 cache miss 的分析
```
Performance counter stats for './naive_transpose' (20 runs):
18,108,797 cache-misses # 89.423 % of all cache refs ( +- 0.37% ) (42.77%)
20,250,709 cache-references ( +- 0.44% ) (57.23%)
573,289,873 L1-dcache-loads ( +- 0.20% ) (57.12%)
30,843,713 L1-dcache-load-misses # 5.38% of all L1-dcache hits ( +- 0.29% ) (55.23%)
245,054,831 L1-dcache-stores ( +- 0.51% ) (28.65%)
3,151,174 L1-dcache-store-misses ( +- 0.87% ) (28.62%)
<not supported> L1-dcache-prefetches
280,366 L1-dcache-prefetch-misses ( +- 10.32% ) (28.46%)
0.550054635 seconds time elapsed ( +- 0.59% )
```
```
Performance counter stats for './sse_transpose' (20 runs):
6,532,498 cache-misses # 83.151 % of all cache refs ( +- 0.34% ) (42.94%)
7,856,201 cache-references ( +- 0.40% ) (57.48%)
462,040,537 L1-dcache-loads ( +- 0.26% ) (57.11%)
10,638,638 L1-dcache-load-misses # 2.30% of all L1-dcache hits ( +- 0.40% ) (54.07%)
259,134,952 L1-dcache-stores ( +- 0.37% ) (28.48%)
3,045,792 L1-dcache-store-misses ( +- 1.03% ) (28.82%)
<not supported> L1-dcache-prefetches
170,996 L1-dcache-prefetch-misses ( +- 12.96% ) (28.68%)
0.370285903 seconds time elapsed ( +- 0.45% )
```
```
Performance counter stats for './sse_prefetch_transpose' (20 runs):
6,792,976 cache-misses # 86.945 % of all cache refs ( +- 0.40% ) (43.02%)
7,812,952 cache-references ( +- 0.28% ) (57.38%)
484,854,755 L1-dcache-loads ( +- 0.17% ) (56.64%)
10,641,926 L1-dcache-load-misses # 2.19% of all L1-dcache hits ( +- 0.46% ) (52.57%)
259,703,220 L1-dcache-stores ( +- 0.24% ) (28.66%)
3,099,040 L1-dcache-store-misses ( +- 0.83% ) (29.13%)
<not supported> L1-dcache-prefetches
240,285 L1-dcache-prefetch-misses ( +- 9.34% ) (28.75%)
0.284322205 seconds time elapsed ( +- 0.59% )
```
* 從 naive 到 sse 進行了 SIMD 優化,減少 cache misses 跟 L1d misses
cache-references 跟 L1-dcache-loads 也可看到顯著減少,推測為 miss rate 減少主因
* sse 到 sse prefetch 進行 SW prefetch 優化,效能有所提高
但 HW prefetch 跟 miss rate 卻有些微上升,懷疑是 traning effect 造成
試着驗證看看 HW, SW prefetch 帶來的影響
根據 [kaizsv HackMD 筆記](https://hackmd.io/s/r1IWtb9R#sse-prefetch-transpose) 找到的 raw counter:
* r014c: LOAD_HIT_PRE.SW_PF
* r024c: LOAD_HIT_PRE.HW_PF
```
Performance counter stats for './naive_transpose' (20 runs):
46,763 r014c ( +- 4.77% )
222,165 r024c ( +- 7.74% )
0.547286871 seconds time elapsed ( +- 0.88% )
```
```
Performance counter stats for './sse_transpose' (20 runs):
13,678 r014c ( +- 1.22% )
290,109 r024c ( +- 2.50% )
0.361726347 seconds time elapsed ( +- 0.59% )
```
```
Performance counter stats for './sse_prefetch_transpose'
(20 runs):
2,033,722 r014c ( +- 3.32% )
255,622 r024c ( +- 4.78% )
0.295382564 seconds time elapsed ( +- 0.76% )
```
![](https://i.imgur.com/V0xuUbM.png)
* 因為 SIMD 的關係,naive 到 sse 的 SW prefetch 顯著下降(不了解原因)
* sse 加入 prefetch ,而 SW prefetch 激增
### 改進方向
原本打算再來實作 AVX 跟 AVX prefetch
但下禮拜還有三份作業,豪多啊 orz
先趕快往下衝 @@
* 老師指正了封裝部分要用有意義的名字,如: Matrix
* 使用 unnamed struct ,來省下命名的麻煩
* struct 初始化使用 specialized initializer,可參考 [你所不知道的C語言:技巧篇](https://hackmd.io/s/HyIdoLnjl#從矩陣操作談起)
* 還有許多實驗分析可以去做,如:prefetch distance, locality hint