# 2017q3 Homework2 (software-pipelining)
contributed by <`HMKRL`>
### Reviewed by `zhanyangch`
* SIMD 的指令可以參考 [intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/) ,比用 MSDN 方便查詢。
* 已經對不同的_MM_HINT作實驗,若能增加對 PDIST 修改的實驗,與論文中 D 值的計算做比較,可使報告更完整。
* 對 gef 介紹可以更詳盡,或是列出相關的參考資料,方便其它同學也能夠學習這項工具。
## 論文閱讀:[When Prefetching Works, When It Doesn’t, and Why](https://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)

由文中的實驗結果圖可以發現,當 Software prefetching (例如 `_mm_prefetch()`) 與 Hardware prefetching (例如 [GHB](http://www.eecg.toronto.edu/~steffan/carg/readings/ghb.pdf)) 同時使用時,並沒有明確的效能表現規則,本文重點為探索此兩種 prefetcher 交互使用的效果
其中有三項探討重點:
1. What are the limitations and overheads of software prefetching?
2. What are the limitations and overheads of hardware prefetching?
3. When is it beneficial to use software and/or hardware prefetching?

Table 1 為文中對不同資料結構適用的 prefetch 方式整理,其中關於 prefetch distance (D),可以參考圖二的時間線:

可以發現 發出 prefetch request 後,資料需要一段時間才能進入 cache ,但若過太久仍未被使用,則會被移出 cache
因此採用 software prefetch 時,若太晚執行 prefetch ,則無法透過這次 prefetch 消除 memory latency 的影響,太早執行則會讓資料在被使用前就移出 cache ,都無法有效增進效能(甚至會因為 prefetch 本身的成本造成效能低落),適合的 D 值可以這樣計算:
$$
D\geq[\frac{l}{s}]
$$
其中 $l$ 是 prefetch 的延遲, $s$ 則是一次迴圈的最短路徑
論文中 5.1.3 對不同 D 值的影響進行分析:


### Memory Indexing
Memory indexing 可以分為兩種:direct 與 indirect
其中 direct 型(例如一維陣列) 較容易做 prefetch (由 index 可以直接得知資料所在的位置)
indirect 型(例如 Linked-List )則較不容易
圖3以簡化的組合語言指令比較了兩種 indexing 模式的差異,可以發現 indirect indexing 必須額外計算實際位址

## 理解 prefetch 程式
開發環境:
```
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Model name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
```
首先注意 `impl.c`
`naive_transpose` 是最直接的轉置作法,直接依照轉置矩陣的規則以迴圈實作
`sse_transpose` 則出現許多未曾使用過的 sse 指令集操作 ,因此使用 `gdb` 觀察程式執行(已加裝 [gef](https://github.com/hugsy/gef) )
將中斷點設在 `impl.c:17` , 由 `gdb` 顯示的組合語言指令得知`_mm_loadu_si128` 用途是將數值存入 `xmm0` ,參考 [wikipedia](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions#Registers) 得知這是用於 sse 指令集的特殊 register
```
────────────────────────────────────────────────────────────────────────────────────────────────────────────[ code:i386:x86-64 ]────
0x555555554a52 <sse_transpose+109> add rax, rdx
0x555555554a55 <sse_transpose+112> mov QWORD PTR [rbp-0x200], rax
0x555555554a5c <sse_transpose+119> mov rax, QWORD PTR [rbp-0x200]
0x555555554a63 <sse_transpose+126> movdqu xmm0, XMMWORD PTR [rax]
0x555555554a67 <sse_transpose+130> movaps XMMWORD PTR [rbp-0x1c0], xmm0
→ 0x555555554a6e <sse_transpose+137> mov eax, DWORD PTR [rbp-0x204]
0x555555554a74 <sse_transpose+143> add eax, 0x1
0x555555554a77 <sse_transpose+146> imul eax, DWORD PTR [rbp-0x224]
0x555555554a7e <sse_transpose+153> movsxd rdx, eax
0x555555554a81 <sse_transpose+156> mov eax, DWORD PTR [rbp-0x208]
0x555555554a87 <sse_transpose+162> cdqe
────────────────────────────────────────────────────────────────────────────────────────────────────────────[ source:impl.c+16 ]────
12 {
13 for (int x = 0; x < w; x += 4) {
14 for (int y = 0; y < h; y += 4) {
15 __m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x));
// y=0x0, x=0x0, src=0x00007fffffffd798 → [...] → 0x0000000100000000, w=0x4
→ 16 __m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x));
17 __m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x));
18 __m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x));
19 __m128i T0 = _mm_unpacklo_epi32(I0, I1);
20 __m128i T1 = _mm_unpacklo_epi32(I2, I3);
```
再觀察以下執行結果:
`__m128i T0 = _mm_unpacklo_epi32(I0, I1)`
```
gef➤ p I0
$8 = {0x100000000, 0x300000002}
gef➤ p I1
$9 = {0x500000004, 0x700000006}
gef➤ p T0
$10 = {0x400000000, 0x500000001}
```
`__m128i T2 = _mm_unpackhi_epi32(I0, I1);`:
```
gef➤ p I0
$11 = {0x100000000, 0x300000002}
gef➤ p I1
$12 = {0x500000004, 0x700000006}
gef➤ p T2
$13 = {0x600000002, 0x700000003}
```
`I0 = _mm_unpacklo_epi64(T0, T1);`
```
gef➤ p T0
$23 = {0x400000000, 0x500000001}
gef➤ p T1
$24 = {0xc00000008, 0xd00000009}
gef➤ p I0
$25 = {0x400000000, 0xc00000008}
```
透過以上觀察,加上參照 [MSDN](https://msdn.microsoft.com/en-us/library/ac53z079(v=vs.90).aspx) 的說明,即可理解這幾個 function 的行為
- sse register 有 128-bit, 善用可以在每道指令處理4個 32-bit integer 數值,比起一次處理一個來說可以省下許多手續
## 產生新的執行檔,分別對應於不同模式
程式碼修改請見 [commit f15322b](https://github.com/HMKRL/prefetcher/commit/f15322bf4a66d0bc2549f8d78926f8f8908d43ad)
新增 `impl.h`, 統一所有 transpose function 名稱為 `transpose()` 並宣告函式原型在 `impl.h` 中
各種不同實作分別位於 `naive_transpose.c`,`sse_transpose.c` 與 `sse_prefetch_transpose.c` 中,透過 Makefile 決定要使用的檔案
測試結果:
```
make run
./naive_transpose
naive: 600506 us
./sse_transpose
sse: 307496 us
./sse_prefetch_transpose
sse_prefetch: 149165 us
```
`make cache-test`
|test case|cache-miss|
|:-------:|:--------:|
|naive |87.332 % |
|sse |78.215 % |
|sse_prefetch|66.306 % |
### 嘗試修改 _MM_HINT 並透過 raw counter 觀察
```
_mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1);
```

參考 illusion030 的[共筆](https://hackmd.io/s/HkHDV-moe),觀察以下 raw counter:
```
r014c LOAD_HIT_PRE.SW_PF
r024c LOAD_HIT_PRE.HW_PF
r81d0 MEM_UOPS_RETIRED.ALL_LOADS
r01d1 MEM_LOAD_UOPS_RETIRED.L1_HIT
r02d1 MEM_LOAD_UOPS_RETIRED.L2_HIT
r04d1 MEM_LOAD_UOPS_RETIRED.L3_HIT
```
結果:
```
Performance counter stats for './naive_transpose' (10 runs):
39,361,652 cache-misses # 87.332 % of all cache refs ( +- 1.18% ) (39.91%)
45,071,235 cache-references ( +- 1.30% ) (40.45%)
1,394,570,644 instructions # 1.18 insn per cycle ( +- 0.62% ) (50.60%)
1,183,001,040 cycles ( +- 0.39% ) (60.58%)
145 r014c ( +- 26.11% ) (60.62%)
347,288 r024c ( +- 8.84% ) (60.77%)
565,116,285 r81d0 ( +- 0.52% ) (60.89%)
541,841,834 r01d1 ( +- 0.72% ) (61.08%)
32,226 r02d1 ( +- 13.70% ) (40.05%)
704,847 r04d1 ( +- 5.39% ) (39.95%)
```
```
Performance counter stats for './sse_transpose' (10 runs):
14,631,270 cache-misses # 78.215 % of all cache refs ( +- 1.32% ) (37.96%)
18,706,573 cache-references ( +- 1.21% ) (40.68%)
1,156,979,239 instructions # 1.44 insn per cycle ( +- 1.37% ) (51.62%)
801,805,797 cycles ( +- 0.67% ) (61.88%)
104 r014c ( +- 4.41% ) (62.54%)
228,122 r024c ( +- 11.42% ) (63.56%)
442,330,486 r81d0 ( +- 0.65% ) (63.34%)
444,642,666 r01d1 ( +- 0.66% ) (61.72%)
32,251 r02d1 ( +- 8.80% ) (38.78%)
138,286 r04d1 ( +- 2.73% ) (37.31%)
```
```
Performance counter stats for './sse_prefetch_transpose' (10 runs):
8,407,856 cache-misses # 66.306 % of all cache refs ( +- 0.77% ) (39.42%)
12,680,436 cache-references ( +- 0.38% ) (39.50%)
1,266,181,395 instructions # 2.19 insn per cycle ( +- 0.38% ) (49.59%)
577,113,498 cycles ( +- 0.36% ) (59.67%)
1,194,382 r014c ( +- 3.70% ) (59.67%)
276,261 r024c ( +- 13.61% ) (60.82%)
450,951,942 r81d0 ( +- 1.27% ) (63.35%)
448,833,332 r01d1 ( +- 0.84% ) (62.97%)
1,185,907 r02d1 ( +- 2.41% ) (41.53%)
23,900 r04d1 ( +- 8.80% ) (40.14%)
```
可以發現在使用了 `_mm_prefetch` 的 `sse_prefetch_transpose` 中,`r014c LOAD_HIT_PRE.SW_PF
` 明顯高出其他版本很多,證實了 software prefetching 有實際生效
接下來單獨觀察 `sse_prefetch_transpose`:
- _MM_HINT_T0
```
Performance counter stats for './sse_prefetch_transpose' (10 runs):
8,274,577 cache-misses # 65.760 % of all cache refs ( +- 1.55% ) (39.24%)
12,582,914 cache-references ( +- 1.15% ) (39.39%)
1,270,075,951 instructions # 2.22 insn per cycle ( +- 0.48% ) (49.56%)
571,300,575 cycles ( +- 0.29% ) (59.66%)
2,055,764 r014c ( +- 1.49% ) (59.81%)
285,024 r024c ( +- 12.43% ) (61.40%)
452,350,810 r81d0 ( +- 0.99% ) (63.40%)
450,444,764 r01d1 ( +- 0.97% ) (62.82%)
136,386 r02d1 ( +- 6.58% ) (41.17%)
21,463 r04d1 ( +- 9.31% ) (39.90%)
```
- _MM_HINT_T1
```
Performance counter stats for './sse_prefetch_transpose' (10 runs):
8,294,601 cache-misses # 65.715 % of all cache refs ( +- 1.96% ) (39.20%)
12,622,082 cache-references ( +- 1.59% ) (39.49%)
1,267,488,698 instructions # 2.21 insn per cycle ( +- 0.20% ) (49.82%)
573,599,634 cycles ( +- 0.47% ) (59.97%)
1,138,961 r014c ( +- 3.27% ) (60.38%)
297,754 r024c ( +- 12.40% ) (62.11%)
452,313,963 r81d0 ( +- 0.71% ) (63.44%)
447,991,261 r01d1 ( +- 0.71% ) (63.10%)
1,211,477 r02d1 ( +- 2.48% ) (40.61%)
23,397 r04d1 ( +- 7.42% ) (39.48%)
```
- _MM_HINT_T2
```
Performance counter stats for './sse_prefetch_transpose' (10 runs):
8,160,145 cache-misses # 64.491 % of all cache refs ( +- 2.19% ) (39.61%)
12,653,216 cache-references ( +- 0.45% ) (39.98%)
1,261,785,164 instructions # 2.19 insn per cycle ( +- 0.37% ) (50.15%)
575,843,475 cycles ( +- 0.38% ) (60.17%)
1,137,009 r014c ( +- 3.06% ) (60.17%)
340,189 r024c ( +- 9.23% ) (60.61%)
464,808,926 r81d0 ( +- 0.84% ) (61.75%)
450,360,519 r01d1 ( +- 0.92% ) (62.46%)
1,179,211 r02d1 ( +- 1.03% ) (40.89%)
20,659 r04d1 ( +- 7.79% ) (40.36%)
```
- _MM_HINT_NTA
```
Performance counter stats for './sse_prefetch_transpose' (10 runs):
8,121,213 cache-misses # 63.071 % of all cache refs ( +- 2.09% ) (39.80%)
12,876,217 cache-references ( +- 2.41% ) (40.06%)
1,267,977,461 instructions # 2.20 insn per cycle ( +- 0.43% ) (50.18%)
575,640,296 cycles ( +- 0.60% ) (60.18%)
2,028,726 r014c ( +- 1.83% ) (60.17%)
338,681 r024c ( +- 9.67% ) (60.74%)
460,173,646 r81d0 ( +- 0.99% ) (61.99%)
452,179,786 r01d1 ( +- 0.93% ) (62.14%)
11,199 r02d1 ( +- 3.08% ) (41.24%)
219,170 r04d1 ( +- 2.84% ) (40.43%)
```
疑問:
可以發現 `T0` 和 `T1` 都如預期將資料放在 L1 / L2 cache
但 `T2` 並不是放在 L3 而是 L2 ? 反而是 NTA 模式 似乎傾向使用 L3 cache
:::danger
用 perf raw-counter 確認!
:notes: jserv
:::
此處尚待理解