# 2017 q1 Homework3 (software-pipelining)
contributed by < stanleytazi >
## 作業要求
- [x] 閱讀論文 "[When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)" (務必事先閱讀論文,否則貿然看程式碼,只是陷入一片茫然!),紀錄你的認知和提問
- [ ] 在 Linux/x86_64 (注意,要用 64-bit 系統,不能透過虛擬機器執行) 上編譯並執行 [prefetcher](https://github.com/sysprog21/prefetcher)
- [ ]說明 `naive_transpose`, `sse_transpose`, `sse_prefetch_transpose` 之間的效能差異,以及 prefetcher 對 cache 的影響
- [ ]在 github 上 fork [prefetcher](https://github.com/sysprog21/prefetcher),參照下方「見賢思齊:共筆選讀」的實驗,嘗試用 AVX 進一步提昇效能。分析的方法可參見 [Matrix Multiplication using SIMD](https://hackmd.io/s/Hk-llHEyx)
- [x]修改 `Makefile`,產生新的執行檔,分別對應於 `naive_transpose`, `sse_transpose`, `sse_prefetch_transpose` (學習 [phonebook](s/S1RVdgza) 的做法),學習 [你所不知道的 C 語言:物件導向程式設計篇](https://hackmd.io/s/HJLyQaQMl) 提到的封裝技巧,以物件導向的方式封裝轉置矩陣的不同實作,得以透過一致的介面 (interface) 存取個別方法並且評估效能
- [ ]用 perf 分析 cache miss/hit
- [ ]學習 `perf stat` 的 raw counter 命令
- [ ]參考 [Performance of SSE and AVX Instruction Sets](http://arxiv.org/pdf/1211.0820.pdf),用 SSE/AVX intrinsic 來改寫程式碼
## 閱讀論文
閱讀過程中,發現老師提供的中文導讀版真的是把論文翻譯得很完整。
這邊先列出幾個疑惑點或是不懂的名詞
+ p.11 節5.1.3 有探討 prefetch distance 對應不同 benchmark 的影響,有提到兩個是對 distance 比較敏感的,在 distance = -4 時就已經達到 optimal,有疑問的是說何謂optimal,Fig.6 所列出的圖 B&C 並沒有特別的趨勢,而且 IPC (intruction per cycle) 也沒有到很高? 所以怎看出它們有optimal?
+ prefetch distance 為負值並且可以到 -4,是否意味著由equation (1),所得出來的值一定會 $\ge4$ 呢?
$$
D\ge\lceil\frac{l}{s}\rceil
$$
## 修改 Makefile
參照 phonebook 修改為可以根據不同的作法產生不同執行檔以及輸出檔,輸出檔是把原來在 main.c 裡 diff_in_us()結果放到 txt 裡面
## perf (raw counter)
沒錯,又是perf,因為在prefetch最主要就是可以讓 cache-misses 的 penalty 隱藏起來,所以一定要觀察 cache miss 相關資訊,比較貪心一下次定義了很多想要觀察的 events:
``` shell=
perf stat --repeat 100 \
-e cache-misses,cache-references,instructions,cycles,
stalled-cycles-frontend,L1-dcache-loads,L1-dcache-load-misses,\
L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,\
L1-dcache-prefetch-misses,L1-icache-loads,L1-icache-load-misses,\
L1-icache-prefetches,L1-icache-prefetch-misses \
./sse_pf_transpose
```
但是得到了很多 **<not supported>**
```shell=
Performance counter stats for './naive_transpose' (100 runs):
547,977,252 cpu/event=0xD0,umask=0x81/ ( +- 0.32% ) (22.15%)
16,859,312 cache-misses # 96.267 % of all cache refs ( +- 0.49% ) (22.85%)
17,513,130 cache-references ( +- 0.53% ) (23.08%)
1,457,427,284 instructions # 1.26 insns per cycle ( +- 0.33% ) (34.65%)
1,160,748,998 cycles ( +- 0.95% ) (45.79%)
<not supported> stalled-cycles-frontend
553,146,550 L1-dcache-loads ( +- 0.19% ) (44.41%)
22,249,547 L1-dcache-load-misses # 4.02% of all L1-dcache hits ( +- 0.74% ) (41.56%)
228,269,859 L1-dcache-stores ( +- 0.37% ) (22.24%)
<not supported> L1-dcache-store-misses
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
<not supported> L1-icache-loads
97,976 L1-icache-load-misses ( +- 9.06% ) (21.99%)
<not supported> L1-icache-prefetches
<not supported> L1-icache-prefetch-misses
0.455125433 seconds time elapsed ( +- 1.30% )
```
+ 接著去找 Intel的[文件](https://software.intel.com/sites/default/files/managed/a4/60/325384-sdm-vol-3abcd.pdf)以及[有人發問L1&L2 cache misses 如何得到 ](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/279983),可以從 CPU Hardware counter 得到值,然後自行做計算例如
```
L1D_load_hit = 100.0 * MEM_LOAD_UOPS_RETIRED.L1_HIT / MEM_UOPS_RETIRED.ALL_LOADS
and
L1D_load_miss= 100.0 * (MEM_UOPS_RETIRED.ALL_LOADS - MEM_LOAD_UOPS_RETIRED.L1_HIT) / MEM_UOPS_RETIRED.ALL_LOADS
For the L2 load hit and miss you can use:
L2_RQSTS.DEMAND_DATA_RD_HIT
L2_RQSTS.ALL_DEMAND_DATA_RD
Then you can compute:
%L2_demand_rd_hit = 100.0 * L2_RQSTS.DEMAND_DATA_RD_HIT / L2_RQSTS.ALL_DEMAND_DATA_RD
%L2_demand_rd_miss = 100.0 * (L2_RQSTS.ALL_DEMAND_DATA_RD - L2_RQSTS.DEMAND_DATA_RD_HIT) /
```
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/499119
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/610581
https://docs.google.com/presentation/d/1I0-SiHid1hTsv7tjLST2dYW5YF5AJVfs9l4Rg9rvz48/edit#slide=id.g1eefe20b_2_19
### 實驗結果
根據
+ [Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide](https://software.intel.com/sites/default/files/managed/a4/60/325384-sdm-vol-3abcd.pdf)
+ [How to count L1 cache miss/hit on Intel Haswell 4790](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/610581)
可以知道:
+ MEM_UOPS_RETIRED.**ALL_LOADS** = MEM_LOAD_UOPS_RETIRED.**L1_HIT** + MEM_LOAD_UOPS_RETIRED.**HIT_LFB** + MEM_LOAD_UOPS_RETIRED.**L2_HIT** +MEM_LOAD_UOPS_RETIRED.**L3_HIT** + MEM_LOAD_UOPS_RETIRED.**L3_MISS**
+ MEM_LOAD_UOPS_RETIRED.**L1_MISS** = MEM_LOAD_UOPS_RETIRED.**L2_HIT** + MEM_LOAD_UOPS_RETIRED.**L2_MISS**
+ MEM_LOAD_UOPS_RETIRED.**L2_MISS** = MEM_LOAD_UOPS_RETIRED.**L3_HIT** + MEM_LOAD_UOPS_RETIRED.**L3_MISS**
>HIT_LFB 的意義為何?
>看 intel 提供的 [Intel® 64 and IA-32 Architectures Optimization Reference Manual](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf) 想要了解這些 HW event,先讀這份文件應該相當有幫助
>
>先來看 **LFB** 代表的意義(他的位置在上圖的最下方中間)
>>[解釋一](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/282055) If you search for 'fill buffer' in the PDF you can see that the Line fill buffer (LFB) is allocated after an L1D miss. The LFB holds the data as it comes in to satisfy the L1D miss but before all the data is ready tobe written to the L1D cache.
>>
>> 我的解釋是當發生 L1D cache miss,從 L2 搬 data 到 L1 "過程中" 會先搬到 LFB,也就是發生 latency 過程中,只要後來 Load 的 data 有在 LFB 裡,仍然可以 hit,這也可以解釋為什摸(新酷音打不出**麼**),上面的公式 ALL_LOADS 有多了 HIT_LFB ,因為它連在搬運 data 的過程都不會浪費
>>解釋二 先沒有解釋二先照解釋一的理解
+ 對照底下由 sse+prefetch(pf-dist=8, hint=T1) 跑一百次的結果,發現光 L1_HIT 就超過 ALL_LOADS,其他 pf-dist 也是有這個現象,所以數據的解讀是有問題的(?)
+ 另外就是 cache-misses 跟 L1 Load cache misses 數據是不太吻合的
``` shell=
Performance counter stats for './sse_pf_d8_transpose' (100 runs):
348,587,929 cpu/event=0xD0,umask=0x81/(ALL_LOADS) ( +- 0.17% ) (9.64%)
353,203,969 cpu/event=0xd1,umask=0x01/(L1_HIT) ( +- 0.20% ) (11.54%)
7,789,452 cpu/event=0x24,umask=0xe1/ ( +- 0.51% ) (12.37%)
3,598,814 cpu/event=0x24,umask=0x41/ ( +- 0.38% ) (12.12%)
3,536,041 cpu/event=0xd1,umask=0x02/(L2_HIT) ( +- 0.41% ) (11.88%)
13,443 cpu/event=0xd1,umask=0x04/(L3_HIT) ( +- 3.19% ) (11.65%)
3,718,549 cpu/event=0xd1,umask=0x08/ ( +- 0.36% ) (11.43%)
35,854 cpu/event=0xd1,umask=0x10/ ( +- 5.04% ) (11.22%)
22,237 cpu/event=0xd1,umask=0x20/(L3_MISS) ( +- 7.66% ) (11.01%)
538,552 cpu/event=0xd1,umask=0x40/(HIT_LFB) ( +- 1.70% ) (10.81%)
0 cpu/event=0xd1,umask=0x80/ (10.62%)
5,430,823 cache-misses # 81.418 % of all cache refs ( +- 0.27% ) (10.44%)
6,670,330 cache-references ( +- 0.28% ) (10.26%)
1,249,464,905 instructions # 1.95 insns per cycle ( +- 0.19% ) (14.96%)
639,756,626 cycles ( +- 0.07% ) (18.68%)
<not supported> stalled-cycles-frontend
466,907,531 L1-dcache-loads ( +- 0.15% ) (15.39%)
7,240,334 L1-dcache-load-misses # 1.55% of all L1-dcache hits ( +- 2.49% ) (8.72%)
254,032,834 L1-dcache-stores ( +- 0.24% ) (6.82%)
<not supported> L1-dcache-store-misses
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
<not supported> L1-icache-loads
15,332 L1-icache-load-misses ( +- 5.95% ) (9.48%)
<not supported> L1-icache-prefetches
<not supported> L1-icache-prefetch-misses
9,598 LLC-load-misses # 7.96% of all LL-cache hits ( +- 3.97% ) (9.33%)
241,283 LLC-loads ( +- 13.52% ) (13.74%)
148,297 LLC-store-misses ( +- 4.94% ) (8.96%)
1,287,418 LLC-stores ( +- 2.61% ) (8.80%)
0.239078408 seconds time elapsed ( +- 0.11% )
```
:::danger
看了[yenWu](https://hackmd.io/EYMwbAhgDArAnAFgLQmGATEhB2AppiCAZhCQBMBGXGBGXCkCMEIA?view) 的筆記有看到針對 raw counter有更簡潔的寫法,but 跑完一次發現兩種寫法所得到的數據還是不太一樣,雖然很相近,如果是同一個 counter 為何會不一樣呢?
:::
> 老師 hint: 做 prefetch 是有成本的,還需要去查 L1 icach
``` shell=
Performance counter stats for './sse_pf_d8_transpose' (100 runs):
371,201,978 cpu/event=0xD0,umask=0x81/ ( +- 1.34% ) (8.92%)
371,170,416 r81d0 ( +- 1.17% ) (10.67%)
...
```
+ 以上這些 event 實際代表的意義還需要整理一下,現在找起來資料還太亂
## 實驗結果
在進行實驗結果前先來講一下目前對於 prefetch hint 的認知,這樣才有辦法再去對應 perf 得到的 raw counter的值
+ **prefetch hint** 提示的意義是
+ T0 代表 data 可以存到任何一個級別的 cache 並且標示為temporal,所以是說待會很快就會被用到
+ T1 代表 temporal data 可以存到 L2 & L3
+ T2 代表 temporal data 可以存到 L3
+ NTA 因為是 non-temporal,所以提示在短時間內不會被用到,並且是可以存到 L1
### hint = T1
+ 不過從 **cache-misses** 與 **L1-dcache-load-misses** 來看, sse + prefetch 的確是比 sse 還有naive 還要好
| method | cache-misses-rate | L1-dcache-load-misses rate | L1-dcache-load-misses |
| ----- | ------------------------- | -----|-|
|naive| 96.71% | 4.12 % | 22,511,461 |
|sse| 89.299% | 2.11 % | 8,558,031 |
|sse + prefetch distance=1| 90.564 % | 2.08 %| 8,508,991 |
|sse + prefetch distance=2| 86.380 % | 2.27 %|9,136,501 |
|sse + prefetch distance=4| 85.033 % | 2.69 %|11,438,953 |
|sse + prefetch distance=8| 82.516 % | 1.95 %|8,826,953 |

### hint = T0
| method | cache-misses| cache-misses-rate | L1-dcache-load-misses rate | L1-dcache-load-misses |
| ----- | ---|------------------------- | -----| -- |
|naive| 16,106,594|87.519% | 5.72 % | 31,054,722 |
|sse| 4,241,509 | 83% | 2.78 % | 11,185,910 |
|sse + prefetch distance=1| 4,787,493|83.087 % | 2.85 %| 11,185,910 |
|sse + prefetch distance=2| 5,381,545 | 88.981 % | 2.40 %|11,315,885 |
|sse + prefetch distance=4| 3,534,252 | 83.505 % | 2.65 %|10,699,825 |
|sse + prefetch distance=8| 3,702,525 | 82.754 % | 2.95 % | 11,,858,492|

+ 單看上面表+圖不同方法間的確在 sse+pfdist=4 or 8 的情況下表的比較好
+ 但是對比 hint=T1 在執行時間以及 L1-dcache-misses 都有落差,why?
+
### hint=T2
### hint=NTA
## 參考資料
+ [gnuplot abbrev.](https://superuser.com/questions/508644/looking-up-gnuplot-abbreviations)