# 2016q3 Homework3 (software-pipeline)
contribute by <`kobeyu`>
[github](https://github.com/kobe0308/prefetcher.git)
[作業說明](https://hackmd.io/s/ry7eqDEC)
## 指定閱讀
### [效能分析: Prefetching](https://embedded2015.hackpad.com/ep/pad/static/7ZSsa98cSKw)
### [When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)
## Prefetch
在cpu要運算之前把資料從主記憶體預先載入cache,降低cpu等待時間進而提升效率.
從時間上來看載入至主記憶體的時機不能太晚(cpu已經運算完畢不再需要這些資料)也不能太早(有可能在使用之前就被踢出快取記憶體).
從記憶體層級上來看prefetch intrinsic提供參數可以選擇載入到哪個層級的快取記憶體.
### Hardware prefetch
是用過去的存取紀錄作為prefetch的依據.
GHB:Global History Buffer
### Software prefetch
可以是人為或是compiler在適當的地方加入prefetch指令.
### Direct and Indirect Memory Indexing
Direct是直接可以從指令中知道要存取的記憶體位址,可以透過hareward prefetch.
Indirect 就像這行程式碼 x = a[b[i]],需要先計算b[i]的值才知道要存取a[]中的哪個位置,機器不會比人更知道b[i]是怎麼來的,所以使用software prefetch比較容易達成.
### 使用Software Prefetch優於Hardware Prefetch的情況:
* Large Number of Streams (Limited Hardware Resources).
* Short Streams 前面有提到hareware prefetch是根據歷史紀律作為prefetch的依據,所以過短的資料比較適合sofware prefetch.
* Irregular Memory Access
* Cache Locality Hint
* Loop Bounds.
### 使用Software Prefetch的缺點
* Increased Instruction Count
* Static Insertion 無法再執行時期動態調整
* Code Structure Change
### 同時使用Hardware與Software Prefetch能夠發會1+1>2的情況
*Handling Multiple Streams
*Positive Training
### 同時使用Hardware與Software Prefetch有人是豬隊友的情況...
*Negative Training
*Harmful Software Prefetching 如果software prefetch的效能很差連帶會影響到hardware prefetch.
### Prefetch Intrinsic Insertion Algorithm

K is a constant factor
L is an average memory latency
IPCb ench is the profiled average IPC of each benchmark
Wloop is the average instruction count in one loop iteration.
p.s在論文中所有的量測K=4 & L=300
---
## Matrix Transpose
矩陣轉置就是把行變列,列變行的數學運算,由於記憶體空間是Row Major,所以在讀取列時是讀取連續的記憶體空間,寫入行則是寫入不連續的記憶體空間,所以矩陣轉置先天上就會造成很高比例的cache-misses.
## 使用Intel Assembary (SSE2)完成矩陣轉置
[Programming trivia: 4x4 integer matrix transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html)
使用SSE/AVX能得到比較短的執行時間,以目前的認知來看原因有:
1.一次可以處理比較長的資料長度
2.CPU指令集直接支援,可以有比較好的CPI.

### PUNPCKLDQ
_mm_unpacklo_epi32 (__m128i __A, __m128i __B)

### PUNPCKHDQ
_mm_unpackhi_epi32 (__m128i __A, __m128i __B)

### PUNPCKLQDQ
_mm_unpacklo_epi64 (__m128i __A, __m128i __B)

### PUNPCKHQDQ
_mm_unpackhi_epi64 (__m128i __A, __m128i __B)

圖片來源:[32/64-Bit 80x86 Assembly Language Architecture](https://books.google.com.tw/books?id=c3SlgrqMid4C)
---
### 解析程式碼pefetch
#### 資料夾結構
.
├── impl.c
├── main.c
├── Makefile
├── README.md
└── scripts
├── plot.gp
└── pre-commit.hook
1 directory, 6 files
impl.c 實作了三個版本的矩陣轉制,分別是naive,sse,sse+prefectch
先來看一下執行時間的比較:

根據先前閱讀的資料來看prefetch的效能好壞會根據Distance與Cache Level有關.
先來分析Distance對於效能的影響.
中imple.c中prefetch的距離是根據PFDIST的設定值,在實驗中分別測試D=0/4/8/12/16的執行時間分別是多少:
以結果來看執行時間的關係為:
D=8 < D=4,12 < D=16 < D=0
所以證明了prefetch的距離會影響到執行效能.其中D=0尤其明顯,推測可能是先前(2 iteration)用過的資料已經被移出cache或被覆蓋的關係.另外觀察到一個有趣的現象就是無論D是等於多少,都會比沒有加prefetch的sse版本還要來得快.

---
接著我將Distance固定在8,改變prefetch的位置(TO/T1/T2/NTA)
執行的結果是NTA最花時間,再來確認一下NTA的定義:
NTA (non-temporal data with respect to all cache levels)--prefetch data into non-temporal cache structure. (This hint can be used to minimize pollution of caches.)
有兩個地方不太懂"non-temporal cache structure"與"pollution of caches"
>> 簡單說就是 cache 被塞入不時宜的 data,而 early 則是指說太早抓這個 data 進來,結果要用到時已經被 replaced 了,inccorrect 是指說你載入了根本用不到的 data,這些都是會讓 cache miss 增加,也可以說是 汙染了cache [name=Yen-Kuan Wu]

接著用perf觀察cache misses
$ perf stat -r 50 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses ./main
D=8 + T0
Performance counter stats for './main' (50 runs):
62,963,773 cache-misses # 86.289 % of all cache refs ( +- 0.16% ) (66.55%)
72,968,723 cache-references ( +- 0.15% ) (66.62%)
34,265,657 L1-dcache-load-misses # 4.49% of all L1-dcache hits ( +- 0.11% ) (66.74%)
762,324,176 L1-dcache-loads ( +- 0.06% ) (66.89%)
314,796,493 L1-dcache-stores ( +- 0.06% ) (66.92%)
194,207 L1-icache-load-misses ( +- 1.25% ) (66.69%)
0.698045545 seconds time elapsed ( +- 0.16% )
---
D=0+NTA
Performance counter stats for './main' (50 runs):
65,505,666 cache-misses # 86.406 % of all cache refs ( +- 0.18% ) (66.35%)
75,811,770 cache-references ( +- 0.16% ) (66.62%)
33,924,898 L1-dcache-load-misses # 4.49% of all L1-dcache hits ( +- 0.08% ) (66.93%)
754,883,652 L1-dcache-loads ( +- 0.05% ) (67.06%)
316,034,709 L1-dcache-stores ( +- 0.08% ) (66.89%)
153,597 L1-icache-load-misses ( +- 1.40% ) (66.49%)
0.779092952 seconds time elapsed ( +- 0.11% )
---
D=0+T0
Performance counter stats for './main' (50 runs):
69,063,108 cache-misses # 86.741 % of all cache refs ( +- 0.17% ) (66.29%)
79,620,072 cache-references ( +- 0.14% ) (66.56%)
33,903,912 L1-dcache-load-misses # 4.49% of all L1-dcache hits ( +- 0.07% ) (66.94%)
755,064,566 L1-dcache-loads ( +- 0.05% ) (67.15%)
316,197,275 L1-dcache-stores ( +- 0.07% ) (66.94%)
157,890 L1-icache-load-misses ( +- 1.25% ) (66.45%)
0.781984994 seconds time elapsed ( +- 0.08% )
~~目前的結果發現執行時間似乎跟cache misses太大的關聯,似乎證明了之前的推論,prefetch的目的並不是在降低cache miss,而是降低cache misses所產生的延遲.~~
---
上述的實驗(cache-misses)結果是錯的,因為在測試cache misses的時候sse+prefetch/sse/naive並沒有分開執行,以下是分開執行的結果:
1. SSE + Prefetch (Distancd=8,@TO)
Performance counter stats for './main' (50 runs):
9,022,860 cache-misses # 67.566 % of all cache refs ( +- 0.32% ) (66.61%)
13,354,083 cache-references ( +- 0.16% ) (67.11%)
8,964,264 L1-dcache-load-misses # 1.93% of all L1-dcache hits ( +- 0.32% ) (67.20%)
465,440,570 L1-dcache-loads ( +- 0.08% ) (67.20%)
233,797,035 L1-dcache-stores ( +- 0.12% ) (66.88%)
62,034 L1-icache-load-misses ( +- 2.20% ) (66.37%)
0.195889775 seconds time elapsed
2. SSE
Performance counter stats for './main' (50 runs):
15,047,354 cache-misses # 79.379 % of all cache refs ( +- 0.28% ) (66.25%)
18,956,381 cache-references ( +- 0.18% ) (66.36%)
8,408,231 L1-dcache-load-misses # 1.89% of all L1-dcache hits ( +- 0.12% ) (66.71%)
443,824,150 L1-dcache-loads ( +- 0.15% ) (67.34%)
229,787,688 L1-dcache-stores ( +- 0.13% ) (67.44%)
90,578 L1-icache-load-misses ( +- 1.63% ) (66.70%)
0.356293017 seconds time elapsed
3. NAIVE
Performance counter stats for './main' (50 runs):
42,155,708 cache-misses # 87.722 % of all cache refs ( +- 0.21% ) (66.42%)
48,056,099 cache-references ( +- 0.20% ) (66.60%)
21,011,298 L1-dcache-load-misses # 3.84% of all L1-dcache hits ( +- 0.07% ) (66.80%)
547,771,641 L1-dcache-loads ( +- 0.09% ) (67.18%)
215,693,429 L1-dcache-stores ( +- 0.13% ) (67.09%)
76,732 L1-icache-load-misses ( +- 1.87% ) (66.52%)
0.432626384 seconds time elapsed ( +- 0.20% )
從上面的實驗結果可以看到prefetch先將所需要的資料載入快取記憶體降低了cache-misses進而減少了執行時間.
思考cache miss的定義,要存取的資料若不在快取記憶體中=>cache misses,如果預先知道要存取的資料並載入快取記憶體中,就可以避免cache misses.
### 使用AVX提升效能
參考之前同學的實作方法 [link](https://embedded2015.hackpad.com/ep/pad/static/VGN4PI1cUxh)
### 參考資料
GHB [link](http://www.eecg.toronto.edu/~steffan/carg/readings/ghb.pdf)
###### tags: `kobeyu`