# 2017q1 Homework3 (software-pipelining)
contributed by <`ryanwang522`>
* [Github](https://github.com/ryanwang522/prefetcher)
---
## 開發環境
* OS: ubuntu 16.04 LTS
* L1d cache: 32K
* L1i cache: 32K
* L2 cache: 256K
* L3 cache: 3072K
* Architecture:x86_64
* CPU op-mode(s): 32-bit, 64-bit
* Byte Order: Little Endian
* CPU(s): 4
* Model name: Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz
## 論文閱讀
* 閱讀 [ “When Prefetching Works, When It Doesn’t, and Why” ](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)
* 參考 [重點提示和解說](https://hackmd.io/s/HJtfT3icx)
* 第一次看覺得有些名詞都似懂非懂,決定先來看這次作業的程式碼。
>>[作業說明](https://hackmd.io/s/rks62p1sl#) 有提醒同學:
>>務必事先閱讀論文,否則貿然看程式碼,只是陷入一片茫然!
>>請先閱讀論文
>>[name=課程助教][color=red]
## 實際嘗試
* 編譯之後執行
```
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
sse prefetch: 86235 us
sse: 171542 us
naive: 359344 us
```
* 觀察 `impl.c` 後發現有不少看不懂的指令
* 參考 [Integer Memory and Initialization Using Streaming SIMD Extensions 2](https://msdn.microsoft.com/zh-tw/library/hfhxtdwx(v=vs.100).aspx)
* 裡面有 `_mm_loadu_si128` ` _mm_unpacklo_epi32` `_mm_storeu_si128` ...等有出現的 SSE 函式的說明。
* 花了不少時間看懂矩陣轉置的實作
* 用紙筆稍微寫出實作過程就不難了解
* 在找資料的過程中發現了張簡單明瞭的概念圖,早點看到就好了QQ
* ![](https://i.imgur.com/BxvN77F.png)
* 接著是 prefetch 的觀念建立,參考 [Cache prefetching](https://en.wikipedia.org/wiki/Cache_prefetching)
* 程式碼中 `PFDIST` 定義成 8 ,但是又每次 loop 取四個值來做,所以是提前將下下次 loop 會用到的資料先取置 Cache。
* SSE 程式碼
```C=
void sse_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x += 4) {
for (int y = 0; y < h; y += 4) {
__m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x));
__m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x));
__m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x));
__m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x));
__m128i T0 = _mm_unpacklo_epi32(I0, I1);
__m128i T1 = _mm_unpacklo_epi32(I2, I3);
__m128i T2 = _mm_unpackhi_epi32(I0, I1);
__m128i T3 = _mm_unpackhi_epi32(I2, I3);
I0 = _mm_unpacklo_epi64(T0, T1);
I1 = _mm_unpackhi_epi64(T0, T1);
I2 = _mm_unpacklo_epi64(T2, T3);
I3 = _mm_unpackhi_epi64(T2, T3);
_mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
_mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
_mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
_mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);
}
}
}
```
* SSE_prefetch 程式碼
```C=
#define PFDIST 8
void sse_prefetch_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x += 4) {
for (int y = 0; y < h; y += 4) {
_mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1);
__m128i I0 = _mm_loadu_si128 ((__m128i *)(src + (y + 0) * w + x));
__m128i I1 = _mm_loadu_si128 ((__m128i *)(src + (y + 1) * w + x));
__m128i I2 = _mm_loadu_si128 ((__m128i *)(src + (y + 2) * w + x));
__m128i I3 = _mm_loadu_si128 ((__m128i *)(src + (y + 3) * w + x));
__m128i T0 = _mm_unpacklo_epi32(I0, I1);
__m128i T1 = _mm_unpacklo_epi32(I2, I3);
__m128i T2 = _mm_unpackhi_epi32(I0, I1);
__m128i T3 = _mm_unpackhi_epi32(I2, I3);
I0 = _mm_unpacklo_epi64(T0, T1);
I1 = _mm_unpackhi_epi64(T0, T1);
I2 = _mm_unpacklo_epi64(T2, T3);
I3 = _mm_unpackhi_epi64(T2, T3);
_mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
_mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
_mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
_mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);
}
}
}
```
## `naive` `SSE` `SSE_Prefetch` 比較
* 先利用 [你所不知道的 C 語言:物件導向程式設計篇](https://hackmd.io/s/HJLyQaQMl) 裡面提及的 OOP 概念實作不同函式的呼叫介面
```c=
typedef void (*func_t)(int *src, int *dst, int w, int h);
typedef struct __Object Object;
struct __Object {
func_t trans_func;
};
```
* 以及函式的初始 function,利用 function pointer 指向所要取用的函式
```C=
int init_sse(Object **self)
{
if((*self = malloc(sizeof(Object))) == NULL) return -1;
(*self)->trans_func = sse_transpose;
return 0;
}
...
```
* 最後利用 Conditional Compile 加上編譯時給的 -D 參數達到產生不同執行檔對應不同 function 的功能。
```C=
/* create a object and init it with corresponding define */
Object *interface = NULL;
#ifdef NAIVE
if (init_naive(&interface) == -1) {
printf("init naive error.\n");
return -1;
}
#elif defined(SSE_PREFETCH)
if (init_sse_prefetch(&interface) == -1) {
printf("init sse_prefetch error.\n");
return -1;
}
#else
if (init_sse(&interface) == -1) {
printf("init sse error.\n");
return -1;
}
#endif
interface->trans_func(testin, testout, 4, 4);
```
* 修改 `main.c` 進而單純地使用 perf 觀察三個函式的效能差異。
* 執行前先清空 Cache: `$ echo 1 | sudo tee /proc/sys/vm/drop_caches`
```
Performance counter stats for './naive' (100 runs):
955 cache-misses # 7.011 % of all cache refs ( +- 13.85% )
1,3616 cache-references ( +- 0.72% )
63,5161 instructions # 0.76 insn per cycle ( +- 0.25% )
84,0346 cycles ( +- 6.58% )
0.000523424 seconds time elapsed ( +- 7.21% )
```
```
Performance counter stats for './sse_transpose' (100 runs):
763 cache-misses # 5.590 % of all cache refs ( +- 15.07% )
1,3645 cache-references ( +- 0.66% )
63,6933 instructions # 0.82 insn per cycle ( +- 0.22% )
78,0010 cycles ( +- 5.09% )
0.000545729 seconds time elapsed ( +- 23.24% )
```
```
Performance counter stats for './sse_prefetch_transpose' (100 runs):
752 cache-misses # 5.688 % of all cache refs ( +- 17.94% )
1,3215 cache-references ( +- 0.60% )
63,0374 instructions # 0.84 insn per cycle ( +- 0.20% )
74,8224 cycles ( +- 5.34% )
0.000507983 seconds time elapsed ( +- 18.33% )
```
* 觀察結果,Cache misses 差異不大? Prefetch 的版本也沒有明顯的 cache-misses% 改善,但執行時間卻顯著減少
* 思考 prefetch 時也會產生 cache-misses(到更高階級的記憶單元取資料就算)
* 需要更詳細的效能評估
### 利用 perf raw counter 進行分析
* 參考 [kaizsv 的共筆](https://hackmd.io/s/r1IWtb9R),發現 perf 還可以進行更詳細的效能評估,對照 [Intel® 64 and IA-32 Architectures Developer’s Manual: Vol. 3B](http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html) 的第 19 章,我的電腦 CPU 型號是 i5-4200, 看了一下 manual 是屬於 4th generation Haswell 的 cpu
* 找了些關於 cache misses 的觀察項目
* ` If you find one you want to instrument, you can specify it as a raw event with the format: rUUEE, where UU == umask, and EE == event number. `
|Event Num|Umask Value|Event Mask Mnemonic|Description|
|-|-|-|-|
|4CH|01H|LOAD_HIT_PRE.SW_PF|Non-SW-prefetch load dispatches that hit fill buffer allocated for S/W prefetch.|
|4CH|02H|LOAD_HIT_PRE.HW_PF|Non-SW-prefetch load dispatches that hit fill buffer allocated for H/W prefetch.|
|D1H|01H|MEM_LOAD_UOPS_RETIRED.L1_HIT|Retired load uops with L1 cache hits as data sources.|
|D1H|02H|MEM_LOAD_UOPS_RETIRED.L2_HIT|Retired load uops with L2 cache hits| as data sources.|
|D1H|04H|MEM_LOAD_UOPS_RETIRED.L3_HIT|Retired load uops with L3 cache hits as data sources.|
|D1H|08H|MEM_LOAD_UOPS_RETIRED.L1_MISS|Retired load uops missed L1 cache as data sources.|
|D1H|10H|MEM_LOAD_UOPS_RETIRED.L2_MISS|Retired load uops missed L2. Unknown data source excluded.|
|D1H|20H|MEM_LOAD_UOPS_RETIRED.L3_MISS|Retired load uops missed L3. Excludes unknown data |
```
$ sudo perf stat --repeat 100 -e cache-misses,cache-references,r014C,r024C,r01D1,r02d1,r04d1,r08d1,r10d1,r20d1 ./sse_prefetch_transpose
Performance counter stats for './sse_prefetch_transpose' (100 runs):
1042 cache-misses # 7.732 % of all cache refs ( +- 14.03% )
1,3473 cache-references ( +- 0.83% )
15 r014C ( +- 13.45% )
6388 r024C ( +- 5.34% )
<not counted> r01D1 (0.00%)
<not counted> r02d1 (0.00%)
<not counted> r04d1 (0.00%)
<not counted> r08d1 (0.00%)
<not counted> r10d1 (0.00%)
<not counted> r20d1 (0.00%)
0.000501777 seconds time elapsed ( +- 8.75% )
```
* ~~這裡不太清楚哪裡弄錯了,raw counter 跑不出來@@~~
* 發現是沒注意到之前 debug 的東西沒清乾淨,少跑了 4096 * 4096 的轉置部份,導致程式只有一個 4 * 4 的矩陣轉置,所以 raw counter 才會沒算到(汗...
* 順便在修改一下先前的介面實作,增加其相容性。
* 執行 perf stat
```
$ sudo perf stat --repeat 100 -e cache-misses,cache-references,r014C,r024C,r01D1,r02d1,r04d1,r08d1,r10d1,r20d1 ./naive
Performance counter stats for './naive' (100 runs):
1556,1539 cache-misses # 88.602 % of all cache refs ( +- 0.38% ) (19.95%)
1756,3494 cache-references ( +- 0.25% ) (20.84%)
2,8704 r014C (LOAD_HIT_PRE.SW_PF) ( +- 3.65% ) (31.41%)
22,5545 r024C (LOAD_HIT_PRE.HW_PF) ( +- 6.06% ) (41.65%)
5,2906,5443 r01D1 (MEM_LOAD_UOPS_RETIRED.L1_HIT) ( +- 0.16% ) (40.03%)
65,1904 r02d1 (MEM_LOAD_UOPS_RETIRED.L2_HIT) ( +- 1.78% ) (37.99%)
145,5382 r04d1 (MEM_LOAD_UOPS_RETIRED.L3_HIT) ( +- 2.57% ) (20.16%)
1639,8087 r08d1 (MEM_LOAD_UOPS_RETIRED.L1_MISS) ( +- 0.19% ) (19.96%)
1562,3707 r10d1 (MEM_LOAD_UOPS_RETIRED.L2_MISS) ( +- 0.21% ) (19.83%)
1421,5128 r20d1 (MEM_LOAD_UOPS_RETIRED.L3_MISS) ( +- 0.37% ) (19.68%)
---
$ sudo perf stat --repeat 100 -e cache-misses,cache-references,r014C,r024C,r01D1,r02d1,r04d1,r08d1,r10d1,r20d1 ./sse_transpose
Performance counter stats for './sse_transpose' (100 runs):
428,6746 cache-misses # 74.599 % of all cache refs ( +- 0.54% ) (20.53%)
574,6364 cache-references ( +- 0.51% ) (21.49%)
1,3171 r014C ( +- 3.67% ) (32.00%)
14,3591 r024C ( +- 2.24% ) (42.14%)
4,4080,0600 r01D1 ( +- 0.21% ) (39.35%)
37,4749 r02d1 ( +- 1.26% ) (35.94%)
25,6156 r04d1 ( +- 2.06% ) (19.55%)
404,8482 r08d1 ( +- 0.54% ) (20.19%)
376,3449 r10d1 ( +- 0.65% ) (20.22%)
359,4998 r20d1 ( +- 0.73% ) (20.07%)
---
$ sudo perf stat --repeat 100 -e cache-misses,cache-references,r014C,r024C,r01D1,r02d1,r04d1,r08d1,r10d1,r20d1 ./sse_prefetch_transpose
Performance counter stats for './sse_prefetch_transpose' (100 runs):
425,0698 cache-misses # 74.312 % of all cache refs ( +- 1.04% ) (20.22%)
572,0070 cache-references ( +- 1.10% ) (21.65%)
330,9106 r014C ( +- 1.60% ) (32.59%)
11,1900 r024C ( +- 3.56% ) (43.04%)
4,6314,2283 r01D1 ( +- 0.22% ) (40.01%)
365,8195 r02d1 ( +- 0.67% ) (36.26%)
5,3485 r04d1 ( +- 8.65% ) (20.09%)
311,3756 r08d1 ( +- 0.97% ) (19.76%)
26,0548 r10d1 ( +- 5.63% ) (19.70%)
21,1605 r20d1 ( +- 5.71% ) (19.62%)
```
* 發現 `r014C` 的變化顯示 sw prefetch 有預期地執行
* prefetch 版本的 Cache misses 也有可觀的減少,說明了為何預取能夠藉由隱藏 latency 的時間成本,進而顯著的提昇效率。
* 然而 prefetch 版本的 L2 Cache hit 有明顯的增加
* 思考是不是與 _MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2, _MM_HINT_NTA 有關
---
## 嘗試 Memory Pool
* 看到 github 上 `jserv` 的建議就想說來試試看 memory pool 對效能的影響
* `main.c` 中只有兩次的 malloc() 給 4096 * 4096 的整數矩陣,不太確定 memory pool 的影響力是否能發揮出來,怕被建立時的 overhead 給蓋過去。
* 實作
```C=
Pool *pool_create(size_t size)
{
Pool *poolPtr = (Pool *) malloc(size * sizeof(int) + sizeof(Pool));
if (poolPtr == NULL) printf("Pool Create Failed.\n");
poolPtr->nextAvail = (int *)&poolPtr[1];
poolPtr->end = poolPtr->nextAvail + size;
return poolPtr;
}
void pool_free(Pool *poolPtr)
{
free(poolPtr);
}
int pool_available(Pool *poolPtr, size_t size)
{
/* Yet consider alignment */
return ((size_t)(poolPtr->end - poolPtr->nextAvail)) < size ? 0 : 1;
}
void *pool_alloc(Pool *poolPtr, size_t size)
{
int check;
if ((check = pool_available(poolPtr, size)) == 0) return NULL;
void *mem = (void *) poolPtr->nextAvail;
poolPtr->nextAvail += size;
return mem;
}
```
* 接著比較效率
* 與原始版本比較時間有微小的縮小,可能在 `malloc()` 上的負擔還沒有大到能夠產生顯著的效能改善。
* ![](https://i.imgur.com/tu5kixM.png)
```
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
naive: 341012 us
sse: 160502 us
see_prefetch: 94221 us
```
## AVX 版本
## 參考資料
[4x4 integer matrix transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html)
[Matrix Multiplication using SIMD
](https://hackmd.io/s/Hk-llHEyx)