# 2017q1 Homework3 (software-pipelining)
contributed by <`zmke`>
## 開發環境
OS: Ubuntu 16.04 LTS
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
Model name: Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 3072K
## 實驗
### 效能分析
利用 function pointer,讓三種方法可以透過同樣的介面測試矩陣轉置,修改 makefile 產生3個執行檔分別測試效能
* 參考 [Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B](http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html) 在 intel Haswell 架構下
* r014c 對應 LOAD_HIT_PRE.SW_PF
* r024c 對應 LOAD_HIT_PRE.HW_PF
```
$ make cache-test
perf stat --repeat 50 \
-e cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,r014c,r024c \
./naive
Performance counter stats for './naive' (50 runs):
16,909,801 cache-misses # 93.023 % of all cache refs ( +- 0.04% ) (66.35%)
18,178,165 cache-references ( +- 0.05% ) (66.34%)
535,492,512 L1-dcache-loads ( +- 0.09% ) (63.94%)
20,746,860 L1-dcache-load-misses # 3.87% of all L1-dcache hits ( +- 0.11% ) (33.84%)
394 r014c ( +- 3.57% ) (33.80%)
405,319 r024c ( +- 3.92% ) (50.00%)
0.381020000 seconds time elapsed ( +- 0.13% )
perf stat --repeat 50 \
-e cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,r014c,r024c \
./sse
Performance counter stats for './sse' (50 runs):
4,492,349 cache-misses # 81.706 % of all cache refs ( +- 0.20% ) (65.76%)
5,498,214 cache-references ( +- 0.13% ) (66.23%)
426,170,790 L1-dcache-loads ( +- 0.11% ) (63.47%)
8,383,786 L1-dcache-load-misses # 1.97% of all L1-dcache hits ( +- 0.19% ) (34.23%)
444 r014c ( +- 4.13% ) (33.21%)
389,386 r024c ( +- 4.40% ) (49.39%)
0.253542948 seconds time elapsed ( +- 0.18% )
perf stat --repeat 50 \
-e cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,r014c,r024c \
./sse_prefetch
Performance counter stats for './sse_prefetch' (50 runs):
4,631,619 cache-misses # 79.482 % of all cache refs ( +- 0.54% ) (66.97%)
5,827,248 cache-references ( +- 0.21% ) (67.32%)
461,334,331 L1-dcache-loads ( +- 0.04% ) (62.35%)
7,718,862 L1-dcache-load-misses # 1.67% of all L1-dcache hits ( +- 0.36% ) (33.02%)
2,780,379 r014c ( +- 2.75% ) (34.00%)
198,855 r024c ( +- 3.45% ) (50.57%)
0.196338559 seconds time elapsed ( +- 0.10% )
```
發現 sse_prefetch 執行的速度和 cache miss rate 表現都最好, r014 raw counter 對應到 software prefetch , sse_prefetch 的 r014c 明顯比其他兩個方法高出許多。
## Reference
[When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)
[論文 When Prefetching Works, When It Doesn’t, and Why 重點提示和解說](https://hackmd.io/s/HJtfT3icx)
[你所不知道的 C 語言:物件導向程式設計篇](https://hackmd.io/s/HJLyQaQMl)
[twzjwang](https://hackmd.io/s/HJDVmyZje#naive-transpose)