# 2017q1 Homework3 (software-pipelining) contributed by <`zmke`> ## 開發環境 OS: Ubuntu 16.04 LTS Architecture: x86_64 CPU 作業模式: 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 Model name: Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz L1d 快取: 32K L1i 快取: 32K L2 快取: 256K L3 快取: 3072K ## 實驗 ### 效能分析 利用 function pointer,讓三種方法可以透過同樣的介面測試矩陣轉置,修改 makefile 產生3個執行檔分別測試效能 * 參考 [Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B](http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html) 在 intel Haswell 架構下 * r014c 對應 LOAD_HIT_PRE.SW_PF * r024c 對應 LOAD_HIT_PRE.HW_PF ``` $ make cache-test perf stat --repeat 50 \ -e cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,r014c,r024c \ ./naive Performance counter stats for './naive' (50 runs): 16,909,801 cache-misses # 93.023 % of all cache refs ( +- 0.04% ) (66.35%) 18,178,165 cache-references ( +- 0.05% ) (66.34%) 535,492,512 L1-dcache-loads ( +- 0.09% ) (63.94%) 20,746,860 L1-dcache-load-misses # 3.87% of all L1-dcache hits ( +- 0.11% ) (33.84%) 394 r014c ( +- 3.57% ) (33.80%) 405,319 r024c ( +- 3.92% ) (50.00%) 0.381020000 seconds time elapsed ( +- 0.13% ) perf stat --repeat 50 \ -e cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,r014c,r024c \ ./sse Performance counter stats for './sse' (50 runs): 4,492,349 cache-misses # 81.706 % of all cache refs ( +- 0.20% ) (65.76%) 5,498,214 cache-references ( +- 0.13% ) (66.23%) 426,170,790 L1-dcache-loads ( +- 0.11% ) (63.47%) 8,383,786 L1-dcache-load-misses # 1.97% of all L1-dcache hits ( +- 0.19% ) (34.23%) 444 r014c ( +- 4.13% ) (33.21%) 389,386 r024c ( +- 4.40% ) (49.39%) 0.253542948 seconds time elapsed ( +- 0.18% ) perf stat --repeat 50 \ -e cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,r014c,r024c \ ./sse_prefetch Performance counter stats for './sse_prefetch' (50 runs): 4,631,619 cache-misses # 79.482 % of all cache refs ( +- 0.54% ) (66.97%) 5,827,248 cache-references ( +- 0.21% ) (67.32%) 461,334,331 L1-dcache-loads ( +- 0.04% ) (62.35%) 7,718,862 L1-dcache-load-misses # 1.67% of all L1-dcache hits ( +- 0.36% ) (33.02%) 2,780,379 r014c ( +- 2.75% ) (34.00%) 198,855 r024c ( +- 3.45% ) (50.57%) 0.196338559 seconds time elapsed ( +- 0.10% ) ``` 發現 sse_prefetch 執行的速度和 cache miss rate 表現都最好, r014 raw counter 對應到 software prefetch , sse_prefetch 的 r014c 明顯比其他兩個方法高出許多。 ## Reference [When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) [論文 When Prefetching Works, When It Doesn’t, and Why 重點提示和解說](https://hackmd.io/s/HJtfT3icx) [你所不知道的 C 語言:物件導向程式設計篇](https://hackmd.io/s/HJLyQaQMl) [twzjwang](https://hackmd.io/s/HJDVmyZje#naive-transpose)