2017q1 Homework3 (software-pipelining)

contributed by <zmke>

開發環境

OS: Ubuntu 16.04 LTS
Architecture: x86_64
CPU 作業模式： 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
Model name: Intel® Core™ i5-4200H CPU @ 2.80GHz
L1d 快取： 32K
L1i 快取： 32K
L2 快取： 256K
L3 快取： 3072K

實驗

效能分析

利用 function pointer，讓三種方法可以透過同樣的介面測試矩陣轉置，修改 makefile 產生3個執行檔分別測試效能

參考 Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B 在 intel Haswell 架構下
- r014c 對應 LOAD_HIT_PRE.SW_PF
- r024c 對應 LOAD_HIT_PRE.HW_PF

$ make cache-test

perf stat --repeat 50 \
-e cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,r014c,r024c \
./naive

 Performance counter stats for './naive' (50 runs):

        16,909,801      cache-misses              #   93.023 % of all cache refs      ( +-  0.04% )  (66.35%)
        18,178,165      cache-references                                              ( +-  0.05% )  (66.34%)
       535,492,512      L1-dcache-loads                                               ( +-  0.09% )  (63.94%)
        20,746,860      L1-dcache-load-misses     #    3.87% of all L1-dcache hits    ( +-  0.11% )  (33.84%)
               394      r014c                                                         ( +-  3.57% )  (33.80%)
           405,319      r024c                                                         ( +-  3.92% )  (50.00%)

       0.381020000 seconds time elapsed                                          ( +-  0.13% )

perf stat --repeat 50 \
-e cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,r014c,r024c \
./sse

 Performance counter stats for './sse' (50 runs):

         4,492,349      cache-misses              #   81.706 % of all cache refs      ( +-  0.20% )  (65.76%)
         5,498,214      cache-references                                              ( +-  0.13% )  (66.23%)
       426,170,790      L1-dcache-loads                                               ( +-  0.11% )  (63.47%)
         8,383,786      L1-dcache-load-misses     #    1.97% of all L1-dcache hits    ( +-  0.19% )  (34.23%)
               444      r014c                                                         ( +-  4.13% )  (33.21%)
           389,386      r024c                                                         ( +-  4.40% )  (49.39%)

       0.253542948 seconds time elapsed                                          ( +-  0.18% )

perf stat --repeat 50 \
-e cache-misses,cache-references,L1-dcache-loads,L1-dcache-load-misses,r014c,r024c \
./sse_prefetch

 Performance counter stats for './sse_prefetch' (50 runs):

         4,631,619      cache-misses              #   79.482 % of all cache refs      ( +-  0.54% )  (66.97%)
         5,827,248      cache-references                                              ( +-  0.21% )  (67.32%)
       461,334,331      L1-dcache-loads                                               ( +-  0.04% )  (62.35%)
         7,718,862      L1-dcache-load-misses     #    1.67% of all L1-dcache hits    ( +-  0.36% )  (33.02%)
         2,780,379      r014c                                                         ( +-  2.75% )  (34.00%)
           198,855      r024c                                                         ( +-  3.45% )  (50.57%)

       0.196338559 seconds time elapsed                                          ( +-  0.10% )

發現 sse_prefetch 執行的速度和 cache miss rate 表現都最好， r014 raw counter 對應到 software prefetch ， sse_prefetch 的 r014c 明顯比其他兩個方法高出許多。

Reference

When Prefetching Works, When It Doesn’t, and Why
論文 When Prefetching Works, When It Doesn’t, and Why 重點提示和解說
 你所不知道的 C 語言：物件導向程式設計篇
 twzjwang