Try   HackMD

2017q1 Homework3 (software-pipelining)

contributed by<zhanyangch>

Reviewed by ZixinYang

  • 本篇發現 sse 及 sse-prefetch 的 cache-misses 結果差不多, 便應用論文提到的 prefetch distance 必須大於 memory latency, 觀察 PDIST 提高對 cache-misses 的影響, 突顯 prefetch 過早對效能的影響。
  • 最後用 raw counter 觀察更詳細的數據, 建議作者可以說明為什麼選這幾項 event 來測試, 以及解釋這些數據。
tags:zhanyangch sysprog2017 week3

論文閱讀

執行環境

$ lscpu
Architecture:          x86_64
CPU 作業模式:    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
每核心執行緒數:2
每通訊端核心數:2
Socket(s):             1
NUMA 節點:         1
供應商識別號:  GenuineIntel
CPU 家族:          6
型號:              42
Model name:            Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz
製程:              7
CPU MHz:             855.421
CPU max MHz:           3500.0000
CPU min MHz:           800.0000
BogoMIPS:              5587.06
虛擬:              VT-x
L1d 快取:          32K
L1i 快取:          32K
L2 快取:           256K
L3 快取:           4096K
NUMA node0 CPU(s):     0-3

Transpose Matrix

一致的介面

​​​​修改 Makefile impl.c 使其依 -D 的內容決定實做內容

SSE transpose

  • __m128i _mm_loadu_si128:載入 128byte(即4個 32byte 整數)
  • __m128i _mm_unpacklo_epi32(__m128i a, __m128i b):將最低位的 2 個 32byte 依[a0 b0 a1 b1]組合

例如 __m128i T0 = _mm_unpacklo_epi32(I0, I1);

__m128i d0 d1 d2 d3
I0 i00 i01 i02 i03
I1 i10 i11 i12 i13
T0 i00 i10 i01 i11
  • __m128i _mm_unpackhi_epi32(__m128i a, __m128i b):將最高位的 2 個 32byte 依 [a2 b2 a3 b3] 組合
  • void _mm_storeu_si128(__m128i *p, __m128i a) :將 a 的資料儲存至 p
  • _mm_prefetch(char * p , int i ): 將位置 p prefetch ,i 為 hint ,一次載入一條 cache line
    (_MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2, _MM_HINT_NTA)

利用 perf 觀察 cache miss

 Performance counter stats for './naive_transpose' (5 runs):

        18,321,550      cache-misses              #   91.193 % of all cache refs      ( +-  0.03% )
        20,090,955      cache-references                                              ( +-  0.01% )
     1,448,992,488      instructions              #    1.04  insns per cycle          ( +-  0.01% )
     1,388,146,906      cycles                                                        ( +-  1.86% )

       0.476780726 seconds time elapsed                                          ( +-  3.28% )

 Performance counter stats for './sse_transpose' (5 runs):

         6,044,604      cache-misses              #   80.660 % of all cache refs      ( +-  0.14% )
         7,493,943      cache-references                                              ( +-  0.06% )
     1,237,151,919      instructions              #    1.33  insns per cycle          ( +-  0.02% )
       933,002,791      cycles                                                        ( +-  1.95% )

       0.349286181 seconds time elapsed                                          ( +-  8.65% )
 Performance counter stats for './sse_prefetch_transpose' (5 runs):

         6,005,813      cache-misses              #   80.029 % of all cache refs      ( +-  0.10% )
         7,504,588      cache-references                                              ( +-  0.12% )
     1,283,017,647      instructions              #    1.73  insns per cycle          ( +-  0.00% )
       741,071,615      cycles                                                        ( +-  0.37% )

       0.293647190 seconds time elapsed                                          ( +-  6.50% )
  • 很意外的是 sse 跟 sse+prefetch 的 cache miss 在 PDIST 為 8 時差不多,由論文得知 prefetch distance 必須大於 memory latency

  • 調整 PFDIST,可以發現轉折點發生在 PDIST=16,112 ,而其他部份則差異不大,當 PDIST < 16,distance < memory latency ,而當 PDIST > 112 ,資料過早被 prefetch 造成 cache miss 上升。

  • 使用 raw counter 的到更詳細的數據,參考illusion030的共筆,在Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B找到對應處理器型號的PERFORMANCE-MONITORING EVENTS,以下列出的為表19.5的資料

Event Num. Umask Value Event Mask Mnemonic
4CH 01H LOAD_HIT_PRE.SW_PF
4CH 02H LOAD_HIT_PRE.HW_PF
D1H 01H MEM_LOAD_UOPS_RETIRED.L1_HIT
D1H 02H MEM_LOAD_UOPS_RETIRED.L2_HIT
D1H 04H MEM_LOAD_UOPS_RETIRED.LLC_HIT
D1H 08H MEM_LOAD_UOPS_RETIRED.L1_MIS
D1H 10H MEM_LOAD_UOPS_RETIRED.L2_MIS
D1H 20H MEM_LOAD_UOPS_RETIRED.LLC_MIS

指令格式

perf stat -e r<Umask Value><Event Num>
Performance counter stats for './naive_transpose' (10 runs):

             4,380      r014c LOAD_HIT_PRE.SW_PF                                      ( +- 20.20% )  (24.90%)
           416,936      r024c LOAD_HIT_PRE.HW_PF                                      ( +- 27.46% )  (25.69%)
       521,970,614      r01d1 MEM_LOAD_UOPS_RETIRED.L1_HIT                            ( +-  0.87% )  (26.13%)
            55,233      r02d1 MEM_LOAD_UOPS_RETIRED.L2_HIT                            ( +- 58.72% )  (25.76%)
           457,605      r04d1 MEM_LOAD_UOPS_RETIRED.LLC_HIT                           ( +- 17.80% )  (25.39%)
        16,221,927      r08d1 MEM_LOAD_UOPS_RETIRED.L1_MIS                            ( +-  0.90% )  (25.18%)
        16,138,041      r10d1 MEM_LOAD_UOPS_RETIRED.L2_MIS                            ( +-  1.07% )  (25.06%)
             3,110      r20d1 MEM_LOAD_UOPS_RETIRED.LLC_MIS                           ( +- 34.97% )  (24.78%)

       0.485324876 seconds time elapsed                                          ( +-  2.56% )

 Performance counter stats for './sse_transpose' (10 runs):

            11,687      r014c LOAD_HIT_PRE.SW_PF                                      ( +- 67.66% )  (24.64%)
           613,046      r024c LOAD_HIT_PRE.HW_PF                                      ( +- 27.57% )  (25.75%)
       427,907,439      r01d1 MEM_LOAD_UOPS_RETIRED.L1_HIT                            ( +-  2.01% )  (26.56%)
            32,728      r02d1 MEM_LOAD_UOPS_RETIRED.L2_HIT                            ( +- 53.98% )  (26.24%)
           105,153      r04d1 MEM_LOAD_UOPS_RETIRED.LLC_HIT                           ( +- 16.43% )  (25.64%)
         3,968,212      r08d1 MEM_LOAD_UOPS_RETIRED.L1_MIS                            ( +-  3.31% )  (25.45%)
         4,018,123      r10d1 MEM_LOAD_UOPS_RETIRED.L2_MIS                            ( +-  3.23% )  (25.23%)
             2,194      r20d1 MEM_LOAD_UOPS_RETIRED.LLC_MIS                           ( +- 43.05% )  (24.75%)

       0.314186726 seconds time elapsed                                          ( +-  4.80% )

 Performance counter stats for './sse_prefetch_transpose' (10 runs):

            73,962      r014c LOAD_HIT_PRE.SW_PF                                      ( +-  7.85% )  (24.98%)
           645,530      r024c LOAD_HIT_PRE.HW_PF                                      ( +- 27.46% )  (25.76%)
       436,823,098      r01d1 MEM_LOAD_UOPS_RETIRED.L1_HIT                            ( +-  1.01% )  (26.42%)
           250,519      r02d1 MEM_LOAD_UOPS_RETIRED.L2_HIT                            ( +-  4.97% )  (25.91%)
         3,701,145      r04d1 MEM_LOAD_UOPS_RETIRED.LLC_HIT                           ( +-  1.58% )  (25.79%)
         3,950,488      r08d1 MEM_LOAD_UOPS_RETIRED.L1_MIS                            ( +-  1.36% )  (25.68%)
         3,771,673      r10d1 MEM_LOAD_UOPS_RETIRED.L2_MIS                            ( +-  1.11% )  (25.31%)
             1,026      r20d1 MEM_LOAD_UOPS_RETIRED.LLC_MIS                           ( +- 10.91% )  (24.86%)

       0.298254699 seconds time elapsed                                          ( +-  3.08% )

可以看到有 sse_prefetch 在 LOAD_HIT_PRE.SW_PF 比其他兩個高出許多