2017q1 Homework3 (software-pipelining)

contributed by < Chihsiang >

作業wiki
Prefetch 論文
 論文中文整理

論文學習

首先尋找文中提出三點問題的的答案：

軟體預取的限制及成本？
硬體預取的限制及成本？
何時使用軟/硬體預取可以達到最佳效果？

先是介紹的軟/硬體目前資料結構使用的對照表，

由上此表得知目前廣泛使用的 Array、Hash 皆有提供 prefetch 方式

上表 prefetch 的分類以及時間軸，時間軸的圖不是很理解，根據內文

Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency

只能理解要足夠的時間及早提出需求，用以隱藏記憶體延遲。

Prefetch Distance 條件
- D 是 prefetch 目標距離
- l 是 lantency 搬移資料所延遲的時間
- s 是迴圈當中的最短路徑
Prefetch Distance 插入演算法
- D 是 prefetch 的目標距離
- L 是平均記憶體延遲
- K 是常數
- IPCb 是 profiled average IPC of each benchmark
- Wloop 是迴圈平均指令數量
軟體比較硬體的預取優勢(情況例子)
- Large Number of Streams:
  - 硬體prefetcher 能處理 stream 數量是受限制的，然而軟體 prefetcher 可於每個 stream 中根據需求使用 prefetch
- Short Streams:
  - 硬體需要至少兩個 cache misses 檢測訓練的時間，才能判斷 stream 或 stride 的方向
- Irregular Memory Access
  - 軟體預取較能對複雜的資料結構做 pretch
- Cache Locality Hint
  - 程式開發者使用SW prefetcher時，能自行調整locality hint
  - HW prefetcher 的設計上則是大多將prefetch的資料搬移至lower-level cache(L2或L3)
  - 優點: 減少 L1 cache的cache pollution
  - 缺點: lower-level cache資料搬移至L1的時間(latency)會造成效能下降
- Loop Bounds
  - SW prefetch的好處是可以在程式當中訂定不會超過迴圈執行次數的prefetch要求
  - 例如: 使用loop unrolling、SW pipelining、使用branch instrcution
  - HW prefetch 則是無法有效控制，特別是當此HW prefetcher工作能力較大(prefetch distance大、高prefetch比率)，甚至可能消耗過多的memory bandwidth
軟體預取缺點
- Instruction count
- Static insert
- Code structure change

程式改善

環境資訊$ lscpu




































⚡  lscpu
Architecture:          x86_64
CPU 作業模式：           32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
每核心執行緒數：          2
每通訊端核心數：          4
Socket(s):             1
NUMA 節點：             1
供應商識別號：           GenuineIntel
CPU 家族：              6
型號：                  58
Model name:            Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
製程：                  9
CPU MHz：              1279.431
CPU max MHz:           3300.0000
CPU min MHz:           1200.0000
BogoMIPS:              4589.27
虛擬：                  VT-x
L1d 快取：              32K
L1i 快取：              32K
L2 快取：               256K
L3 快取：               6144K
NUMA node0 CPU(s):     0-7
Flags:  fpu vme de pse tsc msr pae mce
cx8 apic sep mtrr pge mca cmov pat pse36
clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx rdtscp lm constant_tsc arch_perfmon 
pebs bts rep_good nopl xtopology nonstop_tsc 
aperfmperf eagerfpu pni pclmulqdq dtes64 monitor 
ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid 
sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer 
aes xsave avx f16c rdrand lahf_lm tpr_shadow 
vnmi flexpriority ept vpid fsgsbase smep 
erms xsaveopt dtherm ida arat pln pts

原始程式分析

執行時間

sse prefetch: 	 54074 us
sse: 		 107184 us
naive: 		 224151 us

perf stat

naive

naive: 		 241652 us

Performance counter stats for './naive_transpose':

 18,172,399      cache-misses              #   93.396 % of all cache refs      (43.87%)
 19,457,264      cache-references                                              (43.86%)
 20,157,827      L1-dcache-load-misses     #    3.77% of all L1-dcache hits    (44.24%)
534,896,289      L1-dcache-loads                                               (44.73%)
     22,319      L1-dcache-prefetch-misses                                     (23.05%)
  4,204,752      L1-dcache-store-misses                                        (22.85%)
     32,932      L1-icache-load-misses                                         (33.98%)
282,353,767      branch-instructions                                           (44.91%)
    563,433      branch-misses             #    0.20% of all branches          (44.05%)

0.463622418 seconds time elapsed

sse: 		 116348 us

Performance counter stats for './sse_transpose':

  5,864,063      cache-misses              #   82.944 % of all cache refs      (45.02%)
  7,069,928      cache-references                                              (45.68%)
  8,338,277      L1-dcache-load-misses     #    1.90% of all L1-dcache hits    (46.32%)
439,214,122      L1-dcache-loads                                               (44.80%)
    108,383      L1-dcache-prefetch-misses                                     (22.49%)
  4,283,935      L1-dcache-store-misses                                        (21.30%)
     35,892      L1-icache-load-misses                                         (31.90%)
278,020,930      branch-instructions                                           (42.50%)
    571,414      branch-misses             #    0.21% of all branches          (42.43%)

0.340159594 seconds time elapsed

sse_prefetch

sse prefetch: 	 55429 us

Performance counter stats for './sse_prefetch_transpose':

  7,835,410      cache-misses              #   92.816 % of all cache refs      (45.05%)
  8,441,831      cache-references                                              (45.39%)
  8,624,095      L1-dcache-load-misses     #    1.84% of all L1-dcache hits    (45.40%)
469,033,285      L1-dcache-loads                                               (42.97%)
     17,002      L1-dcache-prefetch-misses                                     (21.98%)
  3,474,969      L1-dcache-store-misses                                        (23.26%)
     24,527      L1-icache-load-misses                                         (34.40%)
259,320,698      branch-instructions                                           (45.21%)
    509,994      branch-misses             #    0.20% of all branches          (44.58%)

0.293630274 seconds time elapsed

根據分析，naive版本的cache-misses比例以及次數都是最高且時間耗費最多，然後沒使用 prefetch 的 sse 版本也比使用 prefetch 的高，且使用 prefetch 耗費時間是未使用的一半。

perf raw counter

參照 kaizsv共筆先找到自己 CPU 架構下的 mask number and event number

查找自己 CPU 規格 Intel-Core-i7-3615QM 屬於3rd generation

查文件 Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B 找相關 raw count 的編號

perf raw counter 範例

Example:

If the Intel docs for a QM720 Core i7 describe an event as:

   Event  Umask  Event Mask
   Num.   Value  Mnemonic    Description                        Comment

   A8H      01H  LSD.UOPS    Counts the number of micro-ops     Use cmask=1 and
                             delivered by loop stream detector  invert to count
                                                                cycles

raw encoding of 0x1A8 can be used:

   perf stat -e r1a8 -a sleep 1
   perf record -e r1a8 ...

使用方法先查找需要測試的Event，然後組合 Umask value + Event num = 01 + A8

手冊上查詢 SW-prefetch

Event	Umask	Event Mask Mnemonic	Description
4CH	01H	LOAD_HIT_PRE.SW_PF	Non-SW-prefetch load dispatches that hit fill buffer allocated for S/W prefetch.
4CH	01H	LOAD_HIT_PRE.HW_PF	Non-SW-prefetch load dispatches that hit fill buffer allocated for H/W prefetch.

perf stat -e r014C ./naive

naive: 		 224461 us

Performance counter stats for './naive_transpose':

   402      r014C

0.450811263 seconds time elapsed

perf stat -e r014C ./sse

sse: 		 107869 us

Performance counter stats for './sse_transpose':

   339      r014C

0.346157841 seconds time elapsed

perf stat -e r014C ./sse_prfetch

sse prefetch: 	 55197 us

Performance counter stats for './sse_prefetch_transpose':

1,798,956      r014C

0.279139593 seconds time elapsed

透過這些資訊可得到更細節的 Event 發生次數，由於情況許多還不完全理解每個資訊所提供的次數可做哪些效能層面的分析。

Verify 機制
要比對轉置結果是否成功希望可以自動化，參照twzjwang同學的實作仿效。
p.s 程式碼

AVX Transpose

仿照 SSE 的方法，實作 8*8 矩陣的 transpose

_mm256_loadu_ps
讀取 256bits data 對齊的存入 Dst
_mm256_unpacklo_ps
將 source a and b 的拆成兩段128bits 將每段低位(64bits) 存入 Dst
_mm256_shuffle_ps
依照 imm8 指定位元決定讀取 source 位置將其存入 Dst
p.s imm8 wiki
_mm256_permute2f128_ps
依照 imm8 一次將 128bits 依照指定位元決定讀取 source 位置將其存入 Dst























































void avx_transpose(int *src, int *dst, int w, int h)
{
    for (int x = 0; x < w; x += 8) {
        for (int y = 0; y < h; y += 8) {
            __m256 r0, r1, r2, r3, r4, r5, r6, r7;
            __m256 t0, t1, t2, t3, t4, t5, t6, t7;

            r0 = _mm256_loadu_ps((__m256 *)(src + (y + 0) * w + x));
            r1 = _mm256_loadu_ps((__m256 *)(src + (y + 1) * w + x));
            r2 = _mm256_loadu_ps((__m256 *)(src + (y + 2) * w + x));
            r3 = _mm256_loadu_ps((__m256 *)(src + (y + 3) * w + x));
            r4 = _mm256_loadu_ps((__m256 *)(src + (y + 4) * w + x));
            r5 = _mm256_loadu_ps((__m256 *)(src + (y + 5) * w + x));
            r6 = _mm256_loadu_ps((__m256 *)(src + (y + 6) * w + x));
            r7 = _mm256_loadu_ps((__m256 *)(src + (y + 7) * w + x));

            t0 = _mm256_unpacklo_ps(r0, r1);
            t1 = _mm256_unpackhi_ps(r0, r1);
            t2 = _mm256_unpacklo_ps(r2, r3);
            t3 = _mm256_unpackhi_ps(r2, r3);
            t4 = _mm256_unpacklo_ps(r4, r5);
            t5 = _mm256_unpackhi_ps(r4, r5);
            t6 = _mm256_unpacklo_ps(r6, r7);
            t7 = _mm256_unpackhi_ps(r6, r7);

            r0 = _mm256_shuffle_ps(t0, t2, 0x44);
            r1 = _mm256_shuffle_ps(t0, t2, 0xEE);
            r2 = _mm256_shuffle_ps(t1, t3, 0x44);
            r3 = _mm256_shuffle_ps(t1, t3, 0xEE);
            r4 = _mm256_shuffle_ps(t4, t6, 0x44);
            r5 = _mm256_shuffle_ps(t4, t6, 0xEE);
            r6 = _mm256_shuffle_ps(t5, t7, 0x44);
            r7 = _mm256_shuffle_ps(t5, t7, 0xEE);

            t0 = _mm256_permute2f128_ps(r0, r4, 0x20);
            t1 = _mm256_permute2f128_ps(r1, r5, 0x20);
            t2 = _mm256_permute2f128_ps(r2, r6, 0x20);
            t3 = _mm256_permute2f128_ps(r3, r7, 0x20);
            t4 = _mm256_permute2f128_ps(r0, r4, 0x31);
            t5 = _mm256_permute2f128_ps(r1, r5, 0x31);
            t6 = _mm256_permute2f128_ps(r2, r6, 0x31);
            t7 = _mm256_permute2f128_ps(r3, r7, 0x31);

            _mm256_storeu_ps((__m256 *)(dst + ((x + 0) * h) + y), t0);
            _mm256_storeu_ps((__m256 *)(dst + ((x + 1) * h) + y), t1);
            _mm256_storeu_ps((__m256 *)(dst + ((x + 2) * h) + y), t2);
            _mm256_storeu_ps((__m256 *)(dst + ((x + 3) * h) + y), t3);
            _mm256_storeu_ps((__m256 *)(dst + ((x + 4) * h) + y), t4);
            _mm256_storeu_ps((__m256 *)(dst + ((x + 5) * h) + y), t5);
            _mm256_storeu_ps((__m256 *)(dst + ((x + 6) * h) + y), t6);
            _mm256_storeu_ps((__m256 *)(dst + ((x + 7) * h) + y), t7);

        }
    }
}

時間




naive: 		 231006 us
sse: 		 114900 us
avx: 		 65514 us
sse prefetch: 	 54519 us

從時間上可得知，AVX因為處理的數量是SSE兩倍因此在速度讓也有將近兩倍的差距。

加入 Prefetch
由於 AVX 版本並沒有提供 Prefetch 指令因此使用 SSE 版本的取而代之













Synopsis
    void _mm_prefetch (char const* p, int i)
    #include "xmmintrin.h"
    Instruction: prefetchnta mprefetch
                 prefetcht0 mprefetch
                 prefetcht1 mprefetch
                 prefetcht2 mprefetch
    CPUID Flags: SSE
Description
    Fetch the line of data from memory 
    that contains address p to a location 
    in the cache heirarchy specified 
    by the locality hint i.

AVX Prefetch 程式碼

主要分析

Perf Stat

SSE

sse: 		 114395 us
Performance counter stats for './sse_transpose':

     6,613,088      cache-misses              #   81.627 % of all cache refs      (44.81%)
     8,101,572      cache-references                                              (44.81%)
     8,963,416      L1-dcache-load-misses     #    2.03% of all L1-dcache hits    (44.80%)
   441,279,258      L1-dcache-loads                                               (42.83%)
        19,481      L1-dcache-prefetch-misses                                     (22.27%)
     4,194,679      L1-dcache-store-misses                                        (23.04%)
        29,409      L1-icache-load-misses                                         (34.18%)
   279,189,951      branch-instructions                                           (45.05%)
       555,463      branch-misses             #    0.20% of all branches          (44.55%)

   0.362906873 seconds time elapsed

AVX

avx: 		 64653 us
Performance counter stats for './avx_transpose':

    5,649,660      cache-misses              #   80.898 % of all cache refs      (45.06%)
    6,983,688      cache-references                                              (45.80%)
    7,684,501      L1-dcache-load-misses     #    1.86% of all L1-dcache hits    (45.84%)
  412,369,472      L1-dcache-loads                                               (43.41%)
       23,589      L1-dcache-prefetch-misses                                     (21.68%)
    4,697,455      L1-dcache-store-misses                                        (23.25%)
       26,099      L1-icache-load-misses                                         (34.38%)
  248,622,769      branch-instructions                                           (45.20%)
      518,603      branch-misses             #    0.21% of all branches          (44.57%)

 0.295939411 seconds time elapsed

上述分析，我最注意到的是L1-dcache-prefetch-misses、L1-dcache-store-misses AVX 指令所產生的 Misses 次數居然比較多。

SSE prefetch(D = 8)

sse prefetch: 	 54383 us

Performance counter stats for './sse_prefetch_transpose':

         6,208,199      cache-misses              #   98.981 % of all cache refs      (43.46%)
         6,272,119      cache-references                                              (43.45%)
         6,500,214      L1-dcache-load-misses     #    1.46% of all L1-dcache hits    (44.17%)
       444,322,322      L1-dcache-loads                                               (44.87%)
            24,325      L1-dcache-prefetch-misses                                     (23.58%)
         4,606,491      L1-dcache-store-misses                                        (23.23%)
            26,308      L1-icache-load-misses                                         (34.35%)
       254,746,423      branch-instructions                                           (45.06%)
           542,430      branch-misses             #    0.21% of all branches          (43.65%)

       0.283370894 seconds time elapsed

AVX prefetch(D = 8)

avx prefetch: 	 58723 us

Performance counter stats for './avx_prefetch_transpose':

         6,158,483      cache-misses              #   86.765 % of all cache refs      (44.96%)
         7,097,877      cache-references                                              (44.94%)
         7,500,835      L1-dcache-load-misses     #    1.76% of all L1-dcache hits    (44.94%)
       425,937,839      L1-dcache-loads                                               (42.50%)
            23,246      L1-dcache-prefetch-misses                                     (22.29%)
         4,833,529      L1-dcache-store-misses                                        (23.26%)
            29,966      L1-icache-load-misses                                         (34.40%)
       247,753,365      branch-instructions                                           (45.21%)
           515,668      branch-misses             #    0.21% of all branches          (44.58%)

0.291183267 seconds time elapsed

由於數據上並未補齊所有 D 範圍以及 prefech 數量的比較，之後補上。

Memory Latency
參照carolc0708同學的共筆，也想了解自己電腦的 Memory Latency 測試

Intel(R) Memory Latency Checker - v3.1a
Measuring idle latencies (in ns)...
    Memory node
Socket	     0
     0	  60.6

Measuring Peak Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :	22924.6
3:1 Reads-Writes :	21408.4
2:1 Reads-Writes :	21079.7
1:1 Reads-Writes :	20016.7
Stream-triad like:	20424.8

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
    Memory node
 Socket	     0
     0	22919.5

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject	Latency	Bandwidth
Delay	(ns)	MB/sec
==========================
 00000	135.19	  23058.9
 00002	134.82	  23053.4
 00008	133.35	  23043.2
 00015	132.27	  23037.6
 00050	126.29	  22870.4
 00100	 82.99	  20031.6
 00200	 64.48	  11700.4
 00300	 61.22	   8385.7
 00400	 61.37	   6637.3
 00500	 60.76	   5573.2
 00700	 60.60	   4321.9
 01000	 61.27	   3353.8
 01300	 61.46	   2828.7
 01700	 61.51	   2413.9
 02500	 61.51	   1978.3
 03500	 61.88	   1682.8
 05000	 61.84	   1492.8
 09000	 61.96	   1285.1
 20000	 61.83	   1150.6

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency	19.1
Local Socket L2->L2 HITM latency	22.8

Run avx prefetch 測試 mlc

Intel(R) Memory Latency Checker - v3.1a
Measuring idle latencies (in ns)...
    Memory node
Socket	     0
     0	  60.8

Measuring Peak Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :	22224.3
3:1 Reads-Writes :	20991.3
2:1 Reads-Writes :	20937.0
1:1 Reads-Writes :	19920.7
Stream-triad like:	20050.8

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
    Memory node
 Socket	     0
     0	22883.8

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject	Latency	Bandwidth
Delay	(ns)	MB/sec
==========================
 00000	141.15	  22473.7
 00002	136.83	  22856.0
 00008	135.33	  22841.5
 00015	136.96	  22582.6
 00050	131.97	  22277.7
 00100	 86.87	  19832.6
 00200	 67.08	  11648.3
 00300	 61.98	   8370.2
 00400	 62.60	   6614.3
 00500	 62.30	   5439.4
 00700	 70.03	   3344.9
 01000	 70.56	   2536.9
 01300	 62.81	   2642.3
 01700	 63.33	   2192.8
 02500	 62.48	   1885.3
 03500	 61.62	   1704.1
 05000	 62.17	   1473.5
 09000	 82.62	    917.5
 20000	 64.81	   1086.5

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency	21.0
Local Socket L2->L2 HITM latency	23.2

發現真的也有在變化，可是完全控制測試方始還在嘗試。

參考資料

Transpose
- Transpose an 8x8 float using AVX/AVX2
- Fast memory transpose with SSE, AVX, and OpenMP
共筆
- yenWu
- kaizsv
- carolc0708