Try   HackMD

2017q1 Homework3 (software-pipelining)

contributed by < Chihsiang >

作業wiki
Prefetch 論文
論文中文整理

論文學習

首先尋找文中提出三點問題的的答案:

  1. 軟體預取的限制及成本?
  2. 硬體預取的限制及成本?
  3. 何時使用軟/硬體預取可以達到最佳效果?

先是介紹的軟/硬體目前資料結構使用的對照表,

由上此表得知目前廣泛使用的 Array、Hash 皆有提供 prefetch 方式

上表 prefetch 的分類以及時間軸,時間軸的圖不是很理解,根據內文

Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency

只能理解要足夠的時間及早提出需求,用以隱藏記憶體延遲。

  • Prefetch Distance 條件

    • D 是 prefetch 目標距離
    • l 是 lantency 搬移資料所延遲的時間
    • s 是迴圈當中的最短路徑
  • Prefetch Distance 插入演算法

    • D 是 prefetch 的目標距離
    • L 是平均記憶體延遲
    • K 是常數
    • IPCb 是 profiled average IPC of each benchmark
    • Wloop 是迴圈平均指令數量
  • 軟體比較硬體的預取優勢(情況例子)

    • Large Number of Streams:
      • 硬體prefetcher 能處理 stream 數量是受限制的,然而軟體 prefetcher 可於每個 stream 中根據需求使用 prefetch
    • Short Streams:
      • 硬體需要至少兩個 cache misses 檢測訓練的時間,才能判斷 stream 或 stride 的方向
    • Irregular Memory Access
      • 軟體預取較能對複雜的資料結構做 pretch
    • Cache Locality Hint
      • 程式開發者使用SW prefetcher時,能自行調整locality hint
      • HW prefetcher 的設計上則是大多將prefetch的資料搬移至lower-level cache(L2或L3)
      • 優點: 減少 L1 cache的cache pollution
      • 缺點: lower-level cache資料搬移至L1的時間(latency)會造成效能下降
    • Loop Bounds
      • SW prefetch的好處是可以在程式當中訂定不會超過迴圈執行次數的prefetch要求
      • 例如: 使用loop unrolling、SW pipelining、使用branch instrcution
      • HW prefetch 則是無法有效控制,特別是當此HW prefetcher工作能力較大(prefetch distance大、高prefetch比率),甚至可能消耗過多的memory bandwidth
  • 軟體預取缺點

    • Instruction count
    • Static insert
    • Code structure change

程式改善

  • 環境資訊$ lscpu
⚡ lscpu Architecture: x86_64 CPU 作業模式: 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 每核心執行緒數: 2 每通訊端核心數: 4 Socket(s): 1 NUMA 節點: 1 供應商識別號: GenuineIntel CPU 家族: 6 型號: 58 Model name: Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz 製程: 9 CPU MHz: 1279.431 CPU max MHz: 3300.0000 CPU min MHz: 1200.0000 BogoMIPS: 4589.27 虛擬: VT-x L1d 快取: 32K L1i 快取: 32K L2 快取: 256K L3 快取: 6144K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
  • 原始程式分析

    • 執行時間

      ​​​​​​​​sse prefetch: 	 54074 us
      ​​​​​​​​sse: 		 107184 us
      ​​​​​​​​naive: 		 224151 us
      
    • perf stat

      • naive
      ​​​​​​​​naive: 		 241652 us
      
      ​​​​​​​​Performance counter stats for './naive_transpose':
      
      ​​​​​​​​ 18,172,399      cache-misses              #   93.396 % of all cache refs      (43.87%)
      ​​​​​​​​ 19,457,264      cache-references                                              (43.86%)
      ​​​​​​​​ 20,157,827      L1-dcache-load-misses     #    3.77% of all L1-dcache hits    (44.24%)
      ​​​​​​​534,896,289      L1-dcache-loads                                               (44.73%)
      ​​​​​​​​     22,319      L1-dcache-prefetch-misses                                     (23.05%)
      ​​​​​​​​  4,204,752      L1-dcache-store-misses                                        (22.85%)
      ​​​​​​​​     32,932      L1-icache-load-misses                                         (33.98%)
      ​​​​​​​282,353,767      branch-instructions                                           (44.91%)
      ​​​​​​​​    563,433      branch-misses             #    0.20% of all branches          (44.05%)
      
      ​​​​​​​0.463622418 seconds time elapsed
      
      • sse
      ​​​​​​​​sse: 		 116348 us
      
      ​​​​​​​​Performance counter stats for './sse_transpose':
      
      ​​​​​​​​  5,864,063      cache-misses              #   82.944 % of all cache refs      (45.02%)
      ​​​​​​​​  7,069,928      cache-references                                              (45.68%)
      ​​​​​​​​  8,338,277      L1-dcache-load-misses     #    1.90% of all L1-dcache hits    (46.32%)
      ​​​​​​​439,214,122      L1-dcache-loads                                               (44.80%)
      ​​​​​​​​    108,383      L1-dcache-prefetch-misses                                     (22.49%)
      ​​​​​​​​  4,283,935      L1-dcache-store-misses                                        (21.30%)
      ​​​​​​​​     35,892      L1-icache-load-misses                                         (31.90%)
      ​​​​​​​278,020,930      branch-instructions                                           (42.50%)
      ​​​​​​​​    571,414      branch-misses             #    0.21% of all branches          (42.43%)
      
      ​​​​​​​0.340159594 seconds time elapsed
      
      • sse_prefetch
      ​​​​​​​​sse prefetch: 	 55429 us
      
      ​​​​​​​​Performance counter stats for './sse_prefetch_transpose':
      
      ​​​​​​​​  7,835,410      cache-misses              #   92.816 % of all cache refs      (45.05%)
      ​​​​​​​​  8,441,831      cache-references                                              (45.39%)
      ​​​​​​​​  8,624,095      L1-dcache-load-misses     #    1.84% of all L1-dcache hits    (45.40%)
      ​​​​​​​469,033,285      L1-dcache-loads                                               (42.97%)
      ​​​​​​​​     17,002      L1-dcache-prefetch-misses                                     (21.98%)
      ​​​​​​​​  3,474,969      L1-dcache-store-misses                                        (23.26%)
      ​​​​​​​​     24,527      L1-icache-load-misses                                         (34.40%)
      ​​​​​​​259,320,698      branch-instructions                                           (45.21%)
      ​​​​​​​​    509,994      branch-misses             #    0.20% of all branches          (44.58%)
      
      ​​​​​​​0.293630274 seconds time elapsed
      

      根據分析,naive版本的cache-misses比例以及次數都是最高且時間耗費最多,然後沒使用 prefetch 的 sse 版本也比使用 prefetch 的高,且使用 prefetch 耗費時間是未使用的一半。

    • perf raw counter

      參照 kaizsv共筆 先找到自己 CPU 架構下的 mask number and event number

      查找自己 CPU 規格 Intel-Core-i7-3615QM 屬於3rd generation

      查文件 Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B 找相關 raw count 的編號

      perf raw counter 範例

      ​​​​​​​​Example:
      
      ​​​​​​​If the Intel docs for a QM720 Core i7 describe an event as:
      
      ​​​​​​​​   Event  Umask  Event Mask
      ​​​​​​​​   Num.   Value  Mnemonic    Description                        Comment
      
      ​​​​​​​​   A8H      01H  LSD.UOPS    Counts the number of micro-ops     Use cmask=1 and
      ​​​​​​​​                             delivered by loop stream detector  invert to count
      ​​​​​​​​                                                                cycles
      
      ​​​​​​​raw encoding of 0x1A8 can be used:
      
      ​​​​​​​​   perf stat -e r1a8 -a sleep 1
      ​​​​​​​​   perf record -e r1a8 ...
      

      使用方法先查找需要測試的Event,然後組合 Umask value + Event num = 01 + A8

      • 手冊上查詢 SW-prefetch

        Event Umask Event Mask Mnemonic Description
        4CH 01H LOAD_HIT_PRE.SW_PF Non-SW-prefetch load dispatches that hit fill buffer allocated for S/W prefetch.
        4CH 01H LOAD_HIT_PRE.HW_PF Non-SW-prefetch load dispatches that hit fill buffer allocated for H/W prefetch.
      • perf stat -e r014C ./naive

        ​​​​​​​​​​​​naive: 		 224461 us
        
        ​​​​​​​​​​​​Performance counter stats for './naive_transpose':
        
        ​​​​​​​​​​​​   402      r014C
        
        ​​​​​​​​​​​​0.450811263 seconds time elapsed
        
      • perf stat -e r014C ./sse

        ​​​​​​​​​​​​sse: 		 107869 us
        
        ​​​​​​​​​​​​Performance counter stats for './sse_transpose':
        
        ​​​​​​​​​​​​   339      r014C
        
        ​​​​​​​​​​​​0.346157841 seconds time elapsed
        
      • perf stat -e r014C ./sse_prfetch

        ​​​​​​​​​​​​sse prefetch: 	 55197 us
        
        ​​​​​​​​​​​​Performance counter stats for './sse_prefetch_transpose':
        
        ​​​​​​​​​​​​1,798,956      r014C
        
        ​​​​​​​​​​​​0.279139593 seconds time elapsed
        

        透過這些資訊可得到更細節的 Event 發生次數,由於情況許多還不完全理解每個資訊所提供的次數可做哪些效能層面的分析。

  • Verify 機制
    要比對轉置結果是否成功希望可以自動化,參照twzjwang同學的實作仿效。
    p.s 程式碼

  • AVX Transpose

    • 仿照 SSE 的方法,實作 8*8 矩陣的 transpose

      • _mm256_loadu_ps
        讀取 256bits data 對齊的存入 Dst
      • _mm256_unpacklo_ps
        將 source a and b 的拆成兩段128bits 將每段低位(64bits) 存入 Dst
      • _mm256_shuffle_ps
        依照 imm8 指定位元決定讀取 source 位置將其存入 Dst
        p.s imm8 wiki
      • _mm256_permute2f128_ps
        依照 imm8 一次將 128bits 依照指定位元決定讀取 source 位置將其存入 Dst
      ​​​​​​​​void avx_transpose(int *src, int *dst, int w, int h) ​​​​​​​​{ ​​​​​​​​ for (int x = 0; x < w; x += 8) { ​​​​​​​​ for (int y = 0; y < h; y += 8) { ​​​​​​​​ __m256 r0, r1, r2, r3, r4, r5, r6, r7; ​​​​​​​​ __m256 t0, t1, t2, t3, t4, t5, t6, t7; ​​​​​​​​ r0 = _mm256_loadu_ps((__m256 *)(src + (y + 0) * w + x)); ​​​​​​​​ r1 = _mm256_loadu_ps((__m256 *)(src + (y + 1) * w + x)); ​​​​​​​​ r2 = _mm256_loadu_ps((__m256 *)(src + (y + 2) * w + x)); ​​​​​​​​ r3 = _mm256_loadu_ps((__m256 *)(src + (y + 3) * w + x)); ​​​​​​​​ r4 = _mm256_loadu_ps((__m256 *)(src + (y + 4) * w + x)); ​​​​​​​​ r5 = _mm256_loadu_ps((__m256 *)(src + (y + 5) * w + x)); ​​​​​​​​ r6 = _mm256_loadu_ps((__m256 *)(src + (y + 6) * w + x)); ​​​​​​​​ r7 = _mm256_loadu_ps((__m256 *)(src + (y + 7) * w + x)); ​​​​​​​​ t0 = _mm256_unpacklo_ps(r0, r1); ​​​​​​​​ t1 = _mm256_unpackhi_ps(r0, r1); ​​​​​​​​ t2 = _mm256_unpacklo_ps(r2, r3); ​​​​​​​​ t3 = _mm256_unpackhi_ps(r2, r3); ​​​​​​​​ t4 = _mm256_unpacklo_ps(r4, r5); ​​​​​​​​ t5 = _mm256_unpackhi_ps(r4, r5); ​​​​​​​​ t6 = _mm256_unpacklo_ps(r6, r7); ​​​​​​​​ t7 = _mm256_unpackhi_ps(r6, r7); ​​​​​​​​ r0 = _mm256_shuffle_ps(t0, t2, 0x44); ​​​​​​​​ r1 = _mm256_shuffle_ps(t0, t2, 0xEE); ​​​​​​​​ r2 = _mm256_shuffle_ps(t1, t3, 0x44); ​​​​​​​​ r3 = _mm256_shuffle_ps(t1, t3, 0xEE); ​​​​​​​​ r4 = _mm256_shuffle_ps(t4, t6, 0x44); ​​​​​​​​ r5 = _mm256_shuffle_ps(t4, t6, 0xEE); ​​​​​​​​ r6 = _mm256_shuffle_ps(t5, t7, 0x44); ​​​​​​​​ r7 = _mm256_shuffle_ps(t5, t7, 0xEE); ​​​​​​​​ t0 = _mm256_permute2f128_ps(r0, r4, 0x20); ​​​​​​​​ t1 = _mm256_permute2f128_ps(r1, r5, 0x20); ​​​​​​​​ t2 = _mm256_permute2f128_ps(r2, r6, 0x20); ​​​​​​​​ t3 = _mm256_permute2f128_ps(r3, r7, 0x20); ​​​​​​​​ t4 = _mm256_permute2f128_ps(r0, r4, 0x31); ​​​​​​​​ t5 = _mm256_permute2f128_ps(r1, r5, 0x31); ​​​​​​​​ t6 = _mm256_permute2f128_ps(r2, r6, 0x31); ​​​​​​​​ t7 = _mm256_permute2f128_ps(r3, r7, 0x31); ​​​​​​​​ _mm256_storeu_ps((__m256 *)(dst + ((x + 0) * h) + y), t0); ​​​​​​​​ _mm256_storeu_ps((__m256 *)(dst + ((x + 1) * h) + y), t1); ​​​​​​​​ _mm256_storeu_ps((__m256 *)(dst + ((x + 2) * h) + y), t2); ​​​​​​​​ _mm256_storeu_ps((__m256 *)(dst + ((x + 3) * h) + y), t3); ​​​​​​​​ _mm256_storeu_ps((__m256 *)(dst + ((x + 4) * h) + y), t4); ​​​​​​​​ _mm256_storeu_ps((__m256 *)(dst + ((x + 5) * h) + y), t5); ​​​​​​​​ _mm256_storeu_ps((__m256 *)(dst + ((x + 6) * h) + y), t6); ​​​​​​​​ _mm256_storeu_ps((__m256 *)(dst + ((x + 7) * h) + y), t7); ​​​​​​​​ } ​​​​​​​​ } ​​​​​​​​}
      • 時間
      ​​​​​​​​naive: 231006 us ​​​​​​​​sse: 114900 us ​​​​​​​​avx: 65514 us ​​​​​​​​sse prefetch: 54519 us

      從時間上可得知,AVX因為處理的數量是SSE兩倍因此在速度讓也有將近兩倍的差距。

    • 加入 Prefetch
      由於 AVX 版本並沒有提供 Prefetch 指令因此使用 SSE 版本的取而代之
    ​​​​Synopsis ​​​​ void _mm_prefetch (char const* p, int i) ​​​​ #include "xmmintrin.h" ​​​​ Instruction: prefetchnta mprefetch ​​​​ prefetcht0 mprefetch ​​​​ prefetcht1 mprefetch ​​​​ prefetcht2 mprefetch ​​​​ CPUID Flags: SSE ​​​​Description ​​​​ Fetch the line of data from memory ​​​​ that contains address p to a location ​​​​ in the cache heirarchy specified ​​​​ by the locality hint i.

主要分析

  • Perf Stat
    • SSE

      ​​​​​​​​sse: 		 114395 us
      ​​​​​​​​Performance counter stats for './sse_transpose':
      
      ​​​​​​​​     6,613,088      cache-misses              #   81.627 % of all cache refs      (44.81%)
      ​​​​​​​​     8,101,572      cache-references                                              (44.81%)
      ​​​​​​​​     8,963,416      L1-dcache-load-misses     #    2.03% of all L1-dcache hits    (44.80%)
      ​​​​​​​​   441,279,258      L1-dcache-loads                                               (42.83%)
      ​​​​​​​​        19,481      L1-dcache-prefetch-misses                                     (22.27%)
      ​​​​​​​​     4,194,679      L1-dcache-store-misses                                        (23.04%)
      ​​​​​​​​        29,409      L1-icache-load-misses                                         (34.18%)
      ​​​​​​​​   279,189,951      branch-instructions                                           (45.05%)
      ​​​​​​​​       555,463      branch-misses             #    0.20% of all branches          (44.55%)
      
      ​​​​​​​​   0.362906873 seconds time elapsed
      
    • AVX

      ​​​​​​​​avx: 		 64653 us
      ​​​​​​​​Performance counter stats for './avx_transpose':
      
      ​​​​​​​​    5,649,660      cache-misses              #   80.898 % of all cache refs      (45.06%)
      ​​​​​​​​    6,983,688      cache-references                                              (45.80%)
      ​​​​​​​​    7,684,501      L1-dcache-load-misses     #    1.86% of all L1-dcache hits    (45.84%)
      ​​​​​​​​  412,369,472      L1-dcache-loads                                               (43.41%)
      ​​​​​​​​       23,589      L1-dcache-prefetch-misses                                     (21.68%)
      ​​​​​​​​    4,697,455      L1-dcache-store-misses                                        (23.25%)
      ​​​​​​​​       26,099      L1-icache-load-misses                                         (34.38%)
      ​​​​​​​​  248,622,769      branch-instructions                                           (45.20%)
      ​​​​​​​​      518,603      branch-misses             #    0.21% of all branches          (44.57%)
      
      ​​​​​​​​ 0.295939411 seconds time elapsed
      

      上述分析,我最注意到的是L1-dcache-prefetch-misses、L1-dcache-store-misses AVX 指令所產生的 Misses 次數居然比較多。

    • SSE prefetch(D = 8)

      ​​​​​​​​sse prefetch: 	 54383 us
      
      ​​​​​​​​Performance counter stats for './sse_prefetch_transpose':
      
      ​​​​​​​​         6,208,199      cache-misses              #   98.981 % of all cache refs      (43.46%)
      ​​​​​​​​         6,272,119      cache-references                                              (43.45%)
      ​​​​​​​​         6,500,214      L1-dcache-load-misses     #    1.46% of all L1-dcache hits    (44.17%)
      ​​​​​​​​       444,322,322      L1-dcache-loads                                               (44.87%)
      ​​​​​​​​            24,325      L1-dcache-prefetch-misses                                     (23.58%)
      ​​​​​​​​         4,606,491      L1-dcache-store-misses                                        (23.23%)
      ​​​​​​​​            26,308      L1-icache-load-misses                                         (34.35%)
      ​​​​​​​​       254,746,423      branch-instructions                                           (45.06%)
      ​​​​​​​​           542,430      branch-misses             #    0.21% of all branches          (43.65%)
      
      ​​​​​​​​       0.283370894 seconds time elapsed
      
    • AVX prefetch(D = 8)

      ​​​​​​​​avx prefetch: 	 58723 us
      
      ​​​​​​​​Performance counter stats for './avx_prefetch_transpose':
      
      ​​​​​​​​         6,158,483      cache-misses              #   86.765 % of all cache refs      (44.96%)
      ​​​​​​​​         7,097,877      cache-references                                              (44.94%)
      ​​​​​​​​         7,500,835      L1-dcache-load-misses     #    1.76% of all L1-dcache hits    (44.94%)
      ​​​​​​​​       425,937,839      L1-dcache-loads                                               (42.50%)
      ​​​​​​​​            23,246      L1-dcache-prefetch-misses                                     (22.29%)
      ​​​​​​​​         4,833,529      L1-dcache-store-misses                                        (23.26%)
      ​​​​​​​​            29,966      L1-icache-load-misses                                         (34.40%)
      ​​​​​​​​       247,753,365      branch-instructions                                           (45.21%)
      ​​​​​​​​           515,668      branch-misses             #    0.21% of all branches          (44.58%)
      
      ​​​​​​​0.291183267 seconds time elapsed
      

      由於數據上並未補齊所有 D 範圍以及 prefech 數量的比較,之後補上。

    • Memory Latency
      參照carolc0708同學的共筆,也想了解自己電腦的 Memory Latency 測試

      ​​​​​​​​Intel(R) Memory Latency Checker - v3.1a
      ​​​​​​​​Measuring idle latencies (in ns)...
      ​​​​​​​​    Memory node
      ​​​​​​​​Socket	     0
      ​​​​​​​​     0	  60.6
      
      ​​​​​​​​Measuring Peak Memory Bandwidths for the system
      ​​​​​​​​Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
      ​​​​​​​​Using all the threads from each core if Hyper-threading is enabled
      ​​​​​​​​Using traffic with the following read-write ratios
      ​​​​​​​​ALL Reads        :	22924.6
      ​​​​​​​​3:1 Reads-Writes :	21408.4
      ​​​​​​​​2:1 Reads-Writes :	21079.7
      ​​​​​​​​1:1 Reads-Writes :	20016.7
      ​​​​​​​​Stream-triad like:	20424.8
      
      ​​​​​​​​Measuring Memory Bandwidths between nodes within system
      ​​​​​​​​Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
      ​​​​​​​​Using all the threads from each core if Hyper-threading is enabled
      ​​​​​​​​Using Read-only traffic type
      ​​​​​​​​    Memory node
      ​​​​​​​​ Socket	     0
      ​​​​​​​​     0	22919.5
      
      ​​​​​​​​Measuring Loaded Latencies for the system
      ​​​​​​​​Using all the threads from each core if Hyper-threading is enabled
      ​​​​​​​​Using Read-only traffic type
      ​​​​​​​​Inject	Latency	Bandwidth
      ​​​​​​​​Delay	(ns)	MB/sec
      ​​​​​​​​==========================
      ​​​​​​​​ 00000	135.19	  23058.9
      ​​​​​​​​ 00002	134.82	  23053.4
      ​​​​​​​​ 00008	133.35	  23043.2
      ​​​​​​​​ 00015	132.27	  23037.6
      ​​​​​​​​ 00050	126.29	  22870.4
      ​​​​​​​​ 00100	 82.99	  20031.6
      ​​​​​​​​ 00200	 64.48	  11700.4
      ​​​​​​​​ 00300	 61.22	   8385.7
      ​​​​​​​​ 00400	 61.37	   6637.3
      ​​​​​​​​ 00500	 60.76	   5573.2
      ​​​​​​​​ 00700	 60.60	   4321.9
      ​​​​​​​​ 01000	 61.27	   3353.8
      ​​​​​​​​ 01300	 61.46	   2828.7
      ​​​​​​​​ 01700	 61.51	   2413.9
      ​​​​​​​​ 02500	 61.51	   1978.3
      ​​​​​​​​ 03500	 61.88	   1682.8
      ​​​​​​​​ 05000	 61.84	   1492.8
      ​​​​​​​​ 09000	 61.96	   1285.1
      ​​​​​​​​ 20000	 61.83	   1150.6
      
      ​​​​​​​​Measuring cache-to-cache transfer latency (in ns)...
      ​​​​​​​​Local Socket L2->L2 HIT  latency	19.1
      ​​​​​​​​Local Socket L2->L2 HITM latency	22.8
      
    • Run avx prefetch 測試 mlc

      ​​​​​​​​Intel(R) Memory Latency Checker - v3.1a
      ​​​​​​​​Measuring idle latencies (in ns)...
      ​​​​​​​​    Memory node
      ​​​​​​​​Socket	     0
      ​​​​​​​​     0	  60.8
      
      ​​​​​​​​Measuring Peak Memory Bandwidths for the system
      ​​​​​​​​Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
      ​​​​​​​​Using all the threads from each core if Hyper-threading is enabled
      ​​​​​​​​Using traffic with the following read-write ratios
      ​​​​​​​​ALL Reads        :	22224.3
      ​​​​​​​​3:1 Reads-Writes :	20991.3
      ​​​​​​​​2:1 Reads-Writes :	20937.0
      ​​​​​​​​1:1 Reads-Writes :	19920.7
      ​​​​​​​​Stream-triad like:	20050.8
      
      ​​​​​​​​Measuring Memory Bandwidths between nodes within system
      ​​​​​​​​Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
      ​​​​​​​​Using all the threads from each core if Hyper-threading is enabled
      ​​​​​​​​Using Read-only traffic type
      ​​​​​​​​    Memory node
      ​​​​​​​​ Socket	     0
      ​​​​​​​​     0	22883.8
      
      ​​​​​​​​Measuring Loaded Latencies for the system
      ​​​​​​​​Using all the threads from each core if Hyper-threading is enabled
      ​​​​​​​​Using Read-only traffic type
      ​​​​​​​​Inject	Latency	Bandwidth
      ​​​​​​​​Delay	(ns)	MB/sec
      ​​​​​​​​==========================
      ​​​​​​​​ 00000	141.15	  22473.7
      ​​​​​​​​ 00002	136.83	  22856.0
      ​​​​​​​​ 00008	135.33	  22841.5
      ​​​​​​​​ 00015	136.96	  22582.6
      ​​​​​​​​ 00050	131.97	  22277.7
      ​​​​​​​​ 00100	 86.87	  19832.6
      ​​​​​​​​ 00200	 67.08	  11648.3
      ​​​​​​​​ 00300	 61.98	   8370.2
      ​​​​​​​​ 00400	 62.60	   6614.3
      ​​​​​​​​ 00500	 62.30	   5439.4
      ​​​​​​​​ 00700	 70.03	   3344.9
      ​​​​​​​​ 01000	 70.56	   2536.9
      ​​​​​​​​ 01300	 62.81	   2642.3
      ​​​​​​​​ 01700	 63.33	   2192.8
      ​​​​​​​​ 02500	 62.48	   1885.3
      ​​​​​​​​ 03500	 61.62	   1704.1
      ​​​​​​​​ 05000	 62.17	   1473.5
      ​​​​​​​​ 09000	 82.62	    917.5
      ​​​​​​​​ 20000	 64.81	   1086.5
      
      ​​​​​​​​Measuring cache-to-cache transfer latency (in ns)...
      ​​​​​​​​Local Socket L2->L2 HIT  latency	21.0
      ​​​​​​​​Local Socket L2->L2 HITM latency	23.2
      

      發現真的也有在變化,可是完全控制測試方始還在嘗試。

參考資料