# 2017q1 Homework3 (software-pipelining) contributed by < `twzjwang` > 作業說明: [B05: software-pipelining](https://hackmd.io/s/rks62p1sl) github: [twzjwang/prefetcher](https://github.com/twzjwang/prefetcher) # 開發環境 - Ubuntu Ubuntu 16.04.2 LTS Linux 4.8.0-36-generic - cpu version: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K - memory size: 4GiB # 閱讀論文 ## When Prefetching Works, When It Doesn’t, and Why - high performance 的關鍵 - increasing cache miss latency - properly managing memory bandwidth - 研究目的 - 開發 novel, foundational understanding of both the benefits and limitations of hardware and software prefetching - 研究項目 - source code-level analysis - compiler- and software-based prefetching 實行上的優缺點 - SW(software) 和 HW(hardware prefetching) 間的 協同效應(synergistic effects) 和拮抗效應 (antagonistic effects) - 評估 prefetching training policies ## 1.INTRODUCTION - 只有相對簡單的 SW prefetching algo. 用在最先進的 compiler (like icc gcc ) - 插入 prefetch intrinsic 沒有嚴格的標準 - SW 和 HW 間交互作用的複雜度不易理解 - Summary - ![](https://i.imgur.com/Ib9u9HH.png) - GHB : Prefetching with the Global History Buffer - STR : stream-based prefetcher - SW : compiler inserted software prefetchers - 根據 speedups 將 benchmarks 分成三類 - positive : > +5% - `milc`, `GemsFDTD`, `mcf`, `lbm`, `libquantum`, `gcc`, `zeusmp` - neutral - `bzip`, `cactusSDM`, `soplex`, `bwaves` - negative : < -5% - `shinx3`, `leslie3d` - `Average` - interactions among software and hardware prefetching yield a variety of benchmark-dependent performance behaviors - 當 SW prefetching 以下列為目標時,整體上有正面影響 - short array streams - irregular memory address patterns - L1 cache miss reduction - 也可能造成及負面的效應 - SW prefetching 會干擾 HW prefetcher 的訓練 - especially when using software prefetch instructions for a part of streams ## 2. BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING - Table I 整理了不同資料結構相對應存取資料的方法 (pattern) 和分別適合的 SW、HW prefetching 機制 - ![](https://i.imgur.com/3Pzkr0C.png) - array 和 Recursive Data Structures 資料結構容易被預測 - hashing 相對較難預測 - 許多方法被提出用來預測複雜的資料結構 - 提醒 (cache-line accesses) - streams : unit-stride cache-line accesses - strided : access stride distances greater than two cache lines ### 2.1 Software Prefetch Intrinsics - x86 SSE SIMD extensions - mm prefetch(char *ptr, int hint) - ![](https://i.imgur.com/rve3mKs.png) ### 2.2 Prefetch Classification - illustrates this classification on a timeline - ![](https://i.imgur.com/m357fXc.png) - ![](https://i.imgur.com/SIFNlJt.png) ### 2.3 Software Prefetch Distance - 只有在 prefetch sent requests 夠早,足以完全掩蓋 memory latency 下,Prefetching 才有用 - D : prefetch distance - the distance ahead of which a prefetch should be requested on a memory address - 對 prefetcher performance 有顯著的影響 - large enough to hide the latency - 也不能太大 : leading to less coverage and more cache misses - l : prefetch latency - s : the length of the shortest path through the loop body - ![](https://i.imgur.com/vBGeC9G.png) ### 2.4 Direct and Indirect Memory Indexing - memory indexing - direct : easily prefetched by HW - the memory addresses show regular stream/stride behavior - indirect : relatively simpler to compute in SW - indirect prefetch requests have a higher overhead than direct memory access - ![](https://i.imgur.com/0nXXcLs.png) ## 3. POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING ### 3.1 Benefits of Software Prefetching over Hardware Prefetching (Software Prefetching 優點) - Large Number of Streams (Limited Hardware Resources) - stream prefetcher 中 steam 的數量被硬體限制著 - 例如 : stream detectors 和 book-keeping mechanisms - for HW prefetchers : easily confused when there are too many streams - Short Streams - HW prefetchers require training time to detect the direction and distance of a stream or stride - 至少需要 2 次 cache-miss才能判斷 direction of a stream - length of stream is extremely short => there will not be enough cache misses - Irregular Memory Access - SW prefetchers : inserting appropriate prefetch intrinsics which usually require very complex structures - Cache Locality Hint - SW prefetching mechanism - greater flexibility of placing data in the right cache level. - In particular, there is a prefetch hint - HW prefetcher - have to apply the same cache insertion policy - or they require complex hardware mechanisms to differentiate between the individual prefetches to place data in a smart manner - Loop Bounds - SW : determine loop bounds - loop unrolling - software pipelining - using branch instructions - HW : not possible - especially when the hardware prefetcher is aggressive (large prefetch distance and high prefetch degree) - results in early or incorrect prefetch requests - consuming memory bandwidth ### 3.2 Negative Impacts of Software Prefetching (Software Prefetching 缺點) - Increased Instruction Count - ![](https://i.imgur.com/2h5pejf.png) - SW prefetch - 增加 instruction - consume fetch and execution bandwidth and machine resources - Static Insertion - programmer or the compiler determines the data to be prefetched and the corresponding prefetch distance - 這些決定是靜態的,不能動態適應下列 runtime behavioral changes - varying memory latency - effective cache size - bandwidth - especially in heterogeneous architectures - HW : there are some recently proposed feedbackdriven adaptive hardware prefetching mechanisms - SW : adaptive mechanisms are largely nonexistent - Code Structure Change - loop 中 instruction 數量極小時 - 插入 SW prefetching instructions 有困難 - prefetching instructions and demand requests 間計算(時間)不足夠 - code structure changes are required - 如 : loop splitting - instructions 過度增加,插入 SW prefetching 變得困難 ### 3.3 Synergistic Effects when Using Software and Hardware Prefetching Together (Software + Hardware Prefetching 優點) - Handling Multiple Streams - cover more data streams - HW : cover regular streams - SW : cover irregular streams - Positive Training - SW prefetch requests 幫助訓練 HW prefetcher - 當 SW prefetch 太慢,經過訓練的 HW prefetcher 可提高時效 ### 3.4 Antagonistic Effects when Using Software and Hardware Prefetching Together (Software + Hardware Prefetching 缺點) - Negative Training - Software prefetch requests can slow down the hardware prefetcher training - software prefetch instructions can trigger overly aggressive hardware prefetches, which results in early requests - Harmful Software Prefetching - HW prefetchers 精準度較 SW prefetchers 差 - 當 software prefetch requests 不正確或太早 - 增加 cache and memory bandwidth - 減低 HW prefetching 效率 ## 4.EXPERIMENTAL METHODOLOGY ### 4.1 Prefetch Intrinsic Insertion Algorithm - A prefetch candidate is any load whose L1 misses per thousand instructions (MPKI) exceeds 0.05 - choice of prefetch distance systematic - ![](https://i.imgur.com/Y5kqZ9s.png) - K : constant factor - L : average memory latency - IPCbench : the profiled average IPC of each benchmark - Wloop : average instruction count in one loop iteration - (use K = 4 and L = 300 for all benchmarks) - IPCbench、Wloop 由 benchmark 的 profile data 決定 the choice of distance is effectively profile-driven ### 4.2 Simulation Methodology - use MacSim, a trace-driven and cycle-level simulator - perform experiments on 13 memory-intensive benchmarks from SPEC CPU 2006 - all memory accesses hit in the L1 cache - use Pinpoints[Patil et al. 2004] to select a representative simulation region for each benchmark using the reference input set - Each benchmark is run for roughly 200M x86 instructions - ensure that the number of function calls is the same, so that we simulate the same section of the code across different binaries ## 5. EVALUATIONS: BASIC OBSERVATIONS ABOUT PREFETCHING ### 5.1 Overhead and Limitations of Software Prefetching - Instruction Overhead - 增加 prefetch instructions - 增加 regular instructions - due to handling of indirect memory accesses and index calculations - benefits 大於 instruction overhead - Software Prefetching Overhead. - ![](https://i.imgur.com/5EiawEh.png) - cache pollution - early or incorrect prefetches 造成 cache pollution - cache pollution 影響不大 - bandwidth consumption - 實驗環境提供足夠 bandwidth 給 single-thread applications - 影響同樣不大 - memory access latency - software prefetching is not completely hiding memory latency - redundant prefetch overhead - 即使存在大量的 redundant prefetch instructions,其造成負面影響幾乎可被忽略 - instruction overhead - 即使 instruction 數量明顯增加,實際時間成本仍不高 - The Effect of Prefetch Distance - ![](https://i.imgur.com/I17mtPh.png) ![](https://i.imgur.com/QsU9ZcI.png) - Most benchmarks from the neutral and negative groups are insensitive to the prefetch distance (Group E) - cactusADM and bwaves 與其他 Benchmarks 不同,在 distance = -4 達到最佳 distance,optimal distance zones 很廣 - Static Distance vs. Machine Configuration - 使用 static prefetch distances 的限制是不同機器上 optimal distance 也許有很大的不同,在 multicore systems 影響會加大 - compare the best performing static distance three different machine configurations (base, less-aggressive, and aggressive processors) - ![](https://i.imgur.com/NMj1FrO.png) - ![](https://i.imgur.com/c04X2yd.png) - 左圖 : 最佳 prefetch distance 在不同 machine configurations 下的效能 - 右圖 : 最佳 prefetch distance 在 baseline 和在各個 machine configurations 的效能差異 - most benchmarks are classified as Group E (insensitive) in Table VII - 即使 machine configuration 在 runtime 被改變,static prefetch distance 的變動對效能影響並不明顯 - Cache-Level Insertion Policy - 通常 out-of-order processor 可容忍 L1 misses,HW prefetchers 常在 the last level cache 插入 prefetched blocks 以減低 polluting L1 cache 的機會 - 當 L1 cache miss rate 太高以至於處理器不能容忍,造成效能下降 - 由於 SW prefetching 通常是精準的,可以安全的直接在 L1 cache 插入 prefetched blocks 而避免 polluting the L1 cache 的風險 - 所有實驗使用 T0 cache-level placement hint - (實驗中使用的) two-level cache hierarchy 下 T1 and T2 相同 - ![](https://i.imgur.com/FsvbTLU.png) - 不同 hint (level) 下的效能 - ![](https://i.imgur.com/9lz2Y2v.png) - T0 效能比 T1 更好,因為藉由在L1 cache 插入prefetched blocks,隱藏 L1 cache miss - 預期 : T1 and T2 hints 在幾乎沒有 data reuse 的 streaming applications 表現不錯 - 結果 : T0 always performs better than other hints ### 5.2 Effect of Using Hardware and Software Prefetching Together - Hardware Prefetcher Training Effects - 以兩種 HW prefetcher training policies 來評估 SW 和 HW 互相的影響 - NT (SW+GHB, SW+STR) - trained by demand cache misses,不管 SW prefetch requests - Train (SW+GHB+T, SW+STR+T) - SW prefetch requests 及 demand cache misses - GHB 和 STR 使用 Train 相對 NT 的效能改進 - ![](https://i.imgur.com/yN89iHU.png) - sphinx3 : GHB 和 STR 經 SW prefetches 的 training ,效能都有改進 - milc, gcc, and soplex : GHB 和 STR 經 SW prefetches 的 training ,效能都負成長 - mcf and bwaves : 幾乎不受 SW prefetches 的 training 影響 - 剩下的 : 不同 HW prefetcher 經 SW prefetches 的 training 後結果不同 - 正成長 - 3~5% - reducing late prefetching - 負成長 - −55.9% for libquantum, and −22.5% for milc - milc : - has short streams - software prefetching can trigger unnecessary hardware prefetches - libquantum : - severe L1 miss penalties - SW prefetch 主要是 prefetch 到 L1 和 L2 cache - HW prefetch 被 SW prefetch 訓練,只會將資料 prefetch 到 L2 而已,因此少了 L1 prefetch 的好處,造成大量 L1 cache miss - overly aggressive prefetches - prefetch distancem 在 HW 和 SW 原是各自獨立的 - when SW prefetch instructions train a HW prefetcher,HW prefetch requests can be too aggressive - resulting in many early prefetches - 結論 - negative impact 明顯降低性能 - 某些結果效能有正成長 - 一般最好不要用 SF prefetching requests 訓練 HW prefetching - Prefetch Coverage - prefetch coverage : the ratio of useful prefetches to total L2 misses—in Figure 10(a) - ![](https://i.imgur.com/n4ls1QJ.png) - 一般而言,coverage of SW prefetching 高於 coverage of hardware prefetchers - benchmarks in neutral and negative groups相反 - 結論 : less coverage 是 neutral and negative groups 效能無法改進的主因 - PrefetchingClassification - prefetch 的分類及實驗結果 - ![](https://i.imgur.com/op1t0Lf.png) - ![](https://i.imgur.com/9OIjbyb.png) - 只有 bzip2 and soplex 有 >10% 的 early prefetches - prefetch cache 可減緩 cache pollution effect - mcf 有 10% late prefetches - 因為 prefetch intrinsic 產生 dependent load - 若移除 latency overhead 效能改進提升至 9%. - 即使有數量明顯的 redundant prefetches 存在許多 benchmarks,但對於效能幾乎沒負面影響 ### 5.3 Using the GCC Compiler - 這節 (5.3) 以外皆使用 icc 評估 - 因為 gcc (more specifically gfortran) 不支援 intrinsics in Fortran - (這節)使用 gcc compiler 編譯完才插入 prefetch intrinsic,模擬區域和 icc 相同,但排除含有 fortran benchmark - 結論 - 除了 mcf 和 lbm 以外,其餘使用 gcc 都被加入許多 prefetch intrinsic (而 icc 並沒有) - gcc 還是無法 cover 所有 cache misses - gcc 的表現介於 NOSW (prefetch flag is disabled) 和 SW 間 - icc is the same as NOSW ### 5.4 Real System Experiments - 使用實際的 processors 實驗 - Core 2 and Nehalem - ![](https://i.imgur.com/qSRtVGi.png) - run the entire reference input sets, not just the simpointed sectio - ![](https://i.imgur.com/BystWUl.png) - both simulated regions and the entire execution have a similar portion of prefetch instructions - 除 milc, gcc, and zeusmp 外,其他benchmarks的模擬區域精準的代表整個程式碼 - simulated system configuration 和 Core 2 相似 - 結果 - ![](https://i.imgur.com/W6Z4h4L.png) - cactusADM、soplex、bwaves、sphinx3、leslie3d 在 simulator 上效能些皆為提升,但在 Real System 效能是退步的 - positive group 使用 SW prefetches 效能提升 - 相較 Core 2,HW prefetchers 在 Nehalem 上效能提升 - due to the multiple cache levels of prefetchers in Nehalem, benefits of software prefetching are reduced ### 5.5 Profile-Guided Optimization (PGO) - 使用 `gcc`,加入 flag `fprofile-generate` , `fprofile-use` - 效能因畫並不完全來自 SW prefetch ### 5.6 Hardware Prefetcher for Short Streams - short stream 為 HW prefetch 的缺點 - 提出 ASD hardware prefetcher - Although ASD can predict short-length streams, it still requires observing the first instances of a memory stream. - 在 prefetching short streams 方面的效率 : SW prefetching >> ASD ### 5.7 Content Directed Prefetching(CDP) - evaluate CDP with a separate prefetch cache (CDP+PF), which prevents cache pollution - software prefetching is more effective for irregular data structures ### 5.8 Manual Tuning - 不確定 SW prefetching 已經完全發揮潛能 - manually and exhaustively tune(調整) the benchmarks - exclude milc, GemsFDTD, lbm, bwaves, bzip2 (already highly optimized or require significant code structure changes) - it is more effective to remove negative intrinsics manually - there are no significant performance changes between manual tuning and the algorithm in Section 4.1, especially in the positive group ### 5.9 Summary of New Findings 1. SW prefetching is frequently more effective in Short streams 2. The SW prefetching distance (只要 > minimum distance) is relatively insensitive to the hardware configuration. 3. 通常 L1 cache misses 會被 out-of-order 執行的處理器容忍, 一旦 L1 cache miss rate > 20%, 透過 prefetching into the L1 cache 減低 L1 cache misses by 會更有效率. 4. extra instructions (SW prefetching instructions) 不會增加太多執行時間 5. 用 SW prefetching train HW prefetcher,可能改善效能,也可能嚴重的減低效能,must be done judiciously if at all ## 6. CASE STUDIES: SOURCE CODE ANALYSIS OF INDIVIDUAL BENCHMARKS - explain the reasons for the observed software prefetching behaviors using source code analysis ## 7. RELATED WORK - Reducing software prefetching overhead - Hardware and software cooperative prefetching ## 8. CONCLUSION AND FUTURE WORK - confirm, quantitatively the following conclusions - Having to select a static prefetch distance is generally considered one of the limitations of more aggressive software prefetching - we should expect to use software-based prefetching techniques more - short stream and indirect references => good candidates for using software prefetching. strong stream behavior or regular stride patterns => 效能沒明顯改善甚至退步 - focus on developing software prefetch algorithms to decide how to train hardware prefethers - reducing the negative interactions between software and hardware prefetching schemes =>compilers can insert prefetch instructions more aggressively. # 實驗前 ## `naive_transpose` - 一個一個轉置 - SRC~xy~ => DST~yx~ ## `sse_transpose` - ![](https://i.imgur.com/Wb2MZGL.png) - [Programming trivia: 4x4 integer matrix transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html) - 以下舉例時使用 - S~00~ S~01~ S~02~ S~03~ S~10~ S~11~ S~12~ S~13~ S~20~ S~21~ S~22~ S~23~ S~30~ S~31~ S~32~ S~33~ - `_mm_loadu_si128` - 從 memory 讀 128-bits 的 integer 進 dst - mem_addr does not need to be aligned on any particular boundary. - e.g. `__m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x));` => I0 : S~00~ S~01~ S~02~ S~03~ - `_mm_unpacklo_epi32` - Unpack and interleave 32-bit integers from the low half of a and b, and store the results in dst. ```= INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[31:0] dst[63:32] := src2[31:0] dst[95:64] := src1[63:32] dst[127:96] := src2[63:32] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0]) ``` - e.g. `__m128i T0 = _mm_unpacklo_epi32(I0, I1);` I0 : S~00~ S~01~ S~02~ S~03~ I1 : S~10~ S~11~ S~12~ S~13~ =>T0 : S~00~ S~10~ S~01~ S~11~ - `_mm_unpackhi_epi32` - Unpack and interleave 32-bit integers from the high half of a and b, and store the results in dst. ```= INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[95:64] dst[63:32] := src2[95:64] dst[95:64] := src1[127:96] dst[127:96] := src2[127:96] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0]) ``` - e.g. `__m128i T2 = _mm_unpackhi_epi32(I0, I1);` I0 : S~00~ S~01~ S~02~ S~03~ I1 : S~10~ S~11~ S~12~ S~13~ =>T2 : S~02~ S~12~ S~03~ S~13~ - `_mm_unpacklo_epi64` - Unpack and interleave 64-bit integers from the low half of a and b, and store the results in dst. ```= INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[63:0] dst[127:64] := src2[63:0] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0]) ``` - e.g. `I0 = _mm_unpacklo_epi64(T0, T1);` T0 : S~00~ S~10~ S~01~ S~11~ T1 : S~20~ S~30~ S~21~ S~31~ =>I0 : S~00~ S~10~ S~20~ S~30~ - `_mm_unpackhi_epi64` - Unpack and interleave 64-bit integers from the high half of a and b, and store the results in dst. ```= INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[127:64] dst[127:64] := src2[127:64] RETURN dst[127:0] } dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0]) ``` - e.g. `I1 = _mm_unpackhi_epi64(T0, T1);` T0 : S~00~ S~10~ S~01~ S~11~ T1 : S~20~ S~30~ S~21~ S~31~ =>I1 : S~01~ S~11~ S~21~ S~31~ - `_mm_storeu_si128` - Store 128-bits of integer data from a into memory. - mem_addr does not need to be aligned on any particular boundary. ## `sse_prefetch_transpose` - 相較 `sse_transpose` 多了以下 5 行 ```C= #define PFDIST 8 _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1); ``` - `_mm_prefetch` - Fetch the line of data from memory that contains address p to a location in the cache hierarchy specified by the locality hint i. # 實驗 - 原始版本 - 結果 - 執行時間 naive_transpose > sse_transpose > sse_prefetch_transpose ``` $ make test_exetime ... naive_transpose : 278009 us sse_transpose : 131516 us sse_prefetch_transpose : 69849 us ``` - cache-misses 並沒有明顯的差異 ``` $ make test_perf Performance counter stats for './main NAI' (5 runs): 1887,6889 cache-misses # 93.942 % of all cache refs ( +- 0.12% ) 2009,4218 cache-references ( +- 0.03% ) 14,4853,4306 instructions # 1.15 insn per cycle ( +- 0.00% ) 12,5997,0514 cycles ( +- 0.65% ) 0.480518521 seconds time elapsed ( +- 0.43% ) Performance counter stats for './main SSE' (5 runs): 690,4713 cache-misses # 91.452 % of all cache refs ( +- 1.94% ) 755,0107 cache-references ( +- 0.34% ) 12,3654,5367 instructions # 1.36 insn per cycle ( +- 0.06% ) 9,0683,3447 cycles ( +- 2.19% ) 0.350957360 seconds time elapsed ( +- 2.30% ) Performance counter stats for './main SSE_PF' (5 runs): 685,7546 cache-misses # 89.687 % of all cache refs ( +- 1.64% ) 764,6069 cache-references ( +- 0.94% ) 12,8263,5856 instructions # 1.66 insn per cycle ( +- 0.00% ) 7,7312,5336 cycles ( +- 3.19% ) 0.300050426 seconds time elapsed ( +- 4.46% ) ``` - 推測 - sse_transpose 較 naive_transpose 快,因為使用 SIMD 增加效率 - sse_prefetch_transpose 較 sse_transpose 快,雖然 cache-misses rate 差異不大,但使用了 SW prefetch 減少 latency time # 實驗一 - 8*8 矩陣轉置,使用 AVX - 方法: - 參考: [Transpose an 8x8 float using AVX/AVX2](http://stackoverflow.com/questions/25622745/transpose-an-8x8-float-using-avx-avx2) ![](https://i.imgur.com/JW1Bc3t.png) => ![](https://i.imgur.com/AQVYHrC.png) => ![](https://i.imgur.com/KeSXyH1.png) => ![](https://i.imgur.com/bzjrv2s.png) => ![](https://i.imgur.com/a3xyQT0.png) - code - 其中 store 選擇 `_mm256_storeu_ps` 而非 `_mm256_store_ps` ,對 alignment 要求較低,兩者差別如下 - `_mm256_storeu_ps` : mem_addr does not need to be aligned on any particular boundary. - `_mm256_store_ps` : mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated. ```C= void sse_transpose_256(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 8) { for (int y = 0; y < h; y += 8) { __m256 I0 = _mm256_loadu_ps((const float *)(src + (y + 0) * w + x)); __m256 I1 = _mm256_loadu_ps((const float *)(src + (y + 1) * w + x)); __m256 I2 = _mm256_loadu_ps((const float *)(src + (y + 2) * w + x)); __m256 I3 = _mm256_loadu_ps((const float *)(src + (y + 3) * w + x)); __m256 I4 = _mm256_loadu_ps((const float *)(src + (y + 4) * w + x)); __m256 I5 = _mm256_loadu_ps((const float *)(src + (y + 5) * w + x)); __m256 I6 = _mm256_loadu_ps((const float *)(src + (y + 6) * w + x)); __m256 I7 = _mm256_loadu_ps((const float *)(src + (y + 7) * w + x)); __m256 T0 = _mm256_unpacklo_ps(I0, I1); __m256 T1 = _mm256_unpackhi_ps(I0, I1); __m256 T2 = _mm256_unpacklo_ps(I2, I3); __m256 T3 = _mm256_unpackhi_ps(I2, I3); __m256 T4 = _mm256_unpacklo_ps(I4, I5); __m256 T5 = _mm256_unpackhi_ps(I4, I5); __m256 T6 = _mm256_unpacklo_ps(I6, I7); __m256 T7 = _mm256_unpackhi_ps(I6, I7); I0 = _mm256_shuffle_ps(T0, T2, _MM_SHUFFLE(1,0,1,0)); I1 = _mm256_shuffle_ps(T0, T2, _MM_SHUFFLE(3,2,3,2)); I2 = _mm256_shuffle_ps(T1, T3, _MM_SHUFFLE(1,0,1,0)); I3 = _mm256_shuffle_ps(T1, T3, _MM_SHUFFLE(3,2,3,2)); I4 = _mm256_shuffle_ps(T4, T6, _MM_SHUFFLE(1,0,1,0)); I5 = _mm256_shuffle_ps(T4, T6, _MM_SHUFFLE(3,2,3,2)); I6 = _mm256_shuffle_ps(T5, T7, _MM_SHUFFLE(1,0,1,0)); I7 = _mm256_shuffle_ps(T5, T7, _MM_SHUFFLE(3,2,3,2)); T0 = _mm256_permute2f128_ps(I0, I4, 0x20); T1 = _mm256_permute2f128_ps(I1, I5, 0x20); T2 = _mm256_permute2f128_ps(I2, I6, 0x20); T3 = _mm256_permute2f128_ps(I3, I7, 0x20); T4 = _mm256_permute2f128_ps(I0, I4, 0x31); T5 = _mm256_permute2f128_ps(I1, I5, 0x31); T6 = _mm256_permute2f128_ps(I2, I6, 0x31); T7 = _mm256_permute2f128_ps(I3, I7, 0x31); _mm256_storeu_ps((float *)(dst + ((x + 0) * h) + y) , T0); _mm256_storeu_ps((float *)(dst + ((x + 1) * h) + y) , T1); _mm256_storeu_ps((float *)(dst + ((x + 2) * h) + y) , T2); _mm256_storeu_ps((float *)(dst + ((x + 3) * h) + y) , T3); _mm256_storeu_ps((float *)(dst + ((x + 4) * h) + y) , T4); _mm256_storeu_ps((float *)(dst + ((x + 5) * h) + y) , T5); _mm256_storeu_ps((float *)(dst + ((x + 6) * h) + y) , T6); _mm256_storeu_ps((float *)(dst + ((x + 7) * h) + y) , T7); } } } ``` - 結果 : - 時間 : avx(8*8, 256 bits) 比 sse(4*4, 128 bits) 快 ``` $ make test_exetime naive_transpose : 275393 us sse_transpose : 130499 us sse_prefetch_transpose : 97015 us AVX_transpose : 101904 us ``` - cache miss 略低於 sse ``` perf stat --repeat 5 \ -e cache-misses,cache-references,instructions,cycles \ ./main AVX Performance counter stats for './main AVX' (5 runs): 555,2161 cache-misses # 85.753 % of all cache refs ( +- 0.14% ) 647,4611 cache-references ( +- 0.04% ) 11,4055,0914 instructions # 1.55 insn per cycle ( +- 0.00% ) 7,3811,8569 cycles ( +- 0.44% ) 0.286325584 seconds time elapsed ( +- 0.73% ) ``` - 驗證正確性 - 8*8陣列 ``` $ make verify cc -msse2 -mavx2 --std gnu99 -O0 -Wall -Wextra verify.c -o verify ./verify 8 verify 8 x 8 matrix transpose testin 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 expected 0 8 16 24 32 40 48 56 1 9 17 25 33 41 49 57 2 10 18 26 34 42 50 58 3 11 19 27 35 43 51 59 4 12 20 28 36 44 52 60 5 13 21 29 37 45 53 61 6 14 22 30 38 46 54 62 7 15 23 31 39 47 55 63 naive_transpose : correct ... sse_transpose : correct ... sse_prefetch_transpose : correct ... AVX_transpose : correct ... ``` - 其他大小(n*n) - naive_transpose : 適合所有矩陣大小 - sse_transpose(包含 prefetch 版本): n 必須為 4 的倍數 - avx_transpose : n 必須為 8 的倍數 # 實驗二 - 8*8 矩陣轉置,使用 AVX + SW prefetch ## 依照 SSE 版本加入 `_mm_prefetch` ```C= void avx_prefetch_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 8) { for (int y = 0; y < h; y += 8) { #define AVX_PFDIST 8 _mm_prefetch(src+(y + AVX_PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 3) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 4) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 5) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 6) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 7) *w + x, _MM_HINT_T1); __m256 I0 = _mm256_loadu_ps((const float *)(src + (y + 0) * w + x)); __m256 I1 = _mm256_loadu_ps((const float *)(src + (y + 1) * w + x)); __m256 I2 = _mm256_loadu_ps((const float *)(src + (y + 2) * w + x)); __m256 I3 = _mm256_loadu_ps((const float *)(src + (y + 3) * w + x)); __m256 I4 = _mm256_loadu_ps((const float *)(src + (y + 4) * w + x)); __m256 I5 = _mm256_loadu_ps((const float *)(src + (y + 5) * w + x)); __m256 I6 = _mm256_loadu_ps((const float *)(src + (y + 6) * w + x)); __m256 I7 = _mm256_loadu_ps((const float *)(src + (y + 7) * w + x)); __m256 T0 = _mm256_unpacklo_ps(I0, I1); __m256 T1 = _mm256_unpackhi_ps(I0, I1); __m256 T2 = _mm256_unpacklo_ps(I2, I3); __m256 T3 = _mm256_unpackhi_ps(I2, I3); __m256 T4 = _mm256_unpacklo_ps(I4, I5); __m256 T5 = _mm256_unpackhi_ps(I4, I5); __m256 T6 = _mm256_unpacklo_ps(I6, I7); __m256 T7 = _mm256_unpackhi_ps(I6, I7); I0 = _mm256_shuffle_ps(T0, T2, _MM_SHUFFLE(1,0,1,0)); I1 = _mm256_shuffle_ps(T0, T2, _MM_SHUFFLE(3,2,3,2)); I2 = _mm256_shuffle_ps(T1, T3, _MM_SHUFFLE(1,0,1,0)); I3 = _mm256_shuffle_ps(T1, T3, _MM_SHUFFLE(3,2,3,2)); I4 = _mm256_shuffle_ps(T4, T6, _MM_SHUFFLE(1,0,1,0)); I5 = _mm256_shuffle_ps(T4, T6, _MM_SHUFFLE(3,2,3,2)); I6 = _mm256_shuffle_ps(T5, T7, _MM_SHUFFLE(1,0,1,0)); I7 = _mm256_shuffle_ps(T5, T7, _MM_SHUFFLE(3,2,3,2)); T0 = _mm256_permute2f128_ps(I0, I4, 0x20); T1 = _mm256_permute2f128_ps(I1, I5, 0x20); T2 = _mm256_permute2f128_ps(I2, I6, 0x20); T3 = _mm256_permute2f128_ps(I3, I7, 0x20); T4 = _mm256_permute2f128_ps(I0, I4, 0x31); T5 = _mm256_permute2f128_ps(I1, I5, 0x31); T6 = _mm256_permute2f128_ps(I2, I6, 0x31); T7 = _mm256_permute2f128_ps(I3, I7, 0x31); _mm256_storeu_ps((float *)(dst + ((x + 0) * h) + y) , T0); _mm256_storeu_ps((float *)(dst + ((x + 1) * h) + y) , T1); _mm256_storeu_ps((float *)(dst + ((x + 2) * h) + y) , T2); _mm256_storeu_ps((float *)(dst + ((x + 3) * h) + y) , T3); _mm256_storeu_ps((float *)(dst + ((x + 4) * h) + y) , T4); _mm256_storeu_ps((float *)(dst + ((x + 5) * h) + y) , T5); _mm256_storeu_ps((float *)(dst + ((x + 6) * h) + y) , T6); _mm256_storeu_ps((float *)(dst + ((x + 7) * h) + y) , T7); } } } ``` ## 決定 distance - 50 次平均 - 選擇 distance = 8 distance |0|2|4|6|8|10|12|14|16|18|20| ---------|-|-|-|-|-|--|--|--|--|--|--| execute time(us)|76279|75820|76703|75968|74315|76408|76005|76624|77762|77839|78046 - 結果 - 使用 SSE,較 naive 執行時間減少為 48% - 使用 AVX,較 naive 執行時間減少為 29% - 使用 SW prefetch,SSE 執行時間減為 52% - 使用 SW prefetch,AVX 執行時間減為 93% ``` naive_transpose : 275385 us sse_transpose : 131880 us sse_prefetch_transpose : 69229 us avx_transpose : 79326 us avx_prefetch_transpose : 73965 us ``` ## 改變 `_mm_prefetch` 指令位置 - 將 `_mm_prefetch` 移至 `_mm256_loadu_ps` 後 (有試過放在其他位置,放在 `_mm256_loadu_ps` 後,執行時間最短) ```C= ... _mm_prefetch(src+(y + AVX_PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 3) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 4) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 5) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 6) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 7) *w + x, _MM_HINT_T1); __m256 I0 = _mm256_loadu_ps((const float *)(src + (y + 0) * w + x)); __m256 I1 = _mm256_loadu_ps((const float *)(src + (y + 1) * w + x)); __m256 I2 = _mm256_loadu_ps((const float *)(src + (y + 2) * w + x)); __m256 I3 = _mm256_loadu_ps((const float *)(src + (y + 3) * w + x)); __m256 I4 = _mm256_loadu_ps((const float *)(src + (y + 4) * w + x)); __m256 I5 = _mm256_loadu_ps((const float *)(src + (y + 5) * w + x)); __m256 I6 = _mm256_loadu_ps((const float *)(src + (y + 6) * w + x)); __m256 I7 = _mm256_loadu_ps((const float *)(src + (y + 7) * w + x)); ... ``` 改成 ```C= ... __m256 I0 = _mm256_loadu_ps((const float *)(src + (y + 0) * w + x)); __m256 I1 = _mm256_loadu_ps((const float *)(src + (y + 1) * w + x)); __m256 I2 = _mm256_loadu_ps((const float *)(src + (y + 2) * w + x)); __m256 I3 = _mm256_loadu_ps((const float *)(src + (y + 3) * w + x)); __m256 I4 = _mm256_loadu_ps((const float *)(src + (y + 4) * w + x)); __m256 I5 = _mm256_loadu_ps((const float *)(src + (y + 5) * w + x)); __m256 I6 = _mm256_loadu_ps((const float *)(src + (y + 6) * w + x)); __m256 I7 = _mm256_loadu_ps((const float *)(src + (y + 7) * w + x)); _mm_prefetch(src+(y + AVX_PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 3) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 4) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 5) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 6) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + AVX_PFDIST + 7) *w + x, _MM_HINT_T1); ... ``` - 結果 - 執行時間從 73965 us 減為 60779 us (減為原本 82%) ``` avx_prefetch_transpose : 60779 us ``` - cache miss rate 反而增加,78% 增加至 85% ``` Performance counter stats for './main AVX_PF' (50 runs): 557,2321 cache-misses # 77.903 % of all cache refs ( +- 0.04% ) 715,2936 cache-references ( +- 0.03% ) 11,6349,6212 instructions # 1.67 insn per cycle ( +- 0.00% ) 6,9523,9092 cycles ( +- 0.06% ) 0.272111501 seconds time elapsed ( +- 0.33% ) Performance counter stats for './main AVX_PF' (50 runs): 554,6242 cache-misses # 85.417 % of all cache refs ( +- 0.04% ) 649,3141 cache-references ( +- 0.02% ) 11,6353,4796 instructions # 1.75 insn per cycle ( +- 0.00% ) 6,6666,4729 cycles ( +- 0.07% ) 0.258250260 seconds time elapsed ( +- 0.32% ) ``` - 用同樣方法修改 SSE 版本 - 執行時間從 69229 us 減為 65405 us (減為原本 94%) - cache miss rate 皆在 87.7% 左右 ``` Performance counter stats for './main SSE_PF' (50 runs): 659,9302 cache-misses # 87.709 % of all cache refs ( +- 0.03% ) 752,4061 cache-references ( +- 0.03% ) 12,8257,3532 instructions # 1.85 insn per cycle ( +- 0.00% ) 6,9223,9566 cycles ( +- 0.13% ) 0.266634360 seconds time elapsed ( +- 0.32% ) Performance counter stats for './main SSE_PF' (50 runs): 659,4715 cache-misses # 87.692 % of all cache refs ( +- 0.03% ) 752,0292 cache-references ( +- 0.02% ) 12,8256,2815 instructions # 1.85 insn per cycle ( +- 0.00% ) 6,9168,6504 cycles ( +- 0.13% ) 0.268256938 seconds time elapsed ( +- 0.33% ) ``` - prefetch hit - ![](https://i.imgur.com/LJL60D6.png) - `$ perf stat -e r014c ./main SSE_PF` - 結果 - SW prefetch hit 次數增加為 8 倍 - HE prefetch hit 次數差不多 指令位置|改變前|改變後 ---------------|-|- SW prefetch hit|16,7489|133,0769 HW prefetch hit|42,1308|41,7050 # Reference - [When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) - [論文 When Prefetching Works, When It Doesn’t, and Why 重點提示和解說](https://hackmd.io/s/HJtfT3icx) - [Cache prefetching](https://en.wikipedia.org/wiki/Cache_prefetching) - [Programming trivia: 4x4 integer matrix transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html) - [IntrinsicsGuide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/) - [ Performance of SSE and AVX Instruction Sets](https://arxiv.org/pdf/1211.0820.pdf) - [Transpose an 8x8 float using AVX/AVX2](http://stackoverflow.com/questions/25622745/transpose-an-8x8-float-using-avx-avx2) ###### tags: `twzjwang`