# 2017q1 Homework3 (software-pipelining)
contributed by < `twzjwang` >
作業說明: [B05: software-pipelining](https://hackmd.io/s/rks62p1sl)
github: [twzjwang/prefetcher](https://github.com/twzjwang/prefetcher)
# 開發環境
- Ubuntu Ubuntu 16.04.2 LTS
Linux 4.8.0-36-generic
- cpu
version: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
- memory
size: 4GiB
# 閱讀論文
## When Prefetching Works, When It Doesn’t, and Why
- high performance 的關鍵
- increasing cache miss latency
- properly managing memory bandwidth
- 研究目的
- 開發 novel, foundational understanding of both the benefits and limitations of hardware and software prefetching
- 研究項目
- source code-level analysis
- compiler- and software-based prefetching 實行上的優缺點
- SW(software) 和 HW(hardware prefetching) 間的 協同效應(synergistic effects) 和拮抗效應 (antagonistic effects)
- 評估 prefetching training policies
## 1.INTRODUCTION
- 只有相對簡單的 SW prefetching algo. 用在最先進的 compiler (like icc gcc )
- 插入 prefetch intrinsic 沒有嚴格的標準
- SW 和 HW 間交互作用的複雜度不易理解
- Summary
- ![](https://i.imgur.com/Ib9u9HH.png)
- GHB : Prefetching with the Global History Buffer
- STR : stream-based prefetcher
- SW : compiler inserted software prefetchers
- 根據 speedups 將 benchmarks 分成三類
- positive : > +5%
- `milc`, `GemsFDTD`, `mcf`, `lbm`, `libquantum`, `gcc`, `zeusmp`
- neutral
- `bzip`, `cactusSDM`, `soplex`, `bwaves`
- negative : < -5%
- `shinx3`, `leslie3d`
- `Average`
- interactions among software and hardware prefetching yield a variety of benchmark-dependent performance behaviors
- 當 SW prefetching 以下列為目標時,整體上有正面影響
- short array streams
- irregular memory address patterns
- L1 cache miss reduction
- 也可能造成及負面的效應
- SW prefetching 會干擾 HW prefetcher 的訓練
- especially when using software prefetch instructions for a part of streams
## 2. BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING
- Table I 整理了不同資料結構相對應存取資料的方法 (pattern) 和分別適合的 SW、HW prefetching 機制
- ![](https://i.imgur.com/3Pzkr0C.png)
- array 和 Recursive Data Structures 資料結構容易被預測
- hashing 相對較難預測
- 許多方法被提出用來預測複雜的資料結構
- 提醒 (cache-line accesses)
- streams : unit-stride cache-line accesses
- strided : access stride distances greater than two cache lines
### 2.1 Software Prefetch Intrinsics
- x86 SSE SIMD extensions
- mm prefetch(char *ptr, int hint)
- ![](https://i.imgur.com/rve3mKs.png)
### 2.2 Prefetch Classification
- illustrates this classification on a timeline
- ![](https://i.imgur.com/m357fXc.png)
- ![](https://i.imgur.com/SIFNlJt.png)
### 2.3 Software Prefetch Distance
- 只有在 prefetch sent requests 夠早,足以完全掩蓋 memory latency 下,Prefetching 才有用
- D : prefetch distance
- the distance ahead of which a prefetch should be requested on a memory address
- 對 prefetcher performance 有顯著的影響
- large enough to hide the latency
- 也不能太大 : leading to less coverage and more cache misses
- l : prefetch latency
- s : the length of the shortest path through the loop body
- ![](https://i.imgur.com/vBGeC9G.png)
### 2.4 Direct and Indirect Memory Indexing
- memory indexing
- direct : easily prefetched by HW
- the memory addresses show regular stream/stride behavior
- indirect : relatively simpler to compute in SW
- indirect prefetch requests have a higher overhead than direct memory access
- ![](https://i.imgur.com/0nXXcLs.png)
## 3. POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING
### 3.1 Benefits of Software Prefetching over Hardware Prefetching (Software Prefetching 優點)
- Large Number of Streams (Limited Hardware Resources)
- stream prefetcher 中 steam 的數量被硬體限制著
- 例如 : stream detectors 和 book-keeping mechanisms
- for HW prefetchers : easily confused when there are too many streams
- Short Streams
- HW prefetchers require training time to detect the direction and distance of a stream or stride
- 至少需要 2 次 cache-miss才能判斷 direction of a stream
- length of stream is extremely short => there will not be enough cache misses
- Irregular Memory Access
- SW prefetchers : inserting appropriate prefetch intrinsics which usually require very complex structures
- Cache Locality Hint
- SW prefetching mechanism
- greater flexibility of placing data in the right cache level.
- In particular, there is a prefetch hint
- HW prefetcher
- have to apply the same cache insertion policy
- or they require complex hardware mechanisms to differentiate between the individual prefetches to place data in a smart manner
- Loop Bounds
- SW : determine loop bounds
- loop unrolling
- software pipelining
- using branch instructions
- HW : not possible
- especially when the hardware prefetcher is aggressive (large prefetch distance and high prefetch degree)
- results in early or incorrect prefetch requests
- consuming memory bandwidth
### 3.2 Negative Impacts of Software Prefetching (Software Prefetching 缺點)
- Increased Instruction Count
- ![](https://i.imgur.com/2h5pejf.png)
- SW prefetch
- 增加 instruction
- consume fetch and execution bandwidth and machine resources
- Static Insertion
- programmer or the compiler determines the data to be prefetched and the corresponding prefetch distance
- 這些決定是靜態的,不能動態適應下列 runtime behavioral changes
- varying memory latency
- effective cache size
- bandwidth
- especially in heterogeneous architectures
- HW : there are some recently proposed feedbackdriven adaptive hardware prefetching mechanisms
- SW : adaptive mechanisms are largely nonexistent
- Code Structure Change
- loop 中 instruction 數量極小時
- 插入 SW prefetching instructions 有困難
- prefetching instructions and demand requests 間計算(時間)不足夠
- code structure changes are required
- 如 : loop splitting
- instructions 過度增加,插入 SW prefetching 變得困難
### 3.3 Synergistic Effects when Using Software and Hardware Prefetching Together (Software + Hardware Prefetching 優點)
- Handling Multiple Streams
- cover more data streams
- HW : cover regular streams
- SW : cover irregular streams
- Positive Training
- SW prefetch requests 幫助訓練 HW prefetcher
- 當 SW prefetch 太慢,經過訓練的 HW prefetcher 可提高時效
### 3.4 Antagonistic Effects when Using Software and Hardware Prefetching Together (Software + Hardware Prefetching 缺點)
- Negative Training
- Software prefetch requests can slow down the hardware prefetcher training
- software prefetch instructions can trigger overly aggressive hardware prefetches, which results in early requests
- Harmful Software Prefetching
- HW prefetchers 精準度較 SW prefetchers 差
- 當 software prefetch requests 不正確或太早
- 增加 cache and memory bandwidth
- 減低 HW prefetching 效率
## 4.EXPERIMENTAL METHODOLOGY
### 4.1 Prefetch Intrinsic Insertion Algorithm
- A prefetch candidate is any load whose L1 misses per thousand instructions (MPKI) exceeds 0.05
- choice of prefetch distance systematic
- ![](https://i.imgur.com/Y5kqZ9s.png)
- K : constant factor
- L : average memory latency
- IPCbench : the profiled average IPC of each benchmark
- Wloop : average instruction count in one loop iteration
- (use K = 4 and L = 300 for all benchmarks)
- IPCbench、Wloop 由 benchmark 的 profile data 決定
the choice of distance is effectively profile-driven
### 4.2 Simulation Methodology
- use MacSim, a trace-driven and cycle-level simulator
- perform experiments on 13 memory-intensive benchmarks from SPEC CPU 2006
- all memory accesses hit in the L1 cache
- use Pinpoints[Patil et al. 2004] to select a representative simulation region for each benchmark using the reference input set
- Each benchmark is run for roughly 200M x86 instructions
- ensure that the number of function calls is the same, so that we simulate the same section of the code across different binaries
## 5. EVALUATIONS: BASIC OBSERVATIONS ABOUT PREFETCHING
### 5.1 Overhead and Limitations of Software Prefetching
- Instruction Overhead
- 增加 prefetch instructions
- 增加 regular instructions
- due to handling of indirect memory accesses and index calculations
- benefits 大於 instruction overhead
- Software Prefetching Overhead.
- ![](https://i.imgur.com/5EiawEh.png)
- cache pollution
- early or incorrect prefetches 造成 cache pollution
- cache pollution 影響不大
- bandwidth consumption
- 實驗環境提供足夠 bandwidth 給 single-thread applications
- 影響同樣不大
- memory access latency
- software prefetching is not completely hiding memory latency
- redundant prefetch overhead
- 即使存在大量的 redundant prefetch instructions,其造成負面影響幾乎可被忽略
- instruction overhead
- 即使 instruction 數量明顯增加,實際時間成本仍不高
- The Effect of Prefetch Distance
- ![](https://i.imgur.com/I17mtPh.png)
![](https://i.imgur.com/QsU9ZcI.png)
- Most benchmarks from the neutral and negative groups are insensitive to the prefetch distance (Group E)
- cactusADM and bwaves 與其他 Benchmarks 不同,在 distance = -4 達到最佳 distance,optimal distance zones 很廣
- Static Distance vs. Machine Configuration
- 使用 static prefetch distances 的限制是不同機器上 optimal distance 也許有很大的不同,在 multicore systems 影響會加大
- compare the best performing static distance three different machine configurations (base, less-aggressive, and aggressive processors)
- ![](https://i.imgur.com/NMj1FrO.png)
- ![](https://i.imgur.com/c04X2yd.png)
- 左圖 : 最佳 prefetch distance 在不同 machine configurations 下的效能
- 右圖 : 最佳 prefetch distance 在 baseline 和在各個 machine configurations 的效能差異
- most benchmarks are classified as Group E (insensitive) in Table VII
- 即使 machine configuration 在 runtime 被改變,static prefetch distance 的變動對效能影響並不明顯
- Cache-Level Insertion Policy
- 通常 out-of-order processor 可容忍 L1 misses,HW prefetchers 常在 the last level cache 插入 prefetched blocks 以減低 polluting L1 cache 的機會
- 當 L1 cache miss rate 太高以至於處理器不能容忍,造成效能下降
- 由於 SW prefetching 通常是精準的,可以安全的直接在 L1 cache 插入 prefetched blocks 而避免 polluting the L1 cache 的風險
- 所有實驗使用 T0 cache-level placement hint
- (實驗中使用的) two-level cache hierarchy 下 T1 and T2 相同
- ![](https://i.imgur.com/FsvbTLU.png)
- 不同 hint (level) 下的效能
- ![](https://i.imgur.com/9lz2Y2v.png)
- T0 效能比 T1 更好,因為藉由在L1 cache 插入prefetched blocks,隱藏 L1 cache miss
- 預期 : T1 and T2 hints 在幾乎沒有 data reuse 的 streaming applications 表現不錯
- 結果 : T0 always performs better than other hints
### 5.2 Effect of Using Hardware and Software Prefetching Together
- Hardware Prefetcher Training Effects
- 以兩種 HW prefetcher training policies 來評估 SW 和 HW 互相的影響
- NT (SW+GHB, SW+STR)
- trained by demand cache misses,不管 SW prefetch requests
- Train (SW+GHB+T, SW+STR+T)
- SW prefetch requests 及 demand cache misses
- GHB 和 STR 使用 Train 相對 NT 的效能改進
- ![](https://i.imgur.com/yN89iHU.png)
- sphinx3 : GHB 和 STR 經 SW prefetches 的 training ,效能都有改進
- milc, gcc, and soplex : GHB 和 STR 經 SW prefetches 的 training ,效能都負成長
- mcf and bwaves : 幾乎不受 SW prefetches 的 training 影響
- 剩下的 : 不同 HW prefetcher 經 SW prefetches 的 training 後結果不同
- 正成長
- 3~5%
- reducing late prefetching
- 負成長
- −55.9% for libquantum, and −22.5% for milc
- milc :
- has short streams
- software prefetching can trigger unnecessary hardware prefetches
- libquantum :
- severe L1 miss penalties
- SW prefetch 主要是 prefetch 到 L1 和 L2 cache
- HW prefetch 被 SW prefetch 訓練,只會將資料 prefetch 到 L2 而已,因此少了 L1 prefetch 的好處,造成大量 L1 cache miss
- overly aggressive prefetches
- prefetch distancem 在 HW 和 SW 原是各自獨立的
- when SW prefetch instructions train a HW prefetcher,HW prefetch requests can be too aggressive
- resulting in many early prefetches
- 結論
- negative impact 明顯降低性能
- 某些結果效能有正成長
- 一般最好不要用 SF prefetching requests 訓練 HW prefetching
- Prefetch Coverage
- prefetch coverage : the ratio of useful prefetches to total L2 misses—in Figure 10(a)
- ![](https://i.imgur.com/n4ls1QJ.png)
- 一般而言,coverage of SW prefetching 高於 coverage of hardware prefetchers
- benchmarks in neutral and negative groups相反
- 結論 : less coverage 是 neutral and negative groups 效能無法改進的主因
- PrefetchingClassification
- prefetch 的分類及實驗結果
- ![](https://i.imgur.com/op1t0Lf.png)
- ![](https://i.imgur.com/9OIjbyb.png)
- 只有 bzip2 and soplex 有 >10% 的 early prefetches
- prefetch cache 可減緩 cache pollution effect
- mcf 有 10% late prefetches
- 因為 prefetch intrinsic 產生 dependent load
- 若移除 latency overhead 效能改進提升至 9%.
- 即使有數量明顯的 redundant prefetches 存在許多 benchmarks,但對於效能幾乎沒負面影響
### 5.3 Using the GCC Compiler
- 這節 (5.3) 以外皆使用 icc 評估
- 因為 gcc (more specifically gfortran) 不支援 intrinsics in Fortran
- (這節)使用 gcc compiler 編譯完才插入 prefetch intrinsic,模擬區域和 icc 相同,但排除含有 fortran benchmark
- 結論
- 除了 mcf 和 lbm 以外,其餘使用 gcc 都被加入許多 prefetch intrinsic (而 icc 並沒有)
- gcc 還是無法 cover 所有 cache misses
- gcc 的表現介於 NOSW (prefetch flag is disabled) 和 SW 間
- icc is the same as NOSW
### 5.4 Real System Experiments
- 使用實際的 processors 實驗
- Core 2 and Nehalem
- ![](https://i.imgur.com/qSRtVGi.png)
- run the entire reference input sets, not just the simpointed sectio
- ![](https://i.imgur.com/BystWUl.png)
- both simulated regions and the entire execution have a similar portion of prefetch instructions
- 除 milc, gcc, and zeusmp 外,其他benchmarks的模擬區域精準的代表整個程式碼
- simulated system configuration 和 Core 2 相似
- 結果
- ![](https://i.imgur.com/W6Z4h4L.png)
- cactusADM、soplex、bwaves、sphinx3、leslie3d 在 simulator 上效能些皆為提升,但在 Real System 效能是退步的
- positive group 使用 SW prefetches 效能提升
- 相較 Core 2,HW prefetchers 在 Nehalem 上效能提升
- due to the multiple cache levels of prefetchers in Nehalem, benefits of software prefetching are reduced
### 5.5 Profile-Guided Optimization (PGO)
- 使用 `gcc`,加入 flag `fprofile-generate` , `fprofile-use`
- 效能因畫並不完全來自 SW prefetch
### 5.6 Hardware Prefetcher for Short Streams
- short stream 為 HW prefetch 的缺點
- 提出 ASD hardware prefetcher
- Although ASD can predict short-length streams, it still requires observing the first instances of a memory stream.
- 在 prefetching short streams 方面的效率 : SW prefetching >> ASD
### 5.7 Content Directed Prefetching(CDP)
- evaluate CDP with a separate prefetch cache (CDP+PF), which prevents cache pollution
- software prefetching is more effective for irregular data structures
### 5.8 Manual Tuning
- 不確定 SW prefetching 已經完全發揮潛能
- manually and exhaustively tune(調整) the benchmarks
- exclude milc, GemsFDTD, lbm, bwaves, bzip2
(already highly optimized or require significant code structure changes)
- it is more effective to remove negative intrinsics manually
- there are no significant performance changes between manual tuning and the algorithm in Section 4.1, especially in the positive group
### 5.9 Summary of New Findings
1. SW prefetching is frequently more effective in Short streams
2. The SW prefetching distance (只要 > minimum distance) is relatively insensitive to the hardware configuration.
3. 通常 L1 cache misses 會被 out-of-order 執行的處理器容忍, 一旦 L1 cache miss rate > 20%, 透過 prefetching into the L1 cache 減低 L1 cache misses by 會更有效率.
4. extra instructions (SW prefetching instructions) 不會增加太多執行時間
5. 用 SW prefetching train HW prefetcher,可能改善效能,也可能嚴重的減低效能,must be done judiciously if at all
## 6. CASE STUDIES: SOURCE CODE ANALYSIS OF INDIVIDUAL BENCHMARKS
- explain the reasons for the observed software prefetching behaviors using source code analysis
## 7. RELATED WORK
- Reducing software prefetching overhead
- Hardware and software cooperative prefetching
## 8. CONCLUSION AND FUTURE WORK
- confirm, quantitatively the following conclusions
- Having to select a static prefetch distance is generally considered one of the limitations of more aggressive software prefetching
- we should expect to use software-based prefetching techniques more
- short stream and indirect references => good candidates for using software prefetching.
strong stream behavior or regular stride patterns => 效能沒明顯改善甚至退步
- focus on developing software prefetch algorithms to decide how to train hardware prefethers
- reducing the negative interactions between software and hardware prefetching schemes
=>compilers can insert prefetch instructions more aggressively.
# 實驗前
## `naive_transpose`
- 一個一個轉置
- SRC~xy~ => DST~yx~
## `sse_transpose`
- ![](https://i.imgur.com/Wb2MZGL.png)
- [Programming trivia: 4x4 integer matrix transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html)
- 以下舉例時使用
- S~00~ S~01~ S~02~ S~03~
S~10~ S~11~ S~12~ S~13~
S~20~ S~21~ S~22~ S~23~
S~30~ S~31~ S~32~ S~33~
- `_mm_loadu_si128`
- 從 memory 讀 128-bits 的 integer 進 dst
- mem_addr does not need to be aligned on any particular boundary.
- e.g. `__m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x));`
=> I0 : S~00~ S~01~ S~02~ S~03~
- `_mm_unpacklo_epi32`
- Unpack and interleave 32-bit integers from the low half of a and b, and store the results in dst.
```=
INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){
dst[31:0] := src1[31:0]
dst[63:32] := src2[31:0]
dst[95:64] := src1[63:32]
dst[127:96] := src2[63:32]
RETURN dst[127:0]
}
dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0])
```
- e.g. `__m128i T0 = _mm_unpacklo_epi32(I0, I1);`
I0 : S~00~ S~01~ S~02~ S~03~
I1 : S~10~ S~11~ S~12~ S~13~
=>T0 : S~00~ S~10~ S~01~ S~11~
- `_mm_unpackhi_epi32`
- Unpack and interleave 32-bit integers from the high half of a and b, and store the results in dst.
```=
INTERLEAVE_HIGH_DWORDS(src1[127:0], src2[127:0]){
dst[31:0] := src1[95:64]
dst[63:32] := src2[95:64]
dst[95:64] := src1[127:96]
dst[127:96] := src2[127:96]
RETURN dst[127:0]
}
dst[127:0] := INTERLEAVE_HIGH_DWORDS(a[127:0], b[127:0])
```
- e.g. `__m128i T2 = _mm_unpackhi_epi32(I0, I1);`
I0 : S~00~ S~01~ S~02~ S~03~
I1 : S~10~ S~11~ S~12~ S~13~
=>T2 : S~02~ S~12~ S~03~ S~13~
- `_mm_unpacklo_epi64`
- Unpack and interleave 64-bit integers from the low half of a and b, and store the results in dst.
```=
INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){
dst[63:0] := src1[63:0]
dst[127:64] := src2[63:0]
RETURN dst[127:0]
}
dst[127:0] := INTERLEAVE_QWORDS(a[127:0], b[127:0])
```
- e.g. `I0 = _mm_unpacklo_epi64(T0, T1);`
T0 : S~00~ S~10~ S~01~ S~11~
T1 : S~20~ S~30~ S~21~ S~31~
=>I0 : S~00~ S~10~ S~20~ S~30~
- `_mm_unpackhi_epi64`
- Unpack and interleave 64-bit integers from the high half of a and b, and store the results in dst.
```=
INTERLEAVE_HIGH_QWORDS(src1[127:0], src2[127:0]){
dst[63:0] := src1[127:64]
dst[127:64] := src2[127:64]
RETURN dst[127:0]
}
dst[127:0] := INTERLEAVE_HIGH_QWORDS(a[127:0], b[127:0])
```
- e.g. `I1 = _mm_unpackhi_epi64(T0, T1);`
T0 : S~00~ S~10~ S~01~ S~11~
T1 : S~20~ S~30~ S~21~ S~31~
=>I1 : S~01~ S~11~ S~21~ S~31~
- `_mm_storeu_si128`
- Store 128-bits of integer data from a into memory.
- mem_addr does not need to be aligned on any particular boundary.
## `sse_prefetch_transpose`
- 相較 `sse_transpose` 多了以下 5 行
```C=
#define PFDIST 8
_mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1);
```
- `_mm_prefetch`
- Fetch the line of data from memory that contains address p to a location in the cache hierarchy specified by the locality hint i.
# 實驗 - 原始版本
- 結果
- 執行時間 naive_transpose > sse_transpose > sse_prefetch_transpose
```
$ make test_exetime
...
naive_transpose : 278009 us
sse_transpose : 131516 us
sse_prefetch_transpose : 69849 us
```
- cache-misses 並沒有明顯的差異
```
$ make test_perf
Performance counter stats for './main NAI' (5 runs):
1887,6889 cache-misses # 93.942 % of all cache refs ( +- 0.12% )
2009,4218 cache-references ( +- 0.03% )
14,4853,4306 instructions # 1.15 insn per cycle ( +- 0.00% )
12,5997,0514 cycles ( +- 0.65% )
0.480518521 seconds time elapsed ( +- 0.43% )
Performance counter stats for './main SSE' (5 runs):
690,4713 cache-misses # 91.452 % of all cache refs ( +- 1.94% )
755,0107 cache-references ( +- 0.34% )
12,3654,5367 instructions # 1.36 insn per cycle ( +- 0.06% )
9,0683,3447 cycles ( +- 2.19% )
0.350957360 seconds time elapsed ( +- 2.30% )
Performance counter stats for './main SSE_PF' (5 runs):
685,7546 cache-misses # 89.687 % of all cache refs ( +- 1.64% )
764,6069 cache-references ( +- 0.94% )
12,8263,5856 instructions # 1.66 insn per cycle ( +- 0.00% )
7,7312,5336 cycles ( +- 3.19% )
0.300050426 seconds time elapsed ( +- 4.46% )
```
- 推測
- sse_transpose 較 naive_transpose 快,因為使用 SIMD 增加效率
- sse_prefetch_transpose 較 sse_transpose 快,雖然 cache-misses rate 差異不大,但使用了 SW prefetch 減少 latency time
# 實驗一 - 8*8 矩陣轉置,使用 AVX
- 方法:
- 參考: [Transpose an 8x8 float using AVX/AVX2](http://stackoverflow.com/questions/25622745/transpose-an-8x8-float-using-avx-avx2)
![](https://i.imgur.com/JW1Bc3t.png)
=>
![](https://i.imgur.com/AQVYHrC.png)
=>
![](https://i.imgur.com/KeSXyH1.png)
=>
![](https://i.imgur.com/bzjrv2s.png)
=>
![](https://i.imgur.com/a3xyQT0.png)
- code
- 其中 store 選擇 `_mm256_storeu_ps` 而非 `_mm256_store_ps` ,對 alignment 要求較低,兩者差別如下
- `_mm256_storeu_ps` : mem_addr does not need to be aligned on any particular boundary.
- `_mm256_store_ps` : mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
```C=
void sse_transpose_256(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x += 8) {
for (int y = 0; y < h; y += 8) {
__m256 I0 = _mm256_loadu_ps((const float *)(src + (y + 0) * w + x));
__m256 I1 = _mm256_loadu_ps((const float *)(src + (y + 1) * w + x));
__m256 I2 = _mm256_loadu_ps((const float *)(src + (y + 2) * w + x));
__m256 I3 = _mm256_loadu_ps((const float *)(src + (y + 3) * w + x));
__m256 I4 = _mm256_loadu_ps((const float *)(src + (y + 4) * w + x));
__m256 I5 = _mm256_loadu_ps((const float *)(src + (y + 5) * w + x));
__m256 I6 = _mm256_loadu_ps((const float *)(src + (y + 6) * w + x));
__m256 I7 = _mm256_loadu_ps((const float *)(src + (y + 7) * w + x));
__m256 T0 = _mm256_unpacklo_ps(I0, I1);
__m256 T1 = _mm256_unpackhi_ps(I0, I1);
__m256 T2 = _mm256_unpacklo_ps(I2, I3);
__m256 T3 = _mm256_unpackhi_ps(I2, I3);
__m256 T4 = _mm256_unpacklo_ps(I4, I5);
__m256 T5 = _mm256_unpackhi_ps(I4, I5);
__m256 T6 = _mm256_unpacklo_ps(I6, I7);
__m256 T7 = _mm256_unpackhi_ps(I6, I7);
I0 = _mm256_shuffle_ps(T0, T2, _MM_SHUFFLE(1,0,1,0));
I1 = _mm256_shuffle_ps(T0, T2, _MM_SHUFFLE(3,2,3,2));
I2 = _mm256_shuffle_ps(T1, T3, _MM_SHUFFLE(1,0,1,0));
I3 = _mm256_shuffle_ps(T1, T3, _MM_SHUFFLE(3,2,3,2));
I4 = _mm256_shuffle_ps(T4, T6, _MM_SHUFFLE(1,0,1,0));
I5 = _mm256_shuffle_ps(T4, T6, _MM_SHUFFLE(3,2,3,2));
I6 = _mm256_shuffle_ps(T5, T7, _MM_SHUFFLE(1,0,1,0));
I7 = _mm256_shuffle_ps(T5, T7, _MM_SHUFFLE(3,2,3,2));
T0 = _mm256_permute2f128_ps(I0, I4, 0x20);
T1 = _mm256_permute2f128_ps(I1, I5, 0x20);
T2 = _mm256_permute2f128_ps(I2, I6, 0x20);
T3 = _mm256_permute2f128_ps(I3, I7, 0x20);
T4 = _mm256_permute2f128_ps(I0, I4, 0x31);
T5 = _mm256_permute2f128_ps(I1, I5, 0x31);
T6 = _mm256_permute2f128_ps(I2, I6, 0x31);
T7 = _mm256_permute2f128_ps(I3, I7, 0x31);
_mm256_storeu_ps((float *)(dst + ((x + 0) * h) + y) , T0);
_mm256_storeu_ps((float *)(dst + ((x + 1) * h) + y) , T1);
_mm256_storeu_ps((float *)(dst + ((x + 2) * h) + y) , T2);
_mm256_storeu_ps((float *)(dst + ((x + 3) * h) + y) , T3);
_mm256_storeu_ps((float *)(dst + ((x + 4) * h) + y) , T4);
_mm256_storeu_ps((float *)(dst + ((x + 5) * h) + y) , T5);
_mm256_storeu_ps((float *)(dst + ((x + 6) * h) + y) , T6);
_mm256_storeu_ps((float *)(dst + ((x + 7) * h) + y) , T7);
}
}
}
```
- 結果 :
- 時間 : avx(8*8, 256 bits) 比 sse(4*4, 128 bits) 快
```
$ make test_exetime
naive_transpose : 275393 us
sse_transpose : 130499 us
sse_prefetch_transpose : 97015 us
AVX_transpose : 101904 us
```
- cache miss 略低於 sse
```
perf stat --repeat 5 \
-e cache-misses,cache-references,instructions,cycles \
./main AVX
Performance counter stats for './main AVX' (5 runs):
555,2161 cache-misses # 85.753 % of all cache refs ( +- 0.14% )
647,4611 cache-references ( +- 0.04% )
11,4055,0914 instructions # 1.55 insn per cycle ( +- 0.00% )
7,3811,8569 cycles ( +- 0.44% )
0.286325584 seconds time elapsed ( +- 0.73% )
```
- 驗證正確性
- 8*8陣列
```
$ make verify
cc -msse2 -mavx2 --std gnu99 -O0 -Wall -Wextra verify.c -o verify
./verify 8
verify 8 x 8 matrix transpose
testin
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63
expected
0 8 16 24 32 40 48 56
1 9 17 25 33 41 49 57
2 10 18 26 34 42 50 58
3 11 19 27 35 43 51 59
4 12 20 28 36 44 52 60
5 13 21 29 37 45 53 61
6 14 22 30 38 46 54 62
7 15 23 31 39 47 55 63
naive_transpose : correct
...
sse_transpose : correct
...
sse_prefetch_transpose : correct
...
AVX_transpose : correct
...
```
- 其他大小(n*n)
- naive_transpose : 適合所有矩陣大小
- sse_transpose(包含 prefetch 版本): n 必須為 4 的倍數
- avx_transpose : n 必須為 8 的倍數
# 實驗二 - 8*8 矩陣轉置,使用 AVX + SW prefetch
## 依照 SSE 版本加入 `_mm_prefetch`
```C=
void avx_prefetch_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x += 8) {
for (int y = 0; y < h; y += 8) {
#define AVX_PFDIST 8
_mm_prefetch(src+(y + AVX_PFDIST + 0) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 1) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 2) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 3) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 4) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 5) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 6) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 7) *w + x, _MM_HINT_T1);
__m256 I0 = _mm256_loadu_ps((const float *)(src + (y + 0) * w + x));
__m256 I1 = _mm256_loadu_ps((const float *)(src + (y + 1) * w + x));
__m256 I2 = _mm256_loadu_ps((const float *)(src + (y + 2) * w + x));
__m256 I3 = _mm256_loadu_ps((const float *)(src + (y + 3) * w + x));
__m256 I4 = _mm256_loadu_ps((const float *)(src + (y + 4) * w + x));
__m256 I5 = _mm256_loadu_ps((const float *)(src + (y + 5) * w + x));
__m256 I6 = _mm256_loadu_ps((const float *)(src + (y + 6) * w + x));
__m256 I7 = _mm256_loadu_ps((const float *)(src + (y + 7) * w + x));
__m256 T0 = _mm256_unpacklo_ps(I0, I1);
__m256 T1 = _mm256_unpackhi_ps(I0, I1);
__m256 T2 = _mm256_unpacklo_ps(I2, I3);
__m256 T3 = _mm256_unpackhi_ps(I2, I3);
__m256 T4 = _mm256_unpacklo_ps(I4, I5);
__m256 T5 = _mm256_unpackhi_ps(I4, I5);
__m256 T6 = _mm256_unpacklo_ps(I6, I7);
__m256 T7 = _mm256_unpackhi_ps(I6, I7);
I0 = _mm256_shuffle_ps(T0, T2, _MM_SHUFFLE(1,0,1,0));
I1 = _mm256_shuffle_ps(T0, T2, _MM_SHUFFLE(3,2,3,2));
I2 = _mm256_shuffle_ps(T1, T3, _MM_SHUFFLE(1,0,1,0));
I3 = _mm256_shuffle_ps(T1, T3, _MM_SHUFFLE(3,2,3,2));
I4 = _mm256_shuffle_ps(T4, T6, _MM_SHUFFLE(1,0,1,0));
I5 = _mm256_shuffle_ps(T4, T6, _MM_SHUFFLE(3,2,3,2));
I6 = _mm256_shuffle_ps(T5, T7, _MM_SHUFFLE(1,0,1,0));
I7 = _mm256_shuffle_ps(T5, T7, _MM_SHUFFLE(3,2,3,2));
T0 = _mm256_permute2f128_ps(I0, I4, 0x20);
T1 = _mm256_permute2f128_ps(I1, I5, 0x20);
T2 = _mm256_permute2f128_ps(I2, I6, 0x20);
T3 = _mm256_permute2f128_ps(I3, I7, 0x20);
T4 = _mm256_permute2f128_ps(I0, I4, 0x31);
T5 = _mm256_permute2f128_ps(I1, I5, 0x31);
T6 = _mm256_permute2f128_ps(I2, I6, 0x31);
T7 = _mm256_permute2f128_ps(I3, I7, 0x31);
_mm256_storeu_ps((float *)(dst + ((x + 0) * h) + y) , T0);
_mm256_storeu_ps((float *)(dst + ((x + 1) * h) + y) , T1);
_mm256_storeu_ps((float *)(dst + ((x + 2) * h) + y) , T2);
_mm256_storeu_ps((float *)(dst + ((x + 3) * h) + y) , T3);
_mm256_storeu_ps((float *)(dst + ((x + 4) * h) + y) , T4);
_mm256_storeu_ps((float *)(dst + ((x + 5) * h) + y) , T5);
_mm256_storeu_ps((float *)(dst + ((x + 6) * h) + y) , T6);
_mm256_storeu_ps((float *)(dst + ((x + 7) * h) + y) , T7);
}
}
}
```
## 決定 distance
- 50 次平均
- 選擇 distance = 8
distance |0|2|4|6|8|10|12|14|16|18|20|
---------|-|-|-|-|-|--|--|--|--|--|--|
execute time(us)|76279|75820|76703|75968|74315|76408|76005|76624|77762|77839|78046
- 結果
- 使用 SSE,較 naive 執行時間減少為 48%
- 使用 AVX,較 naive 執行時間減少為 29%
- 使用 SW prefetch,SSE 執行時間減為 52%
- 使用 SW prefetch,AVX 執行時間減為 93%
```
naive_transpose : 275385 us
sse_transpose : 131880 us
sse_prefetch_transpose : 69229 us
avx_transpose : 79326 us
avx_prefetch_transpose : 73965 us
```
## 改變 `_mm_prefetch` 指令位置
- 將 `_mm_prefetch` 移至 `_mm256_loadu_ps` 後
(有試過放在其他位置,放在 `_mm256_loadu_ps` 後,執行時間最短)
```C=
...
_mm_prefetch(src+(y + AVX_PFDIST + 0) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 1) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 2) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 3) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 4) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 5) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 6) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 7) *w + x, _MM_HINT_T1);
__m256 I0 = _mm256_loadu_ps((const float *)(src + (y + 0) * w + x));
__m256 I1 = _mm256_loadu_ps((const float *)(src + (y + 1) * w + x));
__m256 I2 = _mm256_loadu_ps((const float *)(src + (y + 2) * w + x));
__m256 I3 = _mm256_loadu_ps((const float *)(src + (y + 3) * w + x));
__m256 I4 = _mm256_loadu_ps((const float *)(src + (y + 4) * w + x));
__m256 I5 = _mm256_loadu_ps((const float *)(src + (y + 5) * w + x));
__m256 I6 = _mm256_loadu_ps((const float *)(src + (y + 6) * w + x));
__m256 I7 = _mm256_loadu_ps((const float *)(src + (y + 7) * w + x));
...
```
改成
```C=
...
__m256 I0 = _mm256_loadu_ps((const float *)(src + (y + 0) * w + x));
__m256 I1 = _mm256_loadu_ps((const float *)(src + (y + 1) * w + x));
__m256 I2 = _mm256_loadu_ps((const float *)(src + (y + 2) * w + x));
__m256 I3 = _mm256_loadu_ps((const float *)(src + (y + 3) * w + x));
__m256 I4 = _mm256_loadu_ps((const float *)(src + (y + 4) * w + x));
__m256 I5 = _mm256_loadu_ps((const float *)(src + (y + 5) * w + x));
__m256 I6 = _mm256_loadu_ps((const float *)(src + (y + 6) * w + x));
__m256 I7 = _mm256_loadu_ps((const float *)(src + (y + 7) * w + x));
_mm_prefetch(src+(y + AVX_PFDIST + 0) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 1) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 2) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 3) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 4) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 5) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 6) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + AVX_PFDIST + 7) *w + x, _MM_HINT_T1);
...
```
- 結果
- 執行時間從 73965 us 減為 60779 us (減為原本 82%)
```
avx_prefetch_transpose : 60779 us
```
- cache miss rate 反而增加,78% 增加至 85%
```
Performance counter stats for './main AVX_PF' (50 runs):
557,2321 cache-misses # 77.903 % of all cache refs ( +- 0.04% )
715,2936 cache-references ( +- 0.03% )
11,6349,6212 instructions # 1.67 insn per cycle ( +- 0.00% )
6,9523,9092 cycles ( +- 0.06% )
0.272111501 seconds time elapsed ( +- 0.33% )
Performance counter stats for './main AVX_PF' (50 runs):
554,6242 cache-misses # 85.417 % of all cache refs ( +- 0.04% )
649,3141 cache-references ( +- 0.02% )
11,6353,4796 instructions # 1.75 insn per cycle ( +- 0.00% )
6,6666,4729 cycles ( +- 0.07% )
0.258250260 seconds time elapsed ( +- 0.32% )
```
- 用同樣方法修改 SSE 版本
- 執行時間從 69229 us 減為 65405 us (減為原本 94%)
- cache miss rate 皆在 87.7% 左右
```
Performance counter stats for './main SSE_PF' (50 runs):
659,9302 cache-misses # 87.709 % of all cache refs ( +- 0.03% )
752,4061 cache-references ( +- 0.03% )
12,8257,3532 instructions # 1.85 insn per cycle ( +- 0.00% )
6,9223,9566 cycles ( +- 0.13% )
0.266634360 seconds time elapsed ( +- 0.32% )
Performance counter stats for './main SSE_PF' (50 runs):
659,4715 cache-misses # 87.692 % of all cache refs ( +- 0.03% )
752,0292 cache-references ( +- 0.02% )
12,8256,2815 instructions # 1.85 insn per cycle ( +- 0.00% )
6,9168,6504 cycles ( +- 0.13% )
0.268256938 seconds time elapsed ( +- 0.33% )
```
- prefetch hit
- ![](https://i.imgur.com/LJL60D6.png)
- `$ perf stat -e r014c ./main SSE_PF`
- 結果
- SW prefetch hit 次數增加為 8 倍
- HE prefetch hit 次數差不多
指令位置|改變前|改變後
---------------|-|-
SW prefetch hit|16,7489|133,0769
HW prefetch hit|42,1308|41,7050
# Reference
- [When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)
- [論文 When Prefetching Works, When It Doesn’t, and Why 重點提示和解說](https://hackmd.io/s/HJtfT3icx)
- [Cache prefetching](https://en.wikipedia.org/wiki/Cache_prefetching)
- [Programming trivia: 4x4 integer matrix transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html)
- [IntrinsicsGuide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/)
- [ Performance of SSE and AVX Instruction Sets](https://arxiv.org/pdf/1211.0820.pdf)
- [Transpose an 8x8 float using AVX/AVX2](http://stackoverflow.com/questions/25622745/transpose-an-8x8-float-using-avx-avx2)
###### tags: `twzjwang`