2017q3 Homework2 (software-pipelining)

tags: `sysprogram`

contributed by <st9007a>

Github

開發環境

Linux 4.4.0-92-generic
Ubuntu 16.04.1 LTS
CPU family: 6
CPU model: 94
CPU model name: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
L1d cache:  32K
L1i cache:  32K
L2 cache:   256K
L3 cache:   6144K

論文閱讀整理

原始文章

Section 1 - Introduction

有許多種 SW , HW prefetching 機制，卻只有相對簡單的 SW prefetching 演算法出現在 complier 中 (icc , gcc)
- 通常需要手動加入 prefetching intrinsics 來進行效能優化
兩個原因：
- 加入 prefetching intrinsics 的方式沒有一個嚴格的標準
- SW , HW prefetching 之間的交互影響複雜不易理解
prefetcher 在不同 benchmark 中的比較如下圖：
- SW+HW prefetcher 的最好執行結果與 HW prefetcher 最好執行結果 進行比較，分為三個 group
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →

Section 2 - Background of SW and HW prefetching

名詞定義：
- streams：unit-stride cache-line accesses
- strided：access stride distances greater than two cache lines
不同資料結構對應不同 prefetching 策略
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Software Prefetch Intrinsics

使用 x86 SSE SIMD extenssions
void _mm_prefetch(char *ptr, int hint)
- ptr：prefetch address
- hint：prefetch hint，參照下圖
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
intrinsics translate to instructions
- direct address prefetches：2 instructions
- indirect memory address prefetches：4 instructions
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →

Prefetch Classification

分為六個類別
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Timely, Late, Early 在時間線上的定義
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Software Prefetch Distance

定義：在 array 的 index 中加入一常數 D 來代表要 prefetch 的 address，D 即是 prefetch distance
- e.g. 參考以下程式碼

int vector[100] = get_orgin(100);
for (int i; i < 100; i++) {
    _mm_prefetch(&vector[i + 4], T0); //prefetch distance = 4
    printf("%d\n", vector[i]);
}

stream pattern 的 array structure 的 prefetch distance 可以表示成：
$D > ⌈ \frac{l}{s} ⌉$
- $l$ ：prefetch latency
- $s$ ：the length of the shortest path through the loop body
prefetch distance 必須夠大到足以隱藏 memory latency
- prefetch distance 太小：prefetch request 太晚，cpu 已經開始在 cache 上存取資料
- prefetch distance 太大：prefetch address 太早出現在 cache 上，導致 prefetch address 可能被其他資料覆蓋，造成更多的 cache miss

Direct and Indirect Memory Indexing

HW prefetcher 可以較輕易的 prefetch direct memory indexing 的資料
- 因為 memory address 是正常的 stream/stride 行為
- 相對地，HW prefetcher 不擅長處理 indirect memory indexing
SW prefetcher 在處理 indirect memory indexing 比 HW prefetcher 還要有效率
Indirect prefetch requests have a higher overhead
- 4 個 instructions , 2 次 address 載入
- 2 次 address 載入有相依性，無法並行處理

Section 3 - Positive and Negative Impacts of Software Prefetching

SW prefetching 優點

Large Number of Streams (Limited Hardware Resources)
- stream prefetcher( 其中一種 HW prefetcher ) 的 streams 數量受限於硬體資源
- SW prefetching 可以在每個 streams 加入獨立的 prefetch requests
Short Streams
- HW prefetcher 至少需要 2 個 cache miss 來偵測 stream/stride 的 direction 跟 distance
- stream 過短的狀況下，沒有足夠的 cache miss 能訓練 HW prefetcher，進而載入有效的 cache block
Irregular Memory Access
- 參照 Section 2 Direct and Indirect Memory Indexing
- SW prefetching 可以透過加入適當的 intrinsics 來存取 irregular data structure
Cache Locality Hint
- HW prefetcher 將 prefetched data 放置於 low-level cache ( L2, L3 )，減 L1 cache pollution，但是 L2 到 L1 的延遲會影響效能
- SW prefetching 可以更靈活地將 prefetched data 放置到正確的 cache 上，參照 Section 2 Software Prefetch Intrinsics
Loop Bounds
- SW prefetching 可以透過需多方法避免 prefetch requet out of array bound 的狀況，e.g. loop unrolling , software piplining , branch instructions
- It is not possible for HW prefetcher

SW prefetching 缺點

Increased Instruction Count
- SW prefetching 需要消耗額外的機器資源來執行 prefetch 相關的 instruction，下圖為對應由於 SW prefetches 而增加的 instructions 量
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
Static Insertion
- SW prefetching 策略在編譯完之後就無法更動，因此 SW prefetching 無法即時的應付多種 memory latency, cache size, bandwidth
- HW prefetcher 可以適時的修正 prefetch distance 跟 degree
Code Structure Change
- 在每次 loop 要執行的 instructions 極少的狀況下，SW prefetching 無法隱藏住 memory latency
- 在某些狀況下，加入 prefetch 會導致 instructions 增加過多，使結構調整變得困難

SW + HW prefetching 的優點

Handling Multiple Streams
- 兩種 prefetching 機制並用可以處理更多 data streams
- e.g. HW prefetching 處理常規資料，SW prefetching 處理非常規資料
Positive Training
- SW prefetch requests 可以幫助訓練 HW prefetcher

SW + HW prefetching 的缺點

Negative Training
- SW prefetching 會使 HW prefetcher 訓練速度變慢
- 如果 SW prefetching 隱藏一到多個 data streams 會使 HW prefetcher 無法確實的訓練
- SW prefetching 可能觸發 HW prefetcher，造成 prefetch request 過早
Harmful Software Prefetching
- HW prefetcher 精確度比 SW prefetching 還低，當 SW prefetching 出錯時會造成 HW prefetcher 精確度更低

Section 4 - Experimental Methodology

Prefetch Intrinsic Insertion Algorithm

針對每個 prefetch candidate 加入 SW prefetch
Prefetch candidate：每 1000 個 instructions 中 L1 misses 超過 5%
Prefetch distance：
$D i s t a n c e = \frac{K \times L \times I P C_{b e n c h}}{W_{l o o p}}$
- $K$ ：constant factor
- $L$ ：average memory latency
- $I P C_{b e n c h}$ ：the profiled average IPC of each benchmark
- $W_{l o o p}$ ：s the average instruction count in one loop iteration
實驗使用
$K = 4, L = 300$ ，
$W_{l o o p}, I P C_{b e n c h}$ 則由各 benchmark 的 profile data 決定

Simulation Methodology

使用 MacSim：a trace-driven cycle-level simulator，下圖為處理器的配置
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Perform experiments on 13 memory-intensive benchmarks from SPEC CPU 2006
Excluded four additional benchmarks that meet this definition of memory intensive — omnetpp, astar, xalancbmk, and wrf — because we were not able to insert any software prefetch intrinsics.
下圖為 evaluated benchmark 在 baesline processor 的特徵
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Section 5 - Evaluation：Base Obvervations of Prefetching

Overhead and Limitations of Software Prefetching

Instruction Overhead
- 參照 Section 3 SW prefetching 缺點
- 除了 prefetch instruction 以外，還增加了 indirect memory access 跟 index calculation 相關的 instructions
- GemsFDTD 增加了超過 50% 的 instructions，但是 prefetch 的效益還是大於 instructions 數量增加的影響
Software Prefetching Overhead
- 針對個別去除 overhead of cache pollution , bandwidth consumption , memory access latency , redundant prefetch , instruction 做量測，結果如下圖
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
- Y 軸為去除 overhead 後所減少的執行 cycle 數量
- cache pollution 原因在於錯誤的 prefetch，所以其 overhead 影響不大
- bandwidth comsumption 影響也不大，這證明了機器提供了足夠的 bandwidth
- memory latency access 的影響較大，代表 SW prefetching 無法完全隱藏 memory latency
- redundant prefetch 的值較小，這代表即使實驗中 redundant prefetch 數量很大，其影響仍然能忽略不計
- 雖然在　GemsFDTD　, bwaves　, leslie3d 增加了許多 instructions (參照 Section 3 SW prefetching 缺點)，其 ovrehead of instructions 仍然不大

The Effect of Prefetch Distance

下圖為 prefetch distance 效能的靈敏度
- X 軸為相對於 base prefetch distance 的 prefetch distance
- 參照Section 2 Prefetch Distance，base prefetch distance 套用公式
  $D > ⌈ \frac{l}{s} ⌉$
- X 軸的一個單位長度為一個 cache block
根據 the optimal zone 以及 performance variance 將所有 benchmark 分為 5 個 groups，如下圖

Static Distance vs. Machine Configuration

比較三種不同的機器配置(base, less-aggressive, aggressive，參照 Section 4 Simulation Methodology)
- 左邊三張圖為在不同機器配置下的 best distance
- 右邊三張圖 baseline 的 best distance 與不同機器配置的 best distance 之間的 performance difference
- 除了 lbm 以外，其他的 benchmark 在不同機器配置下效能差異並不大
- 結論：static prefetch distance variance 在不同機器配置下並不會影響效能太多

Cache-Level Insertion Policy

參照 Section 3 SW prefetching 優點 - Cache Locality Hint
下圖為不同 cache locality hint 對應不同 benchmark 比較結果
T0 將 prefetched data 放置到 L1 cache，因此使 L1 cache miss rate 下降
- 在 benchmark 的 libquantum 就有這個現象，其 L1 cache miss rate 從 10.4% 降至 0.01%

Effect of Using Hardware and Software Prefetching Together

Hardware Prefetcher Training Effects
- two different hardware prefetcher training policies
  - NT (SW+GHB, SW+STR)：HW prefetcher 訓練演算法忽略 SW prefetch request，僅透過 cache miss 來訓練
  - Train (SW+GHB+T, SW+STR+T)：HW prefetcher 訓練演算法透過 SW prefetch request 以及 cache miss 來訓練
- 下圖為利用 SW prefetching 訓練 HW prefetcher 的效能比較
- 透過對 HW prefetching 預先做訓練可以達到 3% - 5% 的效能成長
- 然而效能下降幅度也很大 (milc , libquantum)
- 結論：負面影響非常顯著，因此透過 SW prefetching 來訓練 HW prefetching 不是一個好方法

Prefetch Coverage

定義：the ratio of useful prefetches to total L2 misses
理論上，SW prefetching 的 coverage 要高於 HW prefetchers，但是在 neutral 與 negative group 中，SW prefetching 的 coverage都小於 HW prefetchers
結論：less coverage is the main reason for performance loss in the neutral and negative groups

Prefetching Classification

參照 Section 2 Prefetch Classification
下圖為不同 prefetch requests 在各個 benchmark 以及不同 prefetchers 中所佔的比例

Using the GCC Compiler

之前的實驗使用 icc compiler
gcc compiler 不支援 Fortran 的 intrinsics
實驗進行方式為 gcc 編譯完程式碼後再加入 intrinisics，並且排除　Fortran benchmarks
gcc version = 4.4
實驗結果如下
下圖為 gcc compiler 增加的 instructions 數量
- The gcc compiler inserts more software prefetches than icc, except for mcf and lbm.

Real System Experiments

在兩個 Intel 處理器中量測 SW prefetching 的效果
- Core 2
- Nehalem
在實驗中使用完整的 reference input sets，而不是只有 simpointed section
下圖表示 simulated region 與 entire execution 有相似的 prefetch instructions
下圖為實驗結果
- 在 real system 中 cactusADM, soplex, bwaves, sphinx3, leslie3d 效能下降，儘管它們在 simulater 中效能是上升的
- all benchmarks in the positive group also show benefits from using software prefetches
- Nehalem 的 HW prefetcher 效果較 Core 2 好
- 由於 Nehalem 的 HW prefetcher 有 the multiple cache levels，因此 SW prefetching 的效益有所下降

Profile-Guided Optimization (PGO)

perform experiments using both icc and gcc, but we do not include the results from icc because it does not insert any software prefetches
use gcc 4.4 with fprofile-generate and fprofile-use flags
下圖為實驗結果
- 大部分的 PGO 效能表現介於 base 與 SW 之間
- 執行效能的效益並非全部來自於 SW prefetching
- PGO 減少了 total instruction count 以及 prefetch instruction count

Hardware Prefetcher for Short Streams

short stream 是 HW prefetcher 的弱點 (參照 Section 3 SW prefetching 優點 - short stream)
ASD hardware prefetcher：一個可用於處理 short stream 的 HW prefetcher(僅概念，沒有被實作出來)
下圖為　ASD 與 SW/HW prefetching 的實驗結果
- milc 有 short stream 的行為，而 ASD 在沒有 SW prefetching 的狀況下有 7% 的效能提升
- SW prefetching 本身就能讓執行速度快 2 倍以上
- ASD 的效益比 STR 與 GHB 好，但是效益 < 3%
- gcc, sph, les 在結合 ASD 與 SW prefetching 以及 GHB/STR HW prefetching 的狀況下，效能有所提升，但是提升幅度 < 2%
- 結論：在處理 short stream 的情況下，SW prefetching 的效率大於 ASD

Content Directed Prefetching

處理 linked 或 irregular 的 data structure
下圖為實驗結果，只處理 gcc 與 mcf 這兩個 benchmark，因為只有這兩個有受影養
- CDF HW prefetcher 對 gcc , mcf 分別提升了 6.5% 與 3.9%
- 對 CDF 進行防止 cache pollution 的處理後效能又個別上升了 5% 與 2%
- SW prefetching 效能提升 78%
- 結論：SW prefetching 在處理 irregular data structure 上是比較有效率的

Manual Tuning

無法確定 SW prefetching 將結果優化到最好
調整項目：
- remove useless prefetches
- adjust prefetch distance
- apply additional overhead-reduction techniques (loop unrolling , prologue/epilogue transformations)
事先排除 milc, GemsFDTD, lbm, bwaves, bzip2，因為它們不是已經最佳化了，就是需要大幅改變程式碼結構
下圖為實驗結果
- speedup 在 neutral 與 negative group 中比 positive group 高，代表手動移除 negative intrinsics 是一個有效的方法
- 在 positive group 中，manual tuning 與 Section 4 的演算法彼此間的效能沒有顯著的改變，這代表演算法就已經是一個很好的 guideline for inserting software prefetch intrinisics.

Summary

在 regular access pattern 的狀況下，HW prefetchers 會 under-exploit，尤其是 short stream，這時使用 SW prefetching 的效率是比較好的
Prefetch distance 不會受限於機器配置，只要 prefetch distance 大於公式的最小距離，大部分的 applications 不會因為 prefetch distance 而受影響
雖然 out-of-order execution 可以容忍 L1 cache misses，但是當 L1 cache misses 超過 20% 時，利用 prefetch 來降低 L1 cache misses 是比較有效率的
The overhead of useless prefetching instructions is not very significant.
SW prefetching 可以訓練 HW prefetcher 來提升效能，但是也可能拖垮效能

Section 6 - Case Studies：Source Code Analysis of Individual Benchmarks

參照原文 Section 6

Section 7 - Relatived Work

Reducing software prefetching overhead
Hardware and software cooperative prefetching

Section 8 - Conclusion and Furture Work

Conclusion

選擇一個 static prefetch distance 是 aggressive SW prefetching 的限制
- 從實驗中可以看到 aggressive SW prefetching 產生了許多 early prefetch requests，optimal prefetch distance 也會受機器配置影響
- 但是從實驗中可以觀察到，在大部分的 benchmark 中，prefetch distance 不容易影響效能，也很容易調整
- SW prefetch 有高 coverage of delinquent loads，可以透過 prefetch 來減少額外 instruction 的 overhead
Prefetching for short streams shows the most positive effects.
- HW prefetchers 沒辦法 prefetch short stream, irregular data structure, a large number of concurrent streams
- 必須使用 software-base prefetching techniques
Coverage 可以做為選擇 prefetch scheme 的主要評估
- 透過 coverage 可以得知應該使用 SW prefetching 來處理 short stream 以及 indirect reference
- 而在 strong stream 或 regular stride pattern 中使用 SW prefetching 則對效能不會有太大的影響，甚至可能使 HW prefetching 效能下降

Futrue Work

開發 SW prefetch 演算法用於：
- 訓練 HW prefetchers
- 減少 SW/HW prefetching scheme 的 negative instructions 讓 compiler 可以更積極(?)的安插 prefetch instructions

編譯執行 prefetcher

將專案編譯執行後結果如下：

  0  1  2  3
  4  5  6  7
  8  9 10 11
 12 13 14 15

  0  4  8 12
  1  5  9 13
  2  6 10 14
  3  7 11 15
sse prefetch:    54226 us
sse:             135799 us
naive:           238329 us

上面兩個矩陣是用來驗證

4 \times 4

矩陣的 transpose，而驗證時使用的方法是 sse_transpose

接著下面三個是使用三種不同方法來進行矩陣 transpose 時的執行時間，而這裡使用的矩陣大小為

4096 \times 4096

，可以從輸出的結果知道效能是 sse prefetch > sse > naive，其中 sse 使用的是 intrinsics function

再來看看這三個方法是怎麼時做的，首先是 naive_transpose()

void naive_transpose(int *src, int *dst, int w, int h)
{
    for (int x = 0; x < w; x++)
        for (int y = 0; y < h; y++)
            *(dst + x * h + y) = *(src + y * w + x);
}

naive_transpose() 將 src 的 (x, y) 位置的值直接 assign 到 dst 的 (y, x) 位置

然後是 sse_transpose()

void sse_transpose(int *src, int *dst, int w, int h)
{
    for (int x = 0; x < w; x += 4) {
        for (int y = 0; y < h; y += 4) {
            __m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x));
            __m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x));
            __m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x));
            __m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x));
            __m128i T0 = _mm_unpacklo_epi32(I0, I1);
            __m128i T1 = _mm_unpacklo_epi32(I2, I3);
            __m128i T2 = _mm_unpackhi_epi32(I0, I1);
            __m128i T3 = _mm_unpackhi_epi32(I2, I3);
            I0 = _mm_unpacklo_epi64(T0, T1);
            I1 = _mm_unpackhi_epi64(T0, T1);
            I2 = _mm_unpacklo_epi64(T2, T3);
            I3 = _mm_unpackhi_epi64(T2, T3);
            _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
            _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
            _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
            _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);
        }
    }
}

根據 intel intrinsics guide，可以知道程式碼中每個 intrinsics function 的作用

__m128i __mm_loadu_si128(__m128i const *mem_addr)：將 128-bit 的整數從 mem_addr 載入到 dst，dst 會作為回傳值回傳出來
__m128i __mm_unpacklo_epi32(__m128i a, __m128i b)：將 a 與 b 低位的 64-bit 資料切成 a[31:0] , a[63:32] , b[31:0] , b[63:32]，然後填入 dst 回傳，dst 由高位到低位依序是 b[63:32] , a[63:32] , b[31:0] , a[31:0]
__mm128i __mm_unpackhi_epi32(__m128i a, __m128i b)：行為與 __mm_unpacklo_epi32 類似，只是改成取 a 與 b 的高位
__m128i _mm_unpacklo_epi64 (__m128i a, __m128i b)：行為與 __mm_unpacklo_epi32 類似，只是資料不會被分割，所以 dst 由高位到低位是 b[63:0], a[63:0]
__m128i _mm_unpackhi_epi64 (__m128i a, __m128i b)：行為與 __mm_unpackhi_epi32 類似，dst 由高位到低位是 b[127:64], a[127:64]
void __mm_storeu_si128(__m128i *mem_addr, __m128i a)：將 a 的值寫入 mem_addr 中

知道了每個 function 的功用後，sse_transpose 做了什麼還是很難想像，所以先帶入

4 \times 4

矩陣觀察其行為，選擇

4 \times 4

的原因是 for loop 只需做一次就行，方便觀察

src_matrix = {
    { 1, 2, 3, 4},
    { 5, 6, 7, 8},
    { 9,10,11,12},
    {13,14,15,16},
}

I0 = { 1, 2, 3, 4}
I1 = { 5, 6, 7, 8}
I2 = { 9,10,11,12}
I3 = {13,14,15,16}

T0 = { 1, 5, 2, 6}
T1 = { 9,13,10,14}
T2 = { 3, 7, 4, 8}
T3 = {11,15,12,16}

I0 = { 1, 5, 9,13}
I1 = { 2, 6,10,14}
I2 = { 3, 7,11,15}
I3 = { 4, 8,12,16}

dst_matrix = {
    { 1, 5, 9,13},
    { 2, 6,10,14},
    { 3, 7,11,15},
    { 4, 8,12,16}
}

將每一行執行結果列出來後就可以看到，透過 unpack 的方式，矩陣被轉置了

最後是 sse_prefetch_transpose()

void sse_prefetch_transpose(int *src, int *dst, int w, int h)
{
    for (int x = 0; x < w; x += 4) {
        for (int y = 0; y < h; y += 4) {
#define PFDIST  8
            _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1);
            _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1);
            _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1);
            _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1);

            __m128i I0 = _mm_loadu_si128 ((__m128i *)(src + (y + 0) * w + x));
            __m128i I1 = _mm_loadu_si128 ((__m128i *)(src + (y + 1) * w + x));
            __m128i I2 = _mm_loadu_si128 ((__m128i *)(src + (y + 2) * w + x));
            __m128i I3 = _mm_loadu_si128 ((__m128i *)(src + (y + 3) * w + x));
            __m128i T0 = _mm_unpacklo_epi32(I0, I1);
            __m128i T1 = _mm_unpacklo_epi32(I2, I3);
            __m128i T2 = _mm_unpackhi_epi32(I0, I1);
            __m128i T3 = _mm_unpackhi_epi32(I2, I3);
            I0 = _mm_unpacklo_epi64(T0, T1);
            I1 = _mm_unpackhi_epi64(T0, T1);
            I2 = _mm_unpacklo_epi64(T2, T3);
            I3 = _mm_unpackhi_epi64(T2, T3);
            _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
            _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
            _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
            _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);
        }
    }
}

sse_prefetch_transpose() 除了 prefetch 的部分，其他部分跟 sse_transpose() 是一樣的，而 prefetch 的部分則是 prefetch 了往後數 PFDIST + 0 ~ 3 個 row 的資料並且放置到 L2 以及更高階的 cache 上

分析效能

為了能個別測試效能，所以先設計一個通用的介面 transpose.h

void transpose(int *src, int *dst, int w, int h);

接著用不同檔案來個別實作不同方法：naive_transpose.c , sse_transpose.c , sse_prefetch_transpose.c

完成後，修改一下 main.c 跟 Makefile 讓其能自動產生對應的執行檔，接著用 perf stat 來看看各自的 cache miss

$ perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./naive_transpose
Performance counter stats for './navie_transpose' (100 runs):

         4108,3426  cache-misses      #  86.021 % of all cache refs      ( +-  0.31% )
         4775,9788  cache-references                                     ( +-  0.28% )
      14,4866,1674  instructions      #   1.48  insns per cycle          ( +-  0.00% )
       9,7703,3562  cycles                                               ( +-  0.07% )

       0.343869644 seconds time elapsed                                  ( +-  0.07% )

$ perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./sse_transpose
Performance counter stats for './sse_transpose' (100 runs):

         1455,1286  cache-misses      #  75.665 % of all cache refs      ( +-  0.06% )
         1923,1186  cache-references                                     ( +-  0.07% )
      12,3676,0551  instructions      #   1.72  insns per cycle          ( +-  0.00% )
       7,1766,9516  cycles                                               ( +-  0.07% )

       0.264611085 seconds time elapsed                                  ( +-  0.17% )

$ perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./sse_prefetch_transpose
Performance counter stats for './sse_prefetch_transpose' (100 runs):

          823,1685  cache-misses      #  64.939 % of all cache refs      ( +-  0.18% )
         1267,5942  cache-references                                     ( +-  0.12% )
      12,8279,3332  instructions      #   2.30  insns per cycle          ( +-  0.00% )
       5,5755,6001  cycles                                               ( +-  0.01% )

       0.156492681 seconds time elapsed                                  ( +-  0.05% )

可以看到 prefetch 讓 cache miss 下降了，接著搭配 raw counter 使用 perf stat

raw counter 參照 Intel® 64 and IA-32 Architectures Developer’s Manual: Vol. 3B 的第 19-2 章：Performance Monitoring Events For 6th Generation Intel®Core™ Processor And 7th Generation Intel®Core™ Processor

r2124：L2_RQSTS.DEMAND_DATA_RD_MISS (Demand Data Read requests that missed L2, no rejects.)
r4124：L2_RQSTS.DEMAND_DATA_RD_HIT (Demand Data Read requests that hit L2 cache.)
rE124：L2_RQSTS.ALL_DEMAND_DATA_RD (All demand data read requests to L2.)

找了一下，決定觀察這三個 event，原因在於上面 prefetch hint 為 T1，T1 會將 prefetched data 放置於 L2 以及更高階的 cache，所以理論上可以在 L2 cache 上觀察到明顯的數據變化，第三個 event 是拿來算百分比用的

$ perf stat --repeat 100 -e r2124,r4124,rE124 ./navie_transpose
Performance counter stats for './navie_transpose' (100 runs):

         1678,1605  r2124                ( +-  0.00% )
            4,6208  r4124                ( +-  1.15% )
         1683,4637  rE124                ( +-  0.00% )

       0.343165430 seconds time elapsed  ( +-  0.08% )

$ perf stat --repeat 100 -e r2124,r4124,rE124 ./sse_transpose
Performance counter stats for './sse_transpose' (100 runs):

          421,6532  r2124                ( +-  0.00% )
            2,3383  r4124                ( +-  0.39% )
          424,9238  rE124                ( +-  0.01% )

       0.264649831 seconds time elapsed  ( +-  0.15% )

$ perf stat --repeat 100 -e r2124,r4124,rE124 ./sse_prefetch_transpose
Performance counter stats for './sse_prefetch_transpose' (100 runs):

            3,3902  r2124                ( +-  0.30% )
          332,5174  r4124                ( +-  0.29% )
          336,6697  rE124                ( +-  0.29% )

       0.156584539 seconds time elapsed  ( +-  0.05% )

	naive	sse	sse prefetch
L2_RQSTS.DEMAND_DATA_RD_MISS	1678,1605 (99.684%)	421,6532 (99.699%)	3,3902 (1.007%)
L2_RQSTS.DEMAND_DATA_RD_HIT	4,6208 (0.274%)	2,3383 (0.553%)	332,5174 (98.767%)
L2_RQSTS.ALL_DEMAND_DATA_RD	1683,4637	424,9238	336,6697

從上面的表格可以看到，naive 到 sse 的 L2 cache miss/hit rate 沒有顯著變化，但是 sse prefetch 很明顯的使 L2 hit rate 大幅提升，驗證了 prefetch hint = T1 的狀況 (註：這裡的 miss/hit rate 僅計算 demand data read request)

AVX Transpose

AVX transpose 參考了 stackoverflow – Transpose an 8x8 float using AVX/AVX2 的解答，由於 AVX 的暫存器是 256-bit，一次可以讀取 8 個 32-bit 整數，所以 for loop 的 step 改成 8

 for (int x = 0; x < w; x += 8) {
        for (int y = 0; y < h; y += 8) {
            __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x));
            __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x));
            __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x));
            __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x));
            __m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x));
            __m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x));
            __m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x));
            __m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x));
            __m256i T0 = _mm256_unpacklo_epi32(I0, I1);
            __m256i T1 = _mm256_unpackhi_epi32(I0, I1);
            __m256i T2 = _mm256_unpacklo_epi32(I2, I3);
            __m256i T3 = _mm256_unpackhi_epi32(I2, I3);
            __m256i T4 = _mm256_unpacklo_epi32(I4, I5);
            __m256i T5 = _mm256_unpackhi_epi32(I4, I5);
            __m256i T6 = _mm256_unpacklo_epi32(I6, I7);
            __m256i T7 = _mm256_unpackhi_epi32(I6, I7);
            I0 = (__m256i)_mm256_shuffle_ps(T0, T2, _MM_SHUFFLE(1, 0, 1, 0));
            I1 = (__m256i)_mm256_shuffle_ps(T0, T2, _MM_SHUFFLE(3, 2, 3, 2));
            I2 = (__m256i)_mm256_shuffle_ps(T1, T3, _MM_SHUFFLE(1, 0, 1, 0));
            I3 = (__m256i)_mm256_shuffle_ps(T1, T3, _MM_SHUFFLE(3, 2, 3, 2));
            I4 = (__m256i)_mm256_shuffle_ps(T4, T6, _MM_SHUFFLE(1, 0, 1, 0));
            I5 = (__m256i)_mm256_shuffle_ps(T4, T6, _MM_SHUFFLE(3, 2, 3, 2));
            I6 = (__m256i)_mm256_shuffle_ps(T5, T7, _MM_SHUFFLE(1, 0, 1, 0));
            I7 = (__m256i)_mm256_shuffle_ps(T5, T7, _MM_SHUFFLE(3, 2, 3, 2));
            T0 = _mm256_permute2f128_si256(I0, I4, 0x20);
            T1 = _mm256_permute2f128_si256(I1, I5, 0x20);
            T2 = _mm256_permute2f128_si256(I2, I6, 0x20);
            T3 = _mm256_permute2f128_si256(I3, I7, 0x20);
            T4 = _mm256_permute2f128_si256(I0, I4, 0x31);
            T5 = _mm256_permute2f128_si256(I1, I5, 0x31);
            T6 = _mm256_permute2f128_si256(I2, I6, 0x31);
            T7 = _mm256_permute2f128_si256(I3, I7, 0x31);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), T0);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), T1);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), T2);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), T3);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 4) * h) + y), T4);
    }
}

附上執行結果比較，avx_prefetch_transpose 的 prefetch distance 這裡事先設定為 8

exec naive_transpose
execution time: 225405 us

exec sse_transpose
execution time: 151882 us

exec sse_prefetch_transpose
execution time: 43714 us

exec avx_transpose
execution time: 56837 us

exec avx_prefetch_transpose
execution time: 43255 us

可以看到 avx_transpose 的執行速度比 sse_transpose 還快，執行速度的部分可以參照 Performance of SSE and AVX Instruction Sets 的 Data Reuse 的部分，avx_transpose，不斷重複使用資料(I0 ~ I7 , T0 ~ T7)，而 avx 在 Data Reuse 上的效能是勝過 sse 的

接著嘗試使用不同的 prefetch distance 來比較 avx_prefetch_transpose 的效能

prefetch distance	8	16	24	32
Average	42850.82	43546.13	45357.07	47583.64
Standard Deviation	298.008	320.335	606.971	815.092
Confidence Interval (95%)	42254.804 - 43446.836	42905.46 - 44186.8	44143.128 - 46571.012	45953.456 - 49213.824

表格為不同 prefetch distance 個跑 100 次後其執行時間的統計結果，從資料分布來看 prefetch distance = 8 的效能最好，這裡就不考慮更大的 prefetch distance，根據論文的敘述，prefetch distance 太大會導致 cache miss 機率提升

再來重新比較 L2 cache miss/hit

	avx	avx prefetch	sse	sse prefetch	naive
DATA_RD_MISS	307,4598 (97.822%)	89,6375 (32.186%)	421,5791 (99.239%)	3,3613 (1.009%)	1678,0959 (99.686%)
DATA_RD_HIT	3,2268 (1.027%)	185,2637 (66.522%)	2,3144 (0.545%)	329,1381 (98.761%)	4,6184 (0.274%)
DEMAND_DATA_RD	314,3045	278,4986	424,8124	333,2668	1683,3807

可以看出 prefetch 有確實提升 L2 cache hit rate，但是提升幅度沒有 sse prefetch 那麼大

最後是比較 sse_prefetch_transpose 與 avx_prefetch_transpose 執行速度，其他方法的執行速度差距明顯，所以就不特別比較

	avx_prefetch_transpose	sse_prefetch_transpose
Average	42750.75	43661.29
Standard Deviation	328.847	128.357
Confidence Interval (95%)	42093.056 - 43408.444	43404.576 - 43918.004

比較執行 100 次的結果，avx_prefetch_transpose 的執行速度是比較快的，但是卻沒有贏太多，這裡可以同樣參照　Performance of SSE and AVX Instruction Sets 中 Asynchronous Data Transfer 的執行結果比較，avx 跟 sse 在有 prefetch 的狀況下，效能是相近的

調整矩陣大小做驗證

為了進一步驗證 sse 與 avx 之間的效能比較，所以針對長度從 1024 到 8192 的矩陣做不同方法的效能比較，矩陣長度以 1024 為單位做遞增，其結果如下

第一張圖可以很直覺看出各個方法的效能差異，sse 與 avx 的效能在不同矩陣大小下也保持著前述的結論：

Data Reuse：avx > sse
Asynchronous Data Transfer：avx ~= sse

而第二張圖的 y 軸使用 logscale，主要原因是第一張圖會給人成長倍率為：naive > sse > avx > sse prefetch = avx prefetch 的錯覺，但是 y 軸用成 logscale 後就可以看出其成長倍率是相近的

參考資料

intel intrinsics guide
Intel® 64 and IA-32 Architectures Developer’s Manual: Vol. 3B
software-pipling 共筆
 stackoverflow – Transpose an 8x8 float using AVX/AVX2
_MM_SHUFFLE question
Performance of SSE and AVX Instruction Sets