2016q3 Homework3 (software-pipelining)

contributed by <carolc0708>

Prefetching

參考資料

重要名詞定義
- Prefetch Hit: Prefetched line that was hit in the cache
  before being replaced (miss avoided)
- Prefetch Miss: Prefetched line that was replaced before
  being accessed
- Prefetch rate: Prefetches per instruction (or 1000 inst.)
- Accuracy: Percentage of prefetch hits to all prefetches
- Coverage: Percentage of misses avoided due to prefetching
  - 100 x (Prefetch Hits / (Prefetch Hits + Cache Misses))

When Prefetching Works, When It Doesn’t, and Why

已完成整篇論文閱讀，完整中文筆記參考這裡

software prefetching
hardware prefetching

其實還是沒有很懂SW prefetching和HW prefetching的差異@@

SW 指的就是老師寫的 _mm_prefetch，可以由我們來控制的，HW 是硬體上直接做好的，通常是根據歷史紀錄推敲下一個要 prefetch 哪個，所以 HW 是需要 training 的Yen-Kuan Wu

1. INTRODUCTION

First, there are few rigorous guidelines on how best to insert prefetch intrinsics. Secondly, the complexity of the interaction between software and hardware prefetching is not well understood.

==>goal is to provide guidelines for inserting prefetch intrinsics in the presence of a hardware prefetcher and, more importantly, to identify promising new directions for automated cooperative software/hardware prefetching schemes

baseline: only compiler inserted software prefetchers without any hardware prefetching

(1) What are the limitations and overheads of software prefetching?
(2) What are the limitations and overheads of hardware prefetching?
(3) When is it beneficial to use software and/or hardware prefetching?

when software prefetching targets short array streams, irregular memory address patterns, and L1 cache miss reduction, there is an overall positive impact with code examples.

Note that, in this article, we refer to
unit-stride cache-line accesses as streams and access stride distances greater than two cache lines as strided

stream & stride?
HW prefetching ==> fetch in stream or arbitrary stride

stream 是指說你的 prefetch 是連續的抓取 cache line，而 stride 則是指說你是一次跳幾格去 pregetch cache line。Yen-Kuan Wu

2. BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING

2.1 Software Prefetch Intrinsics

2.2 Prefetch Classification

2.3 Software Prefetch Distance

software prefetch distance: the distance ahead of which a prefetch should be requested on a memory address.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
where l is the prefetch latency and s is the length of the shortest path through the loop body
if it is too large, prefetched data could evict useful cache blocks, and the elements in the beginning of the array may not be prefetched, leading to less coverage and more cache misses. Hence the prefetch distance has a significant effect on the overall performance of the prefetcher.

2.4 Direct and Indirect Memory Indexing

direct memory indexing can be easily prefetched by hardware since the memory addresses show regular stream/stride behavior, but indirect memory indexing requires special hardware prefetching mechanisms.

indirect indexing is relatively simpler to compute in software. Thus, we expect software prefetching to be more effective than hardware prefetching for the indirect indexing case

不太懂@@

簡單講就是 a[i] 和 a[b[i]]，可以發現第一個就是簡單的 direct 第二個就是 indirect，也就是他的抓的 address 會跳動，其實也對應到你上面的問題，stream vs arbitrary。Yen-Kuan Wu

大感謝!!! ^^ Carol Chen

3. POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING

3.1 benefit of SW prefetching

large num of stream: The number of streams in the stream prefetcher is limited by hardware resources
short stream: If the length of stream is extremely short, there will not be enough cache misses to train a hardware prefetcher and load useful cache blocks
Irregular memory access: irregular data structure
cache locality hint: there is greater flexibility of placing data in the right cache level. In particular, there is a prefetch hint (Table II) associated with every prefetch request.
loop bound: can be defined with little effort.loop unrolling, software pipelining, and using branch instructions

3.2 negative impact of SW prefetching

Increased Instruction Count: consume fetch and execution bandwidth and machine resources
Static Insertion: cannot adapt to runtime behavioral changes
Code Structure Change

3.3 synergistic effect of SW & HW prefetching

Handling Multiple Streams: e.g. a hardware prefetcher might cover regular streams and software prefetching can cover irregular streams.
Positive Training: If a block prefetched by a software prefetcher is late, then a trained hardware prefetcher can improve prefetch timeliness.

3.4 Antagonistic effect of SW & HW prefetching

Negative Training:
- If prefetched blocks by software prefetching hide a part of one or more streams, the hardware prefetcher will not be trained properly.
- Software prefetch instructions can trigger overly aggressive hardware prefetches, which results in early requests.
Harmful Software Prefetching: When software prefetch requests are incorrect or early, the increased stress on cache and memory bandwidth can further reduce the effectiveness of hardware prefetching.

4. EXPERIMENTAL METHODOLOGY

4.1 Prefetch Intrinsic Insertion Algorithm

prefetch candidate: any load whose L1 misses per thousand instructions (MPKI) exceeds 0.05.
prefetch distance:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
where K is a constant factor, L is an average memory latency, IPCb ench is the profiled average IPC of each benchmark, and Wloop is the average instruction count in one loop iteration.
- Chose K = 4, L = 300, IPC & Wloop from profile data ==>profile-driven
not for complex data structure

4.2 Simulation Methodology

MacSim: evaluate software and hardware prefetching effects
Pinpoints: select a representative simulation region for each benchmark using the reference input set
simulate the same section of the code across different binaries

5. EVALUATIONS: BASIC OBSERVATIONS ABOUT PREFETCHING

5.1 Overhead and Limitations of Software Prefetching

Instruction Overhead
- intrinsic instruction + extra instructions for handling of indirect memory accesses and index calculations.
- GemsFDTD instructions increase more than 50%, but performance is still positive
Software Prefetching Overhead
- to evaluate, eliminate the overhead of cache pollution (SW+P), bandwidth consumption (SW+B, between core and memory), memory access latency (SW+L), redundant prefetch overhead(SW+R), and instruction overhead (SW+I)
- SW+P(small): caused by early or incorrect prefetches.
- SW+B(small): current machines provide enough bandwidth for single-thread applications
- SW+L(large): software prefetching is not completely hiding memory latency
- SW+R(small): negative effect of redundant prefetch instructions is generally negligible
- SW+I(small): even if the instruction increase a lot

what is cache pollution(caused by early or incorrect prefetches)?
Prefetch replaces a line that is requested laterCarol Chen

簡單說就是 cache 被塞入不時宜的 data，而 early 則是指說太早抓這個 data 進來，結果要用到時已經被 replaced 了，inccorrect 是指說你載入了根本用不到的 data，這些都是會讓 cache miss 增加，也可以說是汙染了cache Yen-Kuan Wu

the Effect of Prefetch Distance
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
x-axis show the prefetch distance relative to the base prefetch distance.

why use ipc for performance evaluation?

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Static Distance vs. Machine Configuration
- optimal distance might vary by machine
- for evaluation, compare the best performing static distance for three different machine configurations (base, less-aggressive, and aggressive processors)
- observation: static prefetch distance variance does not impact performance significantly even if the machine configuration is changed at runtime
Cache-Level Insertion Policy
- Since software prefetching is generally accurate, we can safely insert prefetched blocks into the L1 cache directly without the risk of polluting the L1 cache.
- T0 always performs better than other hints in all evaluated benchmarks

"In general, an out-of-order processor can easily tolerate L1 misses, so hardware prefetchers usually insert prefetched blocks into the last level cache to reduce the chance of polluting the L1 cache."==>why??

5.2 Effect of Using Hardware and Software Prefetching Together

Hardware Prefetcher Training Effects
- NT (SW+GHB, SW+STR): The hardware prefetcher’s training algorithm ignores software prefetch requests. That is, the hardware prefetcher is only trained by demand cache misses.
- Train (SW+GHB+T, SW+STR+T): The hardware prefetcher’s training algorithm includes software prefetch requests, in addition to the usual demand misses.

training effect 的概念不太懂@@

Prefetch Coverage
- coverge: the ratio of useful prefetches to total L2 misses
- the coverage of software prefetching is higher than that of hardware prefetchers, but the software coverage of benchmarks in that neutral and negative groups is lower than that of the hardware prefetchers
  ==> less coverage is the main reason for performance loss in the neutral and negative groups
Prefetching Classification
- By eliminating the latency overhead (SW+L), the performance of mcf is improved by 9%.

5.3 Using the GCC Compiler

看不太懂操作方法@@

The gcc compiler inserts more software prefetches than icc, except for mcf and lbm.

5.4 Real System Experiments

5.5 Profile-Guided Optimization (PGO)

5.6 Hardware Prefetcher for Short Streams

evaluate ASD(an abstraction of HW prefetcher) along with software and hardware prefetching
Although ASD can predict short-length streams, it still requires observing the first instances of a memory stream.
software prefetching is much more effective for prefetching short streams than ASD.

5.7 Content Directed Prefetching

CDP: target linked and other irregular data structures

5.8 Manual Tuning

remove useless prefetches, adjust the prefetch distance of each software prefetch,and apply additional overhead-reduction techniques, such as loop unrolling and prologue/epilogue transformations for each benchmark.

prologue/epilogue transformations??

algorithm is already good to use as a guideline for inserting software prefetch intrinisics

5.9 Summary of New Findings

(1) Hardware prefetchers can under-exploit even regular access patterns, such as streams. Short streams are particularly problematic, and software prefetching is frequently more effective in such cases.

(2) The software prefetching distance is relatively insensitive to the hardware configuration. The prefetch distance does need to be set carefully, but as long as the prefetch distance is greater than the minimum distance, most applications will not be sensitive to the prefetch distance.

(3) Although most L1 cache misses can be tolerated through out-of-order execution, when the L1 cache miss rate is much higher than 20%, reducing L1 cache misses by prefetching into the L1 cache can be effective.

(4) The overhead of useless prefetching instructions is not very significant. Most applications that have prefetching instructions are typically limited by memory operations. Thus, having some extra instructions (software prefetching instructions) does not increase execution time by much.

(5) Software prefetching can be used to train a hardware prefetcher and thereby yield some performance improvement. However, it can also degrade performance severely, and therefore must be done judiciously if at all.

6. CASE STUDIES: SOURCE CODE ANALYSIS OF INDIVIDUAL BENCHMARKS

6.1 Positive Group

6.2 Neutral Group

6.3 Negative Group

8. CONCLUSION AND FUTURE WORK

naive_, sse_, sse_prefetch_ 之間的效能差異

修改Makefile，使用perf來做效能評估:

naive_transpose:

Performance counter stats for './main' (100 runs):

        13,664,142      cache-misses              #   75.001 % of all cache refs      ( +-  0.03% )
        18,218,644      cache-references                                              ( +-  0.15% )
     1,565,542,213      instructions              #    0.97  insns per cycle          ( +-  0.01% )
     1,608,864,478      cycles                                                        ( +-  0.27% )

       0.464768873 seconds time elapsed                                        ( +-  0.29% )

sse_transpose:

Performance counter stats for './main_sse' (100 runs):

         4,470,699      cache-misses              #   79.313 % of all cache refs      ( +-  0.17% )
         5,636,790      cache-references                                              ( +-  0.53% )
     1,343,263,111      instructions              #    1.27  insns per cycle          ( +-  0.08% )
     1,055,228,135      cycles                                                        ( +-  0.35% )

       0.304981652 seconds time elapsed                                          ( +-  0.37% )

sse_prefetch_transpose:

Performance counter stats for './main_pre' (100 runs):

         4,519,376      cache-misses              #   78.251 % of all cache refs      ( +-  0.12% )
         5,775,457      cache-references                                              ( +-  0.13% )
     1,384,727,378      instructions              #    1.64  insns per cycle          ( +-  0.00% )
       843,211,849      cycles                                                        ( +-  0.38% )

       0.243784674 seconds time elapsed                                          ( +-  0.41% )

觀察結果
- 有做prefetch的版本除了IPC(instruction per cycle)提升以外，cycle數亦大幅下降。
- cache-miss方面沒有太大的差別，不論是與naive相比或與sse_transpose比較。
  ==>開始思考prefetch 和 cache-miss之間的關係，以及當中覺得有端倪的locality hint:_MM_HINT_T1和 PFDIST
  ==>又回去深入把之前參考的資料看清楚，在下面記錄我對這邊的認識@@
關於SSE中的prefetch:
- _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1) 在intel intrinsics Guide中定義如下:
void _mm_prefetch (char const* p, int i)
Synopsis
void _mm_prefetch (char const* p, int i)
#include "xmmintrin.h"
Instruction: prefetchnta mprefetch
prefetcht0 mprefetch
prefetcht1 mprefetch
prefetcht2 mprefetch
CPUID Flags: SSE
Description
Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i.
- 其中傳入p值是"要prefetch的資料的記憶體位置"
  ==>看懂這裡，就知道PFDIST在原本的程式中是什麼意義了!!
- i值是locality hint，_MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2和_MM_HINT_NTA共4種。
  - T0 (temporal data)–prefetch data into all cache levels.
    T1 (temporal data with respect to first level cache)–prefetch data in all cache levels except 0th cache level<將資料取至所有cache level除了 L1 cache之外>
    T2 (temporal data with respect to second level cache) –prefetch data in all cache levels, except 0th and 1st cache levels.
    NTA (non-temporal data with respect to all cache levels)–prefetch data into non-temporal cache structure. (This hint can be used to minimize pollution of caches.)
關於PFDIST:
- 原程式碼中，PFDIST和論文中的prefetch distance定義有些不同，但概念上是相通的。在這裡PFDIST定為8，又每次loop取4個int來做，因此真正的作用是提前8/4=2輪來做prefetch。
- (software) prefetch distance在論文中定義是"the distance ahead of which a prefetch should be requested on a memory address."
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
  where l is the prefetch latency and s is the length of the shortest path through the loop body
- 看參考資料中的這張圖，可以很清楚的了解
  - 當中Loading an element的時間看得出來是loop中Computaion時間的兩倍(實際上要看出他的時間，就是去查memory latency的cycle數和loop當中instruction執行的cycle數)
  - 因此，若將prefetch distance定義為2，代表在需要資料的前2輪就做prefetch，先將資料由main memory載入cache中。prefetch distance至少要定義為2，才可以來得及在需要用到資料的那輪拿到要的資料。
    ==>當然prefetch過早也會有問題，就像論文中提到的cache pollution(在cache中有當下不必要的資料，造成cache空間浪費)。或是可能後來載入的資料洗掉cache裡的資料，資料還沒用到前就被洗掉，又再度造成cache miss。
結論與觀察假設:
- 來研究看看不同locality hint執行上的差異
- 去查memory latency的時間以及loop中instruction執行的時間，估算有效的prefetch distance。提前2輪是否最佳?太早取或太晚取的影響差別多大?
  ==>雖然不知道原本的2輪是不是計算之後的結果，但覺得要改掉可能也要有個依據在@@

使用AVX來提升效能

將SSE的版本改成AVX(依照直覺方式改)
- 對照一下 Intel Intrinsics Guide

avx_transpose():























void avx_transpose(int *src, int *dst, int w, int h)
{
    for (int x = 0; x < w; x += 4) {
        for (int y = 0; y < h; y += 4) {
            __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x));
            __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x));
            __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x));
            __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x));
            __m256i T0 = _mm256_unpacklo_epi32(I0, I1);
            __m256i T1 = _mm256_unpacklo_epi32(I2, I3);
            __m256i T2 = _mm256_unpackhi_epi32(I0, I1);
            __m256i T3 = _mm256_unpackhi_epi32(I2, I3);
            I0 = _mm256_unpacklo_epi64(T0, T1);
            I1 = _mm256_unpackhi_epi64(T0, T1);
            I2 = _mm256_unpacklo_epi64(T2, T3);
            I3 = _mm256_unpackhi_epi64(T2, T3);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), I0);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), I1);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), I2);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), I3);
        }
    }
}

avx_prefetch_transpose():

發現到AVX-512才有支援prefetch，但我的CPU只有支援到AVX2，只好先借用SSE的prefetch@@
- 查看CPU info: 找自己CPU有支援哪些指令集

覺得做起來毛毛的不知道這樣做有沒有瑕疵@@

之前做 clz 的作業，應該就有建立驗證輸出正確性的機制，你應該一併提出，你可以用 pure C 的程式輸入和輸出來驗證 SSE/AVX 版本的實作 jserv
感謝老師~ 稍晚就來試試^^ Carol Chen

根據老師的建議，多做了一個transpose_verfiy()來驗證SSE和AVX的transpose有沒有做錯
- 當中就是call naive_transpose()來比比看正不正確
- 驗證結果都是正確的~~~































void avx_prefetch_transpose(int *src, int *dst, int w, int h)
{
    for (int x = 0; x < w; x += 4) {
        for (int y = 0; y < h; y += 4) {

            _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1);
            _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1);
            _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1);
            _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1);

            __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x));
            __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x));
            __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x));
            __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x));
            __m256i T0 = _mm256_unpacklo_epi32(I0, I1);
            __m256i T1 = _mm256_unpacklo_epi32(I2, I3);
            __m256i T2 = _mm256_unpackhi_epi32(I0, I1);
            __m256i T3 = _mm256_unpackhi_epi32(I2, I3);
            I0 = _mm256_unpacklo_epi64(T0, T1);
            I1 = _mm256_unpackhi_epi64(T0, T1);
            I2 = _mm256_unpacklo_epi64(T2, T3);
            I3 = _mm256_unpackhi_epi64(T2, T3);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), I0);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), I1);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), I2);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), I3);

        }
    }
}

改成AVX之後的效能分析結果

avx_transpose:

Performance counter stats for './main_avx' (100 runs):

         5,308,277      cache-misses              #   79.449 % of all cache refs      ( +-  0.08% )
         6,681,339      cache-references                                              ( +-  0.18% )
     1,340,262,129      instructions              #    1.23  insns per cycle          ( +-  0.03% )
     1,087,092,611      cycles                                                        ( +-  0.36% )

       0.314026976 seconds time elapsed                                          ( +-  0.38% )

avx_prefetch_transpose:

Performance counter stats for './main_avx_pre' (100 runs):

         5,605,493      cache-misses              #   77.948 % of all cache refs      ( +-  0.10% )
         7,191,278      cache-references                                              ( +-  0.14% )
     1,393,902,294      instructions              #    1.45  insns per cycle          ( +-  0.00% )
       961,599,446      cycles                                                        ( +-  0.35% )

       0.277883340 seconds time elapsed                                          ( +-  0.38% )

執行結果和SSE的版本沒差很多@@

參考周曠宇的筆記關於AVX transpose的實作，才發現自己原本的版本根本沒有發揮到AVX一次可以處理 256bits 的功能，只是用AVX在實作一次SSE的版本
- 花了一些時間看懂code，以下是一些筆記:

老師給的參考資料4x4 transpose in SSE2當中的這張圖:

可以理解到為什麼SIMD的transpose要這麼寫

__m256i T0 = _mm256_unpacklo_epi32(I0, I1)意思是將I0和I1的low 128 bit以32 bit為單位交錯放進T0；unpackhi就是處理high 128 bit

再把問題轉到8x8的矩陣上，可想而知轉換並不會那麼單純而已
















































for ( int x = 0; x < w; x += 8 ) {
        for ( int y = 0; y < h; y += 8 ) {
            __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x));
            __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x));
            __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x));
            __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x));
            __m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x));
            __m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x));
            __m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x));
            __m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x));

            __m256i T0 = _mm256_unpacklo_epi32(I0, I1);
            __m256i T1 = _mm256_unpackhi_epi32(I0, I1);
            __m256i T2 = _mm256_unpacklo_epi32(I2, I3);
            __m256i T3 = _mm256_unpackhi_epi32(I2, I3);
            __m256i T4 = _mm256_unpacklo_epi32(I4, I5);
            __m256i T5 = _mm256_unpackhi_epi32(I4, I5);
            __m256i T6 = _mm256_unpacklo_epi32(I6, I7);
            __m256i T7 = _mm256_unpackhi_epi32(I6, I7);

            I0 = _mm256_unpacklo_epi64(T0, T2);
            I1 = _mm256_unpackhi_epi64(T0, T2);
            I2 = _mm256_unpacklo_epi64(T1, T3);
            I3 = _mm256_unpackhi_epi64(T1, T3);
            I4 = _mm256_unpacklo_epi64(T4, T6);
            I5 = _mm256_unpackhi_epi64(T4, T6);
            I6 = _mm256_unpacklo_epi64(T5, T7);
            I7 = _mm256_unpackhi_epi64(T5, T7);

            T0 = _mm256_permute2x128_si256(I0, I4, 0x20);
            T1 = _mm256_permute2x128_si256(I1, I5, 0x20);
            T2 = _mm256_permute2x128_si256(I2, I6, 0x20);
            T3 = _mm256_permute2x128_si256(I3, I7, 0x20);
            T4 = _mm256_permute2x128_si256(I0, I4, 0x31);
            T5 = _mm256_permute2x128_si256(I1, I5, 0x31);
            T6 = _mm256_permute2x128_si256(I2, I6, 0x31);
            T7 = _mm256_permute2x128_si256(I3, I7, 0x31);

            _mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), T0);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), T1);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), T2);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), T3);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 4) * h) + y), T4);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 5) * h) + y), T5);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 6) * h) + y), T6);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 7) * h) + y), T7);
        }
    }

試著把這段code用紙筆畫出來，還是不太懂為什麼要這麼做@@
尤其是對於permute()的結果

T0 = _mm256_permute2x128_si256(I0, I4, 0x20);
我的理解是把I0的low 128-bit放 T0 的low 128-bit，
I4的low 128-bit放 T0 的high 128-bit
但結果相差很多@@

於是又到stackoverflow找其他作法:


























__t0 = _mm256_unpacklo_ps(row0, row1);
__t1 = _mm256_unpackhi_ps(row0, row1);
__t2 = _mm256_unpacklo_ps(row2, row3);
__t3 = _mm256_unpackhi_ps(row2, row3);
__t4 = _mm256_unpacklo_ps(row4, row5);
__t5 = _mm256_unpackhi_ps(row4, row5);
__t6 = _mm256_unpacklo_ps(row6, row7);
__t7 = _mm256_unpackhi_ps(row6, row7);

__tt0 = _mm256_shuffle_ps(__t0,__t2,_MM_SHUFFLE(1,0,1,0));
__tt1 = _mm256_shuffle_ps(__t0,__t2,_MM_SHUFFLE(3,2,3,2));
__tt2 = _mm256_shuffle_ps(__t1,__t3,_MM_SHUFFLE(1,0,1,0));
__tt3 = _mm256_shuffle_ps(__t1,__t3,_MM_SHUFFLE(3,2,3,2));
__tt4 = _mm256_shuffle_ps(__t4,__t6,_MM_SHUFFLE(1,0,1,0));
__tt5 = _mm256_shuffle_ps(__t4,__t6,_MM_SHUFFLE(3,2,3,2));
__tt6 = _mm256_shuffle_ps(__t5,__t7,_MM_SHUFFLE(1,0,1,0));
__tt7 = _mm256_shuffle_ps(__t5,__t7,_MM_SHUFFLE(3,2,3,2));

row0 = _mm256_permute2f128_ps(__tt0, __tt4, 0x20);
row1 = _mm256_permute2f128_ps(__tt1, __tt5, 0x20);
row2 = _mm256_permute2f128_ps(__tt2, __tt6, 0x20);
row3 = _mm256_permute2f128_ps(__tt3, __tt7, 0x20);
row4 = _mm256_permute2f128_ps(__tt0, __tt4, 0x31);
row5 = _mm256_permute2f128_ps(__tt1, __tt5, 0x31);
row6 = _mm256_permute2f128_ps(__tt2, __tt6, 0x31);
row7 = _mm256_permute2f128_ps(__tt3, __tt7, 0x31);

晚點來研究這邊@@

正確版AVX執行結果如下:
- 很重要的是，要和SSE做比較，PFDIST也要隨之調整
- 在這邊若要做到與SSE差不多的效果，應該PFDIST由8改為16
  在SSE版本中PFDIST=8，8/4(一個loop拿四個int來做) = 2，代表提前2輪作prefetch；
  然而在AVX版本中，一個loop拿了8個int來做，若也要提前兩輪做prefetch，應將PFDIST改為 8*2 = 16。