contributed by <carolc0708
>
重要名詞定義
其實還是沒有很懂SW prefetching和HW prefetching的差異@@
SW 指的就是老師寫的
_mm_prefetch
,可以由我們來控制的,HW 是硬體上直接做好的,通常是根據歷史紀錄推敲下一個要 prefetch 哪個,所以 HW 是需要 training 的Yen-Kuan Wu
==>goal is to provide guidelines for inserting prefetch intrinsics in the presence of a hardware prefetcher and, more importantly, to identify promising new directions for automated cooperative software/hardware prefetching schemes
baseline: only compiler inserted software prefetchers without any hardware prefetching
(1) What are the limitations and overheads of software prefetching?
(2) What are the limitations and overheads of hardware prefetching?
(3) When is it beneficial to use software and/or hardware prefetching?
when software prefetching targets short array streams, irregular memory address patterns, and L1 cache miss reduction, there is an overall positive impact with code examples.
Note that, in this article, we refer to
unit-stride cache-line accesses as streams and access stride distances greater than two cache lines as strided
stream & stride?
HW prefetching ==> fetch in stream or arbitrary stride
stream 是指說你的 prefetch 是連續的抓取 cache line,而 stride 則是指說你是一次跳幾格去 pregetch cache line。Yen-Kuan Wu
software prefetch distance: the distance ahead of which a prefetch should be requested on a memory address.
where l is the prefetch latency and s is the length of the shortest path through the loop body
if it is too large, prefetched data could evict useful cache blocks, and the elements in the beginning of the array may not be prefetched, leading to less coverage and more cache misses. Hence the prefetch distance has a significant effect on the overall performance of the prefetcher.
direct memory indexing can be easily prefetched by hardware since the memory addresses show regular stream/stride behavior, but indirect memory indexing requires special hardware prefetching mechanisms.
indirect indexing is relatively simpler to compute in software. Thus, we expect software prefetching to be more effective than hardware prefetching for the indirect indexing case
不太懂@@
簡單講就是 a[i] 和 a[b[i]],可以發現第一個就是簡單的 direct 第二個就是 indirect,也就是他的抓的 address 會跳動,其實也對應到你上面的問題,stream vs arbitrary。Yen-Kuan Wu
大感謝!!! ^^ Carol Chen
Negative Training:
Harmful Software Prefetching: When software prefetch requests are incorrect or early, the increased stress on cache and memory bandwidth can further reduce the effectiveness of hardware prefetching.
prefetch candidate: any load whose L1 misses per thousand instructions (MPKI) exceeds 0.05.
prefetch distance:
where K is a constant factor, L is an average memory latency, IPCb ench is the profiled average IPC of each benchmark, and Wloop is the average instruction count in one loop iteration.
not for complex data structure
Instruction Overhead
Software Prefetching Overhead
what is cache pollution(caused by early or incorrect prefetches)?
Prefetch replaces a line that is requested laterCarol Chen簡單說就是 cache 被塞入不時宜的 data,而 early 則是指說太早抓這個 data 進來,結果要用到時已經被 replaced 了,inccorrect 是指說你載入了根本用不到的 data,這些都是會讓 cache miss 增加,也可以說是 汙染了cache Yen-Kuan Wu
why use ipc for performance evaluation?
Static Distance vs. Machine Configuration
Cache-Level Insertion Policy
"In general, an out-of-order processor can easily tolerate L1 misses, so hardware prefetchers usually insert prefetched blocks into the last level cache to reduce the chance of polluting the L1 cache."==>why??
training effect 的概念不太懂@@
Prefetch Coverage
Prefetching Classification
看不太懂操作方法@@
prologue/epilogue transformations??
(1) Hardware prefetchers can under-exploit even regular access patterns, such as streams. Short streams are particularly problematic, and software prefetching is frequently more effective in such cases.
(2) The software prefetching distance is relatively insensitive to the hardware configuration. The prefetch distance does need to be set carefully, but as long as the prefetch distance is greater than the minimum distance, most applications will not be sensitive to the prefetch distance.
(3) Although most L1 cache misses can be tolerated through out-of-order execution, when the L1 cache miss rate is much higher than 20%, reducing L1 cache misses by prefetching into the L1 cache can be effective.
(4) The overhead of useless prefetching instructions is not very significant. Most applications that have prefetching instructions are typically limited by memory operations. Thus, having some extra instructions (software prefetching instructions) does not increase execution time by much.
(5) Software prefetching can be used to train a hardware prefetcher and thereby yield some performance improvement. However, it can also degrade performance severely, and therefore must be done judiciously if at all.
naive_transpose
:
Performance counter stats for './main' (100 runs):
13,664,142 cache-misses # 75.001 % of all cache refs ( +- 0.03% )
18,218,644 cache-references ( +- 0.15% )
1,565,542,213 instructions # 0.97 insns per cycle ( +- 0.01% )
1,608,864,478 cycles ( +- 0.27% )
0.464768873 seconds time elapsed ( +- 0.29% )
sse_transpose
:
Performance counter stats for './main_sse' (100 runs):
4,470,699 cache-misses # 79.313 % of all cache refs ( +- 0.17% )
5,636,790 cache-references ( +- 0.53% )
1,343,263,111 instructions # 1.27 insns per cycle ( +- 0.08% )
1,055,228,135 cycles ( +- 0.35% )
0.304981652 seconds time elapsed ( +- 0.37% )
sse_prefetch_transpose
:
Performance counter stats for './main_pre' (100 runs):
4,519,376 cache-misses # 78.251 % of all cache refs ( +- 0.12% )
5,775,457 cache-references ( +- 0.13% )
1,384,727,378 instructions # 1.64 insns per cycle ( +- 0.00% )
843,211,849 cycles ( +- 0.38% )
0.243784674 seconds time elapsed ( +- 0.41% )
觀察結果
naive
相比或與sse_transpose
比較。_MM_HINT_T1
和 PFDIST
關於SSE中的prefetch:
_mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1)
在intel intrinsics Guide中定義如下:void _mm_prefetch (char const* p, int i)
Synopsis
void _mm_prefetch (char const* p, int i)
#include "xmmintrin.h"
Instruction: prefetchnta mprefetch
prefetcht0 mprefetch
prefetcht1 mprefetch
prefetcht2 mprefetch
CPUID Flags: SSE
Description
Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i.
PFDIST
在原本的程式中是什麼意義了!!_MM_HINT_T0
, _MM_HINT_T1
, _MM_HINT_T2
和_MM_HINT_NTA
共4種。
T0 (temporal data)
–prefetch data into all cache levels.T1 (temporal data with respect to first level cache)
–prefetch data in all cache levels except 0th cache level<將資料取至所有cache level除了 L1 cache之外>T2 (temporal data with respect to second level cache)
–prefetch data in all cache levels, except 0th and 1st cache levels.NTA (non-temporal data with respect to all cache levels)
–prefetch data into non-temporal cache structure. (This hint can be used to minimize pollution of caches.)關於PFDIST
:
原程式碼中,PFDIST
和論文中的prefetch distance定義有些不同,但概念上是相通的。在這裡PFDIST
定為8,又每次loop取4個int來做,因此真正的作用是提前8/4=2輪來做prefetch。
(software) prefetch distance在論文中定義是"the distance ahead of which a prefetch should be requested on a memory address."
where l is the prefetch latency and s is the length of the shortest path through the loop body
看參考資料中的這張圖,可以很清楚的了解
結論與觀察假設:
avx_transpose()
:
void avx_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x += 4) {
for (int y = 0; y < h; y += 4) {
__m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x));
__m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x));
__m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x));
__m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x));
__m256i T0 = _mm256_unpacklo_epi32(I0, I1);
__m256i T1 = _mm256_unpacklo_epi32(I2, I3);
__m256i T2 = _mm256_unpackhi_epi32(I0, I1);
__m256i T3 = _mm256_unpackhi_epi32(I2, I3);
I0 = _mm256_unpacklo_epi64(T0, T1);
I1 = _mm256_unpackhi_epi64(T0, T1);
I2 = _mm256_unpacklo_epi64(T2, T3);
I3 = _mm256_unpackhi_epi64(T2, T3);
_mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), I0);
_mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), I1);
_mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), I2);
_mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), I3);
}
}
}
avx_prefetch_transpose()
:
覺得做起來毛毛的不知道這樣做有沒有瑕疵@@
之前做 clz 的作業,應該就有建立驗證輸出正確性的機制,你應該一併提出,你可以用 pure C 的程式輸入和輸出來驗證 SSE/AVX 版本的實作 jserv
感謝老師~ 稍晚就來試試^^ Carol Chen
naive_transpose()
來比比看正不正確
void avx_prefetch_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x += 4) {
for (int y = 0; y < h; y += 4) {
_mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1);
__m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x));
__m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x));
__m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x));
__m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x));
__m256i T0 = _mm256_unpacklo_epi32(I0, I1);
__m256i T1 = _mm256_unpacklo_epi32(I2, I3);
__m256i T2 = _mm256_unpackhi_epi32(I0, I1);
__m256i T3 = _mm256_unpackhi_epi32(I2, I3);
I0 = _mm256_unpacklo_epi64(T0, T1);
I1 = _mm256_unpackhi_epi64(T0, T1);
I2 = _mm256_unpacklo_epi64(T2, T3);
I3 = _mm256_unpackhi_epi64(T2, T3);
_mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), I0);
_mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), I1);
_mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), I2);
_mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), I3);
}
}
}
avx_transpose
:
Performance counter stats for './main_avx' (100 runs):
5,308,277 cache-misses # 79.449 % of all cache refs ( +- 0.08% )
6,681,339 cache-references ( +- 0.18% )
1,340,262,129 instructions # 1.23 insns per cycle ( +- 0.03% )
1,087,092,611 cycles ( +- 0.36% )
0.314026976 seconds time elapsed ( +- 0.38% )
avx_prefetch_transpose
:
Performance counter stats for './main_avx_pre' (100 runs):
5,605,493 cache-misses # 77.948 % of all cache refs ( +- 0.10% )
7,191,278 cache-references ( +- 0.14% )
1,393,902,294 instructions # 1.45 insns per cycle ( +- 0.00% )
961,599,446 cycles ( +- 0.35% )
0.277883340 seconds time elapsed ( +- 0.38% )
執行結果和SSE的版本沒差很多@@
老師給的參考資料4x4 transpose in SSE2當中的這張圖:
可以理解到為什麼SIMD的transpose要這麼寫
__m256i T0 = _mm256_unpacklo_epi32(I0, I1)
意思是將I0和I1的low 128 bit以32 bit為單位交錯放進T0;unpackhi
就是處理high 128 bit再把問題轉到8x8的矩陣上,可想而知轉換並不會那麼單純而已
for ( int x = 0; x < w; x += 8 ) {
for ( int y = 0; y < h; y += 8 ) {
__m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x));
__m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x));
__m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x));
__m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x));
__m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x));
__m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x));
__m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x));
__m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x));
__m256i T0 = _mm256_unpacklo_epi32(I0, I1);
__m256i T1 = _mm256_unpackhi_epi32(I0, I1);
__m256i T2 = _mm256_unpacklo_epi32(I2, I3);
__m256i T3 = _mm256_unpackhi_epi32(I2, I3);
__m256i T4 = _mm256_unpacklo_epi32(I4, I5);
__m256i T5 = _mm256_unpackhi_epi32(I4, I5);
__m256i T6 = _mm256_unpacklo_epi32(I6, I7);
__m256i T7 = _mm256_unpackhi_epi32(I6, I7);
I0 = _mm256_unpacklo_epi64(T0, T2);
I1 = _mm256_unpackhi_epi64(T0, T2);
I2 = _mm256_unpacklo_epi64(T1, T3);
I3 = _mm256_unpackhi_epi64(T1, T3);
I4 = _mm256_unpacklo_epi64(T4, T6);
I5 = _mm256_unpackhi_epi64(T4, T6);
I6 = _mm256_unpacklo_epi64(T5, T7);
I7 = _mm256_unpackhi_epi64(T5, T7);
T0 = _mm256_permute2x128_si256(I0, I4, 0x20);
T1 = _mm256_permute2x128_si256(I1, I5, 0x20);
T2 = _mm256_permute2x128_si256(I2, I6, 0x20);
T3 = _mm256_permute2x128_si256(I3, I7, 0x20);
T4 = _mm256_permute2x128_si256(I0, I4, 0x31);
T5 = _mm256_permute2x128_si256(I1, I5, 0x31);
T6 = _mm256_permute2x128_si256(I2, I6, 0x31);
T7 = _mm256_permute2x128_si256(I3, I7, 0x31);
_mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), T0);
_mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), T1);
_mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), T2);
_mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), T3);
_mm256_storeu_si256((__m256i *)(dst + ((x + 4) * h) + y), T4);
_mm256_storeu_si256((__m256i *)(dst + ((x + 5) * h) + y), T5);
_mm256_storeu_si256((__m256i *)(dst + ((x + 6) * h) + y), T6);
_mm256_storeu_si256((__m256i *)(dst + ((x + 7) * h) + y), T7);
}
}
試著把這段code用紙筆畫出來,還是不太懂為什麼要這麼做@@
尤其是對於permute()
的結果
T0 = _mm256_permute2x128_si256(I0, I4, 0x20);
我的理解是把I0的low 128-bit放 T0 的low 128-bit,
I4的low 128-bit放 T0 的high 128-bit
但結果相差很多@@
於是又到stackoverflow找其他作法:
__t0 = _mm256_unpacklo_ps(row0, row1);
__t1 = _mm256_unpackhi_ps(row0, row1);
__t2 = _mm256_unpacklo_ps(row2, row3);
__t3 = _mm256_unpackhi_ps(row2, row3);
__t4 = _mm256_unpacklo_ps(row4, row5);
__t5 = _mm256_unpackhi_ps(row4, row5);
__t6 = _mm256_unpacklo_ps(row6, row7);
__t7 = _mm256_unpackhi_ps(row6, row7);
__tt0 = _mm256_shuffle_ps(__t0,__t2,_MM_SHUFFLE(1,0,1,0));
__tt1 = _mm256_shuffle_ps(__t0,__t2,_MM_SHUFFLE(3,2,3,2));
__tt2 = _mm256_shuffle_ps(__t1,__t3,_MM_SHUFFLE(1,0,1,0));
__tt3 = _mm256_shuffle_ps(__t1,__t3,_MM_SHUFFLE(3,2,3,2));
__tt4 = _mm256_shuffle_ps(__t4,__t6,_MM_SHUFFLE(1,0,1,0));
__tt5 = _mm256_shuffle_ps(__t4,__t6,_MM_SHUFFLE(3,2,3,2));
__tt6 = _mm256_shuffle_ps(__t5,__t7,_MM_SHUFFLE(1,0,1,0));
__tt7 = _mm256_shuffle_ps(__t5,__t7,_MM_SHUFFLE(3,2,3,2));
row0 = _mm256_permute2f128_ps(__tt0, __tt4, 0x20);
row1 = _mm256_permute2f128_ps(__tt1, __tt5, 0x20);
row2 = _mm256_permute2f128_ps(__tt2, __tt6, 0x20);
row3 = _mm256_permute2f128_ps(__tt3, __tt7, 0x20);
row4 = _mm256_permute2f128_ps(__tt0, __tt4, 0x31);
row5 = _mm256_permute2f128_ps(__tt1, __tt5, 0x31);
row6 = _mm256_permute2f128_ps(__tt2, __tt6, 0x31);
row7 = _mm256_permute2f128_ps(__tt3, __tt7, 0x31);
晚點來研究這邊@@
PFDIST
也要隨之調整PFDIST
由8改為16PFDIST=8
,8/4(一個loop拿四個int來做) = 2,代表提前2輪作prefetch;PFDIST
改為 8*2 = 16。avx_transpose
:
Performance counter stats for './main_avx' (100 runs):
3,660,681 cache-misses # 69.870 % of all cache refs ( +- 0.11% )
5,239,311 cache-references ( +- 0.10% )
1,261,624,058 instructions # 1.44 insns per cycle ( +- 0.00% )
875,580,047 cycles ( +- 0.25% )
0.253157364 seconds time elapsed ( +- 0.27% )
avx_prefetch_transpose
:
Performance counter stats for './main_avx_pre' (100 runs):
3,863,180 cache-misses # 69.442 % of all cache refs ( +- 0.14% )
5,563,190 cache-references ( +- 0.10% )
1,284,699,571 instructions # 1.52 insns per cycle ( +- 0.00% )
846,756,431 cycles ( +- 0.29% )
0.245009690 seconds time elapsed ( +- 0.33% )
avx_transpose()
相比,cycle數少了3000,0000次。locality hint:
T0 (temporal data)
–prefetch data into all cache levels.
T1 (temporal data with respect to first level cache)
–prefetch data in all cache levels except 0th cache level<將資料取至所有cache level除了 L1 cache之外>
T2 (temporal data with respect to second level cache)
–prefetch data in all cache levels, except 0th and 1st cache levels.
NTA (non-temporal data with respect to all cache levels)
–prefetch data into non-temporal cache structure. (This hint can be used to minimize pollution of caches.)
decode-dimms
。參考這裡。但是失敗了~~ @@先由官網下載mlcv3.1a.tgz
,將它解壓縮$ tar -zxvf mlcv3.1a.tgz
。
進到Linux/
目錄底下會有兩個執行檔mlc
和mlc_avx512
,我的CPU沒支援到AVX512,所以用mlc
。
$ sudo ./mlc
得到錯誤訊息@@
tried to open /dev/cpu/0/msr
Cannot access MSR module. Possibilities include not having root privileges, or MSR module not inserted (try 'modprobe msr', and refer README)
照他說的,$ sudo modprobe msr
後,在執行一次mlc
就可以了。
以下是結果:
Intel(R) Memory Latency Checker - v3.1a
Measuring idle latencies (in ns)...
Memory node
Socket 0
0 53.4
Measuring Peak Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 10770.2
3:1 Reads-Writes : 10047.3
2:1 Reads-Writes : 9765.2
1:1 Reads-Writes : 9274.6
Stream-triad like: 9805.0
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Memory node
Socket 0
0 11037.1
Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 177.28 10649.6
00002 192.86 10994.8
00008 167.25 10948.4
00015 159.44 10612.9
00050 154.25 10795.7
00100 140.75 10143.9
00200 128.83 7893.9
00300 105.37 5952.7
00400 100.31 4776.8
00500 86.84 4003.8
00700 88.70 3163.5
01000 86.94 2478.6
01300 85.03 2088.2
01700 82.19 1816.9
02500 82.56 1500.1
03500 72.52 1369.3
05000 66.17 1288.9
09000 88.03 929.9
20000 78.31 905.1
Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT latency 26.4
Local Socket L2->L2 HITM latency 25.7
試試看一邊執行程式,一邊跑./mlc
Intel(R) Memory Latency Checker - v3.1a
Measuring idle latencies (in ns)...
Memory node
Socket 0
0 77.9
Measuring Peak Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 9720.8
3:1 Reads-Writes : 8869.7
2:1 Reads-Writes : 8909.7
1:1 Reads-Writes : 8313.2
Stream-triad like: 9235.2
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Memory node
Socket 0
0 9383.3
Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 323.97 9325.1
00002 187.46 8582.3
00008 337.34 9331.5
00015 177.40 7845.7
00050 246.49 8772.4
00100 296.18 7506.0
00200 243.75 5463.3
00300 221.81 4152.5
00400 154.30 3104.2
00500 221.95 2651.5
00700 190.65 1968.7
01000 158.13 1544.2
01300 235.15 1242.4
01700 153.25 1097.1
02500 149.58 901.2
03500 163.42 734.9
05000 91.15 885.4
09000 191.48 472.6
20000 222.20 354.6
Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT latency 32.5
Local Socket L2->L2 HITM latency 37.6
發現數字改變了~~ 真的有在運作!! 如果沒有理解錯誤,應該就得到我想知道的memory latency大概是介在90(ns) ~ 320(ns)
之間
==>計算時看成平均200(ns)
實在不確定有沒有遺漏什麼@@ 若有發現瑕疵再麻煩指正了~><
naive_transpose()
甚至達到了92%,覺得不是偶然,因為執行那麼多次沒有這麼高過。
naive: 266798 us
naive per iteration: 0.015902 us
sse: 119329 us
sse per iteration: 0.113801 us
==>memory latency平均為200(ns)
左右,SSE一個loop約需100(ns)
的執行時間,算起來prefetch distance真的是2左右!!
==>提前2輪是有依據的~~
sse prefetch: 63210 us
sse prefetch per iteration: 0.060282 us
avx: 62812 us
avx per iteration: 0.239609 us
==>算起來AVX其實可以提前1輪就好,PFDIST
維持在8即可
avx prefetch: 66318 us
avx prefetch per iteration: 0.252983 us
資料待補 這邊我還沒看懂@@
==>看到一堆scatter和gather不知道要怎麼放參數
先把scatter和gather是什麼看懂
參考Intel® 64 and IA-32 Architectures Optimization Reference Manual:
Data Gather