owned this note
owned this note
Published
Linked with GitHub
# [software-pipelining](https://hackmd.io/s/ry7eqDEC)
[github](https://github.com/diana0651/prefetcher) contributed by <`Diana Ho`>
###### tags: `d0651` `sys`
## 案例分析
### 預期目標
- [ ] 學習計算機結構並且透過實驗來驗證所學
- [ ] 理解 prefetch 對 cache 的影響,從而設計實驗來釐清相關議題
- [ ] 論文閱讀和思考
### 作業要求
- [ ] 閱讀 [在計算機裡頭實踐演算法](/s/HyKtIPN0) 提到的論文: "[When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)" (務必事先閱讀論文,否則貿然看程式碼,只是陷入一片茫然!),在 Linux/x86_64 (注意,要用 64-bit 系統,不能透過虛擬機器執行) 上編譯並執行 [prefetcher](https://github.com/embedded2015/prefetcher)
* 說明 `naive_transpose`, `sse_transpose`, `sse_prefetch_transpose` 之間的效能差異,以及 prefetcher 對 cache 的影響
- [ ] 在 github 上 fork [prefetcher](https://github.com/embedded2015/prefetcher),嘗試用 AVX 進一步提昇效能
* 修改 `Makefile`,產生新的執行檔,分別對應於 `naive_transpose`, `sse_transpose`, `sse_prefetch_transpose` (學習 [phonebook](s/S1RVdgza) 的做法)
* 用 perf 分析 cache miss/hit
* 學習 `perf stat` 的 raw counter 命令
* 參考 [Performance of SSE and AVX Instruction Sets](http://arxiv.org/pdf/1211.0820.pdf),用 SSE/AVX intrinsic 來改寫程式碼
---
## Prefetching
>[Data Prefetch Mechanisms](http://www.cs.ucy.ac.cy/courses/EPL605/Fall2014Files/DataPrefetchSurvey.pdf)
>[Caches and Prefetching](http://web.cecs.pdx.edu/~alaa/courses/ece587/spring2012/notes/prefetching.pdf)
>[Prefetching](http://www.cc.gatech.edu/~hyesoon/fall11/lec_pref.pdf)
預先將資料從 memory 載入 cache,再進行 CPU 運算,可以降低 CPU 等待時間進而提升效率
- 時間
載入至主記憶體的時機
- 不能太晚(cpu已經運算完,不再需要這些資料)
- 不能太早(有可能在使用之前就被踢出cache)
- 記憶體
prefetch intrinsic 提供參數可以選擇載入到哪個層級的快取記憶體
### 背景知識
* SW prefetcher:
* compiler中加強效能的algo.
* intrinsics(內聯函數) e.g. SSE中的__mm_prefetch()
* HW prefetcher: CPU 當中的 prefetcher
* 用過去的存取紀錄作為prefetch的依據
* 論文中提到的 [GHB](http://www.eecg.toronto.edu/~steffan/carg/readings/ghb.pdf) (Global History Buffer)
* 名詞定義
* Prefetch Hit: Prefetched line that was hit in the cache
before being replaced (miss avoided)
* Prefetch Miss: Prefetched line that was replaced before
being accessed
* Prefetch rate: Prefetches per instruction (or 1000 inst.)
* Accuracy: Percentage of prefetch hits to all prefetches
* Coverage: Percentage of misses avoided due to prefetching
* 100 x (Prefetch Hits / (Prefetch Hits + Cache Misses))
>>[參考概念](https://embedded2015.hackpad.com/-Homework-7-8-XZe3c94XjUh)
>>[參考概念](https://hackmd.io/OwZgHAhmBMYgtBADAThfALCkAzeAjWAE3gFYMBTAYxwEYQN8rokg?both)
>>[參考概念](https://hackmd.io/CYFgzAxgHFEOwFoJhI8BWARgqcAMAbAugIYgkBmATAZiNZkA?both)
## 閱讀論文
[When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)
### 3. POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING
* benefits of SW prefetching
* large num of stream
* short stream
* Irregular memory access
* cache locality hint
* loop bound
* negative impacts of SW prefetching
* Increased Instruction Count
* Static Insertion
* Code Structure Change
* synergistic effect of SW & HW prefetching
* Handling Multiple Streams
* Positive Training
* antagonistic effect of SW & HW prefetching
* Negative Training
* Harmful Software Prefetching
---
## 程式測試
```clike=
$ ./main
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
sse prefetch: 192603 us
sse: 215824 us
naive: 365479 us
```
- 使用Perf分析cache miss
```clike=
$ perf stat -r 50 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses ./main
Performance counter stats for './main' (50 runs):
27,985,483 cache-misses # 88.670 % of all cache refs (33.29%)
31,806,730 cache-references (50.24%)
34,757,822 L1-dcache-load-misses # 4.34% of all L1-dcache hits (66.90%)
806,498,136 L1-dcache-loads (66.16%)
352,545,104 L1-dcache-stores (64.64%)
377,450 L1-icache-load-misses (33.11%)
0.731182840 seconds time elapsed ( +- 3.65% )
```
### 更改 Makefile
```clike=
$ make cache-test
perf stat -e cache-misses,cache-references,instructions,cycles ./naive_transpose
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
naive: 310485 us
Performance counter stats for './naive_transpose':
18,959,975 cache-misses # 93.414 % of all cache refs
20,296,767 cache-references
1,572,649,544 instructions # 0.89 insns per cycle
1,761,874,413 cycles
0.511842859 seconds time elapsed
perf stat -e cache-misses,cache-references,instructions,cycles ./sse_transpose
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
sse: 160067 us
Performance counter stats for './sse_transpose':
6,657,239 cache-misses # 84.130 % of all cache refs
7,913,069 cache-references
1,421,957,448 instructions # 1.13 insns per cycle
1,253,144,039 cycles
0.333203789 seconds time elapsed
perf stat -e cache-misses,cache-references,instructions,cycles ./sse_prefetch_transpose
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
sse prefetch: 85866 us
Performance counter stats for './sse_prefetch_transpose':
6,592,423 cache-misses # 83.190 % of all cache refs
7,924,577 cache-references
1,407,620,222 instructions # 1.54 insns per cycle
914,875,974 cycles
0.283419155 seconds time elapsed
```
## 效能差異
### 矩陣轉置
#### TRANSPOSE_IMPL
記憶體空間是 Row Major,所以在讀取列時是讀取連續的記憶體空間,寫入行則是寫入不連續的記憶體空間,所以矩陣轉置造成很高比例的 cache-misses.
### naive_transpose
最簡單的矩陣轉置想法最直覺的矩陣轉置,從矩陣的左上角第一個元素開始,計算每一個元素轉置之後的目的地,並且存入成為新矩陣。
```clike=
void naive_transpose(int *src, int *dst, int w, int h)
{
for(int x = 0; x < w; x++){
for(int y = 0; y < h; y++){
*(dst + x*h + y) = *(src + y*w + x);
}
}
}
```
### sse_transpose
使用了 Intel 處理器 [SIMD](https://en.wikipedia.org/wiki/SIMD) 的技術,每個 int element 為 32-bit,所以 4x4 矩陣的每一個 row 剛好可以打包成 128-bit,再透過高低位的 swap,達到轉置的效
相較之下`naive transpose` 執行次數多且需要做 jump,增加 cache-misses 機會,`sse transpose` 更快。
#### 效能改進
* 一次將 4 筆數據放入 sse 暫存器中,一條指令處理 4 筆數據,比 4 筆數據 4 條指令處理快
* loop unrolling:
* 執行 loop 循環的組合語言代碼執行次數會變少
* branch prediction miss 機率降低
> [Programming trivia: 4x4 integer matrix transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html)
```clike=
void sse_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x += 4) {
for (int y = 0; y < h; y += 4) {
__m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x));
__m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x));
__m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x));
__m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x));
__m128i T0 = _mm_unpacklo_epi32(I0, I1);
__m128i T1 = _mm_unpacklo_epi32(I2, I3);
__m128i T2 = _mm_unpackhi_epi32(I0, I1);
__m128i T3 = _mm_unpackhi_epi32(I2, I3);
I0 = _mm_unpacklo_epi64(T0, T1);
I1 = _mm_unpackhi_epi64(T0, T1);
I2 = _mm_unpacklo_epi64(T2, T3);
I3 = _mm_unpackhi_epi64(T2, T3);
_mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
_mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
_mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
_mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);
}
}
}
```
:::info
[Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/)
查詢函式的內部操作
- `__m128i _mm_unpacklo_epi32 (__m128i a, __m128i b)`
把 src1, src2 的 lo 的部分依序以 32-bit 的部分移進 dst 裡
- 其餘類推
:::
:::danger
#### 理解實作
>>[參考概念](https://hackmd.io/KYQwbMBMkAxgtAVgOwE5LwCwmAY3qogIwBm8kywAzIhCCDLrkA==?both)
1. 首先,載入的資料分佈
```
0 32 64 96 128
+-------+-------+-------+-------+
I0 | I00 | I01 | I02 | I03 |
+-------+-------+-------+-------+
I1 | I10 | I11 | I12 | I13 |
+-------+-------+-------+-------+
I2 | I20 | I21 | I22 | I23 |
+-------+-------+-------+-------+
I3 | I30 | I31 | I32 | I33 |
+-------+-------+-------+-------+
```
2. 操作 `_mm_unpacklo_epi32()` 與 `_mm_unpackhi_epi32()`
```
0 32 64 96 128
+-------+-------+-------+-------+
T0 | I00 | I10 | I01 | I11 |
+-------+-------+-------+-------+
T1 | I20 | I30 | I21 | I31 |
+-------+-------+-------+-------+
T2 | I02 | I12 | I03 | I13 |
+-------+-------+-------+-------+
T3 | I22 | I32 | I23 | I33 |
+-------+-------+-------+-------+
```
3. 使用 `_mm_unpacklo_epi64()` 與 `_mm_unpackhi_epi64()` 完成轉置
```
0 32 64 96 128
+-------+-------+-------+-------+
I0 | I00 | I10 | I20 | I30 |
+-------+-------+-------+-------+
I1 | I01 | I11 | I21 | I31 |
+-------+-------+-------+-------+
I2 | I02 | I12 | I22 | I32 |
+-------+-------+-------+-------+
I3 | I03 | I13 | I23 | I33 |
+-------+-------+-------+-------+
```
:::
### sse_prefetch_transpose
> Loads one cache line of data from address p to a location closer to the processor.
比 `sse_transpose` 多使用了 4 次 [`_mm_prefetch`](https://msdn.microsoft.com/en-us/library/84szxsww(v=vs.90).aspx) 指令,prefetch 將用到的資料放在比較近的地方降低去 memory 拿資料的成本
> Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i.
```clike=
_mm_prefetch(char * p , int i )
# 抓取 line of data
#p : 從 address p 去讀取
#i : the type of prefetch operation: the constants _MM_HINT_T0,_MM_HINT_T1, _MM_HINT_T2, and _MM_HINT_NTA
```
#### 論文證明
[When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)
- benefits of SW prefetching
- Short Streams (16\*4 byte)
- Cache Locality Hint (預先 prefetch 到 L1cache )
- Loop Bounds (unroll 4x4 矩陣)
#### 效能改進
- 降低 cache-misses
- 降低執行時間(減少了去 memory 拿資料和 cache-misses 的時間)
```clike=
void sse_prefetch_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x += 4) {
for (int y = 0; y < h; y += 4) {
#define PFDIST 8
_mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1);
_mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1);
__m128i I0 = _mm_loadu_si128 ((__m128i *)(src + (y + 0) * w + x));
__m128i I1 = _mm_loadu_si128 ((__m128i *)(src + (y + 1) * w + x));
__m128i I2 = _mm_loadu_si128 ((__m128i *)(src + (y + 2) * w + x));
__m128i I3 = _mm_loadu_si128 ((__m128i *)(src + (y + 3) * w + x));
__m128i T0 = _mm_unpacklo_epi32(I0, I1);
__m128i T1 = _mm_unpacklo_epi32(I2, I3);
__m128i T2 = _mm_unpackhi_epi32(I0, I1);
__m128i T3 = _mm_unpackhi_epi32(I2, I3);
I0 = _mm_unpacklo_epi64(T0, T1);
I1 = _mm_unpackhi_epi64(T0, T1);
I2 = _mm_unpacklo_epi64(T2, T3);
I3 = _mm_unpackhi_epi64(T2, T3);
_mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
_mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
_mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
_mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);
}
}
}
```
### [SSE/AVX](http://arxiv.org/pdf/1211.0820.pdf)
#### 以 AVX 改寫 Prefetcher
依照 SSE 版本的想法來擴展成 AVX 版本
>[Introduction to Intel® Advanced Vector Extensions](https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions)
AVX 指令集是 256-bit,所以一次處理 8 個 byte,loop 一次加 8
1. 將依次將 8 行數組的元素載入到暫存器中
2. `_mm256_unpacklo/hi_epi32` 函式讀入兩個 256-bit 的數,將低/高 128-bit 以 32-bit 為單位交錯排列,`_mm256_unpacklo_epi64` 同理
- __m256i A = [ A0, A1, A2, A3, A4, A5, A6, A7 ];
- __m256i B = [ B0, B1, B2, B3, B4, B5, B6, B7 ];
- __m256i C = \_mm256_unpacklo_epi32(I0, I1) = [ A0, B0, A1, B1, A2, B2, A3, B3 ];
```clike=
void avx_transpose(int *src, int *dst, int w, int h)
{
for(int x=0; x < w ; x+= 8) {
for(int y=0; y < h ; y+=8) {
__m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x));
__m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x));
__m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x));
__m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x));
__m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x));
__m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x));
__m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x));
__m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x));
__m256i T0 = _mm256_unpacklo_epi32(I0 , I1);
__m256i T1 = _mm256_unpackhi_epi32(I0 , I1);
__m256i T2 = _mm256_unpacklo_epi32(I2 , I3);
__m256i T3 = _mm256_unpackhi_epi32(I2 , I3);
__m256i T4 = _mm256_unpacklo_epi32(I4 , I5);
__m256i T5 = _mm256_unpackhi_epi32(I4 , I5);
__m256i T6 = _mm256_unpacklo_epi32(I6 , I7);
__m256i T7 = _mm256_unpackhi_epi32(I6 , I7);
I0 = _mm256_unpacklo_epi64(T0, T2);
I1 = _mm256_unpackhi_epi64(T0, T2);
I2 = _mm256_unpacklo_epi64(T1, T3);
I3 = _mm256_unpackhi_epi64(T1, T3);
I4 = _mm256_unpacklo_epi64(T4, T6);
I5 = _mm256_unpackhi_epi64(T4, T6);
I6 = _mm256_unpacklo_epi64(T5, T7);
I7 = _mm256_unpacklo_epi64(T5, T7);
T0 = _mm256_permute2f128_si256(I0 , I4 , 0x20);
T1 = _mm256_permute2f128_si256(I1 , I5 , 0x20);
T2 = _mm256_permute2f128_si256(I2 , I6 , 0x20);
T3 = _mm256_permute2f128_si256(I3 , I7 , 0x20);
T4 = _mm256_permute2f128_si256(I0 , I4 , 0x31);
T5 = _mm256_permute2f128_si256(I1 , I5 , 0x31);
T6 = _mm256_permute2f128_si256(I2 , I6 , 0x31);
T7 = _mm256_permute2f128_si256(I3 , I7 , 0x31);
_mm256_storeu_si256((__m256i *) (dst + (x + 0) * h + y) , T0);
_mm256_storeu_si256((__m256i *) (dst + (x + 1) * h + y) , T1);
_mm256_storeu_si256((__m256i *) (dst + (x + 2) * h + y) , T2);
_mm256_storeu_si256((__m256i *) (dst + (x + 3) * h + y) , T3);
_mm256_storeu_si256((__m256i *) (dst + (x + 4) * h + y) , T4);
_mm256_storeu_si256((__m256i *) (dst + (x + 5) * h + y) , T5);
_mm256_storeu_si256((__m256i *) (dst + (x + 6) * h + y) , T6);
_mm256_storeu_si256((__m256i *) (dst + (x + 7) * h + y) , T7);
}
}
}
```
```clike=
void avx_prefetch_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x += 8) {
for (int y = 0; y < h; y += 8) {
#define AVX_PFDIST 8
_mm_prefetch(src + (y + AVX_PFDIST + 0) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 1) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 2) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 3) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 4) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 5) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 6) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 7) * w + x, _MM_HINT_T1);
__m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x));
__m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x));
__m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x));
__m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x));
__m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x));
__m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x));
__m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x));
__m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x));
__m256i T0 = _mm256_unpacklo_epi32(I0, I1);
__m256i T1 = _mm256_unpacklo_epi32(I2, I3);
__m256i T2 = _mm256_unpacklo_epi32(I4, I5);
__m256i T3 = _mm256_unpacklo_epi32(I6, I7);
__m256i T4 = _mm256_unpackhi_epi32(I0, I1);
__m256i T5 = _mm256_unpackhi_epi32(I2, I3);
__m256i T6 = _mm256_unpackhi_epi32(I4, I5);
__m256i T7 = _mm256_unpackhi_epi32(I6, I7);
I0 = _mm256_unpacklo_epi64(T0, T1);
I1 = _mm256_unpackhi_epi64(T0, T1);
I2 = _mm256_unpacklo_epi64(T2, T3);
I3 = _mm256_unpackhi_epi64(T2, T3);
I4 = _mm256_unpacklo_epi64(T4, T5);
I5 = _mm256_unpackhi_epi64(T4, T5);
I6 = _mm256_unpacklo_epi64(T6, T7);
I7 = _mm256_unpackhi_epi64(T6, T7);
T0 = _mm256_permute2x128_si256(I0, I2, 0x20);
T1 = _mm256_permute2x128_si256(I1, I3, 0x20);
T2 = _mm256_permute2x128_si256(I4, I6, 0x20);
T3 = _mm256_permute2x128_si256(I5, I7, 0x20);
T4 = _mm256_permute2x128_si256(I0, I2, 0x31);
T5 = _mm256_permute2x128_si256(I1, I3, 0x31);
T6 = _mm256_permute2x128_si256(I4, I6, 0x31);
T7 = _mm256_permute2x128_si256(I5, I7, 0x31);
_mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), T0);
_mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), T1);
_mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), T2);
_mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), T3);
_mm256_storeu_si256((__m256i *)(dst + ((x + 4) * h) + y), T4);
_mm256_storeu_si256((__m256i *)(dst + ((x + 5) * h) + y), T5);
_mm256_storeu_si256((__m256i *)(dst + ((x + 6) * h) + y), T6);
_mm256_storeu_si256((__m256i *)(dst + ((x + 7) * h) + y), T7);
}
}
}
```
:::danger
理解實作
>>[參考概念](https://hackmd.io/KYQwbMBMkAxgtAVgOwE5LwCwmAY3qogIwBm8kywAzIhCCDLrkA==?both)
:::
#### 程式實作
- 如果在 Makefile 裡試圖執行 sse_transpose 和 avx_transpose,會直接在產生目的檔時出現錯誤訊息
- 如果在 Makefile 裡不執行 sse_transpose,只執行 sse_transpose,不會直接在產生目的檔時出現錯誤訊息,但是會在將目的檔轉成編譯檔時出現`不合法的命令 (core dumped)`
- ==還沒找到解決方法==
```clike=
$./avx_transpose
make: *** [test] 不合法的命令 (core dumped)
$ make cache-test
In file included from /usr/lib/gcc/x86_64-linux-gnu/4.9/include/immintrin.h:41:0,
from main.c:9:
/usr/lib/gcc/x86_64-linux-gnu/4.9/include/avxintrin.h:896:1: error: inlining failed in call to always_inline ‘_mm256_storeu_si256’: target specific option mismatch
_mm256_storeu_si256 (__m256i *__P, __m256i __A)
^
In file included from main.c:18:0:
impl.c:115:13: error: called from here
_mm256_storeu_si256((__m256i *) (dst + (x + 7) * h + y) , T7);
^
make: *** [main] Error 1
```
- cache-miss
- 執行時間
>>[參考筆記](https://embedded2015.hackpad.com/Week8--VGN4PI1cUxh)
>>[參考筆記](https://embedded2015.hackpad.com/-Homework-7-8-XZe3c94XjUh)
---
<style>
h2.part {color: red;}
h3.part {color: green;}
h4.part {color: blue;}
h5.part {color: black;}
</style>