# 2017q1 Homework3 (software-pipelining)
contributed by <`chenweiii`>
## 硬體環境
* Operating System: Ubuntu 16.02.2 LTS (64 bit)
* CPU: [Intel i7-6700 @ 3.4 GHz](http://www.cpu-world.com/CPUs/Core_i7/Intel-Core%20i7-6700.html)
* Cache:
* L1d cache: 32 K 8-way set associative
* L1i cache: 32 K 8-way set associative
* L2 cache: 256 KB 4-way set associative
* L3 cache: 8 MB 16-way set associative
* cache alignment: 64 B
* Memory: 32 GB
## 論文閱讀
> 部份內容節錄自 [論文 When Prefetching Works, When It Doesn’t, and Why 重點提示和解說](https://hackmd.io/s/HJtfT3icx#)
### Summary
![](https://i.imgur.com/pwHxvmJ.png)
* 值得注意的是 Speedup 的計算方式, 是以各 Benchmark 表現最好的 ```SW + HW prefetcher``` 與 ```HW prefetcher``` 做比較。 所以由左至右的順序,是依軟硬 prefetcher 共存時哪個 benchmark 表現最好作排序。
* 可以觀察到除了 gcc, bzip2, soplex , 加上 intrinsics 表現都大幅提昇。
> 內文說明實驗採用了三種 binaries 及三種 simulated hardware prefetching schemes。 但圖表上只看到 GHB & STR, 並沒有看到第三種;而 binaries 也僅看到 baseline 及 SW 兩種, 不知道第三種 binaries 為何。
* 在 Intel's optimization guidelines 便指出 SW prefetch 用在短陣列、 連續且不規則的記憶體存取、 減少 L1 cache miss 時會有正面的影響, 與本篇論文分析之數據結果一致。
> 但我沒有印象這篇論文有特別琢磨這三項的分析,不知道是不是我漏看了。
* 此篇論文目的便是談討 SW & PW prefetcher 之間的交互作用, 最後也有觀察到在某些 benchmark 裡 SW prefetcher 會對 HW prefetcher 造成過度的訓練, 造成巨大的負面影響。
<br>
### Background on SW & HW prefetching
![](https://i.imgur.com/yN0hMjp.png)
* 不同的資料結構有各式各樣適合的 prefetching mechanism, 例如 Recursive Data Structures (RDS), 便適合由 SW prefetcher 處理而不是交由 HW prefetcher。
* 定義 cache-line access 分類
* stream: unit-stride cache-line access
* strided: access stride distances greater than two cache lines
* stream prefetcher, GHB prefetcher, content-based prefetcher 皆為 Hardware-based prefetching mechanisms, 但本實驗僅採用前兩個已成功商業化的 HW prefetcher。
> 不知道這邊理解的對不對。 GHB 的相關文章正在閱讀中。
#### Prefetch Classification
![](https://i.imgur.com/JCu3psd.png)
* 依 Demand Access Time 落在此軸何處, 來決定此 prefetch 的分類。
<br>
![](https://i.imgur.com/dFYj7Eo.png)
* Redundant_dc: 欲 prefetch 的 cache block 已經在 cache 裡。
> 該如何觀察各個種類的 prefetch 次數?
* Miss status holding registers (MSHR)
* 參考 [cornell memory system 的講義](https://jontse.com/courses/files/cornell/ece5730/Lecture04.pdf)
* When a miss occurs, the MSHRs are looked up to determine if the cache block is already being fetched.
* Each MSHR holds information for a primary miss.
* If an MSHR hit, then a secondary miss has occurred. (primary miss to the same line is outstanding)
> 仍然不太了解 MSHR 在 Miss-Under-Miss Cache Design 裡扮演什麼角色及過程。 卡關...但在此篇論文後續沒有討論到 MSHR, 所以之後想辦法再找文章閱讀。
#### Software Prefetch Distance
* Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency.
* If prefetch distance is too large, prefetched data could evict useful cache blocks, and the elements in the beginning of the array may not e prefetched, leading to less coverage and more cache misses.
* Prefetch coverage
* the ratio of useful prefetches to total L2 misses.
* 藉由 prefetch 避掉的 cache miss 比例
* 100 x (Prefetch Hits / (Prefetch Hits + Cache Misses))
> prefetch hits 的數據該如何觀察?
#### Direct and Indirect Memory Indexing
* **Direct memory indexing** can be easily prefetched by hardware since the memory addresses show regular stream/stride behavior.
* **Indirect indexing** is relatively simpler to compute in software, but it usually has a higher overhead than that of direct memory index.
### SW prefetching 之優缺點及 HW + SW 的效果
> 我很好奇這邊的整理,是源自於其他文獻,還是歸納自本篇論文的實驗結果?
#### Benefits of SW prefetching over HW
* Large Number of Streams (Limited Hardware Resources)
* Short Streams
* Irregular Memory Access
* Cache Locality Hint
* 多數 hardware prefetcher 將資料放在 lower cache level (L1 or L3), 但是 software prefetcher 可以依提示直接放進 L1 cache
* Lower-level prefetching block insertion greatly reduces the higher level cache pollution, but L2 to L1 latency can degrade performance significantly.
* So, there is greater flexibility of placing data in the right cache level in the SW prefetching mechanism.
* Loop Bounds
* Several methods, such as loop unrolling, software pipelining, and using branch instructions, can **prevent generating prefetch requests out of array bounds** in software.
* However, overly aggressive prefetching results in early or incorrect prefetch requests, in addition to **consuming memory bandwidth**.
#### Negative Impacts of Software Prefetching
* Increased Instruction Count
* Static Insertion
* The decision to prefetch and choice of parameters , such as **data to be prefetched and the corresponding prefetch distance**, are made statically, and therefore cannot adapt to runtime behavioral changes such as **varying memory latency, effective cache size, and bandwidth**, especially in heterogeneous architectures.
* Code Structure Change
#### Synergistic Effects when using HW + SW prefetching
```Synergistic: 協同作用的```
* Handling Multiple Streams
* HW prefetcher cover regular stream.
* SW prefetcher cover irregular stream.
* Positive Training
* e.g. If a block prefetched by a software prefetcher is late, then a trained hardware prefetcher can improve prefetch timeliness.
#### Antagonistic Effects
```Antagonistic: 對抗性的```
* Negative Training
* Software prefetch requests can slow down the hardware prefetcher training.
* e.g. If prefetched blocks by software prefetching hide a part of one or more streams, the hardware prefetcher will not be trained properly.
* [註] 不太了解這個例子想表達的意思。
* Software prefetch instructions can trigger overly aggressive hardware prefetches, which results in early requests.
* Harmful Software Prefetching
* When software prefetch requests are incorrect or early, the increased stress on cache and memory bandwidth can further reduce the effectiveness of hardware prefetching.
<br>
## 學習 SSE
Reference: [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#)
> 先學習程式上用到之指令。
> 東番西找,找不到跟 SSE 有關的教學,沒想到 Intrinsics Guide 已提供最清楚的解釋。
### 先閱讀 Hotball 的小屋撰寫的 SSE 介紹
* SSE 的 intrinsics 的命名形式
* \_mm_<opcode>_<suffix>
* 其中 <opcode> 是指令的類別。 而 <suffix> 則是資料的種類。
* 在 SSE 浮點運算指令中,只有兩個種類: ps 和 ss
* ps 是指 Packed Single-precision, 也就是這個指令對暫存器中的四個單精度浮點數進行運算。
* ss 則是指 Scalar Single-precision, 也就是這個指令只對暫存器中的 DATA0 進行運算。
> 仍不太清楚下列三個指令之 <suffix> 所代表之意思 e.g. si128, epi32
* 要使用 SSE 的 intrinsics 之前, 要記得先包含 xmmintrin.h 這個 header 檔。
* 絕大部份需要存取記憶體的 SSE 指令, 都要求位址是 16 的倍數(也就是對齊在 16 bytes 的邊上)
* 如果沒有對齊 16 bytes 的話,就不能用 _mm_load_ps 這個 intrinsic 來載入, 而要改用 _mm_loadu_ps 這個 intrinsic。 它是專門用來處理沒有對齊在 16 bytes 邊上的資料的。 但是, 它的速度會比較慢。
> ~~找到一個可能可以改善的方向, 但要如何使矩陣的起始位址為 16 的倍數?~~
* 根據其計算結果, 再設定適當的旗標(divide-by-zero、invalid 等等), 或是產生 exception。 這可以由一個 MXCSR 暫存器來設定。MXCSR 暫存器是一個 32 位元的旗標暫存器, 可以設定是否要產生各種 exception, 並會記錄上次的計算中, 發生了哪些情況。
* SSE 除了運算的指令之外, 還支援了一些 cache 控制指令: prefetch 和 movntps。 prefetch 指令實際上有四個不同的指令,包括 prefetch0、 prefetch1、 prefetch2 和 prefetchnta。 不過, 它們都是用同一個 intrinsic 表示的, 也就是 _mm_prefetch。
* prefetch 指令的主要目 的,是提前讓 CPU 載入稍後運算所需要的資料。通常是在對目前的資料進行運算之前, 告訴 CPU 載入下一筆資料。 **這樣就可以讓目前的運算, 和載入下一筆資料的動作, 可以同時進行。** 如果運算的複雜度夠高的話, 這樣可以完全消除讀取主記憶體的 latency。 不同的 prefetch 指令則是告訴 CPU 將資料載入不同層次的 cache。 不過, 最常用的還是 prefetchnta, 這個指令會把資料載入到離 CPU 最近的 cache 中(通常是 L1 cache 或 L2 cache), 適用於資料在短時間內就會用到的情形。
### [impl.c] 所用到之 SSE intrinsic
#### void _mm_prefetch (char const* p, int i)
* Description
* Fetch **the line of data** from memory that contains address p to a location in the cache heirarchy specified by the locality hint i.
#### __m128i _mm_loadu_si128 (__m128i const* mem_addr)
* Description
* Load 128-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary.
* Operation
* ```dst[127:0] := MEM[mem_addr+127:mem_addr]```
#### __m128i _mm_unpacklo_epi32 (__m128i a, __m128i b)
* Description
* Unpack and interleave 32-bit integers from the low half of a and b, and store the results in dst.
* Operation
* 將 128 bit 切一半,兩變數 a, b 分別取低位址的 64 bit,交叉存放於新的變數。
* 將 lo 改成 hi 則依其意。
#### void _mm_storeu_si128 (__m128i* mem_addr, __m128i a)
* Description
* Store 128-bits of integer data from a into memory. mem_addr does not need to be aligned on any particular boundary.
> 實際紙筆操作老師的 transpose 程式後,實在太神奇了!
<br>
## raw counter
查找 [Intel® 64 and IA-32 Architectures Developer’s Manual: Vol. 3B](http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html),我是第六代處理器 skylake
> 在這邊 raw counter 找得很挫折,不知道哪一個可以幫助我分析,想嘗試找跟 <```yenWu```> 相同的兩個 event,但找不到。
<br>
## 初步觀察
### 三種版本之執行時間
> 每個版本各取樣 100 次
```
naive: 233908 us
sse: 154229 us
sse prefetch: 42235 us
```
### 使用 perf 分析 cache misses
```$perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./...```
```$sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"```
#### naive impl.
```
Performance counter stats for './naive' (100 runs):
41,213,342 cache-misses # 85.814 % of all cache refs ( +- 0.29% )
48,026,565 cache-references ( +- 0.25% )
1,448,367,681 instructions # 1.55 insn per cycle ( +- 0.00% )
936,303,948 cycles ( +- 0.30% )
0.320454039 seconds time elapsed ( +- 0.45% )
```
#### sse impl.
```
Performance counter stats for './sse' (100 runs):
14,535,369 cache-misses # 75.659 % of all cache refs ( +- 0.07% )
19,211,776 cache-references ( +- 0.05% )
1,236,449,412 instructions # 1.79 insn per cycle ( +- 0.00% )
690,682,519 cycles ( +- 0.15% )
0.257315101 seconds time elapsed ( +- 0.24% )
```
#### sse_prefetch impl.
```
Performance counter stats for './sse_prefetch' (100 runs):
8,599,805 cache-misses # 65.805 % of all cache refs ( +- 0.29% )
13,068,657 cache-references ( +- 0.18% )
1,282,489,200 instructions # 2.40 insn per cycle ( +- 0.00% )
534,001,192 cycles ( +- 0.14% )
0.140952357 seconds time elapsed ( +- 0.40% )
```
<br>
## 針對 SSE version 進行改善
> 改善 SSE 在 load 時對齊的情況, 後來翻閱 <```yenWu```> 也有針對這個問題撰寫一節,但似乎目前沒有結果。
### memalign - allocate aligned memory
#### int posix_memalign(void **memptr, size_t alignment, size_t size);
* Description
* The function posix_memalign() allocates size bytes and places the address of the allocated memory in *memptr. The address of the allocated memory will be a multiple of alignment, which must be a power of two and a multiple of sizeof(void *). If size is 0, then posix_memalign() returns either NULL, or a unique pointer value that can later be successfully passed to free(3).
#### void *memalign(size_t alignment, size_t size);
* Description
* The obsolete function memalign() allocates size bytes and returns a pointer to the allocated memory. The memory address will be a multiple of alignment, which must be a power of two.
### 修改的程式碼
```clike=
/* main.c */
int *src = (int *) memalign(16, sizeof(int) * TEST_W * TEST_H);
int *out = (int *) memalign(16, sizeof(int) * TEST_W * TEST_H);
/* impl.c */
__m128i I0 = _mm_load_si128((__m128i *)(src + (y + 0) * w + x));
__m128i I1 = _mm_load_si128((__m128i *)(src + (y + 1) * w + x));
__m128i I2 = _mm_load_si128((__m128i *)(src + (y + 2) * w + x));
__m128i I3 = _mm_load_si128((__m128i *)(src + (y + 3) * w + x));
_mm_store_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
_mm_store_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
_mm_store_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
_mm_store_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);
```
### 結果
```
sse: 155314 us
sse prefetch: 44440 us
```
沒有顯著的改善。
```
Performance counter stats for './sse' (100 runs):
1579,7948 cache-misses # 82.011 % of all cache refs ( +- 0.33% )
1926,3250 cache-references ( +- 0.07% )
12,3650,5999 instructions # 1.52 insn per cycle ( +- 0.00% )
8,1370,9554 cycles ( +- 0.66% )
0.273664634 seconds time elapsed ( +- 0.40% )
Performance counter stats for './sse_prefetch' (100 runs):
1018,6595 cache-misses # 72.837 % of all cache refs ( +- 0.59% )
1398,5518 cache-references ( +- 0.24% )
12,8251,1076 instructions # 2.06 insn per cycle ( +- 0.00% )
6,2351,1430 cycles ( +- 0.49% )
0.168184185 seconds time elapsed ( +- 0.55% )
```
sse & sse_prefetch 的 cache-misses ratio, 皆較原本的版本高出 7%
> 不知道為什麼 cache-misses 會增加, 仿 phonebook 一樣觀察了哪些地方是 cache-misses 的熱點, 但我看不太出什麼東西 orz
<br>
## AVX version
> 先嘗試不要看其他人的 code, 翻一下 Intel Intrinsics Guide 找出對應的指令。 寬度變為 256 bits 的話, 迴圈就變為一次處理 8*8 的方塊矩陣, 那這樣原本的 verify 似乎便不能使用... [time=Fri, Mar 31, 2017 5:27 PM]
> 紙筆操作很順利, 但似乎發現沒有單位為 128 bits 的 unpack...[time=Fri, Mar 31, 2017 5:27 PM]
> 沒有想像中容易, 因為 AVX 的 unpack 運作與 SSE2 也不同, 經過一番操作後終於湊出來(咦?) [time=Fri, Mar 31, 2017 7:41 PM]
> 撰寫 AVX 版本時, DEBUG 時用到大量的 gdb [time=Fri, Mar 31, 2017 8:29 PM]
> >[breakpoint](http://www.delorie.com/gnu/docs/gdb/gdb_29.html)
### 新增的程式碼
```clike=
/* verify whether this two matrix is identical */
static int equal(int *a, int *b, int w, int h)
{
for (int x = 0; x < w; x++)
for (int y = 0; y < h; y++)
if (*(a + x * h + y) != *(b + x * h + y))
return -1;
return 1;
}
```
用來與 naive_transpose 比較, 驗證 avx_transpose 結果是否正確。
### 使用到的 AVX intrinsic
#### __m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b)
* Description
* Unpack and interleave 32-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.
> 被這個可搞慘了, 不能直接把 SSE2 拿來類推, 要看清楚手冊。
#### __m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)
* Description
* Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst.
* 依需要調整 imm8, 手冊有說明 Operation 該如何設定。
> 還好有閱讀到 <```yenWu```> 的筆記有提到這個 intrinsic, 不然根本不知道該選這個
### 結果
```
avx: 64688 us
```
相較於 sse 版本之 speedup: 2.4
```
Performance counter stats for './avx' (100 runs):
1098,0390 cache-misses # 72.052 % of all cache refs ( +- 0.44% )
1523,9534 cache-references ( +- 0.09% )
11,4866,6522 instructions # 2.07 insn per cycle ( +- 0.00% )
5,5608,8254 cycles ( +- 0.80% )
0.174747969 seconds time elapsed ( +- 0.74% )
```
<br>
## 分析 prefetch distance 對效能之影響
> 想試著重現 <```kaizsv```> 的實驗 [time=Fri, Mar 31, 2017 8:23 PM]
> 測試了一下不同的 prefech distance 影響很大, 這個圖表必須作! [time=Fri, Mar 31, 2017 8:24 PM]
> 遇到了一個困難是, 若要測試 PFDIST, 勢必要改為傳入的參數, 但這樣就動到 transpose 一致的界面, 該怎麼處理最好? [time=Fri, Mar 31, 2017 8:45 PM]
> 最後很克難的只好改寫 prefetch 的界面, 先把圖畫出來再說。 [time=Fri, Mar 31, 2017 9:17 PM]
![](https://i.imgur.com/FmGaVbA.png)
## Reference
- [ ][kaizsv](https://hackmd.io/s/r1IWtb9R#)
- [ ][yenWu](https://hackmd.io/s/ryTASBCT#)
- [x][SSE 介紹]( https://www.csie.ntu.edu.tw/~r89004/hive/sse/page_1.html)
- [ ][台大不錯的 SIMD 講義](http://www.csie.ntu.edu.tw/~cyy/courses/assembly/10fall/lectures/handouts/lec17_x86SIMD.pdf)