2017q1 Homework3 (software-pipelining)

# 2017q1 Homework3 (software-pipelining) contributed by < `xdennisx` > ## 論文研讀閱讀論文：[When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) 論文中前五大點[這裡](https://hackmd.io/s/HJtfT3icx)都有做大部分的解釋 ### Case Studies #### Positive Group - 433.milc 下面程式碼是有關一個 3x3 的矩陣運算 ![](https://i.imgur.com/GadFayJ.png) `milc` 大部分的 cache-miss 都是發生在存取小規模的陣列，因為這些 matrix access 都是屬於 **short stream**，所以無法用 HW prefetch 去訓練它，因此他在 function call 的附近做了 SW prefetch 的動作，有了 SW prefetch，速度甚至比最好的 HW prefetch 快了 2.3 倍。 - 459.GemsFDTD `GemsFDTD` 大部分的問題出在 **complex indirect memory accesses**，像下面的變數 m 就是一個 indirect index，而 HW prefetch 對這種 irregular index 的加速效能有限，在這邊只增加了 12%(stream prefetcher)，然而 SW prefetch 卻能降低超過 50% 的執行時間(與 Baseline 相比) ![](https://i.imgur.com/E45NwuE.png) - 429.mcf `mcf` 有密集的記憶體操作，他也使用了 array of pointer 的操作，由於 array of pointer 也是屬於 irregular behavior，所以 HW prefetcher 也不會表現太好。 ![](https://i.imgur.com/LCXHP54.png) 這邊將 `arc` structure 做 prefetch，利用下一個 array index 去計算 next address，這是 HW prefetcher 做不到的。而使用 SW+STR 可以讓原本只有 STR 的效能提升 180% - 470.lbm `lbm` 計算兩個 3-D array，這讓 GHB 跟 STR 可以有效預測 address，即使如此，實驗過後還是發現 SW prefetcher 還是可以表現的比較好，大大提升 cache hit ratio。另外發現 `lbm` 對 prefetch distance 非常敏感，下圖顯示 SW prefetching 可達到將近 100% coverage，因為幾乎所有 `srcGrid` 的元素都會被 prefetch 到 ![](https://i.imgur.com/JO8hKhX.png) 這張圖說明不同的 distance 對 performance 的影響 ![](https://i.imgur.com/EoIzbrx.png) 下圖說明一個 prefetch distance unit 取決於下一個 loop iteration ![](http://i.imgur.com/JAv02Rl.png) - 462.libquantum `libquantum` 主要存取的方式都是 **fixed-stride streaming**，所以 SW 跟 HW 都可以預測 `reg->node` 的 address，可是 SW 可以做到 1.23 倍速的提升，因為 HW 有太多的 L1 cache miss penalties。 ![](http://i.imgur.com/KbRgcwp.png) 下圖顯示因為有太多的 L1 cache miss，所以 execution cycles 跟 on-chip L2 traffics 都會增加。GHB 在 SW 的訓練下，因為產生過多沒用的 prefetch，造成效能下降 55%。 ![](http://i.imgur.com/Hi1pyRh.png) - 403.gcc `gcc` 屬於 RDS，是非常 irregular，而且他的 access 是 indirect，藉由 insert prefetch 我們得到 14% 的效能提升，下面是程式碼片斷 ![](http://i.imgur.com/vY8BwUe.png) - 434.zeusmp `zeusmp` 也是一個 3-D 的計算，每次 iteration 都會存取每個 dimension 的所有鄰近點，`v3` 可以同時被 HW 跟 SW prefetch，SW 降低了 L1 cache misses 而 HW 除去了一些 late prefetches，分別提升了 10%,4% 的效能 ![](http://i.imgur.com/i4P9ail.png) #### Neutral Group - 401.bzip2 `bzip2` 的核心演算法是[Burrows–Wheeler Transform](https://zh.wikipedia.org/wiki/Burrows-Wheeler%E5%8F%98%E6%8D%A2)，當中有很多的 branch，所以 HW prefetchers 沒有什麼用處，SW 在這裡也沒有表現的很好 ![](http://i.imgur.com/035S5NL.png) 下圖顯示當中有很多的 early prefetch，不過 software prefetching 還是能提升 8% 左右的效能 ![](http://i.imgur.com/oYPHNNf.png) - 436.cactusADM `cactusADM` 也是計算 3-D 陣列，每次 iteration 都會存取一個維度鄰近的點，下圖在 `alp`、`ADM_kxx_stag_p` 插入 prefetch，雖然這裡的 data strcture 都是很好預測的，GHB 的表現還是比 STR 好很多，因為有太多的 in-flight stream，不過 SW prefetcher 可以縮小他們之間的差距(5%)。GHB 經過 training 之後可以減少 7% 的 late prefetcher ![](http://i.imgur.com/Vl2xV9m.png) - 450.soplex `soplex` 中大部份 cahce-miss 發生在 C++ class overloaded operators 跟沒有迴圈的 member function，因為不知道實際的 data structure 長怎樣，所以不知道要在哪邊插入 prefetcher，但是他的 memory access 有一定程度的規律，所以 HW prefetching 就變得相對有用。 ![](http://i.imgur.com/gGBe2FF.png) 上圖顯示雖然 SW perfetching 是有用的，但是他被歸類在 neutral group 是因為 bandwidth consumption 以及太多錯誤的 SW prefetches - 410.bwaves #### Negative Group - 482.sphinx3 `sphinx3` 有許多不同的資料存取方式，像是 RDS, array of pointers, regular strides，不過主要是 unit-stride streams，因此 HW prefetchers 很適合用在這上面。因為 prefetch timeliness 和小陣列的關係，SW prefetchers 不會用到每一個 data structure 上。 - 437.leslie3d ![](https://i.imgur.com/vuvjEuX.png) `leslie3d` 有非常規律的 stride，但因為每個 iteration 之間的運算很受限，非得要用大的 prefetch distance，造成過多的 instruction 反而抵消原本預期帶來的利益。 ## 測試環境 ```shell dennis@dennis-X550CC:~/prefetcher$ lscpu Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數：2 每通訊端核心數：2 Socket(s): 1 NUMA 節點： 1 供應商識別號： GenuineIntel CPU 家族： 6 型號： 58 Model name: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz 製程： 9 CPU MHz： 993.585 CPU max MHz: 2700.0000 CPU min MHz: 800.0000 BogoMIPS: 3592.01 虛擬： VT-x L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 3072K NUMA node0 CPU(s): 0-3 ``` ## 最初實驗數據 ### 時間 ``` naive: 250935 us sse: 124485 us sse prefetch: 65676 us ``` ### 用 **perf stat** 看一下cache-miss ``` Performance counter stats for './naive_transpose' (100 runs): 19,145,519 cache-misses # 95.049 % of all cache refs ( +- 0.13% ) 20,142,743 cache-references ( +- 0.04% ) 1,449,345,026 instructions # 1.17 insns per cycle ( +- 0.01% ) 1,238,612,771 cycles ( +- 0.66% ) 0.492766897 seconds time elapsed ( +- 0.73% ) Performance counter stats for './sse_transpose' (100 runs): 6,959,105 cache-misses # 91.276 % of all cache refs ( +- 0.52% ) 7,624,244 cache-references ( +- 0.26% ) 1,237,818,237 instructions # 1.38 insns per cycle ( +- 0.01% ) 899,769,987 cycles ( +- 0.75% ) 0.358673348 seconds time elapsed ( +- 0.81% ) Performance counter stats for './sse_prefetch_transpose' (100 runs): 6,752,137 cache-misses # 89.026 % of all cache refs ( +- 0.22% ) 7,584,432 cache-references ( +- 0.08% ) 1,283,460,940 instructions # 1.75 insns per cycle ( +- 0.01% ) 733,322,586 cycles ( +- 0.29% ) 0.286447261 seconds time elapsed ( +- 0.46% ) ``` 好像用了 prefetch，cache-miss 就會下降的樣子，而且時間也有明顯下降。不過用 perf 所測得的 cache-miss 是**整體**的，而 prefetch 所減少的是 **L1-dcache cache misses** ### L1-cache-test ``` Performance counter stats for './sse_transpose' (100 runs): 8,513,522 L1-dcache-load-misses ( +- 0.04% ) 4,136,589 L1-dcache-store-misses ( +- 0.29% ) 65,288 L1-dcache-prefetch-misses ( +- 0.38% ) 94,785 L1-icache-load-misses ( +- 5.73% ) 0.345807079 seconds time elapsed ( +- 0.60% ) Performance counter stats for './sse_prefetch_transpose' (100 runs): 8,517,476 L1-dcache-load-misses ( +- 0.02% ) 4,175,545 L1-dcache-store-misses ( +- 0.19% ) 64,775 L1-dcache-prefetch-misses ( +- 0.43% ) 82,900 L1-icache-load-misses ( +- 4.33% ) 0.284601743 seconds time elapsed ( +- 0.50% ) ``` ㄜ....所以這是沒有減少的意思嗎... 我嘗試不同的 prefetch distance 看看 ``` D = 4 Performance counter stats for './sse_prefetch_transpose' (100 runs): 8,527,637 L1-dcache-load-misses ( +- 0.03% ) 4,130,140 L1-dcache-store-misses ( +- 0.36% ) 68,948 L1-dcache-prefetch-misses ( +- 0.31% ) 88,588 L1-icache-load-misses ( +- 4.91% ) 0.294001054 seconds time elapsed ( +- 0.58% ) D = 16 Performance counter stats for './sse_prefetch_transpose' (100 runs): 8,552,080 L1-dcache-load-misses ( +- 0.03% ) 4,115,989 L1-dcache-store-misses ( +- 0.41% ) 64,701 L1-dcache-prefetch-misses ( +- 0.42% ) 92,911 L1-icache-load-misses ( +- 5.27% ) 0.304326302 seconds time elapsed ( +- 0.72% ) D = 32 Performance counter stats for './sse_prefetch_transpose' (100 runs): 8,580,343 L1-dcache-load-misses ( +- 0.03% ) 4,080,988 L1-dcache-store-misses ( +- 0.41% ) 65,730 L1-dcache-prefetch-misses ( +- 0.38% ) 112,045 L1-icache-load-misses ( +- 4.02% ) 0.301808613 seconds time elapsed ( +- 0.54% ) D = 64 Performance counter stats for './sse_prefetch_transpose' (100 runs): 8,595,820 L1-dcache-load-misses ( +- 0.03% ) 4,089,623 L1-dcache-store-misses ( +- 0.32% ) 65,383 L1-dcache-prefetch-misses ( +- 0.38% ) 107,591 L1-icache-load-misses ( +- 4.24% ) 0.303709076 seconds time elapsed ( +- 0.52% ) ``` D = 8 的時候是最好的 > 之後該來探討為啥 8 是最好的 ### Raw Counter 在一些系統架構下，有一些 counter 沒有在 `perf list` 裡面，如果是 Intel CPU 的話可以看 [Intel® 64 and IA-32 Architectures Developer’s Manual: Vol. 3B的19章](http://www.intel.com/content/www/us/en/processors/core/core-technical-resources.html)，要對應到自己的 CPU 型號查 mask number 及 event number。我的 CPU 是 Intel Core i5-3337U，是屬於第三代的，對應可用的 Raw Counter 是 **LOAD_HIT_PRE.SW_PF** 跟 **LOAD_HIT_PRE.HW_PF** ![](https://i.imgur.com/gsLCQVL.png) #### 測試(這邊 D = 8) - `perf stat -e r014c ./raw_counter_sse` ``` Performance counter stats for './sse_transpose' (100 runs): 309 r014c ( +- 2.23% ) 0.344172960 seconds time elapsed ( +- 0.48% ) ``` - `perf stat -e r014c ./raw_counter_sse_prefetch` ``` Performance counter stats for './sse_prefetch_transpose' (100 runs): 656,289 r014c ( +- 2.72% ) 0.288038377 seconds time elapsed ( +- 0.47% ) ``` - `perf stat -e r024c ./time_sse` ``` Performance counter stats for './sse_transpose' (100 runs): 599,852 r024c ( +- 0.34% ) 0.347760797 seconds time elapsed ( +- 0.65% ) ``` - `perf stat -e r024c ./time_sse_prefetch` ``` Performance counter stats for './sse_prefetch_transpose' (100 runs): 614,884 r024c ( +- 0.79% ) 0.293914743 seconds time elapsed ( +- 0.82% ) ``` 整體來說，prefetch 還是有增加 cache-hit 的次數 ## SSE/AVX intrinsic 這裡有一個很不一樣的是 `__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)`，利用控制 `imm8` 可以達到 unpack_ep128 的效果 - Operation ``` SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0 ``` > 發現我的電腦根本不能用 AVX2，WTH 該買新電腦了 ## Reference - [吳彥寬學長筆記](https://hackmd.io/s/ryTASBCT) - [kaizsv](https://hackmd.io/IwVmCMGMCYFMGYC0A2e5mICwAYDs5EBDAMwBMAORWU8ATl2N2EhHNiA=?view)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.