# 2017q1 Homework3 (software-pipelining) contributed by <`baimao8437`> ## 論文筆記 ### 前言: 本[論文](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)研究方向有三: 1. source code-level analysis 直接說明SW、HW的優缺點 2. SW、HW之間的協同作用(synergistic)和反作用(antagonistic effects) 3. 在SW有request的情況下評估HW的訓練策略 #### 名詞解釋: - Prefetch 在 cpu 要運算之前把資料從主記憶體預先載入 cache,降低 cpu 等待時間進而提升效率. 載入至主記憶體的時機不能太晚(cpu 已經運算完畢不再需要這些資料)也不能太早(有可能在使用之前就被踢出快取記憶體),所以計算 prefetch distance 很重要。 - Hardware prefetch 是用過去的存取紀錄作為 prefetch 的依據,所以需要 training。 e.g. GHB:Global History Buffer - Software prefetch 人為或 compiler 在適當的地方加入 prefetching intrinsics。 但 ICC or GCC 中只有簡單的 SW 指令,所以比較仰賴人為加入。 e.g. SSE 的 `__mm_prefetch()` 而對於手動加入指令有兩個最大的問題: 1. 對於如何插入 prefetch intrinsic 沒有一個嚴格的標準 2. 軟體與硬體的 prefetch 交互使用的複雜度沒有一個很好的了解 ### SW & HW 的交互作用實驗結果: ![](https://i.imgur.com/3SYkKyL.png) - GHB(stride prefetcher) stride 是指說你是一次跳幾格去 prefetch cache line。 - STR(stream prefetcher) stream 是指說你的 prefetch 是連續的抓取 cache line。 可以看到左邊 Positive 區的實驗結果顯示在這些 benchmark 的情況下,SW+HW 會比 HW 來得更有效率。 #### 實驗結果重點 - SW 對於 ==short array streams==, ==irregular memory address patterns==, and ==L1 cache miss reduction== 效果較好 - SW 會對 HW 的 training 產生干擾,讓效能變差,尤其是在 stream 的時候。 從實驗中,產生下列主要的三個問題: 1. What are the limitations and overheads of software prefetching? 2. What are the limitations and overheads of hardware prefetching? 3. When is it beneficial to use software and/or hardware prefetching? ### SW & HW 的背景知識 #### Data Structure 此表分析了不同 data structure 的性質及當時已經有提出的 HW SW prefetch 方式 ![](https://i.imgur.com/qAbjupc.png) 而當中的 D 就是 prefetch distance #### Prefetch hint ![](https://i.imgur.com/RpiCSqA.png) 主要使用在SSE的__mm_prefetch([address],==[hint]==); #### Prefetch 分類 & Timeliness ![](https://i.imgur.com/sbwqrPc.png) ![](https://i.imgur.com/7ZFCX3E.png) #### Prefetch distance Prefetch 希望能夠 timely,所以要設定prefetch distance,要大到足夠隱藏 latency。 $$D\ge\lceil\frac{l}{s}\rceil$$ - `D` is called the prefetch distance - `l` is the prefetch latency - `s` is the length of the shortest path through the loop body #### Direct\Indirect index - Direct: 是直接可以從指令中知道要存取的記憶體位址,可以透過 hareward prefetch. - Indirect: 就像這行程式碼 x = a[b[i]],需要先計算b[i]的值才知道要存取a[]中的哪個位置,使用 software prefetch 比較容易達成. ![](https://i.imgur.com/eHtiR2o.png) (a)Direct: 2個指令 (b)Indirect: 4個指令 ## 程式實作 矩陣轉置 ### 開發環境 ``` baimao@baimao-Aspire-V5-573G:~$ lscpu Architecture: x86_64 CPU 作業模式: 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數:2 每通訊端核心數:2 Socket(s): 1 NUMA 節點: 1 供應商識別號: GenuineIntel CPU 家族: 6 型號: 69 Model name: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz 製程: 1 CPU MHz: 1711.242 CPU max MHz: 2600.0000 CPU min MHz: 800.0000 BogoMIPS: 4589.38 虛擬: VT-x L1d 快取: 32K L1i 快取: 32K L2 快取: 256K L3 快取: 3072K NUMA node0 CPU(s): 0-3 ``` CPU:Intel Colre i5-4200U [資訊](http://www.cpu-world.com/CPUs/Core_i5/Intel-Core%20i5-4200U%20Mobile%20processor.html) ``` Architecture / Microarchitecture Microarchitecture Haswell Platform Shark Bay Processor core Haswell Core stepping C0 (SR170) Manufacturing process 0.022 micron Data width 64 bit The number of CPU cores 2 The number of threads 4 Floating Point Unit Integrated Level 1 cache size 2 x 32 KB 8-way set associative instruction caches 2 x 32 KB 8-way set associative data caches Level 2 cache size 2 x 256 KB 8-way set associative caches Level 3 cache size 3 MB 12-way set associative shared cache Physical memory 16 GB Multiprocessing Uniprocessor Features MMX instructions SSE / Streaming SIMD Extensions SSE2 / Streaming SIMD Extensions 2 SSE3 / Streaming SIMD Extensions 3 SSSE3 / Supplemental Streaming SIMD Extensions 3 SSE4 / SSE4.1 + SSE4.2 / Streaming SIMD Extensions 4 AES / Advanced Encryption Standard instructions AVX / Advanced Vector Extensions AVX2 / Advanced Vector Extensions 2.0 BMI / BMI1 + BMI2 / Bit Manipulation instructions F16C / 16-bit Floating-Point conversion instructions FMA3 / 3-operand Fused Multiply-Add instructions EM64T / Extended Memory 64 technology / Intel 64 NX / XD / Execute disable bit HT / Hyper-Threading technology TBT 2.0 / Turbo Boost technology 2.0 VT-x / Virtualization technology Low power features Enhanced SpeedStep technology ``` ### 開發紀錄 #### 第一次執行結果 ``` 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse prefetch: 70873 us sse: 144184 us naive: 244073 us ``` * gnuplot `make plot` ![](https://i.imgur.com/q8imCf3.png) 這是100次取平均值 naive 的時間不穩定 所以平均的誤差大 但可以看到 sse + prefetch 的時間最短 接著改寫`makefile`、`main`將三種方法分開進行 perf 測試`$ make cache-test` perf 所設定的參數[^first]: [^first]: [kobeyu的筆記](https://hackmd.io/s/BycRx2EC#解析程式碼pefetch) ``` cache-misses cache-references L1-dcache-load-misses L1-dcache-loads L1-dcache-stores L1-icache-load-misses ``` makefile: ``` EXEC = \ sse_prefetch_transpose\ sse_transpose\ naive_transpose cache-test: $(EXEC) for method in $(EXEC);do \ perf stat --repeat 100 \ -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses \ ./$$method;\ done ``` > 這裡有個困惑 當我嘗試將 -e 後面的參數斷行 > -e cache-misses,cache-references, \\ >L1-dcache-load-misses,L1-dcache-loads, \\ >L1-dcache-stores,L1-icache-load-misses \\ >執行時會發生錯誤: >event syntax error: 'cache-misses,cache-references,' > \___ parser error >Run 'perf list' for a list of valid events > Usage: perf stat [<options>] [<command>] > -e, --event <event> event selector. use 'perf list' to list available events >> 我改成 >> -e cache-misses,cache-references \\ >> -e L1-dcache-load-misses,L1-dcache-loads \\ >> -e L1-dcache-stores,L1-icache-load-misses \ >> 就沒有出現錯誤訊息了 [name=claaaaassic] >>> 歐歐~ 那我.. 還是一行好了orz[name=baimao8437] #### pref 結果 `$ make cache-test` - sse prefetch ``` Performance counter stats for './sse_prefetch_transpose' (100 runs): 4,759,440 cache-misses # 85.111 % of all cache refs ( +- 0.69% ) (32.89%) 5,592,027 cache-references ( +- 0.49% ) (50.39%) 8,481,323 L1-dcache-load-misses # 1.83% of all L1-dcache hits ( +- 0.26% ) (67.13%) 464,186,704 L1-dcache-loads ( +- 0.07% ) (65.37%) 234,986,970 L1-dcache-stores ( +- 0.10% ) (61.59%) 55,906 L1-icache-load-misses ( +- 3.66% ) (32.76%) 0.268066902 seconds time elapsed ( +- 0.64% ) ``` - sse ``` Performance counter stats for './sse_transpose' (100 runs): 4,807,128 cache-misses # 86.253 % of all cache refs ( +- 0.64% ) (33.36%) 5,573,299 cache-references ( +- 0.31% ) (50.49%) 8,545,880 L1-dcache-load-misses # 1.93% of all L1-dcache hits ( +- 0.15% ) (67.18%) 441,706,999 L1-dcache-loads ( +- 0.10% ) (65.64%) 231,528,304 L1-dcache-stores ( +- 0.12% ) (62.41%) 64,512 L1-icache-load-misses ( +- 3.90% ) (32.84%) 0.341006236 seconds time elapsed ( +- 1.00% ) ``` - naive ``` Performance counter stats for './naive_transpose' (100 runs): 17,137,199 cache-misses # 94.603 % of all cache refs ( +- 0.28% ) (33.27%) 18,114,904 cache-references ( +- 0.19% ) (50.21%) 21,069,839 L1-dcache-load-misses # 3.82% of all L1-dcache hits ( +- 0.11% ) (66.89%) 551,000,987 L1-dcache-loads ( +- 0.07% ) (65.83%) 213,504,897 L1-dcache-stores ( +- 0.16% ) (63.62%) 71,981 L1-icache-load-misses ( +- 4.93% ) (33.07%) 0.467403806 seconds time elapsed ( +- 1.33% ) ``` 可以看到 cache-misses 確實是 sse prefetch 最低 因為 prefetch 先將所需要的資料載入快取記憶體降低了 cache-misses,低的 miss rate 讓他的執行速度比較快。 #### Row Counter 參考[kaizsv 的共筆](https://hackmd.io/s/r1IWtb9R#sse-prefetch-transpose)在 [intel 文件](http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html)中找 4th Generation Haswell microarchitecture 察看下列項目: - r014c: LOAD_HIT_PRE.SW_PF - r024c: LOAD_HIT_PRE.HW_PF 結果: ``` Performance counter stats for './sse_prefetch_transpose' (100 runs): 4,911,498 cache-misses # 84.564 % of all cache refs ( +- 0.78% ) (49.91%) 5,808,046 cache-references ( +- 0.49% ) (50.22%) 8,670,151 L1-dcache-load-misses # 1.88% of all L1-dcache hits ( +- 0.39% ) (50.51%) 460,232,822 L1-dcache-loads ( +- 0.14% ) (45.85%) 237,557,467 L1-dcache-stores ( +- 0.22% ) (25.48%) 54,105 L1-icache-load-misses ( +- 3.69% ) (25.65%) 499,850 r014c ( +- 1.84% ) (25.31%) 272,620 r024c ( +- 5.06% ) (37.61%) 0.277024409 seconds time elapsed ( +- 1.18% ) Performance counter stats for './sse_transpose' (100 runs): 4,743,884 cache-misses # 85.357 % of all cache refs ( +- 0.39% ) (49.71%) 5,557,698 cache-references ( +- 0.28% ) (49.95%) 8,494,094 L1-dcache-load-misses # 1.99% of all L1-dcache hits ( +- 0.23% ) (50.33%) 426,955,982 L1-dcache-loads ( +- 0.15% ) (46.63%) 236,397,375 L1-dcache-stores ( +- 0.26% ) (25.85%) 60,052 L1-icache-load-misses ( +- 3.60% ) (25.54%) 464 r014c ( +- 4.00% ) (25.23%) 312,872 r024c ( +- 4.40% ) (37.39%) 0.326735088 seconds time elapsed ( +- 0.44% ) Performance counter stats for './naive_transpose' (100 runs): 16,888,393 cache-misses # 92.475 % of all cache refs ( +- 0.25% ) (49.93%) 18,262,729 cache-references ( +- 0.20% ) (50.39%) 21,341,610 L1-dcache-load-misses # 4.01% of all L1-dcache hits ( +- 0.17% ) (50.91%) 532,720,948 L1-dcache-loads ( +- 0.10% ) (48.00%) 228,505,453 L1-dcache-stores ( +- 0.23% ) (25.06%) 64,509 L1-icache-load-misses ( +- 3.95% ) (25.08%) 496 r014c ( +- 5.12% ) (24.98%) 249,019 r024c ( +- 4.90% ) (37.44%) 0.437171123 seconds time elapsed ( +- 0.56% ) ``` sse prefetch 的 r014c 明顯比其他大 代表真的有執行 SW prefetch #### 實驗不同 Prefetch Distance 論文中提到 Prefetch Distance 會影響 Prefetch 的效率 在 `impl.c` 中的 `PFDIST` 就是 distance 將其用 0,2,4,7,8,9,10,12,16 代入 因為預設8 固檢查8附近的值 `$ make plot DIS=1` ![](https://i.imgur.com/pwalPgT.png) 結果符合理論 distance 過猶不及 這邊 8 代入的效率最好 #### 實驗不同 Prefetch Hints 如同[前方表格](https://hackmd.io/s/Skky1srig#prefetch-hint)顯示共有4種hints 將其分別代入比較(預設是 T1) - _MM_HINT_T0 ``` Performance counter stats for './sse_prefetch_transpose' (100 runs): 4,667,916 cache-misses # 82.749 % of all cache refs ( +- 0.57% ) (49.02%) 5,641,054 cache-references ( +- 0.44% ) (49.48%) 8,916,708 L1-dcache-load-misses # 1.93% of all L1-dcache hits ( +- 0.40% ) (50.67%) 463,153,960 L1-dcache-loads ( +- 0.18% ) (46.33%) 235,037,792 L1-dcache-stores ( +- 0.25% ) (25.85%) 50,785 L1-icache-load-misses ( +- 4.21% ) (25.70%) 1,237,631 r014c ( +- 1.73% ) (24.94%) 403,505 r024c ( +- 3.30% ) (36.74%) 0.256539780 seconds time elapsed ( +- 0.57% ) ``` - _MM_HINT_T1 ``` Performance counter stats for './sse_prefetch_transpose' (100 runs): 4,906,154 cache-misses # 82.649 % of all cache refs ( +- 0.20% ) (49.38%) 5,936,140 cache-references ( +- 0.14% ) (49.40%) 8,625,730 L1-dcache-load-misses # 1.84% of all L1-dcache hits ( +- 0.27% ) (49.46%) 467,546,983 L1-dcache-loads ( +- 0.11% ) (44.87%) 232,786,099 L1-dcache-stores ( +- 0.17% ) (26.09%) 37,113 L1-icache-load-misses ( +- 1.55% ) (26.02%) 441,530 r014c ( +- 1.60% ) (25.60%) 428,107 r024c ( +- 2.72% ) (37.41%) 0.253216818 seconds time elapsed ( +- 0.10% ) ``` - _MM_HINT_T2 ``` Performance counter stats for './sse_prefetch_transpose' (100 runs): 4,931,244 cache-misses # 83.404 % of all cache refs ( +- 0.71% ) (49.53%) 5,912,467 cache-references ( +- 0.53% ) (49.67%) 8,578,641 L1-dcache-load-misses # 1.84% of all L1-dcache hits ( +- 0.39% ) (49.89%) 466,874,784 L1-dcache-loads ( +- 0.17% ) (45.45%) 234,373,157 L1-dcache-stores ( +- 0.22% ) (26.58%) 45,452 L1-icache-load-misses ( +- 6.35% ) (25.87%) 435,123 r014c ( +- 1.95% ) (25.43%) 408,551 r024c ( +- 3.97% ) (37.36%) 0.264967830 seconds time elapsed ( +- 1.18% ) ``` - _MM_HINT_NTA ``` Performance counter stats for './sse_prefetch_transpose' (100 runs): 4,848,945 cache-misses # 82.589 % of all cache refs ( +- 0.41% ) (49.29%) 5,871,166 cache-references ( +- 0.36% ) (49.36%) 8,568,859 L1-dcache-load-misses # 1.83% of all L1-dcache hits ( +- 0.27% ) (49.50%) 467,023,161 L1-dcache-loads ( +- 0.12% ) (45.17%) 232,552,596 L1-dcache-stores ( +- 0.16% ) (26.36%) 35,771 L1-icache-load-misses ( +- 2.76% ) (25.96%) 451,018 r014c ( +- 1.83% ) (25.57%) 428,982 r024c ( +- 2.64% ) (37.33%) 0.253415363 seconds time elapsed ( +- 0.18% ) ``` - gnuplot ![](https://i.imgur.com/hGeL1XW.png) 可以觀察出 T0 時 r014c 最高 因為他將資料 prefetch 到所有 level 所以效率最高 #### AVX 首先先從了解 SSE 的運作方式[^second] 然後將原本的 4\*4 matrix 改為 8\*8 matrix code的部份也照著 SSE 的步驟改為 AVX 但結果並不正確: [^second]: [ierosodin的筆記](https://hackmd.io/s/rkX95E-il#how-sse-transpose-matrix) ``` I0 -> 0 8 16 24 | 4 12 20 28 I1 -> 1 9 17 25 | 5 13 21 29 I2 -> 2 10 18 26 | 6 14 22 30 I3 -> 3 11 19 27 | 7 15 23 31 ----------------------------- I4 -> 32 40 48 56 | 36 44 52 60 I5 -> 33 41 49 57 | 37 45 53 61 I6 -> 34 42 50 58 | 38 46 54 62 I7 -> 35 43 51 59 | 39 47 55 63 ``` 左下和右上的角應該要交換才對 最後參考[周曠宇的共筆](https://embedded2015.hackpad.com/Week8--VGN4PI1cUxh#:h=使用Perf分析cache-miss/hit) 加入這幾行code完成交換 ``` T0 = _mm256_permute2x128_si256(I0, I4, 0x20); T1 = _mm256_permute2x128_si256(I1, I5, 0x20); T2 = _mm256_permute2x128_si256(I2, I6, 0x20); T3 = _mm256_permute2x128_si256(I3, I7, 0x20); T4 = _mm256_permute2x128_si256(I0, I4, 0x31); T5 = _mm256_permute2x128_si256(I1, I5, 0x31); T6 = _mm256_permute2x128_si256(I2, I6, 0x31); T7 = _mm256_permute2x128_si256(I3, I7, 0x31); ``` 但這個[指令](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2&expand=3152,3146,5610,4717,4770,4717,3901,3896,3895,3896,3896,4717,4726,3896,3901,5159,5205,4717,4717,3896,3896&text=_mm256_permute2x128_si256)我還在研究原理為何 試著加入 prefetch 但是 AVX 並沒有支援 prefetch 要到 AVX-512才有支援 所以先用 SSE 的 prefetch 這裡先預設 distance = 8, Hints = T1 - 結果 - avx perf ``` - Performance counter stats for './avx_transpose' (100 runs): 3,612,052 cache-misses # 77.914 % of all cache refs ( +- 0.36% ) (50.22%) 4,635,946 cache-references ( +- 0.29% ) (50.45%) 8,223,505 L1-dcache-load-misses # 2.11% of all L1-dcache hits ( +- 0.19% ) (50.50%) 390,434,639 L1-dcache-loads ( +- 0.07% ) (45.13%) 216,084,485 L1-dcache-stores ( +- 0.14% ) (25.38%) 36,183 L1-icache-load-misses ( +- 1.32% ) (25.96%) 359 r014c ( +- 5.20% ) (25.59%) 208,075 r024c ( +- 4.01% ) (37.76%) 0.259036199 seconds time elapsed ( +- 0.15% ) ``` - avx_prefetch perf ``` Performance counter stats for './avx_prefetch_transpose' (100 runs): 3,870,080 cache-misses # 75.536 % of all cache refs ( +- 1.15% ) (49.67%) 5,123,514 cache-references ( +- 0.64% ) (49.90%) 8,183,794 L1-dcache-load-misses # 2.01% of all L1-dcache hits ( +- 0.43% ) (50.22%) 407,171,607 L1-dcache-loads ( +- 0.22% ) (45.78%) 214,583,897 L1-dcache-stores ( +- 0.31% ) (25.80%) 58,325 L1-icache-load-misses ( +- 3.85% ) (25.72%) 259,162 r014c ( +- 2.77% ) (25.31%) 351,523 r024c ( +- 4.43% ) (37.40%) 0.277570860 seconds time elapsed ( +- 1.61% ) ``` 看到 r014c 的差異 代表有進行SW prefetch - gnuplot ![](https://i.imgur.com/lDvmE3Q.png) avx 的效能又比 sse prefetch 好 因為一次處理的資料量比較大 但有一個重點 avx 所用的 prefetch distance 應該要是 sse 版本的兩倍 也就是16才對 這裡比較一些不同的 distance 代入的結果 - 結果 - gnuplot ![](https://i.imgur.com/NXLKyZy.png) 執行時間還是 distance = 8 時比較好 - perf ``` Performance counter stats for './avx_prefetch_transpose 8' (20 runs): 3,867,416 cache-misses # 80.189 % of all cache refs ( +- 2.16% ) (44.33%) 4,822,896 cache-references ( +- 2.20% ) (44.72%) 684,115,466 cycles ( +- 2.98% ) (45.18%) 7,658,761 L1-dcache-load-misses # 1.90% of all L1-dcache hits ( +- 1.46% ) (45.68%) 403,675,699 L1-dcache-loads ( +- 0.92% ) (40.26%) 209,757,372 L1-dcache-stores ( +- 0.89% ) (22.63%) 60,160 L1-icache-load-misses ( +- 8.71% ) (22.67%) 284,360 r014c ( +- 5.87% ) (22.46%) 419,681 r024c ( +- 10.79% ) (33.30%) 0.285078156 seconds time elapsed ( +- 3.82% ) Performance counter stats for './avx_prefetch_transpose 16' (20 runs): 3,839,351 cache-misses # 58.430 % of all cache refs ( +- 1.73% ) (44.75%) 6,570,840 cache-references ( +- 1.35% ) (45.41%) 644,629,299 cycles ( +- 0.44% ) (45.90%) 7,695,401 L1-dcache-load-misses # 1.86% of all L1-dcache hits ( +- 1.10% ) (46.16%) 413,234,940 L1-dcache-loads ( +- 0.37% ) (39.60%) 213,451,682 L1-dcache-stores ( +- 0.71% ) (22.13%) 55,960 L1-icache-load-misses ( +- 10.40% ) (22.69%) 178,769 r014c ( +- 8.47% ) (22.50%) 197,597 r024c ( +- 1.80% ) (33.55%) 0.262698318 seconds time elapsed ``` 1. cache misses 確實是 distance = 16 最低 但是是因為 cache refs 比別人多了很多 所以 miss rate 才低 2. cycles 數 dis = 16 比 dis = 8 少了 40,000,000 次 3. r014c r024c cache hit 次數都比 dis = 8 還少 ### Code 再進化 - [ ] Use higher level macro to simplify avx code 參考到[類似作法](https://github.com/Naruil/Yastopwatch/blob/master/yastopwatch.h) 還在學習中 - [x] Use Stopwatch to record time 將老師推薦的 [Stopwatch](https://github.com/berndbischl/rscimark/blob/master/src/Stopwatch.h) 結合 clock_gettime 達到精準的時間測試