2016q3 Homework3 (software-pipelining)

# 2016q3 Homework3 (software-pipelining) contribute by <`yenWu`> ###### tags: `sysprog2016` 這次想要特別加強作圖的部分，我後來發現放圖才是最快又清楚的方式。 1. 數學跟電腦的差別 memory 的 2. cache locality + memory hiere ## 論文研讀 ([作業要求](/s/ry7eqDEC)) 閱讀 [效能分析: Prefetching](https://embedded2015.hackpad.com/ep/pad/static/7ZSsa98cSKw) 提到的論文: “[When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)” * 首先我想先釐清，software/ hardware prefetch * hardware prefetch * Hardware prefetchers may use ==some storage to detect access patterns and based on it, prefetch== instructions are issued. * 就敘述上可以想到 `branch prediction` 的 `history table`，也形成一個 state machine * software prefetch * Software prefetchers ==insert prefetch instructions== in program source-code based on knowledge of program control flow. * 再來是一些，關於 prefetch 的名詞 * Prefetch degree * Prefetch degree is ==the number of cache lines prefetched== in each prefetching operation. * Prefetch distance * Prefetch distance shows how far ahead of the demand access stream, the data blocks are prefetched. * 跨謀參考資料: [wikipedia](https://en.wikipedia.org/wiki/Instruction_prefetch) ### Paper Study 簡介就說來教你怎麼用 prefetch 阿，囧。沒看還不知道，一看就不得了，我發現這份 paper 的詳細程度真的是太令人吃驚拉。 * ==就我看到現在的想法是，`prefetch`的功用就是隱藏 `cache miss` 所帶來的 overhead，但 prefetch 的時機很重要，快了慢了都不行，所以計算 `prefetch distance`(也就你 prefetch 使用後，會在哪個 loop 的時機近來)，再來就是比較了 HD，還有雙劍合璧(HD+SW)== * 以下圖表非常重要，他幫我們分析了不同 data structure 的性質及當時已經有提出的 HW SW prefetch 方式 * 這邊注意到 `D is prefetch distance` ![](https://i.imgur.com/tQ7ZRtd.png) * Prefetch hint 的列表 ![](https://i.imgur.com/x0tQqnt.png) * prefetch 的分類，也就是你這個 prefetch 的效果，如果太早拿進 L1 有可能要用的時候就被 evict 了 ![](https://i.imgur.com/gaquQVl.png) * prefetch 的考慮要考慮到 request, insert, evict，所以才會需要下面的 prefetch distance，並不是你一下 prefetch 就直接搬到 L1 ![](https://i.imgur.com/ZKrE8Wb.png) * Prefetch Distance 的計算方式 * `D` 是我們要設定 prefetch 的目標距離，prefetch 希望能夠 timely 是最好的，最少也要是 early ，不然抓進來的 data 就沒有用了 * `l` = prefetch latency，白話文是用了 prefetch 多久才會把 data 搬到 cache 裡 * `s` = ![](https://i.imgur.com/yrmVLf6.png) * 上面計算的重點還要考慮 a[i] 和 a[b[i]] 的差異，因為這兩個所花的 cycle time 是不一樣的，我們必須對應到不同的 `s` * 比較 hardware prefetch 和 software prefetch * **hardware prefetch** 是很依賴訓練的，例如你一個 for 迴圈算 1加 100，那麼他可能就需要透過 1 2 3 的訓練學習到你接下來可能要拿 4，但它並不能處理 indirect 的 memory access * **software prefetch** 就是由 programmer 決定啥時該把 data 放到 cache 裡面，這種方式就可以處理 indirect 的類別，但==最重要最重要的就是你還要考慮 latency 阿，這個超難的阿== * **雙劍合璧?** 對，可以用 software prefetch 來引導 hardware prefetch 阿，用 software prefetch 來建立良好的 cache history 會對 hardware prefetch 有所幫助的 * ==benefits of software prefetch== 舉例用在哪邊會比較好 * **Latge num streams**: 太多 prefetch 的分支，會讓 * **short streams**:太短的 code 用 hardware 是沒辦法學習的，hardware 會需要有一定的訓練時間 * irregular memory access: BJ4 * **cache locality hint**: 我們能決定要搬到 L1 或者是更多的選項(參考 prefetch hint) * loop bound >>反正只要知道 hardware 一定是用先前的訊息來推敲後面該怎麼 prefetch，大概就能理解這些好處了 [name=Yen-Kuan Wu] * ==Negative impacts of software prefetch== * **instruction count** 會增加，因為差了 prefetch 阿(笑 * **static insert** * **code structure change** 我認為這是最重要也是最難的，當我自己嘗試插入 code 時會發現怎麼做都不會比較快，等於是我要考慮計算機的特性還有搬移進 xmm 的時間甚至是 prefetch distence # 測試環境 * 用簡單的`$lscpu`來檢測環境，這邊我留下比較重要的資訊和我看不懂的東西 ```clike= Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 6144K NUMA node0 CPU(s): 0-7 ``` 所以我的 L1 d-cache 是 32KB 這個肯定會在 prefetch 用得到。 ## 不清楚的資訊 1. `Thread(s) per core: 2` * 也就是老師所說的 hyperthreading，一個 core上可以跑多個 thread 2. `Core(s) per socket: 4` * 3. `Socket(s): 1` * 簡單講就式可以差 `CPU` 的洞有幾個 4. `NUMA node0 CPU(s): 0-7` * 這說起來就複雜了，開一個段落來研 # 最初實驗數據 * **時間** ``` sse prefetch: 104668 us sse: 178829 us naive: 326185 us ``` * **perf event** * 我設計了第一個實驗，但我發現這個的成效沒有很好，使用了 `cache-misses,cache-references,instructions,cycles` ``` Performance counter stats for './time_naive' (100 runs): 16,119,335 cache-misses # 78.804 % of all cache refs ( +- 0.03% ) 20,454,855 cache-references ( +- 0.12% ) 1,588,542,305 instructions # 0.95 insns per cycle ( +- 0.00% ) 1,670,079,404 cycles ( +- 0.34% ) 0.586998614 seconds time elapsed ( +- 0.64% ) perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./time_sse > /dev/null Performance counter stats for './time_sse' (100 runs): 6,797,259 cache-misses # 83.452 % of all cache refs ( +- 0.02% ) 8,145,160 cache-references ( +- 0.14% ) 1,376,489,596 instructions # 1.15 insns per cycle ( +- 0.01% ) 1,195,796,152 cycles ( +- 0.36% ) 0.435765644 seconds time elapsed ( +- 0.83% ) perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./time_sse_prefetch > /dev/null Performance counter stats for './time_sse_prefetch' (100 runs): 6,860,052 cache-misses # 82.287 % of all cache refs ( +- 0.07% ) 8,336,692 cache-references ( +- 0.15% ) 1,422,283,317 instructions # 1.47 insns per cycle ( +- 0.01% ) 968,118,604 cycles ( +- 0.38% ) 0.357185164 seconds time elapsed ( +- 0.99% ) ``` ### ==觀察上面的結果我們發現其實，cache-miss差不了多少 ...，但為什麼 prefetch 的時間會比較快呢？== >>看似差不了多少, 但你明確清楚資料從 DRAM => register 跟 L2 => register 以及 L1 => register 之間的時間差異嗎? [name=champ] >>L1 => register = 1 cycle >>L2 => register = 3-4 cycle >>DRAM=>register =100up cycle >>而這邊的 cache-miss 則是包山包海，我們需要清楚分析 L1, L2, DRAM 的 miss 才對。 >>[name=Yen-Kuan Wu] ## 實驗問題 1. 算時間的方式應該採取，多次取平均，我會發限其時我的時間有再跳動 2. `cache-miss` 裡面包含太多我們不在意的東西了，我們的 prefetch 其實只有減少 `dcache cache misses` >>後來想想，prefetch 究竟會不會減少 cache-miss呢? 如果 prefetch 也是透過 cache-miss 的方式來將 data 載入 cache 的話，我該怎麼測量 prefetch 真正帶來的效益呢?[name=Yen-Kuan Wu] >>論文看過了嗎? 引用一段 "Among our numerous findings, we show that when software prefetching targets short array streams, irregular memory address patterns, and L1 cache miss reduction, there is an overall positive impact with code examples." [name=champ] ## L1-cache-test ``` Performance counter stats for './time_sse' (100 runs): 10,597,659 L1-dcache-load-misses ( +- 0.11% ) 3,213,291 L1-dcache-store-misses ( +- 0.14% ) 216,535 L1-dcache-prefetch-misses ( +- 0.20% ) 109,996 L1-icache-load-misses ( +- 5.25% ) 0.435007594 seconds time elapsed ( +- 0.94% ) perf stat --repeat 100 \ -e L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses \ ./time_sse_prefetch > /dev/null Performance counter stats for './time_sse_prefetch' (100 runs): 11,551,554 L1-dcache-load-misses ( +- 0.08% ) 3,220,207 L1-dcache-store-misses ( +- 0.13% ) 217,016 L1-dcache-prefetch-misses ( +- 0.21% ) 94,153 L1-icache-load-misses ( +- 2.13% ) 0.351679917 seconds time elapsed ( +- 1.02% ) ``` ## Raw Counter 參照[<kaizsv> prefetch 共筆](https://hackmd.io/IwVmCMGMCYFMGYC0A2e5mICwAYDs5EBDAMwBMAORWU8ATl2N2EhHNiA=?both) 首先先去 [Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B](http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html) 找我的CPU型號= =(超多) 我的是 3rd generation CPU 在 `19-34` `19.5 PERFORMANCE MONITORING EVENTS FOR 3RD GENERATION INTEL® CORE™ PROCESSORS ` 第 `19-34` 頁 `19.7 PERFORMANCE MONITORING EVENTS FOR INTEL® CORE™ I7 PROCESSOR FAMILY AND INTEL® XEON® PROCESSOR FAMILY` 第`19-57`頁 ``` 4CH 01H | LOAD_HIT_PRE | Counts load operations sent to the L1 data cache while a previous SSE prefetch instruction to the same cache line has started prefetching but has not yet finished. ``` * 這邊是我的 Intel-Core-i7-3610QM [連結](http://ark.intel.com/products/64899/Intel-Core-i7-3610QM-Processor-6M-Cache-up-to-3_30-GHz) * 他屬於 3th generation[連結](http://www.intel.com/content/www/us/en/processors/core/core-technical-resources.html) 關於我找到有用的 raw counter LOAD_HIT_PRE.SW_PF ### raw counter example in man perf-list ``` Example: If the Intel docs for a QM720 Core i7 describe an event as: Event Umask Event Mask Num. Value Mnemonic Description Comment A8H 01H LSD.UOPS Counts the number of micro-ops Use cmask=1 and delivered by loop stream detector invert to count cycles raw encoding of 0x1A8 can be used: perf stat -e r1a8 -a sleep 1 perf record -e r1a8 ... ``` >>慘...，有點看不出來為什麼，但我知道反正就取前兩個就對了，A8 01=>1A8 這樣是對的，我再測試 LOAD_HIT_PRE.SW_PF 是成功的[name=Yen-Kuan Wu] >>經過測試很像是這樣的 A8 01=>01A8 這樣就對了！[name=Yen-Kuan Wu] ### 測試 Hardware prefetch hit and Software prefetch hit * 首先是手冊上的內容 ``` 4CH 01H LOAD_HIT_PRE.SW_PF Non-SW-prefetch load dispatches that hit fill buffer allocated for S/W prefetch. 4CH 02H LOAD_HIT_PRE.HW_PF Non-SW-prefetch load dispatches that hit fill buffer allocated for H/W prefetch. ``` * ==perf stat -e r014c ./time_sse== ``` sse: 129214 us Performance counter stats for './time_sse': 166 r014c ``` * ==perf stat -e r014c ./time_sse_prefetch== ``` sse prefetch: 52287 us Performance counter stats for './time_sse_prefetch': 5,416,833 r014c ``` * ==perf stat -e r024c ./time_sse== ``` sse: 129301 us Performance counter stats for './time_sse': 604,121 r024c ``` * ==perf stat -e r024c ./time_sse_prefetch== ``` sse prefetch: 51559 us Performance counter stats for './time_sse_prefetch': 609,725 r024c ``` # perf record 這邊我們使用 perf record 來觀察每個程式的 event 狀態，重點是我們可以看到使用分佈 ## sse ### cache-references = 7515089 * 68.48% = ==5146332.9472== ![](https://i.imgur.com/A0BZbcP.png) ![](https://i.imgur.com/lC4J24T.png) ### cache-misses = 6326107 * 63.95% = 4045545.4265 ![](https://i.imgur.com/T7wrVMN.png) ![](https://i.imgur.com/15IAiO0.png) ## SSE cache-miss-rates = 78.61025448617908 % ### L1-dcache-load-misses = 8567714 * 61.02 / 100 = 5228019.0828 ![](https://i.imgur.com/rRE1wtD.png) ### L1-dcache-store-misses = 4212845 * 24.80 / 100 = 1044785.56 ![](https://i.imgur.com/YGh18li.png) ## sse_prefetch ### cache-references = 7539706 * 69.99% = ==5277040.2294== ![](https://i.imgur.com/88rnnQd.png) ![](https://i.imgur.com/biIQWV6.png) * cache-misses = 6353014 * 66.65% = ==4234283.831== ![](https://i.imgur.com/uhpZ5Yx.png) ![](https://i.imgur.com/qeGSoyS.png) ### L1-dcache-load-misses = 8703811 * 62.73 / 100 = 5459900.6403 ![](https://i.imgur.com/OiIG1cw.png) ![](https://i.imgur.com/2iTKohE.png) ### L1-dcache-store-misses = 4212394 * 24.82 / 100 = 1045516.1908 ![](https://i.imgur.com/xm8nbzQ.png) ## SSE prefetch cache-miss-rates = 80.23974892989283 % ## [SSE and AVX Instruction Sets](http://arxiv.org/pdf/1211.0820.pdf) * `__m128i _mm_unpacklo_epi32 (__m128i a, __m128i b)` * 看程式碼，大約就可以看出端倪，意思就是把 src1, src2 的 lo 的部分依序以 32-bit 的部分移進 dst裡 ```clike= INTERLEAVE_DWORDS(src1[127:0], src2[127:0]){ dst[31:0] := src1[31:0] dst[63:32] := src2[31:0] dst[95:64] := src1[63:32] dst[127:96] := src2[63:32] RETURN dst[127:0] } ``` * `__m128i _mm_unpacklo_epi64 (__m128i a, __m128i b)` * 這個指令唯一不同的就是一次移 64-bits ```clike= INTERLEAVE_QWORDS(src1[127:0], src2[127:0]){ dst[63:0] := src1[63:0] dst[127:64] := src2[63:0] RETURN dst[127:0] } ``` * `void _mm_prefetch (char const* p, int i)` * 抓取 line of data > Description > Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i. >> 程式碼和文字說明請重新用 Markdown 語法排版 [name=jserv] * Prefetch hint 的列表 ![](https://i.imgur.com/x0tQqnt.png) ==特別注意這段 code== ```clike= #define PFDIST 8 _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T0); _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T0); _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T0); _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T0); ``` 我們會發現這個 `prefetch` 居然是抓下一次要執行的指令耶，所以我索性將`PFDIT`設成 `0` 試試看，到最後我們會發現，他們兩個的時間居然就一樣了，所以 `prefetch` 是有成效的。 >> 這算是"重要"的發現, 但是這表示你沒看過論文, 再者這並不是下"一"次, PFDIST 當時在定義時, 我的原意是 "PreFetch DISTance", 這裡是 8, 所以每次 loop 執行時, 固定去抓這個 loop iteration 之後的第二次的資料 (8/4 = 2) [name=champ] >> ``` sse prefetch: 172069 us sse: 172072 us ``` * `void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)` * Store 256-bits of integer data from a into memory ==using a non-temporal memory hint==. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated. * 所以他會內建，non-temporal 的 hint，這也是我們需要的功能之一 * `__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)` * 我們的 imm8 很重要，他將會是控制我們怎麼輸出的方式 * 我想要一個 unpack_ep128 就可以透過這個來實作 ```clike= SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0 ``` 參考:https://embedded2015.hackpad.com/Week8--VGN4PI1cUxh ## 減少重複的宣告 prefetch 版本修改很簡單就是這樣，我們減少了多次的declaration ``` __m128i I0, I1, I2, I3, T0, T1, T2, T3; ``` 時間就自動減少了 10000us ``` sse prefetch: 88859 us sse: 178110 us naive: 323165 us ``` ## `register` 變數 ```clike= register __m128i I0, I1, I2, I3, T0, T1, T2, T3; for (register int x = 0; x < w; x += 4) { for (register int y = 0; y < h; y += 4) { ``` ``` sse prefetch: 81531 us sse: 177034 us ``` ## Parallize SIMD 因為我們有 8 科 CPU，每個 CPU 有自己的 SSE 和 AVX 暫存器 L1 cache，那我還不把他們拆開來做，這邊我想嘗試 openmp simd 很像會很方便! ## Pthread SIMD 這個本版還只是，一般版，並沒有做 prefetch，想法同上，由於考慮到上次 code review 提到的 memory 最好要靠近一點，所以這次切我是用 row 來切，根據 thread num 來把 matrix 切開，這樣能達到更好的 locality ``` sse prefetch: 78827 us sse: 179930 us sse pthread: 41185 us sse pthread prefetch: 40256 us naive: 297112 us ``` >> 這邊我先暫時用 8 個 thread(我的八核)，總體時間可以說是，又再減少一半，但奇怪的是 pthread prefetch 版本並沒有增進的感覺，可能要調 prefetch distance了。[name=Yen-Kuan Wu] >> 把之前為了開發 [clz](/s/B1LHefQ6) 作業而設計的統計和圖表繪製工具拿出來 [name=jserv] >> Q: 我使用 i7-8core 的 CPU，並且使用了 muti-thread 的方式來跑 SIMD，為什麼 8 個 thread 的時間跟 4 個thread 的時間差不多呢? >>>A: 目前的 i7 其實只有四核，那為什麼說自己有 8 核呢? 因為 Hyper Threading，簡單說明一下，Hyper Threading 其實一個 core 裡面支援可以跑兩個 thread，這個想法式建構在我們有多的 resource 也是有多個 ALU agent(或者其他的)，所以在某些情況下是有可能讓兩個 thread 一起跑的，但是其實 SIMD agent 每個 core 都只有一個，更何況 SIMD 專用的 register 也是只有一組，所以我們是沒有辦法平行的跑 SIMD的，這也是為什麼 8 個 thread 不會有更好的效能(就真的只有四倍)。[name=Yen-Kuan Wu] ## aligned 版本由於老師給的 code 裡都是使用 `loadu` 這是可以自動幫你解決 align的問題，這邊我如果想要用 `load`，我就必須先解決 align的問題，以下是我嘗試的方法 1. `void *memalign(size_t alignment, size_t size)` 2. `__attribute__(align((16)))` ### memalign 這是就是 malloc 的 aligned 版本 ```C #include <malloc.h> void *memalign(size_t alignment, size_t size); ``` 目前遇到一個問題是在 assign 會出現問題 >> 後來發現是 malloc 根本沒有回傳足夠的空間，而且 `out` 直接是 `NULL`[name=Yen-Kuan Wu] x=3072 y=3715 >> 參考 [GCC hacks in the Linux kernel](http://www.ibm.com/developerworks/library/l-gcc-hacks/)，裡面的 gcc extension 用法 [name=jserv] ## 改善 data representation >> 我提問該如做到 unpack128 呢? 因為我那時再做 avx ，如果我用老師的演算法就逼需要用到他，老師提示 https://fgiesen.wordpress.com/2013/07/09/simd-transposes-1/[name=Yen-Kuan Wu] ### Only unpackedlo/hi_ep32 時間上是有減少一點，但這個給我們的想法是，如果我能用 32 就解決，用 64 可以嗎? ```clike= T0 = _mm_unpacklo_epi32(I0, I2); T1 = _mm_unpacklo_epi32(I1, I3); T2 = _mm_unpackhi_epi32(I0, I2); T3 = _mm_unpackhi_epi32(I1, I3); I0 = _mm_unpacklo_epi32(T0, T1); I1 = _mm_unpackhi_epi32(T0, T1); I2 = _mm_unpacklo_epi32(T2, T3); I3 = _mm_unpackhi_epi32(T2, T3); ``` # 3/21 了解 PMU 的特性，及硬體的架構、測量的標準 * [Intel® 64 and IA-32 Architectures Software Developer’s Manual](https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf) > CH18 PMU * [White Paper: performance-monitoring-unit-sharing-guide](https://software.intel.com/sites/default/files/managed/c5/15/performance-monitoring-unit-sharing-guide.pdf) * [perf wiki hardware ref](https://perf.wiki.kernel.org/index.php/HardwareReference) * [Intel® Xeon Phi™ Coprocessor Performance Monitoring Units](https://software.intel.com/sites/default/files/forum/278102/intelr-xeon-phitm-pmu-rev1.01.pdf) ![](https://i.imgur.com/UE5yQ1T.png) * [A Study of Performance Monitoring Unit, perf and perf_events subsystem](http://rts.lab.asu.edu/web_438/project_final/CSE_598_Performance_Monitoring_Unit.pdf) * [Intel Slide: Processor Performance Counter Monitoring](https://hpi.de/fileadmin/user_upload/fachgebiete/plattner/teaching/TrendsandConcepts/2010/Dementiev_Processor__Performance__Counter_Monitoring_by_Roman_Dementiev_14-07-2010.pdf) * [Blackhat Slide: CPU Performance Counters](https://www.blackhat.com/docs/us-15/materials/us-15-Herath-These-Are-Not-Your-Grand-Daddys-CPU-Performance-Counters-CPU-Hardware-Performance-Counters-For-Security.pdf) ###### tags: `yenWu` `software-piplining` `prefetch`