2017q1 Homework3 (software-pipelining)

# 2017q1 Homework3 (software-pipelining) contributed by < ` changyuanhua ` > ## 開發環境 * 輸入指令 ` lscpu ` ``` Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數：1 每通訊端核心數：4 Socket(s): 1 NUMA 節點： 1 供應商識別號： GenuineIntel CPU 家族： 6 型號： 94 Model name: Intel(R) Core(TM) i5-6300HQ CPU @ 2.30GHz 製程： 3 CPU MHz： 799.890 CPU max MHz: 3200.0000 CPU min MHz: 800.0000 BogoMIPS: 4608.00 虛擬： VT-x L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 6144K NUMA node0 CPU(s): 0-3 ``` ## 論文閱讀紀錄 ### 名詞解釋 Data reference patterns 被分類為以下三種情形: * Temporal (時間): 數據將很快再次被使用 * Spatial (空間): 數據將用於相鄰位置（例如，在同一個 cache line 上）。 * Non-temporal (非時間): 數據在將來被引用一次並且不被重用的（例如，對於一些多媒體數據類型，作為3D圖形應用中的頂點緩衝器） * Prefetch: * 從記憶體層級上來看 prefetch intrinsic 提供參數可以選擇載入到哪個層級的快取記憶體 > 先搞懂 prefetch > >在資料被需要之前，把資料從 main memory 中，預先載入到 cache 。因為當資料被需要使用時，從 cache 訪問比 main memory 請求更快速。因此，prefetch 可以隱藏 memory access latency。 > Where to place prefetched data? > > Cache or dedicated buffer * Hardware prefetch * 是用過去的存取紀錄作為 prefetch 的依據，所以 HW 是需要 training 的 * GHB:Global History Buffer > 搞懂 Hardware prefetch > >Hardware prefetch 監視從處理器到 cache 的 memory 存取過程，它監視處理器當前正在訪問哪些 memory 的位置，預測將來需要哪些 memory 的位置，以及向那些 memory 的位置發出預取。 * Software prefetch * 可以是人為或是 compiler 在適當的地方加入 prefetch 指令. * SW 指的就是老師寫的 _mm_prefetch，可以由我們來控制的 ### 研究目的 * the goal of this study is to develop a novel, foundational understanding of both the benefits and limitations of hardware and software prefetching * 提供在 HW prefetcher 有效的 intrinsic 使用方法 * 提供 HW & SW prefetcher 共存時，效能好的組合 ### 研究 * source code-level analysis (源代碼級分析): 幫助理解 compiler- and software-based prefetching (基於編譯器和基於軟件的預取)的實際優缺點 * the synergistic (協同) and antagonistic (對抗) effects between software and hardware prefetching * an evaluation (評估) of hardware prefetching training policies in the presence of software prefetching requests ### 介紹手動加入指令有兩個最大的問題: * 對於如何插入 prefetch intrinsic 沒有一個嚴格的標準 * 軟體與硬體的 prefetch 交互使用的複雜度沒有一個很好的了解 #### 實驗 * 總結了軟件和硬件預取之間的複雜交互 ![](https://i.imgur.com/s3Omhht.png) * SPEC CPU 2006 benchmark(x軸) * 執行時間標準化(y軸): 使用baseline binaries(只有compiler插入的SW prefetcher)的執行時間 * CPU: Intel’s Core 2 and Nehalem * 使用 icc compiler (也有對gcc compiler做實驗) * GHB(stride prefetcher)/STR(stream prefetcher): compiler + HW prefetcher * Base: (i.e., the best among Base, GHB, and STR) * SW: compiler + 程式中插入 intrinsic (i.e., the best among SW, SW+GHB, and SW+STR) * 已知驗證: SW prefetch 用在短陣列、連續且不規則的記憶體存取、減少 L1 cache miss 時效能較好 * 實驗發現: SW prefetch 對於HW prefetcher 會有 training effect，造成效能變差。特別是在 SW prefetch只用在stream的一部分時 * 本實驗只會針對現有機制，而模擬與真實系統都會被使用，以模擬完成較複雜的實驗，再對其中的發現以真實系統做驗證 ### HW & SW prefetcher背景知識 #### 不同資料結構適合的SW prefetch 和 HW prefetch: * summarizes common data structures, their corresponding data access patterns, and the hardware/software prefetching mechanisms previously proposed for each. ![](https://i.imgur.com/ERu9psG.png) * 各種資料結構都適用: dead-block-based prefetching * Recursive Data Structures (RDS): RDS data structure accesses 可以透過 software 容易地預測訪問 * hashing: 難以有效 prefetch * 定義 cache-line access: stream: unit-stride stride: access stride distances 大於 2 個 cache-line * HW prefetcher 嘗試了 stream prefetcher、GHB prefetcher(stride prefetcher)、content-based prefetcher #### Software Prefetch Intrinsics ![](https://i.imgur.com/850jWfQ.png) * 當手動插入預取請求時，我們使用 software prefetch intrinsics (x86 SSE SIMD 擴展的一部分) * 使用 SSE 的__mm_prefetch(char *ptr, int hint) 時要指定 * prefetch address * prefetch usage hint * direct address prefetches: 2個指令 /intrinsic * indirect memory addresses prefetch: 4個指令 /intrinsic #### prefetch分類 ![](https://i.imgur.com/cdaB0ox.png) ![](https://i.imgur.com/8QR3ye5.png) > early vs late > > Too early : 取代其他有用數據 (cache pollution) 或在使用前被取代掉 Too late : 不能隱藏 processor stall * ex: if a demand access occurs after its corresponding prefetch request but before it is inserted into the cache, we classify the prefetch as “late.” * MSHR: Miss Status Holding Register #### Software Prefetch Distance * prefetch distance 必須要大到能夠隱藏 memory latency * D 稱為 prefetch 距離 ; 又定義為應該在存儲器地址上請求 prefetch 的距離 * l is the prefetch latency * s 是迴圈當中的最短路徑 * prefetch distance 太大可能造成不良的後果，如 coverage 更小、 cache miss 更多 * Coverage: 藉由 prefetch 避掉的 cache miss 比例 * 100 x (Prefetch Hits / (Prefetch Hits + Cache Misses)) ![](https://i.imgur.com/PUh5dfT.png) #### Direct and Indirect Memory Indexing * direct memory indexing : * 可以容易地由 HW prefetched ，因為存儲器地址顯示常規的 stream/stride behavior (流/步幅行為) * 在 software 中計算不容易 * Direct 是直接可以從指令中知道要存取的記憶體位址,可以透過 hareward prefetch. * indirect memory indexing : * 需要特殊的 hardware prefetching mechanisms * 在 software 中計算容易 * 程式碼 x = a[b[i]] ,需要先計算 b[i] 的值才知道要存取 a[] 中的哪個位置,機器不會比人更知道 b[i] 是怎麼來的,所以使用 software prefetch 比較容易達成. * indirect memory access 可能造成 SW prefetching overhead * e.g… a[i] prefetch 一次， a[b[i]]則要分兩次prefetch ## Prefetching in the Intel® Core™ Microarchitecture ![](https://i.imgur.com/ovC0E9y.png) * 我的 CPU 的 Microarchitecture 是 Skylake [[參考]](http://www.cpu-world.com/CPUs/Core_i5/Intel-Core%20i5-6300HQ%20Mobile%20processor.html) --> 知道microarchitecture是哪種，才可以確定 HW prefetcher 有哪些種類 ## 原始程式碼 * naive ```clike= void naive_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x++) for (int y = 0; y < h; y++) *(dst + x * h + y) = *(src + y * w + x); } ``` 這是一個 ` dst[x][y] = src[y][x] ` 的程式 * C 是 row-major 的語言，所以當以 colum order 的方式去讀取，會導致 cache 沒辦法有效地被利用。 * sse ```clike= void sse_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 4) { for (int y = 0; y < h; y += 4) { __m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x)); __m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x)); __m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x)); __m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x)); __m128i T0 = _mm_unpacklo_epi32(I0, I1); __m128i T1 = _mm_unpacklo_epi32(I2, I3); __m128i T2 = _mm_unpackhi_epi32(I0, I1); __m128i T3 = _mm_unpackhi_epi32(I2, I3); I0 = _mm_unpacklo_epi64(T0, T1); I1 = _mm_unpackhi_epi64(T0, T1); I2 = _mm_unpacklo_epi64(T2, T3); I3 = _mm_unpackhi_epi64(T2, T3); _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0); _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1); _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2); _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3); } } } ``` 這個程式碼是一次將16筆整數橫向讀入到4個 128bit 的 sse 暫存器，接著利用 unpackhi、unpacklo 使資料 swap ，最後以直排順序存到橫排 * matrix1 一次橫向讀入128bit = 四個int 存進I0 ，只要給要讀入陣列的起始位置，他會往後抓127bit, I1,I2,I3同理 ``` __m128i I0 = _mm_loadu_si128((__m128i *)(src1 + (x + 0) * src1_w + k)); ``` 詳細內容參考 [kobeyu 的筆記](https://hackmd.io/s/BycRx2EC) -> 解釋得很清楚 * sse_prefetch ```clike= void sse_prefetch_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 4) { for (int y = 0; y < h; y += 4) { #define PFDIST 8 _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1); __m128i I0 = _mm_loadu_si128 ((__m128i *)(src + (y + 0) * w + x)); __m128i I1 = _mm_loadu_si128 ((__m128i *)(src + (y + 1) * w + x)); __m128i I2 = _mm_loadu_si128 ((__m128i *)(src + (y + 2) * w + x)); __m128i I3 = _mm_loadu_si128 ((__m128i *)(src + (y + 3) * w + x)); __m128i T0 = _mm_unpacklo_epi32(I0, I1); __m128i T1 = _mm_unpacklo_epi32(I2, I3); __m128i T2 = _mm_unpackhi_epi32(I0, I1); __m128i T3 = _mm_unpackhi_epi32(I2, I3); I0 = _mm_unpacklo_epi64(T0, T1); I1 = _mm_unpackhi_epi64(T0, T1); I2 = _mm_unpacklo_epi64(T2, T3); I3 = _mm_unpackhi_epi64(T2, T3); _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0); _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1); _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2); _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3); } } } #endif /* TRANSPOSE_IMPL */ ``` 這個程式碼是將 sse 版本中，加入 SW Prefetch * 效能分析 * naive ``` Performance counter stats for './naive_transpose' (100 runs): 40,396,950 cache-misses # 86.186 % of all cache refs ( +- 0.30% ) 46,872,044 cache-references ( +- 0.24% ) 21,069,452 L1-dcache-load-misses # 3.79% of all L1-dcache hits ( +- 0.01% ) 555,330,933 L1-dcache-loads ( +- 0.00% ) 219,108,523 L1-dcache-stores ( +- 0.00% ) 72,803 L1-icache-load-misses ( +- 2.12% ) 0.358395944 seconds time elapsed ( +- 0.31% ) ``` * sse ``` Performance counter stats for './sse_transpose' (100 runs): 14,842,277 cache-misses # 77.006 % of all cache refs ( +- 0.28% ) 19,274,268 cache-references ( +- 0.05% ) 8,514,541 L1-dcache-load-misses # 1.91% of all L1-dcache hits ( +- 0.02% ) 445,306,115 L1-dcache-loads ( +- 0.00% ) 232,791,733 L1-dcache-stores ( +- 0.00% ) 103,480 L1-icache-load-misses ( +- 2.20% ) 0.273274670 seconds time elapsed ( +- 0.31% ) ``` * sse_prefetch ``` Performance counter stats for './sse_prefetch_transpose' (100 runs): 10,613,966 cache-misses # 72.171 % of all cache refs ( +- 0.70% ) 14,706,647 cache-references ( +- 0.28% ) 8,519,990 L1-dcache-load-misses # 1.83% of all L1-dcache hits ( +- 0.02% ) 466,169,620 L1-dcache-loads ( +- 0.00% ) 232,724,643 L1-dcache-stores ( +- 0.00% ) 83,636 L1-icache-load-misses ( +- 2.14% ) 0.186173153 seconds time elapsed ( +- 0.63% ) ``` | method | cache-misses rate| L1-dcache-load-misses rate | | ------ | ---------------- | -------------------------- | | naive | 86.186 % | 3.79% | | sse | 77.006 % | 1.91% | | sse prefetcher | 72.171 % | 1.83% | * 時間 ``` 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 ``` * naive : 236494 us * sse : 139678 us * sse_prefetch : 62750 us ## 實驗一： avx AVX 的指令集中，沒有 unpack128 這個指令，所以用 _mm256_permute2x128_si256() 來替代 * code: ```clike= void avx_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 8) { for (int y = 0; y < h; y += 8) { __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x)); __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x)); __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x)); __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x)); __m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x)); __m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x)); __m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x)); __m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x)); __m256i T0 = _mm256_unpacklo_epi32(I0, I1); __m256i T1 = _mm256_unpacklo_epi32(I2, I3); __m256i T2 = _mm256_unpackhi_epi32(I0, I1); __m256i T3 = _mm256_unpackhi_epi32(I2, I3); __m256i T4 = _mm256_unpacklo_epi32(I4, I5); __m256i T5 = _mm256_unpacklo_epi32(I6, I7); __m256i T6 = _mm256_unpackhi_epi32(I4, I5); __m256i T7 = _mm256_unpackhi_epi32(I6, I7); I0 = _mm256_unpacklo_epi64(T0, T1); I1 = _mm256_unpackhi_epi64(T0, T1); I2 = _mm256_unpacklo_epi64(T2, T3); I3 = _mm256_unpackhi_epi64(T2, T3); I4 = _mm256_unpacklo_epi64(T4, T5); I5 = _mm256_unpackhi_epi64(T4, T5); I6 = _mm256_unpacklo_epi64(T6, T7); I7 = _mm256_unpackhi_epi64(T6, T7); T0 = _mm256_unpacklo_epi128(I0, I4); T1 = _mm256_unpacklo_epi128(I1, I5); T2 = _mm256_unpacklo_epi128(I2, I6); T3 = _mm256_unpacklo_epi128(I3, I7); T4 = _mm256_unpackhi_epi128(I0, I4); T5 = _mm256_unpackhi_epi128(I1, I5); T6 = _mm256_unpackhi_epi128(I2, I6); T7 = _mm256_unpackhi_epi128(I3, I7); _mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), T0); _mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), T1); _mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), T2); _mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), T3); _mm256_storeu_si256((__m256i *)(dst + ((x + 4) * h) + y), T4); _mm256_storeu_si256((__m256i *)(dst + ((x + 5) * h) + y), T5); _mm256_storeu_si256((__m256i *)(dst + ((x + 6) * h) + y), T6); _mm256_storeu_si256((__m256i *)(dst + ((x + 7) * h) + y), T7); } } } * 效能分析 * naive ``` Performance counter stats for './naive_transpose' (100 runs): 41,691,591 cache-misses # 86.654 % of all cache refs ( +- 0.24% ) 48,112,706 cache-references ( +- 0.22% ) 21,074,570 L1-dcache-load-misses # 3.79% of all L1-dcache hits ( +- 0.01% ) 555,368,170 L1-dcache-loads ( +- 0.00% ) 219,132,867 L1-dcache-stores ( +- 0.00% ) 78,851 L1-icache-load-misses ( +- 2.26% ) 0.362426909 seconds time elapsed ( +- 0.45% ) ``` * sse ``` Performance counter stats for './sse_transpose' (100 runs): 15,237,833 cache-misses # 78.909 % of all cache refs ( +- 0.47% ) 19,310,745 cache-references ( +- 0.06% ) 8,526,042 L1-dcache-load-misses # 1.91% of all L1-dcache hits ( +- 0.03% ) 445,281,290 L1-dcache-loads ( +- 0.00% ) 232,777,361 L1-dcache-stores ( +- 0.00% ) 98,770 L1-icache-load-misses ( +- 2.42% ) 0.266709924 seconds time elapsed ( +- 0.38% ) ``` * sse_prefetch ``` Performance counter stats for './sse_prefetch_transpose' (100 runs): 8,041,382 cache-misses # 65.431 % of all cache refs ( +- 0.89% ) 12,289,899 cache-references ( +- 0.28% ) 8,523,157 L1-dcache-load-misses # 1.83% of all L1-dcache hits ( +- 0.02% ) 466,153,254 L1-dcache-loads ( +- 0.00% ) 232,714,767 L1-dcache-stores ( +- 0.00% ) 71,794 L1-icache-load-misses ( +- 2.35% ) 0.168626115 seconds time elapsed ( +- 0.71% ) ``` * avx ``` Performance counter stats for './avx_transpose' (100 runs): 10,646,344 cache-misses # 70.440 % of all cache refs ( +- 0.17% ) 15,114,039 cache-references ( +- 0.06% ) 8,129,909 L1-dcache-load-misses # 2.02% of all L1-dcache hits ( +- 0.03% ) 402,956,831 L1-dcache-loads ( +- 0.00% ) 210,944,765 L1-dcache-stores ( +- 0.00% ) 72,525 L1-icache-load-misses ( +- 1.06% ) 0.192949994 seconds time elapsed ( +- 0.20% ) ``` | method | cache-misses rate| L1-dcache-load-misses rate | | ------ | ---------------- | -------------------------- | | naive | 86.654 % | 3.79% | | sse | 78.909 % | 1.91% | | sse prefetcher | 65.431 % | 1.83% | | avx | 70.440 % | 2.02% | > 我發現 L1-dcache-load-misses rate 跟最初的版本都一樣，不知道是剛好，還是我操作有問題？待會在測試其他版本試試看 * 時間：轉置了8*8的矩陣 ``` 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 0 8 16 24 32 40 48 56 1 9 17 25 33 41 49 57 2 10 18 26 34 42 50 58 3 11 19 27 35 43 51 59 4 12 20 28 36 44 52 60 5 13 21 29 37 45 53 61 6 14 22 30 38 46 54 62 7 15 23 31 39 47 55 63 ``` * naive : 231472 us * sse : 128587 us * sse_prefetch : 58680 us * avx : 55967 us ## 參考資料 * [論文 When Prefetching Works, When It Doesn’t, and Why 重點提示和解說](https://hackmd.io/s/HJtfT3icx) * [論文 When Prefetching Works, When It Doesn’t, and Why ](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) * [kobeyu 的筆記](https://hackmd.io/s/BycRx2EC) * [prefetch 相關資料](http://www.ece.cmu.edu/~ece740/f11/lib/exe/fetch.php%3Fmedia%3Dwiki:lectures:onur-740-fall11-lecture24-prefetching-afterlecture.pdf)