Try   HackMD

software-pipelining

github contributed by <Diana Ho>

tags: d0651 sys

案例分析

預期目標

  • 學習計算機結構並且透過實驗來驗證所學
  • 理解 prefetch 對 cache 的影響,從而設計實驗來釐清相關議題
  • 論文閱讀和思考

作業要求

  • 閱讀 在計算機裡頭實踐演算法 提到的論文: "When Prefetching Works, When It Doesn’t, and Why" (務必事先閱讀論文,否則貿然看程式碼,只是陷入一片茫然!),在 Linux/x86_64 (注意,要用 64-bit 系統,不能透過虛擬機器執行) 上編譯並執行 prefetcher
    • 說明 naive_transposesse_transposesse_prefetch_transpose 之間的效能差異,以及 prefetcher 對 cache 的影響
  • 在 github 上 fork prefetcher,嘗試用 AVX 進一步提昇效能
    • 修改 Makefile,產生新的執行檔,分別對應於 naive_transposesse_transposesse_prefetch_transpose (學習 phonebook 的做法)
    • 用 perf 分析 cache miss/hit
    • 學習 perf stat 的 raw counter 命令
    • 參考 Performance of SSE and AVX Instruction Sets,用 SSE/AVX intrinsic 來改寫程式碼

Prefetching

Data Prefetch Mechanisms
Caches and Prefetching
Prefetching

預先將資料從 memory 載入 cache,再進行 CPU 運算,可以降低 CPU 等待時間進而提升效率

  • 時間
    載入至主記憶體的時機
    • 不能太晚(cpu已經運算完,不再需要這些資料)
    • 不能太早(有可能在使用之前就被踢出cache)
  • 記憶體
    prefetch intrinsic 提供參數可以選擇載入到哪個層級的快取記憶體

背景知識

  • SW prefetcher:
    • compiler中加強效能的algo.
    • intrinsics(內聯函數) e.g. SSE中的__mm_prefetch()
  • HW prefetcher: CPU 當中的 prefetcher
    • 用過去的存取紀錄作為prefetch的依據
    • 論文中提到的 GHB (Global History Buffer)
  • 名詞定義
    • Prefetch Hit: Prefetched line that was hit in the cache
      before being replaced (miss avoided)
    • Prefetch Miss: Prefetched line that was replaced before
      being accessed
    • Prefetch rate: Prefetches per instruction (or 1000 inst.)
    • Accuracy: Percentage of prefetch hits to all prefetches
    • Coverage: Percentage of misses avoided due to prefetching
      • 100 x (Prefetch Hits / (Prefetch Hits + Cache Misses))

參考概念
參考概念
參考概念

閱讀論文

When Prefetching Works, When It Doesn’t, and Why

3. POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING

  • benefits of SW prefetching

    • large num of stream
    • short stream
    • Irregular memory access
    • cache locality hint
    • loop bound
  • negative impacts of SW prefetching

    • Increased Instruction Count
    • Static Insertion
    • Code Structure Change
  • synergistic effect of SW & HW prefetching

    • Handling Multiple Streams
    • Positive Training
  • antagonistic effect of SW & HW prefetching

    • Negative Training
    • Harmful Software Prefetching

程式測試

$ ./main 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse prefetch: 192603 us sse: 215824 us naive: 365479 us
  • 使用Perf分析cache miss
$ perf stat -r 50 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses ./main Performance counter stats for './main' (50 runs): 27,985,483 cache-misses # 88.670 % of all cache refs (33.29%) 31,806,730 cache-references (50.24%) 34,757,822 L1-dcache-load-misses # 4.34% of all L1-dcache hits (66.90%) 806,498,136 L1-dcache-loads (66.16%) 352,545,104 L1-dcache-stores (64.64%) 377,450 L1-icache-load-misses (33.11%) 0.731182840 seconds time elapsed ( +- 3.65% )

更改 Makefile

$ make cache-test perf stat -e cache-misses,cache-references,instructions,cycles ./naive_transpose 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 naive: 310485 us Performance counter stats for './naive_transpose': 18,959,975 cache-misses # 93.414 % of all cache refs 20,296,767 cache-references 1,572,649,544 instructions # 0.89 insns per cycle 1,761,874,413 cycles 0.511842859 seconds time elapsed perf stat -e cache-misses,cache-references,instructions,cycles ./sse_transpose 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse: 160067 us Performance counter stats for './sse_transpose': 6,657,239 cache-misses # 84.130 % of all cache refs 7,913,069 cache-references 1,421,957,448 instructions # 1.13 insns per cycle 1,253,144,039 cycles 0.333203789 seconds time elapsed perf stat -e cache-misses,cache-references,instructions,cycles ./sse_prefetch_transpose 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse prefetch: 85866 us Performance counter stats for './sse_prefetch_transpose': 6,592,423 cache-misses # 83.190 % of all cache refs 7,924,577 cache-references 1,407,620,222 instructions # 1.54 insns per cycle 914,875,974 cycles 0.283419155 seconds time elapsed

效能差異

矩陣轉置

TRANSPOSE_IMPL

記憶體空間是 Row Major,所以在讀取列時是讀取連續的記憶體空間,寫入行則是寫入不連續的記憶體空間,所以矩陣轉置造成很高比例的 cache-misses.

naive_transpose

最簡單的矩陣轉置想法最直覺的矩陣轉置,從矩陣的左上角第一個元素開始,計算每一個元素轉置之後的目的地,並且存入成為新矩陣。

void naive_transpose(int *src, int *dst, int w, int h) { for(int x = 0; x < w; x++){ for(int y = 0; y < h; y++){ *(dst + x*h + y) = *(src + y*w + x); } } }

sse_transpose

使用了 Intel 處理器 SIMD 的技術,每個 int element 為 32-bit,所以 4x4 矩陣的每一個 row 剛好可以打包成 128-bit,再透過高低位的 swap,達到轉置的效
相較之下naive transpose 執行次數多且需要做 jump,增加 cache-misses 機會,sse transpose 更快。

效能改進

  • 一次將 4 筆數據放入 sse 暫存器中,一條指令處理 4 筆數據,比 4 筆數據 4 條指令處理快
  • loop unrolling:
    • 執行 loop 循環的組合語言代碼執行次數會變少
    • branch prediction miss 機率降低

Programming trivia: 4x4 integer matrix transpose in SSE2

void sse_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 4) { for (int y = 0; y < h; y += 4) { __m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x)); __m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x)); __m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x)); __m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x)); __m128i T0 = _mm_unpacklo_epi32(I0, I1); __m128i T1 = _mm_unpacklo_epi32(I2, I3); __m128i T2 = _mm_unpackhi_epi32(I0, I1); __m128i T3 = _mm_unpackhi_epi32(I2, I3); I0 = _mm_unpacklo_epi64(T0, T1); I1 = _mm_unpackhi_epi64(T0, T1); I2 = _mm_unpacklo_epi64(T2, T3); I3 = _mm_unpackhi_epi64(T2, T3); _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0); _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1); _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2); _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3); } } }

Intel Intrinsics Guide
查詢函式的內部操作

  • __m128i _mm_unpacklo_epi32 (__m128i a, __m128i b)
    把 src1, src2 的 lo 的部分依序以 32-bit 的部分移進 dst 裡
  • 其餘類推

理解實作

參考概念

  1. 首先,載入的資料分佈
      0       32      64      96      128
      +-------+-------+-------+-------+
I0    |  I00  |  I01  |  I02  |  I03  |
      +-------+-------+-------+-------+
I1    |  I10  |  I11  |  I12  |  I13  |
      +-------+-------+-------+-------+
I2    |  I20  |  I21  |  I22  |  I23  |
      +-------+-------+-------+-------+
I3    |  I30  |  I31  |  I32  |  I33  |
      +-------+-------+-------+-------+
  1. 操作 _mm_unpacklo_epi32()_mm_unpackhi_epi32()
      0       32      64      96      128
      +-------+-------+-------+-------+
T0    |  I00  |  I10  |  I01  |  I11  |
      +-------+-------+-------+-------+
T1    |  I20  |  I30  |  I21  |  I31  |
      +-------+-------+-------+-------+
T2    |  I02  |  I12  |  I03  |  I13  |
      +-------+-------+-------+-------+
T3    |  I22  |  I32  |  I23  |  I33  |
      +-------+-------+-------+-------+
  1. 使用 _mm_unpacklo_epi64()_mm_unpackhi_epi64() 完成轉置
      0       32      64      96      128
      +-------+-------+-------+-------+
I0    |  I00  |  I10  |  I20  |  I30  |
      +-------+-------+-------+-------+
I1    |  I01  |  I11  |  I21  |  I31  |
      +-------+-------+-------+-------+
I2    |  I02  |  I12  |  I22  |  I32  |
      +-------+-------+-------+-------+
I3    |  I03  |  I13  |  I23  |  I33  |
      +-------+-------+-------+-------+

sse_prefetch_transpose

Loads one cache line of data from address p to a location closer to the processor.

sse_transpose 多使用了 4 次 _mm_prefetch 指令,prefetch 將用到的資料放在比較近的地方降低去 memory 拿資料的成本

Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i.

_mm_prefetch(char * p , int i ) # 抓取 line of data #p : 從 address p 去讀取 #i : the type of prefetch operation: the constants _MM_HINT_T0,_MM_HINT_T1, _MM_HINT_T2, and _MM_HINT_NTA

論文證明

When Prefetching Works, When It Doesn’t, and Why

  • benefits of SW prefetching
    • Short Streams (16*4 byte)
    • Cache Locality Hint (預先 prefetch 到 L1cache )
    • Loop Bounds (unroll 4x4 矩陣)

效能改進

  • 降低 cache-misses
  • 降低執行時間(減少了去 memory 拿資料和 cache-misses 的時間)
void sse_prefetch_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 4) { for (int y = 0; y < h; y += 4) { #define PFDIST 8 _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1); __m128i I0 = _mm_loadu_si128 ((__m128i *)(src + (y + 0) * w + x)); __m128i I1 = _mm_loadu_si128 ((__m128i *)(src + (y + 1) * w + x)); __m128i I2 = _mm_loadu_si128 ((__m128i *)(src + (y + 2) * w + x)); __m128i I3 = _mm_loadu_si128 ((__m128i *)(src + (y + 3) * w + x)); __m128i T0 = _mm_unpacklo_epi32(I0, I1); __m128i T1 = _mm_unpacklo_epi32(I2, I3); __m128i T2 = _mm_unpackhi_epi32(I0, I1); __m128i T3 = _mm_unpackhi_epi32(I2, I3); I0 = _mm_unpacklo_epi64(T0, T1); I1 = _mm_unpackhi_epi64(T0, T1); I2 = _mm_unpacklo_epi64(T2, T3); I3 = _mm_unpackhi_epi64(T2, T3); _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0); _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1); _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2); _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3); } } }

SSE/AVX

以 AVX 改寫 Prefetcher

依照 SSE 版本的想法來擴展成 AVX 版本

Introduction to Intel® Advanced Vector Extensions

AVX 指令集是 256-bit,所以一次處理 8 個 byte,loop 一次加 8

  1. 將依次將 8 行數組的元素載入到暫存器中
  2. _mm256_unpacklo/hi_epi32 函式讀入兩個 256-bit 的數,將低/高 128-bit 以 32-bit 為單位交錯排列,_mm256_unpacklo_epi64 同理
    • __m256i A = [ A0, A1, A2, A3, A4, A5, A6, A7 ];
    • __m256i B = [ B0, B1, B2, B3, B4, B5, B6, B7 ];
    • __m256i C = _mm256_unpacklo_epi32(I0, I1) = [ A0, B0, A1, B1, A2, B2, A3, B3 ];
void avx_transpose(int *src, int *dst, int w, int h) { for(int x=0; x < w ; x+= 8) { for(int y=0; y < h ; y+=8) { __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x)); __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x)); __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x)); __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x)); __m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x)); __m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x)); __m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x)); __m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x)); __m256i T0 = _mm256_unpacklo_epi32(I0 , I1); __m256i T1 = _mm256_unpackhi_epi32(I0 , I1); __m256i T2 = _mm256_unpacklo_epi32(I2 , I3); __m256i T3 = _mm256_unpackhi_epi32(I2 , I3); __m256i T4 = _mm256_unpacklo_epi32(I4 , I5); __m256i T5 = _mm256_unpackhi_epi32(I4 , I5); __m256i T6 = _mm256_unpacklo_epi32(I6 , I7); __m256i T7 = _mm256_unpackhi_epi32(I6 , I7); I0 = _mm256_unpacklo_epi64(T0, T2); I1 = _mm256_unpackhi_epi64(T0, T2); I2 = _mm256_unpacklo_epi64(T1, T3); I3 = _mm256_unpackhi_epi64(T1, T3); I4 = _mm256_unpacklo_epi64(T4, T6); I5 = _mm256_unpackhi_epi64(T4, T6); I6 = _mm256_unpacklo_epi64(T5, T7); I7 = _mm256_unpacklo_epi64(T5, T7); T0 = _mm256_permute2f128_si256(I0 , I4 , 0x20); T1 = _mm256_permute2f128_si256(I1 , I5 , 0x20); T2 = _mm256_permute2f128_si256(I2 , I6 , 0x20); T3 = _mm256_permute2f128_si256(I3 , I7 , 0x20); T4 = _mm256_permute2f128_si256(I0 , I4 , 0x31); T5 = _mm256_permute2f128_si256(I1 , I5 , 0x31); T6 = _mm256_permute2f128_si256(I2 , I6 , 0x31); T7 = _mm256_permute2f128_si256(I3 , I7 , 0x31); _mm256_storeu_si256((__m256i *) (dst + (x + 0) * h + y) , T0); _mm256_storeu_si256((__m256i *) (dst + (x + 1) * h + y) , T1); _mm256_storeu_si256((__m256i *) (dst + (x + 2) * h + y) , T2); _mm256_storeu_si256((__m256i *) (dst + (x + 3) * h + y) , T3); _mm256_storeu_si256((__m256i *) (dst + (x + 4) * h + y) , T4); _mm256_storeu_si256((__m256i *) (dst + (x + 5) * h + y) , T5); _mm256_storeu_si256((__m256i *) (dst + (x + 6) * h + y) , T6); _mm256_storeu_si256((__m256i *) (dst + (x + 7) * h + y) , T7); } } }
void avx_prefetch_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 8) { for (int y = 0; y < h; y += 8) { #define AVX_PFDIST 8 _mm_prefetch(src + (y + AVX_PFDIST + 0) * w + x, _MM_HINT_T1); _mm_prefetch(src + (y + AVX_PFDIST + 1) * w + x, _MM_HINT_T1); _mm_prefetch(src + (y + AVX_PFDIST + 2) * w + x, _MM_HINT_T1); _mm_prefetch(src + (y + AVX_PFDIST + 3) * w + x, _MM_HINT_T1); _mm_prefetch(src + (y + AVX_PFDIST + 4) * w + x, _MM_HINT_T1); _mm_prefetch(src + (y + AVX_PFDIST + 5) * w + x, _MM_HINT_T1); _mm_prefetch(src + (y + AVX_PFDIST + 6) * w + x, _MM_HINT_T1); _mm_prefetch(src + (y + AVX_PFDIST + 7) * w + x, _MM_HINT_T1); __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x)); __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x)); __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x)); __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x)); __m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x)); __m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x)); __m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x)); __m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x)); __m256i T0 = _mm256_unpacklo_epi32(I0, I1); __m256i T1 = _mm256_unpacklo_epi32(I2, I3); __m256i T2 = _mm256_unpacklo_epi32(I4, I5); __m256i T3 = _mm256_unpacklo_epi32(I6, I7); __m256i T4 = _mm256_unpackhi_epi32(I0, I1); __m256i T5 = _mm256_unpackhi_epi32(I2, I3); __m256i T6 = _mm256_unpackhi_epi32(I4, I5); __m256i T7 = _mm256_unpackhi_epi32(I6, I7); I0 = _mm256_unpacklo_epi64(T0, T1); I1 = _mm256_unpackhi_epi64(T0, T1); I2 = _mm256_unpacklo_epi64(T2, T3); I3 = _mm256_unpackhi_epi64(T2, T3); I4 = _mm256_unpacklo_epi64(T4, T5); I5 = _mm256_unpackhi_epi64(T4, T5); I6 = _mm256_unpacklo_epi64(T6, T7); I7 = _mm256_unpackhi_epi64(T6, T7); T0 = _mm256_permute2x128_si256(I0, I2, 0x20); T1 = _mm256_permute2x128_si256(I1, I3, 0x20); T2 = _mm256_permute2x128_si256(I4, I6, 0x20); T3 = _mm256_permute2x128_si256(I5, I7, 0x20); T4 = _mm256_permute2x128_si256(I0, I2, 0x31); T5 = _mm256_permute2x128_si256(I1, I3, 0x31); T6 = _mm256_permute2x128_si256(I4, I6, 0x31); T7 = _mm256_permute2x128_si256(I5, I7, 0x31); _mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), T0); _mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), T1); _mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), T2); _mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), T3); _mm256_storeu_si256((__m256i *)(dst + ((x + 4) * h) + y), T4); _mm256_storeu_si256((__m256i *)(dst + ((x + 5) * h) + y), T5); _mm256_storeu_si256((__m256i *)(dst + ((x + 6) * h) + y), T6); _mm256_storeu_si256((__m256i *)(dst + ((x + 7) * h) + y), T7); } } }

理解實作

參考概念

程式實作

  • 如果在 Makefile 裡試圖執行 sse_transpose 和 avx_transpose,會直接在產生目的檔時出現錯誤訊息
  • 如果在 Makefile 裡不執行 sse_transpose,只執行 sse_transpose,不會直接在產生目的檔時出現錯誤訊息,但是會在將目的檔轉成編譯檔時出現不合法的命令 (core dumped)
    • 還沒找到解決方法
$./avx_transpose make: *** [test] 不合法的命令 (core dumped) $ make cache-test In file included from /usr/lib/gcc/x86_64-linux-gnu/4.9/include/immintrin.h:41:0, from main.c:9: /usr/lib/gcc/x86_64-linux-gnu/4.9/include/avxintrin.h:896:1: error: inlining failed in call to always_inline_mm256_storeu_si256’: target specific option mismatch _mm256_storeu_si256 (__m256i *__P, __m256i __A) ^ In file included from main.c:18:0: impl.c:115:13: error: called from here _mm256_storeu_si256((__m256i *) (dst + (x + 7) * h + y) , T7); ^ make: *** [main] Error 1
  • cache-miss
  • 執行時間

參考筆記
參考筆記