software-pipelining

github contributed by <Diana Ho>

tags: `d0651` `sys`

案例分析

預期目標

學習計算機結構並且透過實驗來驗證所學
理解 prefetch 對 cache 的影響，從而設計實驗來釐清相關議題
論文閱讀和思考

作業要求

閱讀在計算機裡頭實踐演算法提到的論文: "When Prefetching Works, When It Doesn’t, and Why" (務必事先閱讀論文，否則貿然看程式碼，只是陷入一片茫然!)，在 Linux/x86_64 (注意，要用 64-bit 系統，不能透過虛擬機器執行) 上編譯並執行 prefetcher
- 說明 naive_transpose, sse_transpose, sse_prefetch_transpose 之間的效能差異，以及 prefetcher 對 cache 的影響
在 github 上 fork prefetcher，嘗試用 AVX 進一步提昇效能
- 修改 Makefile，產生新的執行檔，分別對應於 naive_transpose, sse_transpose, sse_prefetch_transpose (學習 phonebook 的做法)
- 用 perf 分析 cache miss/hit
- 學習 perf stat 的 raw counter 命令
- 參考 Performance of SSE and AVX Instruction Sets，用 SSE/AVX intrinsic 來改寫程式碼

Prefetching

Data Prefetch Mechanisms
Caches and Prefetching
Prefetching

預先將資料從 memory 載入 cache，再進行 CPU 運算，可以降低 CPU 等待時間進而提升效率

時間
載入至主記憶體的時機
- 不能太晚(cpu已經運算完,不再需要這些資料)
- 不能太早(有可能在使用之前就被踢出cache)
記憶體
prefetch intrinsic 提供參數可以選擇載入到哪個層級的快取記憶體

背景知識

SW prefetcher:
- compiler中加強效能的algo.
- intrinsics(內聯函數) e.g. SSE中的__mm_prefetch()
HW prefetcher: CPU 當中的 prefetcher
- 用過去的存取紀錄作為prefetch的依據
- 論文中提到的 GHB (Global History Buffer)
名詞定義
- Prefetch Hit: Prefetched line that was hit in the cache
  before being replaced (miss avoided)
- Prefetch Miss: Prefetched line that was replaced before
  being accessed
- Prefetch rate: Prefetches per instruction (or 1000 inst.)
- Accuracy: Percentage of prefetch hits to all prefetches
- Coverage: Percentage of misses avoided due to prefetching
  - 100 x (Prefetch Hits / (Prefetch Hits + Cache Misses))

參考概念
參考概念
 參考概念

閱讀論文

When Prefetching Works, When It Doesn’t, and Why

3. POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING

benefits of SW prefetching
- large num of stream
- short stream
- Irregular memory access
- cache locality hint
- loop bound
negative impacts of SW prefetching
- Increased Instruction Count
- Static Insertion
- Code Structure Change
synergistic effect of SW & HW prefetching
- Handling Multiple Streams
- Positive Training
antagonistic effect of SW & HW prefetching
- Negative Training
- Harmful Software Prefetching

程式測試













$ ./main
  0  1  2  3
  4  5  6  7
  8  9 10 11
 12 13 14 15

  0  4  8 12
  1  5  9 13
  2  6 10 14
  3  7 11 15
sse prefetch: 	 192603 us
sse: 		 215824 us
naive: 		 365479 us

使用Perf分析cache miss












$ perf stat -r 50 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses ./main
 Performance counter stats for './main' (50 runs):

        27,985,483      cache-misses              #   88.670 % of all cache refs      (33.29%)
        31,806,730      cache-references                                              (50.24%)
        34,757,822      L1-dcache-load-misses     #    4.34% of all L1-dcache hits    (66.90%)
       806,498,136      L1-dcache-loads                                               (66.16%)
       352,545,104      L1-dcache-stores                                              (64.64%)
           377,450      L1-icache-load-misses                                         (33.11%)

       0.731182840 seconds time elapsed                                          ( +-  3.65% )

更改 Makefile
































































$ make cache-test 
perf stat -e cache-misses,cache-references,instructions,cycles ./naive_transpose  
  0  1  2  3
  4  5  6  7
  8  9 10 11
 12 13 14 15

  0  4  8 12
  1  5  9 13
  2  6 10 14
  3  7 11 15
naive: 		 310485 us

 Performance counter stats for './naive_transpose':

        18,959,975      cache-misses              #   93.414 % of all cache refs    
        20,296,767      cache-references                                            
     1,572,649,544      instructions              #    0.89  insns per cycle        
     1,761,874,413      cycles                                                      

       0.511842859 seconds time elapsed

perf stat -e cache-misses,cache-references,instructions,cycles ./sse_transpose  
  0  1  2  3
  4  5  6  7
  8  9 10 11
 12 13 14 15

  0  4  8 12
  1  5  9 13
  2  6 10 14
  3  7 11 15
sse: 		 160067 us

 Performance counter stats for './sse_transpose':

         6,657,239      cache-misses              #   84.130 % of all cache refs    
         7,913,069      cache-references                                            
     1,421,957,448      instructions              #    1.13  insns per cycle        
     1,253,144,039      cycles                                                      

       0.333203789 seconds time elapsed

perf stat -e cache-misses,cache-references,instructions,cycles ./sse_prefetch_transpose 
  0  1  2  3
  4  5  6  7
  8  9 10 11
 12 13 14 15

  0  4  8 12
  1  5  9 13
  2  6 10 14
  3  7 11 15
sse prefetch: 	 85866 us

 Performance counter stats for './sse_prefetch_transpose':

         6,592,423      cache-misses              #   83.190 % of all cache refs    
         7,924,577      cache-references                                            
     1,407,620,222      instructions              #    1.54  insns per cycle        
       914,875,974      cycles                                                      

       0.283419155 seconds time elapsed

效能差異

矩陣轉置

TRANSPOSE_IMPL

記憶體空間是 Row Major,所以在讀取列時是讀取連續的記憶體空間,寫入行則是寫入不連續的記憶體空間,所以矩陣轉置造成很高比例的 cache-misses.

naive_transpose

最簡單的矩陣轉置想法最直覺的矩陣轉置，從矩陣的左上角第一個元素開始，計算每一個元素轉置之後的目的地，並且存入成為新矩陣。








void naive_transpose(int *src, int *dst, int w, int h)
{
    for(int x = 0; x < w; x++){
        for(int y = 0; y < h; y++){
             *(dst + x*h + y) = *(src + y*w + x);
        }
    }
}

sse_transpose

使用了 Intel 處理器 SIMD 的技術，每個 int element 為 32-bit，所以 4x4 矩陣的每一個 row 剛好可以打包成 128-bit，再透過高低位的 swap，達到轉置的效
相較之下naive transpose 執行次數多且需要做 jump，增加 cache-misses 機會，sse transpose 更快。

效能改進

一次將 4 筆數據放入 sse 暫存器中，一條指令處理 4 筆數據，比 4 筆數據 4 條指令處理快
loop unrolling：
- 執行 loop 循環的組合語言代碼執行次數會變少
- branch prediction miss 機率降低

Programming trivia: 4x4 integer matrix transpose in SSE2























void sse_transpose(int *src, int *dst, int w, int h)
{
    for (int x = 0; x < w; x += 4) {
        for (int y = 0; y < h; y += 4) {
            __m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x));
            __m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x));
            __m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x));
            __m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x));
            __m128i T0 = _mm_unpacklo_epi32(I0, I1);
            __m128i T1 = _mm_unpacklo_epi32(I2, I3);
            __m128i T2 = _mm_unpackhi_epi32(I0, I1);
            __m128i T3 = _mm_unpackhi_epi32(I2, I3);
            I0 = _mm_unpacklo_epi64(T0, T1);
            I1 = _mm_unpackhi_epi64(T0, T1);
            I2 = _mm_unpacklo_epi64(T2, T3);
            I3 = _mm_unpackhi_epi64(T2, T3);
            _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
            _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
            _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
            _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);
        }
    }
}

Intel Intrinsics Guide
查詢函式的內部操作

__m128i _mm_unpacklo_epi32 (__m128i a, __m128i b)
把 src1, src2 的 lo 的部分依序以 32-bit 的部分移進 dst 裡
其餘類推

理解實作

參考概念

首先，載入的資料分佈

      0       32      64      96      128
      +-------+-------+-------+-------+
I0    |  I00  |  I01  |  I02  |  I03  |
      +-------+-------+-------+-------+
I1    |  I10  |  I11  |  I12  |  I13  |
      +-------+-------+-------+-------+
I2    |  I20  |  I21  |  I22  |  I23  |
      +-------+-------+-------+-------+
I3    |  I30  |  I31  |  I32  |  I33  |
      +-------+-------+-------+-------+

操作 _mm_unpacklo_epi32() 與 _mm_unpackhi_epi32()

      0       32      64      96      128
      +-------+-------+-------+-------+
T0    |  I00  |  I10  |  I01  |  I11  |
      +-------+-------+-------+-------+
T1    |  I20  |  I30  |  I21  |  I31  |
      +-------+-------+-------+-------+
T2    |  I02  |  I12  |  I03  |  I13  |
      +-------+-------+-------+-------+
T3    |  I22  |  I32  |  I23  |  I33  |
      +-------+-------+-------+-------+

使用 _mm_unpacklo_epi64() 與 _mm_unpackhi_epi64() 完成轉置

      0       32      64      96      128
      +-------+-------+-------+-------+
I0    |  I00  |  I10  |  I20  |  I30  |
      +-------+-------+-------+-------+
I1    |  I01  |  I11  |  I21  |  I31  |
      +-------+-------+-------+-------+
I2    |  I02  |  I12  |  I22  |  I32  |
      +-------+-------+-------+-------+
I3    |  I03  |  I13  |  I23  |  I33  |
      +-------+-------+-------+-------+

sse_prefetch_transpose

Loads one cache line of data from address p to a location closer to the processor.

比 sse_transpose 多使用了 4 次 _mm_prefetch 指令，prefetch 將用到的資料放在比較近的地方降低去 memory 拿資料的成本

Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i.




_mm_prefetch(char * p , int i )
# 抓取 line of data
#p : 從 address p 去讀取
#i : the type of prefetch operation: the constants _MM_HINT_T0,_MM_HINT_T1, _MM_HINT_T2, and _MM_HINT_NTA

論文證明

When Prefetching Works, When It Doesn’t, and Why

benefits of SW prefetching
- Short Streams (16*4 byte)
- Cache Locality Hint (預先 prefetch 到 L1cache )
- Loop Bounds (unroll 4x4 矩陣)

效能改進

降低 cache-misses
降低執行時間(減少了去 memory 拿資料和 cache-misses 的時間)





























void sse_prefetch_transpose(int *src, int *dst, int w, int h)
{
    for (int x = 0; x < w; x += 4) {
        for (int y = 0; y < h; y += 4) {
#define PFDIST  8
            _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1);
            _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1);
            _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1);
            _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1);

            __m128i I0 = _mm_loadu_si128 ((__m128i *)(src + (y + 0) * w + x));
            __m128i I1 = _mm_loadu_si128 ((__m128i *)(src + (y + 1) * w + x));
            __m128i I2 = _mm_loadu_si128 ((__m128i *)(src + (y + 2) * w + x));
            __m128i I3 = _mm_loadu_si128 ((__m128i *)(src + (y + 3) * w + x));
            __m128i T0 = _mm_unpacklo_epi32(I0, I1);
            __m128i T1 = _mm_unpacklo_epi32(I2, I3);
            __m128i T2 = _mm_unpackhi_epi32(I0, I1);
            __m128i T3 = _mm_unpackhi_epi32(I2, I3);
            I0 = _mm_unpacklo_epi64(T0, T1);
            I1 = _mm_unpackhi_epi64(T0, T1);
            I2 = _mm_unpacklo_epi64(T2, T3);
            I3 = _mm_unpackhi_epi64(T2, T3);
            _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
            _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
            _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
            _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);
        }
    }
}

SSE/AVX

以 AVX 改寫 Prefetcher

依照 SSE 版本的想法來擴展成 AVX 版本

Introduction to Intel® Advanced Vector Extensions

AVX 指令集是 256-bit，所以一次處理 8 個 byte，loop 一次加 8

將依次將 8 行數組的元素載入到暫存器中
_mm256_unpacklo/hi_epi32 函式讀入兩個 256-bit 的數，將低/高 128-bit 以 32-bit 為單位交錯排列，_mm256_unpacklo_epi64 同理
- __m256i A = [ A0, A1, A2, A3, A4, A5, A6, A7 ];
- __m256i B = [ B0, B1, B2, B3, B4, B5, B6, B7 ];
- __m256i C = _mm256_unpacklo_epi32(I0, I1) = [ A0, B0, A1, B1, A2, B2, A3, B3 ];




















































void avx_transpose(int *src, int *dst, int w, int h)
{
    for(int x=0; x < w ; x+= 8) {
        for(int y=0; y < h ; y+=8) {
            __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x));
            __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x));
            __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x));
            __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x));
            __m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x));
            __m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x));
            __m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x));
            __m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x));

            __m256i T0 = _mm256_unpacklo_epi32(I0 , I1);
            __m256i T1 = _mm256_unpackhi_epi32(I0 , I1);
            __m256i T2 = _mm256_unpacklo_epi32(I2 , I3);
            __m256i T3 = _mm256_unpackhi_epi32(I2 , I3);
            __m256i T4 = _mm256_unpacklo_epi32(I4 , I5);
            __m256i T5 = _mm256_unpackhi_epi32(I4 , I5);
            __m256i T6 = _mm256_unpacklo_epi32(I6 , I7);
            __m256i T7 = _mm256_unpackhi_epi32(I6 , I7);

            I0 = _mm256_unpacklo_epi64(T0, T2);
            I1 = _mm256_unpackhi_epi64(T0, T2);
            I2 = _mm256_unpacklo_epi64(T1, T3);
            I3 = _mm256_unpackhi_epi64(T1, T3);
            I4 = _mm256_unpacklo_epi64(T4, T6);
            I5 = _mm256_unpackhi_epi64(T4, T6);
            I6 = _mm256_unpacklo_epi64(T5, T7);
            I7 = _mm256_unpacklo_epi64(T5, T7);

            T0 = _mm256_permute2f128_si256(I0 , I4 , 0x20);
            T1 = _mm256_permute2f128_si256(I1 , I5 , 0x20);
            T2 = _mm256_permute2f128_si256(I2 , I6 , 0x20);
            T3 = _mm256_permute2f128_si256(I3 , I7 , 0x20);
            T4 = _mm256_permute2f128_si256(I0 , I4 , 0x31);
            T5 = _mm256_permute2f128_si256(I1 , I5 , 0x31);
            T6 = _mm256_permute2f128_si256(I2 , I6 , 0x31);
            T7 = _mm256_permute2f128_si256(I3 , I7 , 0x31);

            _mm256_storeu_si256((__m256i *) (dst + (x + 0) * h + y) , T0);
            _mm256_storeu_si256((__m256i *) (dst + (x + 1) * h + y) , T1);
            _mm256_storeu_si256((__m256i *) (dst + (x + 2) * h + y) , T2);
            _mm256_storeu_si256((__m256i *) (dst + (x + 3) * h + y) , T3);
            _mm256_storeu_si256((__m256i *) (dst + (x + 4) * h + y) , T4);
            _mm256_storeu_si256((__m256i *) (dst + (x + 5) * h + y) , T5);
            _mm256_storeu_si256((__m256i *) (dst + (x + 6) * h + y) , T6);
            _mm256_storeu_si256((__m256i *) (dst + (x + 7) * h + y) , T7);

        }
    }
}





























































void avx_prefetch_transpose(int *src, int *dst, int w, int h)
{
    for (int x = 0; x < w; x += 8) {
        for (int y = 0; y < h; y += 8) {
#define AVX_PFDIST  8
            _mm_prefetch(src + (y + AVX_PFDIST + 0) * w + x, _MM_HINT_T1);
            _mm_prefetch(src + (y + AVX_PFDIST + 1) * w + x, _MM_HINT_T1);
            _mm_prefetch(src + (y + AVX_PFDIST + 2) * w + x, _MM_HINT_T1);
            _mm_prefetch(src + (y + AVX_PFDIST + 3) * w + x, _MM_HINT_T1);
            _mm_prefetch(src + (y + AVX_PFDIST + 4) * w + x, _MM_HINT_T1);
            _mm_prefetch(src + (y + AVX_PFDIST + 5) * w + x, _MM_HINT_T1);
            _mm_prefetch(src + (y + AVX_PFDIST + 6) * w + x, _MM_HINT_T1);
            _mm_prefetch(src + (y + AVX_PFDIST + 7) * w + x, _MM_HINT_T1);

            __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x));
            __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x));
            __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x));
            __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x));
            __m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x));
            __m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x));
            __m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x));
            __m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x));

            __m256i T0 = _mm256_unpacklo_epi32(I0, I1);
            __m256i T1 = _mm256_unpacklo_epi32(I2, I3);
            __m256i T2 = _mm256_unpacklo_epi32(I4, I5);
            __m256i T3 = _mm256_unpacklo_epi32(I6, I7);
            __m256i T4 = _mm256_unpackhi_epi32(I0, I1);
            __m256i T5 = _mm256_unpackhi_epi32(I2, I3);
            __m256i T6 = _mm256_unpackhi_epi32(I4, I5);
            __m256i T7 = _mm256_unpackhi_epi32(I6, I7);

            I0 = _mm256_unpacklo_epi64(T0, T1);
            I1 = _mm256_unpackhi_epi64(T0, T1);
            I2 = _mm256_unpacklo_epi64(T2, T3);
            I3 = _mm256_unpackhi_epi64(T2, T3);
            I4 = _mm256_unpacklo_epi64(T4, T5);
            I5 = _mm256_unpackhi_epi64(T4, T5);
            I6 = _mm256_unpacklo_epi64(T6, T7);
            I7 = _mm256_unpackhi_epi64(T6, T7);

            T0 = _mm256_permute2x128_si256(I0, I2, 0x20);
            T1 = _mm256_permute2x128_si256(I1, I3, 0x20);
            T2 = _mm256_permute2x128_si256(I4, I6, 0x20);
            T3 = _mm256_permute2x128_si256(I5, I7, 0x20);
            T4 = _mm256_permute2x128_si256(I0, I2, 0x31);
            T5 = _mm256_permute2x128_si256(I1, I3, 0x31);
            T6 = _mm256_permute2x128_si256(I4, I6, 0x31);
            T7 = _mm256_permute2x128_si256(I5, I7, 0x31);

            _mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), T0);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), T1);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), T2);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), T3);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 4) * h) + y), T4);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 5) * h) + y), T5);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 6) * h) + y), T6);
            _mm256_storeu_si256((__m256i *)(dst + ((x + 7) * h) + y), T7);
        }
    }
}

理解實作

參考概念

程式實作

如果在 Makefile 裡試圖執行 sse_transpose 和 avx_transpose，會直接在產生目的檔時出現錯誤訊息
如果在 Makefile 裡不執行 sse_transpose，只執行 sse_transpose，不會直接在產生目的檔時出現錯誤訊息，但是會在將目的檔轉成編譯檔時出現不合法的命令 (core dumped)
- 還沒找到解決方法















$./avx_transpose
make: *** [test] 不合法的命令 (core dumped)


$ make cache-test
In file included from /usr/lib/gcc/x86_64-linux-gnu/4.9/include/immintrin.h:41:0,
                 from main.c:9:
/usr/lib/gcc/x86_64-linux-gnu/4.9/include/avxintrin.h:896:1: error: inlining failed in call to always_inline ‘_mm256_storeu_si256’: target specific option mismatch
 _mm256_storeu_si256 (__m256i *__P, __m256i __A)
 ^
In file included from main.c:18:0:
impl.c:115:13: error: called from here
             _mm256_storeu_si256((__m256i *) (dst + (x + 7) * h + y) , T7);
             ^
make: *** [main] Error 1

cache-miss
執行時間

參考筆記
 參考筆記

software-pipelining

tags: d0651 sys

案例分析

預期目標

作業要求

Prefetching

背景知識

閱讀論文

3. POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING

程式測試

更改 Makefile

效能差異

矩陣轉置

TRANSPOSE_IMPL

naive_transpose

sse_transpose

效能改進

理解實作

sse_prefetch_transpose

論文證明

效能改進

SSE/AVX

以 AVX 改寫 Prefetcher

程式實作

tags: `d0651` `sys`