2017q1 Homework3 (software-pipelining)

tags: `embedded`

contributed by <Cayonliow>
tags:

作業

題目： B05: software-pipelining
github: prefetcher
論文： When Prefetching Works, When It Doesn’t, and Why
相關資料： SIMD Programming Introduction, 在計算機裡頭實踐演算法
參考資料：

論文閱讀

在讀論文之前，去找了找 Prefetch 的定義

來自百度
- =预读取文件夹，用来存放系统已访问过的文件的预读信息，扩展名为PF。之所以自动创建Prefetch文件夹，是为了加快系统启动的进程
來自wikipedia（比較完整）
- Prefetching in computer science is a technique for speeding up fetch operations by beginning a fetch operation whose result is expected to be needed soon. Usually this is before it is known to be needed, so there is a risk of wasting time by prefetching data that will not be used. The technique can be applied in several circumstances
來自0xff07的共筆所提供的一份論文
- Table-Based Prefetching
  - Stride Prefetching
    
    * 在 stride prefetching 裏有一個表是用來儲存 stride-related local history information
    * 儲存 the most recent stride, last address, state infoemation
    * 當 prefetch 被觸發， a+s, a+2s,…a+ds 的 address 會被預讀
    * a = 最根本的地址
    * s = 偵測到的範圍
    * d = degree of prefetch （不太懂。。
  - Markov Prefetching
    - miss address stream: 發生 cache miss 的資料流
    - Correlation table : 記錄這個 address 的下一個發生cache miss 的 address
  - Distance Prefetching
    - 圖會跟 Markov Prefetching 長得很像，因爲是他是 Markov Prefetching 的概括（推論
    - 原本是設計給 TLB ，可是後來發現比較適合用來預讀 cache lines
    - address delta : 距離上一個發生 cache miss 的 address 的長度
    - Markov 是直接記錄地址，可是 Distance 是記錄與上一個距離，所以會出現很多重復的資料

接下來是論文的部分

提到的是 Software Prefetching, Hardware Prefetching, 跟兩者之間一起使用的利弊

圖上表示的是 Sw, Hw,Sw&Hw 用各種 benchmark 的效能的測試結果

這有提到各種形態造成的 index 是 direct 或是 Indirect 的，這會影響到 Hw 的 Prefetch 效能

Direct 指的是連續或是有某種規律的， Indirect 則是相反（沒有規律的）
Direct 可以很容易的被 HW 預讀，因爲有規律，可是如果是 Indirect，在Hw，需要一些特別的 prefetching mechanism, 可是 Sw 卻能更輕易的 prefetch
- 如上圖：在使用 Sw 進行 Indirect 的 Prefetching，會增加 Intruction Count 和更多的 memory access

在使用 Sw Prefetching 需要手動輸入一些程式碼

在 Intel 中的 SSE SIMD 指令

#include <mmintrinsics.h>
void _mm_prefetch(char * p , int i );

p = 要做 Prefetch 的資料的地址
i = 上圖的 Hint 的部分

_MM_HINT_T0 : 放到 L1 Cache
_MM_HINT_T1 : 放到 L2 Cache
_MM_HINT_T2 : 放到 L3 Cache
_MM_HINT_NTA : 這塊資料暫時不會用到

這裏說的是對 Prefetch 的分類只有Timely 是好的

Late: 在這筆資料需要被取用的時候還沒送到，可是卻做了 Prefetch 的動作，浪費效能
Early: 提早太多將資料送到，可是 cache 會一直刷新，在資料被取用之前就被洗到了，做了 Prefetch 的動作可是卻沒有用到
Redundant_dc: 送到多餘的資料去 data cache
dedundant_mshr: 送到多餘的資料去 MSHR⁵
Incorrect: 送到錯的資料

Software Prefetch Distance

l = prefetch latency, 延遲
s = the length of the shortest path through the loop body, （不懂

Software Prefetching & Hardware Prefetching

SW 的好處 = HW 的壞處

Large number of Stream
- SW 因爲可以手動輸入程式所以不會像 HW 一樣，因爲有多條資料流而混亂，
- Prefetch request 可以獨立被輸入進 lbm
- 下面的程式碼就有多條資料流

b[i][j][k] = (a[i][j][k] + a[i][j][k+1] + a[i][j][k-1] + a[i][j-1][k] + a[i][j+1][k]+...)/27

Short Streams
- HW 因爲時間去 "學習" 這串資料流的規律，至少兩個 cache misses 才能判斷資料流的方向，所以當資料流太短，可能在 HW 還沒有學會就已經結束了
Irregular Memory Access
- 只要手動輸入對應的內聯函數(intrinsics) 就可以任意 prefetch 各種沒有規律的資料，可是 HW 的話則需要非常復雜的架構
Cache locality Hint
- 可以選擇要放入的 cache
- HW 的準確性一般都很低，所以會很容易 L1 cache pollution
- L1 cache 的空間最小，速度最快，所以如果有L1 cache pollution 的話，會出現很多 cache miss ,對效能造成很大的影響
Loop Bounds
- 因爲對 HW 來說，他是靠學習來 prefetch 所以沒有人告訴他範圍，他會不停依據規律取資料，可是這些資料可能有些已經越界，而最後取到無用的資料

SW 的壞處 = HW 的好處

Instruction count
- SW 因爲需要手動輸入指令，所以指令的數量會增加，而且如果是 irregular / Indirect 的話，所需要的指令會更多, HW 因爲是已經寫死在硬體裏，所以沒有額外的指令
- - 在這裏可以看的出來 Indirect 所需要的指令會比較多
- - 在不同的 benchmark 測試下，有一些的 Prefetch 指令是多於原本的指令超過 100%
Static Insertion
- prefetching distance 跟記憶體搬資料到 Cache 的時間，以及 Cache 大小有關。而這些可能隨著不同硬體而有所變化，因而最佳的 prefetching distance 也有所不同。SW 不能在 Prefetch 進行的期間修改 Prefetching Distance
Code Structure change
- 如果在 loop 裏的指令很少，導致 SW 來不及 Prefetch，可能就要做 Loop Splitting 的動作來改變資料結構

SW & HW 的混用

Handling Multiple Streams
- 可以 prefetch 更多的資料流
- 如果有很多的資料流，可以將 Direct 跟 Regular 的分配給 HW， Indirect 跟 Irregulaer 的分配給 SW
Training
- 好處： SW 會幫助加快的 HW 的 Training
  - 有 refer 參考的概念
- 壞處：可是 SW 也會幫助減慢的 HW 的 Training
  - SW 所 prefetch 好的 prefetched blocks 有時候會被隱藏起來，這會導致 HW 的“學習” 錯誤，需要更長的 Training time
Harmful Software Prefetching
- HW 的準確性一般而言都會比 SW 差，當 SW 發生錯誤時，這會使 HW 有更大的壓力，使得 HW 的效能降低

開發記錄

先執行看看結果

 0  1  2  3
 4  5  6  7
 8  9 10 11
12 13 14 15

 0  4  8 12
 1  5  9 13
 2  6 10 14
 3  7 11 15
sse prefetch: 	 58170 us
sse: 		     122664 us
naive: 		     267798 us

做的是矩陣轉置, 並看到這三種方式的執行時間

然後個別檢查各種方式的 cache miss

SSE Prefetch

 Performance counter stats for './main' (100 runs):

          428,9651      cache-misses              #   78.672 % of all cache refs      ( +-  0.05% )
          545,2589      cache-references                                              ( +-  0.01% )
          848,9913      L1-dcache-load-misses                                         ( +-  0.02% )
   <not supported>      L1-dcache-store-misses   
   <not supported>      L1-dcache-prefetch-misses
            2,9095      L1-icache-load-misses                                         ( +-  0.35% )

       0.186183896 seconds time elapsed                                          ( +-  0.09% )

 Performance counter stats for './main' (100 runs):

          448,4773      cache-misses              #   81.924 % of all cache refs      ( +-  0.61% )
          547,4291      cache-references                                              ( +-  0.06% )
          849,2977      L1-dcache-load-misses                                         ( +-  0.03% )
   <not supported>      L1-dcache-store-misses   
   <not supported>      L1-dcache-prefetch-misses
            4,1314      L1-icache-load-misses                                         ( +-  3.31% )

       0.253898320 seconds time elapsed                                          ( +-  0.57% )

Naive

 Performance counter stats for './main' (100 runs):

         1678,6696      cache-misses              #   93.020 % of all cache refs      ( +-  0.09% )
         1804,6256      cache-references                                              ( +-  0.01% )
         2105,5287      L1-dcache-load-misses                                         ( +-  0.01% )
   <not supported>      L1-dcache-store-misses   
   <not supported>      L1-dcache-prefetch-misses
            3,6499      L1-icache-load-misses                                         ( +-  2.63% )

       0.387200088 seconds time elapsed                                          ( +-  0.53% )

很明顯 naive 所造成的 cache-miss 是比較高的

perf stat 的結果有兩個 not supported 分別是 L1-dcache-store-misses 跟 L1-dcache-prefetch-misses 原因還在查詢中

AVX

參考 illusion030 的共筆與ierosodin 的共筆中的使用 SSE 的指令集

依樣畫葫蘆地跟着寫了一個版本
去 Intel Intrinsics Guide(AVX指令集的部分) 找每一個指令的意思
一開始忘記了 header file <immintrin.h>
然後還一直編譯不過，因爲沒有在 Makefile 裏加這個

CFLAGS = -msse2 --std gnu99 -O0 -Wall -Wextra -mavx2

	Description
_mm256_loadu_si256	從記憶體中讀入 256-bits 的整數（integer data）放入 dst. mem_addr (不需要有特定邊界？
_mm256_unpacklo_epi32	將兩個參數的 lower bits 以 32 bits 為單位輪流排序並 return
_mm256_unpackhi_epi32	將兩個參數的 higher bits 以 32 bits 為單位輪流排序並 return
_mm256_unpacklo_epi64	將兩個參數的 lower bits 以 64 bits 為單位輪流排序並 return
_mm256_unpackhi_epi64	將兩個參數的 higher bits 以 64 bits 為單位輪流排序並 return
_mm256_permute2x128_si256	將兩個參數以 128 bits 爲單位進行隨機洗牌，然後存入 dst
_mm256_storeu_si256	將整數數據以 256 bits 爲單位存入記憶體

執行100次，其中的5次執行時間:

avx: 		 58198 us
avx: 		 57160 us
avx: 		 59003 us
avx: 		 56443 us
avx: 		 56071 us

cache miss:

 Performance counter stats for './cache-miss-avx' (100 runs):

          331,0269      cache-misses              #   74.376 % of all cache refs      ( +-  0.31% )
          445,0747      cache-references                                              ( +-  0.32% )
          803,6992      L1-dcache-load-misses                                         ( +-  0.25% )
   <not supported>      L1-dcache-store-misses   
   <not supported>      L1-dcache-prefetch-misses
            3,4753      L1-icache-load-misses                                         ( +-  2.62% )

       0.200664836 seconds time elapsed                                          ( +-  1.17% )

不低，可是已經比錢三個版本來的低