2016q3 Homework4 (software-pipelining+)

contributed by<SwimGlass>, <carolc0708>

github

預期目標

整理論文內容，標注重點提示和解說
Hardware prefetcher 的詳細資料
使用 ARM NEON 做 transpose 實驗(同作業三)，並觀察使用 prefetch 的表現
希望可以在 ARM 的 CPU 上模擬論文中的實驗: HW & SW prefetcher 一起使用時的效能影響

問題釐清

驗證資料同步的問題(若資料早已在 cache 中，又執行 prefetch ，這樣是會如何處理?)

名詞解釋

Data reference patterns can be classified as follows:

Temporal — Data will be used again soon
Spatial — Data will be used in adjacent locations (for example, on the same cache line).
Non-temporal — Data which is referenced once and not reused in the
immediate future (for example, for some multimedia data types, as the vertex buffer in a 3D graphics application).

論文重點

1.請在中英關鍵字間加上空白區隔
2.記得更新github連結喔！
課程助教

實驗目的

提供在 HW prefetcher 存在下插入內聯函數的有效方法
提供 HW & SW prefetcher 共存時，效能好的組合

背景知識

SW prefetcher:
- compiler中加強效能的algo.
- intrinsics(內聯函數) e.g. SSE中的__mm_prefetch()
HW prefetcher: CPU 當中的 prefetcher
- 論文中提到的 GHB (Global History Buffer)

全文大綱

對於如何插入 prefetch 內聯函數沒有一個嚴格的標準
軟體與硬體的 prefetch 交互使用的複雜度沒有一個很好的了解

實驗結果

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

CPU: Intel’s Core 2 and Nehalem
使用 icc compiler (也有對gcc compiler做實驗)
GHB(stride prefetcher)/STR(stream prefetcher): compiler + HW prefetcher
SW: compiler + 程式中插入內聯函數
SPEC CPU 2006 benchmark(x軸)
執行時間標準化(y軸): 使用baseline binaries(只有compiler插入的SW prefetcher)的執行時間
Speed up(Positive/Neutural/Negative): SW+HW prefetcher中最好的比 HW prefetcher最好的 快多少

這裡的標準化不知道是用哪種公式??

重點結論歸納

已知驗證: SW prefetch用在短陣列、連續且不規則的記憶體存取、減少L1 cache miss時效能較好
實驗發現: SW prefetch對於HW prefetcher 會有 training effect，造成效能變差。特別是在SW prefetch只用在stream的一部分時。

cache-line access: stream & stride

HW & SW prefetcher背景知識

資料結構

不同資料結構適合的SW prefetch 和 HW prefetch策略:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
[小補充] 各種資料結構都適用: dead-block-based prefetching
在實驗中…
- 嘗試array、RDS(Recursive Data Structures)資料結構
- 定義 cache-line access:
  - stream: unit-stride
  - stride: access stride distances 大於 2 個 cache-line
- HW prefetcher嘗試了stream prefetcher、GHB prefetcher(stride prefetcher)、content-based prefetcher

content-based prefetcher?
該好好研究一下不同的prefetching機制!!

實驗結果評估那邊會講 Carol Chen

SW prefetch內聯函數

使用 SSE 的__mm_prefetch(char *ptr, int hint)
- prefetch address
- prefetch usage hint:
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
內聯函數翻譯成組合語言:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- direct address prefetches: 2個指令/內聯函數
- indirect memory addresses prefetch: 4個指令/內聯函數

prefetch分類

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

MSHR: Miss Status Holding Register

redundant_dc就是我們想驗證的部分!!

SW prefetch distance

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

prefetch distance 必須要大到能夠隱藏 memory latency
- l : prefetch latency
- s : 迴圈當中的最短路徑
然而 prefetch distance 太大可能造成不良的後果，如coverage更小、cache miss更多
- Coverage: 藉由 prefetch 避掉的 cache miss 比例
  - 100 x (Prefetch Hits / (Prefetch Hits + Cache Misses))

Direct and Indirect Memory Indexing

SW prefetch 在處理indirect memory index上應比 HW prefetch更好。HW prefetch通常用來處理可以預測的stream & stride
indirect memory access可能造成 SW prefetching overhead
- e.g.. a[i] prefetch 一次， a[b[i]]則要分兩次prefetch

for example 的縮寫是 e.g.，不是 ex，後者是前男友、前主管的前綴詞

SW prefetching對效能的正面和負面影響

SW prefetching 優點

Large Number of Streams (Limited Hardware Resources)
- HW prefetcher能處理的stream數量會受限於硬體資源如stream detectors和book-keeping mechanisms
- stream 數量超過 HW 能力: 如 lbm, a stencil-based computation, a 3D grid data structure and loop references all 27 nearest neighbors for each grid point (b[i][j][k]=(a[i][j][k] + a[i][ j][k+1] + a[i][j][k−1] +a[i][j−1][k] + a[i][j+ 1][k] + …)/27)
- SW prefetcher則是能在各個stream當中放入prefetch要求

book-keeping machanism?

book-keeping 英文的意思就是記帳，在電腦科學的衍生意義為往復追蹤紀錄 jserv

Short Streams
- HW prefetch在training上需要時間。通常需要經過2個cache-miss才能判斷stream或stride的方向。
- 例如: milc exhibits short stream behavior. In particular, it operates on many 3x3 matrices using 16 bytes per element, so the total size of the matrix is 144 bytes, or just three cache blocks.
- 現已有開發出處理short stream 的 HW prefetcher [Hur and Lin 2006, 2009]
Irregular Memory Access
- SW prefetcher(插入內聯函數的方式)較能對複雜的資料結構做prefetch
- RDS: 如 mfc
Cache Locality Hint
- 程式開發者使用SW prefetcher時，能自行調整locality hint
- HW prefetcher 的設計上則是大多將prefetch的資料搬移至lower-level cache(L2或L3)
  - 優點: 減少 L1 cache的cache pollution
  - 缺點: lower-level cache資料搬移至L1的時間(latency)會造成效能下降
Loop Bounds
- SW prefetch的好處是可以在程式當中訂定不會超過迴圈執行次數的prefetch要求
- 例如: 使用loop unrolling、SW pipelining、使用branch instrcution
- HW prefetch 則是無法有效控制，特別是當此HW prefetcher工作能力較大(prefetch distance大、高prefetch比率)，甚至可能消耗過多的memory bandwidth

想要知道上述SW prefetch(使用loop unrolling、SW pipelining、使用branch instrcution)如何實作

SW prefetching 缺點

Increased Instruction Count
- SW prefetch 會增加instruction數量
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
- 上圖顯示各個benchmark在加入prefetch要求後增加的instruction數量
Static Insertion
- 程式開發者加入SW prefetch之後，在程式執行時間不能動態調整改動prefetch要求，即使實際上memory latency、有效cache大小、bandwidth有所改動也無法隨之作較佳的調整
- 在異質系統(heterogeneous architectures)中此情形會更加嚴重
- 現今有開發出feedback driven adaptive hardware prefetching[Ebrahimi et al. 2009; Srinathet al. 2007]，可動態調整prefetch distance和prefetch degree

Adaptivity is important because it is well-known that the pattern of memory addresses can exhibit phase behavior [Zhang et al. 2006].
what is that?

Code Structure Change
- 在迴圈內，若指令數量太少(單一次執行時間太小)，插入 prefetch 要求會變得困難(可能無法達到隱藏 memory latency==>prefetch distance < 1的情況)，此時需要用到 loop splitting
- 也有可能因為插入 prefetch 之後，指令數量改變太多，造成結構調整上的困難

loop splitting是什麼?

就是字面上的意思，為了讓 prefetching 發揮作用，要適度切割迴圈內的指令 jserv

SW + HW prefetching 優點

Handling Multiple Streams
- 可以一次處理更多 stream 。 HW prefetcher 處理regular stream、SW prefetcher 處理 irregular stream
正向訓練
- SW prefetcher 可對 HW prefetcher 做正向訓練
- 例如: SW prefetch 太慢，經過訓練的 HW prefetcher 會知道要增加即時性

SW + HW prefetching 缺點

Negative Training
- 當 SW prefetch 沒有加在所有的 stream 上面時，可能造成 HW prefetch 的訓練成果下降
- SW prefetch 也可能造成 HW prefetch 過度訓練，工作量太大的結果使效能下降，如下圖所示:

想辦法重現這個實驗 jserv

Harmful Software Prefetching
- SW prefetch 通常比 HW prefetch 還精準，因此當SW prefetcher出現失誤或是太早，可能造成 HW prefetch 行為上更不準確

實驗方法

prefetch insertion 演算法

prefetch 候選對象: L1 cache misses / 1000 個指令 (MPKI) > 0.05
prefetch distance:

其中 k 為常數
L 為平均的 memory latency
IPCbench 是每個 benchmark 的 profiled average IPC
Wloop 是迴圈平均每一輪的指令數
- 在此使用 K = 4 ， L = 300
- IPCbench、Wloop 是由 benchmark 的 profile data 決定
  ==> prefetch distance 的決定可以說是 profile-driven

simulation

MacSim
Pinpoints: 選擇具代表性的程式區段來做模擬

實驗結果評估

SW prefetching 的 Overhead & Limitations

instruction overhead: 指令大量增加
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- GemsFDTD、bwaves、leslie3d 指令增加超過 50 %，bwaves 甚至超過 100 %
- 增加的除了prefetch指令，也有一些是為了要處理 indirect memory accesses 和 index 計算的指令
- GemsFDTD 即使指令增加超過 50 % ， SW prefetch 之後效能還是正向提升 (在 positive group)
  ==> prefetch 帶來的效益超過 instruction overhead
SW prefetching overhead: 分別去除一些 SW prefetch 造成的負面影響，來觀察效能提升的情況
- 實驗中，分別除去 cache pollution overhead (SW+P)、bandwidth consumption (SW+B，在 CPU 核心和記憶體之間)、memory access latency (SW+L)、redundant prefetch overhead (SW+R)、instruction overhead (SW+I) 來觀察效能提升的狀況
- 造成 cache pollution 可能的原因為 early prefetch 或 incorrect prefetch，但兩者在實驗中很少發生，因此除去 cache pollution (圖中的 SW+P) 對 cycle 次數是否減少影響不大
- 去除 bandwidth consumption (圖中的 SW+B) 的影響也不大，代表在實驗中的單執行緒應用之下， bandwidth 都還算夠用
- 去除 memory access latency (圖中的 SW+L)對效能影響極大，代表在此實驗之下，prefetch 並沒有達到完全隱藏 memory latency
- 即便在實驗中存在著大量的 redundant prefetch，去除 redundant prefetch overhead (圖中的 SW+R) 影響幾乎小到可以忽略
- 雖說 GemsFDTD、bwaves、leslie3d 存在大量的 instruction overhead，去除之後 (圖中的 SW+I)的影響也不大
prefetch distance 的影響: 觀察各個 benchmark 對於 prefetch distance 的敏感度
- x 軸代表和 base prefetch distance (即用 prefetch insertion algo. 定義公式所算出之距離) 相差的距離
  - prefetch distance 定義公式:
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
- 利用最佳範圍區間以及效能表現上的差異，實驗中將所有 benchmark 分為五組
- 整體效能表現沒有提升或是負向的組別，多被分在 E 組，即幾乎不受 prefetch distance 影響的一組
- cactusADM 和 bwaves 為分類上的例外，他們從一開始 (x 軸距離為 -4的地方)就達到最佳效能表現，因此最佳範圍區間涵蓋較大

不太清楚這邊的五組是怎麼切的?

好像就是 narrow / wide Optimal Zone & low / high Perf. Delta 下去排列組合在將 benchmark 的表現大致歸類Carol Chen

perf Delta 代表最佳效能和非最佳的差?

固定 prefetch distance vs. 機器組態: 測試最佳 prefetch distance 對於機器設定的敏感度
- 使用固定的 prefetch distance 在效能的表現上可能因為機器組態的不同而有所差異。
- 實驗中，測試了最佳 prefetch distance (固定)，在三種不同處理器- base、less-aggressive、aggressive 上的表現。
  - 左三張圖為三種處理器上的最佳 prefetch distance
  - 右三張圖為最佳 prefetch distance 在 baseline 處理器上與最佳 prefetch distance 分別在 less-agressive 和 agressive 處理器上的表現差異
  - For example, the best distances for lbm are 2, 0, and 8 for less-aggressive, base, and aggressive processors, respectively. We measure the performance difference between that of distance 2 and 0 in the less-aggressive processor and the difference between that of distance 8 and 0 in the aggressive processor.
- 除了 lbm (在後方案例探討會詳述原因)，其他使用最佳 prefetch distance 的 benchmark 在效能差異上並不大。這樣的結果驗證了在 prefetch distance 實驗中，將大部分 benchmark 歸類到 E 組 (insensitive) 的結果
  ==> 結論是固定的 prefetch distance 對於效能的影響不太受機器組態的不同而有所影響，即使機器組態在執行時間有所改變也不會差太多
Cache-Level Insertion 策略
- 大部分的 OOO 處理器都可以容忍 L1 cache misses (影響小到可以忽略)，所以 HW prefetcher 通常為了降低汙染 (pollute) L1 cache 的機率會傾向將資料 prefetch 到 last-level cache (L2 或 L3)。
  ==> 然而，過多的 L1 cache miss 也會造成效能下降。
  ==> 此時，使用較精準的 SW prefetch 在不污染 L1 cache 的情況下(因為 miss 已經太嚴重，不太會再次發生汙染)，將資料 prefetch 到 L1 cache
  - 實驗中使用 T0
  - 實驗中所使用的為two-level cache hierarchy(只有 L1 和 L3)，因此使用 T1 和 T2 結果是一樣的
- 下圖為不同 locality hint 下，各個 benchmark 的效能表現
- 常理來說，T0 的表現會比 T1 的表現更好，因為它多了能夠隱藏 L1 cache miss 的部分。這樣的結果在 libquantum 中可以觀察到，L1 cache miss 的比率由 10.4 % 下降到 0.01 %，如下圖所示:
- 對於資料重複使用率較低的 streaming 應用，應該要能觀察到 T1 、 T2 有不錯的表現。然而，在實驗中的 benchmark 裡沒有這樣的應用，所觀察到的結果都是 T0 較 T1、T2 好

HW & SW prefetcher 同時使用的效果

HW Prefetcher 訓練效果
- 實驗中對下列兩種訓練方式來評估:
  - NT (SW+GHB, SW+STR): HW prefetcher 的訓練演算法會忽略 SW prefetch 要求，只會被正在發生的 cache miss 訓練
  - Train (SW+GHB+T, SW+STR+T): HW prefetcher 的訓練也包含 SW prefetch 要求
- 下圖為各個 benchmark 的 HW prefetcher 經過 SW prefetch 要求訓練的結果:
  - sphinx3: 兩種 HW prefetcher 在訓練之後效能都有正向成長
  - milc、gcc、soplex: 兩種 HW prefetcher 效能都是退步的
  - mcf、bwaves: 效能大致上不受訓練的影響
  - 其他 benchmark: 可能因 HW prefetcher 的不同(GHB 或 STR)而有所不同
- 訓練後效能正成長可以達 3-5%，這些都來自於提前對 HW prefetcher 做訓練，減少 late prefetching
- 訓練後效能退化的退化幅度相當大: libquantum 為 -55.9%，milc 為 -22.5% 。
  - milc 因為程式中包含 short stream，因此 SW prefetch 要求會啟動 HW 不必要的 prefetch
  - libquantum 效能退化的原因:
    (1) 嚴重的 L1 miss penalties: SW prefetch 主要是 prefetch 到 L1 和 L2 cache。但若 HW prefetch 被 SW prefetch 訓練，只會將資料 prefetch 到 L2 而已，因此少了 L1 prefetch 的好處，造成大量 L1 cache miss
    (2) overly aggressive prefetch: HW prefetcher 除了有自己的 prefetch distance，又被 SW prefetch request 訓練，造成過多的 early prefetch
- 結論: 雖然訓練後有些 benchmark 效能能有正向成長，但是會造成效能退化的幅度遠大於正向成長的。因此在一般的情況來說，用 SW prefetching 來訓練 HW prefetcher 並不是一個很好的選擇
Prefetch Coverage (ratio)
- Coverage: Percentage of misses avoided due to prefetching
  100 x (Prefetch Hits / (Prefetch Hits + Cache Misses))
  
  (a) GHB 和 STR 的有效 prefetch coverage (和所有的 L2 miss 相比)
  (b) SW 、SW + GHB 、SW + STR 的 prefetch coverage。 HW prefetch 產生額外有效的 prefetch 以白色表示。
- 一般來說，SW prefetching 的 coverage 會比 HW prefetching 還高，但在效能表現中間或負向退步的組別中情況相反
  ==> 因此可以歸納出 coverage 太低是造成效能無法提升的主要原因
Prefetching Classification
- 對 prefetch 的分類:
- 根據上表對實驗結果做分類:
  - 只有 bzip2 和 soplex 有超過 10% 的 early prefetch。若有 prefetch cache 則可以減低 cache pollution 的影響
  - mfc 有 10% 的 late prefetch，是因為 prefetch 內聯函數所產生的 dependent load instrcution。若移除 memory latency (前面實驗的 SW+L)，效能可以提升 9%
  - 即便有很多的 redundant prefetch，但並不會造成太大的負面影響(前面實驗的 SW+R)
- 回顧之前實驗結果 SW+L 、 SW+R:

這邊的 prefetch cache 是指…? 它如何實作?

dependent load instrcution 發生的情況?

使用 GCC Compiler

gcc compiler (gfortran) 沒有支援 fortran 的 prefetch 內聯函數
==> 實驗中，gcc compiler 編譯完才插入 prefetch 內聯函數，並排除那些含有 fortran 程式的 benchmark
使用 gcc 4.4
實驗結果如下兩圖所示:
- 除了 mcf 和 lbm 以外，其餘使用 gcc compiler 後都被放入更多 prefetch 內聯函數
- icc 沒有插入 prefetch 內聯函數 (因此 icc 的表現 = NOSW : compiler 內的 prefetch flag disable)
- gcc 還是無法解決所有的 cache miss
- gcc 的表現介在 NOSW (icc) 和 SW 之間

Real System 實驗

使用實際的處理器(非 simulator): Core2 和 Nehalem 來進行實驗，硬體規格如下表:
benchmark 程式碼採用全部而非只有 simpointed 的部分
實驗中可以發現 simpointed 的部分和整份程式有差不多依樣多的 prefetch 程式，如下圖:
- milc、gcc、zeusmp 的 simpointed 程式部分對於整體而言具精確的代表性 (結果相似)
- simulator 的硬體組態類似 Core2
實驗結果如圖:
- cactusADM、soplex、bwaves、sphinx3、leslie3d 在 simulator 上的表現為效能些微提升，但在實際處理器上效能是退步的
- 效能正向提升組在實際處理器上的效能都是提升的
和 Core2 相比，Nehalem 的 HW prefetch 效果又更好。HW prefetcher 因為有更多 cache level 的 prefetch，SW prefetch (只有 L1 prefetch)的效能提升相對來說就沒有像只使用 simulator 的情況好那麼多

Profile-Guided Optimization (PGO)

排除 icc compiler 的結果，因為 compiler 中沒有插入 prefetch 內聯函數
開啟 gcc 4.4 的 fprofile-generate 和 fprofile-use flag，結果如下圖所示:
表現介在 base 和 SW 之間，但效能的正向提升並非都來自於 prefetch
==> 由圖 ( c ) ( d )可以看到 PGO 的全部指令以及 prefetch 的指令是有減少的

適合處理 Short Streams 的 HW prefetcher

ASD HW prefetcher: 可以處理 short stream 的 HW prefetcher
- Hur and Lin 2006, 2009 (目前只是提出概念，並無實際實作)
ASD 和 SW & HW prefetch 一起使用的結果:
- milc (有 short stream):
  - 在沒有 SW prefetch 之下效能提升 7%
  - 只有 SW prefetch 的情況下，速度就可變快 2 倍以上
即使 ASD 可以預測 short stream，它還是需要是先觀察 memory stream 的第一個 instance
使用 ASD 的效益大於 GHB 和 STR，但是對於整體效能提升僅 < 3%
在 gcc、sph、les 中多使用 ASD 的效能有提升，但僅提升最多 2 %
==> 結論: SW prefetch 對 short stream 的處理還是比 ASD 好

Content Directed Prefetching (CDP)

目的: 處理 linked 或是 irregular 資料結構下的 prefetch
實驗結果如圖所示:
- gcc、mcf 的效能分別提升了 3.9% 和 6.5%
- 多做了 prefetch cache 去避免 CDP 的一個重大問題: cache pollution。圖中的 CDP+PF 顯示效能又多提升了 2% 和 5%
SW prefetch (效能有 78% 的提升)，相較之下還是比較適合用來處理 irregular 資料結構

手動調整 SW prefetch

無法確定 SW prefetch 演算法已經將結果優化到最好
在實驗中，另外手動對 SW prefetch 的內容做調整:
- 調整 prefetch distance
- 處理額外的 overhead，如使用 loop unrolling 和 prologue/epilogue transformations

prologue/epilogue transformations 如何用來處理overhead?

在實驗中，排除 milc、GemsFDTD、lbm、bwaves、bzip2。有些是已經優化到最佳效能，有的則是調整會牽動程式結構、過於大費周章。
下圖為手動調整後的結果比較:
- speed up 為 手動調整之後 和 使用 SW prefetch 演算法 比較之下速度的提升倍數
- 在此 prefetch distance 使用各個 benchmark 下效能最佳的值
效能沒改變組和負向改變組中，效能都因為優化而在 speed up 上有提升，代表移除會造成負面影響的內聯函數為一種有效率提升效能的方式
在效能正向改善組中，效能並沒有在優化後相差太多，因此可推論在實驗當中定義的 SW prefetch 演算法可以作為一個插入 SW prefetch 內聯函數的有效依據

總結新發現

(1) 即使在處理 regular access patterns，如 streams，HW prefetcher 時常有割雞焉用牛刀的情形。特別是在處理 short stream 的時候，此時使用 SW prefetch 會比較有效率

(2) SW prefetch distance 對於 HW 的組態不太敏感。但即使如此還是必須好好設定 prefetch distance。在一般的應用中，prefetch distance 只要大於定義公式算出的最小距離，效能上都不會差太多

(3) 雖然大部分的 L1 cache misses 會被 out-of-order 執行的處理器容忍，但在 L1 cache misses 太大時 (> 20 %)，還是採取 prefetch 降低 L1 cache misses 對效能的提升比較有效率

(4) 大部分有 prefetch instruction 的應用也同時會遇到 memory operation 的限制，因次在 SW prefetching 中，instruction overhead 對效能的影響並不是很大。

(5) SW prefetching 可以用來訓練 HW prefetcher 因此能夠獲得效能上的提升。但是在有些情況中，也可能因此造成效能嚴重下降，必須小心使用。

案例探討: benchmark 原始碼分析

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

在此針對 positive group、neutral group、negative group 中的各個 benchmark 程式做探討

參照原文第六部分

結論與未來展望

未來展望
- 設計一套能夠有效訓練 HW prefetcher 的演算法
- 減少 SW & HW prefetch 的負面效益的方式

Hardware Prefetcher

參考資料

Prefetching in the Intel® Core™ Microarchitecture

先查看自己的CPU的microarchitecture，參考這裡:

知道是microarchitecture是哪種，搜尋時才可以確定 HW prefetcher有哪些種類

Data Prefetch Logic (Hardware Prefetch):
- fetches streams(either backward or forward) of instructions and data from memory to the unified second-level cache upon detecting a stride.
- triggered when …
  - successive cache misses occur in the last-level cache
  - a stride in the access pattern is detected, such as in the case of loop iterations that access array elements.
- The prefetching occurs up to a page boundary.

what is page boundary?

L2 Streaming Prefetch(Adjacent Cache Line Prefetch):
- two 64-byte cache lines are fetched into a 128-byte sector(last-level cache)
  - pros: bus utilization +
  - cons: bus traffic+
DCU(L1-data cache) Prefetcher: also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.
DCU(L1-data cache) IP Prefetcher: Uses sequential load history (based on Instruction Pointer of previous loads) to determine whether to prefetch additional lines
- If a load instruction is detected to have a regular stride, then a prefetch is sent to the next address which is the sum of the current address and the stride.
- can prefetch forward or backward and can detect strides of up to half of a 4KB-page, or 2 KBytes.

How to turn on HW prefetcher

through BIOS
changing the bits in the IA32_MISC_ENABLE register – MSR 0x1A4 of each core (older microarchitecture uses 0x1A0)
- disable = 1 / enable = 0
- MSR is present in every core and changes made to the MSR of a core will impact the prefetchers only in that core.
- If hyper-threading is enabled, both the threads share the same MSR.
- 修改MSR 可使用的工具有:
  - Linux: msr-tools
  - Linux/Windows/MacOSX/FreeBSD: Intel PCM
[ 小實驗 ] 使用msr-tools查看MSR 0x1A4的資訊

先載入msr模組

$ sudo modprobe msr

接著讀入msr 0x1A4資訊

$ sudo rdmsr -a 0x1A4

語法: rdmsr -option [register no.]，這裡-a代表分別列出4個processor的資訊。結果如下:

==>4個processor的 4個HW prefetcher都是enable的狀態

When to turn on the Prefetch Mechanisms

a high bus utilization could see a performance degradation if prefetch is turned on (may cause bus traffic)
prefetch functionality can hamper the performance of applications that do not have a good spatial locality by causing cache pollution, which results in high levels of cache misses.
The default prefetch setting provides optimal performance for many workloads, so careful consideration should be given to memory-bandwidth utilization of applications before enabling or disabling these mechanisms.

整理一些 prefetch 資料

prefetching 和 paging的關係?

ARM SIMD

參考資料

ARM® C Language Extensions
NEON and VFP Programming
NEON to SSE porting solution
Mandelbrot Set with SIMD Intrinsics , Make參數也是參考這個

NEON

NEON是一種基於SIMD的ARM技術，比較ARMv6或之前的架構，NEON结合了64-bit和128-bit的兩種SIMD指令集，提供128-bit的向量運算(vector operations)。NEON從ARMv7開始被使用，目前可以在ARM Cortex-A和Cortex-R系列處理器中使用。

NEON在Cortex-A7、Cortex-A12、Cortex-A15處理器中被設置為預設選項，但在其餘的ARMv7 Cortex-A系列中是可以選的。NEON與VFP共享了同樣的暫存器，但它具有自己獨立的執行管線。

NEON提供的data type

32-bit single precision floating-point
8, 16, 32 and 64-bit unsigned and signed integers
8 and 16-bit polynomials(多項式)

NEON data type說明：

Unsigned integer：U8 U16 U32 U64
Signed integer：S8 S16 S32 S64
Integer of unspecified type：I8 I16 I32 I64
Floating-point number：F16 F32
Polynomial over {0,1}：P8

ps：F16不適合用來做數據處理，只適用於數據轉換

NEON暫存器

NEON暫存器有幾種類型：
16128-bit暫存器(Q0-Q15)；
或3264-bit暫存器(D0-D31)
或上述暫存器的组合。
ps：每一个Q0-Q15暫存器對應到一對D暫存器。（如上圖）
暫存器之間的對應關係：

D<2n> 對應到 Q 的最低有效半部；
D<2n+1> 對應到 Q 的最高有效半部；
结合NEON支援的data type，NEON暫存器有像是下圖的幾種型態：

NEON 數據處理指令的分類：

詳細情形可以參考這邊

Normal instructions 可以在任意的 vector types進行操作, 然後回傳與vector同樣大小的資料(通常也是同一種型態),作為vector的運算元.
Long instructions 操作dobule word vectors，生成quad word vectors. 結果的寬度通常為operator的加倍，並且類型相同，如果要使用這個類型的指令就要在指令中加L
Wide instructions 操作double wod與quad word並生成四倍長字 , 結果和第一個操作數都是第二個操作數的兩倍寬 , 如果要使用這個類型的指令就要在指令中加Ｗ
Narrow instructions 顧名思義就是操作四倍長字，生成雙字 , 結果寬度一般是操作數的一半 , 如果想要使用這個指令要在指令中加N
Saturating variants
ARM中的飽和運算：
對於有符號飽和運算，如果結果小於 $- 2^{n}$ ，則返回的結果將為 $- 2^{n}$ ；
對於無符號飽和運算，如果整個結果將是負值，那麼返回的結果是0；如果結果大於 $2^{n}$ – 1，則返回的結果將為 $2^{n}$ – 1；
NEON中的飽和運算算法：通過在V和指令助記符之間使用Q前綴可以指定飽和指令，原理與上述內容相同。

程式

參考：Using your C compiler to exploitNEON™ Advanced SIMD 實作 arm SIMD transpose

























void neon_transpose(int *src, int *dst, int w, int h)
{
    for (int x = 0; x < w; x += 4) {
        for(int y = 0; y < h; y += 4) {
            int32x4_t I0 = vld1q_s32((int32_t *)(src + (y + 0) * w + x));
            int32x4_t I1 = vld1q_s32((int32_t *)(src + (y + 1) * w + x));
            int32x4_t I2 = vld1q_s32((int32_t *)(src + (y + 2) * w + x));
            int32x4_t I3 = vld1q_s32((int32_t *)(src + (y + 3) * w + x));

            vzipq_s32(I0 , I1);//I0: T0, I1:T2
            vzipq_s32(I2 , I3);//I2: T1, I3:T3

            int32x4_t T0 = vcombine_s32(vget_low_s32(I0), vget_low_s32(I1));//vcombine_s32(low,high)
            int32x4_t T1 = vcombine_s32(vget_high_s32(I0), vget_high_s32(I1));
            int32x4_t T2 = vcombine_s32(vget_low_s32(I2), vget_low_s32(I3));
            int32x4_t T3 = vcombine_s32(vget_high_s32(I2), vget_high_s32(I3));

            vst1q_s32((int32_t *)(dst + ((x + 0) * h) + y) , T0);
            vst1q_s32((int32_t *)(dst + ((x + 1) * h) + y) , T1);
            vst1q_s32((int32_t *)(dst + ((x + 2) * h) + y) , T2);
            vst1q_s32((int32_t *)(dst + ((x + 3) * h) + y) , T3);
        }
    }
}

測試＆執行

要在Makefile中新增關於arm相關的參數

main_arm: $(GIT_HOOKS) main.c
        arm-linux-gnueabihf-gcc -c -g -Wall -Wextra -Ofast -mfpu=neon -o main_arm.o main.c
        arm-linux-gnueabihf-gcc -Wall -g -Wextra -Ofast -o main_arm main_arm.o

執行結果如下，我們的code是放在raspberry pi 3 model B+上面執行
上面有四顆ARM Cortex-A53指令集是ARMv8 (64/32-bit)

swimglass@swimglass-desktop:~/prefetcher$ ./main_arm
neon:            769104 us

使用 prefetch

參考資料 NEON prefetch 用法






























void neon_prefetch_transpose(int *src, int *dst, int w, int h)
{
    for (int x = 0; x < w; x += 4) {
        for(int y = 0; y < h; y += 4) {
#define PFDIST  8
            __builtin_prefetch(src+(y + PFDIST + 0) *w + x);
            __builtin_prefetch(src+(y + PFDIST + 1) *w + x);
            __builtin_prefetch(src+(y + PFDIST + 2) *w + x);
            __builtin_prefetch(src+(y + PFDIST + 3) *w + x);

            int32x4_t I0 = vld1q_s32((int32_t *)(src + (y + 0) * w + x));
            int32x4_t I1 = vld1q_s32((int32_t *)(src + (y + 1) * w + x));
            int32x4_t I2 = vld1q_s32((int32_t *)(src + (y + 2) * w + x));
            int32x4_t I3 = vld1q_s32((int32_t *)(src + (y + 3) * w + x));

            vzipq_s32(I0 , I1);//I0: T0, I1:T2
            vzipq_s32(I2 , I3);//I2: T1, I3:T3

            int32x4_t T0 = vcombine_s32(vget_low_s32(I0), vget_low_s32(I1));//vcombine_s32(low,high)
            int32x4_t T1 = vcombine_s32(vget_high_s32(I0), vget_high_s32(I1));
            int32x4_t T2 = vcombine_s32(vget_low_s32(I2), vget_low_s32(I3));
            int32x4_t T3 = vcombine_s32(vget_high_s32(I2), vget_high_s32(I3));

            vst1q_s32((int32_t *)(dst + ((x + 0) * h) + y) , T0);
            vst1q_s32((int32_t *)(dst + ((x + 1) * h) + y) , T1);
            vst1q_s32((int32_t *)(dst + ((x + 2) * h) + y) , T2);
            vst1q_s32((int32_t *)(dst + ((x + 3) * h) + y) , T3);
        }
    }
}

執行結果:

swimglass@swimglass-desktop:~/prefetcher$ ./main_arm_pre
neon_pre:                458715 us

變快了 30000 us

2016q3 Homework4 (software-pipelining+)

預期目標

問題釐清

名詞解釋

論文重點

實驗目的

背景知識

全文大綱

實驗結果

重點結論歸納

HW & SW prefetcher背景知識

資料結構

SW prefetch內聯函數

prefetch分類

SW prefetch distance

Direct and Indirect Memory Indexing

SW prefetching對效能的正面和負面影響

SW prefetching 優點

SW prefetching 缺點

SW + HW prefetching 優點

SW + HW prefetching 缺點

實驗方法

prefetch insertion 演算法

simulation

實驗結果評估

SW prefetching 的 Overhead & Limitations

HW & SW prefetcher 同時使用的效果

使用 GCC Compiler

Real System 實驗

Profile-Guided Optimization (PGO)

適合處理 Short Streams 的 HW prefetcher

Content Directed Prefetching (CDP)

手動調整 SW prefetch

總結新發現

案例探討: benchmark 原始碼分析

相關研究探討

結論與未來展望

Hardware Prefetcher

參考資料

Prefetching in the Intel® Core™ Microarchitecture

How to turn on HW prefetcher

When to turn on the Prefetch Mechanisms

整理一些 prefetch 資料

ARM SIMD

參考資料

NEON

NEON提供的data type

NEON暫存器

NEON 數據處理指令的分類：

程式

測試＆執行

使用 prefetch