Prefetch 論文閱讀
contributed by<zhanyangch
>
論文閱讀
When Prefetching Works, When It Doesn’t, and Why
- hardware/software prefetching 的合作與衝突
- INTRODUCTION
- prefetching : tolerating cache miss latency exist
- compiler 只能自動做到簡單的 prefetching,需插入 prefetching intrinsics,但很少嚴謹的規則,以及缺少對 software and hardware prefetching 的複雜組合的瞭解
- intrinsics:看起來像函數,會被 compiler 直接替換成組語

- 圖:sw software prefetching,GHB STR:hardware,比較有 sw prefetching 跟只有hw prefetching 的 speed up,以5% 劃分 positive netual negative
- software prefetching targets short array streams, irregular memory address patterns, and L1 cache miss reduction, there is an overall positive impact with code examples
- software prefetching can interfere with the training of the hardware prefetcher, resulting in strong negative effects, especially when using software prefetch instructions for a part of streams.
- stream prefetcher(STR):prefectch 連續的資料
stride prefetcher(GHB):prefectch 距離 stride 的指令或資料,stride 由先前的紀錄得知
- BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING
- 不同的 data structures 的 access patterns 影響 prefectch 的方式
- Recursive Data Structures (RDS)
- x86 SSE SIMD extensions 的 instrinsic 會被轉換成 2道指令(direct addr)或4道指令(indirect addr)

- Prefetch Classification:跟時間有關 Timely Late Early,重複 Redundant_dc Redundant_mshr,錯誤 Incorrect
- Software Prefetch Distance
D:prefetch distance
l:prefetch latency
s:length of the shortest path through the loop body
D 必須大於 memory latency,但太大會造成 cache 內的資料被提早逐出,導致 cache miss 提高
- direct memory indexing 易用 hardware prefetch
indrect memory indexing,需特殊的硬體,易用 software prefetch
- POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING
- software pefecting 的優點
- hardware 的資源有限,Stream 數目多會有困難
- stream detectors and book-keeping mechanisms
- hardware prefetcher 需要訓練,如果 stream 長度太短 cache misses 不足以訓練
- Hardware prefetchers 主要放 lower level cache,降低 L1 cache pollution
- 在 software 中迴圈邊界可以被簡單的計算並且可藉由loop unrolling, software pipelining, and using branch instructions 避免 prefetch requests out of array bounds
- software pefecting 的缺點
- 增加指令數
- prefetch 那些 data 事先決定,無法依 runtime 的情形決定
- 對於指令少的迴圈很難插入 prefetching instruction,需要 loop splitting 的技巧
- software + hardware 優點
- Handling Multiple Streams
- Positive Training
- software + hardware 缺點
- Negative Training : Software較慢或 hide streams
- Harmful Software Prefetching : stress on cache, memory bandwidth
- EXPERIMENTAL METHODOLOGY
- where K is a constant factor, L is an average memory latency, IPCbench is the profiledaverage IPC of each benchmark, and W is the average instruction count in one loop iteration
- MacSim, a trace-driven and cycle-level simulator
- EVALUATIONS: BASIC OBSERVATIONS ABOUT PREFETCHING
- Instruction Overhead:prefetch instructions + indirect memory accesses,Overhead 高不等於效能差
- Software Prefetching Overhead:cache pollution, bandwidth consumption, memory access latency,redundant prefetch overhead,instruction overhead:除去各別變因,觀察對效能的影響
- small:cache pollution,bandwidth
- larger:memory latency
- negligible:redundant prefetch instructions
- not high:instruction overhead
- optimal distance might vary by machine configurations
- 利用軟體訓練硬體的成效有限:因軟體會放 L1,而硬體在 L2,且 hardware prefetch requests can be too aggressive
- profile-guided optimization
- ASD hardware prefetcher:Short Streams
- content directed prefetching :irregular data structures
- 個案探討
參考
cmu Computer Architecture ppt