# Prefetch 論文閱讀 contributed by<`zhanyangch`> ###### tags:`zhanyangch` `sysprog2017` `week3` ## 論文閱讀 [When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) * hardware/software prefetching 的合作與衝突 1. INTRODUCTION * prefetching : tolerating cache miss latency exist * compiler 只能自動做到簡單的 prefetching,需插入 prefetching intrinsics,但很少嚴謹的規則,以及缺少對 software and hardware prefetching 的複雜組合的瞭解 * intrinsics:看起來像函數,會被 compiler 直接替換成組語 ![](https://i.imgur.com/MNAsWYq.jpg) * 圖:sw software prefetching,GHB STR:hardware,比較有 sw prefetching 跟只有hw prefetching 的 speed up,以5% 劃分 positive netual negative * software prefetching targets short array streams, irregular memory address patterns, and L1 cache miss reduction, there is an overall positive impact with code examples * software prefetching can interfere with the training of the hardware prefetcher, resulting in strong negative effects, especially when using software prefetch instructions for a part of streams. * stream prefetcher(STR):prefectch 連續的資料 stride prefetcher(GHB):prefectch 距離 stride 的指令或資料,stride 由先前的紀錄得知 2. BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING ![](https://i.imgur.com/vPaSNOB.jpg) * 不同的 data structures 的 access patterns 影響 prefectch 的方式 * Recursive Data Structures (RDS) * x86 SSE SIMD extensions 的 instrinsic 會被轉換成 2道指令(direct addr)或4道指令(indirect addr) ![](https://i.imgur.com/tNyz1M8.jpg) * Prefetch Classification:跟時間有關 Timely Late Early,重複 Redundant_dc Redundant_mshr,錯誤 Incorrect * Software Prefetch Distance D:prefetch distance l:prefetch latency s:length of the shortest path through the loop body D 必須大於 memory latency,但太大會造成 cache 內的資料被提早逐出,導致 cache miss 提高 $$ D \geq {l \over s}. $$ * direct memory indexing 易用 hardware prefetch indrect memory indexing,需特殊的硬體,易用 software prefetch 3. POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING * software pefecting 的優點 * hardware 的資源有限,Stream 數目多會有困難 * stream detectors and book-keeping mechanisms * hardware prefetcher 需要訓練,如果 stream 長度太短 cache misses 不足以訓練 * Hardware prefetchers 主要放 lower level cache,降低 L1 cache pollution * 在 software 中迴圈邊界可以被簡單的計算並且可藉由loop unrolling, software pipelining, and using branch instructions 避免 prefetch requests out of array bounds * software pefecting 的缺點 * 增加指令數 * prefetch 那些 data 事先決定,無法依 runtime 的情形決定 * 對於指令少的迴圈很難插入 prefetching instruction,需要 loop splitting 的技巧 * software + hardware 優點 * Handling Multiple Streams * Positive Training * software + hardware 缺點 * Negative Training : Software較慢或 hide streams * Harmful Software Prefetching : stress on cache, memory bandwidth 4. EXPERIMENTAL METHODOLOGY $$ Distance = {K \times L \times {IPC}_{bench} \over {W}_{loop}}. $$ * where K is a constant factor, L is an average memory latency, IPCbench is the profiledaverage IPC of each benchmark, and W is the average instruction count in one loop iteration * MacSim, a trace-driven and cycle-level simulator 5. EVALUATIONS: BASIC OBSERVATIONS ABOUT PREFETCHING * Instruction Overhead:prefetch instructions + indirect memory accesses,Overhead 高不等於效能差 * Software Prefetching Overhead:cache pollution, bandwidth consumption, memory access latency,redundant prefetch overhead,instruction overhead:除去各別變因,觀察對效能的影響 *  small:cache pollution,bandwidth *  larger:memory latency *  negligible:redundant prefetch instructions *  not high:instruction overhead * optimal distance might vary by machine configurations * 利用軟體訓練硬體的成效有限:因軟體會放 L1,而硬體在 L2,且 hardware prefetch requests can be too aggressive * profile-guided optimization * ASD hardware prefetcher:Short Streams * content directed prefetching :irregular data structures 6. 個案探討 ## 參考 [cmu Computer Architecture ppt](http://www.ece.cmu.edu/~ece740/f11/lib/exe/fetch.php%3Fmedia%3Dwiki:lectures:onur-740-fall11-lecture24-prefetching-afterlecture.pdf)