# 2017q3 Homework2 (software-pipelining) ###### tags: `sysprog2017` `dev_record` contributed by <`HTYISABUG`> ### Reviewed by `jackyhobingo` * 有提供實驗環境,卻沒有提供實驗的情況,期待有實驗的過程 ```shell= Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 94 Model name: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz Stepping: 3 CPU MHz: 1232.873 CPU max MHz: 3500.0000 CPU min MHz: 800.0000 BogoMIPS: 5184.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 6144K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp ``` --- ## 閱讀論文 ### INTRODUCTION * 相對簡單的 software prefetching 演算法目前只出現在先進編譯器上 * programmer 需要自行嵌入 prefetching 指令 * 有兩項問題 * 沒有確切的最好的嵌入方針 * 軟硬體的溝通複雜度未被解明 * ![](https://i.imgur.com/O2i7YYv.png) * 採用兩種 HW prefetcher * **GHB** (Global History Buffer) * If cache miss, an address that is offset by a distance from the missed address is likely to be missed in near future * **Stride access**: Either cache miss or hit, prefetch address that is offset by a distance from the missed address. * <span style='color:red;'>Access stride distances greater than two cache lines</span> * **STR** (Stream Buffer) * The cache miss address are fetched into a separate buffer * <span style='color:red;'>Unit-stride cache-line accesses</span> * 比較單純 SW prefetching 與 HW/SW 混合使用的效能 * 要尋求的解答 1. SW prefetching 的限制與成本 2. HW prefetching 的限制與成本 3. 使用 SW and/or HW prefetching 的益處 * SW prefetching 實驗 * 對**不規則記憶體位置**的 prefetch 使 L1 cache miss 降低是主要的正向影響 ### BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING ![](https://i.imgur.com/z58ahcy.png) ![](https://i.imgur.com/GVUNyib.png) ![](https://i.imgur.com/3proo5M.png) ### POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING * SW pref. 優於 HW pref. 的地方 * Large number of Streams * The number of streams in the stream prefetcher is limited by HW resources * Short Streams * Hardware prefetchers require training time to detect the direction and distance of a stream or stride * Irregular Memory Access * Cache Locality Hint * HW prefetcher place the data in the lower-level (L2 or L3) cache * SW prefetched data is placed directly into the **L1** cache * Loop Bounds * Several methods prevent generating prefetch requests out of bounds in software * The same isn't possible in hardware * SW pref. 的負面效果 * Increased Instruction Count * Static Insertion * 看不太懂 * Code Structure Change * SW / HW Prefetching 的協同作用 * Handling Multiple Stream * Positive Training * SW / HW Prefetching 的拮抗作用 * Negative Training * Harmful SW Prefetching ### EVALUATIONS: BASIC OBSERVATIONS ABOUT PREFETCHING * SW prefetching 的限制與成本 * Instruction Overhead * SW Prefetching Overhead * ![](https://i.imgur.com/ytw8QfP.png) * Effects of **cache pollution** are **small** * Current machines provide **enough bandwidth** for single-thread applications * SW prefetching **isn't completely hiding memory latency** * **Negative effect** of redundant prefetch instructions is **generally negligible** * The Effect of Prefetch Distance * ![](https://i.imgur.com/gzAqEWJ.png) * ![](https://i.imgur.com/sL2T6d4.png) * Static Distance vs. Machine Configuration * Static prefetch distance variance **doesn't impact performance significantly** * Cache-Level Insertion Policy * The benefit of T0 over T1/T2 mainly comes from **hiding** L1 cache misses **by inserting prefetched blocks into the L1 cache** * 同時使用 SW/HW Prefetching 的影響 * Hardware Prefetcher Training Effects * Negative impact can reduce performance degradation significantly * It's generally better not to train HW prefetching with SW prefetching requests * Prefetch Coverage * **Less coverage** is the main reason for performance loss in the neutral and negative groups * Prefetching Classification * Even though a significant number of redundant prefetches exists in many benchmarks, there is little negative effect on the performance * HW Prefetcher for Short Streams * One weakness of hardware prefetching is the difficulty of exploiting short streams * ASD HW Prefetcher * SW prefetching is much more effective for prefetching short streams than ASD * Content Directed Prefetching (CDP) * Target linked and other irregular data structures * SW prefetching is more effective for irregular data structures than CDP * Summary * HW prefetchers can **under-exploit** even regular access patterns and SW prefetching is **frequently more effective** in such cases * The SW prefetching distance is relatively insensitive to the HW configuration * The prefetch distance does need to be set carefully, but **as long as the prefetch distance is greater than the minimum distance**, most applications will **not be sensitive** to the prefetch distance * Although most L1 cache misses can be tolerated through out-of-order execution, when the **L1 cache miss rate is much higher than 20%**, reducing L1 cache misses by **prefetching into the L1 cache can be effective** * The overhead of useless prefetching instructions is not very significant * SW prefetching can be used to train a HW prefetcher and thereby yield some performance improvement. However, it **can also degrade performance severely**, and therefore must be done judiciously if at all --- ## 參考資料 [When Prefetching Works, When It Doesn&rsquo;t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)