2017q3 Homework2 (software-pipelining)

tags: `sysprog2017` `dev_record`

contributed by <HTYISABUG>

Reviewed by `jackyhobingo`

有提供實驗環境，卻沒有提供實驗的情況，期待有實驗的過程


































Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 94
Model name:            Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
Stepping:              3
CPU MHz:               1232.873
CPU max MHz:           3500.0000
CPU min MHz:           800.0000
BogoMIPS:              5184.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl 
xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq dtes64 
monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 
x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand 
lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept 
vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx 
rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat 
pln pts hwp hwp_notify hwp_act_window hwp_epp

閱讀論文

INTRODUCTION

相對簡單的 software prefetching 演算法目前只出現在先進編譯器上
programmer 需要自行嵌入 prefetching 指令
- 有兩項問題
  - 沒有確切的最好的嵌入方針
  - 軟硬體的溝通複雜度未被解明
- 採用兩種 HW prefetcher
  - GHB (Global History Buffer)
    - If cache miss, an address that is offset by a distance from the missed address is likely to be missed in near future
    - Stride access: Either cache miss or hit, prefetch address that is offset by a distance from the missed address.
    - Access stride distances greater than two cache lines
  - STR (Stream Buffer)
    - The cache miss address are fetched into a separate buffer
    - Unit-stride cache-line accesses
- 比較單純 SW prefetching 與 HW/SW 混合使用的效能
要尋求的解答
1. SW prefetching 的限制與成本
2. HW prefetching 的限制與成本
3. 使用 SW and/or HW prefetching 的益處
SW prefetching 實驗
- 對不規則記憶體位置的 prefetch 使 L1 cache miss 降低是主要的正向影響

BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING

POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING

SW pref. 優於 HW pref. 的地方
- Large number of Streams
  - The number of streams in the stream prefetcher is limited by HW resources
- Short Streams
  - Hardware prefetchers require training time to detect the direction and distance of a stream or stride
- Irregular Memory Access
- Cache Locality Hint
  - HW prefetcher place the data in the lower-level (L2 or L3) cache
  - SW prefetched data is placed directly into the L1 cache
- Loop Bounds
  - Several methods prevent generating prefetch requests out of bounds in software
  - The same isn't possible in hardware
SW pref. 的負面效果
- Increased Instruction Count
- Static Insertion
  - 看不太懂
- Code Structure Change
SW / HW Prefetching 的協同作用
- Handling Multiple Stream
- Positive Training
SW / HW Prefetching 的拮抗作用
- Negative Training
- Harmful SW Prefetching

EVALUATIONS: BASIC OBSERVATIONS ABOUT PREFETCHING

SW prefetching 的限制與成本
- Instruction Overhead
- SW Prefetching Overhead
  - Effects of cache pollution are small
  - Current machines provide enough bandwidth for single-thread applications
  - SW prefetching isn't completely hiding memory latency
  - Negative effect of redundant prefetch instructions is generally negligible
- The Effect of Prefetch Distance
- Static Distance vs. Machine Configuration
  - Static prefetch distance variance doesn't impact performance significantly
- Cache-Level Insertion Policy
  - The benefit of T0 over T1/T2 mainly comes from hiding L1 cache misses by inserting prefetched blocks into the L1 cache
同時使用 SW/HW Prefetching 的影響
- Hardware Prefetcher Training Effects
  - Negative impact can reduce performance degradation significantly
  - It's generally better not to train HW prefetching with SW prefetching requests
- Prefetch Coverage
  - Less coverage is the main reason for performance loss in the neutral and negative groups
- Prefetching Classification
  - Even though a significant number of redundant prefetches exists in many benchmarks, there is little negative effect on the performance
HW Prefetcher for Short Streams
- One weakness of hardware prefetching is the difficulty of exploiting short streams
- ASD HW Prefetcher
  - SW prefetching is much more effective for prefetching short streams than ASD
Content Directed Prefetching (CDP)
- Target linked and other irregular data structures
- SW prefetching is more effective for irregular data structures than CDP
Summary
- HW prefetchers can under-exploit even regular access patterns and SW prefetching is frequently more effective in such cases
- The SW prefetching distance is relatively insensitive to the HW configuration
- The prefetch distance does need to be set carefully, but as long as the prefetch distance is greater than the minimum distance, most applications will not be sensitive to the prefetch distance
- Although most L1 cache misses can be tolerated through out-of-order execution, when the L1 cache miss rate is much higher than 20%, reducing L1 cache misses by prefetching into the L1 cache can be effective
- The overhead of useless prefetching instructions is not very significant
- SW prefetching can be used to train a HW prefetcher and thereby yield some performance improvement. However, it can also degrade performance severely, and therefore must be done judiciously if at all