Try โ€‚โ€‰HackMD

2017q3 Homework2 (software-pipelining)

tags: sysprog2017 dev_record

contributed by <HTYISABUG>

Reviewed by jackyhobingo

  • ๆœ‰ๆไพ›ๅฏฆ้ฉ—็’ฐๅขƒ๏ผŒๅปๆฒ’ๆœ‰ๆไพ›ๅฏฆ้ฉ—็š„ๆƒ…ๆณ๏ผŒๆœŸๅพ…ๆœ‰ๅฏฆ้ฉ—็š„้Ž็จ‹
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 94 Model name: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz Stepping: 3 CPU MHz: 1232.873 CPU max MHz: 3500.0000 CPU min MHz: 800.0000 BogoMIPS: 5184.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 6144K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp

้–ฑ่ฎ€่ซ–ๆ–‡

INTRODUCTION

  • ็›ธๅฐ็ฐกๅ–ฎ็š„ software prefetching ๆผ”็ฎ—ๆณ•็›ฎๅ‰ๅชๅ‡บ็พๅœจๅ…ˆ้€ฒ็ทจ่ญฏๅ™จไธŠ
  • programmer ้œ€่ฆ่‡ช่กŒๅตŒๅ…ฅ prefetching ๆŒ‡ไปค
    • ๆœ‰ๅ…ฉ้ …ๅ•้กŒ
      • ๆฒ’ๆœ‰็ขบๅˆ‡็š„ๆœ€ๅฅฝ็š„ๅตŒๅ…ฅๆ–น้‡
      • ่ปŸ็กฌ้ซ”็š„ๆบ้€š่ค‡้›œๅบฆๆœช่ขซ่งฃๆ˜Ž
    • ๆŽก็”จๅ…ฉ็จฎ HW prefetcher
      • GHB (Global History Buffer)
        • If cache miss, an address that is offset by a distance from the missed address is likely to be missed in near future
        • Stride access: Either cache miss or hit, prefetch address that is offset by a distance from the missed address.
        • Access stride distances greater than two cache lines
      • STR (Stream Buffer)
        • The cache miss address are fetched into a separate buffer
        • Unit-stride cache-line accesses
    • ๆฏ”่ผƒๅ–ฎ็ด” SW prefetching ่ˆ‡ HW/SW ๆททๅˆไฝฟ็”จ็š„ๆ•ˆ่ƒฝ
  • ่ฆๅฐ‹ๆฑ‚็š„่งฃ็ญ”
    1. SW prefetching ็š„้™ๅˆถ่ˆ‡ๆˆๆœฌ
    2. HW prefetching ็š„้™ๅˆถ่ˆ‡ๆˆๆœฌ
    3. ไฝฟ็”จ SW and/or HW prefetching ็š„็›Š่™•
  • SW prefetching ๅฏฆ้ฉ—
    • ๅฐไธ่ฆๅ‰‡่จ˜ๆ†ถ้ซ”ไฝ็ฝฎ็š„ prefetch ไฝฟ L1 cache miss ้™ไฝŽๆ˜ฏไธป่ฆ็š„ๆญฃๅ‘ๅฝฑ้Ÿฟ

BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING



POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING

  • SW pref. ๅ„ชๆ–ผ HW pref. ็š„ๅœฐๆ–น
    • Large number of Streams
      • The number of streams in the stream prefetcher is limited by HW resources
    • Short Streams
      • Hardware prefetchers require training time to detect the direction and distance of a stream or stride
    • Irregular Memory Access
    • Cache Locality Hint
      • HW prefetcher place the data in the lower-level (L2 or L3) cache
      • SW prefetched data is placed directly into the L1 cache
    • Loop Bounds
      • Several methods prevent generating prefetch requests out of bounds in software
      • The same isn't possible in hardware
  • SW pref. ็š„่ฒ ้ขๆ•ˆๆžœ
    • Increased Instruction Count
    • Static Insertion
      • ็œ‹ไธๅคชๆ‡‚
    • Code Structure Change
  • SW / HW Prefetching ็š„ๅ”ๅŒไฝœ็”จ
    • Handling Multiple Stream
    • Positive Training
  • SW / HW Prefetching ็š„ๆ‹ฎๆŠ—ไฝœ็”จ
    • Negative Training
    • Harmful SW Prefetching

EVALUATIONS: BASIC OBSERVATIONS ABOUT PREFETCHING

  • SW prefetching ็š„้™ๅˆถ่ˆ‡ๆˆๆœฌ
    • Instruction Overhead
    • SW Prefetching Overhead
      • Effects of cache pollution are small
      • Current machines provide enough bandwidth for single-thread applications
      • SW prefetching isn't completely hiding memory latency
      • Negative effect of redundant prefetch instructions is generally negligible
    • The Effect of Prefetch Distance
    • Static Distance vs. Machine Configuration
      • Static prefetch distance variance doesn't impact performance significantly
    • Cache-Level Insertion Policy
      • The benefit of T0 over T1/T2 mainly comes from hiding L1 cache misses by inserting prefetched blocks into the L1 cache
  • ๅŒๆ™‚ไฝฟ็”จ SW/HW Prefetching ็š„ๅฝฑ้Ÿฟ
    • Hardware Prefetcher Training Effects
      • Negative impact can reduce performance degradation significantly
      • It's generally better not to train HW prefetching with SW prefetching requests
    • Prefetch Coverage
      • Less coverage is the main reason for performance loss in the neutral and negative groups
    • Prefetching Classification
      • Even though a significant number of redundant prefetches exists in many benchmarks, there is little negative effect on the performance
  • HW Prefetcher for Short Streams
    • One weakness of hardware prefetching is the difficulty of exploiting short streams
    • ASD HW Prefetcher
      • SW prefetching is much more effective for prefetching short streams than ASD
  • Content Directed Prefetching (CDP)
    • Target linked and other irregular data structures
    • SW prefetching is more effective for irregular data structures than CDP
  • Summary
    • HW prefetchers can under-exploit even regular access patterns and SW prefetching is frequently more effective in such cases
    • The SW prefetching distance is relatively insensitive to the HW configuration
    • The prefetch distance does need to be set carefully, but as long as the prefetch distance is greater than the minimum distance, most applications will not be sensitive to the prefetch distance
    • Although most L1 cache misses can be tolerated through out-of-order execution, when the L1 cache miss rate is much higher than 20%, reducing L1 cache misses by prefetching into the L1 cache can be effective
    • The overhead of useless prefetching instructions is not very significant
    • SW prefetching can be used to train a HW prefetcher and thereby yield some performance improvement. However, it can also degrade performance severely, and therefore must be done judiciously if at all

ๅƒ่€ƒ่ณ‡ๆ–™

When Prefetching Works, When It Doesnโ€™t, and Why