only relatively simple software prefetching algorithms have appeared in state-of-the-art compilers like icc and gcc
programmers often have to insert prefetching intrinsics manually
few rigorous guidelines on how best to insert prefetch intrinsics
complexity of the interaction between software and hardware prefetching is not well understood
group the benchmarks into three categories
positive (performance improvements from software prefetching in excess of 5%)
milc, GemsFDTD, mcf, lbm, libquantum, and gcc
negative (degrade performance more than 5% by using software prefetching)
sphinx3 and leslie3d
neutral
bzip2, cactusADM, soplex, and bwaves
2. BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING
unit-stride cache-line accesses as streams
access stride distances greater than two cache lines as strided
2.1 Software Prefetch Intrinsics
use software prefetch intrinsics that are part of the x86 SSE SIMD extensions
an instrinsic for x86 systems is _mm_prefetch(char *ptr, int hint)
requires the programmer to specify both a prefetch address as well as a prefetch usage hint
One intrinsic is roughly translated to two instructions for direct address prefetches or four instructions for indirect memory addresses.
Table II. Different Prefetch Hints in Intrinsic
| HINT | L1 | L2 | L3 | Remarks |
|-------------------+----+----+----+---------------------------------------------------|
| _MM_HINT_T0 (T0) | O | O | O | Temporal data with respect to all level caches |
| _MM_HINT_T1 (T1) | X | O | O | Temporal data with respect to first level cache |
| _MM_HINT_T2 (T2) | X | X | O | Temporal data with respect to second level cache |
| _MM_HINT_NTA (NTA)| O | X | X | Non-temporal data with respect to all level caches|
2.2 Prefetch Classification
Table III. Prefetch Classifications
Classification|Accuracy|Timeliness
--------------+--------+----------------------------------------------------------------
Timely | O |best prefetching
Late | O |demand request arrives before prefetched block is serviced
Early | O |demand request arrives after the block is evicted from a cache
Redundant_dc | O |the same cache block is in data cache
Redundant_mshr| O |the same cache block is in MSHR5
Incorrect | X |block is never used even after the block is evicted from a cache
2.3 Software Prefetch Distance
Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency.
The average memory latency and the average execution time of one loop iteration can vary
so the prefetch distance should be chosen in such a way that it is large enough to hide the latency.
D is called the prefetch distance
define prefetch distance as the distance ahead of which a prefetch should be requested on a memory address
l is called the prefetch latency
s is the length of the shortest path through the loop body
2.4 Direct and Indirect Memory Indexing
direct memory indexing
can be easily prefetched by hardware since the memory addresses show regular stream/stride behavior
indirect memory indexing
requires special hardware prefetching mechanisms
is relatively simpler to compute in software
software prefetching to be more effective than hardware prefetching
still have a higher overhead than direct memory access
3. POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING
3.1 Benefits of Software Prefetching over Hardware Prefetching
Large Number of Streams (Limited Hardware Resources)
The number of streams in the stream prefetcher is limited by hardware resources, such as stream detectors and book-keeping mechanisms
e.g. Software prefetching can insert prefetch requests independently for each stream in lbm, whereas it is difficult for hardware prefetchers that are typically easily confused when there are too many streams.
Short Streams
Hardware prefetchers require training time to detect the direction and distance of a stream or stride
require at least two cache misses to detect the direction of stream
If the length of stream is extremely short, there will not be enough cache misses to train a hardware prefetcher and load useful cache blocks.
Irregular Memory Access
prefetch irregular data structures by inserting appropriate prefetch intrinsics
Cache Locality Hint
Hardware prefetchers in today’s high-performance processors such as Nehalem [Intel 2009] and Barcelona [AMD] place the data in the lower-level cache of the requesting core, whereas most software prefetched data is placed directly into the L1 cache.
Lower-level (L2 or L3) prefetching block insertion greatly reduces the higher level (L1) cache pollution, but L2 to L1 latency can degrade performance significantly.
In the software prefetching mechanism, there is greater flexibility of placing data in the right cache level
Loop Bounds
Several methods prevent generating prefetch requests out of array bounds in software such as loop unrolling, software pipelining, and using branch instructions
is not possible in hardware, especially when the hardware prefetcher is aggressive (large prefetch distance and high prefetch degree)
3.2 Negative Impacts of Software Prefetching
Increased Instruction Count
Unlike hardware prefetching, software prefetch instructions consume fetch and execution bandwidth and machine resources
Static Insertion
The decision to prefetch and choice of these parameters are made statically (by programmer or compiler), and therefore cannot adapt to runtime behavioral changes such as varying memory latency, effective cache size, and bandwidth, especially in heterogeneous architectures.
Code Structure Change
When the number of instructions in a loop is extremely small, it can be challenging to insert software prefetching instructions to provide timely prefetching requests because there are not enough computations between prefetching instructions and demand requests.
loop splitting
breaking the loop into multiple loops which have the same bodies but iterate over different contiguous portions of the index range.
3.3 Synergistic Effects when Using Software and Hardware Prefetching Together
Handling Multiple Streams
Using software and hardware prefetching together can help cover more data streams.
Positive Training
Software prefetch requests can actually help train the hardware prefetcher.
If a block prefetched by a software prefetcher is late, then a trained hardware prefetcher can improve prefetch timeliness.
3.4 Antagonistic Effects when Using Software and Hardware Prefetching Together
Negative Training
Software prefetch requests can slow down the hardware prefetcher training.
If prefetched blocks by software prefetching hide a part of one or more streams, the hardware prefetcher will not be trained properly.
software prefetch instructions can trigger overly aggressive hardware prefetches, which results in early requests.
Harmful Software Prefetching
Hardware prefetchers generally have a lower accuracy than software prefetchers.
When software prefetch requests are incorrect or early, the increased stress on cache and memory bandwidth can further reduce the effectiveness of hardware prefetching.
4. EXPERIMENTAL METHODOLOGY
4.1 Prefetch Intrinsic Insertion Algorithm
profile each application to identify prefetch candidates.
A prefetch candidate is any load whose L1 misses per thousand instructions (MPKI) exceeds 0.05
To make the choice of prefetch distance systematic, compute it as
K is a constant factor
L is an average memory latency, use K = 4 and L = 300 for all benchmarks
is the profiled average IPC of each benchmark
is the average instruction count in one loop iteration
determine from the profile data, the choice of distance is effectively profile-driven
For loads on complex data structures, such as RDS, hash, C++ overloaded operators, and short functions without any loops, do not use these prefetch intrinsics.
4.2 Simulation Methodology
MacSim
a trace-driven and cycle-level simulator
to evaluate software and hardware prefetching effects
5. EVALUATIONS: BASIC OBSERVATIONS ABOUT PREFETCHING
5.1 Overhead and Limitations of Software Prefetching
Instruction Overhead
major negative impacts of software prefetching
For GemsFDTD, bwaves, and leslie3d, the total number of instructions increases by more than 50%
These extra instructions come not just from prefetch instructions but also from regular instructions due to handling of indirect memory accesses and index calculations
Surprisinly, GemesFDTD is still in the positive group
Software Prefetching Overhead
eliminate
cache pollution (SW+P)
would be caused by early or incorrect prefetches (but small in experiments)
bandwidth consumption (SW+B)
mostly small values
current machines provides enough bandwidth for single-thread applications
memory access latency (SW+L)
larger values
software prefetching is not completely hiding memory latency
redundant prefetch overhead (SW+R)
the negative effect of redundant prefetch instructions is negligible even though instructions increased significantly
instruction overhead (SW+I)
even though the number of instructions increased significantly for GemsFDTD, bwaves, and leslie3d, their actual time overhead is not high
The Effect of Prefetch Distance
Most benchmarks from the neutral and negative groups are insensitive to the prefetch distance
Static Distance vs. Machine Configuration
One limitation of using static prefetch distances is that the optimal distance might vary by machine
the static prefetch distance variance does not impact performance significantly even if the machine configuration is changed at runtime
Cache-Level Insertion Policy
an out-of-order processor can easily tolerate L1 misses, so hardware prefetchers usually insert prefetched blocks into the last level cache to reduce the chance of polluting the L1 cache.
when the L1 cache miss rate is higher than what a processor can tolerate, even an L1 cache miss can degrade performance
Since software prefetching is generally accurate, we can safely insert prefetched blocks into the L1 cache directly
T0 always performs better than other hints in all evaluated benchmarks
5.2 Effect of Using Hardware and Software Prefeching Together
Hardware Prefetcher Training Effects
two different hardware prefetcher training policies
NT (SW+GHB, SW+STR)
hardware prefetcher's training algorithm ignores software prefetch requests
only trained by demand cache misses
Train (SW+GHB+T, SW+STR+T)
includes software prefetch requests
sphinx3
For both GHB and STR, benchmark shows a small benefit from training with software prefetches.
milc, gcc, and soplex
negative effects for both GHB and STR
mcf and bwaves
largely unaffected by training policy
the remaining benchmarks
the effect of the training policy is prefetcher-specific, for instance, positive with GHB and negative with STR
it is generally better not to train hardware prefetching with software prefetching requests
Prefetch Coverage
the coverage of software prefetching is higher than that of hardware prefetchers, but the software coverage of benchmarks in that neutral and negative group is lower than taht of the hardware prefetchers
the less coverage is the main reason for performance loss in the neutral and negative groups
Prefetching Classification
Only bzip2 and soplex show more than 10% early prefetches.
For mcf, 10% of prefetches are late, due to a dependent load instruction that the prefetch intrinsic generates
By eliminating the latency overhead (SW+L), the performance of mcf is improved by 9%.
even though a significant number of redundant prefetches exists in many benchmarks, there is little negative effect on the performance
5.3 Using the GCC Compiler
insert intrinisics after compiling the code with gcc
The gcc compiler inserts more software prefetches than icc, except for mcf and lbm
gcc still cannot cover all cache misses
5.4 Real System Experiments
measured the effect of software prefetching on two recent Intel processors, Core 2 and Nehalem
software prefetching reduced the performance of cactusADM, soplex, bwaves, sphinx3, and leslie3d by a small amount in the real system measurements
all benchmarks in the positive group show benefits from using software prefetches
due to the multiple cache levels of prefetchers in Nehalem (L3 to L2 and L2 to L1), the benefits of software prefetching are reduced
5.5 Profile-Guided Optimization (PGO)
the performance of using PGO lies between that of Base and SW
5.6 Hardware Prefetcher for Short Streams
One weakness of hardware prefetching is the difficulty of exploiting short streams.
ASD
On average, ASD provides a 7% performance benefit on milc without software prefetching. However, software prefetching itself can still provide more than a 2x speedup.
software prefetching is much more effective for prefetching short streams than ASD
5.7 Content Directed Prefetching
target linked and other irregular data structures
CDP provides 3.9% and 6.5% improvement on mcf and gcc, respectively
software prefetching is more effective for irregular data structures
5.8 Manual Tuning
cannot be sure that software prefetching is being used to its full potential
manually and exhaustively tune the benchmarks
remove useless prefetches
adjust the prefetch distance of each software prefetch
apply additional overhead-reduction techniques, such as loop unrolling and prologue/epilogue transformations
there are no significant performance changes between manual tuning and the algorithm, especially in the positive group, which shows that the algorithm is good to use as a guideline for inserting software prefetch intrinisics.
5.9 Summary of New Findings
Hardware prefetchers can under-exploit even regular access patterns
software prefetching is frequently more effective
The prefetch distance does need to be set carefully, but as long as the prefetch distance is greater than the minimum distance
when the L1 cache miss rate is much higher than 20%, reducing L1 cache misses by prefetching into the L1 cache can be effective.
having some extra instructions (software prefetching instructions) does not increase execution time by much
Software prefetching can be used to train a hardware prefetcher and yield some performance improvement. However, it can also degrade performance.
6. CASE STUDIES: SOURCE CODE ANALYSIS OF INDIVIDUAL BENCHMARKS