The benefit of T0 over T1/T2 mainly comes from hiding L1 cache misses by inserting prefetched blocks into the L1 cache
ๅๆไฝฟ็จ SW/HW Prefetching ็ๅฝฑ้ฟ
Hardware Prefetcher Training Effects
Negative impact can reduce performance degradation significantly
It's generally better not to train HW prefetching with SW prefetching requests
Prefetch Coverage
Less coverage is the main reason for performance loss in the neutral and negative groups
Prefetching Classification
Even though a significant number of redundant prefetches exists in many benchmarks, there is little negative effect on the performance
HW Prefetcher for Short Streams
One weakness of hardware prefetching is the difficulty of exploiting short streams
ASD HW Prefetcher
SW prefetching is much more effective for prefetching short streams than ASD
Content Directed Prefetching (CDP)
Target linked and other irregular data structures
SW prefetching is more effective for irregular data structures than CDP
Summary
HW prefetchers can under-exploit even regular access patterns and SW prefetching is frequently more effective in such cases
The SW prefetching distance is relatively insensitive to the HW configuration
The prefetch distance does need to be set carefully, but as long as the prefetch distance is greater than the minimum distance, most applications will not be sensitive to the prefetch distance
Although most L1 cache misses can be tolerated through out-of-order execution, when the L1 cache miss rate is much higher than 20%, reducing L1 cache misses by prefetching into the L1 cache can be effective
The overhead of useless prefetching instructions is not very significant
SW prefetching can be used to train a HW prefetcher and thereby yield some performance improvement. However, it can also degrade performance severely, and therefore must be done judiciously if at all