Software Prefetch

# Software Prefetch This note summarizes my experiments and findings while learn prefetching works, when it doesn’t, and why. Any corrections or feedback would be greatly appreciated. ## Software Prefetching ### What is software prefetching? Software prefetching means **the programmer explicitly specifies the address where user wants CPU the fetch into cache before it is actually needed**. It can be done by following intrinsics ```c= // GCC Built-in void __builtin_prefetch(const void *addr, int rw, int locality); // SSE instruction set void _mm_prefetch(void *p, enum _mm_hint h); ``` ### When to use software prefetching? In my experiment, I use a linear scan over a contiguous region with different strides. Here is the linear scan demonstration ```c= for (size_t i = 0; i < capacity; i += stride) { sum += buffer[i]; } ``` This is a good setup for a software prefetch example since **the access pattern is simple and fully predictable**. You always know exactly which element you will read next. With the software prefetch, we can modify our code like ```c= size_t i = 0; for (; i + prefetch_distance < capacity; i += stride) { __builtin_prefetch(&buffer[i + prefetch_distance], 0, 3); sum += buffer[i]; } for (; i < capacity; i += stride) { sum += buffer[i]; } ``` Here: * `prefetch_distance` controls how far ahead we prefetch. * The first loop handles the **steady state** where we can still prefetch the next address. * The second loop **drains** the remaining iterations. ### Experiment: Normal Scan vs Software Prefetch The first experiment compares the performance between linear scan with and without software prefetch. The only variable in this experiment is the **stride size**. To keep the experiment consistent, I fix the following: * Fixed number of read (from `buffer`) * **Prefetch distance equals to stride size** * Memory allocation and random initialization are done **beforehand**, and the data is mapped into the process address space using `mmap`. * Before each run, I explicitly drop the caches and pin the process (single thread) to a specific core using `taskset`, so that measurements are less affected by noise from other cores or cached data. 1. **Delta Elapsed Time** The elapsed time measures how long it takes to complete the buffer scan (with fixed iteration). In the below figure, the delta elapsed time is defined as **no-prefetch solution minus the prefetch version**. A positive delta means the prefetch version is faster (lower elapsed time, better performance). ![dt_prefetch_vs_noprefetch](https://hackmd.io/_uploads/H1sGP3a-bx.png =800x400) It's important to realize **why** the prefetch version runs faster, I profile both implementation by linux perf events and compare their behavior. 2. **L1 Cache Miss Rate** and **Instruction Per Cycle** For the small strides, both versions have a low miss rate because the scan benefits from strong localit. As the stride increases, each access touches a differents cache line and eventually different pages, so the miss rate rises. The no-prefetch curve climbs much faster while the prefetch curve grows more slowly and stays noticeably lower. This shows that software prefetching successfully brings many of the future lines into L1 early enough, reducing the number of demand misses. ![L1_cache_miss_rate](https://hackmd.io/_uploads/BkaIepaWZl.png =800x400) The **instructions per cycle (IPC)** metric reports how many instructions the CPU executes on average in each clock cycle. In the below figure, the no-prefetch version has lower IPC than the prefetch version, which suggests that the core is spending more time stalled waiting for data from lower levels of the memory hierarchy. In the modern CPU, **you would usually expect IPC to be close to 1.0 or higher**. If your program’s IPC is significantly lower and you are not doing heavy I/O, that is a strong hint that your memory access pattern might be the main performance bottleneck. ![IPC](https://hackmd.io/_uploads/BJSog6pZZx.png =800x400) ### Experiment: Software prefetch with different prefetch distance Previous experiments demonstrated that **software prefetching improves read performance** compared to a no-prefetch baseline. A natural follow-up question is whether the **prefetch distance**—the byte offset between the current access and the prefetched address—further affects performance. This experiment **isolates prefetch distance as the only variable** and evaluates its impact on elapsed time and L1 cache miss rate. ##### First Experimental Setup * Fixed Stride Size: 4096 bytes * Fixed Iteration: 4 * 1024 * 1024 * Memory pre-allocation & mmap to user space * Prefetch Distance: 4096 -> 77,824 bytes 1. **Elapsed Time** ![Elapsed Time](https://hackmd.io/_uploads/HJtYUMS7-e.png =800x400) 2. **L1 Cache Miss Rate** ![L1 Cache Miss Rate](https://hackmd.io/_uploads/ryXCLGS7bl.png =800x400) The two figures reveal a clear trend: **increasing the software prefetch distance beyond a certain range does not improve performance and instead gradually degrades it**. The **L1 cache miss rate** figure shows **a monotonic increase as prefetch distance grows**. When the distance is small, prefetched cache lines are more likely to be consumed while they are still resident in the L1 cache. However, as the prefetch distance increases, **prefetched data arrives too early (Cache Pollution)** and is more likely to be evicted before use, either by demand loads or by subsequent prefetches. **This results in a higher L1 miss rate** despite the presence of software prefetch instructions. The elapsed time figure closely mirrors this behavior. **At shorter prefetch distances, execution time remains relatively stable**, indicating that prefetching is effectively overlapping memory latency with computation. Once the distance exceeds a certain threshold, elapsed time begins to increase steadily. This suggests that overly aggressive prefetching introduces additional overhead: wasted memory bandwidth, cache pollution, and increased pressure on the memory hierarchy, all of which outweigh any potential latency-hiding benefits. Taken together, these results demonstrate that software prefetching is highly sensitive to prefetch distance. While **prefetching itself can improve performance compared to a no-prefetch baseline, larger prefetch distances are not inherently better**. The key takeaway is **that effective software prefetching requires careful tuning.** Prefetch distance must be chosen to match **the workload’s access pattern and the underlying cache hierarchy**. **Do not guess** about performance just do the benchmark. ### When 'not' to use software prefetching? Software prefetching does not universally improve memory access performance. In certain access patterns—particularly random memory access—prefetching can become ineffective or even harmful. It is therefore important to understand not only when to use software prefetching, but also when it should be avoided. #### Random Access Experimental Setup * Fixed Prefetch Distance * Random access index & read memory is pre-generated and mmap to test user space. ![Elapsed Time(2)](https://hackmd.io/_uploads/SyM0S4HXbg.png) As shown in the elapsed time results, **no clear performance trend emerges across different stride sizes**, and **the prefetch-enabled version performs worse than the no-prefetch baseline in most cases**. This behavior indicates that software prefetching is unable to accurately predict future memory accesses under random access patterns. This experiment highlights a key limitation of software prefetching: **it is most effective when future memory accesses are predictable**. When access patterns **lack spatial or temporal locality, software prefetching provides little benefit and may degrade overall performance**. ## Conclusion Software prefetching is a powerful technique for hiding memory access latency, but it's not the dark magic for performance. Throught this experiments, we can draw these conclusions: * **Predictability is Mandatory**: Software prefetching only provides a benefit when the access pattern is **linear or preditable**. In random access scenarios, the overhead of the prefetch instruction and the risk of cache pollution actually degrade performance. * **The Prefetch Distance**: Prefetch distance is the most critical variable. If the distance is too short, the data won't arrive in time to hide the latency; if it is too long, the data will be evicted before it is ever used. * **Trust Data, Not Intuition**: Modern CPUs are equipped with sophisticated hardware prefetchers that handle simple patterns efficiently. **Software prefetching should be reserved for cases where the programmer has "insider knowledge" of a complex pattern that the hardware cannot see**. ## References * [Stack Overflow - prefetching-examples](https://stackoverflow.com/questions/7327994/prefetching-examples) * [Mohit Mishra's Note](https://mohitmishra786687.medium.com/turbocharging-your-code-a-no-bs-guide-to-software-prefetching-in-c-7db1abacd92f) * [GNU Data Prefetch](https://gcc.gnu.org/projects/prefetch.html)