P-OPT: Practical Optimal Cache Replacement for Graph Analytics Architectures

# P-OPT: Practical Optimal Cache Replacement for Graph Analytics Architectures ###### tags: `Cache Replacement` ###### paper origin: ISCA-2021 ###### papers: [link](https://users.ece.cmu.edu/~vigneshb/papers/hpca21.pdf) # 1. Introduction * Graph data reuse is dynamically variable and graph-structure-dependent, two properties not captured well by existing replacement policies. * The main contribution of this paper is a practical cache replacement policy which can approach Belady’s MIN replacement policy without using future knowledge **by directly referring to the graph’s structure** for replacement decisions. ## 1.1 Contribution * This paper show that state-of-the-art cache replacement policies are ineffective for graph processing * This paper show that guiding replacement based on graph structure in T-OPT allows emulating Belady’s MIN policy without oracular knowledge of all future accesses * This paper describe P-OPT’s quantized Rereference Matrix data structure and architectural mechanisms that allow low-cost access to the graph transpose for optimal replacement. * This paper evaluate P-OPT across a range of graphs and applications comparing to many state-of-the-art systems, showing P-OPT’s consistent performance benefits # 2. Background ## 2.1 Overview of Graph Processing ### Data Structure ![](https://i.imgur.com/Um2dFgH.png) * real-world graphs are often very sparse (>99% sparse). The Compressed Sparse Row/Column (CSR/CSC) format is storage efficient and can quickly identify a vertex’s neighbors * Offsets Array (OA : stores the starting offset of a vertex’s neighbors in the Neighbor Array (NA), Neighbor Array (NA) : stores each vertex’s neighbors. * To access the neighbor of vertex i, an application accesses the ith and (i+1)th entries in OA to find the range of indices in NA containing vertex i’s neighbors * **CSR encodes outgoing destination neighbors** for each source vertex, **CSC encodes incoming source neighbors** for each destination vertex * Most frameworks store CSR and CSC because common algorithms and optimizations require fast access to both outgoing and incoming neighbors ### Graph Traversal Pattern ![](https://i.imgur.com/RJPNwj8.png) * A common idiom in graph processing is to iteratively visit each edge and update per-vertex data corresponding to the edge’s source and destination vertices. * Traversing **outgoing neighbors (using CSR)** is a **push** execution. * Traversing **incoming neighbors (using CSC)** is a **pull** execution * Algorithm 1 shows a pull execution of a graph processing kernel * However, the **arbitrary order of the CSC’s Neighbor Array** leads to an irregular, graph-dependent pattern of accesses to srcData. Real-world graphs are very large (GBs – TBs), and **irregular accesses to large graphs have poor cache locality.** ## 2.2 Limitations of existing replacement policies ![](https://i.imgur.com/ivMrxnU.png) * the characteristics of graph processing render state-of-the-art replacement policies ineffective * Simple policies (LRU and DRRIP) **do not learn graph-structure-dependent irregular access patterns** * SHiP-PC and Hawkeye use the PC to predict rereference, assuming all accesses by an instruction have the same reuse properties # 3. TRANSPOSE-BASED OPTIMAL CACHE REPLACEMENT ## 3.1 Transpose Encodes Future Reference Information ++problem with OPT:++ * OPT implementation must scan the contents of NA to find when a particular source vertex will be referenced by destination vertex. * for example, after first access to S1 pointed by D0, the OPT will scan the NA and find that S1 will be re-referenced by D4 again * In the worst case, **the entire NA may be scanned** to find the next reference SrcData, which results in a complexity of $O(|Edges|)$ ++solution using the transpose:++ * By accessing the CSR, we can quickly learn that vertex S1 has two outgoing neighbors – vertex D0 and D4 – and, hence, srcData[S1] will only be accessed next while processing the incoming neighbors of vertex D4 * the complexity of finding the next future reference of a srcData element is reduced to $O(|OutDegree|)$, i.e., scanning the outgoing neighbors of a vertex . * conversely, a push execution model using a CSR can also use its transpose (CSC) for estimation of next references to irregular dstData accesses. ![](https://i.imgur.com/x8xTNG4.png) * Figure 3 (center panel) shows a 2-way set-associative cache in which each cache way can store only a single element of srcData. ## 3.2 Transpose-based Optimal Replacement Performance ![](https://i.imgur.com/5jp8FFc.png) * T-OPT significantly reduces LLC MPKI compared to LRU and other policies, achieving a 41% miss rate for PageRank (compared to a 60-70% miss rate for other policies). * The main reason T-OPT is better is that T-OPT does not guess the re-reference pattern. Instead, T-OPT uses precise information of future reuse encoded in the graph’s transpose to make better replacement decisions ## 3.3 Limitations of Transpose-based Optimal Replacement * A key challenge posed by T-OPT is that naively **accessing the transpose** imposes an untenable run time overhead and cache footprint ### Increased Run Time * Finding the next reference of a vertex incurs a complexity of $O(|OutDegree|)$ of the vertex. * Finding the next reference of a cache line involves **finding the next reference of each vertex** in the cache line (and reporting the minimum of these values) ### Increased Memory Access * Since the vertices resident in cache can be arbitrary, the neighbor lookups using the Offset Array (OA) and Neighbor Array (NA) of the transpose (Figure 1) **incur additional irregular memory accesses** that increase cache contention with graph application data # 4. P-OPT: PRACTICAL OPTIMAL REPLACEMENT * P-OPT uses a specialized data structure (called the Rereference Matrix) for fast access of re-reference information available in a graph’s transpose without incurring T-OPT’s overheads. ## 4.1 Reducing the Overheads of T-OPT ![](https://i.imgur.com/bkmpCHe.png) ### Quantizing Re-reference Information * By quantizing next references into fewer, uniform-sized epochs, P-OPT reduces the range of next references for each vertex (spanning Epoch-0 to 2), unlike T-OPT where the next reference spans the entire range of vertices in the graph (D0 to D4) ### Rereference Matrix * The number of cache lines is equal to the number of vertices as a cache line stores a single srcData element in Figure 3 * Each Rereference Matrix entry stores the **distance** to the epoch of a cache line’s next reuse * maximum value M indicating infinity * 0:this epoch will be accessed * Instead of traversing the neighbors of each vertex in a cache line (as in T-OPT), P-OPT need only look up a single next reference for the cache line in $O(1)$ * For a graph of 32 million vertices, 64B per cache line, and 4B per srcData element, 8-bit quantization yields a Rereference Matrix column size of 2MB (32M*4B/64B * 8bits = 2M lines * 1B), consuming only a small part of a typical server CPU’s LLC ## 4.2 Tolerating Quantization Loss ![](https://i.imgur.com/5qwlXa5.png) * Only inter-epoch reference information is tracked and an execution cannot identify a cache line’s final reference within an epoch * MSB : whether the cache line will be accessed * if not : MSB = 1, lower bits encode the distance to the cache line's next reference * if so : MSB = 0, lower bits encode when the final access to the cache line happen in the epoch * sperate an epoch into many sub-epoch, encode the sub-epoch's number of the cache line's final reference ![](https://i.imgur.com/eaqog2Z.png) * P-OPT checks the MSB of the cache line’s Rereference Matrix entry for the current epoch (currEntry) (Line 5). * if currEntry[7] (MSB) = 1, return currEntry[0]~currEntry[6] * if MSB = 0, check if currEntry is beyond final reference * if so, return 0 (the cache line will be reused) * if not, return distance between current epoch and next reference epoch ![](https://i.imgur.com/o5LasqE.png) * Figure 7 shows the reduction in LLC misses achieved by the different policies relative to DRRIP. * P-OPT-INTER+INTRA is able to achieve LLC miss reduction close to the idealized T-OPT # 5. P-OPT ARCHITECTURE ## 5.1 Storing Next References in LLC ![](https://i.imgur.com/x7vrEIi.png) * Figure 8 shows some LLC ways reserved for current (orange) and next (blue) epoch columns of the Rereference Matrix ## 5.2 Finding a Replacement Candidate ![](https://i.imgur.com/eowaDCK.png) * P-OPT maintains two registers – irreg_base and irreg_bound – to track the address range of a graph kernel’s irregData * During cache replacement, P-OPT compares the address in the tag portion of each way in the eviction set against irreg_base and irreg_bound registers to determine if the way contains an irregData cache line * P-OPT uses a **next-ref engine** (a Finite State Machine) to compute the next reference of each non-reserved way in the eviction set by Algorithm 2, and then stores next references in the next-ref buffer * P-OPT maintains **next-ref buffers** at the LLC to keep track of the next references of each way in the eviction set ++steps:++ 1. The next-ref engine computes next references by accessing Rereference Matrix entries for each irregData line in the eviction set 2. The next-ref engine then storing the computed next references in the next-ref buffer. 3. The next-ref engine then searches the next-ref buffer to find the way with the **largest (i.e., furthest in future)** next reference value to evict, settling a tie using a baseline replacement policy (P-OPT uses DRRIP). ## 5.3 Streaming Rereference Matrix columns into the LLC * To transfer Rereference Matrix entries from memory to LLC, P-OPT uses a dedicated hardware unit called the **streaming engine** * The programmer invokes the streaming engine at every epoch boundary using a new stream_nextrefs instruction. The instruction swaps pointers to the current and next epoch (Figure 8) and streams in the next epoch column of the Rereference Matrix into the LLC locations * Doing so does not impose a performance penalty because streaming engine latency is not a performance problem because epoch boundaries are infrequent. ## 5.4 Generalizing P-OPT * With simple extensions, P-OPT supports multi-threading, multiple irregularly-accessed datastreams, and context switches. ### Support Parallel multi-threaded Execution * In a multi-threaded execution, multiple active vertices are being traversed at a time (i.e., a unique currDstId for each thread) and P-OPT needs to select one of the active vertices for next reference computation (Algorithm 2; Lines 8-12). * To guarantee that all threads always process vertices in the same epoch, P-OPT requires slight modification of the application to **execute epochs serially** (vertices within an epoch are executed in parallel). * Due to the relatively small number of epochs (8 bits, 256 in the default P-OPT configuration) each epoch consists of a large number of vertices and restricting parallelism to only within epochs does not significantly impact performance. ### Handling Mutiple Irregular Data Streams P-OPT can support multiple irregular data structures using three architecture changes. 1. P-OPT holds a **separate Rereference Matrix for each irregular data structure** 2. P-OPT reserves the minimum number of ways in the LLC to hold the Rereference Matrix data for all the different irregular data structures. The system maintains a separate set-base and way-base register for each irregular data structure. 3. P-OPT **maintains an irreg_base and irreg_bound register for each irregular data structure** to use the right Rereference Matrix data corresponding to each data structure. ### Virtualization * The Rereference Matrix in P-OPT only tracks reuse among graph application data. * If applications share LLC, **P-OPT may unfairly prefer caching graph data over other data.** * To remain fair, we assume per-process way-partitioning and that P-OPT only replaces data in the graph-process-designated LLC ways * P-OPT supports context switches, by saving its registers (set-base, way-base, irreg_base, irreg_bound, currVertex) with the process context * On resumption, P-OPT invokes the streaming engine to refetch Rereference Matrix contents into reserved LLC ways ## 5.5 Implementation Complexity * P-OPT has low architectural complexity. * P-OPT only adds next-ref buffers to temporarily store next references during replacement * The size of next-ref buffers state is bounded by the maximum cache-level parallelism at the LLC * The next-ref engine is a simple FSM that only needs support for integer division and basic bit manipulation. # 6 Evaluation ## 6.1 P-OPT Improves Performance ![](https://i.imgur.com/EsIBr82.png) * P-OPT outperforms DRRIP across the board, with average speedup of 22% and LLC miss reduction of 24% ## 6.2 P-OPT Scales with Graph Size ![](https://i.imgur.com/RSP8m3i.png) * We evaluate a P-OPT variant, P-OPT-Single-Epoch (P-OPT-SE), that **computes next references using only the current epoch column of the Rereference Matrix**. * P-OPT-SE encodes information about the next epoch within the current epoch column by repurposing the **second most significant bit** of an entry to track if the cache line is accessed in the next epoch (Figure 6). * P-OPT-SE stores only the current epoch column in LLC. However, the reduced cache footprint in P-OPT-SE comes at the expense of **reduced next reference quality**. * Down two bits per entry, the range of next references tracked in P-OPT-SE is halved from 128 to 64 — P-OPT-SE is forced to use coarser quantization. * With fewer than 32 million vertices, P-OPT has better LLC locality. * P-OPT has better replacement information (i.e. current and next epoch) overshadows the reduction in effective LLC capacity. * However, in larger graphs, P-OPT-SE has better locality because of P-OPT’s high reduction in effective LLC capacity ## 6.3 P-OPT compared to prior optimizations ### Graph-agnostic improvements ![](https://i.imgur.com/AmSufrq.png) * Unlike prior work, **P-OPT is graph-agnostic, not reliant on specific structure or vertex ordering of a graph** #### GRASP * GRASP is a replacement policy for graphs with very skewed degree distributions. GRASP expects a pre-processed input vertex array and GRASP uses Degree-Based Grouping (DBG) to order vertices. * Figure 12(a) shows locality improvements from GRASP and P-OPT for PageRank on DBG-ordered graphs. P-OPT outperforms GRASP in three ways : 1. GRASP works well for graphs with skewed degree distributions, but is less effective for other inputs, P-OPT is agnostic to graph structure, offering consistent improvement 2. even for skewed graphs, P-OPT has higher LLC miss reduction than GRASP 3. GRASP requires the input graph to be reordered (using DBG) whereas P-OPT is applicable across any vertex ordering #### HATS-BDFS * HATS-BDFS is a dynamic vertex-scheduling architecture, it runs hardware Bounded Depth First Search (BDFS) to schedule vertices, yielding locality improvements in graphs with community structure * Figure 12(b) compares P-OPT on the standard vertex schedule (“Vertex Ordered” per HATS) against HATS-BDFS. The data shows that HATS-BDFS’s improvements are sensitive to graph structure. * for graph with community-structured, BDFS offers locality improvements, even outpacing T-OPT because BDFS improves locality at all cache levels. * However, for graphs without community structure (even power-law graphs such as DBP and KRON), BDFS increases LLC misses. * In contrast, P-OPT offers consistent LLC locality improvements, leading to a higher mean LLC miss reduction compared to HATS-BDFS ## 6.4 Sensitivity Studies ### Sensitivity to quantization level ![](https://i.imgur.com/Uphe9EL.png) * Figure 15 shows P-OPT’s performance with 4-bit, 8-bit, and 16-bit quantization in the Rereference Matrix * On average, we observe that for P-OPT with 4b, 8b, and 16b quantization of rereferences, 41%, 12%, and 0% of all LLC replacements respectively result in a tie * The already low percentage of replacement ties at 8b quantization explains why P-OPT sees little benefit with higher precision. ### Sensitivity to LLC parameters ![](https://i.imgur.com/rxrP81S.png) * The benefit offered by P-OPT over DRRIP increases with LLC capacity because the fraction of LLC consumed for Rereference Matrix columns reduces. * P-OPT also offers higher miss reduction with higher LLC associativity. As associativity increases, P-OPT has more options for replacement and makes a better choice by considering next references of all ways in the eviction set ### Preprocessing cost of P-OPT ![](https://i.imgur.com/AlJISqU.png) * The main performance results (Figure 10) omitted preprocessing costs because the Rereference Matrix is algorithm agnostic and needs to be created only once for a graph. * On average, constructing the Rereference Matrix accounts for 19.8% of PageRank runtime * Since the Rereference Matrix is algorithm agnostic, the preprocessing cost of P-OPT can be easily amortized by reusing the Rereference Matrix across multiple applications running on the same graph