# P-OPT: Practical Optimal Cache Replacement for Graph Analytics Architectures
###### tags: `Cache Replacement`
###### paper origin: ISCA-2021
###### papers: [link](https://users.ece.cmu.edu/~vigneshb/papers/hpca21.pdf)
# 1. Introduction
* Graph data reuse is dynamically variable and graph-structure-dependent, two properties not captured well by existing replacement policies.
* The main contribution of this paper is a practical cache replacement policy which can approach Belady’s MIN replacement policy without using future knowledge **by directly referring to the graph’s structure** for replacement decisions.
## 1.1 Contribution
* This paper show that state-of-the-art cache replacement policies are ineffective for graph processing
* This paper show that guiding replacement based on graph structure in T-OPT allows emulating Belady’s MIN policy without oracular knowledge of all future accesses
* This paper describe P-OPT’s quantized Rereference Matrix data structure and architectural mechanisms that allow low-cost access to the graph transpose for optimal replacement.
* This paper evaluate P-OPT across a range of graphs and applications comparing to many state-of-the-art systems, showing P-OPT’s consistent performance benefits
# 2. Background
## 2.1 Overview of Graph Processing
### Data Structure

* real-world graphs are often very sparse (>99% sparse). The Compressed Sparse Row/Column (CSR/CSC) format is storage efficient and can quickly identify a vertex’s neighbors
* Offsets Array (OA : stores the starting offset of a vertex’s neighbors in the Neighbor Array (NA), Neighbor Array (NA) : stores each vertex’s neighbors.
* To access the neighbor of vertex i, an application accesses the ith and (i+1)th entries in OA to find the range of indices in NA containing vertex i’s neighbors
* **CSR encodes outgoing destination neighbors** for each source vertex, **CSC encodes incoming source neighbors** for each destination vertex
* Most frameworks store CSR and CSC because common algorithms and optimizations require fast access to both outgoing and incoming neighbors
### Graph Traversal Pattern

* A common idiom in graph processing is to iteratively visit each edge and update per-vertex data corresponding to the edge’s source and destination vertices.
* Traversing **outgoing neighbors (using CSR)** is a **push** execution.
* Traversing **incoming neighbors (using CSC)** is a **pull** execution
* Algorithm 1 shows a pull execution of a graph processing kernel
* However, the **arbitrary order of the CSC’s Neighbor Array** leads to an irregular, graph-dependent pattern of accesses to srcData. Real-world graphs are very large (GBs – TBs), and **irregular accesses to large graphs have poor cache locality.**
## 2.2 Limitations of existing replacement policies

* the characteristics of graph processing render state-of-the-art replacement policies ineffective
* Simple policies (LRU and DRRIP) **do not learn graph-structure-dependent irregular access patterns**
* SHiP-PC and Hawkeye use the PC to predict rereference, assuming all accesses by an instruction have the same reuse properties
# 3. TRANSPOSE-BASED OPTIMAL CACHE REPLACEMENT
## 3.1 Transpose Encodes Future Reference Information
++problem with OPT:++
* OPT implementation must scan the contents of NA to find when a particular source vertex will be referenced by destination vertex.
* for example, after first access to S1 pointed by D0, the OPT will scan the NA and find that S1 will be re-referenced by D4 again
* In the worst case, **the entire NA may be scanned** to find the next reference SrcData, which results in a complexity of $O(|Edges|)$
++solution using the transpose:++
* By accessing the CSR, we can quickly learn that vertex S1 has two outgoing neighbors – vertex D0 and D4 – and, hence, srcData[S1] will only be accessed next while processing the incoming neighbors of vertex D4
* the complexity of finding the next future reference of a srcData element is reduced to $O(|OutDegree|)$, i.e., scanning the outgoing neighbors of a vertex .
* conversely, a push execution model using a CSR can also use its transpose (CSC) for estimation of next references to irregular dstData accesses.

* Figure 3 (center panel) shows a 2-way set-associative cache in which each cache way can store only a single element of srcData.
## 3.2 Transpose-based Optimal Replacement Performance

* T-OPT significantly reduces LLC MPKI compared to LRU and other policies, achieving a 41% miss rate for PageRank (compared to a 60-70% miss rate for other policies).
* The main reason T-OPT is better is that T-OPT does not guess the re-reference pattern. Instead, T-OPT uses precise information of future reuse encoded in the graph’s transpose to make better replacement decisions
## 3.3 Limitations of Transpose-based Optimal Replacement
* A key challenge posed by T-OPT is that naively **accessing the transpose** imposes an untenable run time overhead and cache footprint
### Increased Run Time
* Finding the next reference of a vertex incurs a complexity of $O(|OutDegree|)$ of the vertex.
* Finding the next reference of a cache line involves **finding the next reference of each vertex** in the cache line (and reporting the minimum of these values)
### Increased Memory Access
* Since the vertices resident in cache can be arbitrary, the neighbor lookups using the Offset Array (OA) and Neighbor Array (NA) of the transpose (Figure 1) **incur additional irregular memory accesses** that increase cache contention with graph application data
# 4. P-OPT: PRACTICAL OPTIMAL REPLACEMENT
* P-OPT uses a specialized data structure (called the Rereference Matrix) for fast access of re-reference information available in a graph’s transpose without incurring T-OPT’s overheads.
## 4.1 Reducing the Overheads of T-OPT

### Quantizing Re-reference Information
* By quantizing next references into fewer, uniform-sized epochs, P-OPT reduces the range of next references for each vertex (spanning Epoch-0 to 2), unlike T-OPT where the next reference spans the entire range of vertices in the graph (D0 to D4)
### Rereference Matrix
* The number of cache lines is equal to the number of vertices as a cache line stores a single srcData element in Figure 3
* Each Rereference Matrix entry stores the **distance** to the epoch of a cache line’s next reuse
* maximum value M indicating infinity
* 0:this epoch will be accessed
* Instead of traversing the neighbors of each vertex in a cache line (as in T-OPT), P-OPT need only look up a single next reference for the cache line in $O(1)$
* For a graph of 32 million vertices, 64B per cache line, and 4B per srcData element, 8-bit quantization yields a Rereference Matrix column size of 2MB (32M*4B/64B * 8bits = 2M lines * 1B), consuming only a small part of a typical server CPU’s LLC
## 4.2 Tolerating Quantization Loss

* Only inter-epoch reference information is tracked and an execution cannot identify a cache line’s final reference within an epoch
* MSB : whether the cache line will be accessed
* if not : MSB = 1, lower bits encode the distance to the cache line's next reference
* if so : MSB = 0, lower bits encode when the final access to the cache line happen in the epoch
* sperate an epoch into many sub-epoch, encode the sub-epoch's number of the cache line's final reference

* P-OPT checks the MSB of the cache line’s Rereference Matrix entry for the current epoch (currEntry) (Line 5).
* if currEntry[7] (MSB) = 1, return currEntry[0]~currEntry[6]
* if MSB = 0, check if currEntry is beyond final reference
* if so, return 0 (the cache line will be reused)
* if not, return distance between current epoch and next reference epoch

* Figure 7 shows the reduction in LLC misses achieved by the different policies relative to DRRIP.
* P-OPT-INTER+INTRA is able to achieve LLC miss reduction close to the idealized T-OPT
# 5. P-OPT ARCHITECTURE
## 5.1 Storing Next References in LLC

* Figure 8 shows some LLC ways reserved for current (orange) and next (blue) epoch columns of the Rereference Matrix
## 5.2 Finding a Replacement Candidate

* P-OPT maintains two registers – irreg_base and irreg_bound – to track the address range of a graph kernel’s irregData
* During cache replacement, P-OPT compares the address in the tag portion of each way in the eviction set against irreg_base and irreg_bound registers to determine if the way contains an irregData cache line
* P-OPT uses a **next-ref engine** (a Finite State Machine) to compute the next reference of each non-reserved way in the eviction set by Algorithm 2, and then stores next references in the next-ref buffer
* P-OPT maintains **next-ref buffers** at the LLC to keep track of the next references of each way in the eviction set
++steps:++
1. The next-ref engine computes next references by accessing Rereference Matrix entries for each irregData line in the eviction set
2. The next-ref engine then storing the computed next references in the next-ref buffer.
3. The next-ref engine then searches the next-ref buffer to find the way with the **largest (i.e., furthest in future)** next reference value to evict, settling a tie using a baseline replacement policy (P-OPT uses DRRIP).
## 5.3 Streaming Rereference Matrix columns into the LLC
* To transfer Rereference Matrix entries from memory to LLC, P-OPT uses a dedicated hardware unit called the **streaming engine**
* The programmer invokes the streaming engine at every epoch boundary using a new stream_nextrefs instruction. The instruction swaps pointers to the current and next epoch (Figure 8) and streams in the next epoch column of the Rereference Matrix into the LLC locations
* Doing so does not impose a performance penalty because streaming engine latency is not a performance problem because epoch boundaries are infrequent.
## 5.4 Generalizing P-OPT
* With simple extensions, P-OPT supports multi-threading, multiple irregularly-accessed datastreams, and context switches.
### Support Parallel multi-threaded Execution
* In a multi-threaded execution, multiple active vertices are being traversed at a time (i.e., a unique currDstId for each thread) and P-OPT needs to select one of the active vertices for next reference computation (Algorithm 2; Lines 8-12).
* To guarantee that all threads always process vertices in the same epoch, P-OPT requires slight modification of the application to **execute epochs serially** (vertices within an epoch are executed in parallel).
* Due to the relatively small number of epochs (8 bits, 256 in the default P-OPT configuration) each epoch consists of a large number of vertices and restricting parallelism to only within epochs does not significantly impact performance.
### Handling Mutiple Irregular Data Streams
P-OPT can support multiple irregular data structures using three architecture changes.
1. P-OPT holds a **separate Rereference Matrix for each irregular data structure**
2. P-OPT reserves the minimum number of ways in the LLC to hold the Rereference Matrix data for all the different irregular data structures. The system maintains a separate set-base and way-base register for each irregular data structure.
3. P-OPT **maintains an irreg_base and irreg_bound register for each irregular data structure** to use the right Rereference Matrix data corresponding to each data structure.
### Virtualization
* The Rereference Matrix in P-OPT only tracks reuse among graph application data.
* If applications share LLC, **P-OPT may unfairly prefer caching graph data over other data.**
* To remain fair, we assume per-process way-partitioning and that P-OPT only replaces data in the graph-process-designated LLC ways
* P-OPT supports context switches, by saving its registers (set-base, way-base, irreg_base, irreg_bound, currVertex) with the process context
* On resumption, P-OPT invokes the streaming engine to refetch Rereference Matrix contents into reserved LLC ways
## 5.5 Implementation Complexity
* P-OPT has low architectural complexity.
* P-OPT only adds next-ref buffers to temporarily store next references during replacement
* The size of next-ref buffers state is bounded by the maximum cache-level parallelism at the LLC
* The next-ref engine is a simple FSM that only needs support for integer division and basic bit manipulation.
# 6 Evaluation
## 6.1 P-OPT Improves Performance

* P-OPT outperforms DRRIP across the board, with average speedup of 22% and LLC miss reduction of 24%
## 6.2 P-OPT Scales with Graph Size

* We evaluate a P-OPT variant, P-OPT-Single-Epoch (P-OPT-SE), that **computes next references using only the current epoch column of the Rereference Matrix**.
* P-OPT-SE encodes information about the next epoch within the current epoch column by repurposing the **second most significant bit** of an entry to track if the cache line is accessed in the next epoch (Figure 6).
* P-OPT-SE stores only the current epoch column in LLC. However, the reduced cache footprint in P-OPT-SE comes at the expense of **reduced next reference quality**.
* Down two bits per entry, the range of next references tracked in P-OPT-SE is halved from 128 to 64 — P-OPT-SE is forced to use coarser quantization.
* With fewer than 32 million vertices, P-OPT has better LLC locality.
* P-OPT has better replacement information (i.e. current and next epoch) overshadows the reduction in effective LLC capacity.
* However, in larger graphs, P-OPT-SE has better locality because of P-OPT’s high reduction in effective LLC capacity
## 6.3 P-OPT compared to prior optimizations
### Graph-agnostic improvements

* Unlike prior work, **P-OPT is graph-agnostic, not reliant on specific structure or vertex ordering of a graph**
#### GRASP
* GRASP is a replacement policy for graphs with very skewed degree distributions. GRASP expects a pre-processed input vertex array and GRASP uses Degree-Based Grouping (DBG) to order vertices.
* Figure 12(a) shows locality improvements from GRASP and P-OPT for PageRank on DBG-ordered graphs. P-OPT outperforms GRASP in three ways :
1. GRASP works well for graphs with skewed degree distributions, but is less effective for other inputs, P-OPT is agnostic to graph structure, offering consistent improvement
2. even for skewed graphs, P-OPT has higher LLC miss reduction than GRASP
3. GRASP requires the input graph to be reordered (using DBG) whereas P-OPT is applicable across any vertex ordering
#### HATS-BDFS
* HATS-BDFS is a dynamic vertex-scheduling architecture, it runs hardware Bounded Depth First Search (BDFS) to schedule vertices, yielding locality improvements in graphs with community structure
* Figure 12(b) compares P-OPT on the standard vertex schedule (“Vertex Ordered” per HATS) against HATS-BDFS. The data shows that HATS-BDFS’s improvements are sensitive to graph structure.
* for graph with community-structured, BDFS offers locality improvements, even outpacing T-OPT because BDFS improves locality at all cache levels.
* However, for graphs without community structure (even power-law graphs such as DBP and KRON), BDFS increases LLC misses.
* In contrast, P-OPT offers consistent LLC locality improvements, leading to a higher mean LLC miss reduction compared to HATS-BDFS
## 6.4 Sensitivity Studies
### Sensitivity to quantization level

* Figure 15 shows P-OPT’s performance with 4-bit, 8-bit, and 16-bit quantization in the Rereference Matrix
* On average, we observe that for P-OPT with 4b, 8b, and 16b quantization of rereferences, 41%, 12%, and 0% of all LLC replacements respectively result in a tie
* The already low percentage of replacement ties at 8b quantization explains why P-OPT sees little benefit with higher precision.
### Sensitivity to LLC parameters

* The benefit offered by P-OPT over DRRIP increases with LLC capacity because the fraction of LLC consumed for Rereference Matrix columns reduces.
* P-OPT also offers higher miss reduction with higher LLC associativity. As associativity increases, P-OPT has more options for replacement and makes a better choice by considering next references of all ways in the eviction set
### Preprocessing cost of P-OPT

* The main performance results (Figure 10) omitted preprocessing costs because the Rereference Matrix is algorithm agnostic and needs to be created only once for a graph.
* On average, constructing the Rereference Matrix accounts for 19.8% of PageRank runtime
* Since the Rereference Matrix is algorithm agnostic, the preprocessing cost of P-OPT can be easily amortized by reusing the Rereference Matrix across multiple applications running on the same graph