GraphPEG: Accelerating Graph Processing on GPUs

# GraphPEG: Accelerating Graph Processing on GPUs ###### tags: `GPUs` ##### origin: ACM Transactions on Architecture and Code Optimization ##### paper: [link](https://dl.acm.org/doi/pdf/10.1145/3450440) ## Introduction * problem of graph processing parallelization * flow divergence * non-uniform work distribution * memory divergence * software-based problem * may have high load-balancing overhead * graph datasets do not admit a one-size-fits-all characterization * memory divergence * not friendly to people who are unfamiliar with GPU programming * contribution * analyze the limitations of software optimization strategies * propose GraphPEG, combining **automatic edge gathering** and **fine-grain work distribution** * detailed evaluation of GraphPEG ## BACKGROUND AND MOTIVATION * ![](https://i.imgur.com/q4iZExp.png) * characteristics of graph application * data-driven approach maintain a **active list** * control flow divergence * load imbalance * problem of software-based load-balancing strategies * complicated codes * may perform worse when the graphs are not highly skewed * no single strategy is worked for all typs of dataset * memory divergence is not addressed * graph application have two levels of parallelism * vertices in the vertex-list * edges in edge-frontier (EF) * limitation of SIMT execution model * The out-degree of vertex accessed in the first level determines the number of iterations executed in the second level * Dynamically detecting load imbalance and exchanging work between threads is difficult for software strategies due to the expensive global inter-thread communication on GPUs * Gunrock’s performance improvement over the simple data-driven implementation and the load-balancing overhead of Gunrock * ![](https://i.imgur.com/IOWfEVW.png) * datasets have an important impact on the load-balancing overhead of Gunrock * the load-balancing overheads for datasets with small average out-degrees, such as NACA and USAC, are relatively high * hardware scheme * gather edges automatically and directly supply threads with the edges of the EF in a **load-balanced manner** * benefit * the kernel code only needs to process edges in the EF * gather enough edges and distribute them evenly to fully utilize the GPU computing resources and eliminate control flow divergence * gather edge data as early as possible to improve the utilization of memory bandwidth, thereby alleviating memory divergence * unburdens graph application developers from making choices between different strategies, because the hardware scheme can adapt to various graph datasets ## GRAPHPEG ### Overview 1. it takes advantage of the common data access pattern in graph traversal algorithms and **automatically gathers edge** data on the basis of the neighbor-list indices of active vertices. 2. the gathered edges is **evenly distributed** to threads to achieve dynamic load balancing. ![](https://i.imgur.com/RUBPTkn.png) * composition 1. The **edge gathering unit** gather edge data from the global memory into shared memory on the basis of the indices gathered from adjacency lists 3. The **vertex coordination unit** control refilling/spilling of the vertices in a vertex-list into/off shared memory 5. The **edgefrontier(EF) exchange unit** balance the edges of EFs between different SMs. 7. The central data structure is called **adjacency list working table(ALWT)** tracks the index and edge gathering states. 9. GraphPEG leverages GPU’s **control processor** for global workload balancing ### Programming Interface * replace the graph traverse code with the **intrinsic functions** of GraphPEG to process edges ![](https://i.imgur.com/bw0UYnh.png) * GraphPEG works on **vertex-lists** and **edge frontiers** * two working modes: * edge gathering mode * primary mode * automatically gathers edge data * **__pop_edge** intrinsic functions * **__push_vert** intrinsic functions * vertex-list mode * complementary mode * filter out certain vertices of a vertex-list on the basis of an application-specific criterion * **__pop_vert** intrinsic function is used in the vertex-list mode to get vertices from the input * **__push_vert** intrinsic --- ![](https://i.imgur.com/oow8dkt.png) * simpler than the pseudocode kernel in Figure 1 * persistent thread model ### Workflow #### Automatic Edge Gathering and Work Distribution * vertices and edges are usually represented with two one-dimensional arrays: * the vertex array * the edge array * compressed sparse column (CSC) * The ith entry of the **vertex array** contains an index to the **edge array**, such an index points to the start of an adjacency list that contains the neighbors of vertex i * The entries in the **edge array** store vertex identifiers that can index the **vertex array** * For weighted graphs, an extra array stores the **weight** of each edge. * push/pop mode * Push mode * an application should use the CSR graph format where the vertex array contains source vertices. * Pull mode * an application should use the CSC format where the vertex array contains destination vertices. * input and output **vertex-lists** * Each GraphPEG in an SM maintains its local input and output vertex-lists to prevent memory contention * software interface transparent --- ![](https://i.imgur.com/0Ul71c8.png) 1. adjacency list working table (ALWT) is used to gather edge data * Each entry of the ALWT stores the status data of an active vertex * Vertex: the base vertex * for the CSR graph format, the base vertex represents the **source vertex** of an edge * for the CSC graph format, the base vertex represents the **destination vertex** of an edge * SIndex: the start indices of the adjacency list in the edge array * EIndex: the end indices of the adjacency list in the edge array * Status: current working status * fetching index (FI) * ready (RD) * fetching edge (FE) * invalid (IVD) 2. if an IVD entry exists in the ALWT, and the vertex-list is not empty, the edge gathering unit will change the state of an IVD entry into FI 3. send a memory request to the LD/ST unit of the SM to fetch the **start** and **end** indices from the vertex array 4. if a RD entry exists in the ALWT, and no FE entry exists, the edge gathering unit will change a RD entry into FE state and issue memory requests to fetch edge data from the edge array and edge weight array (if necessary) on the basis of the start and end indices 5. The start index of the entry increases by the number of the edge data being fetched each time. The fetched edge data is filled into shared memory, given that it is not full 6. When the number of fetched edges in shared memory is equal to or larger than the warp size, or all edges have been gathered, the instructions that correspond to the **__pop_edge** intrinsic function can be issued into the pipeline to retrieve edges from shared memory * vertex coordination unit * monitors the sizes of the vertex-lists in shared memory * vertex-list working mode * **edge gathering unit** is disabled * **ALWT** is disabled #### Lightweight EF Load Balancing * workload is balanced in a thread block. However, thread blocks scheduled onto different SMs may still suffer from load imbalance * **vertex-list** and **EF** load balancing, should be considered. Howerver, experiment show that the latter is more important. * an ALWT entry exchange scheme with simple load-balancing policies for GraphPEG to balance edges between SMs * **Control processor** on a contemporary GPU assigns kernels to SMs and manages hardware resources * The added control registers * EF size register(REFS) * total number of edges to be processed in an SM * balancing control register(RBC) * balancing state register(RBS) * two balance working registers (RBW0 and RBW1) * These registers act as the interaction interface between the **EF exchange unit** of GraphPEG and the **control processor** * Control processor periodic check REFS's value between SMs, if the difference larger than a predifined value(512), then set the RBC state * edge donating(ED) state * edge receiving standby(ERS) state * For EF load balancing, only one ALWT entry is transferred for each load-balancing interval to simplify the hardware design. * To respond to the balancing request, the donator SM#0 checks the entry with the maximum neighbor size in its ALWT, increase the start index of this entry * **control processor** copies the two values into registers RBW 0 and RBW 1 of SM#1, and changes the state of register RBC of SM#1 into edge receiving (ER) * The **EF exchange unit** on SM#1 inserts a new entry in its ALWT with the start and end indices being the values of registers RBW0 and RBW1, respectively ![](https://i.imgur.com/cpBBwYP.png) * For Turing and Ampere GPUs, EF load balancing should be applied to GPCs not individual SMs. GraphPEG in each GPC is then responsible for distributing edge workload evenly between SMs within a GPC ## EXPERIMENTS AND EVALUATION ### experiments ![](https://i.imgur.com/Rc5yuTK.png) ![](https://i.imgur.com/RZf8Kri.png) ### evaluation ![](https://i.imgur.com/gzBBTL0.png) ## CONCLUSIONS AND FUTURE WORK ### conclusion * hardware architecture * automatic edge processing * low overhead inter-SM load balancing ### future work * accelerate senerios of CPU-GPU and multi-GPU * combining graph format transformation techniques such as Tigr and GraphCage with GraphPEG