Graph-Waving architecture: Efficient execution of graph applications on GPUs

# Graph-Waving architecture: Efficient execution of graph applications on GPUs ###### tags: `GPUs` ###### paper source: Journal of Parallel and Distributed Computing ###### paper link: [link](https://www.sciencedirect.com/science/article/pii/S0743731520303889) ###### key word: GPGPU, GPU, microarchitecture, Graph application, Scalar waving, SIMD efficiency ## Abstract * SIMD-group mapping * Scalar-Waving * narrow SIMD-group width with a clustered issue approach and reuse of instructions in the front-end ## 1.Introduction * leverage the idea of vertex to warp mapping * scalarized vertex-centric parallel graph computing on GPUs * introduce the Graph-Waving architecture to better support scalarized vertex-centric graph computing while maintaining the efficiency of SIMD execution for regular GPU applications. * In scalarized vertex-centric parallel graph computing, programmers are allowed to annotate the redundant computation as scalar operations to fully utilize the Scalar-Execution * Scalar-Waving (SW) that dynamically groups scalar operations possessing the same PC and executes them together as a Scalar-wave on SIMD pipelines for improved utilization of SIMD processing elements * employs narrow SIMD-units but with a clustered issue logic to maintain the benefits of larger issue width with regular applications * leverages instruction re-use to eliminate redundant fetching and decoding of instructions across the warps and the iterations of loops ## 2. GPU execution model and baseline GPU microarchitecture * baseline is similar to NVIDIA TeslaC2050 GPUs that are based on NVIDIA’s Fermi architecture. ![](https://ars.els-cdn.com/content/image/1-s2.0-S0743731520303889-gr1_lrg.jpg) * At each cycle, two schedulers select and dispatch instructions for two warps. ## 3. Typical vertex-centric parallel graph computing on GPUs * Pannotia’s implementation * Compressed Sparse Row (CSR) format * **vertex-centric parallel computation** where vertex to a SIMD-thread mapping is applied * called **edge-list-expanding computation pattern** in this paper ![](https://ars.els-cdn.com/content/image/1-s2.0-S0743731520303889-fx1001_lrg.jpg) ## 4. Inefficiencies of vertex-centric graph parallel computing on GPUs * performance penalties * having many vertexes with divergent degrees ![](https://ars.els-cdn.com/content/image/1-s2.0-S0743731520303889-gr2_lrg.jpg) <center>Fig. 2</center> ![](https://ars.els-cdn.com/content/image/1-s2.0-S0743731520303889-gr4_lrg.jpg) <center>Fig. 4</center> ### 4.1. Intra-warp load imbalance and under-utilization of SIMD units * Intra-warp load imbalance * Fig. 2(e) * Fig. 4(a) ![](https://ars.els-cdn.com/content/image/1-s2.0-S0743731520303889-gr3_lrg.jpg) <center>Fig. 3</center> ### 4.2. Non-coalesced memory accesses * non-coalesced memory accesses & memory divergence * Fig. 2(f) * Fig. 4(b) ## 5. Scalarized vertex-centric parallel graph computing * scalarized vertex-centric parallel graph computing * (1) mapping a vertex to a warp * (2) utilizing Scalar-execution (SE) * shortcoming * redundant computations within the warps * step * (1) Programmer associates a vertex to a warp and constructs the algorithm such that the work on edges gets distributed over the SIMD-threads of a warp * (2) annotates Scalar code when implementing the algorithm * (3) the compiler marks each scalar instruction based on programmer’s annotation and also using static scalarization analysis * (4) at the execution time, hardware utilizes Scalar-execution(SE) when the instructions to execute are scalar instructions ### 5.1. Vertex-to-warp mapping * advantages: * (1) SIMD utilization * (2) memory access coalescing ![](https://ars.els-cdn.com/content/image/1-s2.0-S0743731520303889-gr5_lrg.jpg) ### 5.2. Scalar execution (SE) to eliminate redundant computations ![](https://ars.els-cdn.com/content/image/1-s2.0-S0743731520303889-gr6_lrg.jpg) <center>redundant computations</center>center> * static identification of scalar operations using programmer annotation and compiler analysis * a few changes to Instruction Set Architecture (ISA) * a number of modifications to our baseline multi-threaded SIMD processor to support SE #### 5.2.1. Programming for scalarized vertex-centric parallel graph computing * warps become explicitly part of the programming model * two new intrinsic * get_warp_id() * get_simd_width() ![](https://ars.els-cdn.com/content/image/1-s2.0-S0743731520303889-fx1002_lrg.jpg) #### 5.2.2. Identification of scalar operations * compiler annotations * user annotation * compiler analysis * divergence analysis #### 5.2.3. Architectural support for scalar execution on SIMD units * include an extra bit * Instruction Buffer * ISA ## 6. Graph-waving architecture for scalarized vertex-centric parallel graph computing * (1) it uses narrow SIMD-widths with clustered issue approach * (2) it employs decoded-instruction re-use, and * (3) it extends and employs Scalar-Waving (SW) architecture ### 6.1. Narrow SIMDs with clustered issue * supports 8-wide warps * organize the available SIMD processing elements into 4 groups, each group is 16-wide and executes a pair of an odd and an even numbered warp together * At each cycle, 4 scalar/regular warp instructions get issued on SIMD-units ![](https://ars.els-cdn.com/content/image/1-s2.0-S0743731520303889-gr7_lrg.jpg) ### 6.2. Instruction re-use in front-end * the increased number of warps is likely to create pressure in the front-end ![](https://ars.els-cdn.com/content/image/1-s2.0-S0743731520303889-gr8_lrg.jpg) * re-use the already fetched and decoded instructions * requires modifications to fetch unit and re-organization of instruction buffer * instruction storage must be decoupled from warps. * Fig. 7(b) ### 6.3. Efficient execution of scalar instructions * Multiple warps executing the same scalar instruction (possessing the same PC) can be grouped into a Scalar-wave to execute together in SIMD fashion on the SIMD-pipelines * store scalar values in a modified register file * (1) modification to the instruction buffer to differentiate scalar instructions * (2) adding a Scalar-Wave Formation Unit (SWFU) —a new unit that includes a Scalar-Wave Status Table (SWST) and manages the formation of scalar waves * (3) modifications to the register file to provide efficient storage of scalar values ![](https://ars.els-cdn.com/content/image/1-s2.0-S0743731520303889-gr9_lrg.jpg) ## 7. Hardware cost and power consumption The autor specifically focused on performance aspect of GW architecture and considered the detailed study of area estimation and power consumption as a future work. ## 8. Evaluation ### 8.1. Experimental setup * 3.x of GPGPU-Sim * benchmark: ![](https://i.imgur.com/kNI2Iwi.png) ### 8.2. Performance of graph-waving architecture * REGULAR benchmark * 9% performance improvement on the average (geometric mean) * IRREGULAR benchmark * 17% * Graph applications * 4.41x ![](https://ars.els-cdn.com/content/image/1-s2.0-S0743731520303889-gr10_lrg.jpg) ## 10. Conclusions and future work * GW uses a narrow SIMD-width with a clustered issue approach * introduces extensions to the baseline GPU architecture to reuse instructions in the front-end * eliminates redundant computations using Scalar-waving ## Compare to our work * [Our Work](https://docs.google.com/presentation/d/1Thz_tu13iYqtAW99xmj1XgWEpikkIbobYvLYku2fkGw/edit?usp=sharing) * if the degree of most of the nodes less than 8, then GW may not fully utilize the SIMD lane * hardware modification * GW * scoreboard * instruction buffer/storage, scheduler * issue unit * add Scalar-wave Formation Unit(SWFU) * Our * operand collector * add Index Unit * software modification * GW * vertex to warp mapping * user annotation * Our * **working list** or **None**