# Graph-Waving architecture: Efficient execution of graph applications on GPUs
###### tags: `GPUs`
###### paper source: Journal of Parallel and Distributed Computing
###### paper link: [link](https://www.sciencedirect.com/science/article/pii/S0743731520303889)
###### key word: GPGPU, GPU, microarchitecture, Graph application, Scalar waving, SIMD efficiency
## Abstract
* SIMD-group mapping
* Scalar-Waving
* narrow SIMD-group width with a clustered issue approach and reuse of instructions in the front-end
## 1.Introduction
* leverage the idea of vertex to warp mapping
* scalarized vertex-centric parallel graph computing on GPUs
* introduce the Graph-Waving architecture to better support scalarized vertex-centric graph computing while maintaining the efficiency of SIMD execution for regular GPU applications.
* In scalarized vertex-centric parallel graph computing, programmers are allowed to annotate the redundant computation as scalar operations to fully utilize the Scalar-Execution
* Scalar-Waving (SW) that dynamically groups scalar operations possessing the same PC and executes them together as a Scalar-wave on SIMD pipelines for improved utilization of SIMD processing elements
* employs narrow SIMD-units but with a clustered issue logic to maintain the benefits of larger issue width with regular applications
* leverages instruction re-use to eliminate redundant fetching and decoding of instructions across the warps and the iterations of loops
## 2. GPU execution model and baseline GPU microarchitecture
* baseline is similar to NVIDIA TeslaC2050 GPUs that are based on NVIDIA’s Fermi architecture.

* At each cycle, two schedulers select and dispatch instructions for two warps.
## 3. Typical vertex-centric parallel graph computing on GPUs
* Pannotia’s implementation
* Compressed Sparse Row (CSR) format
* **vertex-centric parallel computation** where vertex to a SIMD-thread mapping is applied
* called **edge-list-expanding computation pattern** in this paper

## 4. Inefficiencies of vertex-centric graph parallel computing on GPUs
* performance penalties
* having many vertexes with divergent degrees

<center>Fig. 2</center>

<center>Fig. 4</center>
### 4.1. Intra-warp load imbalance and under-utilization of SIMD units
* Intra-warp load imbalance
* Fig. 2(e)
* Fig. 4(a)

<center>Fig. 3</center>
### 4.2. Non-coalesced memory accesses
* non-coalesced memory accesses & memory divergence
* Fig. 2(f)
* Fig. 4(b)
## 5. Scalarized vertex-centric parallel graph computing
* scalarized vertex-centric parallel graph computing
* (1) mapping a vertex to a warp
* (2) utilizing Scalar-execution (SE)
* shortcoming
* redundant computations within the warps
* step
* (1) Programmer associates a vertex to a warp and constructs the algorithm such that the work on edges gets distributed over the SIMD-threads of a warp
* (2) annotates Scalar code when implementing the algorithm
* (3) the compiler marks each scalar instruction based on programmer’s annotation and also using static scalarization analysis
* (4) at the execution time, hardware utilizes Scalar-execution(SE) when the instructions to execute are scalar instructions
### 5.1. Vertex-to-warp mapping
* advantages:
* (1) SIMD utilization
* (2) memory access coalescing

### 5.2. Scalar execution (SE) to eliminate redundant computations

<center>redundant computations</center>center>
* static identification of scalar operations using programmer annotation and compiler analysis
* a few changes to Instruction Set Architecture (ISA)
* a number of modifications to our baseline multi-threaded SIMD processor to support SE
#### 5.2.1. Programming for scalarized vertex-centric parallel graph computing
* warps become explicitly part of the programming model
* two new intrinsic
* get_warp_id()
* get_simd_width()

#### 5.2.2. Identification of scalar operations
* compiler annotations
* user annotation
* compiler analysis
* divergence analysis
#### 5.2.3. Architectural support for scalar execution on SIMD units
* include an extra bit
* Instruction Buffer
* ISA
## 6. Graph-waving architecture for scalarized vertex-centric parallel graph computing
* (1) it uses narrow SIMD-widths with clustered issue approach
* (2) it employs decoded-instruction re-use, and
* (3) it extends and employs Scalar-Waving (SW) architecture
### 6.1. Narrow SIMDs with clustered issue
* supports 8-wide warps
* organize the available SIMD processing elements into 4 groups, each group is 16-wide and executes a pair of an odd and an even numbered warp together
* At each cycle, 4 scalar/regular warp instructions get issued on SIMD-units

### 6.2. Instruction re-use in front-end
* the increased number of warps is likely to create pressure in the front-end

* re-use the already fetched and decoded instructions
* requires modifications to fetch unit and re-organization of instruction buffer
* instruction storage must be decoupled from warps.
* Fig. 7(b)
### 6.3. Efficient execution of scalar instructions
* Multiple warps executing the same scalar instruction (possessing the same PC) can be grouped into a Scalar-wave to execute together in SIMD fashion on the SIMD-pipelines
* store scalar values in a modified register file
* (1) modification to the instruction buffer to differentiate scalar instructions
* (2) adding a Scalar-Wave Formation Unit (SWFU) —a new unit that includes a Scalar-Wave Status Table (SWST) and manages the formation of scalar waves
* (3) modifications to the register file to provide efficient storage of scalar values

## 7. Hardware cost and power consumption
The autor specifically focused on performance aspect of GW architecture and considered the detailed study of area estimation and power consumption as a future work.
## 8. Evaluation
### 8.1. Experimental setup
* 3.x of GPGPU-Sim
* benchmark:

### 8.2. Performance of graph-waving architecture
* REGULAR benchmark
* 9% performance improvement on the average (geometric mean)
* IRREGULAR benchmark
* 17%
* Graph applications
* 4.41x

## 10. Conclusions and future work
* GW uses a narrow SIMD-width with a clustered issue approach
* introduces extensions to the baseline GPU architecture to reuse instructions in the front-end
* eliminates redundant computations using Scalar-waving
## Compare to our work
* [Our Work](https://docs.google.com/presentation/d/1Thz_tu13iYqtAW99xmj1XgWEpikkIbobYvLYku2fkGw/edit?usp=sharing)
* if the degree of most of the nodes less than 8, then GW may not fully utilize the SIMD lane
* hardware modification
* GW
* scoreboard
* instruction buffer/storage, scheduler
* issue unit
* add Scalar-wave Formation Unit(SWFU)
* Our
* operand collector
* add Index Unit
* software modification
* GW
* vertex to warp mapping
* user annotation
* Our
* **working list** or **None**