# Hardware Acceleration of Graph Neural Networks
##### origin: DAC'20
##### paper: [link](https://passat.crhc.illinois.edu/dac20.pdf)
###### tags: `Accelerators`
## Introduction
### Background
- **Graph Neural Networks (GNNs)** can be described as an extension of the popular Deep Neural Network (DNN) that allows the networks to natively support **graph-structured inputs, outputs, and/or internal state**.
- GNNs has generated particular interest due to their ability to outperform existing techniques in many applications, extending the power of machine learning to problems with graph-structured inputs
- ConvGNN
- Message Passing
- Aggregation
- Update

### Problem
- Computations of GNNs are not a good fit for existing DNN accelerators.
- DNN: SIMD
- GNN: MIMD
- Low utilization due to **sparsity**



- Several recent works optimize DNN accelerators to exploit sparsity in matrices for traditional DNN workloads and demonstrate significant benefits, but **the sparsity is at least two orders of magnitude lower than the sparsity in a sparse graph typically inputted to a GNN**.
- Sparse representation in graph accelerators is a poor fit for the per-vertex components with features.
### Proposed Solution
- This paper proposed the **first accelerator architecture targeting GNNs**, including dedicated hardware units to efficiently execute the irregular data movement required for graph computation in GNNs, while also providing high compute throughput required by GNN models.

## An Accelerator Architecture for GNNs
- **Graph PE(GPE)**: Graph traversal and Runtime

- Allocation Bus: connects to other modules in order to request scratchpad space
- Memory interface: issue indirect asynchronous memory requests
→ Filt buffer → Scratchpad
- General purpose CPU: Runtime

- Manage a pool of software threads
- Syncronization
- **DNN Queue(DNQ)**: Buffering memory requests and intermediate results as they are passed to the DNA.

- Delayed Enqueue
- Data Scratchpad: queue and ready bits
- Control: 2 pairs of Head/Tail Pointers
- Lazy queue switching algorithm: only switched when the DNA has been idle for 16 cycles
- When ready bit=1 → Dequeue → Destination Buffer
- **DNN Accelerator(DNA)**: Executing the DNN computation within the GNN model

- **Aggregator(AGG)**: Aggregation of the features in GNN model

## Methodology
- GNN Benchmarks
- Graph Convolutional Networks (GCNs)
- Graph Attention Networks (GATs)
- Message Passing Neural Networks (MPNNs)
- Power Graph Neural Networks (PGNN)
- GNN Accelerator Model
- Booksim-based custom simulation model

- NN-Dataflow: map DNN models onto a Eyeriss-like single-tile spatial array accelerator with 182 PEs configured in a 13x14 array
- Input Datasets

- Accelerator Configurations
- Baseline CPU and GPU system
- Memory bandwidth: About 68 GBps

- Proposed architecture (3 configurations)


## Result
- Latency(Baseline system)


- Speedup

- Utilization (CPU iso-bandwidth)

- Overall performance: 7.5x higher performance than GPUs and 18x higher performance than CPUs at iso-bandwidth.