Hardware Acceleration of Graph Neural Networks

# Hardware Acceleration of Graph Neural Networks ##### origin: DAC'20 ##### paper: [link](https://passat.crhc.illinois.edu/dac20.pdf) ###### tags: `Accelerators` ## Introduction ### Background - **Graph Neural Networks (GNNs)** can be described as an extension of the popular Deep Neural Network (DNN) that allows the networks to natively support **graph-structured inputs, outputs, and/or internal state**. - GNNs has generated particular interest due to their ability to outperform existing techniques in many applications, extending the power of machine learning to problems with graph-structured inputs - ConvGNN - Message Passing - Aggregation - Update ![](https://i.imgur.com/OfqE2no.png) ### Problem - Computations of GNNs are not a good fit for existing DNN accelerators. - DNN: SIMD - GNN: MIMD - Low utilization due to **sparsity** ![](https://i.imgur.com/5hdlpAV.png) ![](https://i.imgur.com/1Y9h88x.png) ![](https://i.imgur.com/wIQjn6j.png) - Several recent works optimize DNN accelerators to exploit sparsity in matrices for traditional DNN workloads and demonstrate significant benefits, but **the sparsity is at least two orders of magnitude lower than the sparsity in a sparse graph typically inputted to a GNN**. - Sparse representation in graph accelerators is a poor fit for the per-vertex components with features. ### Proposed Solution - This paper proposed the **first accelerator architecture targeting GNNs**, including dedicated hardware units to efficiently execute the irregular data movement required for graph computation in GNNs, while also providing high compute throughput required by GNN models. ![](https://i.imgur.com/3qM94Ru.png) ## An Accelerator Architecture for GNNs - **Graph PE(GPE)**: Graph traversal and Runtime ![](https://i.imgur.com/KuCAhf5.png) - Allocation Bus: connects to other modules in order to request scratchpad space - Memory interface: issue indirect asynchronous memory requests → Filt buffer → Scratchpad - General purpose CPU: Runtime ![](https://i.imgur.com/B6uZPRE.png) - Manage a pool of software threads - Syncronization - **DNN Queue(DNQ)**: Buffering memory requests and intermediate results as they are passed to the DNA. ![](https://i.imgur.com/LyX6u8U.png) - Delayed Enqueue - Data Scratchpad: queue and ready bits - Control: 2 pairs of Head/Tail Pointers - Lazy queue switching algorithm: only switched when the DNA has been idle for 16 cycles - When ready bit=1 → Dequeue → Destination Buffer - **DNN Accelerator(DNA)**: Executing the DNN computation within the GNN model ![](https://i.imgur.com/4XtHmFz.png) - **Aggregator(AGG)**: Aggregation of the features in GNN model ![](https://i.imgur.com/1curxvW.png) ## Methodology - GNN Benchmarks - Graph Convolutional Networks (GCNs) - Graph Attention Networks (GATs) - Message Passing Neural Networks (MPNNs) - Power Graph Neural Networks (PGNN) - GNN Accelerator Model - Booksim-based custom simulation model ![](https://i.imgur.com/mYbsz0Y.png) - NN-Dataflow: map DNN models onto a Eyeriss-like single-tile spatial array accelerator with 182 PEs configured in a 13x14 array - Input Datasets ![](https://i.imgur.com/LCWKyD6.png) - Accelerator Configurations - Baseline CPU and GPU system - Memory bandwidth: About 68 GBps ![](https://i.imgur.com/tIhZY6y.png) - Proposed architecture (3 configurations) ![](https://i.imgur.com/5PpfwBr.png) ![](https://i.imgur.com/WujNfcm.png) ## Result - Latency(Baseline system) ![](https://i.imgur.com/lrhiJs9.png) ![](https://i.imgur.com/XDPtEXB.png) - Speedup ![](https://i.imgur.com/mbnQ8io.png) - Utilization (CPU iso-bandwidth) ![](https://i.imgur.com/XmM9q1f.png) - Overall performance: 7.5x higher performance than GPUs and 18x higher performance than CPUs at iso-bandwidth.