AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing

# AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing ##### origin: MICRO 2020 ##### paper: [link](https://microarch.org/micro53/papers/738300a922.pdf) ###### tags: `Accelerators` ## I. Introduction ### Background - A large (and increasing) number of applications use **non-Euclidean data structures** that are modeled as **graphs**. **Nodes** and **edges** represent objects and relationships between those objects, respectively, as appropriate for the application. - The irregularity of the graph data makes most of the existing Neural Network (NN) algorithms ill-suited. - To tackle this issue, **Graph Neural Networks (GNNs)** have been proposed, and the **Graph Convolutional Network (GCN)** marrying some ideas of CNNs is one of the most popular GNNs. ### Problem - With the rapid development of GCNs, designing dedicated hardware accelerators has become an urgent issue. - Accelerators developed for other domains, such as the sparse-CNN-accelerator (SCNN), are not likely to be optimal as GCN accelerators. - **GCN applications have highly unbalanced non-zero data distributions**: - power-law distribution ![](https://hackmd.io/_uploads/HydW7Q8s2.png) - **Extremely high sparsity** - Graph adjacency matrix: exceeds 99.9% - SCNN: generally ranges from 10% to 50% - **Large matrix size** - Real-world graphs are usually very large (eg. Reddit dataset: 233K nodes, ) - Although neural networks also have large models, the matrix of a particular layer is often much smaller (eg., 1k×1k), so the working set often can fit easily into on-chip memory. ### Proposed Solution - **AWB-GCN**: hardware accelerator for GCN inference with workload auto-tuning - It monitors the workload distribution at three levels at runtime and, accordingly, rebalances the distribution per round (one output matrix column). - Three techniques - **Distribution smoothing**: balances the workload among neighbors - **Remote switching**: shuffles workloads of regions with the most and least clustered non-zero elements - **Evil row remapping**: a row is observed to still contain too many elements to be smoothed or balanced by remote switching, it is designated as an **evil row** ![](https://hackmd.io/_uploads/BJYaJrUi2.png) ![](https://hackmd.io/_uploads/B1m9frUj3.png) ## II. Motivation ### A. Graph Convolutional Network(GCN) Structure ![](https://hackmd.io/_uploads/rJ9OnrLi3.png) - Equation 1 shows the layer-wise forward propagation of a multi-layer spectral GCN: - A: Adjacency matrix - In general A needs to be normalized:![](https://hackmd.io/_uploads/H17oO8vsh.png) - D: Degree matrix (![](https://hackmd.io/_uploads/H1Rc0HIj2.png)) - A is used to denote the normalized Ã in this paper. - Multi-hop neighbor information: A^2^, A^3^, etc. - X^(l)^: Input feature matrix of layer-l - W^(l)^: Weight matrix of layer-l - σ(.) : Non-linear Activation function (eg. ReLU) - Equation 1 is derived from graph signal processing theory: Convolution on the frequency domain via the Fourier transform and simplified by the Chebyshev polynomials ![](https://hackmd.io/_uploads/S1uUiIIs3.png) ### B. Characteristics of Power-Law Graphs - Real-world graphs in many critical domains typically follow the **power-law distribution**. - y = x^-β^, β>0 ![](https://hackmd.io/_uploads/rJi1NUIin.png) - x: degree of a node - y: # of nodes - Adjacency matrix: A small number of rows (or columns) include the majority of non-zeros whereas the majority of the rows (or columns) contain only a few non-zeros but are not empty ![](https://hackmd.io/_uploads/rkEJIU8s3.png) - Matrix density ![](https://hackmd.io/_uploads/BkvUIL8s3.png) - A: ultra sparse (sparsity ≥ 99%), stored in sparse format - X: generally sparse - W: dense ## III. GCN Baseline Architecture ### A. Matrix Computation Order - (A x X) x W and A x (X x W) - (A x X) x W - A x X: sparse-sparse-matrix-multiplication -> large dense matrix - (AX) x W: general dense matrix-multiplication -> time consuming! - **A x (X x W)** - Both X x W and A x (XW) are sparse-dense matrix multiplications (SpMM) - Shown in Table II, it is obvious to choose A x (X x W) ![](https://hackmd.io/_uploads/SJozTd8sn.png) ### B. SpMM Execution Order and Mapping - **Column-wise-product-based SpMM**: ![](https://hackmd.io/_uploads/Hk22AO8sn.png) - Given S~(mxn)~ x B~(nxk)~ = C~(mxk)~ - S~j~: jth column of S - b~(j,k)~: element of B at row-j and column-k - Example: $$ {\displaystyle {\begin{bmatrix}1&0&2\\-1&3&1\end{bmatrix}}{\begin{bmatrix}3&1\\2&1\\1&0\end{bmatrix}} =[{\begin{bmatrix}1\\-1 \end{bmatrix}}{\begin{bmatrix}3 \end{bmatrix}}+ {\begin{bmatrix} 0\\3\end{bmatrix}}{\begin{bmatrix} 2 \end{bmatrix}}+ {\begin{bmatrix} 2\\1 \end{bmatrix}}{\begin{bmatrix} 1\end{bmatrix},\begin{bmatrix}1\\-1 \end{bmatrix}}{\begin{bmatrix}1 \end{bmatrix}}+ {\begin{bmatrix} 0\\3\end{bmatrix}}{\begin{bmatrix} 1 \end{bmatrix}}+ {\begin{bmatrix} 2\\1 \end{bmatrix}}{\begin{bmatrix} 0\end{bmatrix}}}] $$ $$ {=[{\begin{bmatrix}3\\-3 \end{bmatrix}}+{\begin{bmatrix}0\\6 \end{bmatrix}}+ {\begin{bmatrix} 2\\1 \end{bmatrix},\begin{bmatrix}1\\-1 \end{bmatrix}}+{\begin{bmatrix}0\\3 \end{bmatrix}}+ {\begin{bmatrix} 0\\0 \end{bmatrix}]} ={\begin{bmatrix}5&1\\4&2\end{bmatrix}}} $$ - Each element b~(j,k)~ finishes all computation it involves at once and is then evicted - The computation of each column of C reuse the entire sparse matrix S (k times in total) - **Workload Partitioning**: direct and static mapping from matrix rows to PEs ![](https://hackmd.io/_uploads/SkrzSYLj2.png) ### C. Design of Baseline Architecture ![](https://hackmd.io/_uploads/B1V9UtUo2.png) - **Sparse-matrix-memory (SpMMeM)** - Buffer the input sparse matrix S (from off-chip) - Feed non-zeros and their indices to TDQ - **Dense-column-memory (DCM)** - Buffer the input dense matrix B - Broadcast its elements to TDQ - **Task-distributor & Queue (TDQ)** - Distribute tasks to the PEs - TDQ-1 & TDQ-2 (Two designs of TDQ) - **PE-array** - Perform concurrent multiplication of non-zero pairs and partial result accumulation (MAC) - Receives a new non-zero pair [d1, d2] from TDQ - Performs the new multiplication task with d1, d2 - Fetches the corresponding partial results [d_acc] of output matrix C from the ACC buffers - Accumulates the multiplication result and d_acc - Data exchange with the ACC Buffers - Address-Generation-Unit (AGU) - Result address generation and forwarding - **Accumulation-buffers-array (ACC Buffer)** - Cache the partial results of the resulting matrix C for accumulation - Send them to the next SpMM engine at the completion of a whole column calculation - Two designs of TDQ - **TDQ-1** ![](https://hackmd.io/_uploads/H1YCcs8jh.png) - S is **generally sparse** (**sparsity < 75%**) and stored in dense format - Direct row partition and map non-zeros to the input buffer of the corresponding PEs - In each cycle, **NPE/(1 − Sparsity)** elements are forwarded to the PE array, only non-zeros are kept in the queue - Each PE has the chance to receive at most **1/(1 − Sparsity)** in one cycle - 1/(1-75%) = 4 non-zero elements -> 4 **Task Queues(TQs)** - **Read-after-Write(RaW) check** - If RaW hazard occurs(i.e., from the same row of sparse matrix), the new job is cached in stall buffer - **Arbiter** - Selects a non-empty queue, pops an element without hazard, and forwards it to the PE for processing - **TDQ-2** ![](https://hackmd.io/_uploads/HJ7giiIon.png) - S is **ultra-sparse** and stored in **CSC format** - **Multi-stage Omega-network** - Routing the non-zeros to the correct PE according to their row indices - PE_Des: row index - Each router in the Omega-network has a local buffer in case the buffer of the next stage is saturated - **TDQ-1 for X x W, TDQ-2 for A x (XW)** ### D. Pipelining SpMM Chains - Intra-Layer SpMM Pipelining - Fine-grained pipelining - Since A is constant, once a column of (XW) is calculated, we can start the multiplication of this column with A immediately without waiting for the entire XW ![](https://hackmd.io/_uploads/Sy2g3jLjn.png) - Instead of requiring off-chip storage to cache the big resulting XW matrix, we only need to buffer a single column of XW ; this can be done on-chip - Inter-Layer SpMM Pipelining - Allocate hardware resources (PEs) in proportion to the workload of each layer to avoid pipeline bubbles and large intermediate buffers, making the execution time of different layers similar - Off-chip accesses of A are only required by the first layer and forwarded through the layers - Bandwidth Analysis - Off-chip data access of the big Adjacency matrix A - Reused over layers - Matrix Blocking ![](https://hackmd.io/_uploads/Hy_ub2Iin.png) - A is partitioned into multiple blocks - Calculate t columns of A(XW) in parallel → the data reuse of matrix A is improved by t times - Scratchpad memory to cache parts of A on-chip as much as possible ### E. The Workload Balance Problem - The baseline architecture works well when non-zeros are evenly distributed among the rows of A. - Evil rows and regionally clustered non-zeros in imbalanced power-law graph matrices bring the inefficiency ![](https://hackmd.io/_uploads/Hkb5Q2Ij3.png) ## IV. AWB-GCN Architecture ![](https://hackmd.io/_uploads/rJ3Dwn8oh.png) ### A. Distribution Smoothing - Averaging out the workloads among neighbors (at most 3-hop) - TDQ-1: Before pushed into the TQ - TDQ-2: The final layer of the multi-stage Omega network handles neighbor task forwarding - Monitor the runtime PE utilization information by tracking the number of pending tasks in TQs and keep offloading the work of PEs with more pending tasks to their less busy neighbors. - The offloaded work needs to be sent back to the ACC buffers of its original PE after processing. ![](https://hackmd.io/_uploads/ryGRo28s3.png) - Works for local utilization fluctuations, but is not sufficient when - non-zeros are clustered in a region across many PEs, so that neighbors are all busy (PE20, 21, 22 in Fig. 11.(a)) - most non-zeros are clustered in only a few rows (evil rows) so that the major crests cannot be eliminated even if all neighboring PEs help (PE9 in Fig. 11.(b)) → Needs Remote Switching and Evil Row Remapping ### B. Remote Switching ![](https://hackmd.io/_uploads/H1rkn3Uj3.png) - Partially or completely exchanges the workloads between under- and overloaded PEs - **Autotuner** - **PE Status Monitor (PESM)** - An *empty* signal is triggered when the counter of pending tasks in each TQ reaching zero. - The **XOR** results indicate which PEs are newly finished and stored in **Switch Candidate Buffer**. - The **AND** over all *empty* signals indicate a *completion* signal - **Switch Candidate Buffer** - Caches the newly idle info of the most recent cycles - **Arbiter** - At the start of each round (after A is totally sent to TDQs), the arbiter scans the buffer and record IDs of newly done PEs until enough under-loaded PEs (4 in Fig. 13., can be customized) have been found. - After the arbiter finds the first 4 idle PEs, it stops scanning the buffer and instead waits for the *completion* signal - Whenever the arbiter receives the *completion* signal it starts to scan the **Switch Candidate Buffer** and continues until the four most over-loaded PEs have been found - **Utilization Gap Tracker (UGT)** - Calculate the number of jobs (i.e., rows of A) to be switched in the i-th round ![](https://hackmd.io/_uploads/rJAFqpUin.png) - G~i~: workload gap of the selected PE tuple at the i-th round (approximated as the difference of execution cycles to finish all tasks) - R: the number of rows per PE under equal mapping - Tracks the post-switching utilization gaps of PE-tuples selected in the prior rounds and uses them as feedback to adjust the switch fraction ![](https://hackmd.io/_uploads/ryMWcaUoh.png) - j: the number of rounds of fraction update - Hardware-friendly approximation - Threshold-based counting for the execution time gap - Table lookup for result of # of rows to be switched, forward to **WDC** with the corresponding PE IDs - **Workload Distribution Controller (WDC)** - At the start of the next round, the destination PE of these rows is updated in the Shuffle Switches (SS). ### C. Evil Row Remapping ![](https://hackmd.io/_uploads/H1rkn3Uj3.png) - Distribute the evil row to the most under-loaded PEs, and the neighbors of these PEs can help - Building row remapping support into the remote switching hardware (**Autotuner**) - UGT - A **comparator** to identify whether evil row remapping is required - Calculates the utilization gaps between the most over- and under-loaded PEs at the end of each round → If too large, row remapping is performed by switching the overloaded workload to **Super(Master) PE** temporarily in the next round - Sends the information to WDC - WDC - If row remapping is triggered or an evil row is found, the entries of the Super- and Labor-PE at Distribution Switch Table in the WDC are updated - Super(Master)- and Labor-PE - **Super-PE** - Much bigger than other PEs, as it serves as a profiler to find the evil rows - Non-zero counter - Records the number of non-zeros per row - Parallel sorting circuit - Finds the evil rows containing the most non-zeros - Workloads of each evil row are partitioned and distributed to a set of Labor-PEs controlled by the Super-PE in the next round - **Labor-PE** - Similar to the normal PE, but equipped with a small separate **ACC buffer** - **ACC buffer** - Stores the accumulation results of evil rows - summed up into **Evil Row ACC buffer** - The original workloads of the labor-PEs can still be swapped with the most under-loaded PEs via remote switching ## V. Evaluation - Implement in Verilog HDL and measure on Intel acceleration card D5005 equipped with a Stratix 10 SX FPGA - Baseline design - Design A: 1-hop distribution smoothing - Design B: 2-hop distribution smoothing - Design C: 1-hop distribution smoothing + remote switching and row remapping - Design D: 2-hop distribution smoothing + remote switching and row remapping - Datasets: Cora, Citeseer, Pubmed, Nell and Reddit - GCN model: two-layer GCN (Standard_networks) [8], [29] ![](https://hackmd.io/_uploads/Byafh38oh.png) ![](https://hackmd.io/_uploads/Hy8Q238oh.png) ![](https://hackmd.io/_uploads/rk1Eh28jn.png) ![](https://hackmd.io/_uploads/H1SE3nIjn.png) - Cross-platform Comparison - PyG-based implementation on Intel Xeon E5-2680v3 CPU (PyG-CPU) - PyG-based implementation on a NVIDIA RTX 8000 GPU (PyG-GPU) - SCNN [45] reproduction with 4096 multipliers(system-C-based cycle-accurate simulator) - HyGCN (prior work of accelerator of GCN inference)[31] - 4096-PE baseline AWB-GCN without workload rebalancing - AWB-GCN Design-(D) with 4096 PEs - GCN models: Standard_ and HyGCN_networks ![](https://hackmd.io/_uploads/HJFEh38oh.png) ## VI. Related Work - GNN(GCN) model - Accelerating sparse CNN (SpMM, SpMV) - Eliminating workload imbalance - Graph processing ## VII. Conclusion - Proposed AWB-GCN to accelerate GCN inference including three runtime workload rebalancing techniques: distribution smoothing, remote switching, and row remapping. - Evaluate AWB-GCN using an Intel FPGA D5005 Accelerator Card with 5 widely used GCN datasets. Results show that AWB-GCN can achieve, on average, 3255×, 80.3×, and 5.1× speedups over high-end CPUs, GPUs, and other prior work respectively. - It is generally efficient for GNNs whose major arithmetic primitives are also SpMMs, e.g., GraphSage [66], GINConv [67], and GTN [30].