[paper link](https://arxiv.org/pdf/2301.10388.pdf)
[TOC]
## Overview
### <font color = dark red> Challenge: </font>
1. First, we need a special format for the GCN intermediate features.
2. Second, existing sparse DNN accelerators do not fit well for handling sparse features of GCNs. The focus should be on reducing the memory traffic volume, not on reducing the amount of computation.
3. Lastly, the varying level of sparsity makes it difficult to handle locality with tiling techniques.
### <font color = #1560BD> Solution: </font>
* SGCN, a fast and energy-efficient GCN accelerator which minimizes off-chip DRAM accesses by fully exploiting the intermediate feature sparsity of modern GCNs.
1. Bitmap-index Embedded In-place CSR (BEICSR) format that is tailored for GCN execution
3. Microarchitectures and the pipeline structure for executing the sparsity-aware GCNs
5. Sparsity-aware cooperation for handling the varying working set size due to the dynamic level of sparsity
## I. INTRODUCTION
* The early challenge of GCNs has been to deal with the high sparsity of input graph topology. It is known that the graph topology data exhibit near-100% sparsity [41].
* In contrast to the topology data, the per-vertex feature data has often been treated with dense array representation by existing GCN accelerators because the sparsity of the features is far lower than that of the graph topology.
* Prior efforts targeting the feature accesses were often focused on input features or relied on the fact that the feature width shrinks towards the output layer. Those approaches were attractive with traditional GCNs, especially when the number of layers is small (≤ 5) and each layer width is carefully tuned.
* However, recent advances in the GCN are leading to different circumstances. With the introduction of residual connections, modern GCNs now have tens to hundreds, or even a thousand layers, where the feature width remains constant throughout the entire network.

* When fully exploited, this intermediate feature sparsity can provide higher performance and energy efficiency. Considering that the majority of the GCN execution is memory-intensive and shows repetitive random accesses, the expected benefit is significant.
## II. MOTIVATION
### A. Intermediate Feature Sparsity

1. First, the addition of residual connection immediately increases the sparsity to a much higher level [66].
2. Deeper networks have more sparsity in general.
* From the two observations, we found that the higher intermediate feature sparsity exists in modern GCNs across datasets, and is likely to exist in future GCNs too.
* <font color =dark red>The goal of SGCN is to utilize this characteristic to design an accelerator with better performance and energy efficiency.</font>
### B. Need for a New Accelerator Architecture
* The goal is to exploit the ample amount of intermediate feature sparsity for performance benefits.

* Despite the large potential from the intermediate features, the results show that na¨ıvely supporting sparse features on a GCN accelerator results in little or negative speedup.
## III. BACKGROUND
### A. Graph Convolutional Networks
* In a basic GCN, the outgoing featuresof the l-th layer, xl+1 , is computed as:

* σ is a non-linear activation function (e.g., ReLU)
* A˜ is the adjacency matrix of the graph
* X l and W l denote the incoming features and weights of the l-th layer
* In the past, it was believed that GCNs deeper than five layers would lead to very low accuracy, a phenomenon is widely known as over-smoothing.
* As the technology advances, GCNs have evolved to employ residual connections:

* The calculation of A˜ · X l and its variants are usually called aggregation, which does not involve trainable weights.

* With the introduction of residual connections, it was found that GCNs can be as deep as convolutional neural networks, and tens [10], [44], [45] to even a thousand layers [43] are being used for state-of-the-art GCNs.
### B. Hardware Acceleration of GCNs

* GCN accelerators can be further classified into two types depending on which of the aggregation and combination phases get performed first.

## IV. DESIGN GOALS
### From the motivation, our objective is to devise a sparse format for the features and design an accelerator microarchitecture that efficiently process them.
1. First, the primary target should be the memory access reduction, not the computation or the capacity.
* The primary bottleneck of GCN execution is known to be the aggregation phase, which is extremely memory intensive.
2. Second, the format and the resulting access pattern should be cache- and DRAM-friendly.
* In contrast to CNNs where the feature accesses can be done sequentially, GCN aggregation features incur random accesses of small per-vertex features (i.e., a few cachelines).
3. Third, the execution should embrace existing GCN accelerators.
## V. SGCN ARCHITECTURE
### A. Compression Format
* The ultimate purpose of compression from SGCN is to reduce off-chip memory access.

#### Embedded Bitmap Index
* Instead of column indices, we use bitmap indices for each vertex
* Therefore, the choice of embedding the bitmap index in the same array with the non-zero values yields a better memory access pattern.
#### In-place Compression
* We compress the data rowby-row and store each row at the same reserved place in the memory as if the row was left uncompressed. Even though this would give no benefit to the memory capacity, we argue that such a choice is almost necessary for the following reasons.
1. First, it would provide reduced off-chip traffic aligned to the cacheline boundaries.
2. Second, it allows parallel writes at the output of each layer.
3. Lastly, because the sizes of the rows are uniform, there is no need for an indirection array.
* This scheme reduces the memory access count and contributes towards row-buffer locality, and achieves better memory bandwidth utilization.
### B. Supporting Feature Matrix Slicing
* When compressing the entire row together, accessing a slice of the feature matrix involves the following sequence:
1) Read the bit vector index to find the range of non-zero values corresponding to the slice
2) Read multiple cachelines that contain non-zero values fromm the slice
3) Perform aggregation.
**The pitfall from the above sequence is the overhead ofunaligned accesses.**

* Instead of a single set of bit vector indices for the entire row, we partition the bit vector and embed it in the head of each corresponding unit slice.
* Then we align the slices to the burst boundaries. Using the in-place compression from Section V-A for each slice, we allocate the memory space that can hold the maximum number of non-zeros within them.
* With the right choice of the unit slice size C, the wasted amount of memory access can be minimized because the number of non-zero elements has a small variance and there are only a few outliers
### C. Sparsity-Aware Cooperation for Graph Topology Tiling.
* Existing approaches usually find the optimal tile size based on an off-line analysis, often by statically calculating the working set size [88], [47].
* However, with SGCN, the level of sparsity varies dynamically depending on each vertex, dataset, and layer. Because of this, estimating the optimal tile size is very difficult, and a working set size exceeding the cache capacity results in significant performance degradation.

* When the sparsity of the features is lower than expected, the effective working set size increases because there are more non-zeros per vertex. This has the risk of exceeding the cache capacity, and the performance would quickly drop due to the thrashing pattern.
* To address this issue, we propose sparsity-aware cooperation, a method that alters the access pattern such that there are variously sized working sets that can be captured by the caches.

* Sparsity-aware cooperation takes advantage of this, and each engine accesses a small strip of vertices in an interleaved manner

* As a result, when the sparsity level is high, the cache will capture the larger working set window (denoted as ‘large window’), and when the sparsity is low, the cache can capture a smaller working set window to avoid thrashing (denoted as ‘small window’).
### D. Sparse Aggregation

* Each sparse aggregator of SGCN has 16 multipliers, which can process a single cache line worth of data together
**STEP:**
1. When a row of the feature matrix X is selected, its first 64 bytes are fetched to the aggregation engine, where its head contains the bitmaps and the rest contains the non-zero values.
2. The non-zero values are multiplied with the corresponding edge weight of A˜ broadcasted each multiplier
3. (2' )In parallel, the bitmap is processed by a parallel prefix sum unit to convert the 1’s in the bitmap to a reversed index to the non-zero values.
4. The bitmap and the reversed indices are sent to the accumulator. If the bitmap value is 1, the accumulators at the corresponding positions load the multiplier outputs and add them to the current value.
5. When the accumulation for a single vertex is complete, it is sent to the combination engine for performing combination, ReLU, and compression.
6. (Optional) When there are still non-zeros remaining in the next cacheline (identified by the prefix sum result), the next 64 bits are fetched to perform 1 - 4 again.
### E. Post-Combination Compression

* To avoid the extra memory access for the compression, we place the compressor at the output stage of the combination engine using an output-stationary systolic array. One compressor entry containing BEICSR buffer and compression logic is assigned to each row of the systolic array.
**STEP:**
1. When the combination phase is complete, the data are streamed to the compressor after processing the residual addition and the ReLU activation.
2. The compression logic checks whether the output value is zero.
3. If the output value is zero, the bitmap index in the compression entry appends a ‘0’.
4. (3')When the output value is non-zero, the bitmap index accumulates a ‘1’
5. The output value is saved to the location pointed by the counter. The compressor continuously performs 1 - 4 for each output value from the systolic array.
6. After the compressor has processed a unit slice amount of data, the data stored in the buffer is flushed to the DRAM, and the compressor is re-initialized.
### F. Putting It All Together

## VI. EVALUATION
### A. Experimental Setup


* We designed the accelerator using Verilog HDL and synthesized it using Synopsys Design Compiler with the 45 nm OpenPDK library and scaled it to 32 nm to match the tech node of the on-chip memory model.
* The baseline chip area of GCNAX is 3.95 mm2, and the area of the accelerator with SGCN is 4.05 mm2, which has 2.5% overhead on the area, accounting for the increased logic in sparse aggregation and compression engines.
* To measure performance, we designed a cycle-accurate simulator written in C++. The model is based on SCALESim [67] for the systolic array used in the combination engine, which we extended with an in-house module for the sparse aggregation engine and compression engine.
### B. Fast and Energy-Efficient GCN Executions
#### Performance

#### Ablation Study

#### Energy Consumption

#### Memory Access Breakdown

### C. Sensitivity Studies
#### Number of GCN Layers

* This shows that the performance gain from SGCN is not fine-tuned on acertain number of layers, and can be broadly used for various specifications.
#### Cache Size
* In general, the speedup from the sparsity of features is not greatly affected by the cache size unless the data entirely fits into the cache. With a small cache, the benefit of sparsity-aware cooperation becomes marginal because it becomes harder to capture any locality. However, even with larger caches, the speedup remains relatively consistent.
#### GCN Variants

* Because SGCN can efficiently reduce the accesses to the feature matrix, the speedup slightly increases compared to that of the vanilla GCN.
* On the contrary, GraphSAGE applies random sampling on the edges to reduce the computational overhead. It reduces the effective edge count of the graph topology and reduces the portion of aggregation.
* Thus, SGCN experiences slightly less, but still significant speedup over the prior arts.
#### Choice of Slice Size

* According to the experiment, the best performance overall is at C = 96, but a poor choice still provides a great amount of speedup over the baseline.
#### Scalability

* When enough bandwidth is provided, increasing the number of engines provides an almost linear amount of speedup up to around eight engines, demonstrating good scalability.
* The scalability starts saturating at around 16 engines, which is where the system reaches near the maximum bandwidth of the memory module.
## VII. DISCUSSION
### A. Range of Sparsity

* As displayed, SGCN shows better performance on almost all sparsity ranges. The dense format is better only on sparsity under 5%, because of the additional bitmap indices needed for BEICSR.
* The break-even point for CSR is at over 90% sparsity, which is the point where the size of the column indices in CSRs becomes smaller than the bitmap indices of BEICSR.
### B. Applicability to Future GCNs
* As demonstrated in Fig. 19, SGCN is beneficial over a wide range of sparsity. We would like to argue that the future trend would place the sparsity level in a similar range.
## VIII. RELATED WORK
### A. GCN Accelerators
### B. Graph Processing Accelerators
### C. Sparse DNNs
### D. Memory Bandwidth Optimization for DNN Execution
## IX. CONCLUSION
1. SGCN is a GCN accelerator that exploits sparsity in intermediate features.
2. Traditional shallow GCNs lacked the potential to benefit from intermediate feature sparsity present in deeper GCNs with residual layers.
3. The paper proposes BEICSR, a sparse feature representation format designed for the sparse aggregation phase of GCN execution.
4. BEICSR uses embedded bitmap indices, in-place compression, and supports feature slicing techniques commonly used in GCN accelerators.
5. Sparsity-aware cooperation is introduced to handle varying sparsity and changing access patterns efficiently.
6. A na¨ıve attempt to utilize feature sparsity could lead to speed degradation.
7. SGCN demonstrates superior performance and energy efficiency compared to previous state-of-the-art GCN accelerators.
8. Exploiting sparsity in feature data optimizes memory traffic in GCN execution.