# Hardware/Software Co-Programmable Framework for Computational SSDs to Accelerate Deep Learning Service on Large-Scale Graphs
link : [paper](https://https://www.usenix.org/system/files/fast22-kwon.pdf)
## Introduction
* GNNs need to deal with real-world graphs consisting of billions of edges and node embeddings.
* GNNs consist of various computing components, which are non-trivial to fully accelerate or parallelize over conventional computing hardware.
* We propose HolisticGNN, a hardware and software coprogrammable framework that leverages CSSD to accelerate GNN inference services near storage.
* The software part of HolisticGNN enables users to program various GNN algorithms and infer embedding(s) directly atop the graph data
* Hardware framework provides fundamental hardware logic to make CSSD fully programmable and an architectural environment that can accelerate various types of GNN inferences with different hardware configurations.
## Background
* GNN

* Graph Dataset Preprocessing

* Challenge Analysis

## Storage as a GNN Accelerator
* HolisticGNN
* The software part of our framework offers easy-to-use programming/management interfaces and performs GNN preprocessing directly from where the data is stored, thereby minimizing the aforementioned storage access overhead
* Hardware logic and administration module provide a low-overhead baremetal computing environment and reconfigurable hardware to accelerate GNN model executions

* Module Decomposition
* Graph-centric archiving system (GraphStore)
* Programmable inference model (GraphRunner)
* Accelerator builder (XBuilder)
## Design Details and Implementation
* Efficient Storage Accesses for Graphs
* GraphStore maintains the graph datasets as an adjacency list and an embedding table to handle the geometric information and feature vectors.

* adjacency list

* embedding table

## Evaluation
* Prototypes
* CSSD : 14nm 730MHz FPGA chip, 16GB DDR4-2400 DRAM, and a 4TB high-performance SSD together within the same PCIe 3.0×4 subsystem
* three sets of hardware accelerators for XBuilder’s User logic
* Multi-core processor (Octa-HGNN) (eight cores)
* Large systolic array processors (Lsap-HGNN)
* Heterogeneous accelerator having a vector processor and a systolic array (Hetero-HGNN)
* GPU
* GTX1060 and RTX 3090.
* GNN models and graph datasets.
* GCN, GIN, and NGCF
* 14 real-graph datasets from LBC, MUSAE, and SNAP

* End-to-end Performance Comparisons
* HGNN shows 7.1× and 7.0× shorter latency compared to GTX 1060 and RTX 3090 across all the graph datasets except for road-ca, wikitalk, and ljournal
* HGNN becomes higher when we infer features on large-scale graphs (>3M edges), which makes HGNN 201.4× faster than GTX 1060 and RTX 3090, on average

* Pure Inference Acceleration Comparison
* Lsap-HGNN mostly exhibits GEMM as its systolic arrays accelerate the transformation well, but its performance slows down due to a large portion of SIMD
* Octa-HGNN is also limited because matrix computation on dense embedding
* Hetero-HGNN can accelerate both SIMD and GEMM, it successfully shortens the aggregation and transformatio

* Performance Analysis on GraphStore
* GraphStore shows 1.3× better bandwidth on graph updates compared to conventional storage stack(Bulk operations)
* Since Write feature in the figure only shows the times longer than Graph pre, we can observe that GraphStore can make Graph pre completely invisible to users
* For the earliest batch preprocessing, GraphStore performs batch preprocessing 1.7× (chmleon) and 114.5× (youtube) faster than that of the GPU-enabled host, respectively. (Batch preprocessing (Get))

* GraphStore exhibits 970ms for per-day updates, on average, and the accumulated latency in the worst case of GraphStore is just 8.4 sec, which takes reasonably short in the workload execution time (0.01%) (Unit operations)

## Conclusion
Our empirical evaluations show that the inference time of HolisticGNN outperforms GNN inference services using high-performance modern GPUs by 7.1× while reducing energy consumption by 33.2×, on average.