Hardware/Software Co-Programmable Framework for Computational SSDs to Accelerate Deep Learning Service on Large-Scale Graphs

# Hardware/Software Co-Programmable Framework for Computational SSDs to Accelerate Deep Learning Service on Large-Scale Graphs link : [paper](https://https://www.usenix.org/system/files/fast22-kwon.pdf) ## Introduction * GNNs need to deal with real-world graphs consisting of billions of edges and node embeddings. * GNNs consist of various computing components, which are non-trivial to fully accelerate or parallelize over conventional computing hardware. * We propose HolisticGNN, a hardware and software coprogrammable framework that leverages CSSD to accelerate GNN inference services near storage. * The software part of HolisticGNN enables users to program various GNN algorithms and infer embedding(s) directly atop the graph data * Hardware framework provides fundamental hardware logic to make CSSD fully programmable and an architectural environment that can accelerate various types of GNN inferences with different hardware configurations. ## Background * GNN ![](https://i.imgur.com/Kb1z2Kp.png) * Graph Dataset Preprocessing ![](https://i.imgur.com/W8huqSm.png) * Challenge Analysis ![](https://i.imgur.com/WXtbL9K.png) ## Storage as a GNN Accelerator * HolisticGNN * The software part of our framework offers easy-to-use programming/management interfaces and performs GNN preprocessing directly from where the data is stored, thereby minimizing the aforementioned storage access overhead * Hardware logic and administration module provide a low-overhead baremetal computing environment and reconfigurable hardware to accelerate GNN model executions ![](https://i.imgur.com/jY3zTqu.png) * Module Decomposition * Graph-centric archiving system (GraphStore) * Programmable inference model (GraphRunner) * Accelerator builder (XBuilder) ## Design Details and Implementation * Efficient Storage Accesses for Graphs * GraphStore maintains the graph datasets as an adjacency list and an embedding table to handle the geometric information and feature vectors. ![](https://i.imgur.com/wBRe9My.png) * adjacency list ![](https://i.imgur.com/92dnnoH.png) * embedding table ![](https://i.imgur.com/QdgIqYk.png) ## Evaluation * Prototypes * CSSD : 14nm 730MHz FPGA chip, 16GB DDR4-2400 DRAM, and a 4TB high-performance SSD together within the same PCIe 3.0×4 subsystem * three sets of hardware accelerators for XBuilder’s User logic * Multi-core processor (Octa-HGNN) (eight cores) * Large systolic array processors (Lsap-HGNN) * Heterogeneous accelerator having a vector processor and a systolic array (Hetero-HGNN) * GPU * GTX1060 and RTX 3090. * GNN models and graph datasets. * GCN, GIN, and NGCF * 14 real-graph datasets from LBC, MUSAE, and SNAP ![](https://i.imgur.com/fPPD60K.png) * End-to-end Performance Comparisons * HGNN shows 7.1× and 7.0× shorter latency compared to GTX 1060 and RTX 3090 across all the graph datasets except for road-ca, wikitalk, and ljournal * HGNN becomes higher when we infer features on large-scale graphs (>3M edges), which makes HGNN 201.4× faster than GTX 1060 and RTX 3090, on average ![](https://i.imgur.com/nRavoNO.png) * Pure Inference Acceleration Comparison * Lsap-HGNN mostly exhibits GEMM as its systolic arrays accelerate the transformation well, but its performance slows down due to a large portion of SIMD * Octa-HGNN is also limited because matrix computation on dense embedding * Hetero-HGNN can accelerate both SIMD and GEMM, it successfully shortens the aggregation and transformatio ![](https://i.imgur.com/1hLRAyI.png) * Performance Analysis on GraphStore * GraphStore shows 1.3× better bandwidth on graph updates compared to conventional storage stack(Bulk operations) * Since Write feature in the figure only shows the times longer than Graph pre, we can observe that GraphStore can make Graph pre completely invisible to users * For the earliest batch preprocessing, GraphStore performs batch preprocessing 1.7× (chmleon) and 114.5× (youtube) faster than that of the GPU-enabled host, respectively. (Batch preprocessing (Get)) ![](https://i.imgur.com/lcDiREA.png) * GraphStore exhibits 970ms for per-day updates, on average, and the accumulated latency in the worst case of GraphStore is just 8.4 sec, which takes reasonably short in the workload execution time (0.01%) (Unit operations) ![](https://i.imgur.com/4AGKb2c.png) ## Conclusion Our empirical evaluations show that the inference time of HolisticGNN outperforms GNN inference services using high-performance modern GPUs by 7.1× while reducing energy consumption by 33.2×, on average.