# Astraea: Towards QoS-Aware and Resource-Eicient Multi-stage GPU Services #### [paper](https://dl.acm.org/doi/10.1145/3503222.3507721) ###### tags: `GPUs` #### ASPLOS 2022 ## Introduction ### Problem The QoS of GPU-based microservices is hard to ensure due to the different communication patterns and shared resource contentions. 1. The communication overhead should be the frst-class factor while deploying GPU microservices. Deploying the microservice graph based on the interconnect topology of GPUs matters for GPU microservices 2. The optimal deployment changes dynamically due to factors such as real-time user load, performance/resource trade-os, resource contention, and communication overhead. ![](https://i.imgur.com/MJlRTrO.png) ## Solution ![](https://i.imgur.com/KpDUSFX.png) 1. Comprehensive characterization of GPU microservices. The characterization helps address the challenges in managing GPU microservices. 3. An online microservice performance predictor. The MLbased predictor can precisely predict global bandwidth usage, duration, and throughput of each GPU microservice under various resource congurations. 4. A lightweight deployment policy for GPU microservices The policy considers communication overhead, global memory capacity, shared resource contention, and pipeline stall when managing the GPU resources 5. An auto-scaling GPU communication framework. Similar to the unied communication frameworks, Thrift and gRPC for CPUs, the proposed framework enables autoscaling without modifying the microservice source codes, no matter if the microservices are on the same GPU, dierent GPUs, or dierent nodes. ## PERFORMANCE ISSUES OF GPU MICROSERVICES ### Benchmarks and Experimental Platforms ![](https://i.imgur.com/mTgFKrG.png) ### Characterizing GPU Microservices 1. Large communication overhead. ![](https://i.imgur.com/yCjAHxn.png) The communication takes 26.4% to 46.9% of the latency. It is crucial to minimize the communication overhead when deploying multi-stage GPU microservices. 2. Require adaptive unied communication unied low overhead communication framework that scales according to the deployment without modifying the source code of microservices is required. 3. Limitation on global memory space ![](https://i.imgur.com/T7qpR4W.png) We are not able to deploy multiple instances of a global memory-consuming microservice on a GPU 4. Shared-resource contention ![](https://i.imgur.com/FfWJ24i.png) This is mainly because the microservices on the same GPU contend for PCI-e bandwidth and global memory bandwidth, although the SMs are explicitly allocated. It is crucial to handle the unstable runtime contention due to the dynamic microservice deployment. ### Design Principles of Astraea 1. The microservice deployment on GPUs should consider the dierent communication overheads The communication overhead across nodes, across GPUs on the same node, and on a GPU show great gaps. Astraea should consider such dierences when deploying GPU microservices, due to the relatively large data among stages. It is better to use the fewest nodes to host a service for the low cross-node overhead. 2. The communication overhead between microservices on the same GPU should be reduced The CPU-GPU data transfer between microservices results in the long end-to-end latency. The PCI-e bandwidth contention also leads to increased communication overhead and long latency 3. Microservices have to be scheduled across GPUs/nodes considering the limited global memory space Since global memory space is one of the resource bottlenecks for a GPU, Astraea should be able to use multiple GPUs to host an entire multi-stage user-facing service 4. The microservice pipeline effciency should be maximized while achieving the required QoS online Since the pipeline effciency is accected by both the percentage of SM resources allocated to each microservice and the runtime contention behaviors, Astraea considers runtime contention of the shared resources(e.g., global memory bandwidth). ## THE ASTRAEA METHODOLOGY ![](https://i.imgur.com/oLOXhaj.png) The performance model predicts the duration, and shared resource usage of a microservice stage if it is allocated a given number of SMs in a GPU (Section 5). Different from other performance models, our model also predicts the shared resource usage consumed global memory bandwidth). The shared resource usage is used to quantify the runtime contention between microservices on a GPU since only the SMs can be explicitly allocated. We also use the process pool technique to enable dynamic SM allocation. Astraea first collects the performance data of each microservice online, and trains the performance predictor until its accuracy meets requirements. When the load/resource changes, Astraea predicts the performance and shared resource usage of each microservice under different resource configurations. ## REDICTING PERFORMANCE AND RESOURCE USAGE For each microservice, we use the microservice’s input parameters, input data size, batchsize and percentage of computational resources as input features Since the QoS target of a user query is within hundreds of milliseconds to support smooth user interaction [30], it is crucial to choose the modeling technique that shows both high accuracy and low complexity for online prediction ![](https://i.imgur.com/wVmTOx0.png) We therefore choose GBDT as our performance modeling technique. ## DEPLOYING MICROSERVICES ### Contention-Aware Resource Allocation * Determine the minimum number of GPUs Astraea uses the predicted number of foating-point operations and the global memory footprint of microservices to determine the minimum number of GPUs required, * Fined-grained resources allocation Use weighted longest path algorithm(NP-hard) 1. The accumulated global memory bandwidth required by all the microservices on a GPU should be less than its available global memory bandwidth. 2. The accumulated SM quotas allocated to concurrent instances should not exceed the overall available SMs. 3. The number of microservice instances on a GPU should not exceed 48 (MPS allows at most 48 client-server connections per-device). 4. The global memory capacity should be smaller than the GPU global memory limit. 5. The latency required for the entire service should be smaller than the QoS target. * Resource allocation space exploration We use a heuristics approach that effectively avoids local optima to solve the problem ### Identifying Appropriate Deployment ![](https://i.imgur.com/GqxWmzA.png) ## AUTO-SCALING COMMUNICATION FRAMEWORK ### Unied Communication API We propose a unied communication API for the programmers. With our API (Listing 1), developers only need to set a unique identifier for each stage ### Optimizing Intra-GPU Communication ![](https://i.imgur.com/tfJk0l8.png) ## EVALUATION OF ASTRAEA ### Experimental Setup ![](https://i.imgur.com/jpvkY7T.png) 1. Load Generation The arrival time of user requests follows an exponential distribution as in CPU microservices 3. Comparison Baselines * resource management system for CPU microservice FIRM * resource management work for GPU co-locations Laius ![](https://i.imgur.com/P51ULpf.png) ### Maximizing Throughput and Guaranteeing QoS ### Considering Bandwidth Contention and Eect of Global Memory-based Communication * Astraea-NC that disables the constraint(global memory bandwidth contention) in Astraea * Astraea-NG, an Astraea variation that disables the global memory communication. ![](https://i.imgur.com/y97TaYq.png) ### Overhead of Astraea * Training overhead The overhead of training models for predicting microservice performance is acceptable. Collecting the training samples of all microservices and training process fnished within 60 minutes using a single GPU. As for the online predicting, each prediction completes in 1 ms * Resource allocation overhead Our measurement shows that this operation in our experiments completes in 10ms on a single CPU * Communication overhead The setup operation for a pair of microservices using CUDA IPC is only one-off when the end-to-end service is launched and completes in 1ms.