PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers

# PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers ###### tags: `GPUs` ###### paper origin: DAC '22 ###### paper: [link](https://people.ece.ubc.ca/aamodt/publications/papers/saed.micro2022.pdf) ## Background ### A “Reconfigurable” GPU Architecture In A100, the GPCs (compute) and the L2/DRAM slices (memory) are utilized as basic building blocks to architect a GPU with reconfigurability ![](https://i.imgur.com/LaMi1py.png) A GPU partition can be defined at the granularity of a GPC, so A100 which contains seven GPCs can be configured up to seven GPU partitions (each partition having just a single GPC worth of compute capability). ![](https://i.imgur.com/kVepBbE.png) It could be (re)configured into one big GPU (7 GPCs) or multiple small (1 or 2 GPCs) or medium (3 or 4 GPCs) sized GPUs. ## Characterization And Motivation ### Effect of Model Size on Latency & Server Utility ![](https://i.imgur.com/JIChzSO.png) Under small partition sizes like GPU(1), all DNN models universally achieve high GPU utilization. However, blindly partitioning the large GPU into smaller ones without considering the unique computation demands of the target model can be suboptimal. For instance, while both MobileNet and ResNet are DNN models for computer vision applications, the computation requirements of MobileNet are much more lightweight than ResNet as MobileNet heavily employs compute-efficient 1×1 convolutions as well as depthwise filters. Consequently, **ResNet experiences a more steep increase in latency when the GPU partition size is decreased** because it’s performance becomes more sensitive to the (relatively) smaller computation power of GPU(1,2) than the lightweight MobileNet ### Effect of Batch Size on Latency & Server Utility Inference queries with large batch sizes help increase GPU utilization as it better exploits parallelism and locality across the batched inputs On the other hand, large batches increase the amount of computations so it can adversely affect the level of SLA violations when the latency is increased to an unacceptable level. ![](https://i.imgur.com/XfpZfaY.png) Given such, one might choose to utilize the results to manually determine a model specific and batch size specific partitioning point that balances GPU utilization and latency ### Our Goal: A Heterogeneously Partitioned GPU Inference Server and Its Scheduling Algorithm * A “heterogeneous” multi-GPU ML inference server. * Two key challenges 1. A statically chosen, fixed partitioning granularity is not able to efficiently capture the model specific computation diversity of DNNs, failing to achieve low latency and high GPU utilization simultaneously. 2. The dynamically varying input batch size poses another problem because a rigidly configured, single-granular GPU partition size cannot flexibly adapt to the varying computation requirements of input batches. * A “heterogeneity-aware” scheduling algorithm. ![](https://i.imgur.com/4gdlEqo.png) FIFS can lead to suboptimal scheduling decisions as it fails to accommodate the diverse computation power of our GPUs. Consequently, a better scheduling decision would have been to wait until any one of the large GPUs complete its current query and schedule query A there instead. ## Proposed Architecture: PARIS And ELSA ### High-level Overview ![](https://i.imgur.com/ELpL0zi.png) PARIS utilizes both the model specific inference properties (e.g., latency vs. GPU utility under a target GPU partition size) and the batch size distribution information to systematically generate a heterogeneous set of partitioning granularities as well as the number of instances to deploy for each partition ELSA uses a heterogeneity-aware, inference latency prediction model to estimate a given query’s SLA slack and determine which among our heterogeneous GPUs are best suited to service the query ### PARIS * Key observations 1. For any given GPU partition size, having it handle batch sizes larger than its MaxBatch is not cost-effective 2. Assuming the input batch size to execute is smaller than the MaxBatch for a given model, small (medium) GPU partitions are generally more cost-effective(utility) 3. Large GPU partitions are efficient when handling large batch sizes as it has higher utilization. * Partitioning with both model specific properties “and” batch size distribution in mind. ![](https://i.imgur.com/IYtcFdk.png) * We first conduct a one-time profiling of the [GPU utilization vs. latency] curve per each GPU partition size * Using the characterization results, we are able to derive each GPU partition’s MaxBatch for a target DNN model * Determining the number of partition “instances” * As PARIS has now determined which batch size range the partitioned GPUs will be handling, a derivation of how many instances of these GPU partitions should be deployed is required. ### ELSA ![](https://i.imgur.com/Gh9nUQp.png) * Estimating DNN model execution time via profiling * An one-time profiling of a target DNN model’s execution time over a target GPU partition size and all possible batch sizes. * SLA slack time prediction * Whenever a new service query is received at the server, ELSA first calculates how much time this new query must wait inside a target GPU partition until it gets a chance to be serviced ![](https://i.imgur.com/TwZ9FV8.png) * Iterate through all available GPU partitions and calculate the SLA slack had the subject query been scheduled to the subject GPU partition ![](https://i.imgur.com/mqZhIMq.png) * Prioritizing the scheduling of new queries to smaller GPU partitions if there are multiple GPU partitions that satisfy SLA * servicing a query using a smaller GPU partition is always beneficial from a GPU utilization perspective ## Evaluation * Benchmarks * Models: low(ShuffleNet, MobileNet), medium(ResNet), large(Bert) * Query size distribution, query arrival rate * MLPerf inference benchmark’s recommended Poisson distribution * Software * Modefied DeepRecInfra * Hardware * We conduct our experiments on an Amazon EC2 p4d instance (p4d.24xlarge), which contains 8 NVIDIA A100 GPUs, 96 vCPUs, and 1152 GBs of host memory. * Configuration of homogeneous vs. heterogeneous GPU partitions * SLA target ### Tail Latency ![](https://i.imgur.com/jps46BC.png) ### Latency-bounded Throughput ![](https://i.imgur.com/ClLpV36.png) ### Comparision with my project * We both use heterogeneous GPU system but our the granularity of task is smaller then theirs and the input model could be different in our design.