# VELTAIR: Towards High-Performance Multi-tenant Deep Learning Services via Adaptive Compilation and Scheduling
###### tags: `Model Serving`
###### paper origin: ASPLOS 2022
###### paper: [link](https://dl.acm.org/doi/10.1145/3503222.3507752)
### Research Problems
The multi-tenant DL serving has its unique challenges, which are overlooked by previous multi-tenant DL serving works.
1. Owing to the complex inner-structure of the DL models, the scheduling granularity has a profound impact on the multi-model serving throughput.
2. The performance of DL models is very sensitive to code generation strategies
3. Performance of generated code degrades rapidly under multi-tenant scenarios due to the shared resource contention.
### Proposed Solutions
1. To reduce the resource conflict with CPU resource usage efficiency guaranteed under different situations, we propose a layer-block granularity scheduling strategy, which is finer than the model-wise scheduling but coarser than layer-wise scheduling.
2. Propose a single pass compiling strategy based on the existing auto-scheduler.
## OPTIMIZATION SPACE ANALYSIS
* Optimization space that is critical for achieving high-performance multi-tenant DL services:
* Scheduling granularity
* Compilation strategy

### Optimization Space Definition
* Scheduling granularity
* To achieve higher resource usage efficiency and reduce the resource usage conflict, we consider a new scheduling granularity of multiple layers as a unit, which we call **layer block**
* Compilation strategy
* For compiling DNN layers or models on the CPU, we mainly consider the nested loop transformation and some CPU-specific annotation or pragma including parallelization and unrolling. The compiling procedure is actually a trade-off between the parallelism and locality of the program
### Scheduling Granularity Analysis
* Experimental Setup
For the model-wise scheduling, we implement a simple First Come First Serve (FCFS) strategy used in prior work.
* Results

* Model-Wise Inefficiency We find that the distinctive computation resource requirement across DNN layers is the root cause for why the model-wise scheduling is sub-optimal.

* Layer-Wise Inefficiency
* We find that the layer-wise scheduling is sub-optimal owing to the frequent scheduling conflict when the query arrival rate is high.

### Compilation Strategy Analysis
* Extending TVM Auto-Scheduler
* The original TVM Auto-Scheduler does not consider the existence of interference when multiple DNN models run together. To mitigate the impact of interference, we propose a naive extension for the TVM’s existing auto-scheduler
* Results

* Multi-Version Static Compilation
One naive way to exploit
the above insights for multi-tenant DL services is to perform a Just-in-Time (JIT) compilation according to the interference level. However, the JIT compilation overhead can offset the benefit of adaptive compilation. Instead, we propose to use the static multiversion compilation to achieve the same benefit of the adaptive JIT compilation
## DETAILED DESIGN OF VELTAIR
### Single-Pass Static Multi-Version Compiler
A single pass for a layer is typically 20 minutes on our high-end CPU, which means searching for five versions would take close to two hours
* Parallelism-Locality Tradeof
* We then derive the following insight: generated codes with a higher locality (a larger blocking size) perform better under the light interference (interference-vulnerable), while generated codes with a higher parallelism perform better under the heavy interference (interference-tolerant).
* Single-Pass Compilation.

* Instead of searching for the best-performing implementation, we record as many samples as possible and calculate their parallelism and locality metrics

### Dynamic Threshold Based Layer-Block Formation
A fixed-sized layer-block is not efficient because the optimal block size varies with the system load and the interference from other co-executed models
Use a threshold to form blocks. When the system load is high, we use a low threshold which means each layer should have core counts close to the average value for reducing the scheduling conflict rate.

### Veltair Runtime Scheduler
Our scheduler monitors the current CPU inference level
and dynamically chooses the code version and
scheduling granularity.
* Interference Proxy
* We first build the proxy for monitoring the system interference pressure level by using hardware performance counters.
* Dynamic Scheduling Threshold
* We use a simple heuristic that determines the threshold by subtracting the total core number by the sum of all models’ average core count and distributing the remaining cores according to each model’s average core count.
## EVALUATION
### Experimental Setup
* Multi-Tenant Deep Learning Models
* To simulate the realistic situation of deep learning services, we use deep learning models from MLPerf (Server) as listed in Tbl. 2. The evaluated models include image classification, object detection, and neural machine translation (NMT) tasks. We categorize the workload of the models from light, medium to heavy, and set the QoS target for them according to the guidance of MLPerf
* Workload Generation
* We also follow the MLPerf guidance to generate random queries with Poisson distribution
* Hardware and Software
* CPU Ryzen Threadripper 3990X and 256 GB DDR4 RAM at 3200 MHz
* Evaluation Metrics
* QPS with 95% Tasks QoS Satisfied
* Average Latency
* CPU Usage Efficiency
* Baseline Choice
* Planaria
* Evaluation Plan
* Veltair-AS: with only adaptive scheduling.
* Veltair-AC: with only adaptive compilation.
* Veltair-FULL: with both adaptive scheduling and adaptive compilation enabled.
### Query per Second (QPS) Improvement

### Query Execution Latency Result

### Result of CPU Efficiency
