# Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads
###### tags: `GPUs`
###### paper origin: OSDI-2020
###### papers: [link](https://cs.stanford.edu/~deepakn/assets/papers/gavel-osdi20.pdf)
###### slides and video: [link](https://www.usenix.org/sites/default/files/conference/protected-files/osdi20_slides_narayanan.pdf)
###### detail slides: [link](https://www2.slideshare.net/databricks/heterogeneityaware-cluster-scheduling-policies-for-deep-learning-workloads)
# Introduction
* domain-specific architectures have emerged as an alternative to more general purpose CPUs. Consequently, users today must choose from a wide variety of accelerators to train their DNN models.
* How to allocate diverse resources to many users while implementing complex cluster-wide scheduling, optimizing objectives is a big problem.
* Following are three difficulties:
1. Performance Heterogeneity
* Models show heterogeneous performance behavior across accelerator types due to various architectural differences.
* 
* Some scheduling policies can also benefit from splitting a job between multiple resource types
2. Generality across Policies
* Cluster operators might want to implement different scheduling policies based on their business goals
3. Colocation and Placement Optimizations
* To improve cluster utilization, existing GPU schedulers often deploy optimizations
* contributions:
1. A systematic method to convert existing cluster scheduling policies into equivalent policies that consider heterogeneity and colocation; these equivalent optimization problems are practical for current DNN clusters.
2. A round-based scheduling mechanism to ensure that the cluster realizes the allocations returned by these policies.
3. Generalizations of many existing policies in our framework that improve corresponding objectives.
# Background
### Deep Neural Network (DNN) Training
* DNN training proceeds in iterations.
* Gavel leverages minibatch in its throughput estimator.
* Modern DNN schedulers leverage the fact that DNN training is iterative to suspend and resume training at iteration boundaries
### Performance Optimizations
* GPUs can be severely underutilized in multi-tenant clusters.
* The placement of tasks for a distributed training job can have significant impact on performance.
#### Space Sharing.
* concurrently executing multiple models on the same GPU.
#### Placement Sensitivity.
* DNN models show heterogeneity in their distributed scaling behavior depending on the size of the tensors that need to be exchanged between workers during training
# System Overview

* heterogeneity-aware policy computes the fraction of time different jobs and combinations should run on different accelerator types to optimize the desired objective.
* These policies require as input the performance behavior in terms of throughputs for each job on each accelerator type, which can either be provided by the user, or can be measured on the fly by Gavel’s throughput estimator.
* Allocations are intended to be respected only between allocation recomputation events.
* Gavel can recompute its policy either when a reset event occurs, or at periodic intervals of time.
* Given the policy’s output allocation, Gavel’s scheduling mechanism grants jobs time on the different resources, and moves jobs between workers as necessary to ensure that the true fraction of time each job spends on different resources closely resembles the optimal allocation returned by the policy.
### Heterogeneity-Aware Policies
* Gavel expresses scheduling policies as optimization problems for various objectives of interest.
* A matrix X can represent allocations on a single accelerator type.
* 
* 
* 
* 
* 
#### Space Sharing.
* 
* 
* 
#### Placement Sensitivity.
* having two different worker types (consolidated and unconsolidated) with corresponding throughput values in T and allocation values in X.
### Round-based Scheduling Mechanism

* to accelerator types while matching the optimal allocation as closely as possible.
* the scheduler computes a priority score for every job and accelerator type combination that is high when a job has received a smaller time fraction than the optimal allocation.
* Scheduling is performed in rounds
### Throughput Estimator
* Maps a new job to a set of preprofiled reference jobs.
* For the new job’s combinations, The throughputs of the closest reference job can then be used as the initial performance estimate.
* For individual jobs, the throughputs can be estimated on the fly as jobs run on different resource types.
* Round durations of around 6 minutes is best.
### Limitations and Non-Goals
* Do not propose new scheduling policies or performance optimizations.
* The goal is to share resources amongst many different users and jobs in a heterogeneity aware way, while supporting many existing cluster-wide objectives.
# Scheduling Policies

### Max-Min Fairness as an Optimization Problem
* Least Attained Service (LAS) policy
* On a homogeneous cluster


# Scheduling Mechanism


# Implementation
* Implemente a prototype of Gavel in approximately 9000 lines of Python code, and implemented a simulator in about 500 LOC.
* Use cvxpy to implement Gavel’s heterogeneity-aware policies, and
* Use gRPC to communicate control messages between the scheduler and workers.
# Evaluation
* run physical cluster experiments on a cluster with 8 V100s, 16 P100s, and 24 K80s. Simulated cluster experiments are run on a cluster with 36 GPUs of each type




# Conclusion
* proposed Gavel, a heterogeneity-aware cluster scheduler that is able to optimize for many high-level metrics like fairness, makespan, and cost.
* Demonstrates how existing policies can be expressed as optimization problems, and extends these policies to be heterogeneity-aware.
* Gavel then uses a decoupled round-based scheduling mechanism to ensure that the computed optimal allocation is realized.
* Gavel’s heterogeneity-aware policies improve end objectives both on a physical and simulated cluster.
* It can support a higher average input job rate, while improving objectives such as average job completion time by 3.5×, makespan by 2.5×, and cost by 1.4×.