# Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning
###### tags: `GPUs`
###### paper origin: EuroSys, 2020
###### papers: [link](https://dl.acm.org/doi/10.1145/3342195.3387555)
# Introduction
## Problem
* While there have been several DLT job schedulers proposed in the literature , none of them support user fairness even in homogeneous clusters.
## Solution
* Present Gandiva<sub>fair</sub>, a scheduler that guarantees cluster-wide fair share of GPU minutes across users in a shared cluster of heterogeneous GPUs, while ensuring cluster efficiency by distributing unused quota fairly among other active users.
* A novel split, gang-aware stride scheduler to enforce fairness at short time-scales
* A load balancer that uses job migration as a key mechanism to distribute jobs, weighted by ther tickets, evenly throughout the cluster.
* A novel technique of automatic resource trading across users.
## Motivation
### Fairness

Let two 2-GPU jobs of two users, A and B, be running on the eight GPUs of the cluster.
* Inter-user fairness
* The cluster has 8-GPUs and user A and user B have been allocated four GPUs each. Thus, the allocation is inter-user fair.
* A new 2-GPU job arrives for user C
* User C’s fair share in this example is 8 GPUs/3 active users = 2.66 GPUs
* But since the user has only one 2-GPU job, user C’s fair share is 2 GPUs.
* Inter-server gang scheduling

### GPU Heterogeneity

### Heterogeneous GPU performance observations
* It is clear that all models benefit from newer generation GPUs compared to older GPUs.
* We can see that the marginal utility of using a newer GPU varies significantly across models.
## Design
Each DLT job is assigned a number of tickets which represents its share of resources.
### Split Stride Gang-scheduling
* Small jobs
* Jobs whose GPU needs can be served by a single server.
* Lottery and Stride are classic ticket-based scheduling algorithms.
* Problem: not feasible
* Lottery
* At every time-quantum the sched- uler performs a lottery with the total number of tickets of all active jobs and schedules the job with the winning ticket.
* If the new ticket is for a 1 or a 2 GPU job, we allocate it, else we skip that job as well and we iterate until all GPUs are allocated or full allocation is infeasible.
* Stride
* A job’s stride is inversely proportional to its tickets and denotes the interval between when it’s scheduled.
* A new job’s pass value is set to the minimum pass value of all jobs in the node. At each time quantum, the job with the minimum pass value is scheduled. The job’s pass value is then updated by the job’s stride.
* If the job does not fit, we skip the job but note that the job retains it pass value;
* 
* Large jobs
* Split Stride scheduler
* 
* A central scheduler maintains pass values of all large jobs and one aggregate job/pass value for each of the servers. The aggregate pass value for the aggregate job is inversely proportional to the cumulative number of ticketsof all small jobs in that server.
### Load balancing


### Scheduler Efficiency
* Gandiva<sub>fair</sub> addresses apparent conflict between fairness and efficiency by leveraging two domain-specific customizations.
* Gandiva fair leverages the mechanism to migrate a job on-demand at a low cost
* The specific workload of DLT jobs in large clusters is particularly well-suited to a migration policy that performs intelligent packing of jobs to avoid such pathological scenarios.
### Handling GPU heterogeneity transparently
Consider a cluster with 60 K80s and 12 V100s, and three active users, A, B, and C, running hyper-parameter tuning over their VAE, ResNet and ResNext models, respectively.

### Gandiva<sub>fair</sub>
* Allocation
* 
* Profiling
* As jobs arrive and are scheduled on different GPU models, the profiled value gets updated to the average of most recent statistics to reflect the current speedups.
* Migration
* Job migration options are evaluated every time quantum.
* If the user will benefit from having their jobs migrated from one GPU model to another and if the re- sulting improvement is greater than a threshold, then the scheduler adds this user to the candidate migration list.
* Trading
* the scheduler finds a "seller" (the user with the fastest speedup on fast GPU relative to slow GPU) and a "buyer" (the user with the least speedup).
* Scalability
* The auctioning run time is O(numU sers) and the allocation run time is O(numU sers + numJobs).
## Implementation
* uses Kubernetes as a cluster manager with a custom scheduler that allocates jobs to nodes. Jobs are submitted as Docker containers
* Manager exposes a REST API and a gRPC endpoint for the clients to connect to the scheduler
* Job migrating is done with training checkpoints.
* Modify tensorflow and pytorch frameworks.
* CUDA call proxy
* virtual address space to the physical GPU address space
## Evaluation
* 50 servers with 200 GPUs in total, comprising 24 V100, 48 P100 and 128 K80 GPUs. Each server has four GPUs of a single model
* Single-server Gang-aware Stride scheduling
* a single server with 4 V100 GPUs and three users, each with 100 tickets.
* User1 submits four 1-GPU jobs running VAE on MNIST dataset, User2 submits four 2-GPU jobs running DCGAN on CIFAR10 dataset, and User3 submits four 4-GPU jobs running ResNet-50 on CIFAR10 dataset.
* 
* Multi-server Split Stride scheduling
* 2 servers each with 4 P100 GPUs and three users, each with 100 tickets.
* User1 submits one 8-GPU job running ResNet-50, User2 submits two 2-GPU jobs running ResNet50, and User3 submits four 1-GPU jobs running VAE.
* 
* Homogeneous cluster
* GPU cluster with 48 P100s to illustrate load balancing working along with the Split stride scheduler.
* 70 users in total, divided into four classes: 62 users with one 1-GPU job running VAE or SuperResolution on BSDB300 dataset, 3 users with one 2-GPU job running ResNet50, 3 users with one 4-GPU job running ResNet50, and 2 users with one large (8-GPU) job running ResNet50.
* 
* Heterogeneous Cluster: No Trading
* 
* 
* Heterogeneous Cluster: Automated trading
* 
* 