---
title: "Scheduling Algorithm for Clusters with Resource Hogs and Mice"
author:
- Vinayak Agarwal <va2083@nyu.edu>
- Neeraj Yadav <ny736@nyu.edu>
---
# Introduction
We see that while DRF is a fair scheduling algorithm, it inherently assumes that all incoming jobs would be similarly sized. Looking at various real world production traces, we see that this is not the case. In this project, we take inspiration from various other scheduling algorithms and modify DRF such that it supports pre-emption that helps in reduction of job wait times. This is especially helpful for short-lived jobs, that are able to achieve high turnaround times. We further extend the algorithm to allow the data center operator to tune the algorithm to give more importance to either of time or resources.
# Motivation
While [Dominant Resource Fairness](http://web.eecs.umich.edu/~mosharaf/Readings/DRF.pdf) provides a pretty fair scheduling algorithm, it falls short in certain scenarios. Consider a scenario where a user requests resources for a really long running job. In such a scenario, the subscequently incoming job may remain starved for a long time because DRF chooses the job with the smallest dominant resource, but does not take into account the running time of the job. From the [Borg]() papers, we see that in an actual production environment, there is a small fraction of jobs that take up most of the resources (and for a very long durations), while most other jobs are short-lived and don't take up as many resources. This means that most small jobs would have really large wait times because DRF does not support pre-emption.
One way we see such a problem get resolved is by looking at [Themis](https://cs.nyu.edu/~apanda/classes/sp21/papers/themis.pdf), where each job is executed for a short span of time before being put back into the queue. This ensures all jobs are executed and no jobs are starved. Themis however considers that most jobs would be resource hogs instead. This may not hold for scenarios as seen in Borg. We need an algorithm that schedules both resource hogs and mice with equal priority while reducing the number of times a job is put back into the queue.
# Approach
We build upon the standard DRF algorithm which is as follows:
* The algorithm first calculates the share of each resource allocated to that user.
* The maximum share across different resource types is called the user's dominant share and the corresponding share is called the dominant resource. For example, a user with a huge computation-intensive job will have dominant resource as CPU.
* DRF then applies max-min fairness across users' dominant shares, i.e. maximizing the smallest dominant share first, and then the second smallest share, and so on.
We name our version of DRF as Pre-emptive DRF (PDRF). We define the dominant share of PDRF as:
`p * DRF’s dom share + (1-p) * t`
where,
t is the total time the job has executed, across allocations;
p is a tuning parameter to give weightage to either time or resources.
Just in like DRF, jobs with lower dominant share are given a higher priority in PDRF and are thus scheduled first. To be able to perform this computation, we keep track of both the total running time, as well as waiting time for each job. The running time and waiting time for the given set of jobs, in execution or still waiting to be scheduled, help us decide on when to preempt a job by replacing it with another high priority waiting job. We define two metrics t_alloc and t_wait as follows:
```
t_alloc = t_1 + t_2 +...t_n
t_wait = t_1 + t_2 +.. + t_m
```
We now using these metric to decide when to preempt a job, which is given by:
`If t_wait > t_alloc * alpha and “run time of job’s user” > min(average finish time of jobs/ queue_length, beta)`
where,
alpha and beta are tuning parameters.