# Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem
###### tags: `GPUs`
###### origin:
###### paper: [link](https://arxiv.org/abs/2109.11067)
## Background
Multi-Instance GPU (MIG) is a new hardware feature introduced by A100. MIG allows people to partition one physical GPU into some number of GPU instances

## Research problem
1. different DNNs have different performance per resource on different sized instances
* This means that we cannot simply assume that two 1/7 instances equal one 2/7 instance and assign resources by total amounts, which is a common assumption used by traditional resource allocation algorithms (like allocating CPU cores).
2. instance allocation is restricted: partitioning GPUs follows specific and (arguably) peculiar rules.
* For example, an A100 cannot allocate a 3/7 instance when having a running 4/7 instance, even if it has three free units of resources.
3. MIG supports partial reconfiguration
* a subset of a GPU’s instances can be repartitioned on-the-fly, without affecting other working instances on the same GPU.
## Solution
* Proposed MIG-serving, which takes DNN models and their SLOs as inputs, and produces a set of GPU partitions and service assignments, called a deployment, that satisfies all SLOs and uses as few GPUs as possible.
## Introduction

### Observation 1
the growth of inference throughput is non-linear relative to the increase in resources (i.e., from 1/7 to 7/7 instances).
### Observation 2
for the same DNN model, a GPU with different partitions has diverse performance, in terms of throughput and latency
### Observation 3
The performance characteristics of different DNN models are different.
For example, densenet121 prefers small instances
## System overview

### Optimizer
Optimizer tackles the optimization problem of serving DNNs with MIG: finding a valid deployment that uses as few GPUs as possible.
The first phase aims at finding a candidate deployment which is valid but suboptimal in terms of GPU usage efficiency. The second phase improves the candidate deployment via a combination of custom-designed GA and MCTS.
This two-phase design is crucial in practice because it balances two important but conflicting requirements: (i) getting a valid deployment quickly and (ii) taking full advantage of every single GPU.
### Controller
Controller receives two inputs, the new deployment (from optimizer) and the current deployment on GPU clusters. Controller’s duty is to (i) plan a series of actions (called a transition plan) that switch GPUs from the current configurations to the new version, and (ii) execute the transition plan without affecting user experiences.
To achieve the above goals, controller runs an algorithm, called exchange-and-compact. At a high level, the algorithm first changes current service instances to the wanted sized instances while maintaining the required throughputs during this process, with the help of extra GPUs; it then repartitions GPUs and packs the services into the planned number of GPUs
## Optimizer algorithm
### Optimizer’s inputs
1. service performance (throughput and latency) on different sized GPU instances (1/7–7/7 instances)
2. SLOs which include required throughputs and latencies for each service
### completion rates
* the deployment’s progress of satisfying SLOs.
* For example, a deployment has completion rates of [0%, 100%,···] means that the deployment does not run service0 on any GPUs while service1 is fully satisfied.

### Optimizer's goal
1. utilities for all service on all sized instances
2. completion rates, an optimizer procedure should produce a set of GPU configurations, such that the sum of all GPU configuration utilities and the completion rates must be greater than or equal to [100%, 100%,...]
### Fast algorithm: heuristic score and greedy algorithm.
1. ranks the known GPU configurations by their scores and picks the one with the highest score.
* heuristic score of a GPU configuration is based on two factors:
1. current completion rates
2. GPU configuration’s utility
### Slow algorithm: MCTS.
Monte Carlo Tree Search

MIG-serving customizes MCTS by
1. cutting the children of each node into the nodes with the top-K heuristic scores (K=10 by default) and
2. using a fast-and-accurate estimation via memoization and randomization.
## Controller algorithm
finish a transition without interrupting user experiences and finish it quickly.
### Exchange phase
Controller executes each new-unneeded instance pair
1. creating the new instance (using extra GPUs if needed) and
2. deleting the unneeded instances. After finishing all pairs, controller deletes instances in the unneeded list.
### Compact phase
Some GPU may not be fully occupied.
gathers some running instances from other GPUs such that these instances together can fully occupy unoccupied GPU.
### Optimizations
1. locality-aware
* prioritizes local instance migrations over cross-machine migrations.
3. actions can run in parallel if the affected GPUs are separate
## Implementation
MIG-serving is implemented in Python and Kubernetes (k8s). 
MIG-serving’s actions—instance creation, deletion, migration, and GPU partition—are wrappers of k8s operations.
MIG-serving always chooses the largest batch sizes possible, as far as the inference latency is smaller than what required by SLOs
## Experimental evaluation
### Baselines
1. A100-7×1/7, partitioning GPUs into 1/7 instances
2. A100-7/7, using A100 GPUs as-is
3. A100-MIX, partitioning all A100 into “4-2-1”
### Workloads
1. robert-large
2. bert-base-uncased
3. albert-large-v2
4. resnet101
5. resnet50
### Experiments
* Simulation workloads (requiring hundreds of GPUs)
* Real-world workloads (requiring up to 16 GPUs)

### Result
* GPU saved
* We run MIG-serving and the baselines on the simulation workloads and count the number of GPUs they use.
* It saves up to 40% GPUs compared to A100-7/7
* close to the optimal allocation—MIG-serving uses <3% more GPUs than the GPU lower bound