# LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference
###### tags: `Accelerators`
###### paper origin: HPCA 2021
###### papers: [link](https://arxiv.org/abs/2010.13103)
###### video: [link](https://www.youtube.com/watch?v=GhWS_K4YIZk&ab_channel=vialab)
###### slide: `none`
# 1. INTRODUCTION
*Assume NPUs as the baseline accelerator architecture in this paper.*


## Research Problems
**Limits of “One-Size-Fits-All” Batching**
Balance the tradeoff between batching latency and throughput.
Like cloud inference or streaming data, for example:

The effect of graphlevel batching in ResNet’s effective throughput and latency.

Effect of batching time-window from 5 ms to 99 ms: (traffic = 16/250/2000 requests/sec)

> Slide start:
For this experiment, assume that the batched inputs are already formed at size N, without waiting for them to be collected.

Increasing batching timewindow from 2 to 4 in Figure 4(a-b) does not help increase
batch size and needlessly delay the time Req1 and Req2
can start execution.
## Proposed Solutions
We propose LazyBatching, an intelligent batching system that dynamically adjusts the level of batching to balance latency, throughput, and SLA (service level agreement) satisfaction.
Rather than having a batched input execute uninterrupted until the entire graph completes, LazyBatching maintains the scheduling granularity in fine-grained node-level and allows different (batched) inputs to be interleaved for execution.
The key innovation of our proposed LazyBatching is the development of an SLA-aware scheduling algorithm that utilizes domain-specific properties of ML inference to intelligently decide when/which inputs to lazily batch or not.
## Contributions
* We develop an SLA-aware slack prediction model which exploits a domain-specific property of ML inference to predict the DNN inference time for slack estimation.
* We propose LazyBatching, a low-cost and practical batching system for cloud inference. Unlike prior work, our solution is not limited to a particular type of a DNN layer and can flexibly adapt to the deployment environment without hand-tuning the batching parameters.
* Compared to graph batching, LazyBatching provides an average 15×, 1.5×, and 5.5× improvement in latency, throughput, and SLA satisfaction, respectively.
# 2. Implementation
## batching
CellularBatching (other paper):

A key challenge of cellular batching is its limited applicability among generic DNN workloads. Because cellular batching is specifically designed to leverage the unique feature of RNNs the weight sharing effect it takes advantage of is no longer applicable when the end-to-end DNN application contains non-RNN layers.
---
An example scenario where cellular batching fails to batch inputs:

---
High-level overview of LazyBatching model serving system:

* Rather than having a single batched input exclusively execute until completion, our key approach is to maintain the scheduling granularity in a fine-grained node-level and allow different (batched) inputs to be interleaved for execution.
* LazyBatching utilizes the node-level scheduling framework to preferentially schedule newly requested inputs to catch up the progress of a previously ongoing, but yet to be finished earlier inputs. This opens up more batching opportunities as inputs can be “lazily” batched with each other in an incremental manner.
* The notion of batching time-window is therefore nonexistent with LazyBatching because there is no fixed-length time window which inputs must wait in order to be batched together. In effect, our LazyBatching scheduler constantly fires off one of the nodes within the pool of schedulable inputs, whenever the batching unit finds that appropriate to meet latency, throughput, and SLA goals.
LazyBatching:

* A key innovation of LazyBatching is the development of an SLA-aware, slack time prediction model which our scheduler utilizes to intelligently judge when/which inputs are worth lazily batching.

(a) Timeline of LazyBatching when executing a graph with 8 nodes.
(b) the changes in BatchTable as stack entries are pushed/merged.
---
SLA-aware slack time estimator contain:
1) node-level latency estimation
2) graph-wide estimation
3) utilizing these two components for slack estimation
are known value, just requires an estimation of an individual, single-batched input’s end-to-end, graph-wide execution time.
The slack time of Req1 without batching is:

Assume the initial input (Req1) is batched with (N −1) future requests:

---


---
Predicting SingleInputExecTime:

* Individual graph node’s execution time over a target hardware architecture is highly deterministic and predictable. As a result, the computation and memory access characteristics of a graph node (i.e., DNN layer) is highly regular and input-independent.
* DNN graph’s node-level latency only has to be done once and be reused for all future inferences for that model.
* Precisely estimating the latency of a dynamic graph DNN is challenging, because the number of nodes to traverse within the DAG is variable and input-dependent.
* The time-unrolled recurrence length will likely fall within the set of output sequence lengths that we observed during the training dataset characterization.

Number of words within a sentence when characterized across WMT-2019’s 30, 000 “English-to-German/French/Russian” translation pairs.
# 3. Result
Effect on average latency per query-arrival rate (x-axis, requests/sec):

Effect on throughput per query-arrival rate (x-axis, requests/sec):

CDF of inference latency under high load (1K req/sec), showing LazyBatching’s effectiveness in reducing tail latency. For clarity, we only plot the best performing graph batching configuration for each workload:

SLA violation rate as a function of batching policy and SLA deadline (x-axis). The query-arrival rate is set to a high load (1K req/sec) to stress test a batching policy’s ability to minimize SLA violations (i.e., studying SLA under a low query-arrival rate is meaningless because none will violate the SLA). We omit plotting impractical data points for brevity (e.g., it does not make sense to configure the batching time-window at 75 ms when SLA
deadline is 40 ms). As a SLA deadline increases, from left to right in the x-axis, the violation rate monotonically decreases for all policies:

LazyBatching sensitivity to other benchmarks. Due to space
constraints, we only show two datapoints under the low/high load in (a,b), assuming 16/1000 requests/sec, respectively. Similarly, the SLA violation rate in (c) summarizes our evaluation under high load (1000 requests/sec) where
we report the average violation rate as a single result when sweeping the SLA deadline from 20 to 100 ms. BERT’s short end-to-end latency renders the assumed 20-100 ms SLA deadline to not cause any SLA violations even under Serial. Regardless, LazyBatching significantly improves latency and
throughput under this workload:

LazyBatching sensitivity to GPU-based inference systems. Due to space constraints, we only present detailed results for Transformer, assuming the same evaluation methodology in Section VI-A and Section VI-B:

# 4. Report
LazyBatching is an intelligent batching system that dynamically adjusts the level of batching to meet latency, throughput, and SLA requirements.
Compared to the baseline graph batching, LazyBatching provides an average 15×, 1.5×, 5.5× improvements in terms of user-responsiveness, throughput, and SLA satisfaction, respectively.
# Slide:






