# Clipper: A Low-Latency Online Prediction Serving System
###### tags: `Model Serving`
###### paper origin: NSDI 2017
###### papers: [link](https://www.usenix.org/system/files/conference/nsdi17/nsdi17-crankshaw.pdf)
###### slides: [link](https://www.usenix.org/sites/default/files/conference/protected-files/nsdi17_slides_crankshaw.pdf)
## INTRODUCTION
### Research Problems
* Most machine learning frameworks and systems only address model training and not deployment.
* Complexity of Deploying Machine Learning
* Prediction Latency and Throughput
* Model Selection
### Proposed Solution
* Clipper introduces a model abstraction layer and common prediction interface that isolates applications from variability in machine learning frameworks and simplifies the process of deploying a new model or framework to a running application
* Clipper automatically and adaptively batches prediction requests to maximize the use of batch-oriented system optimizations in machine learning frameworks while ensuring that prediction latency objectives are still met
* Clipper leverages adaptive online model selection and ensembling techniques to incorporate feedback and automatically select and combine predictions from models that can span multiple machine learning frameworks.
## Model Abstraction Layer
* a prediction cache
* an adaptive querybatching component
* a set of model containers connected to Clipper via a lightweight RPC system
### Caching
* The prediction cache acts as a function cache for the generic prediction function

* Clipper employs an LRU eviction policy for the prediction cache, using the standard CLOCK cache eviction algorithm
### Batching
* Batching increases throughput via two mechanisms
* Batching amortizes the cost of RPC calls and internal framework
* Batching enables machine learning frameworks to exploit existing data-parallel optimizations
* How to choose optimal batch size and meet the SLO.

* Dynamic Batch Size
* Additive-increase-multiplicativedecrease (AIMD) scheme
* Additively increase the batch size by a fixed amount until the latency to process a batch exceeds the latency objective. At this point, we perform a small multiplicative backoff, reducing the batch size by 10%.

* Delayed Batching
* For some models, briefly delaying the dispatch to allow more queries to arrive can significantly improve throughput under bursty loads.
### Model Containers
Use a common interface for each model container

* Container Replica Scaling
* A GPU cluster may consists of multiple GPU cards.
* But networking maybe ther bottleneck
## Model Selection Layer
* The Model Selection Layer uses feedback to dynamically select one or more of the deployed models and combine their outputs to provide more accurate and robust predictions.
* Build a common interface for multiple model selecting policies

* Provide two generic model selection policies based on robust bandit algorithms
### Single Model Selection Policy
* The model seletion problem could be reduced to a multi-armed bandit problem
* Use Exp3
### Ensemble Model Selection Policies

* Use Exp4
* Robust Predictions
* Aapplications can choose to only accept predictions above a confidence threshold by using the robust model selection policy.
* Straggler Mitigation
* As we add model containers we increase the chance of stragglers adversely affecting tail latencies.

## System Comparison
* Compare with TensorFlow Serving
* Has small overhead in Throughput and latency
* 