Clipper: A Low-Latency Online Prediction Serving System

# Clipper: A Low-Latency Online Prediction Serving System ###### tags: `Model Serving` ###### paper origin: NSDI 2017 ###### papers: [link](https://www.usenix.org/system/files/conference/nsdi17/nsdi17-crankshaw.pdf) ###### slides: [link](https://www.usenix.org/sites/default/files/conference/protected-files/nsdi17_slides_crankshaw.pdf) ## INTRODUCTION ### Research Problems * Most machine learning frameworks and systems only address model training and not deployment. * Complexity of Deploying Machine Learning * Prediction Latency and Throughput * Model Selection ### Proposed Solution * Clipper introduces a model abstraction layer and common prediction interface that isolates applications from variability in machine learning frameworks and simplifies the process of deploying a new model or framework to a running application * Clipper automatically and adaptively batches prediction requests to maximize the use of batch-oriented system optimizations in machine learning frameworks while ensuring that prediction latency objectives are still met * Clipper leverages adaptive online model selection and ensembling techniques to incorporate feedback and automatically select and combine predictions from models that can span multiple machine learning frameworks. ## Model Abstraction Layer * a prediction cache * an adaptive querybatching component * a set of model containers connected to Clipper via a lightweight RPC system ### Caching * The prediction cache acts as a function cache for the generic prediction function ![](https://i.imgur.com/XCHiUf5.png) * Clipper employs an LRU eviction policy for the prediction cache, using the standard CLOCK cache eviction algorithm ### Batching * Batching increases throughput via two mechanisms * Batching amortizes the cost of RPC calls and internal framework * Batching enables machine learning frameworks to exploit existing data-parallel optimizations * How to choose optimal batch size and meet the SLO. ![](https://i.imgur.com/3PdxEWP.png) * Dynamic Batch Size * Additive-increase-multiplicativedecrease (AIMD) scheme * Additively increase the batch size by a fixed amount until the latency to process a batch exceeds the latency objective. At this point, we perform a small multiplicative backoff, reducing the batch size by 10%. ![](https://i.imgur.com/waLjDck.png) * Delayed Batching * For some models, briefly delaying the dispatch to allow more queries to arrive can significantly improve throughput under bursty loads. ### Model Containers Use a common interface for each model container ![](https://i.imgur.com/Gsk8RVW.png) * Container Replica Scaling * A GPU cluster may consists of multiple GPU cards. * But networking maybe ther bottleneck ## Model Selection Layer * The Model Selection Layer uses feedback to dynamically select one or more of the deployed models and combine their outputs to provide more accurate and robust predictions. * Build a common interface for multiple model selecting policies ![](https://i.imgur.com/RCMGCZU.png) * Provide two generic model selection policies based on robust bandit algorithms ### Single Model Selection Policy * The model seletion problem could be reduced to a multi-armed bandit problem * Use Exp3 ### Ensemble Model Selection Policies ![](https://i.imgur.com/yGmsBEk.png) * Use Exp4 * Robust Predictions * Aapplications can choose to only accept predictions above a confidence threshold by using the robust model selection policy. * Straggler Mitigation * As we add model containers we increase the chance of stragglers adversely affecting tail latencies. ![](https://i.imgur.com/r2hekb1.png) ## System Comparison * Compare with TensorFlow Serving * Has small overhead in Throughput and latency * ![](https://i.imgur.com/Yj2PVAa.png)