# Pretzel : Opening the Black Box of Machine Learning Prediction Serving Systems ###### tags: `Model Serving` ###### paper: [link](https://www.usenix.org/system/files/osdi18-lee.pdf) ###### slide: [link](https://www.usenix.org/sites/default/files/conference/protected-files/osdi18_slides_lee.pdf) ##### paper origin: OSDI, 2018 ## Abstraction * Prediction serving requires **low latency**, **high throughput** and **graceful performance degradation** under heavy load. * Current **black box** serving system ignores prediction-time-specific optimizations in favor of ease of deployment. * PRETZEL, a **white box** architecture enabling both end-to-end and multi-model optimizations. ## 1 Introduction ## 2 Model Serving: State-of-the-Art and Limitations * This work only focus on pipelines composed by featurizers and classical ML models(**no deep learning**). * Handling multiple requests in **batches** and **caching** the results of the inference if some predictions are frequently issued for the same pipeline. * Limitations: * Memory Waste Take sentiment analysis as example, many piplines share the same structure. Only **logistic reggression** are unique. ![](https://i.imgur.com/nzJLb2I.png) * Prediction Initialization * Lazily materializes input feature vectors, and tries to reuse existing vectors between intermediate transformations. * This design forces memory allocation along the data path, thus making latency of predictions sub-optimal and hard to predict. * Hard to provide strong tail latency guarantees by ML-as-a-service providers. ![](https://i.imgur.com/m8ovvz8.png) * Infrequent Accesses Models must reside in memory, or may violating Service Level Agreement(SLA). * Operator-at-a-time Model ![](https://i.imgur.com/L66jAkr.png) ![](https://i.imgur.com/09VsKKP.png) * Coarse Grained Scheduling Allocate each request a thread resulting too many threads. ## 3 White Box Prediction Serving:Design Principles ### White Box Prediction Serving * Can allocate resources freely. ### End-to-end Optimizations 1. avoid memory allocation on the data path. 2. avoid creating separate routines per operator when possible, which are sensitive to branch mis-prediction and poor data locality. 3. avoid reflection and JIT compilation at prediction time. * Optimal computation units can be compiled Ahead-Of-Time (AOT). ### Multi-model Optimizations * Shared components ## 4 The Pretzel System ### 4.1 Off-line Phase ![](https://i.imgur.com/lhdPYov.png) #### Flour * intermediate representation ![](https://i.imgur.com/SsZ3Wha.png) #### Oven * Optimizer * Model Plan Compiler #### Object Store * Reserve only one copy of component. ### 4.2 On-line Phase ![](https://i.imgur.com/BJEJlyR.png) #### run time * request-response engine * batch engine #### Scheduler ### 4.3 Additional Optimizations #### Sub-plan Materialization * LRU-policy to reuse sub-plan on top of the Object Store #### External Optimizations * caching * delay batching ## 5 Evaluation ### 5.1 Memory as shown in the folowing image, "Object Store" cause Pretzel use less memory. ![](https://i.imgur.com/Tkpz2id.png) ### 5.2 Latency #### micro-binchmark as shown in the folowing image, Pretzel is faster in both cold and hot condition. ![](https://i.imgur.com/iUa53t5.png) ##### AOT compilation * no JIT * load stage code to cache ##### Vector Pooling * vector load to memory ##### Sub-plan Materialization * common featurizers ![](https://i.imgur.com/rmzKHjE.png) #### end-to-end The overhead is the same. ![](https://i.imgur.com/z4dJGAz.png) ### 5.3 Throughput ![](https://i.imgur.com/avgfpI6.png) ### 5.4 Heavy Load ![](https://i.imgur.com/wV9Ez4i.png) # My point of view * Discuss only CPU server * Did not discuss deep learning scenario * Only discuss two kinds of models mix together * Most of the query and model may be different and hard to reuse.