# Pretzel : Opening the Black Box of Machine Learning Prediction Serving Systems
###### tags: `Model Serving`
###### paper: [link](https://www.usenix.org/system/files/osdi18-lee.pdf)
###### slide: [link](https://www.usenix.org/sites/default/files/conference/protected-files/osdi18_slides_lee.pdf)
##### paper origin: OSDI, 2018
## Abstraction
* Prediction serving requires **low latency**, **high throughput** and **graceful performance degradation** under heavy load.
* Current **black box** serving system ignores prediction-time-specific optimizations in favor of ease of deployment.
* PRETZEL, a **white box** architecture enabling both end-to-end and multi-model optimizations.
## 1 Introduction
## 2 Model Serving: State-of-the-Art and Limitations
* This work only focus on pipelines composed by featurizers and classical ML models(**no deep learning**).
* Handling multiple requests in **batches** and **caching** the results of the inference if some predictions are frequently issued for the same pipeline.
* Limitations:
* Memory Waste
Take sentiment analysis as example, many piplines share the same structure. Only **logistic reggression** are unique.

* Prediction Initialization
* Lazily materializes input feature vectors, and tries to reuse existing vectors between intermediate transformations.
* This design forces memory allocation along the data path, thus making latency of predictions sub-optimal and hard to predict.
* Hard to provide strong tail latency guarantees by ML-as-a-service providers.

* Infrequent Accesses
Models must reside in memory, or may violating Service Level Agreement(SLA).
* Operator-at-a-time Model


* Coarse Grained Scheduling
Allocate each request a thread resulting too many threads.
## 3 White Box Prediction Serving:Design Principles
### White Box Prediction Serving
* Can allocate resources freely.
### End-to-end Optimizations
1. avoid memory allocation on the data path.
2. avoid creating separate routines per operator when possible, which are sensitive to branch mis-prediction and poor data locality.
3. avoid reflection and JIT compilation at prediction time.
* Optimal computation units can be compiled Ahead-Of-Time (AOT).
### Multi-model Optimizations
* Shared components
## 4 The Pretzel System
### 4.1 Off-line Phase

#### Flour
* intermediate representation

#### Oven
* Optimizer
* Model Plan Compiler
#### Object Store
* Reserve only one copy of component.
### 4.2 On-line Phase

#### run time
* request-response engine
* batch engine
#### Scheduler
### 4.3 Additional Optimizations
#### Sub-plan Materialization
* LRU-policy to reuse sub-plan on top of the Object Store
#### External Optimizations
* caching
* delay batching
## 5 Evaluation
### 5.1 Memory
as shown in the folowing image, "Object Store" cause Pretzel use less memory.

### 5.2 Latency
#### micro-binchmark
as shown in the folowing image, Pretzel is faster in both cold and hot condition.

##### AOT compilation
* no JIT
* load stage code to cache
##### Vector Pooling
* vector load to memory
##### Sub-plan Materialization
* common featurizers

#### end-to-end
The overhead is the same.

### 5.3 Throughput

### 5.4 Heavy Load

# My point of view
* Discuss only CPU server
* Did not discuss deep learning scenario
* Only discuss two kinds of models mix together
* Most of the query and model may be different and hard to reuse.