Vessels: Efficient and Scalable DNN Prediction on Trusted Processors

Vessels: Efficient and Scalable DNN Prediction on Trusted Processors === ## Introduction * Deep learning is popular and prevalent * Object, voice recognition, image classification * Medical diagnosis, face recognition, autonomous driving * DNN * Training -> prediction * Stealing and manipulation attacks on deep learning * Training data, model, architecture in the form of data * Data breaches, Trojan attack * Data at rest and in motion are protected by cryptography * Difficult to protect data in use (during prediction). * Protecting DNN prediction using Intel SGX * SGX can protect the confidentiality and integrity of data during DNN prediction * EPC limit of 128MB * Linux: page swapping, Windows: can't run * One page swapping takes XXX cycles [CK: Cite] * DNN prediction requires large memory * DenseNet model is XXX MB and requires XXX MB memory for prediction * EPC is shared by all enclaves in the physical machine * Scalability issue with many prediction requests in parallel * Instant DNN prediction in the cloud (e.g., Siri, Alexa, [CK: Use the examples in the Neurosurgeon paper]) * EPC thrashing * Existing work and their problems * Fig. 1: Performance of existing DNN prediction with SGX * AlexNet (?) * Native, Eleos, TF-Trusted [CK: Use these as the baselines instead of Darknet-SGX in evaluation?] * Y-axis: Avg. response time (per request) [CK: instead of throughput?] * X-axis: Number of simultaneous requests * Eleos: We ported Darknet (a DL framework in C++) to run on Eleos * Replaces SGX page swapping with more efficient page swapping inside the enclave * Slow because not designed for DNN * DNN prediction requires to read and write many floating point numbers at a very high frequency * Frequent page swapping and address translation for each floating point number * TF-Trusted: TensorFlow Lite on SCONE * SCONE * Reduces the number of context switching for system calls * Does not address page swapping, which is more significant to the performance of DNN prediction * TensorFlow Lite * Quantization: integer-only prediction for embedded devices * Changes the DNN model: affects model accuracy and doesn't support layers that require floating point operations * Slow because XXX * Both do not address the scalability issue * In this paper, we focus on the memory usage of DNN prediction with SGX * Systematic study: found memory footprint of DNN prediction can be significantly reduced * [CK: Turn the bullet points O1-4 to regular sentences] * A new system for efficient and scalable DNN prediction with SGX * Vessel: an optimized enclave for efficient and scalable DNN prediction * Reduces peak enclave memory without modification to DNN architecture or model * This reduces expensive SGX page swapping significantly * Brings the performance of protected DNN prediction with SGX to that of unprotected prediction * Carefully schedules to launch vessels for parallel prediction requests to avoid EPC threshing, hence minimizes the impact on scalability * Key techniques of the system * Shared memory across layers * On-demand weight loading * EPC-aware scheduling: based on enclave size estimation before execution * Experimental results * Contributions