Serving DNNs like Clockwork: Performance Predictability from the Bottom Up

# Serving DNNs like Clockwork: Performance Predictability from the Bottom Up ###### tags: `Model Serving` ###### paper: [link](https://arxiv.org/pdf/2006.02464.pdf) ###### slide: [link](https://www.usenix.org/sites/default/files/conference/protected-files/osdi20_slides_gujarati.pdf) ###### video: [link](https://www.youtube.com/watch?v=wHOpY_MY57Y) ##### paper origin: OSDI, 2020 # Introduction * **Contributions:** * We demonstrate that predictability is a fundamental trait of DNN inference that can be exploited to build a predictable model serving system. * We propose a system design approach, consolidating choice, to preserve predictable responsiveness in a larger system comprised of components with predictable performance. * We present the design and implementation of Clockwork, a distributed model serving system that mitigates tail latency of DNN inference from the bottom up. * We report from an experimental evaluation on Clockwork to show that the system supports thousands of models concurrently per GPU and substantially mitigates tail latency, even while supporting tight latency SLOs. Clockwork achieves close to ideal goodput even under overload, with unpredictable and bursty workloads, and with many contending users. # Background and Motivation ![](https://i.imgur.com/qCEP73c.png) # Predictable Performance ![](https://i.imgur.com/27x5kvk.png) # Design ![](https://i.imgur.com/nY9u7kD.png) ![](https://i.imgur.com/lyqkA4a.png) ![](https://i.imgur.com/WFXouv7.png) ![](https://i.imgur.com/EAvD3fo.png) # Evaluation * Experimental setup. cluster of 12 Dell PowerEdge R740 Servers. Each server has 32 cores, 768 GB RAM, and 2×NVIDIA Tesla v100 GPUS with 32 GB memory. The servers are connected by 2×10 Gbps Ethernet on a shared network. In all experiments, we run the controller, clients, and workers on separate machines. ## How Low Can Clockwork Go? ![](https://i.imgur.com/fOKQKF9.png) ## Can Clockwork Isolate Performance? ![](https://i.imgur.com/3FLDIMv.png) ## Are Realistic Workloads Predictable? ![](https://i.imgur.com/7E53RWu.png) * Clockwork with realistic workloads ![](https://i.imgur.com/m8XOABd.png) * Predictable executions ![](https://i.imgur.com/12D244C.png) ## Can Clockwork Scale ![](https://i.imgur.com/0jTNkHf.png) # Conclusion In order to solve the grows requirements of: * fast response tighten * the volume of requests expands * the number of models grows Clockwork efficiently fulfills * aggressive tail-latency SLOs * supporting thousands of DNN models with different workload characteristics concurrently on each GPU * scaling out to additional worker machines for increased capacity * isolates models from performance interference caused by other models served on the same system * ensuring all internal architecture components have predictable performance by concentrating all choices in the centralized controller. * required us to either circumvent canonical best-effort mechanisms or orchestrate them to become predictable, and illustrates how consolidating choice can be applied in practice to achieve predictable performance.