# Research Topic ## Heterogeneous Serverless Computing - Focus on: **Ensemble Training of DNNs** from distributed cluster to Serverless Computing ### Parallel training on a large scale GPU cluster of nodes - Ref: - FLEET [[MLSys'20](https://proceedings.mlsys.org/static/paper_files/mlsys/2020/98-Paper.pdf)] - Exploring flexible communications for streamlining DNN ensemble training pipelines [[SC'18]( https://drive.google.com/file/d/1mYrD_vuYMoZ3orHGCeue4PWdhSh0wSu8/view)] - Deep Neural Network Training Pipeline - Preprocessing stage on CPU and Training stage on GPU - Implemented it using **TensorFlow with Horovod** - [Horovod](https://github.com/horovod/horovod): a distributed deep learning training framework ### From GPU Cluster to Serverless Computing environment - Using microservice containers to support Ensemble Trainning of DNNs #### **Preprocessing Part:** - Using thousands of Lambda function to achieve the comparable or even better ability with distributed system - In the **ggIR paper** [[ATC'19](https://cs.stanford.edu/~matei/papers/2019/usenix_atc_gg.pdf)], there are serval applications compared between serverless computing and distributed system - Example 1: **Video Processing** - **ExCamera**: [[NSDI'17](https://www.usenix.org/system/files/conference/nsdi17/nsdi17-fouladi.pdf)] - it designed a framework to run general-purpose parallel computations on a commercial “cloud function” service and this paper focus on video encoding. - a 15-minute animated movie in 4K resolution, encoding into the VP8 compressed-video format ![](https://i.imgur.com/RnKgyd8.png) - “ExCamera[6,16]” refers to encoding chunks of six frames independently in parallel, then stitching them together in strings of 16 chunks. - **Sprocket**: [[SoCC'18](https://dl.acm.org/doi/pdf/10.1145/3267809.3267815)] - a serverless video processing framework that exploits intra-video parallelism to achieve low latency ![](https://i.imgur.com/WMJxMIX.png) - Example 2: **Object Recognition** - **Scanner** [[ACM Transactions on Graphics '18](https://dl.acm.org/doi/pdf/10.1145/3197517.3201394)] - video -> images -> Tensorflow ![](https://i.imgur.com/qgoTGzb.png) - Scanner local: - run on a 64-core machine. - Scanner on cluster: - run with a 4-core master and four 36-core workers #### **GPU Training Part:** - If we do not have the heterogeneous Elastic Lambda Function (ELF) yet: - Using [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker) (to build and run GPU accelerated Docker containers) to simulate it. ![](https://i.imgur.com/qDIOkxf.png) - let [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker) run on an open-source, distributed Serverless platform: - [**Apache OpenWhisk**](https://openwhisk.apache.org/) ![](https://i.imgur.com/ISWtYBV.png) - [**OpenLambda**](https://www.usenix.org/system/files/conference/hotcloud16/hotcloud16_hendrickson.pdf) ![](https://i.imgur.com/pH6zgMt.png) - How to **schedule** those NVIDIA Docker containers? - extend Software preemptive GPU scheduler: EffiSha [[PPoPP'17](https://people.engr.ncsu.edu/xshen5/Publications/ppopp17.pdf)] - How to **create** Lambda functions automatically? - modified **ggIR** [[ATC'19](https://cs.stanford.edu/~matei/papers/2019/usenix_atc_gg.pdf)] to automatically create NVIDIA containers - How NVIDIA Docker and CPU-based Docker **communicate** with each other? - **RDMA technology**: - Shimmy [[HotEdge'19](https://www.usenix.org/system/files/hotedge19-paper-abranches.pdf)] - **Distributed caches**: - [Amazon ElastiCache for Redis](https://aws.amazon.com/elasticache/redis/) - [Redis](https://redis.io/) is an open-source, in-memory data structure store, used as a database, cache and message broker. - **Database System**: - Apache CouchDB: using in Apache OpenWhisk - **Amazon Simple Storage System** (Amazon S3) - an object storage service is designed for high durability - In the **ggIR paper** [[ATC'19](https://cs.stanford.edu/~matei/papers/2019/usenix_atc_gg.pdf)], they use Amazon S3 to communicate between each Lambda Functions - **Heterogeneous Storage System** - Pocket [[OSDI'18](https://www.usenix.org/system/files/osdi18-klimovic.pdf)] (I will present it in reading group on Thursday) ![](https://i.imgur.com/Nkes7Se.png) ## Multi-task ML system support - Focus on: - make serverless computing support Multi-Task ML - Type 1: **Each DNNs support each task** - In **CPU and memory-constrained Environment**, run separated trained DNNs - Embedded multitask learning system - Neural Weight Virtualization [[MobiSys'20](https://dl.acm.org/doi/pdf/10.1145/3386901.3388947)] - **Embedded System --> Multitask learning in container** - offer a microservice for MultiTask ML in one container - Or, store weight of parameters on remote storage, when certain task is activated, the corresponding container load it from remote storage. - Type 2: Adding Multiple Tasks to **a Single DNN** - **Network Pruning** - Iterative Pruning: PackNet [[CVPR'18](https://openaccess.thecvf.com/content_cvpr_2018/papers/Mallya_PackNet_Adding_Multiple_CVPR_2018_paper.pdf)] ![](https://i.imgur.com/3yLuRa9.png) - Composability-based network pruning: Wootz [[PLDI'19](https://people.engr.ncsu.edu/xshen5/Publications/pldi2019.pdf)] - **Concurrent Pre-Training with Lambda Functions** ![](https://i.imgur.com/KAo9iE2.png) - **Weight Sharing** - Sub-Network Routing [[AAAI'19](http://www.jiaqima.com/papers/SNR.pdf)] ![](https://i.imgur.com/TMRaNYG.png) - make **each separated fixed-block (group of layers) to map to a single container**. - advantage: In inference time, do not need to hold the whole network at the same time.