# Research Topic
## Heterogeneous Serverless Computing
- Focus on: **Ensemble Training of DNNs** from distributed cluster to Serverless Computing
### Parallel training on a large scale GPU cluster of nodes
- Ref:
- FLEET [[MLSys'20](https://proceedings.mlsys.org/static/paper_files/mlsys/2020/98-Paper.pdf)]
- Exploring flexible communications for streamlining DNN ensemble training pipelines [[SC'18]( https://drive.google.com/file/d/1mYrD_vuYMoZ3orHGCeue4PWdhSh0wSu8/view)]
- Deep Neural Network Training Pipeline
- Preprocessing stage on CPU and Training stage on GPU
- Implemented it using **TensorFlow with Horovod**
- [Horovod](https://github.com/horovod/horovod): a distributed deep learning training framework
### From GPU Cluster to Serverless Computing environment
- Using microservice containers to support Ensemble Trainning of DNNs
#### **Preprocessing Part:**
- Using thousands of Lambda function to achieve the comparable or even better ability with distributed system
- In the **ggIR paper** [[ATC'19](https://cs.stanford.edu/~matei/papers/2019/usenix_atc_gg.pdf)], there are serval applications compared between serverless computing and distributed system
- Example 1: **Video Processing**
- **ExCamera**: [[NSDI'17](https://www.usenix.org/system/files/conference/nsdi17/nsdi17-fouladi.pdf)]
- it designed a framework to run general-purpose parallel computations on a commercial “cloud function” service and this paper focus on video encoding.
- a 15-minute animated movie in 4K resolution, encoding into the VP8 compressed-video format

- “ExCamera[6,16]” refers to encoding chunks of six frames independently in parallel, then stitching them together in strings of 16 chunks.
- **Sprocket**: [[SoCC'18](https://dl.acm.org/doi/pdf/10.1145/3267809.3267815)]
- a serverless video processing framework that exploits intra-video parallelism to achieve low latency

- Example 2: **Object Recognition**
- **Scanner** [[ACM Transactions on Graphics '18](https://dl.acm.org/doi/pdf/10.1145/3197517.3201394)]
- video -> images -> Tensorflow

- Scanner local:
- run on a 64-core machine.
- Scanner on cluster:
- run with a 4-core master and four 36-core workers
#### **GPU Training Part:**
- If we do not have the heterogeneous Elastic Lambda Function (ELF) yet:
- Using [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker) (to build and run GPU accelerated Docker containers) to simulate it.

- let [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker) run on an open-source, distributed Serverless platform:
- [**Apache OpenWhisk**](https://openwhisk.apache.org/)

- [**OpenLambda**](https://www.usenix.org/system/files/conference/hotcloud16/hotcloud16_hendrickson.pdf)

- How to **schedule** those NVIDIA Docker containers?
- extend Software preemptive GPU scheduler: EffiSha [[PPoPP'17](https://people.engr.ncsu.edu/xshen5/Publications/ppopp17.pdf)]
- How to **create** Lambda functions automatically?
- modified **ggIR** [[ATC'19](https://cs.stanford.edu/~matei/papers/2019/usenix_atc_gg.pdf)] to automatically create NVIDIA containers
- How NVIDIA Docker and CPU-based Docker **communicate** with each other?
- **RDMA technology**:
- Shimmy [[HotEdge'19](https://www.usenix.org/system/files/hotedge19-paper-abranches.pdf)]
- **Distributed caches**:
- [Amazon ElastiCache for Redis](https://aws.amazon.com/elasticache/redis/)
- [Redis](https://redis.io/) is an open-source, in-memory data structure store, used as a database, cache and message broker.
- **Database System**:
- Apache CouchDB: using in Apache OpenWhisk
- **Amazon Simple Storage System** (Amazon S3)
- an object storage service is designed for high durability
- In the **ggIR paper** [[ATC'19](https://cs.stanford.edu/~matei/papers/2019/usenix_atc_gg.pdf)], they use Amazon S3 to communicate between each Lambda Functions
- **Heterogeneous Storage System**
- Pocket [[OSDI'18](https://www.usenix.org/system/files/osdi18-klimovic.pdf)] (I will present it in reading group on Thursday)

## Multi-task ML system support
- Focus on:
- make serverless computing support Multi-Task ML
- Type 1: **Each DNNs support each task**
- In **CPU and memory-constrained Environment**, run separated trained DNNs
- Embedded multitask learning system
- Neural Weight Virtualization [[MobiSys'20](https://dl.acm.org/doi/pdf/10.1145/3386901.3388947)]
- **Embedded System --> Multitask learning in container**
- offer a microservice for MultiTask ML in one container
- Or, store weight of parameters on remote storage, when certain task is activated, the corresponding container load it from remote storage.
- Type 2: Adding Multiple Tasks to **a Single DNN**
- **Network Pruning**
- Iterative Pruning: PackNet [[CVPR'18](https://openaccess.thecvf.com/content_cvpr_2018/papers/Mallya_PackNet_Adding_Multiple_CVPR_2018_paper.pdf)]

- Composability-based network pruning: Wootz [[PLDI'19](https://people.engr.ncsu.edu/xshen5/Publications/pldi2019.pdf)]
- **Concurrent Pre-Training with Lambda Functions**

- **Weight Sharing**
- Sub-Network Routing [[AAAI'19](http://www.jiaqima.com/papers/SNR.pdf)]

- make **each separated fixed-block (group of layers) to map to a single container**.
- advantage: In inference time, do not need to hold the whole network at the same time.