Distributed GPU training

# Distributed GPU training Part of mini-course of [Apache Submarine: Design and Implementation of a Machine Learning Platform](https://hackmd.io/@submarine/B17x8LhAH). Day 2 30 minutes - [First Class GPUs Support in Apache Hadoop 3.1, YARN & HDP 3.0](https://blog.cloudera.com/gpus-support-in-apache-hadoop-3-1-yarn-hdp-3/) - [YARN-6223](https://issues.apache.org/jira/browse/YARN-6223) - [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN - [YARN-3409](https://issues.apache.org/jira/browse/YARN-3409) - Support Node Attribute functionality - [YARN-5983](https://issues.apache.org/jira/browse/YARN-5983) - [Umbrella] Support for FPGA as a Resource in YARN - [YARN node labeling, queueing](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeLabel.html) - [All You Need to Know about Scheduling Deep Learning Jobs](https://www.sigops.org/src/srcsosp2017/sosp17src-final35.pdf) - [Scheduling CPU for GPU-based Deep Learning Jobs](https://dl.acm.org/citation.cfm?id=3275445) > Deep learning (DL) is popular in data-center as an important workload for artificial intelligence. With the recent breakthrough of using graphics accelerators and the popularity of DL framework, GPU server cluster dominates DL training in current practice. Cluster scheduler simply treats DL jobs as black-boxes and allocates GPUs as per job request specified by a user. However, other resources, e.g. CPU, are often allocated with workload-agnostic approaches. Kubeflow[1] performs heuristic static CPU resource assignment based on task types (e.g., worker, parameter-server), while [2] evenly divides CPUs of a server to each GPU. Despite the traditional impression that GPU is critical in DL, our observation suggests that the importance of CPU is undervalued. Identifying an appropriate CPU core number in a heterogeneous cluster is challenging yet performance critical to DL jobs. The diverse CPU usage characteristic is not well recognized in the following three aspects. > Heterogeneous CPU demand across jobs. Although powered by GPU accelerators, different workloads exhibit tremendous gap on CPU demand. Figure la illustrates the required CPU cores to reach the maximal training performance for different workloads. Overfeat and Resnet50 require 40 and 7 cores for V100 respectively. Moreover, different workloads are not equally sensitive given insufficient resources offer. The training speed of Overfeat reduces 45% from 14 cores to 7 cores, however, Resnet50 only reduces 20%. Therefore, sacrificing the performance of Resnet50 is more cost-effective than Overfeat under insufficient resources scenarios. > Better GPU, more CPU. Another key insight from Figure la is that, with better GPU allocated, the more CPUs are required. Moreover, with better GPU, Overfeat requires much more CPUs to maximize the performance comparing with Resnet50, showing different sensitivities. DL frameworks (e.g., Tensorflow) tend to overlap the computation in CPU (e.g., data pre-processing) and GPU (e.g., convolution) to maximize the resource utilization. With better GPU allocated, the latency of GPU operators reduces. Relatively it makes the latency of CPU operations become notable, calling for more CPUs. Furthermore, in contrast to the slowdown of CPU scaling, hardware accelerators (e.g., GPU) develop fast, which advocates carefully assignment of CPU resources for coordinating execution. > Waving demand over time. DL training is feedback driven exploration that introduces periodically training and model validation switching. For some sequence-to-sequence models, such as text summarization, the validation on generated output of trained model requires computation efforts different from training. Figure lb illustrates profiling for neural machine translation (NMT) tasks with 1 GPU and 4 CPU cores allocated. The CPU and GPU utilization are in cyclic variation. In training, 4 cores are sufficient as the average CPU utilization is only 104%. However, in validation, only 8% for GPU utilization while 387% for CPU utilization. The latency is bounded in CPU. We further increase the CPU resources to 24 cores, resulting in 75% the validation time reduction. > To address the CPU resource scheduling challenges, we present SAD, to maximize the cluster throughput with coarse-grained periodical rescheduling over an optimal experiment design based performance predictor. SAD exhibits adaptive characteristic-aware features to automatically infer appropriate CPU resources for allocation. Through lightweight profiling and continual monitoring, SAD captures the inter-job and intra-job resource demand heterogeneity of DL. The performance predictor in SAD can accurately suggest DL jobs training speed for different CPU numbers across various GPUs. Our small trace preliminary result shows that SAD improves the overall utilization by 19% while reduces the job completion time by 34% comparing with workload-agnostic allocation. [Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads](https://www.usenix.org/system/files/atc19-jeon.pdf) > With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this paper we present a detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in Microsoft. By correlating scheduler logs with logs from individual jobs, we study three distinct issues that affect cluster utilization for DNN training workloads on multi-tenant clusters: (1) the effect of gang scheduling and locality constraints on queuing, (2) the effect of locality on GPU utilization, and (3) failures during training. Based on our experience running a large-scale operation, we provide design guidelines pertaining to next-generation cluster schedulers for DNN training workloads. ## GPU isolation * Nvidia Multi-Process Service (MPS) provides isolation for multiple process access the same GPU. * only works for Volta architecture, and MPS is not widely supported by deep learning platforms yet * Per-GPU device isolation * use cgroups to enforce the isolation. This works by putting a YARN container – a process tree – into a cgroup that allows access to only the prescribed GPU devices. When Docker containers are used on YARN, [nvidia-docker-plugin](https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin) – an optional plugin that admins have to configure – is used to enforce GPU resource isolation. * GPU discovery * Nvidia system management interface (nvidia-smi) to get number of GPUs in each machine and usages of these GPU devices. Single GPU compute capability limitation and memory limit. So multiple GPU or even distributed GPU training. Tensorflow on YARN (30 mins - 1 hr) * Run TensorFlow, PyTorch, MXNet and Horovod natively on YARN * [TonY: An Orchestrator for Distributed Machine Learning Jobs](https://arxiv.org/pdf/1904.01631.pdf) * Resource contention * Tedious and error-prone configuration * Lack of monitoring * Lack of fault tolerance * Validation of the ML model requires access to the full dataset, typically petabytes of data, which is beyond the reach of single machine data scientists. Other distributed Tensorflow systems ([ToY](https://github.com/Intel-bigdata/TensorFlowOnYARN), [Tensorflow on Spark](https://github.com/yahoo/TensorFlowOnSpark)) [Scaling Deep Learning on Hadoop at LinkedIn (slideshare)](https://www.slideshare.net/ssuser72f42a/scaling-deep-learning-on-hadoop-at-linkedin) ![](https://lh4.googleusercontent.com/03lcX9tW1tZTmI_5BeACw15C7LxZA2cMMKKRb-MVviigmQyyk8Ko7eb2HC-3PcpsCF0E6bAb5EPws6G5sZlSp3ZkXY06MaKswncns69a_rDX4gzfEQUpX0tgfZXMQwPTGoiDD97z =70%x) Typically several hundred GPU training is the largest. At Google, to go even further you use TPU. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads [Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (publication)](https://www.usenix.org/system/files/atc19-jeon.pdf) [Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (slide)](https://www.usenix.org/sites/default/files/conference/protected-files/atc19-slides-jeon.pdf) [Large-Scale Shared GPU Clusters for DL Training Workloads](http://csl.snu.ac.kr/courses/4190.568/2019-1/41-mjjeon.pdf) * A GPU device can not be shared easily. * Synchronization cost in distributed training is high. Must account for locality. ###### tags: `2019-minicourse-submarine` `Machine Learning`