Lecture 6: Distributed resource scheduling

# Lecture 6: Distributed resource scheduling Part of mini-course of [Apache Submarine: Design and Implementation of a Machine Learning Platform](https://hackmd.io/@submarine/B17x8LhAH). Day 2, [Lecture 6](https://cloudera2008-my.sharepoint.com/:p:/g/personal/weichiu_cloudera2008_onmicrosoft_com/EU5R7f189oFKka2URAqv91sBkdeT1gyfBZxOlYyU6nQXKw?e=NBzv5X) * 1.5 hr ## Slides 6.1 Specialized acceleration hardware (30 mins) 6.2 YARN (15 mins) 6.3 Kubernetes, containers (15 mins) 6.4 Distributed GPU training ([doc](https://hackmd.io/@submarine/ByZ60ypAr)) (30 mins) ## Specialized acceleration hardware - System ML lecture - [GPU Programming](http://dlsys.cs.washington.edu/pdf/lecture5.pdf) - [chethiya/Deep-Learning-Processor-List](https://github.com/chethiya/Deep-Learning-Processor-List) ### CPU * AVX/AVX-2/AVX-512 instruction set * OpenSSL hardware encryption acceleration * Tensorflow * [Crunching Numbers with AVX and AVX2](https://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX) ### GPU * Started as graphics card. Fully programmable * General-Purpose GPU (GPGPU) refers to the use of Graphics Processing Units (GPU) for general purpose Parallel Computing, outside of Computer Graphics. * Intel Xeon Phi * ![](https://lh6.googleusercontent.com/rm2F9ySPVvWtv07CQyLybNyKrT2mQBWw2NZ2i-_kNMNOhVaZ0Ic1vyAmW2kqPlwINPfhf-YdioWuVZe0dJqvTEvu85THKbpqSjptNTsmRS7i6poJx0x3rSNdeF11RaN2Qh10DRU8 =60%x) \ [https://www.extremetech.com/extreme/290963-intel-quietly-kills-off-xeon-phi](https://www.extremetech.com/extreme/290963-intel-quietly-kills-off-xeon-phi) * A x86 compatible coprocessor installed at a PCIe bus. * Up to 72 core/ 288 threads, AVX-512 instruction set. * Failed and discontinued since 2019. * [General-Purpose Graphics Processor Architectures](https://www.morganclaypool.com/doi/abs/10.2200/S00848ED1V01Y201804CAC044) * [oxford-cs-deepnlp-2017/lectures](https://github.com/oxford-cs-deepnlp-2017/lectures/blob/master/Lecture%206%20-%20Nvidia%20RNNs%20and%20GPUs.pdf) * [Introduction to GPU Architecture](http://haifux.org/lectures/267/Introduction-to-GPUs.pdf) * [How a GPU Works](https://www.cs.cmu.edu/afs/cs/academic/class/15462-f11/www/lec_slides/lec19.pdf) * [From Shader Code to a Terafop: How Shader Cores Work](https://engineering.purdue.edu/~smidkiff/KKU/files/GPUIntro.pdf) * [Introduction to the GPU architecture, Karakasis](https://www.youtube.com/watch?v=oShf5JIpqNc) * SIMT (Single Instruction Multiple Thread) execution model * Application runs kernels. A kernel runs on single SM, a block (32) of threads. These threads execute the same instruction. * CPU vs GPU: Ferrari vs Pickup truck * CPUs typically have dozens of cores; GPUs have thousands of cores * CUDA (Compute Unified Device Architecture) * A scalable parallel programming model and software environment for parallel computing * [An Introduction to GPU Architecture and CUDA C/C++ Programming ](https://rcc.fsu.edu/files/2018-04/cuda-spring18.pdf) * Initial GPU programming was limited to graphics processing. CUDA is essentially the reason we can use GPU for general purpose computation. * OpenCL * Intended to support programming on heterogeneous accelerators, GPUs, FPGAs, … etc. * Why is GPU essential for DL training? * Massively parallel float pointing computation * High memory bandwidth (however, note the bandwidth limitation on PCIe) * 558GB/s to higher, vs. up to 490GB/s for Xeon Phi but much less than 100GB/s for most CPUs. TPU * ASIC: serves one purpose: Neural network inference (aka prediction) (v.s. training/learning). * A coprocessor on PCIe bus * Non-von Neumann architecture. [Systolic array](https://en.wikipedia.org/wiki/Systolic_array) * At Google, a ML training job may take up to a few hundred GPUs. If an even larger compute power is required, they switch to TPU. -- Dachen Juan * Problem solved: * Inference is latency sensitive. * Observation: CNN accounts for just 5% of NN workload at Google in 2016. Rest are [Multilayer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) and [Long short-term memory](https://en.wikipedia.org/wiki/Long_short-term_memory) (Recurrent Neural Networks). But CNN was where most of the acceleration architecture focused on. * Quantization (8 bit) is the key, reducing model size. * Note: training phase requires floating point calculation. * 8-bit quantization is good enough for inference. * (a little math here) * Can cram more matrix multiply unit due to shorter length * TPU has 65k 8-bit MXUs compared to a few thousand 32-bit MMUs on GPUs. * 65,536 * 700 Mhz = 46 x 10^12 multiply-add operations. * Scalar vs. vector vs. matrix processing * CISC * Little control logic * 2nd gen/ 3rd gen * [bfloat16 floating-point format](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) * [BFloat16: The secret to high performance on Cloud TPUs](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus) * Bfloat16 is a truncated version of IEEE 754 single-precision 32-bit float, where exponent bit is preserved but significand field is truncated. * Example * More extensive use of bfloat16 enables Cloud TPUs to train models that are deeper, wider, or have larger inputs. And since larger models often lead to a higher accuracy, this improves the ultimate quality of the products that depend on them. * our experience shows that representing activations in bfloat16 is generally safe. * We typically recommend keeping weights and gradients in FP32 but converting activations to bfloat16. * 420 Teraops * Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." _2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)_. IEEE, 2017. * Many architects believe that major improvements in cost-energy- performance must now come from domain-specific hardware. This paper evaluates a custom ASIC—called a _Tensor Processing Unit (TPU)_— deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a **65,536 8-bit** MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU’s deterministic execution model is a better match to the 99th-percentile response- time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters’ NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X **– **30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X **– **80X higher. Moreover, using the GPU’s GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU. * In summary, the TPU succeeded because of the large—but not too large—matrix multiply unit; the substantial software- controlled on-chip memory; the ability to run whole inference models to reduce dependence on its host CPU; a single-threaded, deterministic execution model that proved to be a good match to 99th-percentile response time limits; enough flexibility to match the NNs of 2017 as well as of 2013; the omission of general- purpose features that enabled a small and low power die despite the larger datapath and memory; the use of 8-bit integers by the quantized applications; and that applications were written using TensorFlow, which made it easy to port them to the TPU at high- performance rather than them having to be rewritten to run well on the very different TPU hardware. ### ASIC * TPU is an ASIC, but there are other ASICs out there. * Compare to GPU and CPU: faster and more power efficient. * Many companies are into this domain: Intel, AWS, ARM * Vector Engine, Cerebras Wafer Scale Engine, Intel Neural Network Processor (Nervana NNP-L1000) ### FPGA ## Hadoop YARN * [Apache Hadoop YARN: Yet Another Resource Negotiator](https://www.cse.ust.hk/~weiwa/teaching/Fall15-COMP6611B/reading_list/YARN.pdf) * [Hydra: a federated resource manager for data-center scale analytics](https://www.usenix.org/conference/nsdi19/presentation/curino) * [Hadoop: Fair Scheduler](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/FairScheduler.html) * [Hadoop: Capacity Scheduler](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html) ## Kubernetes * Borg * Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., & Wilkes, J. (2015, April). Large-scale cluster management at Google with Borg. In _Proceedings of the Tenth European Conference on Computer Systems_ (p. 18). ACM. * Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, omega, and kubernetes. ## Mesos * [Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center](https://people.eecs.berkeley.edu/~alig/papers/mesos.pdf) ## Compare YARN, Mesos, Kubernetes, OpenStack YARN is designed for big data workloads Kubernetes is designed for long running services ## Container Containerization = operating system virtualization in contrast of full virtualization Cgroups(resource isolation) → LXC (OS level virtualization (namespace isolation)) → Docker (application packaging) [Hypervisor](https://en.wikipedia.org/wiki/Hypervisor) Miklos: YARN can use cgroups optionally, but it’s rough. Enforces CPU isolation but not great. LXC: process tree, file system isolation. User space. Whereas cgroups is kernel space. LXC leverages cgroups in addition to chroot, kernel namespaces, seccomp, etc. [What's LXC?](https://linuxcontainers.org/lxc/introduction/#LXC) ![](https://lh3.googleusercontent.com/x3eWXX7xkXSTk-fQOedVRE7_SjlqAPxwbCYLXh-7nOj21UBox9Op_ZoDYzyITAhL-TX9HCY6DUcsyw1K6Q3FCzzuayFioZNbisUcGmytTv3DiXQC7MQSvzIoZha7C3KMQRMIzpqV) Cgroup, resource isolation. Container vs virtual machine [https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html-single/resource_management_guide/index](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html-single/resource_management_guide/index) ![](https://lh6.googleusercontent.com/4xkIhtaorkvz9WiWODDcyVX3PrnejdgIIiZifRn1DvHdV9YKptFbktxBoaebPnUvog8Xn5zQbkj3MC82xudEV5faOjMNu96tcIaFfpcwJ79Z7XvwX32OeDVUDlBHPTIdGPGp5ubX) ### LXC ``` container 運用作業系統提供可隔離執行的環境，進而允許多個彼此在邏輯上相互獨立的執行環境在現有作業系統核心上運作。 Linux 核心實作了 cgroups 機制，得以限制特定的資源 (如 CPU, 記憶體, 網路等等) 的配置方式，再搭配 namespaces 及 copy-on-write storage 等機制，就是 Docker 一類容器化執行環境的基質。 ``` ### Docker ![](https://lh6.googleusercontent.com/_1h2DnTWhQIPwVoy_5B1u9GmDCOZn8R-1sLnvU1eMsha0pdoewAuBy7aU1ZLVrhcqrrkqcVahsx66zB6NylKci14q0z385o1o7M5KhHbkunE9KAkCbpMETY716haLZUmwIR-b_h3) [Using NVIDIA GPU within Docker Containers](https://marmelab.com/blog/2018/03/21/using-nvidia-gpu-within-docker-container.html) security? Share between jobs/users. Is it mature? - An ML model is essentially a container these days. - **API containerization** - sec comp ## GPU support in K8s ### YuniKorn [YuniKorn: a universal resources scheduler](https://blog.cloudera.com/yunikorn-a-universal-resources-scheduler/) ![](https://lh5.googleusercontent.com/tFEGtzEO62C7ZMy5SO40WGys4o714PSpZq-5ljRfLOVpO0NjwFbxUc8oFWEjK90GA0NRrG04semcbb4SUqLjg_y-2VxPsQ87d40KJydkfAyuoXgvCp3YDi-D1_qc7ma3UqbQG1z8) ###### tags: `2019-minicourse-submarine` `Machine Learning`