A Domain-Specific Supercomputer for Training Deep Neural Networks

# A Domain-Specific Supercomputer for Training Deep Neural Networks ###### tags: `Accelerators` ###### papers: [link](https://dl.acm.org/doi/pdf/10.1145/3360307) ## Introduction * Google’s Tensor Processing Unit (TPU) offered 50x improvement in performance per watt over conventional architectures for inference * This article show how Google built the first production DSA for the much harder training problem * Here is some different design between CPU and TPU: 1. TPU has 1-2 large cores and CPU has 32-64 small cores. 2. TPU has 2D-dimensional 128x128 or 256x256 element systolic arrays of multiplier and CPU has few scalar multipliers or SIMD multipliers per core. 3. GPU use narrower data such as 8-16 bits and CPU use 32-64 bits in computation and memory. 4. Dropping general-purpose features irrelevant for DNNs but critical for CPUs such as caches and branch predictors. * If we want to use DSA using TPU to the some DNN,and it have to train and inference. However,there are some key aspects different of training and inference DSAs include: 1. Harder parallelization : Each inference is independent and a training run iterates over millions of examples in parallel. 2. More computation : Back-propagation requires derivatives for every computation in a model. 3. More memory : Training use lots of memory because of weight update while inference use a bit. 4. More programmability : Training algorithms and models are continually changing. 5. Wider data : Quantized arithmetic— 8-bit integer instead of 32-bit floating point (FP) reduced-precision training. ## Design a DSA supercomputer using TPUv2 ### Designing Interconnect 1. Most traffic is an all-reduce over weight updates so we distribute switch functionality into each chip rather than as a standalone unit. 2. The supercomputer is made up with many TPUs connected with 2D tours, and all TPU chips have 4 **Inter-Core Interconnect(ICI)**, each link is 496Gbits. Moreover, the ICI on chips form supercomputer using only 13% of each chip. When we measure bandwidth, we will use bisection bandwidth. **TPU v2 uses a 16 * 16 2D square matrix, the worst case is at most 32 links**, that is **32 * 496Gbps = 15Tbps**. In contrast, an **infiniband connects 64 machines, each machine is 100Gbits**, and the bisection bandwidth is **6.4Tbps**. Therefore, **the bandwidth of TPUv2 is larger and the price is lower than infiniband**. 3. A fast interconnect that quickly reconciles weights across learners with well-controlled tail latencies is critical for fast training. ![](https://i.imgur.com/rdvXxcJ.png) ### Designing Node ![](https://i.imgur.com/qa1EBLW.png) 1. **Matrix multiplication unit(MXU):** * TPU v2 has two matrix multiplication unit(MXU) for computing relying on large batch sizes, which amortize memory accesses for weights. * Produces 32-bit FP products from 16-bit FP inputs that accumulate in 32 bits * Using 128 x 128 MXUs. Larger MXUs like 256 x 256 provide more computation, but can be inefficient. Small MXUs would have a little higher utilization but would need more area. 4. **High Bandwidth Memory(HBM):** * It offers 20 times the bandwidth of TPUv1 by using an interposer substrate that connects the TPUv2 chip via thirty-two 128-bit buses to four short stacks of DRAM chips. 6. **The Core Sequencer** : * In order to fetch VLIW (Very Long Instruction Word) instructions from the core’s on-chip, software managed Instruction Memory,executes scalar operations using a 4K 32-bit scalar data memory and 32 32-bit scalar registers, and forwards vector instructions to the VPU. 8. **The Vector Processing Unit (VPU)** : * Performs vector operations using a large on-chip vector memory with 32K 128 x 32-bit elements (16MiB), and 32 2D vector registers that each contain 128 x 8 32-bit elements 10. **The Transpose Reduction Permute Unit** : * Doing 128x128 matrix transposes, reductions, and permutations of the VPU lanes. ### Designing arithmetic 1. Peak performance is ≥8x higher when using 16-bit FP instead of 32-bit FP for matrix multiply. 2. Uses **brain floating format(bf16)** that keeps the same 8-bit exponent as fp32 to avoid losing the small update values due to FP underflow of a smaller exponent. 3. Bf16 reduces hardware and energy while simplifying software by making loss scaling unnecessary. ![](https://imgur.com/DJwzwcn.png) ## Compare of DSA supercomputer using TPUv2 and TPUv3 1. TPU v3 has 1.35x the clock rate than v2, ICI bandwidth, and memory bandwidth plus twice the number of MXUs, so the performance rise 2.7x 2. Because of liquid cool, TDP rise 1.6x, but TPUv3 die size is only 6% larger than TPUv2 ## Contrasting GPU and TPU Architectures ![](https://i.imgur.com/X36lwEN.png) 1. Multi-chip parallelization is built into TPUs through ICI but GPU uses NVLink and Infiniband. 2. TPU offer bf16 FP arithmetic designed for DNNs inside 128x128 systolic arrays but GPU offer fp16 FP arithmetic inside 4x4 or 16x16. 3. TPUs are dual-core, in-order machines with XLA but GPUs are latency-tolerant many core machines with CUDA. 4. TPUs use software controlled 32MiB scratchpad memories that the compiler schedules, while GPU hardware manages a 6MiB cache and software manages a 7.5MiB scratchpad memory. 5. GPUs incur high overhead in performance, area, and energy due to heavy multithreading which is unnecessary for DNNs which have prefetchable, sequential memory accesses. ## Conclusion * **the GPU is not designed for DNN**, so there is some unnecessary power consumption and TPU has an advantage over GPU. Compared with GPU, TPU single core currently has no advantage but has advantage in scaling. * TPUv3 supercomputer runs a production application using real world data at 70% of peak performance, higher than general-purpose supercomputers. * Reasons for this success include the built-in ICI network, large systolic arrays, and bf16 arithmetic, which we expect will become a standard data type for DNN DSAs.