# RaPiD: AI Accelerator for Ultra-low Precision Training and Inference ###### tags: `Accelerators` ###### paper origin: 2021 ISCA ###### paper: [link](https://dl.acm.org/doi/pdf/10.1109/ISCA52012.2021.00021) ## Introduction One approach that has been broadly adopted by the industry is building specialized systems for AI with hardware accelerators. ### Precision Scaling ![](https://i.imgur.com/vEpok5y.png) * AI workloads present a unique opportunity for performance/energy improvement through precision scaling. ### Ultra-Low Precision Capable AI accelerators Next generation of AI accelerators should be capable of ultra-low precision execution - beyond 16-bit floating point for training, and 8-bit integer for inference. * **Mixed (Ultra-low) precision support** * Supports 5 different precisions * FP16, FP8-fwd, FP8-bwd, HFP8, INT4, INT2 * Although we target ultra-low precision execution, it is critical to retain support for higher precisions. * **Scaling both TOPS and TOPS/W at low precision** * Improve both performance (TOPS) and energy efficiency (TOPS/W) at ultra-low precision. * **End-to-End application coverage** * The datatransfer cost between the accelerator and the host could be costly. * **Sparsity-aware Frequency Throttling** * The fused-multiply-and-accumulate (FMA) engines of RAPID are designed with zero-gating logic. * Includes a power management unit (controlled from software) that can rapidly throttle the effective clock frequency. * **Multi-core Scaling** * Contains a Memory Neighbor Interface (MNI) that enables core-to-core and core-to-memory communication and synchronization. ## Background ### Systolic dataflow ![](https://i.imgur.com/TXV1mBx.png) * Processing Elements (PEs) supporting 16-bit floating-point (FP16) computations to execute convolution and matrix multiplication operations in DNNs. * Special Function Units (SFUs) supporting both 16 and 32-bit floating-point computations (FP16 and FP32) to perform activation functions, pooling, gradient reduction and normalization operations. * L0 scratchpad is used to feed data along the rows (X direction). * The L1 scratchpad memory is connected to the L0 memories and columns (Y direction) of the SFU/PE array on one side and interfaces with the external memory on the other. ### Scaling Training beyond FP16 ![](https://i.imgur.com/idNIWCq.png) * Hybrid-FP8 (HFP8) data format using two different FP8 (sign, exponent, mantissa) representations * One representation with lower dynamic range (1,4,3) for activations and weights, and the other with higher dynamic range (1,5,2) for errors. * HFP8 requires the exponent bias to be configurable. ### Scaling Inference beyond INT8 * Precision scaling for inference has been demonstrated successfully for 4-bit (INT4) and 2-bit (INT2) fixed-point representations. * PArameterized Clipping acTivation (PACT) for activations and Statistics-aware Weight Binning (SaWB) for weights. ## Core Architecture For Ultra-Low Precision The key features of precision scaling is that it preserves the regularity of the compute and saves energy both in the execution units and in the memory and interconnect subsystems. * The baseline architecture has to be enhanced to: * Scale the overall peak TOPs * Balance the area and power of the PE array * Choosing data-flows to re-use the operands effectively across SIMD * Produce outputs in 16-bit format from the PE array * Balance the computational units to match the distribution of high-precision activations and ultra-low precision convolutions and matrix operations ### MPE Array: Mixed-Precision PE Array One of the key challenges is to scale the overall peak TOPs to be commensurate with the scaled precision. ![](https://i.imgur.com/oBXBDgR.png) #### 1. Supporting both INT and FP pipelines * The MPE array has to have comprehensive support for mixed precision execution, which includes different number formatss: viz. FP16, Hybrid-FP8, INT4 and INT2. * Separation of the integer and floating point pipelines solves the architectural complexity of handling multiple precisions while providing circuit implementation opportunities to aggressively improve power efficiency. #### 2. HFP8 Training * The MPE’s FPU pipeline supports both FP16 and HFP8 using the same 128-bit datapath for the 8-way SIMD FPU. * On-the-Fly Conversion to custom FP8 representation: * FP8 training requires matrix multiplications and convolutions in the backward path of training to use tensors of different FP8 formats as inputs. * Enhanced the FPU to use a custom (sign, exponent, mantissa) format of (1,5,3), converted on-the-fly. * sub-SIMD partition: * The fine-grain partition the SIMD units (sub-SIMD) within the FPU to scale the peak TOPs in HFP8 mode. * In HFP8 mode, the multiply-accumulate instruction (FMMA) of the SIMD MPE realizes 2 multiplications and 2 additions. * The FPU compute paths of FP16 and HFP8 merge at the adder to produce FP16 results. #### 3. INT4/INT2 inference * MPE have separate FPU and FXU pipelines, the INT4/INT2 engines target only DNN inference. * Double pumping INT4 and INT2 pipelines: * The analysis of the de-coupled FPU and FXU units revealed opportunities to double the INT4 and INT2 engines within the FXU. * Each FXU has 8 INT4 (16 INT2) multiply-accumulate engines. * Operand Reuse: Sub-SIMD + Across Columns: * Each of the 8-way SIMD unit completes 8 INT4 (and 16 INT2) multiply-accummulate operations in a cycle * 8-way SIMD FXU supports 4 and 2 bit integer MAC operations producing 16-bit integer results. #### 4. Convolution and GEMM Dataflows in the MPE array ![](https://i.imgur.com/5TKAjz4.png) * Defining dimensions that are spatially mapped along rows, columns, SIMD lanes and local register file (LRF) of the MPE array. * Determining which data-structures are streamed along rows, columns and held stationary in the LRF. * A novel weight-stationary dataflow: * Achieve high utilization all the way down to batch size of 1 * Avoid cross-row or cross-column communication * Minimize residue effects due to stripmining when workload dimensions are not a multiple of hardware dimensions. * Stream input data-structure along the rows and output data-structure along the columns. * Weights need to be block-loaded into the LRF before the computation begins, the interval between block-loads need to be maximized for high utilization. ### SFU arrays: Full Spectrum of Activation Functions * Include both accurate and fast version for a spectrum of FP16 non-linear activation functions as well as higher precision FP32 operations. * Doubling the SFU arrays to maintain the balance between the compute time spent in ultra-low precision convolutions and matrix operations and higher-precision auxiliary operations. ### Sparsity-aware Zero-gating and Frequency Throttling #### 1. Energy Savings: Zero-gating Support * MPEs include support to skip the entire FPU pipeline when multiplicands are zero and simply passes the addend to the result. #### 2. Sparsity-aware Frequency Throttling ![](https://i.imgur.com/taflnBQ.png) * Since the distribution of sparsity in the weights are static for inference, RAPID exploits a hardware/software co-design to throttle power to maximize TOPs within the power limit. * RAPID includes an on-chip power control module via clock-edge skipping which uses the throttling rate recommended by the compiler. ### Ultra-low Precision Core with 2 corelets ![](https://i.imgur.com/3fultP2.png) * To maximize re-use of data from the L1 scratchpad and to exploit the reduction in capacity requirements due to ultralow precision. ### Data Communication Among Cores and Memory * Adopted a bi-directional ring interconnection in each direction to communicate data between cores and memory. * Each core has a programmable Memory-Neighbor Interface (MNI) unit to facilitate data communication with memory and neighbors via a ringinterface unit (RIU). ![](https://i.imgur.com/3TkO6b2.png) * Data fetch latency can be effectively hidden by double-buffering data in the L1 scratchpad overlapped in time with computations in the core. * Using load and store queues, MNI supports multiple outstanding requests to neighbors and memory, and allows out-of-order data returns. * To exploit the bi-directional ring bandwidth, the MNI-LU is designed to receive up to 2 data returns in any cycle. * This data sharing behavior is exploited by supporting multi-cast communications both in the ISA and the hardware of MNI’s load and store units. ## RaPiD Chip For Training And Inference Systems ![](https://i.imgur.com/Kgdu2wC.png) ### Inference and Training System ![](https://i.imgur.com/zQTg4J4.png) * RAPID chip architecture is designed to scale to a large number of cores, it is possible to construct multicore/multi-chip inference and training systems. ### Software Architecture ![](https://i.imgur.com/SDSLsTA.png) * Graph compiler which automatically identifies how best to execute a given DNN graph on the AI chip and constructs the program binaries. * A systematic design space exploration is performed focusing on graph optimization, scratchpad management, and work assignment to the cores of the AI chip. * This design space exploration is guided by a bandwidth-centric analytical power-performance model of the AI chip. ## Results ### Experimental Methodology * Performance Estimation * Developed a detailed performance model of the RAPID chip * The power was measured in silicon and combined with the projected utilization of the different components * System Configuration * For inference, we study a RAPID chip with 4 cores attached to an external DDR memory with 200 GBps bandwidth. * For training, we consider a distributed system with 4 scaledup RaPiD chips, each containing 32 cores, 64MB distributed L1 scratchpad and attached to a High Bandwidth Memory (HBM) supplying 400 GBps bandwidth. * The chip-to-chip interconnect bandwidth is 128 GBps. * Benchmarks * We use 11 state-of-the-art DNN benchmarks. * Experimental Setup and Baseline * A batch size of 1 for inference, and a minibatch size of 512 for training. * Use the FP16 implementation on RAPID as the baseline to report relative improvements achieved at lower precisions. ### Inference Performance and Efficiency ![](https://i.imgur.com/2tFVTxo.png) * The speedup at lower precisions are primarily limited by the fraction of operations that are executed in FP16. ![](https://i.imgur.com/2XOpk2q.png) * Shows the sustained compute efficiency (TOPS/W) achieved at FP8 and INT4 precisions ### Training Throughput ![](https://i.imgur.com/qOqiMEk.png) * Training incurs additional off-chip communication for gradient reduction and weight broadcast * Training is memory intensive as activations produced during the forward pass needs to be retained for computing the weight gradients during back-propagation. ### Benefits of Sparsity-aware Throttling ![](https://i.imgur.com/sDuKuDR.png) * The pruned models used FP16 precision; combining pruning with low precision is still an evolving area of research ### Performance Breakdown Analysis ![](https://i.imgur.com/bMpLBRu.png) ### Inference/Training System Scaling ![](https://i.imgur.com/vQzKXeb.png) * For inference systems, we increase the number of cores in the chip and show the speedup for INT4 precision * performance scales as we scale the number of cores from 1 to 32 * Compute-intensive benchmarks show performance improvement even as we scale to 32 cores * For benchmarks that are either auxiliary operations dominated), or memory stalls dominated we see a saturation in the speedup as we increase the number of cores * For training systems, we increase the number of chips in the systems and show the speedup for HFP8 precision ## Related Work * CPU-based techniques * Optimized linear algebra libraries * Efficient parallelization on multicores * Efficient data layouts and batching * Proposed compiler, ISA and micro-architectural optimizations to exploit certain properties * GPU-based techniques * Data/model/pipeline parallelization techniques * Memory management * Locality-aware device placement * Exploiting sparsity in activations and weights * Hardware accelerator techniques * Satiate the computational needs of AI workloads * A myriad of accelerators ranging from low-power ASIC / FGPA / CGRA cores to large scale systems have been proposed * Using dense arithmetic arrays, heterogeneous processing tiles, low-precision data representations and sometimes dynamic hardware reconfiguration. * Exploiting sparsity in activations and weights, 3D memory technologies, bit-serial architectures, in-memory computation * Commercial AI chips * Google TPUs, NVIDIA Tensor Cores, Intel NNP ...