# Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture ###### tags: `Accelerators` ###### paper origin: 52nd IEEE/ACM International Symposium on Microarchitecture ###### papers: [link](https://https://people.eecs.berkeley.edu/~ysshao/assets/papers/shao2019-micro.pdf) ###### slides: [link](https://people.eecs.berkeley.edu/~ysshao/assets/talks/shao2019-micro-slides.pdf) ## 1. INTRODUCTION ### Research Problems * Many applications from edge devices to data centers demand fast and efficient inference, often with low latency or real-time throughput requirements. * Previously proposed multi-chip DL accelerators have focused on improving total compute throughput and on-chip storage size but have not addressed the scalability challenges associated with building a large-scale system with multiple discrete components. ### Proposed Solution * This work presents Simba, a scalable MCM-based deep-learning inference accelerator architecture. * Specifically examine the implications of the non-uniform latency and bandwidth for on-chip and on-package communication that lead to significant latency variability across chiplets. ## 2. BACKGROUND AND MOTIVATION ### DNN Basics * Constructed using a series of layers, including convolutional layers, pooling layers, activation layers, and fully-connected layers. * Convolutional layer ![](https://imgur.com/X4qKG7r.png =80%x) * Activation and pooling layers are typically merged with convolutional layers during execution to reduce data movement. ### Multi-Chip-Module Packaging * Such systems consist of multiple chiplets connected together via on-package links and employing efficient intra-package signaling circuits. * Comparing to a large monolithic die, MCMs can reduce design costs and fabrication costs. ### Non-Uniformity in MCM-based Design * Communication latency between two elements in a MCM heavily depends on their spatial locality on the package. * In large-scale systems with heterogeneous interconnect architectures such as MCMs, assumptions of uniform latency and bandwidth in selecting DNN tiling can degrade performance and energy efficiency. ## 3. SIMBA ARCHITECTURE AND SYSTEM ### Simba Architecture ![](https://imgur.com/m0EdElo.png =100%x) * The design target is an accelerator scalable to data center inference. * Simba adopts a hierarchical interconnect to efficiently connect different processing elements (PEs). This hierarchical interconnect consists of a network-on-chip (NoC) that connects PEs on the same chiplet and a network-on-package (NoP) that connects chiplets together on the same package. All communication is designed to be latency-insensitive. * The figure above illustrates the three-level hierarchy of the Simba architecture: package, chiplet, and PE. * **Simba PE:** * Includes a distributed weight buffer, an input buffer, parallel vector multiply-and-add (MAC) units, an accumulation buffer, and a post-processing unit. * Weights remain in the vector MAC registers and are reused across iterations, while new inputs are read every cycle. * **Simba Global PE:** * Global PE can either unicast data to one PE or multicast to multiple PEs, even across chiplet boundaries. * Also serves as a platform for near-memory computation, can perform computations locally to reduce communication overhead. ### Simba Silicon Prototype ![](https://imgur.com/pt1f9we.png) * 6mm^2 chiplet in TSMC 16 nm FinFET process technology 36 chiplets/package. * Using ground-referenced signaling (GRS) technology for intra-package communication. * **Simba Controller:** * Contains a RISC-V processor core that is responsible for configuring and managing the chiplet’s PEs and Global PE states via memory-mapped registers. * Synchronization of chiplet control processors across the package is implemented via memory-mapped interrupts. * **Simba Interconnect:** ![](https://imgur.com/xvySLkl.png) ### Simba Baseline Tiling ![](https://imgur.com/jLuoh3j.png) * The default dataflow uniformly partitions weights along the input channel (C) and the output channel (K) dimensions. * In addition, Simba can also uniformly partition along the height (P) and width (Q) dimensions of an output activation across chiplets and PEs ## 4. SIMBA CHARACTERIZATION ### Methodology ![](https://imgur.com/Ol8pXHL.png) * Power and performance measurements begin after the weights have been loaded into each PE’s weight buffer and the inputs have been loaded into the Global PE buffers. * The chiplets operate at a core voltage of 0.72 V, a PE frequency of 1.03 GHz, and GRS bandwidth of 11 Gbps. * Focuses the application measurements on ResNet-50, DriveNet and AlexNet. ### Overview ![](https://imgur.com/gvQVTow.png) * Highlighting the importance of strategies for efficiently mapping DNNs to hardware. * The degree of data reuse highly influences the efficiency; layers with high reuse factors tend to perform computation more efficiently than layers that require more data movement. * Increasing the number of chiplets used in the system improves performance, but it also leads to increased energy cost for chiplet-to-chiplet communication and synchronization. ### Mapping Sensitivity * When mapped to a single chiplet, execution latency decreases linearly from one to eight PEs because of the improved compute throughput with more PEs. At the same time, its performance flattens out beyond eight PEs due to the memory bandwidth contention at the Global PE’s SRAM. * When mapping across chiplets, execution time does not scale down beyond four PEs. The additional latency to communicate across multiple chiplets leads to a greater execution time. ![](https://imgur.com/MAOZDOt.png =60%x) ### Layer Sensitivity * The amount of compute parallelism that an MCM can leverage varies from layer to layer (weights in early layers of the network are small that it cannot fully utilize the compute throughput of Simba), and that the cost of communication can hinder the ability to exploit that parallelism. ![](https://imgur.com/8vSQDXH.png =60%x) ### NoP Bandwidth Sensitivity * This adjustment is made by reducing the frequencies of the PE, Global PE, and RISC-V partitions below nominal while maintaining a constant NoP frequency. * End-to-end performance is sensitive to NoP bandwidth, especially for applications with a significant amount of communication. ![](https://imgur.com/VCVeDew.png =60%x) ### NoP Latency Sensitivity * NoP has higher latency than the NoC. * Mapping layers to four chiplets, adjusted the locations of the selected chiplets in the package to modulate latency. * Communication latency plays a significant role in achieving good performance and energy efficiency for a large-scale system. ![](https://imgur.com/00iRRKC.png =60%x) ### Weak Scaling * Fix the amount of work per chiplet but increase the total amount of computation by increasing the batch size. * Achieves throughput improvement but also incurring latency increase due to synchronization cost across multiple chiplets. ![](https://imgur.com/OGiOBj3.png =60%x) ### Comparisons with GPUs ![](https://imgur.com/bVGm0m6.png) * Due to Simba’s limited on-package storage capacity for input activations, so only run Simba at batch size one and two. * With a larger batch size, instead of exploiting the batch-level parallelism like GPUs, Simba would run each batch sequentially. So the throughput of Simba is close to that of running with a batch size of one. ## 5. SIMBA NON-UNIFORM TILING ### Non-Uniform Work Partitioning * PEs are spatially distributed with different communication latencies between them. * PEs closer to the data producers will perform more work to maximize physical data locality, while PEs that are further away will do less work to decrease the tail latency effects. ![](https://imgur.com/1q7X9tt.png) * The achievable performance improvement is highly sensitive to the compute-and-communication ratio in a given mapping. * When compute and communication latencies are more comparable, which is typically desired to achieve good mapping, the performance improvement is more pronounced. ### Communication-Aware Data Placement * Communication latency becomes highly sensitive to the physical location of data. ![](https://imgur.com/Nb3tqwN.png =40%x) * Use a practical greedy algorithm to iteratively determine where input and output activation data should be placed in the Simba system. ![](https://imgur.com/Qe9vnA2.png) ![](https://imgur.com/6tjwZak.png) * Since previous stage of the mapping process has already determined the data tiling, this stage need only focus on data placement and not re-tiling. ![](https://imgur.com/EjnYgoO.png) ### Cross-Layer Pipelining * Simba interconnect supports flexible communication patterns, we can assign different-sized clusters of chiplets to different layers. ![](https://imgur.com/ZjJkfFI.png) * Partition the package into clusters, assign different layers to each cluster, and execute the layers in a pipelined fashion. * With pipelining, the overall throughput is limited by the longest pipelining stage. ![](https://imgur.com/83taG5Y.png)