# MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores ###### tags: `Accelerators` ###### paper origin: HPCA-2022 ###### paper: [link](https://arxiv.org/abs/2104.13997) ## I. Introduction ### Motivation * With the trend to building large accelerators(i.e. MCM-based SIMBA, wafer-scale Cerebras, scaled-out platforms, heterogeneous multi-accelerators), enabling multi-tenancy is a natural use-case. * Mapping jobs from several DNNs simultaneously on an accelerator is a problem. ### Problem ![](https://i.imgur.com/r1pldlY.png) * All these approaches rely on manually-designed heuristics. * This limits their scalability to diverse accelerator back-ends and emerging workloads. ### Solution * We propose an optimization framework called **Multi-workload Multi-accelerator Mapping Explorer (M3E)** * An efficient encoding scheme to encode the search space of the mapping * Several modules to enable the found mapping to effectively orchestrate the data movement across sub-accelerator cores. * Enabling several commonly used black-box optimization algorithm and two reinforcement learning methods to be leveraged as the underlying optimization methods. ## II. Backgroud ### Accelerator Architecture ![](https://i.imgur.com/04672Tl.png) ### Sub-Accelerator Architecture * Each sub-accelerator is a conventional DNN accelerator that is comprised of an array of Processing Element(PE). * Each PE has a MAC to compute partial sums, and local scratchpad(SL) to store weights, activations and partial sums. * Each sub-accelerator also houses a shared global scratchpad(SG) to prefetch activations and weights from HBM/DRAM for the next tile of computation that will be mapped over the PEs and SLs. ### Dataflow * Given a DNN layer, each sub-accelerator employs a dataflow(aka *local-mapping*). * The dataflow determins the **loop order, parallelism dimensions, and tile size** for running the layer on the sub-accelerator. ## III. Problem Formulation ### Jobs * We focus on systems targeting **multi-tenancy** and **batched-job tasks** * Batched-job are usually **not latency sensitive** and targeting **high throughput** (compared to single job). * multi-tenant batched job tasks has much **less layer dependency issue** than single-tanant single-job tasks * multi-tenancy, running multiple neural networks together, can largely alleviate the layer dependency problem as layers from different neural networks can be freely scheduled without any dependency issue. * In batched-job tasks, the activations will be broken down to mini-batches because of memory concerns. These mini-batches are independent with each others. ### Groups * At the host, we run a light-weigted control program to divide a pool of queuing jobs into multiple **dependency-free** groups. * Group size should be larger or equal to number of sub-accelerator to avoid some sub-accelerators being idle completely. ### Mapping * Mapping is the strategy to orchestrate the jobs assignment and job prioritizing across sub-accelerators over space and time. * *Global mapping*, global control of the data movement across host and sub-accelerators. ### Mapper * Manual-designed mapping * \+ highly tuned for specific accelerators/systems or tasks/workloads * \- whenever the underlyinfg accelerators or the targeting tasks change, the mapper need toe be re-designed or at least undergo another lenthy cycle of tuning provess. * **Mapping optimization method**(this work) * we simply need to re-run the search of the optimizer whenever the underlying accelerator evolves or target task changes. ## IV. Optimization Framework(M3E) ![](https://i.imgur.com/IesQrQK.png) * **Evaluation phase**: the candidate mapping is evaluted by the cost model. * **Optimization phase**: the optimization algorithm tweak the mapping based on the feedback from the evaluation phase. The phase optimization-evaluaion loop happen iteratively until the targeting objective converges or after a fixed set of time epochs. * The mapping consists of two key componets * **Sub-accelerator selection**: the assignment of each job to a specific sub-accelerator. * **Job prioritiztion**: execution order of jobs on a given sub-accelerator. ### Encoding *(how the search space is described)* ![](https://i.imgur.com/Q3CQ1vX.png) * The **sub-accel Selection Sec.** decides which jobs goes which sub-accelerator. * Each value describes the sub-accel ID for the corresponding job. * The **Job Prioritizing Sec.** decides the order of the jobs in each sub-accelerator. * Each value describes the priority of the corresponding job. * 0 high ----> low 1 ### Optimization algorithm *(how the search space is explored)* ![](https://i.imgur.com/wEp0Fp4.png) ### Building Blocks of M3E 1. **BW Allocator**: In a multi-core accelerator, system BW to the accelerator becomes a global shared resources between cores. * Evenly allocate the same amount of BW to all the sub-accelerator will increase the possibility of compute resource under-utilization. * The BW allocator is reallocating the BW based on the memory BW intensity of different jobs rinning on different sub-accelerator. ![](https://i.imgur.com/t1tUCv4.png) 2. **Job Analyzer**: The job analyzer takes the jobs description as input and estimates the no-stall latency and its required BW for each sub-accelerator using a cost model ot generate a job analysis table.(Fig. 3) 3. **HW cost module for Sub-Accelerators**: we leverage MAESTRO as our underlying cost model for each sub-accelerator. * Given a DNN layer, a HW resource configuration(PE, SL size, SG size, NOC latency, and BW), and a mapping/dataflow strategy, MAESTRO estimates the statistics such as latency, energy, runtime, power, and area. 4. **Job Analysis Table**: Job Analyzer profiles each job in the task by the cost model and stores the profiling results in the Job Analysis Table. * No-stall Latency: latency for running each job on each sub-accelerator, assuming it has sufficient memory bandwidth. * No-stall Bandwidth: the minimum bandwidth requirement from each sub-accelerator to make it stay compute-bound, not memory-bound. ### M3E Workflow 1. **Set-up**: the user sets up the optimizer by feeding in the jobs decriptions, configurations (PE #, dataflow) of each sub-accelerators, the system constraint(system BW), and objective(e.g., throughput). 2. **Pre-process**: **Job analyzer** prepares the Job Analysis Table. 3. **Optimization Loop** * Optimization phase: optimization algorithm updates the encoding mapping based on the feedback from the evaluation block. * Evaluation phase: * **Decoder** decodes encoded mapping into a mapping description. * **BW Allocator** takes in the mapping description and allocates the BW for each sub-accelerator. ## V. Optimization Algorithm (MAGMA) MAGMA is a GA-based search technique. ### Why GA? * GA reaches competitive performance with deep reinforcement learning. * GA is light and fast. ### MAGMA Algorithm Details 1. **Terminology and Basics of GA:** ![](https://i.imgur.com/DMgWfyK.png) * The basic mechnism in GAs is to create a population of individuals in each generation. * All individuals are evaluated and sorted based on their fitness. * The best performing individuals are used as parents to create a new population of individuals using genetic operators. 2. **Genetic operators**: * Mutation: ![](https://i.imgur.com/aqjSlGv.png) * randomly select multiple genes and mutate them to random values. * Crossover: three specialized crossover genetic operators * **Crossover-gen**: ![](https://i.imgur.com/t9rpx8P.png) * First, we randomly sample a type of genome to crossover. * Next, we randomly sample a pivot point and exchange the genes of the genomes. * **Crossover-rg**: ![](https://i.imgur.com/GGFI9WA.png) * This is a range crossover mechanism structured to preserve the dependency of genes across genomes. * We randomly pick a range of genome and simultaneously crossover all the genes falling into the picked region from both genomes, and thus the cross-genome dependency is preserved. * **Crossover-accel**: ![](https://i.imgur.com/5IP26Zm.png) * This is a crossover method to preserve the dependency of job ordering within an sub-accelerator. * We randomly select a sub-accelerator and pass the job ordering information of this sub-accelerator to the children. ![](https://i.imgur.com/nkCqgTH.png) 3. **Hyper-parameter Tuning**: * mutation, crossover rates, populations size, and elite ratios are hyper-parameters in MAGMA. * We applied a hyper-parameter search via a Bayesian optimization framework to select a set of hyper-parameters that makes MAGMA achieve the highest performance across multiple workloads. ## VI. Evaluation ### Methodology 1. Task and Benchmark: four different types of tasks 1. Vision 2. Language 3. Recommendation 4. Mix 2. Accelerators: two classes of accelerator ![](https://i.imgur.com/kWfqUaW.png) * HB: High bandwidth(inspired by NVDLA) * LB: low bandwidth(inspired by Eyeriss) ### Mapper Settings * Baseline Manual-tuned Mapper * Optimization methods * MAGMA ### Homogeneous Accelerators ![](https://i.imgur.com/riMtvgq.png) Even though AI-MT and Herald-like is designed for vision and language tasks, they also work well when applying to recommendation and mix tasks. ### Heterogeneous Accelerators * Small Accel & Large Accel ![](https://i.imgur.com/TZ4X1HM.png) * (a)(b)(S2)Herald-like performs well, because it is designed for heterogeneous accelerator, while AI-MT is designed for homogeneous accelerator. * \(c\)(d)(S4)In large accelerator, the mapping task becomes more complex since the design space of the mapping grows. * BW-limited Environment ![](https://i.imgur.com/7f5V1f5.png) For both Small and Large accelerators, with the decrease of BWs, MAGMA stands out more obviously by reaching better relative performance. * Homogeneous v.s Heterogenous ![](https://i.imgur.com/iLfljJx.png) * The LB-style sub-acclerators usually take larger runtime but lower BW requirements than HB-style. S4(Large Heterogeneous) induces more no-stall latency but requires less BW than S3(Large Homogeneous). * When BW is limited, the heterogeneous setting enables accelerator to leverage the difference of BW requirement among sub-accelerators to relax the BW contention. * Big(S4) v.s. BigLittle(S5) * BigLittle has smaller BW requirement because of its smaller sub-accelerator size ### Flexible Accelerator * Accelerator Configuration * The number of PEs in the sub-accelerator are fixed; the shape of 2D PE array is flexible * Dataflow strategy * We pick the dataflow strategy of the sub-accelerator to maximize the utilization of the PEs array. * In order to maximize the utilization, we will align the PEs array dimension to be the factor of the the parallelizing dimension of the tile as much as possible. * Evaluation * ![](https://i.imgur.com/rf4btlA.png) * we can observe that for both Vision and Mix tasks, flexible outperforms fixed in ave. per-job no-stall latency, owing to its ability to maximizing the utilization rate of the PEs array * However, it would also incur higher BW requirement.