MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores
paper origin: HPCA-2022
I. Introduction
Motivation
- With the trend to building large accelerators(i.e. MCM-based SIMBA, wafer-scale Cerebras, scaled-out platforms, heterogeneous multi-accelerators), enabling multi-tenancy is a natural use-case.
- Mapping jobs from several DNNs simultaneously on an accelerator is a problem.
Problem
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- All these approaches rely on manually-designed heuristics.
- This limits their scalability to diverse accelerator back-ends and emerging workloads.
Solution
- We propose an optimization framework called Multi-workload Multi-accelerator Mapping Explorer (M3E)
- An efficient encoding scheme to encode the search space of the mapping
- Several modules to enable the found mapping to effectively orchestrate the data movement across sub-accelerator cores.
- Enabling several commonly used black-box optimization algorithm and two reinforcement learning methods to be leveraged as the underlying optimization methods.
II. Backgroud
Accelerator Architecture
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Sub-Accelerator Architecture
- Each sub-accelerator is a conventional DNN accelerator that is comprised of an array of Processing Element(PE).
- Each PE has a MAC to compute partial sums, and local scratchpad(SL) to store weights, activations and partial sums.
- Each sub-accelerator also houses a shared global scratchpad(SG) to prefetch activations and weights from HBM/DRAM for the next tile of computation that will be mapped over the PEs and SLs.
Dataflow
- Given a DNN layer, each sub-accelerator employs a dataflow(aka local-mapping).
- The dataflow determins the loop order, parallelism dimensions, and tile size for running the layer on the sub-accelerator.
Jobs
- We focus on systems targeting multi-tenancy and batched-job tasks
- Batched-job are usually not latency sensitive and targeting high throughput (compared to single job).
- multi-tenant batched job tasks has much less layer dependency issue than single-tanant single-job tasks
- multi-tenancy, running multiple neural networks together, can largely alleviate the layer dependency problem as layers from different neural networks can be freely scheduled without any dependency issue.
- In batched-job tasks, the activations will be broken down to mini-batches because of memory concerns. These mini-batches are independent with each others.
Groups
- At the host, we run a light-weigted control program to divide a pool of queuing jobs into multiple dependency-free groups.
- Group size should be larger or equal to number of sub-accelerator to avoid some sub-accelerators being idle completely.
Mapping
- Mapping is the strategy to orchestrate the jobs assignment and job prioritizing across sub-accelerators over space and time.
- Global mapping, global control of the data movement across host and sub-accelerators.
Mapper
- Manual-designed mapping
- + highly tuned for specific accelerators/systems or tasks/workloads
- - whenever the underlyinfg accelerators or the targeting tasks change, the mapper need toe be re-designed or at least undergo another lenthy cycle of tuning provess.
- Mapping optimization method(this work)
- we simply need to re-run the search of the optimizer whenever the underlying accelerator evolves or target task changes.
IV. Optimization Framework(M3E)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Evaluation phase: the candidate mapping is evaluted by the cost model.
- Optimization phase: the optimization algorithm tweak the mapping based on the feedback from the evaluation phase. The phase optimization-evaluaion loop happen iteratively until the targeting objective converges or after a fixed set of time epochs.
- The mapping consists of two key componets
- Sub-accelerator selection: the assignment of each job to a specific sub-accelerator.
- Job prioritiztion: execution order of jobs on a given sub-accelerator.
Encoding (how the search space is described)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- The sub-accel Selection Sec. decides which jobs goes which sub-accelerator.
- Each value describes the sub-accel ID for the corresponding job.
- The Job Prioritizing Sec. decides the order of the jobs in each sub-accelerator.
- Each value describes the priority of the corresponding job.
- 0 high ––> low 1
Optimization algorithm (how the search space is explored)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Building Blocks of M3E
- BW Allocator: In a multi-core accelerator, system BW to the accelerator becomes a global shared resources between cores.
- Evenly allocate the same amount of BW to all the sub-accelerator will increase the possibility of compute resource under-utilization.
- The BW allocator is reallocating the BW based on the memory BW intensity of different jobs rinning on different sub-accelerator.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Job Analyzer: The job analyzer takes the jobs description as input and estimates the no-stall latency and its required BW for each sub-accelerator using a cost model ot generate a job analysis table.(Fig. 3)
- HW cost module for Sub-Accelerators: we leverage MAESTRO as our underlying cost model for each sub-accelerator.
- Given a DNN layer, a HW resource configuration(PE, SL size, SG size, NOC latency, and BW), and a mapping/dataflow strategy, MAESTRO estimates the statistics such as latency, energy, runtime, power, and area.
- Job Analysis Table: Job Analyzer profiles each job in the task by the cost model and stores the profiling results in the Job Analysis Table.
- No-stall Latency: latency for running each job on each sub-accelerator, assuming it has sufficient memory bandwidth.
- No-stall Bandwidth: the minimum bandwidth requirement from each sub-accelerator to make it stay compute-bound, not memory-bound.
M3E Workflow
- Set-up: the user sets up the optimizer by feeding in the jobs decriptions, configurations (PE #, dataflow) of each sub-accelerators, the system constraint(system BW), and objective(e.g., throughput).
- Pre-process: Job analyzer prepares the Job Analysis Table.
- Optimization Loop
- Optimization phase: optimization algorithm updates the encoding mapping based on the feedback from the evaluation block.
- Evaluation phase:
- Decoder decodes encoded mapping into a mapping description.
- BW Allocator takes in the mapping description and allocates the BW for each sub-accelerator.
V. Optimization Algorithm (MAGMA)
MAGMA is a GA-based search technique.
Why GA?
- GA reaches competitive performance with deep reinforcement learning.
- GA is light and fast.
MAGMA Algorithm Details
- Terminology and Basics of GA:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- The basic mechnism in GAs is to create a population of individuals in each generation.
- All individuals are evaluated and sorted based on their fitness.
- The best performing individuals are used as parents to create a new population of individuals using genetic operators.
- Genetic operators:
- Mutation:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- randomly select multiple genes and mutate them to random values.
- Crossover: three specialized crossover genetic operators
- Crossover-gen:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- First, we randomly sample a type of genome to crossover.
- Next, we randomly sample a pivot point and exchange the genes of the genomes.
- Crossover-rg:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- This is a range crossover mechanism structured to preserve the dependency of genes across genomes.
- We randomly pick a range of genome and simultaneously crossover all the genes falling into the picked region from both genomes, and thus the cross-genome dependency is preserved.
- Crossover-accel:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- This is a crossover method to preserve the dependency of job ordering within an sub-accelerator.
- We randomly select a sub-accelerator and pass the job ordering information of this sub-accelerator to the children.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Hyper-parameter Tuning:
- mutation, crossover rates, populations size, and elite ratios are hyper-parameters in MAGMA.
- We applied a hyper-parameter search via a Bayesian optimization framework to select a set of hyper-parameters that makes MAGMA achieve the highest performance across multiple workloads.
VI. Evaluation
Methodology
- Task and Benchmark: four different types of tasks
- Vision
- Language
- Recommendation
- Mix
- Accelerators: two classes of accelerator

- HB: High bandwidth(inspired by NVDLA)
- LB: low bandwidth(inspired by Eyeriss)
Mapper Settings
- Baseline Manual-tuned Mapper
- Optimization methods
- MAGMA
Homogeneous Accelerators

Even though AI-MT and Herald-like is designed for vision and language tasks, they also work well when applying to recommendation and mix tasks.
Heterogeneous Accelerators
- Small Accel & Large Accel
- (a)(b)(S2)Herald-like performs well, because it is designed for heterogeneous accelerator, while AI-MT is designed for homogeneous accelerator.
- (c)(d)(S4)In large accelerator, the mapping task becomes more complex since the design space of the mapping grows.
- BW-limited Environment

For both Small and Large accelerators, with the decrease of BWs, MAGMA stands out more obviously by reaching better relative performance.
- Homogeneous v.s Heterogenous
- The LB-style sub-acclerators usually take larger runtime but lower BW requirements than HB-style. S4(Large Heterogeneous) induces more no-stall latency but requires less BW than S3(Large Homogeneous).
- When BW is limited, the heterogeneous setting enables accelerator to leverage the difference of BW requirement among sub-accelerators to relax the BW contention.
- Big(S4) v.s. BigLittle(S5)
- BigLittle has smaller BW requirement because of its smaller sub-accelerator size
Flexible Accelerator
- Accelerator Configuration
- The number of PEs in the sub-accelerator are fixed; the shape of 2D PE array is flexible
- Dataflow strategy
- We pick the dataflow strategy of the sub-accelerator to maximize the utilization of the PEs array.
- In order to maximize the utilization, we will align the PEs array dimension to be the factor of the the parallelizing dimension of the tile as much as possible.
- Evaluation

- we can observe that for both Vision and Mix tasks, flexible outperforms fixed in ave. per-job no-stall latency, owing to its ability to maximizing the utilization rate of the PEs array
- However, it would also incur higher BW requirement.