# MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores
###### tags: `Accelerators`
###### paper origin: HPCA-2022
###### paper: [link](https://arxiv.org/abs/2104.13997)
## I. Introduction
### Motivation
* With the trend to building large accelerators(i.e. MCM-based SIMBA, wafer-scale Cerebras, scaled-out platforms, heterogeneous multi-accelerators), enabling multi-tenancy is a natural use-case.
* Mapping jobs from several DNNs simultaneously on an accelerator is a problem.
### Problem

* All these approaches rely on manually-designed heuristics.
* This limits their scalability to diverse accelerator back-ends and emerging workloads.
### Solution
* We propose an optimization framework called **Multi-workload Multi-accelerator Mapping Explorer (M3E)**
* An efficient encoding scheme to encode the search space of the mapping
* Several modules to enable the found mapping to effectively orchestrate the data movement across sub-accelerator cores.
* Enabling several commonly used black-box optimization algorithm and two reinforcement learning methods to be leveraged as the underlying optimization methods.
## II. Backgroud
### Accelerator Architecture

### Sub-Accelerator Architecture
* Each sub-accelerator is a conventional DNN accelerator that is comprised of an array of Processing Element(PE).
* Each PE has a MAC to compute partial sums, and local scratchpad(SL) to store weights, activations and partial sums.
* Each sub-accelerator also houses a shared global scratchpad(SG) to prefetch activations and weights from HBM/DRAM for the next tile of computation that will be mapped over the PEs and SLs.
### Dataflow
* Given a DNN layer, each sub-accelerator employs a dataflow(aka *local-mapping*).
* The dataflow determins the **loop order, parallelism dimensions, and tile size** for running the layer on the sub-accelerator.
## III. Problem Formulation
### Jobs
* We focus on systems targeting **multi-tenancy** and **batched-job tasks**
* Batched-job are usually **not latency sensitive** and targeting **high throughput** (compared to single job).
* multi-tenant batched job tasks has much **less layer dependency issue** than single-tanant single-job tasks
* multi-tenancy, running multiple neural networks together, can largely alleviate the layer dependency problem as layers from different neural networks can be freely scheduled without any dependency issue.
* In batched-job tasks, the activations will be broken down to mini-batches because of memory concerns. These mini-batches are independent with each others.
### Groups
* At the host, we run a light-weigted control program to divide a pool of queuing jobs into multiple **dependency-free** groups.
* Group size should be larger or equal to number of sub-accelerator to avoid some sub-accelerators being idle completely.
### Mapping
* Mapping is the strategy to orchestrate the jobs assignment and job prioritizing across sub-accelerators over space and time.
* *Global mapping*, global control of the data movement across host and sub-accelerators.
### Mapper
* Manual-designed mapping
* \+ highly tuned for specific accelerators/systems or tasks/workloads
* \- whenever the underlyinfg accelerators or the targeting tasks change, the mapper need toe be re-designed or at least undergo another lenthy cycle of tuning provess.
* **Mapping optimization method**(this work)
* we simply need to re-run the search of the optimizer whenever the underlying accelerator evolves or target task changes.
## IV. Optimization Framework(M3E)

* **Evaluation phase**: the candidate mapping is evaluted by the cost model.
* **Optimization phase**: the optimization algorithm tweak the mapping based on the feedback from the evaluation phase. The phase optimization-evaluaion loop happen iteratively until the targeting objective converges or after a fixed set of time epochs.
* The mapping consists of two key componets
* **Sub-accelerator selection**: the assignment of each job to a specific sub-accelerator.
* **Job prioritiztion**: execution order of jobs on a given sub-accelerator.
### Encoding *(how the search space is described)*

* The **sub-accel Selection Sec.** decides which jobs goes which sub-accelerator.
* Each value describes the sub-accel ID for the corresponding job.
* The **Job Prioritizing Sec.** decides the order of the jobs in each sub-accelerator.
* Each value describes the priority of the corresponding job.
* 0 high ----> low 1
### Optimization algorithm *(how the search space is explored)*

### Building Blocks of M3E
1. **BW Allocator**: In a multi-core accelerator, system BW to the accelerator becomes a global shared resources between cores.
* Evenly allocate the same amount of BW to all the sub-accelerator will increase the possibility of compute resource under-utilization.
* The BW allocator is reallocating the BW based on the memory BW intensity of different jobs rinning on different sub-accelerator.

2. **Job Analyzer**: The job analyzer takes the jobs description as input and estimates the no-stall latency and its required BW for each sub-accelerator using a cost model ot generate a job analysis table.(Fig. 3)
3. **HW cost module for Sub-Accelerators**: we leverage MAESTRO as our underlying cost model for each sub-accelerator.
* Given a DNN layer, a HW resource configuration(PE, SL size, SG size, NOC latency, and BW), and a mapping/dataflow strategy, MAESTRO estimates the statistics such as latency, energy, runtime, power, and area.
4. **Job Analysis Table**: Job Analyzer profiles each job in the task by the cost model and stores the profiling results in the Job Analysis Table.
* No-stall Latency: latency for running each job on each sub-accelerator, assuming it has sufficient memory bandwidth.
* No-stall Bandwidth: the minimum bandwidth requirement from each sub-accelerator to make it stay compute-bound, not memory-bound.
### M3E Workflow
1. **Set-up**: the user sets up the optimizer by feeding in the jobs decriptions, configurations (PE #, dataflow) of each sub-accelerators, the system constraint(system BW), and objective(e.g., throughput).
2. **Pre-process**: **Job analyzer** prepares the Job Analysis Table.
3. **Optimization Loop**
* Optimization phase: optimization algorithm updates the encoding mapping based on the feedback from the evaluation block.
* Evaluation phase:
* **Decoder** decodes encoded mapping into a mapping description.
* **BW Allocator** takes in the mapping description and allocates the BW for each sub-accelerator.
## V. Optimization Algorithm (MAGMA)
MAGMA is a GA-based search technique.
### Why GA?
* GA reaches competitive performance with deep reinforcement learning.
* GA is light and fast.
### MAGMA Algorithm Details
1. **Terminology and Basics of GA:**

* The basic mechnism in GAs is to create a population of individuals in each generation.
* All individuals are evaluated and sorted based on their fitness.
* The best performing individuals are used as parents to create a new population of individuals using genetic operators.
2. **Genetic operators**:
* Mutation:

* randomly select multiple genes and mutate them to random values.
* Crossover: three specialized crossover genetic operators
* **Crossover-gen**:

* First, we randomly sample a type of genome to crossover.
* Next, we randomly sample a pivot point and exchange the genes of the genomes.
* **Crossover-rg**:

* This is a range crossover mechanism structured to preserve the dependency of genes across genomes.
* We randomly pick a range of genome and simultaneously crossover all the genes falling into the picked region from both genomes, and thus the cross-genome dependency is preserved.
* **Crossover-accel**:

* This is a crossover method to preserve the dependency of job ordering within an sub-accelerator.
* We randomly select a sub-accelerator and pass the job ordering information of this sub-accelerator to the children.

3. **Hyper-parameter Tuning**:
* mutation, crossover rates, populations size, and elite ratios are hyper-parameters in MAGMA.
* We applied a hyper-parameter search via a Bayesian optimization framework to select a set of hyper-parameters that makes MAGMA achieve the highest performance across multiple workloads.
## VI. Evaluation
### Methodology
1. Task and Benchmark: four different types of tasks
1. Vision
2. Language
3. Recommendation
4. Mix
2. Accelerators: two classes of accelerator

* HB: High bandwidth(inspired by NVDLA)
* LB: low bandwidth(inspired by Eyeriss)
### Mapper Settings
* Baseline Manual-tuned Mapper
* Optimization methods
* MAGMA
### Homogeneous Accelerators

Even though AI-MT and Herald-like is designed for vision and language tasks, they also work well when applying to recommendation and mix tasks.
### Heterogeneous Accelerators
* Small Accel & Large Accel

* (a)(b)(S2)Herald-like performs well, because it is designed for heterogeneous accelerator, while AI-MT is designed for homogeneous accelerator.
* \(c\)(d)(S4)In large accelerator, the mapping task becomes more complex since the design space of the mapping grows.
* BW-limited Environment

For both Small and Large accelerators, with the decrease of BWs, MAGMA stands out more obviously by reaching better relative performance.
* Homogeneous v.s Heterogenous

* The LB-style sub-acclerators usually take larger runtime but lower BW requirements than HB-style. S4(Large Heterogeneous) induces more no-stall latency but requires less BW than S3(Large Homogeneous).
* When BW is limited, the heterogeneous setting enables accelerator to leverage the difference of BW requirement among sub-accelerators to relax the BW contention.
* Big(S4) v.s. BigLittle(S5)
* BigLittle has smaller BW requirement because of its smaller sub-accelerator size
### Flexible Accelerator
* Accelerator Configuration
* The number of PEs in the sub-accelerator are fixed; the shape of 2D PE array is flexible
* Dataflow strategy
* We pick the dataflow strategy of the sub-accelerator to maximize the utilization of the PEs array.
* In order to maximize the utilization, we will align the PEs array dimension to be the factor of the the parallelizing dimension of the tile as much as possible.
* Evaluation
* 
* we can observe that for both Vision and Mix tasks, flexible outperforms fixed in ave. per-job no-stall latency, owing to its ability to maximizing the utilization rate of the PEs array
* However, it would also incur higher BW requirement.