MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores

tags: `Accelerators`

paper origin: HPCA-2022

paper: link

I. Introduction

Motivation

With the trend to building large accelerators(i.e. MCM-based SIMBA, wafer-scale Cerebras, scaled-out platforms, heterogeneous multi-accelerators), enabling multi-tenancy is a natural use-case.
Mapping jobs from several DNNs simultaneously on an accelerator is a problem.

Problem

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

All these approaches rely on manually-designed heuristics.
This limits their scalability to diverse accelerator back-ends and emerging workloads.

Solution

We propose an optimization framework called Multi-workload Multi-accelerator Mapping Explorer (M3E)
- An efficient encoding scheme to encode the search space of the mapping
- Several modules to enable the found mapping to effectively orchestrate the data movement across sub-accelerator cores.
- Enabling several commonly used black-box optimization algorithm and two reinforcement learning methods to be leveraged as the underlying optimization methods.

II. Backgroud

Accelerator Architecture

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Sub-Accelerator Architecture

Each sub-accelerator is a conventional DNN accelerator that is comprised of an array of Processing Element(PE).
Each PE has a MAC to compute partial sums, and local scratchpad(SL) to store weights, activations and partial sums.
Each sub-accelerator also houses a shared global scratchpad(SG) to prefetch activations and weights from HBM/DRAM for the next tile of computation that will be mapped over the PEs and SLs.

Dataflow

Given a DNN layer, each sub-accelerator employs a dataflow(aka local-mapping).
The dataflow determins the loop order, parallelism dimensions, and tile size for running the layer on the sub-accelerator.

III. Problem Formulation

Jobs

We focus on systems targeting multi-tenancy and batched-job tasks
- Batched-job are usually not latency sensitive and targeting high throughput (compared to single job).
- multi-tenant batched job tasks has much less layer dependency issue than single-tanant single-job tasks
  - multi-tenancy, running multiple neural networks together, can largely alleviate the layer dependency problem as layers from different neural networks can be freely scheduled without any dependency issue.
  - In batched-job tasks, the activations will be broken down to mini-batches because of memory concerns. These mini-batches are independent with each others.

Groups

At the host, we run a light-weigted control program to divide a pool of queuing jobs into multiple dependency-free groups.
Group size should be larger or equal to number of sub-accelerator to avoid some sub-accelerators being idle completely.

Mapping

Mapping is the strategy to orchestrate the jobs assignment and job prioritizing across sub-accelerators over space and time.
Global mapping, global control of the data movement across host and sub-accelerators.

Mapper

Manual-designed mapping
- + highly tuned for specific accelerators/systems or tasks/workloads
- - whenever the underlyinfg accelerators or the targeting tasks change, the mapper need toe be re-designed or at least undergo another lenthy cycle of tuning provess.
Mapping optimization method(this work)
- we simply need to re-run the search of the optimizer whenever the underlying accelerator evolves or target task changes.

IV. Optimization Framework(M3E)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Evaluation phase: the candidate mapping is evaluted by the cost model.
Optimization phase: the optimization algorithm tweak the mapping based on the feedback from the evaluation phase. The phase optimization-evaluaion loop happen iteratively until the targeting objective converges or after a fixed set of time epochs.
The mapping consists of two key componets
- Sub-accelerator selection: the assignment of each job to a specific sub-accelerator.
- Job prioritiztion: execution order of jobs on a given sub-accelerator.

Encoding (how the search space is described)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The sub-accel Selection Sec. decides which jobs goes which sub-accelerator.
- Each value describes the sub-accel ID for the corresponding job.
The Job Prioritizing Sec. decides the order of the jobs in each sub-accelerator.
- Each value describes the priority of the corresponding job.
- 0 high ––> low 1

Optimization algorithm (how the search space is explored)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Building Blocks of M3E

BW Allocator: In a multi-core accelerator, system BW to the accelerator becomes a global shared resources between cores.
- Evenly allocate the same amount of BW to all the sub-accelerator will increase the possibility of compute resource under-utilization.
- The BW allocator is reallocating the BW based on the memory BW intensity of different jobs rinning on different sub-accelerator.
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Job Analyzer: The job analyzer takes the jobs description as input and estimates the no-stall latency and its required BW for each sub-accelerator using a cost model ot generate a job analysis table.(Fig. 3)
HW cost module for Sub-Accelerators: we leverage MAESTRO as our underlying cost model for each sub-accelerator.
- Given a DNN layer, a HW resource configuration(PE, SL size, SG size, NOC latency, and BW), and a mapping/dataflow strategy, MAESTRO estimates the statistics such as latency, energy, runtime, power, and area.
Job Analysis Table: Job Analyzer profiles each job in the task by the cost model and stores the profiling results in the Job Analysis Table.
- No-stall Latency: latency for running each job on each sub-accelerator, assuming it has sufficient memory bandwidth.
- No-stall Bandwidth: the minimum bandwidth requirement from each sub-accelerator to make it stay compute-bound, not memory-bound.

M3E Workflow

Set-up: the user sets up the optimizer by feeding in the jobs decriptions, configurations (PE #, dataflow) of each sub-accelerators, the system constraint(system BW), and objective(e.g., throughput).
Pre-process: Job analyzer prepares the Job Analysis Table.
Optimization Loop
- Optimization phase: optimization algorithm updates the encoding mapping based on the feedback from the evaluation block.
- Evaluation phase:
  - Decoder decodes encoded mapping into a mapping description.
  - BW Allocator takes in the mapping description and allocates the BW for each sub-accelerator.

V. Optimization Algorithm (MAGMA)

MAGMA is a GA-based search technique.

Why GA?

GA reaches competitive performance with deep reinforcement learning.
GA is light and fast.

MAGMA Algorithm Details

Terminology and Basics of GA:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

The basic mechnism in GAs is to create a population of individuals in each generation.
All individuals are evaluated and sorted based on their fitness.
The best performing individuals are used as parents to create a new population of individuals using genetic operators.

Genetic operators:
- Mutation:
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
  - randomly select multiple genes and mutate them to random values.
- Crossover: three specialized crossover genetic operators
  - Crossover-gen:
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
    - First, we randomly sample a type of genome to crossover.
    - Next, we randomly sample a pivot point and exchange the genes of the genomes.
  - Crossover-rg:
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
    - This is a range crossover mechanism structured to preserve the dependency of genes across genomes.
    - We randomly pick a range of genome and simultaneously crossover all the genes falling into the picked region from both genomes, and thus the cross-genome dependency is preserved.
  - Crossover-accel:
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
    - This is a crossover method to preserve the dependency of job ordering within an sub-accelerator.
    - We randomly select a sub-accelerator and pass the job ordering information of this sub-accelerator to the children.
      Image Not Showing Possible Reasons
      The image file may be corrupted
      The server hosting the image is unavailable
      The image path is incorrect
      The image format is not supported
      Learn More →
Hyper-parameter Tuning:

mutation, crossover rates, populations size, and elite ratios are hyper-parameters in MAGMA.
We applied a hyper-parameter search via a Bayesian optimization framework to select a set of hyper-parameters that makes MAGMA achieve the highest performance across multiple workloads.

VI. Evaluation

Methodology

Task and Benchmark: four different types of tasks
1. Vision
2. Language
3. Recommendation
4. Mix
Accelerators: two classes of accelerator

HB: High bandwidth(inspired by NVDLA)
LB: low bandwidth(inspired by Eyeriss)

Mapper Settings

Baseline Manual-tuned Mapper
Optimization methods
MAGMA

Homogeneous Accelerators

Even though AI-MT and Herald-like is designed for vision and language tasks, they also work well when applying to recommendation and mix tasks.

Heterogeneous Accelerators

Small Accel & Large Accel
- (a)(b)(S2)Herald-like performs well, because it is designed for heterogeneous accelerator, while AI-MT is designed for homogeneous accelerator.
- (c)(d)(S4)In large accelerator, the mapping task becomes more complex since the design space of the mapping grows.
BW-limited Environment

For both Small and Large accelerators, with the decrease of BWs, MAGMA stands out more obviously by reaching better relative performance.
Homogeneous v.s Heterogenous
- The LB-style sub-acclerators usually take larger runtime but lower BW requirements than HB-style. S4(Large Heterogeneous) induces more no-stall latency but requires less BW than S3(Large Homogeneous).
- When BW is limited, the heterogeneous setting enables accelerator to leverage the difference of BW requirement among sub-accelerators to relax the BW contention.
Big(S4) v.s. BigLittle(S5)
- BigLittle has smaller BW requirement because of its smaller sub-accelerator size

Flexible Accelerator

Accelerator Configuration
- The number of PEs in the sub-accelerator are fixed; the shape of 2D PE array is flexible
Dataflow strategy
- We pick the dataflow strategy of the sub-accelerator to maximize the utilization of the PEs array.
- In order to maximize the utilization, we will align the PEs array dimension to be the factor of the the parallelizing dimension of the tile as much as possible.
Evaluation
- we can observe that for both Vision and Mix tasks, flexible outperforms fixed in ave. per-job no-stall latency, owing to its ability to maximizing the utilization rate of the PEs array
- However, it would also incur higher BW requirement.

MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores

tags: Accelerators

paper origin: HPCA-2022

paper: link

I. Introduction

Motivation

Problem

Solution

II. Backgroud

Accelerator Architecture

Sub-Accelerator Architecture

Dataflow

III. Problem Formulation

Jobs

Groups

Mapping

Mapper

IV. Optimization Framework(M3E)

Encoding (how the search space is described)

Optimization algorithm (how the search space is explored)

Building Blocks of M3E

M3E Workflow

V. Optimization Algorithm (MAGMA)

Why GA?

MAGMA Algorithm Details

VI. Evaluation

Methodology

Mapper Settings

Homogeneous Accelerators

Heterogeneous Accelerators

Flexible Accelerator

Read more

VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs

Capuchin: Tensor-based GPU Memory Management for Deep Learning

AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing

SGCN: Exploiting Compressed-Sparse Features inDeep Graph Convolutional Network Accelerators

tags: `Accelerators`