Try   HackMD

MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores

tags: Accelerators
paper origin: HPCA-2022

I. Introduction

Motivation

  • With the trend to building large accelerators(i.e. MCM-based SIMBA, wafer-scale Cerebras, scaled-out platforms, heterogeneous multi-accelerators), enabling multi-tenancy is a natural use-case.
  • Mapping jobs from several DNNs simultaneously on an accelerator is a problem.

Problem

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • All these approaches rely on manually-designed heuristics.
  • This limits their scalability to diverse accelerator back-ends and emerging workloads.

Solution

  • We propose an optimization framework called Multi-workload Multi-accelerator Mapping Explorer (M3E)
    • An efficient encoding scheme to encode the search space of the mapping
    • Several modules to enable the found mapping to effectively orchestrate the data movement across sub-accelerator cores.
    • Enabling several commonly used black-box optimization algorithm and two reinforcement learning methods to be leveraged as the underlying optimization methods.

II. Backgroud

Accelerator Architecture

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Sub-Accelerator Architecture

  • Each sub-accelerator is a conventional DNN accelerator that is comprised of an array of Processing Element(PE).
  • Each PE has a MAC to compute partial sums, and local scratchpad(SL) to store weights, activations and partial sums.
  • Each sub-accelerator also houses a shared global scratchpad(SG) to prefetch activations and weights from HBM/DRAM for the next tile of computation that will be mapped over the PEs and SLs.

Dataflow

  • Given a DNN layer, each sub-accelerator employs a dataflow(aka local-mapping).
  • The dataflow determins the loop order, parallelism dimensions, and tile size for running the layer on the sub-accelerator.

III. Problem Formulation

Jobs

  • We focus on systems targeting multi-tenancy and batched-job tasks
    • Batched-job are usually not latency sensitive and targeting high throughput (compared to single job).
    • multi-tenant batched job tasks has much less layer dependency issue than single-tanant single-job tasks
      • multi-tenancy, running multiple neural networks together, can largely alleviate the layer dependency problem as layers from different neural networks can be freely scheduled without any dependency issue.
      • In batched-job tasks, the activations will be broken down to mini-batches because of memory concerns. These mini-batches are independent with each others.

Groups

  • At the host, we run a light-weigted control program to divide a pool of queuing jobs into multiple dependency-free groups.
  • Group size should be larger or equal to number of sub-accelerator to avoid some sub-accelerators being idle completely.

Mapping

  • Mapping is the strategy to orchestrate the jobs assignment and job prioritizing across sub-accelerators over space and time.
  • Global mapping, global control of the data movement across host and sub-accelerators.

Mapper

  • Manual-designed mapping
    • + highly tuned for specific accelerators/systems or tasks/workloads
    • - whenever the underlyinfg accelerators or the targeting tasks change, the mapper need toe be re-designed or at least undergo another lenthy cycle of tuning provess.
  • Mapping optimization method(this work)
    • we simply need to re-run the search of the optimizer whenever the underlying accelerator evolves or target task changes.

IV. Optimization Framework(M3E)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Evaluation phase: the candidate mapping is evaluted by the cost model.
  • Optimization phase: the optimization algorithm tweak the mapping based on the feedback from the evaluation phase. The phase optimization-evaluaion loop happen iteratively until the targeting objective converges or after a fixed set of time epochs.
  • The mapping consists of two key componets
    • Sub-accelerator selection: the assignment of each job to a specific sub-accelerator.
    • Job prioritiztion: execution order of jobs on a given sub-accelerator.

Encoding (how the search space is described)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • The sub-accel Selection Sec. decides which jobs goes which sub-accelerator.
    • Each value describes the sub-accel ID for the corresponding job.
  • The Job Prioritizing Sec. decides the order of the jobs in each sub-accelerator.
    • Each value describes the priority of the corresponding job.
    • 0 high > low 1

Optimization algorithm (how the search space is explored)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Building Blocks of M3E

  1. BW Allocator: In a multi-core accelerator, system BW to the accelerator becomes a global shared resources between cores.
    • Evenly allocate the same amount of BW to all the sub-accelerator will increase the possibility of compute resource under-utilization.
    • The BW allocator is reallocating the BW based on the memory BW intensity of different jobs rinning on different sub-accelerator.
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
  2. Job Analyzer: The job analyzer takes the jobs description as input and estimates the no-stall latency and its required BW for each sub-accelerator using a cost model ot generate a job analysis table.(Fig. 3)
  3. HW cost module for Sub-Accelerators: we leverage MAESTRO as our underlying cost model for each sub-accelerator.
    • Given a DNN layer, a HW resource configuration(PE, SL size, SG size, NOC latency, and BW), and a mapping/dataflow strategy, MAESTRO estimates the statistics such as latency, energy, runtime, power, and area.
  4. Job Analysis Table: Job Analyzer profiles each job in the task by the cost model and stores the profiling results in the Job Analysis Table.
    • No-stall Latency: latency for running each job on each sub-accelerator, assuming it has sufficient memory bandwidth.
    • No-stall Bandwidth: the minimum bandwidth requirement from each sub-accelerator to make it stay compute-bound, not memory-bound.

M3E Workflow

  1. Set-up: the user sets up the optimizer by feeding in the jobs decriptions, configurations (PE #, dataflow) of each sub-accelerators, the system constraint(system BW), and objective(e.g., throughput).
  2. Pre-process: Job analyzer prepares the Job Analysis Table.
  3. Optimization Loop
    • Optimization phase: optimization algorithm updates the encoding mapping based on the feedback from the evaluation block.
    • Evaluation phase:
      • Decoder decodes encoded mapping into a mapping description.
      • BW Allocator takes in the mapping description and allocates the BW for each sub-accelerator.

V. Optimization Algorithm (MAGMA)

MAGMA is a GA-based search technique.

Why GA?

  • GA reaches competitive performance with deep reinforcement learning.
  • GA is light and fast.

MAGMA Algorithm Details

  1. Terminology and Basics of GA:
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • The basic mechnism in GAs is to create a population of individuals in each generation.
  • All individuals are evaluated and sorted based on their fitness.
  • The best performing individuals are used as parents to create a new population of individuals using genetic operators.
  1. Genetic operators:
    • Mutation:
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
      • randomly select multiple genes and mutate them to random values.
    • Crossover: three specialized crossover genetic operators
      • Crossover-gen:
        Image Not Showing Possible Reasons
        • The image file may be corrupted
        • The server hosting the image is unavailable
        • The image path is incorrect
        • The image format is not supported
        Learn More →
        • First, we randomly sample a type of genome to crossover.
        • Next, we randomly sample a pivot point and exchange the genes of the genomes.
      • Crossover-rg:
        Image Not Showing Possible Reasons
        • The image file may be corrupted
        • The server hosting the image is unavailable
        • The image path is incorrect
        • The image format is not supported
        Learn More →
        • This is a range crossover mechanism structured to preserve the dependency of genes across genomes.
        • We randomly pick a range of genome and simultaneously crossover all the genes falling into the picked region from both genomes, and thus the cross-genome dependency is preserved.
      • Crossover-accel:
        Image Not Showing Possible Reasons
        • The image file may be corrupted
        • The server hosting the image is unavailable
        • The image path is incorrect
        • The image format is not supported
        Learn More →
        • This is a crossover method to preserve the dependency of job ordering within an sub-accelerator.
        • We randomly select a sub-accelerator and pass the job ordering information of this sub-accelerator to the children.
          Image Not Showing Possible Reasons
          • The image file may be corrupted
          • The server hosting the image is unavailable
          • The image path is incorrect
          • The image format is not supported
          Learn More →
  2. Hyper-parameter Tuning:
  • mutation, crossover rates, populations size, and elite ratios are hyper-parameters in MAGMA.
  • We applied a hyper-parameter search via a Bayesian optimization framework to select a set of hyper-parameters that makes MAGMA achieve the highest performance across multiple workloads.

VI. Evaluation

Methodology

  1. Task and Benchmark: four different types of tasks
    1. Vision
    2. Language
    3. Recommendation
    4. Mix
  2. Accelerators: two classes of accelerator
  • HB: High bandwidth(inspired by NVDLA)
  • LB: low bandwidth(inspired by Eyeriss)

Mapper Settings

  • Baseline Manual-tuned Mapper
  • Optimization methods
  • MAGMA

Homogeneous Accelerators


Even though AI-MT and Herald-like is designed for vision and language tasks, they also work well when applying to recommendation and mix tasks.

Heterogeneous Accelerators

  • Small Accel & Large Accel
    • (a)(b)(S2)Herald-like performs well, because it is designed for heterogeneous accelerator, while AI-MT is designed for homogeneous accelerator.
    • (c)(d)(S4)In large accelerator, the mapping task becomes more complex since the design space of the mapping grows.
  • BW-limited Environment

    For both Small and Large accelerators, with the decrease of BWs, MAGMA stands out more obviously by reaching better relative performance.
  • Homogeneous v.s Heterogenous
    • The LB-style sub-acclerators usually take larger runtime but lower BW requirements than HB-style. S4(Large Heterogeneous) induces more no-stall latency but requires less BW than S3(Large Homogeneous).
    • When BW is limited, the heterogeneous setting enables accelerator to leverage the difference of BW requirement among sub-accelerators to relax the BW contention.
  • Big(S4) v.s. BigLittle(S5)
    • BigLittle has smaller BW requirement because of its smaller sub-accelerator size

Flexible Accelerator

  • Accelerator Configuration
    • The number of PEs in the sub-accelerator are fixed; the shape of 2D PE array is flexible
  • Dataflow strategy
    • We pick the dataflow strategy of the sub-accelerator to maximize the utilization of the PEs array.
    • In order to maximize the utilization, we will align the PEs array dimension to be the factor of the the parallelizing dimension of the tile as much as possible.
  • Evaluation
    • we can observe that for both Vision and Mix tasks, flexible outperforms fixed in ave. per-job no-stall latency, owing to its ability to maximizing the utilization rate of the PEs array
    • However, it would also incur higher BW requirement.