# Heterogeneous Dataflow Accelerators for Multi-DNN Workloads
###### tags: `Accelerators`
###### paper origin: HPCA, 2021
###### papers: [link](https://ieeexplore.ieee.org/document/9407116)
## 1. Introduction
### Problems
* An accelerator’s dataflow design for specific layers can lead to inefficiency across other layers.
* Eyeriss: CONV2D
* The reconfigurable dataflow accelerators approaches enables flexibility but is at the cost of extra hardware components.
### Solutions
* Propose a heterogeneous dataflow accelerators. It provide flexibility by employing multiple sub-accelerators, each tuned for a different dataflow, within an accelerator chip.
* dataflow flexibility
* high utilization

# Backgrounds
### Heterogeneous Multi-DNN Workloads
The diversity of models naturally leads to high variations in layer (1) shape and (2) operations, which constructs heterogeneous multi-DNN workloads

* Layer Shape
* Layer Operation
Based on the shape and operation, each layer prefers different dataflow styles and hardware, which makes such workloads challenging for fixed dataflow accelerators (FDAs).
### Dataflow and Mapping
No single dataflow style is good for all the layers, and we need to optimize the dataflow for each layer in target workloads to maximize the efficiency of an accelerator

* Shi DianNao
* Output-stationary style dataflow
* NVDLA
* Weight-stationary
## HETEROGENEOUS DATAFLOW ACCELERATORS (HDAS)
HDAs by default assign layers to a sub-accelerator with the most preferred dataflow style for each layer.
### Design Considerations and Definition of HDA
* Dataflow Selection for Sub-accelerators
* To maximize the benefits from dataflow flexibility, the dataflow styles of sub-accelerators need to be sufficiently different so that the resulting HDA can adapt to different layers with diverse shapes and operations.
* Hardware Resource Partitioning
* The optimal distribution depends on workloads and selected dataflows, which makes determining hardware resource distribution further challenging.
* Layer Scheduling
* Designing a scheduler satisfy all of the aforementioned requirements is challenging.
### Benefits of HDAs
* Selective Scheduling
* Because each layer prefers different dataflow and hardware, running each layer on its most preferred sub-accelerator in an HDA is an effective solution to maximize overall efficiency
* Layer Parallelism
* HDAs can simultaneously run multiple layers of different models.
* Low Hardware Cost for Dataflow Flexibility
* HDAs do not involve the costs for reconfigurability like RDAs
### Challenges for HDAs
* Reduced Parallelism for Each Layer
* The maximum degree of parallelism for each sub-accelerator decreases compared to an FDA or an RDA with the same number of PEs in total.
* Shared Memory and NoC Bandwidth.
* Because multiple sub-accelerators share a global scratchpad memory and global NoC, those resources either need to be time-multiplexed or hardpartitioned across sub-accelerators.
* Scheduling under Memory and Dependence Constraints to Minimize Dark Silicon
* The scheduler needs to assign layers on the most preferred sub-accelerator to exploit the benefits of flexible dataflow.
We develop a hardware and schedule co-design space exploration (DSE) algorithm for HDAs that co-optimize all the design considerations in hardware and schedule.
## DESIGN SPACE EXPLORATION ALGORITHM FOR HDAS

### Execution Model
1. Fetch global buffer level filter weight tile from DRAM to a global buffer.
2. Distribute sub-accelerator level filter weight tiles to sub-accelerators based layer execution schedule.
3. Fetch global buffer level activation tile from DRAM to the global buffer.
4. Stream sub-accelerator level activation tiles into their corresponding sub-accelerators based on layer execution schedule.
5. Store streamed-out output activation from each sub-accelerator to the global buffer.
6. Overlapping the computation and data fetch from DRAM, pre-fetch next activation and filter tiles (double buffering). during sub-accelerators compute output activation, fetch next filter values from DRAM and send the filter values to the next accelerator (assumes double-buffering).
7. When a sub-accelerator finishes executing a layer, stream output activation stored in the global buffer as input activation of the next layer.
8. Repeat above processes until processing all the layers of all the models.
### Latency and Energy Estimation
We use MAESTRO as a base cost model, which is a validated cost model for monolithic DNN accelerators (i.e., FDAs and RDAs) with any dataflow,
### Hardware Resource Partitioning Optimization
Unlike FDAs and RDAs fully exploit hardware resources and implement a monolithic accelerator substrate, HDAs need to distribute such resources across sub-accelerators.
* Global memory
* NoC bandwidth
* PE
### Layer Execution Schedule Optimization

* Dataflow preference-based layer assignment on sub-accelerators.
* Use greedy and feeback loop for global load-balancing
* Heuristic-based Initial Layer Ordering.
* linear dependence: depth-first
* independent across models: breadth-first
* Eliminating Redundant Idle Time In Initial Schedules via Post-processing.
* The initial schedule based on simple depth-first or breadth-first layer ordering often has unnecessary idle time based on bad layer execution order. The post-processing algorithm fixes such inefficiencies
### Herald: An Implementation of the DSE algorithm

* Herald reports optimized PE and global NoC bandwidth partitioning with an optimized layer execution schedule for the partitioned sub-accelerators as outputs.
* Herald also reports estimated total latency and energy based on MAESTRO cost model
## Evalutions
### Evaluation Settings
* Workloads

* Dataflow
* Shi-diannao
* NVDLA
* Eyeriss
* Cost Estimation
* MAESTRO
* Accelerator Styles

### Results
* Costs and Benefits of HDAs.

Maelstrom demonstrates 65.30% and 5.0% lower latency and energy compared than the best FDA, 63.11% and 4.1% lower latency and energy than the SM-FDAs, and 20.7% higher runtime but 22.0% lower energy compared to a MAERI-based RDA
* Optimal HW Resource Partitioning.

* Impact of Workloads
* AR/VR-B workload is more friendly to HDAs, providing 6.8% latency and 6.61% energy improvements over bestFDAs for each case study in Figure 11, compared to 63.26% latency and 4.05% energy improvements for AR/VR-A and 48.1% latency and 4.4% energy improvements for MLPerf.
* Single-DNN Case
* Even for a single DNN, HDAs can still exploit layer parallelism and heterogeneity within a model by batch-processing the workload.
* Efficacy of Scheduling Algorithm
* Even for a single DNN, HDAs can still exploit layer parallelism and heterogeneity within a model by batch-processing the workload.
* Impact of Batch Size.
* Result shows HDA’s preference for large batch sizes.
* Comparison against RDAs
* RDA designs provided 22.9%, 21.5%, and 24.3% less latency for AR/VR-A, AR/VR-B, and MLPerf workloads, respectively. However, RDA designs required 18.7%, 15.5%, and 18.9% more energy for each workload