# Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks
###### tags: `Accelerators` `Multi-Tenant` `Dynamic Architecture Fission`
###### paper origin: MICRO 2020
###### papers: [link](https://ieeexplore.ieee.org/document/9251939)
###### video: [link](https://www.youtube.com/watch?v=PKS6VgxYVg0)
# 1. INTRODUCTION
## Research Problems
* As the demand for INFaaS scales, continuously increasing the number of accelerators in the cloud may not be efficient. whereas multi-tenancy has been a primary enabler for the success of cloud-computing in current scale.
## Proposed Solutions
* Dynamically fissioning the DNN accelerator at runtime to spatially co-locate multiple DNN inferences on the same hardware.
* Dynamic architecture fission for spatial multi-tenant execution.
* Task scheduling for spatial multi-tenant execution.
# Synamic Architecture Fission : Concepts And Overview

The objective is to enable multi-tenant execution of DNNs by spatially co-locating multiple DNN tasks on a single accelerator. To do so, the underlying accelerator needs to dynamically fission at runtime into smaller pieces of logical full-fledged accelerators that can execute their pertinent DNN.
* Fission microarchitecture
* Microarchitecture that can fission dynamically into smaller full-fledged accelerators to execute multiple DNNs simultaneously.
* Task scheduler
* A task scheduling algorithm that adaptively schedules and assigns the resources to different tasks.
* the scheduler identifies minimal amount of resources required to execute the DNN while meeting the QoS constraints imposed.
* it uses a scoring mechanism that congregates task priority and remaining time to distribute the remaining resources on the accelerator to spatially co-locate tasks.
# Architecture Design For Fission: Challanges And Oppotunities

* Fission for Compute and the Need for New Communication Patterns
* The need for flexible and cost-effective fission of compute resources
* mapping a convolution or matrix multiplication operation to a big monolithic systolic array can lead to underutilization of compute resources

* Fission granularity
* replace a subset of the links to determine the granularity such that they can disconnect a subarray of the PEs instead of a single PE.
* The need for new and flexible patterns of communication for richer fission possibilities.
* bi-directional

we propose omni-directional systolic arrays that can forward the input activations and partial sums in all directions as opposed to conventional systolic arrays that always forward the data in just two directions.
* Enabling full-fledged logical accelerators through fission for the SIMD Vector Unit.
* the SIMD Vector Unit also needs to be broken into smaller segments
* Fission for the On-Chip Memory and the Need for Reorganizing the Entire Design
* Weight buffer fission.
* Activation and output buffer fission

* Fission without Reorganization Defeats the Purpose

# Microarchitecture For Fission
* Omni-Directional Systolic Array Design

* Reorganizing the Accelerator Microarchitecture through Fission Pod Design
* Objective
* Creating multiple stand-alone and full-fledged logical accelerators to enable spatial co-location.
* Enriching the fission possibilities as much as possible to serve various computational needs of co-located DNNs.
* Maximizing the PE subarray utilization.
* Maximizing the on-chip buffers utilization and their bandwidths to subarrays.
* Constraints
* Imposing minimal power/area overhead to the hardware
* Maintaining the baseline clock frequency

* Memory-compute interweaving in Fission Pod
* Intra Fission Pod data communication.
* Clock frequency consideration.
* Planaria Overall Architecture

* the original monolithic systolic array has been broken down into 16 omni-directional systolic subarrays, where a group of four subarrays form one Fission Pod that contains a Pod Memory.
# Spatial Task Scheduling
* Requirements
* The scheduler ideally needs to be aware of the optimal fission configurations for DNN tasks to leverage dynamic fission and co-location.
* The scheduler needs to be QoS-aware and leverage the available slack time offered by QoS constraint of each task to maximize the co-location and utilization while adhering to the SLA.
* Task re-allocation requires checkpointing the intermediate results, while making sure that the re-allocation and checkpointing does not overuse on-chip memory or result in significant context switching overheads.

* Estimating minimal resource to meet the QoS requirement.
* A new inference task is dispatched to the task queue of the datacenter node or a running inference task finishes.
* Allocating resources to improve QoS.
* Determines the allocation of the subarrays based on their availability and priority of the inference requests.
* If not allocatable, increase the score.
* Tile-based scheduling to minimize re-allocation overheads.
* the scheduling happens at tile-granularity
* the tasks are preempted only when the resource allocation changes.
# Result
