Serving Multi-DNN Workloads on FPGAs: a Coordinated Architecture, Scheduling, and Mapping Perspective

# Serving Multi-DNN Workloads on FPGAs: a Coordinated Architecture, Scheduling, and Mapping Perspective ###### tags: `Accelerators` ###### paper origin: IEEE Transactions on Computers ###### paper: [link](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9917279) ## Introduction * CLOUD-backed INFerence-as-a-Service (INFaaS) currently dominates Artificial Intelligence (AI) workloads in data centers. * However, simply increasing the number of nodes to accommodate the computing demands is neither scalable nor cost-effective. Recently, there is a clear trend toward enabling multi-tenancy on the single node accelerator. ### Architecture Level * This paper explores a novel spatial multi-tenant architecture that incorporates the benefits of both HDAs and HMCAs. * On the one hand, the dataflow flexibility of HDAs better accommodates the model heterogeneity, HMCAs allow for running multiple DNN inference tasks concurrently, each allocated with an appropriate number of cores to meet Query-Per-Second (QPS) and Service Level Agreement (SLA) requirements for each task. * Homogeneous multi-core accelerators (HMCAs): * Consists of multiple identical cores that can communicate with each other via Networks-on-Chip (NoC). Each core is able to run a DNN model independently or share the same workload with other cores. * Provides the dynamic fission flexibility that can quickly adapt to dynamic load changes. * Heterogeneous dataflow accelerators (HDAs): * Offers a unique optimization dimension, which stems from the diverse preferences of different layers for dataflow. ![](https://i.imgur.com/5j53lKa.png) ### Scheduling Level * Multi-DNN schedule decides the execution order and resource allocation for layers from different DNN models. * Recent multi-DNN schedulers use heuristics-based algorithms, such as ShortestJob-First (SJF) and greedy-based methods, which heavily rely on pre-designed stationary architectures. * Heuristic-based methods lead to sub-optimal schedule and cannot fit the rapidly evolving hardware platforms (especially FPGAs) in the cloud. * Prior studies assume a fixed bandwidth (BW) allocation at runtime. * Lead to wasteful and over-competition for bandwidth resources, thus deteriorating the efficiency and performance ### Mapping Level * Decide how a DNN layer is mapped to the HDAs (dataflow mapping in terms of loop reordering and loop tiling) or HMCAs (multi-core mapping of batch, activation, weight, or partial sum parallelization). ![](https://i.imgur.com/E4uQmy9.png) ## H3M Co-Exploration Framework ![](https://i.imgur.com/M5q5Ciu.png) ### Framework Overview #### Sample Search Space * The optimizer generates samples from the search space. * hardware parameters, mapping scheme, layer schedule, and bandwidth allocation * The search space consists of: * The architecture design space of the accelerators when deployed to heterogeneous FPGA systems under the given multiDNN workload. * The schedule design space of the execution order with scheduling function and the offchip bandwidth allocation for parallel jobs. * The mapping design space of sub-accelerator selection with allocation function and choosing the proper combination of spatial/temporal mapping for input feature pixel, input channel, and kernel weights dimensions. ![](https://i.imgur.com/VTr1dnk.png) #### Decode * The decoder translates the encoding vectors to solutions. #### Schedule * The scheduler orders the jobs in the queue according to their priority scores and layer dependency. * The scheduler allocates their share of off-chip bandwidth according to their normalized priority scores. #### Simulator * The simulator runs the jobs with the given configuration, and outputs the total latency and energy consumption to calculate Energy-Delay Product (EDP). ### Optimization Algorithm * We use [CMA-ES](https://link.springer.com/content/pdf/10.1007/3-540-32494-1.pdf) to sample the search space and optimize the objective function. ### The Three-tier Encoding Format #### Hardware Encoding Vector ![](https://i.imgur.com/gGhc5sO.png) * An array of 2M floating-point numbers, M is the total choices of sub-accelerator dataflow. * Each pair of floatingpoint numbers encodes the hardware resource share and PE number for one choice of sub-accelerator dataflow. #### Mapping Encoding Vector ![](https://i.imgur.com/Vd92njA.png) * An array of 5L floating-point numbers, with L being the total number of layers. * A mapping score higher than one indicates spatial mapping in that dimension, otherwise indicating temporal mapping. * The dataflow choice determines which HMCA the layer maps to. * The number of the core factor indicates how many cores in the chosen HMCA the current layer maps to. #### Scheduling Encoding Vector ![](https://i.imgur.com/lwl3FcO.png) * An array of L floating-point numbers. * Layers with higher priority scores are likely to execute earlier in the queue, and be assigned more bandwidth. ### Evaluator #### Layer Schedule and Dynamic Bandwidth Allocation * H3M does not require a dynamic scheduler as a hardware module or a piece of software in the runtime, since the schedule is known and fixed at runtime. * Off-chip bandwidth is dynamically allocated based on the normalized priority score of parallel jobs. #### Simulator * Each layer’s latency and energy consumption is evaluated by MAESTRO. ## Implementation on FPGAs ![](https://i.imgur.com/o0hyZNY.png) ### Load/Save Data Movement Module * The Load/Save data movement module has forwarding and broadcast control for Load instructions and write control for Save instructions to the local shared buffer or off-chip DDR memory. * It is responsible for: * Reading data from the left neighbor or off-chip memory to the local cores. * Forwarding data from the off-chip memory or left neighbor to the right neighbor. * Writing data from local cores to the local shared buffer or off-chip memory. * Configuring the dynamic bandwidth controller for runtime bandwidth allocation. * Merging the identical read requests into one and broadcast the fetched data. * Network-on-Chip (NoC) introduced by [Hoplite](https://dl.acm.org/doi/pdf/10.1145/3027486), forwards the instructions and data between groups. * Uni-directional NoC takes up much fewer hardware resources than the bi-directional NoC. * Instruction Decoder generates control signals for all of the multiplexers and the dynamic bandwidth controller to manage broadcast, forwarding, and save directions. * Control Multiplexers are controlled by the Instruction Decoder. * Pack/Unpack Module * Pack data and instructions into frames for inter-group communication. * Unpack module extracts the instruction for decoding. ### Sub-accelerator Architecture ![](https://i.imgur.com/gMVZ04k.png) * The basic templates of the DNN sub-accelerator architecture are based on Xilinx DPUs. * Since a DPU is a Xilinx proprietary functional block, we implement it from [Angel-Eye](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7930521), which is a basic implementation of a Xilinx DPU. * Contains five modules: * Local Instruction Decoder and Scheduler (LIDS) * decoding of instructions and the local scheduling of the other four modules * Data loader module (LOAD) * Data writer module (SAVE) * Convolution operator module (CONV) * Non-convolution operator module (MISC) * By enabling heterogeneous dataflow on the PE array level, we implement DPU-based accelerators with heterogeneous dataflow and homogeneous multi-core architecture. ### Dynamic Bandwidth Controller Implementation * Based on [AXI HyperConnect](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9218652). ![](https://i.imgur.com/gKRAZPy.png) * Bus equalizer is proposed as the key module to incorporate with the conventional AXI interconnection module. * Realizes bandwidth reservation by limiting the number of outstanding transactions based on the threshold h. * Once the number of transactions set by the threshold h is reached, the Central Control Unit (CCU) will suspend further transaction requests of the corresponding sub-accelerator. * Deals with heterogeneous burst sizes with the help of splitters and mergers. * When the burst size is larger than the pre-defined uniform burst size b, the splitter will divide the read and write transaction requests into multiple sub-transactions, and the merger will merge the corresponding responses. * AXI slave control interface is exposed to the hypervisor as a standard memory-mapped device. The hypervisor can read/write internal registers for configuring parameters and monitoring stats at runtime. #### Bandwidth Reservation * Each sub-accelerator is essential to guarantee performance isolation for multi-tenant sharing. * [Limiting data transactions to a specific threshold number over a periodic time window.](https://drops.dagstuhl.de/opus/volltexte/2019/10761/pdf/LIPIcs-ECRTS-2019-24.pdf) #### Security Isolation * It is crucial to ensure security isolation among different DNN accelerators for multi-tenant sharing. * Current commercialized AXI interconnection modules use roundrobin arbitration to solve the conflicts among multiple accelerators. However, it can lead to serious unfair bandwidth allocation under the case with heterogeneous burst sizes. * could bring down the entire FPGA system by uploading the accelerator bitstream with a particularly large burst size. * Employ the [technique](https://retis.sssup.it/~a.biondi/papers/CASES19.pdf), which works by equalizing the burst size of each sub-accelerator to a uniform size. #### Runtime Reconfigurability * Enable the runtime reconfigurability by exposing an AXI interface for the hypervisor to configure internal registers at runtime. * Threshold h to limit the number of AXI transactions for the bandwidth allocation of each sub-accelerator. * The uniform burst size b for fair bandwidth allocation and security isolation. * The period T, which has impacts on the total bandwidth utilization rate depending on the specific workload. ## Evaluation #### Comparison with Multi-tenant DNN Accelerators ![](https://i.imgur.com/sUz2xQw.png) #### Comparison with Mapping and Scheduling Baselines ![](https://i.imgur.com/QfJpJ7x.png) #### Ablation Study ![](https://i.imgur.com/ZR6BWIX.png) * The scheduling design space has the greatest impact on the performance. * The architecture design space has relatively the least impact. #### Sampling Efficiency ![](https://i.imgur.com/MaqnFNp.png) #### Trade-off between Latency and Energy ![](https://i.imgur.com/T7t52t8.png)