# A Multi-Neural Network Acceleration Architecture ###### tags: `Accelerators` `TPU` `multi-nn` ###### paper origin: ISCA 2020 ###### papers: [link](https://ieeexplore.ieee.org/document/9138929) ###### video: [link] `none` ###### slide: `none` # Outline * 1. Intrduction * Motivation * Overview of AI-MT * Contribution * 2. Background * Neural Networks * Baseline neural network accelerator architecture * Baseline Scheduling * 4 problems for baseline scheduling * 3. AI-Multitasking Architecture * At compile time 1. Sublayer Creation 2. Datastructure Initialization * At Runtime 1. Track dependency-free task 2. Load Balancing Scheduling Mechanism * 4. Evaluation * Multi-NN Execution Latency * Sensitivity Test: Batch Size * Sensitivity Test: SRAM Size * Power & Area Overheads * 5. Discussion and Conclusion * I/O Buffer Capacity * Spatial Aspects of the PE Array Utilization * Conclusion # 1. Intrduction * Motivation * Emerging AI services consist of many heterogeneous neural network executions. * Existing accelerators are optimized for a **single neural network**. They will suffer from severe resource underutilization when running multiple neural networks, mainly due to the **load imbalance** . * The target is an architecture which enables a cost-effective, high-performance **multi-neural network** execution * Overview of AI-MT 1. Create fine-grain compute- and memory-intensive tasks from different networks. -> Devide each layer into **sub-layers** 2. Use a H/W based sub-layer scheduler to schedule dependency-free sub-layer tasks, and execute them in parallel (to maximize resource utilization). -> **1. MB prefetching 2. CB merging 3. MB eviction** * Contribution 1. cost-effective, high performence multi-Neural Network Acceleration 2. Efficient Scheduling Methods 3. Minimum SRAM Requirement # 2. Background * **Neural Networks (NN)** ![](https://img.onl/lsSz1Q =400x300) * **Baseline neural network accelerator architecture** `A conventional systolic array architecture based on purposely scaled Google’s TPU.` 1. 16 PE arrays, each has 128x128 8-bit integer MAC unit `16 arrays and reduced bit precision are for server-scale neural network inference.` `MAC = multiply and accumulate` 2. with HBM(450 GB/s) and on-chip decoupled SRAM buffers `HBM is higher than TPUv2 to mitigate memory bottleneck.` `It support double buffering to hide the latency of weights prefetching.` `HBM = High Bandwidth Memory = 3D-SDRAM` ![](https://img.onl/M4gsGM =500x250) * **Baseline Scheduling : Fig. 4-c** 1. A sub-layer execution = single PE array mapping `A layer is divided into a number of equal-sized sub-layers, and each contains 2 phase : memory block(MB) & compute block(CB)` ![](https://img.onl/XevGi =480x360) **2. 4 problems for baseline scheduling** : 2-1 : Utilization of memory/compute resource drop in compute-/memory-intensive layer. 2-2 : inter-layer dependency `can be alleviated by running multi-nn` 2-3 : high resource idleness due to frequent mismatching resource intensities (Fig. 6-b, 6-c) ![](https://img.onl/vhg3iC =410x300) 2-4 : limited SRAM capacity and high capacity requirement ![](https://img.onl/1K654h =460x300) # 3. AI-Multitasking Architecture * **At compile time** 1. Create fine-grain computation and memory-access tasks. 2. Initialize (1) sublayer scheduling table, (2) candidate queues, and (3) Weight management table. See Algo.1. ![](https://img.onl/1liuyK =1000x350) ![](https://img.onl/b9UDWR =600x300) `1. read_cyc_per_array = # of cycles to prefetch weight to PE array.` `2. filling time = # of cycles from 1st weight input to 1st output generated` `3. #iters = ⌈ oc / # of filters runable concurently ⌉` ` * times of mapping for each filter to complete all dot operations` `-----------------------------------------` `CONV :` `(line 2) All arrays share same weight mapping.` `(line 3) CB operates on partitioned input feature.` `FC :` `(line 6) Each arrays has different weight mapping.` `(line 7) CB operates on all input features.` * **At runtime** 1. Track dependency-free blocks (MB or CB). *1-1 Between layers :* ```c= // blk = 1st MB or 1st CB of next layer(L) // CQ = scheduled candidate queue if L's #indegree == 0 : blk_CQ.push(blk) L's #iter of blk -= 1 ``` *1-2 Between sublayers :* ```c= // current layer(L) if L's #iter of MB < L's #iter of CB: CB_CQ.push(CB) L's #iter of blk -= 1 ``` 2. apply **load balancing scheduling mechanism**. *2-1 MB prefetching* To increase memory bandwidth utilization, prefetch MBs whenever next MB.cycle < SRAM remained capacity. *2-2 CB merging* To increase PE arrays utilization, adopt Algo.2. ![](https://img.onl/Q9McKl) `line11 : If none of MB in the candidate queue is smaller than RM C, the scheduler waits until the executing CB finishes to recover the corresponding SRAM capacity` `(See block-4 in Fig.12-c)` ![](https://img.onl/IMtuJF =450x500) *2-3 early MB Eviction* `problem :` `(1) Long execution of CB will incur the short of SRAM capacity and memory bandwidth idleness.` `(2) Same thing will occur when MB.cycle > SRAM capacity.` `(line 11 of Algo.2, Fig. 13-a)` Early MB eviction gives the high priority to the SRAM capacity-critical MBs whose MB.cycles > CB.cycles (i.e., FC sub-layer’s MB). As the large MB occupies the large SRAM capacity and the corresponding CB is relatively small, it can recover a large amount of SRAM capacity quickly ![](https://img.onl/bZJNBW =450x500) Also, it will halt the current long CB execution and schedules smaller CBs first to recover SRAM capacity quickly rather than waiting until the executing large CB finishes. (Fig 13-c) # 4. Evaluation * In this work, we evaluate our AI-MT by extending a systolic array cycle-accurate simulator to enable the multi-neural network execution support on MLPerf benchmark. ![](https://img.onl/Xr1HbB =450x250) ![](https://img.onl/6DBjZ9 =550x250) `TABLE II : 4 CNN and 1 RNN, where VGG16 & GNNT is compute intensive, and the others are memory intensive.` `They provide balanced distribution of CBs and MBs` * Multi-NN Execution Latency * Even VGG16 has large memory-intensive FC layers, its compute-intensive CONV layers earlier than the memory-intensive layers reduce the opportunity to overlap the two different resource-demand sub-layers. (Fig. 14) * Evaluation shows 1.57x speedup over baseline scheduling method(FIFO). * its effect varies depending on the colocated neural networks due to the different resource-demand requirements. ![](https://img.onl/iR5Z6r) * Sensitivity Test: Batch Size * When the batch size increases, CBs become larger and the limited SRAM capacity creates the bottleneck to fully utilize the memory bandwidth * Applying all the three schemes achieves 1.47x speedup over the baseline. ![](https://img.onl/aMGE7g =500x250) * Sensitivity Test: SRAM Size * We also execute our workloads iteratively to see the impact of the long-execution of neural networks like a cloud server environment. * The SRAM requirements to achieve ideal improvement of perfomence for baseline and AI-MT are 4GB and 1MB individually. * Power & Area Overheads ![](https://img.onl/ibUfbA =500x250) # 5. Discussion and Conclusion * **Increasement of input can increase the SRAM capacity requirement** We currently assume that the SRAM has a dedicated space enough to store the required input and output features. However, it might be useful to deploy a preemption mechanism to keep only a minimal working set of input and output features with ways to mitigate the overhead. * To further improve the resource utilization, we believe that the **spatial aspects of the PE array** can be addressed together. For example, when CB has a small dimension to fully utilize all the MAC operations of the PE arrays, it might be possible to perform multiple CBs at the same time to improve the resource utilization within the PE array. * The proposed method includes **(1) memory block prefetching** and **(2) compute block merging** for the best resource load matching **(3) memroy block eviction** which early schedules and evicts SRAM-capacity-critical MBs. * Combined with all methods, AI-MT successfully achieves performance improvement with the minimum SRAM capacity.