IOS: INTER-OPERATOR SCHEDULER FOR CNN ACCELERATION

# IOS: INTER-OPERATOR SCHEDULER FOR CNN ACCELERATION ###### tags:`TinyML` ###### source: SysML 2021 ###### paper:[link](https://arxiv.org/pdf/2011.01302.pdf) ###### slides:[link](http://www.yaoyaoding.com/ios/files/mlsys-ios-slides.pdf) ###### website:[link](http://www.yaoyaoding.com/ios/) ## Introduction and Motivation Convolutional neural networks (CNNs) have achieved stateof-the-art performance across many tasks. Deep learning frameworks such as Tensorflow and Pytorch exploit intra-operator parallelism, which parallelizes arithmetic operations within a single CNN operator (e.g., convolution). To address this problem, recent work explores inter-operator parallelism by executing multiple CNN operators in parallel guided by different heuristics. MetaFlow fuses multiple operators matching a specific pattern into a larger operator to increase operator granularity. Tang et al. proposes a greedy strategy that directly executes all available CNN operators on CPU to maximize resource utilization. These approaches apply different heuristics to optimize local parallelization across a few CNN operators; however, such techniques do not lead to a globally optimal schedule for the entire CNN architecture. ![](https://i.imgur.com/gIo08ul.png) ## Contributions 1. They point out Inter-operator parallelism is crucial because a major bottleneck for efficient CNN inference is existing intra-operator parallelism cannot saturate modern hardware's high parallelism. 2. They propose a dynamic programming algorithm to find a highly optimized schedule for inter-operator parallelization, and this technique is platform-agnostic. 3. They apply IOS to various hardware and inference settings. It can automatically customize the scheduling policy for different hardware and inference configurations. ## Methods #### Problem Definition ![](https://i.imgur.com/71SagKG.png) - Computation Graph A CNN is defined by a computation graph G = (V,E), where V is the set of operators, and E is the edge set representing dependencies. A computation graph is a directed acyclic graph (DAG). Each operator in the graph represents an operator such as convolution and matrix multiplication. Each edge (u, v) is a tensor that is an output of operator u, and an input of operator v. - Stage To take advantage of inter-operator parallelism in a CNN architecture, its computation graph is partitioned into multiple stages. Stages are executed sequentially and the operators in the same stage are executed according to a certain parallelization strategy. - Parallelization Strategy Each stage adopts one of the following two parallelization strategies: operator merge and concurrent execution. ISO considers both of them and automatically picks the more efficient one for each stage. The choice depends on operator types, input tensor shapes, and the hardware device to perform CNN computations. - Schedule We define a schedule Q of a computation graph G as Q = {(S1, T1),(S2, T2),...,(Sk, Tk)}, where Si is the set of operators in the ith stage and Ti is the corresponding parallelization strategy, either “concurrent execution” or “operator merge”. #### IOS - Ending ![](https://i.imgur.com/6hxmVRX.png) To find an optimized schedule for a CNN architecture, we first partition its computation graph G = (V,E) into V-S' and S', where all edges between V-S' and S' start from V-S' and end in S'. Such S' is called an ending of V. - Cost ![](https://i.imgur.com/IvH91v3.png) - cost[S] is the latency of an optimal schedule for S - stage latency[S'] is the latency of stage (S', T) - S' is an ending of S, and cost[NULL] = 0 - Step 1. Enumerate all possible schedules. 2. Use DP to find the schedule whose cost is the minimum. - Algorithm ![](https://i.imgur.com/DUdGXDz.png) - Example ![](https://i.imgur.com/8NgE0gy.png) #### Time complexity ![](https://i.imgur.com/fNtUpcM.png =100x) 𝒏: number of operators 𝒅: max number of parallelizable ops ## Experiment - Comparison of different schedules ![](https://i.imgur.com/DVR2jda.png) - Comparison of cuDNN-based frameworks ![](https://i.imgur.com/6P8kpVr.png) - TVM-cuDNN is the TVM framework that compiles a convolution neural network with cuDNN library, which would use the convolution kernel provided by cuDNN to execute convolutions. All other operators such as addition and concatenation would use their own kernels. - More active warps improve utilization ![](https://i.imgur.com/YOB4DWd.png) Model operators are mapped to GPU kernels to execute. A kernel invokes a collection of threads that are grouped into multiple thread blocks.1 Thread blocks are distributed to stream multiprocessors (SMs). Each thread block on a SM is further partitioned into multiple warps. ## Ablation Study - Schedule Pruning Reduce Search Time ![](https://i.imgur.com/EhgtOkU.png) - To explore the trade-off between optimized latency and optimization cost (i.e. search time), they experiment Inception V3 and NasNet with pruning strategy parameters r = {1, 2, 3} and s = {3, 8}. - Specialized Scheduling is Beneficial ![](https://i.imgur.com/sG96cSe.png) - Different workloads (e.g. network with different batch sizes) have different computation features; thus it is necessary to specialize the schedule for different workloads. - Consistent Improvements for Different Batch Sizes ![](https://i.imgur.com/RepsnUM.png) - In real-world applications, we need to handle different batch sizes for inference. Changing the workload requires different inter-operator parallelization schedules. - When the batch size is larger than 128, the performance saturates, and the throughput does not increase significantly anymore. - Even though a larger batch size provides more data parallelism, we can still utilize inter-operator parallelism to further improve the throughput. - Intra- and Inter-Operator Parallelism ![](https://i.imgur.com/GIbssR3.png) - IOS focuses on inter-operator parallelism and leaves the exploitation of intra-operator parallelism to cuDNN library. - IOS's optimized cost is less than TVM. - TVM outperforms IOS on Randwire and NasNet, because TVM finds more efficient kernels for separable convolutions, which occupy the majority of operators in Randwire and NasNet. - They believe the combination of TVM and IOS would boost the performance further and leave this for future work. ## Conclusion 1. When a model is full of single branch of convolutions, the result may not as good as the result in this paper.