# Split-CNN: Splitting Window-based Operations in Convolutional Neural Networks for Memory System Optimization ###### tags: `GPUs` ###### paper origin: ASPLOS’19 ###### paper: [Link](https://dl.acm.org/doi/10.1145/3297858.3304038) ## 1. Introduction ### problem - The memory bottleneck of training deep convolutional neural networks becomes worse since the trend of deep learning is toward deeper and larger networks. - Three challenging trends contributing to the memory bottleneck - prevalence of memory-bound layers - increasing batch sizes - increasing model complexity ### solution - Propose Split-CNN, a generic instrumentation of the regular CNN models which breaks apart memory bottleneck - Introduce new heterogeneous memory management system (HMMS) which optimally schedules memory allocation, deallocation, offloading and prefetching without any required tuning on the network models to solve the challenge with memory-bound layers - Combine Split-CNN and HMMS to solve the challenge with increasing batch size ## 2. Background and Motivation ### 2.1 Training a Deep Neural Network - The primary reason behind the high memory capacity requirement of training DNN is that intermediate values computed in forward pass are often required again in backward pass. ### 2.2 Challenges to the Memory Capacity Constraint - More Memory Bound Layers - Recent advances (eg. batch normalization layer, fast convolution algorithm) and their quick adoption has made traditional techniques (eg. memory offloading, CUDA managed unified memory) the victim of heavy preformance degradation - Larger Batch Size - Recent studies emphasizes the importance of using very large batch sizes to improve the quality of the gradients and thus accelerate convergence - Higher Model Complexity - The deep learning models only grow more complex, which translate directly into increased demand for higher accelerator memory capacity ### 2.3 Limitation of Layer-wise Memory Allocation - vDNN offload intermediate results after computed and free them after consumption by ensuing layers - require complex and multi-stage tuning process - incur noticeable performance degradation - throughput and trainability still bottlenecked by the layer produces the largest intermediate results ### 2.4 Opportunities and Challenges with NVLink - generated data size: size of intermediate results generated by the layer - offloadable data size: size of data that can offload, obtained by measured NVLink peak bandwidth * profiled execution time of the layer - offload all VGG's intermediate results with previous technique of layer-wise offloading will incur heavy performance penalty - the largest intermediate results often have the smallest budget to offload memory in partial layer-wise offloading of ResNet ![](https://i.imgur.com/K27Tds4.png) - two improvements over the prior art of layer-wise memory offloading technique - the ability to spread memory offloading and prefetching across multiple layers - break down bottleneck into smaller pieces and spread them across forward pass to keep each of them far away from each other ## 3. Split Convolutional Neural Network - Split-CNN splits computation of window-based operations into multiple small pieces - doesn't respect data-dependencies across layers, thus changing the semantics of NN ### 3.1 Single Layer Formulation - concepts and notations - tensor is a generalized matrix (multi-dimensional array) - Op(X,k,s,p) with window of operation k, stride s and padding p denotes a window-based operation(p consists of p~b~ and p~e~, the padding in the beginning and the end of the spatial dimension W) - Split~D~ (T , (s~0~, · · · , s~N−1~)) denotes partitioning tensor T along dimension D following N-tuple(s~0~, · · · , s~N−1~) where s~i~ denotes the index of starting element of *i*th part (Split~D~ (T , (s~0~, · · · , s~N−1~))[i] = *i*th partition) - [T~0~, · · · ,T~n~]~D~ denotes the concatenation of multiple tensors along dimension D 1. partition output into (O~0~, · · · ,O~N-1~), O~i~ denotes the starting element in the *i*th partition along the spatial dimension of the input tensor 2. find input split I = (I~0~, · · · ,I~N-1~) such that with paddings P = (P~0~, · · · ,P~N-1~), Op(Split~W~(T,I)[i] ,k,s,p~i~) produce an output of desired size O~i+1~ - O~i~ - I~i~ must be within the closed interval [lb(I~i~) , ub(I~i~)] (corresponds to split before the element that produce the first element of O~i+1~, and after the element that produce the last element of current O~i~) ![](https://i.imgur.com/xnubU71.png) 3. compute proper padding p~i~ for each input patch X~i~ ![](https://i.imgur.com/HedjX7c.png) - with arbitrarily chosen output partition scheme O and suitably computed input scheme I (base on 1&2), operation X = Op(X,k,s,p) can be formulated ![](https://i.imgur.com/MiK8ANa.png) ![](https://i.imgur.com/BR5cW5b.png) ### 3.2 Multi-Layer Formulation - If the split scheme of *m*th layer output O^m^ = the split scheme of *m+1*th layer input I^m+1^, splitting becomes a multi-layer construct with no communication between individual patches ### 3.3 Stochastic Splitting - The intuition is to prevent the network architecture from utilizing the split structure of Split-CNN during training to obtain a set of trained weights that can perform well with the original unsplit network architecture upon testing/inferencing in production. ## 4. Heterogeneous Memory Management System - Proposed to intelligently schedule data offloading and prefetching on NVLink-enabled device and utilize the memory-friendliness of Split-CNN - two key terms - computation graph: directed acyclic graph - tensor storage object: contiguous region of memory storage space used by one or more tensors to store its results - five step method of planning memory usage for computation graph ![](https://i.imgur.com/b2EilQ7.png) ### 4.1 Splitting and Graph Generation - first step - split the training model with depth *d* (percentage of convolutional layers to break apart) and tuple (*h*,*w*) (number of splits in each spatial dimensions) - minimal *d*,*h*,*w* can be searched coarse granularity and can be overestimated to simplify or eliminate the tuning process since degradation of final model accuracy occurs slowly as splits more aggressively - second step - serialize the computation by topologically sorting compute nodes, then generate serialized back-propagation graph and append to serialized graph ### 4.2 Storage Assignment and Optimization - third step - assign each tensor a tensor storage object and keep a reference counter for each tensor storage object - two optimizations - In-place ReLU: safely replace the input tensor's TSO with output tensor if the reference counter indicates no other tensor references the TSO of such layer's input tensor - Summation Error Storage Object Sharing: allow all error terms to occupy the same TSO since when chain rule is applied to obtain the back-propagated error terms, all error terms have the same value (summation $y=\sum_ix_i,\frac{∂y}{∂x_i}=1 \Rightarrow\frac{∂E}{∂x_i}=\frac{∂E}{∂y}*\frac{∂y}{∂x_i}=\frac{∂E}{∂y}$) ### 4.3 Offload and Prefetch Planning - fourth step - derive the optimal offloading and prefetching scheme to offload the most amount of memory without hurting the performance - two stage - profiling stage: obtain profiled execution time for each layer/operation - memory planning stage: decide when to end the offload and start the prefetch by calculating the size of memory transfer without slowing down the computation ![](https://i.imgur.com/UQuOcCy.png) ### 4.4 Static Memory Planing - fifth step - statically plan memory allocation and deallocation - three memory pools - host general purpose memory pool - device parameter memory pool - device general purpose memory pool - use first-fit memory allocation strategy to allocate memory to TSO for the minimum duration required mandated by reference counter and offloading&prefetching scheme ## 5. Split-CNN Evaluation - using AlexNet, VGG-19, ResNet-18, ResNet-50 ### 5.1 Datasets - CIFAR-10 with batch size 256 and ImageNet with batch size 128 ### 5.2 Impact of Hyperparameters on Accuracy - three tunable hyperparameters - depth of the split ![](https://i.imgur.com/uwKT2IG.png) - number of split ![](https://i.imgur.com/M76ppT5.png) - allow stochastic splitting ![](https://i.imgur.com/Aw38uCi.png) ### 5.3 Convergence Speed - additional experiments using AlexNet and ResNet-50 on ImageNet and CIFAR ![](https://i.imgur.com/wbAhv8r.png) ![](https://i.imgur.com/JAYtgmY.png) ## 6. Evaluation of Heterogeneous Memory Management System (HMMS) ### 6.1 Experimental Setup - equipments: IBM Power System S822LC, NVIDIA Tesla P100 GPUs, IBM Power8 CPU - NVLink peak bandwidth: 34.1GB/sec ### 6.2 Efficacy of Memory Scheduling Algorithm - using ResNet-50, VGG-19, batch size = 64 - three memory plans - baseline memory plan - layer-wise memory allocation scheme - HMMS ![](https://i.imgur.com/YSbE04e.png) - HMMS can plan longer duration of offloading time without eagerly synchronizing with the computation stream, which often results in slow down ![](https://i.imgur.com/H9aaaxg.png) ### 6.3 Split-CNN with HMMS - Split-CNN when combined with HMMS, it can achieve 6x larger batch size on VGG and 2x larger batch size on ResNet with a mere 1.5% and 4.9% throughput degradation respectively ![](https://i.imgur.com/0VM1CXI.png) - two factors enable these techniques contribute the most to the increased train-ability of DNN's - cuDNN convolutions require a large mount of workspace, smaller convolution can re-use same workspace to reduce the memory capacity pressure - Split-CNN enables the breakdown of such a sudden increase of memory capacity requirements caused by a small subset of layers ### 6.4 Accelerating Distributed Training - Split-CNN can be used to accelerate distributed training because by increasing the batch size, Split-CNN can decrease the number of parameter updates (thus network communication) of distributed training. ![](https://i.imgur.com/5eXzOhz.png) ## 7. Related Work - Deep neural networks are over-parameterized and have large amounts of redundancy in the network models, resulting in inefficient memory usage - approaches to address inefficient memory usage - use quantization or a reduced precision when representing data - reduce network complexity and prune size of network - vDNN utilizes CPU DRAM as an external buffer - these approaches can be used together with ideas in this paper to reduce GPU memory usage ## 8. Discussion and Conclusion - propose Split-CNN, an automatic instrumentation to enable new system optimizations to improve the memory scalability of training deep CNN. - propose HMMS, which statically plans the memory allocation, deallocation, offloading and prefetching of deep neural network training, to fully unleash the potential of Split-CNN.