# Split-CNN: Splitting Window-based Operations in Convolutional Neural Networks for Memory System Optimization
###### tags: `GPUs`
###### paper origin: ASPLOS’19
###### paper: [Link](https://dl.acm.org/doi/10.1145/3297858.3304038)
## 1. Introduction
### problem
- The memory bottleneck of training deep convolutional neural networks becomes worse since the trend of deep learning is toward deeper and larger networks.
- Three challenging trends contributing to the memory bottleneck
- prevalence of memory-bound layers
- increasing batch sizes
- increasing model complexity
### solution
- Propose Split-CNN, a generic instrumentation of the regular CNN models which breaks apart memory bottleneck
- Introduce new heterogeneous memory management system (HMMS) which optimally schedules memory allocation, deallocation, offloading and prefetching without any required tuning on the network models to solve the challenge with memory-bound layers
- Combine Split-CNN and HMMS to solve the challenge with increasing batch size
## 2. Background and Motivation
### 2.1 Training a Deep Neural Network
- The primary reason behind the high memory capacity requirement of training DNN is that intermediate values computed in forward pass are often required again in backward pass.
### 2.2 Challenges to the Memory Capacity Constraint
- More Memory Bound Layers
- Recent advances (eg. batch normalization layer, fast convolution algorithm) and their quick adoption has made traditional techniques (eg. memory offloading, CUDA managed unified memory) the victim of heavy preformance degradation
- Larger Batch Size
- Recent studies emphasizes the importance of using very large batch sizes to improve the quality of the gradients and thus accelerate convergence
- Higher Model Complexity
- The deep learning models only grow more complex, which translate directly into increased demand for higher accelerator memory capacity
### 2.3 Limitation of Layer-wise Memory Allocation
- vDNN offload intermediate results after computed and free them after consumption by ensuing layers
- require complex and multi-stage tuning process
- incur noticeable performance degradation
- throughput and trainability still bottlenecked by the layer produces the largest intermediate results
### 2.4 Opportunities and Challenges with NVLink
- generated data size: size of intermediate results generated by the layer
- offloadable data size: size of data that can offload, obtained by measured NVLink peak bandwidth * profiled execution time of the layer
- offload all VGG's intermediate results with previous technique of layer-wise offloading will incur heavy performance penalty
- the largest intermediate results often have the smallest budget to offload memory in partial layer-wise offloading of ResNet

- two improvements over the prior art of layer-wise memory offloading technique
- the ability to spread memory offloading and prefetching across multiple layers
- break down bottleneck into smaller pieces and spread them across forward pass to keep each of them far away from each other
## 3. Split Convolutional Neural Network
- Split-CNN splits computation of window-based operations into multiple small pieces
- doesn't respect data-dependencies across layers, thus changing the semantics of NN
### 3.1 Single Layer Formulation
- concepts and notations
- tensor is a generalized matrix (multi-dimensional array)
- Op(X,k,s,p) with window of operation k, stride s and padding p denotes a window-based operation(p consists of p~b~ and p~e~, the padding in the beginning and the end of the spatial dimension W)
- Split~D~ (T , (s~0~, · · · , s~N−1~)) denotes partitioning tensor T along dimension D following N-tuple(s~0~, · · · , s~N−1~) where s~i~ denotes the index of starting element of *i*th part (Split~D~ (T , (s~0~, · · · , s~N−1~))[i] = *i*th partition)
- [T~0~, · · · ,T~n~]~D~ denotes the concatenation of multiple tensors along dimension D
1. partition output into (O~0~, · · · ,O~N-1~), O~i~ denotes the starting element in the *i*th partition along the spatial dimension of the input tensor
2. find input split I = (I~0~, · · · ,I~N-1~) such that with paddings P = (P~0~, · · · ,P~N-1~), Op(Split~W~(T,I)[i] ,k,s,p~i~) produce an output of desired size O~i+1~ - O~i~
- I~i~ must be within the closed interval [lb(I~i~) , ub(I~i~)] (corresponds to split before the element that produce the first element of O~i+1~, and after the element that produce the last element of current O~i~)

3. compute proper padding p~i~ for each input patch X~i~

- with arbitrarily chosen output partition scheme O and suitably computed input scheme I (base on 1&2), operation X = Op(X,k,s,p) can be formulated


### 3.2 Multi-Layer Formulation
- If the split scheme of *m*th layer output O^m^ = the split scheme of *m+1*th layer input I^m+1^, splitting becomes a multi-layer construct with no communication between individual patches
### 3.3 Stochastic Splitting
- The intuition is to prevent the network architecture from utilizing the split structure of Split-CNN during training to obtain a set of trained weights that can perform well with the original unsplit network architecture upon testing/inferencing in production.
## 4. Heterogeneous Memory Management System
- Proposed to intelligently schedule data offloading and prefetching on NVLink-enabled device and utilize the memory-friendliness of Split-CNN
- two key terms
- computation graph: directed acyclic graph
- tensor storage object: contiguous region of memory storage space used by one or more tensors to store its results
- five step method of planning memory usage for computation graph

### 4.1 Splitting and Graph Generation
- first step
- split the training model with depth *d* (percentage of convolutional layers to break apart) and tuple (*h*,*w*) (number of splits in each spatial dimensions)
- minimal *d*,*h*,*w* can be searched coarse granularity and can be overestimated to simplify or eliminate the tuning process since degradation of final model accuracy occurs slowly as splits more aggressively
- second step
- serialize the computation by topologically sorting compute nodes, then generate serialized back-propagation graph and append to serialized graph
### 4.2 Storage Assignment and Optimization
- third step
- assign each tensor a tensor storage object and keep a reference counter for each tensor storage object
- two optimizations
- In-place ReLU: safely replace the input tensor's TSO with output tensor if the reference counter indicates no other tensor references the TSO of such layer's input tensor
- Summation Error Storage Object Sharing: allow all error terms to occupy the same TSO since when chain rule is applied to obtain the back-propagated error terms, all error terms have the same value (summation $y=\sum_ix_i,\frac{∂y}{∂x_i}=1 \Rightarrow\frac{∂E}{∂x_i}=\frac{∂E}{∂y}*\frac{∂y}{∂x_i}=\frac{∂E}{∂y}$)
### 4.3 Offload and Prefetch Planning
- fourth step
- derive the optimal offloading and prefetching scheme to offload the most amount of memory without hurting the performance
- two stage
- profiling stage: obtain profiled execution time for each layer/operation
- memory planning stage: decide when to end the offload and start the prefetch by calculating the size of memory transfer without slowing down the computation

### 4.4 Static Memory Planing
- fifth step
- statically plan memory allocation and deallocation
- three memory pools
- host general purpose memory pool
- device parameter memory pool
- device general purpose memory pool
- use first-fit memory allocation strategy to allocate memory to TSO for the minimum duration required mandated by reference counter and offloading&prefetching scheme
## 5. Split-CNN Evaluation
- using AlexNet, VGG-19, ResNet-18, ResNet-50
### 5.1 Datasets
- CIFAR-10 with batch size 256 and ImageNet with batch size 128
### 5.2 Impact of Hyperparameters on Accuracy
- three tunable hyperparameters
- depth of the split

- number of split

- allow stochastic splitting

### 5.3 Convergence Speed
- additional experiments using AlexNet and ResNet-50 on ImageNet and CIFAR


## 6. Evaluation of Heterogeneous Memory Management System (HMMS)
### 6.1 Experimental Setup
- equipments: IBM Power System S822LC, NVIDIA Tesla P100 GPUs, IBM Power8 CPU
- NVLink peak bandwidth: 34.1GB/sec
### 6.2 Efficacy of Memory Scheduling Algorithm
- using ResNet-50, VGG-19, batch size = 64
- three memory plans
- baseline memory plan
- layer-wise memory allocation scheme
- HMMS

- HMMS can plan longer duration of offloading time without eagerly synchronizing with the computation stream, which often results in slow down

### 6.3 Split-CNN with HMMS
- Split-CNN when combined with HMMS, it can achieve 6x larger batch size on VGG and 2x larger batch size on ResNet with a mere 1.5% and 4.9% throughput degradation respectively

- two factors enable these techniques contribute the most to the increased train-ability of DNN's
- cuDNN convolutions require a large mount of workspace, smaller convolution can re-use same workspace to reduce the memory capacity pressure
- Split-CNN enables the breakdown of such a sudden increase of memory capacity requirements caused by a small subset of layers
### 6.4 Accelerating Distributed Training
- Split-CNN can be used to accelerate distributed training because by increasing the batch size, Split-CNN can decrease the number of parameter updates (thus network communication) of distributed training.

## 7. Related Work
- Deep neural networks are over-parameterized and have large amounts of redundancy in the network models, resulting in inefficient memory usage
- approaches to address inefficient memory usage
- use quantization or a reduced precision when representing data
- reduce network complexity and prune size of network
- vDNN utilizes CPU DRAM as an external buffer
- these approaches can be used together with ideas in this paper to reduce GPU memory usage
## 8. Discussion and Conclusion
- propose Split-CNN, an automatic instrumentation to enable new system optimizations to improve the memory scalability of training deep CNN.
- propose HMMS, which statically plans the memory allocation, deallocation, offloading and prefetching of deep neural network training, to fully unleash the potential of Split-CNN.