Training Deep Nets with Sublinear Memory Cost

# Training Deep Nets with Sublinear Memory Cost ##### Link: [Paper](https://arxiv.org/pdf/1604.06174.pdf) ###### tags: `model compiler` ## 1. Introduction A systematic approach is proposed to reduce the memory consumption of deep neural network training. - Focus on reducing the memory cost to store intermediate results(feature maps) and gradients. - Use computation graph analysis to do automatic in-place operation and memory sharing optimization, and propose a novel method to trade computation for memory. Proposed algorithm cost **O(√n) memory** for feature maps to train a *n* layer with **double the forward pass computational cost**. (cost **O(log n) memory** in extreme case) ## 2. Related Works - Theano and Tensorflow use reference count based recycling and runtime garbage collection to manage memory during training - MXNet uses a static memory allocation strategy prior to the actual computation most of the existing framework focus on graph analysis to optimize computation after gradient graph is constructed, but do not discuss the computation and memory trade-off. - Gradient checkpointing technique: drop intermediate results. Proposed algorithm applied this technique and is possible to train a general deep neural network with sublinear memory cost. Moreover, the algorithm can be readily *combined with all the existing memory optimizations* in the computational graph, and doesn't need additional communication over PCI-E. ## 3. Memory Optimization with Computation Graph Computation graph consists of operational nodes and edges that represent the dependencies between the operations. Fig. 1 gives an example of the computation graph of a two-layer fully connected neural network. The backward pathway in Fig. 1 represents the gradient calculation steps explicitly, so that the gradient calculation step in training is simplified to just a forward pass on the entire computation graph (including the gradient calculation pathway). Fig. 1 shows a possible allocation plan of the example two-layer neural network. Two types of memory optimizations can be used: - Inplace operation: Directly store the output values to memory of a input value. - Memory sharing: Memory used by intermediate results that are no longer needed can be recycled and used in another node. The first sigmoid transformation is carried out using inplace operation to save memory, which is then reused by its backward operation. The storage of the softmax gradient is shared with the gradient by the first fully connected layer. ![](https://i.imgur.com/elH23mo.png) The algorithm demonstrated in Fig. 2 traverses the graph in topological order, uses a counter to indicate the liveness of each record, and cost only O(n) computation time. An inplace operation can happen when there is no other pending operations that depend on its input. Memory sharing happens when a recycled tag is used by another node. This algorithm can also serve as a dynamic runtime algorithm that traverses the graph, and use a garbage collector to recycle the outdated memory. ![](https://i.imgur.com/oR8zheU.png) Data dependency causes longer lifespan of each output and increase memory consumption of big network as we can see in Fig. 2. Important for deep learning frameworks: - Declare the dependency requirements of gradient operators in minimum manner. - Apply liveness analysis on the dependency information and enable memory sharing. ## 4. Trade Computation for Memory ### 4.1 General Methodology To further reduce memory, *drop some of the intermediate results* is proposed, and dropped results are recovered from extra forward computation when needed. More specifically, during the backpropagation phase, we can recompute the dropped intermediate results by running forward from the closest recorded results. Alg. 1 shows a simplified algorithm for a linear chain feed-forward neural network. - Neural network is divided into several segments and the algorithm only remembers the output of each segment, all the intermediate results in each segment are dropped. - Dropped results are recomputed at the segment level during back-propagation. - Memory cost: outputs of each segment + maximum memory cost to do back-propagation on each segment. ![](https://i.imgur.com/qf9oLrv.png) problems: - Users have to manually divide the graph and write customized training loop. - Cannot benefit from other memory optimizations in [3](#3-Memory-Optimization-with-Computation-Graph). Alg. 2 solved this problem by introducing a general gradient graph construction algorithm that uses essentially the same idea. - Function m is specified on the nodes of a computation graph to indicate how many times a result can be recomputed. - Recomputation is essentially duplicating(mirroring) the node. - To specify recomputation pattern in Alg. 2, the user only needs to set the m(v) = 1(or > 1, leads to a recursive generalization to be discussed in [4.4](#44-More-General-View-Recursion-and-Subroutine)) for nodes within each segment and m(v) = 0 for the output node of each segment. - Also outputs a traversal order for the computation, let the memory usage can be optimized, and can help introduce control flow dependencies for frameworks that depend on runtime allocation. ![](https://i.imgur.com/maASpfZ.png) Fig. 3 shows an example of memory optimized gradient graph. ![](https://i.imgur.com/bsTZHOG.png) ### 4.2 Drop the Results of Low Cost Operations Drop the results of low cost operations and keep the results that are time consuming to compute is another general methodology. - Convolutional neural networks: keep the result of convolution, drop the result of batch normalization, activation function and pooling. ### 4.3 An O(√n) Memory Cost Algorithm Assume we divide the n network into k segments, the memory cost to train this network is given as follows. ![](https://i.imgur.com/eZQvaeD.png) - Setting k = √n, we get the cost of O(2√n), which proves that this algorithm reduce the memory cost to be sub-linear. - In the most general case, the memory cost of each layer is not the same, so we cannot simply set k = √n. However, the trade-off between the intermediate outputs and the cost of each stage still holds. Alg. 3 do a greedy allocation with a given budget for the memory cost within each segment as a single parameter B. - Varying B gives us various allocation plans that either assign more memory to the intermediate outputs, or to computation within each stage. - When doing static memory allocation, we can get the exact memory cost given each allocation plan. Using this information to do a heuristic search over B, we can find optimal memory plan that balances the cost of the two. - We can also generalize this algorithm by considering the cost to run each operation to try to keep time consuming operations when possible. ![](https://i.imgur.com/RTEP0q3.png) ### 4.4 More General View: Recursion and Subroutine We can view each segment as a bulk operator that combines all the operations inside the segment together. The idea is illustrated in Fig. 4. - Combined operator calculates the gradient by executing over the sub-graph that describes its internal computation, which allows us to treat a series of operations as subroutines. - We can recursively apply our memory optimization scheme to each sub-graph, since the optimization within the sub-graph does not affect the external world. ![](https://i.imgur.com/yByntRU.png) Assume that we store k intermediate results in the graph and apply the same strategy recursively when doing forward and backward pass on the sub-path. - Let g(n) to be the memory cost to do forward and backward pass on a n layer neural network. - As a special case, if we set k = 1, we get g(n) = log~2~n, which is interesting since existing implementations takes O(n) memory in feature map to train a n layer neural network. ![](https://i.imgur.com/RnSjo7P.png) ### 4.5 Guideline for Deep Learning Frameworks It is helpful for deep learning frameworks to: - Enable option to drop result of low cost operations. - Provide planning algorithms to give efficient memory plan. - Enable user to set the mirror attribute in the computation graph for memory optimization. (not necessary) ## 5. Experiments ### 5.1 Experiment Setup - Using MXNet - statically allocate all the intermediate feature maps before computation - enable reporting exact memory cost spend on feature maps. - memory cost of parameters and temporal memory (e.g. required by convolution) are not part of the memory cost report. - record the runtime total memory cost by running training steps on a Titan X GPU. - compare following memory allocation algorithm: - *no optimization*, directly allocate memory to each node in the graph without any optimization. - *inplace*, enable inplace optimization when possible. - *sharing*, enable inplace optimization as well as sharing. This represents all the system optimizations presented at [3](#3-Memory-Optimization-with-Computation-Graph). - *drop bn-relu*, apply all system optimizations, drop result of batch norm and relu, this is only shown in convolutional net benchmark. - *sublinear plan*, apply all system optimizations, use plan search with Alg. 3 to trade computation with memory. ### 5.2 Deep Convolutional Network - Using ResNet (but increased the depth of each residual stage), batch size = 32, input image shape = (3, 224, 224). - We count a conv-bn-relu as one layer - Optimizations introduced in [3](#3-Memory-Optimization-with-Computation-Graph) can help to reduce the memory cost by factor of two to three, but only possible to train a 200 layer ResNet with the best GPU we can get. - Proposed algorithm gives a sub-linear trend in terms of number of layers, and can train a 1000 layer ResNet using less than 7GB of GPU memory. ![](https://i.imgur.com/j7QKIkl.png) ### 5.3 LSTM for Long Sequences - Unrolled a four layer LSTM with 1024 hidden states equals 64 over time, batch size = 64, input = 50 dimension vector, output = softmax over 5000 class. - Inplace optimization helps a lot since in our experiment, it enables direct addition of weight gradient to a single memory cell, preventing allocate space for gradient at each timestamp. - Sub-linear plan gives more than 4x reduction over the optimized memory plan. ![](https://i.imgur.com/jugC0VJ.png) ### 5.4 Impact on Training Speed - Because of the double forward cost in gradient calculation, the sublinear allocation strategy costs 30% additional runtime compared to the normal strategy. ![](https://i.imgur.com/yRJEIIE.png) ## 6. Conclusion - Proposed a systematic approach to reduce the memory consumption of the intermediate feature maps when training deep neural networks. - By combining techniques like computation graph liveness analysis, trade computation with memory, we can train a n layer deep neural network with only O(√n) memory cost, by paying not more than one extra forward computation per mini-batch.