Capuchin: Tensor-based GPU Memory Management for Deep Learning

# Capuchin: Tensor-based GPU Memory Management for Deep Learning ###### paper origin: ASPLOS 2020 ###### Link: [Paper](https://dl.acm.org/doi/10.1145/3373376.3378505) ## 1. Introduction - problem - current deep learning frameworks maintain large intermediate results produced in forward propagation in GPU memory until they are no longer needed in back propagation - incurs high memory consumption to store intermediate results - two current solutions: swapping and recomputing - prior works use coarse-grain memory management and choose swapping and recomputation through static analysis on computation graph - layers may contain several nodes in computation graph, which is more fine-grained compared with layerwise - coarse-grained information (e.g. layer type) can't quantify the overhead of a specifc memory operation, making it difficult to prioritize the memory optimization candidates or make choices between swapping and recomputation - each layer's computation varies largely due to the heterogeneity of the hardware and input sizes, which limits potential of memory optimization using static analysis - solution - makes memory management decisions (e.g. eviction, timing of prefetching or recomputation) based on tensor access pattern tracked at runtime with a unique ID for each tensor - tensor accesses have regular and repeated access patterns across iterations during training process ## 2. Background - two execution modes used by deep learning frameworks - eager mode (imperative programming) - evaluates operations immediately without building graphs - slower than executing equivalent graph due to the overhead of interpreting Python code and lack of graph optimizations - convenient to deploy and debug a new model, and is popular in academic community for its convenience - graph mode (declarative programming) - computation graph is built before execution starts, and the actual computations are scheduled only when necessary - Certain optimizations are applied to the program when it is converted to the internal computation graph ## 3. Observation and Motivation - limitation of static analysis - synchronization overhead is huge when the current layer’s execution time is inadequate to overlap data transfer, and such overhead persists in subsequent iterations ![](https://hackmd.io/_uploads/HyohmMMh2.png =60%x) - execution time of different convolution layers in one neural networks varies largely - can't completely ignore convolution layer as recomputation target, since some of them have low execution time ![](https://hackmd.io/_uploads/S16R_zf3h.png =60%x) - opportunity: regular tensor access pattern - tensor access clearly follows a regular pattern during running a CNN task in graph mode - other kinds of workloads (e.g. speech, NLP) also exhibit similar pattern ![](https://hackmd.io/_uploads/S1fmszM23.png =60%x) ## 4. Capuchin Memory Management Mechanisms ### 4.1 Design Goals - minimize performance overhead with memory oversubscription - should be general and incur little code modification for different deep learning frameworks ### 4.2 Design Overview - two phases during training - measured execution - observe dynamic tensor access pattern from the execution of the first iteration - guided execution - training process after the first iteration based on the memory management policy generated using observed tensor access patterns - tensor eviction - tensor will be copied to CPU memory synchronously and tensor's ID will be recorded for later access - operates at tensor granularity and use CPU memory as an extended buffer - cannot handle the case that a function's input and output combined is larger than GPU memory (extremely rare) - eviction of one tensor is always triggered by the access of another tensor - called passive mode which performs on-demand swapping - eviction takes time and is in the critical path of execution - proactively evict tensor - can release memory early for other tensors and hide the overhead compared to on-demand swapping - while evicted tensor is accessed again - consider two methods - swapping - recomputation - need to select appropriate time point to re-generate to avoid increasing memory pressure or overhead in critical path - definition of terms - evicted-access - tensor access that triggers the self-eviction after it is used in the computation - back-access - first tensor access after it is evicted from GPU memory — the access after the tensor’s evicted-access - when performed, the tensor may or may not be in GPU memory - in-trigger - a tensor's re-generation triggered earlier by another tensor access to reduce critical path overhead ### 4.3 Estimating Swap and Recomputation Benefit - need accurate estimation of recomputation time and swapping time ![](https://hackmd.io/_uploads/BJ0eRmG2n.png =60%x) - swapping time - quantify how promising to swap a tensor using free time - $FreeTime=SwapInStartTime-SwapOutEndTime$ - $SwapOutEndTime=EvictedAccessTime+SwapTime$ - $SwapInStartTime=BackAccessTime-SwapTime$ (ideal) - recomputation time - estimating the cost is related to operation algorithm, device computing capability, and inputs size - measures the time during measured execution by recording tensors’ lineage and runtime access time (comparing the access time of input and output tensors) - indicate how favorable to recompute a tensor using Memory Saving Per Second (MSPS) - $MSPS=\frac{Memory\ Saving}{Recomputation\ Time}$ ### 4.4 Determining Tensor Re-generation Cost - select the appropriate time to initiate the regenerate operation - swap - traverse reversely from the back-access in tensor access list to look for the first tensor access whose time is earlier than SwapInStartTime - in-trigger shouldn't be set within the time range of peak memory - a swap cannot start until its preceding swap finishes, causes swap-in time happen later than in-trigger - introduce feedback-driven adjustment to adjust the in-trigger time of a tensor dynamically at runtime - recomputation - perform recomputation in on-demand manner - also needs GPU computing resourse, which is usually run out when GPU memory is insufficient according to observation - generate recomputation cost of each candidate with two information - the lifetime of tensor in the lineage to determine whether a tensor can serve as the source - if a tensor in the lineage is recomputation candidate — they are assumed to be in GPU memory ### 4.5 Selecting Memory Optimization - make choice between swap and recomputation - choose swap as first choice - if swapping can't perfectly hide overhead, compare swapping and recomputation and choose the smallest overhead ![](https://hackmd.io/_uploads/r1H4ySG32.png =60%x) 1. add two kinds of tensors into eviction candidate set - access count of the tensor is more than one - the access of the tensor occurs in tensor lives in peak memory usage period 2. generate ranked list based on FT between two consecutive tensor accesses, assuming the tensor is swapped out between the two accesses 3. move tensors with no overhead from the top of ranked list to eviction set - required memory reduction can be obtained from measured execution with passive mode 4. insert tensor into eviction set iteratively considering both swap and recomputation - only if the reduced memory using swap is smaller than required memory reduction - two steps in iteration - update: compute the MSPS for tensors in the current candidate set based on the current candidate set and eviction set - selection: insert the tensor with the largest MSPS to eviction set and remove it from candidate set ![](https://hackmd.io/_uploads/B1SE_T72n.png =60%x) ## 5. Capuchin Implementation ![](https://hackmd.io/_uploads/Syt_BbEn3.png =50%x) - two modules that the underlying framework need to have - executor - adopt on-the-fly lineage-based recomputation, given a tensor's ID can search for closet input tensors for recomputation - allocator - need to support SwapOut and SwapIn, which will allocate equal size memory at GPU or CPU with an address as parameter - extra fields in tensor structure - five status - IN, SWAPPING_OUT, OUT, SWAPPING_IN, RECOMPUTE ![](https://hackmd.io/_uploads/HylcYWV2n.png =70%x) - Tensor Access Tracker (TAT) - interacts with Executor, Tensor, and Allocator in deep learning frameworks - supports on-demand memory mapping (passive mode) - identify tensor accesses boundary between complete iterations ![](https://hackmd.io/_uploads/B18FTmI23.png =60%x) - track tensors' access pattern for Policy Maker - Policy Maker (PM) - determines memory optimization policy based on tensor access sequence - optimizations - decoupled computation and swapping - decouple computation and data transfer at a tensor’s swapping-out and only synchronize the earliest unfinished swapping-out when OOM occurs - overlap more computations with data transfer ![](https://hackmd.io/_uploads/SkwHf8Uhh.png =60%x) - collective recomputation - keep more recomputation target tensors as possible with one recomputation to reduce recomputation complexity - GPU-specifc Issues - access time profiling - CPU processing is parallel with GPU computation, causes return immediately after enqueuing the kernel into the GPU stream - need CUDA Profiling Tools interface to get real GPU processing time - asynchronous and delayed operation - kernel executes according to GPU stream (CPU just enqueues it into GPU), Capuchin's functions should also follow the execution sequence - using CUDA event to support swapping and two CUDA streams to perform swapping in/out - recomputation is already delayed since it is just like normal computation ## 6. Evaluation - workloads ![](https://hackmd.io/_uploads/S1jvmYIh2.png) - baselines - tensorflow origin version (eager mode) - vDNN - OpenAI’s gradient-checkpointing - memory mode: select a set of nodes to checkpoint to achieve O(sqrt(n)) memory usage - speed mode: checkpointing the outputs of all operations that are typically expensive to compute - breakdown analysis - DS: decoupled computation and swapping - ATP: enable measured execution - FA: dynamically adjust in-trigger time - CR: collective recomputation ![](https://hackmd.io/_uploads/rkiUIF8nn.png =70%x) - improvement is very limited under batch size=400 using swapping since total data transfer time is more than twice as much as computation time - memory footprint reduction - use batch size to represent the degree of memory footprint reduction - graph mode ![](https://hackmd.io/_uploads/SynBK0w32.png =60%x) - eager mode ![](https://hackmd.io/_uploads/BkVT_J_22.png =50%x) - Performance - graph mode ![](https://hackmd.io/_uploads/HyiZQyun2.png =70%x) - eager mode ![](https://hackmd.io/_uploads/S1iZ5yd3h.png =70%x) ## 7. Related Work - deep learning framework support - data parallelism: each GPU keep its own network replica to reduce GPU memory footprint by decreasing batch size per GPU - model parallelism: splits total neural network to multiple GPU and each GPU takes charge of its own part computation - orthogonal to Capuchin - computation graph dependent techniques - three categories of memory optimization works based on computation graph - swapping, recomputation, compression - algorithm level, orthogonal to Capuchin - computation graph agnostic techniques - virtualize GPU memory via leveraging host memory as the extended memory - profile training process to resorts to good memory swapping algorithm ## 8. Conclusions - proposes a tensor-based GPU memory management module that reduces the memory footprint via tensor eviction/prefetching and recomputation - motivated by the observation that the access pattern to tensors is regular during training iterations - exploit the total memory optimization space and offer the fine-grain and fexible control of when and how to perform memory optimization techniques