SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping

# SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping ###### tags: `GPUs` ###### paper origin: ASPLOS-2020 ###### paper: [link](https://dl.acm.org/doi/10.1145/3373376.3378530) ## Introduction ### Problem * It is difficult to continue the trend to increase model size due to limited GPU memory. * Existing work have sought to reduce memory consumption by using **lower-precision floating points**, or **compressing model parmeters**, which can affect model accuracy. Other solutions **discard intermediate data and recompute** them later when needed, which cannot support large models. * Existing work (TFLMS, vDNN, SuperNeurons) supporting swapping between GPU and CPU memory only handle certain models and do not achieve satisfactory performance. ### Solution * SwapAdvisor, a general swaping system which can support various kinds of large model training and inference with limited GPU memory. * SwapAdvisor uses a custom-designed genetic algorithm to search the space of all memory allocation and operator schedules ## SwapAdvisor Design ![](https://i.imgur.com/CWXiHgQ.png) * Given a dataflow graph, SwapAdvisor picks any legitimate schedule and memory allocation based on the graph as initial values, and passes them to the swap planner to determine what tensors to swap in/out and when. * The result of the swap planner is an **augmented dataflow graph** which includes extra swap-in and swap-out operators and additional control flow edges. * For optimization, the augmented graph is passed to simulator to estimate the overall execution time. * Swapadvisor's GA-based search measures the performance of many memory allocation/schedule combinations, and proposes new allocation/schedule candidates for hte swap planner. ### Operator schedule and memory allocation * Operator schedule * an operator schedule is any topological sort ordering of nodes in a dataflow graph * SwapAdvisor uses 3 streams 1. performing GPU execution 2. swapping out tensors to CPU 3. swapping in tensors from CPU * Memory allocation * The **memory pool** consists of a number of different size-classes each of which is assigned a certain number of fixed-size tensor objects. * Given a dataflow graph G, a memory allocation scheme can be defined by specifying two things: 1. the mapping from each tensor size in G to some size classes 2. the set of supported size classes as well as the number of tensor objects assigned in each class ### Swap planning #### Which tensors to swap-out? * The swap planner uses **Belady's strategy** to pick the tensor that will not be needed for the longest time in the future to swap out. * Upon encountering memory pressure when adding tensor of size s, the planner chooses the tensor from the same size-class as s to swap out. #### When to swap in and out? * To maximize computation and communication overlap, we want to complete a pair of swap-out and swap-in as early as possible in order not ot block the execution of the corresponding operator in the schedule. ![](https://i.imgur.com/tu0Anyv.png) ## Optimization via Genetic Algorithm ### Algorithm overview * **Crossover, mutation, selection** * The first generation of individuals are created randomly and the size of population is decided by a hyper-parameter, N~p~ * A crossover takes a pair of parent chromosomes and produces new individuals by combining the features from parents. * In SwapAdvisor, each crossover generates two new schdules and two memory allocation. * We then perform mutation on the children to escape the local minimum and avoid premature convergence. * The rsulting mutated children are given to the swap planner to generate the augmented dataflow graph with swapping nodes. ### Creating new shedules #### Encoding * As a schedule is a topological ordering of the dataflow graph (G), it is natural to encode a schedule as a list, SCH, where each element in SCH is a node id in G. #### Crossover ![](https://i.imgur.com/v4n8GY4.png) #### Mutation * SwapAdvisor mantains a ready set, containing all the nodes which are ready to run * The core function of the mutation algorithm is to choose a node from the ready set based on following two conditions: 1. With a probability, **P**, a mutation happens. the algorithm randomly choose a node from the ready set. 2. the algorithm chooses a node from the ready set which is executed earliest in the original schedule (not mutated). ### Creating new memory allocation #### Encoding * Let TS be the sorted list of unique tensor sizes observed in the dataflow graph. * CLS is a list with the same length as TS, and the i~th~ item in CLS is a positive integer representing the size-class for tensors with size TS[i]. * CNT is a list representing the number of tensor objects allocated for each size-class. #### Crossover ![](https://i.imgur.com/UHbr4ND.png) * There are two CLS before crssover (CLS~1~ and CLS~2~) and four different tensor sizes: 1MB, 2MB, 3MB, 4MB. * CLS~1~ = [1,1,1,1] means that all four tensor sizes belong to the same size-class, 4MB. * CNT~1~ = [4] indicates that there are four 4MB tensor objects allocated. * CLS~2~ = [1,2,2,2] means that there are two size-classes, 1MB and 4MB. * CNT~2~ = [8,1] means that there are eight 1MB tensor obkects and one 4MB tensor oblect. * CLS~C1~ is not monotonically increasing. Thus, we repair it to ensure the new size-class mapping is still valid. * we extend a CNT to an extended CNT~EXT~ which has the same length as CLS. CNT~EXT~ indicates how many tensor obkects can be used for eah tensor size. * CLS~2~ is [1,2,2,2] which means 1MB belongs 1 MB size-class and 2MB, 3MB, and 4MB belong to 4MB size-class * CNT~2~ is [8,1] and CNT~EXT2~ cna then be viewed as [8,1,1,1] * We average all the elements in the same size class to get the count for that size-class. #### Mutation * To mutate the i~th~ element in CLS, we can either increase it by 1 or decrease it by 1. * To mutate CNT, we use Gaussian distribution with the original value as mean. * The mutated CNT and CLS can exceed the memory limit. We use the same methodology as crossover to repair them. ## Evaluation ### Experimental setup * Prototype implementation * SwapAdvisor is based on MXNet 1.2. * The GA and simulator are written in Python. * For MXNet, we modify the scheduler to ensure swap-in and swap-out operations are run on two separated GPU streams other than the computation stream. * We also implemented a new memory allocator which follow the result from SwapAdvisor. * Testbeds * We run SwapAdvisor on an EC2 c5d.18xlarge instance with 72 virtual CPU cores and 144GB memory. * The results of SwapAdvisor are executed on an EC2 p3.2xlarge GPU instance, which has one NVIDIA V100 GPU with 16GB * Evaluated DNN models * ResNet * Inception-V4 * NasNet-25 * RNN * baseline * ideal: assuming GPU memory is infinite * on-demand swapping system, ODSwap, which incorporates heuristics including LRU-based swapping and prefetching used by vDNN and SuperNeuron. ### Training evaluation ![](https://i.imgur.com/qSj2kJR.png) ![](https://i.imgur.com/fLNbh5f.png) * Comparing with TFLMS ![](https://i.imgur.com/TvoiM2B.png) ### Inference evaluation ![](https://i.imgur.com/mPFF36I.png) * When the batch size is 1, SwapAdvisor reduce the memory usage to as few as 64MB with 40% running time overhead or 192MB with only 6% overhead. ### The effectiveness of SwapAdvisor’s design choices ![](https://i.imgur.com/RK63nGV.png) * We would like to see the importance to optimize both scheduling and memory allcoation. ### Simulator accuracy ![](https://i.imgur.com/sQWWgu3.png) * These results indicate that using simulated performance is an effective way to accelerate GA-based search.