nctu-cas lab
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Help
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Capuchin: Tensor-based GPU Memory Management for Deep Learning ###### paper origin: ASPLOS 2020 ###### Link: [Paper](https://dl.acm.org/doi/10.1145/3373376.3378505) ## 1. Introduction - problem - current deep learning frameworks maintain large intermediate results produced in forward propagation in GPU memory until they are no longer needed in back propagation - incurs high memory consumption to store intermediate results - two current solutions: swapping and recomputing - prior works use coarse-grain memory management and choose swapping and recomputation through static analysis on computation graph - layers may contain several nodes in computation graph, which is more fine-grained compared with layerwise - coarse-grained information (e.g. layer type) can't quantify the overhead of a specifc memory operation, making it difficult to prioritize the memory optimization candidates or make choices between swapping and recomputation - each layer's computation varies largely due to the heterogeneity of the hardware and input sizes, which limits potential of memory optimization using static analysis - solution - makes memory management decisions (e.g. eviction, timing of prefetching or recomputation) based on tensor access pattern tracked at runtime with a unique ID for each tensor - tensor accesses have regular and repeated access patterns across iterations during training process ## 2. Background - two execution modes used by deep learning frameworks - eager mode (imperative programming) - evaluates operations immediately without building graphs - slower than executing equivalent graph due to the overhead of interpreting Python code and lack of graph optimizations - convenient to deploy and debug a new model, and is popular in academic community for its convenience - graph mode (declarative programming) - computation graph is built before execution starts, and the actual computations are scheduled only when necessary - Certain optimizations are applied to the program when it is converted to the internal computation graph ## 3. Observation and Motivation - limitation of static analysis - synchronization overhead is huge when the current layer’s execution time is inadequate to overlap data transfer, and such overhead persists in subsequent iterations ![](https://hackmd.io/_uploads/HyohmMMh2.png =60%x) - execution time of different convolution layers in one neural networks varies largely - can't completely ignore convolution layer as recomputation target, since some of them have low execution time ![](https://hackmd.io/_uploads/S16R_zf3h.png =60%x) - opportunity: regular tensor access pattern - tensor access clearly follows a regular pattern during running a CNN task in graph mode - other kinds of workloads (e.g. speech, NLP) also exhibit similar pattern ![](https://hackmd.io/_uploads/S1fmszM23.png =60%x) ## 4. Capuchin Memory Management Mechanisms ### 4.1 Design Goals - minimize performance overhead with memory oversubscription - should be general and incur little code modification for different deep learning frameworks ### 4.2 Design Overview - two phases during training - measured execution - observe dynamic tensor access pattern from the execution of the first iteration - guided execution - training process after the first iteration based on the memory management policy generated using observed tensor access patterns - tensor eviction - tensor will be copied to CPU memory synchronously and tensor's ID will be recorded for later access - operates at tensor granularity and use CPU memory as an extended buffer - cannot handle the case that a function's input and output combined is larger than GPU memory (extremely rare) - eviction of one tensor is always triggered by the access of another tensor - called passive mode which performs on-demand swapping - eviction takes time and is in the critical path of execution - proactively evict tensor - can release memory early for other tensors and hide the overhead compared to on-demand swapping - while evicted tensor is accessed again - consider two methods - swapping - recomputation - need to select appropriate time point to re-generate to avoid increasing memory pressure or overhead in critical path - definition of terms - evicted-access - tensor access that triggers the self-eviction after it is used in the computation - back-access - first tensor access after it is evicted from GPU memory — the access after the tensor’s evicted-access - when performed, the tensor may or may not be in GPU memory - in-trigger - a tensor's re-generation triggered earlier by another tensor access to reduce critical path overhead ### 4.3 Estimating Swap and Recomputation Benefit - need accurate estimation of recomputation time and swapping time ![](https://hackmd.io/_uploads/BJ0eRmG2n.png =60%x) - swapping time - quantify how promising to swap a tensor using free time - $FreeTime=SwapInStartTime-SwapOutEndTime$ - $SwapOutEndTime=EvictedAccessTime+SwapTime$ - $SwapInStartTime=BackAccessTime-SwapTime$ (ideal) - recomputation time - estimating the cost is related to operation algorithm, device computing capability, and inputs size - measures the time during measured execution by recording tensors’ lineage and runtime access time (comparing the access time of input and output tensors) - indicate how favorable to recompute a tensor using Memory Saving Per Second (MSPS) - $MSPS=\frac{Memory\ Saving}{Recomputation\ Time}$ ### 4.4 Determining Tensor Re-generation Cost - select the appropriate time to initiate the regenerate operation - swap - traverse reversely from the back-access in tensor access list to look for the first tensor access whose time is earlier than SwapInStartTime - in-trigger shouldn't be set within the time range of peak memory - a swap cannot start until its preceding swap finishes, causes swap-in time happen later than in-trigger - introduce feedback-driven adjustment to adjust the in-trigger time of a tensor dynamically at runtime - recomputation - perform recomputation in on-demand manner - also needs GPU computing resourse, which is usually run out when GPU memory is insufficient according to observation - generate recomputation cost of each candidate with two information - the lifetime of tensor in the lineage to determine whether a tensor can serve as the source - if a tensor in the lineage is recomputation candidate — they are assumed to be in GPU memory ### 4.5 Selecting Memory Optimization - make choice between swap and recomputation - choose swap as first choice - if swapping can't perfectly hide overhead, compare swapping and recomputation and choose the smallest overhead ![](https://hackmd.io/_uploads/r1H4ySG32.png =60%x) 1. add two kinds of tensors into eviction candidate set - access count of the tensor is more than one - the access of the tensor occurs in tensor lives in peak memory usage period 2. generate ranked list based on FT between two consecutive tensor accesses, assuming the tensor is swapped out between the two accesses 3. move tensors with no overhead from the top of ranked list to eviction set - required memory reduction can be obtained from measured execution with passive mode 4. insert tensor into eviction set iteratively considering both swap and recomputation - only if the reduced memory using swap is smaller than required memory reduction - two steps in iteration - update: compute the MSPS for tensors in the current candidate set based on the current candidate set and eviction set - selection: insert the tensor with the largest MSPS to eviction set and remove it from candidate set ![](https://hackmd.io/_uploads/B1SE_T72n.png =60%x) ## 5. Capuchin Implementation ![](https://hackmd.io/_uploads/Syt_BbEn3.png =50%x) - two modules that the underlying framework need to have - executor - adopt on-the-fly lineage-based recomputation, given a tensor's ID can search for closet input tensors for recomputation - allocator - need to support SwapOut and SwapIn, which will allocate equal size memory at GPU or CPU with an address as parameter - extra fields in tensor structure - five status - IN, SWAPPING_OUT, OUT, SWAPPING_IN, RECOMPUTE ![](https://hackmd.io/_uploads/HylcYWV2n.png =70%x) - Tensor Access Tracker (TAT) - interacts with Executor, Tensor, and Allocator in deep learning frameworks - supports on-demand memory mapping (passive mode) - identify tensor accesses boundary between complete iterations ![](https://hackmd.io/_uploads/B18FTmI23.png =60%x) - track tensors' access pattern for Policy Maker - Policy Maker (PM) - determines memory optimization policy based on tensor access sequence - optimizations - decoupled computation and swapping - decouple computation and data transfer at a tensor’s swapping-out and only synchronize the earliest unfinished swapping-out when OOM occurs - overlap more computations with data transfer ![](https://hackmd.io/_uploads/SkwHf8Uhh.png =60%x) - collective recomputation - keep more recomputation target tensors as possible with one recomputation to reduce recomputation complexity - GPU-specifc Issues - access time profiling - CPU processing is parallel with GPU computation, causes return immediately after enqueuing the kernel into the GPU stream - need CUDA Profiling Tools interface to get real GPU processing time - asynchronous and delayed operation - kernel executes according to GPU stream (CPU just enqueues it into GPU), Capuchin's functions should also follow the execution sequence - using CUDA event to support swapping and two CUDA streams to perform swapping in/out - recomputation is already delayed since it is just like normal computation ## 6. Evaluation - workloads ![](https://hackmd.io/_uploads/S1jvmYIh2.png) - baselines - tensorflow origin version (eager mode) - vDNN - OpenAI’s gradient-checkpointing - memory mode: select a set of nodes to checkpoint to achieve O(sqrt(n)) memory usage - speed mode: checkpointing the outputs of all operations that are typically expensive to compute - breakdown analysis - DS: decoupled computation and swapping - ATP: enable measured execution - FA: dynamically adjust in-trigger time - CR: collective recomputation ![](https://hackmd.io/_uploads/rkiUIF8nn.png =70%x) - improvement is very limited under batch size=400 using swapping since total data transfer time is more than twice as much as computation time - memory footprint reduction - use batch size to represent the degree of memory footprint reduction - graph mode ![](https://hackmd.io/_uploads/SynBK0w32.png =60%x) - eager mode ![](https://hackmd.io/_uploads/BkVT_J_22.png =50%x) - Performance - graph mode ![](https://hackmd.io/_uploads/HyiZQyun2.png =70%x) - eager mode ![](https://hackmd.io/_uploads/S1iZ5yd3h.png =70%x) ## 7. Related Work - deep learning framework support - data parallelism: each GPU keep its own network replica to reduce GPU memory footprint by decreasing batch size per GPU - model parallelism: splits total neural network to multiple GPU and each GPU takes charge of its own part computation - orthogonal to Capuchin - computation graph dependent techniques - three categories of memory optimization works based on computation graph - swapping, recomputation, compression - algorithm level, orthogonal to Capuchin - computation graph agnostic techniques - virtualize GPU memory via leveraging host memory as the extended memory - profile training process to resorts to good memory swapping algorithm ## 8. Conclusions - proposes a tensor-based GPU memory management module that reduces the memory footprint via tensor eviction/prefetching and recomputation - motivated by the observation that the access pattern to tensors is regular during training iterations - exploit the total memory optimization space and offer the fine-grain and fexible control of when and how to perform memory optimization techniques

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Google Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully