Activation Checkpoint Mechanism

## Activation Checkpoint Mechanism ### Pytorch Native Checkpoint Support - function defination ```python torch.utils.checkpoint.checkpoint(function, *args, use_reentrant=True, **kwargs) ``` 1. function說明 Checkpoint 運用compute的時間來換memory空間, 可以不儲存一定的intermediate activations而讓其在backward的時候被重新計算. 被Checkpoint包裝過後的`forward`function會在torch.no_grad()的底下運行 (確保不會記錄activation), 但反之仍需要紀錄function的`inputs`跟`function`本身. 當執行`backward`function的時候, 將存起來的`inputs`跟`function`拿回來, 並重新計算該`forward`的`activation`並拿來計算`gradients`. 2. Parameter 1. `function`→ 要執行的forward function. 2. `preserve_rng_state`→ 如果開啟會將RNG state的值也存起來 (default True). 3. `args`→ 要放進`function`的input tuple. 3. Output return 把`args`丟進`function`的結果 4. checkpoint說明 1. checkpoint機制 ![Untitled](https://hackmd.io/_uploads/Hyu9iKOT6.png) 2. 只有被checkpoint包起來的function內部不會存任何activation, 因此”checkpoint”的意義就是紀錄了當下的`input`,所以前面的部分不需要重算 3. 案例解釋 ![Untitled 1](https://hackmd.io/_uploads/S1_isYO6T.png) - 由於`Conv2d`都被`checkpoint`包起來了, 所以內部的`forward`都不會產生任何`activation`. - 但實際上會有4個`activation`被記下來了就是每個`Conv2d`的`input x`. - 初步的問題 1. `inputs`跟`function`被存到哪裡了? 是仍待在gpu上嗎?還是存到cpu memory? 是不是有config可以設定其儲存的地方, 如果確保將其放到cpu memory可能還有減緩gpu memory的空間. - 解決方向: 由於目前公司測試是有enable huggingface的gradient checkpoint → 實際上會enable native pytorch checkpoint技術, 所以可以透過trace pytorch checkpoint的實作來確認 - 解決方向2: 稍早公司有提到的gradient checkpoint offload技術或許就是可以確保這些值被offload進cpu的方法? 可能需要trace一下做確認. 2. 為何deepspeed本身的`activation checkpoint`沒有用? 或者是沒有跟huggingface的`gradients checkpoint`打架? - 解決方向: trace deepspeed的activation checkpoint技術是怎麼實現的? - [問題回答](https://www.notion.so/8-29-Discussion-526e21453a6741779a7a85ac7d9776a5?pvs=21) - 剛剛開會完得出的結論是gradient checkpoint offload 可以把那些checkpoint起來的input (activation)也offload去DRAM. - reference - 官方Doc: [torch.utils.checkpoint — PyTorch 2.0 documentation](https://pytorch.org/docs/stable/checkpoint.html) ### HuggingFace Gradients Checkpoint - 說明 gradients checkpoint實際上就是pytorch的checkpoint技術的延伸. 在處理activation時有兩種策略: 1. 儲存所有forward後的activation並且在backward時可以直接計算gradient → 計算速度不受影響但memory要求就會很大 2. 丟棄activation並且在backward時重新計算 → 增加computational overhead但可以節省記憶體 gradients 提供了一個hybrid策略, 只有一部分的module使用了checkpoint, 因此只有那些module內部的activation不見了需要透過該checkpoint紀錄的input來重算該module內部的activation. 另一部分仍舊會是原本的forward也就是會正常的存activation並且在backward時可以直接用. - 初步的問題 1. 假設目前enable的gradients checkpoint並不會對所有的module做checkpoint, 是不是代表還有提升的可能性? - 實作Trace 1. 在transformers/src/transformers/trainer.py第1659行 ![Untitled 2](https://hackmd.io/_uploads/SyFhjtOaT.png) - self.model實際上就會抽象對應到huggingface他們寫好的model 2. 以llama為例, 會一路繼承PretrainedModel上來做自己的實作 ![Untitled 3](https://hackmd.io/_uploads/HkfpoY_ap.png) ![Untitled 4](https://hackmd.io/_uploads/S1lIpiFOpp.png) ![Untitled 5](https://hackmd.io/_uploads/SJzCjtd6a.png) 3. 在transformers/src/transformers/modeling_utils.py裡 ![Untitled 6](https://hackmd.io/_uploads/ryqAjFOp6.png) - 透過nn.Module的`apply`function去對所有的sub_module去call `self._set_gradient_checkpointing`為`True` ![Untitled 7](https://hackmd.io/_uploads/BJPy2Yupa.png) 4. 以llama為例 ![Untitled 8](https://hackmd.io/_uploads/HJTZ3KOTT.png) - 目前看到huggingface只去對decoder transformer layer做checkpoint - [問題回答](https://www.notion.so/8-29-Discussion-526e21453a6741779a7a85ac7d9776a5?pvs=21) - 從code的實作可以看到huggingface的gradient_checkpoint是提供一個flag, 而至於到底該如何根據這個flag來實作checkpoint可以等在實作模型的時候才確認. - 由於觀察到llama並不是每個layer都去做gradient checkpoint, 所以可能還有更改的空間, 但空間應該有限因為已經把最大的decoder_layer都checkpoint了 - 所以目前gpu memory只會有每個layer的input(activation)以及其他所有沒被checkpoint包起來的module內部的activation. - reference - checkpoint技術詳細介紹: [Fitting larger networks into memory. | by Yaroslav Bulatov | TensorFlow | Medium](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9) - checkpoint library: https://github.com/cybertronai/gradient-checkpointing - checkpoint 論文(citation 657): [1604.06174.pdf (arxiv.org)](https://arxiv.org/pdf/1604.06174.pdf) - huggingface Doc: [Methods and tools for efficient training on a single GPU (huggingface.co)](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#gradient-checkpointing) - 介紹torch & huggingface checkpoint blog: [聊一下关于使用torch.utils.checkpoint.checkpoint （检查点技术）来节省gpu显存，以及在huggingface中如何使用 - 知乎 (zhihu.com)](https://zhuanlan.zhihu.com/p/615122110#ref_2) ### Deepspeed Activation Checkpoint - 簡易說明 - deepspeed一樣是使用相同的checkpoint概念, 使用方法如pytorch的用法類似 - 在deepspeed/runtime/activation_checkpointing/checkpointing.py ```python def checkpoint(function, *args): """Checkpoint a model or part of the model. This has been directly copied from torch.utils.checkpoint. """ all_outputs = [] CheckpointFunction.apply(function, all_outputs, *args) if len(all_outputs) == 1: return all_outputs[0] else: return tuple(all_outputs) ``` - apply的用法會讓Checkpoint的機制遞迴的傳進`function`內部所有的`module`或`sub-function` - deepspeed比起native的checkpoint以外又多實現了`activation partitioning`, `cpu checkpointing`等機制 - CheckpointFunction (會實作checkpoint的機制包括forward或backward該有的行為) - 路徑: deepspeed/runtime/activation_checkpointing/checkpointing.py - class宣告 ```python class CheckpointFunction(torch.autograd.Function): @staticmethod def forward(ctx, run_function, all_outputs, *args): pass @staticmethod def backward(ctx, *grads): pass ``` - in forward wrapper 1. 定義cuda_device ```python if cuda_device is None: cuda_device = get_accelerator().current_device_name() ``` 2. 準備不同策略時要儲存的checkpoint ```python if PARTITION_ACTIVATIONS: inputs = partition_activations(args, CPU_CHECKPOINT, CONTIGUOUS_CHECKPOINTING) elif CPU_CHECKPOINT: inputs = copy_to_device(args, device=torch.device('cpu'), criterion_func=is_activation_to_checkpoint) ``` 3. 原本forward的行為但不紀錄activation ```python inputs_cuda = copy_to_device(args, device=cuda_device, criterion_func=is_activation_to_checkpoint) with torch.no_grad(): outputs = run_function(*inputs_cuda) del inputs_cuda ``` - 這裡會從gpu又多copy一份input到gpu裡 → 註解是說因為擔心reuse會有問題 4. 根據不同策略把checkpoint存起來 ```python if PARTITION_ACTIVATIONS: new_args = get_partitioned_activations_for_backward(args, inputs, CONTIGUOUS_CHECKPOINTING) assert len(new_args) % 2 == 0, f'save_for_backward called with odd number of args, {len(new_args)}' save_args_for_backward(*new_args) elif CPU_CHECKPOINT: new_args = get_cpu_activations_for_backward(args, inputs) save_args_for_backward(*new_args) else: save_args_for_backward(*args) ``` 5. 會存在ctx裡面 ```python def save_args_for_backward(*all_args): tensor_args, non_tensor_args, tensor_flags = extract_tensors(all_objects=all_args) ctx.deepspeed_saved_tensors = tensor_args ctx.non_tensor_args = non_tensor_args ctx.tensor_flags = tensor_flags ``` - in backward wrapper 1. 根據不同策略把checkpoint拿回來 ```python if PARTITION_ACTIVATIONS: inputs = gather_partitioned_activations(ctx.deepspeed_saved_tensors, device=cuda_device if CPU_CHECKPOINT else None) detached_inputs = detach_variable(inputs) elif CPU_CHECKPOINT: inputs = move_to_device(ctx.deepspeed_saved_tensors, cuda_device, is_activation_to_checkpoint) detached_inputs = detach_variable(inputs) else: inputs = ctx.deepspeed_saved_tensors detached_inputs = detach_variable(inputs) ``` 2. 運算原本的activation ```python with torch.enable_grad(): outputs = ctx.run_function(*detached_inputs) ``` 3. 運算對應的gradients ```python output_tensors = [] grad_tensors = [] for out, grad in zip(outputs, grads): if out.requires_grad: output_tensors.append(out) grad_tensors.append(grad) torch.autograd.backward(output_tensors, grad_tensors) ``` 4. 將checkpoint清理掉 (因為不會再被用到) ```python ctx.deepspeed_saved_tensors = None ctx.non_tensor_args = None ctx.tensor_flags = None ``` - [問題回答](https://www.notion.so/8-29-Discussion-526e21453a6741779a7a85ac7d9776a5?pvs=21) - 所以在使用上時也需要用deepspeed.checkpointing.checkpoint去把forward的function包起來 - 目前看起來這邊的實作只有實作在deepspeed自己寫的transformer layer裡 (hybrid engine), 但目前公司這邊是使用huggingface的model, 而這裡面的model實作並沒有包括deepspeed的checkpoint方法所以才會無效, 但開啟huggingface的gradient checkpoint卻會有效 - reference - 官方Doc: [DeepSpeed Configuration JSON - DeepSpeed](https://www.deepspeed.ai/docs/config-json/#activation-checkpointing) - 官方Doc2: [Activation Checkpointing — DeepSpeed 0.10.2 documentation](https://deepspeed.readthedocs.io/en/latest/activation-checkpointing.html)