## Activation Checkpoint Mechanism
### Pytorch Native Checkpoint Support
- function defination
```python
torch.utils.checkpoint.checkpoint(function, *args, use_reentrant=True, **kwargs)
```
1. function說明
Checkpoint 運用compute的時間來換memory空間, 可以不儲存一定的intermediate activations而讓其在backward的時候被重新計算.
被Checkpoint包裝過後的`forward`function會在torch.no_grad()的底下運行 (確保不會記錄activation), 但反之仍需要紀錄function的`inputs`跟`function`本身.
當執行`backward`function的時候, 將存起來的`inputs`跟`function`拿回來, 並重新計算該`forward`的`activation`並拿來計算`gradients`.
2. Parameter
1. `function`→ 要執行的forward function.
2. `preserve_rng_state`→ 如果開啟會將RNG state的值也存起來 (default True).
3. `args`→ 要放進`function`的input tuple.
3. Output
return 把`args`丟進`function`的結果
4. checkpoint說明
1. checkpoint機制

2. 只有被checkpoint包起來的function內部不會存任何activation, 因此”checkpoint”的意義就是紀錄了當下的`input`,所以前面的部分不需要重算
3. 案例解釋

- 由於`Conv2d`都被`checkpoint`包起來了, 所以內部的`forward`都不會產生任何`activation`.
- 但實際上會有4個`activation`被記下來了就是每個`Conv2d`的`input x`.
- 初步的問題
1. `inputs`跟`function`被存到哪裡了? 是仍待在gpu上嗎?還是存到cpu memory? 是不是有config可以設定其儲存的地方, 如果確保將其放到cpu memory可能還有減緩gpu memory的空間.
- 解決方向: 由於目前公司測試是有enable huggingface的gradient checkpoint → 實際上會enable native pytorch checkpoint技術, 所以可以透過trace pytorch checkpoint的實作來確認
- 解決方向2: 稍早公司有提到的gradient checkpoint offload技術或許就是可以確保這些值被offload進cpu的方法? 可能需要trace一下做確認.
2. 為何deepspeed本身的`activation checkpoint`沒有用? 或者是沒有跟huggingface的`gradients checkpoint`打架?
- 解決方向: trace deepspeed的activation checkpoint技術是怎麼實現的?
- [問題回答](https://www.notion.so/8-29-Discussion-526e21453a6741779a7a85ac7d9776a5?pvs=21)
- 剛剛開會完得出的結論是gradient checkpoint offload 可以把那些checkpoint起來的input (activation)也offload去DRAM.
- reference
- 官方Doc: [torch.utils.checkpoint — PyTorch 2.0 documentation](https://pytorch.org/docs/stable/checkpoint.html)
### HuggingFace Gradients Checkpoint
- 說明
gradients checkpoint實際上就是pytorch的checkpoint技術的延伸.
在處理activation時有兩種策略:
1. 儲存所有forward後的activation並且在backward時可以直接計算gradient → 計算速度不受影響但memory要求就會很大
2. 丟棄activation並且在backward時重新計算 → 增加computational overhead但可以節省記憶體
gradients 提供了一個hybrid策略, 只有一部分的module使用了checkpoint, 因此只有那些module內部的activation不見了需要透過該checkpoint紀錄的input來重算該module內部的activation.
另一部分仍舊會是原本的forward也就是會正常的存activation並且在backward時可以直接用.
- 初步的問題
1. 假設目前enable的gradients checkpoint並不會對所有的module做checkpoint, 是不是代表還有提升的可能性?
- 實作Trace
1. 在transformers/src/transformers/trainer.py第1659行

- self.model實際上就會抽象對應到huggingface他們寫好的model
2. 以llama為例, 會一路繼承PretrainedModel上來做自己的實作



3. 在transformers/src/transformers/modeling_utils.py裡

- 透過nn.Module的`apply`function去對所有的sub_module去call `self._set_gradient_checkpointing`為`True`

4. 以llama為例

- 目前看到huggingface只去對decoder transformer layer做checkpoint
- [問題回答](https://www.notion.so/8-29-Discussion-526e21453a6741779a7a85ac7d9776a5?pvs=21)
- 從code的實作可以看到huggingface的gradient_checkpoint是提供一個flag, 而至於到底該如何根據這個flag來實作checkpoint可以等在實作模型的時候才確認.
- 由於觀察到llama並不是每個layer都去做gradient checkpoint, 所以可能還有更改的空間, 但空間應該有限因為已經把最大的decoder_layer都checkpoint了
- 所以目前gpu memory只會有每個layer的input(activation)以及其他所有沒被checkpoint包起來的module內部的activation.
- reference
- checkpoint技術詳細介紹: [Fitting larger networks into memory. | by Yaroslav Bulatov | TensorFlow | Medium](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9)
- checkpoint library: https://github.com/cybertronai/gradient-checkpointing
- checkpoint 論文(citation 657): [1604.06174.pdf (arxiv.org)](https://arxiv.org/pdf/1604.06174.pdf)
- huggingface Doc: [Methods and tools for efficient training on a single GPU (huggingface.co)](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#gradient-checkpointing)
- 介紹torch & huggingface checkpoint blog: [聊一下关于使用torch.utils.checkpoint.checkpoint (检查点技术)来节省gpu显存,以及在huggingface中如何使用 - 知乎 (zhihu.com)](https://zhuanlan.zhihu.com/p/615122110#ref_2)
### Deepspeed Activation Checkpoint
- 簡易說明
- deepspeed一樣是使用相同的checkpoint概念, 使用方法如pytorch的用法類似
- 在deepspeed/runtime/activation_checkpointing/checkpointing.py
```python
def checkpoint(function, *args):
"""Checkpoint a model or part of the model.
This has been directly copied from torch.utils.checkpoint. """
all_outputs = []
CheckpointFunction.apply(function, all_outputs, *args)
if len(all_outputs) == 1:
return all_outputs[0]
else:
return tuple(all_outputs)
```
- apply的用法會讓Checkpoint的機制遞迴的傳進`function`內部所有的`module`或`sub-function`
- deepspeed比起native的checkpoint以外又多實現了`activation partitioning`, `cpu checkpointing`等機制
- CheckpointFunction (會實作checkpoint的機制 包括forward或backward該有的行為)
- 路徑: deepspeed/runtime/activation_checkpointing/checkpointing.py
- class宣告
```python
class CheckpointFunction(torch.autograd.Function):
@staticmethod
def forward(ctx, run_function, all_outputs, *args):
pass
@staticmethod
def backward(ctx, *grads):
pass
```
- in forward wrapper
1. 定義cuda_device
```python
if cuda_device is None:
cuda_device = get_accelerator().current_device_name()
```
2. 準備不同策略時要儲存的checkpoint
```python
if PARTITION_ACTIVATIONS:
inputs = partition_activations(args, CPU_CHECKPOINT, CONTIGUOUS_CHECKPOINTING)
elif CPU_CHECKPOINT:
inputs = copy_to_device(args, device=torch.device('cpu'), criterion_func=is_activation_to_checkpoint)
```
3. 原本forward的行為但不紀錄activation
```python
inputs_cuda = copy_to_device(args, device=cuda_device, criterion_func=is_activation_to_checkpoint)
with torch.no_grad():
outputs = run_function(*inputs_cuda)
del inputs_cuda
```
- 這裡會從gpu又多copy一份input到gpu裡 → 註解是說因為擔心reuse會有問題
4. 根據不同策略把checkpoint存起來
```python
if PARTITION_ACTIVATIONS:
new_args = get_partitioned_activations_for_backward(args, inputs, CONTIGUOUS_CHECKPOINTING)
assert len(new_args) % 2 == 0, f'save_for_backward called with odd number of args, {len(new_args)}'
save_args_for_backward(*new_args)
elif CPU_CHECKPOINT:
new_args = get_cpu_activations_for_backward(args, inputs)
save_args_for_backward(*new_args)
else:
save_args_for_backward(*args)
```
5. 會存在ctx裡面
```python
def save_args_for_backward(*all_args):
tensor_args, non_tensor_args, tensor_flags = extract_tensors(all_objects=all_args)
ctx.deepspeed_saved_tensors = tensor_args
ctx.non_tensor_args = non_tensor_args
ctx.tensor_flags = tensor_flags
```
- in backward wrapper
1. 根據不同策略把checkpoint拿回來
```python
if PARTITION_ACTIVATIONS:
inputs = gather_partitioned_activations(ctx.deepspeed_saved_tensors,
device=cuda_device if CPU_CHECKPOINT else None)
detached_inputs = detach_variable(inputs)
elif CPU_CHECKPOINT:
inputs = move_to_device(ctx.deepspeed_saved_tensors, cuda_device, is_activation_to_checkpoint)
detached_inputs = detach_variable(inputs)
else:
inputs = ctx.deepspeed_saved_tensors
detached_inputs = detach_variable(inputs)
```
2. 運算原本的activation
```python
with torch.enable_grad():
outputs = ctx.run_function(*detached_inputs)
```
3. 運算對應的gradients
```python
output_tensors = []
grad_tensors = []
for out, grad in zip(outputs, grads):
if out.requires_grad:
output_tensors.append(out)
grad_tensors.append(grad)
torch.autograd.backward(output_tensors, grad_tensors)
```
4. 將checkpoint清理掉 (因為不會再被用到)
```python
ctx.deepspeed_saved_tensors = None
ctx.non_tensor_args = None
ctx.tensor_flags = None
```
- [問題回答](https://www.notion.so/8-29-Discussion-526e21453a6741779a7a85ac7d9776a5?pvs=21)
- 所以在使用上時也需要用deepspeed.checkpointing.checkpoint去把forward的function包起來
- 目前看起來這邊的實作只有實作在deepspeed自己寫的transformer layer裡 (hybrid engine), 但目前公司這邊是使用huggingface的model, 而這裡面的model實作並沒有包括deepspeed的checkpoint方法所以才會無效, 但開啟huggingface的gradient checkpoint卻會有效
- reference
- 官方Doc: [DeepSpeed Configuration JSON - DeepSpeed](https://www.deepspeed.ai/docs/config-json/#activation-checkpointing)
- 官方Doc2: [Activation Checkpointing — DeepSpeed 0.10.2 documentation](https://deepspeed.readthedocs.io/en/latest/activation-checkpointing.html)