## Torch multiprocess with multi CUDA Context
### 實驗背景
1. Ubuntu 20.04/Nvidia 3090 24GB*1/CUDA11.6(510.108)
2. 定義一個app function,單純宣吿1024*1024的tensor放入gpu,並操作del以及torch.cuda.empty_cache,結束後進入迴圈
3. main function 使用python multiprocessing 啟動n個process。
4. 執行main function外,令外開terminal 以 『watch -n 0.1 nvidia-smi』查看各process使用VRAM的變化
- main.py
```=python
from multiprocessing import Process
from basc_torch_usage import app
if __name__ == '__main__':
import torch
import time
print("torch version:", torch.__version__)
process_number = int(3)
process_list = []
for process in range(process_number):
p = Process(target=app)
process_list.append(p)
p.start()
time.sleep(1)
for p in process_list:
p.join()
```
- basc_torch_usage.py
```=python
import torch
import os
def get_memory_allocated():
return torch.cuda.memory_allocated() / 1024 / 1024
def get_memory_reserved():
return torch.cuda.memory_reserved() / 1024 / 1024
def app():
process_id = os.getpid()
print(f"[{process_id}] Before initializing, memory allocated:{get_memory_allocated()}, memory reserved: {get_memory_reserved()}")
a=torch.rand(1024,1024).cuda()
print(f"[{process_id}] After initializing, memory allocated:{get_memory_allocated()}, memory reserved: {get_memory_reserved()}")
del a
print(f"[{process_id}] After deleting, memory allocated:{get_memory_allocated()}, memory reserved: {get_memory_reserved()}")
if torch.cuda.is_available():
torch.cuda.empty_cache()
print(f"[{process_id}] After empty_cache, memory allocated:{get_memory_allocated()}, memory reserved: {get_memory_reserved()}")
while True:
pass
```
### 實驗結果
1. app function 使用torch 提供的cuda memory資訊觀測,可以看到佔用不多VRAM
```
torch version: 1.12.1+cu116
[3598094] Before initializing, memory allocated:0.0, memory reserved: 0.0
[3598145] Before initializing, memory allocated:0.0, memory reserved: 0.0
[3598094] After initializing, memory allocated:4.0, memory reserved: 20.0
[3598094] After deleting, memory allocated:0.0, memory reserved: 20.0
[3598094] After empty_cache, memory allocated:0.0, memory reserved: 0.0
[3598185] Before initializing, memory allocated:0.0, memory reserved: 0.0
[3598145] After initializing, memory allocated:4.0, memory reserved: 20.0
[3598145] After deleting, memory allocated:0.0, memory reserved: 20.0
[3598145] After empty_cache, memory allocated:0.0, memory reserved: 0.0
[3598185] After initializing, memory allocated:4.0, memory reserved: 20.0
[3598185] After deleting, memory allocated:0.0, memory reserved: 20.0
[3598185] After empty_cache, memory allocated:0.0, memory reserved: 0.0
```
2. 透過nvidia-smi 查看到各個process在del tensor跟empty_cache後,還是會各佔據700MB的Vram
```
Every 0.1s: nvidia-smi tesla: Tue Feb 28 11:33:08 2023
Tue Feb 28 11:33:08 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 0% 40C P8 23W / 350W | 2206MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1349 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2305 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 3598094 C python 731MiB |
| 0 N/A N/A 3598145 C python 731MiB |
| 0 N/A N/A 3598185 C python 731MiB |
+-----------------------------------------------------------------------------+
```
### 目前Survey到可能的方向:
1. 近期torch conference有一個pytorch官方的新repo:https://github.com/pytorch/multipy。使用single process來調用multiple interpreters/GPUs
2. 2019 Couple hundred MB are taken just by initializing cuda,這篇torch的開發者coelsbury(meta ai工程師)提到pytoch一開始就會載入一堆kernel,由於他們的相依性很高無法拆開:https://github.com/pytorch/pytorch/issues/20532
3. 有看到nvidia有個 Mutiprocess service: https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf