## Torch multiprocess with multi CUDA Context ### 實驗背景 1. Ubuntu 20.04/Nvidia 3090 24GB*1/CUDA11.6(510.108) 2. 定義一個app function,單純宣吿1024*1024的tensor放入gpu,並操作del以及torch.cuda.empty_cache,結束後進入迴圈 3. main function 使用python multiprocessing 啟動n個process。 4. 執行main function外,令外開terminal 以 『watch -n 0.1 nvidia-smi』查看各process使用VRAM的變化 - main.py ```=python from multiprocessing import Process from basc_torch_usage import app if __name__ == '__main__': import torch import time print("torch version:", torch.__version__) process_number = int(3) process_list = [] for process in range(process_number): p = Process(target=app) process_list.append(p) p.start() time.sleep(1) for p in process_list: p.join() ``` - basc_torch_usage.py ```=python import torch import os def get_memory_allocated(): return torch.cuda.memory_allocated() / 1024 / 1024 def get_memory_reserved(): return torch.cuda.memory_reserved() / 1024 / 1024 def app(): process_id = os.getpid() print(f"[{process_id}] Before initializing, memory allocated:{get_memory_allocated()}, memory reserved: {get_memory_reserved()}") a=torch.rand(1024,1024).cuda() print(f"[{process_id}] After initializing, memory allocated:{get_memory_allocated()}, memory reserved: {get_memory_reserved()}") del a print(f"[{process_id}] After deleting, memory allocated:{get_memory_allocated()}, memory reserved: {get_memory_reserved()}") if torch.cuda.is_available(): torch.cuda.empty_cache() print(f"[{process_id}] After empty_cache, memory allocated:{get_memory_allocated()}, memory reserved: {get_memory_reserved()}") while True: pass ``` ### 實驗結果 1. app function 使用torch 提供的cuda memory資訊觀測,可以看到佔用不多VRAM ``` torch version: 1.12.1+cu116 [3598094] Before initializing, memory allocated:0.0, memory reserved: 0.0 [3598145] Before initializing, memory allocated:0.0, memory reserved: 0.0 [3598094] After initializing, memory allocated:4.0, memory reserved: 20.0 [3598094] After deleting, memory allocated:0.0, memory reserved: 20.0 [3598094] After empty_cache, memory allocated:0.0, memory reserved: 0.0 [3598185] Before initializing, memory allocated:0.0, memory reserved: 0.0 [3598145] After initializing, memory allocated:4.0, memory reserved: 20.0 [3598145] After deleting, memory allocated:0.0, memory reserved: 20.0 [3598145] After empty_cache, memory allocated:0.0, memory reserved: 0.0 [3598185] After initializing, memory allocated:4.0, memory reserved: 20.0 [3598185] After deleting, memory allocated:0.0, memory reserved: 20.0 [3598185] After empty_cache, memory allocated:0.0, memory reserved: 0.0 ``` 2. 透過nvidia-smi 查看到各個process在del tensor跟empty_cache後,還是會各佔據700MB的Vram ``` Every 0.1s: nvidia-smi tesla: Tue Feb 28 11:33:08 2023 Tue Feb 28 11:33:08 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A | | 0% 40C P8 23W / 350W | 2206MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1349 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 2305 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 3598094 C python 731MiB | | 0 N/A N/A 3598145 C python 731MiB | | 0 N/A N/A 3598185 C python 731MiB | +-----------------------------------------------------------------------------+ ``` ### 目前Survey到可能的方向: 1. 近期torch conference有一個pytorch官方的新repo:https://github.com/pytorch/multipy。使用single process來調用multiple interpreters/GPUs 2. 2019 Couple hundred MB are taken just by initializing cuda,這篇torch的開發者coelsbury(meta ai工程師)提到pytoch一開始就會載入一堆kernel,由於他們的相依性很高無法拆開:https://github.com/pytorch/pytorch/issues/20532 3. 有看到nvidia有個 Mutiprocess service: https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf