Bionemo&Diffdock 生醫AI訓練環境架設以及平行化參數調校

## 前置作業 ### 生成訓練用API key Nvidia API ``` {Zm1mNGp0.....} ``` https://org.ngc.nvidia.com/setup/api-key 訓練過程視覺化API ``` {4c47..........} ``` https://wandb.ai/ntcu_sunfrancis12 ### 安裝所需環境需要安裝docker ``` sudo apt install docker ``` 安裝python ``` sudo apt install python3 ``` 安裝pip ``` sudo apt install python3-pip ``` 安裝 wandb ``` pip install wandb ``` 安裝nvidia-container ``` curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list ``` ``` sudo apt-get update ``` ``` sudo apt-get install -y nvidia-container-toolkit ``` 重啟docker ``` sudo systemctl restart docker ``` ### 安裝image https://docs.nvidia.com/bionemo-framework/latest/quickstart-fw.html 下載bionemo image ``` docker login nvcr.io ``` ``` sudo docker pull nvcr.io/nvidia/clara/bionemo-framework:1.3 ``` 安裝完成後執行 ``` sudo docker run -it --rm --gpus all nvcr.io/nvidia/clara/bionemo-framework:1.3 bash ``` :::info 為了避免後續會有需要更改share memory導致的權限問題，建議可以使用以下指令啟動 ``` sudo docker run -it --ipc="host" --privileged --rm --gpus all nvcr.io/nvidia/clara/bionemo-framework:1.3 bash ``` https://stackoverflow.com/questions/63938723/permission-denied-when-trying-to-access-a-shared-memory-from-a-docker-containe ::: ### 訓練模型: preprocess ``` cd /workspace/bionemo/examples/molecule/megamolbart python pretrain.py --config-path=conf --config-name=pretrain_xsmall_span_aug do_training=False model.data.links_file='${oc.env:BIONEMO_HOME}/examples/molecule/megamolbart/dataset/ZINC-downloader-sample.txt' model.data.dataset_path=$(pwd)/zinc_csv ``` https://docs.nvidia.com/bionemo-framework/latest/quickstart-fw.html 如果/dev/shm過小，可以用指令進行一次性resize ``` sudo mount -o remount,size=8G /dev/shm ``` https://stackoverflow.com/questions/58804022/how-to-resize-dev-shm 使用預設參數訓練小模型 ``` python pretrain.py --config-path=conf --config-name=pretrain_xsmall_span_aug do_training=True model.data.dataset_path=$(pwd)/zinc_csv model.data.dataset.train=x000 model.data.dataset.val=x000 model.data.dataset.test=x000 exp_manager.exp_dir=$(pwd)/results ``` * `--config` 設定檔名稱 ### tuning步驟修改`pretrain_base.yaml`中的參數 1. 修改max_steps，可以先嘗試10000 2. 調整batch_size，盡量使GPU記憶體使用綠達到80-90% 3. 將gradient_clip_val調小，最小至0.1 4. 調低lr(learning rate) 建議:1e-4,2e-4, 5e-5 5. 將global size變大, accumulate_grad_batches: 2 ### 官方名詞解釋 * Reduced Train Loss: * is the value of the training loss function aggregated from all parallel processes. If the training loss doesn’t decrease or explodes, this is a possible sign that the learning rate needs to be reduced. * Loss Scale: * the scaling factor of the loss. * Gradient Norm: * the value of the gradient norm. Increasing or undefined (NaN) values of the gradient norm usually indicate instabilities in training. In such cases, the learning rate may need to be reduced. * Learning Rate: * the value of the learning rate. * Epoch: * the value of the epoch. Note: Megatron datasets upsample the data, so the entire training up to max_steps is considered a single epoch. * Consumed Samples: * the number of training samples that have been consumed during training. * Validation Step Timing: * monitors the time required for each validation step during training. this is useful for diagnosing performance issues and bottlenecks in the validation process, and can help optimize the speed and efficiency of model training. * Train Backward Timing: * a measure of the time required for the backpropagation step. Measuring the time this takes can help identify bottlenecks in the training process and assist in performance optimization. * Validation Loss: * the loss function computed on the validation set. Validation loss is used during model training to avoid over-fitting (when a model learns the training data too well and performs poorly on unseen data). If validation loss starts to increase while training loss decreases, this is usually a sign of over-fitting. https://docs.nvidia.com/bionemo-framework/0.4.0/bionemo-fw-for-model-training-fw.html hidden_size: 768 ffn_hidden_size: #Usually 4 * hidden_size. bias_dropout_add_fusion=True gradient_clip_val: 0.5 accumulate_grad_batches 調整 ## Diffdock ### 建資料集創建`/workspace/bionemo/data/splits` ``` mkdir /workspace/bionemo/data/splits ; cd /workspace/bionemo/data/splits ``` 下載資料集並壓縮 ``` wget https://zenodo.org/records/8278563/files/posebusters_paper_data.zip?download=1 -O PDB ; unzip PDB ``` https://zenodo.org/records/8278563 將含有id的文字檔複製三次，並把檔名改為 split_train, split_val,split_test ``` cp posebusters_benchmark_set_ids.txt split_train split_val split_test ``` :::info 也可以從.yaml檔中直接修改檔名為 posebusters_benchmark_set_ids.txt ::: 創建PDB_process資料夾，將posebusters_benchmark_set的資料夾複製其中 ``` mkdir ../PDB_processed ; cp -r posebusters_benchmark_set/* ../PDB_processed/ ; cd ../PDB_processed ``` 因為trainer要的格式名稱與資料集提供的不符，因此需要寫成shell script執行 ``` vim rename.sh ``` **rename.sh** ```bash= #!/bin/bash # 遍歷當前目錄下的所有目錄 for dir in */; do # 去掉目錄名後面的斜線 dirname=${dir%/} # 將名字格式改成{id}_protein_processed.pdb mv "$dirname/${dirname}_protein.pdb" "$dirname/${dirname}_protein_processed.pdb" done ``` ``` chmod +x rename.sh ; ./rename.sh ``` ### 預處理 :::info preprocessing 有分兩個階段的步驟 ::: 切換至diffdock目錄 ``` cd /workspace/bionemo/examples/molecule/diffdock ``` First do protein embedding using esm2 ``` python train.py --config-name=train_score do_embedding_preprocessing=True do_training=False ``` Then, do the graph preprocessing for score model: ``` python train.py --config-name=train_score do_preprocessing=True do_training=False ``` https://docs.nvidia.com/bionemo-framework/latest/preprocessing-bcp-training-diffdock.html ### 訓練模型訓練diffdock模型(small_score) * `nproc_per_node` GPU數量 ``` torchrun --nproc_per_node=2 train.py ``` https://idataagent.com/2023/01/07/pytorch-distributed-data-parallerl-multi-gpu/ https://blog.51cto.com/u_16099279/6887253 用torchrun會自己定義RANK參數(DistributedDataParallel) https://blog.csdn.net/weixin_44504393/article/details/127172367 https://pytorch.org/docs/stable/elastic/run.html :::info 單節點、單卡: ``` python train.py --config-name=train_score ``` 參數: - micro_batch_size: 4 - max_total_size: 5896 結果(在 1080ti 上): - 實際會用到約 10G 的 GPU mem - 一個 epoch 約 13 分鐘 ::: :::warning 避免出現OOM問題，batchsize建議調小一點，diffdock吃的很多 ::: ### tuning步驟 https://docs.nvidia.com/bionemo-framework/latest/notebooks/model_training_diffdock.html ## Train confidence 我們可以從官方提供的`.Nemo`檔，繼續訓練confidence_model https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/diffdock_confidence Generater Token ``` API_KEY=Zm1mNGp0bDB2MmQ0OTMxaGgzajlnN21pbWQ6ZjA3ODE5MGUtMGZiZS00MTUzLWJlYWItNWQxZThkNzZlNGQz TOKEN=$(curl -s -u "\$oauthtoken":"$API_KEY" -H 'Accept:application/json' 'https://authn.nvidia.com/token?service=ngc&scope=group/ngc:cvdgvlzx07ek&group/ngc:cvdgvlzx07ek/sun' | jq -r '.token') ``` 下載nemo檔(此為1.1版) ``` echo "Download Model file" curl -LO --request GET 'https://api.ngc.nvidia.com/v2/orgs/nvidia/teams/clara/models/diffdock_confidence/versions/1.1/files/diffdock_confidence.nemo' -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" ``` https://docs.nvidia.com/ngc/gpu-cloud/ngc-catalog-user-guide/index.html#download-models-via-wget-authenticated-access 使用find找到nemo檔案位置 ``` find /workspace/ -type f -name '*.nemo' ``` data preprocessing ，並指定你要繼續訓練的nemo檔 ``` python train.py --config-name=train_confidence do_preprocessing=True do_training=False score_infer.restore_from_path=/workspace/bionemo/examples/molecule/diffdock/model/diffdock_confidence.nemo ``` :::info 我這裡遇到一個bug，不知道為什麼只有`model_paper_large_score_split_train_limit_0` 這個目錄，如果沒有的話可以創建 * `model_paper_large_score_test_train_limit_0` * `model_paper_large_score_val_train_limit_0` 目錄，再將train目錄裡面的 `confidence_cache_id_base.sqlite3` 複製到這兩個目錄裡 **原本:** ![image](https://hackmd.io/_uploads/B1bCWeBvC.png) **複製後:** ![image](https://hackmd.io/_uploads/Hy0UfgSvA.png) ::: 訓練模型 ``` torchrun --nproc_per_node=2 train.py --config-name=train_confidence score_infer.restore_from_path=/workspace/bionemo/examples/molecule/diffdock/model/diffdock_confidence.nemo ``` https://wandb.ai/ntcu_sunfrancis12/diffdock_large_score_model/reports/Diffdock--Vmlldzo4NDk3OTgx?accessToken=m3h47o9dm2b5b8ilyxcltgyiz1fti259pliprgqqr8fa68pxe8vzjm2r9u0pc3jf