UW HYAK Notes

SALLOC

  • Request for GPU
salloc --partition=gpu-l40 --account=stf --mem=10G --gres=gpu:1 --cpus-per-task=1 --time=2:00:00
  • Check if GPU is requested
scontrol show job 24333466 | grep gpu

Conda Reinstall

rm -rf '/gscratch/scrubbed/andysu/miniconda3'
bash Miniconda3-latest-Linux-x86_64.sh -p /gscratch/scrubbed/andysu/miniconda3
python -m pip install --force-reinstall --upgrade setuptools pip

看哪個節點閒置

sinfo -t idle
salloc --partition=ckpt-all --gres=gpu:1 --nodelist=g3091 --time=8:00:00

GPU 確認 code

module load cuda/11.8.0
python -c "import torch; print(torch.cuda.is_available())"
  • python code
import torch
print(torch.__version__)
print(torch.version.cuda)  # 確保 PyTorch 版本支援 CUDA
print(torch.backends.cudnn.enabled)

看有沒有GPU裝置

scontrol show job 24202314 | grep TRES

如果是用salloc,登出後要回原本computing node

srun --jobid=<jobid> --pty bash

flash_attn 嘗試2.6.1 (考慮cuda版本) -> 安裝成功

  • Create conda environment
conda create --name my_env python=3.9
conda activate my_env