HackMD - Collaborative Markdown Knowledge Base

idist-snippets test on multiple configurations: On data cluster: nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 | | N/A 31C P0 35W / 250W | 0MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... Off | 00000000:AF:00.0 Off | 0 | | N/A 35C P0 37W / 250W | 0MiB / 16160MiB | 0% Default | | | | N/A | pip freeze certifi==2021.5.30 numpy==1.21.0 Pillow==8.2.0 pytorch-ignite==0.5.0.dev20210623 torch==1.9.0 torchvision==0.10.0 typing-extensions==3.10.0.0 RUN OK: with torch spawner python -u ignite_idist.py --nproc_per_node 1 --backend nccl python -u ignite_idist.py --nproc_per_node 2 --backend nccl python -u ignite_idist.py --nproc_per_node 1 --backend gloo python -u ignite_idist.py --nproc_per_node 2 --backend gloo with docker hvd-vision:latest (pytorch-ignite==0.4.4 torch==1.8.1) python -u ignite_idist.py --backend horovod --nproc_per_node 1 python -u ignite_idist.py --backend horovod --nproc_per_node 2 with torch.distributed.launch python -m torch.distributed.launch --nproc_per_node 1 --use_env ignite_idist.py --backend gloo python -m torch.distributed.launch --nproc_per_node 1 --use_env ignite_idist.py --backend nccl python -m torch.distributed.launch --nproc_per_node 2 --use_env ignite_idist.py --backend gloo python -m torch.distributed.launch --nproc_per_node 2 --use_env ignite_idist.py --backend nccl with horovodrun with docker hvd-vision:latest (pytorch-ignite==0.4.4 torch==1.8.1) horovodrun -np 1 -H localhost:2 python -u ignite_idist.py --backend horovod horovodrun -np 2 -H localhost:2 python -u ignite_idist.py --backend horovod On HPC cluster with slurm on hpc cluster with conda pack (same env as data cluster) OK with NCCL srun --nodes=1 --ntasks-per-node=2 --time=00:03:00 --partition=gpgpu --gres=gpu:2 python ignite_idist.py --backend nccl --nb_samples 256 OK with gloo ( note had to increase memory ) srun --nodes=1 --ntasks-per-node=2 --time=00:03:00 --partition=gpgpu --gres=gpu:2 --mem=10G python ignite_idist.py --backend gloo --nb_samples 256 ON COLAB PyTorch version: 1.9.0+cu102 PyTorch xla version: 1.9 PyTorch-Ignite version: 0.5.0.dev20210623 OK python -u ignite_idist.py --backend xla-tpu --nproc_per_node 8 --batch_size 64 --nb_samples 512