pytorch 分布式訓練(disturbuted training)

# pytorch 分布式訓練(disturbuted training) ###### tags: `DLRM` ## prerequisite * [torch ucc](https://github.com/facebookresearch/torch_ucc)(pytorch 擴充) * [ucx](https://github.com/openucx/ucx)(require 11.1 or later) * [ucc](https://github.com/openucx/ucc) * gcc(越新越好?) * ucx(require 11.1 or later) * [doxygen](https://github.com/doxygen/doxygen) * gcc * python * cmake * GNU tools * [m4](https://www.gnu.org/software/m4/) * [bison](https://ftp.gnu.org/gnu/bison/) * [automake(require 1.15 or later)](https://ftp.gnu.org/gnu/automake/) * [autoconf(require 2.71 or later)](https://ftp.gnu.org/gnu/autoconf/) * [gperf](http://ftp.gnu.org/pub/gnu/gperf/gperf-3.1.tar.gz) * [textinfo(makeinfo)](https://ftp.gnu.org/gnu/texinfo/) * [help2man](https://ftp.gnu.org/gnu/help2man/) * [libiconv](https://ftp.gnu.org/gnu/libiconv/) * [flex](https://github.com/westes/flex/) * [pytorch](https://github.com/pytorch/pytorch) * backend * [NCCL](https://github.com/NVIDIA/nccl) * [GLOO](https://github.com/facebookincubator/gloo) * MPI(openmpi、mpich等，可搭配UCX、CUDA等) ## torch-ucc(cuda10.1_ompi3.1.6) 進入 `singularity shell --writable --bind /opt/pbs,/usr/lib:/usr/lib2 cu10 1dist` `singularity shell --writable --bind /opt/pbs,/usr/lib:/usr/lib2,/usr/include:/usr/include2,/usr/bin:/usr/bin2 10.1distofficial` 基本export ``` export PATH=/home/users/industry/ai-hpc/apacsc19/openmpi-3.1.6/bin:$PATH export LD_LIBRARY_PATH=/home/users/industry/ai-hpc/apacsc19/openmpi-3.1.6/lib:$LD_LIBRARY_PATH mpirun --version gcc --version g++ --version cmake --version nvcc --version ``` ### ucx ``` git clone https://github.com/openucx/ucx.git cd /ucx ./autogen.sh ./contrib/configure-release --prefix=/src/ucx --with-cuda=/usr/local/cuda --with-rc --with-mlx5-dv --with-ib-hw-tm --enable-mt make -j install # export export LD_LIBRARY_PATH=/src/ucx/lib/ucx:$LD_LIBRARY_PATH ``` ### ucc ``` git clone https://github.com/openucx/ucc.git cd ucc ./autogen.sh ./configure --prefix=/src/ucc --with-ucx=/src/ucx --with-cuda=/usr/local/cuda --with-mpi=/home/users/industry/ai-hpc/apacsc19/openmpi-3.1.6 --with-nccl make -j8 make install ``` ### pytorch ``` conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing conda install -c pytorch magma-cuda101 git clone --recursive https://github.com/pytorch/pytorch git clone --branch v1.7.1 https://github.com/pytorch/pytorch.git pytorch-1.7.1 cd pytorch-1.7.1 git submodule sync git submodule update --init --recursive export TORCH_CUDA_ARCH_LIST="7.0" export USE_GLOO=1 export USE_DISTRIBUTED=1 export USE_OPENCV=0 export USE_CUDA=1 export USE_NCCL=0 export USE_MKLDNN=0 export BUILD_TEST=0 export USE_FBGEMM=0 export USE_NNPACK=0 export USE_QNNPACK=0 export USE_XNNPACK=0 export USE_KINETO=1 export MAX_JOBS=$(($(nproc)-1)) python setup.py clean python setup.py install ``` ### torch-ucc ``` git clone https://github.com/facebookresearch/torch_ucc.git cd torch_ucc UCX_HOME=/src/ucx UCC_HOME=/src/ucc WITH_CUDA=/usr/local/cuda/ python setup.py install ``` ### preprocess 至少需要8*32GB GPU 官方建議 [preprocess the terabyte dateset(github)](https://github.com/facebookresearch/dlrm/issues/125) preprocess 三大步驟: ```resulting in: (i) day_.npz, (ii) day_processed.npz, and (iii) day*_reordered.npz``` 在任一步驟失敗的話必須把中間檔案刪乾淨再從那個stage重新開始preprocess-> ``` need to delete any intermediate files up to the beginning of the stage (also, be careful with two small files day_fea_count.npz and day_day_count.npz they are very important). ``` ``` python /home/nckuhpclab07/torchucconly/dlrm/dlrm_s_pytorch.py \ --arch-sparse-feature-size=128 \ --arch-mlp-bot="13-512-256-128" \ --arch-mlp-top="1024-1024-512-256-1" \ --max-ind-range=40000000 \ --data-generation=dataset \ --data-set=terabyte \ --raw-data-file=/home/nckuhpclab07/dlrm1/dataset_for_torchucc/day \ --processed-data-file=/home/nckuhpclab07/torchucconly/terabyte_processed.npz \ --loss-function=bce \ --round-targets=True \ --learning-rate=1.0 \ --mini-batch-size=2048 \ --print-freq=2048 \ --print-time \ --test-freq=102400 \ --test-mini-batch-size=16384 \ --test-num-workers=16 \ --memory-map \ --mlperf-logging \ --mlperf-auc-threshold=0.8025 \ --mlperf-bin-loader \ --mlperf-bin-shuffle \ --use-gpu ```