--- tags: CSE5449 --- # hand on 2 run training by horovod dist ## command README - Once on the shell, copy scripts and readme file to your lab1 directory ``` mkdir ~/hands-on-2/ cp -r /fs/ess/PAS2312/owens/hands-on-2/* ~/hands-on-2/ ``` - Use the following command to see your own jobs ``` squeue -u $USER ``` #Lab1 ### Task 1 : Install Miniconda ``` cd ~/hands-on-2/environment wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh -b -p $PWD/miniconda source miniconda/bin/activate export PYTHONNOUSERSITE=true ```` ### Task 2: Create environments ``` conda create -n pytorch_cpu python=3.9 conda activate pytorch_cpu module load mvapich2/2.3.6 ``` ### Task 3: Install PyTorch and Horovod ``` pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu conda install cmake HOROVOD_WITH_MPI=1 pip install --no-cache-dir horovod ``` ### Task 4: Test the installation ``` srun --time=00:10:00 --nodes=1 --ntasks-per-node=1 --account=PAS2312 --exclusive --cpus-per-task=28 run_test.sh $PWD srun --time=00:10:00 --nodes=2 --ntasks-per-node=1 --account=PAS2312 --exclusive --cpus-per-task=28 run_test.sh $PWD ``` ### Lab 2: Training ``` cd ~/hands-on-2/gpu_experiments ``` ### Task 1: Run MLP training on a single GPU ``` srun --time=00:10:00 --nodes=1 --ntasks-per-node=1 --gpus-per-node=1 --account=PAS2312 --exclusive --gpu_cmode=shared --cpus-per-task=28 batch_dist_mlp.sh ``` ### Task 2: Run MLP training on two GPUs ``` srun --time=00:10:00 --nodes=2 --ntasks-per-node=1 --gpus-per-node=1 --account=PAS2312 --exclusive --gpu_cmode=shared --cpus-per-task=28 batch_dist_mlp.sh ``` ### Task 3: Distributed Training Performance Evaluation (1 GPU) ``` srun --time=00:10:00 --nodes=2 --ntasks-per-node=1 --gpus-per-node=1 --account=PAS2312 --exclusive --gpu_cmode=shared --cpus-per-task=28 batch_horovod_gpu.sh ``` ### Task 4: Distributed Training Performance Evaluation (2 GPUs) ``` srun --time=00:10:00 --nodes=2 --ntasks-per-node=1 --gpus-per-node=1 --account=PAS2312 --exclusive --gpu_cmode=shared --cpus-per-task=28 batch_horovod_gpu.sh ```
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up