Ray Cluster - HackMD

# Ray Cluster ## Bare metal Ray cluster How to create a bare metal Cluster for parallel processing ### Update Ubuntu to 20.04 if needed ``` sudo apt update && sudo apt upgrade sudo apt autoremove sudo apt install update-manager-core sudo do-release-upgrade sudo reboot ``` ### Remove previous cuda installation ``` sudo apt-get --purge remove "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" \ "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" sudo apt-get --purge remove "*nvidia*" "libxnvctrl*" sudo apt-get autoremove ``` ### install cuda-11.7 ``` wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600 wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu2004-11-7-local_11.7.0-515.43.04-1_amd64.deb sudo dpkg -i cuda-repo-ubuntu2004-11-7-local_11.7.0-515.43.04-1_amd64.deb sudo cp /var/cuda-repo-ubuntu2004-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/ sudo apt-get update sudo apt-get -y install cuda ``` edit ~/.bashrc to use new cuda path ``` #export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}} #export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} #export CUDA_HOME=/usr/local/cuda-10.1 export PATH=/usr/local/cuda-11.7/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} export CUDA_HOME=/usr/local/cuda-11.7 ``` ### Prepare Ray using Anaconda environment get the env file from [here](https://gist.github.com/darwinharianto/d05a32389b0a89ff3588b22046573120) get cluster.yaml template from [here](https://gist.github.com/darwinharianto/2667fd526ee6115bb7593fd2d63fd471) In your own pc, create anaconda env and start ray cluster ``` conda env create -q -n ray_cluster -f /home/doors/environment.yaml || conda activate ray_cluster && ray up cluster.yaml ``` ## Docker Ray Cluster How to create a containerized Cluster for parallel processing ### Install Docker and make sure it is working ``` sudo apt-get remove docker docker-engine docker.io containerd runc sudo mkdir -p /etc/apt/keyrings curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin ``` To check if it is working ``` sudo docker run hello-world ``` ### Install nvidia docker ``` distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update sudo apt-get install -y nvidia-docker2 sudo systemctl restart docker ``` To check if it is working ``` sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi ``` get cluster_docker.yaml template from [here](https://gist.github.com/darwinharianto/2667fd526ee6115bb7593fd2d63fd471) In your own pc, create anaconda env and start ray cluster ``` conda env create -q -n ray_cluster -f /home/doors/environment.yaml || conda activate ray_cluster && ray up cluster_docker.yaml ``` ## Sending script command sample python script from [here](https://gist.github.com/darwinharianto/2840cf9d370d42d9a95e8749efceabf1) need to ssh to one of the worker/head ``` python training_pytorch_distributed.py --address=HEAD_IP_ADDRESS:6379 --num_workers=2 --model_name_or_path=textattack/roberta-base-CoLA --task_name=cola --use_gpu --per_device_train_batch_size=16 ``` #### ray client cli for pc outside of cluster ``` python training_pytorch_distributed.py --address=ray://HEAD_IP_ADDRESS:10001 --num_workers=2 --model_name_or_path=textattack/roberta-base-CoLA --task_name=cola --use_gpu --per_device_train_batch_size=16 ``` ## Stopping cluster ``` ray down cluster.yaml ``` or ``` ray down cluster_docker.yaml ``` [aa]\(bb) ＞データ容量　2GB超 ~20GB：永年 500円/月~