Installation Instructions for GPU-Server ready for Multi-GPU Computing

# Installation Instructions for GPU-Server ready for Multi-GPU Computing ###### tags: `GPU-Cluster` > Presuming a fresh MAAS system image as the starting point ## Preliminaries #### 1. Enable password-less sudo ```bash= sudo vim /etc/sudoers ``` and add the `NOPASSWD:` in front of the last `ALL` in the line defining the commands sudo users are allowed to execute. #### 2. Enable password-based ssh login ```bash= sudo vim /etc/ssh/sshd_config ``` and change the `no` for `PasswordLogin` to `yes`. Then we just need to restart the ssh service ```bash= sudo service ssh restart ``` ## Actual System Installation #### Install [CUDA](https://developer.nvidia.com/cuda-downloads) as a network-based installation Get the NVIDIA `*.deb` keys and update the apt-cache ```bash= wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb sudo dpkg -i cuda-keyring*.deb sudo apt-get update ``` after which we can install cuda with **all its dependencies** in one go. ```bash= sudo apt-get -y install cuda ``` Then we only have to reboot the machine ```bash= sudo reboot ``` > At this stage we can optionally run the CUDA integration tests by downloading and running the NVIDIA-provided [CUDA-samples](https://github.com/nvidia/cuda-samples). The CUDA-paths need to be exported as usual, which can either be done by adding them to the `~/.bashrc` or just running the command individually from the shell of your choosing. ```bash= export PATH=/usr/local/cuda-11.8/bin${PATH:+:${PATH}} ``` #### Install [cuDNN](https://developer.nvidia.com/rdp/cudnn-download) as a local debian-installer > A network installation option is sadly not available for cuDNN Install the prerequisites for cuDNN ```bash= sudo apt-get install zlib1g ``` Download the local installer for the `.deb` package from the download-page and `scp` or `sftp` the package onto the server to continue. Then unpack the `.deb` local installer and enable the keyring. ```bash= sudo dpkg -i cudnn-local-repo-ubuntu2204-8.6.0.163_1.0-1_amd64.deb sudo cp /var/cudnn-local-repo-ubuntu2204-8.6.0.163/cudnn-local-B0FE0A41-keyring.gpg /usr/share/keyrings/ ``` then we have to refresh the apt metadata ```bash= sudo apt-get update ``` Then we can install the runtime library, the developers library, and the cuDNN samples & documentation ```bash= sudo apt-get install libcudnn8=8.6.0.163-1+cuda11.8 sudo apt-get install libcudnn8-dev=8.6.0.163-1+cuda11.8 sudo apt-get install libcudnn8-samples=8.6.0.163-1+cuda11.8 ``` #### Install [NCCL](https://developer.nvidia.com/nccl/nccl-download) as a network-based installation The keyring for apt for NCCL has already been set up with the CUDA installation step. As such we only need to install the requisite packages with ```bash= sudo apt install libnccl2=2.15.1-1+cuda11.8 libnccl-dev=2.15.1-1+cuda11.8 ``` ## Lackmus Test Create python virtual environment as usual, python3.8 was used in this instance, and then install Jax into it. ```bash= pip install --upgrade pip wheel setuptools pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html ``` To disable direct peer-to-peer NCCL communication set the following environment variable ```bash= export NCCL_P2P_DISABLE=1 ``` > In Ubuntu 22.04 LTS this is not required any more and peer-to-peer NCCL is working as expected. And then launch into a Python shell ```python= import jax import jax.numpy as jnp x = jnp.arange(4) y = jax.pmap(lambda x: jax.lax.all_gather(x, 'i'), axis_name='i')(x) ```