# Installation Instructions for GPU-Server ready for Multi-GPU Computing
###### tags: `GPU-Cluster`
> Presuming a fresh MAAS system image as the starting point
## Preliminaries
#### 1. Enable password-less sudo
```bash=
sudo vim /etc/sudoers
```
and add the `NOPASSWD:` in front of the last `ALL` in the line defining the commands sudo users are allowed to execute.
#### 2. Enable password-based ssh login
```bash=
sudo vim /etc/ssh/sshd_config
```
and change the `no` for `PasswordLogin` to `yes`. Then we just need to restart the ssh service
```bash=
sudo service ssh restart
```
## Actual System Installation
#### Install [CUDA](https://developer.nvidia.com/cuda-downloads) as a network-based installation
Get the NVIDIA `*.deb` keys and update the apt-cache
```bash=
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring*.deb
sudo apt-get update
```
after which we can install cuda with **all its dependencies** in one go.
```bash=
sudo apt-get -y install cuda
```
Then we only have to reboot the machine
```bash=
sudo reboot
```
> At this stage we can optionally run the CUDA integration tests by downloading and running the NVIDIA-provided [CUDA-samples](https://github.com/nvidia/cuda-samples).
The CUDA-paths need to be exported as usual, which can either be done by adding them to the `~/.bashrc` or just running the command individually from the shell of your choosing.
```bash=
export PATH=/usr/local/cuda-11.8/bin${PATH:+:${PATH}}
```
#### Install [cuDNN](https://developer.nvidia.com/rdp/cudnn-download) as a local debian-installer
> A network installation option is sadly not available for cuDNN
Install the prerequisites for cuDNN
```bash=
sudo apt-get install zlib1g
```
Download the local installer for the `.deb` package from the download-page and `scp` or `sftp` the package onto the server to continue. Then unpack the `.deb` local installer and enable the keyring.
```bash=
sudo dpkg -i cudnn-local-repo-ubuntu2204-8.6.0.163_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2204-8.6.0.163/cudnn-local-B0FE0A41-keyring.gpg /usr/share/keyrings/
```
then we have to refresh the apt metadata
```bash=
sudo apt-get update
```
Then we can install the runtime library, the developers library, and the cuDNN samples & documentation
```bash=
sudo apt-get install libcudnn8=8.6.0.163-1+cuda11.8
sudo apt-get install libcudnn8-dev=8.6.0.163-1+cuda11.8
sudo apt-get install libcudnn8-samples=8.6.0.163-1+cuda11.8
```
#### Install [NCCL](https://developer.nvidia.com/nccl/nccl-download) as a network-based installation
The keyring for apt for NCCL has already been set up with the CUDA installation step. As such we only need to install the requisite packages with
```bash=
sudo apt install libnccl2=2.15.1-1+cuda11.8 libnccl-dev=2.15.1-1+cuda11.8
```
## Lackmus Test
Create python virtual environment as usual, python3.8 was used in this instance, and then install Jax into it.
```bash=
pip install --upgrade pip wheel setuptools
pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
```
To disable direct peer-to-peer NCCL communication set the following environment variable
```bash=
export NCCL_P2P_DISABLE=1
```
> In Ubuntu 22.04 LTS this is not required any more and peer-to-peer NCCL is working as expected.
And then launch into a Python shell
```python=
import jax
import jax.numpy as jnp
x = jnp.arange(4)
y = jax.pmap(lambda x: jax.lax.all_gather(x, 'i'), axis_name='i')(x)
```