# Notes on AER/MEP GPU-Cluster Setup
###### tags: `GPU-Cluster`
## Current Status
- [x] Base-OS Installs on all servers
- [x] MAAS Setup to have a bare-metal cloud
- [x] Do virtual machines have access to the internet?
- [x] NAT setup on Hideko (made redundant by the next point on the agenda)
- [ ] Own sub-network for the cluster
- [ ] Juju Installation (LP) - in progress, held up by curl
- [ ] Cloud Component Deployment
- [ ] OpenStack
- [ ] Kubernetes on OpenStack
- [ ] Ceph-Storage
- [ ] GPU-Operator on GPU-Nodes
## Currently Possible Usage
- Ursell deployed as bare-metal machine to allow for stop-gap GPU computing
- Kushana can be deployed as second bare-metal machine to fulfill larger demand
- Full virtual machines (VMs) / containers will only be able to be deployed once Juju works.
- GPU operator is being installed as a part of the Juju deployment, and will hence only then allow for the pass-through of the GPUs to the respective containers/VMs
## Updates
- [26.8.2022](/e_Uc0_stSzW__cZ6rw6IIQ)
## Installation Layout
The installation breaks down the following multiple steps
1. Base-OS installation
1. Setup [MAAS](https://maas.io) ([reference](https://maas.io/docs/how-to-install-maas))
2. Use [Juju](https://juju.is) for charmed deployments ([reference]())
1. Deploy [OpenStack]() with Juju ([reference]())
2. Create [Ceph]() storage cluster ([reference]())
3. Deploy [Kubernetes]() on top of OpenStack ([reference]())
4. Add [NVIDIA GPU-Operator]() to enable GPU-computing ([reference]())
A high-level summary of the approach to deployment is available in this [video](https://www.youtube.com/watch?v=sLADei_c9Qg) leaving out the deployment of the NVIDIA GPU-Operator which is covered in this [Readme](https://github.com/NVIDIA/gpu-operator) by NVIDIA.
:::success
We consciously followed the stock install instructions s.t. all issues are replicable on the package level, and the cluster can be maintained by any eventual successors.
:::
:::warning
**Dump of Deployment Links**
- [Virtual GPUs](https://docs.openstack.org/charm-guide/latest/admin/vgpu.html)
- [NVIDIA Virtual GPU Docs](https://docs.nvidia.com/grid/13.0/grid-vgpu-release-notes-generic-linux-kvm/index.html)
- [Interaction between KVM drivers and PCIe-Passthrough](https://www.kraxel.org/blog/2021/05/virtio-gpu-qemu-graphics-update/)
- [MAAS GPU-Tagging](https://gist.github.com/ThinGuy/661a88be3f8b0ed7770277374ac3546b)
- [OpenStack PCIe-Passthrough](https://docs.openstack.org/charm-guide/latest/admin/compute/pci-passthrough.html)
- [Juju installation instructions](https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/rocky/install-juju.html)
:::
```graphviz
digraph hierarchy {
// Separation between nodes
nodesep=1.0
// Define the style
node [color=Red,fontname=Courier,shape=oval]
edge [color=Blue, style=dashed]
// Graph definition itself
MAAS->{JUJU}
JUJU->{OpenStack}
OpenStack->{Ceph, Kubernetes, GPU_Operator}
{Ceph,Kubernetes, GPU_Operator}->{GPU_Cloud}
}
```
##### Currently Virtualized Servers
- Hideko (Head-node)
- Ursell (8 * A6000 GPU-Server)
- Lundquist (8 * A6000 GPU-Server)
- Nusselt (Storage-Server w [Ceph](https://ceph.io/en/))
- Kushana (4 * A6000)
- **(Arriving in the coming week)** name-to-be-determined (2 * A100 80GB)
##### Known Issues
:::danger
Newly created virtual machines have their own IP-addresses which are not part of the cleared IP-addresses on the network, hence would require us to span up our own separate IP range to circumvent this constraint imposed on us by the general network setup of AER.
- Need to open a running trouble ticket for this!
- Need to set up Hideko as a NAT to handle the IP addresses, and fix this issue
:::
:::danger
GPU-passthrough needs to be enabled and tested
- Can we tag the physical machines with an additional GPU attribute?
- Only focus on this once the Juju-charms [for the GPU deployment](https://github.com/juju-solutions/layer-nvidia-cuda) can be used
:::
#### Documentation of Individual Setup Components
- [Initial Setup of the Server](/VX17IBrJRuOH_KYTRylq1g)
- [MAAS](/oLS4hAsKQWCLXWzx16KwoQ)
- [Networking Setup](/xc-GaM5CQdiVf5QcfcY6SA)
- [Juju](/mC-sF0gnRKWoiuH3PPZ3IQ)
- [OpenStack](/cte_RGoETw61vNQ7uEapRA)
- [Ceph](/Mfy4bdM4RC6XaD9KK0abjg)
- [Kubernetes](/veyV7WntSr2buaz1W1G4TA)
- [NVIDIA GPU-Operator](/orFfzRWPTZapcWh8nwPigA)