Notes on AER/MEP GPU-Cluster Setup

# Notes on AER/MEP GPU-Cluster Setup ###### tags: `GPU-Cluster` ## Current Status - [x] Base-OS Installs on all servers - [x] MAAS Setup to have a bare-metal cloud - [x] Do virtual machines have access to the internet? - [x] NAT setup on Hideko (made redundant by the next point on the agenda) - [ ] Own sub-network for the cluster - [ ] Juju Installation (LP) - in progress, held up by curl - [ ] Cloud Component Deployment - [ ] OpenStack - [ ] Kubernetes on OpenStack - [ ] Ceph-Storage - [ ] GPU-Operator on GPU-Nodes ## Currently Possible Usage - Ursell deployed as bare-metal machine to allow for stop-gap GPU computing - Kushana can be deployed as second bare-metal machine to fulfill larger demand - Full virtual machines (VMs) / containers will only be able to be deployed once Juju works. - GPU operator is being installed as a part of the Juju deployment, and will hence only then allow for the pass-through of the GPUs to the respective containers/VMs ## Updates - [26.8.2022](/e_Uc0_stSzW__cZ6rw6IIQ) ## Installation Layout The installation breaks down the following multiple steps 1. Base-OS installation 1. Setup [MAAS](https://maas.io) ([reference](https://maas.io/docs/how-to-install-maas)) 2. Use [Juju](https://juju.is) for charmed deployments ([reference]()) 1. Deploy [OpenStack]() with Juju ([reference]()) 2. Create [Ceph]() storage cluster ([reference]()) 3. Deploy [Kubernetes]() on top of OpenStack ([reference]()) 4. Add [NVIDIA GPU-Operator]() to enable GPU-computing ([reference]()) A high-level summary of the approach to deployment is available in this [video](https://www.youtube.com/watch?v=sLADei_c9Qg) leaving out the deployment of the NVIDIA GPU-Operator which is covered in this [Readme](https://github.com/NVIDIA/gpu-operator) by NVIDIA. :::success We consciously followed the stock install instructions s.t. all issues are replicable on the package level, and the cluster can be maintained by any eventual successors. ::: :::warning **Dump of Deployment Links** - [Virtual GPUs](https://docs.openstack.org/charm-guide/latest/admin/vgpu.html) - [NVIDIA Virtual GPU Docs](https://docs.nvidia.com/grid/13.0/grid-vgpu-release-notes-generic-linux-kvm/index.html) - [Interaction between KVM drivers and PCIe-Passthrough](https://www.kraxel.org/blog/2021/05/virtio-gpu-qemu-graphics-update/) - [MAAS GPU-Tagging](https://gist.github.com/ThinGuy/661a88be3f8b0ed7770277374ac3546b) - [OpenStack PCIe-Passthrough](https://docs.openstack.org/charm-guide/latest/admin/compute/pci-passthrough.html) - [Juju installation instructions](https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/rocky/install-juju.html) ::: ```graphviz digraph hierarchy { // Separation between nodes nodesep=1.0 // Define the style node [color=Red,fontname=Courier,shape=oval] edge [color=Blue, style=dashed] // Graph definition itself MAAS->{JUJU} JUJU->{OpenStack} OpenStack->{Ceph, Kubernetes, GPU_Operator} {Ceph,Kubernetes, GPU_Operator}->{GPU_Cloud} } ``` ##### Currently Virtualized Servers - Hideko (Head-node) - Ursell (8 * A6000 GPU-Server) - Lundquist (8 * A6000 GPU-Server) - Nusselt (Storage-Server w [Ceph](https://ceph.io/en/)) - Kushana (4 * A6000) - **(Arriving in the coming week)** name-to-be-determined (2 * A100 80GB) ##### Known Issues :::danger Newly created virtual machines have their own IP-addresses which are not part of the cleared IP-addresses on the network, hence would require us to span up our own separate IP range to circumvent this constraint imposed on us by the general network setup of AER. - Need to open a running trouble ticket for this! - Need to set up Hideko as a NAT to handle the IP addresses, and fix this issue ::: :::danger GPU-passthrough needs to be enabled and tested - Can we tag the physical machines with an additional GPU attribute? - Only focus on this once the Juju-charms [for the GPU deployment](https://github.com/juju-solutions/layer-nvidia-cuda) can be used ::: #### Documentation of Individual Setup Components - [Initial Setup of the Server](/VX17IBrJRuOH_KYTRylq1g) - [MAAS](/oLS4hAsKQWCLXWzx16KwoQ) - [Networking Setup](/xc-GaM5CQdiVf5QcfcY6SA) - [Juju](/mC-sF0gnRKWoiuH3PPZ3IQ) - [OpenStack](/cte_RGoETw61vNQ7uEapRA) - [Ceph](/Mfy4bdM4RC6XaD9KK0abjg) - [Kubernetes](/veyV7WntSr2buaz1W1G4TA) - [NVIDIA GPU-Operator](/orFfzRWPTZapcWh8nwPigA)