NCS Gen AI Workshop - Lab Guide v1

--- title: NCS Gen AI Workshop - Lab Guide --- <style> html, body, .ui-content { background-color: #1c1c1c; color: #ddd; } .markdown-body h1, .markdown-body h2, .markdown-body h3, .markdown-body h4, .markdown-body h5, .markdown-body h6 { color: #ddd; } .markdown-body h1, .markdown-body h2 { border-bottom-color: #ffffff69; } .markdown-body h1 .octicon-link, .markdown-body h2 .octicon-link, .markdown-body h3 .octicon-link, .markdown-body h4 .octicon-link, .markdown-body h5 .octicon-link, .markdown-body h6 .octicon-link { color: #fff; } .markdown-body img { background-color: transparent; } .ui-toc-dropdown .nav>.active:focus>a, .ui-toc-dropdown .nav>.active:hover>a, .ui-toc-dropdown .nav>.active>a { color: white; border-left: 2px solid white; } .expand-toggle:hover, .expand-toggle:focus, .back-to-top:hover, .back-to-top:focus, .go-to-bottom:hover, .go-to-bottom:focus { color: white; } .ui-toc-dropdown { background-color: #333; } .ui-toc-label.btn { background-color: #191919; color: white; } .ui-toc-dropdown .nav>li>a:focus, .ui-toc-dropdown .nav>li>a:hover { color: white; border-left: 1px solid white; } .markdown-body blockquote { color: #bcbcbc; } .markdown-body table tr { background-color: #5f5f5f; } .markdown-body table tr:nth-child(2n) { background-color: #4f4f4f; } .markdown-body code, .markdown-body tt { color: #eee; background-color: rgba(230, 230, 230, 0.36); } a, .open-files-container li.selected a { color: #5EB7E0; } </style> # NCS Gen AI Workshop - Lab Guide ## 1. Objectives The Lab Guide provides the steps to deploy TKG workload cluster with allocated GPU, and also to deploy GPU Operator on the TKG workload cluster ## 2. System Requirements ### 2.1 Hardware Requirements 1. Dell PowerEdge Servers 2. NVIDIA A30 Tensor Core GPU ### 2.2 Software Requirements 1. VMware vSphere 8 2. NVIDIA AI Enterprise 3.1 * NVIDIA vGPU Host Driver (525.105.14) * NVIDIA vGPU Guest Driver (525.105.17) * NVIDIA GPU Operator (v23.3.1) https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/release-notes.html ## 3. Setup of TKG Workload Cluster Reference - https://docs.nvidia.com/ai-enterprise/deployment-guide-vmware/0.1.0/tanzu.html#it-administrator ### 3.1 Prerequisites 1. TKG Workload Management Configured (with vSphere Networking) 2. Custom VM Class created with NVIDIA A30 vGPU ### 3.2 Login to vSphere Namespace 1. Authenticate with the Supervisor Cluster ```bash! kubectl-vsphere login --server <supervisor-cluster-ip> --insecure-skip-tls-verify -u <username> ``` 2. Switch to the correct namespace ```bash! kubectl config use-context ncs ``` ### 3.3 Deploy TKG Workload Cluster with Custom VM Class 1. In your jumphost, create a new YAML manifest file named `tkc.yaml` for provisioning a TKG cluster. - Copy the following content into the YAML file - Define the cluster's name (e.g. group-1) - Save changes ```yaml! apiVersion: run.tanzu.vmware.com/v1alpha2 kind: TanzuKubernetesCluster metadata: name: # CLUSTER_NAME annotations: run.tanzu.vmware.com/resolve-os-image: os-name=ubuntu spec: topology: controlPlane: replicas: 1 storageClass: tkg-storage vmClass: best-effort-medium tkr: reference: name: v1.24.9---vmware.1-tkg.4 nodePools: - name: n-a30-4c replicas: 1 storageClass: tkg-storage vmClass: n-a30-24-4gb # CUSTOM VMClass tkr: reference: name: v1.24.9---vmware.1-tkg.4 volumes: - capacity: storage: 20Gi mountPath: /var/lib/containerd name: containerd - capacity: storage: 2Gi mountPath: /var/lib/kubelet name: kubelet ``` 2. Using the manifest file, provision your TKG cluster. ```bash! kubectl apply -f tkc.yaml ``` 3. Wait for the status of your TKG cluster to be ***READY*** (approx. 6-10 min) ```bash! kubectl get tkc <cluster-name> -w ``` ### 3.4 Login to TKG Workload Cluster ```bash! kubectl-vsphere login --server <supervisor-cluster-ip> --insecure-skip-tls-verify -u <username> --tanzu-kubernetes-cluster-namespace=ncs --tanzu-kubernetes-cluster-name=<cluster-name> ``` ### 3.5 Grant Permissions for Authenticated Users ```bash! kubectl create clusterrolebinding psp:system:authenticated --clusterrole=psp:vmware-system-privileged --group=system:authenticated ``` ## 4. Installation of GPU Operator The NVIDIA GPU Operator manages NVIDIA GPU resources in a Kubernetes cluster and automates tasks related to bootstrapping GPU nodes. There are TWO (2) ways to install the GPU Operator: via ***Helm*** or via ***Operator***. > In this workshop, we deployed the GPU Operator by via ***Helm***. ### 4.1 Prerequisites 1. NVIDIA A30 Tensor Core GPU Installed 2. NVIDIA vGPU Host Driver installed on each ESXi Host (Current Version: 525.105.14) 3. NVIDIA DLS (Delegated License Server) instance installed - Client Configuration Token (DLS License Key/Token) downloaded from DLS 4. NGC API Key - You will need to request for the access to NGC (https://ngc.nvidia.com) via NVIDIA with your Corporate Email ID - Once you receive the approval from NVIDIA, you will be able to get your NGC API Key to download and deploy NVIDIA Enterprise Containers from NVIDIA NGC for testing purpose. ### 4.2 Install NVIDIA GPU Operator 1. Verify that you are in your TKG cluster ```bash! kubectl config current-context ``` 2. Create NVIDIA GPU Operator Namespace ```bash! kubectl create namespace gpu-operator ``` 3. From the jumphost, view the manifest that contains the license token. ```bash! cat token.yaml ``` 4. Using this manifest, create a ConfigMap for the CLS Licensing. ```bash! kubectl apply -f token.yaml ``` 5. Set the following Environment Variables ```bash! export NGC_API_KEY=<NGC_API_KEY> # api key is stored in a file in the jumphost export NGC_EMAIL=<NGC_EMAIL_ADDRESS> # you may use your email ``` 6. Create k8s Secret to Access NGC registry ```bash! kubectl create secret docker-registry ngc-secret --docker-server="nvcr.io/nvaie" --docker-username='$oauthtoken' --docker-password="$NGC_API_KEY" --docker-email="$NGC_EMAIL" -n gpu-operator ``` 7. Add the Helm Repo ```bash! helm repo add nvaie https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password="$NGC_API_KEY" ``` 8. Update the Helm Repo ```bash! helm repo update ``` 9. Search to see if GPU Operator for NVAIE 3.1 is available ```bash! helm search repo nvaie/gpu-operator -l ``` 10. Install NVIDIA GPU Operator. Allow 1-2 min for the command to complete running. ```bash! helm install --wait gpu-operator nvaie/gpu-operator-3-1 -n gpu-operator ``` 11. Wait for all Pods' status to be ***running*** or ***completed*** (approx. 8-10 min) ```bash! watch -n 10 'kubectl get pods -n gpu-operator' ``` ### 4.3 Validate NVIDIA GPU Operator Deployment 1. Grab the name of the nvidia driver pod. ```bash! export GPU_DRIVER=$(kubectl -n gpu-operator get pods -l app=nvidia-driver-daemonset -o=jsonpath='{.items[*].metadata.name}') ``` 2. Run `nvidia-smi` within the nvidia-driver pod ```bash! kubectl exec $GPU_DRIVER -n gpu-operator -- nvidia-smi ``` 3. Confirm if the nvidia driver is loaded with license ```bash! kubectl exec $GPU_DRIVER -n gpu-operator -- nvidia-smi -a | grep License ``` ### 4.4 Run a Sample GPU Application 1. In your jumphost, create a new YAML manifest file named `cuda-vectoradd.yaml`. - Copy the following content into the YAML file - Save changes ```yaml! apiVersion: v1 kind: Pod metadata: name: cuda-vectoradd spec: restartPolicy: OnFailure containers: - name: cuda-vectoradd image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04" resources: limits: nvidia.com/gpu: 1 # request for GPU resource ``` 2. Run the Pod on your TKG cluster. ```bash! kubectl apply -f cuda-vectoradd.yaml ``` 3. Wait for the `cuda-vectoradd` Pod's status to be ***completed*** ```bash! kubectl get pods -w ``` 4. View the logs from the container. ```bash! kubectl logs pod/cuda-vectoradd ``` *Should the application run **successfully**, below is a sample output:* ```! [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` 4. Remove the stopped Pod. ```bash! kubectl delete -f cuda-vectoradd.yaml ``` ## End of Lab Guide