How to manage your GPUs, MLOps Edition

# How to manage your GPUs, MLOps edition Greetings, tech enthusiasts! In this guide, we'll traverse the transformative world of Graphics Processing Units (GPUs), what they are - what makes the important to ML, but also what makes them tricky. We'll also talk about common pitfalls in usage, and what problems and solutions exist when you start considering using them at scale. As Nvidia currently dominates the ML/DL ecosystem, we'll be focusing on Nvidia devices today - however rest assured ROCm and OpenCL are super capable GPU stacks and deserve their own blog post. To start, lets explore the GPU! What makes it different from a CPU anyway? ![](https://hackmd.io/_uploads/rkinjkpan.png) In the graphic above, you can see a conventional 4 core CPU; you may notice that the GPU has a few more than 4! The GPUs cores in an Nvidia Device are called CUDA Cores. While a CPU adeptly handles multiple tasks, a GPU excels at executing thousands of simpler tasks concurrently. It's like comparing a small team of generalists to an army of specialists. We won't be diving into the extreme details of what a GPU is, there are [other fantastic](https://nyu-cds.github.io/python-gpu/02-cuda/) [blog](https://research.nvidia.com/sites/default/files/pubs/2007-02_How-GPUs-Work/04085637.pdf) [posts](https://www.extremetech.com/gaming/269335-how-graphics-cards-work); however it's worth looking at an example of CUDA code and what makes it so different. ### Let's Code: The Classic Way vs. The CUDA Way Let's do something.. interesting. We've got a list of 1M integers (randomly initalized), and we want to perform a series of math operations to each of them. Lets look at the standard python way of approaching the problem, and the pycuda equivalent. ```python import time import random import math # Generate a list of 1 million random integers between 1 and 100 inputs = [random.randint(1, 100) for _ in range(1000000)] # Define a list of 500 math operations operations = [math.sin, math.cos, math.tan, ...] # This should contain 500 math functions start_time = time.time() outputs = [] for item in inputs: result = item for operation in operations: result = operation(result) outputs.append(result) end_time = time.time() print("Output sample:", outputs[:5]) print("Estimated Compute Duration (Classic):", end_time - start_time, "seconds") # Completed in 7.8 secs ``` JAX based Approach ``` python import jax import jax.numpy as jnp import numpy as np import time import random # Ensure computations are on the GPU jax.devices('gpu') # Generate a list of 1 million random integers between 1 and 100 inputs = np.array([random.randint(1, 100) for _ in range(1000000)], dtype=np.float32) # Define a list of 500 math operations operations = [jnp.sin, jnp.cos] * 250 # This is just a placeholder; you can replace with actual operations @jax.jit def apply_operations(input_array): def apply_ops_to_item(item): result = item for operation in operations: result = operation(result) return result return jnp.vectorize(apply_ops_to_item)(input_array) start_time = time.time() outputs = apply_operations(inputs) end_time = time.time() print("Output sample:", outputs[:5]) print("Estimated Compute Duration (JAX):", end_time - start_time, "seconds") #Completed in 0.15 secs ``` In this example, each row is able to be computed independently of any other row, that makes this problem *parallelizable* - perfectly suited for computation on a GPU. ### But Why Should We Use GPUs for ML? Ever felt the thrill of upgrading to a more powerful device and experiencing the speed difference? Now imagine that thrill magnified multiple times over. That's the GPU advantage in ML. At the core of Machine Learning, especially deep learning, lie matrices. Large, complex matrices. Whether we're forwarding inputs through neural network layers or backpropagating errors, we're engrossed in matrix operations. ![](https://hackmd.io/_uploads/BkBA5Gpan.png) Matrix Multiplication row wise is *Parallelizable*, just like the previous example. Instead of having to execute every row sequentially - we can calculate them all at the same time. NVIDIA's CUDA and cuDNN are like the secret sauces that make GPUs even more potent for ML tasks. Here's a glimpse into matrix multiplication using CUDA: ```cpp __global__ void matrixMul(float *A, float *B, float *C, int N) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; float sum = 0; for (int i = 0; i < N; i++) { sum += A[row * N + i] * B[i * N + col]; } C[row * N + col] = sum; } // Usage dim3 threadsPerBlock(16, 16); dim3 blocksPerGrid(N / threadsPerBlock.x, N / threadsPerBlock.y); matrixMul<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); ``` This setup, especially when paired with optimizations from cuDNN, makes deep learning computations blazingly fast. ## Here be dragons, however GPU accelerated workflows may be fast, but there are serious considerations regarding using them outside of personal use. One example of this is "surprise OOM errors" or Out of Memory Errors caused by heterogenous use without careful planning. --- ### Hypothetical Machine Specs: - **GPU Memory**: 12 GB ### Scenario: User A and User B are both running deep learning models on this machine. **User A's Model**: - A large neural network for image classification. - Requires 9 GB of GPU memory for its operations. **User B's Model**: - A smaller neural network for text classification. - Requires 5 GB of GPU memory for its operations. ### Situation 1: Parallel Execution (OOM Error) **User A's Configuration**: ```python import torch # Define and load User A's large model model_A = torch.nn.Sequential( # ... many layers here ... ) model_A.cuda() # Moves the model to the GPU ``` **User B's Configuration**: ```python import torch # Define and load User B's smaller model model_B = torch.nn.Sequential( # ... some layers here ... ) model_B.cuda() # Moves the model to the GPU ``` In this situation, both User A and User B attempt to load their models onto the GPU simultaneously. The combined memory requirement is `9GB + 5GB = 14GB`, which exceeds the GPU's capacity of 12 GB. This will trigger an OOM error. ### Situation 2: Sequential Execution (No OOM Error) Now, let's assume both users decide to run their models one after the other, instead of simultaneously. **User A's Configuration**: ```python import torch # Define and load User A's large model model_A = torch.nn.Sequential( # ... many layers here ... ) model_A.cuda() # Moves the model to the GPU # Run training or inference # ... # After completion, free up GPU memory del model_A torch.cuda.empty_cache() # Clears up the cached memory ``` After User A's operations complete and the memory is cleared, User B starts their operations. **User B's Configuration**: ```python import torch # Define and load User B's smaller model model_B = torch.nn.Sequential( # ... some layers here ... ) model_B.cuda() # Moves the model to the GPU # Run training or inference # ... ``` In this sequential setup, at no point does the GPU memory usage exceed its 12 GB capacity. Hence, there's no OOM error. --- This example illustrates how runtime configurations and the order of operations can influence memory usage on a GPU. Running tasks sequentially and ensuring proper memory cleanup can prevent OOM errors in scenarios where parallel execution might fail. ## Effective Strategies for GPU Resource Allocation and Management In the heady rush to adopt GPUs for Machine Learning tasks, one often overlooked challenge is managing these powerful resources effectively. While the sheer computational advantage of GPUs is undeniable, their potential can be squandered without proper resource allocation and management. Let's dive into the strategies that can help you maximize the benefits of your GPU resources. ### **If your Model is small** Assuming your not training a LLM from scratch, you might be able to Some might require extensive memory, while others might be more compute-bound. If your workload can be run locally, it might be useful to do the following: - **Profile Before Deploying**: Use profiling tools like NVIDIA's `nvprof` to get insights into the memory and compute utilization of your models. This helps in understanding bottlenecks and can guide resource allocation decisions. You can even profile your python code (pytorch or tensorflow work!) with `nvprof python execute_your_script.py` - **Validate Memory Footprint**: Before deploying to a shared or constrained environment, check the memory footprint of your model with tools like torch.cuda.memory_allocated() for PyTorch or tf.profiler.experimental.Profiler for TensorFlow. Small models can sometimes have hidden memory costs, especially when tied to larger frameworks or with auxiliary data structures. Understanding the total GPU memory requirement ensures you allocate resources appropriately and avoid surprise OOM errors even with smaller models. - **Model Optimization and Pruning**: Before deploying, consider techniques like model pruning or quantization. Tools such as TensorFlow's Model Optimization Toolkit or PyTorch's pruning utilities can help reduce the size of your model without significantly compromising accuracy. As an example, here's how that can be done in pytorch ```python import torch import torch.nn as nn import torch.nn.utils.prune as prune # Define a simple model class SimpleNet(nn.Module): def __init__(self): super(SimpleNet, self).__init__() self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1) self.fc1 = nn.Linear(64*32*32, 10) def forward(self, x): x = self.conv1(x) x = x.view(x.size(0), -1) x = self.fc1(x) return x # Create a model instance model = SimpleNet() # Use global unstructured pruning on the 'weight' of conv1 layer # Prune 30% of the connections based on their absolute magnitude prune.global_unstructured( [model.conv1], pruning_method=prune.L1Unstructured, amount=0.3, ) # You can inspect the pruned model's weights print(model.conv1.weight) # To make the pruning permanent, remove the 'weight_orig' and 'weight_mask' attributes prune.remove(model.conv1, 'weight') ``` This not only reduces memory consumption but can also lead to faster inference times. Remember, a leaner model often translates to a more efficient deployment, especially in environments where resources are at a premium. ### **If your Model is Not So Small** Dealing with larger models presents a different set of challenges. They often come with significant computational demands, both in terms of processing and memory requirements. When your model starts to grow, these considerations can help ensure efficient GPU utilization and smooth operation: - **Distributed Training**: When a model becomes too large to fit on a single GPU, or when training times become prohibitively long, it's time to consider distributed training. Libraries like NVIDIA's NCCL or frameworks like Horovod allow you to distribute your model and data across multiple GPUs or even multiple nodes. This parallelism can significantly accelerate training times. - **Gradient Accumulation**: If your model's mini-batch doesn't fit into GPU memory, gradient accumulation can be a lifesaver. Instead of reducing the mini-batch size (which can affect model convergence), process smaller chunks of your mini-batch and accumulate gradients over iterations before performing a weight update. ```python optimizer.zero_grad() # Reset gradients tensors for i, (inputs, labels) in enumerate(dataloader): outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() # Backpropagate the gradients if (i+1) % accumulation_steps == 0: # Wait for several backward steps optimizer.step() # Now we can do an optimizer step optimizer.zero_grad() # Reset gradients tensors ``` - **Mixed Precision Training**: Training in mixed precision, using both 16-bit and 32-bit floating point types, can improve model training speed and reduce memory usage, leading to faster training times without significant loss in model accuracy. NVIDIA’s Apex library offers utilities to enable mixed precision training with a few lines of code. - **Use Gradient Checkpointing**: For models with deep architectures, gradient checkpointing can be invaluable. This technique trades compute for memory, allowing you to fit much larger models onto GPUs by re-computing intermediate activations during the backward pass. - **On-the-fly Data Loading**: Rather than pre-loading your entire dataset into GPU memory, use data generators or PyTorch's `DataLoader` with the `pin_memory` option set to `True`. This ensures that data is loaded on-the-fly and moved to GPU memory in batches, reducing overall memory consumption. - **Model Parallelism**: For models that are too large to fit on a single GPU, consider model parallelism. This involves splitting your model into distinct parts and placing each part on a different GPU. Frameworks like PyTorch and TensorFlow offer native support for model parallelism. Remember, as models grow in complexity and size, the nuances of GPU resource management become even more critical. Efficiently leveraging accelerators can be the difference between a model that's viable in production and one that remains confined to the drawing board. Always keep a close eye on GPU metrics, iterate, and optimize. ## Build your own GPU Accelerated K8s Harnessing the power of GPUs in a Kubernetes ecosystem requires not just the right tools but also a deep understanding of the intricacies involved. If you're keen on building GPU support yourself, Kubernetes offers an array of specialized tools that can be customized to fit unique requirements. Below are some foundational tools and services that you can utilize and potentially extend: Rolling Out Your NVIDIA GPU Support: While NVIDIA provides a GPU Operator for Kubernetes, understanding its underpinnings can be valuable. This operator essentially streamlines the GPU provisioning process, manages runtimes, and oversees device plugin discovery and monitoring. If you wish to modify or build upon this, understanding how GPU resources are managed akin to CPU and memory is crucial. ```bash # To deploy the NVIDIA GPU Operator manually helm repo add nvidia https://nvidia.github.io/gpu-operator helm repo update helm install --wait --generate-name nvidia/gpu-operator ``` Customizing Multi-Instance GPU (MIG) Configurations: NVIDIA's MIG allows for the division of a single A100 GPU into multiple GPU instances. While Kubernetes offers out-of-the-box support for MIG, diving deeper into configurations can allow for more tailored GPU instance allocations, especially useful for specific workloads. ```yaml apiVersion: v1 kind: Pod metadata: name: custom-gpu-mig-pod spec: containers: - name: cuda-container image: nvidia/cuda:11.0-base resources: limits: nvidia.com/gpu-mig-1g.5gb: 1 ``` Enhancing Node Feature Discovery (NFD): As your cluster expands, the granularity of node feature detection can become pivotal. While NFD offers automatic node labeling based on hardware features, delving into its mechanisms can allow for more precise GPU type detection, CUDA versioning, and other GPU-related characteristics. Extending NVIDIA DCGM (Data Center GPU Manager) Exporter Metrics: NVIDIA's DCGM Exporter provides a suite of GPU performance metrics. While it seamlessly integrates with Prometheus, you might want to capture custom metrics or integrate with other monitoring solutions. Understanding its core can facilitate such extensions. ```bash # To deploy DCGM Exporter manually in a Kubernetes cluster kubectl apply -f https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/dcgm-exporter.yaml ``` By diving deep into these tools and services, you can ensure that your Kubernetes cluster not only supports GPUs but does so in a manner tailored to your specific needs. Building GPU support yourself offers the flexibility to optimize, customize, and ensure that your ML deployments are both efficient and scalable. ## Or maybe you don't have to, MLOps Orchestration Setting up and maintaining a Kubernetes stack tailored for GPU orchestration is certainly a daunting endeavor. But what if you could focus on your core ML tasks, leaving the orchestration complexities to dedicated platforms? Enter Machine Learning Orchestration Platforms. ### **Kubeflow** - **Pros**: A comprehensive machine learning platform built on Kubernetes, Kubeflow provides tools for training, serving, and monitoring ML models, integrating seamlessly with popular ML libraries. - **Cons**: Its rich feature set brings complexity, making it potentially overwhelming for smaller projects. ### **MLFlow** - **Pros**: MLFlow streamlines end-to-end machine learning workflows with its modular structure, supporting diverse ML libraries and offering tools for experiment tracking and model deployment. - **Cons**: While it's versatile, it doesn't offer the deep Kubernetes-native integrations that platforms like Kubeflow provide. ### **Airflow** - **Pros**: A robust platform for orchestrating computational workflows and data processing pipelines. With an extensive plugin system, it offers integrations with a multitude of systems. - **Cons**: Primarily designed with data engineering in mind, integrating machine learning workflows might require additional legwork. ### **TensorFlow Serving** - **Pros**: A dedicated tool from TensorFlow, TensorFlow Serving is optimized for serving machine learning models in production environments, requiring minimal setup. - **Cons**: It's tailored exclusively for TensorFlow models, which can be limiting if your ecosystem involves multiple frameworks. ### **Flyte** Emerging as a Kubernetes-native workflow automation platform, Flyte stands out with its ability to interoperate with systems like Airflow, Spark, and Kubeflow. It leverages dedicated Operators to spin up and tear down ephemeral clusters on-demand, ensuring resource efficiency. - **Pros**: Beyond its scalability and native Python SDKs, Flyte's strength lies in its extensibility. It can capitalize on the best of Airflow, Spark, and Kubeflow, offering a versatile orchestration solution. The platform also emphasizes type-safety, ensuring data integrity. - **Cons**: As a newer entrant, it might not have all the features of established platforms, but its rapid evolution is bridging any gaps. Here's a snapshot of how you'd define a GPU-enabled task and workflow in Flyte: ```python from flytekit import task, workflow from flytekit.types.file import FlyteFile # Defining a Flyte task that uses GPU @task( requests=Resources(cpu="2", mem="500Mi", gpu="1"), limits=Resources(cpu="2", mem="500Mi", gpu="1") ) def gpu_powered_task(input_data: FlyteFile) -> str: # Your GPU processing logic here pass # Defining a Flyte workflow @workflow def gpu_workflow(data: FlyteFile) -> str: return gpu_powered_task(input_data=data) ``` Flyte's intuitive structure and native Python SDKs simplify workflow management. Its inherent compatibility with Kubernetes ensures that GPU resource requests and limits are always honored. In summary, while constructing your orchestration stack offers unparalleled flexibility, the MLOps orchestration platforms can significantly reduce overhead. Your selection will hinge on your team's know-how, your workflow's complexity, and your operational scale. Ensure your choice is aligned with both present requirements and future growth.