2800 words
14 minutes
Demystifying GPU Management: High-Performance AI on Kubernetes

Demystifying GPU Management: High-Performance AI on Kubernetes#

The field of artificial intelligence (AI) has rapidly evolved in recent years, driven by exponential growth in data and improvements in compute power. In particular, Graphics Processing Units (GPUs) have become a critical component of modern AI and machine learning (ML) infrastructure because of their massive parallel processing capabilities. Kubernetes—the most popular container orchestration platform—has also become central to deploying, managing, and scaling AI/ML workloads in production.

This blog post introduces you to GPU acceleration on Kubernetes. From the basics of GPU hardware and containerization to multi-GPU scheduling strategies and advanced HPC (High-Performance Computing) techniques, we will cover a wide range of topics that bring clarity to GPU-powered AI/ML operations in a Kubernetes ecosystem. By the end of this guide, you will have a solid grounding in how to run your GPU-accelerated workloads on Kubernetes in a scalable, efficient, and manageable manner.


Table of Contents#

  1. Introduction to GPUs for AI
  2. Kubernetes Basics
  3. Why Run GPUs on Kubernetes?
  4. Getting Started with GPU Enablement
  5. Scheduling and Resource Management
  6. Running Your First GPU-Accelerated Workload
  7. Best Practices and Advanced Concepts
  8. Production Considerations
  9. Use Cases
  10. Hands-On Example: End-to-End Workflow
  1. Conclusion

Introduction to GPUs for AI#

GPUs stand out because they excel at parallel processing—this means GPUs can perform many mathematical computations simultaneously. While Central Processing Units (CPUs) handle a broad range of workloads—including single-threaded tasks—GPUs specialize in tasks that can be parallelized. AI models, especially deep learning neural networks, benefit tremendously from parallel computation, making GPUs almost a necessity for training and sometimes even for inference.

Key GPU Features#

  • Parallelism: Thousands of smaller cores for concurrent operations.
  • High Throughput: GPUs can perform billions of floating-point operations per second (FLOPS).
  • Specialized Libraries: Ecosystems such as CUDA (for NVIDIA GPUs) and ROCm (for AMD GPUs) provide specialized APIs and libraries.

Because of these capabilities, GPUs have become indispensable to modern AI and high-performance computing tasks. However, the challenge is how to effectively manage and scale these GPU resources across dozens or even hundreds of machines. This is where Kubernetes steps in.


Kubernetes Basics#

Kubernetes (often abbreviated as K8s) is an open-source platform automating container orchestration. Here are a few essential components and concepts:

  • Pod: The smallest deployable unit in Kubernetes, often containing one or more containers that share storage and network resources.
  • Node: A worker machine (virtual or physical) running containerized workloads.
  • Cluster: A set of machines (nodes) managed by a control plane that schedules workloads and ensures the desired state is maintained.
  • Control Plane: The collection of processes (kube-apiserver, kube-controller-manager, kube-scheduler, etc.) that manage the cluster.
  • Deployment: A way to define how many replicas of a Pod should run, and how to update them in a controlled manner.
  • Service: An abstraction that provides network access to Pods, typically load-balanced by default.

Container Orchestration and AI#

Container orchestration with Kubernetes isn’t just about running any workload; it’s about managing distributed systems in an automated and scalable manner. For AI and ML workloads, which can be resource-intensive, Kubernetes provides advanced scheduling to ensure that containers using GPUs are placed on nodes that have GPU capabilities.


Why Run GPUs on Kubernetes?#

Before diving into the practicalities, it’s worth understanding the “why.” Many AI/ML practitioners hesitate to jump into Kubernetes because they feel it might be overkill for smaller experimentation or local development. However, if you are aiming for production-scale workloads or have a large data science team working concurrently, Kubernetes offers the following benefits:

  1. Resource Efficiency: By pooling GPU-enabled nodes, you can share GPU resources among teams, limiting idle GPU time.
  2. Scalability: As the demand for training or inference grows, Kubernetes can automatically spin up more Pods to handle the load (assuming you have enough GPU nodes in the cluster).
  3. Isolation and Multitenancy: Kubernetes namespaces and resource quotas help isolate workloads from different teams or projects, avoiding resource contention.
  4. Data Locality: In advanced deployments, you can place workload Pods close to data storage for faster IO.
  5. Extensibility: The Kubernetes community is large and active, and you can tap into many existing solutions (e.g., operators, custom resource definitions) to manage complex GPU tasks more easily.

Getting Started with GPU Enablement#

Prerequisites#

  1. A functioning Kubernetes cluster: You should already have a cluster set up—either on premises, on a public cloud, or in a hybrid environment.
  2. GPU Hardware: At least one node should be equipped with a GPU (e.g., an NVIDIA GPU).
  3. Suitable drivers and runtime: For NVIDIA GPUs, you need to install NVIDIA drivers on the node.

Installing GPU Drivers on Your Nodes#

On GPU-equipped nodes, you need to install the appropriate drivers for your hardware. For NVIDIA GPUs:

  1. Add the CUDA repository:

    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring.gpg
    sudo rpm --import cuda-keyring.gpg # or 'apt-key add' for Debian/Ubuntu
  2. Install the driver packages:

    sudo apt-get update
    sudo apt-get install -y cuda-drivers

    or:

    sudo yum clean expire-cache
    sudo yum install -y nvidia-driver-latest-dkms
  3. Verify:

    nvidia-smi

    If successful, you should see information about your GPU model and driver version.

NVIDIA Device Plugin for Kubernetes#

Kubernetes includes support for accelerated computing using plugin-like integrations. NVIDIA provides a device plugin to make the GPU resources discoverable by the kubelet.

  1. Deploy using YAML (for a typical setup):

    kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml

    This creates a DaemonSet that runs on every GPU node in your cluster, registering NVIDIA GPUs with Kubernetes.

  2. Check the logs to confirm:

    kubectl logs -n kube-system <nvidia-device-plugin-pod-name>

    It should report something like “Found X GPU(s)”.

This device plugin fundamentally exposes GPUs as schedulable resources. With the plugin in place, you can start running GPU-accelerated Pods.


Scheduling and Resource Management#

GPU Requests and Limits#

Kubernetes uses a resource request/limit model to ensure Pods get the resources they need:

  • Requests: The amount of resource (CPU, GPU, memory) that a container is guaranteed.
  • Limits: The maximum amount of resource a container is allowed to use.

For GPU-enabled Pods, you specify GPUs in the resources section of your Pod spec:

apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: gpu-container
image: nvidia/cuda:11.3.0-base
resources:
limits:
nvidia.com/gpu: 1

In this example, Kubernetes will schedule the Pod on a node with at least one available GPU.

Important Note#

In current practice, you typically do not separate requests and limits for GPUs. Instead, nvidia.com/gpu appears under limits. Setting requests for GPUs in the same way as CPU or memory can lead to scheduling inconsistencies, so the recommended best practice is to specify only limits for GPUs.

Node Labels and GPU Affinity#

Kubernetes node labels help the scheduler decide where to place your workloads. If you have different GPU models across your cluster—for example, some nodes have Tesla V100, and others have Tesla T4—you can label the nodes:

Terminal window
kubectl label nodes <node-name> gpu-type=tesla-t4
kubectl label nodes <node-name> gpu-type=tesla-v100

Then you can use node affinity in your Pod specification:

apiVersion: v1
kind: Pod
metadata:
name: gpu-affinity-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu-type
operator: In
values:
- tesla-t4
containers:
- name: gpu-container
image: nvidia/cuda:11.3.0-base
resources:
limits:
nvidia.com/gpu: 1

In this manner, you can ensure that your workloads land on nodes that have the correct GPU type.

GPU-Sharing Mechanisms#

By default, Kubernetes schedules GPU resources in an exclusive manner, meaning one container receives the entire physical GPU. However, you can use proprietary or specialized device plugins that enable fractional GPU sharing. This is particularly useful for inference workloads, where a single GPU can be shared among multiple models at once without exhausting GPU memory. Examples of GPU sharing solutions include:

  • NVIDIA MPS (Multi-Process Service): Allows multiple processes to share GPU contexts.
  • Third-Party Device Plugins: Provide fractional GPU allocation.

However, these are more advanced setups and might require extra configuration in your cluster.


Running Your First GPU-Accelerated Workload#

You have installed GPU drivers on your nodes, deployed the NVIDIA device plugin, and optionally labeled your GPU nodes. The next step is to run a GPU-accelerated workload.

Example: TensorFlow with GPUs#

Here is a minimal YAML file running TensorFlow on GPUs:

apiVersion: batch/v1
kind: Job
metadata:
name: tf-gpu-job
spec:
template:
spec:
restartPolicy: Never
containers:
- name: tf-container
image: tensorflow/tensorflow:latest-gpu
command: ["python", "-c"]
args:
- >
import tensorflow as tf;
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')));
resources:
limits:
nvidia.com/gpu: 1
  1. Create the Job:
    kubectl apply -f tf-gpu-job.yaml
  2. Check the status:
    kubectl get jobs
  3. Check the logs:
    kubectl logs <the-pod-name>

If everything is correct, the output will show the number of GPUs recognized by TensorFlow.

Example: PyTorch with GPUs#

Similarly, for PyTorch:

apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-gpu-job
spec:
template:
spec:
restartPolicy: Never
containers:
- name: pytorch-container
image: pytorch/pytorch:latest
command: ["python", "-c"]
args:
- >
import torch;
print("Is CUDA available?", torch.cuda.is_available());
print("GPU Count:", torch.cuda.device_count());
resources:
limits:
nvidia.com/gpu: 1

You should see Is CUDA available? True and the number of GPU devices recognized by PyTorch.


Best Practices and Advanced Concepts#

Once you have the basics down, you can explore more nuanced areas that significantly impact your cluster’s performance, security, and ease of management.

Managing GPU Memory#

Deep learning containers can be memory-hungry, and GPU memory constraints are a common bottleneck. If your GPU is 16GB of VRAM, and your model plus batch size demands more, you’ll run into out-of-memory errors.

  • Batch Size Tuning: Find the optimal batch size that fits in GPU memory.
  • Gradient Checkpointing: Reduce GPU memory usage by recomputing some intermediate states during backpropagation.
  • Mixed-Precision Training: Use half-precision (FP16) or Tensor Cores on supported GPUs to reduce memory usage and improve speed.

Multi-GPU and Distributed Training#

For more advanced scenarios, you may want to run distributed training across multiple GPUs and potentially across multiple nodes. Popular frameworks like PyTorch’s torch.distributed or TensorFlow’s tf.distribute can help.

Example: Multi-GPU Scheduling#

Use a Pod spec that requests multiple GPUs:

apiVersion: v1
kind: Pod
metadata:
name: multi-gpu-pod
spec:
containers:
- name: multi-gpu-container
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
nvidia.com/gpu: 2

This Pod requires two GPUs on the same node. Make sure your node has at least two available GPUs.

Example: Distributed Training on Multiple Nodes#

A typical pattern is to create multiple replicas (Pods) using a Kubernetes mechanism like a StatefulSet or a specialized Operator. Each Pod runs as a worker, and you have one Pod as a parameter server or coordinator. The important part is networking, so the pods can communicate. Tools like Horovod, Kubeflow, or MLFlow can help coordinate distributed training.

Monitoring and Metrics#

Keeping an eye on GPU utilization is crucial. You can integrate solutions like:

  1. Prometheus + Grafana:

    • Deploy the Prometheus node exporter or the NVIDIA GPU exporter.
    • Export metrics like GPU usage, memory usage, temperature, etc.
    • Visualize using Grafana dashboards.
  2. NVIDIA DCGM (Data Center GPU Manager):

    • Provides GPU-level metrics and can integrate with Prometheus.

Below is a snippet of a Prometheus configuration for scraping GPU metrics from a node exporter:

scrape_configs:
- job_name: 'gpu-metrics'
static_configs:
- targets: ['<node-ip>:9100'] # or a service if you have one

GPU Operator and CRDs#

The NVIDIA GPU Operator uses the Operator pattern to simplify GPU management. It automates:

  • Driver installation
  • Device plugin deployment
  • Monitoring components

It leverages Custom Resource Definitions (CRDs) to abstract complex GPU management tasks into higher-level concepts. If you’re scaling up GPU usage across many nodes, or you frequently add new GPU nodes, adopting the GPU Operator can save you a lot of time.


Production Considerations#

Version Compatibility#

The moving parts in a GPU-enabled Kubernetes cluster include:

  • Kubernetes version
  • NVIDIA drivers
  • CUDA toolkit
  • NVIDIA device plugin

Incompatibilities can cause major headaches—for instance, if the device plugin version does not match the GPU driver version. Always check the official compatibility matrix from NVIDIA or your GPU vendor.

Resource Quotas and Avoiding Overcommitment#

Kubernetes allows you to set Resource Quotas in a namespace to define the maximum GPU usage. Overcommitment is generally not applicable to GPUs because they cannot be shared the same way CPU or memory can be overcommitted. If your cluster tries to schedule a GPU workload when none are available, the Pod will be stuck in Pending.

Security and Isolation#

Given GPUs are shared hardware, GPU isolation is crucial when running multi-tenant workloads. By default, the device plugin provides exclusive GPU allocation. However, malicious container processes could still attempt side-channel attacks. To mitigate:

  • Use Runtime Security Tools: Tools such as SELinux, AppArmor, or seccomp can provide additional constraint layers.
  • Sandboxing: For extremely sensitive workloads, consider using separate GPU nodes or advanced techniques like GPU virtualization.

Upgrades and Maintenance#

Keeping your cluster current is a never-ending task. Regularly updating the Kubernetes version, device plugin versions, and GPU drivers is important but can be disruptive:

  • Rollouts: Use rolling upgrades for node updates.
  • Version Skew: Validate that your device plugin is forward-compatible with the new driver or Kubernetes release.
  • Backup: Always back up your cluster state, especially if using critical data or states in ephemeral volumes.

Use Cases#

Real-Time Inference#

For use cases like real-time video analytics or online recommendation engines, GPU-enabled inference can drastically reduce latency. By deploying your inference model in a GPU-enabled container, you can scale horizontally as traffic grows. Tools like Triton Inference Server from NVIDIA provide a standardized way to serve multiple models on the same GPU.

Data Analytics and ETL#

ETL or data-processing tasks can benefit from GPU acceleration. Libraries like RAPIDS accelerate data transformations and analytics in Python using GPUs. Utilizing these libraries within a Kubernetes environment allows large data sets to be processed faster.

Model Training Pipelines#

From data preprocessing to hyperparameter tuning, GPUs accelerate each step of an ML pipeline. Integrating Kubeflow or Argo Workflows allows you to define complex pipelines where certain tasks specifically request GPUs when required, and revert to CPU nodes for lighter tasks.


Hands-On Example: End-to-End Workflow#

Demo Architecture#

In this section, we’ll walk through a simple demonstration pipeline:

  1. A data preprocessing step (CPU-based).
  2. A model training step (GPU-based).
  3. An inference service (GPU-based, for demonstration).
  4. A monitoring setup to track GPU usage.

Below is a high-level architecture diagram (in text form):

Data Source
|
v
[Preprocessing Pod] --(CPU only)--> [Model Checkpoint]
|
v
[Training Pod] ------(GPU)--------> [Trained Model]
|
v
[Inference Pod] ----(GPU)----> [Service Exposure]
|
v
[Monitoring Setup: Prometheus + Node Exporter + Grafana]

Step-by-Step Guide#

1. Preprocessing Pod#

A simple CPU-based container that processes data. Could be an Apache Spark job or a simple Python script.

apiVersion: batch/v1
kind: Job
metadata:
name: data-preprocessing-job
spec:
template:
spec:
containers:
- name: preprocess
image: python:3.9
command: ["python", "-c"]
args:
- >
import sys;
# Pseudo code for processing:
# read input data, clean it, transform it, save to persistent storage
print("Data preprocessing completed");
restartPolicy: Never

2. Model Training Pod#

Leverages a GPU. This example uses PyTorch, but any ML framework could be used:

apiVersion: batch/v1
kind: Job
metadata:
name: model-training-job
spec:
template:
spec:
containers:
- name: train
image: pytorch/pytorch:latest
command: ["python", "train.py"]
# 'train.py' could be part of the container or mounted via ConfigMap
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: Never

Within train.py, you might have code like:

import torch
# Hyperparameters
epochs = 5
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Hypothetical model, data loader, etc.
# model = ...
# optimizer = ...
# data_loader = ...
for epoch in range(epochs):
# training loop
pass
# Save the trained model
torch.save(model.state_dict(), "model.pt")

3. Inference Pod#

Wrap your trained model in an inference service:

apiVersion: apps/v1
kind: Deployment
metadata:
name: inference-deployment
spec:
replicas: 2
selector:
matchLabels:
app: inference
template:
metadata:
labels:
app: inference
spec:
containers:
- name: inference
image: python:3.9
command: ["python", "inference_server.py"]
resources:
limits:
nvidia.com/gpu: 1

Where inference_server.py loads the model from a shared storage or a persisted volume:

from flask import Flask, request, jsonify
import torch
app = Flask(__name__)
model = ... # load model from model.pt
model.cuda()
model.eval()
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
# preprocess input
with torch.no_grad():
output = model(...) # pass data to model on GPU
return jsonify({"output": output.tolist()})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)

Expose this with a Kubernetes Service:

apiVersion: v1
kind: Service
metadata:
name: inference-service
spec:
type: NodePort
selector:
app: inference
ports:
- port: 8080
targetPort: 8080
nodePort: 30002

4. Monitoring Setup#

Deploy Prometheus and a GPU exporter DaemonSet. Then configure Grafana to create dashboards for GPU usage.

  • Prometheus:

    apiVersion: v1
    kind: Service
    metadata:
    name: prometheus
    spec:
    type: ClusterIP
    selector:
    app: prometheus
    ports:
    - port: 9090
    protocol: TCP
    name: web

    (Prometheus deployment omitted for brevity.)

  • NVIDIA GPU Exporter:

    kubectl create -f nvidia-dcgm-exporter.yaml

    This DaemonSet exports GPU metrics to Prometheus.

With this pipeline in place, you have an end-to-end system that processes data, trains a model, serves predictions, and monitors GPU resource usage.


Conclusion#

In the modern AI-driven world, GPUs are paramount for training and inference tasks. Kubernetes helps you orchestrate these GPU resources at scale, offering features like node affinity, powerful scheduling, resource quotas, and an entire ecosystem of tooling. By combining these features with GPU-specific add-ons (like the NVIDIA device plugin or the GPU Operator), you can create a robust, production-grade setup where multiple data scientists and engineers can efficiently share GPU resources.

This blog post walked you through the foundational concepts necessary to harness GPUs in Kubernetes. We explored installation, configuration, scheduling, best practices, monitoring, and a step-by-step example of how to put it all together. Armed with this knowledge, you can confidently deploy high-performance AI workloads on Kubernetes, whether for small-scale experiments or large enterprise applications.

Remember that each environment may have unique challenges, from network configuration to security constraints. Still, the Kubernetes GPU ecosystem is mature and continues to evolve, making it an increasingly indispensable choice for any organization looking to supercharge its AI initiatives in a containerized, multi-tenant environment.

Now that you have a comprehensive overview, go ahead and spin up your own GPU-enabled cluster. Experiment with training models, deploy inference services, and watch your GPU metrics in real-time. Learning through practice is the best way to demystify GPU management in Kubernetes and unlock high-performance AI.

Demystifying GPU Management: High-Performance AI on Kubernetes
https://science-ai-hub.vercel.app/posts/656afa6c-93b6-4719-ab19-ac3da9ff127a/5/
Author
AICore
Published at
2025-03-30
License
CC BY-NC-SA 4.0