AI and GPU Workloads on Kubernetes: Device Plugin and Training Jobs
In 2026, 66% of AI inference clusters run on Kubernetes (CNCF Survey 2026). The reason It's simple: Kubernetes solves AI workloads' toughest operational problems — scheduling intelligent GPU scaling, elastic scaling of training jobs, integration with distributed storage for datasets, automatic retry in case of node failure. But setting up Kubernetes for i GPU workload requires specific skills that go well beyond the deployment of a normal one web application.
In this article we will see how to configure the NVIDIA Device Plugins for expose GPUs to the cluster, how to schedule Distributed training jobs with PyTorch and TensorFlow, how to use Karpenter to do spot scaling of GPU (reducing costs by 40-70%), and patterns to optimize GPU usage in workloads of inference into production.
What You Will Learn
- Installation and configuration of the NVIDIA Device Plugin for Kubernetes
- Scheduling Pods with GPU demand (nvidia.com/gpu resource)
- Distributed training with PyTorchJob and TFJob (Kubeflow Training Operator)
- Karpenter NodePool for automatic provisioning of spot GPU nodes
- GPU time-slicing to share GPUs between multiple Pods
- MIG (Multi-Instance GPU) for A100/H100 GPU partition
- GPU monitoring with DCGM Exporter and Grafana
- Pattern for high throughput inference with TorchServe on K8s
GPU architecture on Kubernetes
Kubernetes is not natively aware of GPUs. GPUs are exposed to the cluster via the Device Plugin Framework: a DaemonSet that runs on every node with GPU, registers with the kubelet, and manages the allocation of GPUs to containers. The NVIDIA Device Plugin is the most popular implementation of this framework.
Installing NVIDIA Device Plugin
# Pre-requisiti: NVIDIA GPU drivers installati sui nodi
# Verifica driver sui nodi
kubectl get nodes -l accelerator=nvidia
kubectl describe node gpu-node-1 | grep -i nvidia
# Installa NVIDIA Device Plugin con Helm
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvdp nvdp/nvidia-device-plugin \
--namespace kube-system \
--version 0.16.0 \
--set failOnInitError=false
# Oppure con manifest diretto
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.0/deployments/static/nvidia-device-plugin.yml
# Verifica che le GPU siano visibili nel cluster
kubectl get nodes -o json | jq '.items[].status.allocatable | select(."nvidia.com/gpu" != null)'
# Output: { "nvidia.com/gpu": "8" } per un nodo con 8 GPU A100
Installing NVIDIA GPU Operator (Recommended Approach)
For clusters in production, the GPU Operator of NVIDIA manages automatically all the necessary components: drivers, device plugins, container runtime, DCGM Exporter for monitoring:
# Installa GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--version v24.9.0 \
--set driver.enabled=true \
--set mig.strategy=single \
--set dcgmExporter.enabled=true \
--set dcgmExporter.serviceMonitor.enabled=true
# Verifica installazione
kubectl get pods -n gpu-operator
# Attendi che tutti i pod siano Running
kubectl wait --for=condition=ready pod -l app=nvidia-device-plugin-daemonset -n gpu-operator --timeout=300s
# Verifica GPU allocabili
kubectl describe node gpu-node-1 | grep -A 5 "Allocatable:"
# nvidia.com/gpu: 8
Scheduling Pods with GPU
Once the Device Plugin is active, you can request GPUs in manifests like any other Kubernetes resource. The difference: you don't use requests, only limits for GPUs (Kubernetes always guarantees exactly the number of GPUs required).
# pod-gpu-basic.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: OnFailure
containers:
- name: inference
image: nvcr.io/nvidia/pytorch:24.01-py3
command: ["python3", "-c"]
args:
- |
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")
x = torch.rand(1000, 1000).cuda()
print(f"Tensor on GPU: {x.device}")
resources:
limits:
nvidia.com/gpu: "1" # richiedi 1 GPU
memory: "16Gi"
cpu: "4"
requests:
memory: "16Gi"
cpu: "4"
volumeMounts:
- name: model-storage
mountPath: /models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
nodeSelector:
accelerator: "nvidia-a100" # schedule solo su nodi A100
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
Training Distributed with Kubeflow Training Operator
Training large models often requires multiple GPUs on multiple nodes. The Kubeflow Training Operator manages distributed training jobs with PyTorchJob, TFJob, MXJob and MPIJob. Install the operator first:
# Installa Training Operator
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.8.0"
# Verifica
kubectl get pods -n kubeflow
kubectl get crd | grep kubeflow
PyTorchJob for Multi-GPU Multi-Node Training
# pytorch-distributed-training.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: llm-finetuning-job
namespace: ml-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: company.registry.io/training:llm-v2.1
command:
- python3
- -m
- torch.distributed.run
- --nproc_per_node=8
- --nnodes=4
- --node_rank=$(RANK)
- --master_addr=$(MASTER_ADDR)
- --master_port=23456
- train_llm.py
- --model=llama-7b
- --dataset=/data/training_set
- --batch-size=32
- --epochs=3
- --output=/models/finetuned
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
resources:
limits:
nvidia.com/gpu: "8"
memory: "120Gi"
cpu: "32"
requests:
memory: "120Gi"
cpu: "32"
volumeMounts:
- name: training-data
mountPath: /data
- name: model-output
mountPath: /models
- name: shm
mountPath: /dev/shm
volumes:
- name: training-data
persistentVolumeClaim:
claimName: training-dataset-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-output-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi" # shared memory per NCCL
nodeSelector:
accelerator: "nvidia-a100-80gb"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
Worker:
replicas: 3 # 3 worker + 1 master = 4 nodi, 32 GPU totali
restartPolicy: OnFailure
template:
spec: # stesso spec del Master...
containers:
- name: pytorch
image: company.registry.io/training:llm-v2.1
resources:
limits:
nvidia.com/gpu: "8"
memory: "120Gi"
cpu: "32"
Karpenter for Spot GPU Node Provisioning
GPUs are the most expensive resource in the cloud. Spot GPU instances cost 60-70% less compared to on-demand. Karpenter manages automatic provisioning of spot GPU nodes, with fallback to on-demand in case of outage:
# karpenter-gpu-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-spot
spec:
template:
metadata:
labels:
role: gpu-worker
accelerator: nvidia
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
name: gpu-nodeclass
requirements:
# Tipologie di istanze GPU AWS
- key: node.kubernetes.io/instance-type
operator: In
values:
- p4d.24xlarge # 8x A100 80GB
- p3.8xlarge # 4x V100
- g5.12xlarge # 4x A10G
- g4dn.12xlarge # 4x T4
# Preferisci spot
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand # fallback
- key: kubernetes.io/os
operator: In
values:
- linux
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
limits:
nvidia.com/gpu: 256 # max 256 GPU totali nel cluster
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30m # rimuovi nodi GPU spot quando il job finisce
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: gpu-nodeclass
spec:
amiFamily: AL2
role: KarpenterNodeRole
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "my-cluster"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "my-cluster"
instanceStorePolicy: RAID0 # usa i dischi NVMe locali per storage temporaneo
userData: |
#!/bin/bash
# Installa NVIDIA drivers al primo boot
/etc/eks/bootstrap.sh my-cluster
nvidia-smi # verifica GPU disponibili
GPU Time-Slicing: Share a GPU between multiple Pods
For light inference or development workloads, an entire GPU is often wasted. The GPU time-slicing allows you to share a physical GPU between multiple Pods, each of which sees a "virtual GPU" with a slice of the computation time:
# gpu-time-slicing-config.yaml
# Configura il Device Plugin per time-slicing
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4 # ogni GPU fisica diventa 4 "GPU" logiche
---
# Applica il config all'operator
kubectl patch clusterpolicy gpu-cluster-policy \
-n gpu-operator \
--type merge \
-p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config"}}}}'
# Verifica: ogni nodo con 1 GPU A100 ora mostra 4 GPU allocabili
kubectl describe node gpu-node-1 | grep nvidia.com/gpu
# Allocatable:
# nvidia.com/gpu: 4
# Pod che usa 1/4 di GPU
apiVersion: v1
kind: Pod
metadata:
name: inference-small
spec:
containers:
- name: model-server
image: company.registry.io/inference:v1
resources:
limits:
nvidia.com/gpu: "1" # ottiene 1/4 della GPU fisica
MIG: Multi-Instance GPU for A100 and H100
NVIDIA A100 and H100 GPUs support MIG (Multi-Instance GPU), that partition the GPU into isolated hardware instances (not just time-sharing). Each MIG instance it has guaranteed memory and computation and does not interfere with the others:
# Configura MIG sul nodo (eseguito sul nodo GPU, non da kubectl)
# Richiede: driver NVIDIA >= 525, GPU A100 o H100
# Abilita MIG mode sulla GPU
sudo nvidia-smi -mig 1
# Crea 7 istanze MIG da 1/7 di A100 (1g.10gb)
sudo nvidia-smi mig -cgip -p 0,9 # Profile 9 = MIG 1g.10gb
# Verifica istanze create
sudo nvidia-smi mig -lgi
# +-------------------------------------------------------+
# | GPU instances: |
# | GPU Name Profile Instance Placement |
# | ID ID Start:Size |
# |=======================================================|
# | 0 MIG 1g.10gb 9 1 0:1 |
# | 0 MIG 1g.10gb 9 2 1:1 |
# | 0 MIG 1g.10gb 9 3 2:1 |
# ... (7 istanze totali)
# Nel cluster Kubernetes, configurare MIG Strategy nel GPU Operator
kubectl patch clusterpolicy gpu-cluster-policy \
-n gpu-operator \
--type json \
-p '[{"op":"replace","path":"/spec/mig/strategy","value":"mixed"}]'
# Pod che richiede specifica istanza MIG
apiVersion: v1
kind: Pod
metadata:
name: inference-mig
spec:
containers:
- name: model
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
limits:
nvidia.com/mig-1g.10gb: "1" # richiedi 1 istanza MIG 1g.10gb
GPU Monitoring with DCGM Exporter
GPU monitoring is essential to understand if training jobs are efficient and for FinOps. DCGM Exporter exposes GPU metrics to Prometheus:
# DCGM Exporter viene installato automaticamente con GPU Operator
# Verifica che le metriche siano disponibili
kubectl port-forward svc/gpu-operator-dcgm-exporter 9400:9400 -n gpu-operator &
curl -s localhost:9400/metrics | grep DCGM_FI
# Metriche chiave da monitorare:
# DCGM_FI_DEV_GPU_UTIL - utilizzo GPU (0-100%)
# DCGM_FI_DEV_MEM_COPY_UTIL - utilizzo memoria GPU
# DCGM_FI_DEV_FB_USED - memoria GPU usata (MB)
# DCGM_FI_DEV_POWER_USAGE - consumo energetico (W)
# DCGM_FI_DEV_SM_CLOCK - clock streaming multiprocessor
# DCGM_FI_DEV_GPU_TEMP - temperatura GPU
# Alert: GPU sottoutilizzata (< 50% per 30 minuti = spreco)
- alert: GPUUnderutilized
expr: DCGM_FI_DEV_GPU_UTIL < 50
for: 30m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu }} sul nodo {{ $labels.Hostname }} utilization < 50%"
description: "Valuta se il job puo essere terminato o ottimizzato"
# Dashboard Grafana: importa ID 12239 (NVIDIA DCGM Exporter Dashboard)
Deployment of Models for Inference with TorchServe
For production inference, you need a model server that handles load balancing including multiple model replication, request batching, and versioning. TorchServe and the official PyTorch solution:
# torchserve-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-inference
namespace: ml-inference
spec:
replicas: 3 # 3 replica per alta disponibilita
selector:
matchLabels:
app: model-inference
template:
metadata:
labels:
app: model-inference
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8082" # TorchServe metrics port
spec:
containers:
- name: torchserve
image: pytorch/torchserve:0.11.0-gpu
args:
- torchserve
- --start
- --model-store=/models
- --models=text-classifier=bert-classifier.mar
- --ts-config=/config/config.properties
ports:
- containerPort: 8080 # inference API
- containerPort: 8081 # management API
- containerPort: 8082 # metrics
resources:
limits:
nvidia.com/gpu: "1"
memory: "16Gi"
cpu: "4"
requests:
memory: "8Gi"
cpu: "2"
readinessProbe:
httpGet:
path: /ping
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
livenessProbe:
httpGet:
path: /ping
port: 8080
initialDelaySeconds: 120
periodSeconds: 30
volumeMounts:
- name: model-store
mountPath: /models
- name: ts-config
mountPath: /config
volumes:
- name: model-store
persistentVolumeClaim:
claimName: model-store-pvc
- name: ts-config
configMap:
name: torchserve-config
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: torchserve-config
namespace: ml-inference
data:
config.properties: |
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
number_of_gpu=1
batch_size=32
max_batch_delay=100 # ms: attendi fino a 100ms per fare batching
max_response_size=6553500
install_py_dep_per_model=true
---
# HPA basato su latenza con KEDA (event-driven autoscaling)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: inference-scaler
namespace: ml-inference
spec:
scaleTargetRef:
name: model-inference
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: torchserve_queue_latency_microseconds
threshold: "100000" # 100ms di coda = scala up
query: avg(torchserve_queue_latency_microseconds{model_name="bert-classifier"})
Best Practices for AI Workloads on Kubernetes
GPU Cost Optimization
- Spot for training, on-demand for inference: Training can handle interruptions with checkpoints; inference must always be available
- Frequent checkpoints: Save checkpoints every 30 minutes to resume training after a spot interruption
- Time-slicing for development: Use time-sliced GPUs for developers, MIG or full GPUs for production
- Scale-to-zero: Karpenter removes spot GPU nodes when training finishes — you don't pay for idle GPUs
- Batching in inference: TorchServe with batch_size=32 increases throughput 10-20x compared to single requests
- Profile before deploying: Use NVIDIA Nsight to profile your training job and identify inefficiencies
Common GPU Errors on Kubernetes
- Containers without taint tolerance: GPU nodes have taint
nvidia.com/gpu=true:NoSchedule; without toleration the Pod is not scheduled on the GPU node - Failure to isolate memory: The GPU does not isolate memory between containers like the CPU does. If you allocate 1 GPU but the model uses more memory than available, the job crashes with CUDA OOM
- NCCL without shared memory: PyTorch distributed training uses NCCL which requires large /dev/shm (typically 10-60GB); always configure an emptyDir with medium: Memory
- Do not monitor GPU utilization: A GPU at 20% utilization is a huge waste. The DCGM dashboard should be the first place you look after every deployment
Conclusions and Next Steps
Kubernetes has become the standard platform for AI/ML workloads not by chance: the its resource abstraction, advanced scheduling system and operator ecosystem (Kubeflow, Training Operator) make it the ideal context to orchestrate both training that inference to scale. With Karpenter automatically managing node provisioning Spot GPU, the cost of a training job can be reduced by 40-70% compared to usage of on-demand instances.
The next step is to integrate these workloads with a complete MLOps pipeline: logging of models with MLflow, dataset management with DVC, CI/CD for automatic retraining. The FinOps for Kubernetes article (Article 9 in this series) delves into how to measure and optimize the total cost of GPU workloads in the cluster.
Upcoming Articles in the Kubernetes at Scale Series
Related Series
- MLOps and Machine Learning in Production — CI/CD pipeline for ML, model registry
- Deep Learning and Neural Networks — theoretical foundations of models training on GPUs
- Autoscaling in Kubernetes — KEDA for scaling based on custom metrics







