Autoscaling in Kubernetes: HPA, VPA, KEDA and Karpenter
One of the main advantages of Kubernetes in production is the ability to scale automatically workloads in response to demand. Yet, most teams only use a fraction of available autoscaling capabilities: they configure an HPA with CPU scaling and leave it at that. The result? Under-provisioned pods that slow down under load, or over-provisioned nodes that they burn budgets for no reason.
Kubernetes offers four complementary levels of autoscaling: HPA scale horizontally the Pods on CPU/memory/custom metrics, VPA fix resource requests automatically, KEDA enable event-driven scaling on any source (queues, databases, Prometheus metrics), e Karpenter Provisions nodes in less than 30 seconds, 40% faster than traditional Cluster Autoscaler according to CNCF 2025 benchmarks. This article shows how to use them together in production.
What You Will Learn
- How the Horizontal Pod Autoscaler (HPA) works with custom and external metrics
- Configure the Vertical Pod Autoscaler (VPA) for automatic rightsizing
- KEDA: Event-driven autoscaling on SQS, Kafka, Redis and Prometheus queues
- Karpenter: Just-in-time node provisioning with NodePool and NodeClass
- Combination Pattern: Use HPA and KEDA together without conflicts
- Troubleshooting: why your HPA isn't scaling as you expect
- Best practices to avoid flap loops and cold starts
Horizontal Pod Autoscaler (HPA)
The HPA and the Kubernetes component that scales the number of replicas of a Deployment, StatefulSet o ReplicaSet based on observed metrics. The HPA controller queries the metrics every 15 seconds (configurable) and calculate the desired number of replicas with the formula:
desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))
To avoid flapping (continuous up and down scaling), the HPA has a stabilization period: 5 minutes for scale-down and 0 seconds for scale-up by default.
HPA on CPU and Memory
The basic configuration with CPU and memory. Note that to scale to memory, the application must free memory when the load decreases, otherwise the scale-down never happens:
# hpa-basic.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # scala quando CPU media > 60%
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageValue: "512Mi" # scala quando memoria media > 512Mi
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # scala su immediatamente
policies:
- type: Percent
value: 100
periodSeconds: 60 # max raddoppio delle repliche al minuto
- type: Pods
value: 4
periodSeconds: 60 # o max 4 pod al minuto
selectPolicy: Max # usa la policy piu aggressiva
scaleDown:
stabilizationWindowSeconds: 300 # 5 minuti prima di scalare giu
policies:
- type: Percent
value: 25
periodSeconds: 60 # riduce max 25% delle repliche al minuto
selectPolicy: Min
HPA with Custom Metrics via Prometheus Adapter
To scale on application metrics (requests per second, queue length, etc.), you need the Prometheus Adapter exposing Prometheus metrics as Custom Metrics Kubernetes API:
# Installa Prometheus Adapter
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
--set prometheus.url=http://kube-prometheus-stack-prometheus.monitoring.svc \
--set prometheus.port=9090
# prometheus-adapter-config.yaml - regola di mapping metrica
rules:
custom:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total$"
as: "${1}_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
---
# hpa-custom-metric.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-rps-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 2
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000" # 1000 req/s per Pod
HPA with External Metrics
External metrics allow you to scale to sources outside the cluster, such as length of an SQS queue or the number of unconsumed Kafka messages:
# hpa-external-metric.yaml
# Scala in base alla lunghezza di una coda SQS
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: worker-queue-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: queue-worker
minReplicas: 1
maxReplicas: 100
metrics:
- type: External
external:
metric:
name: sqs_approximate_number_of_messages_visible
selector:
matchLabels:
queue: "job-queue-prod"
target:
type: AverageValue
averageValue: "10" # 10 messaggi per worker
# Verifica stato HPA
kubectl get hpa -n production -w
kubectl describe hpa api-server-hpa -n production
Vertical Pod Autoscaler (VPA)
VPA monitors the actual CPU and memory usage of your Pods and automatically adjusts
i resources.requests e limits. And the solution to the problem of
"garbage in, garbage out" of resource requests: if you don't know how many resources yours needs
Pod, the VPA finds out for you.
VPA and HPA: Beware of Conflicts
Do not use VPA in mode Auto along with HPA scaling to CPU or memory:
the two controllers will conflict. The correct combination is: VPA in mode
Off o Initial for resource requests, and HPA for replications
on custom metrics. Or use KEDA instead of HPA to avoid the problem.
VPA Installation and Configuration
# Installa VPA
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-install.sh
# oppure con Helm
helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm install vpa fairwinds-stable/vpa --namespace vpa --create-namespace
---
# vpa-recommendation.yaml - modalita Off (solo raccomandazioni)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off" # Off|Initial|Recreate|Auto
resourcePolicy:
containerPolicies:
- containerName: api-server
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "4"
memory: "4Gi"
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
# Leggi le raccomandazioni VPA
kubectl describe vpa api-server-vpa -n production
# Output tipico:
# Recommendation:
# Container Recommendations:
# Container Name: api-server
# Lower Bound: cpu: 100m, memory: 256Mi
# Target: cpu: 450m, memory: 512Mi
# Uncapped Target: cpu: 450m, memory: 512Mi
# Upper Bound: cpu: 2000m, memory: 2Gi
VPA in Auto Mode
# vpa-auto.yaml - aggiorna automaticamente i resource (riavvia i Pod)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: background-worker-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: background-worker
updatePolicy:
updateMode: "Auto" # Riavvia i Pod con i nuovi resource
minReplicas: 2 # Non aggiornare se le repliche sono meno di 2
resourcePolicy:
containerPolicies:
- containerName: worker
minAllowed:
cpu: "200m"
memory: "256Mi"
maxAllowed:
cpu: "2"
memory: "2Gi"
KEDA: Event-Driven Autoscaling
KEDA (Kubernetes Event-Driven Autoscaling) is a CNCF operator that extends HPA with 60+ pre-built scalers: AWS SQS, Azure Service Bus, Kafka, RabbitMQ, Redis, Prometheus, Datadog, and many more. KEDA can scale a deployment to 0 replicas when there are no events, and set it back to 1 when the first event arrives.
KEDA installation
# Installa KEDA via Helm
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
--namespace keda \
--create-namespace \
--version 2.14.0
# Verifica
kubectl get pods -n keda
ScaledObject for Kafka
A worker consuming from a Kafka topic scales based on consumer group lag:
# keda-kafka-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-consumer-scaler
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: kafka-consumer
pollingInterval: 15 # controlla ogni 15 secondi
cooldownPeriod: 30 # aspetta 30s prima di scalare a 0
minReplicaCount: 0 # scala a zero se non ci sono messaggi
maxReplicaCount: 50
advanced:
restoreToOriginalReplicaCount: true
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 30
triggers:
- type: kafka
metadata:
bootstrapServers: kafka-broker.kafka.svc:9092
consumerGroup: my-consumer-group
topic: orders-topic
lagThreshold: "100" # 100 messaggi per replica
offsetResetPolicy: latest
allowIdleConsumers: "false"
scaleToZeroOnInvalidOffset: "false"
authenticationRef:
name: kafka-auth # TriggerAuthentication con credenziali Kafka
ScaledObject for AWS SQS
# keda-sqs-scaledobject.yaml
apiVersion: v1
kind: Secret
metadata:
name: aws-credentials
namespace: production
data:
AWS_ACCESS_KEY_ID: BASE64_KEY
AWS_SECRET_ACCESS_KEY: BASE64_SECRET
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: aws-trigger-auth
namespace: production
spec:
secretTargetRef:
- parameter: awsAccessKeyID
name: aws-credentials
key: AWS_ACCESS_KEY_ID
- parameter: awsSecretAccessKey
name: aws-credentials
key: AWS_SECRET_ACCESS_KEY
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: sqs-worker-scaler
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: sqs-worker
minReplicaCount: 0
maxReplicaCount: 100
triggers:
- type: aws-sqs-queue
authenticationRef:
name: aws-trigger-auth
metadata:
queueURL: https://sqs.eu-west-1.amazonaws.com/123456789/job-queue
queueLength: "5" # 5 messaggi per replica
awsRegion: eu-west-1
identityOwner: pod # usa IRSA se disponibile
ScaledObject on Prometheus
# keda-prometheus-scaledobject.yaml
# Scala in base a una query Prometheus personalizzata
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: api-latency-scaler
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicaCount: 2
maxReplicaCount: 30
triggers:
- type: prometheus
metadata:
serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc:9090
metricName: http_request_duration_p99
threshold: "0.5" # scala se P99 latency > 500ms
query: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api-server"}[2m])) by (le))
# Verifica stato KEDA
kubectl get scaledobject -n production
kubectl describe scaledobject kafka-consumer-scaler -n production
Karpenter: Just-in-Time Node Provisioning
Karpenter is the next generation node provisioner created by AWS and now CNCF project. Unlike the Cluster Autoscaler, which works with predefined node groups, Karpenter provision nodes with the exact characteristics required by pending Pods: instance type, area, on-demand or spot capacity, CPU/GPU. The result: provisioning in 30-60 seconds versus 3-5 minutes for the Cluster Autoscaler.
Karpenter Architecture
Karpenter completely replaces the Cluster Autoscaler. It has two main CRDs:
- NodePool: defines the requirements of the nodes that Karpenter can create (instance types, zones, taints, labels, limits)
- NodeClass (EC2NodeClass on AWS): cloud-provider-specific configuration (AMI, subnet, security groups, user data)
Karpenter installation on EKS
# Prerequisiti: IRSA configurata per Karpenter
export CLUSTER_NAME="my-production-cluster"
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export AWS_REGION=eu-west-1
# Installa Karpenter con Helm
helm repo add karpenter https://charts.karpenter.sh/
helm repo update
helm upgrade --install karpenter karpenter/karpenter \
--namespace karpenter \
--create-namespace \
--version 1.0.0 \
--set serviceAccount.annotations."eks.amazonaws.com/role-arn"=arn:aws:iam::${AWS_ACCOUNT_ID}:role/KarpenterControllerRole \
--set settings.clusterName=${CLUSTER_NAME} \
--set settings.interruptionQueue=${CLUSTER_NAME} \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi
NodePool and EC2NodeClass for Production
# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general-purpose
spec:
template:
metadata:
labels:
node-type: general-purpose
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"] # preferisce spot, fallback on-demand
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"] # compute, memory, general
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["2"] # solo istanze di generazione 3+
- key: karpenter.k8s.aws/instance-cpu
operator: In
values: ["4", "8", "16", "32"]
taints: []
expireAfter: 720h # ricicla nodi ogni 30 giorni
terminationGracePeriod: 48h
limits:
cpu: "500" # max 500 vCPU in questo NodePool
memory: 2000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m # consolida nodi vuoti dopo 1 minuto
budgets:
- nodes: "20%" # non drainare piu del 20% dei nodi alla volta
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2023
amiSelectorTerms:
- alias: al2023@latest # usa sempre la AMI piu recente
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "my-production-cluster"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "my-production-cluster"
instanceProfile: KarpenterNodeInstanceProfile
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
iops: 3000
encrypted: true
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 1 # sicurezza: blocca access IMDSv1 da container
httpTokens: required # richiede IMDSv2
NodePool for GPU Workloads
# karpenter-gpu-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-nodes
spec:
template:
metadata:
labels:
node-type: gpu
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
name: gpu-nodeclass
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"] # GPU spot non disponibile in tutte le zone
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["g5", "p3", "p4d"] # GPU instance families
taints:
- key: nvidia.com/gpu
effect: NoSchedule # solo Pod che tollerano questo taint
limits:
cpu: "128"
memory: 1024Gi
nvidia.com/gpu: "32" # max 32 GPU in questo NodePool
Consolidation and Cost Optimization
# Forza la consolidazione immediata (utile per test)
kubectl annotate node karpenter.sh/do-not-disrupt-
# Vedi i nodi creati da Karpenter
kubectl get nodes -l karpenter.sh/nodepool=general-purpose -o wide
# Vedi le decisioni di Karpenter in tempo reale
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter -f | grep -E "launched|terminated|consolidated"
# Vedi quanto sta costando ogni nodo (con Kubecost)
kubectl get nodeclaims -o json | jq '.items[] | {name: .metadata.name, type: .status.providerID, price: .metadata.annotations["karpenter.sh/nodepool"]}'
Combine HPA, KEDA and Karpenter
In a mature manufacturing cluster, these three components work synergistically:
- KEDA scale Pods from 0 to N based on events (Kafka lag, SQS depth, Prometheus query)
- Karpenter detects pending Pods and provisions nodes with the exact characteristics required in 30-60 seconds
- VPA (in Off mode) provides resource request recommendations that you apply manually or via CI/CD pipeline
Recommended Pattern for Production
- Stateless server API: KEDA on Prometheus (P99 latency) + Karpenter general-purpose NodePool
- Queue worker: KEDA on SQS/Kafka with minReplicas=0 + Karpenter with on-demand/spot mix
- Database/StatefulSet: VPA in Auto mode with minReplicas >= 2, no HPA on memory
- Batch jobs: KEDA ScaledJob (not ScaledObject) for K8s Jobs finishing
- Don't use HPA on CPU together with KEDA - leads to conflicts on targetMetrics
KEDA ScaledJob for Batch
# keda-scaledjob.yaml - per batch job che terminano
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: ml-training-job
namespace: production
spec:
jobTargetRef:
template:
spec:
containers:
- name: trainer
image: my-registry/ml-trainer:latest
resources:
requests:
cpu: "2"
memory: "4Gi"
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
restartPolicy: Never
pollingInterval: 30
maxReplicaCount: 20
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 3
triggers:
- type: aws-sqs-queue
authenticationRef:
name: aws-trigger-auth
metadata:
queueURL: https://sqs.eu-west-1.amazonaws.com/123456789/ml-jobs
queueLength: "1" # 1 job per task
awsRegion: eu-west-1
HPA and KEDA troubleshooting
# HPA non scala? Controlla lo stato
kubectl describe hpa api-server-hpa -n production
# Cerca: "AbleToScale", "ScalingActive", "DesiredReplicas"
# Errore comune: "failed to get cpu utilization" = metrics-server non installato
# Verifica che metrics-server funzioni
kubectl top pods -n production
kubectl top nodes
# KEDA non scala a zero? Controlla il cooldownPeriod
kubectl get scaledobject kafka-consumer-scaler -n production -o yaml | grep -A5 "conditions"
# Karpenter non provisiona?
kubectl get pods --field-selector=status.phase=Pending -A
kubectl describe pod | grep "Events" -A20
# Cerca: "0/N nodes are available" + il motivo del pending
# Vedi log Karpenter
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=50 | grep -i "error\|warning\|launched"
# Simula provisioning senza applicare
kubectl annotate pods karpenter.sh/do-not-disrupt=true
Best Practices and Anti-Patterns
Best Practices for Autoscaling
- Always set minReplicas >= 2 for critical services: scaling from 0 requires a cold start; for APIs in production, maintain at least 2 minimum replicas
- Use PodDisruptionBudget: prevent Karpenter/HPA from draining too many Pods during consolidation
- Configure accurate resource requests: HPA calculates percentage usage on resource.requests; if they are too low, it never scales
- Strict readiness probe: Kubernetes waits for the Pod to be Ready before sending traffic; Without readiness probes, newly scaled Pods receive traffic before they are ready
- Monitor flaps: if the HPA scales up and down every few minutes, increase the
stabilizationWindowSecondsof scale-down - Use topology spread constraints with Karpenter: distribute Pods across zones for high availability even during provisioning
Anti-Patterns to Avoid
- HPA without resource requests defined: HPA cannot calculate percentage usage without requests in the container spec
- VPA Auto + HPA on CPU/Memory: the two controllers compete for resources and cause inconsistent scaling; use KEDA on custom metrics if you want both
- maxReplicas too low: if your peak traffic requires 100 Pods but maxReplicas and 20, autoscaling is not enough and the service degrades
- Karpenter without disruption budget: without
disruption.budgets, Karpenter can drain 100% of the nodes during overnight consolidation - Polling interval too low on KEDA: un
pollingIntervalof 5 seconds on external sources (SQS, external APIs) generates too many API calls and possible throttling
Conclusions and Next Steps
Effective autoscaling in Kubernetes is not a single solution but a multiple strategy levels: KEDA for event-driven scaling of Pods, HPA for usage-based scaling of resources, VPA to optimize resource requests, and Karpenter for provisioning quick of the knots. Used together, these tools can reduce costs by 30-50% compared to statically provisioned clusters, maintaining high SLAs.
The key to success is the accurate configuration of resource requests (VPA helps here), choosing the right metrics to scale on (not always CPU and response), and the configuration of scaling behavior (stabilization periods, rate limiting) to avoid oscillations that can worsen performance instead of improving it.







