Cluster observability: Prometheus, Grafana and OpenTelemetry
“If you can't measure it, you can't manage it.” For a production Kubernetes cluster, this it means having visibility on three dimensions: metrics (how much does he consume each component), log (what is happening in the system), e trace (how requests move through microservices). The Prometheus + Grafana + stack Loki + OpenTelemetry is the open-source answer to this need.
In this article we will build a complete observability stack for Kubernetes: we will install kube-prometheus-stack for infrastructure metrics, we will configure Loki for aggregate logs, and we will integrate OpenTelemetry Collector to collect traces distributed by applications and forward them to Tempo (the backend of Grafana tracing). The result is a unified observability platform displayed in Grafana.
What You Will Learn
- Install kube-prometheus-stack: Prometheus Operator, kube-state-metrics, Node Exporter
- Create ServiceMonitor and PodMonitor for automatic app scraping
- PrometheusRule for critical cluster alerts (OOMKill, CrashLoopBackOff, etc.)
- Loki + Promtail for aggregated logs with Kubernetes labels
- OpenTelemetry Collector - Configurable telemetry pipeline
- Grafana Tempo for distributed tracing
- Prebuilt Grafana dashboards for Kubernetes
- Metrics-log-trace correlation in Grafana (Exemplars)
Architecture of the Observability Stack
Before installing, let's understand how the components relate to each other:
- Prometheus: Collect metrics with HTTP scraping. Store data for 15-30 days
- kube-state-metrics: Exposes metrics on the status of K8s objects (Deployment, Pod, etc.)
- Node Exporter: Exposes node hardware metrics (CPU, disk, network)
- Loki: Aggregate Pod logs. It does not index the content, only the labels
- Promtail: DaemonSet that sends container logs to Loki
- OpenTelemetry Collector: Receives traces/metrics/logs from apps and routes them to backends
- Grafana Time: Backend for distributed tracing (traces)
- Grafana: Unified UI to view metrics (Prometheus), log (Loki), trace (Tempo)
Installing kube-prometheus-stack
# Installa kube-prometheus-stack (include Prometheus, Alertmanager, Grafana, kube-state-metrics, Node Exporter)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# values.yaml per produzione
cat > kube-prometheus-values.yaml << 'EOF'
# Prometheus
prometheus:
prometheusSpec:
retention: 30d
retentionSize: "50GB"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: ssd
resources:
requests:
storage: 100Gi
# Scrape tutte le ServiceMonitor/PodMonitor nel cluster
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
ruleSelectorNilUsesHelmValues: false
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2000m
memory: 8Gi
# Alertmanager
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: ssd
resources:
requests:
storage: 10Gi
config:
global:
slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
route:
receiver: 'slack-critical'
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
routes:
- receiver: 'slack-critical'
matchers:
- alertname =~ ".*Critical.*"
- receiver: 'slack-warning'
matchers:
- severity = warning
receivers:
- name: 'slack-critical'
slack_configs:
- channel: '#alerts-critical'
send_resolved: true
title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'slack-warning'
slack_configs:
- channel: '#alerts-warning'
send_resolved: true
# Grafana
grafana:
enabled: true
ingress:
enabled: true
hosts:
- grafana.company.com
persistence:
enabled: true
size: 10Gi
# Datasource Loki pre-configurato
additionalDataSources:
- name: Loki
type: loki
url: http://loki.monitoring.svc:3100
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: '"traceID":"(\w+)"'
name: TraceID
url: '${__value.raw}'
- name: Tempo
type: tempo
url: http://tempo.monitoring.svc:3100
# kube-state-metrics
kube-state-metrics:
metricLabelsAllowlist:
- pods=[team,environment,app]
- deployments=[team,environment,app]
EOF
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--version 65.0.0 \
-f kube-prometheus-values.yaml
# Verifica
kubectl get pods -n monitoring
kubectl get servicemonitors -A
ServiceMonitor for Application Scraping
The Prometheus Operator uses ServiceMonitor e PodMonitor to dynamically configure application metrics scraping. It's no use modify Prometheus configuration:
# servicemonitor-api-service.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-service-monitor
namespace: team-alpha-production
labels:
team: team-alpha # label per selezionare questo monitor
spec:
selector:
matchLabels:
app: api-service # seleziona il Service con questo label
endpoints:
- port: metrics # nome della porta nel Service
interval: 30s
path: /metrics
# Basic auth se le metriche sono protette
# basicAuth:
# username: { name: metrics-auth, key: username }
# password: { name: metrics-auth, key: password }
namespaceSelector:
matchNames:
- team-alpha-production
---
# Aggiungi la porta metrics al Service dell'applicazione
apiVersion: v1
kind: Service
metadata:
name: api-service
namespace: team-alpha-production
labels:
app: api-service
spec:
selector:
app: api-service
ports:
- name: http
port: 8080
- name: metrics # porta dedicata alle metriche
port: 9090
targetPort: 9090
Cluster Critical Alert
# prometheusrule-kubernetes-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-critical-alerts
namespace: monitoring
labels:
prometheus: kube-prometheus
spec:
groups:
- name: kubernetes.critical
rules:
# Pod in CrashLoopBackOff
- alert: PodCrashLoopBackOff
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
AND
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} in CrashLoopBackOff"
# Pod OOMKilled
- alert: PodOOMKilled
expr: |
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
for: 0m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} terminato per OOM"
description: "Aumenta i memory limits del container {{ $labels.container }}"
# Nodo sotto pressione di memoria
- alert: NodeMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: "Nodo {{ $labels.node }} sotto MemoryPressure"
# PVC quasi pieno
- alert: PersistentVolumeFillingUp
expr: |
kubelet_volume_stats_available_bytes /
kubelet_volume_stats_capacity_bytes < 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} al 85% della capacita"
# Deployment con zero repliche disponibili
- alert: DeploymentUnavailable
expr: kube_deployment_status_replicas_available == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} ha 0 repliche disponibili"
Loki + Promtail for Aggregate Logs
# Installa Loki (modalita monolitica per cluster medio)
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki \
--namespace monitoring \
--set loki.commonConfig.replication_factor=1 \
--set loki.storage.type=filesystem \
--set singleBinary.replicas=1 \
--set monitoring.selfMonitoring.enabled=false
# Installa Promtail (DaemonSet che invia log a Loki)
helm install promtail grafana/promtail \
--namespace monitoring \
--set config.clients[0].url=http://loki.monitoring.svc:3100/loki/api/v1/push \
--set config.snippets.extraScrapeConfigs='
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- cri: {}
- labeldrop:
- filename
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_team]
target_label: team
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace'
# Query Loki in Grafana (LogQL):
# Tutti i log error del team-alpha:
# {namespace="team-alpha-production"} |= "ERROR"
# Error rate per app negli ultimi 5 minuti:
# sum(rate({namespace="team-alpha-production"} |= "ERROR" [5m])) by (app)
# Log strutturati JSON - estrai campo:
# {app="api-service"} | json | level="error" | line_format "{{.message}}"
OpenTelemetry Collector for Distributed Traces
# Installa OpenTelemetry Operator
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml
---
# otel-collector.yaml - pipeline di telemetria
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
namespace: monitoring
spec:
mode: DaemonSet # un collector per nodo
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
# Aggiunge metadati Kubernetes ai trace (namespace, pod name, etc.)
k8sattributes:
auth_type: "serviceAccount"
passthrough: false
extract:
metadata:
- k8s.pod.name
- k8s.namespace.name
- k8s.deployment.name
- k8s.node.name
labels:
- tag_name: team
key: team
from: pod
# Campionamento: mantieni solo 10% dei trace in produzione (volume alto)
probabilistic_sampler:
sampling_percentage: 10
exporters:
otlp/tempo:
endpoint: http://tempo.monitoring.svc:4317
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
const_labels:
cluster: production
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, k8sattributes, probabilistic_sampler]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch, k8sattributes]
exporters: [prometheus]
Installation of Grafana Tempo
# Installa Grafana Tempo
helm install tempo grafana/tempo-distributed \
--namespace monitoring \
--set storage.trace.backend=local
# Alternativa: Tempo monolitico per cluster piccoli/medi
helm install tempo grafana/tempo \
--namespace monitoring \
--set tempo.storage.trace.backend=filesystem \
--set tempo.storage.trace.local.path=/var/tempo
Grafana dashboard for Kubernetes
Grafana has a catalog of pre-built dashboards. Import these IDs from the UI (Dashboards > Import):
| Dashboards | ID Grafana | Utility |
|---|---|---|
| Kubernetes Cluster Overview | 7249 | CPU/Memory/Pod Overview by Node |
| Kubernetes Deployments | 8588 | Deployment status, restart rate, replicas |
| Node Exporter Full | 1860 | Node hardware metrics (CPU, disk, network) |
| Kubernetes PVC | 13646 | Use of PVC storage |
| Loki Dashboard | 15141 | Aggregate logs, error rate, log explorer |
| NGINX Input Controller | 9614 | Request rate, latency, input status code |
Metrics-Log-Trace Correlation (Exemplars)
The true power of this stack is the correlation: from a metric of high latency you can go directly to the corresponding trace, and from trace to logs of the Pod at that time. This is called Exemplary:
# Abilita gli exemplar in Prometheus (già abilitati in kube-prometheus-stack)
# Nell'applicazione, includi il traceID nell'histogram metric:
# Go/Python con OpenTelemetry:
# Quando crei un histogram, aggiungi l'exemplar con il trace ID corrente
# Il Prometheus scraper lo raccoglie e lo conserva
# In Grafana:
# 1. Vai alla dashboard API latency
# 2. Vedi un picco di latenza
# 3. Clicca il diamante (exemplar) sul grafico
# 4. Grafana ti apre automaticamente il trace in Tempo
# 5. Dal trace, clicca sul servizio con errore
# 6. Grafana mostra i log di quel Pod in Loki per quel timestamp
# Questo flusso metriche → trace → log e il "holy grail" dell'osservabilita
Best Practices for Kubernetes Observability
Production Observability Checklist
- USE Method for resources: For each resource (CPU, memory, disk, network): Utilization, Saturation, Errors. These are the 3 fundamental alerts for each node
- RED Method for services: For each service: Rate (req/s), Errors (error rate), Duration (latency). Alert on all three
- SLO-based alerting: Don't alert on every anomaly, but only when you are using up your SLO error budget. Less noise, more signal
- Differentiated retention: Raw metrics 15 days, monthly aggregates 1 year. Raw log 7 days, audit log 1 year
- Sampling for tracks: Don't keep 100% of the traces in production — it costs too much. 1-10% is enough for debugging
- Consistent labels: Each metric, log and trace must have team, environment, app, version for filtering
Conclusions and Next Steps
A complete observability stack — Prometheus for metrics, Loki for logs, OpenTelemetry + Trace Time — transforms Kubernetes from a black box into a system understandable. The correlation between the three signals in Grafana drastically reduces the time Debugging average from hours to minutes.
The next and last article in this series — Kubernetes Multi-Cloud with Federation e Submariner — addresses the challenge of managing multiple clusters across different cloud providers as if were only one, extending all the concepts of this series to multi-cluster.
Next article in the series
Related Series
- Observability and OpenTelemetry — in-depth analysis of application instrumentation
- GitOps with ArgoCD — Argo Rollouts uses Prometheus for automatic canary analysis







