Monitoring Kafka: JMX Exporter, Prometheus and Grafana Dashboard
A silent Kafka cluster is not a healthy cluster - it's a cluster you're not looking at. The consumer lag that grows, under-replicated partitions accumulating, request latency rising — these signals anticipate failures by hours or days. In this guide we configure a complete monitoring system: JMX Exporter, Prometheus, Grafana and the alerts that every Kafka cluster in production must have.
Why Kafka Monitoring is Critical
Apache Kafka is designed to be resilient, but resilience is not automatic: it requires active observability. An overloaded broker, a consumer group that can't keep up with production, or a partition with degraded replication they can turn into data loss or downtime if not detected early.
The three fundamental dimensions to monitor in Kafka are:
- Cluster health: active brokers, active controller, under-replicated partitions, offline partitions
- Producer performance: request rate, record error rate, request latency, batch size
- Consumer performance: consumer lag per group and per partition, commit rate, rebalance frequency
The 5 Most Critical Kafka Metrics
- kafka.consumer.lag: difference between LEO and committed offset — the #1 metric for system health
- kafka.server.UnderReplicatedPartitions: partitions with incomplete ISR — degraded broker signal
- kafka.server.OfflinePartitionsCount: leaderless partitions — critical error, data inaccessible
- kafka.network.RequestMetrics.TotalTimeMs: total latency of requests
- kafka.server.BrokerTopicMetrics.MessagesInPerSec: ingestion throughput
How Kafka Exposes Metrics: JMX
Kafka exposes all its internal metrics via JMX (Java Management Extensions), the standard framework for
monitoring of JVM applications. Each metric is identified by a MBean name in the format
dominio:tipo=Tipo,nome=Metrica.
To make JMX metrics accessible to Prometheus, the most used monitoring system in a cloud-native environment, you use the JMX Exporter: A Java agent that exposes JMX metrics as an HTTP endpoint in Prometheus format (text-based, pull model).
Configure JMX Exporter as a Java Agent
JMX Exporter is started as a JVM agent: just add the parameter -javaagent to the Kafka broker JVM.
You download the JAR from the official Prometheus repository and create a YAML configuration file that specifies
which MBeans to collect and how to rename them to Prometheus metrics.
# Scaricare JMX Exporter (ultima versione stabile)
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/1.0.1/jmx_prometheus_javaagent-1.0.1.jar \
-O /opt/kafka/libs/jmx_prometheus_javaagent.jar
# File di configurazione: /opt/kafka/config/kafka-jmx-exporter.yml
# Questo file dice a JMX Exporter quali MBean raccogliere
# kafka-jmx-exporter.yml
# Configurazione completa per Kafka broker + consumer group lag
---
startDelaySeconds: 0
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
# Metriche Broker: throughput messaggi
- pattern: 'kafka.server<type=BrokerTopicMetrics, name=MessagesInPerSec><>OneMinuteRate'
name: kafka_server_brokertopicmetrics_messagesinpersec
labels:
topic: "$1"
# Under-replicated partitions: CRITICO
- pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value'
name: kafka_server_replicamanager_underreplicatedpartitions
# Offline partitions: CRITICO
- pattern: 'kafka.controller<type=KafkaController, name=OfflinePartitionsCount><>Value'
name: kafka_controller_kafkacontroller_offlinepartitionscount
# Active controller count (deve essere sempre 1)
- pattern: 'kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value'
name: kafka_controller_kafkacontroller_activecontrollercount
# Request latency per tipo di request
- pattern: 'kafka.network<type=RequestMetrics, name=TotalTimeMs, request=(\w+)><>(Mean|99thPercentile)'
name: kafka_network_requestmetrics_totaltimems
labels:
request: "$1"
quantile: "$2"
# Producer request rate
- pattern: 'kafka.server<type=BrokerTopicMetrics, name=BytesInPerSec><>OneMinuteRate'
name: kafka_server_brokertopicmetrics_bytesinpersec
# Log size per topic-partizione
- pattern: 'kafka.log<type=Log, name=Size, topic=(.+), partition=(\d+)><>Value'
name: kafka_log_log_size
labels:
topic: "$1"
partition: "$2"
# JVM heap usage
- pattern: 'java.lang<type=Memory><>HeapMemoryUsage'
name: jvm_heap_memory_usage
type: GAUGE
Enable JMX Exporter in the Broker
# Modificare kafka-server-start.sh oppure impostare la variabile d'ambiente
# Aggiungere al file bin/kafka-server-start.sh o configurare systemd
# Opzione 1: variabile d'ambiente KAFKA_OPTS
export KAFKA_OPTS="-javaagent:/opt/kafka/libs/jmx_prometheus_javaagent.jar=9404:/opt/kafka/config/kafka-jmx-exporter.yml"
# Opzione 2: in systemd service file
# [Service]
# Environment="KAFKA_OPTS=-javaagent:/opt/kafka/libs/jmx_prometheus_javaagent.jar=9404:/opt/kafka/config/kafka-jmx-exporter.yml"
# Verificare che l'endpoint funzioni dopo il riavvio
curl http://localhost:9404/metrics | grep kafka_server_replicamanager
Consumer Lag: The Most Important Metric
Il consumer lag measures how many records the consumer still has to process: it is the difference between the Log End Offset (the last record written in the topic) and the Committed Offset (the last record processed by the consumer group). An increasing lag indicates that consumers are not keeping up with the speed of production.
The lag alone does not expose the brokers' JMX metrics: it must be collected by querying the consumer group coordinator. The standard tool for this is Kafka Lag Exporter (Lightbend/open source project) or the plugin kafka_consumer_group of Prometheus himself.
# Docker Compose per Kafka Lag Exporter (alternativa moderna)
# https://github.com/seglo/kafka-lag-exporter
services:
kafka-lag-exporter:
image: seglo/kafka-lag-exporter:0.8.0
ports:
- "8000:8000"
volumes:
- ./lag-exporter.conf:/opt/docker/conf/application.conf
environment:
JAVA_OPTS: "-Xmx256m"
# lag-exporter.conf (formato HOCON)
kafka-lag-exporter {
port = 8000
clusters = [
{
name = "production-cluster"
bootstrap-brokers = "kafka1:9092,kafka2:9092,kafka3:9092"
# Polling interval
poll-interval = 30 seconds
# Consumer groups da monitorare (regex, vuoto = tutti)
consumer-group-whitelist = [".*"]
}
]
# Metriche Prometheus esposte:
# kafka_consumer_group_latest_offset
# kafka_consumer_group_partition_lag
# kafka_consumer_group_sum_lag <-- lag totale per group
# kafka_consumer_group_max_lag <-- lag massimo tra le partizioni
}
Calculate Lag with PromQL
# Query PromQL per Grafana / Prometheus
# Lag totale per consumer group
sum(kafka_consumer_group_partition_lag) by (group)
# Top 10 consumer groups per lag
topk(10, sum(kafka_consumer_group_partition_lag) by (group))
# Tasso di crescita del lag (se positivo, i consumer non stanno al passo)
rate(kafka_consumer_group_sum_lag[5m])
# Consumer lag per topic specifico
sum(kafka_consumer_group_partition_lag{topic="ordini-effettuati"}) by (group)
# Alert: lag sopra soglia critica per piu di 5 minuti
# kafka_consumer_group_sum_lag{group="servizio-inventario"} > 10000
Configure Prometheus for Kafka
Prometheus uses a template pull: it is Prometheus that collects metrics from the endpoints
exposed by exporters, according to a configurable interval (scrape_interval).
The configuration specifies the scrape target: The HTTP endpoints to query.
# prometheus.yml - configurazione per scraping Kafka
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "kafka-alerts.yml"
scrape_configs:
# JMX Exporter su ogni broker Kafka
- job_name: "kafka-brokers"
static_configs:
- targets:
- "kafka1:9404"
- "kafka2:9404"
- "kafka3:9404"
relabel_configs:
- source_labels: [__address__]
target_label: broker
regex: "([^:]+):.*"
replacement: "$1"
# Kafka Lag Exporter per consumer group metrics
- job_name: "kafka-lag-exporter"
static_configs:
- targets:
- "kafka-lag-exporter:8000"
scrape_interval: 30s # lag cambia meno frequentemente
# JVM metrics dei broker (se abilitato jvm_exporter separato)
- job_name: "kafka-jvm"
static_configs:
- targets:
- "kafka1:9404"
- "kafka2:9404"
- "kafka3:9404"
metrics_path: /metrics
params:
module: [jvm]
Alert Rules Kafka: The Essential Rules
The alert rules define the conditions that trigger a notification. In Prometheus they express themselves with PromQL and are bundled into separate YAML files. Here are the basic rules for Kafka in production:
# kafka-alerts.yml - Regole di alert per Prometheus Alertmanager
groups:
- name: kafka.critical
rules:
# Broker down: nessun dato dall'endpoint JMX
- alert: KafkaBrokerDown
expr: up{job="kafka-brokers"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka broker {{ $labels.broker }} non raggiungibile"
description: "Il broker {{ $labels.broker }} non risponde da piu di 1 minuto."
# Partizioni offline: CRITICO, dati non accessibili
- alert: KafkaOfflinePartitions
expr: kafka_controller_kafkacontroller_offlinepartitionscount > 0
for: 30s
labels:
severity: critical
annotations:
summary: "{{ $value }} partizioni offline in Kafka"
description: "Partizioni senza leader: dati non leggibili/scrivibili."
# Nessun active controller
- alert: KafkaNoActiveController
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka senza controller attivo"
- name: kafka.warning
rules:
# Under-replicated partitions: replica degradata
- alert: KafkaUnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $value }} partizioni under-replicated su {{ $labels.broker }}"
description: "Replica degradata: un broker potrebbe essere lento o down."
# Consumer lag critico
- alert: KafkaConsumerLagHigh
expr: sum(kafka_consumer_group_partition_lag) by (group) > 50000
for: 10m
labels:
severity: warning
annotations:
summary: "Consumer lag alto per gruppo {{ $labels.group }}"
description: "Lag: {{ $value }}. I consumer non stanno al passo con la produzione."
# Consumer lag critico prolungato
- alert: KafkaConsumerLagCritical
expr: sum(kafka_consumer_group_partition_lag) by (group) > 200000
for: 5m
labels:
severity: critical
annotations:
summary: "Consumer lag CRITICO per gruppo {{ $labels.group }}"
# JVM heap usage alta
- alert: KafkaBrokerHeapHigh
expr: jvm_heap_memory_usage{area="used"} / jvm_heap_memory_usage{area="max"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Heap JVM alta su broker Kafka {{ $labels.broker }}"
Grafana Dashboard for Kafka
Grafana is the standard visualization tool for Prometheus. There are ready-to-import dashboards directly from the Grafana Dashboard Hub (grafana.com/dashboards). The most used for Kafka are:
- Dashboard ID 7589: Kafka Overview (broker health, throughput, partitions)
- Dashboard ID 9021: Confluent Platform Kafka Metrics
- Dashboard ID 13282: Kafka Consumer Lag (detailed consumer group lag)
To import a dashboard into Grafana: Dashboards → Import → enter ID → select Prometheus datasource.
Key Panels to Have in Your Dashboard
# Pannello 1: Broker Health Overview
# Stat panel: up/down per ogni broker
up{job="kafka-brokers"}
# Pannello 2: Under-Replicated Partitions
# Stat panel con soglia colorata (verde=0, rosso>0)
sum(kafka_server_replicamanager_underreplicatedpartitions)
# Pannello 3: Consumer Lag per Group (time series)
sum(kafka_consumer_group_partition_lag) by (group)
# Pannello 4: Messages In/Out per secondo
rate(kafka_server_brokertopicmetrics_messagesinpersec[1m])
# Pannello 5: Request Latency p99 (heatmap o time series)
kafka_network_requestmetrics_totaltimems{quantile="99thPercentile"}
# Pannello 6: Disk Usage per Broker
# Utile per pianificare l'espansione storage
node_filesystem_size_bytes{mountpoint="/kafka-data"} -
node_filesystem_avail_bytes{mountpoint="/kafka-data"}
Docker Compose: Complete Stack Monitoring
For development or staging environments, this Docker Compose launches the entire monitoring stack: JMX Exporter (integrated into the broker), Kafka Lag Exporter, Prometheus and Grafana.
# docker-compose.monitoring.yml
version: "3.9"
services:
kafka:
image: apache/kafka:4.0.0
container_name: kafka
ports:
- "9092:9092"
- "9404:9404" # JMX Exporter
volumes:
- ./config/kafka-jmx-exporter.yml:/opt/kafka-jmx-exporter.yml
- ./libs/jmx_prometheus_javaagent.jar:/opt/jmx_prometheus_javaagent.jar
environment:
KAFKA_NODE_ID: 1
KAFKA_PROCESS_ROLES: "broker,controller"
KAFKA_LISTENERS: "PLAINTEXT://kafka:9092,CONTROLLER://kafka:9093"
KAFKA_ADVERTISED_LISTENERS: "PLAINTEXT://localhost:9092"
KAFKA_CONTROLLER_LISTENER_NAMES: "CONTROLLER"
KAFKA_CONTROLLER_QUORUM_VOTERS: "1@kafka:9093"
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: "CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT"
CLUSTER_ID: "monitoring-demo-cluster"
KAFKA_OPTS: "-javaagent:/opt/jmx_prometheus_javaagent.jar=9404:/opt/kafka-jmx-exporter.yml"
kafka-lag-exporter:
image: seglo/kafka-lag-exporter:0.8.0
container_name: kafka-lag-exporter
ports:
- "8000:8000"
volumes:
- ./config/lag-exporter.conf:/opt/docker/conf/application.conf
depends_on:
- kafka
prometheus:
image: prom/prometheus:v2.50.1
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./config/prometheus.yml:/etc/prometheus/prometheus.yml
- ./config/kafka-alerts.yml:/etc/prometheus/kafka-alerts.yml
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.retention.time=30d"
grafana:
image: grafana/grafana:10.3.1
container_name: grafana
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: "kafka-monitoring"
GF_USERS_ALLOW_SIGN_UP: "false"
volumes:
- grafana-data:/var/lib/grafana
- ./config/grafana-datasources.yml:/etc/grafana/provisioning/datasources/prometheus.yml
- ./config/grafana-dashboards.yml:/etc/grafana/provisioning/dashboards/kafka.yml
volumes:
grafana-data:
Consumer Lag Monitoring with Java (Programmatic)
In some cases it is useful to measure consumer lag directly in the application code, for example to expose metric via Micrometer/Actuator in a Spring Boot application. This approach is complementary to external monitoring and allows for more precise correlations.
// ConsumerLagMonitor.java - Misura il lag programmaticamente con Kafka AdminClient
import org.apache.kafka.clients.admin.*;
import org.apache.kafka.clients.consumer.OffsetAndMetadata;
import org.apache.kafka.common.*;
import java.util.*;
import java.util.concurrent.*;
import java.util.stream.*;
public class ConsumerLagMonitor {
private final AdminClient adminClient;
private final String groupId;
public ConsumerLagMonitor(String bootstrapServers, String groupId) {
Properties props = new Properties();
props.put("bootstrap.servers", bootstrapServers);
this.adminClient = AdminClient.create(props);
this.groupId = groupId;
}
public Map<TopicPartition, Long> getLagPerPartition() throws ExecutionException, InterruptedException {
// 1. Recupera committed offsets del consumer group
ListConsumerGroupOffsetsResult offsetsResult =
adminClient.listConsumerGroupOffsets(groupId);
Map<TopicPartition, OffsetAndMetadata> committedOffsets =
offsetsResult.partitionsToOffsetAndMetadata().get();
// 2. Recupera l'end offset (Log End Offset) per le stesse partizioni
Map<TopicPartition, OffsetSpec> latestOffsetRequest = committedOffsets.keySet().stream()
.collect(Collectors.toMap(tp -> tp, tp -> OffsetSpec.latest()));
ListOffsetsResult latestOffsets = adminClient.listOffsets(latestOffsetRequest);
Map<TopicPartition, ListOffsetsResult.ListOffsetsResultInfo> endOffsets =
latestOffsets.all().get();
// 3. Calcola il lag = endOffset - committedOffset
Map<TopicPartition, Long> lag = new HashMap<>();
for (Map.Entry<TopicPartition, OffsetAndMetadata> entry : committedOffsets.entrySet()) {
TopicPartition tp = entry.getKey();
long committedOffset = entry.getValue().offset();
long endOffset = endOffsets.get(tp).offset();
lag.put(tp, Math.max(0, endOffset - committedOffset));
}
return lag;
}
public long getTotalLag() throws ExecutionException, InterruptedException {
return getLagPerPartition().values().stream()
.mapToLong(Long::longValue)
.sum();
}
public void close() {
adminClient.close();
}
}
Best Practices for Monitoring Kafka
Anti-Pattern: Alert on Every Spike of Lag
Consumer lag fluctuates naturally: a momentary peak in production or a consumer GC pause
they cause temporary lag that recovers quickly. Configure alerts on absolute thresholds without
a time window (for: 10m) generates constant false positives.
Always use a minimum duration (5-10 minutes) before triggering the alert.
-
Monitor the trend, not the absolute value: A decreasing lag of 100K is less of a concern
of a growing 5K lag. USA
rate()in PromQL to see the derivative. - Separate dashboards by person: an operational dashboard with critical alerts for SREs, a capacity planning dashboard for infrastructure teams, a business KPI dashboard for product managers.
-
Include JVM metrics: 30% of Kafka performance problems arise from GC pressure
about the broker. Always monitor
jvm_gc_pause_secondsand heap usage. - Keep Prometheus at least 30 days: for trend analysis and monthly capacity planning. Consider Thanos or Mimir for long-term retention.
-
Test alerts with
amtool: Verify that the alert rules are syntactically correct correct and that they trigger with the expected test data before going into production.
Kafka in Cloud Managed: Confluent Cloud Metrics API
If you use Confluent Cloud or a managed Kafka service (AWS MSK, Aiven Kafka), JMX metrics are not directly accessible. In this case we use the Metrics API of the cloud provider. Confluent Cloud exposes a Prometheus-compatible Metrics REST API via remote_write:
# prometheus.yml per Confluent Cloud Metrics API
scrape_configs:
- job_name: "confluent-cloud"
honor_timestamps: true
metrics_path: "/v2/metrics/cloud/export"
scheme: https
basic_auth:
username: "<CONFLUENT_CLOUD_API_KEY>"
password: "<CONFLUENT_CLOUD_API_SECRET>"
static_configs:
- targets:
- "api.telemetry.confluent.cloud"
params:
resource.kafka.id:
- "<CLUSTER_ID>"
# Per AWS MSK: usa il CloudWatch exporter
# https://github.com/prometheus/cloudwatch_exporter
scrape_configs:
- job_name: "aws-msk"
static_configs:
- targets:
- "cloudwatch-exporter:9106"
Summary: Kafka Monitoring Checklist
- JMX Exporter configured on all brokers, port 9404
- Kafka Lag Exporter or equivalent for consumer group metrics
- Prometheus with scrape interval 15s for broker, 30s for lag exporter
- Critical alerts: broker down, offline partitions, no active controller
- Alert warning: under-replicated partitions, high consumer lag, heap >85%
- Grafana dashboard imported (ID 7589 for overview, ID 13282 for lag)
- Retention Prometheus minimum 30 days
- Alerts tested with minimum duration (for: 5m) to avoid false positives
Next Steps in the Series
You have configured monitoring. The next step is knowing what to do when problems occur:
- Article 10 – Dead Letter Queue and Error Handling: patterns for managing messages that fail to be processed, including DLQ, exponential retry and poison pill detection.
- Article 11 – Kafka in Production: complete operational guide for cluster sizing, the optimal configuration of retention and replication factors, and MirrorMaker 2 for disaster recovery.
Link with Other Series
- Apache Kafka Fundamentals (Article 1): The concept of consumer lag requires understanding the difference between LEO, HW and committed offset explained in the first article of the series.
- Event-Driven Architecture: Monitoring consumer lag is also critical in EDA systems with AWS SQS and SNS, where the equivalent metric is queue depth.







