Why Kafka Monitoring is Critical

Apache Kafka is designed to be resilient, but resilience is not automatic: it requires active observability. An overloaded broker, a consumer group that can't keep up with production, or a partition with degraded replication they can turn into data loss or downtime if not detected early.

The three fundamental dimensions to monitor in Kafka are:

  • Cluster health: active brokers, active controller, under-replicated partitions, offline partitions
  • Producer performance: request rate, record error rate, request latency, batch size
  • Consumer performance: consumer lag per group and per partition, commit rate, rebalance frequency

The 5 Most Critical Kafka Metrics

  • kafka.consumer.lag: difference between LEO and committed offset — the #1 metric for system health
  • kafka.server.UnderReplicatedPartitions: partitions with incomplete ISR — degraded broker signal
  • kafka.server.OfflinePartitionsCount: leaderless partitions — critical error, data inaccessible
  • kafka.network.RequestMetrics.TotalTimeMs: total latency of requests
  • kafka.server.BrokerTopicMetrics.MessagesInPerSec: ingestion throughput

How Kafka Exposes Metrics: JMX

Kafka exposes all its internal metrics via JMX (Java Management Extensions), the standard framework for monitoring of JVM applications. Each metric is identified by a MBean name in the format dominio:tipo=Tipo,nome=Metrica.

To make JMX metrics accessible to Prometheus, the most used monitoring system in a cloud-native environment, you use the JMX Exporter: A Java agent that exposes JMX metrics as an HTTP endpoint in Prometheus format (text-based, pull model).

Configure JMX Exporter as a Java Agent

JMX Exporter is started as a JVM agent: just add the parameter -javaagent to the Kafka broker JVM. You download the JAR from the official Prometheus repository and create a YAML configuration file that specifies which MBeans to collect and how to rename them to Prometheus metrics.

# Scaricare JMX Exporter (ultima versione stabile)
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/1.0.1/jmx_prometheus_javaagent-1.0.1.jar \
  -O /opt/kafka/libs/jmx_prometheus_javaagent.jar

# File di configurazione: /opt/kafka/config/kafka-jmx-exporter.yml
# Questo file dice a JMX Exporter quali MBean raccogliere
# kafka-jmx-exporter.yml
# Configurazione completa per Kafka broker + consumer group lag
---
startDelaySeconds: 0
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true

rules:
  # Metriche Broker: throughput messaggi
  - pattern: 'kafka.server<type=BrokerTopicMetrics, name=MessagesInPerSec><>OneMinuteRate'
    name: kafka_server_brokertopicmetrics_messagesinpersec
    labels:
      topic: "$1"

  # Under-replicated partitions: CRITICO
  - pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value'
    name: kafka_server_replicamanager_underreplicatedpartitions

  # Offline partitions: CRITICO
  - pattern: 'kafka.controller<type=KafkaController, name=OfflinePartitionsCount><>Value'
    name: kafka_controller_kafkacontroller_offlinepartitionscount

  # Active controller count (deve essere sempre 1)
  - pattern: 'kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value'
    name: kafka_controller_kafkacontroller_activecontrollercount

  # Request latency per tipo di request
  - pattern: 'kafka.network<type=RequestMetrics, name=TotalTimeMs, request=(\w+)><>(Mean|99thPercentile)'
    name: kafka_network_requestmetrics_totaltimems
    labels:
      request: "$1"
      quantile: "$2"

  # Producer request rate
  - pattern: 'kafka.server<type=BrokerTopicMetrics, name=BytesInPerSec><>OneMinuteRate'
    name: kafka_server_brokertopicmetrics_bytesinpersec

  # Log size per topic-partizione
  - pattern: 'kafka.log<type=Log, name=Size, topic=(.+), partition=(\d+)><>Value'
    name: kafka_log_log_size
    labels:
      topic: "$1"
      partition: "$2"

  # JVM heap usage
  - pattern: 'java.lang<type=Memory><>HeapMemoryUsage'
    name: jvm_heap_memory_usage
    type: GAUGE

Enable JMX Exporter in the Broker

# Modificare kafka-server-start.sh oppure impostare la variabile d'ambiente
# Aggiungere al file bin/kafka-server-start.sh o configurare systemd

# Opzione 1: variabile d'ambiente KAFKA_OPTS
export KAFKA_OPTS="-javaagent:/opt/kafka/libs/jmx_prometheus_javaagent.jar=9404:/opt/kafka/config/kafka-jmx-exporter.yml"

# Opzione 2: in systemd service file
# [Service]
# Environment="KAFKA_OPTS=-javaagent:/opt/kafka/libs/jmx_prometheus_javaagent.jar=9404:/opt/kafka/config/kafka-jmx-exporter.yml"

# Verificare che l'endpoint funzioni dopo il riavvio
curl http://localhost:9404/metrics | grep kafka_server_replicamanager

Consumer Lag: The Most Important Metric

Il consumer lag measures how many records the consumer still has to process: it is the difference between the Log End Offset (the last record written in the topic) and the Committed Offset (the last record processed by the consumer group). An increasing lag indicates that consumers are not keeping up with the speed of production.

The lag alone does not expose the brokers' JMX metrics: it must be collected by querying the consumer group coordinator. The standard tool for this is Kafka Lag Exporter (Lightbend/open source project) or the plugin kafka_consumer_group of Prometheus himself.

# Docker Compose per Kafka Lag Exporter (alternativa moderna)
# https://github.com/seglo/kafka-lag-exporter

services:
  kafka-lag-exporter:
    image: seglo/kafka-lag-exporter:0.8.0
    ports:
      - "8000:8000"
    volumes:
      - ./lag-exporter.conf:/opt/docker/conf/application.conf
    environment:
      JAVA_OPTS: "-Xmx256m"
# lag-exporter.conf (formato HOCON)
kafka-lag-exporter {
  port = 8000

  clusters = [
    {
      name = "production-cluster"
      bootstrap-brokers = "kafka1:9092,kafka2:9092,kafka3:9092"

      # Polling interval
      poll-interval = 30 seconds

      # Consumer groups da monitorare (regex, vuoto = tutti)
      consumer-group-whitelist = [".*"]
    }
  ]

  # Metriche Prometheus esposte:
  # kafka_consumer_group_latest_offset
  # kafka_consumer_group_partition_lag
  # kafka_consumer_group_sum_lag  <-- lag totale per group
  # kafka_consumer_group_max_lag  <-- lag massimo tra le partizioni
}

Calculate Lag with PromQL

# Query PromQL per Grafana / Prometheus

# Lag totale per consumer group
sum(kafka_consumer_group_partition_lag) by (group)

# Top 10 consumer groups per lag
topk(10, sum(kafka_consumer_group_partition_lag) by (group))

# Tasso di crescita del lag (se positivo, i consumer non stanno al passo)
rate(kafka_consumer_group_sum_lag[5m])

# Consumer lag per topic specifico
sum(kafka_consumer_group_partition_lag{topic="ordini-effettuati"}) by (group)

# Alert: lag sopra soglia critica per piu di 5 minuti
# kafka_consumer_group_sum_lag{group="servizio-inventario"} > 10000

Configure Prometheus for Kafka

Prometheus uses a template pull: it is Prometheus that collects metrics from the endpoints exposed by exporters, according to a configurable interval (scrape_interval). The configuration specifies the scrape target: The HTTP endpoints to query.

# prometheus.yml - configurazione per scraping Kafka
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "kafka-alerts.yml"

scrape_configs:
  # JMX Exporter su ogni broker Kafka
  - job_name: "kafka-brokers"
    static_configs:
      - targets:
          - "kafka1:9404"
          - "kafka2:9404"
          - "kafka3:9404"
    relabel_configs:
      - source_labels: [__address__]
        target_label: broker
        regex: "([^:]+):.*"
        replacement: "$1"

  # Kafka Lag Exporter per consumer group metrics
  - job_name: "kafka-lag-exporter"
    static_configs:
      - targets:
          - "kafka-lag-exporter:8000"
    scrape_interval: 30s  # lag cambia meno frequentemente

  # JVM metrics dei broker (se abilitato jvm_exporter separato)
  - job_name: "kafka-jvm"
    static_configs:
      - targets:
          - "kafka1:9404"
          - "kafka2:9404"
          - "kafka3:9404"
    metrics_path: /metrics
    params:
      module: [jvm]

Alert Rules Kafka: The Essential Rules

The alert rules define the conditions that trigger a notification. In Prometheus they express themselves with PromQL and are bundled into separate YAML files. Here are the basic rules for Kafka in production:

# kafka-alerts.yml - Regole di alert per Prometheus Alertmanager
groups:
  - name: kafka.critical
    rules:
      # Broker down: nessun dato dall'endpoint JMX
      - alert: KafkaBrokerDown
        expr: up{job="kafka-brokers"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Kafka broker {{ $labels.broker }} non raggiungibile"
          description: "Il broker {{ $labels.broker }} non risponde da piu di 1 minuto."

      # Partizioni offline: CRITICO, dati non accessibili
      - alert: KafkaOfflinePartitions
        expr: kafka_controller_kafkacontroller_offlinepartitionscount > 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "{{ $value }} partizioni offline in Kafka"
          description: "Partizioni senza leader: dati non leggibili/scrivibili."

      # Nessun active controller
      - alert: KafkaNoActiveController
        expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Kafka senza controller attivo"

  - name: kafka.warning
    rules:
      # Under-replicated partitions: replica degradata
      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $value }} partizioni under-replicated su {{ $labels.broker }}"
          description: "Replica degradata: un broker potrebbe essere lento o down."

      # Consumer lag critico
      - alert: KafkaConsumerLagHigh
        expr: sum(kafka_consumer_group_partition_lag) by (group) > 50000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Consumer lag alto per gruppo {{ $labels.group }}"
          description: "Lag: {{ $value }}. I consumer non stanno al passo con la produzione."

      # Consumer lag critico prolungato
      - alert: KafkaConsumerLagCritical
        expr: sum(kafka_consumer_group_partition_lag) by (group) > 200000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Consumer lag CRITICO per gruppo {{ $labels.group }}"

      # JVM heap usage alta
      - alert: KafkaBrokerHeapHigh
        expr: jvm_heap_memory_usage{area="used"} / jvm_heap_memory_usage{area="max"} > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Heap JVM alta su broker Kafka {{ $labels.broker }}"

Grafana Dashboard for Kafka

Grafana is the standard visualization tool for Prometheus. There are ready-to-import dashboards directly from the Grafana Dashboard Hub (grafana.com/dashboards). The most used for Kafka are:

  • Dashboard ID 7589: Kafka Overview (broker health, throughput, partitions)
  • Dashboard ID 9021: Confluent Platform Kafka Metrics
  • Dashboard ID 13282: Kafka Consumer Lag (detailed consumer group lag)

To import a dashboard into Grafana: Dashboards → Import → enter ID → select Prometheus datasource.

Key Panels to Have in Your Dashboard

# Pannello 1: Broker Health Overview
# Stat panel: up/down per ogni broker
up{job="kafka-brokers"}

# Pannello 2: Under-Replicated Partitions
# Stat panel con soglia colorata (verde=0, rosso>0)
sum(kafka_server_replicamanager_underreplicatedpartitions)

# Pannello 3: Consumer Lag per Group (time series)
sum(kafka_consumer_group_partition_lag) by (group)

# Pannello 4: Messages In/Out per secondo
rate(kafka_server_brokertopicmetrics_messagesinpersec[1m])

# Pannello 5: Request Latency p99 (heatmap o time series)
kafka_network_requestmetrics_totaltimems{quantile="99thPercentile"}

# Pannello 6: Disk Usage per Broker
# Utile per pianificare l'espansione storage
node_filesystem_size_bytes{mountpoint="/kafka-data"} -
node_filesystem_avail_bytes{mountpoint="/kafka-data"}

Docker Compose: Complete Stack Monitoring

For development or staging environments, this Docker Compose launches the entire monitoring stack: JMX Exporter (integrated into the broker), Kafka Lag Exporter, Prometheus and Grafana.

# docker-compose.monitoring.yml
version: "3.9"

services:
  kafka:
    image: apache/kafka:4.0.0
    container_name: kafka
    ports:
      - "9092:9092"
      - "9404:9404"   # JMX Exporter
    volumes:
      - ./config/kafka-jmx-exporter.yml:/opt/kafka-jmx-exporter.yml
      - ./libs/jmx_prometheus_javaagent.jar:/opt/jmx_prometheus_javaagent.jar
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: "broker,controller"
      KAFKA_LISTENERS: "PLAINTEXT://kafka:9092,CONTROLLER://kafka:9093"
      KAFKA_ADVERTISED_LISTENERS: "PLAINTEXT://localhost:9092"
      KAFKA_CONTROLLER_LISTENER_NAMES: "CONTROLLER"
      KAFKA_CONTROLLER_QUORUM_VOTERS: "1@kafka:9093"
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: "CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT"
      CLUSTER_ID: "monitoring-demo-cluster"
      KAFKA_OPTS: "-javaagent:/opt/jmx_prometheus_javaagent.jar=9404:/opt/kafka-jmx-exporter.yml"

  kafka-lag-exporter:
    image: seglo/kafka-lag-exporter:0.8.0
    container_name: kafka-lag-exporter
    ports:
      - "8000:8000"
    volumes:
      - ./config/lag-exporter.conf:/opt/docker/conf/application.conf
    depends_on:
      - kafka

  prometheus:
    image: prom/prometheus:v2.50.1
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./config/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./config/kafka-alerts.yml:/etc/prometheus/kafka-alerts.yml
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"

  grafana:
    image: grafana/grafana:10.3.1
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: "kafka-monitoring"
      GF_USERS_ALLOW_SIGN_UP: "false"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./config/grafana-datasources.yml:/etc/grafana/provisioning/datasources/prometheus.yml
      - ./config/grafana-dashboards.yml:/etc/grafana/provisioning/dashboards/kafka.yml

volumes:
  grafana-data:

Consumer Lag Monitoring with Java (Programmatic)

In some cases it is useful to measure consumer lag directly in the application code, for example to expose metric via Micrometer/Actuator in a Spring Boot application. This approach is complementary to external monitoring and allows for more precise correlations.

// ConsumerLagMonitor.java - Misura il lag programmaticamente con Kafka AdminClient
import org.apache.kafka.clients.admin.*;
import org.apache.kafka.clients.consumer.OffsetAndMetadata;
import org.apache.kafka.common.*;

import java.util.*;
import java.util.concurrent.*;
import java.util.stream.*;

public class ConsumerLagMonitor {

    private final AdminClient adminClient;
    private final String groupId;

    public ConsumerLagMonitor(String bootstrapServers, String groupId) {
        Properties props = new Properties();
        props.put("bootstrap.servers", bootstrapServers);
        this.adminClient = AdminClient.create(props);
        this.groupId = groupId;
    }

    public Map<TopicPartition, Long> getLagPerPartition() throws ExecutionException, InterruptedException {
        // 1. Recupera committed offsets del consumer group
        ListConsumerGroupOffsetsResult offsetsResult =
            adminClient.listConsumerGroupOffsets(groupId);

        Map<TopicPartition, OffsetAndMetadata> committedOffsets =
            offsetsResult.partitionsToOffsetAndMetadata().get();

        // 2. Recupera l'end offset (Log End Offset) per le stesse partizioni
        Map<TopicPartition, OffsetSpec> latestOffsetRequest = committedOffsets.keySet().stream()
            .collect(Collectors.toMap(tp -> tp, tp -> OffsetSpec.latest()));

        ListOffsetsResult latestOffsets = adminClient.listOffsets(latestOffsetRequest);
        Map<TopicPartition, ListOffsetsResult.ListOffsetsResultInfo> endOffsets =
            latestOffsets.all().get();

        // 3. Calcola il lag = endOffset - committedOffset
        Map<TopicPartition, Long> lag = new HashMap<>();
        for (Map.Entry<TopicPartition, OffsetAndMetadata> entry : committedOffsets.entrySet()) {
            TopicPartition tp = entry.getKey();
            long committedOffset = entry.getValue().offset();
            long endOffset = endOffsets.get(tp).offset();
            lag.put(tp, Math.max(0, endOffset - committedOffset));
        }

        return lag;
    }

    public long getTotalLag() throws ExecutionException, InterruptedException {
        return getLagPerPartition().values().stream()
            .mapToLong(Long::longValue)
            .sum();
    }

    public void close() {
        adminClient.close();
    }
}

Best Practices for Monitoring Kafka

Anti-Pattern: Alert on Every Spike of Lag

Consumer lag fluctuates naturally: a momentary peak in production or a consumer GC pause they cause temporary lag that recovers quickly. Configure alerts on absolute thresholds without a time window (for: 10m) generates constant false positives. Always use a minimum duration (5-10 minutes) before triggering the alert.

  • Monitor the trend, not the absolute value: A decreasing lag of 100K is less of a concern of a growing 5K lag. USA rate() in PromQL to see the derivative.
  • Separate dashboards by person: an operational dashboard with critical alerts for SREs, a capacity planning dashboard for infrastructure teams, a business KPI dashboard for product managers.
  • Include JVM metrics: 30% of Kafka performance problems arise from GC pressure about the broker. Always monitor jvm_gc_pause_seconds and heap usage.
  • Keep Prometheus at least 30 days: for trend analysis and monthly capacity planning. Consider Thanos or Mimir for long-term retention.
  • Test alerts with amtool: Verify that the alert rules are syntactically correct correct and that they trigger with the expected test data before going into production.

Kafka in Cloud Managed: Confluent Cloud Metrics API

If you use Confluent Cloud or a managed Kafka service (AWS MSK, Aiven Kafka), JMX metrics are not directly accessible. In this case we use the Metrics API of the cloud provider. Confluent Cloud exposes a Prometheus-compatible Metrics REST API via remote_write:

# prometheus.yml per Confluent Cloud Metrics API
scrape_configs:
  - job_name: "confluent-cloud"
    honor_timestamps: true
    metrics_path: "/v2/metrics/cloud/export"
    scheme: https
    basic_auth:
      username: "<CONFLUENT_CLOUD_API_KEY>"
      password: "<CONFLUENT_CLOUD_API_SECRET>"
    static_configs:
      - targets:
          - "api.telemetry.confluent.cloud"
    params:
      resource.kafka.id:
        - "<CLUSTER_ID>"

# Per AWS MSK: usa il CloudWatch exporter
# https://github.com/prometheus/cloudwatch_exporter
scrape_configs:
  - job_name: "aws-msk"
    static_configs:
      - targets:
          - "cloudwatch-exporter:9106"

Summary: Kafka Monitoring Checklist

  • JMX Exporter configured on all brokers, port 9404
  • Kafka Lag Exporter or equivalent for consumer group metrics
  • Prometheus with scrape interval 15s for broker, 30s for lag exporter
  • Critical alerts: broker down, offline partitions, no active controller
  • Alert warning: under-replicated partitions, high consumer lag, heap >85%
  • Grafana dashboard imported (ID 7589 for overview, ID 13282 for lag)
  • Retention Prometheus minimum 30 days
  • Alerts tested with minimum duration (for: 5m) to avoid false positives

Next Steps in the Series

You have configured monitoring. The next step is knowing what to do when problems occur:

  • Article 10 – Dead Letter Queue and Error Handling: patterns for managing messages that fail to be processed, including DLQ, exponential retry and poison pill detection.
  • Article 11 – Kafka in Production: complete operational guide for cluster sizing, the optimal configuration of retention and replication factors, and MirrorMaker 2 for disaster recovery.

Link with Other Series

  • Apache Kafka Fundamentals (Article 1): The concept of consumer lag requires understanding the difference between LEO, HW and committed offset explained in the first article of the series.
  • Event-Driven Architecture: Monitoring consumer lag is also critical in EDA systems with AWS SQS and SNS, where the equivalent metric is queue depth.