The Problem with ZooKeeper

For nearly ten years, every Apache Kafka cluster has required an ensemble Apache ZooKeeper separate to manage metadata: which brokers were active, which broker was leader of which partition, topic and ACL metadata. ZooKeeper is a coordination system distributed robust and reliable, but introduced a number of significant operational problems:

  • Double operational complexity: Each team managing Kafka also had to manage a separate ZooKeeper cluster (typically 3 or 5 nodes), with its own monitoring, upgrade cycle, and distinct configuration.
  • Limited metadata scalability: ZooKeeper showed performance degradation beyond ~200,000 partitions per cluster, because each partition's metadata was written as separate ZooKeeper nodes.
  • Slow controller election: When the Kafka broker controller fell, the new controller had to read the integer cluster status from ZooKeeper before it could operate, a process that could take tens of seconds for large clusters.
  • Difficulty in disaster recovery: recovery of a Kafka cluster in case of data loss on ZooKeeper it was a complex and risky manual process.

KRaft Timeline

  • KIP-500 (2020): Original proposal to remove ZooKeeper from Kafka
  • Kafka 2.8 (April 2021): first version with KRaft in early access (for testing only)
  • Kafka 3.3 (October 2022): KRaft declared production-ready for new clusters
  • Kafka 3.5 (June 2023): ZooKeeper to KRaft migration tool available
  • Kafka 3.7 (March 2024): ZooKeeper mode deprecated
  • Kafka 4.0 (March 2025): ZooKeeper mode permanently removed

How KRaft Works: The Raft Consensus Log

The Concept of Metadata Log

The solution adopted in KRaft (Kafka Raft) is elegant: instead of depending on an external system for metadata, Kafka handles its metadata as a Internal Kafka topic called @metadata. This topic is replicated via a Raft protocol between controller nodes.

In KRaft, cluster brokers take on one of two roles (or both, in small clusters):

  • Controllers: Manages cluster metadata. In a production cluster, a quorum of 3 controllers is recommended. The active controller (Raft leader) processes all metadata changes and replicates them to other controllers.
  • Broker: manages partition logs, serves producers and consumers. Brokers keep a copy cache of metadata received from the controller, updated in streaming.

The Raft Protocol in Kafka

Raft is a distributed consensus algorithm designed to be understandable (unlike Paxos). In short: among all quorum nodes, one is elected leader. The leader receives all the scriptures, it propagates them to the followers, and when a majority of the nodes have confirmed the writing, it considers it committed.

In KRaft, this translates like this:

  1. A metadata operation (create topic, assign partition leader, etc.) arrives at the leader controller
  2. The leader controller writes the operation to the metadata log as a serialized event
  3. The event is replicated to controller followers via the protocol FETCH (leveraging existing Kafka code)
  4. When the majority of controllers have confirmed (quorum), the operation is committed
  5. Brokers receive metadata updates pushed from the active controller via MetadataUpdate
# Struttura di una directory dati KRaft (broker+controller combinato)
# /var/lib/kafka/data/

/var/lib/kafka/data/
  meta.properties          # cluster.id, node.id, version
  __cluster_metadata-0/    # il metadata log (partizione 0)
    00000000000000000000.log
    00000000000000000000.index
    00000000000000000000.timeindex
    leader-epoch-checkpoint
  ordini-effettuati-0/     # log di una partizione normale
  ordini-effettuati-1/
  ...

# meta.properties esempio:
node.id=1
version=1
cluster.id=MkU3OEVBNTcwNTJENDM2Qk

Quorum Controller: Sizing

The quorum controller follows the consensus rules: to tolerate f failure, they are needed 2f+1 knots.

  • 3 controllers: tolerates 1 failure (minimum configuration for production)
  • 5 controllers: tolerates 2 simultaneous failures (recommended for critical clusters)
  • 1 controller: For local development/testing only, no fault tolerance

Controllers can be dedicated (controller role only, do not manage user partitions) or combined (same machines that also act as brokers). For small clusters (< 10 brokers) the controllers combined they are fine. For large or high-throughput clusters, dedicated controllers isolate the management load metadata from the partitions I/O load.

Configuring a KRaft Cluster from Scratch

# server.properties per un nodo controller+broker combinato (cluster single-node per dev)

# ─── Identity ─────────────────────────────────────────────────────────────────
# In KRaft ogni nodo ha un node.id unico nel cluster (sostituisce broker.id)
node.id=1

# Ruoli: "broker" | "controller" | "broker,controller"
process.roles=broker,controller

# Indirizzo del quorum controller: formato node.id@host:port
controller.quorum.voters=1@localhost:9093

# ─── Listeners ────────────────────────────────────────────────────────────────
# KAFKA: listener per producer/consumer
# CONTROLLER: listener per comunicazione KRaft interna
listeners=KAFKA://localhost:9092,CONTROLLER://localhost:9093
advertised.listeners=KAFKA://localhost:9092

listener.security.protocol.map=KAFKA:PLAINTEXT,CONTROLLER:PLAINTEXT
inter.broker.listener.name=KAFKA
controller.listener.names=CONTROLLER

# ─── Storage ──────────────────────────────────────────────────────────────────
log.dirs=/var/lib/kafka/data

# ─── Replication defaults ─────────────────────────────────────────────────────
default.replication.factor=1        # 1 per dev, 3 per produzione
min.insync.replicas=1               # 1 per dev, 2 per produzione
offsets.topic.replication.factor=1

# ─── Retention ────────────────────────────────────────────────────────────────
log.retention.hours=168             # 7 giorni
log.segment.bytes=1073741824        # 1GB per segmento
# Inizializzare il cluster KRaft (una tantum)
# Step 1: generare un cluster UUID univoco
KAFKA_CLUSTER_ID=$(kafka-storage.sh random-uuid)
echo "Cluster ID: $KAFKA_CLUSTER_ID"

# Step 2: formattare la directory storage con il cluster ID
kafka-storage.sh format \
  --config /etc/kafka/server.properties \
  --cluster-id "$KAFKA_CLUSTER_ID"

# Output:
# Formatting /var/lib/kafka/data with metadata.version 4.0-IV3.

# Step 3: avviare il broker
kafka-server-start.sh /etc/kafka/server.properties

Important: The Cluster ID is Immutable

Il cluster.id generated when the format is written to the file meta.properties of each node and in the metadata log. It cannot be changed after initialization. If you lose this file and want to add a node to the existing cluster, you must use the appropriate bootstrap procedure. Store the cluster ID in a secrets management system.

Docker Compose: KRaft Cluster for Local Development

# docker-compose.yml per cluster Kafka 4.0 KRaft (3 broker)
# Immagine: apache/kafka:4.0.0 (immagine ufficiale Apache, non Confluent)

version: "3.9"

services:
  kafka1:
    image: apache/kafka:4.0.0
    container_name: kafka1
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: "broker,controller"
      KAFKA_LISTENERS: "PLAINTEXT://kafka1:9092,CONTROLLER://kafka1:9093"
      KAFKA_ADVERTISED_LISTENERS: "PLAINTEXT://kafka1:9092"
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: "CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT"
      KAFKA_CONTROLLER_LISTENER_NAMES: "CONTROLLER"
      KAFKA_CONTROLLER_QUORUM_VOTERS: "1@kafka1:9093,2@kafka2:9093,3@kafka3:9093"
      KAFKA_INTER_BROKER_LISTENER_NAME: "PLAINTEXT"
      KAFKA_DEFAULT_REPLICATION_FACTOR: 3
      KAFKA_MIN_INSYNC_REPLICAS: 2
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
      CLUSTER_ID: "MkU3OEVBNTcwNTJENDM2Qk"
    volumes:
      - kafka1-data:/var/lib/kafka/data

  kafka2:
    image: apache/kafka:4.0.0
    container_name: kafka2
    environment:
      KAFKA_NODE_ID: 2
      KAFKA_PROCESS_ROLES: "broker,controller"
      KAFKA_LISTENERS: "PLAINTEXT://kafka2:9092,CONTROLLER://kafka2:9093"
      KAFKA_ADVERTISED_LISTENERS: "PLAINTEXT://kafka2:9092"
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: "CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT"
      KAFKA_CONTROLLER_LISTENER_NAMES: "CONTROLLER"
      KAFKA_CONTROLLER_QUORUM_VOTERS: "1@kafka1:9093,2@kafka2:9093,3@kafka3:9093"
      KAFKA_INTER_BROKER_LISTENER_NAME: "PLAINTEXT"
      KAFKA_DEFAULT_REPLICATION_FACTOR: 3
      KAFKA_MIN_INSYNC_REPLICAS: 2
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
      CLUSTER_ID: "MkU3OEVBNTcwNTJENDM2Qk"
    volumes:
      - kafka2-data:/var/lib/kafka/data

  kafka3:
    image: apache/kafka:4.0.0
    container_name: kafka3
    environment:
      KAFKA_NODE_ID: 3
      KAFKA_PROCESS_ROLES: "broker,controller"
      KAFKA_LISTENERS: "PLAINTEXT://kafka3:9092,CONTROLLER://kafka3:9093"
      KAFKA_ADVERTISED_LISTENERS: "PLAINTEXT://kafka3:9092"
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: "CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT"
      KAFKA_CONTROLLER_LISTENER_NAMES: "CONTROLLER"
      KAFKA_CONTROLLER_QUORUM_VOTERS: "1@kafka1:9093,2@kafka2:9093,3@kafka3:9093"
      KAFKA_INTER_BROKER_LISTENER_NAME: "PLAINTEXT"
      KAFKA_DEFAULT_REPLICATION_FACTOR: 3
      KAFKA_MIN_INSYNC_REPLICAS: 2
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
      CLUSTER_ID: "MkU3OEVBNTcwNTJENDM2Qk"
    volumes:
      - kafka3-data:/var/lib/kafka/data

volumes:
  kafka1-data:
  kafka2-data:
  kafka3-data:

Migration from Kafka 3.x with ZooKeeper to KRaft

If you are managing a Kafka 3.x cluster in ZooKeeper mode and need to migrate to KRaft (required to use Kafka 4.0), the process is called KRaft migration and is officially supported since version 3.5. The good news: Migration happens without downtime for producers and consumers.

Phases of Migration

The official process is divided into 6 phases:

  1. Check prerequisites: upgrade to Kafka 3.7 (latest version with ZooKeeper+KRaft dual-write support), check that all brokers have metadata.version aligned.
  2. KRaft controller deployment: Start KRaft controller nodes (3 new nodes, or existing brokers with additional role). Controllers obtain initial metadata from ZooKeeper via the migration tool.
  3. Dual-write mode: Brokers write metadata to both ZooKeeper and the KRaft metadata log. During this phase the system is fully operational.
  4. Migration completed: all brokers migrate, ZooKeeper becomes read-only for Kafka. Producers and consumers do not perceive any interruption.
  5. ZooKeeper finalizer: Run the finalizer that cleans Kafka metadata from ZooKeeper.
  6. Shutdown ZooKeeper: Decommission the ZooKeeper ensemble. Fully KRaft cluster.
# Step 1: Verifica metadata.version attuale del cluster
# (da eseguire con Kafka 3.7)
kafka-features.sh --bootstrap-server kafka1:9092 describe

# Output:
# Feature: metadata.version
#   SupportedMinVersion: 3.0-IV1
#   SupportedMaxVersion: 3.7-IV4
#   FinalizedVersion: 3.7-IV4

# Step 2: Avvia i controller KRaft con la migration config speciale
# In server.properties dei controller KRaft:
process.roles=controller
zookeeper.connect=zk1:2181,zk2:2181,zk3:2181   # ancora necessario in fase di migrazione
controller.quorum.voters=10@kc1:9093,11@kc2:9093,12@kc3:9093

# Step 3: Avvia la migration (da eseguire una volta soli i controller KRaft sono up)
# Modifica server.properties di OGNI broker Kafka esistente:
# Aggiunge il parametro:
zookeeper.metadata.migration.enable=true
controller.quorum.voters=10@kc1:9093,11@kc2:9093,12@kc3:9093

# Riavvia i broker uno alla volta (rolling restart, zero downtime)
# I broker entrano in migration mode automaticamente

# Step 4: Monitora lo stato della migrazione
kafka-metadata-shell.sh \
  --snapshot /var/lib/kafka/data/__cluster_metadata-0/00000000000000000000.snapshot

# Step 5: Finalizza (dopo che tutti i broker sono migrati)
kafka-features.sh --bootstrap-server kafka1:9092 upgrade \
  --metadata 3.7-IV4  # o la versione target

# Step 6: Rimuovi zookeeper.connect dai server.properties e riavvia i broker

Important Notices for Migration

  • Don't go back easily: Once the KRaft migration is complete and ZooKeeper is removed, rollback is very complex. Migrate first to a staging environment identical to production.
  • ACLs and configurations: ACLs and dynamic configurations managed via ZooKeeper are migrated automatically in the metadata log, but check that they are present after migration.
  • Connector Kafka Connect: Connectors that use the Kafka cluster as a backend for state (group.id, offsets) continue to work unchanged.
  • MirrorMaker 2: If you use MM2 for geo-replication, update remote clusters in the same maintenance window to avoid version incompatibilities.

KRaft with Advanced Configuration: Dedicated Controllers

For clusters with high throughput or managing a large number of partitions (>50,000), it is advisable to separate controllers from brokers (dedicated controllers). Like this metadata operations (create topic, leader election, config change) do not compete with partition log I/O on the same disks.

# server.properties per un CONTROLLER DEDICATO (non gestisce partizioni utente)
node.id=10
process.roles=controller
controller.quorum.voters=10@kc1:9093,11@kc2:9093,12@kc3:9093
listeners=CONTROLLER://kc1:9093
listener.security.protocol.map=CONTROLLER:PLAINTEXT
controller.listener.names=CONTROLLER
log.dirs=/var/lib/kafka/metadata

# server.properties per un BROKER PURO (non è controller)
node.id=1
process.roles=broker
controller.quorum.voters=10@kc1:9093,11@kc2:9093,12@kc3:9093
listeners=KAFKA://kafka1:9092
advertised.listeners=KAFKA://kafka1:9092
listener.security.protocol.map=KAFKA:PLAINTEXT
inter.broker.listener.name=KAFKA
controller.listener.names=CONTROLLER
log.dirs=/var/lib/kafka/data

# Con questa configurazione:
# - 3 macchine controller dedicati (leggeri, poca RAM, poca CPU)
# - N broker puri (ottimizzati per I/O disco)
# - Nessuna competizione di risorse tra metadata ops e I/O partizioni

In Confluent Cloud and in managed environments such as Amazon MSK (which has adopted KRaft since version 3.6), the controller/broker separation occurs automatically and is transparent to the user.

Operational Benefits of KRaft

Faster Startup and Recovery

With ZooKeeper, when the Kafka broker controller restarted, it had to read the entire state of the cluster from ZooKeeper before being able to operate. For clusters with 100,000+ partitions, this could take 30-90 seconds controller unavailability.

With KRaft, the leader controller keeps the metadata log already in memory and on local disk. A failover of the controller typically requires less than 5 seconds, even for large clusters. A case study of a fintech company (Confluent Engineering Blog, 2025) documents a 40% reduction in setup time after migrating to KRaft.

Metadata Scalability

ZooKeeper had a practical limit of around 200,000 partitions per cluster (regardless of performance of metadata operations degraded significantly). KRaft handles the metadata log like normal Kafka logs with compaction, and has been tested with millions of partitions per cluster.

Operational Simplicity

Removing ZooKeeper means:

  • One system to monitor instead of two
  • One upgrade cycle instead of two (often ZooKeeper and Kafka had complex version constraints)
  • Easier deployment on Kubernetes (less StatefulSet, less PVC)
  • Easier disaster recovery (cluster state is in the metadata log, not distributed between Kafka and ZooKeeper)

KRaft on Kubernetes with Strimzi

Strimzi is the most popular Kubernetes operator for managing Kafka. From version 0.38, Strimzi natively supports KRaft:

# Kafka cluster KRaft con Strimzi Operator (Kubernetes)
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-cluster
  namespace: kafka
  annotations:
    # Abilita KRaft mode (richiede Strimzi 0.38+)
    strimzi.io/kraft: enabled
spec:
  kafka:
    version: 4.0.0
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      # KRaft-specific
      default.replication.factor: 3
      min.insync.replicas: 2
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      # Retention
      log.retention.hours: 168
      log.segment.bytes: 1073741824
    storage:
      type: persistent-claim
      size: 100Gi
      class: fast-ssd
    # Controller separato (produzione: controller dedicati)
    # Ometti questa sezione per controller combinati (default)
  # entityOperator gestisce topic e utenti tramite CRD
  entityOperator:
    topicOperator: {}
    userOperator: {}
# Creare un topic con Strimzi CRD (invece di kafka-topics.sh)
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: ordini-effettuati
  namespace: kafka
  labels:
    strimzi.io/cluster: my-cluster
spec:
  partitions: 6
  replicas: 3
  config:
    retention.ms: "604800000"
    min.insync.replicas: "2"
    compression.type: snappy

Checking the status of the KRaft Cluster

# Verificare chi è il controller leader attuale
kafka-metadata-quorum.sh \
  --bootstrap-server kafka1:9092 \
  describe --status

# Output:
# ClusterId:              MkU3OEVBNTcwNTJENDM2Qk
# LeaderId:               1
# LeaderEpoch:            42
# HighWatermark:          156789
# MaxFollowerLag:         0
# MaxFollowerLagTimeMs:   12
# CurrentVoters:          [{"nodeId":1,"logEndOffset":156789,"lag":0},
#                          {"nodeId":2,"logEndOffset":156789,"lag":0},
#                          {"nodeId":3,"logEndOffset":156789,"lag":0}]
# CurrentObservers:       []

# Verificare i dettagli del quorum
kafka-metadata-quorum.sh \
  --bootstrap-server kafka1:9092 \
  describe --replication

# Leggere il metadata log (per debugging)
kafka-dump-log.sh \
  --files /var/lib/kafka/data/__cluster_metadata-0/00000000000000000000.log \
  --cluster-metadata

Configuration Differences: ZooKeeper vs KRaft

For those coming from a ZooKeeper cluster, here are the main configuration differences to know:

Configuration ZooKeeperMode KRaftMode
Cluster connection zookeeper.connect controller.quorum.voters
Node ID broker.id node.id
Roles Always broker process.roles
Listener controllers N/A controller.listener.names
Initialization Car (ZK handles) kafka-storage.sh format
ACL storage ZooKeeper znodes Metadata log

Metadata Version and Feature Flags in KRaft

With KRaft, Kafka introduces the concept of metadata.version: A version of the metadata format in the cluster. This allows rolling upgrades of a cluster without downtime, one node at a time. The metadata version is updated only when all brokers in the cluster support the new version.

# Verificare la metadata.version corrente e le versioni supportate
kafka-features.sh \
  --bootstrap-server kafka1:9092 \
  describe

# Output tipico con Kafka 4.0:
# Feature: metadata.version
#   SupportedMinVersion: 3.0-IV1
#   SupportedMaxVersion: 4.0-IV3
#   FinalizedVersion: 4.0-IV3

# Verificare tutti i feature flags disponibili
kafka-features.sh \
  --bootstrap-server kafka1:9092 \
  describe --all

# Aggiornare la metadata.version dopo un upgrade di cluster
# (eseguire DOPO che tutti i broker sono stati aggiornati alla nuova versione)
kafka-features.sh \
  --bootstrap-server kafka1:9092 \
  upgrade --metadata 4.0-IV3

The version 4.0-IV3 (Kafka 4.0 Incremental Version 3) is the latest available in the release Kafka 4.0 March 2025. Each version increase enables new features and protocol optimizations.

Troubleshooting KRaft: Common Problems

The Cluster Does Not Start: “No voters found in quorum”

This error indicates that controller nodes cannot find other quorum voters. Common causes:

  • misconfigured controller.quorum.voters: Verify that the format is correct (nodeId@hostname:port) and that hostnames are resolvable by all nodes.
  • CONTROLLER listener unreachable: Verify that the firewall allows communication on the controller listener port (default: 9093) between controller nodes.
  • Cluster ID mismatch: if you restarted with kafka-storage.sh format on one of the nodes without using the correct cluster ID, the nodes will not join the cluster.
# Verificare il cluster ID su ogni nodo
cat /var/lib/kafka/data/meta.properties
# node.id=1
# version=1
# cluster.id=MkU3OEVBNTcwNTJENDM2Qk  <-- deve essere identico su tutti i nodi

# Verificare che il controller leader sia eletto
kafka-metadata-quorum.sh \
  --bootstrap-server kafka1:9092 \
  describe --status | grep LeaderId

# Se LeaderId=-1, nessun leader è stato eletto (quorum non raggiunto)

# Controllare i log del broker per errori KRaft
grep -E "WARN|ERROR" /var/log/kafka/kafka.log | grep -i "kraft\|quorum\|controller"

Broker Not Added to the Cluster

When you add a new broker to an existing KRaft cluster, the broker must be formatted with the same cluster ID as the existing cluster:

# Recupera il cluster ID dal cluster esistente
CLUSTER_ID=$(kafka-metadata-quorum.sh \
  --bootstrap-server kafka1:9092 \
  describe --status | grep ClusterId | awk '{print $2}')

echo "Cluster ID: $CLUSTER_ID"

# Formatta il nuovo broker con lo stesso cluster ID
kafka-storage.sh format \
  --config /etc/kafka/server.properties \
  --cluster-id "$CLUSTER_ID"

# Avvia il nuovo broker
kafka-server-start.sh /etc/kafka/server.properties

# Verifica che il nuovo broker sia visibile nel cluster
kafka-broker-api-versions.sh \
  --bootstrap-server kafka1:9092 | grep "id:"

Next Steps in the Series

With KRaft included, you are ready to tackle more advanced aspects of Kafka configuration:

  • Article 3 – Advanced Producer and Consumer: the detailed configuration of acks, idempotent producer, and retry strategies to ensure durability without duplicates.
  • Article 4 – Exactly-Once Semantics: Kafka transactions for atomic writes on multiple topics, with the new transaction coordinator implemented in the KRaft metadata log.
  • Article 11 – Kafka in Production: KRaft cluster sizing, configuration of controller replicas, disaster recovery and metadata log backup.

Link with Other Series

  • Advanced Kubernetes: deployment of Kafka on Kubernetes with Strimzi operator, persistent storage management and consumer group autoscaling.
  • Observability: KRaft quorum monitoring with JMX Exporter, critical metrics how kafka.controller:type=KafkaController,name=ActiveControllerCount and alert on leader election.