The Problem: What Happens to Messages That Fail

Asynchronous messaging systems like SQS, SNS, Kafka, and EventBridge use the pattern at-least-once delivery: a message is delivered at least once, but it could be delivered multiple times (duplicates in case of retry). This creates two critical scenarios:

  1. Transient errors: The downstream service is temporarily unavailable. Automatic retry solves the problem. After a few attempts the message is processed correctly.
  2. Permanent errors (poison pill): the message is malformed, contains data that violate business invariants, or the consumer's code has a bug. Retry doesn't help: the message will continue to fail indefinitely, potentially consuming resources blocking the processing of subsequent messages.

The DLQ solves the second scenario: after a configurable number of failed attempts (maxReceiveCount in SQS, MAX_RETRY_ATTEMPTS in Kafka), the message is moved to the Dead Letter Queue where it can be analyzed and reprocessed in a controlled manner.

DLQ: the Resilience Contract

  • Zero data loss: No messages are silently discarded
  • Problem isolation: Poison pills don't block good messages
  • Visibility: Messages in DLQ are inspectable for debugging
  • Controlled reprocessing: After the problem is corrected, the messages are reprocessed

DLQ in Amazon SQS

In SQS, the DLQ is simply another SQS queue configured as a destination for messages that exceed maxReceiveCount. The mechanism is based on visibility timeout: when a consumer receives a message, it becomes "invisible" to other consumers for the duration of the visibility timeout. If it is not deleted within that time (the consumer has failed or crashed), SQS makes it visible again for another try.

The counter ApproximateReceiveCount is incremented with each reception. When it reaches maxReceiveCount, SQS moves the message to the configured DLQ.

# Configurazione DLQ per SQS con Terraform

# 1. Crea la DLQ (stessa tipologia della coda principale)
resource "aws_sqs_queue" "ordini_dlq" {
  name                       = "ordini-queue-dlq"
  message_retention_seconds  = 1209600  # 14 giorni (massimo SQS)
  visibility_timeout_seconds = 300      # 5 minuti per elaborare dalla DLQ

  # CloudWatch alarm sulla DLQ
  tags = {
    Environment = "production"
    Alert       = "critical"
  }
}

# 2. Crea la coda principale con redrive policy che punta alla DLQ
resource "aws_sqs_queue" "ordini" {
  name                       = "ordini-queue"
  visibility_timeout_seconds = 60       # 60s per elaborare ogni messaggio
  receive_wait_time_seconds  = 20       # long polling
  message_retention_seconds  = 345600   # 4 giorni

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.ordini_dlq.arn
    maxReceiveCount     = 3  # 3 tentativi falliti -> DLQ
  })
}

# 3. CloudWatch Alarm: alert quando la DLQ ha messaggi
resource "aws_cloudwatch_metric_alarm" "dlq_not_empty" {
  alarm_name          = "ordini-dlq-not-empty"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = "60"
  statistic           = "Sum"
  threshold           = "0"
  alarm_description   = "CRITICO: messaggi in DLQ ordini"

  dimensions = {
    QueueName = aws_sqs_queue.ordini_dlq.name
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

Inspect and Reprocess by DLQ SQS

// DlqReprocessor.java - Riprocessa messaggi dalla DLQ SQS
import software.amazon.awssdk.services.sqs.*;
import software.amazon.awssdk.services.sqs.model.*;

public class SqsDlqReprocessor {

    private final SqsClient sqsClient;
    private final String dlqUrl;
    private final String mainQueueUrl;

    // Riprocessa tutti i messaggi dalla DLQ verso la coda principale
    public void reprocessAll() {
        int reprocessed = 0;
        List<Message> messages;

        do {
            messages = receiveMessages(dlqUrl, 10);

            for (Message message : messages) {
                try {
                    // Analizza il messaggio per capire il tipo di errore
                    System.out.printf("Riprocesso messaggio: id=%s, receiveCount=%s%n",
                        message.messageId(),
                        message.attributes().get(MessageSystemAttributeName.APPROXIMATE_RECEIVE_COUNT)
                    );

                    // Rimanda alla coda principale (con delay opzionale)
                    sqsClient.sendMessage(
                        SendMessageRequest.builder()
                            .queueUrl(mainQueueUrl)
                            .messageBody(message.body())
                            .messageAttributes(message.messageAttributes())
                            .delaySeconds(0)
                            .build()
                    );

                    // Elimina dalla DLQ
                    sqsClient.deleteMessage(
                        DeleteMessageRequest.builder()
                            .queueUrl(dlqUrl)
                            .receiptHandle(message.receiptHandle())
                            .build()
                    );

                    reprocessed++;

                } catch (Exception e) {
                    System.err.println("Errore reprocessing: " + e.getMessage());
                    // Non eliminare: rimane in DLQ
                }
            }

        } while (!messages.isEmpty());

        System.out.printf("Reprocessing completato: %d messaggi rimandati%n", reprocessed);
    }

    private List<Message> receiveMessages(String queueUrl, int maxMessages) {
        return sqsClient.receiveMessage(
            ReceiveMessageRequest.builder()
                .queueUrl(queueUrl)
                .maxNumberOfMessages(maxMessages)
                .waitTimeSeconds(5)
                .attributeNames(QueueAttributeName.ALL)
                .build()
        ).messages();
    }
}

DLQ in AWS Lambda: Function-Level vs Queue-Level

In AWS Lambda, DLQ can be configured at two different levels with distinct semantics:

  • SQS Queue DLQ: Configured on the source SQS queue. Messages are moved in DLQ when SQS exceeds the maxReceiveCount. This happens Before that Lambda is invoked. This is the recommended configuration for Lambda + SQS.
  • Lambda Function DLQ: configured on the Lambda itself (only for asynchronous invocations, not for event source mapping with SQS). Catch failures of the Lambda invocation, not the queue.
# Terraform: Lambda con SQS event source e DLQ configurata sulla coda

resource "aws_lambda_function" "ordini_consumer" {
  function_name = "ordini-consumer"
  handler       = "handler.lambda_handler"
  runtime       = "python3.12"
  role          = aws_iam_role.lambda_role.arn
  timeout       = 30  # 30 secondi per messaggio

  # DLQ a livello di Lambda (solo per invocazioni async dirette)
  dead_letter_config {
    target_arn = aws_sqs_queue.lambda_dlq.arn
  }
}

# SQS come event source per Lambda
resource "aws_lambda_event_source_mapping" "sqs_trigger" {
  event_source_arn = aws_sqs_queue.ordini.arn
  function_name    = aws_lambda_function.ordini_consumer.arn
  batch_size       = 10
  enabled          = true

  # Bisection: in caso di errore batch, prova prima con metà messaggi
  # Aiuta a isolare il poison pill senza mandare tutti in DLQ
  bisect_batch_on_function_error = true

  # Report batch item failures: Lambda può indicare quali specifici
  # messaggi nel batch hanno fallito (solo quelli vanno in DLQ)
  function_response_types = ["ReportBatchItemFailures"]
}

ReportBatchItemFailures: Granular DLQ for Batch

// Handler Lambda Python con batch item failures
// Solo i messaggi falliti vanno in DLQ, non l'intero batch

def lambda_handler(event, context):
    """
    ReportBatchItemFailures: ritorna solo i message ID falliti.
    SQS mandrà in DLQ solo quelli, non il batch intero.
    """
    failed_items = []

    for record in event['Records']:
        message_id = record['messageId']
        try:
            # Elabora il messaggio
            payload = json.loads(record['body'])
            process_ordine(payload)
            print(f"Successo: {message_id}")

        except PermanentError as e:
            # Errore permanente: vai in DLQ subito
            print(f"PERMANENTE: {message_id} - {e}")
            failed_items.append({'itemIdentifier': message_id})

        except TransientError as e:
            # Errore transitorio: riprova (non aggiungere a failed)
            # SQS ritenterà l'intero batch se almeno uno fallisce
            # Con ReportBatchItemFailures, solo i falliti vengono ritentati
            print(f"TRANSITORIO: {message_id} - {e}")
            failed_items.append({'itemIdentifier': message_id})

    return {'batchItemFailures': failed_items}

DLQ in EventBridge

EventBridge has its own DLQ level target: if the delivery of an event to the target (Lambda, SQS) fails after all configured retries, the event is sent in the SQS DLQ specified in dead_letter_config of the rule.

# EventBridge DLQ per target Lambda
resource "aws_cloudwatch_event_target" "ordini_lambda" {
  rule           = aws_cloudwatch_event_rule.ordini.name
  event_bus_name = aws_cloudwatch_event_bus.mioapp.name
  arn            = aws_lambda_function.ordini_consumer.arn

  # Retry policy di EventBridge: quanti tentativi prima di DLQ
  retry_policy {
    maximum_event_age_in_seconds = 86400  # Riprova per max 24h
    maximum_retry_attempts       = 185    # ~exponential backoff su 24h
  }

  # DLQ per eventi non consegnati
  dead_letter_config {
    arn = aws_sqs_queue.eventbridge_dlq.arn
  }
}

# L'evento in DLQ EventBridge include metadata di debug
# {
#   "version": "1.0",
#   "timestamp": "...",
#   "requestId": "...",
#   "condition": "...",
#   "approximateInvokeCount": 185,
#   "requestParameters": {
#     "FunctionName": "ordini-consumer"
#   },
#   "responseParameters": {
#     "statusCode": 500,
#     "errorCode": "Lambda.ServiceException"
#   },
#   "originalEvent": { ... l'evento originale ... }
# }

Advanced Pattern: Retry with Progressive Backoff using SQS Delay

SQS allows you to configure a delay for single message (up to 15 minutes). Combined with the FIFO queue and the message group, it is possible to implement an exponential retry pattern without blocking other messages:

// RetryWithSqsDelay.java - Retry progressivo con SQS message delay
public class SqsExponentialRetry {

    private static final int MAX_ATTEMPTS = 5;
    private static final int MAX_DELAY_SECONDS = 900;  // 15 minuti (max SQS)

    public void handleWithRetry(String queueUrl, Message sqsMessage) {
        // Leggi il numero di tentativi corrente dal message attribute
        int currentAttempt = Integer.parseInt(
            sqsMessage.messageAttributes()
                .getOrDefault("retryAttempt",
                    MessageAttributeValue.builder().stringValue("0").build())
                .stringValue()
        );

        try {
            processMessage(sqsMessage.body());
            // Successo: elimina dalla coda
            sqsClient.deleteMessage(...);

        } catch (TransientException e) {
            if (currentAttempt >= MAX_ATTEMPTS) {
                // Troppi tentativi: manda in DLQ manuale
                sendToManualDLQ(sqsMessage, e);
                sqsClient.deleteMessage(...);
                return;
            }

            // Calcola delay esponenziale (1s, 2s, 4s, 8s, 16s...)
            int delaySeconds = (int) Math.min(
                Math.pow(2, currentAttempt),
                MAX_DELAY_SECONDS
            );

            // Rimanda il messaggio con delay e contatore incrementato
            sqsClient.sendMessage(
                SendMessageRequest.builder()
                    .queueUrl(queueUrl)
                    .messageBody(sqsMessage.body())
                    .delaySeconds(delaySeconds)
                    .messageAttributes(Map.of(
                        "retryAttempt", MessageAttributeValue.builder()
                            .stringValue(String.valueOf(currentAttempt + 1))
                            .dataType("Number")
                            .build()
                    ))
                    .build()
            );

            // Elimina il messaggio originale (non usare la DLQ automatica)
            sqsClient.deleteMessage(...);
        }
    }
}

DLQ Monitoring: Essential Alerts

DLQ must be actively monitored. A message in DLQ indicates a real problem that requires attention. The SQS metrics to monitor on CloudWatch:

# CloudWatch Metric Alarms per DLQ - AWS CLI

# Alert: qualsiasi messaggio in DLQ (soglia 0)
aws cloudwatch put-metric-alarm \
  --alarm-name "ordini-dlq-not-empty" \
  --alarm-description "CRITICO: messaggi in DLQ ordini" \
  --metric-name "ApproximateNumberOfMessagesVisible" \
  --namespace "AWS/SQS" \
  --dimensions Name=QueueName,Value=ordini-queue-dlq \
  --period 60 \
  --evaluation-periods 1 \
  --statistic Sum \
  --comparison-operator GreaterThanThreshold \
  --threshold 0 \
  --alarm-actions "arn:aws:sns:eu-west-1:123456:alerts-topic"

# Metriche importanti da monitorare su DLQ:
# - ApproximateNumberOfMessagesVisible: messaggi pronti per essere letti
# - ApproximateNumberOfMessagesNotVisible: messaggi in processing
# - NumberOfMessagesSent: rate di arrivo in DLQ (crescita = problema)
# - NumberOfMessagesDeleted: rate di reprocessing

Best Practices for DLQ in Asynchronous Systems

  • DLQ mandatory for every consumer: There is no asynchronous system in production without DLQ. If missing, failed messages are lost or block the flow.
  • Monitor DLQ with zero-threshold alerts: any message in DLQ is a sign of a problem. Don't wait for hundreds to accumulate before reacting.
  • Enrich DLQ messages with metadata: add headers or attributes with the error type, stacktrace, number of retries, and failure timestamp. Without this data debugging is nearly impossible.
  • Long retention on the DLQ: configure 14 day retention (maximum SQS) or at least 30 days on Kafka. Messages in DLQ must be available for deferred analysis.
  • Test reprocessing regularly: DLQ is useless if you don't know how to reprocess messages. Document and test the reprocessing process.
  • Separate error types: Consider separate DLQs for permanent errors (corrupt payloads) and transient errors (services down). The reprocessing strategy is different.

Anti-Pattern: Ignore the DLQ

The most dangerous pattern is to configure the DLQ and then not monitor it. The messages accumulate silently for weeks, then someone notices critical data is missing. Always set up an alert on the DLQ: it is the safety net that you should never ignore.

Next Steps in the Series

  • Article 9 – Idempotence in Consumers: retries and reprocessing from DLQ can cause duplicate messages. The idempotency key pattern is the main defense.
  • Article 10 – Outbox Pattern: ensure that an event is published always happens, even in the event of a producer crash, using an outbox table in the database.

Link with Other Series

  • SQS vs SNS vs EventBridge (Article 7): Each AWS service has its own semantics of DLQ and retry. This article covers the configuration differences between the services.
  • Kafka Dead Letter Queue (Article 10 of the Kafka Series): the DLQ pattern in Kafka with consumer group, retry topic and reprocessing it has the same foundations conceptual but a different implementation than SQS.