Dead letter queue for permanently failed sends

ab-002457 · sending-pipeline-infrastructure.queue-architecture.dead-letter-queue

Severity: highactive

Why it matters

When a send job exhausts its retry budget and disappears silently, every permanently failed send becomes invisible. Customer support cannot answer "did we send that?", engineers cannot distinguish a flaky network blip from a systemic ESP misconfiguration, and the business has no path to replay failed communications. CWE-391 covers insufficient logging of error conditions — a missing DLQ is the queue-layer equivalent. ISO-25010:2011 reliability.recoverability requires that a system can recover from faults; a DLQ is the prerequisite for that recovery.

Severity rationale

High because permanently failed sends are unrecoverable and unauditable without a DLQ, but the immediate user impact is delayed rather than immediate data loss.

Remediation

Set removeOnFail: false on the BullMQ queue configuration so failed jobs move to the built-in failed set instead of being deleted:

export const emailQueue = new Queue('email', {
  connection,
  defaultJobOptions: {
    attempts: 5,
    backoff: { type: 'exponential', delay: 2000 },
    removeOnComplete: { age: 86400, count: 1000 },
    removeOnFail: false
  }
})

For SQS, configure a redrive policy pointing to a DLQ with maxReceiveCount: 5. Retain at least 30 days of failed job history to satisfy audit and replay requirements.

Detection

ID: dead-letter-queue
Severity: high
What to look for: Check the queue configuration for dead letter queue (DLQ) setup. After a job exhausts all retry attempts, it should move to a dedicated DLQ rather than being silently discarded. Look for removeOnFail: false in BullMQ job options, failedJobsHistoryLength configuration, or equivalent DLQ routing in RabbitMQ (dead letter exchange) or SQS (redrive policy).
Pass criteria: Failed jobs that exhaust retry attempts are moved to a DLQ, failed queue, or dedicated error log that allows inspection and replay. The DLQ retains at least 30 days of failed job history. Failed jobs can be replayed manually. Count the number of DLQ retention and inspection mechanisms present.
Fail criteria: Jobs are silently discarded after max retries with no persistent record. There is no way to inspect which sends failed without reading application logs. Or removeOnFail: true is set, deleting failed jobs.
Skip (N/A) when: The application has no queue (synchronous sends only) — confirmed by the absence of queue libraries in package.json.
Cross-reference: The Operational Resilience (Email) Audit's monitoring-alerting category verifies that the DLQ is monitored — this check verifies that the DLQ itself exists and retains data.
Detail on fail: "BullMQ configured with removeOnFail: true — failed jobs are deleted after max retries with no recoverable record" or "No DLQ configured for SQS queue — failed messages dropped after visibility timeout expires"

Remediation: Configure BullMQ to retain failed jobs and optionally route them to a separate queue:

// Keep failed jobs for 30 days, max 10,000 entries
export const emailQueue = new Queue('email', {
  connection,
  defaultJobOptions: {
    attempts: 5,
    backoff: { type: 'exponential', delay: 2000 },
    removeOnComplete: { age: 86400, count: 1000 },
    removeOnFail: false // retain failed jobs
  }
})

// Inspect failed jobs
const failed = await emailQueue.getFailed()

// Retry a specific failed job by ID
const job = await emailQueue.getJob(jobId)
await job?.retry()

For SQS, configure a redrive policy pointing to a DLQ:

{
  "RedrivePolicy": {
    "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123:email-dlq",
    "maxReceiveCount": 5
  }
}

External references

cwe · CWE-391 — Unchecked Error Condition
iso-25010:2011 · reliability.recoverability — Recoverability

Taxons

error-resilience observability

History

2026-04-18·v1.0.0·Initial import from sending-pipeline-infrastructure·automated