Retry limit set with dead letter escalation

ab-002452 · sending-pipeline-infrastructure.retry-error-handling.max-retry-limit

Severity: highactive

Why it matters

An unbounded retry limit — or no limit at all — allows a job to cycle indefinitely, consuming worker capacity and queue resources without ever resolving. CWE-770 covers allocation of resources without limits; an infinite retry loop is a resource exhaustion vector against the queue infrastructure itself. CWE-391 covers insufficient logging of errors — a job that fails silently after an unbounded retry chain leaves no trace for operators. ISO-25010:2011 reliability.recoverability requires that failed states are detectable and recoverable; an undocumented failure is neither.

Severity rationale

High because an unbounded retry limit can hold worker concurrency slots indefinitely and provides no path for operators to inspect or replay permanently failed sends.

Remediation

Set an explicit attempts limit between 3 and 10 on the BullMQ queue, keep failed jobs, and fire an alert when the limit is reached:

export const emailQueue = new Queue('email', {
  connection,
  defaultJobOptions: {
    attempts: 5,
    backoff: { type: 'exponential', delay: 2000 },
    removeOnFail: false
  }
})

worker.on('failed', async (job, err) => {
  if (job && job.attemptsMade >= (job.opts.attempts ?? 5)) {
    logger.error({ jobId: job.id, data: job.data }, 'Job exhausted retries')
    await alertOps(`Email job ${job.id} permanently failed: ${err.message}`)
  }
})

Never set attempts: Infinity or omit the attempts field — both produce equivalent unbounded behavior.

Detection

ID: max-retry-limit
Severity: high
What to look for: Check the maximum retry attempt count for failed jobs. Verify that a reasonable upper bound of at least 3 and no more than 10 attempts is set. After the limit is reached, the job must be moved to a dead letter queue or error log rather than being silently discarded. Count all queue configurations and verify each has an explicit attempts value. Check for attempts: Infinity or no attempt limit, which causes a permanent retry loop.
Pass criteria: A maximum retry limit is set between 3 and 10 attempts. Jobs that exhaust retries are moved to a DLQ or persistent failed state with removeOnFail: false. An alert or notification is triggered when jobs reach the DLQ. Count all queue configurations and verify each has an explicit attempts limit.
Fail criteria: No retry limit is set, or the limit is effectively unbounded (e.g., attempts: 999). Failed jobs are silently discarded after exhausting retries with no DLQ escalation. No alert fires when jobs permanently fail. Must not pass when removeOnFail: true is set.
Skip (N/A) when: Never — all retry systems must have a finite upper bound.
Detail on fail: "emailQueue configured with no attempts limit — transient errors could cause jobs to retry indefinitely" or "Max attempts set to 3, but failed jobs are removed with removeOnFail: true — no DLQ inspection possible"

Remediation: Set a bounded retry limit and ensure failed jobs persist:

// Queue default options
export const emailQueue = new Queue('email', {
  connection,
  defaultJobOptions: {
    attempts: 5,
    backoff: { type: 'exponential', delay: 2000 },
    removeOnFail: false // Keep failed jobs in the failed set
  }
})

// Worker: listen for exhausted jobs and alert
worker.on('failed', async (job, err) => {
  if (job && job.attemptsMade >= (job.opts.attempts ?? 5)) {
    logger.error(
      { jobId: job.id, data: job.data, err: err.message },
      'Email job exhausted retries — moved to failed queue'
    )
    // Notify ops (PagerDuty, Slack, etc.)
    await alertOps(`Email job ${job.id} permanently failed: ${err.message}`)
  }
})

External references

cwe · CWE-770 — Allocation of Resources Without Limits or Throttling
cwe · CWE-391 — Unchecked Error Condition
iso-25010:2011 · reliability.recoverability — Recoverability

Taxons

error-resilience

History

2026-04-18·v1.0.0·Initial import from sending-pipeline-infrastructure·automated