Retries use exponential backoff

ab-002450 · sending-pipeline-infrastructure.retry-error-handling.exponential-backoff

Severity: highactive

Why it matters

Fixed-interval or immediate retries during an ESP outage or rate-limit window cause all five retry attempts to fire within seconds. The ESP rejects each one, the job exhausts its retry budget, and the send permanently fails — even though the ESP would have recovered in minutes if the worker had waited. CWE-770 covers allocation of resources without limits; CWE-400 covers resource exhaustion. Exponential backoff is the documented mitigation: it gives transient faults time to resolve without burning retry budget, and it is required by every major ESP's developer guidelines for handling 429 and 5xx responses.

Severity rationale

High because fixed or immediate retries exhaust the retry budget during short outages that exponential backoff would survive, causing preventable permanent delivery failures.

Remediation

Set backoff: { type: 'exponential', delay: 2000 } on BullMQ job options, and cap the maximum delay in the worker's backoff strategy:

await emailQueue.add('send', jobData, {
  attempts: 5,
  backoff: { type: 'exponential', delay: 2000 }
})

// Optional: custom strategy with 10-minute cap
const worker = new Worker('email', processEmail, {
  connection,
  settings: {
    backoffStrategy: (attemptsMade) =>
      Math.min(2000 * Math.pow(2, attemptsMade - 1), 600_000)
  }
})

For rate-limit (429) responses, add explicit detection in the error handler and override the delay to at least the Retry-After value returned by the ESP.

Detection

ID: exponential-backoff
Severity: high
What to look for: Enumerate all retry configurations for failed jobs. Check whether failed sends are retried immediately (fixed-interval or no delay), or whether the delay between retries increases exponentially. Count the number of queue configurations that use exponential backoff versus fixed or no backoff. Immediate retries during an ESP outage or rate-limit window exacerbate the problem and can exhaust retry budgets quickly. Look for backoff: 'exponential' in BullMQ job options, or equivalent configurations in other queue libraries.
Pass criteria: Failed send jobs use exponential backoff between retry attempts. The delay between attempt N and N+1 at least doubles (e.g., 2s, 4s, 8s, 16s). A maximum delay cap of no more than 600 seconds (10 minutes) is set. Before evaluating, quote the exact backoff configuration from the queue setup code. Do NOT pass when backoff type is 'fixed' or when no maximum delay cap exists.
Fail criteria: Retries are immediate (no delay), fixed-interval (same delay each retry), or linear. An ESP outage causes all retry attempts to fire within minutes and exhaust the retry budget.
Skip (N/A) when: The application has no retry mechanism because it uses exactly-once delivery at the infrastructure level — confirmed by the queue configuration.
Cross-reference: The Operational Resilience (Email) Audit's failure-recovery category verifies backpressure mechanisms — this check verifies the individual job retry behavior that feeds into that backpressure.
Detail on fail: "BullMQ configured with backoff: { type: 'fixed', delay: 1000 } — all 5 retries fire within 5 seconds, exhausting budget during brief outages" or "No backoff configured — failed jobs retried immediately 3 times in rapid succession"

Remediation: Configure exponential backoff with a cap:

await emailQueue.add('send', jobData, {
  attempts: 5,
  backoff: {
    type: 'exponential',
    delay: 2000 // 2s, 4s, 8s, 16s, 32s (capped by BullMQ at 2^(attempts-1) * delay)
  }
})

// Or with a custom backoff function capped at 10 minutes
await emailQueue.add('send', jobData, {
  attempts: 5,
  backoff: {
    type: 'custom'
  }
})

// In worker options:
const worker = new Worker('email', processEmail, {
  connection,
  settings: {
    backoffStrategy: (attemptsMade) => {
      const delay = Math.min(2000 * Math.pow(2, attemptsMade - 1), 600_000)
      return delay
    }
  }
})

External references

cwe · CWE-400 — Uncontrolled Resource Consumption
cwe · CWE-770 — Allocation of Resources Without Limits or Throttling
iso-25010:2011 · reliability.fault-tolerance — Fault Tolerance

Taxons

error-resilience cost-efficiency

History

2026-04-18·v1.0.0·Initial import from sending-pipeline-infrastructure·automated