Queue backpressure on ESP outage

ab-001982 · operational-resilience-email.failure-recovery.queue-backpressure-on-outage

Severity: highactive

Why it matters

During an ESP outage, a queue worker without backpressure configuration becomes a thundering herd: all failed jobs retry immediately at full concurrency, exhausting the ESP's connection limit, burning through retry budget, and degrading IP reputation simultaneously. CWE-400 (uncontrolled resource consumption) and CWE-770 (allocation without limits) both apply. The damage compounds: by the time the ESP recovers, the IP may have accumulated enough connection-abuse signals to land in spam. Exponential backoff and a concurrency cap are not optional niceties — they are what prevents a 30-minute outage from becoming a 3-day reputation recovery.

Severity rationale

High because a worker with no backpressure configuration converts a temporary ESP outage into a thundering-herd attack on the ESP that can permanently damage IP reputation.

Remediation

Configure exponential backoff and a concurrency cap in your queue worker file (e.g., src/workers/email.worker.ts):

const worker = new Worker('email', processJob, {
  connection,
  concurrency: 10,
  limiter: { max: 100, duration: 60_000 } // 100 sends/minute
})

const defaultJobOptions: JobsOptions = {
  attempts: 5,
  backoff: { type: 'exponential', delay: 2000 }
}

Linear retry and full-concurrency defaults both fail this check. Quote the exact backoff type and delay values when reporting — "backoff configured" without specifics is not sufficient.

Detection

ID: queue-backpressure-on-outage
Severity: high
What to look for: Enumerate all backpressure mechanisms in the queue worker error handling for ESP failures: exponential backoff, concurrency limits, and pause/drain logic when a circuit breaker opens. Count the number of these 3 mechanisms present. The failure mode to prevent is a queue worker that immediately retries at full concurrency against a non-responsive ESP — exhausting connections, rate limits, and IP reputation simultaneously.
Pass criteria: When ESP calls fail repeatedly, the worker applies exponential backoff with a maximum delay of no more than 600 seconds (10 minutes), has a concurrency cap of no more than 50 concurrent jobs, and retry attempts are bounded to at most 10 retries. Before evaluating, quote the exact backoff configuration (type and delay values) from the queue setup code.
Fail criteria: Worker retries at full concurrency with no backoff — all jobs retry immediately on each cycle, generating a thundering herd against the failing ESP. Or backoff is linear rather than exponential.
Skip (N/A) when: The project uses no async queue system (synchronous sends only) — confirmed by the absence of queue libraries in package.json.
Detail on fail: "BullMQ concurrency is set to 50 with no backoff — ESP outage would trigger 50 simultaneous retries per worker cycle" or "No exponential backoff configured on job retry — all failed jobs retry immediately at maximum concurrency"

Remediation: Configure exponential backoff in your queue worker file (e.g., src/workers/email.worker.ts):

const worker = new Worker('email', processJob, {
  connection,
  concurrency: 10,
  limiter: { max: 100, duration: 60_000 } // 100 sends/minute
})

// In job definition or default job options:
const defaultJobOptions: JobsOptions = {
  attempts: 5,
  backoff: { type: 'exponential', delay: 2000 }
}

Also disable the queue when the circuit breaker is open — do not enqueue new jobs while the ESP is confirmed down.

External references

cwe · CWE-400 — Uncontrolled Resource Consumption — queue retries at full concurrency exhausting ESP connections and rate limits
cwe · CWE-770 — Allocation of Resources Without Limits or Throttling — no concurrency cap or backoff on ESP retry
iso-25010:2011 · reliability.fault-tolerance — Reliability / Fault Tolerance — backpressure prevents thundering herd during ESP outage

Taxons

error-resilience cost-efficiency

History

2026-04-18·v1.0.0·Initial import from operational-resilience-email·automated