Temporary failures do not block the rest of the queue

ab-002449 · sending-pipeline-infrastructure.retry-error-handling.failure-isolation

Severity: mediumactive

Why it matters

A queue worker configured with concurrency: 1 can process exactly one job at a time. When that job fails and BullMQ schedules a 32-second exponential backoff retry, the worker is blocked from pulling the next available job for 32 seconds. In a FIFO queue, every subsequent recipient in the batch waits behind the failing job. CWE-400 covers resource exhaustion via uncontrolled resource consumption — a single failing job consuming the entire worker thread is the mechanism. ISO-25010:2011 reliability.fault-tolerance requires that a fault in one component does not propagate to unrelated operations.

Severity rationale

Medium because single-worker FIFO blocking causes throughput stalls and delivery delays rather than data loss, but affects all recipients in a campaign when any single job fails.

Remediation

Set worker concurrency to at least 2 so delayed retry jobs do not block queue processing in workers/email.worker.ts:

const worker = new Worker('email', processEmailJob, {
  connection,
  concurrency: 5
})
// BullMQ moves failed-with-backoff jobs to the 'delayed' set automatically
// — they do not block other jobs while waiting

For marketing sends, add a rate limiter to prevent concurrency from overwhelming ESP rate limits:

const marketingWorker = new Worker('email:marketing', processEmailJob, {
  connection,
  concurrency: 10,
  limiter: { max: 50, duration: 1000 }
})

Detection

ID: failure-isolation
Severity: medium
What to look for: Examine whether a single failing recipient or campaign job can stall or block the processing of other jobs in the queue. In a FIFO queue with one worker and no concurrency, a job stuck in retry delays blocks all subsequent jobs until its retry timeout expires. Check the worker concurrency setting, whether delayed retry jobs are properly delayed (not immediately re-queued), and whether the queue supports job-level independent retry without blocking the queue head.
Pass criteria: The queue worker has a concurrency of at least 2 so that a job in retry delay state does not prevent other jobs from being processed. Enumerate all worker configurations and count their concurrency values. Jobs in delayed retry are moved to the "delayed" set and do not block the queue head.
Fail criteria: A single worker with concurrency of 1 processes a FIFO queue. A failing job with a 32-second retry delay blocks all subsequent sends for 32 seconds. Queue tail-of-line blocking is possible.
Skip (N/A) when: The application sends fewer than 100 emails per day where single-worker FIFO processing is acceptable — documented in code or configuration comments.
Detail on fail: "Single worker with concurrency: 1 — a job in 30-second backoff delay blocks all subsequent sends in the queue" or "FIFO queue with no priority or delay bypass — a ratelimited campaign job stalls all other sends"

Remediation: Configure worker concurrency and use BullMQ's delayed job system correctly:

// Use concurrency to process multiple jobs in parallel
const worker = new Worker('email', processEmailJob, {
  connection,
  concurrency: 5, // Process up to 5 jobs simultaneously
  // Failed jobs with backoff are re-queued as delayed jobs
  // — they do not block other jobs while waiting
})

// BullMQ handles this correctly by design:
// When a job fails with backoff, it moves to the "delayed" set
// and the worker picks up the next available job immediately.
// Confirm your version of BullMQ (>=2.x) uses this behavior.

// For rate-limited campaign sends, use the limiter to throttle
// without blocking unrelated jobs:
const marketingWorker = new Worker('email:marketing', processEmailJob, {
  connection,
  concurrency: 10,
  limiter: {
    max: 50,       // Max 50 jobs per interval
    duration: 1000 // 1 second window
  }
})

External references

cwe · CWE-400 — Uncontrolled Resource Consumption
iso-25010:2011 · reliability.fault-tolerance — Fault Tolerance

Taxons

error-resilience cost-efficiency

History

2026-04-18·v1.0.0·Initial import from sending-pipeline-infrastructure·automated