Queue depth and age monitoring

ab-001977 · operational-resilience-email.monitoring-alerting.queue-depth-age-monitored

Severity: highactive

Why it matters

A queue that is silently backing up means recipients are waiting hours for time-sensitive transactional email — password resets, order confirmations, MFA codes. Without polling queue depth and job age on a schedule, a backlog caused by an ESP outage or a misconfigured concurrency setting is only discovered when users complain. ISO 25010 reliability requires that fault conditions are observable before they become user-facing failures. Job age is the earlier signal: a large depth with young jobs is a spike; a small depth with ancient jobs is a stuck worker.

Severity rationale

High because an unmonitored queue backlog can silently delay time-critical transactional emails — MFA codes and password resets — causing user lockouts before any alert fires.

Remediation

Add a periodic health poller in your queue worker bootstrap file (e.g., src/workers/email.worker.ts or src/lib/queue/monitor.ts), running on a setInterval of no more than 5 minutes:

const counts = await queue.getJobCounts('wait', 'active', 'delayed')
const oldest = await queue.getJobs(['wait'], 0, 1, true) // oldest first
const ageMs = oldest[0] ? Date.now() - oldest[0].timestamp : 0
metrics.gauge('email_queue_depth', counts.wait + counts.delayed)
metrics.gauge('email_queue_oldest_job_age_seconds', Math.floor(ageMs / 1000))

Both depth and age are required — emitting only one of the two is not sufficient for this check to pass.

Detection

ID: queue-depth-age-monitored
Severity: high
What to look for: Examine queue worker setup for any code that reads and emits the current queue depth (number of jobs waiting) and job age (time since oldest job was enqueued). With BullMQ, this is queue.getJobCounts(). With other queue systems, look for equivalent polling. Count the number of queue health signals emitted: depth and age are both required. The metric must be emitted on a schedule — not only on failure. The polling interval must be at most 5 minutes (300 seconds).
Pass criteria: Queue depth and the age of the oldest waiting job are both polled at intervals of no more than 5 minutes and emitted to a monitoring system, log aggregator, or metrics platform. Before evaluating, quote the exact polling interval or cron expression found in the code. Do NOT pass when only depth is polled but age is not, or vice versa — both signals are required.
Fail criteria: No code polls queue depth or job age. Or depth is only logged when a job fails, not proactively. Or only 1 of the 2 required signals (depth, age) is emitted.
Skip (N/A) when: The project uses no async queue — all email sends are synchronous inline calls, confirmed by the absence of bull, bullmq, bee-queue, agenda, or similar queue libraries in package.json.
Detail on fail: "No queue depth polling found — queue backlog would only be noticed through increased latency or missed delivery" or "BullMQ job counts are never read in application code"
Cross-reference: The Sending Pipeline & Infrastructure Audit's queue-architecture category verifies that the queue itself is durable — this check verifies that the queue's health is observable in production.

Remediation: Add a periodic poller in your queue worker bootstrap file (e.g., src/workers/email.worker.ts or src/lib/queue/monitor.ts):

const counts = await queue.getJobCounts('wait', 'active', 'delayed')
const oldest = await queue.getJobs(['wait'], 0, 1, true) // oldest first
const ageMs = oldest[0] ? Date.now() - oldest[0].timestamp : 0
metrics.gauge('email_queue_depth', counts.wait + counts.delayed)
metrics.gauge('email_queue_oldest_job_age_seconds', Math.floor(ageMs / 1000))

External references

iso-25010:2011 · reliability.fault-tolerance — Reliability / Fault Tolerance — unmonitored queue depth hides backlog failures
cwe · CWE-391 — Unchecked Error Condition — queue health not proactively polled

Taxons

observability

History

2026-04-18·v1.0.0·Initial import from operational-resilience-email·automated