v1.1.0Pro19 checks

Every Operational Resilience (Email) Audit check

All 19 checks with why-it-matters prose, severity, and cross-references to related audits.

3 critical8 high3 medium5 low

Monitoring & Alerting

5 checks

Deliverability metrics are dashboarded

critical

Bounce rate, complaint rate, and open rate are the three signals ESPs use to determine whether your sending domain is trustworthy. When these metrics are not stored persistently and surfaced in a dashboard, deliverability collapse happens silently — Gmail and Yahoo start routing to spam, complaint rates climb past the 0.1% threshold, and you have no historical trend to show when you contact ESP support. CWE-391 (insufficient logging) at the operational layer means you cannot distinguish a transient spike from a structural reputation problem. Without queryable trend data, you are flying blind into potential domain blacklisting.

Why this severity: Critical because unmonitored bounce and complaint rates can trigger ESP account suspension and domain blacklisting before any operator notices the problem.

operational-resilience-email.monitoring-alerting.deliverability-metrics-dashboardedSee full pattern

Queue depth and age monitoring

high

A queue that is silently backing up means recipients are waiting hours for time-sensitive transactional email — password resets, order confirmations, MFA codes. Without polling queue depth and job age on a schedule, a backlog caused by an ESP outage or a misconfigured concurrency setting is only discovered when users complain. ISO 25010 reliability requires that fault conditions are observable before they become user-facing failures. Job age is the earlier signal: a large depth with young jobs is a spike; a small depth with ancient jobs is a stuck worker.

Why this severity: High because an unmonitored queue backlog can silently delay time-critical transactional emails — MFA codes and password resets — causing user lockouts before any alert fires.

operational-resilience-email.monitoring-alerting.queue-depth-age-monitoredSee full pattern

ESP API latency tracked

medium

ESP API latency is a leading indicator of outages and rate-limit throttling. A call that normally completes in 80 ms suddenly taking 4 seconds means the ESP is struggling — but without per-call timing recorded to a metrics system or log, the only way you discover the degradation is through user reports or queue backlog growth. ISO 25010 performance-efficiency requires that time-behaviour is measurable. Missing timing on error paths is especially dangerous: slow errors are the most likely signal of an impending outage.

Why this severity: Medium because latency spikes are invisible without instrumentation, delaying incident detection until the problem has already caused queue backlog or send failures.

operational-resilience-email.monitoring-alerting.esp-api-latency-trackedSee full pattern

Alert channels configured beyond logging

high

Log-only alerting means a human must be watching logs at the exact moment an email system failure occurs. A Slack webhook or PagerDuty alert configured for queue backlog and ESP failure conditions changes the detection model from passive (someone notices) to active (system notifies). ISO 25010 reliability.operability requires that operational problems are communicated to operators without manual monitoring. Teams that rely on log scraping during on-call hours routinely miss email incidents for 30–90 minutes — long enough for complaint rates to spike above Gmail's 0.1% threshold.

Why this severity: High because log-only alerting guarantees delayed incident response — production email failures go undetected until a human happens to look at logs or users report problems.

operational-resilience-email.monitoring-alerting.alert-channels-configuredSee full pattern

Per-campaign health metrics tracked

medium

Aggregate email metrics hide which campaign is hurting your sender reputation. A bounce rate of 4% looks alarming in aggregate but harmless if it comes entirely from one poorly-segmented cold campaign — and fixable once you can isolate it. Without `campaign_id` on individual event records, triage requires reconstructing send state from logs and timestamps, which takes hours during an active incident. The Campaign Analytics & Attribution Audit validates attribution accuracy; this check verifies that the raw event storage layer records the campaign context in the first place.

Why this severity: Medium because missing campaign attribution makes deliverability triage time-consuming and imprecise, slowing incident response during bounce rate spikes.

operational-resilience-email.monitoring-alerting.per-campaign-health-trackedSee full pattern

Failure Handling & Recovery

5 checks

ESP failover mechanism exists

critical

When a primary ESP goes down — API timeout, account suspension, or DNS failure — email stops. Without automatic failover to a secondary ESP, every email queued during the outage either fails permanently or waits for manual operator intervention. CWE-391 (insufficient logging of errors) compounds the problem: a silent primary failure with no secondary means transactional emails — password resets, order confirmations — are silently dropped. The Sending Pipeline & Infrastructure Audit verifies queue resilience; this check verifies that the ESP layer itself has a recovery path that does not require a human in the loop.

Why this severity: Critical because a single-ESP architecture converts any ESP outage into a complete email blackout — no transactional sends, no recovery path until the outage resolves.

operational-resilience-email.failure-recovery.esp-failover-existsSee full pattern

Queue backpressure on ESP outage

high

During an ESP outage, a queue worker without backpressure configuration becomes a thundering herd: all failed jobs retry immediately at full concurrency, exhausting the ESP's connection limit, burning through retry budget, and degrading IP reputation simultaneously. CWE-400 (uncontrolled resource consumption) and CWE-770 (allocation without limits) both apply. The damage compounds: by the time the ESP recovers, the IP may have accumulated enough connection-abuse signals to land in spam. Exponential backoff and a concurrency cap are not optional niceties — they are what prevents a 30-minute outage from becoming a 3-day reputation recovery.

Why this severity: High because a worker with no backpressure configuration converts a temporary ESP outage into a thundering-herd attack on the ESP that can permanently damage IP reputation.

operational-resilience-email.failure-recovery.queue-backpressure-on-outageSee full pattern

Contact database backed up with restore procedure documented

critical

Contacts and suppression lists are the two most operationally critical datasets in an email system. Losing the suppression list means mailing previously unsubscribed contacts — a CAN-SPAM violation that triggers regulatory fines and FCC complaints. Losing contacts means losing revenue and user relationships with no recovery path. NIST SP 800-53 CP-9 (Information System Backup) requires that backup procedures are tested, not just configured. ISO 25010 reliability.recoverability requires that a restore procedure exists and is documented. A backup without a documented restore procedure is not a recovery plan — it is an archive of unknown integrity.

Why this severity: Critical because loss of the contacts or suppression list without a tested restore procedure means permanent data loss, CAN-SPAM violations, and irreversible damage to sender reputation.

operational-resilience-email.failure-recovery.contact-db-backup-testedSee full pattern

Failed send recovery plan documented

high

When a send batch fails partway through — worker crash, ESP timeout, deployment restart — the question is: which contacts got the email and which did not? Without a `status` field on send records that tracks `queued`, `sent`, and `failed` per contact, the only way to reconstruct send state is to cross-reference logs with timestamps, which is slow and error-prone under incident pressure. CWE-391 covers the missing audit trail. ISO 25010 reliability.recoverability requires that recovery from a partial failure is a defined, executable procedure — not an ad-hoc investigation.

Why this severity: High because without per-contact send status and a documented recovery procedure, a partial batch failure forces manual log reconstruction and risks either re-sending to already-reached contacts or missing contacts entirely.

operational-resilience-email.failure-recovery.failed-send-recovery-documentedSee full pattern

Circuit breaker on repeated ESP failures

low

Exponential backoff slows down retry frequency, but it does not stop new jobs from being accepted and attempted against a broken ESP. A circuit breaker adds the missing control: after N consecutive failures, it stops attempting sends entirely and waits for a cooldown before probing again. Without it, a prolonged ESP outage fills the queue with failed jobs that burn retry budget. CWE-400 applies because new jobs continue to consume processing resources against an endpoint that is confirmed unavailable. The circuit breaker is the difference between "we paused sends during the outage" and "we burned through all retry attempts before the outage resolved."

Why this severity: Low because exponential backoff provides partial protection, but without a circuit breaker, indefinite new-job ingestion during an outage still exhausts retry budget and worker capacity.

operational-resilience-email.failure-recovery.circuit-breaker-on-esp-failuresSee full pattern

Capacity & Scaling

5 checks

Volume limits documented per ESP and IP

high

Every ESP plan has hard sending limits — SES sandbox defaults to 1 send per second and 200 per day; SendGrid's free tier caps at 100 sends per second. Exceeding these limits produces 429 errors that fail jobs and degrade IP reputation if retried without backoff. CWE-770 (allocation without limits) applies when queue concurrency is uncapped relative to the ESP tier. The Deliverability Engineering Audit's IP reputation category depends on sends staying within known limits — this check verifies that those limits are actually documented and enforced in code, not just hoped to be safe.

Why this severity: High because sending above ESP tier limits produces throttling errors that increase bounce rate, consume retry budget, and can trigger account suspension on plans with anti-abuse enforcement.

operational-resilience-email.capacity-scaling.volume-limits-documentedSee full pattern

Queue workers scale horizontally

low

A single-instance queue worker is a capacity ceiling and a single point of failure. When send volume spikes — product launch, re-engagement campaign — a single worker becomes the bottleneck. ISO 25010 performance-efficiency.capacity requires that the system can be scaled to meet demand. The second failure mode is subtler: in-process state (a local Map of sent IDs, a module-level cache) will cause duplicate sends or inconsistent behavior when multiple instances run, which is worse than not scaling at all.

Why this severity: Low because single-instance workers are a capacity ceiling, not an immediate failure — the impact is gradual throughput degradation during volume spikes rather than an outright outage.

operational-resilience-email.capacity-scaling.workers-scale-horizontallySee full pattern

Burst capacity tested at 10x normal volume

high

Queue workers that have never been tested at above-normal volume reveal their breaking points in production: the rate limiter was set to a library default, concurrency was never tuned, and the ESP starts returning 429s at 3x normal volume. ISO 25010 reliability.fault-tolerance requires that the system behaves predictably under adverse load conditions. Without a burst test script that runs against a sandbox ESP at 10x normal volume, the concurrency and rate limiter values are guesses — and the first real burst is the test.

Why this severity: High because untested burst capacity means the first high-volume send reveals queue configuration failures in production, where bounces and ESP throttling are real and affect sender reputation.

operational-resilience-email.capacity-scaling.burst-capacity-testedSee full pattern

Resource limits enforced on workers

low

A queue worker with no memory limit can consume all available host memory during a burst, taking down the database, Redis, and web server on the same host. CWE-400 (uncontrolled resource consumption) applies directly. The connection pool problem is multiplicative: 5 replicas with a default pg connection pool of 10 connections each create 50 connections — fine for one replica, catastrophic for five on a database with a max_connections of 25. ISO 25010 performance-efficiency.resource-utilization requires that resource consumption is bounded and predictable.

Why this severity: Low because resource limits protect co-located services from worker memory exhaustion and connection pool saturation, but the failure mode is gradual rather than instantaneous.

operational-resilience-email.capacity-scaling.resource-limits-enforcedSee full pattern

Capacity alerts at 80% threshold

low

Capacity alerts at failure (queue full, ESP quota exceeded) give operators zero time to react — sends are already failing when the alert fires. An 80% threshold alert gives operators time to scale workers, contact the ESP about a quota increase, or delay a non-urgent campaign before the limit is hit. ISO 25010 reliability.operability requires that the system communicates approaching resource exhaustion, not just completed resource exhaustion. The Monitoring & Alerting category checks in this bundle verify that signals are being measured — this check verifies that pre-warning thresholds are configured on those signals.

Why this severity: Low because the gap between no pre-warning and pre-warning alerts is operational comfort rather than correctness — sends degrade at 100% regardless, but operators with early warning can prevent reaching 100%.

operational-resilience-email.capacity-scaling.capacity-alerts-at-thresholdSee full pattern

Incident Response

4 checks

Runbook for deliverability drop

high

A deliverability drop — bounce rate spike, complaint rate above Gmail's 0.1% threshold, or inbox placement collapse — requires immediate, structured triage. Without a runbook, operators improvise: they may pause all sends instead of isolating the affected campaign, spend 30 minutes finding the right dashboard query, or miss the SPF/DKIM check that would have identified the root cause in 2 minutes. NIST SP 800-53 IR-8 (Incident Response Plan) requires documented procedures for anticipated failure modes. RFC 5321 bounce handling standards define the specific events this runbook must cover.

Why this severity: High because an undocumented deliverability incident response guarantees slower triage, which allows complaint rates to climb further past ESP thresholds during the response window.

operational-resilience-email.incident-response.deliverability-drop-runbookSee full pattern

Blast radius containment — pause affected campaigns only

high

When a campaign starts generating complaint spikes, the correct response is to pause that campaign — not halt the entire email system. A queue worker that can only be stopped globally means that pausing one broken campaign takes down order confirmations, password resets, and MFA codes simultaneously. NIST SP 800-53 IR-4 (Incident Handling) requires that incident containment minimizes collateral impact. The Campaign Orchestration & Sequencing Audit verifies sequence management; this check verifies that the pause mechanism is fine-grained enough to contain an incident without collateral damage.

Why this severity: High because the absence of per-campaign pause forces operators to choose between continuing a damaging campaign or halting all email — including time-critical transactional sends.

operational-resilience-email.incident-response.blast-radius-containmentSee full pattern

Quick-disable for single campaign

medium

During an incident, operators are under time pressure — every minute a bad campaign runs increases complaints and risks ESP account action. A pause mechanism that requires navigating to an undocumented admin page, writing a SQL UPDATE, or finding the right API endpoint in source code adds minutes to incident response. NIST SP 800-53 IR-4 (Incident Handling) requires that containment actions are pre-planned and executable without discovery overhead. An admin UI button with no runbook entry fails because the runbook is what makes it findable under pressure.

Why this severity: Medium because an undocumented quick-disable mechanism is operationally equivalent to no mechanism — operators cannot use what they cannot find under incident pressure.

operational-resilience-email.incident-response.quick-disable-single-campaignSee full pattern

Post-mortem template exists

low

Without a post-mortem template, email incidents end when sends resume — the root cause, contributing factors, and action items are never formally captured. The same misconfiguration repeats. NIST SP 800-53 IR-5 (Incident Monitoring) and ISO 25010 reliability.recoverability both require that incidents are tracked and that learnings feed back into system improvements. A template is not bureaucracy — it is the mechanism that converts a painful incident into a concrete backlog item that prevents recurrence.

Why this severity: Low because missing post-mortems cause recurring incidents rather than immediate failures — the impact accumulates over months as the same root causes repeat without structural fixes.

operational-resilience-email.incident-response.post-mortem-templateSee full pattern

Ready to scan your project?

Run this audit in your AI coding tool (Claude Code, Cursor, Bolt, etc.) and submit results here for scoring and benchmarks.

Open Operational Resilience (Email) Audit