ESP failover mechanism exists

ab-001981 · operational-resilience-email.failure-recovery.esp-failover-exists

Severity: criticalactive

Why it matters

When a primary ESP goes down — API timeout, account suspension, or DNS failure — email stops. Without automatic failover to a secondary ESP, every email queued during the outage either fails permanently or waits for manual operator intervention. CWE-391 (insufficient logging of errors) compounds the problem: a silent primary failure with no secondary means transactional emails — password resets, order confirmations — are silently dropped. The Sending Pipeline & Infrastructure Audit verifies queue resilience; this check verifies that the ESP layer itself has a recovery path that does not require a human in the loop.

Severity rationale

Critical because a single-ESP architecture converts any ESP outage into a complete email blackout — no transactional sends, no recovery path until the outage resolves.

Remediation

Implement automatic failover in your ESP router (e.g., src/lib/email/router.ts) so primary failures trigger the secondary without operator action:

async function sendWithFailover(message: EmailMessage): Promise<void> {
  try {
    await primaryEsp.send(message)
  } catch (primaryErr) {
    logger.warn({ err: primaryErr }, 'Primary ESP failed — attempting failover')
    await secondaryEsp.send(message)
    metrics.increment('esp_failover_used')
  }
}

Both ESPs must have valid credentials loaded from environment variables. Manual failover — where an operator must change an env var and redeploy — does not satisfy this check.

Detection

ID: esp-failover-exists
Severity: critical
What to look for: Look for code that handles the case where the primary ESP API call fails and routes the send through a secondary ESP. This requires: (1) more than one ESP client configured, and (2) a try/catch or fallback pattern in the send path that invokes the secondary ESP when the primary fails. The Sending Pipeline & Infrastructure Audit verifies queue resilience; this check verifies the ESP-level failover layer.
Pass criteria: When the primary ESP API returns an error or times out, the system automatically retries through a secondary ESP client. Count all ESP SDKs configured in the codebase — at least 2 must be present with valid credential loading. The failover path must be exercised automatically (try/catch or circuit breaker), not via manual env var switch. Report even on pass: "Primary: [ESP name], Secondary: [ESP name], failover trigger: [mechanism]."
Fail criteria: Only 1 ESP is configured, or multiple ESPs are installed but no failover logic routes sends to the secondary when the primary fails. Manual failover (operator must change config) does not count as pass.
Skip (N/A) when: The project sends no email — confirmed by the absence of any ESP SDK in package.json dependencies.
Detail on fail: Describe the gap. Example: "Only SendGrid is configured — no secondary ESP installed" or "Both SendGrid and SES are installed but the send path has no fallback — primary failure throws and the job fails" or "ESP failover is documented as manual — operator must update env var during an outage"
Cross-reference: The Sending Pipeline & Infrastructure Audit's esp-integration category verifies ESP abstraction — this check verifies the operational failover layer that depends on that abstraction.

Remediation: Implement automatic failover in your ESP adapter (e.g., src/lib/email/router.ts):

async function sendWithFailover(message: EmailMessage): Promise<void> {
  try {
    await primaryEsp.send(message)
  } catch (primaryErr) {
    logger.warn({ err: primaryErr }, 'Primary ESP failed — attempting failover')
    await secondaryEsp.send(message)
    metrics.increment('esp_failover_used')
  }
}

Ensure both ESPs are configured and tested monthly against a seed list as recommended in the Deliverability Engineering Audit.

External references

cwe · CWE-391 — Unchecked Error Condition — ESP failure has no failover path
iso-25010:2011 · reliability.fault-tolerance — Reliability / Fault Tolerance — system must continue operating when primary ESP fails

Taxons

error-resilience

History

2026-04-18·v1.0.0·Initial import from operational-resilience-email·automated