Multi-ESP fallback routing configured

ab-002463 · sending-pipeline-infrastructure.esp-integration.multi-esp-fallback

Severity: infoactive

Why it matters

A single-ESP sending architecture creates a hard dependency on one provider's availability. When SendGrid, Mailgun, or Postmark experiences an outage, all outbound email stops until the provider recovers. For SaaS products where password resets and billing receipts are blocking user actions, even a one-hour outage translates directly to churn and support volume. ISO-25010:2011 reliability.availability requires that systems maintain service continuity through component failures — a second ESP is the minimal architectural hedge against that dependency.

Severity rationale

Info because multi-ESP fallback is a hardening concern; many teams acceptably accept the availability dependency of a single ESP with a good SLA.

Remediation

Implement a simple circuit-breaker router in lib/email/router.ts that switches to a fallback provider after a configurable failure threshold:

export class EmailRouter {
  private failures = 0
  private readonly threshold = 5
  constructor(
    private readonly primary: EmailProvider,
    private readonly fallback: EmailProvider
  ) {}
  async send(msg: EmailMessage) {
    if (this.failures >= this.threshold) return this.fallback.send(msg)
    try {
      const result = await this.primary.send(msg)
      this.failures = 0
      return result
    } catch (err) {
      this.failures++
      if (this.failures >= this.threshold) return this.fallback.send(msg)
      throw err
    }
  }
}

Store the fallback ESP's credentials in separate environment variables and verify both during the startup health check.

Detection

ID: multi-esp-fallback
Severity: info
What to look for: Check whether the application supports routing sends to a secondary ESP when the primary is unavailable. Look for fallback logic in the ESP adapter, a circuit breaker pattern around the primary ESP, or configuration for a backup provider. This is a hardening concern, not a correctness requirement — many projects acceptably rely on a single ESP.
Pass criteria: The system can route sends to at least 1 secondary ESP when the primary returns persistent errors or is detected as unhealthy. Count all configured ESP providers — at least 2 must be present. Fallback routing is automatic (circuit breaker) or manual (operator toggle). The secondary ESP's credentials are also stored securely via environment variables.
Fail criteria: A single ESP is the only send path with no fallback. An ESP outage halts all outbound email indefinitely. Or a secondary ESP is configured but no routing logic exists to activate it.
Skip (N/A) when: The application deliberately relies on a single ESP and accepts the availability dependency — documented in code comments or README.
Detail on fail: "Single ESP with no fallback — if SendGrid is unavailable, all outbound email stops until the outage resolves" or "No circuit breaker or secondary provider configured"

Remediation: Implement a simple circuit breaker that falls back to a secondary provider:

// lib/email/router.ts
export class EmailRouter {
  private primaryFailures = 0
  private readonly failureThreshold = 5
  private readonly resetAfterMs = 60_000

  constructor(
    private readonly primary: EmailProvider,
    private readonly fallback: EmailProvider
  ) {}

  async send(message: EmailMessage): Promise<{ messageId: string }> {
    if (this.primaryFailures >= this.failureThreshold) {
      logger.warn('Primary ESP circuit open — routing to fallback')
      return this.fallback.send(message)
    }

    try {
      const result = await this.primary.send(message)
      this.primaryFailures = 0 // Reset on success
      return result
    } catch (err) {
      this.primaryFailures++
      logger.error({ failures: this.primaryFailures }, 'Primary ESP failure')
      if (this.primaryFailures >= this.failureThreshold) {
        setTimeout(() => { this.primaryFailures = 0 }, this.resetAfterMs)
        return this.fallback.send(message)
      }
      throw err
    }
  }
}

External references

iso-25010:2011 · reliability.availability — Availability

Taxons

error-resilience operational-readiness

History

2026-04-18·v1.0.0·Initial import from sending-pipeline-infrastructure·automated