ESP health check validates connectivity before processing

ab-002461 · sending-pipeline-infrastructure.esp-integration.esp-health-check

Severity: lowactive

Why it matters

An email worker that starts processing jobs before validating its ESP credentials discovers a misconfiguration only when the first real send fails in production — potentially after dozens of jobs have been dequeued, acknowledged, and logged as in-flight. A rotated API key, a wrong environment variable, or a network policy blocking outbound HTTPS to the ESP creates a failure that only surfaces under load. ISO-25010:2011 reliability.availability requires that faults are detected at startup rather than at runtime, where the blast radius is larger.

Severity rationale

Low because the failure is discovered quickly (on first job attempt) rather than going undetected, but the absence of a startup check delays diagnosis and allows job state to become inconsistent.

Remediation

Add a pre-startup health check in workers/email.worker.ts that exits the process with a non-zero code if the ESP is unreachable:

async function startEmailWorker() {
  const healthy = await emailProvider.healthCheck()
  if (!healthy) {
    logger.fatal('ESP health check failed — not starting')
    process.exit(1)
  }
  logger.info('ESP connectivity verified — starting worker')
  const worker = new Worker('email', processEmailJob, { connection })
  worker.on('error', (err) => logger.error(err, 'Worker error'))
}

startEmailWorker().catch((err) => {
  logger.fatal(err, 'Failed to start email worker')
  process.exit(1)
})

Verify all configured ESP providers, not just the primary, before the worker accepts its first job.

Detection

ID: esp-health-check
Severity: low
What to look for: Look for a startup or pre-flight check that validates the ESP connection and API key before the worker begins processing jobs. The health check must run within the first 30 seconds of worker startup. Also count the number of ESP providers configured — each must have its own health check. An ESP misconfiguration (wrong key, revoked credential) should be caught at deploy time, not when the first email fails in production.
Pass criteria: A health check or startup validation tests the ESP API key and connectivity in under 30 seconds of worker startup, before the worker processes any jobs. The worker must not start processing until the health check passes. Count the number of ESP providers that have health checks — all configured ESPs must be validated. Failed health checks produce a clear log and cause the worker process to exit with a non-zero code.
Fail criteria: No ESP health check exists. An invalid API key is discovered only when the first production job fails, potentially after many retry cycles.
Skip (N/A) when: The ESP SDK validates the API key on first use and throws a clear error that the monitoring system is confirmed to alert on.
Detail on fail: "No ESP health check — invalid SENDGRID_API_KEY would go undetected until the first send attempt fails" or "Worker starts processing immediately without validating ESP connectivity"

Remediation: Add a startup validation that runs before the worker begins processing:

// workers/email.worker.ts
async function startEmailWorker() {
  // Validate ESP connectivity before processing any jobs
  const healthy = await emailProvider.healthCheck()
  if (!healthy) {
    logger.fatal('ESP health check failed — email worker not starting')
    process.exit(1)
  }

  logger.info('ESP connectivity verified — email worker starting')

  const worker = new Worker('email', processEmailJob, { connection })
  worker.on('error', (err) => logger.error(err, 'Email worker error'))
  worker.on('failed', (job, err) => logger.error({ jobId: job?.id, err }, 'Email job failed'))
}

startEmailWorker().catch((err) => {
  logger.fatal(err, 'Failed to start email worker')
  process.exit(1)
})

External references

iso-25010:2011 · reliability.availability — Availability

Taxons

operational-readiness

History

2026-04-18·v1.0.0·Initial import from sending-pipeline-infrastructure·automated