When a send batch fails partway through — worker crash, ESP timeout, deployment restart — the question is: which contacts got the email and which did not? Without a status field on send records that tracks queued, sent, and failed per contact, the only way to reconstruct send state is to cross-reference logs with timestamps, which is slow and error-prone under incident pressure. CWE-391 covers the missing audit trail. ISO 25010 reliability.recoverability requires that recovery from a partial failure is a defined, executable procedure — not an ad-hoc investigation.
High because without per-contact send status and a documented recovery procedure, a partial batch failure forces manual log reconstruction and risks either re-sending to already-reached contacts or missing contacts entirely.
Add a status column to send records and document the recovery query. The schema should support filtering by campaign and status:
// Schema: email_sends table
// status: 'queued' | 'sent' | 'failed' | 'bounced'
// sent_at: timestamp, nullable
// Recovery query: contacts who were queued but never sent
const unsent = await db.emailSend.findMany({
where: { campaign_id: campaignId, status: 'queued', sent_at: null }
})
// Re-enqueue these contacts only
Document this query and the re-queue steps in docs/runbooks/failed-campaign-recovery.md. A data model with no status field makes this check fail regardless of documentation quality.
ID: operational-resilience-email.failure-recovery.failed-send-recovery-documented
Severity: high
What to look for: Enumerate all recovery steps documented for failed send batches: (1) identify affected contacts, (2) de-duplicate against already-sent records, (3) re-send safely. Count the number of these 3 steps documented. Also check for a failed_at or status column on send records that enables filtering — without this column, recovery is impossible regardless of documentation.
Pass criteria: There is a documented procedure (not necessarily automated) for recovering a failed send batch covering at least 3 steps: (1) identifying affected contacts, (2) de-duplicating against already-sent records, and (3) re-sending safely without duplicates. The data model must include a per-contact send status field (e.g., status: 'queued' | 'sent' | 'failed') that supports the recovery query.
Fail criteria: No documentation exists for this scenario. Or the data model does not track per-contact send status, making it impossible to determine who was missed. Or the procedure exists but covers fewer than 3 steps.
Skip (N/A) when: The system only sends transactional one-off emails where re-send is always safe and obvious — confirmed by the absence of campaign or batch-send models.
Detail on fail: "No runbook or documentation for failed batch recovery — operator would need to manually reconstruct send state from logs" or "Send records have no status field — cannot distinguish sent from unsent contacts after a partial failure"
Remediation: Add a status field to send records and document recovery steps:
// Schema: email_sends table
// status: 'queued' | 'sent' | 'failed' | 'bounced'
// sent_at: timestamp, nullable
// Recovery query: contacts who were queued but never sent
const unsent = await db.emailSend.findMany({
where: { campaign_id: campaignId, status: 'queued', sent_at: null }
})
// Re-enqueue these contacts only
Document this query and the re-queue steps in a runbook file (docs/runbooks/failed-campaign-recovery.md).