Runbook for deliverability drop

ab-001991 · operational-resilience-email.incident-response.deliverability-drop-runbook

Severity: highactive

Why it matters

A deliverability drop — bounce rate spike, complaint rate above Gmail's 0.1% threshold, or inbox placement collapse — requires immediate, structured triage. Without a runbook, operators improvise: they may pause all sends instead of isolating the affected campaign, spend 30 minutes finding the right dashboard query, or miss the SPF/DKIM check that would have identified the root cause in 2 minutes. NIST SP 800-53 IR-8 (Incident Response Plan) requires documented procedures for anticipated failure modes. RFC 5321 bounce handling standards define the specific events this runbook must cover.

Severity rationale

High because an undocumented deliverability incident response guarantees slower triage, which allows complaint rates to climb further past ESP thresholds during the response window.

Remediation

Create docs/runbooks/deliverability-incident.md with at least four sections: detection thresholds, triage checklist, immediate actions, and escalation contacts:

# Deliverability Incident Runbook
## Detection
- Bounce rate exceeds 5% over 1-hour window
- Complaint rate above 0.1% (Gmail threshold)
## Triage
1. Check bounce breakdown: `SELECT domain, count(*) FROM bounces WHERE ...`
2. Review Postmaster Tools for Gmail/Yahoo reputation
## Immediate Actions
- Pause affected campaign via admin API
- Verify DNS records (SPF, DKIM, DMARC)
## Escalation
- On-call: [name] via PagerDuty
- ESP support: [contact info]

The runbook must reference at least one numeric threshold — "bounce rate exceeds 5%" — to avoid the generic "contact ops" anti-pattern.

Detection

ID: deliverability-drop-runbook
Severity: high
What to look for: Look for a runbook document (in docs/, runbooks/, or a wiki link referenced in code) that describes step-by-step how to respond when deliverability drops — bounce rate spikes, complaint rate rises above the Gmail 0.1% threshold, or open rates collapse. The runbook should cover: how to detect the drop, initial triage steps (which campaigns, which IP, which domain), escalation path, and resolution actions (pause campaigns, check DNS, contact ESP support). This complements the Deliverability Engineering Audit's DNS monitoring requirements.
Pass criteria: A deliverability incident runbook exists with at least 4 sections: detection steps, triage checklist, campaign pause procedure, and escalation contacts. Enumerate all sections present in the runbook. The runbook must include at least 1 specific threshold (e.g., "bounce rate exceeds 5%", "complaint rate above 0.1%"). Is NOT a pass when only a generic "escalate to ops" note exists with no specific triage steps.
Fail criteria: No deliverability runbook exists. Or only a generic "escalate to ops" note with no specific triage steps. Or the runbook has fewer than 4 sections.
Skip (N/A) when: The project sends no email — confirmed by the absence of any ESP SDK in package.json.
Detail on fail: "No deliverability incident runbook found — operators would need to improvise triage during an active incident" or "README mentions 'contact ESP support' but has no structured triage steps before escalation"
Remediation: Create docs/runbooks/deliverability-incident.md with sections for:
- Detection: which metric crossed which threshold
- Initial triage: check bounce breakdown by domain, campaign, and IP; review Postmaster Tools for Gmail reputation
- Immediate actions: pause affected campaigns, check DNS/SPF/DKIM (ref: Deliverability Engineering Audit findings)
- Escalation: who to call and when, ESP support contact procedure
- Resolution: criteria for resuming sends, warm-up plan if IP was flagged
```
# Deliverability Incident Runbook
## Detection
- Bounce rate exceeds 5% over 1-hour window
- Complaint rate above 0.1% (Gmail threshold)
## Triage
1. Check bounce breakdown: `SELECT domain, count(*) FROM bounces WHERE ...`
2. Review Postmaster Tools for Gmail/Yahoo reputation
## Immediate Actions
- Pause affected campaign via admin API
- Verify DNS records (SPF, DKIM, DMARC)
## Escalation
- On-call: [name] via PagerDuty
- ESP support: [contact info]
```

External references

iso-25010:2011 · reliability.recoverability — Reliability / Recoverability — structured runbook required to restore deliverability after incident
nist:rev5 · IR-8 — NIST 800-53 IR-8: Incident Response Plan
external · RFC-5321-bounce — RFC 5321 §3.6 — SMTP bounce handling requires distinguishing permanent vs. transient failures for triage

Taxons

operational-readiness

History

2026-04-18·v1.0.0·Initial import from operational-resilience-email·automated

Why it matters

Remediation

Create docs/runbooks/deliverability-incident.md with at least four sections: detection thresholds, triage checklist, immediate actions, and escalation contacts:

# Deliverability Incident Runbook
## Detection
- Bounce rate exceeds 5% over 1-hour window
- Complaint rate above 0.1% (Gmail threshold)
## Triage
1. Check bounce breakdown: `SELECT domain, count(*) FROM bounces WHERE ...`
2. Review Postmaster Tools for Gmail/Yahoo reputation
## Immediate Actions
- Pause affected campaign via admin API
- Verify DNS records (SPF, DKIM, DMARC)
## Escalation
- On-call: [name] via PagerDuty
- ESP support: [contact info]

The runbook must reference at least one numeric threshold — "bounce rate exceeds 5%" — to avoid the generic "contact ops" anti-pattern.

Detection

ID: deliverability-drop-runbook
Severity: high
What to look for: Look for a runbook document (in docs/, runbooks/, or a wiki link referenced in code) that describes step-by-step how to respond when deliverability drops — bounce rate spikes, complaint rate rises above the Gmail 0.1% threshold, or open rates collapse. The runbook should cover: how to detect the drop, initial triage steps (which campaigns, which IP, which domain), escalation path, and resolution actions (pause campaigns, check DNS, contact ESP support). This complements the Deliverability Engineering Audit's DNS monitoring requirements.
Pass criteria: A deliverability incident runbook exists with at least 4 sections: detection steps, triage checklist, campaign pause procedure, and escalation contacts. Enumerate all sections present in the runbook. The runbook must include at least 1 specific threshold (e.g., "bounce rate exceeds 5%", "complaint rate above 0.1%"). Is NOT a pass when only a generic "escalate to ops" note exists with no specific triage steps.
Fail criteria: No deliverability runbook exists. Or only a generic "escalate to ops" note with no specific triage steps. Or the runbook has fewer than 4 sections.
Skip (N/A) when: The project sends no email — confirmed by the absence of any ESP SDK in package.json.
Detail on fail: "No deliverability incident runbook found — operators would need to improvise triage during an active incident" or "README mentions 'contact ESP support' but has no structured triage steps before escalation"
Remediation: Create docs/runbooks/deliverability-incident.md with sections for:
- Detection: which metric crossed which threshold
- Initial triage: check bounce breakdown by domain, campaign, and IP; review Postmaster Tools for Gmail reputation
- Immediate actions: pause affected campaigns, check DNS/SPF/DKIM (ref: Deliverability Engineering Audit findings)
- Escalation: who to call and when, ESP support contact procedure
- Resolution: criteria for resuming sends, warm-up plan if IP was flagged
```
# Deliverability Incident Runbook
## Detection
- Bounce rate exceeds 5% over 1-hour window
- Complaint rate above 0.1% (Gmail threshold)
## Triage
1. Check bounce breakdown: `SELECT domain, count(*) FROM bounces WHERE ...`
2. Review Postmaster Tools for Gmail/Yahoo reputation
## Immediate Actions
- Pause affected campaign via admin API
- Verify DNS records (SPF, DKIM, DMARC)
## Escalation
- On-call: [name] via PagerDuty
- ESP support: [contact info]
```

External references

iso-25010:2011 · reliability.recoverability — Reliability / Recoverability — structured runbook required to restore deliverability after incident

nist:rev5 · IR-8 — NIST 800-53 IR-8: Incident Response Plan

external · RFC-5321-bounce — RFC 5321 §3.6 — SMTP bounce handling requires distinguishing permanent vs. transient failures for triage