Incident response runbook documented with severity levels, communication plan, on-call rotation

ab-000981 · deployment-readiness.rollback-recovery.incident-runbook

Severity: mediumactive

Why it matters

When production goes down at 2am, the difference between a 15-minute resolution and a 3-hour outage is whether your team has a documented incident runbook. NIST IR-8 requires incident response plans; SOC 2 CC9.1 requires risk mitigation procedures; NIST CSF RS.RP-1 requires response plan execution. Without defined severity levels, teams cannot triage — every problem looks like a P1. Without escalation contacts and communication channels, the on-call engineer is making high-stakes decisions alone with no authority to act.

Severity rationale

Medium because the absence of an incident runbook degrades response quality significantly during outages, but its impact is realized only when an incident occurs, not on every deployment.

Remediation

Create INCIDENT_RESPONSE.md at the repo root. At minimum it must define severity levels, escalation contacts, and the communication plan.

# Incident Response Runbook

## Severity Levels
| Level | Response SLA | Example |
|-------|--------------|---------|
| P1 — Critical | 15 min | Production down, data loss |
| P2 — High | 1 hour | Core feature unavailable |
| P3 — Medium | 4 hours | Minor feature broken, workaround exists |

## Escalation
- On-call rotation: [link to PagerDuty/schedule]
- Slack: #incidents
- P1 only: page on-call + notify engineering lead

## Communication
1. Declare incident in #incidents: "P2 incident — login flow returning 500"
2. Update status page at status.your-app.com within 10 minutes
3. Post resolution summary within 30 minutes of recovery

Link the runbook from your DEPLOYMENT.md and README so on-call engineers can find it under pressure.

Detection

ID: incident-runbook
Severity: medium
What to look for: Enumerate every relevant item. Look for RUNBOOK.md, INCIDENT_RESPONSE.md, or similar documentation. Should define severity levels (P1/P2/P3 or Critical/High/Medium), escalation contacts, on-call rotation, and communication channels (Slack, status page, etc.).
Pass criteria: An incident response runbook exists, defines severity levels, includes escalation contacts or on-call rotation, and documents communication channels.
Fail criteria: No incident response documentation found, or documentation is incomplete (missing severity levels, contacts, or communication plan).
Skip (N/A) when: The project is not planned for production.
Detail on fail: "No incident response runbook found in repository." or "Runbook exists but does not define severity levels or escalation contacts."

Remediation: Create INCIDENT_RESPONSE.md in your repository:

# Incident Response Runbook

## Severity Levels

| Level | Response Time | Example |
|-------|---------------|---------|
| P1 - Critical | 15 min | Production down, data loss risk |
| P2 - High | 1 hour | Major feature unavailable, severe degradation |
| P3 - Medium | 4 hours | Minor feature broken, workaround available |

## Escalation

- **On-call:** See on-call schedule at [link to schedule]
- **Slack channel:** #incidents
- **Page the team:** PagerDuty integration (P1 only)

## Communication Plan

1. Post incident declaration in #incidents: "P2 incident: Login flow down"
2. Update status page at status.your-app.com
3. For P1: Page on-call engineer, notify leadership
4. Post incident summary 30 min after resolution

## Common Procedures

### Database Connection Pool Exhausted
1. Check connection count: `SELECT COUNT(*) FROM pg_stat_activity;`
2. Kill idle connections: `SELECT pg_terminate_backend(pid) ...`
3. Restart application: [deployment steps]
4. Verify health endpoint recovers

### High Error Rate (> 5%)
1. Check error tracking dashboard (Sentry)
2. Identify error pattern from stack traces
3. If caused by recent deploy, execute rollback
4. Otherwise, apply hotfix

External references

iso-25010:2011 · reliability.recoverability — Recoverability — defined incident response reduces time to restore service
nist:rev5 · IR-8 — Incident Response Plan
soc2:2017 · CC9.1 — Risk mitigation — incident response procedures defined
nist-csf:2.0 · RS.RP-1 — Respond — Incident response plan executed during or after an incident

Taxons

operational-readiness

History

2026-04-18·v1.0.0·Initial import from deployment-readiness·automated

Why it matters

Remediation

Create INCIDENT_RESPONSE.md at the repo root. At minimum it must define severity levels, escalation contacts, and the communication plan.

# Incident Response Runbook

## Severity Levels
| Level | Response SLA | Example |
|-------|--------------|---------|
| P1 — Critical | 15 min | Production down, data loss |
| P2 — High | 1 hour | Core feature unavailable |
| P3 — Medium | 4 hours | Minor feature broken, workaround exists |

## Escalation
- On-call rotation: [link to PagerDuty/schedule]
- Slack: #incidents
- P1 only: page on-call + notify engineering lead

## Communication
1. Declare incident in #incidents: "P2 incident — login flow returning 500"
2. Update status page at status.your-app.com within 10 minutes
3. Post resolution summary within 30 minutes of recovery

Link the runbook from your DEPLOYMENT.md and README so on-call engineers can find it under pressure.

Detection

ID: incident-runbook
Severity: medium
What to look for: Enumerate every relevant item. Look for RUNBOOK.md, INCIDENT_RESPONSE.md, or similar documentation. Should define severity levels (P1/P2/P3 or Critical/High/Medium), escalation contacts, on-call rotation, and communication channels (Slack, status page, etc.).
Pass criteria: An incident response runbook exists, defines severity levels, includes escalation contacts or on-call rotation, and documents communication channels.
Fail criteria: No incident response documentation found, or documentation is incomplete (missing severity levels, contacts, or communication plan).
Skip (N/A) when: The project is not planned for production.
Detail on fail: "No incident response runbook found in repository." or "Runbook exists but does not define severity levels or escalation contacts."

Remediation: Create INCIDENT_RESPONSE.md in your repository:

# Incident Response Runbook

## Severity Levels

| Level | Response Time | Example |
|-------|---------------|---------|
| P1 - Critical | 15 min | Production down, data loss risk |
| P2 - High | 1 hour | Major feature unavailable, severe degradation |
| P3 - Medium | 4 hours | Minor feature broken, workaround available |

## Escalation

- **On-call:** See on-call schedule at [link to schedule]
- **Slack channel:** #incidents
- **Page the team:** PagerDuty integration (P1 only)

## Communication Plan

1. Post incident declaration in #incidents: "P2 incident: Login flow down"
2. Update status page at status.your-app.com
3. For P1: Page on-call engineer, notify leadership
4. Post incident summary 30 min after resolution

## Common Procedures

### Database Connection Pool Exhausted
1. Check connection count: `SELECT COUNT(*) FROM pg_stat_activity;`
2. Kill idle connections: `SELECT pg_terminate_backend(pid) ...`
3. Restart application: [deployment steps]
4. Verify health endpoint recovers

### High Error Rate (> 5%)
1. Check error tracking dashboard (Sentry)
2. Identify error pattern from stack traces
3. If caused by recent deploy, execute rollback
4. Otherwise, apply hotfix

External references

iso-25010:2011 · reliability.recoverability — Recoverability — defined incident response reduces time to restore service

nist:rev5 · IR-8 — Incident Response Plan

soc2:2017 · CC9.1 — Risk mitigation — incident response procedures defined

nist-csf:2.0 · RS.RP-1 — Respond — Incident response plan executed during or after an incident