Recovery time objectives documented

ab-001298 · error-resilience.graceful-degradation-shutdown.recovery-time-objectives

Severity: infoactive

Why it matters

Without documented RTO (Recovery Time Objective) and RPO (Recovery Point Objective), teams have no shared definition of what a successful recovery looks like, no way to tell when an incident is resolved, and no accountability for response speed. NIST SP 800-53 CP-2 (contingency planning) requires organizations to define recovery targets. ISO 25010 reliability.recoverability requires measurable recovery capability — which cannot exist without first measuring what recovery means. In vibe-coded projects, the absence of RTOs is common because AI coding tools do not prompt for operational documentation; it must be written explicitly.

Severity rationale

Info severity because missing RTO documentation is an operational gap rather than a runtime defect — but the gap becomes critical the moment an incident occurs and the team has no defined resolution criteria.

Remediation

Document RTO and RPO for at least your three most critical system components. Store the document in docs/disaster-recovery.md and reference it from your incident runbook.

<!-- docs/disaster-recovery.md -->
## Recovery Objectives

| Component          | RTO       | RPO        | Strategy                          |
|--------------------|-----------|------------|-----------------------------------|
| API server         | 5 min     | 0          | Auto-restart via Railway/Fly.io   |
| Database           | 15 min    | 1 hour     | Restore from daily snapshot       |
| Payment processing | 30 min    | 0          | Queue + replay via webhook retry  |

## Review cadence
Update after any incident that breached the RTO or RPO above.
Last reviewed: 2026-04-18

Tie RTO targets to your alert thresholds (see alert-thresholds) — if detection takes 15 minutes and RTO is 5 minutes, the targets are mathematically impossible.

Detection

ID: recovery-time-objectives
Severity: info
What to look for: Before evaluating, extract and quote any RTO/RPO documentation found. Count all recovery-related documentation and configurations. Enumerate whether RTO (Recovery Time Objective) and RPO (Recovery Point Objective) are defined. Look for documentation of recovery time objectives (RTOs) for common error scenarios: database down, API down, network partition, etc. Check whether incident response playbook references these scenarios.
Pass criteria: Recovery time objectives for common error scenarios are documented; incident response playbook references error types and expected recovery times. At least 1 RTO and 1 RPO must be documented with specific time targets.
Fail criteria: No RTOs documented; playbook lacks specific guidance for error scenarios.
Skip (N/A) when: The application is in early development or is not production-critical.
Cross-reference: For chaos testing that validates recovery, see chaos-testing. For alert thresholds, see alert-thresholds.
Detail on fail: "No RTOs documented. Team doesn't know expected recovery time for common failures" or "Incident playbook exists but doesn't reference specific error types or recovery strategies"

Remediation: Document your RTO:

<!-- docs/disaster-recovery.md — RTO/RPO documentation -->
## Recovery Objectives
- RTO: 15 minutes (time to restore service)
- RPO: 1 hour (maximum acceptable data loss)

# Incident Response & Recovery Times

## Common Scenarios

### Database Connection Lost
- Detection: Error rate spikes above 5%
- Expected RTO: 5 minutes (auto-reconnect via connection pool)
- Manual remediation: Restart database or switch to replica

### Payment API Down
- Detection: All payment requests fail
- Expected RTO: 15 minutes (queue payments, retry when API recovers)
- Manual remediation: Contact payment provider support

### Third-Party Service Down (e.g., email)
- Detection: Email sending fails
- Expected RTO: 30 minutes (queue emails, retry)
- Manual remediation: Implement fallback (SMS, in-app notification)

External references

iso-25010:2011 · reliability.recoverability
nist:rev5 · CP-2 — Contingency Plan

Taxons

operational-readiness error-resilience

History

2026-04-18·v1.0.0·Initial import from error-resilience·automated