Alert thresholds for error rate spikes documented and tested

ab-001297 · error-resilience.graceful-degradation-shutdown.alert-thresholds

Severity: infoactive

Why it matters

Error tracking without alert thresholds is passive observation. A team that checks the Sentry dashboard manually once per day will miss an error spike that began at 2 AM and drove 400 failed signups before business hours. NIST SP 800-53 IR-5 (incident monitoring) requires automated monitoring and notification. ISO 25010 reliability.availability is directly reduced by every minute between spike onset and team awareness; unconfigured alerts convert that to hours. Alert thresholds are also the only objective way to distinguish a transient blip from a systemic regression.

Severity rationale

Info severity because alert gaps increase mean time to detection rather than causing failures directly — but undetected production incidents compound quietly, often reaching high user impact before anyone notices.

Remediation

Configure at minimum three alert rules in your error tracking or monitoring service: error rate, p95 response time, and uptime. Test each rule fires correctly after setup.

# docs/runbook/alert-thresholds.md
## Active Alerts

| Metric          | Threshold       | Channel       | Response SLA |
|-----------------|-----------------|---------------|--------------|
| Error rate      | >1% of requests | Slack #ops    | 15 min       |
| p95 latency     | >2000 ms        | Slack #ops    | 30 min       |
| Uptime          | <99.9% (5 min)  | PagerDuty     | 5 min        |

## Testing
Trigger test: send `POST /api/chaos/error-spike` in staging.
Last verified: 2026-04-01

Document the last test date and next scheduled test so alert configuration does not silently drift into a non-firing state.

Detection

ID: alert-thresholds
Severity: info
What to look for: Count all alert configurations in monitoring services. Enumerate thresholds for error rate (no more than 1%), response time (no more than 2 seconds p95), and uptime (at least 99.9%). Check whether error rate thresholds are documented and alerts configured (in error tracking service, monitoring tool, or runbook). Verify team receives notification within 5 minutes of threshold breach.
Pass criteria: Alert thresholds for error rate spikes are documented; team receives notification within 5 minutes of threshold breach. At least 3 alert thresholds must be configured covering error rate, response time, and availability.
Fail criteria: No alert thresholds documented or configured; team does not receive notifications.
Skip (N/A) when: The application has no production deployment or error tracking.
Cross-reference: For error tracking integration, see error-tracking-service. For recovery time objectives, see recovery-time-objectives.
Detail on fail: "No alert thresholds configured. Error spikes will not be noticed by the team" or "Thresholds documented but alerts configured to email only, not Slack — may take hours to notice"

Remediation: Document and test alert thresholds:

// lib/monitoring/alerts.ts — alert thresholds
const THRESHOLDS = { errorRate: 0.01, p95Latency: 2000, uptimePercent: 99.9 }

# Error Response Runbook

## Alert Thresholds
- Error rate spike: 5% of requests returning 5xx (normal: <0.5%)
- Notification channels: Slack #incidents, PagerDuty, email
- Response time: Slack within 5 minutes, PagerDuty immediate

## Testing
- Test alert firing monthly: `curl https://api/chaos/error-spike`
- Last tested: 2026-02-15
- Next test: 2026-03-15

External references

iso-25010:2011 · reliability.availability
nist:rev5 · IR-5 — Incident Monitoring

Taxons

observability

History

2026-04-18·v1.0.0·Initial import from error-resilience·automated