Error tracking without alert thresholds is passive observation. A team that checks the Sentry dashboard manually once per day will miss an error spike that began at 2 AM and drove 400 failed signups before business hours. NIST SP 800-53 IR-5 (incident monitoring) requires automated monitoring and notification. ISO 25010 reliability.availability is directly reduced by every minute between spike onset and team awareness; unconfigured alerts convert that to hours. Alert thresholds are also the only objective way to distinguish a transient blip from a systemic regression.
Info severity because alert gaps increase mean time to detection rather than causing failures directly — but undetected production incidents compound quietly, often reaching high user impact before anyone notices.
Configure at minimum three alert rules in your error tracking or monitoring service: error rate, p95 response time, and uptime. Test each rule fires correctly after setup.
# docs/runbook/alert-thresholds.md
## Active Alerts
| Metric | Threshold | Channel | Response SLA |
|-----------------|-----------------|---------------|--------------|
| Error rate | >1% of requests | Slack #ops | 15 min |
| p95 latency | >2000 ms | Slack #ops | 30 min |
| Uptime | <99.9% (5 min) | PagerDuty | 5 min |
## Testing
Trigger test: send `POST /api/chaos/error-spike` in staging.
Last verified: 2026-04-01
Document the last test date and next scheduled test so alert configuration does not silently drift into a non-firing state.
ID: error-resilience.graceful-degradation-shutdown.alert-thresholds
Severity: info
What to look for: Count all alert configurations in monitoring services. Enumerate thresholds for error rate (no more than 1%), response time (no more than 2 seconds p95), and uptime (at least 99.9%). Check whether error rate thresholds are documented and alerts configured (in error tracking service, monitoring tool, or runbook). Verify team receives notification within 5 minutes of threshold breach.
Pass criteria: Alert thresholds for error rate spikes are documented; team receives notification within 5 minutes of threshold breach. At least 3 alert thresholds must be configured covering error rate, response time, and availability.
Fail criteria: No alert thresholds documented or configured; team does not receive notifications.
Skip (N/A) when: The application has no production deployment or error tracking.
Cross-reference: For error tracking integration, see error-tracking-service. For recovery time objectives, see recovery-time-objectives.
Detail on fail: "No alert thresholds configured. Error spikes will not be noticed by the team" or "Thresholds documented but alerts configured to email only, not Slack — may take hours to notice"
Remediation: Document and test alert thresholds:
// lib/monitoring/alerts.ts — alert thresholds
const THRESHOLDS = { errorRate: 0.01, p95Latency: 2000, uptimePercent: 99.9 }
# Error Response Runbook
## Alert Thresholds
- Error rate spike: 5% of requests returning 5xx (normal: <0.5%)
- Notification channels: Slack #incidents, PagerDuty, email
- Response time: Slack within 5 minutes, PagerDuty immediate
## Testing
- Test alert firing monthly: `curl https://api/chaos/error-spike`
- Last tested: 2026-02-15
- Next test: 2026-03-15