An error rate that spikes from 0.1% to 15% without triggering an alert means your team finds out from a user tweet, not a PagerDuty notification. NIST SI-4 and NIST CSF DE.AE-3 require automated detection of anomalous system events; ISO 25010 reliability.fault-tolerance requires that faults be detected and responded to. Without defined thresholds and alert destinations, error rate monitoring is passive — a dashboard that nobody checks during an incident rather than an active signal that wakes someone up.
Low because error alerting is a detection mechanism, not a prevention mechanism — its absence degrades response time rather than directly causing failures.
Configure an error rate alert in your monitoring service. For Datadog:
avg:trace.web.request.error_rate{service:your-app} by {env}@slack-incidents @pagerduty-oncallFor Prometheus + Alertmanager:
# prometheus-alerts.yml
groups:
- name: app
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
annotations:
summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.instance }}"
Document the alert owner, threshold justification, and expected response steps in INCIDENT_RESPONSE.md.
ID: deployment-readiness.environment-configuration.error-alerts
Severity: low
What to look for: Enumerate every relevant item. Check monitoring service configuration (Datadog, New Relic, Prometheus, CloudWatch). Look for alert rules defined on error rate metrics. Verify alert destinations are configured (Slack, email, PagerDuty, SMS).
Pass criteria: Error rate alerting is configured. Thresholds are defined (e.g., alert if error rate > 5% for 5 minutes). Alert destinations are configured to notify the team.
Fail criteria: No error rate alerting found, or alerting is configured but thresholds are not defined, or no alert destinations are set.
Skip (N/A) when: The project has no monitoring service in place.
Detail on fail: "No error rate alerting configured. Error spikes will not trigger notifications." or "Error alerting configured but no alert destinations specified."
Remediation: Configure error rate alerting. Using Datadog:
avg:trace.web.request.error_rate{service:your-app}@slack-#incidentsOr in code with Prometheus:
# prometheus-alerts.yml
groups:
- name: app
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{job="app",status=~"5.."}[5m]) > 0.05
annotations:
summary: "High error rate on {{ $labels.instance }}"