Alerting configured for error rate spikes; thresholds defined; alert destinations configured
Why it matters
An error rate that spikes from 0.1% to 15% without triggering an alert means your team finds out from a user tweet, not a PagerDuty notification. NIST SI-4 and NIST CSF DE.AE-3 require automated detection of anomalous system events; ISO 25010 reliability.fault-tolerance requires that faults be detected and responded to. Without defined thresholds and alert destinations, error rate monitoring is passive — a dashboard that nobody checks during an incident rather than an active signal that wakes someone up.
Severity rationale
Low because error alerting is a detection mechanism, not a prevention mechanism — its absence degrades response time rather than directly causing failures.
Remediation
Configure an error rate alert in your monitoring service. For Datadog:
- Monitors → New Monitor → Metric
- Query:
avg:trace.web.request.error_rate{service:your-app} by {env} - Alert threshold: > 0.05 (5%) for 5 consecutive minutes
- Notification:
@slack-incidents @pagerduty-oncall
For Prometheus + Alertmanager:
# prometheus-alerts.yml
groups:
- name: app
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
annotations:
summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.instance }}"
Document the alert owner, threshold justification, and expected response steps in INCIDENT_RESPONSE.md.
Detection
-
ID:
error-alerts -
Severity:
low -
What to look for: Enumerate every relevant item. Check monitoring service configuration (Datadog, New Relic, Prometheus, CloudWatch). Look for alert rules defined on error rate metrics. Verify alert destinations are configured (Slack, email, PagerDuty, SMS).
-
Pass criteria: Error rate alerting is configured. Thresholds are defined (e.g., alert if error rate > 5% for 5 minutes). Alert destinations are configured to notify the team.
-
Fail criteria: No error rate alerting found, or alerting is configured but thresholds are not defined, or no alert destinations are set.
-
Skip (N/A) when: The project has no monitoring service in place.
-
Detail on fail:
"No error rate alerting configured. Error spikes will not trigger notifications."or"Error alerting configured but no alert destinations specified." -
Remediation: Configure error rate alerting. Using Datadog:
- Go to Datadog → Monitors → New Monitor
- Select "Metric" type
- Define query:
avg:trace.web.request.error_rate{service:your-app} - Set threshold: Alert if > 0.05 (5%) for 5 minutes
- Add notification:
@slack-#incidents - Save monitor
Or in code with Prometheus:
# prometheus-alerts.yml groups: - name: app rules: - alert: HighErrorRate expr: rate(http_requests_total{job="app",status=~"5.."}[5m]) > 0.05 annotations: summary: "High error rate on {{ $labels.instance }}"
External references
- iso-25010:2011 · reliability.fault-tolerance — Fault Tolerance — error rate alerting enables timely response to failures
- nist:rev5 · SI-4 — System Monitoring
- nist-csf:2.0 · DE.AE-3 — Detect — Event data are collected and correlated from multiple sources
Taxons
History
- 2026-04-18·v1.0.0·Initial import from deployment-readiness·automated