Alerting configured for error rate spikes; thresholds defined; alert destinations configured

ab-000986 · deployment-readiness.environment-configuration.error-alerts

Severity: lowactive

Why it matters

An error rate that spikes from 0.1% to 15% without triggering an alert means your team finds out from a user tweet, not a PagerDuty notification. NIST SI-4 and NIST CSF DE.AE-3 require automated detection of anomalous system events; ISO 25010 reliability.fault-tolerance requires that faults be detected and responded to. Without defined thresholds and alert destinations, error rate monitoring is passive — a dashboard that nobody checks during an incident rather than an active signal that wakes someone up.

Severity rationale

Low because error alerting is a detection mechanism, not a prevention mechanism — its absence degrades response time rather than directly causing failures.

Remediation

Configure an error rate alert in your monitoring service. For Datadog:

Monitors → New Monitor → Metric
Query: avg:trace.web.request.error_rate{service:your-app} by {env}
Alert threshold: > 0.05 (5%) for 5 consecutive minutes
Notification: @slack-incidents @pagerduty-oncall

For Prometheus + Alertmanager:

# prometheus-alerts.yml
groups:
  - name: app
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        annotations:
          summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.instance }}"

Document the alert owner, threshold justification, and expected response steps in INCIDENT_RESPONSE.md.

Detection

ID: error-alerts
Severity: low
What to look for: Enumerate every relevant item. Check monitoring service configuration (Datadog, New Relic, Prometheus, CloudWatch). Look for alert rules defined on error rate metrics. Verify alert destinations are configured (Slack, email, PagerDuty, SMS).
Pass criteria: Error rate alerting is configured. Thresholds are defined (e.g., alert if error rate > 5% for 5 minutes). Alert destinations are configured to notify the team.
Fail criteria: No error rate alerting found, or alerting is configured but thresholds are not defined, or no alert destinations are set.
Skip (N/A) when: The project has no monitoring service in place.
Detail on fail: "No error rate alerting configured. Error spikes will not trigger notifications." or "Error alerting configured but no alert destinations specified."
Remediation: Configure error rate alerting. Using Datadog:
1. Go to Datadog → Monitors → New Monitor
2. Select "Metric" type
3. Define query: avg:trace.web.request.error_rate{service:your-app}
4. Set threshold: Alert if > 0.05 (5%) for 5 minutes
5. Add notification: @slack-#incidents
6. Save monitor
Or in code with Prometheus:
```
# prometheus-alerts.yml
groups:
  - name: app
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{job="app",status=~"5.."}[5m]) > 0.05
        annotations:
          summary: "High error rate on {{ $labels.instance }}"
```

External references

iso-25010:2011 · reliability.fault-tolerance — Fault Tolerance — error rate alerting enables timely response to failures
nist:rev5 · SI-4 — System Monitoring
nist-csf:2.0 · DE.AE-3 — Detect — Event data are collected and correlated from multiple sources

Taxons

observability

History

2026-04-18·v1.0.0·Initial import from deployment-readiness·automated