Disk, memory, and CPU monitoring is configured

ab-002296 · saas-logging.monitoring.disk-memory-cpu

Severity: mediumactive

Why it matters

NIST AU-6 requires continuous review of audit records and system performance. For self-hosted deployments, memory leaks, disk-filling log files, and CPU runaway are the most common causes of unexpected outages — and they are invisible without infrastructure-level monitoring. A Node.js process with a memory leak gradually degrades over hours until OOM kills it; without a memory usage alert, the first signal is the application going down. This check applies only to self-hosted deployments because serverless platforms (Vercel, Netlify, Cloudflare Workers) manage infrastructure scaling automatically and do not expose these metrics as meaningful operator concerns.

Severity rationale

Medium because unmonitored infrastructure metrics allow memory leaks and disk exhaustion to cause outages on self-hosted deployments, with no early warning.

Remediation

Configure infrastructure metrics collection appropriate to your hosting environment.

For Fly.io, enable Prometheus metrics export in fly.toml:

[metrics]
  port = 9091
  path = "/metrics"

Then view metrics in the Fly.io dashboard under your app's Metrics tab.

For Docker or VPS deployments, install the Datadog agent:

DD_API_KEY=<your-key> bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"

This collects CPU, memory, and disk metrics automatically and surfaces them in the Datadog infrastructure dashboard.

For Railway, use the built-in metrics view in your service's Metrics tab — no configuration needed.

Detection

ID: disk-memory-cpu
Severity: medium
What to look for: Enumerate all relevant files and Check for infrastructure-level monitoring. Look for: Datadog agent configuration, Prometheus/Grafana setup (prometheus.yml, grafana config), New Relic infrastructure agent, CloudWatch agent configuration, Fly.io metrics configuration (fly.toml metrics section), Railway metrics dashboard references, or any script/cron job that checks system resources.
Pass criteria: At least 1 conforming pattern must exist. Infrastructure metrics (at minimum memory and CPU, ideally disk too) are monitored through an external service, platform dashboard, or agent. Evidence is in config files, deployment manifests, or documentation.
Fail criteria: No infrastructure monitoring found for a self-hosted application.
Skip (N/A) when: Hosting is serverless (Vercel, Netlify, Cloudflare Pages/Workers, AWS Lambda). Serverless platforms manage infrastructure scaling automatically and do not expose disk/CPU/memory as meaningful metrics. Signal: vercel.json present, or netlify.toml present, or wrangler.toml present, or hosting detected as one of these platforms.
Detail on fail: "No infrastructure monitoring configured for self-hosted deployment — no Datadog agent, Prometheus, or platform metrics integration found"
Remediation: Memory leaks, disk-filling logs, and CPU runaway are common causes of production outages. An infrastructure monitor catches these before they take down the app.

For Fly.io applications, Fly provides built-in metrics at https://fly.io/apps/YOUR_APP/metrics. Enable Prometheus export in fly.toml:
```
[metrics]
  port = 9091
  path = "/metrics"
```
For Docker/VPS deployments, Datadog agent is the most complete solution:
```
DD_API_KEY=<key> bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"
```
For Railway, use the built-in metrics dashboard in the project settings.

External references

nist:rev5 · AU-6 — Audit Record Review, Analysis, and Reporting
iso-25010:2011 · reliability

Taxons

observability operational-readiness

History

2026-04-18·v1.0.0·Initial import from saas-logging·automated