v1.1.0Pro20 checks

Every Deployment Readiness Audit check

All 20 checks with why-it-matters prose, severity, and cross-references to related audits.

2 critical5 high4 medium6 low3 info

CI/CD Pipeline

5 checks

CI/CD pipeline runs automated tests and blocks deployment on failures

critical

Deploying untested code to production is a direct path to user-facing regressions, data corruption, and outages. Without CI/CD gates (NIST SP 800-218 PW.8, SSDF), a broken commit reaches users the moment it's merged. ISO 25010 reliability.maturity requires demonstrable automated quality gates. Teams without mandatory test runs discover failures in production — where the cost is measured in downtime, rollbacks, and lost trust — rather than in a pull request, where the cost is a 5-minute fix.

Why this severity: Critical because untested code in production is a direct cause of undetected regressions, outages, and data integrity failures that affect all users simultaneously.

deployment-readiness.ci-cd-pipeline.automated-testsSee full pattern

Documented step-by-step rollback procedure covering code, database, and infrastructure

critical

A bad production deployment without a documented rollback plan means your team is improvising under pressure during an outage — when clarity matters most. Database schema changes and infrastructure shifts cannot be undone by `git revert` alone. Without step-by-step rollback coverage for code, database, and infrastructure (SOC 2 A1.3, NIST CP-10), teams routinely extend incidents from minutes to hours. ISO 25010 reliability.recoverability requires demonstrable recovery procedures, not tribal knowledge.

Why this severity: Critical because the absence of a documented rollback procedure converts every failed deployment into an extended incident where recovery time depends entirely on who is online and what they remember.

deployment-readiness.ci-cd-pipeline.rollback-documentedSee full pattern

Build artifacts tagged with commit SHA, git tag, or semantic version; Docker images use SHA not latest

high

Docker images tagged exclusively as `latest` make it impossible to pinpoint which code version is running in production, roll back to a known-good build, or audit what changed between deployments. SLSA Build-L2 requires traceable artifact provenance; SSDF DS.1 requires unique build identifiers. Without SHA or semantic version tagging, a failed deployment becomes a debugging exercise because `latest` is overwritten with each push — the previous good build is gone.

Why this severity: High because mutable `latest` tags eliminate the ability to pin or roll back to a specific artifact, making incident recovery slower and deployment audits impossible.

deployment-readiness.ci-cd-pipeline.artifact-versioningSee full pattern

Staging environment exists with infrastructure and dependencies matching production

high

Deploying directly to production without a staging environment means every untested configuration change, database migration, or third-party integration issue hits live users first. NIST CM-4 requires evaluating the security impact of changes before production; that evaluation requires a representative environment. A staging environment that uses SQLite while production uses PostgreSQL is not representative — subtle query behavior differences and migration incompatibilities only surface at the worst time.

Why this severity: High because skipping a production-equivalent staging environment means integration failures, migration errors, and environment-specific bugs are discovered by users rather than by the team.

deployment-readiness.ci-cd-pipeline.staging-environmentSee full pattern

Database migrations tested on staging or test database before production

high

Applying untested database migrations directly to production risks table corruption, constraint violations, or data loss that cannot be easily undone. NIST CP-9 and SOC 2 A1.2 require verified recovery procedures for data; a migration that drops a column or alters a type without pre-testing can trigger application errors for every user and may require hours of manual data recovery. ISO 25010 reliability.recoverability demands that any data-modifying operation be validated before it reaches the primary store.

Why this severity: High because a failed migration in production can corrupt live data or take down the application for all users, with recovery time measured in hours if no pre-tested rollback path exists.

deployment-readiness.ci-cd-pipeline.migration-testingSee full pattern

Monitoring & Alerting

4 checks

Uptime monitoring service active; monitors public endpoint at least every 5 minutes; alerts on downtime

high

Without uptime monitoring, you learn about production outages from users, not from alerts. A 5-minute monitoring interval is the operational threshold between acceptable detection time and prolonged silent downtime. SOC 2 A1.1 requires availability controls; NIST SI-4 requires system monitoring. ISO 25010 reliability.availability cannot be demonstrated without instrumentation. An undetected 2-hour outage during off-hours can erase a week of user trust — monitoring configured to alert in under 5 minutes limits exposure to minutes, not hours.

Why this severity: High because without sub-5-minute uptime checks, production outages remain undetected until users report them, extending mean time to detection and amplifying business impact.

deployment-readiness.monitoring-alerting.uptime-monitoringSee full pattern

Error tracking service integrated; uncaught exceptions sent to dashboard; critical errors trigger team alerts

high

Uncaught exceptions that silently swallow stack traces (CWE-209, CWE-215) make production failures invisible until users complain. Without error tracking (Sentry, Datadog, Rollbar), you cannot quantify your application's error rate, identify the most impactful failures, or detect regressions introduced by a deployment. NIST SI-4 requires active system monitoring; ISO 25010 reliability.fault-tolerance requires that faults be detected and surfaced. A single untracked 500 error affecting 10% of checkout requests can silently drain revenue for days.

Why this severity: High because untracked production exceptions mean teams are blind to user-impacting failures until the error rate is high enough for users to notice and report.

deployment-readiness.monitoring-alerting.error-trackingSee full pattern

Rollback procedure executed successfully in staging or production within last 3 months

medium

A rollback procedure that exists on paper but has never been executed is not a rollback procedure — it is a hypothesis. SOC 2 A1.3 and NIST CP-4 require that recovery procedures be tested, not just documented. Teams that have never exercised rollback discover gaps (missing permissions, undocumented dependencies, wrong migration commands) during an active production incident, extending recovery time from minutes to hours. ISO 25010 reliability.recoverability requires verified recovery capability, not theoretical capability.

Why this severity: Medium because an untested rollback procedure fails unpredictably under incident pressure, but the risk is bounded by the existence of documentation and the fact that rollbacks are infrequent.

deployment-readiness.monitoring-alerting.rollback-testedSee full pattern

Feature flags or toggles configured for high-risk deployments; toggleable without code redeployment

medium

Deploying a high-risk feature without a feature flag means the only remediation for a bad release is a full rollback — redeploying old code, re-running migrations in reverse, and potentially incurring downtime. NIST CM-3 covers controlled configuration changes; feature flags implement that control at runtime without touching infrastructure. A flag that can be toggled in a dashboard in seconds is vastly safer than a rollback that takes 15 minutes. Teams without feature flags tie their deployment risk directly to their rollback speed.

Why this severity: Medium because missing feature flags force full rollbacks for high-risk feature failures, but the risk is only realized when a high-risk feature ships and misbehaves.

deployment-readiness.monitoring-alerting.feature-flagsSee full pattern

Rollback & Recovery

5 checks

Automated database backups scheduled and tested; retention at least 7 days; restore procedure verified

medium

Without automated database backups with at least 7-day retention, a corrupted migration, accidental `DELETE` without a `WHERE` clause, or infrastructure failure can result in permanent data loss. SOC 2 A1.2 and NIST CP-9 require backup and recovery capabilities and periodic restore testing. ISO 25010 reliability.recoverability cannot be satisfied by a backup that has never been successfully restored — untested backups have an unknown failure rate.

Why this severity: Medium because automated backups address a catastrophic but infrequent failure mode; the risk is bounded by the fact that most deployments don't corrupt data, but when they do the impact is irreversible.

deployment-readiness.rollback-recovery.database-backupsSee full pattern

Incident response runbook documented with severity levels, communication plan, on-call rotation

medium

When production goes down at 2am, the difference between a 15-minute resolution and a 3-hour outage is whether your team has a documented incident runbook. NIST IR-8 requires incident response plans; SOC 2 CC9.1 requires risk mitigation procedures; NIST CSF RS.RP-1 requires response plan execution. Without defined severity levels, teams cannot triage — every problem looks like a P1. Without escalation contacts and communication channels, the on-call engineer is making high-stakes decisions alone with no authority to act.

Why this severity: Medium because the absence of an incident runbook degrades response quality significantly during outages, but its impact is realized only when an incident occurs, not on every deployment.

deployment-readiness.rollback-recovery.incident-runbookSee full pattern

SSL/TLS certificates auto-renew before expiry; certificate will not expire within 30 days

low

An expired SSL/TLS certificate takes your entire site offline for all users instantly — browsers block access, not just warn. CWE-295 covers improper certificate validation; NIST SC-17 and PCI DSS Req-4.2.1 require maintaining valid certificates. A certificate expiring within 30 days with no auto-renewal is a timed production outage. Manual renewal requires remembering a calendar event and having the right credentials available — both fail under operational stress.

Why this severity: Low because most modern platforms (Vercel, Netlify, AWS) handle SSL auto-renewal automatically, so this only affects self-hosted deployments, but expiry causes immediate and total service unavailability.

deployment-readiness.rollback-recovery.ssl-auto-renewSee full pattern

DNS records point to correct production hosts; DNS managed by reliable provider; TTL set to 300-3600 seconds

low

A DNS TTL set to 86400 seconds (24 hours) means a misconfigured or failed DNS record takes up to 24 hours to propagate a fix — even after you've corrected it. NIST SC-20 covers secure DNS; ISO 25010 reliability.availability requires timely failover capability. A TTL between 300 and 3600 seconds gives you fast failover when you need to redirect traffic during an outage while keeping query load on DNS servers reasonable. A CNAME pointing to staging instead of production silently serves wrong content to all users.

Why this severity: Low because DNS misconfiguration causes routing failures affecting all users, but correct DNS is typically set once and rarely changes, making ongoing risk low.

deployment-readiness.rollback-recovery.dns-configuredSee full pattern

CDN cache invalidation strategy documented; deployment process includes CDN purge or auto-purge configured

low

Deploying new code without purging the CDN cache means users may receive stale assets — old JavaScript bundles, outdated HTML, broken references to renamed files — for hours or days after a deployment. ISO 25010 reliability.maturity requires that deployments deliver what was deployed. Without a documented invalidation strategy or automated purge step in the CI/CD pipeline, cache staleness is the default outcome, not the exception, and debugging it requires correlating CDN access logs rather than a simple cache clear.

Why this severity: Low because CDN staleness causes user-visible bugs and inconsistency after deployments, but doesn't expose data or cause security failures — the impact is reliability and UX degradation.

deployment-readiness.rollback-recovery.cdn-purgeSee full pattern

Environment Configuration

6 checks

Load testing completed simulating at least 2x expected peak traffic; application stable with error rate under 1%

low

Deploying to production without load testing means your application's failure threshold under traffic is unknown until it collapses. ISO 25010 performance-efficiency.capacity requires knowing the system's limits before they are breached. NIST SC-5 covers resource availability protection. A deployment that handles 50 concurrent users gracefully may fail at 200 — a traffic spike from a product launch, press mention, or marketing campaign will find that limit in the worst possible context. A 2x peak traffic load test with under 1% error rate gives you a defensible headroom margin.

Why this severity: Low because load testing failures are discovered through the test rather than in production, and most apps never hit the traffic levels required to expose the gap, but an untested capacity ceiling is an invisible risk.

deployment-readiness.environment-configuration.load-testingSee full pattern

Alerting configured for error rate spikes; thresholds defined; alert destinations configured

low

An error rate that spikes from 0.1% to 15% without triggering an alert means your team finds out from a user tweet, not a PagerDuty notification. NIST SI-4 and NIST CSF DE.AE-3 require automated detection of anomalous system events; ISO 25010 reliability.fault-tolerance requires that faults be detected and responded to. Without defined thresholds and alert destinations, error rate monitoring is passive — a dashboard that nobody checks during an incident rather than an active signal that wakes someone up.

Why this severity: Low because error alerting is a detection mechanism, not a prevention mechanism — its absence degrades response time rather than directly causing failures.

deployment-readiness.environment-configuration.error-alertsSee full pattern

Health check endpoint exists, returns HTTP 200, verifies critical dependencies, and consumed by load balancer

low

A health check endpoint that only returns HTTP 200 without verifying the database or downstream services will report healthy during a database connection pool exhaustion or broken third-party integration — exactly the conditions that cause user-facing failures. NIST SI-6 requires software verification; ISO 25010 reliability.availability requires accurate system state reporting. A load balancer routing traffic to a broken instance that reports itself healthy multiplies the failure, not the resilience. The health endpoint is the contract between your application and its infrastructure.

Why this severity: Low because a shallow health check returns false positives under partial failures, causing load balancers to route to unhealthy instances rather than triggering failover — the gap is only realized during compound failures.

deployment-readiness.environment-configuration.health-checkSee full pattern

Performance metrics collected; response time p50, p95, error rate, and throughput visible in dashboard with baselines

info

Without response time percentiles and throughput baselines, every deployment is a blind release — you cannot distinguish a 20% p95 regression from normal variance, or confirm that a refactor didn't silently degrade performance. ISO 25010 performance-efficiency.time-behaviour and NIST AU-6 require that performance data be collected and reviewed. Teams without p50/p95 dashboards measure performance by user complaints, which means regressions are invisible until they're severe enough for users to notice and report.

Why this severity: Info because performance metrics are an observability gap rather than an active failure mode — the system works, but you cannot see whether it is degrading or whether deployments improve or worsen response times.

deployment-readiness.environment-configuration.performance-metricsSee full pattern

Secrets managed via environment variables or secrets manager; no credentials committed to version control

info

Hardcoded credentials in source code (CWE-798) or committed `.env` files (CWE-312) expose API keys, database passwords, and service tokens to every person with repository access — and to public GitHub if the repo is ever made public or forked. OWASP A07 (Identification and Authentication Failures) and NIST IA-5 both classify hardcoded credentials as a critical authentication control failure. PCI DSS v4.0 Req-8.6.2 explicitly prohibits hardcoded passwords/passphrases in scripts, configuration/property files, and bespoke source code. Unlike most vulnerabilities, secrets in git history persist even after the secret is removed from the current codebase — they require key rotation, not just file deletion.

Why this severity: Info in this bundle context because other patterns cover active secret exposure; this pattern focuses on the secrets management hygiene baseline rather than a detected active leak.

deployment-readiness.environment-configuration.secrets-managementSee full pattern

Post-deployment smoke tests automated or documented; key user flows verified after every production deployment

info

A deployment that succeeds at the infrastructure level can still break login, payment processing, or core data flows — the CI/CD pipeline passes, but users hit errors. Without automated post-deployment smoke tests (NIST SA-11, SSDF PW.8), the first signal of a broken deployment is a user support ticket. ISO 25010 reliability.maturity requires that production changes be verified as correct after deployment, not just before it. A 3-minute automated smoke test covering the critical user journey is the difference between a 5-minute rollback and a 45-minute incident.

Why this severity: Info because smoke tests are a post-deployment safety net rather than a preventive control — their absence extends time-to-detect when a deployment breaks production flows.

deployment-readiness.environment-configuration.smoke-testsSee full pattern

Ready to scan your project?

Run this audit in your AI coding tool (Claude Code, Cursor, Bolt, etc.) and submit results here for scoring and benchmarks.

Open Deployment Readiness Audit