v1.1.0Pro24 checks

Every Error Resilience Audit check

All 24 checks with why-it-matters prose, severity, and cross-references to related audits.

2 critical5 high5 medium8 low4 info

Error Boundaries & UI Recovery

4 checks

Root error boundary wraps entire application

critical

Without a root error boundary, any unhandled React component error produces a white screen of death — the entire application unmounts silently. Users lose their context, their unsaved work, and often assume the product is broken. ISO 25010 reliability.fault-tolerance requires the system to continue operating under fault conditions; a missing root boundary violates that baseline. One null-pointer in a third-party widget can take down checkout, dashboard, and every other route.

Why this severity: Critical because a single component failure without a root error boundary crashes the entire application for every active user simultaneously.

error-resilience.error-boundaries-ui-recovery.root-error-boundarySee full pattern

Error boundaries exist around high-risk feature sections

critical

A root error boundary keeps the shell alive, but without feature-level boundaries a single crashing widget still takes down the entire dashboard, settings panel, or payment form. ISO 25010 reliability.fault-tolerance expects fault isolation: one failed revenue widget should not blank the user analytics widget beside it. Applications with data tables, real-time feeds, and payment forms are the highest-risk areas — each independently capable of crashing a page and forcing the user to reload and start over.

Why this severity: Critical because without feature-level isolation, one widget crash eliminates access to all other features on the same page, amplifying impact beyond the single failure.

error-resilience.error-boundaries-ui-recovery.feature-error-boundariesSee full pattern

Fallback components are user-friendly

high

A fallback that displays a raw stack trace violates CWE-209 (information exposure through error messages) and erodes user trust irreparably. Worse, a fallback with no retry or navigation option leaves users stranded — they cannot recover without a hard reload, and even then they may lose unsaved state. ISO 25010 reliability.recoverability requires the system to restore a defined state after failure; a blank or cryptic fallback achieves neither. User churn after an unrecoverable error is measurably higher than after a friendly one.

Why this severity: High because a non-actionable fallback UI converts a recoverable runtime error into a user-abandoned session, and raw error detail exposes internal implementation.

error-resilience.error-boundaries-ui-recovery.friendly-fallback-componentsSee full pattern

User-facing error messages are non-technical and actionable

high

Raw JavaScript errors exposed in the UI — `TypeError: Cannot read properties of undefined`, `ECONNREFUSED`, HTTP status codes — violate CWE-209 (information exposure) and OWASP A05 (Security Misconfiguration). Beyond security, they erode user trust: a non-technical user reading a stack trace has no path forward. Every technical error string that reaches a user represents a conversion failure, a support ticket, or churn. Error message quality is a measurable product metric, not a cosmetic concern.

Why this severity: High because raw error strings leak internal architecture to users and are direct evidence of OWASP A05 misconfiguration that also degrades perceived reliability.

error-resilience.error-boundaries-ui-recovery.user-friendly-error-messagesSee full pattern

Logging & Observability

4 checks

Structured logging in JSON format is implemented

high

Unstructured `console.log` output is unsearchable, unqueryable, and useless when diagnosing production incidents. Without consistent fields — `level`, `timestamp`, `traceId`, `userId` — correlating a user-reported failure to a server-side log line can take hours. NIST SP 800-53 AU-3 mandates that audit records contain sufficient information to establish what events occurred and who was responsible. ISO 25010 reliability.fault-tolerance requires detectable failure conditions; logs that can't be filtered by severity or correlated by trace ID fail that standard.

Why this severity: High because unstructured logs make production incident investigation hours slower, directly extending mean time to resolution for every outage.

error-resilience.logging-observability.structured-loggingSee full pattern

Global unhandled promise rejection handler is configured

high

Unhandled promise rejections are silent killers. In Node.js prior to v15, they produced no output and leaked memory. In modern runtimes they terminate the process, causing unannounced restarts. In browsers, they leave background operations in an unknown state with no user feedback. CWE-703 (improper check for exceptional conditions) applies directly. Without a global handler, async errors from queued jobs, scheduled tasks, or deferred operations vanish entirely — producing outages with no trace in logs and no signal in error tracking.

Why this severity: High because unhandled rejections terminate Node.js processes without warning or logging, causing unannounced service restarts with zero diagnostic trail.

error-resilience.logging-observability.unhandled-promise-rejectionSee full pattern

window.onerror handler catches synchronous errors

high

Third-party scripts — analytics, chat widgets, A/B testing tools — run in the same JavaScript context as your application. A synchronous error in any of them crashes the page if `window.onerror` is not configured. CWE-703 applies: failures in external code are exceptional conditions your app must handle. Without this handler, you have no visibility into crashes caused by vendor script updates, and users experience blank pages with no error surface in your error tracking dashboard.

Why this severity: High because synchronous errors from third-party scripts silently crash the page with no diagnostic trail unless a global handler is in place.

error-resilience.logging-observability.window-onerror-handlerSee full pattern

Error tracking service is configured in production

medium

Without an error tracking service, production failures are invisible until users report them — often via churn, refund requests, or social media. NIST SP 800-53 SI-11 requires software fault handling that produces meaningful diagnostic output; `console.error` into a log file no one monitors does not meet that bar. ISO 25010 reliability.fault-tolerance depends on detection: you cannot respond to a fault you cannot observe. A free-tier Sentry account covers most production workloads and takes under 30 minutes to configure.

Why this severity: Medium because unmonitored production errors go undetected until users report them, inflating mean time to detection for every incident.

error-resilience.logging-observability.error-tracking-serviceSee full pattern

Network & API Resilience

4 checks

Circuit breaker pattern for external API dependencies

medium

When an external API — payment gateway, SMS provider, identity service — starts returning errors, repeated synchronous retries amplify load on an already-struggling service and cascade failures throughout your application. CWE-703 (improper check for exceptional conditions) applies to this failure mode. ISO 25010 reliability.fault-tolerance requires the system to degrade gracefully under component failures; a missing circuit breaker means one downstream outage can freeze all request-handling threads or queue workers indefinitely, taking down your entire service.

Why this severity: Medium because without circuit-breaking, a single downstream API failure causes cascading thread exhaustion or queue saturation across the entire application.

error-resilience.network-api-resilience.circuit-breaker-patternSee full pattern

Timeout configured on all HTTP requests

medium

HTTP requests without timeouts block indefinitely when a server stops responding without closing the TCP connection — a common failure mode in cloud environments during rolling restarts or network partitions. CWE-400 (uncontrolled resource consumption) applies: a single hung request holds a thread, connection slot, or serverless concurrency unit until the platform forcibly kills it — typically after 30+ seconds. At scale, simultaneous slow requests exhaust your connection pool entirely, making the application unresponsive for all users.

Why this severity: Medium because requests without timeouts deplete connection pools and serverless concurrency under partial network failures, causing application-wide unavailability.

error-resilience.network-api-resilience.request-timeoutSee full pattern

API calls implement retry logic with exponential backoff

medium

Transient network failures — DNS blips, cloud provider hiccups, brief overloads — are normal in distributed systems. Without retry logic, a single dropped packet causes a user-visible failure on an operation that would succeed 200ms later. ISO 25010 reliability.recoverability requires the application to restore a defined performance level after failure; immediate hard failures with no retry violate that objective. Equally important: retrying non-idempotent POST requests without user confirmation risks duplicate charges, duplicate submissions, or duplicate order creation — a common and costly AI-coding defect.

Why this severity: Medium because missing retry logic converts recoverable transient failures into permanent user-visible errors, and naive retry on non-idempotent calls risks duplicate transactions.

error-resilience.network-api-resilience.retry-logic-backoffSee full pattern

API fails gracefully with partial UI rendering

medium

When an API call fails, applications that render nothing leave users with no way to understand what happened or what they can still do. ISO 25010 reliability.fault-tolerance expects partial operation: a profile page should still show the edit form even if the user data fetch fails; a dashboard should still show navigation even if a widget's data source is unavailable. Applications that hard-fail to a blank screen or infinite spinner on any API error eliminate user agency entirely — an especially high-impact defect in payment and checkout flows.

Why this severity: Medium because full-page blank screens on API failure eliminate user agency and produce abandonment at the exact moments — payment, onboarding — where recovery matters most.

error-resilience.network-api-resilience.graceful-api-failureSee full pattern

Graceful Degradation & Shutdown

12 checks

404 error page with navigation options

low

A default server 404 response — raw text or a framework-generated HTML stub — gives users no path forward. They cannot navigate home, search for what they wanted, or understand whether the URL was ever valid. ISO 25010 reliability.recoverability requires the system to guide users back to a functional state; a missing custom 404 page abandons them entirely. For SEO, a well-structured custom 404 with navigation also prevents Google from interpreting broken links as site-wide quality signals.

Why this severity: Low because the user impact is limited to the dead-end session; most users can manually navigate home, but the experience degrades trust and increases bounce rate.

error-resilience.graceful-degradation-shutdown.not-found-error-pageSee full pattern

500 error page with incident tracking

low

A generic 500 response gives users no information about whether the problem is transient or permanent, no way to report what happened, and no confidence that anyone knows the system is broken. CWE-209 warns against exposing internal error details, but the opposite failure — showing nothing at all — is equally problematic for recoverability. ISO 25010 reliability.recoverability requires the system to restore or redirect to a usable state; a 500 page with no retry and no incident ID achieves neither. An incident ID on the error page links user-reported failures to server-side logs without requiring users to paste stack traces.

Why this severity: Low because server errors are less frequent than 404s, but a well-designed 500 page with an incident ID directly reduces support resolution time.

error-resilience.graceful-degradation-shutdown.server-error-pageSee full pattern

Data-fetching components handle loading, success, error, and empty states

low

Components that only handle the success path leave users stranded in three common failure modes: slow network (indefinite spinner), API error (blank or partially rendered page), and empty server response (no indication of why nothing appeared). CWE-703 (improper check for exceptional conditions) applies to missing error state handling. In vibe-coded applications this is one of the most frequent defects: AI code generation reliably handles the success path and omits the loading, error, and empty branches — leaving large surface areas completely unhandled.

Why this severity: Low individually, but high in aggregate — missing error and empty states across many data-fetching components produce a product that feels broken on any slow or unreliable connection.

error-resilience.graceful-degradation-shutdown.data-fetching-statesSee full pattern

Form submission retries on network failure

low

A form that clears its fields on network failure destroys user work and guarantees the user abandons the operation — or at minimum re-enters all data under frustration. ISO 25010 reliability.recoverability requires the application to preserve state across failure events. For signup, checkout, and contact forms this is a direct conversion killer: a single network timeout during form submission can permanently lose a customer. The fix is architectural, not cosmetic: form state must live in React state or a form library, never be cleared in the catch block.

Why this severity: Low severity at the technical level, but high business impact — form abandonment after submission failure is a direct revenue loss vector for signup and checkout flows.

error-resilience.graceful-degradation-shutdown.form-submission-retrySee full pattern

Database connection pool with recovery strategy

low

Without a connection pool, applications create a new database connection for every request — a pattern that fails catastrophically under load. At moderate traffic, the database server exhausts its connection limit and begins rejecting connections, producing 500 errors for all users simultaneously. CWE-400 (uncontrolled resource consumption) applies directly. ISO 25010 reliability.fault-tolerance requires resource management that prevents single-request failures from cascading into service-wide outages; unmanaged connection creation is one of the most common causes of that cascade in vibe-coded applications.

Why this severity: Low in development where request concurrency is low, but a systemic failure mode in production — connection exhaustion under modest load produces complete service outages.

error-resilience.graceful-degradation-shutdown.database-connection-poolSee full pattern

Backend process handles graceful shutdown

low

A backend process that ignores SIGTERM drops in-flight requests immediately when the orchestrator (Kubernetes, Railway, Fly.io, Heroku) restarts or redeploys it. Users mid-checkout, mid-upload, or mid-auth flow receive hard errors with no recovery path. CWE-460 (improper cleanup on thrown exception) encompasses improper shutdown handling. Platforms send SIGTERM and wait a grace period (typically 30 seconds) before SIGKILL; failing to handle it means every deploy causes unannounced request failures that are invisible in your error tracking.

Why this severity: Low in environments with short-lived requests, but causes guaranteed data loss and user-facing errors on every deployment in applications with long-running operations.

error-resilience.graceful-degradation-shutdown.graceful-shutdownSee full pattern

Rate limit errors (HTTP 429) handled gracefully

low

HTTP 429 (Too Many Requests) responses from third-party APIs — OpenAI, Stripe, SendGrid, Twilio — are not errors; they are instructions. An application that treats them as generic failures and retries immediately makes the rate-limit situation worse and may get its API key suspended. CWE-703 (improper check for exceptional conditions) applies. Users who trigger rate-limited operations see cryptic failures when they should see a clear message like 'You\'ve sent too many requests. Please wait 60 seconds.' The `Retry-After` header provides the exact wait time — ignoring it is wasteful and user-hostile.

Why this severity: Low severity for individual users, but repeated unhandled 429s can trigger API key suspension, causing a service-wide outage for all users simultaneously.

error-resilience.graceful-degradation-shutdown.rate-limit-handlingSee full pattern

Error logs scrubbed of sensitive data

info

Error logs that contain email addresses, passwords, API keys, or session tokens become a secondary attack surface. CWE-209 (information exposure through error messages) and OWASP A09 (security logging and monitoring failures) both cover this failure mode. NIST SP 800-53 AU-3 requires audit records to contain the right information — but not uncontrolled PII or credentials. A compromised logging pipeline or accidental log export can expose the credentials of every user whose request happened to error during a specific window.

Why this severity: Info severity because exploitation requires log access, which typically requires a separate compromise — but when that compromise occurs, unredacted logs amplify every credential in them.

error-resilience.graceful-degradation-shutdown.scrub-sensitive-logsSee full pattern

Chaos testing exercises error paths

info

Error boundaries, circuit breakers, and retry logic are only as good as the failure scenarios they were written to handle. Without chaos testing, these resilience mechanisms are untested hypotheses. ISO 25010 reliability.fault-tolerance requires demonstrated tolerance, not assumed tolerance. Vibe-coded applications are especially prone to this gap: the happy path is well-exercised, but the circuit breaker has never opened in a real test, and the error boundary's fallback has never rendered in a staged failure. The first time these paths run is in production, during an incident.

Why this severity: Info severity because the defect is a testing gap, not a runtime defect — but untested error paths are a leading indicator of incidents that take longer to resolve.

error-resilience.graceful-degradation-shutdown.chaos-testingSee full pattern

Source maps uploaded to error tracking service

low

Minified production JavaScript produces stack traces like `a.b is not a function at t (main.abc123.js:1:4821)` — unactionable without the corresponding source map. When source maps are not uploaded to the error tracking service, every production stack trace requires a manual deobfuscation step that most teams skip, leaving errors effectively uninvestigated. ISO 25010 reliability.fault-tolerance depends on diagnostic capability; without readable stack traces, fault isolation time increases dramatically. Exposing source maps publicly (via `productionBrowserSourceMaps: true`) solves the readability problem but exposes your full source code to anyone with DevTools.

Why this severity: Low because the defect slows diagnosis rather than causing failures directly — but unreadable production stack traces systematically increase mean time to resolution for every incident.

error-resilience.graceful-degradation-shutdown.source-maps-error-trackingSee full pattern

Alert thresholds for error rate spikes documented and tested

info

Error tracking without alert thresholds is passive observation. A team that checks the Sentry dashboard manually once per day will miss an error spike that began at 2 AM and drove 400 failed signups before business hours. NIST SP 800-53 IR-5 (incident monitoring) requires automated monitoring and notification. ISO 25010 reliability.availability is directly reduced by every minute between spike onset and team awareness; unconfigured alerts convert that to hours. Alert thresholds are also the only objective way to distinguish a transient blip from a systemic regression.

Why this severity: Info severity because alert gaps increase mean time to detection rather than causing failures directly — but undetected production incidents compound quietly, often reaching high user impact before anyone notices.

error-resilience.graceful-degradation-shutdown.alert-thresholdsSee full pattern

Recovery time objectives documented

info

Without documented RTO (Recovery Time Objective) and RPO (Recovery Point Objective), teams have no shared definition of what a successful recovery looks like, no way to tell when an incident is resolved, and no accountability for response speed. NIST SP 800-53 CP-2 (contingency planning) requires organizations to define recovery targets. ISO 25010 reliability.recoverability requires measurable recovery capability — which cannot exist without first measuring what recovery means. In vibe-coded projects, the absence of RTOs is common because AI coding tools do not prompt for operational documentation; it must be written explicitly.

Why this severity: Info severity because missing RTO documentation is an operational gap rather than a runtime defect — but the gap becomes critical the moment an incident occurs and the team has no defined resolution criteria.

error-resilience.graceful-degradation-shutdown.recovery-time-objectivesSee full pattern

Ready to scan your project?

Run this audit in your AI coding tool (Claude Code, Cursor, Bolt, etc.) and submit results here for scoring and benchmarks.

Open Error Resilience Audit