Chaos testing exercises error paths

ab-001295 · error-resilience.graceful-degradation-shutdown.chaos-testing

Severity: infoactive

Why it matters

Error boundaries, circuit breakers, and retry logic are only as good as the failure scenarios they were written to handle. Without chaos testing, these resilience mechanisms are untested hypotheses. ISO 25010 reliability.fault-tolerance requires demonstrated tolerance, not assumed tolerance. Vibe-coded applications are especially prone to this gap: the happy path is well-exercised, but the circuit breaker has never opened in a real test, and the error boundary's fallback has never rendered in a staged failure. The first time these paths run is in production, during an incident.

Severity rationale

Info severity because the defect is a testing gap, not a runtime defect — but untested error paths are a leading indicator of incidents that take longer to resolve.

Remediation

Add at least one chaos test that deliberately injects a failure and asserts the application degrades gracefully. MSW (Mock Service Worker) is the lowest-friction option for frontend tests.

// tests/chaos/api-failure.test.tsx
import { server } from '../mocks/server'
import { http, HttpResponse } from 'msw'

test('shows error message when dashboard API fails', async () => {
  server.use(
    http.get('/api/dashboard', () =>
      HttpResponse.json({ error: 'service unavailable' }, { status: 503 })
    )
  )
  render(<Dashboard />)
  expect(await screen.findByText(/failed to load/i)).toBeVisible()
  expect(screen.queryByRole('progressbar')).not.toBeInTheDocument()
})

Document findings from each chaos run — what failed, what recovered correctly, what needed fixing — in docs/chaos-findings.md.

Detection

ID: chaos-testing
Severity: info
What to look for: Count all chaos/resilience testing configurations or documentation. Enumerate whether the project has tested failure scenarios (service outage, network partition, high latency). Look for documentation or code evidence of chaos testing: intentionally failing APIs, cutting off network, causing timeouts. Check for findings documented and remediation tracked.
Pass criteria: Chaos testing or fault injection exercises have been conducted; findings are documented and addressed. At least 1 chaos testing approach must be documented or configured.
Fail criteria: No evidence of chaos testing or fault injection exercises.
Skip (N/A) when: The application is in early development with no production deployment yet.
Cross-reference: For recovery time objectives, see recovery-time-objectives.
Detail on fail: "No chaos testing conducted. Error paths untested and likely to fail in production" or "Chaos tests planned but not yet executed"

Remediation: Add chaos testing to your QA process:

// tests/chaos/api-failure.test.ts — chaos test example
test('handles API timeout gracefully', async () => { server.use(rest.get('/api/data', (_, res) => res.networkError('timeout'))); expect(screen.getByText('Service unavailable')).toBeVisible() })

// Example: fault injection for testing
const createChaoticFetch = (failureRate = 0.1) => {
  return async (url: string) => {
    if (Math.random() < failureRate) {
      throw new Error('Simulated network failure')
    }
    return fetch(url)
  }
}

// Or use a dedicated tool: Gremlin, Chaos Toolkit, or Locust
// Document findings: "When payment API is unavailable, checkout shows blank screen (should show friendly error)"

External references

iso-25010:2011 · reliability.fault-tolerance

Taxons

error-resilience

History

2026-04-18·v1.0.0·Initial import from error-resilience·automated