Blast radius containment — pause affected campaigns only

ab-001992 · operational-resilience-email.incident-response.blast-radius-containment

Severity: highactive

Why it matters

When a campaign starts generating complaint spikes, the correct response is to pause that campaign — not halt the entire email system. A queue worker that can only be stopped globally means that pausing one broken campaign takes down order confirmations, password resets, and MFA codes simultaneously. NIST SP 800-53 IR-4 (Incident Handling) requires that incident containment minimizes collateral impact. The Campaign Orchestration & Sequencing Audit verifies sequence management; this check verifies that the pause mechanism is fine-grained enough to contain an incident without collateral damage.

Severity rationale

High because the absence of per-campaign pause forces operators to choose between continuing a damaging campaign or halting all email — including time-critical transactional sends.

Remediation

Add a campaign status check at the start of each job in the queue worker, before any send logic runs:

async function processEmailJob(job: Job) {
  const campaign = await db.campaign.findUnique({ where: { id: job.data.campaignId } })
  if (campaign?.status === 'paused') {
    return { skipped: true, reason: 'campaign paused' }
  }
  // proceed with send
}

Wire a PATCH /api/campaigns/:id endpoint that sets status: 'paused' without requiring a code deploy. A pause mechanism that requires a redeploy or infrastructure restart does not satisfy this check.

Detection

ID: blast-radius-containment
Severity: high
What to look for: Enumerate all campaign-level control mechanisms: a paused or status flag on individual campaign records, an API endpoint or admin action that sets this flag, and queue worker logic that checks the flag before processing a job. Count the number of these 3 components present. The Campaign Orchestration & Sequencing Audit verifies sequence management — this check verifies that incident response does not require a full system halt.
Pass criteria: A campaign can be paused individually without affecting other campaigns in under 60 seconds. The pause mechanism is operable without a code deploy (e.g., a database flag checked by the worker, admin UI action, or API call). The worker must check the campaign status before processing each job — not after. Do NOT pass when the only pause mechanism requires a code deploy or infrastructure restart.
Fail criteria: The only way to stop a problematic campaign is to halt the entire queue worker or stop all email sending. No per-campaign pause mechanism exists. Or the pause requires a code change and redeploy.
Skip (N/A) when: The system only ever runs one campaign at a time and a full pause is equivalent — confirmed by the application architecture.
Detail on fail: "No per-campaign pause flag — stopping one campaign requires halting all queue workers" or "Campaign status is managed in code only — a pause requires a code change and redeploy"

Remediation: Add a status check in the queue worker:

async function processEmailJob(job: Job) {
  const campaign = await db.campaign.findUnique({ where: { id: job.data.campaignId } })
  if (campaign?.status === 'paused') {
    // Return without processing — job will be retried later or discarded
    return { skipped: true, reason: 'campaign paused' }
  }
  // proceed with send
}

Wire a PATCH /api/campaigns/:id endpoint that allows setting status: 'paused' without a deploy.

External references

iso-25010:2011 · reliability.fault-tolerance — Reliability / Fault Tolerance — blast radius containment limits incident impact to affected campaigns only
nist:rev5 · IR-4 — NIST 800-53 IR-4: Incident Handling

Taxons

error-resilience operational-readiness

History

2026-04-18·v1.0.0·Initial import from operational-resilience-email·automated