All 23 checks with why-it-matters prose, severity, and cross-references to related audits.
An in-memory queue — a plain array, a Map, or a setInterval drain loop — lives entirely in the Node.js process heap. When the process restarts after a deploy, a crash, or an OOM kill, every pending email in that queue vanishes. Recipients who triggered a password-reset or order-confirmation during that window never receive it. CWE-400 covers resource exhaustion from unbounded growth; here the risk is the inverse — the store has zero durability. ISO-25010:2011 reliability.fault-tolerance requires components to survive fault conditions; in-process storage is the antithesis of that property.
Why this severity: Critical because a process restart silently drops all pending sends, and the application has no mechanism to detect or replay the lost jobs.
sending-pipeline-infrastructure.queue-architecture.persistent-queueSee full patternWhen a send job exhausts its retry budget and disappears silently, every permanently failed send becomes invisible. Customer support cannot answer "did we send that?", engineers cannot distinguish a flaky network blip from a systemic ESP misconfiguration, and the business has no path to replay failed communications. CWE-391 covers insufficient logging of error conditions — a missing DLQ is the queue-layer equivalent. ISO-25010:2011 reliability.recoverability requires that a system can recover from faults; a DLQ is the prerequisite for that recovery.
Why this severity: High because permanently failed sends are unrecoverable and unauditable without a DLQ, but the immediate user impact is delayed rather than immediate data loss.
sending-pipeline-infrastructure.queue-architecture.dead-letter-queueSee full patternA job payload containing only `{to, subject, html}` becomes useless the moment it lands in the DLQ. Without a campaign ID, a recipient ID, and a correlation key, you cannot answer which campaign generated the failure, which contact never received the message, or whether the failure was isolated or systemic. ISO-25010:2011 maintainability.analysability requires that system faults can be diagnosed from the available artifacts — an untraceable job payload defeats that entirely, forcing triage through raw log searches across worker instances.
Why this severity: High because untraceable failed jobs in a DLQ cannot be attributed, replayed selectively, or used to satisfy delivery audit requirements without manual cross-referencing of logs.
sending-pipeline-infrastructure.queue-architecture.job-traceabilitySee full patternAt-least-once delivery guarantees that a job is processed at least once — it does not guarantee it is processed exactly once. If a worker sends a password-reset email, then crashes before acknowledging the job, BullMQ re-queues it and the next worker pickup sends an identical email. For transactional messages, duplicates erode user trust immediately. CWE-362 covers race conditions on shared resources; duplicate sends are precisely that class of failure. ISO-25010:2011 reliability.fault-tolerance requires the system to behave correctly despite component failures.
Why this severity: High because worker crashes during network I/O are routine events, and without a dedup guard every such crash produces a duplicate transactional email delivered to real recipients.
sending-pipeline-infrastructure.queue-architecture.at-least-once-dedupSee full patternA single FIFO email queue shared by transactional and marketing sends creates a hidden contention point. A bulk marketing campaign pushing 100,000 jobs can delay a password-reset email by 30 minutes — precisely the window where a user is waiting at the browser. CWE-400 covers uncontrolled resource consumption; sharing a queue is the mechanism by which a marketing send consumes the delivery window for time-sensitive transactional messages. ISO-25010:2011 performance-efficiency.resource-utilization requires that resources are allocated in proportion to priority.
Why this severity: Low because the failure mode is latency degradation rather than data loss or security breach, and affects only deployments running simultaneous marketing and transactional sends.
sending-pipeline-infrastructure.queue-architecture.priority-queuesSee full patternWhen multiple workers process a drip sequence concurrently, the order emails land in a recipient's inbox is determined by worker timing and network latency, not the sequence design. Email 3 can arrive before email 2 if workers race. For onboarding sequences where each message references the previous one, out-of-order delivery breaks the narrative and confuses recipients. CWE-362 covers race conditions on shared resources — concurrent workers competing for sequence jobs are precisely that. ISO-25010:2011 reliability.fault-tolerance requires the system to behave predictably under concurrent execution.
Why this severity: Low because mis-ordered emails harm user experience but do not cause data loss, security exposure, or system failure.
sending-pipeline-infrastructure.queue-architecture.message-orderingSee full patternA hardcoded SendGrid or Mailgun API key in a committed source file is a credential that any engineer with repository read access — and any attacker who gains access to the repo, a build artifact, or a public GitHub fork — can use immediately to send email at your expense, under your sending domain, to any recipient list. This maps to OWASP A02 (Cryptographic Failures) and CWE-798 (Use of Hard-coded Credentials). The damage is not theoretical: repositories are leaked, build logs are captured, and Docker image layers are extracted. Committing credentials is an irreversible exposure until the key is rotated.
Why this severity: Critical because a committed API key grants immediate unauthorized send access to the account and cannot be made safe after the fact without key rotation.
sending-pipeline-infrastructure.esp-integration.credentials-secureSee full patternWhen the ESP SDK is imported and called in six different worker files, route handlers, and service modules, switching providers requires auditing and rewriting every one of those call sites. More practically, a team that needs to rotate to a backup ESP during an outage must make coordinated changes across the codebase under time pressure. ISO-25010:2011 maintainability.modifiability requires that changes to one component do not cascade to unrelated modules — direct SDK spread is the structural cause of that cascade.
Why this severity: High because tight coupling to a specific ESP SDK prevents safe provider rotation and forces disruptive multi-file changes whenever the ESP contract changes.
sending-pipeline-infrastructure.esp-integration.esp-abstractionSee full patternPolling the ESP's message status API on a timer introduces latency (you learn about a hard bounce minutes later, not seconds), consumes API quota on every poll cycle, and misses events that occur between polling intervals. Gmail and Yahoo's 2024 bulk sender requirements mandate one-click unsubscribe processing — a webhook-less implementation cannot honor those events promptly. CWE-345 covers insufficient verification of data authenticity; a webhook without HMAC signature verification lets anyone inject fake delivery events and manipulate your suppression list.
Why this severity: High because missing webhook processing means delivery failures, bounces, and unsubscribe events are never recorded, leading to repeated sends to invalid addresses and sender reputation damage.
sending-pipeline-infrastructure.esp-integration.webhook-delivery-statusSee full patternWhen a worker catches all ESP errors with a single `catch` block and re-throws them into the retry pipeline, invalid email addresses are retried five times before being discarded, burning send quota and generating five bounce events against your domain reputation. SendGrid and Mailgun rate response codes differently: a 400 (invalid address) is permanent; a 503 (service unavailable) is transient; a 429 (rate limit) needs a long backoff. Treating all three identically is the mechanism behind CWE-390 (detection of errors without action) and CWE-400 (resource exhaustion from unnecessary retry).
Why this severity: Medium because misclassified errors waste send quota and degrade sender reputation rather than causing immediate data loss or security breach.
sending-pipeline-infrastructure.esp-integration.response-code-mappingSee full patternAn email worker that starts processing jobs before validating its ESP credentials discovers a misconfiguration only when the first real send fails in production — potentially after dozens of jobs have been dequeued, acknowledged, and logged as in-flight. A rotated API key, a wrong environment variable, or a network policy blocking outbound HTTPS to the ESP creates a failure that only surfaces under load. ISO-25010:2011 reliability.availability requires that faults are detected at startup rather than at runtime, where the blast radius is larger.
Why this severity: Low because the failure is discovered quickly (on first job attempt) rather than going undetected, but the absence of a startup check delays diagnosis and allows job state to become inconsistent.
sending-pipeline-infrastructure.esp-integration.esp-health-checkSee full patternA single-ESP sending architecture creates a hard dependency on one provider's availability. When SendGrid, Mailgun, or Postmark experiences an outage, all outbound email stops until the provider recovers. For SaaS products where password resets and billing receipts are blocking user actions, even a one-hour outage translates directly to churn and support volume. ISO-25010:2011 reliability.availability requires that systems maintain service continuity through component failures — a second ESP is the minimal architectural hedge against that dependency.
Why this severity: Info because multi-ESP fallback is a hardening concern; many teams acceptably accept the availability dependency of a single ESP with a good SLA.
sending-pipeline-infrastructure.esp-integration.multi-esp-fallbackSee full patternEmail header injection exploits the CRLF-based structure of the SMTP protocol. An attacker who can control a merge field value — say, a user's first name registered as `Alice\r\nBcc: victim@example.com` — can inject additional headers into outgoing emails, silently copying sends to arbitrary recipients. This maps to CWE-93 (CRLF injection) and OWASP A03 (Injection). Subject lines and Reply-To addresses built from unsanitized user input are the most common injection surfaces. The Compliance & Consent Engine category verifies that unsubscribe mechanisms exist — header injection is the vector by which those mechanisms can be bypassed.
Why this severity: Critical because CRLF injection in email headers allows an attacker to redirect outbound email to unintended recipients without any access control bypass.
sending-pipeline-infrastructure.template-engine.merge-field-sanitizationSee full patternEmail template rendering that happens in the browser exposes your personalization logic, merge-field variable names, template structure, and potentially sensitive business rules to any user who opens DevTools. For SaaS products, this can reveal subscription tier logic, A/B test variants, or internal campaign identifiers embedded in templates. CWE-200 covers exposure of sensitive information to an unauthorized actor — shipping template compilation into the client bundle is the mechanism. OWASP A02 applies because the information asymmetry enables targeted attacks against the application's business logic.
Why this severity: High because client-side template rendering exposes personalization logic and merge-field schemas to any user who inspects the browser bundle.
sending-pipeline-infrastructure.template-engine.server-side-renderingSee full patternEmail clients are notoriously inconsistent renderers — Outlook 2016-2019 strips flexbox and CSS grid, Gmail removes unreferenced `<style>` blocks, and Apple Mail handles media queries differently from Yahoo. Templates that render perfectly in the preview pane arrive as broken, unstyled blobs for a meaningful portion of recipients, directly degrading conversion on transactional and marketing sends. The user-experience cost compounds for password resets, receipts, and onboarding flows where a misaligned CTA means an abandoned signup or a support ticket.
Why this severity: Medium because broken rendering hurts conversion and trust across a large recipient slice without exposing data or breaking delivery.
sending-pipeline-infrastructure.template-engine.email-client-compatibilitySee full patternManual unsubscribe link inclusion means that any template published without that link violates CAN-SPAM and GDPR Article 21 simultaneously. In 2024, Gmail and Yahoo added a hard requirement for `List-Unsubscribe-Post` headers on all bulk sends — without it, messages are increasingly routed to spam or rejected outright. Relying on individual template authors to remember the link is the systematic failure mode: teams under deadline drop it, new contributors don't know the requirement exists, and automated campaign cloning propagates the omission silently.
Why this severity: High because missing unsubscribe mechanisms violate CAN-SPAM and GDPR requirements, and the absence of `List-Unsubscribe-Post` headers causes Gmail bulk sends to be filtered or rejected.
sending-pipeline-infrastructure.template-engine.unsubscribe-auto-injectSee full patternJobs that reference a template by mutable name always resolve to the current version. When a template is updated mid-campaign, queued jobs that have not yet been processed will render the new version instead of the version that was live when the campaign was dispatched. Recipients in the same send batch receive different content depending on when their job was processed. ISO-25010:2011 maintainability.modifiability requires that changes to one component do not silently alter the behavior of unrelated queued work.
Why this severity: Low because mid-campaign template mutations affect content consistency rather than causing data loss, security exposure, or system failure.
sending-pipeline-infrastructure.template-engine.template-versioningSee full patternAn HTML-only email displays raw markup in Gmail's push notification preview, in plain-text email clients used in corporate environments, and in assistive technologies that cannot parse HTML. Spam filters from Barracuda, SpamAssassin, and Proofpoint penalize the absence of a multipart/alternative structure because it is a known characteristic of automated spam. RFC-2046 defines the multipart/alternative MIME structure specifically to ensure that all receivers can display a meaningful representation of the message.
Why this severity: Info because the consequence is display degradation and minor spam score penalties rather than security exposure or data loss.
sending-pipeline-infrastructure.template-engine.plain-text-alternativeSee full patternFixed-interval or immediate retries during an ESP outage or rate-limit window cause all five retry attempts to fire within seconds. The ESP rejects each one, the job exhausts its retry budget, and the send permanently fails — even though the ESP would have recovered in minutes if the worker had waited. CWE-770 covers allocation of resources without limits; CWE-400 covers resource exhaustion. Exponential backoff is the documented mitigation: it gives transient faults time to resolve without burning retry budget, and it is required by every major ESP's developer guidelines for handling 429 and 5xx responses.
Why this severity: High because fixed or immediate retries exhaust the retry budget during short outages that exponential backoff would survive, causing preventable permanent delivery failures.
sending-pipeline-infrastructure.retry-error-handling.exponential-backoffSee full patternAn unbounded retry limit — or no limit at all — allows a job to cycle indefinitely, consuming worker capacity and queue resources without ever resolving. CWE-770 covers allocation of resources without limits; an infinite retry loop is a resource exhaustion vector against the queue infrastructure itself. CWE-391 covers insufficient logging of errors — a job that fails silently after an unbounded retry chain leaves no trace for operators. ISO-25010:2011 reliability.recoverability requires that failed states are detectable and recoverable; an undocumented failure is neither.
Why this severity: High because an unbounded retry limit can hold worker concurrency slots indefinitely and provides no path for operators to inspect or replay permanently failed sends.
sending-pipeline-infrastructure.retry-error-handling.max-retry-limitSee full patternA worker that calls the ESP, receives a success response, and then crashes before updating the database leaves the job in an unacknowledged state. BullMQ re-queues it, and the next worker pickup sends an identical transactional email — a password reset, an order confirmation, a payment receipt — to a real recipient who has already received it. CWE-362 covers race conditions on shared resources; this is the classic check-then-act race across a network boundary. Unlike the queue-level dedup check, this pattern fails specifically on the retry after a successful-but-unacknowledged send.
Why this severity: Critical because worker crashes during ESP network I/O are expected events, not edge cases, and without idempotency guards every crash produces a duplicate transactional email sent to real recipients.
sending-pipeline-infrastructure.retry-error-handling.idempotent-retriesSee full patternRetrying a hard bounce five times does not deliver the email — it generates five additional bounce events against your sending domain. Bounce rate is the primary metric email providers use to classify senders as spam sources. A sustained bounce rate above 2% triggers deliverability penalties from Gmail and Yahoo that affect all emails from the domain, including transactional messages to valid recipients. CWE-390 covers detection of errors without action; re-throwing permanent failures into the retry pipeline is the application-layer pattern that drives deliverability degradation.
Why this severity: Medium because repeated retries on permanent failures degrade sender reputation across all recipients rather than causing immediate security or data integrity failures.
sending-pipeline-infrastructure.retry-error-handling.no-retry-permanent-failuresSee full patternA queue worker configured with `concurrency: 1` can process exactly one job at a time. When that job fails and BullMQ schedules a 32-second exponential backoff retry, the worker is blocked from pulling the next available job for 32 seconds. In a FIFO queue, every subsequent recipient in the batch waits behind the failing job. CWE-400 covers resource exhaustion via uncontrolled resource consumption — a single failing job consuming the entire worker thread is the mechanism. ISO-25010:2011 reliability.fault-tolerance requires that a fault in one component does not propagate to unrelated operations.
Why this severity: Medium because single-worker FIFO blocking causes throughput stalls and delivery delays rather than data loss, but affects all recipients in a campaign when any single job fails.
sending-pipeline-infrastructure.retry-error-handling.failure-isolationSee full patternRun this audit in your AI coding tool (Claude Code, Cursor, Bolt, etc.) and submit results here for scoring and benchmarks.
Open Sending Pipeline & Infrastructure Audit