Dedup check before insert

ab-000893 · data-sourcing-provenance.ingestion-pipeline.dedup-before-insert

Severity: highactive

Why it matters

Duplicate contact records from multiple ingestion runs inflate your list size, distort per-source quality metrics, and cause repeated outreach to the same person — a direct CAN-SPAM and GDPR Art. 5(1)(c) (data minimization) violation. A raw INSERT that relies on a unique constraint exception for dedup handling is not graceful deduplication — it is crash-driven dedup that requires explicit exception handling at every call site or risks silent swallowing of duplicates. CWE-694 (Use of Multiple Resources with Duplicate Identifier) applies when duplicate identity keys are not resolved at the insertion boundary.

Severity rationale

High because ungraceful duplicate handling either corrupts the contacts table with true duplicates or creates unhandled exceptions in ingestion pipelines, both of which degrade data integrity and pipeline reliability.

Remediation

Replace raw INSERTs with an explicit upsert pattern that handles the duplicate case intentionally — updating only non-provenance fields on conflict so the original provenance record is preserved.

// Prisma upsert — provenance fields are in `create`, not `update`
await db.contacts.upsert({
  where:  { email: contact.email },
  create: {
    email:       contact.email,
    source_type: contact.source_type,
    source_id:   contact.source_id,
    acquired_at: contact.acquired_at,
  },
  update: {
    last_seen_at: new Date(),  // Only update non-provenance fields
  },
})

-- Raw SQL alternative
INSERT INTO contacts (email, source_type, source_id, acquired_at)
VALUES ($1, $2, $3, $4)
ON CONFLICT (email) DO UPDATE
  SET last_seen_at = NOW();

A unique constraint on email remains required as a database-level safety net, but it should never be the primary dedup mechanism.

Detection

ID: dedup-before-insert
Severity: high
What to look for: Count all INSERT operations in the ingestion pipeline and for each, check whether a dedup mechanism exists. Look for deduplication logic in the ingestion path. Specifically: does the code check for an existing contact with the same email address (or other identity key) BEFORE attempting the INSERT, and handle the duplicate case explicitly (update, skip, merge)? A unique constraint on the database alone is not sufficient — it will raise an exception that must be caught and handled. Look for explicit dedup queries or INSERT ... ON CONFLICT / upsert patterns that handle duplicates gracefully. A raw INSERT with no dedup does not count as pass.
Pass criteria: 100% of ingestion insert paths use either: (a) a pre-insert dedup query, or (b) an upsert operation (INSERT ON CONFLICT DO UPDATE or ORM upsert). At least 1 unique constraint exists on the identity column as a safety net.
Fail criteria: Ingestion code performs a raw INSERT without a prior dedup check, relying on the database to throw a unique constraint violation exception. Duplicate records from different batch runs would either raise unhandled exceptions or be silently swallowed.
Skip (N/A) when: Not applicable — all contact ingestion should handle deduplication.
Detail on fail: "Ingestion uses raw INSERT with no prior dedup check — duplicate records cause unhandled constraint violations" or "No unique constraint or dedup logic found — duplicates can be inserted freely".

Remediation: Use an explicit upsert:

// Prisma upsert
await db.contacts.upsert({
  where: { email: contact.email },
  create: {
    email:       contact.email,
    source_type: contact.source_type,
    source_id:   contact.source_id,
    acquired_at: contact.acquired_at,
  },
  update: {
    // Only update non-provenance fields on conflict
    last_seen_at: new Date(),
  },
})

-- Raw SQL upsert
INSERT INTO contacts (email, source_type, source_id, acquired_at)
VALUES ($1, $2, $3, $4)
ON CONFLICT (email) DO UPDATE
  SET last_seen_at = NOW()
WHERE contacts.email = EXCLUDED.email;

External references

cwe · CWE-694 — Use of Multiple Resources with Duplicate Identifier
iso-25010:2011 · functional-suitability.functional-correctness — Functional Correctness — duplicate records violate data integrity
gdpr · Art. 5(1)(c) — Data minimisation

Taxons

data-integrity

History

2026-04-18·v1.0.0·Initial import from data-sourcing-provenance·automated