Dedup check before insert
Why it matters
Duplicate contact records from multiple ingestion runs inflate your list size, distort per-source quality metrics, and cause repeated outreach to the same person — a direct CAN-SPAM and GDPR Art. 5(1)(c) (data minimization) violation. A raw INSERT that relies on a unique constraint exception for dedup handling is not graceful deduplication — it is crash-driven dedup that requires explicit exception handling at every call site or risks silent swallowing of duplicates. CWE-694 (Use of Multiple Resources with Duplicate Identifier) applies when duplicate identity keys are not resolved at the insertion boundary.
Severity rationale
High because ungraceful duplicate handling either corrupts the contacts table with true duplicates or creates unhandled exceptions in ingestion pipelines, both of which degrade data integrity and pipeline reliability.
Remediation
Replace raw INSERTs with an explicit upsert pattern that handles the duplicate case intentionally — updating only non-provenance fields on conflict so the original provenance record is preserved.
// Prisma upsert — provenance fields are in `create`, not `update`
await db.contacts.upsert({
where: { email: contact.email },
create: {
email: contact.email,
source_type: contact.source_type,
source_id: contact.source_id,
acquired_at: contact.acquired_at,
},
update: {
last_seen_at: new Date(), // Only update non-provenance fields
},
})
-- Raw SQL alternative
INSERT INTO contacts (email, source_type, source_id, acquired_at)
VALUES ($1, $2, $3, $4)
ON CONFLICT (email) DO UPDATE
SET last_seen_at = NOW();
A unique constraint on email remains required as a database-level safety net, but it should never be the primary dedup mechanism.
Detection
-
ID:
dedup-before-insert -
Severity:
high -
What to look for: Count all INSERT operations in the ingestion pipeline and for each, check whether a dedup mechanism exists. Look for deduplication logic in the ingestion path. Specifically: does the code check for an existing contact with the same email address (or other identity key) BEFORE attempting the INSERT, and handle the duplicate case explicitly (update, skip, merge)? A unique constraint on the database alone is not sufficient — it will raise an exception that must be caught and handled. Look for explicit dedup queries or
INSERT ... ON CONFLICT/upsertpatterns that handle duplicates gracefully. A raw INSERT with no dedup does not count as pass. -
Pass criteria: 100% of ingestion insert paths use either: (a) a pre-insert dedup query, or (b) an upsert operation (
INSERT ON CONFLICT DO UPDATEor ORM upsert). At least 1 unique constraint exists on the identity column as a safety net. -
Fail criteria: Ingestion code performs a raw INSERT without a prior dedup check, relying on the database to throw a unique constraint violation exception. Duplicate records from different batch runs would either raise unhandled exceptions or be silently swallowed.
-
Skip (N/A) when: Not applicable — all contact ingestion should handle deduplication.
-
Detail on fail:
"Ingestion uses raw INSERT with no prior dedup check — duplicate records cause unhandled constraint violations"or"No unique constraint or dedup logic found — duplicates can be inserted freely". -
Remediation: Use an explicit upsert:
// Prisma upsert await db.contacts.upsert({ where: { email: contact.email }, create: { email: contact.email, source_type: contact.source_type, source_id: contact.source_id, acquired_at: contact.acquired_at, }, update: { // Only update non-provenance fields on conflict last_seen_at: new Date(), }, })-- Raw SQL upsert INSERT INTO contacts (email, source_type, source_id, acquired_at) VALUES ($1, $2, $3, $4) ON CONFLICT (email) DO UPDATE SET last_seen_at = NOW() WHERE contacts.email = EXCLUDED.email;
External references
- cwe · CWE-694 — Use of Multiple Resources with Duplicate Identifier
- iso-25010:2011 · functional-suitability.functional-correctness — Functional Correctness — duplicate records violate data integrity
- gdpr · Art. 5(1)(c) — Data minimisation
Taxons
History
- 2026-04-18·v1.0.0·Initial import from data-sourcing-provenance·automated