All 21 checks with why-it-matters prose, severity, and cross-references to related audits.
Unvalidated free-text source type fields let any string enter your database — 'scraped', 'Scraped', 'SCRAPER', and '' all become distinct values. Downstream deduplication queries, compliance lookups, and per-source quality analytics break silently when the field is inconsistent. CWE-20 (Improper Input Validation) and ISO 27001:2022 A.8.9 both require that data integrity controls are in place at the input boundary. Without a constrained enum, you cannot reliably segment contacts by acquisition method or demonstrate to a regulator that you know where each record came from.
Why this severity: High because an unconstrained source_type field silently corrupts every downstream provenance query and compliance report that groups or filters by source.
data-sourcing-provenance.source-management.source-type-enumSee full patternPer-contact quality flags (bounce, invalid email) are only useful if you can trace them back to the source that produced them. Without per-source quality scoring you have no way to quarantine a bad purchased list, deprioritize a scraper hitting stale pages, or negotiate data credits with a vendor whose records are 40% invalid. ISO 25010:2011 §4.2.7 (data quality) requires that data accuracy be measurable and traceable. Absence of source-level aggregation means data rot goes undetected until deliverability collapses.
Why this severity: Medium because the gap is an operational blind spot rather than an immediate data breach, but it allows quality degradation to accumulate undetected across entire sources.
data-sourcing-provenance.source-management.per-source-quality-scoreSee full patternScraper configurations stored only in the database — targets, rate limits, user-agent strings — can be changed by anyone with database access, bypassing code review and leaving no audit trail. SLSA Build L2 and SSDF PW.4 both require that build and configuration artifacts go through version control. A silent config change that removes rate-limit constraints or adds a scraping-prohibited domain is a legal and operational risk that version control would have caught.
Why this severity: Low because the immediate data integrity risk is indirect — the gap is a governance control weakness rather than an exploitable vulnerability today.
data-sourcing-provenance.source-management.source-config-versionedSee full patternA scraper or API integration that stops producing records with no alert will silently starve your pipeline for days. You will not know until someone notices list growth has plateaued or a manual check reveals the last ingest timestamp is a week old. NIST CSF 2.0 DE.CM-1 requires continuous monitoring of system components, which extends to data ingestion pipelines. Without dead-source detection, outages in data acquisition are invisible until they become business-impacting.
Why this severity: Low because silent failures affect future data acquisition rather than compromising existing records, but they can cause significant operational and business impact over time.
data-sourcing-provenance.source-management.dead-source-alertingSee full patternIngestion logs that lack a `source_id` dimension tell you what happened but not where. When one source starts producing bad records at scale — malformed emails, duplicate floods, elevated bounce rates — you cannot isolate it without manual log correlation. NIST CSF 2.0 DE.AE-3 requires that event data include enough context for anomaly detection. Missing source dimension in logs means the signal exists but is not actionable without hours of forensic work.
Why this severity: Low because the gap degrades observability rather than correctness, but the operational cost compounds every time a source-level incident requires investigation.
data-sourcing-provenance.source-management.source-level-metricsSee full patternScraping without checking robots.txt is a legal and ethical violation that exposes your company to GDPR Art. 6 liability (no lawful basis for collecting personal data from disallowed pages) and CCPA §1798.100 scrutiny. CFAA and DMCA claims have been upheld against scrapers that ignored robots.txt. Beyond legal risk, platforms detect unconstrained scrapers and block your IP ranges, poisoning data quality for legitimate sources. CWE-749 (Exposed Dangerous Method or Function) applies when scraping code makes unchecked requests to external systems.
Why this severity: Critical because a robots.txt violation can constitute unlawful data collection under GDPR Art. 6, exposing the organization to regulatory penalties and civil liability on every scraping request made to a disallowed path.
data-sourcing-provenance.legal-sourcing.robots-txt-enforcedSee full patternScrapers that make requests at full network speed to external domains violate the implicit contract of the web — and in many cases the explicit contract of a site's Terms of Service. Unthrottled scraping can constitute a denial-of-service condition under CWE-799 and may be interpreted as intentional interference under CFAA and GDPR Art. 6 (no lawful basis exists for extracting data via abusive request patterns). Rate limits that are hardcoded and non-configurable create a deploy bottleneck every time a domain operator asks you to slow down.
Why this severity: High because unthrottled scraping against a target domain can expose the company to legal claims of unauthorized computer access and undermine the lawful basis for collected data under GDPR Art. 6.
data-sourcing-provenance.legal-sourcing.scraping-rate-limitsSee full patternGDPR Art. 5(1)(d) requires that personal data be kept accurate and up to date. Purchased contact lists degrade rapidly — industry benchmarks cite 22–30% annual data decay on B2B contacts. Importing a 6-month-old list without a date check means you are processing data the vendor has already cycled out, inflating your bounce rate and contacting people who have changed roles or departed. GDPR Art. 13 requires you to inform data subjects of the data source; stale lists may have been collected under expired consent. CCPA §1798.100 adds a parallel obligation to ensure data accuracy.
Why this severity: High because importing stale purchased lists without age verification risks GDPR Art. 5(1)(d) accuracy violations and exposes the company to outreach based on outdated, potentially withdrawn consent.
data-sourcing-provenance.legal-sourcing.purchased-list-ageSee full patternThird-party data APIs (Apollo, Clearbit, Hunter, People Data Labs, ZoomInfo) enforce rate limits contractually. Ignoring 429 responses and retrying without backoff burns your monthly API quota in minutes, crashes enrichment pipelines, and may violate ToS. OWASP A05 (Security Misconfiguration) includes failure to honor upstream service constraints. Key cycling — rotating through multiple API keys to exceed per-account limits — is an explicit ToS violation that can result in account termination and, for GDPR-regulated data, loss of the legal basis for processing contacts sourced through that API.
Why this severity: High because key cycling to bypass rate limits violates vendor ToS and can invalidate the lawful basis for data collected through the API, while unhandled 429s crash enrichment pipelines and destroy data freshness.
data-sourcing-provenance.legal-sourcing.api-rate-limit-complianceSee full patternAuthenticating as a user on platforms that prohibit scraping in their ToS — LinkedIn, Facebook, Twitter — bypasses access controls those platforms have implemented specifically to protect user data. This is not a legal gray area: GDPR Art. 6 requires a lawful basis for processing, and simulated session access provides none. CWE-284 (Improper Access Control) and OWASP A01 (Broken Access Control) both apply. Courts have found CFAA violations in cases involving credential-based scraping of platforms with clear no-scraping ToS. CCPA §1798.100 adds parallel liability for collecting personal information without a valid basis.
Why this severity: Critical because credential-based scraping of ToS-restricted platforms exposes the company to CFAA liability, GDPR Art. 6 unlawful-processing findings, and platform-level account termination affecting all legitimate API access.
data-sourcing-provenance.legal-sourcing.no-login-wall-scrapingSee full patternGDPR Art. 13 requires that data subjects be informed of the source from which their personal data was obtained. Art. 30 requires a Records of Processing Activities (RoPA) that includes data sources. Without `source_type`, `source_id`, and `acquired_at` as NOT NULL fields on every contact record, you cannot respond to a regulatory audit, a data subject access request, or a right-to-erasure request that asks you to delete all contacts from a specific list. SLSA Provenance L1 requires a minimum verifiable provenance record for supply-chain integrity. CWE-345 (Insufficient Verification of Data Authenticity) applies when records lack traceable origin.
Why this severity: Critical because the absence of any provenance field means the system cannot fulfill GDPR Art. 13 disclosure obligations or Art. 30 RoPA requirements, making every contact record a potential regulatory liability.
data-sourcing-provenance.provenance-tracking.required-provenance-fieldsSee full patternProvenance fields that can be overwritten are not provenance — they are annotations. If an enrichment job can silently replace `source_id` or `acquired_at` on an existing contact, the audit trail is broken. GDPR Art. 5(1)(d) requires data accuracy, which includes accuracy about the origin of the data. SLSA Provenance L2 requires that provenance records be unforgeable after creation. CWE-345 (Insufficient Verification of Data Authenticity) applies when records can be modified post-creation in ways that destroy traceability.
Why this severity: High because mutable provenance fields allow enrichment jobs or admin tools to silently rewrite the record of origin, breaking GDPR Art. 30 compliance and destroying the evidentiary value of the audit trail.
data-sourcing-provenance.provenance-tracking.provenance-immutableSee full patternEnrichment from third-party services (Clearbit, Apollo, Hunter) supplements original data with fields those vendors sourced from their own pipelines — which may themselves have provenance gaps or consent issues. GDPR Art. 30 requires that the RoPA document all processing operations, including data augmentation from third parties. SLSA Provenance L1 applies to data supply chains as much as software supply chains. CWE-345 applies when enriched fields can no longer be traced to their origin. Without an enrichment log, you cannot tell a regulator which fields a third party supplied and under what basis.
Why this severity: Medium because without an enrichment chain of custody, the company cannot demonstrate GDPR Art. 30 compliance for enriched fields or reconstruct the data lineage required to respond to a data subject access request.
data-sourcing-provenance.provenance-tracking.enrichment-chain-of-custodySee full patternGDPR Art. 30 requires a Records of Processing Activities identifying each category of data subjects and the source of their data. Art. 17 (right to erasure) requires that you can efficiently locate and delete all records from a specific source when that source becomes non-compliant or a data subject invokes their rights. Without an index on `source_id`, bulk lookups require full table scans that become prohibitively slow at scale. CCPA §1798.100 adds a parallel right to know which business collected the personal information and from what source.
Why this severity: High because missing indexes on provenance fields make GDPR Art. 17 erasure-by-source and Art. 30 compliance lookups operationally infeasible at production table sizes.
data-sourcing-provenance.provenance-tracking.provenance-queryableSee full patternGDPR Art. 17 (right to erasure) and Art. 5(1)(e) (storage limitation) require that personal data be erased when the basis for holding it no longer applies. If a purchased list is invalidated — the vendor is found to have collected it unlawfully, or the list exceeds its contractual retention period — you need a reliable mechanism to remove or anonymize all contacts derived from that source. CWE-459 (Incomplete Cleanup) applies when deletion of a parent record leaves orphaned child records with dangling foreign keys and no defined disposition.
Why this severity: Medium because orphaned contacts after a source deletion leave the system in a GDPR Art. 17 non-compliant state — records exist with no valid processing basis and no mechanism to find or remove them.
data-sourcing-provenance.provenance-tracking.source-deletion-cascadeSee full patternGDPR Art. 6 requires a lawful basis for every contact's personal data processing — and that basis must be determinable from the record itself, not from institutional memory. Art. 13 requires that data subjects be told the legal basis at the time their data is collected. Without a `legal_basis` indicator on the provenance record, you cannot demonstrate compliance per-contact or automate consent-status checks. A scraping operation categorized as 'legitimate interest' needs different treatment than a form submission with explicit consent — conflating them because the record carries no basis indicator is a regulatory failure waiting to happen.
Why this severity: Low because the field is supplemental when a separate consent management system carries a foreign key link, but its absence on standalone provenance records breaks GDPR Art. 6 auditability at the contact level.
data-sourcing-provenance.provenance-tracking.provenance-consent-contextSee full patternInserting unvalidated contact data — malformed emails, missing required provenance fields, incorrect data types — violates GDPR Art. 5(1)(d), which requires that personal data be accurate and kept up to date. OWASP A03 (Injection) and CWE-20 (Improper Input Validation) both apply when untrusted external data is written directly to the database without a validation gate. A single batch import with no schema check can flood your contacts table with garbage records that are expensive to identify and remove retroactively.
Why this severity: High because unvalidated ingestion paths allow malformed records — invalid emails, null provenance fields, wrong types — to corrupt the contacts table and undermine every downstream process that depends on data quality.
data-sourcing-provenance.ingestion-pipeline.schema-validation-on-ingestSee full patternDuplicate contact records from multiple ingestion runs inflate your list size, distort per-source quality metrics, and cause repeated outreach to the same person — a direct CAN-SPAM and GDPR Art. 5(1)(c) (data minimization) violation. A raw INSERT that relies on a unique constraint exception for dedup handling is not graceful deduplication — it is crash-driven dedup that requires explicit exception handling at every call site or risks silent swallowing of duplicates. CWE-694 (Use of Multiple Resources with Duplicate Identifier) applies when duplicate identity keys are not resolved at the insertion boundary.
Why this severity: High because ungraceful duplicate handling either corrupts the contacts table with true duplicates or creates unhandled exceptions in ingestion pipelines, both of which degrade data integrity and pipeline reliability.
data-sourcing-provenance.ingestion-pipeline.dedup-before-insertSee full patternSilently discarding records that fail validation destroys forensic visibility into data quality problems. A `catch` block that logs an error and continues leaves no persistent artifact of what was rejected, why, or how many records were affected. CWE-390 (Detection of Error Condition Without Action) applies directly — the error is detected but no remediation path exists. ISO 25010:2011 §4.2.5 (reliability) requires that fault conditions produce recoverable states. Without a quarantine store, a systematic problem in a purchased list (wrong column mapping, encoding issue) will produce thousands of silent discards that are invisible until someone notices the list's expected contact count does not match what was inserted.
Why this severity: Medium because silent discards make systematic ingestion failures invisible, prevent reprocessing when the root cause is fixed, and violate the principle that all received data should have an auditable disposition.
data-sourcing-provenance.ingestion-pipeline.malformed-record-quarantineSee full patternIngestion batch jobs with no structured output metrics are black boxes. You cannot tell whether a 30-minute job processed 10,000 records successfully or crashed after 100. NIST CSF 2.0 DE.AE-3 requires that event data be aggregated to support detection of anomalies. Without per-run throughput metrics — records processed, records quarantined, duration — you have no baseline against which to detect degradation, and no data to drive SLA decisions about ingestion pipeline capacity.
Why this severity: Low because the gap is a monitoring weakness rather than a data correctness issue, but the absence of throughput metrics makes it impossible to detect performance degradation or failure rate trends across ingestion runs.
data-sourcing-provenance.ingestion-pipeline.ingestion-throughput-monitoringSee full patternAn ingestion pipeline with no backpressure mechanism is a cost bomb waiting for a large import. A bulk upload of 500,000 contacts from a newly purchased list can exhaust database connection pools, max out queue worker memory, and cause cascading failures in unrelated services sharing the same infrastructure. CWE-400 (Uncontrolled Resource Consumption) applies when there is no ceiling on how fast upstream producers can push work into the pipeline. ISO 25010:2011 §4.2.6 (resource utilization) requires that systems manage resource consumption under load. The fix cost is low; the blast radius of omitting it is high.
Why this severity: Low because the risk materializes only during atypical high-volume imports, but when it does, an unbound queue can exhaust memory and crash the ingestion pipeline and adjacent services.
data-sourcing-provenance.ingestion-pipeline.backpressure-on-saturationSee full patternRun this audit in your AI coding tool (Claude Code, Cursor, Bolt, etc.) and submit results here for scoring and benchmarks.
Open Data Sourcing & Provenance Audit