Authenticating as a user on platforms that prohibit scraping in their ToS — LinkedIn, Facebook, Twitter — bypasses access controls those platforms have implemented specifically to protect user data. This is not a legal gray area: GDPR Art. 6 requires a lawful basis for processing, and simulated session access provides none. CWE-284 (Improper Access Control) and OWASP A01 (Broken Access Control) both apply. Courts have found CFAA violations in cases involving credential-based scraping of platforms with clear no-scraping ToS. CCPA §1798.100 adds parallel liability for collecting personal information without a valid basis.
Critical because credential-based scraping of ToS-restricted platforms exposes the company to CFAA liability, GDPR Art. 6 unlawful-processing findings, and platform-level account termination affecting all legitimate API access.
Remove all stored session cookies, login credentials, and authentication flows targeting scraping-prohibited platforms. Replace with official API integrations where available.
// Remove this pattern entirely from the codebase:
// const cookies = JSON.parse(fs.readFileSync('config/linkedin-cookies.json', 'utf8'))
// await page.setCookie(...cookies)
// Use the official API instead:
import { LinkedInAPI } from '@linkedin/api-client'
const client = new LinkedInAPI({ apiKey: process.env.LINKEDIN_API_KEY })
const results = await client.search({ keywords: query })
For platforms with no official API, restrict scraping to unauthenticated public pages that are not excluded by robots.txt. Document the legal basis in a data processing register before any scraping begins. Consult legal counsel on each platform's ToS before implementation.
ID: data-sourcing-provenance.legal-sourcing.no-login-wall-scraping
Severity: critical
What to look for: Enumerate all scraping modules and for each, check for authentication to third-party services. Count every instance of stored credentials, session cookies, or login flows for external platforms (LinkedIn, Facebook, Twitter, etc.). Look for code that logs in to external services before scraping, or use of browser automation to navigate past login screens on platforms that prohibit scraping in their Terms of Service. Distinguish between services that explicitly allow API access (e.g., using an official API key with their blessing) versus authentication that simulates a user session on a platform that prohibits scraping. Simulating user sessions on platforms with no-scraping ToS does not count as pass.
Pass criteria: Scrapers do not authenticate as users on platforms that prohibit scraping in their ToS. 0 instances of stored credentials or session cookies for scraping-prohibited platforms are found. Any authenticated data access uses the platform's official API with proper authorization (not credential simulation).
Fail criteria: The codebase contains at least 1 credential, session cookie, or login flow that authenticates to external platforms for the purpose of scraping content beyond what the platform's ToS permits.
Skip (N/A) when: The system performs no web scraping, or all scraping targets are public pages with no authentication involved.
Detail on fail: Name the file and platform. Example: "src/scrapers/linkedin.ts stores LinkedIn session cookies and navigates authenticated search pages" or "Playwright script logs in to Facebook and scrapes group members".
Remediation: Replace session-based scraping with official API integrations where available. Remove stored credentials and session cookies from the codebase entirely:
// Before: session-based scraping (DO NOT USE)
// const cookies = fs.readFileSync('src/config/linkedin-cookies.json')
// page.setCookie(...cookies)
// After: use official API with proper authorization
import { LinkedInAPI } from '@linkedin/api-client'
const client = new LinkedInAPI({ apiKey: process.env.LINKEDIN_API_KEY })
const results = await client.search({ keywords: query })
For platforms without APIs, limit scraping to publicly accessible, unauthenticated pages that are not excluded by robots.txt. Consult your legal counsel on the specific platform's ToS before collecting data.