Scraping rate limits configurable per-domain

ab-000882 · data-sourcing-provenance.legal-sourcing.scraping-rate-limits

Severity: highactive

Why it matters

Scrapers that make requests at full network speed to external domains violate the implicit contract of the web — and in many cases the explicit contract of a site's Terms of Service. Unthrottled scraping can constitute a denial-of-service condition under CWE-799 and may be interpreted as intentional interference under CFAA and GDPR Art. 6 (no lawful basis exists for extracting data via abusive request patterns). Rate limits that are hardcoded and non-configurable create a deploy bottleneck every time a domain operator asks you to slow down.

Severity rationale

High because unthrottled scraping against a target domain can expose the company to legal claims of unauthorized computer access and undermine the lawful basis for collected data under GDPR Art. 6.

Remediation

Define per-domain rate limits in src/config/sources.ts as configurable constants, then enforce them in every scraper with an explicit await sleep(delay) before each request.

// src/config/sources.ts
export const SCRAPER_RATE_LIMITS: Record<string, number> = {
  'linkedin.com': 3000,   // 3 s between requests
  'default':      2000,   // 2 s for all other domains
}

// In each scraper:
const domain = new URL(targetUrl).hostname
const delay = SCRAPER_RATE_LIMITS[domain] ?? SCRAPER_RATE_LIMITS['default']
await sleep(delay)
const response = await fetch(targetUrl)

Minimum acceptable default delay is 1000 ms. The configuration must be external to the scraper logic itself so it can be adjusted without a code change.

Detection

ID: scraping-rate-limits
Severity: high
What to look for: Count all scrapers and for each, check for throttling mechanisms. Look for: configurable delays between requests per domain (await sleep(rateLimit.ms)), concurrency limits (no more than 10 simultaneous requests to the same domain), or use of scraping frameworks that manage rate limits (Crawlee, Playwright with slot limits). Quote the actual delay value or configuration found. Check that the delay is configurable (not hardcoded) so it can be adjusted without a code deploy.
Pass criteria: Scraping code enforces a configurable delay between requests to the same domain, with a default delay of at least 1000 milliseconds between requests. The delay value is not hardcoded — it can be changed via configuration.
Fail criteria: Scrapers make requests to external domains with no intentional delay, or delays are hardcoded constants with no way to adjust per-domain.
Cross-reference: Check data-sourcing-provenance.legal-sourcing.robots-txt-enforced — rate limits complement robots.txt enforcement but do not replace it.
Skip (N/A) when: The system performs no web scraping.
Detail on fail: Example: "Scraper loops over URLs with no delay between requests" or "Rate limit hardcoded to 500ms with no per-domain configuration".

Remediation: Implement configurable per-domain rate limits:

// src/config/sources.ts
export const SCRAPER_RATE_LIMITS: Record<string, number> = {
  'linkedin.com': 3000,    // ms between requests
  'default': 2000,
}

// In scraper:
const delay = SCRAPER_RATE_LIMITS[domain] ?? SCRAPER_RATE_LIMITS['default']
await sleep(delay)

External references

gdpr · Art. 6 — Lawfulness of processing — scraping without rate limits may violate ToS and undermine lawful basis
cwe · CWE-799 — Improper Control of Interaction Frequency

Taxons

regulatory-conformance

History

2026-04-18·v1.0.0·Initial import from data-sourcing-provenance·automated