robots.txt parser present and enforced for scrapers

ab-000881 · data-sourcing-provenance.legal-sourcing.robots-txt-enforced

Severity: criticalactive

Why it matters

Scraping without checking robots.txt is a legal and ethical violation that exposes your company to GDPR Art. 6 liability (no lawful basis for collecting personal data from disallowed pages) and CCPA §1798.100 scrutiny. CFAA and DMCA claims have been upheld against scrapers that ignored robots.txt. Beyond legal risk, platforms detect unconstrained scrapers and block your IP ranges, poisoning data quality for legitimate sources. CWE-749 (Exposed Dangerous Method or Function) applies when scraping code makes unchecked requests to external systems.

Severity rationale

Critical because a robots.txt violation can constitute unlawful data collection under GDPR Art. 6, exposing the organization to regulatory penalties and civil liability on every scraping request made to a disallowed path.

Remediation

Wrap all outbound scraping requests in a robots.txt enforcement function. Cache the fetched robots.txt for at least 1 hour to avoid excessive fetching, and abort any request to a disallowed path before it executes.

import robotsParser from 'robots-parser'

const robotsCache = new Map<string, { robots: ReturnType<typeof robotsParser>; cachedAt: number }>()

async function isAllowed(url: string, userAgent: string): Promise<boolean> {
  const { protocol, host } = new URL(url)
  const robotsUrl = `${protocol}//${host}/robots.txt`
  const cached = robotsCache.get(host)
  let robots: ReturnType<typeof robotsParser>

  if (cached && Date.now() - cached.cachedAt < 3_600_000) {
    robots = cached.robots
  } else {
    const res = await fetch(robotsUrl)
    const text = res.ok ? await res.text() : ''
    robots = robotsParser(robotsUrl, text)
    robotsCache.set(host, { robots, cachedAt: Date.now() })
  }
  return robots.isAllowed(url, userAgent) ?? true
}

// In every scraper:
if (!(await isAllowed(targetUrl, 'MyBot/1.0'))) {
  logger.warn('robots_blocked', { url: targetUrl })
  return
}

This check must run at 100% of scrapers — a single scraper path without it is a fail.

Detection

ID: robots-txt-enforced
Severity: critical
What to look for: Enumerate all scraping and crawling code paths. Count every scraper module found. For each, look for a prior robots.txt check before any fetch or navigation of external URLs. A compliant scraper must: fetch https://{domain}/robots.txt before crawling, parse the Disallow and Allow directives for the bot's user-agent, and abort requests to disallowed paths. Look for libraries like robots-parser, robotstxt-guard, crawlee (which handles this automatically), or manual parsing code. Quote the actual robots.txt enforcement function or library import found. A scraper without robots.txt enforcement does not count as pass — do not pass if any scraper lacks the check.
Pass criteria: Count all scrapers and report the ratio: "N of N scrapers enforce robots.txt." Every scraper performs a robots.txt check before making requests to a domain. Disallowed paths are skipped. A robots.txt cache exists with a TTL of at least 1 hour. The check applies to 100% of scrapers.
Fail criteria: Any scraper makes HTTP requests to an external domain without first checking robots.txt for that domain. The check is present in some scrapers but not all. Robots.txt is fetched but its Disallow directives are not enforced.
Skip (N/A) when: The system performs no web scraping — all data comes from APIs, forms, or purchased lists.
Detail on fail: Name the scraper files or modules. Example: "src/scrapers/linkedin.ts fetches profile pages without a robots.txt check" or "robots-parser library present in package.json but no enforcement code found in scraper paths".

Remediation: Add robots.txt enforcement before any scraping request:

import robotsParser from 'robots-parser'

const robotsCache = new Map<string, { robots: ReturnType<typeof robotsParser>; cachedAt: number }>()

async function isAllowed(url: string, userAgent: string): Promise<boolean> {
  const parsed = new URL(url)
  const robotsUrl = `${parsed.protocol}//${parsed.host}/robots.txt`
  const cacheKey = parsed.host
  const cached = robotsCache.get(cacheKey)

  let robots: ReturnType<typeof robotsParser>
  if (cached && Date.now() - cached.cachedAt < 24 * 60 * 60 * 1000) {
    robots = cached.robots
  } else {
    const res = await fetch(robotsUrl)
    const text = res.ok ? await res.text() : ''
    robots = robotsParser(robotsUrl, text)
    robotsCache.set(cacheKey, { robots, cachedAt: Date.now() })
  }

  return robots.isAllowed(url, userAgent) ?? true
}

// In your scraper:
if (!(await isAllowed(targetUrl, 'MyBot/1.0'))) {
  logger.warn('robots_blocked', { url: targetUrl })
  return
}

External references

gdpr · Art. 6 — Lawfulness of processing — legitimate interest requires respect for ToS/robots.txt
ccpa · §1798.100 — Consumer right to know — unlawful scraping undermines right-to-know obligations
cwe · CWE-749 — Exposed Dangerous Method or Function

Taxons

regulatory-conformance privacy-consent

History

2026-04-18·v1.0.0·Initial import from data-sourcing-provenance·automated

Why it matters

Remediation

import robotsParser from 'robots-parser'

const robotsCache = new Map<string, { robots: ReturnType<typeof robotsParser>; cachedAt: number }>()

async function isAllowed(url: string, userAgent: string): Promise<boolean> {
  const { protocol, host } = new URL(url)
  const robotsUrl = `${protocol}//${host}/robots.txt`
  const cached = robotsCache.get(host)
  let robots: ReturnType<typeof robotsParser>

  if (cached && Date.now() - cached.cachedAt < 3_600_000) {
    robots = cached.robots
  } else {
    const res = await fetch(robotsUrl)
    const text = res.ok ? await res.text() : ''
    robots = robotsParser(robotsUrl, text)
    robotsCache.set(host, { robots, cachedAt: Date.now() })
  }
  return robots.isAllowed(url, userAgent) ?? true
}

// In every scraper:
if (!(await isAllowed(targetUrl, 'MyBot/1.0'))) {
  logger.warn('robots_blocked', { url: targetUrl })
  return
}

This check must run at 100% of scrapers — a single scraper path without it is a fail.

Detection

ID: robots-txt-enforced
Severity: critical
What to look for: Enumerate all scraping and crawling code paths. Count every scraper module found. For each, look for a prior robots.txt check before any fetch or navigation of external URLs. A compliant scraper must: fetch https://{domain}/robots.txt before crawling, parse the Disallow and Allow directives for the bot's user-agent, and abort requests to disallowed paths. Look for libraries like robots-parser, robotstxt-guard, crawlee (which handles this automatically), or manual parsing code. Quote the actual robots.txt enforcement function or library import found. A scraper without robots.txt enforcement does not count as pass — do not pass if any scraper lacks the check.
Pass criteria: Count all scrapers and report the ratio: "N of N scrapers enforce robots.txt." Every scraper performs a robots.txt check before making requests to a domain. Disallowed paths are skipped. A robots.txt cache exists with a TTL of at least 1 hour. The check applies to 100% of scrapers.
Fail criteria: Any scraper makes HTTP requests to an external domain without first checking robots.txt for that domain. The check is present in some scrapers but not all. Robots.txt is fetched but its Disallow directives are not enforced.
Skip (N/A) when: The system performs no web scraping — all data comes from APIs, forms, or purchased lists.
Detail on fail: Name the scraper files or modules. Example: "src/scrapers/linkedin.ts fetches profile pages without a robots.txt check" or "robots-parser library present in package.json but no enforcement code found in scraper paths".

Remediation: Add robots.txt enforcement before any scraping request:

import robotsParser from 'robots-parser'

const robotsCache = new Map<string, { robots: ReturnType<typeof robotsParser>; cachedAt: number }>()

async function isAllowed(url: string, userAgent: string): Promise<boolean> {
  const parsed = new URL(url)
  const robotsUrl = `${parsed.protocol}//${parsed.host}/robots.txt`
  const cacheKey = parsed.host
  const cached = robotsCache.get(cacheKey)

  let robots: ReturnType<typeof robotsParser>
  if (cached && Date.now() - cached.cachedAt < 24 * 60 * 60 * 1000) {
    robots = cached.robots
  } else {
    const res = await fetch(robotsUrl)
    const text = res.ok ? await res.text() : ''
    robots = robotsParser(robotsUrl, text)
    robotsCache.set(cacheKey, { robots, cachedAt: Date.now() })
  }

  return robots.isAllowed(url, userAgent) ?? true
}

// In your scraper:
if (!(await isAllowed(targetUrl, 'MyBot/1.0'))) {
  logger.warn('robots_blocked', { url: targetUrl })
  return
}