robots.txt parser present and enforced for scrapers
Why it matters
Scraping without checking robots.txt is a legal and ethical violation that exposes your company to GDPR Art. 6 liability (no lawful basis for collecting personal data from disallowed pages) and CCPA §1798.100 scrutiny. CFAA and DMCA claims have been upheld against scrapers that ignored robots.txt. Beyond legal risk, platforms detect unconstrained scrapers and block your IP ranges, poisoning data quality for legitimate sources. CWE-749 (Exposed Dangerous Method or Function) applies when scraping code makes unchecked requests to external systems.
Severity rationale
Critical because a robots.txt violation can constitute unlawful data collection under GDPR Art. 6, exposing the organization to regulatory penalties and civil liability on every scraping request made to a disallowed path.
Remediation
Wrap all outbound scraping requests in a robots.txt enforcement function. Cache the fetched robots.txt for at least 1 hour to avoid excessive fetching, and abort any request to a disallowed path before it executes.
import robotsParser from 'robots-parser'
const robotsCache = new Map<string, { robots: ReturnType<typeof robotsParser>; cachedAt: number }>()
async function isAllowed(url: string, userAgent: string): Promise<boolean> {
const { protocol, host } = new URL(url)
const robotsUrl = `${protocol}//${host}/robots.txt`
const cached = robotsCache.get(host)
let robots: ReturnType<typeof robotsParser>
if (cached && Date.now() - cached.cachedAt < 3_600_000) {
robots = cached.robots
} else {
const res = await fetch(robotsUrl)
const text = res.ok ? await res.text() : ''
robots = robotsParser(robotsUrl, text)
robotsCache.set(host, { robots, cachedAt: Date.now() })
}
return robots.isAllowed(url, userAgent) ?? true
}
// In every scraper:
if (!(await isAllowed(targetUrl, 'MyBot/1.0'))) {
logger.warn('robots_blocked', { url: targetUrl })
return
}
This check must run at 100% of scrapers — a single scraper path without it is a fail.
Detection
-
ID:
robots-txt-enforced -
Severity:
critical -
What to look for: Enumerate all scraping and crawling code paths. Count every scraper module found. For each, look for a prior
robots.txtcheck before any fetch or navigation of external URLs. A compliant scraper must: fetchhttps://{domain}/robots.txtbefore crawling, parse theDisallowandAllowdirectives for the bot's user-agent, and abort requests to disallowed paths. Look for libraries likerobots-parser,robotstxt-guard,crawlee(which handles this automatically), or manual parsing code. Quote the actual robots.txt enforcement function or library import found. A scraper without robots.txt enforcement does not count as pass — do not pass if any scraper lacks the check. -
Pass criteria: Count all scrapers and report the ratio: "N of N scrapers enforce robots.txt." Every scraper performs a robots.txt check before making requests to a domain. Disallowed paths are skipped. A robots.txt cache exists with a TTL of at least 1 hour. The check applies to 100% of scrapers.
-
Fail criteria: Any scraper makes HTTP requests to an external domain without first checking robots.txt for that domain. The check is present in some scrapers but not all. Robots.txt is fetched but its Disallow directives are not enforced.
-
Skip (N/A) when: The system performs no web scraping — all data comes from APIs, forms, or purchased lists.
-
Detail on fail: Name the scraper files or modules. Example:
"src/scrapers/linkedin.ts fetches profile pages without a robots.txt check"or"robots-parser library present in package.json but no enforcement code found in scraper paths". -
Remediation: Add robots.txt enforcement before any scraping request:
import robotsParser from 'robots-parser' const robotsCache = new Map<string, { robots: ReturnType<typeof robotsParser>; cachedAt: number }>() async function isAllowed(url: string, userAgent: string): Promise<boolean> { const parsed = new URL(url) const robotsUrl = `${parsed.protocol}//${parsed.host}/robots.txt` const cacheKey = parsed.host const cached = robotsCache.get(cacheKey) let robots: ReturnType<typeof robotsParser> if (cached && Date.now() - cached.cachedAt < 24 * 60 * 60 * 1000) { robots = cached.robots } else { const res = await fetch(robotsUrl) const text = res.ok ? await res.text() : '' robots = robotsParser(robotsUrl, text) robotsCache.set(cacheKey, { robots, cachedAt: Date.now() }) } return robots.isAllowed(url, userAgent) ?? true } // In your scraper: if (!(await isAllowed(targetUrl, 'MyBot/1.0'))) { logger.warn('robots_blocked', { url: targetUrl }) return }
External references
- gdpr · Art. 6 — Lawfulness of processing — legitimate interest requires respect for ToS/robots.txt
- ccpa · §1798.100 — Consumer right to know — unlawful scraping undermines right-to-know obligations
- cwe · CWE-749 — Exposed Dangerous Method or Function
Taxons
History
- 2026-04-18·v1.0.0·Initial import from data-sourcing-provenance·automated