Search and citation crawlers are not blocked in robots.txt

ab-001519 · geo-readiness.ai-crawler-access.ai-crawlers-not-blocked

Severity: criticalactive

Why it matters

A small set of robots.txt tokens controls whether AI answer engines can surface your site at all, and every vendor documents its own. OpenAI is categorical — sites that opt out of OAI-SearchBot "will not be shown in ChatGPT search answers." Anthropic states that blocking Claude-SearchBot or Claude-User "may reduce your site's visibility" in Claude's search results. Perplexity recommends allowing PerplexityBot "to ensure your site appears in search results." Google's AI Overviews and AI Mode are built on the ordinary Search index, so a Googlebot block removes you from those too, and Bing's index powers Microsoft Copilot and is a named third-party provider behind ChatGPT search. Blocking these search and citation crawlers silently removes your product from the fastest-growing discovery channel. Blocking training-only tokens like GPTBot or ClaudeBot is a different, documented choice that does not affect search visibility.

Severity rationale

Critical because vendors document the exclusion as binary for their search surfaces — OpenAI states opted-out sites "will not be shown in ChatGPT search answers" — and a single Disallow line silently removes the site from that platform's answers.

Remediation

Remove Disallow rules targeting search and citation crawlers (OAI-SearchBot, Claude-SearchBot, Claude-User, PerplexityBot, Googlebot, bingbot) from public/robots.txt or app/robots.ts. OpenAI notes robots.txt changes take about 24 hours to propagate, and Anthropic requires the opt-out (or opt-in) per subdomain. A page blocked from OAI-SearchBot can still appear in ChatGPT as a bare title-and-link if discovered via a third-party search provider; only noindex removes that.

// app/robots.ts
export default function robots() {
  return {
    rules: [
      // One wildcard rule is enough. If you want to opt out of AI TRAINING
      // without losing search visibility, add Disallow entries only for the
      // training tokens (GPTBot, ClaudeBot, Google-Extended) — every vendor
      // documents these as independent of search appearance. Note that
      // blocking Google-Extended also opts you out of Gemini app grounding.
      { userAgent: '*', allow: '/', disallow: ['/api/', '/auth/'] },
    ],
    sitemap: 'https://example.com/sitemap.xml',
  }
}

Detection

ID: ai-crawlers-not-blocked
Severity: critical
What to look for: Examine robots.txt (static file at public/robots.txt or generated via app/robots.ts or equivalent). Before evaluating, quote every User-agent / Disallow rule pair found in the file. Classify each user-agent token into one of three buckets, sourced from the vendors' own documentation:
- Search/citation bots (blocking these is the failure): OAI-SearchBot (controls appearing in ChatGPT search answers — OpenAI: opted-out sites "will not be shown in ChatGPT search answers"), Claude-SearchBot and Claude-User (Anthropic: blocking each "may reduce your site's visibility" in Claude search — note Anthropic documents honoring robots.txt for its user-triggered fetcher, unlike OpenAI/Perplexity, so Claude-User belongs in this bucket by Anthropic's own statement), PerplexityBot (Perplexity search results), Googlebot (Google's AI Overviews and AI Mode are grounded by the core Search index), bingbot (Bing's index powers Microsoft Copilot, and Bing is a named third-party provider for ChatGPT search — the Copilot/ChatGPT consequence is the obvious inference from those two documented facts).
- Training-opt-out tokens (blocking these is a documented owner choice — never a failure): GPTBot (OpenAI: training only, independent of search), ClaudeBot (Anthropic: training only), Google-Extended (a robots.txt token, not a crawler; controls Gemini model training and grounding in Gemini Apps/Vertex AI — Google states it "does not impact a site's inclusion in Google Search nor is it used as a ranking signal." Blocking it does cost Gemini app citations; report that trade-off, don't penalize it).
- Other tokens (report only, no score impact): CCBot, Bytespider, FacebookBot/meta-externalagent (no documented citation role among the vendors above; CCBot feeds third-party corpora that some search partners may use, so its classification is not vendor-verified), and the user-triggered fetchers ChatGPT-User and Perplexity-User (their vendors state robots.txt "may not apply" / is "generally ignored" for user-triggered fetches, and neither governs search appearance).
Pass criteria: Count the search/citation bots (6 listed above) with an effective Disallow rule. The count must be 0. A generic User-agent: * / Allow: / with no AI-specific blocks passes. Training-opt-out blocks and other-bucket blocks do NOT count against the pass. Report even on pass: "0 of 6 search/citation bots blocked. Training-opt-out blocks present (owner's choice, not penalized): [names or none]."
Do NOT fail when: only training-opt-out tokens (GPTBot, ClaudeBot, Google-Extended) or other-bucket tokens are blocked. Worked example: a robots.txt containing User-agent: GPTBot / Disallow: / while all 6 search/citation bots are allowed → result pass, with detail noting the training opt-out as the owner's documented choice. Every vendor documents training controls as independent of search appearance.
Fail criteria: At least 1 of the 6 search/citation bots has an effective Disallow rule, or a blanket User-agent: * / Disallow: / blocks everything. Report: "X of 6 search/citation bots blocked: [names]", naming each blocked bot and its documented consequence using the vendor's own strength of claim — OpenAI's is categorical ("will not be shown in ChatGPT search answers"); Anthropic's is hedged ("may reduce visibility"); for Googlebot/bingbot note the block also removes the site from conventional search, which the SEO Fundamentals audit covers.
Do NOT pass when: a Disallow: / for * exists with Allow exceptions that do not cover the main content paths, or when a search/citation bot is blocked on the primary content subdomain even if allowed elsewhere (Anthropic documents per-subdomain semantics).
Skip (N/A) when: No robots.txt found — this is caught by the SEO Fundamentals audit. Note in detail: "No robots.txt found — cannot verify AI crawler access. See SEO Fundamentals audit."
Detail on fail: Name the specific bots and mirror each vendor's hedge. Example: "2 of 6 search/citation bots blocked: OAI-SearchBot (OpenAI: site will not be shown in ChatGPT search answers) and Claude-User (Anthropic: may reduce visibility in Claude search). Training-opt-out blocks present: GPTBot (not penalized)."
Cross-reference: A Googlebot or bingbot block is also a conventional-SEO failure — the Advanced SEO audit covers robots.txt structure, crawlability, and sitemaps in depth; this check scores only the AI-answer consequence.
Remediation: Remove Disallow rules for the 6 search/citation bots. If you want to opt out of AI training without losing search visibility, block only the training tokens — OpenAI's own docs describe allowing OAI-SearchBot while disallowing GPTBot as the supported configuration:
```
# Stay visible in AI search, opt out of model training
User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /
```
Notes from the vendors: robots.txt changes take ~24 hours to propagate (OpenAI); opt-outs apply per subdomain (Anthropic); a blocked page can still surface in ChatGPT as a bare title+link via third-party providers unless the page is noindex (OpenAI). Blocking Google-Extended does not affect AI Overviews but does opt you out of Gemini Apps/Vertex AI grounding (Google).

External references

external · robots-txt-spec — Robots Exclusion Protocol (RFC 9309)
external · openai-crawlers — Overview of OpenAI Crawlers (OAI-SearchBot vs GPTBot vs ChatGPT-User)
external · anthropic-crawlers — Anthropic — Does Anthropic crawl data from the web?
external · google-common-crawlers — Google common crawlers (Google-Extended is not a Search control)
external · perplexity-crawlers — Perplexity Crawlers (PerplexityBot, Perplexity-User)

Taxons

findability

History

2026-04-18·v1.0.0·Initial import from geo-readiness·automated
2026-06-10·v1.1.0·Realigned to first-party vendor docs: three-bucket bot taxonomy. Search/citation bots (OAI-SearchBot, Claude-SearchBot, Claude-User, PerplexityBot, Googlebot, bingbot) fail when blocked; training-opt-out tokens (GPTBot, ClaudeBot, Google-Extended) are a documented owner choice reported without penalty; removed the false "10 major AI crawlers" framing, the nonexistent ClaudeBot-User token, and the claim that GPTBot/ClaudeBot are citation-oriented. Added first-party source URLs.·by geo-first-party-alignment

Why it matters

Remediation

// app/robots.ts
export default function robots() {
  return {
    rules: [
      // One wildcard rule is enough. If you want to opt out of AI TRAINING
      // without losing search visibility, add Disallow entries only for the
      // training tokens (GPTBot, ClaudeBot, Google-Extended) — every vendor
      // documents these as independent of search appearance. Note that
      // blocking Google-Extended also opts you out of Gemini app grounding.
      { userAgent: '*', allow: '/', disallow: ['/api/', '/auth/'] },
    ],
    sitemap: 'https://example.com/sitemap.xml',
  }
}

Detection

ID: ai-crawlers-not-blocked
Severity: critical
What to look for: Examine robots.txt (static file at public/robots.txt or generated via app/robots.ts or equivalent). Before evaluating, quote every User-agent / Disallow rule pair found in the file. Classify each user-agent token into one of three buckets, sourced from the vendors' own documentation:
- Search/citation bots (blocking these is the failure): OAI-SearchBot (controls appearing in ChatGPT search answers — OpenAI: opted-out sites "will not be shown in ChatGPT search answers"), Claude-SearchBot and Claude-User (Anthropic: blocking each "may reduce your site's visibility" in Claude search — note Anthropic documents honoring robots.txt for its user-triggered fetcher, unlike OpenAI/Perplexity, so Claude-User belongs in this bucket by Anthropic's own statement), PerplexityBot (Perplexity search results), Googlebot (Google's AI Overviews and AI Mode are grounded by the core Search index), bingbot (Bing's index powers Microsoft Copilot, and Bing is a named third-party provider for ChatGPT search — the Copilot/ChatGPT consequence is the obvious inference from those two documented facts).
- Training-opt-out tokens (blocking these is a documented owner choice — never a failure): GPTBot (OpenAI: training only, independent of search), ClaudeBot (Anthropic: training only), Google-Extended (a robots.txt token, not a crawler; controls Gemini model training and grounding in Gemini Apps/Vertex AI — Google states it "does not impact a site's inclusion in Google Search nor is it used as a ranking signal." Blocking it does cost Gemini app citations; report that trade-off, don't penalize it).
- Other tokens (report only, no score impact): CCBot, Bytespider, FacebookBot/meta-externalagent (no documented citation role among the vendors above; CCBot feeds third-party corpora that some search partners may use, so its classification is not vendor-verified), and the user-triggered fetchers ChatGPT-User and Perplexity-User (their vendors state robots.txt "may not apply" / is "generally ignored" for user-triggered fetches, and neither governs search appearance).
Pass criteria: Count the search/citation bots (6 listed above) with an effective Disallow rule. The count must be 0. A generic User-agent: * / Allow: / with no AI-specific blocks passes. Training-opt-out blocks and other-bucket blocks do NOT count against the pass. Report even on pass: "0 of 6 search/citation bots blocked. Training-opt-out blocks present (owner's choice, not penalized): [names or none]."
Do NOT fail when: only training-opt-out tokens (GPTBot, ClaudeBot, Google-Extended) or other-bucket tokens are blocked. Worked example: a robots.txt containing User-agent: GPTBot / Disallow: / while all 6 search/citation bots are allowed → result pass, with detail noting the training opt-out as the owner's documented choice. Every vendor documents training controls as independent of search appearance.
Fail criteria: At least 1 of the 6 search/citation bots has an effective Disallow rule, or a blanket User-agent: * / Disallow: / blocks everything. Report: "X of 6 search/citation bots blocked: [names]", naming each blocked bot and its documented consequence using the vendor's own strength of claim — OpenAI's is categorical ("will not be shown in ChatGPT search answers"); Anthropic's is hedged ("may reduce visibility"); for Googlebot/bingbot note the block also removes the site from conventional search, which the SEO Fundamentals audit covers.
Do NOT pass when: a Disallow: / for * exists with Allow exceptions that do not cover the main content paths, or when a search/citation bot is blocked on the primary content subdomain even if allowed elsewhere (Anthropic documents per-subdomain semantics).
Skip (N/A) when: No robots.txt found — this is caught by the SEO Fundamentals audit. Note in detail: "No robots.txt found — cannot verify AI crawler access. See SEO Fundamentals audit."
Detail on fail: Name the specific bots and mirror each vendor's hedge. Example: "2 of 6 search/citation bots blocked: OAI-SearchBot (OpenAI: site will not be shown in ChatGPT search answers) and Claude-User (Anthropic: may reduce visibility in Claude search). Training-opt-out blocks present: GPTBot (not penalized)."
Cross-reference: A Googlebot or bingbot block is also a conventional-SEO failure — the Advanced SEO audit covers robots.txt structure, crawlability, and sitemaps in depth; this check scores only the AI-answer consequence.
Remediation: Remove Disallow rules for the 6 search/citation bots. If you want to opt out of AI training without losing search visibility, block only the training tokens — OpenAI's own docs describe allowing OAI-SearchBot while disallowing GPTBot as the supported configuration:
```
# Stay visible in AI search, opt out of model training
User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /
```
Notes from the vendors: robots.txt changes take ~24 hours to propagate (OpenAI); opt-outs apply per subdomain (Anthropic); a blocked page can still surface in ChatGPT as a bare title+link via third-party providers unless the page is noindex (OpenAI). Blocking Google-Extended does not affect AI Overviews but does opt you out of Gemini Apps/Vertex AI grounding (Google).

External references

external · robots-txt-spec — Robots Exclusion Protocol (RFC 9309)

external · openai-crawlers — Overview of OpenAI Crawlers (OAI-SearchBot vs GPTBot vs ChatGPT-User)

external · anthropic-crawlers — Anthropic — Does Anthropic crawl data from the web?

external · google-common-crawlers — Google common crawlers (Google-Extended is not a Search control)

external · perplexity-crawlers — Perplexity Crawlers (PerplexityBot, Perplexity-User)

History

2026-04-18·v1.0.0·Initial import from geo-readiness·automated

2026-06-10·v1.1.0·Realigned to first-party vendor docs: three-bucket bot taxonomy. Search/citation bots (OAI-SearchBot, Claude-SearchBot, Claude-User, PerplexityBot, Googlebot, bingbot) fail when blocked; training-opt-out tokens (GPTBot, ClaudeBot, Google-Extended) are a documented owner choice reported without penalty; removed the false "10 major AI crawlers" framing, the nonexistent ClaudeBot-User token, and the claim that GPTBot/ClaudeBot are citation-oriented. Added first-party source URLs.·by geo-first-party-alignment