No aggressive bot detection on content pages

ab-001522 · geo-readiness.ai-crawler-access.no-aggressive-bot-blocking

Severity: mediumactive

Why it matters

Cloudflare Turnstile, reCAPTCHA, and user-agent-sniffing middleware that challenge every request will block OAI-SearchBot, PerplexityBot, Claude-SearchBot, and the other crawlers that feed AI answers, alongside malicious traffic. This failure mode is serious enough that the vendors document the fix themselves — OpenAI asks site owners to "ensure your site host and/or content delivery network allows traffic from our published IP addresses," and Perplexity's crawler docs include step-by-step Cloudflare and AWS WAF allowlisting instructions. A one-click "block AI bots" toggle in a CDN dashboard silently overrides a perfectly open robots.txt. When your blog and docs return a JavaScript challenge instead of HTML, AI systems record the page as uncrawlable and stop retrying — your content vanishes from generative answers even though it loads fine for humans.

Severity rationale

Medium because impact is conditional on how broadly the bot challenge is scoped across routes.

Remediation

Restrict bot protection middleware to API, auth, and form-submission routes using an explicit matcher, and leave content pages unprotected at the application layer. If a CDN or edge firewall enforces challenges, allowlist the AI search crawlers using the vendors' published IP ranges — openai.com/searchbot.json, perplexity.com/perplexitybot.json, claude.com/crawling/bots.json — or exclude content-only paths; OpenAI and Perplexity both publish first-party WAF-allowlisting instructions. Check CDN dashboards for one-click "block AI bots" toggles (Cloudflare, Vercel) — they override robots.txt silently. Update middleware.ts.

export const config = {
  matcher: ['/api/:path*', '/auth/:path*'],
}

Detection

ID: no-aggressive-bot-blocking
Severity: medium
What to look for: Count all bot-blocking mechanisms in the codebase: imports of CAPTCHA libraries (reCAPTCHA, hCaptcha, Cloudflare Turnstile), middleware files that inspect user agents, and any "verify you're human" interstitial components. Also check committed CDN/firewall config (vercel.json, wrangler.toml, firewall rules files) for bot-management or "block AI bots" settings. For each mechanism found, determine which routes it applies to — content pages vs. API/auth routes. Bot protection on API routes, auth pages, or form submissions is acceptable and expected.
Pass criteria: Count all CAPTCHA, JavaScript challenge, and bot verification mechanisms. The count of such mechanisms on public content pages (homepage, marketing, blog, docs) must be 0. Bot protection scoped only to API routes, auth flows, or form submissions passes. If no bot-blocking code is found in the codebase at all, pass — note that CDN-level configuration (Cloudflare, Vercel) cannot be verified from code alone. Report even on pass: "Found X bot-blocking mechanisms total — 0 apply to public content pages."
Fail criteria: At least 1 CAPTCHA or challenge mechanism applies to public content page routes. Middleware that blocks or challenges requests based on user agent string for all routes including content pages. Report: "X bot-blocking mechanisms found on content routes: [list mechanisms and affected routes]".
Skip (N/A) when: Never.
Detail on fail: "Cloudflare Turnstile challenge applied via middleware to all routes including /blog and /docs — AI crawlers cannot access 2 content sections" or "reCAPTCHA gate on homepage prevents automated content access — 1 bot-blocking mechanism on 1 content route"
Remediation: Bot protection should target form submissions and API endpoints, not content pages. Scope your middleware to exclude public content routes:
```
// middleware.ts — exclude content pages from bot challenges
export const config = {
  matcher: ['/api/:path*', '/auth/:path*'],  // Only protect API and auth
}
```
CDN-level bot protection (Cloudflare, Vercel Firewall) should be configured to allow AI search crawlers or to skip challenges for content-only pages. The vendors publish machine-readable IP ranges for exactly this purpose — openai.com/searchbot.json, perplexity.com/perplexitybot.json, claude.com/crawling/bots.json — and OpenAI's inclusion guidance explicitly asks hosts/CDNs to allow traffic from those ranges. Watch for one-click "block AI bots" toggles: they block the search crawlers (costing AI-answer visibility) along with the training ones, and they override robots.txt silently.

External references

external · openai-searchbot-ips — OpenAI — OAI-SearchBot published IP ranges (allowlist at host/CDN)
external · perplexity-crawlers-waf — Perplexity Crawlers — Cloudflare/AWS WAF allowlisting instructions
external · anthropic-bot-ips — Anthropic — published crawler IP ranges

Taxons

findability

History

2026-04-18·v1.0.0·Initial import from geo-readiness·automated
2026-06-10·v1.0.1·Additive update from first-party docs (criteria shape unchanged): added published IP-range allowlists (openai.com/searchbot.json, perplexitybot.json, claude.com/crawling/bots.json), OpenAI's and Perplexity's own WAF-allowlisting instructions, and CDN "block AI bots" toggles as the common failure mode. Fixed why_it_matters bot names (the blocked-search-bot consequence belongs to OAI-SearchBot/PerplexityBot/Claude-SearchBot, not training bots).·by geo-first-party-alignment

Why it matters

Remediation

export const config = {
  matcher: ['/api/:path*', '/auth/:path*'],
}

Detection

ID: no-aggressive-bot-blocking
Severity: medium
What to look for: Count all bot-blocking mechanisms in the codebase: imports of CAPTCHA libraries (reCAPTCHA, hCaptcha, Cloudflare Turnstile), middleware files that inspect user agents, and any "verify you're human" interstitial components. Also check committed CDN/firewall config (vercel.json, wrangler.toml, firewall rules files) for bot-management or "block AI bots" settings. For each mechanism found, determine which routes it applies to — content pages vs. API/auth routes. Bot protection on API routes, auth pages, or form submissions is acceptable and expected.
Pass criteria: Count all CAPTCHA, JavaScript challenge, and bot verification mechanisms. The count of such mechanisms on public content pages (homepage, marketing, blog, docs) must be 0. Bot protection scoped only to API routes, auth flows, or form submissions passes. If no bot-blocking code is found in the codebase at all, pass — note that CDN-level configuration (Cloudflare, Vercel) cannot be verified from code alone. Report even on pass: "Found X bot-blocking mechanisms total — 0 apply to public content pages."
Fail criteria: At least 1 CAPTCHA or challenge mechanism applies to public content page routes. Middleware that blocks or challenges requests based on user agent string for all routes including content pages. Report: "X bot-blocking mechanisms found on content routes: [list mechanisms and affected routes]".
Skip (N/A) when: Never.
Detail on fail: "Cloudflare Turnstile challenge applied via middleware to all routes including /blog and /docs — AI crawlers cannot access 2 content sections" or "reCAPTCHA gate on homepage prevents automated content access — 1 bot-blocking mechanism on 1 content route"
Remediation: Bot protection should target form submissions and API endpoints, not content pages. Scope your middleware to exclude public content routes:
```
// middleware.ts — exclude content pages from bot challenges
export const config = {
  matcher: ['/api/:path*', '/auth/:path*'],  // Only protect API and auth
}
```
CDN-level bot protection (Cloudflare, Vercel Firewall) should be configured to allow AI search crawlers or to skip challenges for content-only pages. The vendors publish machine-readable IP ranges for exactly this purpose — openai.com/searchbot.json, perplexity.com/perplexitybot.json, claude.com/crawling/bots.json — and OpenAI's inclusion guidance explicitly asks hosts/CDNs to allow traffic from those ranges. Watch for one-click "block AI bots" toggles: they block the search crawlers (costing AI-answer visibility) along with the training ones, and they override robots.txt silently.

History

2026-04-18·v1.0.0·Initial import from geo-readiness·automated

2026-06-10·v1.0.1·Additive update from first-party docs (criteria shape unchanged): added published IP-range allowlists (openai.com/searchbot.json, perplexitybot.json, claude.com/crawling/bots.json), OpenAI's and Perplexity's own WAF-allowlisting instructions, and CDN "block AI bots" toggles as the common failure mode. Fixed why_it_matters bot names (the blocked-search-bot consequence belongs to OAI-SearchBot/PerplexityBot/Claude-SearchBot, not training bots).·by geo-first-party-alignment