Controlling OpenCrawl and AI Crawlers: How to Protect Your Site from Unwanted Scraping

The rise of large language models has created a new class of web crawlers that operate at unprecedented scale. Unlike traditional search engine crawlers that index pages for search results, AI training crawlers like OpenCrawl, GPTBot, CCBot, and others are scraping content to feed machine learning pipelines. For website operators, this means significantly more traffic, higher server costs, and the uncompensated use of original content.

IncidentHub-Bay blocks nine AI training crawlers in our own robots.txt configuration. Visit /robots.txt to see our approach, and use it as a template for your own site.

The Scale of the Problem

AI training crawlers do not behave like traditional search bots. Google's search crawler respects crawl delays, fetches pages at a measured pace, and focuses on indexing content for search. AI training crawlers, by contrast, often attempt to download entire sites as quickly as possible. They fetch every page, every asset, and every variant — because more data means better training outcomes for the model operators.

Server logs from sites that have not implemented crawler controls regularly show AI bots accounting for 20 to 40 percent of total traffic. For content-heavy sites — blogs, documentation, forums, news outlets — this can translate directly into higher bandwidth bills and degraded performance for real users.

Identifying AI Crawlers

The first step in controlling crawlers is knowing which ones are hitting your site. Check your server access logs for these common AI training User-Agent strings:

GPTBot — OpenAI's web crawler used for training data collection
ChatGPT-User — OpenAI's crawler for ChatGPT browsing features
Google-Extended — Google's crawler specifically for AI training (separate from Googlebot search)
CCBot — Common Crawl's crawler, whose datasets are widely used for LLM training
anthropic-ai / Claude-Web — Anthropic's crawlers for Claude training data
Bytespider — ByteDance's aggressive crawler used for TikTok and AI training
FacebookBot — Meta's crawler used for AI model training
cohere-ai — Cohere's crawler for LLM training data

Layer 1: robots.txt

The robots.txt file is the first line of defense. It is a voluntary protocol — crawlers are expected to check and respect it, but compliance is not guaranteed. Despite this limitation, most major AI companies do respect robots.txt directives, making it an essential baseline control.

Add explicit disallow rules for each AI crawler you want to block. Do not rely on a generic wildcard rule, because that would also block legitimate search engine crawlers. Instead, create separate User-agent blocks for each AI bot. Allow Googlebot and Bingbot to continue indexing your site for search while blocking Google-Extended and GPTBot from scraping for AI training.

Layer 2: Server-Level Blocking

For crawlers that ignore robots.txt, server-level controls provide enforcement. This can be implemented through your web server configuration (nginx, Apache), a CDN-level WAF (Cloudflare, Vercel), or application middleware.

User-Agent header matching: Block requests from known AI crawler User-Agents at the web server or CDN level. This is fast and has minimal performance impact.
Rate limiting: Implement per-IP rate limits that allow normal browsing patterns but throttle aggressive crawling. A limit of 60 requests per minute per IP is reasonable for most sites.
IP range blocking: Some crawler operators publish their IP ranges. Blocking these at the firewall level stops requests before they reach your application server.
CAPTCHA challenges: For high-value content, consider serving CAPTCHA challenges to suspected bot traffic. This adds friction for automated scrapers while allowing human visitors through.

Layer 3: Monitoring and Alerting

Crawler behavior changes over time. New bots appear, existing bots change their User-Agent strings, and request patterns evolve. Set up monitoring to track bot traffic as a percentage of total requests, alert on sudden spikes in crawler activity, and regularly review access logs for unfamiliar User-Agent strings.

If you notice a new crawler consuming significant resources, research its origin and decide whether to allow, throttle, or block it. The goal is to maintain control over who accesses your content and at what rate.

Legal and Ethical Considerations

The legal landscape around AI training data is evolving rapidly. Several high-profile lawsuits are challenging whether web scraping for AI training constitutes fair use. Regardless of the legal outcome, website operators have the technical right to control access to their servers. Blocking crawlers is not anti-AI — it is exercising control over your own infrastructure and content.

Protecting your site from unwanted crawlers is one aspect of infrastructure security. For monitoring the reliability of the AI services you depend on, set up alerts at /alerts to get notified of provider outages in real time.

Recommended Action Plan

Audit your server logs to identify which AI crawlers are currently accessing your site and how much traffic they generate.
Update your robots.txt to explicitly block AI training crawlers while preserving search engine access.
Implement server-level or CDN-level User-Agent blocking as an enforcement layer for non-compliant crawlers.
Set up rate limiting to protect against any single IP consuming excessive resources.
Monitor bot traffic regularly and adjust your rules as the crawler landscape evolves.

Controlling OpenCrawl and AI Crawlers: How to Protect Your Site from Unwanted Scraping

The Scale of the Problem

Identifying AI Crawlers

Layer 1: robots.txt

Layer 2: Server-Level Blocking

Layer 3: Monitoring and Alerting

Legal and Ethical Considerations

Recommended Action Plan

Key Takeaways

Discussion Prompts

More from the Journal

Cloud Outage Patterns in March 2026: What We Observed This Month

AI API Reliability Compared: OpenAI vs Anthropic vs Google AI in 2026

How to Build an LLM Fallback Strategy for Production AI Applications