Security Guide

Controlling OpenCrawl and AI Crawlers: How to Protect Your Site from Unwanted Scraping

·6 min
web securitycrawlersrobots.txtai scrapingopencrawlbot management

The rise of large language models has created a new class of web crawlers that operate at unprecedented scale. Unlike traditional search engine crawlers that index pages for search results, AI training crawlers like OpenCrawl, GPTBot, CCBot, and others are scraping content to feed machine learning pipelines. For website operators, this means significantly more traffic, higher server costs, and the uncompensated use of original content.

IncidentHub-Bay blocks nine AI training crawlers in our own robots.txt configuration. Visit /robots.txt to see our approach, and use it as a template for your own site.

The Scale of the Problem

AI training crawlers do not behave like traditional search bots. Google's search crawler respects crawl delays, fetches pages at a measured pace, and focuses on indexing content for search. AI training crawlers, by contrast, often attempt to download entire sites as quickly as possible. They fetch every page, every asset, and every variant — because more data means better training outcomes for the model operators.

Server logs from sites that have not implemented crawler controls regularly show AI bots accounting for 20 to 40 percent of total traffic. For content-heavy sites — blogs, documentation, forums, news outlets — this can translate directly into higher bandwidth bills and degraded performance for real users.

Identifying AI Crawlers

The first step in controlling crawlers is knowing which ones are hitting your site. Check your server access logs for these common AI training User-Agent strings:

  • GPTBot — OpenAI's web crawler used for training data collection
  • ChatGPT-User — OpenAI's crawler for ChatGPT browsing features
  • Google-Extended — Google's crawler specifically for AI training (separate from Googlebot search)
  • CCBot — Common Crawl's crawler, whose datasets are widely used for LLM training
  • anthropic-ai / Claude-Web — Anthropic's crawlers for Claude training data
  • Bytespider — ByteDance's aggressive crawler used for TikTok and AI training
  • FacebookBot — Meta's crawler used for AI model training
  • cohere-ai — Cohere's crawler for LLM training data

Layer 1: robots.txt

The robots.txt file is the first line of defense. It is a voluntary protocol — crawlers are expected to check and respect it, but compliance is not guaranteed. Despite this limitation, most major AI companies do respect robots.txt directives, making it an essential baseline control.

Add explicit disallow rules for each AI crawler you want to block. Do not rely on a generic wildcard rule, because that would also block legitimate search engine crawlers. Instead, create separate User-agent blocks for each AI bot. Allow Googlebot and Bingbot to continue indexing your site for search while blocking Google-Extended and GPTBot from scraping for AI training.

Layer 2: Server-Level Blocking

For crawlers that ignore robots.txt, server-level controls provide enforcement. This can be implemented through your web server configuration (nginx, Apache), a CDN-level WAF (Cloudflare, Vercel), or application middleware.

  • User-Agent header matching: Block requests from known AI crawler User-Agents at the web server or CDN level. This is fast and has minimal performance impact.
  • Rate limiting: Implement per-IP rate limits that allow normal browsing patterns but throttle aggressive crawling. A limit of 60 requests per minute per IP is reasonable for most sites.
  • IP range blocking: Some crawler operators publish their IP ranges. Blocking these at the firewall level stops requests before they reach your application server.
  • CAPTCHA challenges: For high-value content, consider serving CAPTCHA challenges to suspected bot traffic. This adds friction for automated scrapers while allowing human visitors through.

Layer 3: Monitoring and Alerting

Crawler behavior changes over time. New bots appear, existing bots change their User-Agent strings, and request patterns evolve. Set up monitoring to track bot traffic as a percentage of total requests, alert on sudden spikes in crawler activity, and regularly review access logs for unfamiliar User-Agent strings.

If you notice a new crawler consuming significant resources, research its origin and decide whether to allow, throttle, or block it. The goal is to maintain control over who accesses your content and at what rate.

Legal and Ethical Considerations

The legal landscape around AI training data is evolving rapidly. Several high-profile lawsuits are challenging whether web scraping for AI training constitutes fair use. Regardless of the legal outcome, website operators have the technical right to control access to their servers. Blocking crawlers is not anti-AI — it is exercising control over your own infrastructure and content.

Protecting your site from unwanted crawlers is one aspect of infrastructure security. For monitoring the reliability of the AI services you depend on, set up alerts at /alerts to get notified of provider outages in real time.

Recommended Action Plan

  • Audit your server logs to identify which AI crawlers are currently accessing your site and how much traffic they generate.
  • Update your robots.txt to explicitly block AI training crawlers while preserving search engine access.
  • Implement server-level or CDN-level User-Agent blocking as an enforcement layer for non-compliant crawlers.
  • Set up rate limiting to protect against any single IP consuming excessive resources.
  • Monitor bot traffic regularly and adjust your rules as the crawler landscape evolves.

Key Takeaways

  • AI training crawlers like GPTBot, CCBot, and Google-Extended ignore rate limits and can consume significant server bandwidth if left unchecked.
  • robots.txt is a voluntary protocol — aggressive crawlers may ignore it entirely, requiring server-level blocking for enforcement.
  • A layered defense combining robots.txt, WAF rules, rate limiting, and User-Agent blocking provides the strongest protection.
  • Monitoring crawler traffic is essential — you cannot protect against what you cannot see.

Discussion Prompts

  • Have you audited your server logs recently to see which AI crawlers are hitting your site?
  • Does your robots.txt block AI training crawlers, or are you unknowingly providing free training data?
  • What would happen to your server performance if a crawler suddenly increased its request rate tenfold?

More from the Journal

Stay ahead of the next outage

Get notified via Slack, webhook, or Google Chat when cloud providers report incidents.

Set up alerts