AI crawlers in robots.txt: GPTBot, ClaudeBot, and more

How robots.txt works for AI crawlers, what GPTBot, ClaudeBot, ChatGPT-User, PerplexityBot, and Google-Extended mean, and how to avoid blocking useful visibility by accident.

Practice and Methodology AI crawlers in robots.txt: GPTBot, ClaudeBot, and more
Article contents 0%
What exactly controls robots.txt Why "AI crawler" no longer means one type of bot Training, AI Search, and On-Demand Downloads: What's the Difference Basic AI Bots and What They Mean OpenAI: GPTBot, OAI-SearchBot, and ChatGPT-User Anthropic: ClaudeBot, Claude-SearchBot and Claude-User PerplexityBot and Perplexity-User Google-Extended: not to be confused with Googlebot Ready-to-use robots.txt rules How to check that the rules really work Common mistakes Which policy to choose for different sites FAQ Summary What else to read
Article contents

AI crawlers in robots.txt should be managed by role. Separate user-agents may handle AI search, potential model training, or user-initiated retrieval. A blanket block can remove a site from ChatGPT Search, Claude Search, or Perplexity even when the original goal was only to restrict training use.

This guide provides an AI bot list for robots.txt, practical policies for GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, and Google-Extended, and a way to test whether the rules work. It also explains why robots.txt is not access control for private content.

What exactly controls robots.txt

robots.txt is a text file at the root of the site, such as https://example.com/robots.txt. It is read by automated clients to understand which URLs the site owner allows or does not allow crawling.

The basic format is simple:

User-agent: ExampleBot
Disallow: /private/
Allow: /blog/

Sitemap: https://example.com/sitemap.xml

RFC 9309 explicitly states that these rules are not a form of authorization. robots.txt does not prevent a person from opening a page, hide the URL, set a password, or guarantee that a non-compliant bot will follow the rules.

The file is useful for managing compliant crawlers. Protect private materials with authentication, authorization, a paywall, WAF rules, or server-level access controls. Use noindex and search-engine removal tools to manage indexing, not as security controls. If a URL must remain private, do not rely on robots.txt alone.

Kozak moves private content from crawl rules to access control
Kozak moves private content from crawl rules to access control

The rules apply to a specific protocol, host, and port. https://example.com/robots.txt does not control https://blog.example.com/; crawlers also treat http://example.com/ and https://example.com/ as separate origins. Google describes this scope in its robots.txt documentation.

Why "AI crawler" no longer means one type of bot

In the past, the conversation often came down to "to let or not to let Googlebot in". With AI search, everything is more complicated. One service can have different user-agents for different tasks:

  • training or improvement of foundation models;
  • indexing for AI search;
  • opening the page at the direct request of the user;
  • technical verification of pages, advertising or security.

For business, the difference is practical. You can want the site to appear in ChatGPT Search or Perplexity responses, but you don't want to pass the content to future training kits. Or vice versa: the company sees no value in AI search and wants to minimize all automated visits.

That is why it is better not to write one rough block:

User-agent: *
Disallow: /

Such a rule will close not only AI bots, but also ordinary search crawlers. For a site that relies on organic search, this is too risky.

Training, AI Search, and On-Demand Downloads: What's the Difference

Before changes in robots.txt, it is worth separating three scenarios.

robots.txt divides bots by role: AI search, model training, and user requests
robots.txt divides bots by role: AI search, model training, and user requests

ScenarioWhat is happeningWhat is it for business
Model TrainingCrawler collects public content that can be used to train or improve future modelsControl over whether you want to give away materials for a training scenario
AI search and indexingThe service crawls pages to show the site in answers, sources, and linksBrand visibility in ChatGPT Search, Claude, Perplexity, and other AI search scripts
User-requested retrievalThe user asks the AI to open a URL or perform an action, and the bot requests that pageSite availability when a user explicitly initiates access

This distinction has a practical effect. As explained in AI visibility and SEO, a brand can rank in Google yet remain absent from AI-assisted decisions. Blocking a crawler used for AI search can reduce visibility in that layer.

Basic AI Bots and What They Mean

As of June 20, 2026, official documentation describes the following roles.

CompanyUser-agent or tokenMain roleWhat changes the blocking of
OpenAIGPTBotCrawler for content that may be used to train generative foundation modelsA signal that site content should not be used for such training
OpenAIOAI-SearchBotCrawler for Sites to Appear in ChatGPT Search FunctionsThe site may not appear in ChatGPT search answers, although it may remain a navigation link
OpenAIChatGPT-UserUser-initiated requests in ChatGPT or Custom GPTsThis is not an automatic scan; OpenAI writes that robots.txt may not apply
AnthropicClaudeBotCollection of web content that could potentially end up in training setsSignal to exclude future site materials from training sets
AnthropicClaude-SearchBotIndexing and improving the quality of web search in ClaudeLess visibility and accuracy in Claude search answers
AnthropicClaude-UserAccess to pages at the request of the user ClaudeClaude may not receive content in response to the user's request
PerplexityPerplexityBotIndexing for Perplexity Search Results and LinksSite Appears Worse in Perplexity
PerplexityPerplexity-UserPerplexity user-initiated requestsThe Perplexity documentation writes that this mechanism generally ignores robots.txt
GoogleGoogle-ExtendedRobots.txt token to control content usage in Gemini Apps, Vertex AI API for Gemini and groundingDoes not affect Google Search and is not a ranking signal

Links to primary sources: OpenAI Crawlers, Anthropic Help Center, Perplexity Crawlers, Google common crawlers.

OpenAI: GPTBot, OAI-SearchBot, and ChatGPT-User

OpenAI separates search from model training. In OpenAI documentation, OAI-SearchBot is described as a search crawler that helps sites appear in ChatGPT search results. GPTBot is a separate crawler for content that may be used to train foundation models.

For many public sites, a practical policy is to allow ChatGPT Search while blocking the training crawler.

User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

This does not guarantee a brand mention because ChatGPT still selects sources based on relevance and other signals. Blocking OAI-SearchBot, however, can prevent pages from being surfaced in ChatGPT search results.

Kozak allows AI search while blocking content use for model training
Kozak allows AI search while blocking content use for model training

ChatGPT-User is a special case. OpenAI says this user-agent is used for certain user actions in ChatGPT and Custom GPTs, not for automated crawling. Because the action is user-initiated, robots.txt rules may not apply. If a page must not be opened in this scenario, use site-level access control rather than relying only on a rule in robots.txt.

Anthropic: ClaudeBot, Claude-SearchBot and Claude-User

Anthropic also divides bots by role. Claude help describes three user-agents:

  • ClaudeBot - for web content that can potentially be used to train models;
  • Claude-SearchBot - to improve the quality of search results in Claude;
  • Claude-User - for accessing sites at the request of the user.

If you want to block the training crawler while preserving Claude's search crawler, the rule might look like this:

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

Anthropic separately writes that it supports a non-standard Crawl-delay directive to restrict activity. This is useful if the problem is not in the access itself, but in the load.

User-agent: ClaudeBot
Crawl-delay: 1

There is a technical nuance here. If you block the bot's IP ranges at the firewall, it may be unable to read the updated robots.txt. The crawler then cannot see the opt-out instructions. Make sure /robots.txt remains accessible before relying on those rules.

PerplexityBot and Perplexity-User

Perplexity in its crawler documentation writes that PerplexityBot is needed to appear and link to sites in Perplexity results. It also states that PerplexityBot is not used to scan content for fundamental AI models.

So, if your goal is to remain visible in Perplexity, you shouldn't block PerplexityBot for a reason.

User-agent: PerplexityBot
Allow: /

Perplexity-User supports user actions. The Perplexity documentation explains that since the download is user-initiated, this mechanism usually ignores robots.txt rules. If this is unacceptable for your site, you need to check not only robots.txt, but also server access rules, WAF, paywall, authorization, and request frequency limiting logic.

For sensitive sections, it is better not to pretend that Disallow decides everything:

User-agent: Perplexity-User
Disallow: /clients/

This line can be a useful signal, but it shouldn't be the only barrier.

Google-Extended: not to be confused with Googlebot

Google-Extended is the most unusual item on this list. This is not a separate HTTP user-agent that you will see in the logs. Google describes it as a standalone product token in robots.txt. The crawl is performed by Google's existing user-agent, and Google-Extended works as a control token.

The main thing: Google-Extended does not affect the inclusion of the site in Google Search and is not used as a ranking signal. Google's documentation says that this token controls whether the content Google crawls on sites can be used to train future generations of Gemini models and for grounding in Gemini Apps and Grounding with Google Search on Vertex AI.

If you want to keep regular Google Search open but limit Google-Extended:

User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Disallow: /

Don't replace this with a rule for Googlebot if you don't want to close your site from Google Search. This is one of the most dangerous mistakes in this topic.

Kozak separates Google Search access from Gemini content use
Kozak separates Google Search access from Gemini content use

Ready-to-use robots.txt rules

Below are templates that can be adapted. Before publishing, test them on a test environment or at least in a robots.txt tester, if your CMS or SEO tool has one.

Scenario 1. Allow AI search but block training crawlers

Suitable for a blog, SaaS product, media outlet, or service site that wants to appear in AI search answers but does not want its content collected by training crawlers.

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

This is often a balanced option, but it does not guarantee visibility. After changing the rules, monitor whether pages appear in AI answers and which sources the model cites. The process in how to analyze the sources AI relies on can help.

Scenario 2. Shut down major AI bots completely

Suitable for a site where AI visibility is not the goal, or for content with an increased risk of copying.

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: Google-Extended
Disallow: /

Before making such a decision, it is worth noting the consequence: you deliberately reduce the chances of visibility in terms of AI search scenarios. For some sites, this is normal. For commercial categories where the user is already asking ChatGPT or Perplexity "who to choose", this can be a loss of touchpoint.

Scenario 3. Close only individual sections

It is suitable when public blogs, documentation and service pages can be accessed, but service, client or archive sections are not.

User-agent: GPTBot
Disallow: /clients/
Disallow: /internal/
Allow: /blog/
Allow: /services/

User-agent: OAI-SearchBot
Disallow: /clients/
Disallow: /internal/
Allow: /

This option is closer to real life. Not all sites are equally valuable for AI searches, and not all content is equally safe to open.

Suitable when a site needs to stay in Google Search, but the team wants to limit Google-Extended.

User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Disallow: /

Once again: Google-Extended is not a substitute for Googlebot.

How to check that the rules really work

Posting strings in robots.txt is just the first step. Verification is more important.

  1. Open https://domain.com/robots.txt in a browser. The file must render 200 OK, not an HTML page, or a 403.
  2. Check the rules for each subdomain. If a blog lives on blog.example.com, it needs its own robots.txt.
  3. Look at the server logs by user-agent: GPTBot, OAI-SearchBot, ClaudeBot, Claude-SearchBot, PerplexityBot, Googlebot.
  4. Match the user-agent with official IP ranges if the provider publishes them. Only a user-agent can be faked.
  5. Check the CDN and WAF. Cloudflare, AWS WAF, nginx, security plugins, or WordPress plugins can block the bot even before it reads robots.txt.
  6. Give the systems time. OpenAI and Perplexity, in their documents, talk about a delay of up to about 24 hours to apply part of the changes; other crawlers may update the cache differently.

Kozak verifies robots.txt, server logs, WAF, and the result after changes
Kozak verifies robots.txt, server logs, WAF, and the result after changes

Also check whether priority pages were blocked accidentally. If previously cited pages disappear after a rules change, access may be the cause. See which website pages most often appear in AI answers for a method to identify pages that should remain accessible.

Common mistakes

Most problems come from choosing the wrong policy rather than writing invalid syntax. A team may block OAI-SearchBot when it intended to block only GPTBot, or block Googlebot instead of Google-Extended and damage Google Search visibility. Disallow prevents crawling; it does not guarantee deindexing. Listing private paths in the public robots.txt file can expose their locations. Other common mistakes include configuring only the main domain while ignoring subdomains, blocking /robots.txt or verified crawlers at the WAF, and relying on the non-standard Crawl-delay directive.

Remember the main limitation: robots.txt controls compliant crawler access but does not explain why a page deserves citation. Content, structure, visible facts, and external evidence serve that purpose. For the technical layer, see how structured data affects AI visibility.

Which policy to choose for different sites

There is no one-size-fits-all answer. The decision depends on what is more important to you: maximum AI visibility, content usage control, less workload, or legal caution.

Site typePractical policy
Blog or mediaIt often makes sense to allow AI search bots but separately address the issue of bots to train
SaaS or service siteAllow product, blog, FAQ, and documentation pages; protect private accounts, test environments, and client areas
EcommerceAllow categories, product pages, and help content; protect carts, internal search, parameter filters, and personal accounts
B2B companyLeave service pages, cases, expert materials available; handle PDFs, price lists, and client materials with care
Closed knowledge baseDo not rely on robots.txt; set authorization and server access control
Medical, legal, or financial websiteDo a separate review with a lawyer or compliance team, because the cost of an error is higher

For many sites, a sensible starting policy is to allow AI search crawlers, block training crawlers if required, review the logs, and compare citations and server load after 2-4 weeks.

FAQ

If you block GPTBot, will the site disappear from ChatGPT?

Not necessarily. The OAI-SearchBot is responsible for search visibility in ChatGPT, and GPTBot refers to the training scenario. But the behavior of specific answers still needs to be checked in the interface and logs.

Why does the bot still come after Disallow?

Possible causes: robots.txt cache, another user-agent, download at the user's request, fake user-agent, WAF prevents the bot from reading the file, or a rule written for the wrong host. Start by logging and checking the actual request to /robots.txt.

Do I need to add all AI bots to robots.txt?

Not necessarily. It's better to start with those that really impact your business: OpenAI, Anthropic, Perplexity, Google. Next, look at logs and add rules for bots that actually visit the site or create loads.

Summary

Configure AI crawler access by role rather than allowing or blocking every bot together. GPTBot, ClaudeBot, and Google-Extended relate to training or use in AI products. OAI-SearchBot, Claude-SearchBot, and PerplexityBot support AI search visibility. ChatGPT-User, Claude-User, and Perplexity-User form a separate group because a user often initiates those requests.

For most commercial sites, a practical first iteration is to allow AI search, limit training use if required, leave Googlebot unchanged, check WAF rules, and review server logs. This preserves the AI visibility channel while retaining control over how public content may be used.

What else to read

Next

What to read next

All articles
// Try it on your prompts

See how AI sees your brand in VYDAI

Create an account, add your domain, and test real prompts: which AI models mention the brand, which sources support it, and which competitors appear nearby.

Create VYDAI account