AI crawlers in robots.txt should be managed by role. Separate user-agents may handle AI search, potential model training, or user-initiated retrieval. A blanket block can remove a site from ChatGPT Search, Claude Search, or Perplexity even when the original goal was only to restrict training use.
This guide provides an AI bot list for robots.txt, practical policies for GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, and Google-Extended, and a way to test whether the rules work. It also explains why robots.txt is not access control for private content.
What exactly controls robots.txt
robots.txt is a text file at the root of the site, such as https://example.com/robots.txt. It is read by automated clients to understand which URLs the site owner allows or does not allow crawling.
The basic format is simple:
User-agent: ExampleBot
Disallow: /private/
Allow: /blog/
Sitemap: https://example.com/sitemap.xml
RFC 9309 explicitly states that these rules are not a form of authorization. robots.txt does not prevent a person from opening a page, hide the URL, set a password, or guarantee that a non-compliant bot will follow the rules.
The file is useful for managing compliant crawlers. Protect private materials with authentication, authorization, a paywall, WAF rules, or server-level access controls. Use noindex and search-engine removal tools to manage indexing, not as security controls. If a URL must remain private, do not rely on robots.txt alone.

The rules apply to a specific protocol, host, and port. https://example.com/robots.txt does not control https://blog.example.com/; crawlers also treat http://example.com/ and https://example.com/ as separate origins. Google describes this scope in its robots.txt documentation.
Why "AI crawler" no longer means one type of bot
In the past, the conversation often came down to "to let or not to let Googlebot in". With AI search, everything is more complicated. One service can have different user-agents for different tasks:
- training or improvement of foundation models;
- indexing for AI search;
- opening the page at the direct request of the user;
- technical verification of pages, advertising or security.
For business, the difference is practical. You can want the site to appear in ChatGPT Search or Perplexity responses, but you don't want to pass the content to future training kits. Or vice versa: the company sees no value in AI search and wants to minimize all automated visits.
That is why it is better not to write one rough block:
User-agent: *
Disallow: /
Such a rule will close not only AI bots, but also ordinary search crawlers. For a site that relies on organic search, this is too risky.
Training, AI Search, and On-Demand Downloads: What's the Difference
Before changes in robots.txt, it is worth separating three scenarios.
| Scenario | What is happening | What is it for business |
|---|---|---|
| Model Training | Crawler collects public content that can be used to train or improve future models | Control over whether you want to give away materials for a training scenario |
| AI search and indexing | The service crawls pages to show the site in answers, sources, and links | Brand visibility in ChatGPT Search, Claude, Perplexity, and other AI search scripts |
| User-requested retrieval | The user asks the AI to open a URL or perform an action, and the bot requests that page | Site availability when a user explicitly initiates access |
This distinction has a practical effect. As explained in AI visibility and SEO, a brand can rank in Google yet remain absent from AI-assisted decisions. Blocking a crawler used for AI search can reduce visibility in that layer.
Basic AI Bots and What They Mean
As of June 20, 2026, official documentation describes the following roles.
| Company | User-agent or token | Main role | What changes the blocking of |
|---|---|---|---|
| OpenAI | GPTBot | Crawler for content that may be used to train generative foundation models | A signal that site content should not be used for such training |
| OpenAI | OAI-SearchBot | Crawler for Sites to Appear in ChatGPT Search Functions | The site may not appear in ChatGPT search answers, although it may remain a navigation link |
| OpenAI | ChatGPT-User | User-initiated requests in ChatGPT or Custom GPTs | This is not an automatic scan; OpenAI writes that robots.txt may not apply |
| Anthropic | ClaudeBot | Collection of web content that could potentially end up in training sets | Signal to exclude future site materials from training sets |
| Anthropic | Claude-SearchBot | Indexing and improving the quality of web search in Claude | Less visibility and accuracy in Claude search answers |
| Anthropic | Claude-User | Access to pages at the request of the user Claude | Claude may not receive content in response to the user's request |
| Perplexity | PerplexityBot | Indexing for Perplexity Search Results and Links | Site Appears Worse in Perplexity |
| Perplexity | Perplexity-User | Perplexity user-initiated requests | The Perplexity documentation writes that this mechanism generally ignores robots.txt |
Google-Extended | Robots.txt token to control content usage in Gemini Apps, Vertex AI API for Gemini and grounding | Does not affect Google Search and is not a ranking signal |
Links to primary sources: OpenAI Crawlers, Anthropic Help Center, Perplexity Crawlers, Google common crawlers.
OpenAI: GPTBot, OAI-SearchBot, and ChatGPT-User
OpenAI separates search from model training. In OpenAI documentation, OAI-SearchBot is described as a search crawler that helps sites appear in ChatGPT search results. GPTBot is a separate crawler for content that may be used to train foundation models.
For many public sites, a practical policy is to allow ChatGPT Search while blocking the training crawler.
User-agent: OAI-SearchBot
Allow: /
User-agent: GPTBot
Disallow: /
This does not guarantee a brand mention because ChatGPT still selects sources based on relevance and other signals. Blocking OAI-SearchBot, however, can prevent pages from being surfaced in ChatGPT search results.

ChatGPT-User is a special case. OpenAI says this user-agent is used for certain user actions in ChatGPT and Custom GPTs, not for automated crawling. Because the action is user-initiated, robots.txt rules may not apply. If a page must not be opened in this scenario, use site-level access control rather than relying only on a rule in robots.txt.
Anthropic: ClaudeBot, Claude-SearchBot and Claude-User
Anthropic also divides bots by role. Claude help describes three user-agents:
ClaudeBot- for web content that can potentially be used to train models;Claude-SearchBot- to improve the quality of search results in Claude;Claude-User- for accessing sites at the request of the user.
If you want to block the training crawler while preserving Claude's search crawler, the rule might look like this:
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
Anthropic separately writes that it supports a non-standard Crawl-delay directive to restrict activity. This is useful if the problem is not in the access itself, but in the load.
User-agent: ClaudeBot
Crawl-delay: 1
There is a technical nuance here. If you block the bot's IP ranges at the firewall, it may be unable to read the updated robots.txt. The crawler then cannot see the opt-out instructions. Make sure /robots.txt remains accessible before relying on those rules.
PerplexityBot and Perplexity-User
Perplexity in its crawler documentation writes that PerplexityBot is needed to appear and link to sites in Perplexity results. It also states that PerplexityBot is not used to scan content for fundamental AI models.
So, if your goal is to remain visible in Perplexity, you shouldn't block PerplexityBot for a reason.
User-agent: PerplexityBot
Allow: /
Perplexity-User supports user actions. The Perplexity documentation explains that since the download is user-initiated, this mechanism usually ignores robots.txt rules. If this is unacceptable for your site, you need to check not only robots.txt, but also server access rules, WAF, paywall, authorization, and request frequency limiting logic.
For sensitive sections, it is better not to pretend that Disallow decides everything:
User-agent: Perplexity-User
Disallow: /clients/
This line can be a useful signal, but it shouldn't be the only barrier.
Google-Extended: not to be confused with Googlebot
Google-Extended is the most unusual item on this list. This is not a separate HTTP user-agent that you will see in the logs. Google describes it as a standalone product token in robots.txt. The crawl is performed by Google's existing user-agent, and Google-Extended works as a control token.
The main thing: Google-Extended does not affect the inclusion of the site in Google Search and is not used as a ranking signal. Google's documentation says that this token controls whether the content Google crawls on sites can be used to train future generations of Gemini models and for grounding in Gemini Apps and Grounding with Google Search on Vertex AI.
If you want to keep regular Google Search open but limit Google-Extended:
User-agent: Googlebot
Allow: /
User-agent: Google-Extended
Disallow: /
Don't replace this with a rule for Googlebot if you don't want to close your site from Google Search. This is one of the most dangerous mistakes in this topic.

Ready-to-use robots.txt rules
Below are templates that can be adapted. Before publishing, test them on a test environment or at least in a robots.txt tester, if your CMS or SEO tool has one.
Scenario 1. Allow AI search but block training crawlers
Suitable for a blog, SaaS product, media outlet, or service site that wants to appear in AI search answers but does not want its content collected by training crawlers.
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
This is often a balanced option, but it does not guarantee visibility. After changing the rules, monitor whether pages appear in AI answers and which sources the model cites. The process in how to analyze the sources AI relies on can help.
Scenario 2. Shut down major AI bots completely
Suitable for a site where AI visibility is not the goal, or for content with an increased risk of copying.
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-SearchBot
Disallow: /
User-agent: Claude-User
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Google-Extended
Disallow: /
Before making such a decision, it is worth noting the consequence: you deliberately reduce the chances of visibility in terms of AI search scenarios. For some sites, this is normal. For commercial categories where the user is already asking ChatGPT or Perplexity "who to choose", this can be a loss of touchpoint.
Scenario 3. Close only individual sections
It is suitable when public blogs, documentation and service pages can be accessed, but service, client or archive sections are not.
User-agent: GPTBot
Disallow: /clients/
Disallow: /internal/
Allow: /blog/
Allow: /services/
User-agent: OAI-SearchBot
Disallow: /clients/
Disallow: /internal/
Allow: /
This option is closer to real life. Not all sites are equally valuable for AI searches, and not all content is equally safe to open.
Scenario 4. Do not touch SEO, but limit Gemini-related use
Suitable when a site needs to stay in Google Search, but the team wants to limit Google-Extended.
User-agent: Googlebot
Allow: /
User-agent: Google-Extended
Disallow: /
Once again: Google-Extended is not a substitute for Googlebot.
How to check that the rules really work
Posting strings in robots.txt is just the first step. Verification is more important.
- Open
https://domain.com/robots.txtin a browser. The file must render200 OK, not an HTML page, or a 403. - Check the rules for each subdomain. If a blog lives on
blog.example.com, it needs its ownrobots.txt. - Look at the server logs by user-agent:
GPTBot,OAI-SearchBot,ClaudeBot,Claude-SearchBot,PerplexityBot,Googlebot. - Match the user-agent with official IP ranges if the provider publishes them. Only a user-agent can be faked.
- Check the CDN and WAF. Cloudflare, AWS WAF, nginx, security plugins, or WordPress plugins can block the bot even before it reads
robots.txt. - Give the systems time. OpenAI and Perplexity, in their documents, talk about a delay of up to about 24 hours to apply part of the changes; other crawlers may update the cache differently.

Also check whether priority pages were blocked accidentally. If previously cited pages disappear after a rules change, access may be the cause. See which website pages most often appear in AI answers for a method to identify pages that should remain accessible.
Common mistakes
Most problems come from choosing the wrong policy rather than writing invalid syntax. A team may block OAI-SearchBot when it intended to block only GPTBot, or block Googlebot instead of Google-Extended and damage Google Search visibility. Disallow prevents crawling; it does not guarantee deindexing. Listing private paths in the public robots.txt file can expose their locations. Other common mistakes include configuring only the main domain while ignoring subdomains, blocking /robots.txt or verified crawlers at the WAF, and relying on the non-standard Crawl-delay directive.
Remember the main limitation: robots.txt controls compliant crawler access but does not explain why a page deserves citation. Content, structure, visible facts, and external evidence serve that purpose. For the technical layer, see how structured data affects AI visibility.
Which policy to choose for different sites
There is no one-size-fits-all answer. The decision depends on what is more important to you: maximum AI visibility, content usage control, less workload, or legal caution.
| Site type | Practical policy |
|---|---|
| Blog or media | It often makes sense to allow AI search bots but separately address the issue of bots to train |
| SaaS or service site | Allow product, blog, FAQ, and documentation pages; protect private accounts, test environments, and client areas |
| Ecommerce | Allow categories, product pages, and help content; protect carts, internal search, parameter filters, and personal accounts |
| B2B company | Leave service pages, cases, expert materials available; handle PDFs, price lists, and client materials with care |
| Closed knowledge base | Do not rely on robots.txt; set authorization and server access control |
| Medical, legal, or financial website | Do a separate review with a lawyer or compliance team, because the cost of an error is higher |
For many sites, a sensible starting policy is to allow AI search crawlers, block training crawlers if required, review the logs, and compare citations and server load after 2-4 weeks.
FAQ
If you block GPTBot, will the site disappear from ChatGPT?
Not necessarily. The OAI-SearchBot is responsible for search visibility in ChatGPT, and GPTBot refers to the training scenario. But the behavior of specific answers still needs to be checked in the interface and logs.
Why does the bot still come after Disallow?
Possible causes: robots.txt cache, another user-agent, download at the user's request, fake user-agent, WAF prevents the bot from reading the file, or a rule written for the wrong host. Start by logging and checking the actual request to /robots.txt.
Do I need to add all AI bots to robots.txt?
Not necessarily. It's better to start with those that really impact your business: OpenAI, Anthropic, Perplexity, Google. Next, look at logs and add rules for bots that actually visit the site or create loads.
Summary
Configure AI crawler access by role rather than allowing or blocking every bot together. GPTBot, ClaudeBot, and Google-Extended relate to training or use in AI products. OAI-SearchBot, Claude-SearchBot, and PerplexityBot support AI search visibility. ChatGPT-User, Claude-User, and Perplexity-User form a separate group because a user often initiates those requests.
For most commercial sites, a practical first iteration is to allow AI search, limit training use if required, leave Googlebot unchanged, check WAF rules, and review server logs. This preserves the AI visibility channel while retaining control over how public content may be used.