AI crawler robots.txt: what to allow or block
Want to know if AI search engines can actually reach and read your site? Check it free. Run the AI visibility check.
The robots.txt line that can hide you
A single Disallow can keep your best page out of an AI answer, even when the page is public and indexed elsewhere.
If you want to appear in AI answers, allow the crawlers that build or refresh search results. For ChatGPT search, that crawler is OAI-SearchBot. For Claude search, it is Claude-SearchBot. For Perplexity, it is PerplexityBot. For Google AI Overviews, there is no separate AI crawler or separate opt-out crawler. Google AI Overviews use the normal Google Search index, so the key crawler is Googlebot. For Apple search and Apple Intelligence surfaces, the crawler to allow is Applebot.
Blocking those search crawlers in robots.txt removes you from that engine's live search or answer surface. Google has one extra wrinkle: a blocked URL can still show as a bare search result if other pages link to it, but Google cannot crawl the blocked page content for snippets or AI features.
- Allow
OAI-SearchBotif you want pages eligible for ChatGPT search answers. - Allow
Claude-SearchBotif you want pages eligible for Claude search results. - Allow
PerplexityBotif you want pages eligible for Perplexity results and citations. - Allow
Googlebotif you want Google Search and AI Overviews to use your pages. - Allow
Applebotif you want Apple search features to reach your pages.
After you edit robots.txt, check the live file, then test an important page with the free AI visibility checker. It gives you a quick read on whether the main AI search crawlers can reach the site.
Search crawlers and training controls are different
Many site owners block the wrong bot. They lose search visibility when they only meant to opt out of training.
These tokens are training, dataset, or AI-use controls: GPTBot, ClaudeBot, CCBot, Google-Extended, and Applebot-Extended. Blocking them tells that vendor or data source you do not want your pages used that way. Blocking them does not block live AI-search visibility by itself.
Google-Extended and Applebot-Extended are easy to misread. They are robots.txt control tokens, not separate crawl user-agents you should expect to see in logs. Google says Google-Extended does not affect inclusion or ranking in Google Search. Apple says Applebot-Extended lets publishers opt out of foundation-model training while Applebot can still include pages in search results.
robots.txt is stated policy. It is not proof of what a bot did, and it is not access control. Perplexity-User and Bytespider have been reported to ignore robots.txt, so do not turn a robots.txt rule into a claim about crawler behavior. Use server logs, vendor IP ranges where published, reverse DNS checks, rate limits, and your firewall when access must be enforced.
When the exact rule matters, check the source. Use the official docs from OpenAI, Anthropic, Perplexity, Google, Apple, and Common Crawl instead of copied bot lists.
A clean robots.txt pattern
Start with the pages you want found in answers. Then add training opt-outs only where you mean them.
This pattern keeps the main AI-search crawlers open and blocks the common training controls. Adjust the paths if only part of the site should be available.
User-agent: OAI-SearchBot Allow: / User-agent: Claude-SearchBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Googlebot Allow: / User-agent: Applebot Allow: / User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: CCBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: Applebot-Extended Disallow: /
Test the file after every edit. A broad rule like User-agent: * Disallow: / can block crawlers you meant to allow if the file is malformed or if a crawler does not choose the group you expected. Keep public pages crawlable, return normal 200 responses, and make sure canonical URLs do not point to blocked pages.
JavaScript is a crawl risk. Google publicly documents rendering for Googlebot. Most AI crawler docs do not give the same promise for answer inclusion. If the useful text appears only after client-side JavaScript runs, some AI systems may miss it. Put the core content in server-rendered HTML when visibility matters.
Email deliverability has its own gate
robots.txt controls who can read your site. SPF, DKIM, and DMARC help mailbox providers decide whether to trust your mail.
SPF is a DNS TXT record that lists the servers allowed to send for a domain. ~all is a soft fail. It tells receivers to be suspicious but leaves room for delivery. -all is a hard fail. It is stricter, so use it only when every real sender is included. SPF also has a 10 DNS-lookup limit. Too many includes can cause a permanent error.
DKIM signs each message with a private key. The public key lives in DNS under a selector, such as selector1._domainkey.example.com. If a service sends your mail but does not sign with your domain, DMARC can fail even when SPF passes.
DMARC sits on top of SPF and DKIM. A message passes DMARC when SPF or DKIM passes and aligns with the visible From domain. p=none monitors. p=quarantine asks receivers to treat failures as suspicious. p=reject asks receivers to reject failures. Add rua reports so you can see who is sending as your domain before moving to a stricter policy. If you need to read one of those XML reports, use the free DMARC report reader.
Gmail and Outlook also look at valid forward and reverse DNS, blocklists, complaint rates, bounce rates, message content, sending volume, and past reputation. MX records matter for receiving mail and for domain health checks, but they do not replace sender authentication. Broken authentication is still the first thing to fix because it is visible and testable. The free domain scorecard checks SPF, DKIM, DMARC, MX, and related DNS issues in one place.
The standards behind this are public: SPF is RFC 7208, DKIM is RFC 6376, and DMARC is RFC 7489. Gmail and Microsoft sender guidelines add mailbox-provider rules on top.
Checklist before you publish changes
Treat robots.txt like DNS. Small edits can have wide effects, and cached results may take time to settle.
- Fetch
/robots.txtfrom the live domain, not a staging copy. - Confirm the search crawlers you care about are not blocked:
OAI-SearchBot,Claude-SearchBot,PerplexityBot,Googlebot, andApplebot. - Block training controls only when you mean that opt-out:
GPTBot,ClaudeBot,CCBot,Google-Extended, andApplebot-Extended. - Load an important page with JavaScript disabled and check that the main text is still present in the HTML.
- Check server logs after launch, but verify bot identity before drawing conclusions from a user-agent string.
- Review related guides at InboxRadar guides when DNS, authentication, or deliverability warnings show up beside the crawl issue.
FAQ
Does blocking GPTBot remove my site from ChatGPT search?
No. GPTBot is a training crawler. ChatGPT search visibility is controlled by OAI-SearchBot. You can allow OAI-SearchBot and block GPTBot if you want search visibility without allowing training use.
Does Google have a separate AI Overviews crawler?
No. Google AI Overviews use the normal Google Search index. Blocking Googlebot can remove your page content from Google Search and AI features. Blocking Google-Extended does not remove you from Google Search or act as a ranking signal.
Can robots.txt force an AI company to obey my rule?
No. robots.txt is a published policy. Major search crawlers usually honor it, but it is not access control. Use authentication, firewall rules, rate limits, and log review when access must be enforced.
Should a SaaS block all AI crawlers?
Usually no. Block training controls if that is your policy, but keep AI-search crawlers open for public product pages, docs, pricing pages, and support content you want users to find in answers.
What should I fix first if email is going to spam?
Check SPF, DKIM, and DMARC first. Then check forward and reverse DNS, MX, blocklists, complaint rate, sending volume, and content. Mailbox providers use many signals, but authentication failures are the easiest to prove and fix.