Configuring robots.txt — LLM-LD

The two-file strategy

When you set up an AI subdomain, you need two separate robots.txt files with opposite rules:

  • Main site (yoursite.com/robots.txt) — Blocks AI crawlers, allows search engines
  • AI subdomain (ai.yoursite.com/robots.txt) — Allows AI crawlers, blocks search engines

Your main site is for humans and Google. Your AI subdomain is for ChatGPT, Claude, and Perplexity. robots.txt enforces the separation.

Why separate sites?

The core idea behind the two-site architecture is independence. Your human-facing website and your AI-facing content have different audiences, different needs, and will evolve differently over time.

Future-proofing your AI content

AI systems are changing fast. What AI needs to understand your business today might be different from what it needs in six months. With a separate AI layer, you can:

  • Add AI-specific content that doesn't belong on your main site (detailed entity relationships, machine-readable FAQs, structured pricing data)
  • Remove content that confuses AI without affecting what humans see (marketing fluff, outdated promotions, content that gets misinterpreted)
  • Experiment and iterate on your AI presence without touching your production website
  • Respond to new AI capabilities as they emerge (new structured data formats, agent-specific instructions, real-time feeds)

If your AI content lives on your main site, you're stuck. Every change optimized for AI risks breaking something for humans, and vice versa. You can't serve two masters from one page.

Think of it like responsive design

You wouldn't serve the exact same layout to mobile and desktop — you adapt to the device. The AI subdomain is the same concept: adapting your content to the consumer. Humans get a beautiful, interactive website. AI gets clean, structured, machine-optimized data.

🔮

Looking ahead: As AI agents become more capable (booking appointments, making purchases, comparing options), your AI layer can evolve to support these interactions — without cluttering your human experience with machine-readable instructions.


Why block AI from your main site?

This might seem counterintuitive — don't you want AI to crawl your site? Yes, but you want AI to crawl the right version of your site. Here's why:

1. Signal-to-noise ratio

Your main site is built for humans: navigation menus, hero images, JavaScript frameworks, cookie banners, chat widgets. AI crawlers have to wade through all of this to extract the actual content. Your AI subdomain is pure signal — clean HTML, rich Schema.org, no clutter.

2. Consistent structured data

Your AI subdomain has guaranteed Schema.org markup on every page, plus llm-index.json, entities, and knowledge graphs. Your main site might have inconsistent or missing structured data depending on how it was built.

3. Control over what AI learns

When AI crawls your main site, you don't control what it extracts or how it interprets your content. With an AI subdomain, you're explicitly defining what AI should know about your business — the same way you'd brief a new employee.

4. Avoid duplicate/conflicting information

If AI crawls both your main site and your AI subdomain, it might find slightly different information (different wording, outdated pages on the main site, etc.). By blocking the main site, you ensure AI only sees your canonical, structured version.

⚠️

Exception: AI Discovery Page. Your ADP at /ai-discovery should always be crawlable — it's the bridge that points AI crawlers to your AI subdomain. Make sure to add Allow: /ai-discovery in your main site's robots.txt.


The traffic flow

Here's how crawlers are directed with this setup:

AI Crawler (GPTBot, ClaudeBot, etc.)

AI Crawler

Wants your content

yoursite.com

robots.txt: BLOCKED

/ai-discovery

ALLOWED (bridge)

ai.yoursite.com

robots.txt: ALLOWED

Search Crawler (Googlebot, Bingbot, etc.)

Search Crawler

Indexing for search

yoursite.com

robots.txt: ALLOWED

The result: search engines index your human-friendly main site, AI systems consume your machine-optimized AI subdomain, and the AI Discovery Page connects them.


Main site robots.txt

This file blocks all known AI crawlers while preserving access for traditional search engines. Place it at yoursite.com/robots.txt.

yoursite.com/robots.txt
# ============================================================================ # ROBOTS.TXT FOR YOURSITE.COM (MAIN SITE) # ============================================================================ # This file BLOCKS AI crawlers and ALLOWS regular search crawlers. # AI crawlers should use ai.yoursite.com instead. # # IMPORTANT: All crawlers are allowed to access /ai-discovery # This is the bridge page that directs AI to the AI subdomain. # # Last updated: February 2026 # ============================================================================ # ============================================================================ # OPENAI CRAWLERS - BLOCKED (except /ai-discovery) # ============================================================================ User-agent: GPTBot Allow: /ai-discovery Disallow: / User-agent: ChatGPT-User Allow: /ai-discovery Disallow: / User-agent: OAI-SearchBot Allow: /ai-discovery Disallow: / # ============================================================================ # ANTHROPIC (CLAUDE) CRAWLERS - BLOCKED (except /ai-discovery) # ============================================================================ User-agent: anthropic-ai Allow: /ai-discovery Disallow: / User-agent: ClaudeBot Allow: /ai-discovery Disallow: / User-agent: Claude-Web Allow: /ai-discovery Disallow: / User-agent: Claude-SearchBot Allow: /ai-discovery Disallow: / # ============================================================================ # PERPLEXITY CRAWLERS - BLOCKED (except /ai-discovery) # ============================================================================ User-agent: PerplexityBot Allow: /ai-discovery Disallow: / User-agent: Perplexity-User Allow: /ai-discovery Disallow: / # ============================================================================ # GOOGLE AI CRAWLERS - BLOCKED (except /ai-discovery) # ============================================================================ User-agent: Google-Extended Allow: /ai-discovery Disallow: / User-agent: GoogleOther Allow: /ai-discovery Disallow: / # ============================================================================ # META AI CRAWLERS - BLOCKED (except /ai-discovery) # ============================================================================ User-agent: FacebookBot Allow: /ai-discovery Disallow: / User-agent: meta-externalagent Allow: /ai-discovery Disallow: / # ============================================================================ # OTHER AI CRAWLERS - BLOCKED (except /ai-discovery) # ============================================================================ User-agent: Amazonbot Allow: /ai-discovery Disallow: / User-agent: Applebot-Extended Allow: /ai-discovery Disallow: / User-agent: Bytespider Allow: /ai-discovery Disallow: / User-agent: CCBot Allow: /ai-discovery Disallow: / User-agent: cohere-ai Allow: /ai-discovery Disallow: / User-agent: Diffbot Allow: /ai-discovery Disallow: / User-agent: DeepSeekBot Allow: /ai-discovery Disallow: / User-agent: DuckAssistBot Allow: /ai-discovery Disallow: / User-agent: YouBot Allow: /ai-discovery Disallow: / # ============================================================================ # REGULAR SEARCH CRAWLERS - ALLOWED # ============================================================================ User-agent: Googlebot Allow: / User-agent: Bingbot Allow: / User-agent: Slurp Allow: / User-agent: DuckDuckBot Allow: / # ============================================================================ # DEFAULT - ALLOW (for unlisted crawlers) # ============================================================================ User-agent: * Allow: / # ============================================================================ # SITEMAP # ============================================================================ Sitemap: https://yoursite.com/sitemap.xml
🔑

Why Allow: /ai-discovery appears for every AI crawler: In robots.txt, specific user-agent rules override wildcard rules. If you block GPTBot with Disallow: /, it can't access anything — including /ai-discovery — even if you have a separate User-agent: * Allow: /ai-discovery. Each AI crawler needs its own explicit Allow: /ai-discovery line before the Disallow: / so it can find the bridge page that directs it to your AI subdomain.


AI subdomain robots.txt

This file does the opposite: welcomes all AI crawlers while blocking traditional search engines (to avoid duplicate content). Place it at ai.yoursite.com/robots.txt.

ai.yoursite.com/robots.txt
# ============================================================================ # ROBOTS.TXT FOR AI.YOURSITE.COM (AI SUBDOMAIN) # ============================================================================ # This file ALLOWS all AI crawlers and BLOCKS regular search crawlers. # This is the machine-readable version of yoursite.com # # Last updated: February 2026 # ============================================================================ # ============================================================================ # OPENAI CRAWLERS - ALLOWED # ============================================================================ User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / # ============================================================================ # ANTHROPIC (CLAUDE) CRAWLERS - ALLOWED # ============================================================================ User-agent: anthropic-ai Allow: / User-agent: ClaudeBot Allow: / User-agent: Claude-Web Allow: / User-agent: Claude-SearchBot Allow: / User-agent: Claude-User Allow: / # ============================================================================ # PERPLEXITY CRAWLERS - ALLOWED # ============================================================================ User-agent: PerplexityBot Allow: / User-agent: Perplexity-User Allow: / # ============================================================================ # GOOGLE AI CRAWLERS - ALLOWED # ============================================================================ User-agent: Google-Extended Allow: / User-agent: GoogleOther Allow: / User-agent: GoogleOther-Image Allow: / User-agent: GoogleOther-Video Allow: / User-agent: Google-CloudVertexBot Allow: / User-agent: GoogleAgent-Mariner Allow: / User-agent: Gemini-Deep-Research Allow: / User-agent: Gemini-AI Allow: / # ============================================================================ # MICROSOFT/BING AI - ALLOWED # ============================================================================ User-agent: Copilot Allow: / # ============================================================================ # META AI CRAWLERS - ALLOWED # ============================================================================ User-agent: FacebookBot Allow: / User-agent: meta-externalagent Allow: / User-agent: Meta-ExternalAgent Allow: / User-agent: Meta-ExternalFetcher Allow: / # ============================================================================ # OTHER AI/LLM CRAWLERS - ALLOWED # ============================================================================ User-agent: Amazonbot Allow: / User-agent: Applebot Allow: / User-agent: Applebot-Extended Allow: / User-agent: Bytespider Allow: / User-agent: CCBot Allow: / User-agent: cohere-ai Allow: / User-agent: DeepSeekBot Allow: / User-agent: Diffbot Allow: / User-agent: DuckAssistBot Allow: / User-agent: YouBot Allow: / User-agent: YouAgent Allow: / # ============================================================================ # BLOCK REGULAR SEARCH ENGINE CRAWLERS # ============================================================================ # These should index the main site, not the AI subdomain User-agent: Googlebot Disallow: / User-agent: Googlebot-Image Disallow: / User-agent: Googlebot-News Disallow: / User-agent: Googlebot-Video Disallow: / User-agent: Bingbot Disallow: / User-agent: BingPreview Disallow: / User-agent: msnbot Disallow: / User-agent: Slurp Disallow: / User-agent: Baiduspider Disallow: / User-agent: YandexBot Disallow: / User-agent: DuckDuckBot Disallow: / # ============================================================================ # BLOCK EVERYTHING ELSE BY DEFAULT # ============================================================================ User-agent: * Disallow: / # ============================================================================ # SITEMAP (AI SUBDOMAIN) # ============================================================================ Sitemap: https://ai.yoursite.com/sitemap.xml

Quick reference

Here's a summary of what each crawler type sees:

CrawlerMain SiteAI Subdomain
GPTBot (OpenAI)BlockedAllowed
ClaudeBot (Anthropic)BlockedAllowed
PerplexityBotBlockedAllowed
Google-Extended (AI)BlockedAllowed
Googlebot (Search)AllowedBlocked
Bingbot (Search)AllowedBlocked
/ai-discovery pageAllowed for all

Keeping up with new crawlers

New AI crawlers appear regularly. When a new LLM or AI system launches, it typically announces its user-agent string. Add new crawlers to both files — blocked on main, allowed on AI subdomain.

Current major AI crawler user-agents to track:

  • OpenAI: GPTBot, ChatGPT-User, OAI-SearchBot
  • Anthropic: anthropic-ai, ClaudeBot, Claude-Web, Claude-SearchBot
  • Google: Google-Extended, GoogleOther, Gemini-AI
  • Meta: FacebookBot, meta-externalagent
  • Perplexity: PerplexityBot, Perplexity-User
  • Others: Amazonbot, Applebot-Extended, Bytespider, CCBot, cohere-ai, DeepSeekBot

Testing your setup

After deploying both files:

  1. Visit yoursite.com/robots.txt — verify AI crawlers are blocked
  2. Visit ai.yoursite.com/robots.txt — verify AI crawlers are allowed
  3. Check that /ai-discovery is explicitly allowed on the main site
  4. Use Google's robots.txt Tester to validate syntax
⚠️

Common mistake: Getting the two files swapped. Double-check that the main site blocks AI and the AI subdomain allows AI. It's easy to mix up.

Next: Build your AI Discovery Page

The ADP is the bridge that connects your main site to your AI subdomain.