How to Tell AI Crawlers Not to Index Your Website or Blog

Frustrated that all of the work you’ve done creating content for your personal blog or Web site is now being used to train generative AI? Turns out that you can tell the large language model (LLM) crawler ‘bots to ignore your content and not index it! Here’s how…

If you’ve spent any time exploring the capabilities of large language model-based generative AI tools, you know that they’re quite remarkable. Like Google, it seems there aren’t any queries these AI systems can’t answer. But where do they get all that information? It turns out that it’s through scraping and analyzing the Web, along with many other data sources. Imagine, when you’re asking it about information you’ve written about, it might well be referencing your material to synthesize its own answer.

This is pretty cool to think about unless you’d really rather prefer not to have your content used to train these systems. But it’s on the Internet, so it’s too late, right? Well, maybe not so much. Turns out that all of the well-behaved AI site crawlers respect something called “robots.txt”, and that in that associated file, you can specify that your content is off-limits. Let’s have a closer look.

ROBOTS.TXT

There is some irony that the file to tell the AI robot crawler not to scour your site’s content is called “robots.txt”. Turns out that it’s been used for many years to tell search engine crawlers what to index and what to ignore too, and most every site has one. Just append “robots.txt” to the URL. For example, click thru and you can read https://whitehouse.gov/robots.txt to see if the Whitehouse site blocks any utilities:

User-agent: *
Disallow:

Sitemap: https://www.whitehouse.gov/sitemap_index.xml
Sitemap: https://www.whitehouse.gov/es/sitemap_index.xml

It doesn’t! It also has no blocks or limits on AI crawlers either. Some sites have more detailed files, however. Meta.com, the parent company of Facebook, has a quite detailed robots.txt:

User-agent: *
Allow: /*help/support/$
Allow: /*account/$
Allow: /*order/find/$
Allow: /*return/find/$
Disallow: *?_ga=
Disallow: /*help_app/*
Disallow: /*help/support/*
Disallow: /*account/*
Disallow: /order/*
Disallow: /*help/search/*
Disallow: /*rma/*
Disallow: /return/*
Disallow: /meta-employee-store/
Disallow: /bv/upload/*
Disallow: /intern/*
Disallow: /internal/*
Disallow: *utm_source
Disallow: *utm_content
Disallow: *utm_campaign
Disallow: *utm_offering
Disallow: *utm_product
Disallow: *utm_medium
Disallow: *?fb_comment_id=
Disallow: *?id=
Disallow: *?cursor=
Disallow: *?ref=
Disallow: *?intern_source
Disallow: *?intern_content
Disallow: */?uid

Sitemap: https://www.meta.com/sitemap.xml
Sitemap: https://www.meta.com/help/sitemap.xml
Sitemap: https://www.meta.com/blog/quest/sitemap.xml
Sitemap: https://www.meta.com/experiences/sitemap.xml

It’s still organized by their content, rather than differentiating AI crawlers from other types of ‘bots and software that might seek to index content on the site. Here’s my robots.txt for comparison: askdavetaylor.com/robots.txt

WHAT ABOUT BLOCKING AI IN ROBOTS.TXT?

So what about blocking those pesky AI ‘bots? That’s a bit more complex because to some extent we’re trying to close the barn door after the horses have run out of the proverbial barn; odds are good your content has already been indexed and analyzed by ChatGPT’s LLM, along with dozens of others ravenous for maximal content. Still, what about new content you produce and new crawlers that might be starting their index of the Web today, rather than a few years ago?

ChatGPT, the busiest of the current generation AI systems, is part of OpenAI. The way to block it in your robots.txt is:

# GPTBot is OpenAI's web crawler
User-agent: GPTBot
Disallow: /

I haven’t found a single source that lists every user agent, but here are the ones I have dug up:

User-agent: CCBot - Common Crawl dataset, original training source for GPT

User-agent: GPTBot - OpenAI's Web crawler

User-agent: ChatGPT-User - ChatGPT interactive site access

User-agent: Google-Extended - Google Gemini (formerly Bard) and Vertex AI

User-agent: anthropic-ai - Anthropic

User-agent: Claude-Web - Claude

User-agent: Omgilibot - Webz.io, a company that sells data to AI researchers
User-agent: Omgili

User-agent: FacebookBot - Meta (Facebook)

User-agent: Bytespider - ByteDance, including Doubao

User-agent: magpie-crawler - Brandwatch AI

What I haven’t figured out is Microsoft Copilot, a very popular AI, but since it’s ostensibly a layer atop ChatGPT, if you block OpenAI and ChatGPT then you should also be blocking Copilot.

WHAT ABOUT ON A PER-PAGE BASIS?

Since you can add these statements to your robots.txt file it also means you can add them as meta information on an individual Web page too, if you can get to the “raw” page. In WordPress, for example, that might involve editing your theme template page, something you should do with great caution lest you mess things up. Nonetheless, if you can get to the HEAD of the raw source page, you can try adding:

<meta name="robots" content="nocache">
<meta name="robots" content="noindex">

The first one should limit well-behaved AI systems from keeping content from your page in its database, and the noindex should stop them from reading the page at all. Perhaps the key phrase is well-behaved, however, because any nefarious AI crawler that’s going to produce scammy SEO content or similar is also likely to ignore any user requests to avoid indexing content.

I expect that we’ll soon have specific “robots” values specifically related to AI training and interactivity, but until then, these will give you the ability to attempt to slow down these eager robots.

Pro Tip: I’ve been writing about AI for a while now. Please check out my AI and ChatGPT Help Area for more tutorials and help articles while you’re visiting!

About the Author: Dave Taylor has been involved with the online world since the early days of the Internet. Author of over 20 technical books, he runs the popular AskDaveTaylor.com tech help site. You can also find his gadget reviews on YouTube and chat with him on Twitter as @DaveTaylor.

ai blocker, web page ai crawler

ROBOTS.TXT

WHAT ABOUT BLOCKING AI IN ROBOTS.TXT?

WHAT ABOUT ON A PER-PAGE BASIS?

Leave a Reply Cancel reply