Frustrated that all of the work you’ve done creating content for your personal blog or Web site is now being used to train generative AI? Turns out that you can tell the large language model (LLM) crawler ‘bots to ignore your content and not index it! Here’s how…
If you’ve spent any time exploring the capabilities of large language model-based generative AI tools, you know that they’re quite remarkable. Like Google, it seems there aren’t any queries these AI systems can’t answer. But where do they get all that information? It turns out that it’s through scraping and analyzing the Web, along with many other data sources. Imagine, when you’re asking it about information you’ve written about, it might well be referencing your material to synthesize its own answer.
This is pretty cool to think about unless you’d really rather prefer not to have your content used to train these systems. But it’s on the Internet, so it’s too late, right? Well, maybe not so much. Turns out that all of the well-behaved AI site crawlers respect something called “robots.txt”, and that in that associated file, you can specify that your content is off-limits. Let’s have a closer look.
ROBOTS.TXT
There is some irony that the file to tell the AI robot crawler not to scour your site’s content is called “robots.txt”. Turns out that it’s been used for many years to tell search engine crawlers what to index and what to ignore too, and most every site has one. Just append “robots.txt” to the URL. For example, click thru and you can read https://whitehouse.gov/robots.txt to see if the Whitehouse site blocks any utilities:
User-agent: * Disallow: Sitemap: https://www.whitehouse.gov/sitemap_index.xml Sitemap: https://www.whitehouse.gov/es/sitemap_index.xml
It doesn’t! It also has no blocks or limits on AI crawlers either. Some sites have more detailed files, however. Meta.com, the parent company of Facebook, has a quite detailed robots.txt:
User-agent: * Allow: /*help/support/$ Allow: /*account/$ Allow: /*order/find/$ Allow: /*return/find/$ Disallow: *?_ga= Disallow: /*help_app/* Disallow: /*help/support/* Disallow: /*account/* Disallow: /order/* Disallow: /*help/search/* Disallow: /*rma/* Disallow: /return/* Disallow: /meta-employee-store/ Disallow: /bv/upload/* Disallow: /intern/* Disallow: /internal/* Disallow: *utm_source Disallow: *utm_content Disallow: *utm_campaign Disallow: *utm_offering Disallow: *utm_product Disallow: *utm_medium Disallow: *?fb_comment_id= Disallow: *?id= Disallow: *?cursor= Disallow: *?ref= Disallow: *?intern_source Disallow: *?intern_content Disallow: */?uid Sitemap: https://www.meta.com/sitemap.xml Sitemap: https://www.meta.com/help/sitemap.xml Sitemap: https://www.meta.com/blog/quest/sitemap.xml Sitemap: https://www.meta.com/experiences/sitemap.xml
It’s still organized by their content, rather than differentiating AI crawlers from other types of ‘bots and software that might seek to index content on the site. Here’s my robots.txt for comparison: askdavetaylor.com/robots.txt
WHAT ABOUT BLOCKING AI IN ROBOTS.TXT?
So what about blocking those pesky AI ‘bots? That’s a bit more complex because to some extent we’re trying to close the barn door after the horses have run out of the proverbial barn; odds are good your content has already been indexed and analyzed by ChatGPT’s LLM, along with dozens of others ravenous for maximal content. Still, what about new content you produce and new crawlers that might be starting their index of the Web today, rather than a few years ago?
ChatGPT, the busiest of the current generation AI systems, is part of OpenAI. The way to block it in your robots.txt is:
# GPTBot is OpenAI's web crawler User-agent: GPTBot Disallow: /
I haven’t found a single source that lists every user agent, but here are the ones I have dug up:
User-agent: CCBot - Common Crawl dataset, original training source for GPT User-agent: GPTBot - OpenAI's Web crawler User-agent: ChatGPT-User - ChatGPT interactive site access User-agent: Google-Extended - Google Gemini (formerly Bard) and Vertex AI User-agent: anthropic-ai - Anthropic User-agent: Claude-Web - Claude User-agent: Omgilibot - Webz.io, a company that sells data to AI researchers User-agent: Omgili User-agent: FacebookBot - Meta (Facebook) User-agent: Bytespider - ByteDance, including Doubao User-agent: magpie-crawler - Brandwatch AI
What I haven’t figured out is Microsoft Copilot, a very popular AI, but since it’s ostensibly a layer atop ChatGPT, if you block OpenAI and ChatGPT then you should also be blocking Copilot.
WHAT ABOUT ON A PER-PAGE BASIS?
Since you can add these statements to your robots.txt file it also means you can add them as meta information on an individual Web page too, if you can get to the “raw” page. In WordPress, for example, that might involve editing your theme template page, something you should do with great caution lest you mess things up. Nonetheless, if you can get to the HEAD of the raw source page, you can try adding:
<meta name="robots" content="nocache"> <meta name="robots" content="noindex">
The first one should limit well-behaved AI systems from keeping content from your page in its database, and the noindex should stop them from reading the page at all. Perhaps the key phrase is well-behaved, however, because any nefarious AI crawler that’s going to produce scammy SEO content or similar is also likely to ignore any user requests to avoid indexing content.
I expect that we’ll soon have specific “robots” values specifically related to AI training and interactivity, but until then, these will give you the ability to attempt to slow down these eager robots.
Pro Tip: I’ve been writing about AI for a while now. Please check out my AI and ChatGPT Help Area for more tutorials and help articles while you’re visiting!