Solving the AI scraping problem

People are mad at AI scrapers.

If you operate a website, have a blog, host documentation, or put any kind of content on the web, chances are you're going to be hit by dozens of scraper bots. Some of these bots are operated by search engines (GoogleBot, BingBot), others are automatic vulnerability checkers trying to see if you have an unprotected server setup.

But increasingly, they are run by AI companies, scraping the content to power modern AI systems.

Many of these scrapers are badly behaved, don't respect rate limits, don't respect robots.txt, and don't care if they effectively perform a Denial of Service attack on the site they're hitting. To make matters worse, the many different scrapers don't coordinate or share information, and the end result is basically a Distributed Denial of Service (DDoS) attack — even if not intentional.

This is a textbook tragedy of the commons: each actor is behaving rationally in isolation, but the cumulative effect is bad for everyone. The additional irony is that when people try to protect the commons by legal or technical means, are making it worse, not better.

Why a thousand scrapers bloom

From a pure engineering perspective, this is really inefficient. Most publicly available information on the internet only needs to be scraped once. After that, the resulting dataset could be shared and reused by everyone.

That's exactly what Common Crawl does, so why is everyone running their own scraper in secret? The Common Crawl dataset is too small to be competitive, for two reasons:

They don't have the infrastructure to crawl the entire Internet frequently enough (basically: it costs money!), so they only have a subset.
They respect robots.txt, and a lot of sites disable bot access for anyone except Google — everyone still wants to appear in Google search.

Everyone else scrapes anyway, in secret. The data is not shared because it's treated like a competitive moat (even though, it's — literally — publicly accessible on the Internet), and to avoid potential legal trouble.

If you are not Google and you want to build a search engine or train an AI model, you are stuck. Either you accept that you will never have comparable data, or you break the rules and hope nobody can prove it.

Quod licet Iovi, non licet bovi

Legally, content owners are perfectly fine to allow Google to scrape their public sites and forbid everyone else. But that doesn't make it technically feasible or indeed moral. When you do that, what you're saying is that one huge corporation can take your data and use it (which includes training its AI), but everyone else is forbidden from doing the same thing.

The predictable outcome is that only the biggest players can afford to operate, and smaller companies, non-profits and individuals cannot compete. The moat around big companies gets deeper.

At that point, you are not “protecting your content”. You are just entrenching incumbents, and paying for it.

It's my content and I'll hide it if I want to

Two points that are easy to mix up:

You should still be able to control whether your site is public or not. If something requires authentication or is not publicly accessible, that is a different story.
If something is publicly accessible, you already cannot control where it ends up. That is how the web works, in practice.

Training LLMs on public data is, in practice, very similar to building a search engine index. Both are transformative uses. Both should fall under fair use, or an equivalent concept, even if the exact legal details still need work.

That does not mean “anything goes”. A shared dataset should come with a license that allows for specific things as fair use (like search engines, potentially AI model training), forbids things that are already obviously illegal, like bulk republishing or simple copy-paste mirrors. There could even be commercial licensing, similar to compulsory licensing in music, with pre-set rates (for example, a percentage of revenue) that go into supporting Common Crawl and/or are redistributed to rights holders.

This is certainly a hugely challenging and complex issue, but we've seen it working in other domains — it can be done.

One small step

A good place to start would be to agree on one simple rule:

If you allow one public, unauthenticated actor to access and scrape your site, you must allow all others under the same terms.

In other words: Either something is public, or it's not. You should not be able to say “public, but only for Google”.

This would remove a lot of the legal gray area, make the rules clearer for everyone and reduce the incentive for secret scraping operations.

Here's how this could work in practice:

Common Crawl (or a similar non-profit, like Internet Archive) becomes the primary public web archive, tasked with scraping, archiving, and making the dataset publicly available.
Big companies fund it because it is cheaper and cleaner than everyone scraping everything.
Everyone uses the shared dataset.

Why This Probably Won’t Happen Soon

Many people are reflexively opposed to LLMs using their content, even when it is public. That makes any political or legal progress slow. At the same time, big companies are not going to stop. They will keep ingesting this data quietly, in ways that are hard to prove or challenge.

The worst possible outcome is:

Only big companies can do this at scale.
Small companies, researchers, and non-profits are locked out.
The moat gets deeper because of legal and financial risk, not technical skill.

Ironically, the attempt to “protect” public content is what strengthens this moat.

I don't want AI to be trained with my content

That's an understandable position. I believe the system I describe could actually help:

You add metadata to your content opting out of AI training. Similar to how noindex opts out of indexing, a hypothetical notrain could explicitly opt out of AI training.
Your content gets crawled and stored in the corpus, with the opt-out metadata.
AI labs are required to disclose which training corpus versions have been used for a specific model.
If you find your content reproduced verbatim by an AI which was trained by a version of the corpus that has notrain metadata for your content, this is automatic proof that your copyright rights have been violated.

Yes, bad actors could still ignore the flag, which would be the same situation as today, but it would be an explicit act of disregarding the opt-out and thus easier to punish, whereas today there's really no recourse.

There must be a better way

We can keep the current situation, where everyone duplicates work, sites get hammered, the rules are unclear, and only the biggest players really win. Or we can treat public web data like the shared resource it already is: scrape it once, do it politely, share the results, and put clear rules around acceptable use.

That would be cheaper, cleaner, and more honest than what we are doing now. Even if it requires an uncomfortable compromise, it is still better than the slow, wasteful mess we are currently stuck with.