That's a pretty big assumption. The largest site I work on has 100,000s of pages...

Meekro · 2025-07-03T00:21:04 1751502064

Ok, clearly I had no idea of the scale of it. 200RPS from a single bot sounds pretty bad! Do all 100,000+ pages have to be live to be useful, or could many be served from a cache that is minutes/hours/days old?

Symbiote · 2025-07-03T08:42:59 1751532179

The main data for those pages is in a column store, so it can sustain many thousand RPS (at least).

The problem is we have things like

  Disallow: /the-search-page
  Disallow: /some-statistics-pages

in robots.txt, which is respected by most search engine (etc) crawlers, but completely ignored by the AI crawlers.

By chance, this morning I find a legacy site is down, because in the last 8 hours it's had 2 million hits (70/s) to a location disallowed in robots.txt. These hits have come from over 1.5 million different IP addresses, so the existing rate-limit-by-IP didn't catch it.

The User-Agents are a huge mixture of real-looking web browsers; the IPs look to come from residential, commercial and sometimes cloud ranges, so it's probably all hacked computers.

I could see Cloudflare might have data to block this better. They don't just get 1 or 2 requests from an IP, they presumably see a stream of them to different sites. They could see many different user agents being used from that IP, and other patterns, and can assign a reputation score.

I think we will need to add a proof-of-work thing in front of these pages and probably whitelist some 'good' bots (Wikipedia, Internet Archive etc). It is annoying since this was working fine in its current form for over 5 years.