Block AI Scrapers With a Caddy Plugin
Recently Xe Iaso had some issues with AI scrapers targeting their Gitea instance, despite being nicely asked not to via `robots.txt`.
Amazon's AI crawler is making my git server unstable
Within the same day, not having put two and two together, I messaged betamike to let him know that his Gitea instance was unreachable, and he responded that he was also experiencing the same issue as Xe. I decided to check the logs of my own cgit, and lo and behold there I see AmazonBot over and over.
Xe's solution to this problem was to introduce a new service, Anubis, which acts as a reverse-proxy serving up proof-of-work (PoW) challenges. If a client arrives to the site for the first time (be it a bot or a real web browser) it will be served a simple webpage with some javascript on it. This is the challenge. That javascript will churn through random data, doing math on it and such, until it finds some random data that meets some criteria. The client then presents this random data to the server as a solution to the challenge, and the server allows the client through to view the real website.
Finding this random data, and doing the math on it, takes a non-trivial amount of effort for the client. The premise is that, for a normal person, waiting an extra second to load the page isn't that much of a hassle, but for a bot which is trying to crawl the site it's a major pain. In order to not get hit with the PoW challenge on every request the bot must maintain the challenge and solution between all requests, which requires more implementation effort. It also requires that the bot actually runs javascript in the first place, which is a good low-pass filter.
More than all that, the PoW challenge and solution can act as a kind of session identifier. Even if a bot does run javascript, and does maintain state between requests it makes, its PoW "session" can be used to enforce further ratelimits in a way which adding more IPs or faking user agents can't avoid. Overall PoW is a good solution to this problem.
Proof-of-Work for Caddy
Xe's Anubis service does seem to work great, and they have put a lot of time into polishing it up. But for my own services I'd like to avoid configuring and running yet another reverse proxy. Instead I developed a `proof_of_work` plugin for Caddy, my preferred webserver, and added it to my mediocre-caddy-plugins project.
mediocre-caddy-plugins > proof_of_work
This was pretty easy for me to implement, as I'd already written a PoW checker for the old version of my website, mediocre-blog; I only needed to exfiltrate it, clean it up, and send it back out into the world. The implementation is much simpler than Anubis, but it allows enough customization to be useful. I've already set it up on my cgit instance - if you click the link to the plugin above you may notice it - and can see in the logs that the AmazonBot has been stumped.
If you're curious about the PoW algorithm being used, you can find it here:
One thing this plugin doesn't yet do is increase the challenge difficulty in response to load. What I would like to be able to do is be able to configure a target number of "solved solutions per minute", and if more than that number are being solved then the difficulty is raised every minute until the target is met. At present the difficulty is only adjustable manually, which suffices for now.
If you're using Caddy for you're own purposes and are interested in trying out the plugin there is documentation on how to use it in the mediocre-caddy-plugins repo linked above. Please try it out and let me know what you think!
Hi! I'm available for remote contract work. You can learn more about me and my skillset by browsing around this site, then head over to my resume site to find my work history and professional contact form.
This site is a mirror of my gemini capsule. The equivalent gemini page can be found here, and you can learn more about gemini at my 🚀 What is Gemini? page.