Digital Tidying: Finding and Fixing Dead Links
On occasion I'll go back and check on an old post on this site, either to link it to someone or to reference it myself. As time has gone on I've found more and more frequently that older posts contain broken links. Either something the post used to reference has moved, or more frequently it's just gone for good.
I've had a thought floating around in my head that this site could move away from being just a series of chronological posts to something more like a wiki, where each page is maintained and evolved over time. Maybe I'll dive more into this idea in the future, if I ever care to follow through with it.
But one major hindrance to that idea would be broken links. It's one thing for an old post in a chronological series of posts to be out of date; the reader implicitly understands that such a post is a kind of snapshot, like an old newspaper article. But for a wiki-style website it would be a much bigger problem.
And really, even for a chronological series like mine, it's just messy to have broken links sitting around.
So as a little side project over this Christmas I've developed a tool, creatively named:
DeadLinks: A tool for crawling and finding links to URLs which no longer exist
Usage information is in the README, but the summary is that it will crawl a gemini and/or HTTP(s) website, traverse all links, and output a summary at the end of all dead ones. With this tool in hand I'll be going through my older posts and fixing a bunch of busted links.
The CLI tool is actually just a thin wrapper around the `deadlinks` library, which I designed with the intention that I could some day have this crawler running in a more automated fashion and hooked up to some kind of personal status page. Combined with a tool like Webmon, which I already use for downtime notifications, I could stay abreast of any dead links popping up in the future.
Webmon: Monitor web services and get notified, if a service becomes unavailable.
There's still some polish needed before `deadlinks` is ready for that, however. There's some kind of ongoing issue with request timeouts that I need to track down, I think maybe related to having multiple requests in flight at once. I've also found that some sites which I link to block the crawler, probably based on its user agent, so I'll need to make that more easily configurable. In other cases I might want to stop the crawler from visiting certain URLs at all.
But it's still useful as-is, if you're ok with sorting through the false positives. Hopefully someone else finds it useful too!
-----
Published 2023-12-30
This site is a mirror of my gemini capsule. The equivalent gemini page can be found here, and you can learn more about gemini at my 🚀 What is Gemini? page.
Hi! I'm available for remote contract work. You can learn more about me and my skillset by browsing around this site, head over to my resume to find my actual work history, and shoot me an email when you're ready to get in touch.