- Upgrading the Firmware on a Tulip
- Learning Code Through the Advent of Code Challenge
- Common Loki Misconfigurations
- Iterating Through a List in Ink
- Debugging Misconfigured Container Networks
- Minimum Viable EC2 in Terraform
- Storylets in Ink
- Interactive Fiction Tooling Overview
- In-Place Resizing for Digitalocean Droplets
- Unity Demonstrates the Importance of FOSS
- Target Labels in Prometheus
- My View of AI is the Same
- Verify DNS Ownership with TXT Records
- Sane Droplet Defaults
- Editing Made Easy with Vim
- Gatsby Gotchas
- Concatinating Default AWS Tags in Terraform
- Easily Updating the Default Github Branch
- Lifetimes in Rust
- Checking for Bad Links
- Maybe TypeScript and React is Bad
- Static Asset Management in React
- Bundler Down Time
- Using React Context for Localization
- JS Implementation of a Sticky Footer
- Custom Aliases
- Trying Out the 7drl Challenge
- Trash Opinions
- Building Your First Program in Rust
- Fixing mongod reports errors related to opening a socket
- Improving Open Source Maintenance
- Technical Interviewing Tips
- Housekeeping Note
- Dynamic Programming Basics
- The Oddity of Naming Conventions in Programming Languages
- An Experiment Using Machine Learning, Part 3
- Debugging with grep
- An Experiment Using Machine Learning, Part 2
- An Experiment Using Machine Learning, Part 1
- The Value of while
- National Day of Civic Hacking
- OpenAI and the Future of Humanity
- Creating a Whiteboard App in Django
- Creating Meaningful, Organized Information
- Towards A Critique of Social Media Feeds
- Setting up Routes in Django
- Developing a Messaging Component for Code for SF
- Dream Stream 2.0
- Keyed Collections in Javascript: Maps and Sets
- Blog Soft Relaunch
- Scraping with Puppeteer
- Looking Ahead to Dream Stream 2.0
- Solving West of Loathing's Soupstock Lode Puzzle
- Installing Ubuntu
- Interview with David Jickling Evaluation
- Compare Text Evaluation
- Dream Stream Evaluation
Checking for Bad Links
A frequent problem with webpage maintenance is through no fault of your own there will be broken links. Often these don’t get fixed unless a user is kind enough to notify the owner of the site.
However we don’t have to rely on the kindness of our readers to address this, awe can even do so without having to do something totally boring like click the links ourselves to see if they work. We can automate this work away with a tool like Puppeteer, and some sweet bash commands.
The first step is to have Puppeteer grab all the links from the page you are looking at. If you need a refresher on using Puppeteer, I wrote a blog post a few years back on how to do some basic scraping.
Once you have an array with this list of links you should write it to a file. At this point you could probably also have Puppeteer check all those URLs, but in my experience, a tool like curl
will be way faster. So we can write a bash script to manage all of that now.
#!/bin/bash
# You should include some stuff to run your Puppeteer script
echo "Checking HTTP status codes of links"
echo "status_code, url" > status_codes.csv
if [[ -e ./links ]] ; then
while read -r i; do
export URL="$i"
echo "Getting status code for link $URL"
STATUS_CODE=$(curl -sI "$URL" | awk 'FNR==1 { print $2 }')
echo "$STATUS_CODE, $URL" >> status_codes.csv
done < links
fi
echo "CSV file generated. You can find broken links with the following command:"
echo "grep '^404' status_codes.csv"
The while
loop runs through each line of your file with the list of links, and assigns that as the variable URL
. Note that I’m making the assumption here that your file just has a list of url names, nothing else like quotation marks, commas, or square brackets, etc. that you might have generated from using a JavaScript array.
We assign the STATUS_CODE
variable with a curl request piped to an awk one-liner. The -s
curl flag has it run silently so it doesn’t output some information we’re not interested in, and the -I
flag just returns the HTTP header, which has all the information we’re interested in.
Awk is a bit too complicated to go into detail here. The big picture idea though is that sed is used to edit lines of text, and awk is used to edit fields of data. Of course that’s a huge oversimplification, but will work for our purposes. Since our curl response outputs a field of data, we can use awk to manage it. FNR==1
tells awk just to evaluate the first row of our input, and print $2
tells awk to print the 2nd field, which is where the status code we are interested in is located.
Finally you can find all the broken links you need with grep '^404' status_codes.csv
. As a reminder, the ^
in a regular expression looks for the specified pattern at the beginning of the line.
You could even automate further with some tools to rewrite your code with the broken links, but that depends on the specificities of your codebase. But even the bare minimum here is a powerful way to quickly maintain your website.