- Upgrading the Firmware on a Tulip
- Learning Code Through the Advent of Code Challenge
- Common Loki Misconfigurations
- Iterating Through a List in Ink
- Debugging Misconfigured Container Networks
- Minimum Viable EC2 in Terraform
- Storylets in Ink
- Interactive Fiction Tooling Overview
- In-Place Resizing for Digitalocean Droplets
- Unity Demonstrates the Importance of FOSS
- Target Labels in Prometheus
- My View of AI is the Same
- Verify DNS Ownership with TXT Records
- Sane Droplet Defaults
- Editing Made Easy with Vim
- Gatsby Gotchas
- Concatinating Default AWS Tags in Terraform
- Easily Updating the Default Github Branch
- Lifetimes in Rust
- Checking for Bad Links
- Maybe TypeScript and React is Bad
- Static Asset Management in React
- Bundler Down Time
- Using React Context for Localization
- JS Implementation of a Sticky Footer
- Custom Aliases
- Trying Out the 7drl Challenge
- Trash Opinions
- Building Your First Program in Rust
- Fixing mongod reports errors related to opening a socket
- Improving Open Source Maintenance
- Technical Interviewing Tips
- Housekeeping Note
- Dynamic Programming Basics
- The Oddity of Naming Conventions in Programming Languages
- An Experiment Using Machine Learning, Part 3
- Debugging with grep
- An Experiment Using Machine Learning, Part 2
- An Experiment Using Machine Learning, Part 1
- The Value of while
- National Day of Civic Hacking
- OpenAI and the Future of Humanity
- Creating a Whiteboard App in Django
- Creating Meaningful, Organized Information
- Towards A Critique of Social Media Feeds
- Setting up Routes in Django
- Developing a Messaging Component for Code for SF
- Dream Stream 2.0
- Keyed Collections in Javascript: Maps and Sets
- Blog Soft Relaunch
- Scraping with Puppeteer
- Looking Ahead to Dream Stream 2.0
- Solving West of Loathing's Soupstock Lode Puzzle
- Installing Ubuntu
- Interview with David Jickling Evaluation
- Compare Text Evaluation
- Dream Stream Evaluation
Scraping with Puppeteer
Puppeteer is a Node library that provides an API for a headless Chrome browser. A headless browser is a browser with no graphical user interface, which is a useful tool for testing and other sorts of automated tasks you might want to perform, including web scraping. I found the web scraping tools to be very powerful even if the documentation for them is limited at the moment. This was a timely issue as I had been previously considering the question of how I wanted to scrape data off of the Dota 2 pro circuit page.
After installing it via NPM you can write a script that looks something like this:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
try {
await page.goto('http://www.dota2.com/procircuit');
} catch(e) { console.log(e) }
await browser.close();
})();
Setting the Javascript aside, this should seem intuitive to anyone that has used a web browser. Essentially the script launches the browser, then opens up a page to view content, and then the page loads a URL. Once you are done performing whatever operations you want to conduct you close the script out by closing the browser. The code is crystal clear to read thanks to the async/await keywords available in current versions of Node. Nearly every script that utilizes Puppeteer is going to start off with this introductory bit of code.
Now that we have our headless browser up in running it is time to collect some data from our website. The data I’m interested in at the Dota pro circuit page is in a table, and I’m only actually interested in the columns for the player names, and the points they have accrued. Luckily this page is semantically well-organized, so when I inspect the elements I find the span.columnContent.playerNameColumn
and span.columnContent.pointsColumn
classes. Selecting those classes will produce the exact data I’m looking for.
const nameResult = await page.evaluate(() => {
let nameArr = [];
let playerColumn = document.querySelectorAll('span.columnContent.playerNameColumn');
for (let i=0; i < playerColumn.length; i++) {
nameArr.push(playerColumn.item(i).textContent);
}
return nameArr;
});
As best I can tell, page.evaluate
is the way you want to return DOM elements using Puppeteer (if I’m wrong about this, please email me, I’d love to know more!). The rest of it looks like a typical operation where a for loop pushes values into an empty array. However I do want to call attention to one particular line that caused me an enormous amount of difficulty until I figured out a subtle but important distinction:
nameArr.push(playerColumn.item(i).textContent);
`
If the playerColumn
variable were a normal array this is not how we would write this. Instead it would look like this:
nameArr.push(playerColumn[i].textContent);
But playerColumn
is not an array. When you perform a document.querySelectorAll
call you retrieve an object called a Node List. This is an object that is similar to an array, but not identical to one. They share some methods like .length
, but the NodeList departs from the array in many other respects. You can read more about Node Lists here. For our purpose, the important thing to keep in mind is that a Node List has a syntactically distinct way of iterating through its items.
Having created a method for collecting the list of player names, we can now repeat that process for selecting the player points.
Web scraping is a fantastic tool to have at one’s disposal. It provides a way to collect data even if an API call or JSON request is unavailable. Of course it is important to use web scraping responsibly, and not perform any web scraping operation that would infringe on another individual or party’s copyrights.