- Upgrading the Firmware on a Tulip
- Learning Code Through the Advent of Code Challenge
- Common Loki Misconfigurations
- Iterating Through a List in Ink
- Debugging Misconfigured Container Networks
- Minimum Viable EC2 in Terraform
- Storylets in Ink
- Interactive Fiction Tooling Overview
- In-Place Resizing for Digitalocean Droplets
- Unity Demonstrates the Importance of FOSS
- Target Labels in Prometheus
- My View of AI is the Same
- Verify DNS Ownership with TXT Records
- Sane Droplet Defaults
- Editing Made Easy with Vim
- Gatsby Gotchas
- Concatinating Default AWS Tags in Terraform
- Easily Updating the Default Github Branch
- Lifetimes in Rust
- Checking for Bad Links
- Maybe TypeScript and React is Bad
- Static Asset Management in React
- Bundler Down Time
- Using React Context for Localization
- JS Implementation of a Sticky Footer
- Custom Aliases
- Trying Out the 7drl Challenge
- Trash Opinions
- Building Your First Program in Rust
- Fixing mongod reports errors related to opening a socket
- Improving Open Source Maintenance
- Technical Interviewing Tips
- Housekeeping Note
- Dynamic Programming Basics
- The Oddity of Naming Conventions in Programming Languages
- An Experiment Using Machine Learning, Part 3
- Debugging with grep
- An Experiment Using Machine Learning, Part 2
- An Experiment Using Machine Learning, Part 1
- The Value of while
- National Day of Civic Hacking
- OpenAI and the Future of Humanity
- Creating a Whiteboard App in Django
- Creating Meaningful, Organized Information
- Towards A Critique of Social Media Feeds
- Setting up Routes in Django
- Developing a Messaging Component for Code for SF
- Dream Stream 2.0
- Keyed Collections in Javascript: Maps and Sets
- Blog Soft Relaunch
- Scraping with Puppeteer
- Looking Ahead to Dream Stream 2.0
- Solving West of Loathing's Soupstock Lode Puzzle
- Installing Ubuntu
- Interview with David Jickling Evaluation
- Compare Text Evaluation
- Dream Stream Evaluation
An Experiment Using Machine Learning, Part 3
In the last blog post in this series I reviewed my research findings on other machine learning experiments regarding Shakespeare. After those findings I started to setup a repository of texts, and wrote some Python scripts to extract texts into lists of words as csv files that included word frequency. This was a somewhat difficult challenge because I discovered there are a lot of difficult edge cases to handle. I also had to start thinking about doing the same thing with sentences. After getting somewhat lost in the weeds trying to handle this, I finally did some research into existing Python libraries, and found my way to the Natural Language Tool Kit. It’s always nice when someone has already solved tricky problems for you and made their tools available.
I am still learning the API, but I conducted a quick experiment using the FreqDist module that can quickly generate tables of word frequencies and distributions. I decided to look at hapaxes. A hapax is a word that only occurs once within a given body of text. In one of the experiments I looked at in the last post, the authors examined the configuration of glue words to generate an authorial fingerprint. An analysis of hapaxes is essentially the opposite approach. Whereas glue words are words that everyone has to use, hapaxes represent the outermost edges of a person’s vocabulary. Does a collection of hapaxes represent a meaningful authorial fingerprint? That remains to be seen, and something I plan to interrogate more. For now though, I have the following dataset:
Shared hapaxes: 1210 Shakespeare hapax list length: 13073 Bacon hapax list length: 12782 Total hapax list length: 25855
It was rather nice that the hapax list lengths for the Shakespeare works and Bacon works were roughly comparable, less than 1000 in distance! Out of a total of 25855 hapaxes, only 1210 were shared. In other words, for every 21 hapaxes examined, only one of them were shared. Therefore the unique vocabulary shared between Bacon and Shakespeare’s written output is less than 5%.
There are some reality checks we will want to consider immediately. For example, what’s the shared hapax ratio for works belonging to a single author? Is this ratio likely to change when we are writing in different genres? The answers to these questions might demonstrate that hapaxes are a bad measurement to be using.
Thus far I haven’t performed any machine learning quite yet, this is simply statistical analysis. However the NLTK means I now have the means to start gathering and analyzing data quickly, so I should be able to setup a proper experiment soon.