An Experiment Using Machine Learning, Part 2

My initial research into studies involving Shakespeare and machine learning didn’t turn up a ton of results, but luckily all the results I did find provided useful frameworks for how to think about my problem. In my last blog post I was saying how it would be great if there were already existing studies looking into the Bacon hypothesis. I did not turn up any results. This is not surprising. The Bacon hypothesis was quite popular in the 19th century, however it has since lost steam, and almost no Shakespearean scholars subscribe to it. There is a similar, and more recent hypothesis speculating that the author of Shakespeare’s plays is Edward de Vere. This hypothesis has not had the same popular success as the Bacon hypothesis once had, which is not surprising since I can’t imagine most people have even heard of Edward de Vere. I certainly hadn’t until I came across the theory. At any rate, I was unable to find any prior research into this inquiry, bad news for making this experiment easier to conduct, good news for people that would prefer people generally not be persuaded by theories that don’t stand up to the scrutiny of Occam’s Razor.

The first piece of research I came across was not machine learning related, but extremely relevant to the project: a blog post about text mining the complete works of Shakespeare. If we want to avoid the problem of garbage in, garbage out, we need to clean up the text to make sure it only contains characters that are relevant data points. This post gives a good idea of how to do that.

The next piece of research I came across was extremely interesting. An article summarizing the research described an analysis of functional words like “and”, “or”, “the”, “to”, etc. to form a word adjacency network that creates an author’s fingerprint. The theory here is that by analyzing words that everyone has to use to construct sentences, looking at the differences in how different authors deploy these words generates a more “objective measure of ‘style’”. Their findings suggest that Christopher Marlowe may have written certain parts of Henry VI. Unfortunately I’ve only been able to find an article summarizing the research, I haven’t yet found the research itself, but this is definitely something I’ll be looking into.

The other piece of research I came across is a university paper that used machine learning to guess the gender of Shakespeare’s characters with “reasonably good classification accuracy”. There are some assumptions baked into this study that I find questionable, but I’m not here to conduct a gender analysis of speech patterns, so this is less important to me. Of greater value is the author’s clarity in laying out how the experiment is conducted, and the analysis of the different types of algorithms used to tackle the problem, a naive Bayes model and support vector machine model. A universal truth is different algorithms do a better job at solving different types of problems. My initial experiment will initially use a k nearest neighbor cluster algorithm to analyze different sets of textual data. I decided I wanted to start with this algorithm just because it is relatively straightforward to implement. As I get a better sense of the data I am working with I will start to consider what other algorithms I can use to better analyze the problem.