not another bloody wordle?!?!

August 20, 2013

(UPDATE: an earlier version of this was totally wrong. It’s better now.)

Inspired by a Facebook post from a colleague, I decided to waste ten minutes this week knocking together a word cloud from the text of my thesis. The process was pretty straightforward.

First up - extracting the text from the thesis. Like all good scienticians, my thesis was written in LaTeX. I thought I could have used a couple of different tools to extract the plain text from the raw .tex input files, but actually none of the tools available from a quick googling seemed to work properly, so I went with extracting the text from the pdf file instead. Fortunately on Mac OS X this is pretty simple, as you can create a straightforward Automator application to extract the text from any pdf file, as documented in step 2 here.

Once I had the plain text contents of my thesis in a text file it was just a simple few lines of python (using the excellent NLTK) to get a frequency distribution of the words in my thesis:

from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize, sent_tokenize

fdist = FreqDist()
with open("2012chorleymjphd.txt", "r") as inputfile:
    for sentence in sent_tokenize(inputfile.read()):
        for word in word_tokenize(sentence):
            fdist.inc(word.lower())

    for word, count in fdist.iteritems():
        if count > 10:
            print "%s: %d" % (word, count)

Then it was just a matter of copying and pasting the word frequency distribution into wordle:

And there we have it. A not particularly informative but quite nice looking representation of my thesis. As you can guess from the cloud, it’s not the most exciting thesis in the world. Interestingly, the word error doesn’t seem to be there 😉.

Next: Post-Processed Dinosaurs

Previous: SWN Festival 2013 plans – part 1: the data (2!)