Scraping the Assembly…

M’colleague is currently teaching a first-semester module on Data Journalism to the students on our MSc in Computational and Data Journalism. As part of this, they need to do some sort of data project. One of the students is looking at the expenses of Welsh Assembly Members. These are all freely available online, but not in an easy to manipulate form. According to the Assembly they’d be happy to give the data out as a spreadsheet, if we submitted an FOI.

To me, this seems quite stupid. The information is all online and freely accessible. You’ve admitted you’re willing to give it out to anyone who submits an FOI. So why not just make the raw data available to download? This does not sound like a helpful Open Government to me. Anyway, for whatever reason, they’ve chosen not to, and we can’t be bothered to wait around for an FOI to come back. It’s much quicker and easier to build a scraper! We’ll just use selenium to drive a web browser, submit a search, page through all the results collecting the details, then dump it all out to csv. Simple.

Scraping AM expenses
Scraping AM expenses

I built this as a quick hack this morning. It took about an hour or so, and it shows. The code is not robust in any way, but it works. You can ask it for data from any year (or a number of years) and it’ll happily sit there churning its way through the results and spitting them out as both .csv and .json.

All the code is available on Github and it’s under an MIT Licence. Have fun ūüėČ

Atom Plugins for Web Development

I’ve had a number of students in my web-dev module asking me what plugins I’m using in my text editor, so I thought I’d dash off a quick blog post on the plugins I find useful day-to-day. (Actually, most people are normally¬†asking me ‘how did you do that thing where you typed one word and suddenly you had a whole HTML page? The answer is I used a plugin, so ‘what plugins do you use?’ is really the question they should be asking…)

I’m using Atom as my text editor. It’s free, open source, and generally reliable. If you’re a student on my web-dev course you’re stuck using Sublime Text in the lab for now. I’m pretty sure most of the Atom plugins I use have either direct Sublime equivalents, or similarly functioning alternatives.

There’s a guide to Atom packages here¬†and one for Sublime Text here

A quick google for ‘best atom packages web developer’ will probably get you to a far more comprehensive list than this, but here’s my current pick of useful plugins anyway:


This is essential for anyone writing any amount of HTML. This is the magic package that allows me to write ‘html:5’ in a blank document, hit the shortcut keys (CTRL + E in my setup), and suddenly have a simple boilerplate HTML page.

emmet auto-completion
emmet auto-completion

It’s ace. Not only that, but it can write loads of HTML for you, and all you have to do is write a¬†CSS selector for that HTML:

HTML CSS selector expansion

Great stuff. The documentation is here.


This will tidy up your code automatically, fixing the indentation and spacing etc. It can even be set to automatically tidy your code every time you save a file. Awesome huh? Imagine being set a coursework where some of the marks were dependent on not writing code that looks like it was written by a five-year old child who’s addicted to hitting the tab key, then finding out that there’s software to strap that five-year olds thumbs to his hands so he can’t hit that tab key. Awesome.

Atom Beautify tidies your code


This one adds a colour picker right into atom. Just CMD-SHIFT-C and choose your colours!

Colour Picker in atom

Another useful colour related plugin you may want to look at is Pigments, which can highlight colours in your projects, and gather them all together so you can see your palette.


My last recommendation is linter. This plugin will automatically check your code for errors. You’ll need to install linters for whatever language you want to check, like linter-tidy, linter-csslint, linter-pylint and linter-jshint.

Linter finds errors in your code


So there we go – a few recommendations to get you started. Found anything else interesting? Let me know!

Accessing and Scraping MyFitnessPal Data with Python

Interesting news this morning that MyFitnessPal has been bought by Under Armour for ¬†$475 million. I’ve used MFP for many years now, and it was pretty helpful in helping me lose all the excess PhD weight that I’d put on, and then maintaining a healthy(ish) lifestyle since 2010.

News of an acquisition always has me slightly worried though – not for someone else having access to my data, as I’ve made my peace with the fact that using a free service generally means that it’s me that’s being sold. Giving away my data is the cost of doing business. Rather, it worries me that I may lose access to all the data I’ve collected. I have no idea what Under Armour intend for the service in the long run, and while its likely that MFP will continue with business as usual for the foreseeable, it’s always worth having a backup of your data.

A few years ago, I wrote a couple of python scripts to back up data from MFP and then extract the food and exercise info from the raw HTML. These scripts use Python and Beautiful Soup to do a login to MFP, then go back through your diary history and save all the raw HTML pages, essentially scraping your data.

I came to run them this morning and found they needed a couple of changes to deal with site updates. I’ve made the necessary updates and the full code for all the scripts is available on GitHub. It’s not great, but it works. The code is Python 2 and requires BeautifulSoup and Matplotlib (if you want to use

NHS Hackday 2015

This weekend I took part in an incredibly successful NHS hackday, hosted at Cardiff University and organised by Anne Marie Cunningham and James Morgan. We went as a team from the MSc in Computational Journalism, with myself and Glyn attending along with Pooja, Nikita, Annalisa and Charles. At the last-minute I recruited a couple of ringers as well, dragging along Rhys Priestland Dr William Wilberforce Webberley from Comsc and Dr Matthew Williams, previously of this parish. Annalisa also brought along Dan Hewitt, so in total we had a large and diverse team.

The hackday

This was the first NHS hackday I’d attended, but I believe it’s the second event¬†held in Cardiff, so Anne Marie and the team have it down to a fine art. The whole weekend seemed to go pretty smoothly (barring a couple of misunderstandings on our part regarding¬†the pitch sessions!). It was certainly¬†one of the most well organised events that I’ve attended, with all the necessary ingredients for successful coding: much power, many wifi and plenty of food, snacks and coffee. Anne Marie and the team deserve much recognition and thanks for their hard work. I’m definitely in for next year.

The quality of the projects created at the hackday was incredibly high across the board, which was great to see. One of my favourites used an Oculus Rift virtual reality headset to create a zombie ‘game’ that could be used to test people’s peripheral vision. Another standout was a system for logging and visualising the ANGEL factors describing a patient’s health situation. It was really pleasing to see these rank highly with the judges too, coming in third and second in the overall rankings. Other great projects brought an old Open Source project back to life, created a system for managing groups walking the Wales Coast path, and created automatic notification systems for healthcare processes. Overall it was a really interesting mix of projects, many of which have clear potential to become useful products within or alongside the NHS. As Matt commented in the pub afterwards, it’s probably the first hackday we’ve been to where several of the projects have clear original IP with commercial potential.

Our project

We had decided before the event that we wanted to build some visualisations of health data across Wales, something like, but working with local health boards and local authorities in Wales. We split into two teams for the implementation: ‘the data team’ who were responsible for sourcing, processing and inputting data, and the ‘interface team’ who built the front-end and the visualisations.

Progress was good, with Matthew and William quickly defining a schema for describing data so that the data team could add multiple data sets and have the front-end automatically pick them up and be able to visualise them. The CompJ students worked to find and extract data, adding them to the github repository with the correct metadata. Meanwhile, I pulled a bunch of D3 code together for some simple visualisations.

By the end of the weekend we established¬†a fairly decent system. It’s able to visualise a few different types of data, at different resolutions, is mostly mobile friendly, and most importantly is easily extensible and adaptable. It’s online now on our github pages, and all the code and documentation is also in the github repository.

We’ll continue development for a while to improve the usability and code quality, and hopefully we’ll find a community willing to take the code base on and keep improving what could be a fairly useful resource for understanding the health of Wales.


We didn’t win any of the prizes, which is understandable. Our project was really focused on the public understanding of the NHS and health, and not for solving a particular need within (or for users of) the NHS. We knew this going in to the weekend, and we’d taken the decision that it was more important to work on a project related to the course, so that the students could experience some of the tools and technologies they’ll be using as the course progresses than to do something more closely aligned with the brief that would have perhaps been less relevant to the students work.

I need to thank Will and Matt for coming and helping the team. Without Matt wrangling the data team and showing them how to create json metadata descriptors we probably wouldn’t have anywhere near as many example datasets as we do. Similarly, without Will’s hard work on the front end interface, the project wouldn’t look nearly as good as it does, or have anywhere near the functionality. His last-minute addition of localstorage for personal datasets was a triumph. (Sadly though he does lose some coder points for user agent sniffing to decide whether to show a mobile interface :-D.) They were both a massive help, and we couldn’t have done it without them.

Also,¬†of course, I need to congratulate the CompJ students, who gave up their weekend to trawl through datasets, pull figures off websites and out of pdf’s, and create the lovely easy to process .csv files we needed. It was a great effort from them, and I’m looking forward to our next Team CompJ hackday outing.

One thing that sadly did stand out was a lack of participation from Comsc undergraduate students, with only one or two attending. Rob Davies stopped by on Saturday, and both Will and I discussed with him what we can do to increase participation in these events. Hopefully we’ll make some progress on that front in time for the next hackday.


There’s some great photos from the event on Flickr, courtesy of Paul Clarke (Saturday¬†and Sunday). I’ve pulled out some of the best of Team CompJ and added them here. All photos are released¬†under a Creative Commons BY-NC 2.0¬†licence.



We got a lovely write-up about out project from Dyfrig Williams of the Good Practice Exchange at the Wales Audit Office. Dyfrig also curated a great storify of the weekend.

Hemavault labs have done a round up of the projects here

Computational Journalism – ‘a Manifesto’

While Glyn and I have been discussing the new MSc course between ourselves and with others, we have repeatedly come up with the same issues and themes, again and again. As a planning exercise earlier in the summer, we gathered some of these together into a ‘manifesto’.

The manifesto is online on our main ‘Computational Journalism‘ website with a bit of extra commentary, but I thought I’d upload it here as well. Any comments should probably be directed to the article on the CompJ site, so I’ve turned them off just for this article.


GeoJSON and topoJSON for UK boundaries

I’ve just put an archive online containing GeoJSON and topoJSON for UK boundary data. It’s all stored on Github, with a viewer and download site hosted on Github pages.

Browser for the UK topoJSON stored in the Github repository
Browser for the UK topoJSON stored in the Github repository

The data is all created from shapefiles released by the Office of National Statistics, Ordnance Survey and National Records Scotland, all under the Open Government and OS OpenData licences.

In later posts I’ll detail how I created the files, and how to use them to create interactive choropleth maps.

CCGs and WPCs via the medium of OAs

As I was eating lunch this afternoon, I spotted a conversation between @JoeReddington and @MySociety whizz past in Tweetdeck. I traced the conversation back to the beginning and found this request for data:

I’ve been doing a lot of playing with geographic data recently while¬†preparing to release a site making it easier to get GeoJSON boundaries of various areas in the UK. As a result, I’ve become pretty familiar with the Office of National Statistics Geography portal, and the data available there. I figured it must be pretty simple to hack something together to provide the data Joseph was looking for, so I took a few minutes out of lunch to see if I could help.

Checking the lookup tables at the ONS, it was clear that unfortunately there was no simple ‘NHS Trust to Parliamentary Constituency’ lookup table. However, there were two separate lookups involving Output Areas (OAs). One allows you to lookup which Parliamentary Constituency (WPC) an OA belongs to. The other allows you to lookup which NHS Clinical Commissioning Group (CCG) an OA belongs to. Clearly, all that’s required to link the two together is a bit of quick scripting to tie them both together via the Output Areas.

First, let’s create a dictionary with an entry for each CCG. For each CCG we’ll store it’s ID, name, and a set of OAs contained within. We’ll also add ¬†an empty set for the WPCs contained within the CCG:

import csv
from collections import defaultdict

data = {}

# extract information about clinical commissioning groups
with open('OA11_CCG13_NHSAT_NHSCR_EN_LU.csv', 'r') as oa_to_cgc_file:
  reader = csv.DictReader(oa_to_cgc_file)
  for row in reader:
    if not data.get(row['CCG13CD']):
      data[row['CCG13CD']] = {'CCG13CD': row['CCG13CD'], 'CCG13NM': row['CCG13NM'], 'PCON11CD list': set(), 'PCON11NM list': set(), 'OA11CD list': set(),}
    data[row['CCG13CD']]['OA11CD list'].add(row['OA11CD'])

Next we create a lookup table that allows us to convert from OA to WPC:

# extract information for output area to constituency lookup
oas = {}
pcon_nm = {}

with open('OA11_PCON11_EER11_EW_LU.csv', 'r') as oa_to_pcon_file:
  reader = csv.DictReader(oa_to_pcon_file)
  for row in reader:
    oas[row['OA11CD']] = row['PCON11CD']
    pcon_nm[row['PCON11CD']] = row['PCON11NM']

As the almost last step we go through the CCGs, and for each one we go through the list of OAs it covers, and lookup the WPC each OA belongs to:

# go through all the ccgs and lookup pcons from oas
for ccg, d in data.iteritems():

 for oa in d['OA11CD list']:
   d['PCON11CD list'].add(oas[oa])
   d['PCON11NM list'].add(pcon_nm[oas[oa]])
del d['OA11CD list']

Finally we just need to output the data:

for d in data.values():

 d['PCON11CD list'] = ';'.join(d['PCON11CD list'])
 d['PCON11NM list'] = ';'.join(d['PCON11NM list'])

with open('output.csv', 'w') as out_file:
  writer = csv.DictWriter(out_file, ['CCG13CD', 'CCG13NM', 'PCON11CD list', 'PCON11NM list'])

Run the script, and we get a nice CSV with one row for each CCG, each row containing a list of the WPC ids and names the CCG covers.

Of course, this data only covers England (as CCGs are a division in NHS England). Although there don’t seem to be lookups for OAs to Health Boards in Scotland, or from OAs to Local Health Boards in Wales, it should still be possible to do something similar for these countries using Parliamentary Wards as the intermediate geography, as lookups for Wards to Health Boards and Local Health Boards are available. It’s also not immediately clear how well the boundaries for CCGs and WPCs match up, that would require further investigation, depending on what the lookup is to be used for.

All the code, input and output for this task is available on my github page.