Scraping the Assembly…

M’colleague is currently teaching a first-semester module on Data Journalism to the students on our MSc in Computational and Data Journalism. As part of this, they need to do some sort of data project. One of the students is looking at the expenses of Welsh Assembly Members. These are all freely available online, but not in an easy to manipulate form. According to the Assembly they’d be happy to give the data out as a spreadsheet, if we submitted an FOI.

To me, this seems quite stupid. The information is all online and freely accessible. You’ve admitted you’re willing to give it out to anyone who submits an FOI. So why not just make the raw data available to download? This does not sound like a helpful Open Government to me. Anyway, for whatever reason, they’ve chosen not to, and we can’t be bothered to wait around for an FOI to come back. It’s much quicker and easier to build a scraper! We’ll just use selenium to drive a web browser, submit a search, page through all the results collecting the details, then dump it all out to csv. Simple.

Scraping AM expenses
Scraping AM expenses

I built this as a quick hack this morning. It took about an hour or so, and it shows. The code is not robust in any way, but it works. You can ask it for data from any year (or a number of years) and it’ll happily sit there churning its way through the results and spitting them out as both .csv and .json.

All the code is available on Github and it’s under an MIT Licence. Have fun ūüėČ

Sustainable Software Institute – Research Data Visualisation Workshop

Last week I ¬†gave a talk and delivered a hands on session at the Sustainable Software Institute’s ‘Research Data Visualisation Workshop‘ which was held at Manchester University. It was a really engaging¬†event, with a lot of good discussion on the issues surrounding data visualisation.

Professor Jessie Kennedy from Edinburgh Napier University gave a great keynote looking at a some key design principles in visualisation, including a number of studies I hadn’t seen before but will definitely be including in my teaching in future.

I gave a talk on ‘Human Science Visualisation’ which really focused on a couple of key issues. Firstly, I tried to illustrate the¬†importance of interactivity in complex visualisations. I then talked about how we as academic researchers need¬†publish our interactive visualisations¬†in posterity, and how we should press academic publishers to help us communicate our data to readers. Finally, I wanted to point¬†people towards the excellent visualisation work being done by data journalists, and that the newsrooms are an excellent source of ideas and tips for data visualisation. The slides for my talk are here. It’s the first time I’ve spoken about visualisation outside of the classroom, and it was a really fun talk to give.

We also had two great talks from Dr Christina Bergmann and Dr Andy South, focusing on issues of biological visualisation and mapping respectively. All the talks generated some good discussion both in the room and online, which was fantastic to see.

In the afternoon I lead a hands on session looking at visualising data using d3. This was the first time I’d taught a session using d3 v4, which made things slightly interesting. I’m not fully up to speed with all the areas of the API that have changed, so getting the live coding right first time was a bit tricky, but I think I managed. Interestingly, I feel that¬†the changes made to the .data(), .exit(), .enter(), update cycle as discussed in Mike’s “What Makes Software Good” make a lot more sense from a teaching perspective. The addition of .merge() in particular helps a great deal. As you might expect from a d3 workshop that lasted a mere three hours, I’m not entirely convinced that everybody ‘got’ it, but I think a most went away satisfied.

Overall it was a very successful workshop. Raniere Silva did an excellent job putting it together and running the day, and I really enjoyed it. I’m looking forward to seeing what other people thought about it too.

Computational Journalism – ‘a Manifesto’

While Glyn and I have been discussing the new MSc course between ourselves and with others, we have repeatedly come up with the same issues and themes, again and again. As a planning exercise earlier in the summer, we gathered some of these together into a ‘manifesto’.

The manifesto is online on our main ‘Computational Journalism‘ website with a bit of extra commentary, but I thought I’d upload it here as well. Any comments should probably be directed to the article on the CompJ site, so I’ve turned them off just for this article.


CCGs and WPCs via the medium of OAs

As I was eating lunch this afternoon, I spotted a conversation between @JoeReddington and @MySociety whizz past in Tweetdeck. I traced the conversation back to the beginning and found this request for data:

I’ve been doing a lot of playing with geographic data recently while¬†preparing to release a site making it easier to get GeoJSON boundaries of various areas in the UK. As a result, I’ve become pretty familiar with the Office of National Statistics Geography portal, and the data available there. I figured it must be pretty simple to hack something together to provide the data Joseph was looking for, so I took a few minutes out of lunch to see if I could help.

Checking the lookup tables at the ONS, it was clear that unfortunately there was no simple ‘NHS Trust to Parliamentary Constituency’ lookup table. However, there were two separate lookups involving Output Areas (OAs). One allows you to lookup which Parliamentary Constituency (WPC) an OA belongs to. The other allows you to lookup which NHS Clinical Commissioning Group (CCG) an OA belongs to. Clearly, all that’s required to link the two together is a bit of quick scripting to tie them both together via the Output Areas.

First, let’s create a dictionary with an entry for each CCG. For each CCG we’ll store it’s ID, name, and a set of OAs contained within. We’ll also add ¬†an empty set for the WPCs contained within the CCG:

import csv
from collections import defaultdict

data = {}

# extract information about clinical commissioning groups
with open('OA11_CCG13_NHSAT_NHSCR_EN_LU.csv', 'r') as oa_to_cgc_file:
  reader = csv.DictReader(oa_to_cgc_file)
  for row in reader:
    if not data.get(row['CCG13CD']):
      data[row['CCG13CD']] = {'CCG13CD': row['CCG13CD'], 'CCG13NM': row['CCG13NM'], 'PCON11CD list': set(), 'PCON11NM list': set(), 'OA11CD list': set(),}
    data[row['CCG13CD']]['OA11CD list'].add(row['OA11CD'])

Next we create a lookup table that allows us to convert from OA to WPC:

# extract information for output area to constituency lookup
oas = {}
pcon_nm = {}

with open('OA11_PCON11_EER11_EW_LU.csv', 'r') as oa_to_pcon_file:
  reader = csv.DictReader(oa_to_pcon_file)
  for row in reader:
    oas[row['OA11CD']] = row['PCON11CD']
    pcon_nm[row['PCON11CD']] = row['PCON11NM']

As the almost last step we go through the CCGs, and for each one we go through the list of OAs it covers, and lookup the WPC each OA belongs to:

# go through all the ccgs and lookup pcons from oas
for ccg, d in data.iteritems():

 for oa in d['OA11CD list']:
   d['PCON11CD list'].add(oas[oa])
   d['PCON11NM list'].add(pcon_nm[oas[oa]])
del d['OA11CD list']

Finally we just need to output the data:

for d in data.values():

 d['PCON11CD list'] = ';'.join(d['PCON11CD list'])
 d['PCON11NM list'] = ';'.join(d['PCON11NM list'])

with open('output.csv', 'w') as out_file:
  writer = csv.DictWriter(out_file, ['CCG13CD', 'CCG13NM', 'PCON11CD list', 'PCON11NM list'])

Run the script, and we get a nice CSV with one row for each CCG, each row containing a list of the WPC ids and names the CCG covers.

Of course, this data only covers England (as CCGs are a division in NHS England). Although there don’t seem to be lookups for OAs to Health Boards in Scotland, or from OAs to Local Health Boards in Wales, it should still be possible to do something similar for these countries using Parliamentary Wards as the intermediate geography, as lookups for Wards to Health Boards and Local Health Boards are available. It’s also not immediately clear how well the boundaries for CCGs and WPCs match up, that would require further investigation, depending on what the lookup is to be used for.

All the code, input and output for this task is available on my github page.

Unified Diff and recruiting Guest Lecturers

Last week I gave a quick lightning talk at UnifiedDiff – a local tech meetup here in Cardiff. The main point of the talk was to try and recruit more industry involvement for our new MSc in Computational Journalism – preferably by getting some web developers and software engineers in to give guest lectures on the tools, languages and processes they use.

The talk went well, and I’ve had several offers from people wanting to get involved and add some real value to the course, which is brilliant. Of course, there’s always room to add more, so if you’re interested in coming and talking to our students, get in touch!

If you’re interested, the slides from the talk are here


Welcome to 2014

So. 2013. That was an alright year. Finished the Recognition project, finally graduated, got a 12 month fellowship, started some interesting projects, and pushed on with the new MSc with JOMEC. Professionally, not too bad at all. Personally the year wasn’t bad either, what with getting engaged and finally getting the house on the market.

But now it’s a new year, so it’s time to push things on further. My plans so far for this year seem to be ‘smash it’. There’s papers to be published, data to be analysed and project proposals to write (and get funded!). Getting a permanent job would be quite nice, while I’m at it. Here’s to 2014 being even more successful than last year.

SCA 2013 – Visiting Patterns and Personality of Foursquare Users

Today was presentation day for me at SCA 2013 – I was presenting the initial results of the Foursquare experiment, which has now been running for a while. The presentation seemed to go really well – I think it’s the strongest work I’ve done yet, and so it was easy to talk well and with confidence about it, which led to a nice talk. There was also plenty of discussion after the talk, with a lot of good comments and questions from the audience, which suggests that most people were quite interested in the research. I pitched it as a WIP paper, describing what the ultimate aim of the project is – very much recycling the talk I gave to the interview panel for my fellowship proposal. I think it certainly got a few people interested who’ll look to follow the project as it unfolds over the next couple of months.

After lunch there was an extra bonus when we discovered a beer vending machine in the hotel – what better way to celebrate a successful conference presentation than a cold beer in the sun?