Tag Archives: coding

Computational Journalism Manifesto

Computational Journalism – ‘a Manifesto’

While Glyn and I have been discussing the new MSc course between ourselves and with others, we have repeatedly come up with the same issues and themes, again and again. As a planning exercise earlier in the summer, we gathered some of these together into a ‘manifesto’.

The manifesto is online on our main ‘Computational Journalism‘ website with a bit of extra commentary, but I thought I’d upload it here as well. Any comments should probably be directed to the article on the CompJ site, so I’ve turned them off just for this article.

 

GeoJSON and topoJSON for UK boundaries

I’ve just put an archive online containing GeoJSON and topoJSON for UK boundary data. It’s all stored on Github, with a viewer and download site hosted on Github pages.

Browser for the UK topoJSON stored in the Github repository
Browser for the UK topoJSON stored in the Github repository

The data is all created from shapefiles released by the Office of National Statistics, Ordnance Survey and National Records Scotland, all under the Open Government and OS OpenData licences.

In later posts I’ll detail how I created the files, and how to use them to create interactive choropleth maps.

CCG to WPC lookup

CCGs and WPCs via the medium of OAs

As I was eating lunch this afternoon, I spotted a conversation between @JoeReddington and @MySociety whizz past in Tweetdeck. I traced the conversation back to the beginning and found this request for data:

I’ve been doing a lot of playing with geographic data recently while preparing to release a site making it easier to get GeoJSON boundaries of various areas in the UK. As a result, I’ve become pretty familiar with the Office of National Statistics Geography portal, and the data available there. I figured it must be pretty simple to hack something together to provide the data Joseph was looking for, so I took a few minutes out of lunch to see if I could help.

Checking the lookup tables at the ONS, it was clear that unfortunately there was no simple ‘NHS Trust to Parliamentary Constituency’ lookup table. However, there were two separate lookups involving Output Areas (OAs). One allows you to lookup which Parliamentary Constituency (WPC) an OA belongs to. The other allows you to lookup which NHS Clinical Commissioning Group (CCG) an OA belongs to. Clearly, all that’s required to link the two together is a bit of quick scripting to tie them both together via the Output Areas.

First, let’s create a dictionary with an entry for each CCG. For each CCG we’ll store it’s ID, name, and a set of OAs contained within. We’ll also add  an empty set for the WPCs contained within the CCG:

import csv
from collections import defaultdict

data = {}

# extract information about clinical commissioning groups
with open('OA11_CCG13_NHSAT_NHSCR_EN_LU.csv', 'r') as oa_to_cgc_file:
  reader = csv.DictReader(oa_to_cgc_file)
  for row in reader:
    if not data.get(row['CCG13CD']):
      data[row['CCG13CD']] = {'CCG13CD': row['CCG13CD'], 'CCG13NM': row['CCG13NM'], 'PCON11CD list': set(), 'PCON11NM list': set(), 'OA11CD list': set(),}
    data[row['CCG13CD']]['OA11CD list'].add(row['OA11CD'])

Next we create a lookup table that allows us to convert from OA to WPC:

# extract information for output area to constituency lookup
oas = {}
pcon_nm = {}

with open('OA11_PCON11_EER11_EW_LU.csv', 'r') as oa_to_pcon_file:
  reader = csv.DictReader(oa_to_pcon_file)
  for row in reader:
    oas[row['OA11CD']] = row['PCON11CD']
    pcon_nm[row['PCON11CD']] = row['PCON11NM']

As the almost last step we go through the CCGs, and for each one we go through the list of OAs it covers, and lookup the WPC each OA belongs to:

# go through all the ccgs and lookup pcons from oas
for ccg, d in data.iteritems():

 for oa in d['OA11CD list']:
   d['PCON11CD list'].add(oas[oa])
   d['PCON11NM list'].add(pcon_nm[oas[oa]])
 
del d['OA11CD list']

Finally we just need to output the data:

for d in data.values():

 d['PCON11CD list'] = ';'.join(d['PCON11CD list'])
 d['PCON11NM list'] = ';'.join(d['PCON11NM list'])

with open('output.csv', 'w') as out_file:
  writer = csv.DictWriter(out_file, ['CCG13CD', 'CCG13NM', 'PCON11CD list', 'PCON11NM list'])
  writer.writeheader()
  writer.writerows(data.values())

Run the script, and we get a nice CSV with one row for each CCG, each row containing a list of the WPC ids and names the CCG covers.

Of course, this data only covers England (as CCGs are a division in NHS England). Although there don’t seem to be lookups for OAs to Health Boards in Scotland, or from OAs to Local Health Boards in Wales, it should still be possible to do something similar for these countries using Parliamentary Wards as the intermediate geography, as lookups for Wards to Health Boards and Local Health Boards are available. It’s also not immediately clear how well the boundaries for CCGs and WPCs match up, that would require further investigation, depending on what the lookup is to be used for.

All the code, input and output for this task is available on my github page.

MSc Computational Journalism about to launch

For the last two years I’ve been working on a project with some colleagues in the school of Journalism, Media and Cultural Studies (JOMEC) here at Cardiff University and it’s finally all coming together. This week we’ve been able to announce that (subject to some final internal paperwork wrangling) we’ll be launching an MSc in Computational Journalism this September. The story of how the course came about is fairly long, but starts simply with a tweet (unfortunately missing the context, but you get the drift):

An offer via social media from someone I’d never met, asking to pick my brains  about an unknown topic. Of course, I jumped at the invite:

That ‘brain picking’ became an interesting chat over coffee in one of the excellent coffee shops in Cardiff, where Glyn and I discussed many things of interest, and many potential areas for collaboration – including the increased use of data and coding within modern journalism. At one point during this chat, m’colleague Glyn said something like “do you know, I think we should run a masters course on this.” I replied with something along the lines of “yes, I think that’s a very good idea.” That short conversation became us taking the idea of a MSc in Computational Journalism to our respective heads of schools, which became us sat around the table discussing what should be in such a course, which then became us (I say us, it was mainly all Richard) writing pages of documentation explaining what the course would be and arguing the case for it to the University.  Last week we held the final approval panel for the course, where both internal and external panel members all agreed that we pretty much knew what we were doing, that the course was a good idea and had the right content, and that we should go ahead and launch it. From 25th July 2012 to 1st April 2014 is a long time to get an MSc up and running, but we’ve finally done it. Over that time I’ve discovered many things about the University and its processes, drunk many pints of fine ale as we try to hammer out a course structure in various pubs around the city, and have come close on at least one occasion to screaming at a table full of people, but now it’s done. As I write, draft press releases are being written, budgets are being sorted, and details are being uploaded to coursefinder. With any luck, September will see us with a batch of students ready and willing to step onto the course for the first time. It’s exciting, and I can’t wait.

Python + OAuth

As part of a current project I had the misfortune of having to to deal with a bunch of OAuth authenticated web services using a command line script in Python. Usually this isn’t really a problem as most decent client libraries for services such as Twitter or Foursquare can handle the authentication requests themselves, usually wrapping their own internal OAuth implementation. However, when it comes to web services that don’t have existing python client libraries, you have to do the implementation yourself. Unfortunately support for OAuth in Python is a mess, so this is not the most pleasant of tasks, especially when most stackoverflow posts on the topic point to massively outdated and unmaintained Python libraries.

Fortunately after some digging around, I was able to find a nice, well maintained and fairly well documented solution: rauth, which is very clean and easy to use. As an example, I was trying to connect to the Fitbit API, and it really was as simple as following their example.

Firstly, we create an OAuth1Service:

import rauth
from _credentials import consumer_key, consumer_secret

base_url = "https://api.fitbit.com"
request_token_url = base_url + "/oauth/request_token"
access_token_url = base_url + "/oauth/access_token"
authorize_url = "http://www.fitbit.com/oauth/authorize"

fitbit = rauth.OAuth1Service(
 name="fitbit",
 consumer_key=consumer_key,
 consumer_secret=consumer_secret,
 request_token_url=request_token_url,
 access_token_url=access_token_url,
 authorize_url=authorize_url,
 base_url=base_url)

Then we get the temporary request token credentials:

request_token, request_token_secret = fitbit.get_request_token()

print " request_token = %s" % request_token
print " request_token_secret = %s" % request_token_secret
print

We then ask the user to authorise our application, and give us the PIN so we can prove to the service that they authorised us:

authorize_url = fitbit.get_authorize_url(request_token)

print "Go to the following page in your browser: " + authorize_url
print

accepted = 'n'
while accepted.lower() == 'n':
 accepted = raw_input('Have you authorized me? (y/n) ')
pin = raw_input('Enter PIN from browser ')

Finally, we can create an authenticated session and access user data from the service:

session = fitbit.get_auth_session(request_token,
 request_token_secret,
 method="POST",
 data={'oauth_verifier': pin})

print ""
print " access_token = %s" % session.access_token
print " access_token_secret = %s" % session.access_token_secret
print ""

url = base_url + "/1/" + "user/-/profile.json"

r = session.get(url, params={}, header_auth=True)
print r.json()

It really is that easy to perform a 3-legged OAuth authentication on the command line. If you’re only interested in data from 1 user, and you want to run the app multiple times, once you have the access token and secret, there’s nothing to stop you just storing those and re-creating your session each time without having to re-authenticate (assuming the service does not expire access tokens):

base_url = "https://api.fitbit.com/"
api_version = "1/"
token = (fitbit_oauth_token, fitbit_oauth_secret)
consumer = (fitbit_consumer_key, fitbit_consumer_secret)

session = rauth.OAuth1Session(consumer[0], consumer[1], token[0], token[1])
url = base_url + api_version + "user/-/profile.json"
r = session.get(url, params={}, header_auth=True)
print r.json()

So there we have it. Simple OAuth authentication on the command line, in Python. As always, the code is available on github if you’re interested.

KSRI Services Summer School – Social Computing Theory and Hackathon

I was invited by Simon Caton to come to the KSRI Services Summer School, held at KIT in Germany, to help him run a workshop session on Social Computing.  We decided to use the session as a crash course in retrieving and manipulating data from Social Media APIs – showing the students the basics, then running a mini ‘hackathon’ for the students to gain some practical experience.

I think the session went really well, the students seemed to enjoy it and the feedback was very positive. We spent about 90 minutes talking about APIs, JSON, Twitter, Facebook and Foursquare, then set the students off on forming teams and brainstorming ideas. Very quickly they managed to get set up grabbing Twitter data from the streaming API, and coming up with ways of analysing it for interesting facts and statistics.  A number of the students were not coders, and had never done anything like this before, so it was great to see them diving in, setting up servers and running php scripts to grab the data. It was also good to see the level of team work on display; everyone was communicating, dividing the work, and getting on well. Fuelled by a combination of pizza, beer, red bull and haribo they coded into the night, until we drew things to a close at about 10pm and retired to the nearest bar for a pint of debrief.

Hackathon Students
Teams hard at work hacking with Twitter data

It was a really good experience, and I think everyone got something useful out of it. I’m looking forward to the presentations later on today to see what everyone came up with.

Our slides from the talk are available on slideshare. As usual they’re information light and picture heavy, so their usefulness is probably limited!

SWN Festival 2013 plans – part 1: the data (2!)

In the previous post, I used python and BeautifulSoup to grab the list of artists appearing at SWN Festival 2013, and to scrape their associated soundcloud/twitter/facebook/youtube links (where available).

However, there are more places to find music online than just those listed on the festival site, and some of those extra sources include additional data that I want to collect, so now we need to search these other sources for the artists. Firstly, we need to load the artist data we previously extracted from the festival website, and iterate through the list of artists one by one:

artists = {}
with open("bands.json") as infile:
    artists = json.load(infile)

for artist, artist_data in artists.iteritems():

The first thing I want to do for each artist it to search Spotify to see if they have any music available there. Spotify has a simple web API for searching which is pretty straightforward to use:

params = {
    "q" : "artist:" + artist.encode("utf-8")
}

spotify_root_url = "http://ws.spotify.com/search/1/artist.json"
spotify_url = "%s?%s" % (spotify_root_url, urllib.urlencode(params))

data = retrieve_json_data(spotify_url)

if data.get("artists", None) is not None:
    if len(data["artists"]) > 0:
        artist_id = data["artists"][0]["href"].lstrip("spotify:artist:")
        artist_data["spotify_id"] = data["artists"][0]["href"]
        artist_data["spotify_url"] = "http://open.spotify.com/artist/" + artist_id

The ‘retrieve_json_data’ function is just a wrapper to call a URL and parse the returned JSON data:

def retrieve_json_data(url):

    try:
        response = urllib2.urlopen(url)
    except urllib2.HTTPError, e:
        raise e
    except urllib2.URLError, e:
        raise e

    raw_data = response.read()
    data = json.loads(raw_data)

    return data

Once I’ve searched Spotify, I then want to see if the artist has a page on Last.FM. If they do, I also want to extract and store their top-tags from the site. Again, the Last.FM API makes this straightforward. Firstly, searching for the artist page:

params = {
    "artist": artist.encode("utf-8"),
    "api_key": last_fm_api_key,
    "method": "artist.getinfo",
    "format": "json"
}

last_fm_url = "http://ws.audioscrobbler.com/2.0/?" + urllib.urlencode(params)

data = retrieve_json_data(last_fm_url)

if data.get("artist", None) is not None:
    if data["artist"].get("url", None) is not None:
        artist_data["last_fm_url"] = data["artist"]["url"]

Then, searching for the artist’s top tags:

params = {
    "artist": artist.encode("utf-8"),
    "api_key": last_fm_api_key,
    "method": "artist.gettoptags",
    "format": "json"
}

last_fm_url = "http://ws.audioscrobbler.com/2.0/?" + urllib.urlencode(params)

data = retrieve_json_data(last_fm_url)

if data.get("toptags", None) is not None:

    artist_data["tags"] = {}

    if data["toptags"].get("tag", None) is not None:
        tags = data["toptags"]["tag"]
        if type(tags) == type([]):
            for tag in tags:
                name = tag["name"].encode('utf-8')
                count = 1 if int(tag["count"]) == 0 else int(tag["count"])
                artist_data["tags"][name] = count
            else:
                name = tags["name"].encode('utf-8')
                count = 1 if int(tags["count"]) == 0 else int(tags["count"])
                artist_data["tags"][name] = count

Again, once we’ve retrieved all the extra artist data, we can dump it to file:

with open("bands.json", "w") as outfile:
    json.dump(artists, outfile)

So, I now have 2 scripts that I can run regularly to capture any updates to the festival website (including lineup additions) and to search for artist data on Spotify and Last.FM. Now I’ve got all this data captured and stored, it’s time to start doing something interesting with it…

SWN Festival 2013 plans – part 1: the data

As I mentioned, I’m planning on doing a bit more development work this year connected to the SWN Festival. The first stage is to get hold of the data associated with the festival in an accessible and machine readable form so it can be used in other apps.

Unfortunately (but unsurprisingly), being a smallish local festival, there is no API for any of the data. So, getting a list of the bands and their info means we need to resort to web scraping. Fortunately, with a couple of lines of python and the BeautifulSoup library, getting the list of artists playing the festival is pretty straightforward:

import urllib2
import json

from bs4 import BeautifulSoup
root_page = "http://swnfest.com/"
lineup_page = root_page + "lineup/"

try:
    response = urllib2.urlopen(lineup_page)
except urllib2.HTTPError, e:
    raise e
except urllib2.URLError, e:
    raise e

raw_data = response.read()

soup = BeautifulSoup(raw_data)

links = soup.select(".artist-listing h5 a")

artists = {}

for link in links:
    url = link.attrs["href"]
    artist = link.contents[0]

    artists[artist] = {}
    artists[artist]["swn_url"] = url

All we’re doing here is loading the lineup page for the main festival website, using BeautifulSoup to find all the links to individual artist pages (which are in a div with a class of “artist-listing”, each one in a h5 tag), then parsing these links to extract the artist name, and the url of their page on the festival website.

Each artist page on the website includes handy links to soundcloud, twitter, youtube etc (where these exist), and since I’m going to want to include these kinds of things in the apps I’m working on, I’ll grab those too:

for artist, data in artists.iteritems():
    try:
        response = urllib2.urlopen(data["swn_url"])
    except urllib2.HTTPError, e:
        raise e
    except urllib2.URLError, e:
        raise e

    raw_data = response.read()

    soup = BeautifulSoup(raw_data)

    links = soup.select(".outlinks li")

    for link in links:
         source_name = link.attrs["class"][0]
         source_url = link.findChild("a").attrs["href"]
         data[source_name] = source_url

This code iterates through the list of artists we just extracted from the lineup page, retrieves the relevant artist page, and parses it for the outgoing links, stored in list items in an unordered list with a class of ‘outlinks’. Fortunately each link in this list has a class describing what type of link it is (facebook/twitter/soundcloud etc) so we can use the class as a key in our dictionary, with the link itself as an item. Later on once schedule information is included in the artist page we can add some code to parse stage-times and venues, but at the moment that data isn’t present on the pages, so we can’t extract it yet.

Finally we can just dump our artist data to json, and we have the information we need in an easily accessible format:

with open("bands.json", "w") as outfile:
    json.dump(artists, outfile)

Now we have the basic data for each artist, we can go on to search for more information on other music sites. The nice thing about this script is that when the lineup gets updated, we can just re-run the code and capture all the new artists that have been added. I should also mention that all the code I’m using for this is available on github.

SWN Festival 2013 – plans

Last year I had a go at creating a couple of web apps based around the bands playing the SWN Festival here in Cardiff. I love SWN with all my heart, it’s a permanent fixture in my calendar and even if (when) I leave Cardiff it’ll be the one thing I come back for every year. It’s a great way to see and discover new bands, but sometimes the sheer volume of music on offer can be overwhelming. So I wanted to see if I could create some web apps that would help to navigate your way through all the bands, and find the ones that you should go and see.

The first was a simple app that gathered artist tags from Last.FM, allowing you to see which artists playing the festival had similar tags – so if you knew you liked one artist you could find other artists tagged with the same terms. The second (which technically wasn’t ever really finished) would allow you to login with a last.fm account and find the artists whose tags best matched the tags for your top artists in your last.fm profile.

I liked both these apps and found them both useful – but I don’t think they went far enough. I only started development late in the year, about a month before the festival, so didn’t have a lot of time to really get into it. This year I’m starting a lot earlier, so I’ve got time to do a lot more.

Firstly I’d like to repeat the apps from last year, but perhaps combining them in some way. I’d like to include more links to the actual music, making it easy to get from an artist to their songs by including embeds from soundcloud, spotify, youtube etc. I’d also like to try making a mobile app guide to the festival (probably as an android app as the official app is iOS only). I’m hopeful that given enough free time I should be able to get some genuinely useful stuff done, and I’ll be blogging about it here as I work on it.

Summer Project update

We are storming along with summer projects now, and starting to see some really good results.

Liam Turner (who is starting a PhD in the school in October) has been working hard to create a mobile version of the 4SQPersonality app. His work is coming along really well, with a great mobile HTML version now up and running, a native android wrapper working, and an iOS wrapper on its way. With any luck we’ll have mobile apps for both major platforms ready to be released before the summer is over.

Max Chandler, who is now a second year undergraduate, has done some great work looking at the Foursquare venues within various cities around the UK, analysing them for similarity and spatial distribution. He’s just over halfway through the project now and is beginning to work on visualising the data he’s collected and analysed. He’s creating some interesting interactive visualisations using D3, so as soon as he’s done I’ll link to the website here.

It’s been a really good summer for student projects so far, with some really pleasing results. I’ll post more description of the projects and share some of the results as they come to a close in the coming weeks.

Open Sauce Hackathon – Post Mortem

This weekend saw the second ‘Open Sauce Hackathon‘ run by undergraduate students here in the school. Last years was pretty successful, and they improved upon it this year, pulling in many more sponsors and offering more prizes.

Unlike last year, when I turned up having already decided with Jon Quinn what we were doing, I went along this year with no real ideas. I had a desire to do something with a map, as I’m pretty sure building stuff connected to maps is going to play a big part in work over the next couple of months. Other than that though, I was at a bit of a loss. After playing around with some ideas and APIs I finally came up with my app: dionysus.

Dionysus Screenshot

It’s a mobile friendly mapping app that shows you two important things: Where the pubs are (using venue data from Foursquare) and where the gigs are at (using event data from last.fm). If you sign in to either last.fm or Foursquare it will also pull in recommended bars and recommended gigs and highlight these for you.

The mapping is done using leaflet.js, which I found to be nicer and easier to use than Google Maps. The map tiles are based on OpenStreetMap data and come from CloudMade, while the (devastatingly beautiful) icons were rushed together by me over the weekend. The entire app is just client side Javascript and HTML, with HTML5 persistent localStorage used to maintain login authentication between sessions. It’s a simple app, but I’m pretty pleased with it. In the end I even won a prize for it (£50), so it can’t be too bad.

The app is hosted here, and the source code is available here. Obviously though the code is not very pretty and quite hacky, but it does the job!

LaTeX notes and links

During the `LaTeX for Beginners’ UGC course on Friday 1st February I promised that I would upload the source code for my presentations along with some useful links:

Some useful links for people new to LaTeX:

LaTeX beamer handouts (with frames and borders)

I’m working on some notes for a beginners LaTeX course that I’m giving for the University Graduate College this week. In a temporary fit of insanity I decided it would be nice to write all the slides in LaTeX, so that I can distribute the source to the students so they get some real world LaTeX examples to go along with the course notes.

I was attempting to make handouts for the students using the great handoutWithNotes package. However, as my slides are white, they looked a bit odd on a page without a frame around them:

Slides without border

I wanted to add a border to make the handouts look better, but there were no suggestions at the site I got the package from as to how to add a frame, and I’m too lazy to go digging in CTAN to see if there’s any documentation.

Instead, a little bit of googling (thank you tex.stackexchange!) revealed the answer:

\pgfpageslogicalpageoptions{1}{border code=\pgfusepath{stroke}}
\pgfpageslogicalpageoptions{2}{border code=\pgfusepath{stroke}}
\pgfpageslogicalpageoptions{3}{border code=\pgfusepath{stroke}}

You’ll need one command for each slide on a page, and you get simple frames around the slides:

Handouts with Frames Easy!