Scraping the Assembly…

M’colleague is currently teaching a first-semester module on Data Journalism to the students on our MSc in Computational and Data Journalism. As part of this, they need to do some sort of data project. One of the students is looking at the expenses of Welsh Assembly Members. These are all freely available online, but not in an easy to manipulate form. According to the Assembly they’d be happy to give the data out as a spreadsheet, if we submitted an FOI.

To me, this seems quite stupid. The information is all online and freely accessible. You’ve admitted you’re willing to give it out to anyone who submits an FOI. So why not just make the raw data available to download? This does not sound like a helpful Open Government to me. Anyway, for whatever reason, they’ve chosen not to, and we can’t be bothered to wait around for an FOI to come back. It’s much quicker and easier to build a scraper! We’ll just use selenium to drive a web browser, submit a search, page through all the results collecting the details, then dump it all out to csv. Simple.

Scraping AM expenses
Scraping AM expenses

I built this as a quick hack this morning. It took about an hour or so, and it shows. The code is not robust in any way, but it works. You can ask it for data from any year (or a number of years) and it’ll happily sit there churning its way through the results and spitting them out as both .csv and .json.

All the code is available on Github and it’s under an MIT Licence. Have fun ūüėČ

Atom Plugins for Web Development

I’ve had a number of students in my web-dev module asking me what plugins I’m using in my text editor, so I thought I’d dash off a quick blog post on the plugins I find useful day-to-day. (Actually, most people are normally¬†asking me ‘how did you do that thing where you typed one word and suddenly you had a whole HTML page? The answer is I used a plugin, so ‘what plugins do you use?’ is really the question they should be asking…)

I’m using Atom as my text editor. It’s free, open source, and generally reliable. If you’re a student on my web-dev course you’re stuck using Sublime Text in the lab for now. I’m pretty sure most of the Atom plugins I use have either direct Sublime equivalents, or similarly functioning alternatives.

There’s a guide to Atom packages here¬†and one for Sublime Text here

A quick google for ‘best atom packages web developer’ will probably get you to a far more comprehensive list than this, but here’s my current pick of useful plugins anyway:

emmet

This is essential for anyone writing any amount of HTML. This is the magic package that allows me to write ‘html:5’ in a blank document, hit the shortcut keys (CTRL + E in my setup), and suddenly have a simple boilerplate HTML page.

emmet auto-completion
emmet auto-completion

It’s ace. Not only that, but it can write loads of HTML for you, and all you have to do is write a¬†CSS selector for that HTML:

html-css-selector-expansion
HTML CSS selector expansion

Great stuff. The documentation is here.

atom-beautify

This will tidy up your code automatically, fixing the indentation and spacing etc. It can even be set to automatically tidy your code every time you save a file. Awesome huh? Imagine being set a coursework where some of the marks were dependent on not writing code that looks like it was written by a five-year old child who’s addicted to hitting the tab key, then finding out that there’s software to strap that five-year olds thumbs to his hands so he can’t hit that tab key. Awesome.

atom-beautify
Atom Beautify tidies your code

color-picker

This one adds a colour picker right into atom. Just CMD-SHIFT-C and choose your colours!

color-picker
Colour Picker in atom

Another useful colour related plugin you may want to look at is Pigments, which can highlight colours in your projects, and gather them all together so you can see your palette.

linter

My last recommendation is linter. This plugin will automatically check your code for errors. You’ll need to install linters for whatever language you want to check, like linter-tidy, linter-csslint, linter-pylint and linter-jshint.

linter
Linter finds errors in your code

 

So there we go – a few recommendations to get you started. Found anything else interesting? Let me know!

Sustainable Software Institute – Research Data Visualisation Workshop

Last week I ¬†gave a talk and delivered a hands on session at the Sustainable Software Institute’s ‘Research Data Visualisation Workshop‘ which was held at Manchester University. It was a really engaging¬†event, with a lot of good discussion on the issues surrounding data visualisation.

Professor Jessie Kennedy from Edinburgh Napier University gave a great keynote looking at a some key design principles in visualisation, including a number of studies I hadn’t seen before but will definitely be including in my teaching in future.

I gave a talk on ‘Human Science Visualisation’ which really focused on a couple of key issues. Firstly, I tried to illustrate the¬†importance of interactivity in complex visualisations. I then talked about how we as academic researchers need¬†publish our interactive visualisations¬†in posterity, and how we should press academic publishers to help us communicate our data to readers. Finally, I wanted to point¬†people towards the excellent visualisation work being done by data journalists, and that the newsrooms are an excellent source of ideas and tips for data visualisation. The slides for my talk are here. It’s the first time I’ve spoken about visualisation outside of the classroom, and it was a really fun talk to give.

We also had two great talks from Dr Christina Bergmann and Dr Andy South, focusing on issues of biological visualisation and mapping respectively. All the talks generated some good discussion both in the room and online, which was fantastic to see.

In the afternoon I lead a hands on session looking at visualising data using d3. This was the first time I’d taught a session using d3 v4, which made things slightly interesting. I’m not fully up to speed with all the areas of the API that have changed, so getting the live coding right first time was a bit tricky, but I think I managed. Interestingly, I feel that¬†the changes made to the .data(), .exit(), .enter(), update cycle as discussed in Mike’s “What Makes Software Good” make a lot more sense from a teaching perspective. The addition of .merge() in particular helps a great deal. As you might expect from a d3 workshop that lasted a mere three hours, I’m not entirely convinced that everybody ‘got’ it, but I think a most went away satisfied.

Overall it was a very successful workshop. Raniere Silva did an excellent job putting it together and running the day, and I really enjoyed it. I’m looking forward to seeing what other people thought about it too.

Accessing and Scraping MyFitnessPal Data with Python

Interesting news this morning that MyFitnessPal has been bought by Under Armour for ¬†$475 million. I’ve used MFP for many years now, and it was pretty helpful in helping me lose all the excess PhD weight that I’d put on, and then maintaining a healthy(ish) lifestyle since 2010.

News of an acquisition always has me slightly worried though – not for someone else having access to my data, as I’ve made my peace with the fact that using a free service generally means that it’s me that’s being sold. Giving away my data is the cost of doing business. Rather, it worries me that I may lose access to all the data I’ve collected. I have no idea what Under Armour intend for the service in the long run, and while its likely that MFP will continue with business as usual for the foreseeable, it’s always worth having a backup of your data.

A few years ago, I wrote a couple of python scripts to back up data from MFP and then extract the food and exercise info from the raw HTML. These scripts use Python and Beautiful Soup to do a login to MFP, then go back through your diary history and save all the raw HTML pages, essentially scraping your data.

I came to run them this morning and found they needed a couple of changes to deal with site updates. I’ve made the necessary updates and the full code for all the scripts is available on GitHub. It’s not great, but it works. The code is Python 2 and requires BeautifulSoup and Matplotlib (if you want to use generate_plots.py).

NHS Hackday 2015

This weekend I took part in an incredibly successful NHS hackday, hosted at Cardiff University and organised by Anne Marie Cunningham and James Morgan. We went as a team from the MSc in Computational Journalism, with myself and Glyn attending along with Pooja, Nikita, Annalisa and Charles. At the last-minute I recruited a couple of ringers as well, dragging along Rhys Priestland Dr William Wilberforce Webberley from Comsc and Dr Matthew Williams, previously of this parish. Annalisa also brought along Dan Hewitt, so in total we had a large and diverse team.

The hackday

This was the first NHS hackday I’d attended, but I believe it’s the second event¬†held in Cardiff, so Anne Marie and the team have it down to a fine art. The whole weekend seemed to go pretty smoothly (barring a couple of misunderstandings on our part regarding¬†the pitch sessions!). It was certainly¬†one of the most well organised events that I’ve attended, with all the necessary ingredients for successful coding: much power, many wifi and plenty of food, snacks and coffee. Anne Marie and the team deserve much recognition and thanks for their hard work. I’m definitely in for next year.

The quality of the projects created at the hackday was incredibly high across the board, which was great to see. One of my favourites used an Oculus Rift virtual reality headset to create a zombie ‘game’ that could be used to test people’s peripheral vision. Another standout was a system for logging and visualising the ANGEL factors describing a patient’s health situation. It was really pleasing to see these rank highly with the judges too, coming in third and second in the overall rankings. Other great projects brought an old Open Source project back to life, created a system for managing groups walking the Wales Coast path, and created automatic notification systems for healthcare processes. Overall it was a really interesting mix of projects, many of which have clear potential to become useful products within or alongside the NHS. As Matt commented in the pub afterwards, it’s probably the first hackday we’ve been to where several of the projects have clear original IP with commercial potential.

Our project

We had decided before the event that we wanted to build some visualisations of health data across Wales, something like nhsmaps.co.uk, but working with local health boards and local authorities in Wales. We split into two teams for the implementation: ‘the data team’ who were responsible for sourcing, processing and inputting data, and the ‘interface team’ who built the front-end and the visualisations.

Progress was good, with Matthew and William quickly defining a schema for describing data so that the data team could add multiple data sets and have the front-end automatically pick them up and be able to visualise them. The CompJ students worked to find and extract data, adding them to the github repository with the correct metadata. Meanwhile, I pulled a bunch of D3 code together for some simple visualisations.

By the end of the weekend we established¬†a fairly decent system. It’s able to visualise a few different types of data, at different resolutions, is mostly mobile friendly, and most importantly is easily extensible and adaptable. It’s online now on our github pages, and all the code and documentation is also in the github repository.

We’ll continue development for a while to improve the usability and code quality, and hopefully we’ll find a community willing to take the code base on and keep improving what could be a fairly useful resource for understanding the health of Wales.

Debrief

We didn’t win any of the prizes, which is understandable. Our project was really focused on the public understanding of the NHS and health, and not for solving a particular need within (or for users of) the NHS. We knew this going in to the weekend, and we’d taken the decision that it was more important to work on a project related to the course, so that the students could experience some of the tools and technologies they’ll be using as the course progresses than to do something more closely aligned with the brief that would have perhaps been less relevant to the students work.

I need to thank Will and Matt for coming and helping the team. Without Matt wrangling the data team and showing them how to create json metadata descriptors we probably wouldn’t have anywhere near as many example datasets as we do. Similarly, without Will’s hard work on the front end interface, the project wouldn’t look nearly as good as it does, or have anywhere near the functionality. His last-minute addition of localstorage for personal datasets was a triumph. (Sadly though he does lose some coder points for user agent sniffing to decide whether to show a mobile interface :-D.) They were both a massive help, and we couldn’t have done it without them.

Also,¬†of course, I need to congratulate the CompJ students, who gave up their weekend to trawl through datasets, pull figures off websites and out of pdf’s, and create the lovely easy to process .csv files we needed. It was a great effort from them, and I’m looking forward to our next Team CompJ hackday outing.

One thing that sadly did stand out was a lack of participation from Comsc undergraduate students, with only one or two attending. Rob Davies stopped by on Saturday, and both Will and I discussed with him what we can do to increase participation in these events. Hopefully we’ll make some progress on that front in time for the next hackday.

Media

There’s some great photos from the event on Flickr, courtesy of Paul Clarke (Saturday¬†and Sunday). I’ve pulled out some of the best of Team CompJ and added them here. All photos are released¬†under a Creative Commons BY-NC 2.0¬†licence.

 

Elsewhere…

We got a lovely write-up about out project from Dyfrig Williams of the Good Practice Exchange at the Wales Audit Office. Dyfrig also curated a great storify of the weekend.

Hemavault labs have done a round up of the projects here

Quick and Dirty Twitter API in Python

QUICK DISCLAIMER: this is a quick and dirty solution to a problem, so may not represent best coding practice, and has absolutely no error checking or handling. Use with caution…

A recent project has needed me to scrape some data from Twitter. I considered using Tweepy, but as it was a project for the MSc in Computational Journalism, I thought it would be more interesting to write our own simple Twitter API wrapper in Python.

The code presented here will allow you to make any API request to Twitter that uses a GET request, so is really only useful for getting data from Twitter, not sending it to Twitter. It is also only for using with the REST API, not the streaming API, so if you’re looking for realtime monitoring, this is not the API wrapper you’re looking for. This API wrapper also uses a single user’s authentication (yours), so is not setup to allow other users to use Twitter through your application.

The first step is to get some access credentials from Twitter. Head over to¬†https://apps.twitter.com/¬†and register a new application. Once the application is created, you’ll be able to access its details. Under ‘Keys and Access Tokens’ are four values we’re going to need for the API – the ¬†Consumer Key and Consumer Secret, and the Access Token and Access Token Secret. Copy all four values into a new python file, and save it as ‘_credentials.py‘. The images below walk through the process. Also – don’t try and use the credentials from these images, this app has already been deleted so they won’t work!

Once we have the credentials, we can write some code to make some API requests!

First, we define a Twitter API object that will carry out our API requests. We need to store the API url, and some details to allow us to throttle our requests to Twitter to fit inside their rate limiting.

class Twitter_API:

 def __init__(self):

   # URL for accessing API
   scheme = "https://"
   api_url = "api.twitter.com"
   version = "1.1"

   self.api_base = scheme + api_url + "/" + version

   #
   # seconds between queries to each endpoint
   # queries in this project limited to 180 per 15 minutes
   query_interval = float(15 * 60)/(175)

   #
   # rate limiting timer
   self.__monitor = {'wait':query_interval,
     'earliest':None,
     'timer':None}

We add a rate limiting method that will make our API sleep if we are requesting things from Twitter too fast:

 #
 # rate_controller puts the thread to sleep 
 # if we're hitting the API too fast
 def __rate_controller(self, monitor_dict):

   # 
   # join the timer thread
   if monitor_dict['timer'] is not None:
   monitor_dict['timer'].join() 

   # sleep if necessary 
   while time.time() < monitor_dict['earliest']:
     time.sleep(monitor_dict['earliest'] - time.time())
 
   # work out then the next API call can be made
   earliest = time.time() + monitor_dict['wait']
   timer = threading.Timer( earliest-time.time(), lambda: None )
   monitor_dict['earliest'] = earliest
   monitor_dict['timer'] = timer
   monitor_dict['timer'].start()

The Twitter API requires us to supply authentication headers in the request. One of these headers is a signature, created by encoding details of the request. We can write a function that will take in all the details of the request (method, url, parameters) and create the signature:

 # 
 # make the signature for the API request
 def get_signature(self, method, url, params):
 
   # escape special characters in all parameter keys
   encoded_params = {}
   for k, v in params.items():
     encoded_k = urllib.parse.quote_plus(str(k))
     encoded_v = urllib.parse.quote_plus(str(v))
     encoded_params[encoded_k] = encoded_v 

   # sort the parameters alphabetically by key
   sorted_keys = sorted(encoded_params.keys())

   # create a string from the parameters
   signing_string = ""

   count = 0
   for key in sorted_keys:
     signing_string += key
     signing_string += "="
     signing_string += encoded_params[key]
     count += 1
     if count < len(sorted_keys):
       signing_string += "&"

   # construct the base string
   base_string = method.upper()
   base_string += "&"
   base_string += urllib.parse.quote_plus(url)
   base_string += "&"
   base_string += urllib.parse.quote_plus(signing_string)

   # construct the key
   signing_key = urllib.parse.quote_plus(client_secret) + "&" + urllib.parse.quote_plus(access_secret)

   # encrypt the base string with the key, and base64 encode the result
   hashed = hmac.new(signing_key.encode(), base_string.encode(), sha1)
   signature = base64.b64encode(hashed.digest())
   return signature.decode("utf-8")

Finally, we can write a method to actually make the API request:

 def query_get(self, endpoint, aspect, get_params={}):
 
   #
   # rate limiting
   self.__rate_controller(self.__monitor)

   # ensure we're dealing with strings as parameters
   str_param_data = {}
   for k, v in get_params.items():
     str_param_data[str(k)] = str(v)

   # construct the query url
   url = self.api_base + "/" + endpoint + "/" + aspect + ".json"
 
   # add the header parameters for authorisation
   header_parameters = {
     "oauth_consumer_key": client_id,
     "oauth_nonce": uuid.uuid4(),
     "oauth_signature_method": "HMAC-SHA1",
     "oauth_timestamp": time.time(),
     "oauth_token": access_token,
     "oauth_version": 1.0
   }

   # collect all the parameters together for creating the signature
   signing_parameters = {}
   for k, v in header_parameters.items():
     signing_parameters[k] = v
   for k, v in str_param_data.items():
     signing_parameters[k] = v

   # create the signature and add it to the header parameters
   header_parameters["oauth_signature"] = self.get_signature("GET", url, signing_parameters)

   # add the OAuth headers
   header_string = "OAuth "
   count = 0
   for k, v in header_parameters.items():
     header_string += urllib.parse.quote_plus(str(k))
     header_string += "=\""
     header_string += urllib.parse.quote_plus(str(v))
     header_string += "\""
     count += 1
     if count < 7:
       header_string += ", "

   headers = {
     "Authorization": header_string
   }

   # create the full url including parameters
   url = url + "?" + urllib.parse.urlencode(str_param_data)
   request = urllib.request.Request(url, headers=headers)

   # make the API request
   try:
     response = urllib.request.urlopen(request)
     except urllib.error.HTTPError as e:
     print(e)
   raise e
     except urllib.error.URLError as e:
     print(e)
     raise e

   # read the response and return the json
   raw_data = response.read().decode("utf-8")
   return json.loads(raw_data)

Putting this all together, we have a simple Python class that acts as an API wrapper for GET requests to the Twitter REST API, including the signing and authentication of those requests. Using it is as simple as:

ta = Twitter_API()

# retrieve tweets for a user
params = {
   "screen_name": "martinjc",
}

user_tweets = ta.query_get("statuses", "user_timeline", params)

As always, the full code is online on Github, in both my personal account and the account for the MSc Computational Journalism.