Scraping the Assembly

November 2, 2016 - by martin

M’colleague is currently teaching a first-semester module on Data Journalism to the students on our MSc in Computational and Data Journalism. As part of this, they need to do some sort of data project. One of the students is looking at the expenses of Welsh Assembly Members. These are all freely available online, but not in an easy to manipulate form. According to the Assembly they’d be happy to give the data out as a spreadsheet, if we submitted an FOI.

To me, this seems quite stupid. The information is all online and freely accessible. You’ve admitted you’re willing to give it out to anyone who submits an FOI. So why not just make the raw data available to download? This does not sound like a helpful Open Government to me. Anyway, for whatever reason, they’ve chosen not to, and we can’t be bothered to wait around for an FOI to come back. It’s much quicker and easier to build a scraper! We’ll just use selenium to drive a web browser, submit a search, page through all the results collecting the details, then dump it all out to csv. Simple.

Scraping AM expenses

I built this as a quick hack this morning. It took about an hour or so, and it shows. The code is not robust in any way, but it works. You can ask it for data from any year (or a number of years) and it’ll happily sit there churning its way through the results and spitting them out as both .csv and .json.

All the code is available on Github and it’s under an MIT Licence. Have fun 😉


Atom Plugins for Web Development

October 5, 2016 - by martin

I’ve had a number of students in my web-dev module asking me what plugins I’m using in my text editor, so I thought I’d dash off a quick blog post on the plugins I find useful day-to-day. (Actually, most people are normally asking me ‘how did you do that thing where you typed one word and suddenly you had a whole HTML page? The answer is I used a plugin, so ‘what plugins do you use?’ is really the question they should be asking…)

I’m using Atom as my text editor. It’s free, open source, and generally reliable. If you’re a student on my web-dev course you’re stuck using Sublime Text in the lab for now. I’m pretty sure most of the Atom plugins I use have either direct Sublime equivalents, or similarly functioning alternatives.

There’s a guide to Atom packages here and one for Sublime Text here

A quick google for ‘best atom packages web developer’ will probably get you to a far more comprehensive list than this, but here’s my current pick of useful plugins anyway:

emmet

This is essential for anyone writing any amount of HTML. This is the magic package that allows me to write ‘html:5’ in a blank document, hit the shortcut keys (CTRL + E in my setup), and suddenly have a simple boilerplate HTML page.

emmet auto-completion

It’s ace. Not only that, but it can write loads of HTML for you, and all you have to do is write a CSS selector for that HTML:

html css Selector expansion

Great stuff. The documentation is here.

atom-beautify

This will tidy up your code automatically, fixing the indentation and spacing etc. It can even be set to automatically tidy your code every time you save a file. Awesome huh? Imagine being set a coursework where some of the marks were dependent on not writing code that looks like it was written by a five-year old child who’s addicted to hitting the tab key, then finding out that there’s software to strap that five-year olds thumbs to his hands so he can’t hit that tab key. Awesome.

Beautiful tidy code

color-picker

This one adds a colour picker right into atom. Just CMD-SHIFT-C and choose your colours!

Colour picker

Another useful colour related plugin you may want to look at is Pigments, which can highlight colours in your projects, and gather them all together so you can see your palette.

linter

My last recommendation is linter. This plugin will automatically check your code for errors. You’ll need to install linters for whatever language you want to check, like linter-tidy, linter-csslint, linter-pylint and linter-jshint.

Linter finds errors in your code

So there we go – a few recommendations to get you started. Found anything else interesting? Let me know!


Sustainable Software Institute – Research Data Visualisation Workshop

August 1, 2016 - by martin

Last week I gave a talk and delivered a hands on session at the Sustainable Software Institute’s ‘Research Data Visualisation Workshop‘ which was held at Manchester University. It was a really engaging event, with a lot of good discussion on the issues surrounding data visualisation.

Professor Jessie Kennedy from Edinburgh Napier University gave a great keynote looking at a some key design principles in visualisation, including a number of studies I hadn’t seen before but will definitely be including in my teaching in future.

I gave a talk on ‘Human Science Visualisation’ which really focused on a couple of key issues. Firstly, I tried to illustrate the importance of interactivity in complex visualisations. I then talked about how we as academic researchers need publish our interactive visualisations in posterity, and how we should press academic publishers to help us communicate our data to readers. Finally, I wanted to point people towards the excellent visualisation work being done by data journalists, and that the newsrooms are an excellent source of ideas and tips for data visualisation. The slides for my talk are here. It’s the first time I’ve spoken about visualisation outside of the classroom, and it was a really fun talk to give.

We also had two great talks from Dr Christina Bergmann and Dr Andy South, focusing on issues of biological visualisation and mapping respectively. All the talks generated some good discussion both in the room and online, which was fantastic to see.

In the afternoon I lead a hands on session looking at visualising data using d3. This was the first time I’d taught a session using d3 v4, which made things slightly interesting. I’m not fully up to speed with all the areas of the API that have changed, so getting the live coding right first time was a bit tricky, but I think I managed. Interestingly, I feel that the changes made to the .data(), .exit(), .enter(), update cycle as discussed in Mike’s “What Makes Software Good” make a lot more sense from a teaching perspective. The addition of .merge() in particular helps a great deal. As you might expect from a d3 workshop that lasted a mere three hours, I’m not entirely convinced that everybody ‘got’ it, but I think a most went away satisfied.

Overall it was a very successful workshop. Raniere Silva did an excellent job putting it together and running the day, and I really enjoyed it. I’m looking forward to seeing what other people thought about it too.


Quick Update...

July 13, 2015 - by martin

Been a bit quiet here recently. It’s been a very busy few months. I’ve got a few projects and thoughts that I’ll be posting more on in the next couple of weeks, but I figured it was worth a quick update on what’s been going on, and what I’ve been up to.

MSc Computational Journalism

We have finished the taught part of the MSc, and we’re getting well into the dissertation phase for the first cohort of our students. It’s been a really good first year, and I’ll be posting a debrief and some thoughts on the next year sometime over summer.

BarDiff

I’ve launched a data dashboard thing for beer drinking in Cardiff. Powered by Untappd checkins, it’s providing (I think) a fairly interesting overview of the city. I’ve got some ideas for some better visualisations, but for now it’s nicely ticking over. Plus it’s getting some decent interaction on the social medias

Academia

The usual ticking over of academia continues - journal reviews, conference reviews,  a book chapter to write, paper deadlines coming and going. It’s the same old same old….

Teaching

I’ve started on my teaching qualification (PgCUTL). The first module portfolio was submitted a couple of weeks ago, and results are due any day now (fingers crossed). I’ve also got a few thoughts on the recently announced TEF that I’ll be putting up soon, and some things on employability…

and finally…

The reason I’ve not posted in a while:

Arthur!

My son, Arthur James Chorley-Jones was born on 13th May 2015. He’s amazing, I think he’s the best thing that has ever happened, and since he’s been around there has not been a huge amount of time for blogging, side-projects, and other such things. Which is ace.


Accessing and Scraping MyFitnessPal Data with Python

February 5, 2015 - by martin

Interesting news this morning that MyFitnessPal has been bought by Under Armour for  $475 million. I’ve used MFP for many years now, and it was pretty helpful in helping me lose all the excess PhD weight that I’d put on, and then maintaining a healthy(ish) lifestyle since 2010.

News of an acquisition always has me slightly worried though - not for someone else having access to my data, as I’ve made my peace with the fact that using a free service generally means that it’s me that’s being sold. Giving away my data is the cost of doing business. Rather, it worries me that I may lose access to all the data I’ve collected. I have no idea what Under Armour intend for the service in the long run, and while its likely that MFP will continue with business as usual for the foreseeable, it’s always worth having a backup of your data.

A few years ago, I wrote a couple of python scripts to back up data from MFP and then extract the food and exercise info from the raw HTML. These scripts use Python and Beautiful Soup to do a login to MFP, then go back through your diary history and save all the raw HTML pages, essentially scraping your data.

I came to run them this morning and found they needed a couple of changes to deal with site updates. I’ve made the necessary updates and the full code for all the scripts is available on GitHub. It’s not great, but it works. The code is Python 2 and requires BeautifulSoup and Matplotlib (if you want to use generate_plots.py).