Catching a Bug

June 12, 2017 - by martin

I’m doing some data analysis, and I just caught a showstopper of a bug. Want to see it? Here’s the code as it was before:

new_index = [LIKERT[value] for value in LIKERT.keys() if value in data_counts.index]

and here’s a simple fix for the code:

new_index = [LIKERT[value] for value in data_counts.index]

Doesn’t look like much of a problem, but it completely changed the way my data was analysed. Both lines are creating a new index for a pandas dataframe. I have a dataframe that is indexed:

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0]

and I want to replace the index with the correct names from a likert scale that these values refer to:

['N/A', 'Disagree Strongly', 'Disagree', 'Neither Agree nor Disagree', 'Agree', 'Agree Strongly']

so I create a dictionary that maps from keys in the first index, to values for the new index:

    0.0: 'N/A',
    1.0: 'Disagree Strongly',
    2.0: 'Disagree',
    3.0: 'Neither Agree nor Disagree',
    4.0: 'Agree',
    5.0: 'Agree Strongly'

I then do a little list comprehension that adds the correct new value to the new index, if it’s key is in the old index. If the key isn’t there, it gets skipped:

new_index = [LIKERT[value] for value in LIKERT.keys() if value in data_counts.index]

All fine, right? Sure, if the index is always in numerical order. Which it isn’t. Using this code, if the index is in the wrong order, you can get ‘5’ being replaced with ‘Disagree Strongly’ (or any of the values other than ‘Agree Strongly’) and suddenly your analysis is completely wrong.

The second line fixes this by looping through the index, not the dictionary, and so creates the new index in the correct order.

A better fix is actually to use the .rename() function, which can rename the index of a dataframe (or the column names) using a dictionary as a lookup, like so:

data.rename(index=LIKERT, inplace=True)

Any values present in the index but not in the lookup are left alone, and values in the lookup but not in the index are ignored, and the result is exactly what I need, all my ‘5s’ replaced with ‘Agree Strongly’ and so on.

So I guess the lesson learnt here is RTFM, and don’t try to be clever and re-invent functionality that already exists.

Weeknotes - 29th May 2017

June 4, 2017 - by martin

Monday 29th May


Tuesday 30th May


Finished my visualisation coursework marking today. Generally really good quality across the board, and a really enjoyable set of work to mark. As time goes on, I’m liking this visualisation course more and more. It’s fun to teach as it’s an interesting and quite subjective field, which is not usual in a ‘normal’ Computer Science course. There’s lots of scope for discussion and argument and plenty of chances for students to really get stuck into some data analysis and communication and really show off their skills. I get the feeling the mark distribution skewed a little higher than last year, but I haven’t checked that yet.

Also met with the last student who has expressed an interest in our CUROP project for this summer, so we’ll be able to make a decision on that soon and get that project rolling. I also met with another of our CompDJ students about their dissertation project - they’re looking to build a bot that will write articles automatically about particular topics. A very ambitious project, but one that looks to be really interesting.

The other major task on Tuesday was a Skype call with the rest of the organising committee of DataJConf to finalise the accepted talks and sort out the schedule. We had a really great set of submissions, with a good mix from industry and academia. Our programme committee did a great job of reviewing them, so it was a fairly simple task to conduct a quick meta-review of the papers, decide where our cut-off point is and then take the top 8 papers forward to the conference. Sadly the fact that we’re only one-day main track this year meant we had to lose some very good submissions, but I’m hopeful those authors will still come along and pitch their discussion topics for the Unconference on the day after DataJConf (and we invited them to in their notification emails). The schedule is now online, and it looks like it’s going to be a really good day. Tickets are selling, and the attendee numbers are ticking up. We were supposed to make a decision this week on which room to go for, the big room or the bigger room, but we put that off to see how numbers look in a weeks time. It’s a bit of a gamble as there’s always a chance that if we need to switch from the room we already have booked the other room will be unavailable by the time we make our minds up, but who doesn’t like a bit of risk in their conference planning?

Wednesday 31st May

A day in which very little was accomplished towards my own goals, but which had to be done. Most of the morning was taken up with a meeting with my counterpart in undergraduate operations, the school manager, and various faces from college about our generally low survey response rates in the School, and how we might do better at communicating with students to foster a culture that encourages these response rates to improve. One of the key points we came up with was that while we’re very good at listening to students as a school, and then acting upon their feedback, we’re pretty rubbish at communicating those actions and changes back to the students. The outcome of this discussion was a need to empower the operations teams for postgrads and undergrads to do more with the various surveys and module feedback questionnaires, to bring actions and recommendations to the teaching and learning quality committee and boards of studies, and to work with the comms team to make sure students know that what they tell us is listened to and acted upon, and is therefore quite important. Essentially it’s about a culture change within the school, and we all know how easy that is, right…

Also had some interesting discussions with my Head of School today about a number of projects I’ve got going on at the moment. I already wrote about trying to coordinate the large number of new programme / programme change approvals that we have happening within the school, but we also discussed a couple of other projects. One, looking at end-of-module feedback has been going on for a while but is close to being ready for launch. The other was around module-review, and how I want to improve that process by moving to a git based approach, which will allow better oversight and review of module changes and data collection. I’ll talk more about that as the project develops.

At lunchtime we had our first official meeting with Stuart, our third-year student who is working with us for the summer on our Education Innovation chatbot project. He seems to have really hit the ground running and is getting stuck in to building code and designing solutions. Really great to see, and it looks highly likely that we’re going to have something working to test with students in the Autumn.

The afternoon was taken up with an Academic Approval Event in the School of Modern Languages. I was on the panel as the internal member from outside the college. It’s the second approval event I’ve done, and was a fairly pleasant experience. The programme we were looking at was well thought out, and would clearly be a benefit to the school in question. There were the usual typos and small inconsistencies in the documentation, and we had some recommendations that might improve the student experience, particularly around assessment, where there were a lot of essays that might be replaced with some more interesting types of assessment. Overall though it looks good, and I hope they make a success of it.

While all that was going on, we were hosting a hackday over in Bute, a collaboration with The Bureau Local. A team came over from The Bristol Cable and along with our students spent the day looking at voter numbers within local constituencies. I wrote a tiny write up over at the CompDJ blog, but I was a bit annoyed I couldn’t get more involved, what with everything else that was going on yesterday. Hopefully I’ll be able to get stuck in at the next one, as I’m sure this wont be the last hackday type event the Bureau organises.

Thursday 1st June

Today was spent interviewing students for another of our summer projects, looking at Journalism Education. We’ve been carrying out a data collection experiment since last summer looking at the skill requirements of the media industry as exposed through job advertising and mailing list postings, and now we’re looking to back that up through a qualitative analysis of journalism school educators and their syllabi. We had 12 students from a range of schools express an interest in working on this project with us, and choosing between them is a very hard task indeed. Luckily m’colleague is leading this project, so most of that particular burden falls on him. Hopefully we’ll have someone in place very soon and we can get the third of our summer projects up and running.

Friday 2nd June


Sunday 4th June

My ‘Friday’ was spent working on some analysis of module evaluation feedback. As I mentioned in Tuesday’s notes, we need to do more and better with the feedback given to us by students. I’ve been working for a while on creating some simple dashboards that transform the quite poor output of the module evaluation system into something that is firstly a little more usable by module leaders, but that also looks more like the survey dashboards (NSS, PTES, etc) that we are used to dealing with.

Module Evaluation Dashboards - WiP

The idea is that consistency between the types of visualisations and analysis used will reduce the cognitive burden when trying to assess the feedback and compare across surveys. I’m now starting to put together a system that will create individual dashboards for lecturers and module leaders, and that will also allow comparison between modules on the various programmes and years of our degree schemes, and allow comparison to the school as a whole. With any luck I’ll be able to present this at the next TLAQC and we can start to deliver these to lecturers and operations teams to help them understand what the students are telling them. Today was mostly refactoring my existing analysis code that takes the raw survey data and converts it into percent agree/disagree scores as per the NSS dashboards, and collects the data across the different groupings (years/programmes) of modules.

Weeknotes - 22nd May 2017

May 28, 2017 - by martin

Strange week this week - coming back from holiday, lots of time spent catching up, arranging meetings and organising more meetings for next week

Monday 22.05.17

Most of Monday morning was spent dealing with all the emails I’d received while away last week. The usual mix of requests for information from admin, queries from current, potential and past students, and a number of things relating to projects that are either about to start or were supposed to have started by now! It took an absolute age to crack through it all. The apocryphal tale of the colleague who just ‘deleted everything’ on the return from holiday with the assumption that anything important would be chased up loomed large in my mind as I replied to my fiftieth message. In a world where ‘responsiveness to communication’ is one of the questions in any number of student feedback surveys, I just don’t think that path is the right one to take.

Monday afternoon saw myself and Glyn working on our talk for Wednesday, taking the usual divide and conquer approach to put together something interesting (we hoped) for the ‘Investigating (with) Big Data’ symposium being held by the Digital Cultures Network.

Tuesday 23.05.17

Another morning of marking this morning. I mentioned last week how pleased I was with the quality of the submissions this year, and it has held up through this latest batch too. The students really seem to have engaged with the module, have thought about what the data says and the message they want to communicate, and have then brought the technical skills to the table to implement their solution. I’m really pleased with how it’s gone. Over halfway through the marking now. It’s supposed to be done by the end of this week, but with two days of training courses and a very busy Wednesday, that’s just not going to happen. I have supplied the necessary apologies to the admin staff and I’m fairly sure they’re not going to hurt me too much.

The afternoon was taken up with meetings with m’colleague, potential CUROP students, and a couple of our MSc CompDJ students who are beginning to think about their dissertation projects for this summer. One of the things Glyn and I discussed was our lack of self-promotion around the activities we do as the ‘Computational and Data Journalism’ team. In the last couple of months we’ve scored research project funding, student project funding, international workshop funding and our students have landed a number of prestigious summer internships, and we’re really not doing a good enough job at shouting about this activity. I’ve resolved to drive this forward a bit better, so came up with a list of potential items for promotion, and I’ll be trying to push those out over the summer, and then keep things ticking over during term time next year.

There was also some movement on the Untappd data project front, as I was finally pushed into responding to my co-investigators with some plans on how to progress from last year’s ICWSM conference paper to a fuller journal paper submission. This is one of those side projects that it’s a real shame to not have more time for, as I think we have a lot of interesting things that we can do, but are all lacking the time to really get stuck in to the analysis. Hopefully we’ll be able to push things forward over summer and get something delivered.

Wednesday 24.05.17

Wednesday started with my first catch up meeting with the DoT for a couple of weeks. I’ve been deputy DoT since September(ish), and we’ve probably not had enough of these meetings. The plan is to make them more regular in the future, and that will probably help with keeping all the plates spinning, as I’m now working on a lot of different projects for the School. We discussed the programme approval process, as we have a number of new programmes in the pipeline as well as some changes to existing programmes going on, and we need to make sure we keep everything coherent. I’ve been tasked with setting up some meetings with the key proposers and the usual suspects within the school to make sure there’s enough coordination going on.

In the afternoon, it was over to John Percival Building to give a talk as part of the ‘Investigating (with) Big Data’ symposium. This was a double hander with m’colleague, and we’d chosen to discuss some issues around large data investigations within news media. Glyn started by presenting some of the more recent large-scale collaborative data investigations that have been carried out by news orgs. I followed that up with a discussion around data openness, transparency, and some of the technical issues that are holding back data journalism. I think the talk went well, people seemed interested and receptive to the ideas we presented.

Sadly I couldn’t hang around for the rest of the symposium as I’d double booked myself for the afternoon, having agreed to go to a briefing for exam board chairs being held over in main building. There’s a few new people taking on the exam board chair role within the school, and although I’m not one of them it was ‘decided’ (no idea who by) that I should also attend the briefing, as I’m probably going to be one of the people called upon to step in if the usual chair isn’t available. It was a fairly dull but not entirely useless presentation on the process of getting ready for and dealing with the aftermath of an exam board. It ticks the boxes though, so now I’m trained and can step into that particular set of shoes if necessary.

Thursday 25.05.17 & Friday 26.05.17

Days 2 and 3 of the ‘Leading Teaching Teams’ training programme that I’d managed to score a place on. This part of the programme was run by the Leadership Foundation for Higher Education, and was probably one of the best training courses I’ve been on so far. I spent a long time reflecting on the way I work, and it really delivered some useful insights. We did a lot of self-assessment and analysis of how our individual approaches may or may not be helpful in managing teams, and I’m looking forward to putting some of the ideas into practice.

As with many of these training courses, one of the added benefits was being able to spend time with colleagues from across the University. It’s always fascinating to find out how others work and to hear about common problems or issues across different schools and colleges, and how they’ve been solved (or not!) in different ways. It’s also nice to get an opportunity to discuss things and to hear that others feel the same way. There was a lot of discussion and dissatisfaction expressed over the 2 days about the increased corporatisation and commodification of Higher Education. I’d love to tell you that we’ve solved that particular issue, but sadly not. Many did get righteously angry about it though. I suspect a higher societal change is needed to fix it, and all we can do at this level is to keep pushing for that change.

Weeknotes - 14th May 2017

May 15, 2017 - by martin

A very good week this week, in that I was only in work for 3 days, but still accomplished a lot. Through some convenient meetings I’ve managed to get a whole mess of projects lined up for the next year or so, and I came out of the week on Friday very eager to get cracking on things.

Monday 8.5.17

This week started with our Assessment and Feedback focused teaching ‘away day’, which wasn’t really an away day because we didn’t go anywhere, but which was incredibly useful nonetheless. Put together by myself, Andrew (DoT) and Helen (A&F lead), the event was attended by a good number of teaching staff within the school and allowed us to spend the day thinking about our teaching practice and the way that we do things within COMSC.

We took a look at the upcoming Cardiff University commitments and principles around assessment and feedback, and considered how well our assessment lines up with some of the ideas within this draft of the document. A surprising amount of assessment within the school is some variation of ‘build this project in language X using paradigm Y and assess how well it performs in terms of property Z, then write a report on it’, and it turns out that trying to work out how well that corresponds with a 4,000 word essay is quite a challenge. Discussion around this topic also highlighted how the National Software Academy have done a good job of using larger projects as assessment for a number of different modules, something that we could do more in the BSc Computer Science, as currently there are a lot of (too many?) larger assessments within each module. Combining these makes a lot of sense - for instance why not have a software project in the first year that gets assessed for both the ‘OO Development in Java’ and the ‘Developing Quality Software’ modules, rather than a separate project in each module?

In a session that I chaired we looked at Learning Outcomes of our modules - with a particular focus on how well they match with assessment or are assessable. We also looked at trying to get a handle on the year level learning outcomes for BSc Computer Science to make sure they are up to date and relevant.

The final session of the morning saw us covering exam feedback, and how we can provide this to students in a useful fashion. The afternoon saw some discussion around a few different projects that aim to help give visibility to the workload that assessment gives to both staff and students. One project from a team in the University seemed almost useful, but focused too much on deadlines, with little regard to start dates, duration and effort. So, as a tool to help prevent deadline bunching it was great, but to actually monitor workload it was less than great. George is working on a project as part of the Cardiff Futures project that promises to deliver what I think is needed (essentially automating the creation of the coursework timetables that we delivered at the beginning of this year), and hopefully that will be taken up by central University, as effective communication of this information is a key part of helping students and staff manage their workload.

Tuesday 9.5.17

After a fairly involved and in-depth Monday, Tuesday was a day of playing catch up with admin and sorting things out before my week off next week. First things first was my PDR. This was my first PDR, having come off probation last September a few months early because I was fed up of not getting paid enough. I thought it had been a pretty decent year, and Andrew seemed to agree. I agreed some interesting objectives for the next year that were basically things I’ve been wanting to do for the last few months, and are all things I’m looking forward to getting stuck in to over the next 12 months.

In the afternoon I met with a few students who are interested in our summer CUROP project doing some analysis and visualisation of the Creative Cardiff data, which was fun as it’s always good to meet with interested and engaged students who are keen to get involved with research projects. Still got a couple more students to meet with, but hopefully we’ll have someone for this project relatively soon.

I also met up with m’colleague on Tuesday afternoon and we did some more planning for the next few months. The textbook we’re writing is coming along, and we’ve identified an opportunity to get some excellent input to the book from the attendees of the Data and Computational Journalism Conference we’re organising in Dublin. We discussed the upcoming intake of students for the next academic year, and the progress our current students are making on their dissertation project pitches. We also solidified our publication plans for the next six months - with a couple of decent journal papers in the pipeline alongside a couple of decent conference presentations it’s looking like a strong finish to the year.

Wednesday 10.5.17

I took a day off today to have a sneaky date with my wife for her birthday (which isn’t really until later this month). We went off to the theatre to watch OmiDaze doing Romeo and Juliet, which was very enjoyable, then to a bar in the bay for lunch which was very tasty (food) and very average (beer).

Thursday 11.5.17

Worked from home today ploughing through the CMT212 marking. The quality this year is incredibly high, and I’m very pleased with how the students have analysed and visualised their data. I’ve had a few students submitting data analysis in R, a lot of Python, and then the majority of the visualisation so far has been some very good D3 code. If the standard keeps up across the whole cohort when I get round to marking the rest I’ll be very happy indeed.

Friday 12.5.17

Nada. Day off again (2 in 1 week!) packing and preparing for the week off. Mad day rushing around with Arthur collecting parcels, packing bags, and trying to optimise the fitting of bags into the car boot so that we could get both ourselves and the luggage in the car at the same time.

Scraping the Assembly

November 2, 2016 - by martin

M’colleague is currently teaching a first-semester module on Data Journalism to the students on our MSc in Computational and Data Journalism. As part of this, they need to do some sort of data project. One of the students is looking at the expenses of Welsh Assembly Members. These are all freely available online, but not in an easy to manipulate form. According to the Assembly they’d be happy to give the data out as a spreadsheet, if we submitted an FOI.

To me, this seems quite stupid. The information is all online and freely accessible. You’ve admitted you’re willing to give it out to anyone who submits an FOI. So why not just make the raw data available to download? This does not sound like a helpful Open Government to me. Anyway, for whatever reason, they’ve chosen not to, and we can’t be bothered to wait around for an FOI to come back. It’s much quicker and easier to build a scraper! We’ll just use selenium to drive a web browser, submit a search, page through all the results collecting the details, then dump it all out to csv. Simple.

Scraping AM expenses

I built this as a quick hack this morning. It took about an hour or so, and it shows. The code is not robust in any way, but it works. You can ask it for data from any year (or a number of years) and it’ll happily sit there churning its way through the results and spitting them out as both .csv and .json.

All the code is available on Github and it’s under an MIT Licence. Have fun 😉