Scraping the Assembly…

M’colleague is currently teaching a first-semester module on Data Journalism to the students on our MSc in Computational and Data Journalism. As part of this, they need to do some sort of data project. One of the students is looking at the expenses of Welsh Assembly Members. These are all freely available online, but not in an easy to manipulate form. According to the Assembly they’d be happy to give the data out as a spreadsheet, if we submitted an FOI.

To me, this seems quite stupid. The information is all online and freely accessible. You’ve admitted you’re willing to give it out to anyone who submits an FOI. So why not just make the raw data available to download? This does not sound like a helpful Open Government to me. Anyway, for whatever reason, they’ve chosen not to, and we can’t be bothered to wait around for an FOI to come back. It’s much quicker and easier to build a scraper! We’ll just use selenium to drive a web browser, submit a search, page through all the results collecting the details, then dump it all out to csv. Simple.

Scraping AM expenses
Scraping AM expenses

I built this as a quick hack this morning. It took about an hour or so, and it shows. The code is not robust in any way, but it works. You can ask it for data from any year (or a number of years) and it’ll happily sit there churning its way through the results and spitting them out as both .csv and .json.

All the code is available on Github and it’s under an MIT Licence. Have fun ūüėČ

Sustainable Software Institute – Research Data Visualisation Workshop

Last week I ¬†gave a talk and delivered a hands on session at the Sustainable Software Institute’s ‘Research Data Visualisation Workshop‘ which was held at Manchester University. It was a really engaging¬†event, with a lot of good discussion on the issues surrounding data visualisation.

Professor Jessie Kennedy from Edinburgh Napier University gave a great keynote looking at a some key design principles in visualisation, including a number of studies I hadn’t seen before but will definitely be including in my teaching in future.

I gave a talk on ‘Human Science Visualisation’ which really focused on a couple of key issues. Firstly, I tried to illustrate the¬†importance of interactivity in complex visualisations. I then talked about how we as academic researchers need¬†publish our interactive visualisations¬†in posterity, and how we should press academic publishers to help us communicate our data to readers. Finally, I wanted to point¬†people towards the excellent visualisation work being done by data journalists, and that the newsrooms are an excellent source of ideas and tips for data visualisation. The slides for my talk are here. It’s the first time I’ve spoken about visualisation outside of the classroom, and it was a really fun talk to give.

We also had two great talks from Dr Christina Bergmann and Dr Andy South, focusing on issues of biological visualisation and mapping respectively. All the talks generated some good discussion both in the room and online, which was fantastic to see.

In the afternoon I lead a hands on session looking at visualising data using d3. This was the first time I’d taught a session using d3 v4, which made things slightly interesting. I’m not fully up to speed with all the areas of the API that have changed, so getting the live coding right first time was a bit tricky, but I think I managed. Interestingly, I feel that¬†the changes made to the .data(), .exit(), .enter(), update cycle as discussed in Mike’s “What Makes Software Good” make a lot more sense from a teaching perspective. The addition of .merge() in particular helps a great deal. As you might expect from a d3 workshop that lasted a mere three hours, I’m not entirely convinced that everybody ‘got’ it, but I think a most went away satisfied.

Overall it was a very successful workshop. Raniere Silva did an excellent job putting it together and running the day, and I really enjoyed it. I’m looking forward to seeing what other people thought about it too.

Personality and Places

Our paper examining the link between individual personality and the places people visit has just been published in¬†Computers in Human Behavior. It’s open access, so you can go read it for free, now!¬†

In an experiment we ran previously, we asked users of Foursquare to take a personality test and give us access to their checkin history. The personality test gives us a measure of how each person scores for five different factors: Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism. The checkin history¬†lists all the places they’ve ever checked in to using Foursquare. Because a couple of hundred people took part in the experiment, we ended up with a large number of individual personalities that we could link to over a hundred thousand venues. In total, this represents a pretty staggering half a million Foursquare checkins that we have personality data associated with.

Our first step with this data has been to see if there are any links between personality factors and the places people choose to visit, and we found some interesting connections.

One of our main finding shows that the use of Foursquare for recording checkins seems to correlate well with Conscientiousness. The more conscientious a user is, the more likely they are to have checked in at more places and to have visited more venues. This could be because people with a high Conscientiousness score tend to be quite organised and disciplined, and so are more likely to remember to check in at every place they visit.

The opposite is true for Neuroticism: the more neurotic an individual is, the fewer places they have visited. Neuroticism is associated with negative feelings, and a tendency to be less social, which could then translate into people going to fewer places, and so checking in less. This is expressed again when we look at only those venues classed as ‘social’ (i.e. – somewhere you would go to hang out with friends). The more neurotic someone is, the fewer ‘social’ venues they have been to.

Surprisingly, we have found no link between Extraversion and the number of social venues visited. It may be expected that extraverts (who are very social in their nature) may go to more social venues. However, the data does not support this. In fact, we find no link between Extraversion and any aspect of Foursquare checkins that we have examined so far.

The personality factor of Openness is related to feelings of creativity and artistic expression, and a willingness to experience new things. It is interesting to find that there is a link between Openness and the average distance travelled between checkins – the more Open an individual is, the further they tend to have travelled. This could be an expression of an Open individual’s desire to experience new things exposing itself through wider travel, and a larger geographic spread of checkins. However, we do not find any link between Openness and the number of different categories visited by a user. We do not see a desire for new experiences express itself in the range and diversity of places visited.

Ultimately, this data could be incredibly useful in improving venue recommendation systems. Current systems use many different information ‘cues’¬†to recommend to a user a place they might like to visit. These cues include things such as where they have been in the past, where their friends have been, or where is popular nearby. Perhaps by including aspects of an individual’s personality (so including aspects of why they might visit somewhere) we can increase the usefulness of these recommendations.

There is still a lot of analysis to be done on this data, and both myself and Nyala Noe are busy churning through it to discover other links between personality and the places people visit. As we find more interesting connections, I’ll post more here.

 

Extended Mind Crowdsourcing

Update 13/01/15: the paper containing the research described below is currently available from the HICSS website

This post is one I’m cross-posting both here and on the MobiSoc blog. Here, because it’s my personal translation of one of our latest research papers, and there because it’s a very good paper mostly written and driven by Roger Whitaker, so deserves an ‘official’ blog post!

A lot of use is made of Crowdsourcing in both business and academia. Business likes it because it allows simple tasks to be outsourced for a small cost. Researchers like it because it allows the gathering of large amounts of data from participants, again for minimal cost. (For an example of this, see our TweetCues work (paper here), where we paid Twitter users to take a simple survey and massively increased our sample size for a few dollars). As technology is developing, we can apply crowdsourcing to new problems; particularly those concerned with collective human behaviour and culture.

Crowdsourcing

The traditional definition of crowdsourcing involves several things:

  1. a clearly defined crowd
  2. a task with a clear goal
  3. clear recompense received by the crowd
  4. an identified owner of the task
  5. an online process

The combination of all these things allows us to complete a large set of simple tasks in a short time and often for a reduced cost. It also provides access to global labour markets for users who may not previously have been able to access these resources.

crowdsourcing_img
Participatory Computing

Participatory computing is a related concept to crowdsourcing, based around the idea that the resources and data of computing devices can be shared and used to complete tasks. As with crowdsourcing, these tasks are often large, complex and data-driven, but capable of being broken down into smaller chunks that can be distributed to separate computing devices in order to complete the larger task. BOINC is a clear example of this class of participatory computing.

 

participatory_img

Extended Mind Crowdsourcing

The extended mind hypothesis describes the way that humans extend their thinking beyond the internal mind, to use external objects. For instance, a person using a notebook to record a memory uses the ‘extended mind’ to record the memory; the internal mind simply recalls that the memory is located in the notebook, an object that is external to the individual.

Extended mind crowdsourcing takes crowdsourcing and participatory computing a step further by including the extended mind hypothesis, to allow us to describe systems that use the extended mind of participants, as represented by their devices and objects, in order to add implicit as well as explicit human computation for collective discovery.

 

emc_img

 

What this means is that we can crowdsource the collection of data and completion of tasks using both individual users, their devices, and the extended mind that the two items together represent. Thus by accessing the information stored within a smartphone or similar personal device, and the wider internet services that the device can connect to, we can access the extended mind of a participant and thus learn more about his or her behaviour and individual characteristics. In essence, extended mind crowdsourcing captures the way in which humans undertake and respond to daily activity. In this sense it supports observation of human life and our interpretation of and response to the environment. By including social networks and social media communication within the extended mind, it is clear that while an individual extended mind may represent a single individual human, it is also possible to represent a group, such as a network or a collective using extended mind crowdsourcing.

By combining the ideas of social computing, crowdsourcing, and the extended mind, we are able to access and aggregate the data that is created through our use of technology. This allows us to extend ideas of human cognition into the physical world, in a less formal and structured way than when using other forms of human computational systems. The reduced focus on task driven systems allows EMC to be directed at the solving of loosely defined problems, and those problems where we have no initial expectations of solutions or findings.

This is a new way of thinking about the systems we create in order to solve problems using computational systems focused on humans, but it has the potential to be a powerful tool in our research toolbox. We are presenting this new Extended Mind Crowdsourcing idea this week at HICSS.

Quick and Dirty Twitter API in Python

QUICK DISCLAIMER: this is a quick and dirty solution to a problem, so may not represent best coding practice, and has absolutely no error checking or handling. Use with caution…

A recent project has needed me to scrape some data from Twitter. I considered using Tweepy, but as it was a project for the MSc in Computational Journalism, I thought it would be more interesting to write our own simple Twitter API wrapper in Python.

The code presented here will allow you to make any API request to Twitter that uses a GET request, so is really only useful for getting data from Twitter, not sending it to Twitter. It is also only for using with the REST API, not the streaming API, so if you’re looking for realtime monitoring, this is not the API wrapper you’re looking for. This API wrapper also uses a single user’s authentication (yours), so is not setup to allow other users to use Twitter through your application.

The first step is to get some access credentials from Twitter. Head over to¬†https://apps.twitter.com/¬†and register a new application. Once the application is created, you’ll be able to access its details. Under ‘Keys and Access Tokens’ are four values we’re going to need for the API – the ¬†Consumer Key and Consumer Secret, and the Access Token and Access Token Secret. Copy all four values into a new python file, and save it as ‘_credentials.py‘. The images below walk through the process. Also – don’t try and use the credentials from these images, this app has already been deleted so they won’t work!

Once we have the credentials, we can write some code to make some API requests!

First, we define a Twitter API object that will carry out our API requests. We need to store the API url, and some details to allow us to throttle our requests to Twitter to fit inside their rate limiting.

class Twitter_API:

 def __init__(self):

   # URL for accessing API
   scheme = "https://"
   api_url = "api.twitter.com"
   version = "1.1"

   self.api_base = scheme + api_url + "/" + version

   #
   # seconds between queries to each endpoint
   # queries in this project limited to 180 per 15 minutes
   query_interval = float(15 * 60)/(175)

   #
   # rate limiting timer
   self.__monitor = {'wait':query_interval,
     'earliest':None,
     'timer':None}

We add a rate limiting method that will make our API sleep if we are requesting things from Twitter too fast:

 #
 # rate_controller puts the thread to sleep 
 # if we're hitting the API too fast
 def __rate_controller(self, monitor_dict):

   # 
   # join the timer thread
   if monitor_dict['timer'] is not None:
   monitor_dict['timer'].join() 

   # sleep if necessary 
   while time.time() < monitor_dict['earliest']:
     time.sleep(monitor_dict['earliest'] - time.time())
 
   # work out then the next API call can be made
   earliest = time.time() + monitor_dict['wait']
   timer = threading.Timer( earliest-time.time(), lambda: None )
   monitor_dict['earliest'] = earliest
   monitor_dict['timer'] = timer
   monitor_dict['timer'].start()

The Twitter API requires us to supply authentication headers in the request. One of these headers is a signature, created by encoding details of the request. We can write a function that will take in all the details of the request (method, url, parameters) and create the signature:

 # 
 # make the signature for the API request
 def get_signature(self, method, url, params):
 
   # escape special characters in all parameter keys
   encoded_params = {}
   for k, v in params.items():
     encoded_k = urllib.parse.quote_plus(str(k))
     encoded_v = urllib.parse.quote_plus(str(v))
     encoded_params[encoded_k] = encoded_v 

   # sort the parameters alphabetically by key
   sorted_keys = sorted(encoded_params.keys())

   # create a string from the parameters
   signing_string = ""

   count = 0
   for key in sorted_keys:
     signing_string += key
     signing_string += "="
     signing_string += encoded_params[key]
     count += 1
     if count < len(sorted_keys):
       signing_string += "&"

   # construct the base string
   base_string = method.upper()
   base_string += "&"
   base_string += urllib.parse.quote_plus(url)
   base_string += "&"
   base_string += urllib.parse.quote_plus(signing_string)

   # construct the key
   signing_key = urllib.parse.quote_plus(client_secret) + "&" + urllib.parse.quote_plus(access_secret)

   # encrypt the base string with the key, and base64 encode the result
   hashed = hmac.new(signing_key.encode(), base_string.encode(), sha1)
   signature = base64.b64encode(hashed.digest())
   return signature.decode("utf-8")

Finally, we can write a method to actually make the API request:

 def query_get(self, endpoint, aspect, get_params={}):
 
   #
   # rate limiting
   self.__rate_controller(self.__monitor)

   # ensure we're dealing with strings as parameters
   str_param_data = {}
   for k, v in get_params.items():
     str_param_data[str(k)] = str(v)

   # construct the query url
   url = self.api_base + "/" + endpoint + "/" + aspect + ".json"
 
   # add the header parameters for authorisation
   header_parameters = {
     "oauth_consumer_key": client_id,
     "oauth_nonce": uuid.uuid4(),
     "oauth_signature_method": "HMAC-SHA1",
     "oauth_timestamp": time.time(),
     "oauth_token": access_token,
     "oauth_version": 1.0
   }

   # collect all the parameters together for creating the signature
   signing_parameters = {}
   for k, v in header_parameters.items():
     signing_parameters[k] = v
   for k, v in str_param_data.items():
     signing_parameters[k] = v

   # create the signature and add it to the header parameters
   header_parameters["oauth_signature"] = self.get_signature("GET", url, signing_parameters)

   # add the OAuth headers
   header_string = "OAuth "
   count = 0
   for k, v in header_parameters.items():
     header_string += urllib.parse.quote_plus(str(k))
     header_string += "=\""
     header_string += urllib.parse.quote_plus(str(v))
     header_string += "\""
     count += 1
     if count < 7:
       header_string += ", "

   headers = {
     "Authorization": header_string
   }

   # create the full url including parameters
   url = url + "?" + urllib.parse.urlencode(str_param_data)
   request = urllib.request.Request(url, headers=headers)

   # make the API request
   try:
     response = urllib.request.urlopen(request)
     except urllib.error.HTTPError as e:
     print(e)
   raise e
     except urllib.error.URLError as e:
     print(e)
     raise e

   # read the response and return the json
   raw_data = response.read().decode("utf-8")
   return json.loads(raw_data)

Putting this all together, we have a simple Python class that acts as an API wrapper for GET requests to the Twitter REST API, including the signing and authentication of those requests. Using it is as simple as:

ta = Twitter_API()

# retrieve tweets for a user
params = {
   "screen_name": "martinjc",
}

user_tweets = ta.query_get("statuses", "user_timeline", params)

As always, the full code is online on Github, in both my personal account and the account for the MSc Computational Journalism.

 

 

 

 

 

 

 

 

 

How do people decide whether or not to read a tweet?

It turns out that an existing relationship with the author of the tweet is one of the main factors influencing how someone decides whether or not to read a tweet. At the same time,  a large number associated with a tweet can also make the tweet more attractive to readers.

Our latest Open Access research has discovered how much effect the information about a tweet has on whether people decide to read it or not.

By showing hundreds of Twitter users the information about two tweets but not the tweets themselves, and then asking the users which tweet they would like to read, we have been able to look at which information is more important when users are deciding to read a tweet.

We looked at two different types of information:

  1. Simple numbers that describe the tweet, such as the number of retweets it has, or numbers that describe the author, such as how many followers they have, or how many tweets they’ve written.
  2. Whether a relationship between the reader and the author is important, and whether that relationship was best shown through subtle hints, or direct information.

When readers can see¬†only one piece of information, the case is clear: they’d rather read the tweet written by someone they are following. Readers can easily recognise the usernames, names, and profile images of people they already follow, and are likely to choose to read content written by someone they follow (instead of content written by a stranger) around 75% of the time. If all they can see¬†is a piece of numerical information, they would rather read the tweet with the highest number, no matter what that number is. The effect is strongest with the number of retweets, followed by the number of followers, but even for the number of following and number of tweets written¬†the effect is significant.

When readers can see two pieces of information, one about their relationship with the author, and one numerical, there are two cases to look at. When the author they follow also has a high numerical value, readers will choose that tweet in around 80% of the cases. When the author they already follow has a lower numerical value, it is still the existing relationship that is more of a draw. Readers would rather read a tweet from someone they know that has a low number of retweets, than one from a stranger with a high number of retweets.

This work offers an understanding of how the decision-making process works on Twitter when users are skimming their timelines for something to read, and has particular implications for the display and promotion of non-timeline content within content streams. For instance, readers may pay more attention to adverts and promoted content if the link between themselves and the author is highlighted.

Previous results  from an early experiment were published at SocialCom. The results in this new paper are from a modified and expanded version of this earlier experiment.