olihawkins

Weeknotes: Week 38, 2020

18 Sep 2020 14:28 GMT

I have always been skeptical about the value of writing weeknotes. There are a million and one things I want to get done at work and the idea of writing weeknotes always felt like not doing any of them.

However, in the last few months I have started to notice how working from home has subtly changed the way we work. Some of these changes are good: I get longer periods of uninterrupted time between meetings, which makes it easier to be productive with development work.

But some are less good. In particular, I've noticed that with many people working from home there is less opportunity for serendipity, when you find out through casual conversations that what you are doing joins up neatly with something one of your colleagues is doing and you both benefit.

So while we're all keeping our distance, I thought it might help to know what I'm up to.

House of Commons Library GitHub account

This week we launched our House of Commons Library GitHub account. Since we started the data science programme two years ago, researchers in the Library have been working on tools to make routine data analysis easier. Some of this work is potentially useful to researchers outside Parliament, and we share code that has a general purpose when we can.

Until now we have all been using personal GitHub accounts, but some of the tools we have developed have become important enough that we need to maintain and manage them in one place. I hope that having an organisational account will also make collaboration and training easier. It's been a real pleasure seeing statistical researchers with no previous programming experience discover its potential and get really good at it. We are now in a position to work collaboratively on shared projects as a team.

Parliamentary data

I spent a good part of this week working on a new R package for downloading data from Parliament's Committees API. The package is called clcommittees and I've published an initial working version on GitHub. We use data on committees in enquiries, briefings and dashboards. The package currently focusses on committees, memberships and roles, but it is likely to grow to include other data as and when we need it.

This package is part of a wider programme of work we are doing developing our capability with Parliamentary data. We are developing tools to work with a range of Parliamentary data sources (including the data platform now it has been unpaused), so watch this space.

New chart style for the Library

This week, the Library launched it's new chart style. The style was developed by my colleagues Carl Baker and Paul Bolton. I've implemented the style as a ggplot2 theme in an R package called clcharts, so that Library researchers can easily produce charts in the new style when doing computational data analysis. To give you a flavour, here's an example of a ridgeline plot made with the package.

A smoothed ridge chart showing the distribution of Parliamentary constituencies by median age and settlement class

You can see more examples in the GitHub readme at the link above. I think the new style looks great, and thanks to patchwork we have been able to fully implement it in R, which wasn't the case with our old chart style.

MSOA Names 1.5.0

Carl and I updated the MSOA Names dataset and map to version 1.5.0 to fix a couple of errors people had spotted. The dataset has been turning up everywhere from Public Health England's map of coronavirus cases to the Doogal postcode lookup service.

Weeknotes: Week 39, 2020

27 Sep 2020 15:32 GMT

This has been an interesting week that I mostly can't talk about. I have been working on something that hasn't yet launched. In the meantime, here are a few things that happened that I can talk about.

Parliamentary data

The week started with Elise and I working on improving the code she has written to collect and integrate data for her dashboard on Parliamentary activities.

You might be surprised by how much work it takes to pull this data together. Elise has done an amazing job. The data comes from a combination of public APIs, automated use of internal search tools, and webscraping data that is not otherwise published as a structured dataset.

Combining data from such a varied range of sources is not easy. In many cases the data lacks unique identifiers, so there is no option but to join datasets by normalising names, then using fuzzy string matching, and then manually checking cases where the join failed. With such a potentially error prone process we want to be sure the code is robust and that we have a thorough understanding of all the potential sources of error in the underlying data.

We are working towards making it easier to use these data sources. We have encapsulated some of our routine data processing tasks in packages like clcommittees and Noel's parlygroups package. But there are a number of data sources for which we still need to build tools to use efficiently. And until the data platform is fully restored and renewed, linking these datasets will always be problematic.

On Wednesday, Anya and I met with Chris, head of the Library's Parliament and Constitution Centre, to take forward his idea for a regular meet-up of Parliamentary data users within the Library. The idea is that if we can learn a bit more about what we each need most urgently, we can identify which work to prioritise. The first full meeting of the group will be in October.

Data wrangling

This week, one of our researchers ran into a data wrangling problem that has come up a couple of times before, so I am going to post the best solution I have found so far.

Suppose you have some address level data, where the address itself is stored in a free text field. For each record, you need to extract the postcode from the address text and use it to link the data to the ONS postcode directory, so that you can aggregate the data to a higher-level geography, like a constituency.

The best way I have found to do this in R is to use string_extract on the address text with the following regular expression.


# Import stringr
library(stringr)

# Greedy regular expression for postcodes
POSTCODE_REGEXP <- str_c(
    "([A-Za-z][A-Ha-hJ-Yj-y]?[0-9][A-Za-z0-9]? ",
    "?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})")

# Extract the first matching postcode in each address 
df$postcode <- str_extract(df$address, POSTCODE_REGEXP)


This regular expression will match postcodes even if they are missing the space separating the outward and inward parts, so it will still catch postcodes that have been misrecorded in this way. Consequently, if you remove any spaces from inside the postcodes in both datasets when you perform the join (create new columns for joining if you want to preserve the original data), then you get a higher match rate when linking the data.

Note that the above code only looks for the first matching postcode in an address, so you may want to double check there is only one postcode per address with string_extract_all before relying on this strategy.

R training

On Tuesday I ran the second module of the introductory R training for Georgina and Richard, which is all about using the tidyverse. I really like teaching this module because it introduces programming for the first time. One of the remarkable things about the tidyverse is that you can start by teaching techniques that researchers can benefit from using right away.

I think this is probably R's biggest advantage over other programming languages when it comes to teaching absolute beginners. The number of things you need to know before you can start using other languages to do useful data analysis is much larger (and I say this as someone who prefers Python overall).

Of course, this simplicity doesn't come entirely for free. Some of the apparently magical things that the tidyverse does through non-standard evaluation eventually need to be explained, and that's not trivial.

However, if you are learning programming specifically for data analysis, and you don't have a background in any programming language, I think R has a lower barrier to entry than everything else. And that is mainly thanks to the tidyverse.

Interactive visualisations and ES2015

I spent much of this week working on a new interactive visualisation for the Library. I am not going to talk about it before it's published. I will have some things to say about it next week. It's been an interesting collaboration between researchers with different skills and areas of expertise.

However, there was an issue related to interactives that came up this week that I can talk about and that I think is worth raising.

During the week, I was contacted by a colleague who knows we use D3 for our interactive data visualisations. They had seen that D3 version 6 has fully adopted ES2015, and they were worried that our visualisations would no longer be compatible with Internet Explorer.

In practice, this doesn't affect our work. We have been using ES2015 for at least two years, transpiling and polyfilling our code to work in older browsers.

If anything, D3 fully adopting ES2015 makes things a bit easier. D3 started using elements of modern JavaScript in version 5: you needed to polyfill Promises and fetch to use d3-fetch, for example. We were having to handle these issues on a case by case basis. With D3 version 6 you can take a consistent approach and transpile whichever D3 modules you import.

But I do think D3 switching fully to ES2015 is a big deal. Public sector organisations tend to be quite conservative about supporting older browsers, because they rightly don't want to exclude people who either can't afford to upgrade their devices, or who lack the technical expertise to keep up to date with what the industry expects of its customers.

At the same time, lots of people who are interested in data visualisation aren't necessarily that interested in software development more generally. Setting up a pipeline to compile modern JavaScript for older browsers can be challenging if you're not familiar with the wider JavaScript ecosystem.

In the pre-Covid past, I met data visualisation professionals from other public sector organisations who were taking the approach of avoiding modern JavaScript altogether, writing cross-browser compatible code natively with D3 version 4. That's surely not sustainable in the longer term.

D3 is clearly putting its weight behind a more modern approach to online data visualisation, compelling practitioners to either drop support for Internet Explorer or take on the responsibility of supporting it themselves.