Weeknotes: Week 39, 2020

27 Sep 2020 15:32 GMT

This has been an interesting week that I mostly can't talk about. I have been working on something that hasn't yet launched. In the meantime, here are a few things that happened that I can talk about.

Parliamentary data

The week started with Elise and I working on improving the code she has written to collect and integrate data for her dashboard on Parliamentary activities.

You might be surprised by how much work it takes to pull this data together. Elise has done an amazing job. The data comes from a combination of public APIs, automated use of internal search tools, and webscraping data that is not otherwise published as a structured dataset.

Combining data from such a varied range of sources is not easy. In many cases the data lacks unique identifiers, so there is no option but to join datasets by normalising names, then using fuzzy string matching, and then manually checking cases where the join failed. With such a potentially error prone process we want to be sure the code is robust and that we have a thorough understanding of all the potential sources of error in the underlying data.

We are working towards making it easier to use these data sources. We have encapsulated some of our routine data processing tasks in packages like clcommittees and Noel's parlygroups package. But there are a number of data sources for which we still need to build tools to use efficiently. And until the data platform is fully restored and renewed, linking these datasets will always be problematic.

On Wednesday, Anya and I met with Chris, head of the Library's Parliament and Constitution Centre, to take forward his idea for a regular meet-up of Parliamentary data users within the Library. The idea is that if we can learn a bit more about what we each need most urgently, we can identify which work to prioritise. The first full meeting of the group will be in October.

Data wrangling

This week, one of our researchers ran into a data wrangling problem that has come up a couple of times before, so I am going to post the best solution I have found so far.

Suppose you have some address level data, where the address itself is stored in a free text field. For each record, you need to extract the postcode from the address text and use it to link the data to the ONS postcode directory, so that you can aggregate the data to a higher-level geography, like a constituency.

The best way I have found to do this in R is to use string_extract on the address text with the following regular expression.

# Import stringr

# Greedy regular expression for postcodes
    "([A-Za-z][A-Ha-hJ-Yj-y]?[0-9][A-Za-z0-9]? ",
    "?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})")

# Extract the first matching postcode in each address 
df$postcode <- str_extract(df$address, POSTCODE_REGEXP)

This regular expression will match postcodes even if they are missing the space separating the outward and inward parts, so it will still catch postcodes that have been misrecorded in this way. Consequently, if you remove any spaces from inside the postcodes in both datasets when you perform the join (create new columns for joining if you want to preserve the original data), then you get a higher match rate when linking the data.

Note that the above code only looks for the first matching postcode in an address, so you may want to double check there is only one postcode per address with string_extract_all before relying on this strategy.

R training

On Tuesday I ran the second module of the introductory R training for Georgina and Richard, which is all about using the tidyverse. I really like teaching this module because it introduces programming for the first time. One of the remarkable things about the tidyverse is that you can start by teaching techniques that researchers can benefit from using right away.

I think this is probably R's biggest advantage over other programming languages when it comes to teaching absolute beginners. The number of things you need to know before you can start using other languages to do useful data analysis is much larger (and I say this as someone who prefers Python overall).

Of course, this simplicity doesn't come entirely for free. Some of the apparently magical things that the tidyverse does through non-standard evaluation eventually need to be explained, and that's not trivial.

However, if you are learning programming specifically for data analysis, and you don't have a background in any programming language, I think R has a lower barrier to entry than everything else. And that is mainly thanks to the tidyverse.

Interactive visualisations and ES2015

I spent much of this week working on a new interactive visualisation for the Library. I am not going to talk about it before it's published. I will have some things to say about it next week. It's been an interesting collaboration between researchers with different skills and areas of expertise.

However, there was an issue related to interactives that came up this week that I can talk about and that I think is worth raising.

During the week, I was contacted by a colleague who knows we use D3 for our interactive data visualisations. They had seen that D3 version 6 has fully adopted ES2015, and they were worried that our visualisations would no longer be compatible with Internet Explorer.

In practice, this doesn't affect our work. We have been using ES2015 for at least two years, transpiling and polyfilling our code to work in older browsers.

If anything, D3 fully adopting ES2015 makes things a bit easier. D3 started using elements of modern JavaScript in version 5: you needed to polyfill Promises and fetch to use d3-fetch, for example. We were having to handle these issues on a case by case basis. With D3 version 6 you can take a consistent approach and transpile whichever D3 modules you import.

But I do think D3 switching fully to ES2015 is a big deal. Public sector organisations tend to be quite conservative about supporting older browsers, because they rightly don't want to exclude people who either can't afford to upgrade their devices, or who lack the technical expertise to keep up to date with what the industry expects of its customers.

At the same time, lots of people who are interested in data visualisation aren't necessarily that interested in software development more generally. Setting up a pipeline to compile modern JavaScript for older browsers can be challenging if you're not familiar with the wider JavaScript ecosystem.

In the pre-Covid past, I met data visualisation professionals from other public sector organisations who were taking the approach of avoiding modern JavaScript altogether, writing cross-browser compatible code natively with D3 version 4. That's surely not sustainable in the longer term.

D3 is clearly putting its weight behind a more modern approach to online data visualisation, compelling practitioners to either drop support for Internet Explorer or take on the responsibility of supporting it themselves.