18 Sep 2020 14:28 GMT
I have always been skeptical about the value of writing weeknotes. There are a million and one things I want to get done at work and the idea of writing weeknotes always felt like not doing any of them.
However, in the last few months I have started to notice how working from home has subtly changed the way we work. Some of these changes are good: I get longer periods of uninterrupted time between meetings, which makes it easier to be productive with development work.
But some are less good. In particular, I've noticed that with many people working from home there is less opportunity for serendipity, when you find out through casual conversations that what you are doing joins up neatly with something one of your colleagues is doing and you both benefit.
So while we're all keeping our distance, I thought it might help to know what I'm up to.
House of Commons Library GitHub account
This week we launched our House of Commons Library GitHub account. Since we started the data science programme two years ago, researchers in the Library have been working on tools to make routine data analysis easier. Some of this work is potentially useful to researchers outside Parliament, and we share code that has a general purpose when we can.
Until now we have all been using personal GitHub accounts, but some of the tools we have developed have become important enough that we need to maintain and manage them in one place. I hope that having an organisational account will also make collaboration and training easier. It's been a real pleasure seeing statistical researchers with no previous programming experience discover its potential and get really good at it. We are now in a position to work collaboratively on shared projects as a team.
I spent a good part of this week working on a new R package for downloading data from Parliament's Committees API. The package is called clcommittees and I've published an initial working version on GitHub. We use data on committees in enquiries, briefings and dashboards. The package currently focusses on committees, memberships and roles, but it is likely to grow to include other data as and when we need it.
This package is part of a wider programme of work we are doing developing our capability with Parliamentary data. We are developing tools to work with a range of Parliamentary data sources (including the data platform now it has been unpaused), so watch this space.
New chart style for the Library
This week, the Library launched it's new chart style. The style was developed by my colleagues Carl Baker and Paul Bolton. I've implemented the style as a ggplot2 theme in an R package called clcharts, so that Library researchers can easily produce charts in the new style when doing computational data analysis. To give you a flavour, here's an example of a ridgeline plot made with the package.
You can see more examples in the GitHub readme at the link above. I think the new style looks great, and thanks to patchwork we have been able to fully implement it in R, which wasn't the case with our old chart style.
MSOA Names 1.5.0
Carl and I updated the MSOA Names dataset and map to version 1.5.0 to fix a couple of errors people had spotted. The dataset has been turning up everywhere from Public Health England's map of coronavirus cases to the Doogal postcode lookup service.
27 Sep 2020 15:32 GMT
This has been an interesting week that I mostly can't talk about. I have been working on something that hasn't yet launched. In the meantime, here are a few things that happened that I can talk about.
The week started with Elise and I working on improving the code she has written to collect and integrate data for her dashboard on Parliamentary activities.
You might be surprised by how much work it takes to pull this data together. Elise has done an amazing job. The data comes from a combination of public APIs, automated use of internal search tools, and webscraping data that is not otherwise published as a structured dataset.
Combining data from such a varied range of sources is not easy. In many cases the data lacks unique identifiers, so there is no option but to join datasets by normalising names, then using fuzzy string matching, and then manually checking cases where the join failed. With such a potentially error prone process we want to be sure the code is robust and that we have a thorough understanding of all the potential sources of error in the underlying data.
We are working towards making it easier to use these data sources. We have encapsulated some of our routine data processing tasks in packages like clcommittees and Noel's parlygroups package. But there are a number of data sources for which we still need to build tools to use efficiently. And until the data platform is fully restored and renewed, linking these datasets will always be problematic.
On Wednesday, Anya and I met with Chris, head of the Library's Parliament and Constitution Centre, to take forward his idea for a regular meet-up of Parliamentary data users within the Library. The idea is that if we can learn a bit more about what we each need most urgently, we can identify which work to prioritise. The first full meeting of the group will be in October.
This week, one of our researchers ran into a data wrangling problem that has come up a couple of times before, so I am going to post the best solution I have found so far.
Suppose you have some address level data, where the address itself is stored in a free text field. For each record, you need to extract the postcode from the address text and use it to link the data to the ONS postcode directory, so that you can aggregate the data to a higher-level geography, like a constituency.
The best way I have found to do this in R is to use string_extract on the address text with the following regular expression.
# Import stringr
# Greedy regular expression for postcodes
POSTCODE_REGEXP <- str_c(
# Extract the first matching postcode in each address
df$postcode <- str_extract(df$address, POSTCODE_REGEXP)
This regular expression will match postcodes even if they are missing the space separating the outward and inward parts, so it will still catch postcodes that have been misrecorded in this way. Consequently, if you remove any spaces from inside the postcodes in both datasets when you perform the join (create new columns for joining if you want to preserve the original data), then you get a higher match rate when linking the data.
Note that the above code only looks for the first matching postcode in an address, so you may want to double check there is only one postcode per address with string_extract_all before relying on this strategy.
On Tuesday I ran the second module of the introductory R training for Georgina and Richard, which is all about using the tidyverse. I really like teaching this module because it introduces programming for the first time. One of the remarkable things about the tidyverse is that you can start by teaching techniques that researchers can benefit from using right away.
I think this is probably R's biggest advantage over other programming languages when it comes to teaching absolute beginners. The number of things you need to know before you can start using other languages to do useful data analysis is much larger (and I say this as someone who prefers Python overall).
Of course, this simplicity doesn't come entirely for free. Some of the apparently magical things that the tidyverse does through non-standard evaluation eventually need to be explained, and that's not trivial.
However, if you are learning programming specifically for data analysis, and you don't have a background in any programming language, I think R has a lower barrier to entry than everything else. And that is mainly thanks to the tidyverse.
Interactive visualisations and ES2015
I spent much of this week working on a new interactive visualisation for the Library. I am not going to talk about it before it's published. I will have some things to say about it next week. It's been an interesting collaboration between researchers with different skills and areas of expertise.
However, there was an issue related to interactives that came up this week that I can talk about and that I think is worth raising.
During the week, I was contacted by a colleague who knows we use D3 for our interactive data visualisations. They had seen that D3 version 6 has fully adopted ES2015, and they were worried that our visualisations would no longer be compatible with Internet Explorer.
In practice, this doesn't affect our work. We have been using ES2015 for at least two years, transpiling and polyfilling our code to work in older browsers.
But I do think D3 switching fully to ES2015 is a big deal. Public sector organisations tend to be quite conservative about supporting older browsers, because they rightly don't want to exclude people who either can't afford to upgrade their devices, or who lack the technical expertise to keep up to date with what the industry expects of its customers.
D3 is clearly putting its weight behind a more modern approach to online data visualisation, compelling practitioners to either drop support for Internet Explorer or take on the responsibility of supporting it themselves.