Weeknotes: Week 40, 2020

04 Oct 2020 13:04 GMT

I had a busy week, despite taking two days off. It was exciting but also a bit stressful.

Coronavirus restrictions map

On Wednesday, we launched our new interactive map of coronavirus restrictions. This was the project I mentioned I was working on last week.

The map shows a schematic representation of the coronavirus restrictions that apply in different parts of the UK. You can click on an area to see a summary of the local and national restrictions in force, and find a link to official guidance.

A screenshot of the interactive map showing coronavirus restrictions. The screenshot shows the map zoomed in on North Wales and the North West of England, with information about Denbighshire displayed in a popup.

The launch went really well. Jenny's tweet introducing the map got a big response and was shared by a number of MPs and journalists. I think the map has been successful largely because it is a collaboration between people with different specialisms.

Jenny and Dan are policy specialists; they classify the restrictions in each area using a simple taxonomy and enter them in a data file. Carl is our mapping specialist; he creates custom boundaries for the areas under restrictions and combines it with the data that Jenny and Dan produce. I take the boundary layer that Carl produces and map it with JavaScript and Leaflet.

This division of responsibilities makes it easier to update the map, which is going to be important as we respond to changes in the coming weeks.

The response to the map was interesting. We got a lot of positive feedback and also a few feature requests. People asked for things like a postcode lookup, layers showing the boundaries for other geographies, and an API for the data.

Those are all potentially worthwhile features to add, but there are reasons that we are unlikely to add many of them. A product like this is journalistic in nature: it's time-bound and contingent on events. Publishing the map already involves a commitment to maintain its current feature set and adding features only adds to that burden.

The Library has a well-earned reputation for the quality of its research, but what is less obvious from the outside is how efficient it is. The researchers who publish the briefing material that you see online also spend a good part of their time answering deadlined requests for information and analysis from Members, and are often working on internal projects for the wider organisation as well.

When we publish an online tool like this, we know that we will need to maintain it alongside everything else we are doing, so we have to decide where best to focus our limited resources. For example, Matthew Somerville already maintains a terrific postcode lookup tool for coronavirus restrictions, and there is no point us reproducing his work. What was missing when we started was a map.

There is also the question of what tools we have readily available. Can we take an idea forward quickly, given the immediately available resources? Anything that involves arranging new digital infrastructure takes time, and that constrains our options on projects where timeliness matters.

The unexpected level of interest in this project has raised questions about whether we have all the tools we need to do this kind of work effectively in future. I think we may need to make some long-term changes to our digital capability so that we have more options next time we tackle a similar problem.

So just because we haven't done a particular thing in this case does not mean we wouldn't ever do it. It depends on the time and resources available, and where we think we can add the most value.

Weeknotes: Week 39, 2020

27 Sep 2020 15:32 GMT

This has been an interesting week that I mostly can't talk about. I have been working on something that hasn't yet launched. In the meantime, here are a few things that happened that I can talk about.

Parliamentary data

The week started with Elise and I working on improving the code she has written to collect and integrate data for her dashboard on Parliamentary activities.

You might be surprised by how much work it takes to pull this data together. Elise has done an amazing job. The data comes from a combination of public APIs, automated use of internal search tools, and webscraping data that is not otherwise published as a structured dataset.

Combining data from such a varied range of sources is not easy. In many cases the data lacks unique identifiers, so there is no option but to join datasets by normalising names, then using fuzzy string matching, and then manually checking cases where the join failed. With such a potentially error prone process we want to be sure the code is robust and that we have a thorough understanding of all the potential sources of error in the underlying data.

We are working towards making it easier to use these data sources. We have encapsulated some of our routine data processing tasks in packages like clcommittees and Noel's parlygroups package. But there are a number of data sources for which we still need to build tools to use efficiently. And until the data platform is fully restored and renewed, linking these datasets will always be problematic.

On Wednesday, Anya and I met with Chris, head of the Library's Parliament and Constitution Centre, to take forward his idea for a regular meet-up of Parliamentary data users within the Library. The idea is that if we can learn a bit more about what we each need most urgently, we can identify which work to prioritise. The first full meeting of the group will be in October.

Data wrangling

This week, one of our researchers ran into a data wrangling problem that has come up a couple of times before, so I am going to post the best solution I have found so far.

Suppose you have some address level data, where the address itself is stored in a free text field. For each record, you need to extract the postcode from the address text and use it to link the data to the ONS postcode directory, so that you can aggregate the data to a higher-level geography, like a constituency.

The best way I have found to do this in R is to use string_extract on the address text with the following regular expression.

# Import stringr

# Greedy regular expression for postcodes
    "([A-Za-z][A-Ha-hJ-Yj-y]?[0-9][A-Za-z0-9]? ",
    "?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})")

# Extract the first matching postcode in each address 
df$postcode <- str_extract(df$address, POSTCODE_REGEXP)

This regular expression will match postcodes even if they are missing the space separating the outward and inward parts, so it will still catch postcodes that have been misrecorded in this way. Consequently, if you remove any spaces from inside the postcodes in both datasets when you perform the join (create new columns for joining if you want to preserve the original data), then you get a higher match rate when linking the data.

Note that the above code only looks for the first matching postcode in an address, so you may want to double check there is only one postcode per address with string_extract_all before relying on this strategy.

R training

On Tuesday I ran the second module of the introductory R training for Georgina and Richard, which is all about using the tidyverse. I really like teaching this module because it introduces programming for the first time. One of the remarkable things about the tidyverse is that you can start by teaching techniques that researchers can benefit from using right away.

I think this is probably R's biggest advantage over other programming languages when it comes to teaching absolute beginners. The number of things you need to know before you can start using other languages to do useful data analysis is much larger (and I say this as someone who prefers Python overall).

Of course, this simplicity doesn't come entirely for free. Some of the apparently magical things that the tidyverse does through non-standard evaluation eventually need to be explained, and that's not trivial.

However, if you are learning programming specifically for data analysis, and you don't have a background in any programming language, I think R has a lower barrier to entry than everything else. And that is mainly thanks to the tidyverse.

Interactive visualisations and ES2015

I spent much of this week working on a new interactive visualisation for the Library. I am not going to talk about it before it's published. I will have some things to say about it next week. It's been an interesting collaboration between researchers with different skills and areas of expertise.

However, there was an issue related to interactives that came up this week that I can talk about and that I think is worth raising.

During the week, I was contacted by a colleague who knows we use D3 for our interactive data visualisations. They had seen that D3 version 6 has fully adopted ES2015, and they were worried that our visualisations would no longer be compatible with Internet Explorer.

In practice, this doesn't affect our work. We have been using ES2015 for at least two years, transpiling and polyfilling our code to work in older browsers.

If anything, D3 fully adopting ES2015 makes things a bit easier. D3 started using elements of modern JavaScript in version 5: you needed to polyfill Promises and fetch to use d3-fetch, for example. We were having to handle these issues on a case by case basis. With D3 version 6 you can take a consistent approach and transpile whichever D3 modules you import.

But I do think D3 switching fully to ES2015 is a big deal. Public sector organisations tend to be quite conservative about supporting older browsers, because they rightly don't want to exclude people who either can't afford to upgrade their devices, or who lack the technical expertise to keep up to date with what the industry expects of its customers.

At the same time, lots of people who are interested in data visualisation aren't necessarily that interested in software development more generally. Setting up a pipeline to compile modern JavaScript for older browsers can be challenging if you're not familiar with the wider JavaScript ecosystem.

In the pre-Covid past, I met data visualisation professionals from other public sector organisations who were taking the approach of avoiding modern JavaScript altogether, writing cross-browser compatible code natively with D3 version 4. That's surely not sustainable in the longer term.

D3 is clearly putting its weight behind a more modern approach to online data visualisation, compelling practitioners to either drop support for Internet Explorer or take on the responsibility of supporting it themselves.

Weeknotes: Week 38, 2020

18 Sep 2020 14:28 GMT

I have always been skeptical about the value of writing weeknotes. There are a million and one things I want to get done at work and the idea of writing weeknotes always felt like not doing any of them.

However, in the last few months I have started to notice how working from home has subtly changed the way we work. Some of these changes are good: I get longer periods of uninterrupted time between meetings, which makes it easier to be productive with development work.

But some are less good. In particular, I've noticed that with many people working from home there is less opportunity for serendipity, when you find out through casual conversations that what you are doing joins up neatly with something one of your colleagues is doing and you both benefit.

So while we're all keeping our distance, I thought it might help to know what I'm up to.

House of Commons Library GitHub account

This week we launched our House of Commons Library GitHub account. Since we started the data science programme two years ago, researchers in the Library have been working on tools to make routine data analysis easier. Some of this work is potentially useful to researchers outside Parliament, and we share code that has a general purpose when we can.

Until now we have all been using personal GitHub accounts, but some of the tools we have developed have become important enough that we need to maintain and manage them in one place. I hope that having an organisational account will also make collaboration and training easier. It's been a real pleasure seeing statistical researchers with no previous programming experience discover its potential and get really good at it. We are now in a position to work collaboratively on shared projects as a team.

Parliamentary data

I spent a good part of this week working on a new R package for downloading data from Parliament's Committees API. The package is called clcommittees and I've published an initial working version on GitHub. We use data on committees in enquiries, briefings and dashboards. The package currently focusses on committees, memberships and roles, but it is likely to grow to include other data as and when we need it.

This package is part of a wider programme of work we are doing developing our capability with Parliamentary data. We are developing tools to work with a range of Parliamentary data sources (including the data platform now it has been unpaused), so watch this space.

New chart style for the Library

This week, the Library launched it's new chart style. The style was developed by my colleagues Carl Baker and Paul Bolton. I've implemented the style as a ggplot2 theme in an R package called clcharts, so that Library researchers can easily produce charts in the new style when doing computational data analysis. To give you a flavour, here's an example of a ridgeline plot made with the package.

A smoothed ridge chart showing the distribution of Parliamentary constituencies by median age and settlement class

You can see more examples in the GitHub readme at the link above. I think the new style looks great, and thanks to patchwork we have been able to fully implement it in R, which wasn't the case with our old chart style.

MSOA Names 1.5.0

Carl and I updated the MSOA Names dataset and map to version 1.5.0 to fix a couple of errors people had spotted. The dataset has been turning up everywhere from Public Health England's map of coronavirus cases to the Doogal postcode lookup service.

2019 General Election datasets and cartogram

29 Dec 2019 09:41 GMT

I had a busy end to the year. Parliament voted for a general election to be held two weeks before Christmas and I was once again running the House of Commons Library's data collection effort. This was my third general election working for the Library. We completed the data collection in record time and published the first edition of our briefing paper and datasets within a week of the polls closing.

In between the data collection, I found a bit of time to work on some data visualisation for the election. My colleague Carl Baker (who I worked with previously on MSOA Names) has designed a new constituency cartogram, which neatly balances equally sized constituencies with geographic groupings that make it easy to find particular constituencies and to see patterns within historic county areas.

I made an interactive version of the cartogram for showing the election results online on the morning after the vote. Embedding a small image of it doesn't do it justice; the interactive version of the election cartogram is better.

I think it is a really nice bit of visual design and I am happy with how we managed to make it look and work on the web in quite a short period of time. I hope we can do further work using a similar approach for other geographies next year.

Further adventures in animating uncertainty

17 Nov 2019 16:32 GMT

As I mentioned in my last post, I have been playing with uncertainty charts again. In that post, I wanted to simplify the task of creating animated bar charts so that I could easily create uncertainty bar charts with positive and negative values.

More recently, I have been exploring how the idea of animated uncertainty could be extended to other chart types. The following two experiments were inspired by suggestions from Paul Bolton and Harvey Goldstein.

Both examples are based on the same fundamental idea behind the uncertainty bar chart, which is to illustrate the statistical uncertainty in a set of estimates by generating alternative but equally plausible data using random values drawn from the error distribution for each estimate.

In these examples, I use estimates of the number of EU nationals living in the London Borough of Haringey, which are taken from the Annual Population Survey. These figures are published by the ONS in their regular statistical release on the population of the UK by country of birth and nationality.

Haringey is the borough where I live. There has been an increase in the number of EU nationals living in Haringey since 2004, but the precise extent of the increase is uncertain due to sampling error in the APS. Coincidentally, I was also a migrant to Haringey during this period (from Wales).

I have embedded screenshots of the new charts below, but please follow the links, or click the screenshots, to see the live animated versions.

Uncertainty line chart

The uncertainty line chart starts by showing the trend for the estimates as a line. Clicking on the chart generates alternative lines for the same estimates. Each line is drawn in sequence, and then fades gradually over time until it disappears. In this way the chart builds up a constantly evolving representation of the uncertainty in the estimates, showing the range of possible trends.

An uncertainy line chart showing different possible trends for growth in the number of EU nationals living in Haringey between 2004 and 2018

Uncertainty level chart

The uncertainty level chart takes a slightly different approach. In this chart, each horizontal line represents an estimate of the number of EU nationals living in Haringey in each year.

When you click on the chart, these lines move to newly generated random values, leaving a translucent shadow of the value in each case. Over time these shadows build up to represent the density of the error distribution: the more values drawn in a given region, the darker that region becomes.

An uncertainy level chart showing the range of possible values for each estimate of the number of EU nationals living in Haringey in each year from 2004 to 2018

Like the animated bar charts, these charts show different trends that are equally likely given the uncertainty in the estimates. But unlike the bar charts, these versions also build up a visual representation of the overall degree of uncertainty.

In some respects, I quite like the epistemological terror that the bar charts can induce. Seeing a trend erased by variation is a useful antidote to the apparent solidity of numbers. But it does mean those charts are missing some important context, which is made explicit in these versions of animated uncertainty.