olihawkins

Weeknotes: Week 43, 2020

23 Oct 2020 17:57 GMT

I have been on leave intermittently during the last few weeks. This was my first full week back at work and so my first full weeknotes after a short break. Not everything covered here happened this week, but I would rather write about the most interesting things that happen than stick rigidly to a format.

Coronavirus restrictions map

Work on the coronavirus restrictions map has continued since it launched. The initial response was quite surprising. The map received over 100,000 unique visits in the first few days and settled down to an average of around 2,000 a day this week.

With this many people using the map we (that is me, Carl, Dan and Jenny) have felt a big responsibility to keep it as up to date and accurate as possible. It hasn't been easy, given the frequency with which new announcements have been made and the increasing complexity of the restrictions in place across the UK. I had to build more flexibility into the structure of the application so that different systems of restrictions can apply in different parts of the UK.

One unexpected piece of work was rebuilding the entire application from scratch using a different framework about a week after it launched. The map was originally built using Leaflet. However, Mapbox very kindly offered to waive the cost of our map tiles, as they were running a promotion for non-commercial maps providing public information on Covid. In return they asked if we would consider switching from using Leaflet to Mapbox GL JS as this would use their resources more efficiently. It seemed only fair to honour that request.

If you are not familiar with Mapbox GL, it's like a more powerful version of Leaflet. Mapbox GL JS is the JavaScript library for Mapbox GL, but there are also native SDKs for various operating systems. Mapbox GL gives you much more control over how your map looks and behaves than Leaflet, but you need to tell it in more detail what you want it to do. To help with this, it provides some neat declarative APIs which you can use to set the properties of different elements on the map based on the data they represent. I've used Mapbox GL JS once before, on MSOA Names, but I learned a lot more about it on this project.

One other important feature of the map that I haven't written about before is the icon set we are using for the restriction categories. The icons come from a rather wonderful open source icon set called Remix Icon. This is a general purpose collection of icons for application development. You can see a few examples of the icons in the latest national restrictions for Wales.

A screenshot of the interactive map showing coronavirus restrictions. The screenshot shows the map zoomed in on Wales to illustrate the icons used to represent the restrictions in the pop-up information box that is shown when a user clicks the map.

We've had to be quite creative in using the icons to represent different types of restrictions rather than the typical user-interface elements for which they were designed. But the collection has been comprehensive enough to meet all our needs so far.

I will never stop being grateful to the designers and coders who publish wonderful open source resources like this.

Parliamentary data

As I mentioned previously, one of my big focusses right now is developing the Library's capability for working with Parliamentary data. Noel, Elise and I have been developing a set of R packages and scripts for working with different parliamentary data sources so that we can easily download, combine and analyse the data we need for Parliamentary research.

In an ideal world this work would not have been necessary. But the pause in the development of the data platform means we need to move ahead with using the currently available sources until the project to integrate procedural data is back on track. It's coming, but it won't happen overnight.

This week Noel finished initial development work on a new R package for retrieving data from the Members Names Information Service (MNIS) called clmnis. MNIS is the system that manages Member data in Parliament. The neat thing about Noel's package is that it has exactly the same interface as the Member functions in our R and Python packages for getting Member data for the data platform. The idea is that the packages should be interchangeable so that we can switch back to using the data platform in future.

Noel and I went through some of the code together on Tuesday and I spent Wednesday working on unit tests for the package. It's still in beta for now, but we are working to make it production ready as soon as possible. I think Noel has done an amazing job. We are aiming to get colleagues to start testing it next week. And then after that we will start work on a package for the Commons and Lords votes APIs.

Weeknotes: Week 40, 2020

04 Oct 2020 13:04 GMT

I had a busy week, despite taking two days off. It was exciting but also a bit stressful.

Coronavirus restrictions map

On Wednesday, we launched our new interactive map of coronavirus restrictions. This was the project I mentioned I was working on last week.

The map shows a schematic representation of the coronavirus restrictions that apply in different parts of the UK. You can click on an area to see a summary of the local and national restrictions in force, and find a link to official guidance.

A screenshot of the interactive map showing coronavirus restrictions. The screenshot shows the map zoomed in on North Wales and the North West of England, with information about Denbighshire displayed in a popup.

The launch went really well. Jenny's tweet introducing the map got a big response and was shared by a number of MPs and journalists. I think the map has been successful largely because it is a collaboration between people with different specialisms.

Jenny and Dan are policy specialists; they classify the restrictions in each area using a simple taxonomy and enter them in a data file. Carl is our mapping specialist; he creates custom boundaries for the areas under restrictions and combines it with the data that Jenny and Dan produce. I take the boundary layer that Carl produces and map it with JavaScript and Leaflet.

This division of responsibilities makes it easier to update the map, which is going to be important as we respond to changes in the coming weeks.

The response to the map was interesting. We got a lot of positive feedback and also a few feature requests. People asked for things like a postcode lookup, layers showing the boundaries for other geographies, and an API for the data.

Those are all potentially worthwhile features to add, but there are reasons that we are unlikely to add many of them. A product like this is journalistic in nature: it's time-bound and contingent on events. Publishing the map already involves a commitment to maintain its current feature set and adding features only adds to that burden.

The Library has a well-earned reputation for the quality of its research, but what is less obvious from the outside is how efficient it is. The researchers who publish the briefing material that you see online also spend a good part of their time answering deadlined requests for information and analysis from Members, and are often working on internal projects for the wider organisation as well.

When we publish an online tool like this, we know that we will need to maintain it alongside everything else we are doing, so we have to decide where best to focus our limited resources. For example, Matthew Somerville already maintains a terrific postcode lookup tool for coronavirus restrictions, and there is no point us reproducing his work. What was missing when we started was a map.

There is also the question of what tools we have readily available. Can we take an idea forward quickly, given the immediately available resources? Anything that involves arranging new digital infrastructure takes time, and that constrains our options on projects where timeliness matters.

The unexpected level of interest in this project has raised questions about whether we have all the tools we need to do this kind of work effectively in future. I think we may need to make some long-term changes to our digital capability so that we have more options next time we tackle a similar problem.

So just because we haven't done a particular thing in this case does not mean we wouldn't ever do it. It depends on the time and resources available, and where we think we can add the most value.

Weeknotes: Week 39, 2020

27 Sep 2020 15:32 GMT

This has been an interesting week that I mostly can't talk about. I have been working on something that hasn't yet launched. In the meantime, here are a few things that happened that I can talk about.

Parliamentary data

The week started with Elise and I working on improving the code she has written to collect and integrate data for her dashboard on Parliamentary activities.

You might be surprised by how much work it takes to pull this data together. Elise has done an amazing job. The data comes from a combination of public APIs, automated use of internal search tools, and webscraping data that is not otherwise published as a structured dataset.

Combining data from such a varied range of sources is not easy. In many cases the data lacks unique identifiers, so there is no option but to join datasets by normalising names, then using fuzzy string matching, and then manually checking cases where the join failed. With such a potentially error prone process we want to be sure the code is robust and that we have a thorough understanding of all the potential sources of error in the underlying data.

We are working towards making it easier to use these data sources. We have encapsulated some of our routine data processing tasks in packages like clcommittees and Noel's parlygroups package. But there are a number of data sources for which we still need to build tools to use efficiently. And until the data platform is fully restored and renewed, linking these datasets will always be problematic.

On Wednesday, Anya and I met with Chris, head of the Library's Parliament and Constitution Centre, to take forward his idea for a regular meet-up of Parliamentary data users within the Library. The idea is that if we can learn a bit more about what we each need most urgently, we can identify which work to prioritise. The first full meeting of the group will be in October.

Data wrangling

This week, one of our researchers ran into a data wrangling problem that has come up a couple of times before, so I am going to post the best solution I have found so far.

Suppose you have some address level data, where the address itself is stored in a free text field. For each record, you need to extract the postcode from the address text and use it to link the data to the ONS postcode directory, so that you can aggregate the data to a higher-level geography, like a constituency.

The best way I have found to do this in R is to use string_extract on the address text with the following regular expression.


# Import stringr
library(stringr)

# Greedy regular expression for postcodes
POSTCODE_REGEXP <- str_c(
    "([A-Za-z][A-Ha-hJ-Yj-y]?[0-9][A-Za-z0-9]? ",
    "?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})")

# Extract the first matching postcode in each address 
df$postcode <- str_extract(df$address, POSTCODE_REGEXP)


This regular expression will match postcodes even if they are missing the space separating the outward and inward parts, so it will still catch postcodes that have been misrecorded in this way. Consequently, if you remove any spaces from inside the postcodes in both datasets when you perform the join (create new columns for joining if you want to preserve the original data), then you get a higher match rate when linking the data.

Note that the above code only looks for the first matching postcode in an address, so you may want to double check there is only one postcode per address with string_extract_all before relying on this strategy.

R training

On Tuesday I ran the second module of the introductory R training for Georgina and Richard, which is all about using the tidyverse. I really like teaching this module because it introduces programming for the first time. One of the remarkable things about the tidyverse is that you can start by teaching techniques that researchers can benefit from using right away.

I think this is probably R's biggest advantage over other programming languages when it comes to teaching absolute beginners. The number of things you need to know before you can start using other languages to do useful data analysis is much larger (and I say this as someone who prefers Python overall).

Of course, this simplicity doesn't come entirely for free. Some of the apparently magical things that the tidyverse does through non-standard evaluation eventually need to be explained, and that's not trivial.

However, if you are learning programming specifically for data analysis, and you don't have a background in any programming language, I think R has a lower barrier to entry than everything else. And that is mainly thanks to the tidyverse.

Interactive visualisations and ES2015

I spent much of this week working on a new interactive visualisation for the Library. I am not going to talk about it before it's published. I will have some things to say about it next week. It's been an interesting collaboration between researchers with different skills and areas of expertise.

However, there was an issue related to interactives that came up this week that I can talk about and that I think is worth raising.

During the week, I was contacted by a colleague who knows we use D3 for our interactive data visualisations. They had seen that D3 version 6 has fully adopted ES2015, and they were worried that our visualisations would no longer be compatible with Internet Explorer.

In practice, this doesn't affect our work. We have been using ES2015 for at least two years, transpiling and polyfilling our code to work in older browsers.

If anything, D3 fully adopting ES2015 makes things a bit easier. D3 started using elements of modern JavaScript in version 5: you needed to polyfill Promises and fetch to use d3-fetch, for example. We were having to handle these issues on a case by case basis. With D3 version 6 you can take a consistent approach and transpile whichever D3 modules you import.

But I do think D3 switching fully to ES2015 is a big deal. Public sector organisations tend to be quite conservative about supporting older browsers, because they rightly don't want to exclude people who either can't afford to upgrade their devices, or who lack the technical expertise to keep up to date with what the industry expects of its customers.

At the same time, lots of people who are interested in data visualisation aren't necessarily that interested in software development more generally. Setting up a pipeline to compile modern JavaScript for older browsers can be challenging if you're not familiar with the wider JavaScript ecosystem.

In the pre-Covid past, I met data visualisation professionals from other public sector organisations who were taking the approach of avoiding modern JavaScript altogether, writing cross-browser compatible code natively with D3 version 4. That's surely not sustainable in the longer term.

D3 is clearly putting its weight behind a more modern approach to online data visualisation, compelling practitioners to either drop support for Internet Explorer or take on the responsibility of supporting it themselves.

Weeknotes: Week 38, 2020

18 Sep 2020 14:28 GMT

I have always been skeptical about the value of writing weeknotes. There are a million and one things I want to get done at work and the idea of writing weeknotes always felt like not doing any of them.

However, in the last few months I have started to notice how working from home has subtly changed the way we work. Some of these changes are good: I get longer periods of uninterrupted time between meetings, which makes it easier to be productive with development work.

But some are less good. In particular, I've noticed that with many people working from home there is less opportunity for serendipity, when you find out through casual conversations that what you are doing joins up neatly with something one of your colleagues is doing and you both benefit.

So while we're all keeping our distance, I thought it might help to know what I'm up to.

House of Commons Library GitHub account

This week we launched our House of Commons Library GitHub account. Since we started the data science programme two years ago, researchers in the Library have been working on tools to make routine data analysis easier. Some of this work is potentially useful to researchers outside Parliament, and we share code that has a general purpose when we can.

Until now we have all been using personal GitHub accounts, but some of the tools we have developed have become important enough that we need to maintain and manage them in one place. I hope that having an organisational account will also make collaboration and training easier. It's been a real pleasure seeing statistical researchers with no previous programming experience discover its potential and get really good at it. We are now in a position to work collaboratively on shared projects as a team.

Parliamentary data

I spent a good part of this week working on a new R package for downloading data from Parliament's Committees API. The package is called clcommittees and I've published an initial working version on GitHub. We use data on committees in enquiries, briefings and dashboards. The package currently focusses on committees, memberships and roles, but it is likely to grow to include other data as and when we need it.

This package is part of a wider programme of work we are doing developing our capability with Parliamentary data. We are developing tools to work with a range of Parliamentary data sources (including the data platform now it has been unpaused), so watch this space.

New chart style for the Library

This week, the Library launched it's new chart style. The style was developed by my colleagues Carl Baker and Paul Bolton. I've implemented the style as a ggplot2 theme in an R package called clcharts, so that Library researchers can easily produce charts in the new style when doing computational data analysis. To give you a flavour, here's an example of a ridgeline plot made with the package.

A smoothed ridge chart showing the distribution of Parliamentary constituencies by median age and settlement class

You can see more examples in the GitHub readme at the link above. I think the new style looks great, and thanks to patchwork we have been able to fully implement it in R, which wasn't the case with our old chart style.

MSOA Names 1.5.0

Carl and I updated the MSOA Names dataset and map to version 1.5.0 to fix a couple of errors people had spotted. The dataset has been turning up everywhere from Public Health England's map of coronavirus cases to the Doogal postcode lookup service.

2019 General Election datasets and cartogram

29 Dec 2019 09:41 GMT

I had a busy end to the year. Parliament voted for a general election to be held two weeks before Christmas and I was once again running the House of Commons Library's data collection effort. This was my third general election working for the Library. We completed the data collection in record time and published the first edition of our briefing paper and datasets within a week of the polls closing.

In between the data collection, I found a bit of time to work on some data visualisation for the election. My colleague Carl Baker (who I worked with previously on MSOA Names) has designed a new constituency cartogram, which neatly balances equally sized constituencies with geographic groupings that make it easy to find particular constituencies and to see patterns within historic county areas.

I made an interactive version of the cartogram for showing the election results online on the morning after the vote. Embedding a small image of it doesn't do it justice; the interactive version of the election cartogram is better.

I think it is a really nice bit of visual design and I am happy with how we managed to make it look and work on the web in quite a short period of time. I hope we can do further work using a similar approach for other geographies next year.