An R package for simple data wrangling

7 Jul 2018 11:28 GMT

I recently started a new role at work where one of my tasks is helping statisticians to develop data science skills. I've noticed that one of the most challenging obstacles people encounter when first learning to program is how much you need to learn in order to become productive.

It takes time to become a good programmer — it's a learning experience that never really ends — but there is an inflection point when you become more fluent and the time you spent learning how do each thing for the first time starts to pay off.

The question I keep coming up against is how to motivate people who are learning to program in a professional setting to persevere through the initial learning period, when doing something in a new way is less efficient than doing it the old way.

Part of the answer is to show people the remarkable things that can only be done in the new way. But perhaps even more important is lowering the barrier to entry: reducing the time it takes beginners to learn simple and useful things.

With that in mind I wrote an R package called cltools, which is designed to make common data wrangling tasks easier to perform. These are all things that an experienced R user could do with base R or tidyverse functions. But the point is to reduce the level of skill people need in order to do useful work with R.

The package is primarily designed to help statistical researchers and data journalists covering public policy fields. It focusses on the simple data wrangling tasks that these researchers do most often. Things like calculating row and column percentages, creating indices, and deflating prices to real terms.

Let me know if you have any suggestions for ways to make it better.

Animating mortality in England and Wales

8 Apr 2018 19:31 GMT

From around 1970 to 2011 there was a broad downward trend in mortality in England and Wales: both the the total number of deaths and the crude death rate (the number of deaths per thousand people) fell.

Since 2011 this long-term downward trend has halted and both the number of deaths and the crude death rate have increased.

Two charts showing the annual number of deaths and the number of deaths per thousand people in England and Wales from 1970 t0 2016.

These charts show trends in mortality from 1970 to 2016. But they do not include data for 2017 or 2018, as neither the final totals nor the population estimates needed to calculate mortality rates have been published for these years.

A more fine-grained and up to date analysis of recent trends in mortality can be produced using Office for National Statistics figures for the number of weekly deaths registered in England and Wales during the last few years.

This dataset provides the most recent statistics on deaths but comes with the caveats that the figures are provisional and are not normalised to the total population.

Animated chart

This animated chart shows the number of deaths registered in England and Wales in each week of the year for each calendar year from 2011 to 2017, and so far in 2018.

As the chart shows, the number of weekly deaths has gradually increased: the darker lines showing data for more recent years are generally higher than the lighter lines showing data for earlier years.

Some of this increase may be explained by the growing population, but the larger number of winter deaths in 2015, 2017 and 2018, and the rising death rate before 2016 suggest this may not be the only factor.

I thought this was a novel way of presenting the data, and potentially a good use case for animation.

A simple hexmap editor

7 Feb 2018 21:15 GMT

I recently produced a hexmap of local authorities in England and Wales, which I need for mapping a number of different datasets, such as data on internal migration. As I couldn't find an existing hexmap for these areas, I had to create one from scratch.

Initially, I hoped to generate a hexmap algorithmically using the geogrid package in R. But while geogrid did a decent job of minimising the average distance between where areas were located on the geographic map and on the hexagonal map, the number of local authorities and the variation in their size led to some odd results. Birmingham was on the coast, for example.

Furthermore, the algorithm that geogrid uses isn't designed to preserve certain geospatial relationships, such as contiguous groups of areas like regions, or relative positions that matter logically. For instance, in the original output from geogrid, South Tyneside was north of North Tyneside.

Geogrid is a fantastic starting point, which can do a lot of the groundwork for you, but it seems that to make a good hexmap you need to do a certain amount of the work by hand.

To help with that, I wrote a small tool that lets you edit HexJSON data in the browser, and then export your work in both HexJSON and GeoJSON formats. Like a lot of tools I make, it doesn't have a massive feature set, but aims to do one thing well.

Editing hexmaps in the browser

The HexJSON Editor lets you import hexmap data in HexJSON format and move the hexes around by hand. Simply select the hex you want to move by clicking on it.

A grid of hexagons with one of the hexagons highlighted indicating it is selected and ready to move

And then click on the destination to place it.

The Dane grind of hexagons as shown previously but the selected hexagon has moved

If the position you place a hex is already occupied by another hex, it will swap them.

If you want to try the editor, you can use the example HexJSON grid shown above, or alternatively try the local authority hexmap, or the Open Data Institute's constituency hexmap.

When you load the HexJSON data, you can choose which variable to use for labelling. You also have the option of choosing a categorical variable to use for shading the hexes — the colours are chosen for each category using a standard categorical colour scale. Finally, if you need more space around the hexes for editing, you have the option of adding more padding.

I made the HexJSON editor mainly to meet my own mapping needs, but I thought it was worth sharing. If there is enough interest in new features I could either add them, or open source the code so other people can. In the meantime, it's there if you need it.

A hexmap of district and unitary local authorities in England and Wales

4 Feb 2018 20:57 GMT

Last week I posted an interactive hexmap showing internal migration in England and Wales. Today I am publishing the boundaries for the hexmap so that other people can use them to show data for the same geographical areas.

The hexmap represents the 348 lower-tier (district and unitary) local authorities in England and Wales. The map preserves the organisation of local authorities within their countries and regions.

A hexmap of lower-tier local authorities in England and Wales where each hex is shaded according to its region.

The boundary data for this hexmap can be found in the files linked below in HexJSON and GeoJSON formats.

Arranging the local authorities where you would expect to find them, while preserving their geospatial relationships to one another and their regional groupings was not easy. Each solution to a particular difficulty created its own new problems, and it was often a question of deciding which arrangement was least unsatisfactory.

What this process has taught me is that hexmaps that represent real geographies are a useful fiction, and producing one involves making compromises. So while I am happy with the arrangement shown here, it is by no means definitive. Feel free to take it as a starting point and modify it further. I would be interested to hear of any changes or improvements.

The map was made using an interactive tool I have written for editing HexJSON data in the browser. I intend to put it online once I have had an opportunity to tidy up the code a bit.

Finally, I want to say thanks to Carl Baker, who kindly reviewed the work and suggested some improvements. Carl not only has great mapping skills, but also a remarkable memory for the geography of the UK, so he was able to spot problems and suggest solutions quickly.

Mapping migration within England and Wales

29 Jan 2018 21:17 GMT

A few weeks ago I posted this scatterplot on Twitter. It shows Office for National Statistics estimates of net internal migration between local authorities in England and Wales in the year ending June 2016. The local authorities are grouped into quartiles based on how urban or rural they are.

A scatterplot showing net internal migration in local authorities in England and Wales, where local authorities are grouped into quartiles based on how urban or rural they are.

As you can see, there is an interesting pattern: the local authorities with the highest net outflows are among the most urban. But not all urban areas have large net outflows, and there are not many rural areas with correspondingly large net inflows.

I wanted to find a way of representing these migration flows that would let me explore how people were moving between local authorities: which local authorities had the largest flows in each direction, and what was the balance of flows between these and other local authorities?

With that in mind, I built an interactive hexmap that shows the ONS internal migration data.

How it works

The map shows the 348 district and unitary local authorities in England and Wales, and shades each of them according to the magnitude of their internal migration flows. By default the map shows the net flow in each area (these are the flows shown in the scatterplot above), but the tabs let you switch to see the gross inflows and outflows too.

If you click on a local authority (or tap on it twice on a touchscreen) you can see the flows between that local authority and other areas. Mousing over an area (or tapping on it once on a touchscreen) pops up a label showing the name of the local authority and the flows in question.

The map stores the currently selected local authority and flow in the URL, so you can link directly to a given combination of local authority and flow. Here are the net migration flows between the London Borough of Barnet and other local authorities, for example. You can see that there are net flows into Barnet from local authorities in central London, and net flows out from Barnet to less urban areas. And here is the map for Sevenoaks, which shows a similar pattern.

Strengths and weaknesses

I think this approach works well in some respects and less well in others. It effectively condenses a very large amount of data into a relatively simple interface that allows you to easily explore geographical patterns of internal migration, which is what I set out to do.

However, one weakness of this design is that the variation in the size of the migration flows between local authorities is so great that it's not really possible to represent the data using the same shading scale for every area.

Consequently, while each type of flow is represented using the same set of colours, the scale associated with those colours changes depending on which local authority is selected.

This ensures that you see an appropriate level of variation in the data for each local authority, but it also means that the maps for each area are not directly comparable with one another: you always have to glance at the colour key to check the magnitude of the flows shown in each case.

Effectively, this interface provides 348 different local authority maps for each type of migration flow. The maps all use the same grammar to communicate, but they show the data at different scales.

A new hexmap of local authorities

One final feature of the visualisation worth mentioning is the hexmap itself. This is an entirely new hexmap of local authorities in England and Wales that was built from scratch. I intend to write more about the process of making it, and share some tools I have developed to make creating hexmaps easier, so look out for those in forthcoming posts.