olihawkins

Animating mortality in England and Wales

8 Apr 2018 19:31 GMT

From around 1970 to 2011 there was a broad downward trend in mortality in England and Wales: both the the total number of deaths and the crude death rate (the number of deaths per thousand people) fell.

Since 2011 this long-term downward trend has halted and both the number of deaths and the crude death rate have increased.

Two charts showing the annual number of deaths and the number of deaths per thousand people in England and Wales from 1970 t0 2016.

These charts show trends in mortality from 1970 to 2016. But they do not include data for 2017 or 2018, as neither the final totals nor the population estimates needed to calculate mortality rates have been published for these years.

A more fine-grained and up to date analysis of recent trends in mortality can be produced using Office for National Statistics figures for the number of weekly deaths registered in England and Wales during the last few years.

This dataset provides the most recent statistics on deaths but comes with the caveats that the figures are provisional and are not normalised to the total population.

Animated chart

This animated chart shows the number of deaths registered in England and Wales in each week of the year for each calendar year from 2011 to 2017, and so far in 2018.

As the chart shows, the number of weekly deaths has gradually increased: the darker lines showing data for more recent years are generally higher than the lighter lines showing data for earlier years.

Some of this increase may be explained by the growing population, but the larger number of winter deaths in 2015, 2017 and 2018, and the rising death rate before 2016 suggest this may not be the only factor.

I thought this was a novel way of presenting the data, and potentially a good use case for animation.

A simple hexmap editor

7 Feb 2018 21:15 GMT

I recently produced a hexmap of local authorities in England and Wales, which I need for mapping a number of different datasets, such as data on internal migration. As I couldn't find an existing hexmap for these areas, I had to create one from scratch.

Initially, I hoped to generate a hexmap algorithmically using the geogrid package in R. But while geogrid did a decent job of minimising the average distance between where areas were located on the geographic map and on the hexagonal map, the number of local authorities and the variation in their size led to some odd results. Birmingham was on the coast, for example.

Furthermore, the algorithm that geogrid uses isn't designed to preserve certain geospatial relationships, such as contiguous groups of areas like regions, or relative positions that matter logically. For instance, in the original output from geogrid, South Tyneside was north of North Tyneside.

Geogrid is a fantastic starting point, which can do a lot of the groundwork for you, but it seems that to make a good hexmap you need to do a certain amount of the work by hand.

To help with that, I wrote a small tool that lets you edit HexJSON data in the browser, and then export your work in both HexJSON and GeoJSON formats. Like a lot of tools I make, it doesn't have a massive feature set, but aims to do one thing well.

Editing hexmaps in the browser

The HexJSON Editor lets you import hexmap data in HexJSON format and move the hexes around by hand. Simply select the hex you want to move by clicking on it.

A grid of hexagons with one of the hexagons highlighted indicating it is selected and ready to move

And then click on the destination to place it.

The Dane grind of hexagons as shown previously but the selected hexagon has moved

If the position you place a hex is already occupied by another hex, it will swap them.

If you want to try the editor, you can use the example HexJSON grid shown above, or alternatively try the local authority hexmap, or the Open Data Institute's constituency hexmap.

When you load the HexJSON data, you can choose which variable to use for labelling. You also have the option of choosing a categorical variable to use for shading the hexes — the colours are chosen for each category using a standard categorical colour scale. Finally, if you need more space around the hexes for editing, you have the option of adding more padding.

I made the HexJSON editor mainly to meet my own mapping needs, but I thought it was worth sharing. If there is enough interest in new features I could either add them, or open source the code so other people can. In the meantime, it's there if you need it.

A hexmap of district and unitary local authorities in England and Wales

4 Feb 2018 20:57 GMT

Last week I posted an interactive hexmap showing internal migration in England and Wales. Today I am publishing the boundaries for the hexmap so that other people can use them to show data for the same geographical areas.

The hexmap represents the 348 lower-tier (district and unitary) local authorities in England and Wales. The map preserves the organisation of local authorities within their countries and regions.

A hexmap of lower-tier local authorities in England and Wales where each hex is shaded according to its region.

The boundary data for this hexmap can be found in the files linked below in HexJSON and GeoJSON formats.

Arranging the local authorities where you would expect to find them, while preserving their geospatial relationships to one another and their regional groupings was not easy. Each solution to a particular difficulty created its own new problems, and it was often a question of deciding which arrangement was least unsatisfactory.

What this process has taught me is that hexmaps that represent real geographies are a useful fiction, and producing one involves making compromises. So while I am happy with the arrangement shown here, it is by no means definitive. Feel free to take it as a starting point and modify it further. I would be interested to hear of any changes or improvements.

The map was made using an interactive tool I have written for editing HexJSON data in the browser. I intend to put it online once I have had an opportunity to tidy up the code a bit.

Finally, I want to say thanks to Carl Baker, who kindly reviewed the work and suggested some improvements. Carl not only has great mapping skills, but also a remarkable memory for the geography of the UK, so he was able to spot problems and suggest solutions quickly.

Mapping migration within England and Wales

29 Jan 2018 21:17 GMT

A few weeks ago I posted this scatterplot on Twitter. It shows Office for National Statistics estimates of net internal migration between local authorities in England and Wales in the year ending June 2016. The local authorities are grouped into quartiles based on how urban or rural they are.

A scatterplot showing net internal migration in local authorities in England and Wales, where local authorities are grouped into quartiles based on how urban or rural they are.

As you can see, there is an interesting pattern: the local authorities with the highest net outflows are among the most urban. But not all urban areas have large net outflows, and there are not many rural areas with correspondingly large net inflows.

I wanted to find a way of representing these migration flows that would let me explore how people were moving between local authorities: which local authorities had the largest flows in each direction, and what was the balance of flows between these and other local authorities?

With that in mind, I built an interactive hexmap that shows the ONS internal migration data.

How it works

The map shows the 348 district and unitary local authorities in England and Wales, and shades each of them according to the magnitude of their internal migration flows. By default the map shows the net flow in each area (these are the flows shown in the scatterplot above), but the tabs let you switch to see the gross inflows and outflows too.

If you click on a local authority (or tap on it twice on a touchscreen) you can see the flows between that local authority and other areas. Mousing over an area (or tapping on it once on a touchscreen) pops up a label showing the name of the local authority and the flows in question.

The map stores the currently selected local authority and flow in the URL, so you can link directly to a given combination of local authority and flow. Here are the net migration flows between the London Borough of Barnet and other local authorities, for example. You can see that there are net flows into Barnet from local authorities in central London, and net flows out from Barnet to less urban areas. And here is the map for Sevenoaks, which shows a similar pattern.

Strengths and weaknesses

I think this approach works well in some respects and less well in others. It effectively condenses a very large amount of data into a relatively simple interface that allows you to easily explore geographical patterns of internal migration, which is what I set out to do.

However, one weakness of this design is that the variation in the size of the migration flows between local authorities is so great that it's not really possible to represent the data using the same shading scale for every area.

Consequently, while each type of flow is represented using the same set of colours, the scale associated with those colours changes depending on which local authority is selected.

This ensures that you see an appropriate level of variation in the data for each local authority, but it also means that the maps for each area are not directly comparable with one another: you always have to glance at the colour key to check the magnitude of the flows shown in each case.

Effectively, this interface provides 348 different local authority maps for each type of migration flow. The maps all use the same grammar to communicate, but they show the data at different scales.

A new hexmap of local authorities

One final feature of the visualisation worth mentioning is the hexmap itself. This is an entirely new hexmap of local authorities in England and Wales that was built from scratch. I intend to write more about the process of making it, and share some tools I have developed to make creating hexmaps easier, so look out for those in forthcoming posts.

Getting started with the Labour Force Survey in Python and R

20 Nov 2017 19:58 GMT

The Labour Force Survey is a quarterly national survey that's designed to represent the UK's household population. The Office for National Statistics primarily uses the LFS to produce quarterly labour market statistics; but both the Labour Force Survey and its annual equivalent, the Annual Population Survey, are also used to produce a range of other statistics, such as estimates of families and households and of the population by country of birth and nationality.

The ONS publishes LFS and APS microdata through the UK Data Service. I can't link directly to the datasets here, but they are publicly available if you sign up for a UK Data Service account. This data is used in academic, political, and commercial research for various purposes, and a common use is to produce population estimates for specific groups.

The datasets are published in SPSS and Stata formats. Many people either can't afford these programs or would prefer to use something else. This article provides a basic introduction to working with LFS datasets in Python and R. It shows how to load the data in a usable format and how to produce population estimates as frequency tables and crosstabs.

Setting up

The examples below use a dataset stored in the Stata file format (.dta) so make sure you have a recent LFS dataset in that format and that you give it the same filename as you use in your code. In Python we will use the pandas library, while in R we will use the tidyverse and haven packages, so make sure you have the right libraries for your chosen language installed too. It's also a good idea to keep the LFS user guide to hand. Details of the survey variables and their codes are in Volume 3.

Working with LFS data

Python and R can both read files in .dta format, but there are a couple of issues with LFS data that it helps to know about.

First, the LFS contains a large number of detailed categorical variables. In .dta and .spss files these are each represented as a set of numerical codes, along with a set of associated labels. When working with these variables in Python and R it is best to refer to individual categories using their numerical codes, as these are guaranteed to be consistent across datasets for different quarters. But when summarising the data for presentation it is better to use the labels, so the results are readable.

Ideally, you want both the codes and the labels for any variables of interest available in your data frame, but the data structures for categorical variables in Python and R either convert the codes to labels, or only preserve the codes.

The second difficulty is that while Python and R both require unique values for their categorical data types, some of the variables in LFS datasets have labels that are not unique. So you can't use these labels as categoricals in pandas, or as factors in R, even if you want to. If you try to automatically convert these variables when you load the data, you'll get errors. Fortunately, there are ways around this, and the issue only affects a small number of variables which are often not needed.

The solution to these problems is to have two copies of each categorical variable of interest in your data frame whenever possible: one containing the codes and the other containing the labels. That way you can use the codes when you need to select particular categories in your analysis, and the labels to present the results. There are different ways of achieving this in Python and R but the result is essentially the same. For problem variables with duplicate labels you can just use the codes and create your own unique labels if necessary.

To illustrate how this works in practice, the first example below shows how to load and summarise an LFS dataset in order to produce a simple breakdown of the population of the UK by region. We'll do it first in R, and then do the same thing in Python. After that, we'll extend the example to produce a crosstab of the population by region and nationality. Finally, we'll filter the data to produce the same crosstab for a subset of the population: nationals of EU8 countries.

Frequency tables

To produce a population estimate from the LFS you need to sum the population weights for the population in question. The population weights are stored in a variable called PWTx where x is a year expressed as two digits. In the dataset I am using the weight variable is called PWT17.

The other variable we need is the one that identifies each respondents' country or region of residence within the UK, which is called GOVTOF2. We are going to create a second version of this variable using its labels rather than its codes. That way, when we group the data by this variable and sum the population weights in each region, we will end up with a readable frequency table.

Let's start with R. There is more than one way to load a .dta file with R, but we are going to use the haven package, because haven has a nice way of handling categorical data stored in .dta files.

Instead of forcing you to choose between using codes or labels for categorical data at the point you load the file, haven loads categorical variables as a special data type called the labelled type. This represents the data as codes in the data frame, but keeps a list of the labels associated with each code for reference. You can create a factor from a labelled variable at any point using as_factor() if you want to use the labels in the data frame instead.

The following code loads an LFS Stata file as a data frame with haven's read_dta() method, and uses as_factor() to create a second version of GOVTOF2 based on its labels rather than its codes. We'll call this variable GOVTOF2F to identify it as a factor.

It's then just a question of grouping the data by GOVTOF2F and summing the population weights within each group. (If you don't understand what is happening with the %>% operator, called a pipe, have a look at Chapter 5 of R for Data Science.)

It's also a good idea to round the estimates to at least a thousand so as not to be spuriously precise. You can then save the output to disk with write.csv().


library(tidyverse)
library(haven)

# Load the data file
data <- read_dta("lfs-2017-q2.dta")

# Create a label version of the region variable
data$GOVTOF2F <- as_factor(data$GOVTOF2)

# Calculate the population by region
estimates <- data %>%
    # Group by region
    group_by(GOVTOF2F) %>%
    # Sum the population weights
    summarise(POPEST = sum(PWT17))

# Round the estimates
estimates$POPEST <- round(estimates$POPEST, -3)

# Write out the estimates as a csv
write.csv(estimates, file = "frequency.csv", row.names = FALSE)

Now let's do the same thing with Python. Unlike haven's read_dta() method, the read_stata() method in pandas does not preserve both the codes and the labels when loading categorical data: you have to choose which you want when you load it. To get around this, you can load the data twice, once as codes and once as labels, and then combine the two sets of variables into a single dataset.

To do this, we can take advantage of a nice feature of read_stata() that haven's read_dta() doesn't have, which is the ability to load only selected variables from the data file. So while this takes a few more lines of code in Python, and involves reading from the data file twice, it actually executes faster.

First, we load the variables of interest as codes. These are the population weight (PWT17), the region of residence (GOVTOF2), and the respondent unique identifier (CASENOP) — we'll use CASENOP to link the codes with the labels.

Next, we load the variables we need as labels and rename them accordingly. We only need GOVTOF2 for labelling, but we'll also need CASENOP for linking. Then we merge the labels with the codes by joining the two datasets on CASENOP.

After that, we use the same approach we used in R: group the data by region, sum the population weights in each region, and round the estimates to the nearest thousand.


import pandas as pd

# Load the data as codes
codes = pd.read_stata(
    'lfs-2017-q2.dta',
    columns = ['CASENOP', 'GOVTOF2', 'PWT17'],
    convert_categoricals = False)

# Load the data as categorical labels
labels = pd.read_stata(
    'lfs-2017-q2.dta',
    columns = ['CASENOP', 'GOVTOF2'],
    convert_categoricals = True)

# Rename the label variables
labels.columns = ['CASENOP', 'GOVTOF2F']

# Merge the codes and labels on the respondent id
data = codes.merge(labels, on = 'CASENOP')

# Group by region
gd = data.groupby(['GOVTOF2F'])

# Sum the population weights
estimates = gd[['PWT17']].sum()

# Round the estimates
estimates = round(estimates, -3)

# Write out the estimates as a csv
estimates.to_csv('frequency.csv')

In these examples we didn't use the version of GOVTOF2 that contains the codes. But it's a good idea to have both versions of a variable you're using available, in case you want to select specific categories within it later on. We'll do this in the section on filtering below.

Crosstabs

But first, let's adapt the examples to change the output from a one dimensional frequency table to a two dimensional crosstab of the population by region and nationality.

We need to do three new things: create a labelled version of the nationality variable (NATOX7), group the data by both region and nationality before summing the population weights, and then reshape the estimates for each combination of region and nationality from long form to wide form.

In R you can do this with spread(), which is part of the tidyr package.


library(tidyverse)
library(haven)

# Load data and create labelled versions of the variables
data <- read_dta("lfs-2017-q2.dta")
data$GOVTOF2F <- as_factor(data$GOVTOF2)
data$NATOX7F <- as_factor(data$NATOX7)

# Calculate the population estimates in long form
estimates <- data %>%
    # Group by region and nationality
    group_by(GOVTOF2F, NATOX7F) %>%
    # Sum the population weights
    summarise(POPEST = sum(PWT17))

# Round the estimates
estimates$POPEST <- round(estimates$POPEST, -3)

# Reshape the population estimates to wide form
crosstab <- estimates %>% 
    spread(GOVTOF2F, POPEST, fill = 0)

# Write out the estimates as a csv
write.csv(crosstab, file = "crosstab.csv", row.names = FALSE)

In Python you can do the same thing with pandas' pivot_table().


import pandas as pd

# Load the data as codes
codes = pd.read_stata(
    'lfs-2017-q2.dta',
    columns = ['CASENOP', 'GOVTOF2', 'NATOX7', 'PWT17'],
    convert_categoricals = False)

# Load the data as categorical labels
labels = pd.read_stata(
    'lfs-2017-q2.dta',
    columns = ['CASENOP', 'GOVTOF2', 'NATOX7'],
    convert_categoricals = True)

# Rename the label variables
labels.columns = ['CASENOP', 'GOVTOF2F', 'NATOX7F']

# Merge the codes and labels on the respondent id
data = codes.merge(labels, on = 'CASENOP')

# Group by region name
gd = data.groupby(['GOVTOF2F', 'NATOX7F'])

# Sum the population weights
estimates = gd[['PWT17']].sum()

# Round the estimates
estimates = round(estimates, -3)

# Reshape the population estimates to wide form
crosstab = pd.pivot_table(
    estimates, 
    index = ['NATOX7F'], 
    columns = ['GOVTOF2F'],
    fill_value = 0)

# Write out the estimates as a csv
crosstab.to_csv('crosstab.csv')

While this crosstab provides a very detailed breakdown of the population, many of the inidividual estimates within it are too small to be reliable, so you shouldn't report estimates at this level of detail for most nationalities (see the caveats below).

Filtering

Finally, we'll filter the data before summing the population weights so that the estimates are for a subset of the population. In this case we'll filter on nationality to select nationals of the eight Eastern European countries that joined the EU in 2004, known as the EU8.

To do this we'll use the codes for NATOX7 to select respondents by their nationality. Note that we filter the data frame on the variable containing the codes (NATOX7), but group by the variable containing the labels (NATOX7F).

You can find the codes for each variable in Volume 3 of the LFS user guide. There are nine codes listed here because sometimes survey respondents give their nationality as “Czechoslovakia”, which no longer exists but whose successor countries are in the EU8.

To filter in R, add a filter() to the start of the pipeline for transforming the data.


# NATOX7 codes for EU8
EU8 <- c(203, 971, 233, 348, 
    428, 440, 616, 703, 705)

# Calculate the population estimates in long form
estimates <- data %>%
    # Select only EU8 nationals
    filter(NATOX7 %in% EU8) %>%
    # Group by region and nationality
    group_by(GOVTOF2F, NATOX7F) %>%
    # Sum the population weights
    summarise(POPEST = sum(PWT17))

In Python, you can use pandas' subsetting operations to select the target rows from the data frame before you group and sum. This time, we'll also call dropna() on the results to remove empty rows left by nationalities we filtered out.


# NATOX7 codes for EU8
EU8 = [203, 971, 233, 348, 
    428, 440, 616, 703, 705]

# Select only EU8 nationals
filtered = data.loc[data['NATOX7'].isin(EU8)]

# Group by region and nationality 
gd = filtered.groupby(['GOVTOF2F', 'NATOX7F'])

# Sum the population weights (and drop empty rows)
estimates = gd[['PWT17']].sum().dropna()

Some caveats

The above examples only scratch the surface of what's possible with the Labour Force Survey. Other things you might want to do include recoding variables, repeating the same analysis over different quarters so as to build up a time series, or calculating confidence intervals for estimates. (These may be the subjects of future posts.)

But bear in mind that while the scale of the Labour Force Survey makes it a rich source of data, it also introduces some potential pitfalls.

First and foremost, the LFS is a survey, so its estimates are surrounded by statistical uncertainty. Estimates based on a small number of respondents are particularly uncertain, and the level of detail in the LFS makes it easy to find lots of small populations, so take care not to base conclusions on tiny samples. To get the size of the sample underpinning each estimate you can count the number of respondents, rather than sum their weights.

In R, use n() rather than sum() as the function in summarise().


summarise(RESP = n())

In Python, call count() rather than sum() on the grouped data.


respondents = gd[['PWT17']].count()

And don't round the results.

If you do this for the crosstab of population by region and nationality you'll see that many individual estimates are based on just a handful of respondents, and are too small to be considered reliable. You should aggregate smaller estimates into higher level groups; in this case you could use the ONS country groupings.

Another thing to bear in mind is that LFS data is not seasonally adjusted, so comparisions between estimates at different points in time may not be valid. A simple solution is to only compare estimates from the same quarter in each year.

Furthermore, there are technical issues affecting some aspects of the data — the LFS is known to underestimate benefit claimants and seasonal agricultural workers for example — so it's worth reading the literature on the LFS carefully, especially those parts that relate to your analysis.

Finally, it's worth looking at other people's research, to see how they have used the LFS to address similar questions to yours, and how they have dealt with any problems they encountered.