Mapping migration within England and Wales

29 Jan 2018 21:17 GMT

A few weeks ago I posted this scatterplot on Twitter. It shows Office for National Statistics estimates of net internal migration between local authorities in England and Wales in the year ending June 2016. The local authorities are grouped into quartiles based on how urban or rural they are.

A scatterplot showing net internal migration in local authorities in England and Wales, where local authorities are grouped into quartiles based on how urban or rural they are.

As you can see, there is an interesting pattern: the local authorities with the highest net outflows are among the most urban. But not all urban areas have large net outflows, and there are not many rural areas with correspondingly large net inflows.

I wanted to find a way of representing these migration flows that would let me explore how people were moving between local authorities: which local authorities had the largest flows in each direction, and what was the balance of flows between these and other local authorities?

With that in mind, I built an interactive hexmap that shows the ONS internal migration data.

How it works

The map shows the 348 district and unitary local authorities in England and Wales, and shades each of them according to the magnitude of their internal migration flows. By default the map shows the net flow in each area (these are the flows shown in the scatterplot above), but the tabs let you switch to see the gross inflows and outflows too.

If you click on a local authority (or tap on it twice on a touchscreen) you can see the flows between that local authority and other areas. Mousing over an area (or tapping on it once on a touchscreen) pops up a label showing the name of the local authority and the flows in question.

The map stores the currently selected local authority and flow in the URL, so you can link directly to a given combination of local authority and flow. Here are the net migration flows between the London Borough of Barnet and other local authorities, for example. You can see that there are net flows into Barnet from local authorities in central London, and net flows out from Barnet to less urban areas. And here is the map for Sevenoaks, which shows a similar pattern.

Strengths and weaknesses

I think this approach works well in some respects and less well in others. It effectively condenses a very large amount of data into a relatively simple interface that allows you to easily explore geographical patterns of internal migration, which is what I set out to do.

However, one weakness of this design is that the variation in the size of the migration flows between local authorities is so great that it's not really possible to represent the data using the same shading scale for every area.

Consequently, while each type of flow is represented using the same set of colours, the scale associated with those colours changes depending on which local authority is selected.

This ensures that you see an appropriate level of variation in the data for each local authority, but it also means that the maps for each area are not directly comparable with one another: you always have to glance at the colour key to check the magnitude of the flows shown in each case.

Effectively, this interface provides 348 different local authority maps for each type of migration flow. The maps all use the same grammar to communicate, but they show the data at different scales.

A new hexmap of local authorities

One final feature of the visualisation worth mentioning is the hexmap itself. This is an entirely new hexmap of local authorities in England and Wales that was built from scratch. I intend to write more about the process of making it, and share some tools I have developed to make creating hexmaps easier, so look out for those in forthcoming posts.

Getting started with the Labour Force Survey in Python and R

20 Nov 2017 19:58 GMT

The Labour Force Survey is a quarterly national survey that's designed to represent the UK's household population. The Office for National Statistics primarily uses the LFS to produce quarterly labour market statistics; but both the Labour Force Survey and its annual equivalent, the Annual Population Survey, are also used to produce a range of other statistics, such as estimates of families and households and of the population by country of birth and nationality.

The ONS publishes LFS and APS microdata through the UK Data Service. I can't link directly to the datasets here, but they are publicly available if you sign up for a UK Data Service account. This data is used in academic, political, and commercial research for various purposes, and a common use is to produce population estimates for specific groups.

The datasets are published in SPSS and Stata formats. Many people either can't afford these programs or would prefer to use something else. This article provides a basic introduction to working with LFS datasets in Python and R. It shows how to load the data in a usable format and how to produce population estimates as frequency tables and crosstabs.

Setting up

The examples below use a dataset stored in the Stata file format (.dta) so make sure you have a recent LFS dataset in that format and that you give it the same filename as you use in your code. In Python we will use the pandas library, while in R we will use the tidyverse and haven packages, so make sure you have the right libraries for your chosen language installed too. It's also a good idea to keep the LFS user guide to hand. Details of the survey variables and their codes are in Volume 3.

Working with LFS data

Python and R can both read files in .dta format, but there are a couple of issues with LFS data that it helps to know about.

First, the LFS contains a large number of detailed categorical variables. In .dta and .spss files these are each represented as a set of numerical codes, along with a set of associated labels. When working with these variables in Python and R it is best to refer to individual categories using their numerical codes, as these are guaranteed to be consistent across datasets for different quarters. But when summarising the data for presentation it is better to use the labels, so the results are readable.

Ideally, you want both the codes and the labels for any variables of interest available in your data frame, but the data structures for categorical variables in Python and R either convert the codes to labels, or only preserve the codes.

The second difficulty is that while Python and R both require unique values for their categorical data types, some of the variables in LFS datasets have labels that are not unique. So you can't use these labels as categoricals in pandas, or as factors in R, even if you want to. If you try to automatically convert these variables when you load the data, you'll get errors. Fortunately, there are ways around this, and the issue only affects a small number of variables which are often not needed.

The solution to these problems is to have two copies of each categorical variable of interest in your data frame whenever possible: one containing the codes and the other containing the labels. That way you can use the codes when you need to select particular categories in your analysis, and the labels to present the results. There are different ways of achieving this in Python and R but the result is essentially the same. For problem variables with duplicate labels you can just use the codes and create your own unique labels if necessary.

To illustrate how this works in practice, the first example below shows how to load and summarise an LFS dataset in order to produce a simple breakdown of the population of the UK by region. We'll do it first in R, and then do the same thing in Python. After that, we'll extend the example to produce a crosstab of the population by region and nationality. Finally, we'll filter the data to produce the same crosstab for a subset of the population: nationals of EU8 countries.

Frequency tables

To produce a population estimate from the LFS you need to sum the population weights for the population in question. The population weights are stored in a variable called PWTx where x is a year expressed as two digits. In the dataset I am using the weight variable is called PWT17.

The other variable we need is the one that identifies each respondents' country or region of residence within the UK, which is called GOVTOF2. We are going to create a second version of this variable using its labels rather than its codes. That way, when we group the data by this variable and sum the population weights in each region, we will end up with a readable frequency table.

Let's start with R. There is more than one way to load a .dta file with R, but we are going to use the haven package, because haven has a nice way of handling categorical data stored in .dta files.

Instead of forcing you to choose between using codes or labels for categorical data at the point you load the file, haven loads categorical variables as a special data type called the labelled type. This represents the data as codes in the data frame, but keeps a list of the labels associated with each code for reference. You can create a factor from a labelled variable at any point using as_factor() if you want to use the labels in the data frame instead.

The following code loads an LFS Stata file as a data frame with haven's read_dta() method, and uses as_factor() to create a second version of GOVTOF2 based on its labels rather than its codes. We'll call this variable GOVTOF2F to identify it as a factor.

It's then just a question of grouping the data by GOVTOF2F and summing the population weights within each group. (If you don't understand what is happening with the %>% operator, called a pipe, have a look at Chapter 5 of R for Data Science.)

It's also a good idea to round the estimates to at least a thousand so as not to be spuriously precise. You can then save the output to disk with write.csv().


# Load the data file
data <- read_dta("lfs-2017-q2.dta")

# Create a label version of the region variable
data$GOVTOF2F <- as_factor(data$GOVTOF2)

# Calculate the population by region
estimates <- data %>%
    # Group by region
    group_by(GOVTOF2F) %>%
    # Sum the population weights
    summarise(POPEST = sum(PWT17))

# Round the estimates
estimates$POPEST <- round(estimates$POPEST, -3)

# Write out the estimates as a csv
write.csv(estimates, file = "frequency.csv", row.names = FALSE)

Now let's do the same thing with Python. Unlike haven's read_dta() method, the read_stata() method in pandas does not preserve both the codes and the labels when loading categorical data: you have to choose which you want when you load it. To get around this, you can load the data twice, once as codes and once as labels, and then combine the two sets of variables into a single dataset.

To do this, we can take advantage of a nice feature of read_stata() that haven's read_dta() doesn't have, which is the ability to load only selected variables from the data file. So while this takes a few more lines of code in Python, and involves reading from the data file twice, it actually executes faster.

First, we load the variables of interest as codes. These are the population weight (PWT17), the region of residence (GOVTOF2), and the respondent unique identifier (CASENOP) — we'll use CASENOP to link the codes with the labels.

Next, we load the variables we need as labels and rename them accordingly. We only need GOVTOF2 for labelling, but we'll also need CASENOP for linking. Then we merge the labels with the codes by joining the two datasets on CASENOP.

After that, we use the same approach we used in R: group the data by region, sum the population weights in each region, and round the estimates to the nearest thousand.

import pandas as pd

# Load the data as codes
codes = pd.read_stata(
    columns = ['CASENOP', 'GOVTOF2', 'PWT17'],
    convert_categoricals = False)

# Load the data as categorical labels
labels = pd.read_stata(
    columns = ['CASENOP', 'GOVTOF2'],
    convert_categoricals = True)

# Rename the label variables
labels.columns = ['CASENOP', 'GOVTOF2F']

# Merge the codes and labels on the respondent id
data = codes.merge(labels, on = 'CASENOP')

# Group by region
gd = data.groupby(['GOVTOF2F'])

# Sum the population weights
estimates = gd[['PWT17']].sum()

# Round the estimates
estimates = round(estimates, -3)

# Write out the estimates as a csv

In these examples we didn't use the version of GOVTOF2 that contains the codes. But it's a good idea to have both versions of a variable you're using available, in case you want to select specific categories within it later on. We'll do this in the section on filtering below.


But first, let's adapt the examples to change the output from a one dimensional frequency table to a two dimensional crosstab of the population by region and nationality.

We need to do three new things: create a labelled version of the nationality variable (NATOX7), group the data by both region and nationality before summing the population weights, and then reshape the estimates for each combination of region and nationality from long form to wide form.

In R you can do this with spread(), which is part of the tidyr package.


# Load data and create labelled versions of the variables
data <- read_dta("lfs-2017-q2.dta")
data$GOVTOF2F <- as_factor(data$GOVTOF2)
data$NATOX7F <- as_factor(data$NATOX7)

# Calculate the population estimates in long form
estimates <- data %>%
    # Group by region and nationality
    group_by(GOVTOF2F, NATOX7F) %>%
    # Sum the population weights
    summarise(POPEST = sum(PWT17))

# Round the estimates
estimates$POPEST <- round(estimates$POPEST, -3)

# Reshape the population estimates to wide form
crosstab <- estimates %>% 
    spread(GOVTOF2F, POPEST, fill = 0)

# Write out the estimates as a csv
write.csv(crosstab, file = "crosstab.csv", row.names = FALSE)

In Python you can do the same thing with pandas' pivot_table().

import pandas as pd

# Load the data as codes
codes = pd.read_stata(
    columns = ['CASENOP', 'GOVTOF2', 'NATOX7', 'PWT17'],
    convert_categoricals = False)

# Load the data as categorical labels
labels = pd.read_stata(
    columns = ['CASENOP', 'GOVTOF2', 'NATOX7'],
    convert_categoricals = True)

# Rename the label variables
labels.columns = ['CASENOP', 'GOVTOF2F', 'NATOX7F']

# Merge the codes and labels on the respondent id
data = codes.merge(labels, on = 'CASENOP')

# Group by region name
gd = data.groupby(['GOVTOF2F', 'NATOX7F'])

# Sum the population weights
estimates = gd[['PWT17']].sum()

# Round the estimates
estimates = round(estimates, -3)

# Reshape the population estimates to wide form
crosstab = pd.pivot_table(
    index = ['NATOX7F'], 
    columns = ['GOVTOF2F'],
    fill_value = 0)

# Write out the estimates as a csv

While this crosstab provides a very detailed breakdown of the population, many of the inidividual estimates within it are too small to be reliable, so you shouldn't report estimates at this level of detail for most nationalities (see the caveats below).


Finally, we'll filter the data before summing the population weights so that the estimates are for a subset of the population. In this case we'll filter on nationality to select nationals of the eight Eastern European countries that joined the EU in 2004, known as the EU8.

To do this we'll use the codes for NATOX7 to select respondents by their nationality. Note that we filter the data frame on the variable containing the codes (NATOX7), but group by the variable containing the labels (NATOX7F).

You can find the codes for each variable in Volume 3 of the LFS user guide. There are nine codes listed here because sometimes survey respondents give their nationality as “Czechoslovakia”, which no longer exists but whose successor countries are in the EU8.

To filter in R, add a filter() to the start of the pipeline for transforming the data.

# NATOX7 codes for EU8
EU8 <- c(203, 971, 233, 348, 
    428, 440, 616, 703, 705)

# Calculate the population estimates in long form
estimates <- data %>%
    # Select only EU8 nationals
    filter(NATOX7 %in% EU8) %>%
    # Group by region and nationality
    group_by(GOVTOF2F, NATOX7F) %>%
    # Sum the population weights
    summarise(POPEST = sum(PWT17))

In Python, you can use pandas' subsetting operations to select the target rows from the data frame before you group and sum. This time, we'll also call dropna() on the results to remove empty rows left by nationalities we filtered out.

# NATOX7 codes for EU8
EU8 = [203, 971, 233, 348, 
    428, 440, 616, 703, 705]

# Select only EU8 nationals
filtered = data.loc[data['NATOX7'].isin(EU8)]

# Group by region and nationality 
gd = filtered.groupby(['GOVTOF2F', 'NATOX7F'])

# Sum the population weights (and drop empty rows)
estimates = gd[['PWT17']].sum().dropna()

Some caveats

The above examples only scratch the surface of what's possible with the Labour Force Survey. Other things you might want to do include recoding variables, repeating the same analysis over different quarters so as to build up a time series, or calculating confidence intervals for estimates. (These may be the subjects of future posts.)

But bear in mind that while the scale of the Labour Force Survey makes it a rich source of data, it also introduces some potential pitfalls.

First and foremost, the LFS is a survey, so its estimates are surrounded by statistical uncertainty. Estimates based on a small number of respondents are particularly uncertain, and the level of detail in the LFS makes it easy to find lots of small populations, so take care not to base conclusions on tiny samples. To get the size of the sample underpinning each estimate you can count the number of respondents, rather than sum their weights.

In R, use n() rather than sum() as the function in summarise().

summarise(RESP = n())

In Python, call count() rather than sum() on the grouped data.

respondents = gd[['PWT17']].count()

And don't round the results.

If you do this for the crosstab of population by region and nationality you'll see that many individual estimates are based on just a handful of respondents, and are too small to be considered reliable. You should aggregate smaller estimates into higher level groups; in this case you could use the ONS country groupings.

Another thing to bear in mind is that LFS data is not seasonally adjusted, so comparisions between estimates at different points in time may not be valid. A simple solution is to only compare estimates from the same quarter in each year.

Furthermore, there are technical issues affecting some aspects of the data — the LFS is known to underestimate benefit claimants and seasonal agricultural workers for example — so it's worth reading the literature on the LFS carefully, especially those parts that relate to your analysis.

Finally, it's worth looking at other people's research, to see how they have used the LFS to address similar questions to yours, and how they have dealt with any problems they encountered.

Westminster Bubble

16 Sep 2017 22:23 GMT

My latest project is Westminster Bubble, a Twitter bot that shows what Westminster journalists are talking about on Twitter, in one word cloud a day.

The @wmbubble account follows all journalists who are registered with Parliament and who primarily cover Westminster politics in their work and on Twitter.

This is a reasonably objective set of criteria for Westminster journalism but it's not perfect. There is some fuzziness around who primarily covers Westminster and whether that is their focus on Twitter, but the aim is to exclude journalists who are covering only one or two specific areas of policy and to include journalists who are talking about Westminster politics or policy generally.

The bot does not analyse or store individual tweets, or keep any record of who is saying what. It takes the text of each tweet in the stream of tweets from people it follows, cleans the text to remove mentions, hashtags and links, and adds what's left to an anonymous corpus. That corpus is used to make the word cloud at the end of each day.

If you are a journalist the account follows and you are uncomfortable with your tweets being included in this analysis, please let me know and I will remove your account from the group. Alternatively, you can just block the @wmbubble account. But to reiterate, the bot is not storing any data about individual accounts, just a single block of anonymised text which does not include any ids, account names, hashtags or URLs.

Word clouds are posted daily at around 9:00pm, just before the newspaper front pages come out. The word clouds show the most common words each day and the size of a word represents a measure of its relative frequency within that day's corpus. This measure is not based on frequency alone, but an equally weighted combination of frequency and rank on frequency.

This is because on some days a small number of words are so dominant in the corpus they would make most other words too small to render at the minimum font size, if size was scaled only to frequency. Giving equal weight to rank on frequency preserves the diversity of words that are visible in the cloud, and ensures you get a better sense of the content of the day's big stories. But this does mean that the relative size of the words is not a direct measure of their relative frequency alone.

Frequently collocated words are treated as a single term, so names like Theresa May and Jeremy Corbyn are preserved. The colours are not meaningful and are just randomly chosen from a range of colour maps for variety.

Feel free to address any questions to me via @olihawkins on Twitter. Don't try addressing questions to @wmbubble: it's programmed to ignore those tweets.

Going for broke in King of Tokyo

23 Jul 2017 15:24 GMT

King of Tokyo is a lighthearted but engrossing board game that's based on the classic monster movie genre. You play a monster in the style of Godzilla or King Kong and your objective is to defeat the other monsters battling for control of Tokyo. Since my brother introduced me to the game last summer it has quickly become a family favourite. It's a game I can play with my seven year-old daughter and her 83 year-old grandad and everyone has a good time.

I won't run through all of the rules here. The only thing you need to understand to read this article is the game's central dice mechanism. Underneath the thematic flavour of giant monsters with superpowers, King of Tokyo is a dice rolling game with similar rules to Yahtzee. The dice are the traditional six-sided type, but with different faces. These are the digits 1, 2, and 3, a claw, a heart, and a lightning bolt.

On your turn you roll six dice up to three times. After each roll you can keep any of the dice that you like and continue rolling the remaining dice until you either have all the faces that you want, or you run out of rolls. The dice that are face up at the end of all your rolls are your hand for that turn, and you then resolve the outcomes.

Dice with numbers give you victory points if you have three or more of the same number. Claws deal damage to your opponents, hearts restore your health, and lightning gives you energy which you can use to buy power cards.

Quite often in the game you can benefit from getting dice of more than one type on the same turn: do some damage to an opponent and heal yourself a bit; collect some lightning for power cards and gain a few victory points.

But at other times you need to go for broke and try to get as many of one dice face as possible. This tends to happen at the most decisive moments in the game: when you need just a few more victory points to win outright, when you need to do enough damage to kill an opponent before they win, when you need to heal quickly to prevent your own impending death, or when you need to collect enough lightning to buy a vital power card before someone else.

You can model this strategy with a Python function that looks like this:

import numpy as np

def max_outcome_for_face(dice=6, rolls=3):

    hits = 0
    outcome = [0] * rolls

    for roll in range(rolls):

        # Get a random number from one to six for each dice
        results = np.random.randint(1, 7, size=dice)
        # Count the number of ones: a one is a hit
        numhits = np.count_nonzero(results == 1)
        # Add to hits and remove a dice for each hit
        hits = hits + numhits
        dice = dice - numhits 

        # Store the hits after each roll
        outcome[roll] = hits

    return outcome

This function simulates a turn in the game with the given number of dice and rolls — six dice with three rolls by default. After each roll, it counts the number of times the target face was rolled, adds that number to a running total for the number of dice with the target face, and removes those dice with the target face from subsequent rolls. After all the rolls have completed, it returns a list showing the cumulative number of dice with the target face after each roll.

If you call this function a large enough number of times and collect the results, you can obtain the probability distribution for the outcomes of this strategy for any given number of dice and rolls. The heatmap below shows the distribution of outcomes after ten million turns with six dice and up to four rolls.

A heatmap showing the probability distribution of outcomes for dice rolls in King of Tokyo where the player is trying to get as many of one dice face as possible.

In King of Tokyo you get three rolls with six dice by default, so while this chart shows the outcomes for up to four rolls, the third column is the most relevant. This shows that if you are trying to get as many dice as possible with one particular face, you have an 80% chance of getting two or more dice with the given face, and a 21% chance of getting four or more.

So why does the heatmap show probabilities for up to four rolls? Because some of the power cards you can buy in the game give you an extra roll of the dice, and the Giant Brain card in particular gives you an extra roll as a permanent effect. With a fourth roll of the dice the probability of getting two or more dice with the target face increases to 91%, and the probability of getting four or more rises to 38%.

There are also power cards in the game that give you an extra dice on each roll: the Extra Head card gives you this as a permanent effect. Here is the distrbution of outcomes for seven dice and up to four rolls.

A heatmap showing the probability distribution of outcomes for dice rolls in King of Tokyo where the player is trying to get as many of one dice face as possible.

This shows that an extra dice is worth less than an extra roll when you are going after a particular face. With seven dice and three rolls the probability of getting two or more of a given face is 87%, and the probability of getting four or more is 33%. That's better than six dice with three rolls, but not as good as six dice with four. Although it's worth noting that an extra dice confers other benefits — you get more stuff.

Of course, if you can get the combination of cards that gives you seven dice with four rolls, then getting four or more dice with the target face becomes more likely than not, with a 54% chance.

This analysis was done using numpy and pandas, and the heatmaps were produced with matplotlib and seaborn. The complete source code is available on GitHub.

Making hexmaps with D3

24 Jun 2017 14:27 GMT

I spend a lot of time working with data for Parliamentary constituencies. I often want to map constituency data in a way that gives each constituency equal weight, especially when mapping election data, but until recently I had never seen an interactive hexmap of constituencies online. So I was really pleased when two months ago the Open Data Institute released not just a hexmap of Parliamentary constituencies, but also a specification for describing hexmap data called HexJSON.

You can read more about the HexJSON spec on the ODI's website, but briefly: HexJSON describes a hexmap as a set of hexes with column (q) and row (r) coordinates within a given coordinate system, which is specified with the layout property.

HexJSON that looks like this:

  "hexes": {

Describes a hexmap that looks like this:

An example of a hexmap with four columns, four rows, and a pointy top layout.

I wanted a way to render hexjson data generally, so I wrote a small D3 plugin called d3-hexjson, which takes a hexjson object and generates the data necessary to render it easily with D3.

Code examples can be found in the GitHub readme, and are shown in two blocks by Henry Lau. Giuseppe Sollazzo used d3-hexjson to create a visualisation showing the potential impact of swing on the number of Conservative and Labour seats won at the 2017 General Elecion, and wrote an article explaining how he did it.

My first use of the plugin was to run live hexmaps of the 2017 General Election results as they came in overnight on polling day. We were recording the results as they were announced at work, so we could send the data to the hexmaps very easily. At one point we fell behind the announced results as most of our election volunteers did not start until 3:00am and the results were already coming in quickly at that point.

Here are links to all the completed hexmaps showing the 2017 results, along with the 2015 results for comparison.