## 7 Feb 2018 21:15 GMT

I recently produced a hexmap of local authorities in England and Wales, which I need for mapping a number of different datasets, such as data on internal migration. As I couldn't find an existing hexmap for these areas, I had to create one from scratch.

Initially, I hoped to generate a hexmap algorithmically using the geogrid package in R. But while geogrid did a decent job of minimising the average distance between where areas were located on the geographic map and on the hexagonal map, the number of local authorities and the variation in their size led to some odd results. Birmingham was on the coast, for example.

Furthermore, the algorithm that geogrid uses isn't designed to preserve certain geospatial relationships, such as contiguous groups of areas like regions, or relative positions that matter logically. For instance, in the original output from geogrid, South Tyneside was north of North Tyneside.

Geogrid is a fantastic starting point, which can do a lot of the groundwork for you, but it seems that to make a good hexmap you need to do a certain amount of the work by hand.

To help with that, I wrote a small tool that lets you edit HexJSON data in the browser, and then export your work in both HexJSON and GeoJSON formats. Like a lot of tools I make, it doesn't have a massive feature set, but aims to do one thing well.

### Editing hexmaps in the browser

The HexJSON Editor lets you import hexmap data in HexJSON format and move the hexes around by hand. Simply select the hex you want to move by clicking on it.

And then click on the destination to place it.

If the position you place a hex is already occupied by another hex, it will swap them.

If you want to try the editor, you can use the example HexJSON grid shown above, or alternatively try the local authority hexmap, or the Open Data Institute's constituency hexmap.

When you load the HexJSON data, you can choose which variable to use for labelling. You also have the option of choosing a categorical variable to use for shading the hexes — the colours are chosen for each category using a standard categorical colour scale. Finally, if you need more space around the hexes for editing, you have the option of adding more padding.

I made the HexJSON editor mainly to meet my own mapping needs, but I thought it was worth sharing. If there is enough interest in new features I could either add them, or open source the code so other people can. In the meantime, it's there if you need it.

## 29 Jan 2018 21:17 GMT

A few weeks ago I posted this scatterplot on Twitter. It shows Office for National Statistics estimates of net internal migration between local authorities in England and Wales in the year ending June 2016. The local authorities are grouped into quartiles based on how urban or rural they are.

As you can see, there is an interesting pattern: the local authorities with the highest net outflows are among the most urban. But not all urban areas have large net outflows, and there are not many rural areas with correspondingly large net inflows.

I wanted to find a way of representing these migration flows that would let me explore how people were moving between local authorities: which local authorities had the largest flows in each direction, and what was the balance of flows between these and other local authorities?

With that in mind, I built an interactive hexmap that shows the ONS internal migration data.

### How it works

The map shows the 348 district and unitary local authorities in England and Wales, and shades each of them according to the magnitude of their internal migration flows. By default the map shows the net flow in each area (these are the flows shown in the scatterplot above), but the tabs let you switch to see the gross inflows and outflows too.

If you click on a local authority (or tap on it twice on a touchscreen) you can see the flows between that local authority and other areas. Mousing over an area (or tapping on it once on a touchscreen) pops up a label showing the name of the local authority and the flows in question.

The map stores the currently selected local authority and flow in the URL, so you can link directly to a given combination of local authority and flow. Here are the net migration flows between the London Borough of Barnet and other local authorities, for example. You can see that there are net flows *into* Barnet from local authorities in central London, and net flows *out from* Barnet to less urban areas. And here is the map for Sevenoaks, which shows a similar pattern.

### Strengths and weaknesses

I think this approach works well in some respects and less well in others. It effectively condenses a very large amount of data into a relatively simple interface that allows you to easily explore geographical patterns of internal migration, which is what I set out to do.

However, one weakness of this design is that the variation in the size of the migration flows between local authorities is so great that it's not really possible to represent the data using the same shading scale for every area.

Consequently, while each type of flow is represented using the same set of colours, the scale associated with those colours changes depending on which local authority is selected.

This ensures that you see an appropriate level of variation in the data for each local authority, but it also means that the maps for each area are not directly comparable with one another: you always have to glance at the colour key to check the magnitude of the flows shown in each case.

Effectively, this interface provides 348 different local authority maps for each type of migration flow. The maps all use the same grammar to communicate, but they show the data at different scales.

### A new hexmap of local authorities

One final feature of the visualisation worth mentioning is the hexmap itself. This is an entirely new hexmap of local authorities in England and Wales that was built from scratch. I intend to write more about the process of making it, and share some tools I have developed to make creating hexmaps easier, so look out for those in forthcoming posts.

## 20 Nov 2017 19:58 GMT

The Labour Force Survey is a quarterly national survey that's designed to represent the UK's household population. The Office for National Statistics primarily uses the LFS to produce quarterly labour market statistics; but both the Labour Force Survey and its annual equivalent, the Annual Population Survey, are also used to produce a range of other statistics, such as estimates of families and households and of the population by country of birth and nationality.

The ONS publishes LFS and APS microdata through the UK Data Service. I can't link directly to the datasets here, but they are publicly available if you sign up for a UK Data Service account. This data is used in academic, political, and commercial research for various purposes, and a common use is to produce population estimates for specific groups.

The datasets are published in SPSS and Stata formats. Many people either can't afford these programs or would prefer to use something else. This article provides a basic introduction to working with LFS datasets in Python and R. It shows how to load the data in a usable format and how to produce population estimates as frequency tables and crosstabs.

### Setting up

The examples below use a dataset stored in the Stata file format (.dta) so make sure you have a recent LFS dataset in that format and that you give it the same filename as you use in your code. In Python we will use the pandas library, while in R we will use the tidyverse and haven packages, so make sure you have the right libraries for your chosen language installed too. It's also a good idea to keep the LFS user guide to hand. Details of the survey variables and their codes are in Volume 3.

### Working with LFS data

Python and R can both read files in .dta format, but there are a couple of issues with LFS data that it helps to know about.

First, the LFS contains a large number of detailed categorical variables. In .dta and .spss files these are each represented as a set of numerical codes, along with a set of associated labels. When working with these variables in Python and R it is best to refer to individual categories using their numerical codes, as these are guaranteed to be consistent across datasets for different quarters. But when summarising the data for presentation it is better to use the labels, so the results are readable.

Ideally, you want both the codes and the labels for any variables of interest available in your data frame, but the data structures for categorical variables in Python and R either convert the codes to labels, or only preserve the codes.

The second difficulty is that while Python and R both require unique values for their categorical data types, some of the variables in LFS datasets have labels that are not unique. So you can't use these labels as categoricals in pandas, or as factors in R, even if you want to. If you try to automatically convert these variables when you load the data, you'll get errors. Fortunately, there are ways around this, and the issue only affects a small number of variables which are often not needed.

The solution to these problems is to have two copies of each categorical variable of interest in your data frame whenever possible: one containing the codes and the other containing the labels. That way you can use the codes when you need to select particular categories in your analysis, and the labels to present the results. There are different ways of achieving this in Python and R but the result is essentially the same. For problem variables with duplicate labels you can just use the codes and create your own unique labels if necessary.

To illustrate how this works in practice, the first example below shows how to load and summarise an LFS dataset in order to produce a simple breakdown of the population of the UK by region. We'll do it first in R, and then do the same thing in Python. After that, we'll extend the example to produce a crosstab of the population by region and nationality. Finally, we'll filter the data to produce the same crosstab for a subset of the population: nationals of EU8 countries.

### Frequency tables

To produce a population estimate from the LFS you need to sum the population weights for the population in question. The population weights are stored in a variable called PWTx where x is a year expressed as two digits. In the dataset I am using the weight variable is called PWT17.

The other variable we need is the one that identifies each respondents' country or region of residence within the UK, which is called GOVTOF2. We are going to create a second version of this variable using its labels rather than its codes. That way, when we group the data by this variable and sum the population weights in each region, we will end up with a readable frequency table.

Let's start with R. There is more than one way to load a .dta file with R, but we are going to use the haven package, because haven has a nice way of handling categorical data stored in .dta files.

Instead of forcing you to choose between using codes or labels for categorical data at the point you load the file, haven loads categorical variables as a special data type called the *labelled* type. This represents the data as codes in the data frame, but keeps a list of the labels associated with each code for reference. You can create a factor from a *labelled* variable at any point using *as_factor()* if you want to use the labels in the data frame instead.

The following code loads an LFS Stata file as a data frame with haven's *read_dta()* method, and uses *as_factor()* to create a second version of GOVTOF2 based on its labels rather than its codes. We'll call this variable GOVTOF2F to identify it as a factor.

It's then just a question of grouping the data by GOVTOF2F and summing the population weights within each group. (If you don't understand what is happening with the *%>%* operator, called a pipe, have a look at Chapter 5 of R for Data Science.)

It's also a good idea to round the estimates to at least a thousand so as not to be spuriously precise. You can then save the output to disk with *write.csv()*.

```
library(tidyverse)
library(haven)
# Load the data file
data <- read_dta("lfs-2017-q2.dta")
# Create a label version of the region variable
data$GOVTOF2F <- as_factor(data$GOVTOF2)
# Calculate the population by region
estimates <- data %>%
# Group by region
group_by(GOVTOF2F) %>%
# Sum the population weights
summarise(POPEST = sum(PWT17))
# Round the estimates
estimates$POPEST <- round(estimates$POPEST, -3)
# Write out the estimates as a csv
write.csv(estimates, file = "frequency.csv", row.names = FALSE)
```

Now let's do the same thing with Python. Unlike haven's *read_dta()* method, the *read_stata()* method in pandas does not preserve both the codes and the labels when loading categorical data: you have to choose which you want when you load it. To get around this, you can load the data twice, once as codes and once as labels, and then combine the two sets of variables into a single dataset.

To do this, we can take advantage of a nice feature of *read_stata()* that haven's *read_dta()* doesn't have, which is the ability to load only selected variables from the data file. So while this takes a few more lines of code in Python, and involves reading from the data file twice, it actually executes faster.

First, we load the variables of interest as codes. These are the population weight (PWT17), the region of residence (GOVTOF2), and the respondent unique identifier (CASENOP) — we'll use CASENOP to link the codes with the labels.

Next, we load the variables we need as labels and rename them accordingly. We only need GOVTOF2 for labelling, but we'll also need CASENOP for linking. Then we merge the labels with the codes by joining the two datasets on CASENOP.

After that, we use the same approach we used in R: group the data by region, sum the population weights in each region, and round the estimates to the nearest thousand.

```
import pandas as pd
# Load the data as codes
codes = pd.read_stata(
'lfs-2017-q2.dta',
columns = ['CASENOP', 'GOVTOF2', 'PWT17'],
convert_categoricals = False)
# Load the data as categorical labels
labels = pd.read_stata(
'lfs-2017-q2.dta',
columns = ['CASENOP', 'GOVTOF2'],
convert_categoricals = True)
# Rename the label variables
labels.columns = ['CASENOP', 'GOVTOF2F']
# Merge the codes and labels on the respondent id
data = codes.merge(labels, on = 'CASENOP')
# Group by region
gd = data.groupby(['GOVTOF2F'])
# Sum the population weights
estimates = gd[['PWT17']].sum()
# Round the estimates
estimates = round(estimates, -3)
# Write out the estimates as a csv
estimates.to_csv('frequency.csv')
```

In these examples we didn't use the version of GOVTOF2 that contains the codes. But it's a good idea to have both versions of a variable you're using available, in case you want to select specific categories within it later on. We'll do this in the section on filtering below.

### Crosstabs

But first, let's adapt the examples to change the output from a one dimensional frequency table to a two dimensional crosstab of the population by region and nationality.

We need to do three new things: create a labelled version of the nationality variable (NATOX7), group the data by both region *and* nationality before summing the population weights, and then reshape the estimates for each combination of region and nationality from long form to wide form.

In R you can do this with *spread()*, which is part of the tidyr package.

```
library(tidyverse)
library(haven)
# Load data and create labelled versions of the variables
data <- read_dta("lfs-2017-q2.dta")
data$GOVTOF2F <- as_factor(data$GOVTOF2)
data$NATOX7F <- as_factor(data$NATOX7)
# Calculate the population estimates in long form
estimates <- data %>%
# Group by region and nationality
group_by(GOVTOF2F, NATOX7F) %>%
# Sum the population weights
summarise(POPEST = sum(PWT17))
# Round the estimates
estimates$POPEST <- round(estimates$POPEST, -3)
# Reshape the population estimates to wide form
crosstab <- estimates %>%
spread(GOVTOF2F, POPEST, fill = 0)
# Write out the estimates as a csv
write.csv(crosstab, file = "crosstab.csv", row.names = FALSE)
```

In Python you can do the same thing with pandas' *pivot_table()*.

```
import pandas as pd
# Load the data as codes
codes = pd.read_stata(
'lfs-2017-q2.dta',
columns = ['CASENOP', 'GOVTOF2', 'NATOX7', 'PWT17'],
convert_categoricals = False)
# Load the data as categorical labels
labels = pd.read_stata(
'lfs-2017-q2.dta',
columns = ['CASENOP', 'GOVTOF2', 'NATOX7'],
convert_categoricals = True)
# Rename the label variables
labels.columns = ['CASENOP', 'GOVTOF2F', 'NATOX7F']
# Merge the codes and labels on the respondent id
data = codes.merge(labels, on = 'CASENOP')
# Group by region name
gd = data.groupby(['GOVTOF2F', 'NATOX7F'])
# Sum the population weights
estimates = gd[['PWT17']].sum()
# Round the estimates
estimates = round(estimates, -3)
# Reshape the population estimates to wide form
crosstab = pd.pivot_table(
estimates,
index = ['NATOX7F'],
columns = ['GOVTOF2F'],
fill_value = 0)
# Write out the estimates as a csv
crosstab.to_csv('crosstab.csv')
```

While this crosstab provides a very detailed breakdown of the population, many of the inidividual estimates within it are too small to be reliable, so you shouldn't report estimates at this level of detail for most nationalities (see the caveats below).

### Filtering

Finally, we'll filter the data before summing the population weights so that the estimates are for a subset of the population. In this case we'll filter on nationality to select nationals of the eight Eastern European countries that joined the EU in 2004, known as the EU8.

To do this we'll use the codes for NATOX7 to select respondents by their nationality. Note that we filter the data frame on the variable containing the codes (NATOX7), but group by the variable containing the labels (NATOX7F).

You can find the codes for each variable in Volume 3 of the LFS user guide. There are nine codes listed here because sometimes survey respondents give their nationality as “Czechoslovakia”, which no longer exists but whose successor countries are in the EU8.

To filter in R, add a *filter()* to the start of the pipeline for transforming the data.

```
# NATOX7 codes for EU8
EU8 <- c(203, 971, 233, 348,
428, 440, 616, 703, 705)
# Calculate the population estimates in long form
estimates <- data %>%
# Select only EU8 nationals
filter(NATOX7 %in% EU8) %>%
# Group by region and nationality
group_by(GOVTOF2F, NATOX7F) %>%
# Sum the population weights
summarise(POPEST = sum(PWT17))
```

In Python, you can use pandas' subsetting operations to select the target rows from the data frame before you group and sum. This time, we'll also call *dropna()* on the results to remove empty rows left by nationalities we filtered out.

```
# NATOX7 codes for EU8
EU8 = [203, 971, 233, 348,
428, 440, 616, 703, 705]
# Select only EU8 nationals
filtered = data.loc[data['NATOX7'].isin(EU8)]
# Group by region and nationality
gd = filtered.groupby(['GOVTOF2F', 'NATOX7F'])
# Sum the population weights (and drop empty rows)
estimates = gd[['PWT17']].sum().dropna()
```

### Some caveats

The above examples only scratch the surface of what's possible with the Labour Force Survey. Other things you might want to do include recoding variables, repeating the same analysis over different quarters so as to build up a time series, or calculating confidence intervals for estimates. (These may be the subjects of future posts.)

But bear in mind that while the scale of the Labour Force Survey makes it a rich source of data, it also introduces some potential pitfalls.

First and foremost, the LFS is a survey, so its estimates are surrounded by statistical uncertainty. Estimates based on a small number of respondents are particularly uncertain, and the level of detail in the LFS makes it easy to find lots of small populations, so take care not to base conclusions on tiny samples. To get the size of the sample underpinning each estimate you can count the number of respondents, rather than sum their weights.

In R, use *n()* rather than *sum()* as the function in *summarise()*.

```
summarise(RESP = n())
```

In Python, call *count()* rather than *sum()* on the grouped data.

```
respondents = gd[['PWT17']].count()
```

And don't round the results.

If you do this for the crosstab of population by region and nationality you'll see that many individual estimates are based on just a handful of respondents, and are too small to be considered reliable. You should aggregate smaller estimates into higher level groups; in this case you could use the ONS country groupings.

Another thing to bear in mind is that LFS data is not seasonally adjusted, so comparisions between estimates at different points in time may not be valid. A simple solution is to only compare estimates from the same quarter in each year.

Furthermore, there are technical issues affecting some aspects of the data — the LFS is known to underestimate benefit claimants and seasonal agricultural workers for example — so it's worth reading the literature on the LFS carefully, especially those parts that relate to your analysis.

Finally, it's worth looking at other people's research, to see how they have used the LFS to address similar questions to yours, and how they have dealt with any problems they encountered.