olihawkins

Getting started with the Labour Force Survey in Python and R

20 Nov 2017 19:58 GMT

The Labour Force Survey is a quarterly national survey that's designed to represent the UK's household population. The Office for National Statistics primarily uses the LFS to produce quarterly labour market statistics; but both the Labour Force Survey and its annual equivalent, the Annual Population Survey, are also used to produce a range of other statistics, such as estimates of families and households and of the population by country of birth and nationality.

The ONS publishes LFS and APS microdata through the UK Data Service. I can't link directly to the datasets here, but they are publicly available if you sign up for a UK Data Service account. This data is used in academic, political, and commercial research for various purposes, and a common use is to produce population estimates for specific groups.

The datasets are published in SPSS and Stata formats. Many people either can't afford these programs or would prefer to use something else. This article provides a basic introduction to working with LFS datasets in Python and R. It shows how to load the data in a usable format and how to produce population estimates as frequency tables and crosstabs.

Setting up

The examples below use a dataset stored in the Stata file format (.dta) so make sure you have a recent LFS dataset in that format and that you give it the same filename as you use in your code. In Python we will use the pandas library, while in R we will use the tidyverse and haven packages, so make sure you have the right libraries for your chosen language installed too. It's also a good idea to keep the LFS user guide to hand. Details of the survey variables and their codes are in Volume 3.

Working with LFS data

Python and R can both read files in .dta format, but there are a couple of issues with LFS data that it helps to know about.

First, the LFS contains a large number of detailed categorical variables. In .dta and .spss files these are each represented as a set of numerical codes, along with a set of associated labels. When working with these variables in Python and R it is best to refer to individual categories using their numerical codes, as these are guaranteed to be consistent across datasets for different quarters. But when summarising the data for presentation it is better to use the labels, so the results are readable.

Ideally, you want both the codes and the labels for any variables of interest available in your data frame, but the data structures for categorical variables in Python and R either convert the codes to labels, or only preserve the codes.

The second difficulty is that while Python and R both require unique values for their categorical data types, some of the variables in LFS datasets have labels that are not unique. So you can't use these labels as categoricals in pandas, or as factors in R, even if you want to. If you try to automatically convert these variables when you load the data, you'll get errors. Fortunately, there are ways around this, and the issue only affects a small number of variables which are often not needed.

The solution to these problems is to have two copies of each categorical variable of interest in your data frame whenever possible: one containing the codes and the other containing the labels. That way you can use the codes when you need to select particular categories in your analysis, and the labels to present the results. There are different ways of achieving this in Python and R but the result is essentially the same. For problem variables with duplicate labels you can just use the codes and create your own unique labels if necessary.

To illustrate how this works in practice, the first example below shows how to load and summarise an LFS dataset in order to produce a simple breakdown of the population of the UK by region. We'll do it first in R, and then do the same thing in Python. After that, we'll extend the example to produce a crosstab of the population by region and nationality. Finally, we'll filter the data to produce the same crosstab for a subset of the population: nationals of EU8 countries.

Frequency tables

To produce a population estimate from the LFS you need to sum the population weights for the population in question. The population weights are stored in a variable called PWTx where x is a year expressed as two digits. In the dataset I am using the weight variable is called PWT17.

The other variable we need is the one that identifies each respondents' country or region of residence within the UK, which is called GOVTOF2. We are going to create a second version of this variable using its labels rather than its codes. That way, when we group the data by this variable and sum the population weights in each region, we will end up with a readable frequency table.

Let's start with R. There is more than one way to load a .dta file with R, but we are going to use the haven package, because haven has a nice way of handling categorical data stored in .dta files.

Instead of forcing you to choose between using codes or labels for categorical data at the point you load the file, haven loads categorical variables as a special data type called the labelled type. This represents the data as codes in the data frame, but keeps a list of the labels associated with each code for reference. You can create a factor from a labelled variable at any point using as_factor() if you want to use the labels in the data frame instead.

The following code loads an LFS Stata file as a data frame with haven's read_dta() method, and uses as_factor() to create a second version of GOVTOF2 based on its labels rather than its codes. We'll call this variable GOVTOF2F to identify it as a factor.

It's then just a question of grouping the data by GOVTOF2F and summing the population weights within each group. (If you don't understand what is happening with the %>% operator, called a pipe, have a look at Chapter 5 of R for Data Science.)

It's also a good idea to round the estimates to at least a thousand so as not to be spuriously precise. You can then save the output to disk with write.csv().


library(tidyverse)
library(haven)

# Load the data file
data <- read_dta("lfs-2017-q2.dta")

# Create a label version of the region variable
data$GOVTOF2F <- as_factor(data$GOVTOF2)

# Calculate the population by region
estimates <- data %>%
    # Group by region
    group_by(GOVTOF2F) %>%
    # Sum the population weights
    summarise(POPEST = sum(PWT17))

# Round the estimates
estimates$POPEST <- round(estimates$POPEST, -3)

# Write out the estimates as a csv
write.csv(estimates, file = "frequency.csv", row.names = FALSE)

Now let's do the same thing with Python. Unlike haven's read_dta() method, the read_stata() method in pandas does not preserve both the codes and the labels when loading categorical data: you have to choose which you want when you load it. To get around this, you can load the data twice, once as codes and once as labels, and then combine the two sets of variables into a single dataset.

To do this, we can take advantage of a nice feature of read_stata() that haven's read_dta() doesn't have, which is the ability to load only selected variables from the data file. So while this takes a few more lines of code in Python, and involves reading from the data file twice, it actually executes faster.

First, we load the variables of interest as codes. These are the population weight (PWT17), the region of residence (GOVTOF2), and the respondent unique identifier (CASENOP) — we'll use CASENOP to link the codes with the labels.

Next, we load the variables we need as labels and rename them accordingly. We only need GOVTOF2 for labelling, but we'll also need CASENOP for linking. Then we merge the labels with the codes by joining the two datasets on CASENOP.

After that, we use the same approach we used in R: group the data by region, sum the population weights in each region, and round the estimates to the nearest thousand.


import pandas as pd

# Load the data as codes
codes = pd.read_stata(
    'lfs-2017-q2.dta',
    columns = ['CASENOP', 'GOVTOF2', 'PWT17'],
    convert_categoricals = False)

# Load the data as categorical labels
labels = pd.read_stata(
    'lfs-2017-q2.dta',
    columns = ['CASENOP', 'GOVTOF2'],
    convert_categoricals = True)

# Rename the label variables
labels.columns = ['CASENOP', 'GOVTOF2F']

# Merge the codes and labels on the respondent id
data = codes.merge(labels, on = 'CASENOP')

# Group by region
gd = data.groupby(['GOVTOF2F'])

# Sum the population weights
estimates = gd[['PWT17']].sum()

# Round the estimates
estimates = round(estimates, -3)

# Write out the estimates as a csv
estimates.to_csv('frequency.csv')

In these examples we didn't use the version of GOVTOF2 that contains the codes. But it's a good idea to have both versions of a variable you're using available, in case you want to select specific categories within it later on. We'll do this in the section on filtering below.

Crosstabs

But first, let's adapt the examples to change the output from a one dimensional frequency table to a two dimensional crosstab of the population by region and nationality.

We need to do three new things: create a labelled version of the nationality variable (NATOX7), group the data by both region and nationality before summing the population weights, and then reshape the estimates for each combination of region and nationality from long form to wide form.

In R you can do this with spread(), which is part of the tidyr package.


library(tidyverse)
library(haven)

# Load data and create labelled versions of the variables
data <- read_dta("lfs-2017-q2.dta")
data$GOVTOF2F <- as_factor(data$GOVTOF2)
data$NATOX7F <- as_factor(data$NATOX7)

# Calculate the population estimates in long form
estimates <- data %>%
    # Group by region and nationality
    group_by(GOVTOF2F, NATOX7F) %>%
    # Sum the population weights
    summarise(POPEST = sum(PWT17))

# Round the estimates
estimates$POPEST <- round(estimates$POPEST, -3)

# Reshape the population estimates to wide form
crosstab <- estimates %>% 
    spread(GOVTOF2F, POPEST, fill = 0)

# Write out the estimates as a csv
write.csv(crosstab, file = "crosstab.csv", row.names = FALSE)

In Python you can do the same thing with pandas' pivot_table().


import pandas as pd

# Load the data as codes
codes = pd.read_stata(
    'lfs-2017-q2.dta',
    columns = ['CASENOP', 'GOVTOF2', 'NATOX7', 'PWT17'],
    convert_categoricals = False)

# Load the data as categorical labels
labels = pd.read_stata(
    'lfs-2017-q2.dta',
    columns = ['CASENOP', 'GOVTOF2', 'NATOX7'],
    convert_categoricals = True)

# Rename the label variables
labels.columns = ['CASENOP', 'GOVTOF2F', 'NATOX7F']

# Merge the codes and labels on the respondent id
data = codes.merge(labels, on = 'CASENOP')

# Group by region name
gd = data.groupby(['GOVTOF2F', 'NATOX7F'])

# Sum the population weights
estimates = gd[['PWT17']].sum()

# Round the estimates
estimates = round(estimates, -3)

# Reshape the population estimates to wide form
crosstab = pd.pivot_table(
    estimates, 
    index = ['NATOX7F'], 
    columns = ['GOVTOF2F'],
    fill_value = 0)

# Write out the estimates as a csv
crosstab.to_csv('crosstab.csv')

While this crosstab provides a very detailed breakdown of the population, many of the inidividual estimates within it are too small to be reliable, so you shouldn't report estimates at this level of detail for most nationalities (see the caveats below).

Filtering

Finally, we'll filter the data before summing the population weights so that the estimates are for a subset of the population. In this case we'll filter on nationality to select nationals of the eight Eastern European countries that joined the EU in 2004, known as the EU8.

To do this we'll use the codes for NATOX7 to select respondents by their nationality. Note that we filter the data frame on the variable containing the codes (NATOX7), but group by the variable containing the labels (NATOX7F).

You can find the codes for each variable in Volume 3 of the LFS user guide. There are nine codes listed here because sometimes survey respondents give their nationality as “Czechoslovakia”, which no longer exists but whose successor countries are in the EU8.

To filter in R, add a filter() to the start of the pipeline for transforming the data.


# NATOX7 codes for EU8
EU8 <- c(203, 971, 233, 348, 
    428, 440, 616, 703, 705)

# Calculate the population estimates in long form
estimates <- data %>%
    # Select only EU8 nationals
    filter(NATOX7 %in% EU8) %>%
    # Group by region and nationality
    group_by(GOVTOF2F, NATOX7F) %>%
    # Sum the population weights
    summarise(POPEST = sum(PWT17))

In Python, you can use pandas' subsetting operations to select the target rows from the data frame before you group and sum. This time, we'll also call dropna() on the results to remove empty rows left by nationalities we filtered out.


# NATOX7 codes for EU8
EU8 = [203, 971, 233, 348, 
    428, 440, 616, 703, 705]

# Select only EU8 nationals
filtered = data.loc[data['NATOX7'].isin(EU8)]

# Group by region and nationality 
gd = filtered.groupby(['GOVTOF2F', 'NATOX7F'])

# Sum the population weights (and drop empty rows)
estimates = gd[['PWT17']].sum().dropna()

Some caveats

The above examples only scratch the surface of what's possible with the Labour Force Survey. Other things you might want to do include recoding variables, repeating the same analysis over different quarters so as to build up a time series, or calculating confidence intervals for estimates. (These may be the subjects of future posts.)

But bear in mind that while the scale of the Labour Force Survey makes it a rich source of data, it also introduces some potential pitfalls.

First and foremost, the LFS is a survey, so its estimates are surrounded by statistical uncertainty. Estimates based on a small number of respondents are particularly uncertain, and the level of detail in the LFS makes it easy to find lots of small populations, so take care not to base conclusions on tiny samples. To get the size of the sample underpinning each estimate you can count the number of respondents, rather than sum their weights.

In R, use n() rather than sum() as the function in summarise().


summarise(RESP = n())

In Python, call count() rather than sum() on the grouped data.


respondents = gd[['PWT17']].count()

And don't round the results.

If you do this for the crosstab of population by region and nationality you'll see that many individual estimates are based on just a handful of respondents, and are too small to be considered reliable. You should aggregate smaller estimates into higher level groups; in this case you could use the ONS country groupings.

Another thing to bear in mind is that LFS data is not seasonally adjusted, so comparisions between estimates at different points in time may not be valid. A simple solution is to only compare estimates from the same quarter in each year.

Furthermore, there are technical issues affecting some aspects of the data — the LFS is known to underestimate benefit claimants and seasonal agricultural workers for example — so it's worth reading the literature on the LFS carefully, especially those parts that relate to your analysis.

Finally, it's worth looking at other people's research, to see how they have used the LFS to address similar questions to yours, and how they have dealt with any problems they encountered.