The ONS publishes LFS and APS microdata through the UK Data Service. I can't link directly to the datasets here, but they are publicly available if you sign up for a UK Data Service account. This data is used in academic, political, and commercial research for various purposes, and a common use is to produce population estimates for specific groups.

The datasets are published in SPSS and Stata formats. Many people either can't afford these programs or would prefer to use something else. This article provides a basic introduction to working with LFS datasets in Python and R. It shows how to load the data in a usable format and how to produce population estimates as frequency tables and crosstabs.

The examples below use a dataset stored in the Stata file format (.dta) so make sure you have a recent LFS dataset in that format and that you give it the same filename as you use in your code. In Python we will use the pandas library, while in R we will use the tidyverse and haven packages, so make sure you have the right libraries for your chosen language installed too. It's also a good idea to keep the LFS user guide to hand. Details of the survey variables and their codes are in Volume 3.

Python and R can both read files in .dta format, but there are a couple of issues with LFS data that it helps to know about.

First, the LFS contains a large number of detailed categorical variables. In .dta and .spss files these are each represented as a set of numerical codes, along with a set of associated labels. When working with these variables in Python and R it is best to refer to individual categories using their numerical codes, as these are guaranteed to be consistent across datasets for different quarters. But when summarising the data for presentation it is better to use the labels, so the results are readable.

Ideally, you want both the codes and the labels for any variables of interest available in your data frame, but the data structures for categorical variables in Python and R either convert the codes to labels, or only preserve the codes.

The second difficulty is that while Python and R both require unique values for their categorical data types, some of the variables in LFS datasets have labels that are not unique. So you can't use these labels as categoricals in pandas, or as factors in R, even if you want to. If you try to automatically convert these variables when you load the data, you'll get errors. Fortunately, there are ways around this, and the issue only affects a small number of variables which are often not needed.

The solution to these problems is to have two copies of each categorical variable of interest in your data frame whenever possible: one containing the codes and the other containing the labels. That way you can use the codes when you need to select particular categories in your analysis, and the labels to present the results. There are different ways of achieving this in Python and R but the result is essentially the same. For problem variables with duplicate labels you can just use the codes and create your own unique labels if necessary.

To illustrate how this works in practice, the first example below shows how to load and summarise an LFS dataset in order to produce a simple breakdown of the population of the UK by region. We'll do it first in R, and then do the same thing in Python. After that, we'll extend the example to produce a crosstab of the population by region and nationality. Finally, we'll filter the data to produce the same crosstab for a subset of the population: nationals of EU8 countries.

To produce a population estimate from the LFS you need to sum the population weights for the population in question. The population weights are stored in a variable called PWTx where x is a year expressed as two digits. In the dataset I am using the weight variable is called PWT17.

The other variable we need is the one that identifies each respondents' country or region of residence within the UK, which is called GOVTOF2. We are going to create a second version of this variable using its labels rather than its codes. That way, when we group the data by this variable and sum the population weights in each region, we will end up with a readable frequency table.

Let's start with R. There is more than one way to load a .dta file with R, but we are going to use the haven package, because haven has a nice way of handling categorical data stored in .dta files.

Instead of forcing you to choose between using codes or labels for categorical data at the point you load the file, haven loads categorical variables as a special data type called the *labelled* type. This represents the data as codes in the data frame, but keeps a list of the labels associated with each code for reference. You can create a factor from a *labelled* variable at any point using *as_factor()* if you want to use the labels in the data frame instead.

The following code loads an LFS Stata file as a data frame with haven's *read_dta()* method, and uses *as_factor()* to create a second version of GOVTOF2 based on its labels rather than its codes. We'll call this variable GOVTOF2F to identify it as a factor.

It's then just a question of grouping the data by GOVTOF2F and summing the population weights within each group. (If you don't understand what is happening with the *%>%* operator, called a pipe, have a look at Chapter 5 of R for Data Science.)

It's also a good idea to round the estimates to at least a thousand so as not to be spuriously precise. You can then save the output to disk with *write.csv()*.

```
library(tidyverse)
library(haven)
# Load the data file
data <- read_dta("lfs-2017-q2.dta")
# Create a label version of the region variable
data$GOVTOF2F <- as_factor(data$GOVTOF2)
# Calculate the population by region
estimates <- data %>%
# Group by region
group_by(GOVTOF2F) %>%
# Sum the population weights
summarise(POPEST = sum(PWT17))
# Round the estimates
estimates$POPEST <- round(estimates$POPEST, -3)
# Write out the estimates as a csv
write.csv(estimates, file = "frequency.csv", row.names = FALSE)
```

Now let's do the same thing with Python. Unlike haven's *read_dta()* method, the *read_stata()* method in pandas does not preserve both the codes and the labels when loading categorical data: you have to choose which you want when you load it. To get around this, you can load the data twice, once as codes and once as labels, and then combine the two sets of variables into a single dataset.

To do this, we can take advantage of a nice feature of *read_stata()* that haven's *read_dta()* doesn't have, which is the ability to load only selected variables from the data file. So while this takes a few more lines of code in Python, and involves reading from the data file twice, it actually executes faster.

First, we load the variables of interest as codes. These are the population weight (PWT17), the region of residence (GOVTOF2), and the respondent unique identifier (CASENOP) — we'll use CASENOP to link the codes with the labels.

Next, we load the variables we need as labels and rename them accordingly. We only need GOVTOF2 for labelling, but we'll also need CASENOP for linking. Then we merge the labels with the codes by joining the two datasets on CASENOP.

After that, we use the same approach we used in R: group the data by region, sum the population weights in each region, and round the estimates to the nearest thousand.

```
import pandas as pd
# Load the data as codes
codes = pd.read_stata(
'lfs-2017-q2.dta',
columns = ['CASENOP', 'GOVTOF2', 'PWT17'],
convert_categoricals = False)
# Load the data as categorical labels
labels = pd.read_stata(
'lfs-2017-q2.dta',
columns = ['CASENOP', 'GOVTOF2'],
convert_categoricals = True)
# Rename the label variables
labels.columns = ['CASENOP', 'GOVTOF2F']
# Merge the codes and labels on the respondent id
data = codes.merge(labels, on = 'CASENOP')
# Group by region
gd = data.groupby(['GOVTOF2F'])
# Sum the population weights
estimates = gd[['PWT17']].sum()
# Round the estimates
estimates = round(estimates, -3)
# Write out the estimates as a csv
estimates.to_csv('frequency.csv')
```

In these examples we didn't use the version of GOVTOF2 that contains the codes. But it's a good idea to have both versions of a variable you're using available, in case you want to select specific categories within it later on. We'll do this in the section on filtering below.

But first, let's adapt the examples to change the output from a one dimensional frequency table to a two dimensional crosstab of the population by region and nationality.

We need to do three new things: create a labelled version of the nationality variable (NATOX7), group the data by both region *and* nationality before summing the population weights, and then reshape the estimates for each combination of region and nationality from long form to wide form.

In R you can do this with *spread()*, which is part of the tidyr package.

```
library(tidyverse)
library(haven)
# Load data and create labelled versions of the variables
data <- read_dta("lfs-2017-q2.dta")
data$GOVTOF2F <- as_factor(data$GOVTOF2)
data$NATOX7F <- as_factor(data$NATOX7)
# Calculate the population estimates in long form
estimates <- data %>%
# Group by region and nationality
group_by(GOVTOF2F, NATOX7F) %>%
# Sum the population weights
summarise(POPEST = sum(PWT17))
# Round the estimates
estimates$POPEST <- round(estimates$POPEST, -3)
# Reshape the population estimates to wide form
crosstab <- estimates %>%
spread(GOVTOF2F, POPEST, fill = 0)
# Write out the estimates as a csv
write.csv(crosstab, file = "crosstab.csv", row.names = FALSE)
```

In Python you can do the same thing with pandas' *pivot_table()*.

```
import pandas as pd
# Load the data as codes
codes = pd.read_stata(
'lfs-2017-q2.dta',
columns = ['CASENOP', 'GOVTOF2', 'NATOX7', 'PWT17'],
convert_categoricals = False)
# Load the data as categorical labels
labels = pd.read_stata(
'lfs-2017-q2.dta',
columns = ['CASENOP', 'GOVTOF2', 'NATOX7'],
convert_categoricals = True)
# Rename the label variables
labels.columns = ['CASENOP', 'GOVTOF2F', 'NATOX7F']
# Merge the codes and labels on the respondent id
data = codes.merge(labels, on = 'CASENOP')
# Group by region name
gd = data.groupby(['GOVTOF2F', 'NATOX7F'])
# Sum the population weights
estimates = gd[['PWT17']].sum()
# Round the estimates
estimates = round(estimates, -3)
# Reshape the population estimates to wide form
crosstab = pd.pivot_table(
estimates,
index = ['NATOX7F'],
columns = ['GOVTOF2F'],
fill_value = 0)
# Write out the estimates as a csv
crosstab.to_csv('crosstab.csv')
```

While this crosstab provides a very detailed breakdown of the population, many of the inidividual estimates within it are too small to be reliable, so you shouldn't report estimates at this level of detail for most nationalities (see the caveats below).

Finally, we'll filter the data before summing the population weights so that the estimates are for a subset of the population. In this case we'll filter on nationality to select nationals of the eight Eastern European countries that joined the EU in 2004, known as the EU8.

To do this we'll use the codes for NATOX7 to select respondents by their nationality. Note that we filter the data frame on the variable containing the codes (NATOX7), but group by the variable containing the labels (NATOX7F).

You can find the codes for each variable in Volume 3 of the LFS user guide. There are nine codes listed here because sometimes survey respondents give their nationality as “Czechoslovakia”, which no longer exists but whose successor countries are in the EU8.

To filter in R, add a *filter()* to the start of the pipeline for transforming the data.

```
# NATOX7 codes for EU8
EU8 <- c(203, 971, 233, 348,
428, 440, 616, 703, 705)
# Calculate the population estimates in long form
estimates <- data %>%
# Select only EU8 nationals
filter(NATOX7 %in% EU8) %>%
# Group by region and nationality
group_by(GOVTOF2F, NATOX7F) %>%
# Sum the population weights
summarise(POPEST = sum(PWT17))
```

In Python, you can use pandas' subsetting operations to select the target rows from the data frame before you group and sum. This time, we'll also call *dropna()* on the results to remove empty rows left by nationalities we filtered out.

```
# NATOX7 codes for EU8
EU8 = [203, 971, 233, 348,
428, 440, 616, 703, 705]
# Select only EU8 nationals
filtered = data.loc[data['NATOX7'].isin(EU8)]
# Group by region and nationality
gd = filtered.groupby(['GOVTOF2F', 'NATOX7F'])
# Sum the population weights (and drop empty rows)
estimates = gd[['PWT17']].sum().dropna()
```

The above examples only scratch the surface of what's possible with the Labour Force Survey. Other things you might want to do include recoding variables, repeating the same analysis over different quarters so as to build up a time series, or calculating confidence intervals for estimates. (These may be the subjects of future posts.)

But bear in mind that while the scale of the Labour Force Survey makes it a rich source of data, it also introduces some potential pitfalls.

First and foremost, the LFS is a survey, so its estimates are surrounded by statistical uncertainty. Estimates based on a small number of respondents are particularly uncertain, and the level of detail in the LFS makes it easy to find lots of small populations, so take care not to base conclusions on tiny samples. To get the size of the sample underpinning each estimate you can count the number of respondents, rather than sum their weights.

In R, use *n()* rather than *sum()* as the function in *summarise()*.

```
summarise(RESP = n())
```

In Python, call *count()* rather than *sum()* on the grouped data.

```
respondents = gd[['PWT17']].count()
```

And don't round the results.

If you do this for the crosstab of population by region and nationality you'll see that many individual estimates are based on just a handful of respondents, and are too small to be considered reliable. You should aggregate smaller estimates into higher level groups; in this case you could use the ONS country groupings.

Another thing to bear in mind is that LFS data is not seasonally adjusted, so comparisions between estimates at different points in time may not be valid. A simple solution is to only compare estimates from the same quarter in each year.

Furthermore, there are technical issues affecting some aspects of the data — the LFS is known to underestimate benefit claimants and seasonal agricultural workers for example — so it's worth reading the literature on the LFS carefully, especially those parts that relate to your analysis.

Finally, it's worth looking at other people's research, to see how they have used the LFS to address similar questions to yours, and how they have dealt with any problems they encountered.

]]>The @wmbubble account follows all journalists who are registered with Parliament and who primarily cover Westminster politics in their work and on Twitter.

This is a reasonably objective set of criteria for Westminster journalism but it's not perfect. There is some fuzziness around who primarily covers Westminster and whether that is their focus on Twitter, but the aim is to exclude journalists who are covering only one or two specific areas of policy and to include journalists who are talking about Westminster politics or policy generally.

The bot does not analyse or store individual tweets, or keep any record of who is saying what. It takes the text of each tweet in the stream of tweets from people it follows, cleans the text to remove mentions, hashtags and links, and adds what's left to an anonymous corpus. That corpus is used to make the word cloud at the end of each day.

If you are a journalist the account follows and you are uncomfortable with your tweets being included in this analysis, please let me know and I will remove your account from the group. Alternatively, you can just block the @wmbubble account. But to reiterate, the bot is not storing any data about individual accounts, just a single block of anonymised text which does not include any ids, account names, hashtags or URLs.

Word clouds are posted daily at around 9:00pm, just before the newspaper front pages come out. The word clouds show the most common words each day and the size of a word represents a measure of its relative frequency within that day's corpus. This measure is not based on frequency alone, but an equally weighted combination of frequency and rank on frequency.

This is because on some days a small number of words are so dominant in the corpus they would make most other words too small to render at the minimum font size, if size was scaled only to frequency. Giving equal weight to rank on frequency preserves the diversity of words that are visible in the cloud, and ensures you get a better sense of the content of the day's big stories. But this does mean that the relative size of the words is not a direct measure of their relative frequency alone.

Frequently collocated words are treated as a single term, so names like Theresa May and Jeremy Corbyn are preserved. The colours are not meaningful and are just randomly chosen from a range of colour maps for variety.

Feel free to address any questions to me via @olihawkins on Twitter. Don't try addressing questions to @wmbubble: it's programmed to ignore those tweets.

]]>I won't run through all of the rules here. The only thing you need to understand to read this article is the game's central dice mechanism. Underneath the thematic flavour of giant monsters with superpowers, King of Tokyo is a dice rolling game with similar rules to Yahtzee. The dice are the traditional six-sided type, but with different faces. These are the digits 1, 2, and 3, a claw, a heart, and a lightning bolt.

On your turn you roll six dice up to three times. After each roll you can keep any of the dice that you like and continue rolling the remaining dice until you either have all the faces that you want, or you run out of rolls. The dice that are face up at the end of all your rolls are your hand for that turn, and you then resolve the outcomes.

Dice with numbers give you victory points if you have three or more of the same number. Claws deal damage to your opponents, hearts restore your health, and lightning gives you energy which you can use to buy power cards.

Quite often in the game you can benefit from getting dice of more than one type on the same turn: do some damage to an opponent *and* heal yourself a bit; collect some lightning for power cards *and* gain a few victory points.

But at other times you need to go for broke and try to get as many of one dice face as possible. This tends to happen at the most decisive moments in the game: when you need just a few more victory points to win outright, when you need to do enough damage to kill an opponent before they win, when you need to heal quickly to prevent your own impending death, or when you need to collect enough lightning to buy a vital power card before someone else.

You can model this strategy with a Python function that looks like this:

```
import numpy as np
def max_outcome_for_face(dice=6, rolls=3):
hits = 0
outcome = [0] * rolls
for roll in range(rolls):
# Get a random number from one to six for each dice
results = np.random.randint(1, 7, size=dice)
# Count the number of ones: a one is a hit
numhits = np.count_nonzero(results == 1)
# Add to hits and remove a dice for each hit
hits = hits + numhits
dice = dice - numhits
# Store the hits after each roll
outcome[roll] = hits
return outcome
```

This function simulates a turn in the game with the given number of dice and rolls — six dice with three rolls by default. After each roll, it counts the number of times the target face was rolled, adds that number to a running total for the number of dice with the target face, and removes those dice with the target face from subsequent rolls. After all the rolls have completed, it returns a list showing the cumulative number of dice with the target face after each roll.

If you call this function a large enough number of times and collect the results, you can obtain the probability distribution for the outcomes of this strategy for any given number of dice and rolls. The heatmap below shows the distribution of outcomes after ten million turns with six dice and up to four rolls.

In King of Tokyo you get three rolls with six dice by default, so while this chart shows the outcomes for up to four rolls, the third column is the most relevant. This shows that if you are trying to get as many dice as possible with one particular face, you have an 80% chance of getting two or more dice with the given face, and a 21% chance of getting four or more.

So why does the heatmap show probabilities for up to four rolls? Because some of the power cards you can buy in the game give you an extra roll of the dice, and the Giant Brain card in particular gives you an extra roll as a permanent effect. With a fourth roll of the dice the probability of getting two or more dice with the target face increases to 91%, and the probability of getting four or more rises to 38%.

There are also power cards in the game that give you an extra dice on each roll: the Extra Head card gives you this as a permanent effect. Here is the distrbution of outcomes for seven dice and up to four rolls.

This shows that an extra dice is worth less than an extra roll when you are going after a particular face. With seven dice and three rolls the probability of getting two or more of a given face is 87%, and the probability of getting four or more is 33%. That's better than six dice with three rolls, but not as good as six dice with four. Although it's worth noting that an extra dice confers other benefits — you get more stuff.

Of course, if you can get the combination of cards that gives you seven dice with four rolls, then getting four or more dice with the target face becomes more likely than not, with a 54% chance.

This analysis was done using numpy and pandas, and the heatmaps were produced with matplotlib and seaborn. The complete source code is available on GitHub.

]]>You can read more about the HexJSON spec on the ODI's website, but briefly: HexJSON describes a hexmap as a set of hexes with column *(q)* and row *(r)* coordinates within a given coordinate system, which is specified with the *layout* property.

HexJSON that looks like this:

```
{
"layout":"odd-r",
"hexes": {
"Q0R0":{"q":0,"r":0},
"Q1R0":{"q":1,"r":0},
"Q2R0":{"q":2,"r":0},
"Q3R0":{"q":3,"r":0},
"Q0R1":{"q":0,"r":1},
"Q1R1":{"q":1,"r":1},
"Q2R1":{"q":2,"r":1},
"Q3R1":{"q":3,"r":1},
"Q0R2":{"q":0,"r":2},
"Q1R2":{"q":1,"r":2},
"Q2R2":{"q":2,"r":2},
"Q3R2":{"q":3,"r":2},
"Q0R3":{"q":0,"r":3},
"Q1R3":{"q":1,"r":3},
"Q2R3":{"q":2,"r":3},
"Q3R3":{"q":3,"r":3}
}
}
```

Describes a hexmap that looks like this:

I wanted a way to render hexjson data generally, so I wrote a small D3 plugin called d3-hexjson, which takes a hexjson object and generates the data necessary to render it easily with D3.

Code examples can be found in the GitHub readme, and are shown in two blocks by Henry Lau. Giuseppe Sollazzo used d3-hexjson to create a visualisation showing the potential impact of swing on the number of Conservative and Labour seats won at the 2017 General Elecion, and wrote an article explaining how he did it.

My first use of the plugin was to run live hexmaps of the 2017 General Election results as they came in overnight on polling day. We were recording the results as they were announced at work, so we could send the data to the hexmaps very easily. At one point we fell behind the announced results as most of our election volunteers did not start until 3:00am and the results were already coming in quickly at that point.

Here are links to all the completed hexmaps showing the 2017 results, along with the 2015 results for comparison.

]]>I wanted to try presenting the data in this way because it helps address an interesting question: what is the role of animation in data visualisation? I've been thinking about this question ever since posting an animated uncertainty chart a few years ago.

Animation is superficially appealing because the eye loves motion, and interfaces that respond to user input feel alive. But sometimes animations in visualisations add little if anything to the reader's understanding of the data. They can be like animated transitions in PowerPoint, which tend to communicate that the presenter has spent more time thinking about the style of their presentation than the content.

I think animations can be valuable when they help show something about the data. The extent of change in a transition can be a dimension along which comparisons can be made. In this case, you could show two treemaps side by side (one for seats and one for votes) but the eye would have to move between the two to find the same region and make comparisons. Here you can hold the eye still and observe the magnitude of the transition from one state to the other. That comparison is meaningful in this context because in a perfectly proportional electoral system there would be no difference at all.

I don't think the effect works in all cases. In particular, when a party's position in a nation or region changes along with its size you lose the visual baseline for the comparison and it becomes harder to judge by how much a party's share has changed. Nevertheless, I thought this was an efficient way to summarise a large dataset and the comparison was worth sharing.

]]>