Westminster Bubble

16 Sep 2017 22:23 GMT

My latest project is Westminster Bubble, a Twitter bot that shows what Westminster journalists are talking about on Twitter, in one word cloud a day.

The @wmbubble account follows all journalists who are registered with Parliament and who primarily cover Westminster politics in their work and on Twitter.

This is a reasonably objective set of criteria for Westminster journalism but it's not perfect. There is some fuzziness around who primarily covers Westminster and whether that is their focus on Twitter, but the aim is to exclude journalists who are covering only one or two specific areas of policy and to include journalists who are talking about Westminster politics or policy generally.

The bot does not analyse or store individual tweets, or keep any record of who is saying what. It takes the text of each tweet in the stream of tweets from people it follows, cleans the text to remove mentions, hashtags and links, and adds what's left to an anonymous corpus. That corpus is used to make the word cloud at the end of each day.

If you are a journalist the account follows and you are uncomfortable with your tweets being included in this analysis, please let me know and I will remove your account from the group. Alternatively, you can just block the @wmbubble account. But to reiterate, the bot is not storing any data about individual accounts, just a single block of anonymised text which does not include any ids, account names, hashtags or URLs.

Word clouds are posted daily at around 9:00pm, just before the newspaper front pages come out. The word clouds show the most common words each day and the size of a word represents a measure of its relative frequency within that day's corpus. This measure is not based on frequency alone, but an equally weighted combination of frequency and rank on frequency.

This is because on some days a small number of words are so dominant in the corpus they would make most other words too small to render at the minimum font size, if size was scaled only to frequency. Giving equal weight to rank on frequency preserves the diversity of words that are visible in the cloud, and ensures you get a better sense of the content of the day's big stories. But this does mean that the relative size of the words is not a direct measure of their relative frequency alone.

Frequently collocated words are treated as a single term, so names like Theresa May and Jeremy Corbyn are preserved. The colours are not meaningful and are just randomly chosen from a range of colour maps for variety.

Feel free to address any questions to me via @olihawkins on Twitter. Don't try addressing questions to @wmbubble: it's programmed to ignore those tweets.

Going for broke in King of Tokyo

23 Jul 2017 15:24 GMT

King of Tokyo is a lighthearted but engrossing board game that's based on the classic monster movie genre. You play a monster in the style of Godzilla or King Kong and your objective is to defeat the other monsters battling for control of Tokyo. Since my brother introduced me to the game last summer it has quickly become a family favourite. It's a game I can play with my seven year-old daughter and her 83 year-old grandad and everyone has a good time.

I won't run through all of the rules here. The only thing you need to understand to read this article is the game's central dice mechanism. Underneath the thematic flavour of giant monsters with superpowers, King of Tokyo is a dice rolling game with similar rules to Yahtzee. The dice are the traditional six-sided type, but with different faces. These are the digits 1, 2, and 3, a claw, a heart, and a lightning bolt.

On your turn you roll six dice up to three times. After each roll you can keep any of the dice that you like and continue rolling the remaining dice until you either have all the faces that you want, or you run out of rolls. The dice that are face up at the end of all your rolls are your hand for that turn, and you then resolve the outcomes.

Dice with numbers give you victory points if you have three or more of the same number. Claws deal damage to your opponents, hearts restore your health, and lightning gives you energy which you can use to buy power cards.

Quite often in the game you can benefit from getting dice of more than one type on the same turn: do some damage to an opponent and heal yourself a bit; collect some lightning for power cards and gain a few victory points.

But at other times you need to go for broke and try to get as many of one dice face as possible. This tends to happen at the most decisive moments in the game: when you need just a few more victory points to win outright, when you need to do enough damage to kill an opponent before they win, when you need to heal quickly to prevent your own impending death, or when you need to collect enough lightning to buy a vital power card before someone else.

You can model this strategy with a Python function that looks like this:

import numpy as np

def max_outcome_for_face(dice=6, rolls=3):

    hits = 0
    outcome = [0] * rolls

    for roll in range(rolls):

        # Get a random number from one to six for each dice
        results = np.random.randint(1, 7, size=dice)
        # Count the number of ones: a one is a hit
        numhits = np.count_nonzero(results == 1)
        # Add to hits and remove a dice for each hit
        hits = hits + numhits
        dice = dice - numhits 

        # Store the hits after each roll
        outcome[roll] = hits

    return outcome

This function simulates a turn in the game with the given number of dice and rolls — six dice with three rolls by default. After each roll, it counts the number of times the target face was rolled, adds that number to a running total for the number of dice with the target face, and removes those dice with the target face from subsequent rolls. After all the rolls have completed, it returns a list showing the cumulative number of dice with the target face after each roll.

If you call this function a large enough number of times and collect the results, you can obtain the probability distribution for the outcomes of this strategy for any given number of dice and rolls. The heatmap below shows the distribution of outcomes after ten million turns with six dice and up to four rolls.

A heatmap showing the probability distribution of outcomes for dice rolls in King of Tokyo where the player is trying to get as many of one dice face as possible.

In King of Tokyo you get three rolls with six dice by default, so while this chart shows the outcomes for up to four rolls, the third column is the most relevant. This shows that if you are trying to get as many dice as possible with one particular face, you have an 80% chance of getting two or more dice with the given face, and a 21% chance of getting four or more.

So why does the heatmap show probabilities for up to four rolls? Because some of the power cards you can buy in the game give you an extra roll of the dice, and the Giant Brain card in particular gives you an extra roll as a permanent effect. With a fourth roll of the dice the probability of getting two or more dice with the target face increases to 91%, and the probability of getting four or more rises to 38%.

There are also power cards in the game that give you an extra dice on each roll: the Extra Head card gives you this as a permanent effect. Here is the distrbution of outcomes for seven dice and up to four rolls.

A heatmap showing the probability distribution of outcomes for dice rolls in King of Tokyo where the player is trying to get as many of one dice face as possible.

This shows that an extra dice is worth less than an extra roll when you are going after a particular face. With seven dice and three rolls the probability of getting two or more of a given face is 87%, and the probability of getting four or more is 33%. That's better than six dice with three rolls, but not as good as six dice with four. Although it's worth noting that an extra dice confers other benefits — you get more stuff.

Of course, if you can get the combination of cards that gives you seven dice with four rolls, then getting four or more dice with the target face becomes more likely than not, with a 54% chance.

This analysis was done using numpy and pandas, and the heatmaps were produced with matplotlib and seaborn. The complete source code is available on GitHub.

Making hexmaps with D3

24 Jun 2017 14:27 GMT

I spend a lot of time working with data for Parliamentary constituencies. I often want to map constituency data in a way that gives each constituency equal weight, especially when mapping election data, but until recently I had never seen an interactive hexmap of constituencies online. So I was really pleased when two months ago the Open Data Institute released not just a hexmap of Parliamentary constituencies, but also a specification for describing hexmap data called HexJSON.

You can read more about the HexJSON spec on the ODI's website, but briefly: HexJSON describes a hexmap as a set of hexes with column (q) and row (r) coordinates within a given coordinate system, which is specified with the layout property.

HexJSON that looks like this:

	"hexes": {

Describes a hexmap that looks like this:

An example of a hexmap with four columns, four rows, and a pointy top layout.

I wanted a way to render hexjson data generally, so I wrote a small D3 plugin called d3-hexjson, which takes a hexjson object and generates the data necessary to render it easily with D3.

Code examples can be found in the GitHub readme, and are shown in two blocks by Henry Lau. Giuseppe Sollazzo used d3-hexjson to create a visualisation showing the potential impact of swing on the number of Conservative and Labour seats won at the 2017 General Elecion, and wrote an article explaining how he did it.

My first use of the plugin was to run live hexmaps of the 2017 General Election results as they came in overnight on polling day. We were recording the results as they were announced at work, so we could send the data to the hexmaps very easily. At one point we fell behind the announced results as most of our election volunteers did not start until 3:00am and the results were already coming in quickly at that point.

Below is the hexmap of MPs by gender: a record number of women MPs were elected at the 2017 General Election. And here are links to all the completed hexmaps showing the 2017 results, along with the 2015 results for comparison.

A hexmap showing MPs elected in June 2017 by gender.

Animating the difference between seats and votes in first past the post elections

9 May 2017 07:44 GMT

I've been playing with animated treemaps. This treemap illustrates the difference between the number of seats and votes won by political parties in each nation and region of Great Britain at the 2015 General Election. Northern Ireland is not shown as NI has its own distinct political parties, so comparisons with other parts of the UK are less meaningful.

I wanted to try presenting the data in this way because it helps address an interesting question: what is the role of animation in data visualisation? I've been thinking about this question ever since posting an animated uncertainty chart a few years ago.

Animation is superficially appealing because the eye loves motion, and interfaces that respond to user input feel alive. But sometimes animations in visualisations add little if anything to the reader's understanding of the data. They can be like animated transitions in PowerPoint, which tend to communicate that the presenter has spent more time thinking about the style of their presentation than the content.

I think animations can be valuable when they help show something about the data. The extent of change in a transition can be a dimension along which comparisons can be made. In this case, you could show two treemaps side by side (one for seats and one for votes) but the eye would have to move between the two to find the same region and make comparisons. Here you can hold the eye still and observe the magnitude of the transition from one state to the other. That comparison is meaningful in this context because in a perfectly proportional electoral system there would be no difference at all.

I don't think the effect works in all cases. In particular, when a party's position in a nation or region changes along with its size you lose the visual baseline for the comparison and it becomes harder to judge by how much a party's share has changed. Nevertheless, I thought this was an efficient way to summarise a large dataset and the comparison was worth sharing.

Visualising migration between the countries of the UK

14 Mar 2017 21:55 GMT

I've been experimenting with Sankey diagrams using d3 and thought I'd share an example. This visualisation shows migration flows between the different countries of the UK in the year ending June 2015. The data comes from the Office for National Statistics annual release on internal migration. In this dataset, internal migration refers to people moving to a new home in a different part of the UK.

When it comes to migration between the countries of the UK, most of the flows are between England and each of the other countries. There is much less direct migration between Wales, Scotland, and Northern Ireland. This may be because of the geographical arrangement of the UK, the size of England relative to the other countries, the movement of people to and from England's major cities (especially London), or a combination of all those things.

The flows between England and each other part of the UK are fairly balanced, with a similar number of people moving in each direction. Interestingly, the flows between Wales and England are slightly larger than the flows between Scotland and England, even though the population of Scotland is larger than that of Wales. What's not shown in this visualisation is the large number of moves within England itself.

One aspect of these charts that I'm undecided on is how the links — the flowing bridges between the origin and destination nodes — should be shaded. Most of the examples of Sankey diagrams made with d3 use a single neutral colour for all the links (see Mike Bostock's example). In this case I have used asymmetric shading: the links are shaded according to their origin node. This lets you trace the flows from their origin, reading from left to right, while you can easily see the composition of the flows at the destination without having to trace them back.