olihawkins

A bestiary of undead statistics

30 Oct 2018 19:47 GMT

For some time now, statisticians and fact-checkers have talked about zombie statistics: false statistical claims which have found their way into public debate that are repeated endlessly and uncritically. They are called zombies because no matter how many times you beat them to death with evidence, they keep coming back to life.

A few months ago I described a particular false claim as a vampire statistic, because it survives in the shadows of semi-public discourse and avoids the daylight of scrutiny that could kill it.

I've been wondering what other kinds of undead statistics there are, and seeing as it's Halloween I've had a go at a typology. This is a little schematic, so I'd welcome any suggestions to flesh it out.

  • Zombie statistic — The classic undead statistic; it survives all attempts to destroy it with facts and keeps on claiming victims.
  • Vampire statistic — A statistic that never dies because it is never exposed to daylight. It is too wrong to appear in the usual arenas of public debate, so it keeps circulating through viral channels. Vampire statistics survive in the dark corners of the internet where paranoia and conspiracy theories flourish.
  • Phantom statistic — A statistical claim with no apparent source. It is either asserted without evidence or attributed to a source that does not contain the statistic.
  • Skeleton statistic — The bare bones of a statistical claim that has been removed from the body of knowledge that gives it life and meaning. This kind of statistic is often true in a narrow or technical sense, but is untrue in the way it is presented without context.1
  • Frankenstein statistic — a false statistical claim produced by stitching together statistics from different sources that shouldn't be combined.2
  • Werewolf statistic — A statistical howler that comes up with predictable regularity at certain events or times of year.3
  • Mummy statistic — A statistic that was once true but is no longer true. It has somehow been embalmed in the public imagination and keeps coming back to life when it should have died a long time ago.4

Happy Halloween!

Footnotes

1. An example of a skeleton statistic is the claim that more than 90% of communication is non-verbal. Albert Mehrabian's finding was that, in certain experimental settings, more than 90% of the content of communications about feelings and attitudes was non-verbal.

2. An example I can recall is this article, which compared the number of EU migrants living in the UK with the number of British migrants living in the EU using datasets that had different definitions of a migrant. To the FT's credit they quickly corrected the story (because they have a brilliant data team).

3. See Blue Monday, for example. See also the recurring confusion between the net change in the number of people in work and the number of new jobs that accompanies ONS's regular labour market statistics.

4. I once accidentally created a mummy statistic. In 2014, I tweeted that the Telegraph was wrong to say net migration was above 250,000. The tweet was picked up by a fact-checking bot, which intermittently cited me saying this over the next two years, as net migration rose to around 330,000.

Westminster Bubble's final word cloud

14 Aug 2018 20:12 GMT

Last summer I wrote a small Twitter bot called Westminster Bubble. It follows the Twitter accounts of registered Parliamentary journalists and shows what they are tweeting about in one word cloud a day. You can read more about it in the article I posted when it launched.

Westminster Bubble was always intended as a fun side project: a light-hearted way of presenting data on the topics obsessing political journalists in Westminster, which is where I work. It was fun to write, and it was fun to watch it work each day.

And it wasn't difficult to develop because Twitter's streaming API made it very easy to subscribe to the tweets from all the people an account follows. I wrote the code in free moments during my summer break. It was literally ‘what I did on my summer holidays’.

But in December last year, Twitter announced that it was shutting down its streaming API. The API was originally scheduled to close in June this year, but Twitter pushed the date back to August after some resistance from developers.

Twitter's new Account Activity API works in a completely different way, which means the only way to keep Westminster Bubble running would be to rewrite it from scratch. And I don't think that's going to happen. It could come back to life at some point, but being realistic this is probably the end of the road.

To wrap things up, I thought it would be interesting to make a word cloud covering the whole period Westminster Bubble has been online.

A word cloud representing the relative frequency of words used by journalists covering Westminster politics on Twitter from 16 September 2017 to 14 August 2018

This word cloud has been produced using all tweets from registered Parliamentary journalists covering Westminster politics from 16 September 2017 to 14 August 2018. As anyone who has been following the account knows, Brexit has been the biggest single issue, and the leaders of the Conservative and Labour parties routinely dominate the coverage.

An R package for simple data wrangling

7 Jul 2018 11:28 GMT

I recently started a new role at work where one of my tasks is helping statisticians to develop data science skills. I've noticed that one of the most challenging obstacles people encounter when first learning to program is how much you need to learn in order to become productive.

It takes time to become a good programmer — it's a learning experience that never really ends — but there is an inflection point when you become more fluent and the time you spent learning how do each thing for the first time starts to pay off.

The question I keep coming up against is how to motivate people who are learning to program in a professional setting to persevere through the initial learning period, when doing something in a new way is less efficient than doing it the old way.

Part of the answer is to show people the remarkable things that can only be done in the new way. But perhaps even more important is lowering the barrier to entry: reducing the time it takes beginners to learn simple and useful things.

With that in mind I wrote an R package called cltools, which is designed to make common data wrangling tasks easier to perform. These are all things that an experienced R user could do with base R or tidyverse functions. But the point is to reduce the level of skill people need in order to do useful work with R.

The package is primarily designed to help statistical researchers and data journalists covering public policy fields. It focuses on the simple data wrangling tasks that these researchers do most often. Things like calculating row and column percentages, creating indices, and deflating prices to real terms.

Let me know if you have any suggestions for ways to make it better.

Animating mortality in England and Wales

8 Apr 2018 19:31 GMT

From around 1970 to 2011 there was a broad downward trend in mortality in England and Wales: both the the total number of deaths and the crude death rate (the number of deaths per thousand people) fell.

Since 2011 this long-term downward trend has halted and both the number of deaths and the crude death rate have increased.

Two charts showing the annual number of deaths and the number of deaths per thousand people in England and Wales from 1970 t0 2016.

These charts show trends in mortality from 1970 to 2016. But they do not include data for 2017 or 2018, as neither the final totals nor the population estimates needed to calculate mortality rates have been published for these years.

A more fine-grained and up to date analysis of recent trends in mortality can be produced using Office for National Statistics figures for the number of weekly deaths registered in England and Wales during the last few years.

This dataset provides the most recent statistics on deaths but comes with the caveats that the figures are provisional and are not normalised to the total population.

Animated chart

This animated chart shows the number of deaths registered in England and Wales in each week of the year for each calendar year from 2011 to 2017, and so far in 2018.

As the chart shows, the number of weekly deaths has gradually increased: the darker lines showing data for more recent years are generally higher than the lighter lines showing data for earlier years.

Some of this increase may be explained by the growing population, but the larger number of winter deaths in 2015, 2017 and 2018, and the rising death rate before 2016 suggest this may not be the only factor.

I thought this was a novel way of presenting the data, and potentially a good use case for animation.

A simple hexmap editor

7 Feb 2018 21:15 GMT

I recently produced a hexmap of local authorities in England and Wales, which I need for mapping a number of different datasets, such as data on internal migration. As I couldn't find an existing hexmap for these areas, I had to create one from scratch.

Initially, I hoped to generate a hexmap algorithmically using the geogrid package in R. But while geogrid did a decent job of minimising the average distance between where areas were located on the geographic map and on the hexagonal map, the number of local authorities and the variation in their size led to some odd results. Birmingham was on the coast, for example.

Furthermore, the algorithm that geogrid uses isn't designed to preserve certain geospatial relationships, such as contiguous groups of areas like regions, or relative positions that matter logically. For instance, in the original output from geogrid, South Tyneside was north of North Tyneside.

Geogrid is a fantastic starting point, which can do a lot of the groundwork for you, but it seems that to make a good hexmap you need to do a certain amount of the work by hand.

To help with that, I wrote a small tool that lets you edit HexJSON data in the browser, and then export your work in both HexJSON and GeoJSON formats. Like a lot of tools I make, it doesn't have a massive feature set, but aims to do one thing well.

Editing hexmaps in the browser

The HexJSON Editor lets you import hexmap data in HexJSON format and move the hexes around by hand. Simply select the hex you want to move by clicking on it.

A grid of hexagons with one of the hexagons highlighted indicating it is selected and ready to move

And then click on the destination to place it.

The Dane grind of hexagons as shown previously but the selected hexagon has moved

If the position you place a hex is already occupied by another hex, it will swap them.

If you want to try the editor, you can use the example HexJSON grid shown above, or alternatively try the local authority hexmap, or the Open Data Institute's constituency hexmap.

When you load the HexJSON data, you can choose which variable to use for labelling. You also have the option of choosing a categorical variable to use for shading the hexes — the colours are chosen for each category using a standard categorical colour scale. Finally, if you need more space around the hexes for editing, you have the option of adding more padding.

I made the HexJSON editor mainly to meet my own mapping needs, but I thought it was worth sharing. If there is enough interest in new features I could either add them, or open source the code so other people can. In the meantime, it's there if you need it.