olihawkins

Experimenting with grid histograms

11 Mar 2019 08:38 GMT

A few months ago, when ONS published their population estimates for Parliamentary constituencies in mid-2017, I worked on an analysis of constituencies by median age. This included some interacive charts produced with D3, which I was never able to publish as the online part of the project didn't get off the ground. Rather than see that work go to waste, I decided to publish the charts here for posterity.

The charts in question are grid histograms, which show the distribution of Parliamentary constituencies by median age at the 2017 General Election. In these charts, each square represents a single constituency, and each chart shows the squares shaded by another variable: such as party, turnout and majority. Click on the image below or the link above to see the interactive versions of these charts.

A set of four charts each showing the distribution of constituencies by median age and their relationship with another variable

On average, median age was lower in seats won by Labour and higher in seats won by the Conservatives. Turnout tended to be higher in seats with a higher median age. And some of Labour's biggest majorities were in seats with the lowest median age.

As a way of showing the strength of the relationship between two variables, I think this type of chart is probably less succesful than a scatterplot. But as a way of showing the distribution of one variable within another, I think it works quite well. Although perhaps better for the continuous variables than the categorical variable.

Web scraping for BBC More or Less

10 Feb 2019 17:27 GMT

During my career as a journalist I used to work for BBC Radio Current Affairs, where I would often work on Radio 4's pop stats programme More or Less. I don’t think anyone would be surprised to hear it was my favourite programme.

I am still in touch with the team that make the show, and every now and again the editor Richard Vadon sends me a message to chat about something statistical. Last year he sent me a DM with an interesting question.

A direct message from the editor of More or Less asking me to web scrape data on the position of the planets over time

More or Less wanted to work out which planet was closest to Earth on average, given how their relative positions change over time. They were exploring ways of getting the data and wanted to know if I could help. They did the story on the programme a couple of weeks ago, and they have produced a special version of that show for the BBC’s new interactive web player.

I want to be clear about how I helped with the story, because Tim Harford was very generous with his praise on the programme, which was kind of him and the team, but it was essentially a web scraping exercise.

I'm not an astronomer. I do use computational statistics in my job but I work primarily with social, economic and political data. Richard asked if I could help them with the story by scraping the data from the web, which I did. I’m not even sure this is the best source for the data, but it is a source, and one that was relatively easy to use.

Here’s a chart of the data that I gathered for the story — these are the “wiggly lines” Professor David Rothery talked about during the piece. It shows that on average over the last fifty years Mercury was the planet that was closest to Earth.

A chart showing how the distances of Mercury, Venus and Mars from Earth vary over time

If you want to reproduce the chart yourself, you can download the Python code to gather the data and generate the image from this gist.

R and Python packages for the Parliamentary data platform

27 Jan 2019 18:26 GMT

I recently published two software packages for downloading and analysing data from the new Parliamentary data platform.

The data platform is an ambitious project, which aims to be a canonical source of integrated open data on Parliamentary activity. The data is stored in RDF and is available through a publicly accessible SPARQL endpoint. You can see the structure of the data stored in the platorm visualised with WebVOWL.

These packages provide an easy way to use the data platform API in both R and Python. They are aimed at people who want to use Parliamentary data for research and analysis. Their main feature is that they let you easily download data in a structure and format that is suitable for analysis, preserving the links between data so that it is easy to combine the results of different queries.

The packages provide two different interfaces to the data platorm:

  • A low level interface that takes a SPARQL SELECT query, sends it to the platform, and returns the result as a tibble (R) or a DataFrame (Python), with data types appropriately converted.
  • A high level interface comprising families of functions for downloading specific datasets. This currently focuses on key data about Members of both Houses of Parliament.

I think the data platform is great. It's a really valuable piece of public data infrastructure that has the potential to become a comprehensive digital record of what Parliament does. I hope to expand these packages as more data is added to the platform in future.

A bestiary of undead statistics

30 Oct 2018 19:47 GMT

For some time now, statisticians and fact-checkers have talked about zombie statistics: false statistical claims which have found their way into public debate that are repeated endlessly and uncritically. They are called zombies because no matter how many times you beat them to death with evidence, they keep coming back to life.

A few months ago I described a particular false claim as a vampire statistic, because it survives in the shadows of semi-public discourse and avoids the daylight of scrutiny that could kill it.

I've been wondering what other kinds of undead statistics there are, and seeing as it's Halloween I've had a go at a typology. This is a little schematic, so I'd welcome any suggestions to flesh it out.

  • Zombie statistic — The classic undead statistic; it survives all attempts to destroy it with facts and keeps on claiming victims.
  • Vampire statistic — A statistic that never dies because it is never exposed to daylight. It is too wrong to appear in the usual arenas of public debate, so it keeps circulating through viral channels. Vampire statistics survive in the dark corners of the internet where paranoia and conspiracy theories flourish.
  • Phantom statistic — A statistical claim with no apparent source. It is either asserted without evidence or attributed to a source that does not contain the statistic.
  • Skeleton statistic — The bare bones of a statistical claim that has been removed from the body of knowledge that gives it life and meaning. This kind of statistic is often true in a narrow or technical sense, but is untrue in the way it is presented without context.1
  • Frankenstein statistic — a false statistical claim produced by stitching together statistics from different sources that shouldn't be combined.2
  • Werewolf statistic — A statistical howler that comes up with predictable regularity at certain events or times of year.3
  • Mummy statistic — A statistic that was once true but is no longer true. It has somehow been embalmed in the public imagination and keeps coming back to life when it should have died a long time ago.4

Happy Halloween!

Footnotes

1. An example of a skeleton statistic is the claim that more than 90% of communication is non-verbal. Albert Mehrabian's finding was that, in certain experimental settings, more than 90% of the content of communications about feelings and attitudes was non-verbal.

2. An example I can recall is this article, which compared the number of EU migrants living in the UK with the number of British migrants living in the EU using datasets that had different definitions of a migrant. To the FT's credit they quickly corrected the story (because they have a brilliant data team).

3. See Blue Monday, for example. See also the recurring confusion between the net change in the number of people in work and the number of new jobs that accompanies ONS's regular labour market statistics.

4. I once accidentally created a mummy statistic. In 2014, I tweeted that the Telegraph was wrong to say net migration was above 250,000. The tweet was picked up by a fact-checking bot, which intermittently cited me saying this over the next two years, as net migration rose to around 330,000.

Westminster Bubble's final word cloud

14 Aug 2018 20:12 GMT

Last summer I wrote a small Twitter bot called Westminster Bubble. It follows the Twitter accounts of registered Parliamentary journalists and shows what they are tweeting about in one word cloud a day. You can read more about it in the article I posted when it launched.

Westminster Bubble was always intended as a fun side project: a light-hearted way of presenting data on the topics obsessing political journalists in Westminster, which is where I work. It was fun to write, and it was fun to watch it work each day.

And it wasn't difficult to develop because Twitter's streaming API made it very easy to subscribe to the tweets from all the people an account follows. I wrote the code in free moments during my summer break. It was literally ‘what I did on my summer holidays’.

But in December last year, Twitter announced that it was shutting down its streaming API. The API was originally scheduled to close in June this year, but Twitter pushed the date back to August after some resistance from developers.

Twitter's new Account Activity API works in a completely different way, which means the only way to keep Westminster Bubble running would be to rewrite it from scratch. And I don't think that's going to happen. It could come back to life at some point, but being realistic this is probably the end of the road.

To wrap things up, I thought it would be interesting to make a word cloud covering the whole period Westminster Bubble has been online.

A word cloud representing the relative frequency of words used by journalists covering Westminster politics on Twitter from 16 September 2017 to 14 August 2018

This word cloud has been produced using all tweets from registered Parliamentary journalists covering Westminster politics from 16 September 2017 to 14 August 2018. As anyone who has been following the account knows, Brexit has been the biggest single issue, and the leaders of the Conservative and Labour parties routinely dominate the coverage.