Web scraping for BBC More or Less

10 Feb 2019 17:27 GMT

During my career as a journalist I used to work for BBC Radio Current Affairs, where I would often work on Radio 4's pop stats programme More or Less. I don’t think anyone would be surprised to hear it was my favourite programme.

I am still in touch with the team that make the show, and every now and again the editor Richard Vadon sends me a message to chat about something statistical. Last year he sent me a DM with an interesting question.

A direct message from the editor of More or Less asking me to web scrape data on the position of the planets over time

More or Less wanted to work out which planet was closest to Earth on average, given how their relative positions change over time. They were exploring ways of getting the data and wanted to know if I could help. They did the story on the programme a couple of weeks ago, and they have produced a special version of that show for the BBC’s new interactive web player.

I want to be clear about how I helped with the story, because Tim Harford was very generous with his praise on the programme, which was kind of him and the team, but it was essentially a web scraping exercise.

I'm not an astronomer. I do use computational statistics in my job but I work primarily with social, economic and political data. Richard asked if I could help them with the story by scraping the data from the web, which I did. I’m not even sure this is the best source for the data, but it is a source, and one that was relatively easy to use.

Here’s a chart of the data that I gathered for the story — these are the “wiggly lines” Professor David Rothery talked about during the piece. It shows that on average over the last fifty years Mercury was the planet that was closest to Earth.

A chart showing how the distances of Mercury, Venus and Mars from Earth vary over time

If you want to reproduce the chart yourself, you can download the Python code to gather the data and generate the image from this gist.