25 Sep 2013 19:27 GMT
A few years ago I was talking to Michael Blastland about uncertainty. If you don't know Michael, he's an author and journalist who has a particular gift for explaining statistics in a way that people who normally fear numbers not only understand, but even enjoy.
Michael had been writing an article for his Go Figure column on the BBC website about the representation of uncertainty in charts and tables. He was making the point that many of the methods used to present data in mainstream media do not represent the statistical uncertainty that surrounds estimated values, and that this can lead to the appearance of trends or patterns that may not in fact exist.
He was looking for a way to visually illustrate this uncertainty, to give people a sense of the range of possible values that a set of estimates can represent, and to show how misleading a single set of values can sometimes be when they are presented as if there was nothing uncertain about them. For his column, the BBC produced a Flash animation that visually illustrated the concept of uncertainty, but Michael said he was still looking for a way to more directly reflect the scale of the potential variation in a given set of data.
One way to visually describe the uncertainty surrounding a set of estimates is to use error bars. The following chart shows estimates of net inward migration to the UK based on the International Passenger Survey (IPS) over the last fifteen years. The dark blue bars represent the central estimate of net migration in each calendar year, while the light blue bars indicate the margin of error surrounding the estimate.
The margin of error reflects the 95% confidence interval for the estimate, which means there is a 95% chance that the actual value is within the range shown by the error bar and a 5% chance that it is outside this range. The size of the error bar is determined by the size of the sample on which the estimate is based.
One weakness of error bars is that because the upper and lower bounds of the confidence intervals show a similar trend to the central estimates, they tend to visually reinforce the pattern already apparent in the chart, when their purpose is to illustrate to what extent the chart might have shown a different pattern.
Error bars may be the best you can do on a printed chart, but the web creates more interesting possibilities. An alternative approach is to animate the uncertainty in the chart to show how else it might have looked, given the margin of error associated with each value. You can see this idea in action in this visualisation.
Try it for a little bit, then come back when you're done.
How it works
At the start, the chart shows the actual IPS estimate of net migration in each year represented as a dark blue bar — just as in the static chart above. When you click on the chart, the code randomly generates a new set of estimates based on the observed value for each year and its associated margin of error. It then transitions to the new data, waits one and a half seconds, and does the whole thing again. The randomised estimates are shaded light blue to distinguish them from the observed estimates. The loop continues until you click on the chart again and it returns to its initial state.
In each randomised version of the chart, the values represent what the estimate of net migration could also have been in each year, given the margin of error. The randomised values are drawn from probability distributions that reflect the degree of uncertainty associated with each estimate, based on the sample size, so over the long-term the values for each year will tend to cluster around the observed estimate for that year and can only vary in their distance from that estimate with the frequency that the margin of error allows. Each randomised chart therefore represents an alternative version of the original chart that shows how else it might have looked, given a different set of random samples from the same populations.1
The net migration estimates in the chart are the unadjusted IPS estimates of long-term international migration.2 These differ slightly from the final adjusted migration estimates, which are the figures you see quoted in the press. The final estimates are based on the IPS estimates, but they are adjusted to take account of certain types of migration that the IPS doesn't pick up, such as asylum seekers, people migrating for longer or shorter than they thought they would, and migration over land to and from Northern Ireland. Because it is difficult to quantify the uncertainty around these adjustments, confidence intervals can only be properly calculated for the survey-based estimates, which is why I have used them here.
This method of representing the uncertainty around estimated values is only valid if the estimates are independent of one another. IPS estimates of net migration in discrete twelve month periods are based on independent samples, so they can be represented in this way. It would not be valid to do the same thing using the estimates of net migration for the years ending each subsequent quarter.
The chart was built using d3. The starting point was example code from Scott Murray's wonderful book on d3 for data visualisation. This is my first attempt at using d3 and the code is tailored to this specific example. It does not generalise well to other data in its current state — it can't represent negative values, for example. My aim is to extend this code into a general-purpose reusable chart for displaying uncertain data that is normally distributed, based on independent samples, with known standard errors.
1. For geeks, the randomised values are normal random variables with a mean equal to the observed estimate and a standard deviation equal to the standard error implied by the published 95% confidence interval for the observed estimate.