Visualising trends in running performance

2 Sep 2013 17:35 GMT

Three months ago I started running. Or at least I started running properly. I had been managing to run once a week since my daughter was born in 2010, but it wasn't enough to offset the inevitable slide into decrepitude that begins in your early thirties and accelerates when you become a parent. So since May this year I have been running at least three times a week, and every other day if possible.

Being a data geek I started tracking my runs using a smartphone app and recording the data in a spreadsheet. The chart below shows my overall pace in seconds per kilometre for every run I've been on since I started running more frequently. These runs were all on routes between 3.8 and 5.7km, with a mean running distance of 4.6km per run; so some of the variation in pace may be explained by differences in route length. However, as I generally try to mix shorter and longer routes, this shouldn't affect comparisons over the longer term.

Chart of running pace

So how am I doing? (Relatively speaking. I know these aren't great stats to someone who is really serious about running.) There has been some improvement, but the variation in pace from one run to the next makes it difficult to get a clear picture of the underlying trend. I wanted to find a good way of representing my overall progress, taking the variation into account.

Regressive compulsive disorder

When you have a scatterplot of points like this, there is a strong temptation to draw the line of best fit and see where the trend line is pointing. But in this case it really wouldn't make much sense. When you carry out a simple linear regression you are not so much exposing a linear relationship that exists between two variables as estimating the characteristics of their linear relationship, assuming that such a relationship does in fact exist.1

But there is no reason to believe that a person's average speed over a particular distance would increase linearly. If it did, then based on the above trend I would achieve light speed in around April 2016. And even if I could fit the data to a more realistic model of how a runner's pace develops, I'm not trying to predict my future performance assuming that the current trend continues; I'm trying to measure how I'm doing right now, in a way that reflects whether my overall performance is getting better or worse.

Rolling an average

The easiest way to smooth out some of the variation in the series is to take a simple rolling average of the pace over the last few runs. But there is a tension in this approach between filtering out the noise and reflecting the most recent trend. The more observations you include in the average, the more the variation is reduced; but the oldest observation in the calculation has the same weight as the most recent observation, so the more data you include in the rolling average, the less it reflects your current performance. Ideally, you want an average in which each observation is weighted according to its age, so the most recent observations carry the most weight, while older observations still weigh enough to help reduce the variaton in the series and reveal the underlying trend. This is what exponential smoothing does.

Exponential smoothing is a way of averaging time series data that weights each observation from the newest to the oldest using an exponential decay function. If you don't already know what that means, don't worry, because you can get the result with surprisingly simple maths.2

Here's how it works. You have a time series of observations, like the running data above. We'll call this the observed series. You are going to create a corresponding time series of smoothed data. We'll call this the smoothed series. To begin, set the first value in the smoothed series to the first value in the observed series.3 Every subsequent smoothed value is then calculated by adding a percentage of the last observed value to a percentage of the last smoothed value, so that the percentages add up to a hundred. The actual equation is therefore:

next smoothed = A * last observed + (1 - A) * last smoothed

Where A (or alpha) is the smoothing factor, which is a number between 0 and 1. Counterintuitively, a lower smoothing factor leads to more smoothing and a higher smoothing factor leads to less smoothing. A smoothing factor of 1 means that 100% of each smoothed value is determined by the last observed value, so the smoothed series is identical to the observed series, just shifted in time by one value. A smoothing factor of 0 means that 0% of each smoothed value is determined by the last observed value, so the smoothed series never changes from its initial value. The smoothing factor therefore determines how much weight is placed on older and newer values of the observed series in calculating the smoothed value at each point.

When deciding how to set the smoothing factor you might think it's a good idea to aim for the middle of the range and choose a smoothing factor of 0.5, hoping to get an even balance between more and less recent observations. But this may not give you the result you expect. The chart below shows the observed series of running data overlaid with a smoothed series constructed using a smoothing factor of 0.5.

Chart of running pace with smoothing factor of 0.5

The smoothed series gives a better sense of the underlying trend than the raw data, but it's still pretty noisy. This is because the exponential nature of the weighting in the smoothed series means that a smoothing factor of 0.5 still places a lot of weight on the most recent observations. To illustrate what's going on, the following chart shows the composition of ten consecutive values in the smoothed series, expressed as a percentage of the most recent smoothed value. (The most recent smoothed value is number 10).

Chart of weights with smoohing factor of 0.5

With a smoothing factor of 0.5 almost 90% of the most recent smoothed value is determined by the previous three observed values. Now see what happens when the smoothing factor is set to 0.2.

Chart of weights with smoothing factor of 0.2

With a smoothing factor of 0.2 only around 50% of the most recent smoothed value is determined by the previous three observed values, and around 90% is determined by the previous ten observed values. This weighting gives a much clearer picture of the underlying trend in the running data.

Chart of running pace with smoothing factor of 0.2

When using exponential smoothing for simple descriptive purposes like this, there is no “correct” smoothing factor as such; it simply controls how much emphasis the smoothing places on more recent observations.4 In that sense, choosing the smoothing factor still involves making a judgement about the nature of the data: deciding how much of the variation you want to preserve from one data point to the next. But by comparing values within a given smoothed series it becomes easier to see the longer-term trends in the data.

Footnotes

1. There is much more to linear regression than this, but the point here is that you are choosing to model the data using a linear relationship.

2. And if you do already know what it means, you don't really need to read this. Go and read something more challenging.

3. Strictly speaking, you're setting the second value in the smoothed series to the first value in the observed series, but this follows inevitably from the smoothing algorithm if you set the first smoothed value in this way. Note however that this is not the only way the smoothed series can be initialised. It will do for these purposes, but bear in mind that the lower the smoothing factor, the greater the influence of the initial smoothed value on the whole series; so how you choose it matters.

4. For technical applications, such as forecasting, there are methods for choosing the best smoothing factor based on the data in the series. The NIST Handbook of Statistical Methods has a short and clear explanation of one of the most common.

Animating uncertainty

25 Sep 2013 19:27 GMT

A few years ago I was talking to Michael Blastland about uncertainty. If you don't know Michael, he's an author and journalist who has a particular gift for explaining statistics in a way that people who normally fear numbers not only understand, but even enjoy.

Michael had been writing an article for his Go Figure column on the BBC website about the representation of uncertainty in charts and tables. He was making the point that many of the methods used to present data in mainstream media do not represent the statistical uncertainty that surrounds estimated values, and that this can lead to the appearance of trends or patterns that may not in fact exist.

He was looking for a way to visually illustrate this uncertainty, to give people a sense of the range of possible values that a set of estimates can represent, and to show how misleading a single set of values can sometimes be when they are presented as if there was nothing uncertain about them. For his column, the BBC produced a Flash animation that visually illustrated the concept of uncertainty, but Michael said he was still looking for a way to more directly reflect the scale of the potential variation in a given set of data.

One way to visually describe the uncertainty surrounding a set of estimates is to use error bars. The following chart shows estimates of net inward migration to the UK based on the International Passenger Survey (IPS) over the last fifteen years. The dark blue bars represent the central estimate of net migration in each calendar year, while the light blue bars indicate the margin of error surrounding the estimate.

Estimates of net migration to the UK from 1998 to 2012 based on the International Passenger Survey

The margin of error reflects the 95% confidence interval for the estimate, which means there is a 95% chance that the actual value is within the range shown by the error bar and a 5% chance that it is outside this range. The size of the error bar is determined by the size of the sample on which the estimate is based.

One weakness of error bars is that because the upper and lower bounds of the confidence intervals show a similar trend to the central estimates, they tend to visually reinforce the pattern already apparent in the chart, when their purpose is to illustrate to what extent the chart might have shown a different pattern.

Error bars may be the best you can do on a printed chart, but the web creates more interesting possibilities. An alternative approach is to animate the uncertainty in the chart to show how else it might have looked, given the margin of error associated with each value. You can see this idea in action in this visualisation.

Try it for a little bit, then come back when you're done.

How it works

At the start, the chart shows the actual IPS estimate of net migration in each year represented as a dark blue bar — just as in the static chart above. When you click on the chart, the code randomly generates a new set of estimates based on the observed value for each year and its associated margin of error. It then transitions to the new data, waits one and a half seconds, and does the whole thing again. The randomised estimates are shaded light blue to distinguish them from the observed estimates. The loop continues until you click on the chart again and it returns to its initial state.

In each randomised version of the chart, the values represent what the estimate of net migration could also have been in each year, given the margin of error. The randomised values are drawn from probability distributions that reflect the degree of uncertainty associated with each estimate, based on the sample size, so over the long-term the values for each year will tend to cluster around the observed estimate for that year and can only vary in their distance from that estimate with the frequency that the margin of error allows. Each randomised chart therefore represents an alternative version of the original chart that shows how else it might have looked, given a different set of random samples from the same populations.1

Statistical issues

The net migration estimates in the chart are the unadjusted IPS estimates of long-term international migration.2 These differ slightly from the final adjusted migration estimates, which are the figures you see quoted in the press. The final estimates are based on the IPS estimates, but they are adjusted to take account of certain types of migration that the IPS doesn't pick up, such as asylum seekers, people migrating for longer or shorter than they thought they would, and migration over land to and from Northern Ireland. Because it is difficult to quantify the uncertainty around these adjustments, confidence intervals can only be properly calculated for the survey-based estimates, which is why I have used them here.

This method of representing the uncertainty around estimated values is only valid if the estimates are independent of one another. IPS estimates of net migration in discrete twelve month periods are based on independent samples, so they can be represented in this way. It would not be valid to do the same thing using the estimates of net migration for the years ending each subsequent quarter.

Programming issues

The chart was built using d3. The starting point was example code from Scott Murray's wonderful book on d3 for data visualisation. This is my first attempt at using d3 and the code is tailored to this specific example. It does not generalise well to other data in its current state — it can't represent negative values, for example. My aim is to extend this code into a general-purpose reusable chart for displaying uncertain data that is normally distributed, based on independent samples, with known standard errors.

Normal random variables are generated in JavaScript using the polar method. This method was suggested to me by Adam Hyland, who also provided lots of useful material on generating random numbers in JavaScript, including this great article and this seedable RNG, plus documentation.

Acknowledgements

Many thanks to Michael Blastland and David Spiegelhalter for their suggestions and feedback, and to Adam Hyland for his statistical computational origami.

Footnotes

1. For geeks, the randomised values are normal random variables with a mean equal to the observed estimate and a standard deviation equal to the standard error implied by the published 95% confidence interval for the observed estimate.

2. The data sources for the chart are the following ONS statistical releases: Long-Term International Migration 2011, Migration Statistics Quarterly Report August 2013.