2 Sep 2013 17:35 GMT
Three months ago I started running. Or at least I started running properly. I had been managing to run once a week since my daughter was born in 2010, but it wasn't enough to offset the inevitable slide into decrepitude that begins in your early thirties and accelerates when you become a parent. So since May this year I have been running at least three times a week, and every other day if possible.
Being a data geek I started tracking my runs using a smartphone app and recording the data in a spreadsheet. The chart below shows my overall pace in seconds per kilometre for every run I've been on since I started running more frequently. These runs were all on routes between 3.8 and 5.7km, with a mean running distance of 4.6km per run; so some of the variation in pace may be explained by differences in route length. However, as I generally try to mix shorter and longer routes, this shouldn't affect comparisons over the longer term.
So how am I doing? (Relatively speaking. I know these aren't great stats to someone who is really serious about running.) There has been some improvement, but the variation in pace from one run to the next makes it difficult to get a clear picture of the underlying trend. I wanted to find a good way of representing my overall progress, taking the variation into account.
Regressive compulsive disorder
When you have a scatterplot of points like this, there is a strong temptation to draw the line of best fit and see where the trend line is pointing. But in this case it really wouldn't make much sense. When you carry out a simple linear regression you are not so much exposing a linear relationship that exists between two variables as estimating the characteristics of their linear relationship, assuming that such a relationship does in fact exist.1
But there is no reason to believe that a person's average speed over a particular distance would increase linearly. If it did, then based on the above trend I would achieve light speed in around April 2016. And even if I could fit the data to a more realistic model of how a runner's pace develops, I'm not trying to predict my future performance assuming that the current trend continues; I'm trying to measure how I'm doing right now, in a way that reflects whether my overall performance is getting better or worse.
Rolling an average
The easiest way to smooth out some of the variation in the series is to take a simple rolling average of the pace over the last few runs. But there is a tension in this approach between filtering out the noise and reflecting the most recent trend. The more observations you include in the average, the more the variation is reduced; but the oldest observation in the calculation has the same weight as the most recent observation, so the more data you include in the rolling average, the less it reflects your current performance. Ideally, you want an average in which each observation is weighted according to its age, so the most recent observations carry the most weight, while older observations still weigh enough to help reduce the variaton in the series and reveal the underlying trend. This is what exponential smoothing does.
Exponential smoothing is a way of averaging time series data that weights each observation from the newest to the oldest using an exponential decay function. If you don't already know what that means, don't worry, because you can get the result with surprisingly simple maths.2
Here's how it works. You have a time series of observations, like the running data above. We'll call this the observed series. You are going to create a corresponding time series of smoothed data. We'll call this the smoothed series. To begin, set the first value in the smoothed series to the first value in the observed series.3 Every subsequent smoothed value is then calculated by adding a percentage of the last observed value to a percentage of the last smoothed value, so that the percentages add up to a hundred. The actual equation is therefore:
next smoothed = A * last observed + (1 - A) * last smoothed
Where A (or alpha) is the smoothing factor, which is a number between 0 and 1. Counterintuitively, a lower smoothing factor leads to more smoothing and a higher smoothing factor leads to less smoothing. A smoothing factor of 1 means that 100% of each smoothed value is determined by the last observed value, so the smoothed series is identical to the observed series, just shifted in time by one value. A smoothing factor of 0 means that 0% of each smoothed value is determined by the last observed value, so the smoothed series never changes from its initial value. The smoothing factor therefore determines how much weight is placed on older and newer values of the observed series in calculating the smoothed value at each point.
When deciding how to set the smoothing factor you might think it's a good idea to aim for the middle of the range and choose a smoothing factor of 0.5, hoping to get an even balance between more and less recent observations. But this may not give you the result you expect. The chart below shows the observed series of running data overlaid with a smoothed series constructed using a smoothing factor of 0.5.
The smoothed series gives a better sense of the underlying trend than the raw data, but it's still pretty noisy. This is because the exponential nature of the weighting in the smoothed series means that a smoothing factor of 0.5 still places a lot of weight on the most recent observations. To illustrate what's going on, the following chart shows the composition of ten consecutive values in the smoothed series, expressed as a percentage of the most recent smoothed value. (The most recent smoothed value is number 10).
With a smoothing factor of 0.5 almost 90% of the most recent smoothed value is determined by the previous three observed values. Now see what happens when the smoothing factor is set to 0.2.
With a smoothing factor of 0.2 only around 50% of the most recent smoothed value is determined by the previous three observed values, and around 90% is determined by the previous ten observed values. This weighting gives a much clearer picture of the underlying trend in the running data.
When using exponential smoothing for simple descriptive purposes like this, there is no “correct” smoothing factor as such; it simply controls how much emphasis the smoothing places on more recent observations.4 In that sense, choosing the smoothing factor still involves making a judgement about the nature of the data: deciding how much of the variation you want to preserve from one data point to the next. But by comparing values within a given smoothed series it becomes easier to see the longer-term trends in the data.
1. There is much more to linear regression than this, but the point here is that you are choosing to model the data using a linear relationship.
2. And if you do already know what it means, you don't really need to read this. Go and read something more challenging.
3. Strictly speaking, you're setting the second value in the smoothed series to the first value in the observed series, but this follows inevitably from the smoothing algorithm if you set the first smoothed value in this way. Note however that this is not the only way the smoothed series can be initialised. It will do for these purposes, but bear in mind that the lower the smoothing factor, the greater the influence of the initial smoothed value on the whole series; so how you choose it matters.
4. For technical applications, such as forecasting, there are methods for choosing the best smoothing factor based on the data in the series. The NIST Handbook of Statistical Methods has a short and clear explanation of one of the most common.