Statistical Data Analysis
Both admin recorded data and the honor system have the same flaw: invalid data points. Simply adding or removing one digit of
a participant’s data can drastically affect the outcome of the challenge. Therefore, a system of validation should be put in
place to check the reasonableness of every entry. The simplest and most effective method for performing this is to use standard
Standard deviation is used to determine confidence that a particular data point falls within an ordinary range. By using two
standard deviations, you can assume 95% confidence that the value in question is valid if it falls within the given range.
The first step in finding the standard deviation is finding the mean. To determine the mean, add all of the data points and
then divide by the number of data points.
E.g. for a given set of steps walked in a day (1000, 3000, 4000, 5000, 5000, 11000), the mean is:
Mean = (1000 + 3000 + 4000 + 5000 + 5000 + 11000) / 6 = 4833
Next, compute the variance by subtracting each data point by the mean, squaring it and then determining the average.
Variance = ((1000 – 4833)2 + (3000 – 4833)2 + (4000 - 4833)2 + (5000 - 4833)2 + (5000 - 4833)2 + (11000 - 4833)2) / 6
= (14691889 + 3359889 +693889 + 27889 + 27889 + 38031889) / 6
Finally, to compute the standard deviation, take the square root of the variance:
Standard Deviation = √9472222 = 3078
Now that you have the standard deviation, you can use it to determine confidence by computing the upper and lower bounds for
your range of numbers. This is accomplished by subtracting the standard deviation from the mean for the lower bound and adding
the standard deviation to the mean for the upper bound. For example:
Lower Bound = 4833 – 3078 = 1755
Upper Bound = 4833 + 3078 = 7911
In a normal distribution, 68% of all values will fall within one standard deviation. In our example, both the 1000 data point
and 11000 data point would fall outside of one standard deviation. If we are checking on every outlier that is reported in our
fitness challenge and 32% are considered outliers, we are in for a lot of work. Instead, we should try two standard deviations
which will give us 95% confidence that our data is valid. To calculate the upper and lower bounds with two standard deviations,
simply multiply the standard deviation by two:
Lower Bound = 4833 – 6155 = -1322
Upper Bound = 4833 + 6155 = 10988
Now, only the 11000 data point barely falls outside of the standard deviation and should be checked out. If you are considering
thousands of data points, you may even want to consider using three standard deviations which would raise confidence to over 99%.
Doing this by hand would require considerable work. Fortunately, spreadsheets can accomplish this with much less effort.
Challenge management systems should also provide this analysis automatically. A sample report from ChallengeRunner.com appears