In the last post we discussed how to identify the ‘What’, ‘Where’ and ‘How’ of your optimization efforts. In this blog post, we are going to focus on the analysis portion of optimization and talk about some of the common statistical pitfalls you could face and how you can avoid them.
There are an overwhelming number of issues you can encounter in your digital optimization journey, but today we will focus on four of the most common mistakes:
1. Cumulative Averages (Anscombes Quartet)
2. Unrepresentative Time Period
3. Continuous Monitoring
4. Multiplicity or Multiple Testing Problem
Cumulative Averages (Ascombes quartet)
Most optimization tools give you a way to exclude outliers in your data when dealing with cumulative averages such as average order value, but what if you are simply trying to increase the lead generation conversion rate? In these instances the tool is looking at the aggregate conversion rate and does not take conversion rate per day into account. Normally this wouldn’t be an issue, but web data is not always clean and errors in reporting are always a possibility. The issue you will encounter with cumulative averages deals with how these averages can hide outliers/errors within your test results. The most elegant representation of this issue is Anscombe’s Quartet, which is a group of four datasets that have the same cumulative statistics yet four different stories when graphed.
You can see in the figure above, that while these data sets may all result in the same cumulative averages, but when visually represented the differences clearly emerge.
A more real world example of this would involve an error with data capture for one of your test variations. Say you had a test in which you are trying to improve your lead generation rate. You let the test run for the appropriate amount of time but there is no difference between the two variations. Is the test inconclusive? Should you go back to the drawing board? Well, let’s take a look at the trended results:
When you view the trended daily conversion rate you can see that there may have been an issue with conversion tracking in the middle of the test. Before the error presented itself and after it was resolved, version B was out performing version A. However, due to this tracking issue the cumulative averages were the same. In this instance it is best to test again. You don’t have the requisite data to make a decision even though B was consistently better. When noise gets introduced into your test, it is always best to simply re-test to make sure your results are good.
Unrepresentative Time Frame
The next statistical pitfall to avoid is error introduced by not selecting a representative time frame for your test. There are a few ways this type of error might be introduced. One way is if you select a time frame when your company is planning a media push (online or offline). These types of media blasts could result in an influx of new visitors that might skew the test’s results in a manner outside the normal user behavior. For example, say your company is planning a big TV buy during the Super Bowl. This could result in a large influx of traffic coming to your site that has a completely different behavior than your normal traffic. This temporary traffic infusion could end up skewing your test results in favor of a variation that performs worse for your normal users.
Download the Optimization: Organizing for Success Whitepaper Now!
Another example of selecting an unrepresentative time frame (and probably the most common one) is when a test is run for only a handful of days. Even for sites that have enough traffic to support a quick test, this is not always a good idea as your site’s vistors most likely have varying behavior depending on when they came to your site. For example, visitors that might frequent a site during the work week, possibly while at work, might simply scan your site quickly looking for deals or new content. But visitors who come to your site on weekends, when they have more time, might spend more time reading through your site. This is why it is always recommended to select a minimum of two weeks for running your tests. This might not work for all tests, like an email landing page that gets 90% of all of its traffic in the first 3-4 days, but for the most part this is a good rule of thumb to follow.
Continuous Monitoring
Another common error many testing professionals fall into is the error of continuous monitoring. Now, I’m not saying that you shouldn’t monitor how a test is progressing… actually I would recommend reviewing your test results at a minimum of a weekly basis (more frequent monitoring would be better). But, what you should avoid is making a decision on the test until your traffic and time threshold have been reached. Monitoring your test results to ensure everything is moving along as expected (i.e. traffic allocation is correct, conversions are incrementing, nothing strange is showing in the data, etc…) is a great practice, but monitoring your tests to see when the tool states it has reached significance so you can declare a winner and move on, is not recommended.
Why shouldn’t you pull your tests when your tool states statistical significance has been reached? That is a great question! One reason was already outlined in the previous section regarding the unrepresentative time frame. If your tool states that you’ve reached statistical significance on Wednesday when you only launched the test on Monday, it could be giving you a false reading. If your website’s user behavior differs on weekends, then pulling the test before you get traffic from weekend users could cause you to miss a very valuable cross section of your traffic. If you let your test run longer, you may see that one version works better on weekends giving you insight into how you can get an even better conversion rate by targeting your web content based on the day of the week.
Another reason not to blindly follow your testing tool’s significance calculation has to do with the law of the iterated logarithm. In short this law implies that, given an infinite amount of time, two identical test variations will have a 100% probability of falsely declaring a winner. This a very important point to understand, so we will dive into this a bit deeper.
Say you are running an A/A test, a test in which you are splitting your site’s traffic into two groups with the exact same experiences. If you were to check the statistical significance of this test on a daily basis, and let this test run for an infinite amount of time, you would be 100% likely to see one version beat the other. Let me state that again, running a test of the exact same experiences is 100% likely to declare a statistically significant winner at least once during its run if given enough time.
The way you can get around inflating your error rate is to calculate your traffic threshold prior to launching your test. There are a variety of different tools online to help you do this and this should be ingrained in your digital optimization process. There are some tools on the market today that use new statistical calculations to counteract the inflation of Type I error rates due to continuous monitoring (Optimizely for one) by imposing stricter measures based a statistical framework known as Sequential Testing. While this will help to counteract false positives due to random fluctuations in responses, this will not be able to account for fluctuations due to and unrepresentative time frame (i.e. running a test for a few days during the week and ending before the weekend).
Multiplicity or Multiple Testing Problem
No, I am not talking about the 1990’s romcom movie starring Michael Keaton. The issue of multiplicity, or multiple testing, has to do with how Type I error rates increase as the number of test variations and/or goals measured in the test increase. For example, say you are running a test on your homepage and set it up to measure account creations, white paper downloads, and video views. As you are analyzing the results of the test you notice that the new design improves video views at 95% confidence but it does not impact account creations or white paper downloads. You then declare that the test variation will provide a lift in video view while not positively or negatively impacting accounts or downloads, right? Well, not exactly, at least not at 95% confidence.
Since you are essentially comparing an A/B test with 3 independent goals, the 95% confidence your tool declares is only applicable to one goal, not the combination of all 3. This is because as you add in additional measures or test variations, the false positive (Type I error) rate also increases. So rather than being 95% confident, you are really only at 90% confidence.
Most of the enterprise testing solutions offer a way to correct for multiplicity; Adobe Target gives their users a sample size calculator option that uses the Bonferroni Correction. While Optimizely incorporates calculation corrections directly in their Stats Engine that controls for false discoveries rather than false positives. No matter what approach you take, it is important to make sure you do take an approach to control for this common statistical pitfall.
Conclusion
While we did not cover every potential statistical pitfall you will encounter in your optimization journey (that would be a much longer blog post), ensuring you avoid the four outlined above will go a long way in ensuring your test results are closer to the significance level you expect.
Further Reading
Five Steps to an Actionable Digital Analytics Strategy
Four Trends that Will Shape Your Analytics Strategy by 2017
Getting a Testing Program Off the Ground: Optimization Part 2
Optimization: The First Step