ABT – “Always Be Testing” – a popular phrase almost everybody has heard about, and rightfully so. For me, A/B testing is hands down the best value-for-money optimization method there is. The entry barrier to A/B testing is extremely low nowadays, so everybody can do it – every CRM platform has testing capabilities, and almost everyone has a brain to think of some testing ideas.

So it’s easy to start testing, but it’s hard to get it completely right. The reason for that is it combines multiple principles from user research methodology, statistics, and psychology, and as almost all of us are not proficient in all of those, we tend to overlook some really important things.

For example, my A/B testing philosophy went like this:

  1. Every campaign has to have an A/B test,
  2. Every A/B test has to have as many variants as possible.

I was sure this would generate the maximum amount of learnings for me, but I was wrong. It turns out there is a practical upper limit of variants you can test in a single campaign, and that number is not limited by your sample size, but by pure statistical chance.

*Audience sample size and statistical significance are connected. More on further in the post.

Understanding Statistical Significance

When testing, we know that we can conclude the outcome of the test only if it has reached a desired statistical significance – a claim that a set of observed data is not the result of chance but can instead be attributed to a specific cause.

This threshold is most often set to 95% as it represents a good balance of statistical confidence and the needed audience sample size and detectable change effect. It’s a practical common ground, as the ideal threshold of 100% would require prohibitively large sample sizes or unrealistically large testing effects (lift).

The Implication of Lower Thresholds

People in dynamic environments also tend to blur the line a little bit and call conclusions on lower thresholds – ahh, 85% is good enough. That is ok, but let’s explain that the implication of that is.

When we say that a result is statistically significant at the 95% confidence level, it means that we are 95% confident that the result is not due to random chance. However, this also implies that there is a 5% chance that the result could still be due to chance. In other words, 1 out of every 20 variants could show a significant result even if there is no real effect – this is known as a false positive.

The lower we set the threshold, the higher the chance for a false positive is. If we call the test at the before mentioned 85%, there is a 15% chance that the result could still be due to chance – or 3 out of 20 variants.

It’s important to mention that this could cause both false positives and false negatives.

  • False Positive (Type I Error): A test result that incorrectly indicates the presence of an effect or condition when it is not actually present.

  • False Negative (Type II Error): A test result that incorrectly indicates the absence of an effect or condition when it is actually present.

The above-described probability is true when testing a single additional variation, but when adding more variants, the chance for encountering at least one false positive increases due to the cumulative probability effect. Simply meaning, by every variant (hypothesis) you add to the A/B test, you get 5% higher chance to get a false result.

Calculating False Positives

You can calculate the chance of getting a false positive using the following formula:

P (False Positive) = 1 – a ^ m

Here, “m” represents the total number of variations tested, and “a” is the desired statistical significance level.

Example Calculation

Let’s illustrate this with an example calculation for testing 5 variants at a 95% desired statistical significance level:

P (False Positive) = 1 − 0.95 ^ 5

P (False Positive) = 1 − 0.774

P (False Positive) ≈ 0.226

P (False Positive) ≈ 22.6%

So, there’s approximately a 22.6% chance of encountering at least one false positive when testing 5 variants with a desired significance level of 95%.

The results of other calculations can be seen below.

Number of variants or hypothesesProbability of a false positive
15.0%
29.8%
522.6%
833.7%
1040.1%
2064.2%

Table: Probability of a false positive with a 95% statistical significance threshold

You can now see why testing too many variants at once is not always a good idea.

Luckily, there is a solution for this, and it’s called Bonaferroni correction. It’s basically a formula that calculates the statistical significance threshold you need for the number of variants you’re testing.

Number of variants or hypothesesRequired significance level
195.0%
297.5%
599.0%
899.4%
1099.5%
2099.8%

Table: Bonaferroni corrected significance levels to maintain a 5% false discovery probability

Simply put, to test 5 variants and maintain the 5% error rate, you’d need to set your statistical significance threshold to 99%, instead of 95%.

The Conclusion

For me, A/B testing has evolved beyond simply throwing in as many variants as possible at the wall. It’s about planning, understanding the limitations of the audience samples, and not cutting corners with statistical significance.

So, let’s remember to Always Be Testing, but let’s do so wisely.

How did you like this post?

Thanks for your feedback!

🚀

Why name denis-test.com?

© 2024 Denis Kolbas

Privacy Policy