We’re Agile, we think lean, we’re data-driven. If you live in the new economy and work in some sort of digital product you hear some of this buzzwords (or perhaps all of them) on a daily basis.
You can argue they don’t all mean the exact same thing but they do carry one common fundamental principle:
Blind guessing is not a desirable way of making decisions, but rather a necessity when nothing more reliable exists to inform our reasoning. In developing digital products, however, that necessity rarely exists.
Testing hypothesis with some sort of A/B testing is one of the ways to avoid blind-guessing and “hope-for-the-best” approaches. Below you’ll find a collection of statistics principles and best-practices that I consider fundamental for anyone wanting to use A/B testing.
Hope you’ll find it useful!
This article was first published internally at Freeletics. We sure take our decisions serious and don’t favour “hope-for-the-best” approaches. Also, we’re hiring Product Managers and Product Designers!
What is an A/B test?
An A/B test consists of taking two comparable groups of users and exposing them to two different versions (the control and a variation) of a software experience. By doing that and measuring how each of the groups interacts with the software we hope to infer which of the two versions best serves its purpose.
Some relevant statistics concepts
Overall Evaluation Criterion (OEC)
Also called Primary Goal (in Optimizely and other tools) or Dependent variable, in statistics terminology. It’s a quantitative measure that defines the experiment’s objective. Ex. Conversion rate.
Null hypothesis (Ho)
The hypothesis that the OECs for the variants are not different in fact, and that any observed differences during the experiment are due to random fluctuations. The goal of our split-test is to reject this null hypothesis, and therefore conclude with statistical significance that an observed difference in the OECs is not due to random fluctuations. Failing to reject the Null hypothesis doesn’t mean it is true, means no conclusion can be drawn.
(Most of the statistics used in A/B testing come from the field of Statistical hypothesis testing.)
In the context of Null hypothesis testing, the p-value for a given observed result represents the probability obtaining such result (or any result more extreme than that) only due to random fluctuations, assuming the Null hypothesis is true.
For the purpose of A/B testing, we can say the p-value represents the probability of observing a given OEC difference (or any difference more extreme than that) between variants due only to random fluctuations, assuming the OECs are in reality the same for both.
The P-value is not the probability of the null-hypothesis being true — it is calculated assuming the null hypothesis is true to start with.
On Optimizely p-value is not made visible and instead the concept “Statistical Significance status” is mentioned. This value represents the probability of a given observed difference not being due to chance.
Significance level (SL)
Set in the beginning of the experiment, the significance level (traditionally set to 1% or 5%) is the probability threshold under which the null hypothesis can be rejected. In other words, if a given OEC difference has p-value smaller than the significance level, we’ll conclude the OECs are not the same for the variants and, therefore, believe there’s causality between the change and the effect. In other words, we can say variant B’s OEC is statistically significantly different from the control’s OEC.
Significance level is also the probability of rejecting the Null hypothesis when this hypothesis is true (called Type I error or false positive).
As an example, setting the SL of an A/B test to 0.05 (or 5%) means that even for statistically significant results, there’s a 5% risk of concluding there’s a difference between the variations when in fact there is not.
Note: Optimizely uses different terminology here and allows an Optimizely-significance value to be set for each project (default 10%). According to the documentation, the Optimizely-significance value represents 1-pvalue, which means that to set the SL to 0.05 this value must be chosen to be 95%.
On the contrary of SL which focuses on assessing the down-side of the experiment (probability of error and no-error), Power represents the probability of correctly rejecting the null hypothesis or obtaining a true-positive.
In other words, statistical Power is the likelihood that an experiment will detect an effect if there is in fact an effect to be detected.
The power of an experiment is influenced by a number of factors (such as sample size) and 80% is a typical desired value. Experiments with low power will never be conclusive in practical terms.
The standard deviation (σ — greek ‘sigma’) is a measure that is used to quantify the amount of variation or dispersion of a set of data values, in other words it represents how much ‘spread-out’ the data points are. In the context of A/B testing, we look at the distribution of observed OECs.
Conversion events, such as purchases or registrations, can be modelled as a Bernoulli trial with probability (p)= conversion rate. In that case, the standard deviation (σ) is given by:
Minimum sample size
For a desired Power (probability of detecting a true-positive if it exists) and a sensitivity Δ (the amount of change we want to detect, ex. 5% of control value) we can estimate a minimum sample size that‘s needed to achieve that. A useful formula, for a Power = 80% is the following:
where n is the number of users in each variant. That means that to detect a change Δ with 80% probability, we’ll need n users in each variant.
Such estimation is interesting to understand how long a given experiment we’re planning is likely to take in order to detect the change we’re looking for, if it exists. As can be seen from the formula above, n depends on the square of σ which means we can reduce the minimum sample size by choosing an OEC with lower variability (ex. conversion rate instead of purchase value).
Perhaps even more relevant, this formula tells us that small changes are harder to detect. Which means that great ideas will hardly be missed but also that A/B testing is often not a suitable/efficient way to make small incremental improvements.
The Confidence interval is a range of values that is likely to contain the actual effect of the variant on the OEC of the control. Just as an example, a 95% confidence interval (that is, an interval that has 95% probability of containing the actual effect) could be calculated as:
where OB and OA are the average value of the observed OECs and sigma-d is an estimation of the standard deviation of the difference. In practice though, tools such as Optimizely calculate this interval automatically.
Looking at the confidence interval gives us an insight of potential up and down-side of the experiment. It allows us to say, with a given probability (in the case above at least 95%), that the true difference between OECs is not higher than the higher bound of the interval. That can support, the decision of stopping the test because the upside is very likely lower than what we were hoping to achieve.
However, we might not be able to tell significantly if the test represents, or not, an improvement for the OEC — that conclusion can only be made if the whole interval is positive and does not contain 0.
Fig 1. Inconclusive confidence interval as seen on Optimizely
Fig 2. Conclusive confidence interval as seen on Optimizely
- Optimizely automatically sets the Confidence interval to the same value as Optimizely-significance. Which means, if you set Optimizely-significance to 95% in your project, you’ll see 95% confidence intervals.
- Optimizely only marks an experiment as conclusive if this last condition is met, which means that depending on the OEC goal we’ve set for the experiment, we are likely to be able to make a negative conclusion earlier. To be more precise, we can make a conclusion that our goal was not met as soon as the confidence interval is inferior and doesn’t contain our OEC goal.
Choose OEC in advance
Define the success metric for your test in advance and define a minimum improvement you’d consider a success. Those two decisions must take into account the following factors:
Slow variant OEC (lower 𝝈) need smaller sample sizes (and, therefore, less time) to detect changes. In an e-commerce setup, a conversion rate is a good example of a slow variant OEC, as opposed to basket size.
Large differences are easier and faster to detect. Small improvements take big sample sizes, a lot of time and, therefore, might never reach a conclusion. So make big changes and set big goals.
Be ambitious not only on the goals you set but also on the variations you test. For example, try two completely new designs or split users between two completely different pages. Stay away from little incremental changes since A/B testing is unlikely to be the right tool to support your decision on that situation.
Traditional A/B testing vs. Multivariable testing
Multivariable testing is a generalization of A/B testing in which several factors are tested simultaneously — in its most standard form (the one supported by Optimizely — also called full factorial) there’s one variant for each possible combination of the different factors.
This kind of testing is ideal if you suspect several factors will interact strongly. It should also be used if you have, or can have, several factors (called sections on Optimizely) implemented and tested straight away because you’ll reach a conclusion faster than with consequent A/B tests of a single factor each.
Minimum sample size for Multivariable experiments
The formula given above for Minimum sample size of a traditional split test with n variants can be generalized to the case of Multivariable experiments, with several factors and several variants per factor:
where r stands for the total number of variants and n is still the number of users per variant.
It’s a good strategy to do a traffic ramp-up, that is: start by serving variant B to only 10% of the users to ensure there are no implementation problems or catastrophically losing variations.
To maximize the Power of the experiment, however, you should increase the traffic allocation to 50%/50% as soon as possible. As an example, an experiment running at 1%/99% will have to run 25 times longer than the same experiment at 50%/50%. This effect is particularly relevant for multivariable testing, as any decrease in the traffic allocated to one of the variations will cap the sample size for any comparison within the test.
EDIT: As mentioned by Chris Stucchio, it’s fundamental throw away all the data and start collecting from scratch once you’ve ruled out any implementation bugs and scaled traffic allocation to 50–50. In his words “Since conversion rates change during the week (i.e., sat != tues), keeping the data during the ramp-up period is a great way to get wrong results due to Simpson’s Paradox.”
Always have to wait until test is “conclusive”?
No, it’s wise to assess each case since in some cases — particularly if minimum sample size or magnitude of change were not properly estimated — significance will never be reached because the power of the experience is too low.
In the majority of the cases, however, you will reach significance at some point. For these experiments, as described above, you might still be able to tell if your experiment didn’t reach its goal if the Confidence interval is lower and doesn’t contain the goal.
Estimate minimum sample size
Estimating the sample size helps you have an idea for how long the test will likely have to run until you have a conclusion. Tests taking longer than a couple of months are not recommended because several external factors, such as cookie churn, start becoming relevant and impacting the experiment.
You can play with factors such as the number of variations, magnitude of the expected change and the OEC you’ve chosen in order to tune and reduce the minimum sample size
Calculate Minimum sample size for a traditional A/B test (1 control, 1 variant) for detecting a 5% change in the conversion rate of an e-commerce checkout page with baseline conversion of 3%:
Note that Δ is given by 0.03*0.05 because the magnitude of the change we’re trying to measure is 5% of the 3% baseline.
Further reading and sources
- Kohavi, Ron, et al. “Controlled experiments on the web: survey and practical guide.” Data mining and knowledge discovery 18.1 (2009): 140–181. APA (Published by Springer)