A/B Testing

A/B Testing Best Practices

A/B testing refers to a specific type of randomized experiment in which a group of users (or participants) are presented with two variations of some thing (product, email, advertisement, landing page, whatever): a Variation A, and a Variation B. The group of users exposed to Variation A is often referred to as the control group because it’s performance is held as the baseline against which any improvement in performance observed from presenting Variation B is measured. Variation A itself is sometimes the original version of the thing being tested, and existed before the test. The group of users exposed to Variation B is called the treatment. When A/B testing is used to optimize a conversion rate, the performance of the treatment is measured against that of the control using the following calculations:

  1. Observed Conversion Rate (CvR) for each variation = Unique Conversions / Users; where conversions are defined as users who completed a certain action
  2. 95% Confidence Interval for each variation = CvR +/- 1.96 * Standard Error (see here for calculating standard error); this is the range of values for which there is a 95% probability that the true conversion rate lies.
  3. Observed Improvement (Absolute) = (CvR of B – CvR of A); it is the absolute amount of lift observed from the test
  4. Observed Improvement (Relative) = (CvR of B – CvR of A) / CvR A * 100; it is the relative amount of lift observed from the test
  5. Chance to Beat Baseline for Variation B is the probability that the true CvR of B is greater than the true CvR A (this measure is dependent on the conversion rates of either variation, the observed improvement and the amount of statistical power in the test, the see above link for how this is calculated).

To get a better sense for how these performance and significance metrics operate together, you can plug in some numbers to our Significance Calculator. For a deeper introduction to this type of traditional A/B testing, check out this video of a data scientist from Quora lecturing on the subject at Harvard.

A/B/n testing or split testing is largely equivalent to A/B testing in it’s fundamental design in that for the duration of the exploration phase – or testing period – a fixed proportion of users are being presented with a given variation (e.g.: 50% of users exposed to Variation A). The difference with split testing is that it is not restricted to two variations (A and B). When A/B or split testing, many practitioners keep Variation A as their control and will often even fix the proportion of users who are shown A to a higher rate than experimental variations so as to mitigate the perceived risk of making a change.

Multivariate testing is distinct from A/B or split testing in that it is better suited towards testing multiple things simultaneously. The problem with testing multiple changes at once is around identifying which change was responsible for the observed effect in performance. To deal with this, multivariate tests employ a full factorial experimental design where every combination of test subjects’ variations constitute a distinct version to be tested.

Take the example of a landing page which has the following elements and variations that are being tested:

  1. Headline: 2 variations
  2. Banner image: 3 variations
  3. Form fields: 3 variations
  4. Call to action: 4 variations

In this instance, a multivariate test would produce 72 (2*3*3*4) combinations for testing. Multivariate testing is a robust way to identify the best performing combination of changes – especially if you believe that the state of some elements influence one another. The classic example of this is text color and background color: a white background may perform better in one test, and a white text may perform better in another, but they are unlikely to perform the best as a combination because of their influence on one another. Multivariate tests are often a faster way to improvement than testing multiple changes through step-wise A/B tests.

Though easier to understand and conceptualize. All of the aforementioned methods always offer a suboptimal solution to the question: What works better? Advances in machine learning and computing have made more sophisticated and better performing options available.

Bandit algorithms, in which the proportions of users that are shown different test variations are dynamically updated with each incremental datum collected, have been shown to deliver faster results and maintain higher performance levels than A/B, split or multivariate tests. There are several bandit solutions that vary in their level of sophistication and utility under different computing and development constraints. We’ve found that the best bandit solution for mobile app experiments is that which leverages random probability sampling, and you can read more about our implementation of this solution here.