This article explains advanced A/B test concepts. The primary audience are engineers encountering such A/B tests for the first time.

A/B tests allow engineers to validate fixes, test for regressions, and measure improvements. In most situations the standard approach is to create a control and a test group of equal size (e.g. both 5%). Users within the test population (e.g. 10%) are then randomly assigned to one of these. However, sometimes more intricate methods can improve the user experience and make deployment safer and more effective.

In this article I explain the following advanced A/B testing strategies: shadow testing, test groups with fallback, hold-out groups, and negative experiments. Often these terms are thrown around by senior engineers without necessarily agreeing on what exactly they mean. They certainly were not clear to me when I first encountered them.

In this article the described experiments are concerned with testing an existing piece of code or infrastructure which I will call A (control) against new code or infrastructure called B (test).

Shadow Testing

Shadow tests allow to verify new infrastructure and code without having effects that are visible to the user. The incoming calls to the methods/APIs under tests are duplicated and send to both A and B. Only A is allowed to have side-effects and only its responses are returned. B will be measured, but it should not negatively effect A.

This approach is great for checking how a new back-end service implementation performs under real-word load. If the infrastructure allows, the responses from A and B can be compared to also verify correctness. However, this strategy comes at the cost of having to run twice the amount the servers.

In mobile apps shadow testing is less common because of the constrained resources. If both A and B are executed in parallel, they both put pressure on the limited memory and computational resources. This then can lead to user-visible regressions that shadow tests were meant to avoid.

Test Group with Fallback

Falling back to the existing code when the new implementation fails sounds like a great idea at first. However, such setups are hellishly hard to get right and even then it is easy to be fooled by the data. I suggest to avoid any fallback mechanism where possible.

First, most tooling does not properly support this setup. Usually users are assigned to either a test or control group and the measured effect is determined afterwards by joining those groups with the metric table. If a user of the test group falls-back on B, their assignment must be removed. This is often expensive and sometimes impossible when the assignment is done client-side (e.g. by hashing the user id).

What’s more, removing these users creates a bias. The failing users are likely a challenging sub-group (e.g. using low-memory devices) for which we do expect a higher failure rate. With such a setup we remove the difficult cases from A, which makes the old code look better than it is.

Second, running A after B can make A perform worse. For instance, B might have created many objects in memory that are now causing an increase in garbage collection. But it can also be the other way around. For instance, B might have done an API request and A can benefit from an already established HTTPS connection.

Third, higher failure rates can make B look more performant. Assuming that failures correlate with the hardness of a given task, crashes will more often occur for executions with long runtime and high memory consumption. All these long runs would be filtered out, dramatically shifting the mean and percentiles of performance data towards lower values.

Hold-Out Group

Hold-out groups are great for verifying fixes and bringing the benefit to many users as quick as possible.

A conventional A/B test allocates a population of users (e.g. 10%) to an experiment and then evenly splits it into a test group and a control group. Hence, only 5% of all users will receive the new B code during our experiments.

In a hold-out group setup the new code becomes the default implementation - i.e. without the experiment setup it would be run for 100% of all users. In this strategy we still allocate an experiment population (e.g. 10%) and split into a test group and a control group. However, this time the roles are inverted and the test group will receive the old code A and the control group receives the new code B.

This inversion can be confusing at first. However, it is usually necessary as most tooling assume that the control group is receiving the default implementation. Of course this means that the evaluation has to be read inversely as well: if the test group performs worse than control, it means that the new code is better.

Hold-out groups are great for changes that fix important bugs or offer other significant improvement that should not be delayed. They are of courses verified in local tests, but it might be impossible to do exhaustive tests accross all possible configurations. Sometimes, hold-out groups are also used for estimating the impact of a fix (i.e. what’s the bottom line impact of the fix). Finally, this approach is a more comfortable choice for measuring long term effects, because the benefit of the new code is available to most users from the start.

Negative Experiments

Negative experiments are an effective choice for testing hypotheses and identifying effect-to-metric relationships. Such experiments assign a small test group a regressed behaviour (e.g. adding additional delays) where an improved implementation would be too costly to justify without being sure it has a large impact. However, negative experiments cause costs, can damage trust, and require careful ethical considerations.

Consider the following example of an online shop: the team responsible for the check-out pages wants to understand whether the load time has an effect on total revenue. Testing locally is difficult, as they cannot emulate the thoughts of actual customers using the site. And implementing an actual 100ms improvement would require many weeks of work which management does not approve without convincing impact estimates. Finally, publicly available research on this is inconclusive and the team believes that it does not apply to this particular online shop.

Therefore, the team decides to run a negative experiment. They create multiple small test groups and give them between 100ms and 500ms extra delay when sending the check-out page. Their evaluation of the corresponding revenue metrics might show that there is a strong correlation between load time and revenue – data that gets the performance improvement project approved.

There are many costs to this experiment. First, the company is expected make less revenue for the test group. Second, something as simple as add a delay can cause more dramatic consequences. What if the delay means that for peak times the previous connection pool size is no longer sufficient and new connections are being dropped? Third, the user will spent more of their time with your product than necessary. This costs them battery life, mobile data, and attention.

Especially the negative impact on the user requires careful ethical consideration, because they have not opted-in to this experiment. The decision running such an experiment must require a convincing argument that the long-term benefits to all users outweigh the negative impact. In particular, this mandates that local tests and actual improvement tests should be chosen instead wherever possible.

Credits: cover photo by Isaac Smith on Unsplash.