This article explains an often-overlooked effect that can distort metrics during feature roll-out. The primary audience are software engineers and managers working on infrastructure and libraries.
I want to motivate this problem with a fictive scenario: Alex wrote a replacement for an old and inefficient component in their app. It took the team a few weeks to implement and they carefully measured on test devices that it improves all core metrics. It is faster, uses less memory, all the good stuff. Following procedure, Alex’s new code is bundled with the next release and they open it to 1% for an A/B test.
This is when Alex feels the world has gone mad. The very metrics they carefully tested locally seem to go the opposite direction, crashes are up, engagement is down. Alex scrambles through log files and bytecode: is the exposure logging broken? are there device specific bugs? did they release the wrong code? are the results even significant?
One of their team’s most actionable idea is that they might have too little data. Therefore, they slowly increase the exposure to 2%, 3%, … . The metrics never go really bad, but they continue to stay slightly in the negative. Without the positive results that Alex has promised, people start questioning the project…
What Happened Here?
The scenario above could have been a situation where short-term negative effects hide general positive impact. I call these Roll-Out Phantoms, as they often occur as part of feature roll-outs in infrastructure projects. During feature roll-outs the population under test runs the new code for the first time. Being the new code on the block can have negative implications at the beginning:
- The control group can use its existing caches (e.g. it is an image library), but the new code has to build up that cache first.
- The new website code requires downloading new assets (e.g. CSS and JS files) that increase bandwidth and rendering time.
- During the first exposure you new component is loaded together with the existing one leading to higher memory pressure.
- The new code might only load after the app has been fully restarted. Therefore, the negative effects of cold-start impact the metrics of your test group.
- The new code solves a long-standing bug, but the affected people have churned-away already. It can take a long time for them to discover the app works again for them.
Note that the slow roll-out in the story above made things worse. As the first 1% recover from the negative effects, the next 1% get burned. As a result, the average or percentile-based metrics do not show how these two groups experience different developments.
Phantom-Busters: Avoiding Experiment Distortion.
There are a few strategies to minimize the effect of Roll-Out Phantoms. Trusting your local testing and patience can help you.
Roll-out should happen in steps and with sufficient time between exposure and analyses. Rather than opening from 1% to 10% over a week, open it to 5% and wait. We all all keen to see the results of our work right-away, but do not trust the first days worth of data. If you are convinced the new code is better and the results are slightly down in the beginning – explain and wait. Of course: Keep an eye on crashes and churn-rates, that indicate whether things are going horribly wrong.
Roll-out increases should be performed by adding an additional test group. You followed the previous paragraph and the experiment is happily sitting at 5%. But some metrics are not significant yet and you actually need more exposure. Keep the existing experiment and add a second experiment with 20% from the currently unexposed population. This way you avoid mixing short-term negative effects of the younger exposures and the long-term impact of the older exposures.
Adding the same negative effects for control group. In some instances you can reduce the relative benefit of the control group by levelling the playing field. When testing an image library, you can consider removing the cache for both groups at the beginning. When testing a new website design, you can consider adding cache-busting to resources loaded by both groups. However, do carefully consider the ethics of this. Is getting your results faster overall more beneficial than the negative effects (battery drain, bandwidth usage, …) the control group has to endure?
Roll-Out Phantoms are surprising to engineers when I mention them – especially when they are experiencing the distress of a feature roll-out not going to plan. But luckily, these projects (and team confidence) can be fixed by restarting the experiment with a big enough initial exposure and patience.