Typical 8-Step A/B Test Workflow for Data Scientists in 2022

https://ift.tt/6q9EXGW A Summary of What Data Scientists Do in A/B Test Projects at Tech Companies Source: by Stephen Dawson on Unsplash ...

https://ift.tt/6q9EXGW

A Summary of What Data Scientists Do in A/B Test Projects at Tech Companies

Too Long; Don’t Read

Before the launch of the experiment: (1) think about the product & talk with product managers; (2) design the experiment; (3) check the A/A variation¹.
During the experiment: (4) check the invariant metrics & experiment set-up; (5) monitor key evaluation metrics.
After the end of the experiment: (6) read the p-values, magnitudes of difference, and minimum detectable effects (MDE²), dive deeper when surprised; (7) formulate product recommendations; (8) write the report.

Introduction

This article aims to discuss in detail what specifically data scientists should do in A/B test tasks and which one should be the primary focus. It is best for practitioners who have existing knowledge of A/B tests such as t-testing and minimum sample size calculation, and are interested in learning the procedures in actual work scenarios.

Why Am I Writing This? / Why Shall We Care?

I felt motivated to write this article because I realized there are still gaps between the day-to-day experiment practice in the tech companies and the A/B test knowledge from schools. Though comprehensive in theoretical justification, most online courses or articles, even the notable A/B testing by Google on Udacity, fail to connect the dots and present a holistic procedure about how data scientists apply the statistical knowledge in A/B tests.

Furthermore, how data scientists conduct A/B tests has changed a lot in recent years, shifting from the more or less manual-labor work to extensive utilization of different tools. However, managing tools can be daunting, especially when a single data scientist is expected to take responsibility for multiple A/B tests.

This emphasizes the necessity of a clear understanding of the new workflow along with the advancement in A/B test tools, which helps us stay organized and plan ahead. While the popular A/B test platforms take on most of the computation work, including but not limited to sample sizes and p-values, freeing data scientists from listing out statistics formulas and hand-calculating the numbers, it is crucial that we as data scientists spend most of our time thinking about the product and experiment themselves, rather than being stuck in the complicated tools doing repeated work back-and-forth.

Source: Carlos Muza from Unsplash — Source: by Carlos Muza on Unsplash

Expanding in Detail: Typical Workflow

Before the Launch of the Experiment

1) Think About the Product & Talk with the Product Manager (PM)

From my perspective, this is the most important thing in the entire A/B test workflow. There are two key aspects that should be evaluated thoroughly: feasibility and impact. After exploring the product logic and user journey in detail, we shall think very critically about whether it is appropriate or even doable to test this feature using randomized online A/B tests. A couple of questions to ask include “what are the hypotheses”, “can we find calculable metrics to test that?”, “what are the risks for our product and users?”, “Is there a network effect?”, “is it possible for users to demonstrate a strong novelty effect or change aversion”, “Is it fair to keep some users from our new feature treatment”, etc.

Additionally, as data scientists, we must be impact-driven rather than output-driven, which means we should pay attention to the potential gain or Return-On-Investment (ROI) of this feature testing. “Why do we need this?”, “What is our expected gain, is it short-term or long-term?” are some of the key questions we should ask ourselves and debate with product managers.

Why do we need to change the button to a brighter color? What are some metrics (e.g. CTR for both buttons, fallout rate)? Will the user be disturbed too much? Is it long-term beneficial? Source: Leanplum

2) Design the Experiment

Based on the hypotheses from 1), we as data scientists provide suggestions on designing the experiment for product managers and engineers, including but not limited to:

Evaluation Metrics: What are our north star metrics, directly-influenced metrics, guardrail metrics, and adoption metrics in the treatment group. Apart from the business interpretation, it would be better for the evaluation metrics to have lower variance and be sensitive to changes.
Unit of Diversion: Shall we randomize the test based on user_id or session_id.
Minimum Required Sample: For different evaluation metrics and various minimum detectable effects (MDE), what are the necessary sample sizes for the result to be statistically significant? Nowadays, most large tech companies provide mature tools to calculate the sample size, but very often we have to retrieve historical data and calculate on our own when the metrics of interest are too specific and are not supported by the A/B test platforms.
Experiment Duration & Traffic Portion: This usually depends on the required samples size and the eligible traffic size that could potentially see the new feature. We shall also take into consideration the risks of the experiment.
Experiment Layer & Experiment Type: Furthermore, we shall determine which experiment layer to deploy our A/B tests. It is common that most of the experiment layers have been populated with other tests, causing very little traffic left. In this case, we might want to choose another layer or consider orthogonal experiments, if our feature is not correlated with others.

3) Check the A/A Variation

Different from the past, most of the A/B test platforms have automated this process and require very little labor in 2022. What we need to do is to check the A/A variation output and make sure everything is as expected — the p-values out of A/A simulations are uniformly distributed.

During the Experiment

4) Check the Invariant Metrics & Experiment Set-up

This is the very phase where we start to transition our analysis from offline to the online A/B test platform. During the experiment, the first thing data scientists should do is to double-check whether the experiment is well set up. Specifically, I am referring to making sure users in the treatment group see all the treated features as expected. There are numerous cases in tech firms where experiment parameters were wrong and no one caught them in time, resulting in a massive loss of time and traffic. This is even worse in the advertising scenario, where we are using the traffic bought by our advertisers for A/B testing but end up sabotaging the experiments due to wrong parameters(personal experience…).

The second thing is to explore the invariant metrics and ensure two things: (1) The diversion of traffic being random (we usually use the Chi-squared Test to check the population across control and treatment groups); (2) The distribution of user profiles (e.g. gender, age) being homogeneous across groups (most platforms will compute these for us and it shall be easy to display within a few clicks).

5) Monitor Key Evaluation Metrics

Despite the samples are usually insufficient to interpret statistical significance in the first few days, it is necessary to closely keep track of the descriptive difference between the treatment and control group. If the key metrics and guardrail metrics drop significantly and continuously, we should consider taking a step back and re-evaluating the experiment for the benefit of our product and users, as the negative significant results are likely to be just a matter of time.

After the End of the Experiment

6) Read the P-values & Magnitudes of Difference & Minimum Detectable Effect, Dive Deeper When Surprised

Data scientists collect the descriptive differences as well as the p-value between the treatment groups and control groups. Additionally, I suggest revisiting our product hypothesis and assessing if we are surprised by the experiment results. If not, we can probably proceed to the next step. However, if we are indeed uncovering something beyond our expectations (e.g. diff should be significantly positive but not significant or significantly negative), we should delve deeper to understand the reasons by breaking it down into key business dimensions (e.g. new/old users, channels, geographic regions).

It is worth noting that the common reason for insignificant results is insufficient statistical power. Apart from increasing the sample size, we can also try other metrics with less variance or the CUPED(Controlled-experiment Using Pre-Existing Data) method. Additionally, comparing the empirical MDE (how large the diff should be in order to generate significant outcomes, generally provided by the platform) and the actual difference in key metrics often reveals how volatile the metrics are. This is a great reference when studying user behaviors or designing similar experiments in the future.

7) Formulate Product Recommendations

With all the data being collected, we can proceed to formulate the experiment conclusion and derive the product recommendation on whether we should gradually expose more traffic to the treatment or select a more conservative strategy.

8) Write the Report

Lastly, data scientists are expected to compose and archive a report covering (1) product backgrounds and feature hypotheses; (2) experiment resign; (3) result analysis; and (4) product recommendations.

Summary

The above-mentioned 8-step A/B test workflow can help minimize substantive errors and ensure a trustworthy controlled experiment. Different tech companies may vary in detail, but this could be taken as a general framework so that we could save more time exploring important product questions (which is one of the core values of data scientists and most fruitful for personal growth) rather than resolving procedural issues.

Note

A/A Variation: A/A variation is the difference in key metrics between the control group (group A) and another control group (~group A). Since there is no actual difference between the two, the A/A variation is not expected to deviate from 0 in a statistically significant manner.
Minimum Detectable Effects (MDE): MDE measures the smallest improvement over the baseline we are willing to detect in a controlled experiment.

Typical 8-Step A/B Test Workflow for Data Scientists in 2022 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Towards Data Science - Medium https://ift.tt/pJ6zSuK
via RiYo Analytics