Cargando…

Approaches to analyzing binary data for large-scale A/B testing

An industry-academic collaboration was established to evaluate the choice of statistical test and study design for A/B testing in larger-scale industry experiments. Specifically, the standard approach at the industry partner was to apply a t-test for all outcomes, both continuous and binary, and to...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhou, Wenru, Kroehl, Miranda, Meier, Maxene, Kaizer, Alexander
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9982610/
https://www.ncbi.nlm.nih.gov/pubmed/36875556
http://dx.doi.org/10.1016/j.conctc.2023.101091
_version_ 1784900367163588608
author Zhou, Wenru
Kroehl, Miranda
Meier, Maxene
Kaizer, Alexander
author_facet Zhou, Wenru
Kroehl, Miranda
Meier, Maxene
Kaizer, Alexander
author_sort Zhou, Wenru
collection PubMed
description An industry-academic collaboration was established to evaluate the choice of statistical test and study design for A/B testing in larger-scale industry experiments. Specifically, the standard approach at the industry partner was to apply a t-test for all outcomes, both continuous and binary, and to apply naïve interim monitoring strategies that had not evaluated the potential implications on operating characteristics such as power and type I error rates. Although many papers have summarized the robustness of the t-test, its performance for the A/B testing context of large-scale proportion data, with or without interim analyses, is needed. Investigating the effect of interim analyses on the robustness of the t-test is important, because interim analyses rely on a fraction of the total sample size and one should ensure that desired properties are maintained when a t-test is implemented not just at the end of the study, but for making interim decisions. Through simulation studies, the performance of the t-test, Chi-squared test, and Chi-squared test with Yate's correction when applied to binary outcomes data is evaluated. Further, interim monitoring through a naïve approach with no correction for multiple testing versus the O'Brien-Fleming boundary are considered in designs that allow early termination for futility, difference, or both. Results indicate that the t-test achieves similar power and type I error rates for binary outcomes data with the large sample sizes used in industrial A/B tests with and without interim monitoring, and naïve interim monitoring without corrections leads to poorly performing studies.
format Online
Article
Text
id pubmed-9982610
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-99826102023-03-04 Approaches to analyzing binary data for large-scale A/B testing Zhou, Wenru Kroehl, Miranda Meier, Maxene Kaizer, Alexander Contemp Clin Trials Commun Article An industry-academic collaboration was established to evaluate the choice of statistical test and study design for A/B testing in larger-scale industry experiments. Specifically, the standard approach at the industry partner was to apply a t-test for all outcomes, both continuous and binary, and to apply naïve interim monitoring strategies that had not evaluated the potential implications on operating characteristics such as power and type I error rates. Although many papers have summarized the robustness of the t-test, its performance for the A/B testing context of large-scale proportion data, with or without interim analyses, is needed. Investigating the effect of interim analyses on the robustness of the t-test is important, because interim analyses rely on a fraction of the total sample size and one should ensure that desired properties are maintained when a t-test is implemented not just at the end of the study, but for making interim decisions. Through simulation studies, the performance of the t-test, Chi-squared test, and Chi-squared test with Yate's correction when applied to binary outcomes data is evaluated. Further, interim monitoring through a naïve approach with no correction for multiple testing versus the O'Brien-Fleming boundary are considered in designs that allow early termination for futility, difference, or both. Results indicate that the t-test achieves similar power and type I error rates for binary outcomes data with the large sample sizes used in industrial A/B tests with and without interim monitoring, and naïve interim monitoring without corrections leads to poorly performing studies. Elsevier 2023-02-16 /pmc/articles/PMC9982610/ /pubmed/36875556 http://dx.doi.org/10.1016/j.conctc.2023.101091 Text en © 2023 The Authors. Published by Elsevier Inc. https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Zhou, Wenru
Kroehl, Miranda
Meier, Maxene
Kaizer, Alexander
Approaches to analyzing binary data for large-scale A/B testing
title Approaches to analyzing binary data for large-scale A/B testing
title_full Approaches to analyzing binary data for large-scale A/B testing
title_fullStr Approaches to analyzing binary data for large-scale A/B testing
title_full_unstemmed Approaches to analyzing binary data for large-scale A/B testing
title_short Approaches to analyzing binary data for large-scale A/B testing
title_sort approaches to analyzing binary data for large-scale a/b testing
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9982610/
https://www.ncbi.nlm.nih.gov/pubmed/36875556
http://dx.doi.org/10.1016/j.conctc.2023.101091
work_keys_str_mv AT zhouwenru approachestoanalyzingbinarydataforlargescaleabtesting
AT kroehlmiranda approachestoanalyzingbinarydataforlargescaleabtesting
AT meiermaxene approachestoanalyzingbinarydataforlargescaleabtesting
AT kaizeralexander approachestoanalyzingbinarydataforlargescaleabtesting