Cargando…

Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference

Nonignorable technical variation is commonly observed across data from multiple experimental runs, platforms, or studies. These so-called batch effects can lead to difficulty in merging data from multiple sources, as they can severely bias the outcome of the analysis. Many groups have developed appr...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Tenglong, Zhang, Yuqing, Patil, Prasad, Johnson, W Evan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10449015/
https://www.ncbi.nlm.nih.gov/pubmed/34893807
http://dx.doi.org/10.1093/biostatistics/kxab039
_version_ 1785094852665409536
author Li, Tenglong
Zhang, Yuqing
Patil, Prasad
Johnson, W Evan
author_facet Li, Tenglong
Zhang, Yuqing
Patil, Prasad
Johnson, W Evan
author_sort Li, Tenglong
collection PubMed
description Nonignorable technical variation is commonly observed across data from multiple experimental runs, platforms, or studies. These so-called batch effects can lead to difficulty in merging data from multiple sources, as they can severely bias the outcome of the analysis. Many groups have developed approaches for removing batch effects from data, usually by accommodating batch variables into the analysis (one-step correction) or by preprocessing the data prior to the formal or final analysis (two-step correction). One-step correction is often desirable due it its simplicity, but its flexibility is limited and it can be difficult to include batch variables uniformly when an analysis has multiple stages. Two-step correction allows for richer models of batch mean and variance. However, prior investigation has indicated that two-step correction can lead to incorrect statistical inference in downstream analysis. Generally speaking, two-step approaches introduce a correlation structure in the corrected data, which, if ignored, may lead to either exaggerated or diminished significance in downstream applications such as differential expression analysis. Here, we provide more intuitive and more formal evaluations of the impacts of two-step batch correction compared to existing literature. We demonstrate that the undesired impacts of two-step correction (exaggerated or diminished significance) depend on both the nature of the study design and the batch effects. We also provide strategies for overcoming these negative impacts in downstream analyses using the estimated correlation matrix of the corrected data. We compare the results of our proposed workflow with the results from other published one-step and two-step methods and show that our methods lead to more consistent false discovery controls and power of detection across a variety of batch effect scenarios. Software for our method is available through GitHub (https://github.com/jtleek/sva-devel) and will be available in future versions of the [Formula: see text] R package in the Bioconductor project (https://bioconductor.org/packages/release/bioc/html/sva.html).
format Online
Article
Text
id pubmed-10449015
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-104490152023-08-25 Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference Li, Tenglong Zhang, Yuqing Patil, Prasad Johnson, W Evan Biostatistics Article Nonignorable technical variation is commonly observed across data from multiple experimental runs, platforms, or studies. These so-called batch effects can lead to difficulty in merging data from multiple sources, as they can severely bias the outcome of the analysis. Many groups have developed approaches for removing batch effects from data, usually by accommodating batch variables into the analysis (one-step correction) or by preprocessing the data prior to the formal or final analysis (two-step correction). One-step correction is often desirable due it its simplicity, but its flexibility is limited and it can be difficult to include batch variables uniformly when an analysis has multiple stages. Two-step correction allows for richer models of batch mean and variance. However, prior investigation has indicated that two-step correction can lead to incorrect statistical inference in downstream analysis. Generally speaking, two-step approaches introduce a correlation structure in the corrected data, which, if ignored, may lead to either exaggerated or diminished significance in downstream applications such as differential expression analysis. Here, we provide more intuitive and more formal evaluations of the impacts of two-step batch correction compared to existing literature. We demonstrate that the undesired impacts of two-step correction (exaggerated or diminished significance) depend on both the nature of the study design and the batch effects. We also provide strategies for overcoming these negative impacts in downstream analyses using the estimated correlation matrix of the corrected data. We compare the results of our proposed workflow with the results from other published one-step and two-step methods and show that our methods lead to more consistent false discovery controls and power of detection across a variety of batch effect scenarios. Software for our method is available through GitHub (https://github.com/jtleek/sva-devel) and will be available in future versions of the [Formula: see text] R package in the Bioconductor project (https://bioconductor.org/packages/release/bioc/html/sva.html). Oxford University Press 2021-12-10 /pmc/articles/PMC10449015/ /pubmed/34893807 http://dx.doi.org/10.1093/biostatistics/kxab039 Text en © The Author 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Article
Li, Tenglong
Zhang, Yuqing
Patil, Prasad
Johnson, W Evan
Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference
title Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference
title_full Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference
title_fullStr Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference
title_full_unstemmed Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference
title_short Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference
title_sort overcoming the impacts of two-step batch effect correction on gene expression estimation and inference
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10449015/
https://www.ncbi.nlm.nih.gov/pubmed/34893807
http://dx.doi.org/10.1093/biostatistics/kxab039
work_keys_str_mv AT litenglong overcomingtheimpactsoftwostepbatcheffectcorrectionongeneexpressionestimationandinference
AT zhangyuqing overcomingtheimpactsoftwostepbatcheffectcorrectionongeneexpressionestimationandinference
AT patilprasad overcomingtheimpactsoftwostepbatcheffectcorrectionongeneexpressionestimationandinference
AT johnsonwevan overcomingtheimpactsoftwostepbatcheffectcorrectionongeneexpressionestimationandinference