Cargando…
Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference
Nonignorable technical variation is commonly observed across data from multiple experimental runs, platforms, or studies. These so-called batch effects can lead to difficulty in merging data from multiple sources, as they can severely bias the outcome of the analysis. Many groups have developed appr...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10449015/ https://www.ncbi.nlm.nih.gov/pubmed/34893807 http://dx.doi.org/10.1093/biostatistics/kxab039 |
_version_ | 1785094852665409536 |
---|---|
author | Li, Tenglong Zhang, Yuqing Patil, Prasad Johnson, W Evan |
author_facet | Li, Tenglong Zhang, Yuqing Patil, Prasad Johnson, W Evan |
author_sort | Li, Tenglong |
collection | PubMed |
description | Nonignorable technical variation is commonly observed across data from multiple experimental runs, platforms, or studies. These so-called batch effects can lead to difficulty in merging data from multiple sources, as they can severely bias the outcome of the analysis. Many groups have developed approaches for removing batch effects from data, usually by accommodating batch variables into the analysis (one-step correction) or by preprocessing the data prior to the formal or final analysis (two-step correction). One-step correction is often desirable due it its simplicity, but its flexibility is limited and it can be difficult to include batch variables uniformly when an analysis has multiple stages. Two-step correction allows for richer models of batch mean and variance. However, prior investigation has indicated that two-step correction can lead to incorrect statistical inference in downstream analysis. Generally speaking, two-step approaches introduce a correlation structure in the corrected data, which, if ignored, may lead to either exaggerated or diminished significance in downstream applications such as differential expression analysis. Here, we provide more intuitive and more formal evaluations of the impacts of two-step batch correction compared to existing literature. We demonstrate that the undesired impacts of two-step correction (exaggerated or diminished significance) depend on both the nature of the study design and the batch effects. We also provide strategies for overcoming these negative impacts in downstream analyses using the estimated correlation matrix of the corrected data. We compare the results of our proposed workflow with the results from other published one-step and two-step methods and show that our methods lead to more consistent false discovery controls and power of detection across a variety of batch effect scenarios. Software for our method is available through GitHub (https://github.com/jtleek/sva-devel) and will be available in future versions of the [Formula: see text] R package in the Bioconductor project (https://bioconductor.org/packages/release/bioc/html/sva.html). |
format | Online Article Text |
id | pubmed-10449015 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-104490152023-08-25 Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference Li, Tenglong Zhang, Yuqing Patil, Prasad Johnson, W Evan Biostatistics Article Nonignorable technical variation is commonly observed across data from multiple experimental runs, platforms, or studies. These so-called batch effects can lead to difficulty in merging data from multiple sources, as they can severely bias the outcome of the analysis. Many groups have developed approaches for removing batch effects from data, usually by accommodating batch variables into the analysis (one-step correction) or by preprocessing the data prior to the formal or final analysis (two-step correction). One-step correction is often desirable due it its simplicity, but its flexibility is limited and it can be difficult to include batch variables uniformly when an analysis has multiple stages. Two-step correction allows for richer models of batch mean and variance. However, prior investigation has indicated that two-step correction can lead to incorrect statistical inference in downstream analysis. Generally speaking, two-step approaches introduce a correlation structure in the corrected data, which, if ignored, may lead to either exaggerated or diminished significance in downstream applications such as differential expression analysis. Here, we provide more intuitive and more formal evaluations of the impacts of two-step batch correction compared to existing literature. We demonstrate that the undesired impacts of two-step correction (exaggerated or diminished significance) depend on both the nature of the study design and the batch effects. We also provide strategies for overcoming these negative impacts in downstream analyses using the estimated correlation matrix of the corrected data. We compare the results of our proposed workflow with the results from other published one-step and two-step methods and show that our methods lead to more consistent false discovery controls and power of detection across a variety of batch effect scenarios. Software for our method is available through GitHub (https://github.com/jtleek/sva-devel) and will be available in future versions of the [Formula: see text] R package in the Bioconductor project (https://bioconductor.org/packages/release/bioc/html/sva.html). Oxford University Press 2021-12-10 /pmc/articles/PMC10449015/ /pubmed/34893807 http://dx.doi.org/10.1093/biostatistics/kxab039 Text en © The Author 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Article Li, Tenglong Zhang, Yuqing Patil, Prasad Johnson, W Evan Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference |
title | Overcoming the impacts of two-step batch effect correction on gene expression
estimation and inference |
title_full | Overcoming the impacts of two-step batch effect correction on gene expression
estimation and inference |
title_fullStr | Overcoming the impacts of two-step batch effect correction on gene expression
estimation and inference |
title_full_unstemmed | Overcoming the impacts of two-step batch effect correction on gene expression
estimation and inference |
title_short | Overcoming the impacts of two-step batch effect correction on gene expression
estimation and inference |
title_sort | overcoming the impacts of two-step batch effect correction on gene expression
estimation and inference |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10449015/ https://www.ncbi.nlm.nih.gov/pubmed/34893807 http://dx.doi.org/10.1093/biostatistics/kxab039 |
work_keys_str_mv | AT litenglong overcomingtheimpactsoftwostepbatcheffectcorrectionongeneexpressionestimationandinference AT zhangyuqing overcomingtheimpactsoftwostepbatcheffectcorrectionongeneexpressionestimationandinference AT patilprasad overcomingtheimpactsoftwostepbatcheffectcorrectionongeneexpressionestimationandinference AT johnsonwevan overcomingtheimpactsoftwostepbatcheffectcorrectionongeneexpressionestimationandinference |