Cargando…

Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation

BACKGROUND: With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences (“batch effects”) as well as differences in sample com...

Descripción completa

Detalles Bibliográficos
Autores principales:	Soneson, Charlotte, Gerster, Sarah, Delorenzi, Mauro
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4072626/ https://www.ncbi.nlm.nih.gov/pubmed/24967636 http://dx.doi.org/10.1371/journal.pone.0100335

_version_	1782322994729713664
author	Soneson, Charlotte Gerster, Sarah Delorenzi, Mauro
author_facet	Soneson, Charlotte Gerster, Sarah Delorenzi, Mauro
author_sort	Soneson, Charlotte
collection	PubMed
description	BACKGROUND: With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences (“batch effects”) as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. FOCUS: The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. DATA: We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., ‘control’) or group 2 (e.g., ‘treated’). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. METHODS: We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data.
format	Online Article Text
id	pubmed-4072626
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-40726262014-07-02 Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation Soneson, Charlotte Gerster, Sarah Delorenzi, Mauro PLoS One Research Article BACKGROUND: With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences (“batch effects”) as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. FOCUS: The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. DATA: We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., ‘control’) or group 2 (e.g., ‘treated’). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. METHODS: We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data. Public Library of Science 2014-06-26 /pmc/articles/PMC4072626/ /pubmed/24967636 http://dx.doi.org/10.1371/journal.pone.0100335 Text en © 2014 Soneson et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Soneson, Charlotte Gerster, Sarah Delorenzi, Mauro Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation
title	Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation
title_full	Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation
title_fullStr	Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation
title_full_unstemmed	Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation
title_short	Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation
title_sort	batch effect confounding leads to strong bias in performance estimates obtained by cross-validation
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4072626/ https://www.ncbi.nlm.nih.gov/pubmed/24967636 http://dx.doi.org/10.1371/journal.pone.0100335
work_keys_str_mv	AT sonesoncharlotte batcheffectconfoundingleadstostrongbiasinperformanceestimatesobtainedbycrossvalidation AT gerstersarah batcheffectconfoundingleadstostrongbiasinperformanceestimatesobtainedbycrossvalidation AT delorenzimauro batcheffectconfoundingleadstostrongbiasinperformanceestimatesobtainedbycrossvalidation

Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation

Ejemplares similares