Cargando…

Variable selection in omics data: A practical evaluation of small sample sizes

In omics experiments, variable selection involves a large number of metabolites/ genes and a small number of samples (the n < p problem). The ultimate goal is often the identification of one, or a few features that are different among conditions- a biomarker. Complicating biomarker identification...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kirpich, Alexander, Ainsworth, Elizabeth A., Wedow, Jessica M., Newman, Jeremy R. B., Michailidis, George, McIntyre, Lauren M.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2018
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6013185/ https://www.ncbi.nlm.nih.gov/pubmed/29927942 http://dx.doi.org/10.1371/journal.pone.0197910

_version_	1783333980297232384
author	Kirpich, Alexander Ainsworth, Elizabeth A. Wedow, Jessica M. Newman, Jeremy R. B. Michailidis, George McIntyre, Lauren M.
author_facet	Kirpich, Alexander Ainsworth, Elizabeth A. Wedow, Jessica M. Newman, Jeremy R. B. Michailidis, George McIntyre, Lauren M.
author_sort	Kirpich, Alexander
collection	PubMed
description	In omics experiments, variable selection involves a large number of metabolites/ genes and a small number of samples (the n < p problem). The ultimate goal is often the identification of one, or a few features that are different among conditions- a biomarker. Complicating biomarker identification, the p variables often contain a correlation structure due to the biology of the experiment making identifying causal compounds from correlated compounds difficult. Additionally, there may be elements in the experimental design (blocks, batches) that introduce structure in the data. While this problem has been discussed in the literature and various strategies proposed, the over fitting problems concomitant with such approaches are rarely acknowledged. Instead of viewing a single omics experiment as a definitive test for a biomarker, an unrealistic analytical goal, we propose to view such studies as screening studies where the goal of the study is to reduce the number of features present in the second round of testing, and to limit the Type II error. Using this perspective, the performance of LASSO, ridge regression and Elastic Net was compared with the performance of an ANOVA via a simulation study and two real data comparisons. Interestingly, a dramatic increase in the number of features had no effect on Type I error for the ANOVA approach. ANOVA, even without multiple test correction, has a low false positive rates in the scenarios tested. The Elastic Net has an inflated Type I error (from 10 to 50%) for small numbers of features which increases with sample size. The Type II error rate for the ANOVA is comparable or lower than that for the Elastic Net leading us to conclude that an ANOVA is an effective analytical tool for the initial screening of features in omics experiments.
format	Online Article Text
id	pubmed-6013185
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-60131852018-07-06 Variable selection in omics data: A practical evaluation of small sample sizes Kirpich, Alexander Ainsworth, Elizabeth A. Wedow, Jessica M. Newman, Jeremy R. B. Michailidis, George McIntyre, Lauren M. PLoS One Research Article In omics experiments, variable selection involves a large number of metabolites/ genes and a small number of samples (the n < p problem). The ultimate goal is often the identification of one, or a few features that are different among conditions- a biomarker. Complicating biomarker identification, the p variables often contain a correlation structure due to the biology of the experiment making identifying causal compounds from correlated compounds difficult. Additionally, there may be elements in the experimental design (blocks, batches) that introduce structure in the data. While this problem has been discussed in the literature and various strategies proposed, the over fitting problems concomitant with such approaches are rarely acknowledged. Instead of viewing a single omics experiment as a definitive test for a biomarker, an unrealistic analytical goal, we propose to view such studies as screening studies where the goal of the study is to reduce the number of features present in the second round of testing, and to limit the Type II error. Using this perspective, the performance of LASSO, ridge regression and Elastic Net was compared with the performance of an ANOVA via a simulation study and two real data comparisons. Interestingly, a dramatic increase in the number of features had no effect on Type I error for the ANOVA approach. ANOVA, even without multiple test correction, has a low false positive rates in the scenarios tested. The Elastic Net has an inflated Type I error (from 10 to 50%) for small numbers of features which increases with sample size. The Type II error rate for the ANOVA is comparable or lower than that for the Elastic Net leading us to conclude that an ANOVA is an effective analytical tool for the initial screening of features in omics experiments. Public Library of Science 2018-06-21 /pmc/articles/PMC6013185/ /pubmed/29927942 http://dx.doi.org/10.1371/journal.pone.0197910 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 (https://creativecommons.org/publicdomain/zero/1.0/) public domain dedication.
spellingShingle	Research Article Kirpich, Alexander Ainsworth, Elizabeth A. Wedow, Jessica M. Newman, Jeremy R. B. Michailidis, George McIntyre, Lauren M. Variable selection in omics data: A practical evaluation of small sample sizes
title	Variable selection in omics data: A practical evaluation of small sample sizes
title_full	Variable selection in omics data: A practical evaluation of small sample sizes
title_fullStr	Variable selection in omics data: A practical evaluation of small sample sizes
title_full_unstemmed	Variable selection in omics data: A practical evaluation of small sample sizes
title_short	Variable selection in omics data: A practical evaluation of small sample sizes
title_sort	variable selection in omics data: a practical evaluation of small sample sizes
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6013185/ https://www.ncbi.nlm.nih.gov/pubmed/29927942 http://dx.doi.org/10.1371/journal.pone.0197910
work_keys_str_mv	AT kirpichalexander variableselectioninomicsdataapracticalevaluationofsmallsamplesizes AT ainsworthelizabetha variableselectioninomicsdataapracticalevaluationofsmallsamplesizes AT wedowjessicam variableselectioninomicsdataapracticalevaluationofsmallsamplesizes AT newmanjeremyrb variableselectioninomicsdataapracticalevaluationofsmallsamplesizes AT michailidisgeorge variableselectioninomicsdataapracticalevaluationofsmallsamplesizes AT mcintyrelaurenm variableselectioninomicsdataapracticalevaluationofsmallsamplesizes

Variable selection in omics data: A practical evaluation of small sample sizes

Ejemplares similares