Cargando…

Variable selection and validation in multivariate modelling

MOTIVATION: Validation of variable selection and predictive performance is crucial in construction of robust multivariate models that generalize well, minimize overfitting and facilitate interpretation of results. Inappropriate variable selection leads instead to selection bias, thereby increasing t...

Descripción completa

Detalles Bibliográficos
Autores principales: Shi, Lin, Westerhuis, Johan A, Rosén, Johan, Landberg, Rikard, Brunius, Carl
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6419897/
https://www.ncbi.nlm.nih.gov/pubmed/30165467
http://dx.doi.org/10.1093/bioinformatics/bty710
_version_ 1783404021022720000
author Shi, Lin
Westerhuis, Johan A
Rosén, Johan
Landberg, Rikard
Brunius, Carl
author_facet Shi, Lin
Westerhuis, Johan A
Rosén, Johan
Landberg, Rikard
Brunius, Carl
author_sort Shi, Lin
collection PubMed
description MOTIVATION: Validation of variable selection and predictive performance is crucial in construction of robust multivariate models that generalize well, minimize overfitting and facilitate interpretation of results. Inappropriate variable selection leads instead to selection bias, thereby increasing the risk of model overfitting and false positive discoveries. Although several algorithms exist to identify a minimal set of most informative variables (i.e. the minimal-optimal problem), few can select all variables related to the research question (i.e. the all-relevant problem). Robust algorithms combining identification of both minimal-optimal and all-relevant variables with proper cross-validation are urgently needed. RESULTS: We developed the MUVR algorithm to improve predictive performance and minimize overfitting and false positives in multivariate analysis. In the MUVR algorithm, minimal variable selection is achieved by performing recursive variable elimination in a repeated double cross-validation (rdCV) procedure. The algorithm supports partial least squares and random forest modelling, and simultaneously identifies minimal-optimal and all-relevant variable sets for regression, classification and multilevel analyses. Using three authentic omics datasets, MUVR yielded parsimonious models with minimal overfitting and improved model performance compared with state-of-the-art rdCV. Moreover, MUVR showed advantages over other variable selection algorithms, i.e. Boruta and VSURF, including simultaneous variable selection and validation scheme and wider applicability. AVAILABILITY AND IMPLEMENTATION: Algorithms, data, scripts and tutorial are open source and available as an R package (‘MUVR’) at https://gitlab.com/CarlBrunius/MUVR.git. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-6419897
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-64198972019-03-20 Variable selection and validation in multivariate modelling Shi, Lin Westerhuis, Johan A Rosén, Johan Landberg, Rikard Brunius, Carl Bioinformatics Original Papers MOTIVATION: Validation of variable selection and predictive performance is crucial in construction of robust multivariate models that generalize well, minimize overfitting and facilitate interpretation of results. Inappropriate variable selection leads instead to selection bias, thereby increasing the risk of model overfitting and false positive discoveries. Although several algorithms exist to identify a minimal set of most informative variables (i.e. the minimal-optimal problem), few can select all variables related to the research question (i.e. the all-relevant problem). Robust algorithms combining identification of both minimal-optimal and all-relevant variables with proper cross-validation are urgently needed. RESULTS: We developed the MUVR algorithm to improve predictive performance and minimize overfitting and false positives in multivariate analysis. In the MUVR algorithm, minimal variable selection is achieved by performing recursive variable elimination in a repeated double cross-validation (rdCV) procedure. The algorithm supports partial least squares and random forest modelling, and simultaneously identifies minimal-optimal and all-relevant variable sets for regression, classification and multilevel analyses. Using three authentic omics datasets, MUVR yielded parsimonious models with minimal overfitting and improved model performance compared with state-of-the-art rdCV. Moreover, MUVR showed advantages over other variable selection algorithms, i.e. Boruta and VSURF, including simultaneous variable selection and validation scheme and wider applicability. AVAILABILITY AND IMPLEMENTATION: Algorithms, data, scripts and tutorial are open source and available as an R package (‘MUVR’) at https://gitlab.com/CarlBrunius/MUVR.git. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2019-03-15 2018-08-28 /pmc/articles/PMC6419897/ /pubmed/30165467 http://dx.doi.org/10.1093/bioinformatics/bty710 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Shi, Lin
Westerhuis, Johan A
Rosén, Johan
Landberg, Rikard
Brunius, Carl
Variable selection and validation in multivariate modelling
title Variable selection and validation in multivariate modelling
title_full Variable selection and validation in multivariate modelling
title_fullStr Variable selection and validation in multivariate modelling
title_full_unstemmed Variable selection and validation in multivariate modelling
title_short Variable selection and validation in multivariate modelling
title_sort variable selection and validation in multivariate modelling
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6419897/
https://www.ncbi.nlm.nih.gov/pubmed/30165467
http://dx.doi.org/10.1093/bioinformatics/bty710
work_keys_str_mv AT shilin variableselectionandvalidationinmultivariatemodelling
AT westerhuisjohana variableselectionandvalidationinmultivariatemodelling
AT rosenjohan variableselectionandvalidationinmultivariatemodelling
AT landbergrikard variableselectionandvalidationinmultivariatemodelling
AT bruniuscarl variableselectionandvalidationinmultivariatemodelling