Cargando…

nestedcv: an R package for fast implementation of nested cross-validation with embedded feature selection designed for transcriptomics and high-dimensional data

MOTIVATION: Although machine learning models are commonly used in medical research, many analyses implement a simple partition into training data and hold-out test data, with cross-validation (CV) for tuning of model hyperparameters. Nested CV with embedded feature selection is especially suited to...

Descripción completa

Detalles Bibliográficos
Autores principales: Lewis, Myles J, Spiliopoulou, Athina, Goldmann, Katriona, Pitzalis, Costantino, McKeigue, Paul, Barnes, Michael R
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10125905/
https://www.ncbi.nlm.nih.gov/pubmed/37113250
http://dx.doi.org/10.1093/bioadv/vbad048
_version_ 1785030121557590016
author Lewis, Myles J
Spiliopoulou, Athina
Goldmann, Katriona
Pitzalis, Costantino
McKeigue, Paul
Barnes, Michael R
author_facet Lewis, Myles J
Spiliopoulou, Athina
Goldmann, Katriona
Pitzalis, Costantino
McKeigue, Paul
Barnes, Michael R
author_sort Lewis, Myles J
collection PubMed
description MOTIVATION: Although machine learning models are commonly used in medical research, many analyses implement a simple partition into training data and hold-out test data, with cross-validation (CV) for tuning of model hyperparameters. Nested CV with embedded feature selection is especially suited to biomedical data where the sample size is frequently limited, but the number of predictors may be significantly larger (P ≫ n). RESULTS: The nestedcv R package implements fully nested k × l-fold CV for lasso and elastic-net regularized linear models via the glmnet package and supports a large array of other machine learning models via the caret framework. Inner CV is used to tune models and outer CV is used to determine model performance without bias. Fast filter functions for feature selection are provided and the package ensures that filters are nested within the outer CV loop to avoid information leakage from performance test sets. Measurement of performance by outer CV is also used to implement Bayesian linear and logistic regression models using the horseshoe prior over parameters to encourage a sparse model and determine unbiased model accuracy. AVAILABILITY AND IMPLEMENTATION: The R package nestedcv is available from CRAN: https://CRAN.R-project.org/package=nestedcv.
format Online
Article
Text
id pubmed-10125905
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-101259052023-04-26 nestedcv: an R package for fast implementation of nested cross-validation with embedded feature selection designed for transcriptomics and high-dimensional data Lewis, Myles J Spiliopoulou, Athina Goldmann, Katriona Pitzalis, Costantino McKeigue, Paul Barnes, Michael R Bioinform Adv Application Note MOTIVATION: Although machine learning models are commonly used in medical research, many analyses implement a simple partition into training data and hold-out test data, with cross-validation (CV) for tuning of model hyperparameters. Nested CV with embedded feature selection is especially suited to biomedical data where the sample size is frequently limited, but the number of predictors may be significantly larger (P ≫ n). RESULTS: The nestedcv R package implements fully nested k × l-fold CV for lasso and elastic-net regularized linear models via the glmnet package and supports a large array of other machine learning models via the caret framework. Inner CV is used to tune models and outer CV is used to determine model performance without bias. Fast filter functions for feature selection are provided and the package ensures that filters are nested within the outer CV loop to avoid information leakage from performance test sets. Measurement of performance by outer CV is also used to implement Bayesian linear and logistic regression models using the horseshoe prior over parameters to encourage a sparse model and determine unbiased model accuracy. AVAILABILITY AND IMPLEMENTATION: The R package nestedcv is available from CRAN: https://CRAN.R-project.org/package=nestedcv. Oxford University Press 2023-04-13 /pmc/articles/PMC10125905/ /pubmed/37113250 http://dx.doi.org/10.1093/bioadv/vbad048 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Application Note
Lewis, Myles J
Spiliopoulou, Athina
Goldmann, Katriona
Pitzalis, Costantino
McKeigue, Paul
Barnes, Michael R
nestedcv: an R package for fast implementation of nested cross-validation with embedded feature selection designed for transcriptomics and high-dimensional data
title nestedcv: an R package for fast implementation of nested cross-validation with embedded feature selection designed for transcriptomics and high-dimensional data
title_full nestedcv: an R package for fast implementation of nested cross-validation with embedded feature selection designed for transcriptomics and high-dimensional data
title_fullStr nestedcv: an R package for fast implementation of nested cross-validation with embedded feature selection designed for transcriptomics and high-dimensional data
title_full_unstemmed nestedcv: an R package for fast implementation of nested cross-validation with embedded feature selection designed for transcriptomics and high-dimensional data
title_short nestedcv: an R package for fast implementation of nested cross-validation with embedded feature selection designed for transcriptomics and high-dimensional data
title_sort nestedcv: an r package for fast implementation of nested cross-validation with embedded feature selection designed for transcriptomics and high-dimensional data
topic Application Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10125905/
https://www.ncbi.nlm.nih.gov/pubmed/37113250
http://dx.doi.org/10.1093/bioadv/vbad048
work_keys_str_mv AT lewismylesj nestedcvanrpackageforfastimplementationofnestedcrossvalidationwithembeddedfeatureselectiondesignedfortranscriptomicsandhighdimensionaldata
AT spiliopoulouathina nestedcvanrpackageforfastimplementationofnestedcrossvalidationwithembeddedfeatureselectiondesignedfortranscriptomicsandhighdimensionaldata
AT goldmannkatriona nestedcvanrpackageforfastimplementationofnestedcrossvalidationwithembeddedfeatureselectiondesignedfortranscriptomicsandhighdimensionaldata
AT pitzaliscostantino nestedcvanrpackageforfastimplementationofnestedcrossvalidationwithembeddedfeatureselectiondesignedfortranscriptomicsandhighdimensionaldata
AT mckeiguepaul nestedcvanrpackageforfastimplementationofnestedcrossvalidationwithembeddedfeatureselectiondesignedfortranscriptomicsandhighdimensionaldata
AT barnesmichaelr nestedcvanrpackageforfastimplementationofnestedcrossvalidationwithembeddedfeatureselectiondesignedfortranscriptomicsandhighdimensionaldata