Cargando…

Feature selection and validated predictive performance in the domain of Legionella pneumophila: a comparative study

BACKGROUND: Genetic comparisons of clinical and environmental Legionella strains form an essential part of outbreak investigations. DNA microarrays often comprise many DNA markers (features). Feature selection and the development of prediction models are particularly challenging in this domain with...

Descripción completa

Detalles Bibliográficos
Autores principales: van der Ploeg, Tjeerd, Steyerberg, Ewout W.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4782323/
https://www.ncbi.nlm.nih.gov/pubmed/26951763
http://dx.doi.org/10.1186/s13104-016-1945-2
_version_ 1782419929707839488
author van der Ploeg, Tjeerd
Steyerberg, Ewout W.
author_facet van der Ploeg, Tjeerd
Steyerberg, Ewout W.
author_sort van der Ploeg, Tjeerd
collection PubMed
description BACKGROUND: Genetic comparisons of clinical and environmental Legionella strains form an essential part of outbreak investigations. DNA microarrays often comprise many DNA markers (features). Feature selection and the development of prediction models are particularly challenging in this domain with many variables and comparatively few subjects or data points. We aimed to compare modeling strategies to develop prediction models for classifying infections as clinical or environmental. METHODS: We applied a bootstrap strategy for preselecting important features to a database containing 222 Legionella pneumophila strains with 448 continuous markers and a dichotomous outcome (clinical or environmental). Feature selection was done with 50 bootstrap samples resulting in a top 10 of most important features for each of four modeling techniques: classification and regression trees (CART), random forests (RF), support vector machines (SVM) and least absolute shrinkage and selection operator (LASSO). Validation was done in a second bootstrap re-sampling loop (200×) for evaluation of discriminatory model performance according to the AUC. RESULTS: The top 5 of selected features differed considerably between the various modeling techniques, with only one common feature (“LePn.007B8”). The mean validated AUC-values of the SVM model and the CART model were 0.859 and 0.873 respectively. The LASSO and the RF model showed higher validated AUC-values (0.925 and 0.975 respectively). CONCLUSIONS: In the domain of Legionella pneumophila, which comprises many potential features for classifying of infections as clinical or environmental, the RF and LASSO techniques provide good prediction models. The identification of potentially biologically relevant features is highly dependent on the technique used, and should hence be interpreted with caution. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13104-016-1945-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4782323
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-47823232016-03-09 Feature selection and validated predictive performance in the domain of Legionella pneumophila: a comparative study van der Ploeg, Tjeerd Steyerberg, Ewout W. BMC Res Notes Research Article BACKGROUND: Genetic comparisons of clinical and environmental Legionella strains form an essential part of outbreak investigations. DNA microarrays often comprise many DNA markers (features). Feature selection and the development of prediction models are particularly challenging in this domain with many variables and comparatively few subjects or data points. We aimed to compare modeling strategies to develop prediction models for classifying infections as clinical or environmental. METHODS: We applied a bootstrap strategy for preselecting important features to a database containing 222 Legionella pneumophila strains with 448 continuous markers and a dichotomous outcome (clinical or environmental). Feature selection was done with 50 bootstrap samples resulting in a top 10 of most important features for each of four modeling techniques: classification and regression trees (CART), random forests (RF), support vector machines (SVM) and least absolute shrinkage and selection operator (LASSO). Validation was done in a second bootstrap re-sampling loop (200×) for evaluation of discriminatory model performance according to the AUC. RESULTS: The top 5 of selected features differed considerably between the various modeling techniques, with only one common feature (“LePn.007B8”). The mean validated AUC-values of the SVM model and the CART model were 0.859 and 0.873 respectively. The LASSO and the RF model showed higher validated AUC-values (0.925 and 0.975 respectively). CONCLUSIONS: In the domain of Legionella pneumophila, which comprises many potential features for classifying of infections as clinical or environmental, the RF and LASSO techniques provide good prediction models. The identification of potentially biologically relevant features is highly dependent on the technique used, and should hence be interpreted with caution. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13104-016-1945-2) contains supplementary material, which is available to authorized users. BioMed Central 2016-03-08 /pmc/articles/PMC4782323/ /pubmed/26951763 http://dx.doi.org/10.1186/s13104-016-1945-2 Text en © van de Ploeg and Steyerberg. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
van der Ploeg, Tjeerd
Steyerberg, Ewout W.
Feature selection and validated predictive performance in the domain of Legionella pneumophila: a comparative study
title Feature selection and validated predictive performance in the domain of Legionella pneumophila: a comparative study
title_full Feature selection and validated predictive performance in the domain of Legionella pneumophila: a comparative study
title_fullStr Feature selection and validated predictive performance in the domain of Legionella pneumophila: a comparative study
title_full_unstemmed Feature selection and validated predictive performance in the domain of Legionella pneumophila: a comparative study
title_short Feature selection and validated predictive performance in the domain of Legionella pneumophila: a comparative study
title_sort feature selection and validated predictive performance in the domain of legionella pneumophila: a comparative study
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4782323/
https://www.ncbi.nlm.nih.gov/pubmed/26951763
http://dx.doi.org/10.1186/s13104-016-1945-2
work_keys_str_mv AT vanderploegtjeerd featureselectionandvalidatedpredictiveperformanceinthedomainoflegionellapneumophilaacomparativestudy
AT steyerbergewoutw featureselectionandvalidatedpredictiveperformanceinthedomainoflegionellapneumophilaacomparativestudy