Cargando…

Feature selection and validated predictive performance in the domain of Legionella pneumophila: a comparative study

BACKGROUND: Genetic comparisons of clinical and environmental Legionella strains form an essential part of outbreak investigations. DNA microarrays often comprise many DNA markers (features). Feature selection and the development of prediction models are particularly challenging in this domain with...

Descripción completa

Detalles Bibliográficos
Autores principales: van der Ploeg, Tjeerd, Steyerberg, Ewout W.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4782323/
https://www.ncbi.nlm.nih.gov/pubmed/26951763
http://dx.doi.org/10.1186/s13104-016-1945-2
Descripción
Sumario:BACKGROUND: Genetic comparisons of clinical and environmental Legionella strains form an essential part of outbreak investigations. DNA microarrays often comprise many DNA markers (features). Feature selection and the development of prediction models are particularly challenging in this domain with many variables and comparatively few subjects or data points. We aimed to compare modeling strategies to develop prediction models for classifying infections as clinical or environmental. METHODS: We applied a bootstrap strategy for preselecting important features to a database containing 222 Legionella pneumophila strains with 448 continuous markers and a dichotomous outcome (clinical or environmental). Feature selection was done with 50 bootstrap samples resulting in a top 10 of most important features for each of four modeling techniques: classification and regression trees (CART), random forests (RF), support vector machines (SVM) and least absolute shrinkage and selection operator (LASSO). Validation was done in a second bootstrap re-sampling loop (200×) for evaluation of discriminatory model performance according to the AUC. RESULTS: The top 5 of selected features differed considerably between the various modeling techniques, with only one common feature (“LePn.007B8”). The mean validated AUC-values of the SVM model and the CART model were 0.859 and 0.873 respectively. The LASSO and the RF model showed higher validated AUC-values (0.925 and 0.975 respectively). CONCLUSIONS: In the domain of Legionella pneumophila, which comprises many potential features for classifying of infections as clinical or environmental, the RF and LASSO techniques provide good prediction models. The identification of potentially biologically relevant features is highly dependent on the technique used, and should hence be interpreted with caution. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13104-016-1945-2) contains supplementary material, which is available to authorized users.