Cargando…

Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data

Untargeted metabolomics is a powerful phenotyping tool for better understanding biological mechanisms involved in human pathology development and identifying early predictive biomarkers. This approach, based on multiple analytical platforms, such as mass spectrometry (MS), chemometrics and bioinform...

Descripción completa

Detalles Bibliográficos
Autores principales:	Grissa, Dhouha, Pétéra, Mélanie, Brandolini, Marion, Napoli, Amedeo, Comte, Blandine, Pujos-Guillot, Estelle
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2016
Materias:	Molecular Biosciences
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4937038/ https://www.ncbi.nlm.nih.gov/pubmed/27458587 http://dx.doi.org/10.3389/fmolb.2016.00030

_version_	1782441638071631872
author	Grissa, Dhouha Pétéra, Mélanie Brandolini, Marion Napoli, Amedeo Comte, Blandine Pujos-Guillot, Estelle
author_facet	Grissa, Dhouha Pétéra, Mélanie Brandolini, Marion Napoli, Amedeo Comte, Blandine Pujos-Guillot, Estelle
author_sort	Grissa, Dhouha
collection	PubMed
description	Untargeted metabolomics is a powerful phenotyping tool for better understanding biological mechanisms involved in human pathology development and identifying early predictive biomarkers. This approach, based on multiple analytical platforms, such as mass spectrometry (MS), chemometrics and bioinformatics, generates massive and complex data that need appropriate analyses to extract the biologically meaningful information. Despite various tools available, it is still a challenge to handle such large and noisy datasets with limited number of individuals without risking overfitting. Moreover, when the objective is focused on the identification of early predictive markers of clinical outcome, few years before occurrence, it becomes essential to use the appropriate algorithms and workflow to be able to discover subtle effects among this large amount of data. In this context, this work consists in studying a workflow describing the general feature selection process, using knowledge discovery and data mining methodologies to propose advanced solutions for predictive biomarker discovery. The strategy was focused on evaluating a combination of numeric-symbolic approaches for feature selection with the objective of obtaining the best combination of metabolites producing an effective and accurate predictive model. Relying first on numerical approaches, and especially on machine learning methods (SVM-RFE, RF, RF-RFE) and on univariate statistical analyses (ANOVA), a comparative study was performed on an original metabolomic dataset and reduced subsets. As resampling method, LOOCV was applied to minimize the risk of overfitting. The best k-features obtained with different scores of importance from the combination of these different approaches were compared and allowed determining the variable stabilities using Formal Concept Analysis. The results revealed the interest of RF-Gini combined with ANOVA for feature selection as these two complementary methods allowed selecting the 48 best candidates for prediction. Using linear logistic regression on this reduced dataset enabled us to obtain the best performances in terms of prediction accuracy and number of false positive with a model including 5 top variables. Therefore, these results highlighted the interest of feature selection methods and the importance of working on reduced datasets for the identification of predictive biomarkers issued from untargeted metabolomics data.
format	Online Article Text
id	pubmed-4937038
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-49370382016-07-25 Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data Grissa, Dhouha Pétéra, Mélanie Brandolini, Marion Napoli, Amedeo Comte, Blandine Pujos-Guillot, Estelle Front Mol Biosci Molecular Biosciences Untargeted metabolomics is a powerful phenotyping tool for better understanding biological mechanisms involved in human pathology development and identifying early predictive biomarkers. This approach, based on multiple analytical platforms, such as mass spectrometry (MS), chemometrics and bioinformatics, generates massive and complex data that need appropriate analyses to extract the biologically meaningful information. Despite various tools available, it is still a challenge to handle such large and noisy datasets with limited number of individuals without risking overfitting. Moreover, when the objective is focused on the identification of early predictive markers of clinical outcome, few years before occurrence, it becomes essential to use the appropriate algorithms and workflow to be able to discover subtle effects among this large amount of data. In this context, this work consists in studying a workflow describing the general feature selection process, using knowledge discovery and data mining methodologies to propose advanced solutions for predictive biomarker discovery. The strategy was focused on evaluating a combination of numeric-symbolic approaches for feature selection with the objective of obtaining the best combination of metabolites producing an effective and accurate predictive model. Relying first on numerical approaches, and especially on machine learning methods (SVM-RFE, RF, RF-RFE) and on univariate statistical analyses (ANOVA), a comparative study was performed on an original metabolomic dataset and reduced subsets. As resampling method, LOOCV was applied to minimize the risk of overfitting. The best k-features obtained with different scores of importance from the combination of these different approaches were compared and allowed determining the variable stabilities using Formal Concept Analysis. The results revealed the interest of RF-Gini combined with ANOVA for feature selection as these two complementary methods allowed selecting the 48 best candidates for prediction. Using linear logistic regression on this reduced dataset enabled us to obtain the best performances in terms of prediction accuracy and number of false positive with a model including 5 top variables. Therefore, these results highlighted the interest of feature selection methods and the importance of working on reduced datasets for the identification of predictive biomarkers issued from untargeted metabolomics data. Frontiers Media S.A. 2016-07-08 /pmc/articles/PMC4937038/ /pubmed/27458587 http://dx.doi.org/10.3389/fmolb.2016.00030 Text en Copyright © 2016 Grissa, Pétéra, Brandolini, Napoli, Comte and Pujos-Guillot. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Molecular Biosciences Grissa, Dhouha Pétéra, Mélanie Brandolini, Marion Napoli, Amedeo Comte, Blandine Pujos-Guillot, Estelle Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data
title	Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data
title_full	Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data
title_fullStr	Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data
title_full_unstemmed	Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data
title_short	Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data
title_sort	feature selection methods for early predictive biomarker discovery using untargeted metabolomic data
topic	Molecular Biosciences
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4937038/ https://www.ncbi.nlm.nih.gov/pubmed/27458587 http://dx.doi.org/10.3389/fmolb.2016.00030
work_keys_str_mv	AT grissadhouha featureselectionmethodsforearlypredictivebiomarkerdiscoveryusinguntargetedmetabolomicdata AT peteramelanie featureselectionmethodsforearlypredictivebiomarkerdiscoveryusinguntargetedmetabolomicdata AT brandolinimarion featureselectionmethodsforearlypredictivebiomarkerdiscoveryusinguntargetedmetabolomicdata AT napoliamedeo featureselectionmethodsforearlypredictivebiomarkerdiscoveryusinguntargetedmetabolomicdata AT comteblandine featureselectionmethodsforearlypredictivebiomarkerdiscoveryusinguntargetedmetabolomicdata AT pujosguillotestelle featureselectionmethodsforearlypredictivebiomarkerdiscoveryusinguntargetedmetabolomicdata

Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data

Ejemplares similares