Cargando…

Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons

BACKGROUND: One of the main topics in the development of quantitative structure-property relationship (QSPR) predictive models is the identification of the subset of variables that represent the structure of a molecule and which are predictors for a given property. There are several automated featur...

Descripción completa

Detalles Bibliográficos
Autores principales: Teixeira, Ana L, Leal, João P, Falcao, Andre O
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3599435/
https://www.ncbi.nlm.nih.gov/pubmed/23399299
http://dx.doi.org/10.1186/1758-2946-5-9
_version_ 1782262962528976896
author Teixeira, Ana L
Leal, João P
Falcao, Andre O
author_facet Teixeira, Ana L
Leal, João P
Falcao, Andre O
author_sort Teixeira, Ana L
collection PubMed
description BACKGROUND: One of the main topics in the development of quantitative structure-property relationship (QSPR) predictive models is the identification of the subset of variables that represent the structure of a molecule and which are predictors for a given property. There are several automated feature selection methods, ranging from backward, forward or stepwise procedures, to further elaborated methodologies such as evolutionary programming. The problem lies in selecting the minimum subset of descriptors that can predict a certain property with a good performance, computationally efficient and in a more robust way, since the presence of irrelevant or redundant features can cause poor generalization capacity. In this paper an alternative selection method, based on Random Forests to determine the variable importance is proposed in the context of QSPR regression problems, with an application to a manually curated dataset for predicting standard enthalpy of formation. The subsequent predictive models are trained with support vector machines introducing the variables sequentially from a ranked list based on the variable importance. RESULTS: The model generalizes well even with a high dimensional dataset and in the presence of highly correlated variables. The feature selection step was shown to yield lower prediction errors with RMSE values 23% lower than without feature selection, albeit using only 6% of the total number of variables (89 from the original 1485). The proposed approach further compared favourably with other feature selection methods and dimension reduction of the feature space. The predictive model was selected using a 10-fold cross validation procedure and, after selection, it was validated with an independent set to assess its performance when applied to new data and the results were similar to the ones obtained for the training set, supporting the robustness of the proposed approach. CONCLUSIONS: The proposed methodology seemingly improves the prediction performance of standard enthalpy of formation of hydrocarbons using a limited set of molecular descriptors, providing faster and more cost-effective calculation of descriptors by reducing their numbers, and providing a better understanding of the underlying relationship between the molecular structure represented by descriptors and the property of interest.
format Online
Article
Text
id pubmed-3599435
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-35994352013-03-17 Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons Teixeira, Ana L Leal, João P Falcao, Andre O J Cheminform Research Article BACKGROUND: One of the main topics in the development of quantitative structure-property relationship (QSPR) predictive models is the identification of the subset of variables that represent the structure of a molecule and which are predictors for a given property. There are several automated feature selection methods, ranging from backward, forward or stepwise procedures, to further elaborated methodologies such as evolutionary programming. The problem lies in selecting the minimum subset of descriptors that can predict a certain property with a good performance, computationally efficient and in a more robust way, since the presence of irrelevant or redundant features can cause poor generalization capacity. In this paper an alternative selection method, based on Random Forests to determine the variable importance is proposed in the context of QSPR regression problems, with an application to a manually curated dataset for predicting standard enthalpy of formation. The subsequent predictive models are trained with support vector machines introducing the variables sequentially from a ranked list based on the variable importance. RESULTS: The model generalizes well even with a high dimensional dataset and in the presence of highly correlated variables. The feature selection step was shown to yield lower prediction errors with RMSE values 23% lower than without feature selection, albeit using only 6% of the total number of variables (89 from the original 1485). The proposed approach further compared favourably with other feature selection methods and dimension reduction of the feature space. The predictive model was selected using a 10-fold cross validation procedure and, after selection, it was validated with an independent set to assess its performance when applied to new data and the results were similar to the ones obtained for the training set, supporting the robustness of the proposed approach. CONCLUSIONS: The proposed methodology seemingly improves the prediction performance of standard enthalpy of formation of hydrocarbons using a limited set of molecular descriptors, providing faster and more cost-effective calculation of descriptors by reducing their numbers, and providing a better understanding of the underlying relationship between the molecular structure represented by descriptors and the property of interest. BioMed Central 2013-02-11 /pmc/articles/PMC3599435/ /pubmed/23399299 http://dx.doi.org/10.1186/1758-2946-5-9 Text en Copyright ©2013 Teixeira et al; licensee Chemistry Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Teixeira, Ana L
Leal, João P
Falcao, Andre O
Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons
title Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons
title_full Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons
title_fullStr Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons
title_full_unstemmed Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons
title_short Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons
title_sort random forests for feature selection in qspr models - an application for predicting standard enthalpy of formation of hydrocarbons
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3599435/
https://www.ncbi.nlm.nih.gov/pubmed/23399299
http://dx.doi.org/10.1186/1758-2946-5-9
work_keys_str_mv AT teixeiraanal randomforestsforfeatureselectioninqsprmodelsanapplicationforpredictingstandardenthalpyofformationofhydrocarbons
AT lealjoaop randomforestsforfeatureselectioninqsprmodelsanapplicationforpredictingstandardenthalpyofformationofhydrocarbons
AT falcaoandreo randomforestsforfeatureselectioninqsprmodelsanapplicationforpredictingstandardenthalpyofformationofhydrocarbons