Cargando…

Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction

BACKGROUND: We present a novel feature selection algorithm, Winnowing Artificial Ant Colony (WAAC), that performs simultaneous feature selection and model parameter optimisation for the development of predictive quantitative structure-property relationship (QSPR) models. The WAAC algorithm is an ext...

Descripción completa

Detalles Bibliográficos
Autores principales: O'Boyle, Noel M, Palmer, David S, Nigsch, Florian, Mitchell, John BO
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2603525/
https://www.ncbi.nlm.nih.gov/pubmed/18959785
http://dx.doi.org/10.1186/1752-153X-2-21
_version_ 1782162592571064320
author O'Boyle, Noel M
Palmer, David S
Nigsch, Florian
Mitchell, John BO
author_facet O'Boyle, Noel M
Palmer, David S
Nigsch, Florian
Mitchell, John BO
author_sort O'Boyle, Noel M
collection PubMed
description BACKGROUND: We present a novel feature selection algorithm, Winnowing Artificial Ant Colony (WAAC), that performs simultaneous feature selection and model parameter optimisation for the development of predictive quantitative structure-property relationship (QSPR) models. The WAAC algorithm is an extension of the modified ant colony algorithm of Shen et al. (J Chem Inf Model 2005, 45: 1024–1029). We test the ability of the algorithm to develop a predictive partial least squares model for the Karthikeyan dataset (J Chem Inf Model 2005, 45: 581–590) of melting point values. We also test its ability to perform feature selection on a support vector machine model for the same dataset. RESULTS: Starting from an initial set of 203 descriptors, the WAAC algorithm selected a PLS model with 68 descriptors which has an RMSE on an external test set of 46.6°C and R(2 )of 0.51. The number of components chosen for the model was 49, which was close to optimal for this feature selection. The selected SVM model has 28 descriptors (cost of 5, ε of 0.21) and an RMSE of 45.1°C and R(2 )of 0.54. This model outperforms a kNN model (RMSE of 48.3°C, R(2 )of 0.47) for the same data and has similar performance to a Random Forest model (RMSE of 44.5°C, R(2 )of 0.55). However it is much less prone to bias at the extremes of the range of melting points as shown by the slope of the line through the residuals: -0.43 for WAAC/SVM, -0.53 for Random Forest. CONCLUSION: With a careful choice of objective function, the WAAC algorithm can be used to optimise machine learning and regression models that suffer from overfitting. Where model parameters also need to be tuned, as is the case with support vector machine and partial least squares models, it can optimise these simultaneously. The moving probabilities used by the algorithm are easily interpreted in terms of the best and current models of the ants, and the winnowing procedure promotes the removal of irrelevant descriptors.
format Text
id pubmed-2603525
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-26035252008-12-17 Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction O'Boyle, Noel M Palmer, David S Nigsch, Florian Mitchell, John BO Chem Cent J Methodology BACKGROUND: We present a novel feature selection algorithm, Winnowing Artificial Ant Colony (WAAC), that performs simultaneous feature selection and model parameter optimisation for the development of predictive quantitative structure-property relationship (QSPR) models. The WAAC algorithm is an extension of the modified ant colony algorithm of Shen et al. (J Chem Inf Model 2005, 45: 1024–1029). We test the ability of the algorithm to develop a predictive partial least squares model for the Karthikeyan dataset (J Chem Inf Model 2005, 45: 581–590) of melting point values. We also test its ability to perform feature selection on a support vector machine model for the same dataset. RESULTS: Starting from an initial set of 203 descriptors, the WAAC algorithm selected a PLS model with 68 descriptors which has an RMSE on an external test set of 46.6°C and R(2 )of 0.51. The number of components chosen for the model was 49, which was close to optimal for this feature selection. The selected SVM model has 28 descriptors (cost of 5, ε of 0.21) and an RMSE of 45.1°C and R(2 )of 0.54. This model outperforms a kNN model (RMSE of 48.3°C, R(2 )of 0.47) for the same data and has similar performance to a Random Forest model (RMSE of 44.5°C, R(2 )of 0.55). However it is much less prone to bias at the extremes of the range of melting points as shown by the slope of the line through the residuals: -0.43 for WAAC/SVM, -0.53 for Random Forest. CONCLUSION: With a careful choice of objective function, the WAAC algorithm can be used to optimise machine learning and regression models that suffer from overfitting. Where model parameters also need to be tuned, as is the case with support vector machine and partial least squares models, it can optimise these simultaneously. The moving probabilities used by the algorithm are easily interpreted in terms of the best and current models of the ants, and the winnowing procedure promotes the removal of irrelevant descriptors. BioMed Central 2008-10-29 /pmc/articles/PMC2603525/ /pubmed/18959785 http://dx.doi.org/10.1186/1752-153X-2-21 Text en Copyright © 2007 O'Boyle et al http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology
O'Boyle, Noel M
Palmer, David S
Nigsch, Florian
Mitchell, John BO
Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction
title Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction
title_full Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction
title_fullStr Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction
title_full_unstemmed Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction
title_short Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction
title_sort simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2603525/
https://www.ncbi.nlm.nih.gov/pubmed/18959785
http://dx.doi.org/10.1186/1752-153X-2-21
work_keys_str_mv AT oboylenoelm simultaneousfeatureselectionandparameteroptimisationusinganartificialantcolonycasestudyofmeltingpointprediction
AT palmerdavids simultaneousfeatureselectionandparameteroptimisationusinganartificialantcolonycasestudyofmeltingpointprediction
AT nigschflorian simultaneousfeatureselectionandparameteroptimisationusinganartificialantcolonycasestudyofmeltingpointprediction
AT mitchelljohnbo simultaneousfeatureselectionandparameteroptimisationusinganartificialantcolonycasestudyofmeltingpointprediction