Feature selection and classifier performance on diverse bio- logical datasets

BACKGROUND: There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications. Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algor...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hemphill, Edward, Lindsay, James, Lee, Chih, Măndoiu, Ion I, Nelson, Craig E
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4248652/ https://www.ncbi.nlm.nih.gov/pubmed/25434802 http://dx.doi.org/10.1186/1471-2105-15-S13-S4

_version_	1782346843961688064
author	Hemphill, Edward Lindsay, James Lee, Chih Măndoiu, Ion I Nelson, Craig E
author_facet	Hemphill, Edward Lindsay, James Lee, Chih Măndoiu, Ion I Nelson, Craig E
author_sort	Hemphill, Edward
collection	PubMed
description	BACKGROUND: There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications. Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algorithms to use with that biomarker set, can be a daunting task. Existing surveys of feature selection and classification algorithms typically focus on a single data type, such as gene expression microarrays, and rarely explore the model's performance across multiple biological data types. RESULTS: This paper presents the results of a large scale empirical study whereby a large number of popular feature selection and classification algorithms are used to identify the tissue of origin for the NCI-60 cancer cell lines. A computational pipeline was implemented to maximize predictive accuracy of all models at all parameters on five different data types available for the NCI-60 cell lines. A validation experiment was conducted using external data in order to demonstrate robustness. CONCLUSIONS: As expected, the data type and number of biomarkers have a significant effect on the performance of the predictive models. Although no model or data type uniformly outperforms the others across the entire range of tested numbers of markers, several clear trends are visible. At low numbers of biomarkers gene and protein expression data types are able to differentiate between cancer cell lines significantly better than the other three data types, namely SNP, array comparative genome hybridization (aCGH), and microRNA data. Interestingly, as the number of selected biomarkers increases best performing classifiers based on SNP data match or slightly outperform those based on gene and protein expression, while those based on aCGH and microRNA data continue to perform the worst. It is observed that one class of feature selection and classifier are consistently top performers across data types and number of markers, suggesting that well performing feature-selection/classifier pairings are likely to be robust in biological classification problems regardless of the data type used in the analysis.
format	Online Article Text
id	pubmed-4248652
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-42486522014-12-04 Feature selection and classifier performance on diverse bio- logical datasets Hemphill, Edward Lindsay, James Lee, Chih Măndoiu, Ion I Nelson, Craig E BMC Bioinformatics Proceedings BACKGROUND: There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications. Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algorithms to use with that biomarker set, can be a daunting task. Existing surveys of feature selection and classification algorithms typically focus on a single data type, such as gene expression microarrays, and rarely explore the model's performance across multiple biological data types. RESULTS: This paper presents the results of a large scale empirical study whereby a large number of popular feature selection and classification algorithms are used to identify the tissue of origin for the NCI-60 cancer cell lines. A computational pipeline was implemented to maximize predictive accuracy of all models at all parameters on five different data types available for the NCI-60 cell lines. A validation experiment was conducted using external data in order to demonstrate robustness. CONCLUSIONS: As expected, the data type and number of biomarkers have a significant effect on the performance of the predictive models. Although no model or data type uniformly outperforms the others across the entire range of tested numbers of markers, several clear trends are visible. At low numbers of biomarkers gene and protein expression data types are able to differentiate between cancer cell lines significantly better than the other three data types, namely SNP, array comparative genome hybridization (aCGH), and microRNA data. Interestingly, as the number of selected biomarkers increases best performing classifiers based on SNP data match or slightly outperform those based on gene and protein expression, while those based on aCGH and microRNA data continue to perform the worst. It is observed that one class of feature selection and classifier are consistently top performers across data types and number of markers, suggesting that well performing feature-selection/classifier pairings are likely to be robust in biological classification problems regardless of the data type used in the analysis. BioMed Central 2014-11-13 /pmc/articles/PMC4248652/ /pubmed/25434802 http://dx.doi.org/10.1186/1471-2105-15-S13-S4 Text en Copyright © 2014 Hemphill et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Proceedings Hemphill, Edward Lindsay, James Lee, Chih Măndoiu, Ion I Nelson, Craig E Feature selection and classifier performance on diverse bio- logical datasets
title	Feature selection and classifier performance on diverse bio- logical datasets
title_full	Feature selection and classifier performance on diverse bio- logical datasets
title_fullStr	Feature selection and classifier performance on diverse bio- logical datasets
title_full_unstemmed	Feature selection and classifier performance on diverse bio- logical datasets
title_short	Feature selection and classifier performance on diverse bio- logical datasets
title_sort	feature selection and classifier performance on diverse bio- logical datasets
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4248652/ https://www.ncbi.nlm.nih.gov/pubmed/25434802 http://dx.doi.org/10.1186/1471-2105-15-S13-S4
work_keys_str_mv	AT hemphilledward featureselectionandclassifierperformanceondiversebiologicaldatasets AT lindsayjames featureselectionandclassifierperformanceondiversebiologicaldatasets AT leechih featureselectionandclassifierperformanceondiversebiologicaldatasets AT mandoiuioni featureselectionandclassifierperformanceondiversebiologicaldatasets AT nelsoncraige featureselectionandclassifierperformanceondiversebiologicaldatasets

Feature selection and classifier performance on diverse bio- logical datasets

Ejemplares similares