Feature selection and classifier performance on diverse bio- logical datasets

BACKGROUND: There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications. Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algor...

Descripción completa

Detalles Bibliográficos
Autores principales: Hemphill, Edward, Lindsay, James, Lee, Chih, Măndoiu, Ion I, Nelson, Craig E
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4248652/
https://www.ncbi.nlm.nih.gov/pubmed/25434802
http://dx.doi.org/10.1186/1471-2105-15-S13-S4
_version_ 1782346843961688064
author Hemphill, Edward
Lindsay, James
Lee, Chih
Măndoiu, Ion I
Nelson, Craig E
author_facet Hemphill, Edward
Lindsay, James
Lee, Chih
Măndoiu, Ion I
Nelson, Craig E
author_sort Hemphill, Edward
collection PubMed
description BACKGROUND: There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications. Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algorithms to use with that biomarker set, can be a daunting task. Existing surveys of feature selection and classification algorithms typically focus on a single data type, such as gene expression microarrays, and rarely explore the model's performance across multiple biological data types. RESULTS: This paper presents the results of a large scale empirical study whereby a large number of popular feature selection and classification algorithms are used to identify the tissue of origin for the NCI-60 cancer cell lines. A computational pipeline was implemented to maximize predictive accuracy of all models at all parameters on five different data types available for the NCI-60 cell lines. A validation experiment was conducted using external data in order to demonstrate robustness. CONCLUSIONS: As expected, the data type and number of biomarkers have a significant effect on the performance of the predictive models. Although no model or data type uniformly outperforms the others across the entire range of tested numbers of markers, several clear trends are visible. At low numbers of biomarkers gene and protein expression data types are able to differentiate between cancer cell lines significantly better than the other three data types, namely SNP, array comparative genome hybridization (aCGH), and microRNA data. Interestingly, as the number of selected biomarkers increases best performing classifiers based on SNP data match or slightly outperform those based on gene and protein expression, while those based on aCGH and microRNA data continue to perform the worst. It is observed that one class of feature selection and classifier are consistently top performers across data types and number of markers, suggesting that well performing feature-selection/classifier pairings are likely to be robust in biological classification problems regardless of the data type used in the analysis.
format Online
Article
Text
id pubmed-4248652
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-42486522014-12-04 Feature selection and classifier performance on diverse bio- logical datasets Hemphill, Edward Lindsay, James Lee, Chih Măndoiu, Ion I Nelson, Craig E BMC Bioinformatics Proceedings BACKGROUND: There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications. Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algorithms to use with that biomarker set, can be a daunting task. Existing surveys of feature selection and classification algorithms typically focus on a single data type, such as gene expression microarrays, and rarely explore the model's performance across multiple biological data types. RESULTS: This paper presents the results of a large scale empirical study whereby a large number of popular feature selection and classification algorithms are used to identify the tissue of origin for the NCI-60 cancer cell lines. A computational pipeline was implemented to maximize predictive accuracy of all models at all parameters on five different data types available for the NCI-60 cell lines. A validation experiment was conducted using external data in order to demonstrate robustness. CONCLUSIONS: As expected, the data type and number of biomarkers have a significant effect on the performance of the predictive models. Although no model or data type uniformly outperforms the others across the entire range of tested numbers of markers, several clear trends are visible. At low numbers of biomarkers gene and protein expression data types are able to differentiate between cancer cell lines significantly better than the other three data types, namely SNP, array comparative genome hybridization (aCGH), and microRNA data. Interestingly, as the number of selected biomarkers increases best performing classifiers based on SNP data match or slightly outperform those based on gene and protein expression, while those based on aCGH and microRNA data continue to perform the worst. It is observed that one class of feature selection and classifier are consistently top performers across data types and number of markers, suggesting that well performing feature-selection/classifier pairings are likely to be robust in biological classification problems regardless of the data type used in the analysis. BioMed Central 2014-11-13 /pmc/articles/PMC4248652/ /pubmed/25434802 http://dx.doi.org/10.1186/1471-2105-15-S13-S4 Text en Copyright © 2014 Hemphill et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Proceedings
Hemphill, Edward
Lindsay, James
Lee, Chih
Măndoiu, Ion I
Nelson, Craig E
Feature selection and classifier performance on diverse bio- logical datasets
title Feature selection and classifier performance on diverse bio- logical datasets
title_full Feature selection and classifier performance on diverse bio- logical datasets
title_fullStr Feature selection and classifier performance on diverse bio- logical datasets
title_full_unstemmed Feature selection and classifier performance on diverse bio- logical datasets
title_short Feature selection and classifier performance on diverse bio- logical datasets
title_sort feature selection and classifier performance on diverse bio- logical datasets
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4248652/
https://www.ncbi.nlm.nih.gov/pubmed/25434802
http://dx.doi.org/10.1186/1471-2105-15-S13-S4
work_keys_str_mv AT hemphilledward featureselectionandclassifierperformanceondiversebiologicaldatasets
AT lindsayjames featureselectionandclassifierperformanceondiversebiologicaldatasets
AT leechih featureselectionandclassifierperformanceondiversebiologicaldatasets
AT mandoiuioni featureselectionandclassifierperformanceondiversebiologicaldatasets
AT nelsoncraige featureselectionandclassifierperformanceondiversebiologicaldatasets