Feature selection and classifier performance on diverse bio- logical datasets
BACKGROUND: There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications. Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algor...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4248652/ https://www.ncbi.nlm.nih.gov/pubmed/25434802 http://dx.doi.org/10.1186/1471-2105-15-S13-S4 |
_version_ | 1782346843961688064 |
---|---|
author | Hemphill, Edward Lindsay, James Lee, Chih Măndoiu, Ion I Nelson, Craig E |
author_facet | Hemphill, Edward Lindsay, James Lee, Chih Măndoiu, Ion I Nelson, Craig E |
author_sort | Hemphill, Edward |
collection | PubMed |
description | BACKGROUND: There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications. Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algorithms to use with that biomarker set, can be a daunting task. Existing surveys of feature selection and classification algorithms typically focus on a single data type, such as gene expression microarrays, and rarely explore the model's performance across multiple biological data types. RESULTS: This paper presents the results of a large scale empirical study whereby a large number of popular feature selection and classification algorithms are used to identify the tissue of origin for the NCI-60 cancer cell lines. A computational pipeline was implemented to maximize predictive accuracy of all models at all parameters on five different data types available for the NCI-60 cell lines. A validation experiment was conducted using external data in order to demonstrate robustness. CONCLUSIONS: As expected, the data type and number of biomarkers have a significant effect on the performance of the predictive models. Although no model or data type uniformly outperforms the others across the entire range of tested numbers of markers, several clear trends are visible. At low numbers of biomarkers gene and protein expression data types are able to differentiate between cancer cell lines significantly better than the other three data types, namely SNP, array comparative genome hybridization (aCGH), and microRNA data. Interestingly, as the number of selected biomarkers increases best performing classifiers based on SNP data match or slightly outperform those based on gene and protein expression, while those based on aCGH and microRNA data continue to perform the worst. It is observed that one class of feature selection and classifier are consistently top performers across data types and number of markers, suggesting that well performing feature-selection/classifier pairings are likely to be robust in biological classification problems regardless of the data type used in the analysis. |
format | Online Article Text |
id | pubmed-4248652 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-42486522014-12-04 Feature selection and classifier performance on diverse bio- logical datasets Hemphill, Edward Lindsay, James Lee, Chih Măndoiu, Ion I Nelson, Craig E BMC Bioinformatics Proceedings BACKGROUND: There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications. Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algorithms to use with that biomarker set, can be a daunting task. Existing surveys of feature selection and classification algorithms typically focus on a single data type, such as gene expression microarrays, and rarely explore the model's performance across multiple biological data types. RESULTS: This paper presents the results of a large scale empirical study whereby a large number of popular feature selection and classification algorithms are used to identify the tissue of origin for the NCI-60 cancer cell lines. A computational pipeline was implemented to maximize predictive accuracy of all models at all parameters on five different data types available for the NCI-60 cell lines. A validation experiment was conducted using external data in order to demonstrate robustness. CONCLUSIONS: As expected, the data type and number of biomarkers have a significant effect on the performance of the predictive models. Although no model or data type uniformly outperforms the others across the entire range of tested numbers of markers, several clear trends are visible. At low numbers of biomarkers gene and protein expression data types are able to differentiate between cancer cell lines significantly better than the other three data types, namely SNP, array comparative genome hybridization (aCGH), and microRNA data. Interestingly, as the number of selected biomarkers increases best performing classifiers based on SNP data match or slightly outperform those based on gene and protein expression, while those based on aCGH and microRNA data continue to perform the worst. It is observed that one class of feature selection and classifier are consistently top performers across data types and number of markers, suggesting that well performing feature-selection/classifier pairings are likely to be robust in biological classification problems regardless of the data type used in the analysis. BioMed Central 2014-11-13 /pmc/articles/PMC4248652/ /pubmed/25434802 http://dx.doi.org/10.1186/1471-2105-15-S13-S4 Text en Copyright © 2014 Hemphill et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Proceedings Hemphill, Edward Lindsay, James Lee, Chih Măndoiu, Ion I Nelson, Craig E Feature selection and classifier performance on diverse bio- logical datasets |
title | Feature selection and classifier performance on diverse bio- logical datasets |
title_full | Feature selection and classifier performance on diverse bio- logical datasets |
title_fullStr | Feature selection and classifier performance on diverse bio- logical datasets |
title_full_unstemmed | Feature selection and classifier performance on diverse bio- logical datasets |
title_short | Feature selection and classifier performance on diverse bio- logical datasets |
title_sort | feature selection and classifier performance on diverse bio- logical datasets |
topic | Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4248652/ https://www.ncbi.nlm.nih.gov/pubmed/25434802 http://dx.doi.org/10.1186/1471-2105-15-S13-S4 |
work_keys_str_mv | AT hemphilledward featureselectionandclassifierperformanceondiversebiologicaldatasets AT lindsayjames featureselectionandclassifierperformanceondiversebiologicaldatasets AT leechih featureselectionandclassifierperformanceondiversebiologicaldatasets AT mandoiuioni featureselectionandclassifierperformanceondiversebiologicaldatasets AT nelsoncraige featureselectionandclassifierperformanceondiversebiologicaldatasets |