Cargando…

Characterization of the Effectiveness of Reporting Lists of Small Feature Sets Relative to the Accuracy of the Prior Biological Knowledge

When confronted with a small sample, feature-selection algorithms often fail to find good feature sets, a problem exacerbated for high-dimensional data and large feature sets. The problem is compounded by the fact that, if one obtains a feature set with a low error estimate, the estimate is unreliab...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhao, Chen, Bittner, Michael L., Chapkin, Robert S., Dougherty, Edward R.
Formato:	Texto
Lenguaje:	English
Publicado:	Libertas Academica 2010
Materias:	Methodology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2865771/ https://www.ncbi.nlm.nih.gov/pubmed/20458361

_version_	1782180868533518336
author	Zhao, Chen Bittner, Michael L. Chapkin, Robert S. Dougherty, Edward R.
author_facet	Zhao, Chen Bittner, Michael L. Chapkin, Robert S. Dougherty, Edward R.
author_sort	Zhao, Chen
collection	PubMed
description	When confronted with a small sample, feature-selection algorithms often fail to find good feature sets, a problem exacerbated for high-dimensional data and large feature sets. The problem is compounded by the fact that, if one obtains a feature set with a low error estimate, the estimate is unreliable because training-data-based error estimators typically perform poorly on small samples, exhibiting optimistic bias or high variance. One way around the problem is limit the number of features being considered, restrict features sets to sizes such that all feature sets can be examined by exhaustive search, and report a list of the best performing feature sets. If the list is short, then it greatly restricts the possible feature sets to be considered as candidates; however, one can expect the lowest error estimates obtained to be optimistically biased so that there may not be a close-to-optimal feature set on the list. This paper provides a power analysis of this methodology; in particular, it examines the kind of results one should expect to obtain relative to the length of the list and the number of discriminating features among those considered. Two measures are employed. The first is the probability that there is at least one feature set on the list whose true classification error is within some given tolerance of the best feature set and the second is the expected number of feature sets on the list whose true errors are within the given tolerance of the best feature set. These values are plotted as functions of the list length to generate power curves. The results show that, if the number of discriminating features is not too small—that is, the prior biological knowledge is not too poor—then one should expect, with high probability, to find good feature sets. Availability: companion website at http://gsp.tamu.edu/Publications/supplementary/zhao09a/
format	Text
id	pubmed-2865771
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	Libertas Academica
record_format	MEDLINE/PubMed
spelling	pubmed-28657712010-05-10 Characterization of the Effectiveness of Reporting Lists of Small Feature Sets Relative to the Accuracy of the Prior Biological Knowledge Zhao, Chen Bittner, Michael L. Chapkin, Robert S. Dougherty, Edward R. Cancer Inform Methodology When confronted with a small sample, feature-selection algorithms often fail to find good feature sets, a problem exacerbated for high-dimensional data and large feature sets. The problem is compounded by the fact that, if one obtains a feature set with a low error estimate, the estimate is unreliable because training-data-based error estimators typically perform poorly on small samples, exhibiting optimistic bias or high variance. One way around the problem is limit the number of features being considered, restrict features sets to sizes such that all feature sets can be examined by exhaustive search, and report a list of the best performing feature sets. If the list is short, then it greatly restricts the possible feature sets to be considered as candidates; however, one can expect the lowest error estimates obtained to be optimistically biased so that there may not be a close-to-optimal feature set on the list. This paper provides a power analysis of this methodology; in particular, it examines the kind of results one should expect to obtain relative to the length of the list and the number of discriminating features among those considered. Two measures are employed. The first is the probability that there is at least one feature set on the list whose true classification error is within some given tolerance of the best feature set and the second is the expected number of feature sets on the list whose true errors are within the given tolerance of the best feature set. These values are plotted as functions of the list length to generate power curves. The results show that, if the number of discriminating features is not too small—that is, the prior biological knowledge is not too poor—then one should expect, with high probability, to find good feature sets. Availability: companion website at http://gsp.tamu.edu/Publications/supplementary/zhao09a/ Libertas Academica 2010-03-18 /pmc/articles/PMC2865771/ /pubmed/20458361 Text en © the author(s), publisher and licensee Libertas Academica Ltd. This is an open access article. Unrestricted non-commercial use is permitted provided the original work is properly cited.
spellingShingle	Methodology Zhao, Chen Bittner, Michael L. Chapkin, Robert S. Dougherty, Edward R. Characterization of the Effectiveness of Reporting Lists of Small Feature Sets Relative to the Accuracy of the Prior Biological Knowledge
title	Characterization of the Effectiveness of Reporting Lists of Small Feature Sets Relative to the Accuracy of the Prior Biological Knowledge
title_full	Characterization of the Effectiveness of Reporting Lists of Small Feature Sets Relative to the Accuracy of the Prior Biological Knowledge
title_fullStr	Characterization of the Effectiveness of Reporting Lists of Small Feature Sets Relative to the Accuracy of the Prior Biological Knowledge
title_full_unstemmed	Characterization of the Effectiveness of Reporting Lists of Small Feature Sets Relative to the Accuracy of the Prior Biological Knowledge
title_short	Characterization of the Effectiveness of Reporting Lists of Small Feature Sets Relative to the Accuracy of the Prior Biological Knowledge
title_sort	characterization of the effectiveness of reporting lists of small feature sets relative to the accuracy of the prior biological knowledge
topic	Methodology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2865771/ https://www.ncbi.nlm.nih.gov/pubmed/20458361
work_keys_str_mv	AT zhaochen characterizationoftheeffectivenessofreportinglistsofsmallfeaturesetsrelativetotheaccuracyofthepriorbiologicalknowledge AT bittnermichaell characterizationoftheeffectivenessofreportinglistsofsmallfeaturesetsrelativetotheaccuracyofthepriorbiologicalknowledge AT chapkinroberts characterizationoftheeffectivenessofreportinglistsofsmallfeaturesetsrelativetotheaccuracyofthepriorbiologicalknowledge AT doughertyedwardr characterizationoftheeffectivenessofreportinglistsofsmallfeaturesetsrelativetotheaccuracyofthepriorbiologicalknowledge

Characterization of the Effectiveness of Reporting Lists of Small Feature Sets Relative to the Accuracy of the Prior Biological Knowledge

Ejemplares similares