Cargando…

The Illusion of Distribution-Free Small-Sample Classification in Genomics

Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an ev...

Descripción completa

Detalles Bibliográficos
Autores principales: Dougherty, Edward R, Zollanvari, Amin, Braga-Neto, Ulisses M
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Bentham Science Publishers Ltd 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3145263/
https://www.ncbi.nlm.nih.gov/pubmed/22294876
http://dx.doi.org/10.2174/138920211796429763
_version_ 1782209078758473728
author Dougherty, Edward R
Zollanvari, Amin
Braga-Neto, Ulisses M
author_facet Dougherty, Edward R
Zollanvari, Amin
Braga-Neto, Ulisses M
author_sort Dougherty, Edward R
collection PubMed
description Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.
format Online
Article
Text
id pubmed-3145263
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Bentham Science Publishers Ltd
record_format MEDLINE/PubMed
spelling pubmed-31452632012-02-01 The Illusion of Distribution-Free Small-Sample Classification in Genomics Dougherty, Edward R Zollanvari, Amin Braga-Neto, Ulisses M Curr Genomics Article Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion. Bentham Science Publishers Ltd 2011-08 /pmc/articles/PMC3145263/ /pubmed/22294876 http://dx.doi.org/10.2174/138920211796429763 Text en ©2011 Bentham Science Publishers Ltd. http://creativecommons.org/licenses/by/2.5/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.5/), which permits unrestrictive use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Article
Dougherty, Edward R
Zollanvari, Amin
Braga-Neto, Ulisses M
The Illusion of Distribution-Free Small-Sample Classification in Genomics
title The Illusion of Distribution-Free Small-Sample Classification in Genomics
title_full The Illusion of Distribution-Free Small-Sample Classification in Genomics
title_fullStr The Illusion of Distribution-Free Small-Sample Classification in Genomics
title_full_unstemmed The Illusion of Distribution-Free Small-Sample Classification in Genomics
title_short The Illusion of Distribution-Free Small-Sample Classification in Genomics
title_sort illusion of distribution-free small-sample classification in genomics
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3145263/
https://www.ncbi.nlm.nih.gov/pubmed/22294876
http://dx.doi.org/10.2174/138920211796429763
work_keys_str_mv AT doughertyedwardr theillusionofdistributionfreesmallsampleclassificationingenomics
AT zollanvariamin theillusionofdistributionfreesmallsampleclassificationingenomics
AT braganetoulissesm theillusionofdistributionfreesmallsampleclassificationingenomics
AT doughertyedwardr illusionofdistributionfreesmallsampleclassificationingenomics
AT zollanvariamin illusionofdistributionfreesmallsampleclassificationingenomics
AT braganetoulissesm illusionofdistributionfreesmallsampleclassificationingenomics