Cargando…

Examining the significance of fingerprint-based classifiers

BACKGROUND: Experimental examinations of biofluids to measure concentrations of proteins or their fragments or metabolites are being explored as a means of early disease detection, distinguishing diseases with similar symptoms, and drug treatment efficacy. Many studies have produced classifiers with...

Descripción completa

Detalles Bibliográficos
Autores principales: Luke, Brian T, Collins, Jack R
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2628908/
https://www.ncbi.nlm.nih.gov/pubmed/19091087
http://dx.doi.org/10.1186/1471-2105-9-545
_version_ 1782163749188141056
author Luke, Brian T
Collins, Jack R
author_facet Luke, Brian T
Collins, Jack R
author_sort Luke, Brian T
collection PubMed
description BACKGROUND: Experimental examinations of biofluids to measure concentrations of proteins or their fragments or metabolites are being explored as a means of early disease detection, distinguishing diseases with similar symptoms, and drug treatment efficacy. Many studies have produced classifiers with a high sensitivity and specificity, and it has been argued that accurate results necessarily imply some underlying biology-based features in the classifier. The simplest test of this conjecture is to examine datasets designed to contain no information with classifiers used in many published studies. RESULTS: The classification accuracy of two fingerprint-based classifiers, a decision tree (DT) algorithm and a medoid classification algorithm (MCA), are examined. These methods are used to examine 30 artificial datasets that contain random concentration levels for 300 biomolecules. Each dataset contains between 30 and 300 Cases and Controls, and since the 300 observed concentrations are randomly generated, these datasets are constructed to contain no biological information. A modest search of decision trees containing at most seven decision nodes finds a large number of unique decision trees with an average sensitivity and specificity above 85% for datasets containing 60 Cases and 60 Controls or less, and for datasets with 90 Cases and 90 Controls many DTs have an average sensitivity and specificity above 80%. For even the largest dataset (300 Cases and 300 Controls) the MCA procedure finds several unique classifiers that have an average sensitivity and specificity above 88% using only six or seven features. CONCLUSION: While it has been argued that accurate classification results must imply some biological basis for the separation of Cases from Controls, our results show that this is not necessarily true. The DT and MCA classifiers are sufficiently flexible and can produce good results from datasets that are specifically constructed to contain no information. This means that a chance fitting to the data is possible. All datasets used in this investigation are available on the web. This work is funded by NCI Contract N01-CO-12400.
format Text
id pubmed-2628908
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-26289082009-01-21 Examining the significance of fingerprint-based classifiers Luke, Brian T Collins, Jack R BMC Bioinformatics Research Article BACKGROUND: Experimental examinations of biofluids to measure concentrations of proteins or their fragments or metabolites are being explored as a means of early disease detection, distinguishing diseases with similar symptoms, and drug treatment efficacy. Many studies have produced classifiers with a high sensitivity and specificity, and it has been argued that accurate results necessarily imply some underlying biology-based features in the classifier. The simplest test of this conjecture is to examine datasets designed to contain no information with classifiers used in many published studies. RESULTS: The classification accuracy of two fingerprint-based classifiers, a decision tree (DT) algorithm and a medoid classification algorithm (MCA), are examined. These methods are used to examine 30 artificial datasets that contain random concentration levels for 300 biomolecules. Each dataset contains between 30 and 300 Cases and Controls, and since the 300 observed concentrations are randomly generated, these datasets are constructed to contain no biological information. A modest search of decision trees containing at most seven decision nodes finds a large number of unique decision trees with an average sensitivity and specificity above 85% for datasets containing 60 Cases and 60 Controls or less, and for datasets with 90 Cases and 90 Controls many DTs have an average sensitivity and specificity above 80%. For even the largest dataset (300 Cases and 300 Controls) the MCA procedure finds several unique classifiers that have an average sensitivity and specificity above 88% using only six or seven features. CONCLUSION: While it has been argued that accurate classification results must imply some biological basis for the separation of Cases from Controls, our results show that this is not necessarily true. The DT and MCA classifiers are sufficiently flexible and can produce good results from datasets that are specifically constructed to contain no information. This means that a chance fitting to the data is possible. All datasets used in this investigation are available on the web. This work is funded by NCI Contract N01-CO-12400. BioMed Central 2008-12-17 /pmc/articles/PMC2628908/ /pubmed/19091087 http://dx.doi.org/10.1186/1471-2105-9-545 Text en Copyright © 2008 Luke and Collins; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Luke, Brian T
Collins, Jack R
Examining the significance of fingerprint-based classifiers
title Examining the significance of fingerprint-based classifiers
title_full Examining the significance of fingerprint-based classifiers
title_fullStr Examining the significance of fingerprint-based classifiers
title_full_unstemmed Examining the significance of fingerprint-based classifiers
title_short Examining the significance of fingerprint-based classifiers
title_sort examining the significance of fingerprint-based classifiers
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2628908/
https://www.ncbi.nlm.nih.gov/pubmed/19091087
http://dx.doi.org/10.1186/1471-2105-9-545
work_keys_str_mv AT lukebriant examiningthesignificanceoffingerprintbasedclassifiers
AT collinsjackr examiningthesignificanceoffingerprintbasedclassifiers