Cargando…

ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification

BACKGROUND: Current classification of protein folds are based, ultimately, on visual inspection of similarities. Previous attempts to use computerized structure comparison methods show only partial agreement with curated databases, but have failed to provide detailed statistical and structural analy...

Descripción completa

Detalles Bibliográficos
Autores principales:	Sam, Vichetra, Tai, Chin-Hsien, Garnier, Jean, Gibrat, Jean-Francois, Lee, Byungkook, Munson, Peter J
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2006
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1513609/ https://www.ncbi.nlm.nih.gov/pubmed/16613604 http://dx.doi.org/10.1186/1471-2105-7-206

_version_	1782128514805268480
author	Sam, Vichetra Tai, Chin-Hsien Garnier, Jean Gibrat, Jean-Francois Lee, Byungkook Munson, Peter J
author_facet	Sam, Vichetra Tai, Chin-Hsien Garnier, Jean Gibrat, Jean-Francois Lee, Byungkook Munson, Peter J
author_sort	Sam, Vichetra
collection	PubMed
description	BACKGROUND: Current classification of protein folds are based, ultimately, on visual inspection of similarities. Previous attempts to use computerized structure comparison methods show only partial agreement with curated databases, but have failed to provide detailed statistical and structural analysis of the causes of these divergences. RESULTS: We construct a map of similarities/dissimilarities among manually defined protein folds, using a score cutoff value determined by means of the Receiver Operating Characteristics curve. It identifies folds which appear to overlap or to be "confused" with each other by two distinct similarity measures. It also identifies folds which appear inhomogeneous in that they contain apparently dissimilar domains, as measured by both similarity measures. At a low (1%) false positive rate, 25 to 38% of domain pairs in the same SCOP folds do not appear similar. Our results suggest either that some of these folds are defined using criteria other than purely structural consideration or that the similarity measures used do not recognize some relevant aspects of structural similarity in certain cases. Specifically, variations of the "common core" of some folds are severe enough to defeat attempts to automatically detect structural similarity and/or to lead to false detection of similarity between domains in distinct folds. Structures in some folds vary greatly in size because they contain varying numbers of a repeating unit, while similarity scores are quite sensitive to size differences. Structures in different folds may contain similar substructures, which produce false positives. Finally, the common core within a structure may be too small relative to the entire structure, to be recognized as the basis of similarity to another. CONCLUSION: A detailed analysis of the entire available protein fold space by two automated similarity methods reveals the extent and the nature of the divergence between the automatically determined similarity/dissimilarity and the manual fold type classifications. Some of the observed divergences can probably be addressed with better structure comparison methods and better automatic, intelligent classification procedures. Others may be intrinsic to the problem, suggesting a continuous rather than discrete protein fold space.
format	Text
id	pubmed-1513609
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-15136092006-07-24 ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification Sam, Vichetra Tai, Chin-Hsien Garnier, Jean Gibrat, Jean-Francois Lee, Byungkook Munson, Peter J BMC Bioinformatics Research Article BACKGROUND: Current classification of protein folds are based, ultimately, on visual inspection of similarities. Previous attempts to use computerized structure comparison methods show only partial agreement with curated databases, but have failed to provide detailed statistical and structural analysis of the causes of these divergences. RESULTS: We construct a map of similarities/dissimilarities among manually defined protein folds, using a score cutoff value determined by means of the Receiver Operating Characteristics curve. It identifies folds which appear to overlap or to be "confused" with each other by two distinct similarity measures. It also identifies folds which appear inhomogeneous in that they contain apparently dissimilar domains, as measured by both similarity measures. At a low (1%) false positive rate, 25 to 38% of domain pairs in the same SCOP folds do not appear similar. Our results suggest either that some of these folds are defined using criteria other than purely structural consideration or that the similarity measures used do not recognize some relevant aspects of structural similarity in certain cases. Specifically, variations of the "common core" of some folds are severe enough to defeat attempts to automatically detect structural similarity and/or to lead to false detection of similarity between domains in distinct folds. Structures in some folds vary greatly in size because they contain varying numbers of a repeating unit, while similarity scores are quite sensitive to size differences. Structures in different folds may contain similar substructures, which produce false positives. Finally, the common core within a structure may be too small relative to the entire structure, to be recognized as the basis of similarity to another. CONCLUSION: A detailed analysis of the entire available protein fold space by two automated similarity methods reveals the extent and the nature of the divergence between the automatically determined similarity/dissimilarity and the manual fold type classifications. Some of the observed divergences can probably be addressed with better structure comparison methods and better automatic, intelligent classification procedures. Others may be intrinsic to the problem, suggesting a continuous rather than discrete protein fold space. BioMed Central 2006-04-13 /pmc/articles/PMC1513609/ /pubmed/16613604 http://dx.doi.org/10.1186/1471-2105-7-206 Text en Copyright © 2006 Sam et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Sam, Vichetra Tai, Chin-Hsien Garnier, Jean Gibrat, Jean-Francois Lee, Byungkook Munson, Peter J ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification
title	ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification
title_full	ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification
title_fullStr	ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification
title_full_unstemmed	ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification
title_short	ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification
title_sort	roc and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1513609/ https://www.ncbi.nlm.nih.gov/pubmed/16613604 http://dx.doi.org/10.1186/1471-2105-7-206
work_keys_str_mv	AT samvichetra rocandconfusionanalysisofstructurecomparisonmethodsidentifythemaincausesofdivergencefrommanualproteinclassification AT taichinhsien rocandconfusionanalysisofstructurecomparisonmethodsidentifythemaincausesofdivergencefrommanualproteinclassification AT garnierjean rocandconfusionanalysisofstructurecomparisonmethodsidentifythemaincausesofdivergencefrommanualproteinclassification AT gibratjeanfrancois rocandconfusionanalysisofstructurecomparisonmethodsidentifythemaincausesofdivergencefrommanualproteinclassification AT leebyungkook rocandconfusionanalysisofstructurecomparisonmethodsidentifythemaincausesofdivergencefrommanualproteinclassification AT munsonpeterj rocandconfusionanalysisofstructurecomparisonmethodsidentifythemaincausesofdivergencefrommanualproteinclassification

ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification

Ejemplares similares