Cargando…

Spectral clustering of protein sequences

An important problem in genomics is automatically clustering homologous proteins when only sequence information is available. Most methods for clustering proteins are local, and are based on simply thresholding a measure related to sequence distance. We first show how locality limits the performance...

Descripción completa

Detalles Bibliográficos
Autores principales: Paccanaro, Alberto, Casbon, James A., Saqi, Mansoor A. S.
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1409676/
https://www.ncbi.nlm.nih.gov/pubmed/16547200
http://dx.doi.org/10.1093/nar/gkj515
_version_ 1782127047624097792
author Paccanaro, Alberto
Casbon, James A.
Saqi, Mansoor A. S.
author_facet Paccanaro, Alberto
Casbon, James A.
Saqi, Mansoor A. S.
author_sort Paccanaro, Alberto
collection PubMed
description An important problem in genomics is automatically clustering homologous proteins when only sequence information is available. Most methods for clustering proteins are local, and are based on simply thresholding a measure related to sequence distance. We first show how locality limits the performance of such methods by analysing the distribution of distances between protein sequences. We then present a global method based on spectral clustering and provide theoretical justification of why it will have a remarkable improvement over local methods. We extensively tested our method and compared its performance with other local methods on several subsets of the SCOP (Structural Classification of Proteins) database, a gold standard for protein structure classification. We consistently observed that, the number of clusters that we obtain for a given set of proteins is close to the number of superfamilies in that set; there are fewer singletons; and the method correctly groups most remote homologs. In our experiments, the quality of the clusters as quantified by a measure that combines sensitivity and specificity was consistently better [on average, improvements were 84% over hierarchical clustering, 34% over Connected Component Analysis (CCA) (similar to GeneRAGE) and 72% over another global method, TribeMCL].
format Text
id pubmed-1409676
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-14096762006-03-23 Spectral clustering of protein sequences Paccanaro, Alberto Casbon, James A. Saqi, Mansoor A. S. Nucleic Acids Res Article An important problem in genomics is automatically clustering homologous proteins when only sequence information is available. Most methods for clustering proteins are local, and are based on simply thresholding a measure related to sequence distance. We first show how locality limits the performance of such methods by analysing the distribution of distances between protein sequences. We then present a global method based on spectral clustering and provide theoretical justification of why it will have a remarkable improvement over local methods. We extensively tested our method and compared its performance with other local methods on several subsets of the SCOP (Structural Classification of Proteins) database, a gold standard for protein structure classification. We consistently observed that, the number of clusters that we obtain for a given set of proteins is close to the number of superfamilies in that set; there are fewer singletons; and the method correctly groups most remote homologs. In our experiments, the quality of the clusters as quantified by a measure that combines sensitivity and specificity was consistently better [on average, improvements were 84% over hierarchical clustering, 34% over Connected Component Analysis (CCA) (similar to GeneRAGE) and 72% over another global method, TribeMCL]. Oxford University Press 2006 2006-03-17 /pmc/articles/PMC1409676/ /pubmed/16547200 http://dx.doi.org/10.1093/nar/gkj515 Text en © The Author 2006. Published by Oxford University Press. All rights reserved
spellingShingle Article
Paccanaro, Alberto
Casbon, James A.
Saqi, Mansoor A. S.
Spectral clustering of protein sequences
title Spectral clustering of protein sequences
title_full Spectral clustering of protein sequences
title_fullStr Spectral clustering of protein sequences
title_full_unstemmed Spectral clustering of protein sequences
title_short Spectral clustering of protein sequences
title_sort spectral clustering of protein sequences
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1409676/
https://www.ncbi.nlm.nih.gov/pubmed/16547200
http://dx.doi.org/10.1093/nar/gkj515
work_keys_str_mv AT paccanaroalberto spectralclusteringofproteinsequences
AT casbonjamesa spectralclusteringofproteinsequences
AT saqimansooras spectralclusteringofproteinsequences