Cargando…
Spectral clustering of protein sequences
An important problem in genomics is automatically clustering homologous proteins when only sequence information is available. Most methods for clustering proteins are local, and are based on simply thresholding a measure related to sequence distance. We first show how locality limits the performance...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2006
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1409676/ https://www.ncbi.nlm.nih.gov/pubmed/16547200 http://dx.doi.org/10.1093/nar/gkj515 |
_version_ | 1782127047624097792 |
---|---|
author | Paccanaro, Alberto Casbon, James A. Saqi, Mansoor A. S. |
author_facet | Paccanaro, Alberto Casbon, James A. Saqi, Mansoor A. S. |
author_sort | Paccanaro, Alberto |
collection | PubMed |
description | An important problem in genomics is automatically clustering homologous proteins when only sequence information is available. Most methods for clustering proteins are local, and are based on simply thresholding a measure related to sequence distance. We first show how locality limits the performance of such methods by analysing the distribution of distances between protein sequences. We then present a global method based on spectral clustering and provide theoretical justification of why it will have a remarkable improvement over local methods. We extensively tested our method and compared its performance with other local methods on several subsets of the SCOP (Structural Classification of Proteins) database, a gold standard for protein structure classification. We consistently observed that, the number of clusters that we obtain for a given set of proteins is close to the number of superfamilies in that set; there are fewer singletons; and the method correctly groups most remote homologs. In our experiments, the quality of the clusters as quantified by a measure that combines sensitivity and specificity was consistently better [on average, improvements were 84% over hierarchical clustering, 34% over Connected Component Analysis (CCA) (similar to GeneRAGE) and 72% over another global method, TribeMCL]. |
format | Text |
id | pubmed-1409676 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2006 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-14096762006-03-23 Spectral clustering of protein sequences Paccanaro, Alberto Casbon, James A. Saqi, Mansoor A. S. Nucleic Acids Res Article An important problem in genomics is automatically clustering homologous proteins when only sequence information is available. Most methods for clustering proteins are local, and are based on simply thresholding a measure related to sequence distance. We first show how locality limits the performance of such methods by analysing the distribution of distances between protein sequences. We then present a global method based on spectral clustering and provide theoretical justification of why it will have a remarkable improvement over local methods. We extensively tested our method and compared its performance with other local methods on several subsets of the SCOP (Structural Classification of Proteins) database, a gold standard for protein structure classification. We consistently observed that, the number of clusters that we obtain for a given set of proteins is close to the number of superfamilies in that set; there are fewer singletons; and the method correctly groups most remote homologs. In our experiments, the quality of the clusters as quantified by a measure that combines sensitivity and specificity was consistently better [on average, improvements were 84% over hierarchical clustering, 34% over Connected Component Analysis (CCA) (similar to GeneRAGE) and 72% over another global method, TribeMCL]. Oxford University Press 2006 2006-03-17 /pmc/articles/PMC1409676/ /pubmed/16547200 http://dx.doi.org/10.1093/nar/gkj515 Text en © The Author 2006. Published by Oxford University Press. All rights reserved |
spellingShingle | Article Paccanaro, Alberto Casbon, James A. Saqi, Mansoor A. S. Spectral clustering of protein sequences |
title | Spectral clustering of protein sequences |
title_full | Spectral clustering of protein sequences |
title_fullStr | Spectral clustering of protein sequences |
title_full_unstemmed | Spectral clustering of protein sequences |
title_short | Spectral clustering of protein sequences |
title_sort | spectral clustering of protein sequences |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1409676/ https://www.ncbi.nlm.nih.gov/pubmed/16547200 http://dx.doi.org/10.1093/nar/gkj515 |
work_keys_str_mv | AT paccanaroalberto spectralclusteringofproteinsequences AT casbonjamesa spectralclusteringofproteinsequences AT saqimansooras spectralclusteringofproteinsequences |