Cargando…

SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale

BACKGROUND: An important problem in genomics is the automatic inference of groups of homologous proteins from pairwise sequence similarities. Several approaches have been proposed for this task which are "local" in the sense that they assign a protein to a cluster based only on the distanc...

Descripción completa

Detalles Bibliográficos
Autores principales: Nepusz, Tamás, Sasidharan, Rajkumar, Paccanaro, Alberto
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2841596/
https://www.ncbi.nlm.nih.gov/pubmed/20214776
http://dx.doi.org/10.1186/1471-2105-11-120
_version_ 1782179133808181248
author Nepusz, Tamás
Sasidharan, Rajkumar
Paccanaro, Alberto
author_facet Nepusz, Tamás
Sasidharan, Rajkumar
Paccanaro, Alberto
author_sort Nepusz, Tamás
collection PubMed
description BACKGROUND: An important problem in genomics is the automatic inference of groups of homologous proteins from pairwise sequence similarities. Several approaches have been proposed for this task which are "local" in the sense that they assign a protein to a cluster based only on the distances between that protein and the other proteins in the set. It was shown recently that global methods such as spectral clustering have better performance on a wide variety of datasets. However, currently available implementations of spectral clustering methods mostly consist of a few loosely coupled Matlab scripts that assume a fair amount of familiarity with Matlab programming and hence they are inaccessible for large parts of the research community. RESULTS: SCPS (Spectral Clustering of Protein Sequences) is an efficient and user-friendly implementation of a spectral method for inferring protein families. The method uses only pairwise sequence similarities, and is therefore practical when only sequence information is available. SCPS was tested on difficult sets of proteins whose relationships were extracted from the SCOP database, and its results were extensively compared with those obtained using other popular protein clustering algorithms such as TribeMCL, hierarchical clustering and connected component analysis. We show that SCPS is able to identify many of the family/superfamily relationships correctly and that the quality of the obtained clusters as indicated by their F-scores is consistently better than all the other methods we compared it with. We also demonstrate the scalability of SCPS by clustering the entire SCOP database (14,183 sequences) and the complete genome of the yeast Saccharomyces cerevisiae (6,690 sequences). CONCLUSIONS: Besides the spectral method, SCPS also implements connected component analysis and hierarchical clustering, it integrates TribeMCL, it provides different cluster quality tools, it can extract human-readable protein descriptions using GI numbers from NCBI, it interfaces with external tools such as BLAST and Cytoscape, and it can produce publication-quality graphical representations of the clusters obtained, thus constituting a comprehensive and effective tool for practical research in computational biology. Source code and precompiled executables for Windows, Linux and Mac OS X are freely available at http://www.paccanarolab.org/software/scps.
format Text
id pubmed-2841596
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-28415962010-03-19 SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale Nepusz, Tamás Sasidharan, Rajkumar Paccanaro, Alberto BMC Bioinformatics Software BACKGROUND: An important problem in genomics is the automatic inference of groups of homologous proteins from pairwise sequence similarities. Several approaches have been proposed for this task which are "local" in the sense that they assign a protein to a cluster based only on the distances between that protein and the other proteins in the set. It was shown recently that global methods such as spectral clustering have better performance on a wide variety of datasets. However, currently available implementations of spectral clustering methods mostly consist of a few loosely coupled Matlab scripts that assume a fair amount of familiarity with Matlab programming and hence they are inaccessible for large parts of the research community. RESULTS: SCPS (Spectral Clustering of Protein Sequences) is an efficient and user-friendly implementation of a spectral method for inferring protein families. The method uses only pairwise sequence similarities, and is therefore practical when only sequence information is available. SCPS was tested on difficult sets of proteins whose relationships were extracted from the SCOP database, and its results were extensively compared with those obtained using other popular protein clustering algorithms such as TribeMCL, hierarchical clustering and connected component analysis. We show that SCPS is able to identify many of the family/superfamily relationships correctly and that the quality of the obtained clusters as indicated by their F-scores is consistently better than all the other methods we compared it with. We also demonstrate the scalability of SCPS by clustering the entire SCOP database (14,183 sequences) and the complete genome of the yeast Saccharomyces cerevisiae (6,690 sequences). CONCLUSIONS: Besides the spectral method, SCPS also implements connected component analysis and hierarchical clustering, it integrates TribeMCL, it provides different cluster quality tools, it can extract human-readable protein descriptions using GI numbers from NCBI, it interfaces with external tools such as BLAST and Cytoscape, and it can produce publication-quality graphical representations of the clusters obtained, thus constituting a comprehensive and effective tool for practical research in computational biology. Source code and precompiled executables for Windows, Linux and Mac OS X are freely available at http://www.paccanarolab.org/software/scps. BioMed Central 2010-03-09 /pmc/articles/PMC2841596/ /pubmed/20214776 http://dx.doi.org/10.1186/1471-2105-11-120 Text en Copyright ©2010 Nepusz et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
Nepusz, Tamás
Sasidharan, Rajkumar
Paccanaro, Alberto
SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale
title SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale
title_full SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale
title_fullStr SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale
title_full_unstemmed SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale
title_short SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale
title_sort scps: a fast implementation of a spectral method for detecting protein families on a genome-wide scale
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2841596/
https://www.ncbi.nlm.nih.gov/pubmed/20214776
http://dx.doi.org/10.1186/1471-2105-11-120
work_keys_str_mv AT nepusztamas scpsafastimplementationofaspectralmethodfordetectingproteinfamiliesonagenomewidescale
AT sasidharanrajkumar scpsafastimplementationofaspectralmethodfordetectingproteinfamiliesonagenomewidescale
AT paccanaroalberto scpsafastimplementationofaspectralmethodfordetectingproteinfamiliesonagenomewidescale