Cargando…

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

BACKGROUND: An important problem in computational biology is the automatic detection of protein families (groups of homologous sequences). Clustering sequences into families is at the heart of most comparative studies dealing with protein evolution, structure, and function. Many methods have been de...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bernardes, Juliana S, Vieira, Fabio RJ, Costa, Lygia MM, Zaverucha, Gerson
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4339679/ https://www.ncbi.nlm.nih.gov/pubmed/25651949 http://dx.doi.org/10.1186/s12859-014-0445-4

_version_	1782358898125045760
author	Bernardes, Juliana S Vieira, Fabio RJ Costa, Lygia MM Zaverucha, Gerson
author_facet	Bernardes, Juliana S Vieira, Fabio RJ Costa, Lygia MM Zaverucha, Gerson
author_sort	Bernardes, Juliana S
collection	PubMed
description	BACKGROUND: An important problem in computational biology is the automatic detection of protein families (groups of homologous sequences). Clustering sequences into families is at the heart of most comparative studies dealing with protein evolution, structure, and function. Many methods have been developed for this task, and they perform reasonably well (over 0.88 of F-measure) when grouping proteins with high sequence identity. However, for highly diverged proteins the performance of these methods can be much lower, mainly because a common evolutionary origin is not deduced directly from sequence similarity. To the best of our knowledge, a systematic evaluation of clustering methods over distant homologous proteins is still lacking. RESULTS: We performed a comparative assessment of four clustering algorithms: Markov Clustering (MCL), Transitive Clustering (TransClust), Spectral Clustering of Protein Sequences (SCPS), and High-Fidelity clustering of protein sequences (HiFix), considering several datasets with different levels of sequence similarity. Two types of similarity measures, required by the clustering sequence methods, were used to evaluate the performance of the algorithms: the standard measure obtained from sequence–sequence comparisons, and a novel measure based on profile-profile comparisons, used here for the first time. CONCLUSIONS: The results reveal low clustering performance for the highly divergent datasets when the standard measure was used. However, the novel measure based on profile-profile comparisons substantially improved the performance of the four methods, especially when very low sequence identity datasets were evaluated. We also performed a parameter optimization step to determine the best configuration for each clustering method. We found that TransClust clearly outperformed the other methods for most datasets. This work also provides guidelines for the practical application of clustering sequence methods aimed at detecting accurately groups of related protein sequences.
format	Online Article Text
id	pubmed-4339679
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-43396792015-02-26 Evaluation and improvements of clustering algorithms for detecting remote homologous protein families Bernardes, Juliana S Vieira, Fabio RJ Costa, Lygia MM Zaverucha, Gerson BMC Bioinformatics Research Article BACKGROUND: An important problem in computational biology is the automatic detection of protein families (groups of homologous sequences). Clustering sequences into families is at the heart of most comparative studies dealing with protein evolution, structure, and function. Many methods have been developed for this task, and they perform reasonably well (over 0.88 of F-measure) when grouping proteins with high sequence identity. However, for highly diverged proteins the performance of these methods can be much lower, mainly because a common evolutionary origin is not deduced directly from sequence similarity. To the best of our knowledge, a systematic evaluation of clustering methods over distant homologous proteins is still lacking. RESULTS: We performed a comparative assessment of four clustering algorithms: Markov Clustering (MCL), Transitive Clustering (TransClust), Spectral Clustering of Protein Sequences (SCPS), and High-Fidelity clustering of protein sequences (HiFix), considering several datasets with different levels of sequence similarity. Two types of similarity measures, required by the clustering sequence methods, were used to evaluate the performance of the algorithms: the standard measure obtained from sequence–sequence comparisons, and a novel measure based on profile-profile comparisons, used here for the first time. CONCLUSIONS: The results reveal low clustering performance for the highly divergent datasets when the standard measure was used. However, the novel measure based on profile-profile comparisons substantially improved the performance of the four methods, especially when very low sequence identity datasets were evaluated. We also performed a parameter optimization step to determine the best configuration for each clustering method. We found that TransClust clearly outperformed the other methods for most datasets. This work also provides guidelines for the practical application of clustering sequence methods aimed at detecting accurately groups of related protein sequences. BioMed Central 2015-02-05 /pmc/articles/PMC4339679/ /pubmed/25651949 http://dx.doi.org/10.1186/s12859-014-0445-4 Text en © Bernardes et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Bernardes, Juliana S Vieira, Fabio RJ Costa, Lygia MM Zaverucha, Gerson Evaluation and improvements of clustering algorithms for detecting remote homologous protein families
title	Evaluation and improvements of clustering algorithms for detecting remote homologous protein families
title_full	Evaluation and improvements of clustering algorithms for detecting remote homologous protein families
title_fullStr	Evaluation and improvements of clustering algorithms for detecting remote homologous protein families
title_full_unstemmed	Evaluation and improvements of clustering algorithms for detecting remote homologous protein families
title_short	Evaluation and improvements of clustering algorithms for detecting remote homologous protein families
title_sort	evaluation and improvements of clustering algorithms for detecting remote homologous protein families
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4339679/ https://www.ncbi.nlm.nih.gov/pubmed/25651949 http://dx.doi.org/10.1186/s12859-014-0445-4
work_keys_str_mv	AT bernardesjulianas evaluationandimprovementsofclusteringalgorithmsfordetectingremotehomologousproteinfamilies AT vieirafabiorj evaluationandimprovementsofclusteringalgorithmsfordetectingremotehomologousproteinfamilies AT costalygiamm evaluationandimprovementsofclusteringalgorithmsfordetectingremotehomologousproteinfamilies AT zaveruchagerson evaluationandimprovementsofclusteringalgorithmsfordetectingremotehomologousproteinfamilies

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

Ejemplares similares