Cargando…

Clustering evolving proteins into homologous families

BACKGROUND: Clustering sequences into groups of putative homologs (families) is a critical first step in many areas of comparative biology and bioinformatics. The performance of clustering approaches in delineating biologically meaningful families depends strongly on characteristics of the data, inc...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chan, Cheong Xin, Mahbob, Maisarah, Ragan, Mark A
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3637521/ https://www.ncbi.nlm.nih.gov/pubmed/23566217 http://dx.doi.org/10.1186/1471-2105-14-120

_version_	1782267492449648640
author	Chan, Cheong Xin Mahbob, Maisarah Ragan, Mark A
author_facet	Chan, Cheong Xin Mahbob, Maisarah Ragan, Mark A
author_sort	Chan, Cheong Xin
collection	PubMed
description	BACKGROUND: Clustering sequences into groups of putative homologs (families) is a critical first step in many areas of comparative biology and bioinformatics. The performance of clustering approaches in delineating biologically meaningful families depends strongly on characteristics of the data, including content bias and degree of divergence. New, highly scalable methods have recently been introduced to cluster the very large datasets being generated by next-generation sequencing technologies. However, there has been little systematic investigation of how characteristics of the data impact the performance of these approaches. RESULTS: Using clusters from a manually curated dataset as reference, we examined the performance of a widely used graph-based Markov clustering algorithm (MCL) and a greedy heuristic approach (UCLUST) in delineating protein families coded by three sets of bacterial genomes of different G+C content. Both MCL and UCLUST generated clusters that are comparable to the reference sets at specific parameter settings, although UCLUST tends to under-cluster compositionally biased sequences (G+C content 33% and 66%). Using simulated data, we sought to assess the individual effects of sequence divergence, rate heterogeneity, and underlying G+C content. Performance decreased with increasing sequence divergence, decreasing among-site rate variation, and increasing G+C bias. Two MCL-based methods recovered the simulated families more accurately than did UCLUST. MCL using local alignment distances is more robust across the investigated range of sequence features than are greedy heuristics using distances based on global alignment. CONCLUSIONS: Our results demonstrate that sequence divergence, rate heterogeneity and content bias can individually and in combination affect the accuracy with which MCL and UCLUST can recover homologous protein families. For application to data that are more divergent, and exhibit higher among-site rate variation and/or content bias, MCL may often be the better choice, especially if computational resources are not limiting.
format	Online Article Text
id	pubmed-3637521
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-36375212013-04-27 Clustering evolving proteins into homologous families Chan, Cheong Xin Mahbob, Maisarah Ragan, Mark A BMC Bioinformatics Methodology Article BACKGROUND: Clustering sequences into groups of putative homologs (families) is a critical first step in many areas of comparative biology and bioinformatics. The performance of clustering approaches in delineating biologically meaningful families depends strongly on characteristics of the data, including content bias and degree of divergence. New, highly scalable methods have recently been introduced to cluster the very large datasets being generated by next-generation sequencing technologies. However, there has been little systematic investigation of how characteristics of the data impact the performance of these approaches. RESULTS: Using clusters from a manually curated dataset as reference, we examined the performance of a widely used graph-based Markov clustering algorithm (MCL) and a greedy heuristic approach (UCLUST) in delineating protein families coded by three sets of bacterial genomes of different G+C content. Both MCL and UCLUST generated clusters that are comparable to the reference sets at specific parameter settings, although UCLUST tends to under-cluster compositionally biased sequences (G+C content 33% and 66%). Using simulated data, we sought to assess the individual effects of sequence divergence, rate heterogeneity, and underlying G+C content. Performance decreased with increasing sequence divergence, decreasing among-site rate variation, and increasing G+C bias. Two MCL-based methods recovered the simulated families more accurately than did UCLUST. MCL using local alignment distances is more robust across the investigated range of sequence features than are greedy heuristics using distances based on global alignment. CONCLUSIONS: Our results demonstrate that sequence divergence, rate heterogeneity and content bias can individually and in combination affect the accuracy with which MCL and UCLUST can recover homologous protein families. For application to data that are more divergent, and exhibit higher among-site rate variation and/or content bias, MCL may often be the better choice, especially if computational resources are not limiting. BioMed Central 2013-04-08 /pmc/articles/PMC3637521/ /pubmed/23566217 http://dx.doi.org/10.1186/1471-2105-14-120 Text en Copyright © 2013 Chan et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Chan, Cheong Xin Mahbob, Maisarah Ragan, Mark A Clustering evolving proteins into homologous families
title	Clustering evolving proteins into homologous families
title_full	Clustering evolving proteins into homologous families
title_fullStr	Clustering evolving proteins into homologous families
title_full_unstemmed	Clustering evolving proteins into homologous families
title_short	Clustering evolving proteins into homologous families
title_sort	clustering evolving proteins into homologous families
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3637521/ https://www.ncbi.nlm.nih.gov/pubmed/23566217 http://dx.doi.org/10.1186/1471-2105-14-120
work_keys_str_mv	AT chancheongxin clusteringevolvingproteinsintohomologousfamilies AT mahbobmaisarah clusteringevolvingproteinsintohomologousfamilies AT raganmarka clusteringevolvingproteinsintohomologousfamilies

Clustering evolving proteins into homologous families

Ejemplares similares