Cargando…

Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource

BACKGROUND: New computational resources are needed to manage the increasing volume of biological data from genome sequencing projects. One fundamental challenge is the ability to maintain a complete and current catalog of protein diversity. We developed a new approach for the identification of prote...

Descripción completa

Detalles Bibliográficos
Autores principales:	Sharpton, Thomas J, Jospin, Guillaume, Wu, Dongying, Langille, Morgan GI, Pollard, Katherine S, Eisen, Jonathan A
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2012
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3481395/ https://www.ncbi.nlm.nih.gov/pubmed/23061897 http://dx.doi.org/10.1186/1471-2105-13-264

_version_	1782247726277197824
author	Sharpton, Thomas J Jospin, Guillaume Wu, Dongying Langille, Morgan GI Pollard, Katherine S Eisen, Jonathan A
author_facet	Sharpton, Thomas J Jospin, Guillaume Wu, Dongying Langille, Morgan GI Pollard, Katherine S Eisen, Jonathan A
author_sort	Sharpton, Thomas J
collection	PubMed
description	BACKGROUND: New computational resources are needed to manage the increasing volume of biological data from genome sequencing projects. One fundamental challenge is the ability to maintain a complete and current catalog of protein diversity. We developed a new approach for the identification of protein families that focuses on the rapid discovery of homologous protein sequences. RESULTS: We implemented fully automated and high-throughput procedures to de novo cluster proteins into families based upon global alignment similarity. Our approach employs an iterative clustering strategy in which homologs of known families are sifted out of the search for new families. The resulting reduction in computational complexity enables us to rapidly identify novel protein families found in new genomes and to perform efficient, automated updates that keep pace with genome sequencing. We refer to protein families identified through this approach as “Sifting Families,” or SFams. Our analysis of ~10.5 million protein sequences from 2,928 genomes identified 436,360 SFams, many of which are not represented in other protein family databases. We validated the quality of SFam clustering through statistical as well as network topology–based analyses. CONCLUSIONS: We describe the rapid identification of SFams and demonstrate how they can be used to annotate genomes and metagenomes. The SFam database catalogs protein-family quality metrics, multiple sequence alignments, hidden Markov models, and phylogenetic trees. Our source code and database are publicly available and will be subject to frequent updates (http://edhar.genomecenter.ucdavis.edu/sifting_families/).
format	Online Article Text
id	pubmed-3481395
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-34813952012-10-27 Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource Sharpton, Thomas J Jospin, Guillaume Wu, Dongying Langille, Morgan GI Pollard, Katherine S Eisen, Jonathan A BMC Bioinformatics Research Article BACKGROUND: New computational resources are needed to manage the increasing volume of biological data from genome sequencing projects. One fundamental challenge is the ability to maintain a complete and current catalog of protein diversity. We developed a new approach for the identification of protein families that focuses on the rapid discovery of homologous protein sequences. RESULTS: We implemented fully automated and high-throughput procedures to de novo cluster proteins into families based upon global alignment similarity. Our approach employs an iterative clustering strategy in which homologs of known families are sifted out of the search for new families. The resulting reduction in computational complexity enables us to rapidly identify novel protein families found in new genomes and to perform efficient, automated updates that keep pace with genome sequencing. We refer to protein families identified through this approach as “Sifting Families,” or SFams. Our analysis of ~10.5 million protein sequences from 2,928 genomes identified 436,360 SFams, many of which are not represented in other protein family databases. We validated the quality of SFam clustering through statistical as well as network topology–based analyses. CONCLUSIONS: We describe the rapid identification of SFams and demonstrate how they can be used to annotate genomes and metagenomes. The SFam database catalogs protein-family quality metrics, multiple sequence alignments, hidden Markov models, and phylogenetic trees. Our source code and database are publicly available and will be subject to frequent updates (http://edhar.genomecenter.ucdavis.edu/sifting_families/). BioMed Central 2012-10-13 /pmc/articles/PMC3481395/ /pubmed/23061897 http://dx.doi.org/10.1186/1471-2105-13-264 Text en Copyright ©2012 Sharpton et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Sharpton, Thomas J Jospin, Guillaume Wu, Dongying Langille, Morgan GI Pollard, Katherine S Eisen, Jonathan A Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource
title	Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource
title_full	Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource
title_fullStr	Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource
title_full_unstemmed	Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource
title_short	Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource
title_sort	sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3481395/ https://www.ncbi.nlm.nih.gov/pubmed/23061897 http://dx.doi.org/10.1186/1471-2105-13-264
work_keys_str_mv	AT sharptonthomasj siftingthroughgenomeswithiterativesequenceclusteringproducesalargephylogeneticallydiverseproteinfamilyresource AT jospinguillaume siftingthroughgenomeswithiterativesequenceclusteringproducesalargephylogeneticallydiverseproteinfamilyresource AT wudongying siftingthroughgenomeswithiterativesequenceclusteringproducesalargephylogeneticallydiverseproteinfamilyresource AT langillemorgangi siftingthroughgenomeswithiterativesequenceclusteringproducesalargephylogeneticallydiverseproteinfamilyresource AT pollardkatherines siftingthroughgenomeswithiterativesequenceclusteringproducesalargephylogeneticallydiverseproteinfamilyresource AT eisenjonathana siftingthroughgenomeswithiterativesequenceclusteringproducesalargephylogeneticallydiverseproteinfamilyresource

Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource

Ejemplares similares