Cargando…

Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering

BACKGROUND: The identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities. However, metagenomic datasets are characterized by the presence of organisms with varying GC composition, codon usage biases etc.,...

Descripción completa

Detalles Bibliográficos
Autores principales: Yooseph, Shibu, Li, Weizhong, Sutton, Granger
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2362130/
https://www.ncbi.nlm.nih.gov/pubmed/18402669
http://dx.doi.org/10.1186/1471-2105-9-182
_version_ 1782153384769355776
author Yooseph, Shibu
Li, Weizhong
Sutton, Granger
author_facet Yooseph, Shibu
Li, Weizhong
Sutton, Granger
author_sort Yooseph, Shibu
collection PubMed
description BACKGROUND: The identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities. However, metagenomic datasets are characterized by the presence of organisms with varying GC composition, codon usage biases etc., and consequently gene identification is challenging. The vast amount of sequence data also requires faster protein family classification tools. RESULTS: We present a computational improvement to a sequence clustering approach that we developed previously to identify and classify protein coding genes in large microbial metagenomic datasets. The clustering approach can be used to identify protein coding genes in prokaryotes, viruses, and intron-less eukaryotes. The computational improvement is based on an incremental clustering method that does not require the expensive all-against-all compute that was required by the original approach, while still preserving the remote homology detection capabilities. We present evaluations of the clustering approach in protein-coding gene identification and classification, and also present the results of updating the protein clusters from our previous work with recent genomic and metagenomic sequences. The clustering results are available via CAMERA, (http://camera.calit2.net). CONCLUSION: The clustering paradigm is shown to be a very useful tool in the analysis of microbial metagenomic data. The incremental clustering method is shown to be much faster than the original approach in identifying genes, grouping sequences into existing protein families, and also identifying novel families that have multiple members in a metagenomic dataset. These clusters provide a basis for further studies of protein families.
format Text
id pubmed-2362130
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-23621302008-04-30 Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering Yooseph, Shibu Li, Weizhong Sutton, Granger BMC Bioinformatics Methodology Article BACKGROUND: The identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities. However, metagenomic datasets are characterized by the presence of organisms with varying GC composition, codon usage biases etc., and consequently gene identification is challenging. The vast amount of sequence data also requires faster protein family classification tools. RESULTS: We present a computational improvement to a sequence clustering approach that we developed previously to identify and classify protein coding genes in large microbial metagenomic datasets. The clustering approach can be used to identify protein coding genes in prokaryotes, viruses, and intron-less eukaryotes. The computational improvement is based on an incremental clustering method that does not require the expensive all-against-all compute that was required by the original approach, while still preserving the remote homology detection capabilities. We present evaluations of the clustering approach in protein-coding gene identification and classification, and also present the results of updating the protein clusters from our previous work with recent genomic and metagenomic sequences. The clustering results are available via CAMERA, (http://camera.calit2.net). CONCLUSION: The clustering paradigm is shown to be a very useful tool in the analysis of microbial metagenomic data. The incremental clustering method is shown to be much faster than the original approach in identifying genes, grouping sequences into existing protein families, and also identifying novel families that have multiple members in a metagenomic dataset. These clusters provide a basis for further studies of protein families. BioMed Central 2008-04-10 /pmc/articles/PMC2362130/ /pubmed/18402669 http://dx.doi.org/10.1186/1471-2105-9-182 Text en Copyright © 2008 Yooseph et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Yooseph, Shibu
Li, Weizhong
Sutton, Granger
Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering
title Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering
title_full Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering
title_fullStr Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering
title_full_unstemmed Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering
title_short Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering
title_sort gene identification and protein classification in microbial metagenomic sequence data via incremental clustering
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2362130/
https://www.ncbi.nlm.nih.gov/pubmed/18402669
http://dx.doi.org/10.1186/1471-2105-9-182
work_keys_str_mv AT yoosephshibu geneidentificationandproteinclassificationinmicrobialmetagenomicsequencedataviaincrementalclustering
AT liweizhong geneidentificationandproteinclassificationinmicrobialmetagenomicsequencedataviaincrementalclustering
AT suttongranger geneidentificationandproteinclassificationinmicrobialmetagenomicsequencedataviaincrementalclustering